Learning Bioinformatics

2023-07-18

To prepare the syllabus and resources list, some information collected from the Harvard Informatics group and other contributors.

Table of content

Unix for Bioinformatics
R for Bioinformatics
Python and BioPython
Git and version control
Molecular Drug Designing
RNA-seq
Single-cell Analysis
Read mapping
Variant calling
Miscellaneous

Unix for Bioinformatics

Introduction

Unix Basics

Introduction to Unix and its role in bioinformatics.
Understanding the Unix file system and directory structure.
Basic Unix commands: ls, cd, pwd, mkdir, rm, cp, mv, etc.
Working with files and directories.

Working with text data

Using text editors in Unix (e.g., nano, vi, vim) for editing files.
Redirection and pipes
Text processing utilities: grep, awk, sed.

File manipulation

Archiving and compressing files: tar, gzip, zip.
File permissions and ownership: chmod, chown.

Introduction to Scripting

Writing and executing basic shell scripts.
Variables, control structures, and loops.

Intermediate

Data Retrieval and Transfer

Downloading files from the web using wget and curl.
Transferring files between local and remote systems using scp and rsync.

Working with Biological Data Formats

Introduction to common bioinformatics data formats (FASTA, FASTQ, SAM/BAM, VCF, etc.).
Using tools for file format conversion (e.g., samtools, bedtools).

Text Processing and Analysis

Advanced text processing with regular expressions.
Combining Unix tools for complex data analysis.
Extracting relevant information from large data files.

Advanced

Shell Scripting and Automation

Writing more complex shell scripts for automation.
Using Unix tools to automate bioinformatics workflows.
Advanced scripting techniques and best practices.

High-Performance Computing (HPC)

Introduction to HPC clusters and job submission systems.
Writing and submitting batch scripts for bioinformatics analysis.
Managing resources and optimizing performance.

Advanced Data Manipulation

Using awk, sed, and other tools for advanced data manipulation.
Handling large datasets efficiently.

Bioinformatics Pipelines

Designing and building bioinformatics pipelines using Unix tools.
Integrating third-party tools into custom pipelines.

R

Introduction

Getting Started with R

Introduction to R and its applications in bioinformatics.
Installing R and RStudio (Integrated Development Environment for R).
Basic R syntax: variables, data types, and basic arithmetic operations.

Working with Data in R

Data structures in R: vectors, matrices, data frames, and lists.
Reading and writing data from/to files (e.g., CSV, FASTA, FASTQ).
Basic data manipulation: subsetting, filtering, and sorting.

Data Visualization

Introduction to data visualization in R.
Using base R graphics and ggplot2 for creating plots.
Customizing plots for bioinformatics data (e.g., genomics, proteomics).

Intermediate

Statistical Analysis with R

Introduction to statistical analysis in R.
Descriptive statistics: mean, median, standard deviation, etc.
Hypothesis testing and statistical tests for bioinformatics data.

Bioconductor and Genomic Data Analysis

Overview of Bioconductor, a repository of R packages for bioinformatics.
Analyzing gene expression data (microarrays, RNA-seq) with Bioconductor packages.
Working with genomic data (e.g., DNA sequencing, ChIP-seq, variant analysis).
Data Visualization (Advanced)

Advanced data visualization techniques in R.

Creating complex plots for multi-dimensional bioinformatics data.
Interactive data visualization using packages like Plotly and Shiny.

Advanced

Machine Learning with R

Introduction to machine learning in R.
Supervised and unsupervised learning algorithms.
Applying machine learning to bioinformatics data (e.g., classification, clustering).
Bioinformatics Workflows and Reproducibility

Building and documenting bioinformatics workflows in R.

Using RMarkdown for creating reproducible reports.
Best practices for reproducible research in bioinformatics.

Integration with Other Tools and Databases

Connecting R with databases (e.g., MySQL, SQLite) for data storage and retrieval.
Accessing and querying biological databases through R.

High-Performance Computing (HPC) with R

Parallel computing in R for handling large-scale bioinformatics tasks.
Utilizing HPC clusters for bioinformatics analysis.

Python

Introduction

Getting Started with Python

Introduction to Python and its applications in bioinformatics.
Installing Python and setting up the development environment.
Basic Python syntax: variables, data types, and control structures.

Working with Data in Python

Data structures in Python: lists, tuples, dictionaries, and sets.
Reading and writing data from/to files (e.g., CSV, FASTA, FASTQ).
Basic data manipulation: slicing, filtering, and sorting.

Data Visualization in Python

Introduction to data visualization in Python.
Using matplotlib and seaborn libraries to create plots.
Customizing plots for bioinformatics data (e.g., genomics, proteomics).

Intermediate

Bioinformatics Algorithms in Python

Implementing common bioinformatics algorithms (e.g., sequence alignment, motif finding).
Utilizing Python libraries for bioinformatics tasks (e.g., pairwise2 for sequence alignment).
Analyzing biological sequences and structures.

Biological Data Analysis with Pandas

Introduction to Pandas library for data manipulation and analysis.
Handling and processing bioinformatics data using Pandas DataFrames.
Data cleaning and preprocessing techniques.

Bioinformatics Libraries in Python

Exploring Biopython: installation and basic usage.
Working with biological sequences, structures, and annotations.
Retrieving data from biological databases using Biopython.

Advanced

Machine Learning for Bioinformatics

Introduction to machine learning in Python.
Supervised and unsupervised learning algorithms for bioinformatics data.
Applying machine learning to tasks like gene expression analysis, variant calling, etc.

Bioinformatics Workflows and Automation

Building bioinformatics pipelines in Python.
Utilizing workflow management tools like Snakemake.
Automating repetitive tasks and batch processing.

Data Visualization (Advanced)

Advanced data visualization in Python using Plotly, Bokeh, or Dash.
Creating interactive visualizations for complex bioinformatics data.

Structural Bioinformatics with PyMOL

Introduction to PyMOL for visualization and analysis of molecular structures.
Structural alignment, superimposition, and visualization.

Git and Version Control

Introduction

Understanding Version Control

What is version control and why it is important for researchers?
The benefits of using version control in research projects.
Overview of Git as a distributed version control system.

Installing Git and Basic Configuration

Installing Git on your computer (Windows, macOS, Linux).
Configuring your Git identity (name, email).
Setting up a global .gitignore file to exclude unnecessary files.

Creating and Cloning Repositories

Initializing a new Git repository for a research project.
Cloning an existing repository from a remote source (e.g., GitHub, GitLab).
Understanding the local and remote repository relationship.

Working with Git for Researchers

Basic Version Control Operations

Staging and committing changes to the repository.
Viewing the commit history and understanding commit messages.
Checking out previous versions of files and repositories.

Collaborating with Others

Adding collaborators to your repository.
Handling merge conflicts and resolving them.
Pulling changes from a remote repository and pushing your changes.

Branching and Merging

Creating and managing branches for different research tasks.
Merging branches and resolving conflicts during merges.
Utilizing feature branches for experimental work.

Advanced Git Techniques for Researchers

Managing Large Files and Data

Using Git LFS (Large File Storage) for handling large files.
Handling datasets and large research files with Git.

Tagging and Releases

Creating tags to mark important milestones in your research.
Creating releases for specific versions of your research project.

Git Best Practices for Research

Organizing your research project repository effectively.
Writing meaningful commit messages and documentation.
Using branching strategies that suit research workflows.

Integrating Git into Research Workflows

Version Control with Data Analysis

Using Git to version control scripts and notebooks.
Incorporating Git into data analysis workflows.

Collaborative Writing with Git

Using Git for collaborative writing (e.g., research papers, documentation).
Integrating Git with LaTeX, Markdown, or other writing formats.

Automating Workflows with Git Hooks

Setting up Git hooks to automate tasks (e.g., running tests, code formatting).
Customizing pre-commit and post-commit hooks for your research needs.

Molecular Drug Designing

Introduction

Introduction to Molecular Drug Design

Overview of molecular drug design and its importance in pharmaceutical research.
Understanding the process of drug discovery and development.

Proteins as Drug Targets

Identifying and selecting protein targets for drug design.
Understanding the importance of protein structure in drug design.

Molecular Docking

Principles of Molecular Docking

Understanding the basic principles of molecular docking.
Different types of molecular docking algorithms and scoring functions.

Bioinformatics Tools for Molecular Docking

Introduction to molecular docking software (e.g., AutoDock, AutoDock Vina).
Preparing protein and ligand structures for docking.

Performing Molecular Docking

Conducting protein-ligand docking simulations.
Analyzing docking results and interpreting binding interactions.

Molecular Dynamics Simulation

Introduction to Molecular Dynamics (MD)

Understanding the principles of molecular dynamics simulations.
Applications of MD in drug design and biomolecular studies.

Bioinformatics Tools for Molecular Dynamics

Introduction to MD simulation software (e.g., Schrodinger, GROMACS, AMBER, NAMD).
Preparing biomolecular systems for MD simulations.

Running Molecular Dynamics Simulations

Setting up and running MD simulations.
Analyzing MD trajectories and extracting relevant data.

Integration of Docking and Dynamics in Drug Design

Virtual Screening and Hit Identification

Using molecular docking for virtual screening of compound libraries.
Filtering and prioritizing potential drug candidates.

Free Energy Calculations

Introduction to free energy calculation methods.
Enhancing accuracy with binding free energy calculations.

Advanced Topics in Molecular Drug Design

Personalized Medicine and Drug Design

Exploring the concept of personalized medicine.
Customizing drug design approaches for individual patients.

RNA-seq

Introduction to RNA-Seq Analysis

Introduction to RNA-Seq

Understanding RNA-Seq technology and its applications in genomics.
Differences between RNA-Seq and other sequencing methods (e.g., DNA-Seq).

RNA-Seq Experimental Design

Design considerations for RNA-Seq experiments.
Sample preparation, library construction, and sequencing platforms.

Preprocessing and Quality Control

Raw Data Quality Assessment

Understanding the raw sequencing data formats (FASTQ).
Performing quality control (QC) checks using tools like FastQC.

Preprocessing of RNA-Seq Data

Trimming adapters and low-quality bases with tools like Trimmomatic.
Quality filtering and read preprocessing.

Mapping and Alignment

Reference Genome and Transcriptome

Selecting an appropriate reference genome or transcriptome for mapping.
Building custom references if needed.

RNA-Seq Read Alignment

Aligning preprocessed reads to the reference using tools like STAR or HISAT2.
Dealing with splice junctions and novel transcripts.

Quantification of Gene Expression

Gene Expression Estimation

Counting aligned reads at the gene level using tools like featureCounts or HTSeq.
Generating count matrices for downstream analysis.

Differential Gene Expression Analysis

Introduction to differential expression analysis.
Using tools like DESeq2 or edgeR to identify differentially expressed genes.

Functional Analysis and Visualization

Gene Ontology (GO) Enrichment Analysis

Understanding GO terms and their significance in functional analysis.
Using tools like GOseq or topGO for GO enrichment analysis.

Pathway Analysis

Introduction to pathway analysis and its importance in understanding gene functions.
Performing pathway analysis using tools like KEGG, Reactome, or GSEA.

Data Visualization

Creating various plots for RNA-Seq data visualization (e.g., heatmaps, volcano plots).
Utilizing tools like R and Python libraries for data visualization.

Advanced Topics in RNA-Seq Analysis

Isoform-level Analysis

Quantifying gene isoforms using tools like Salmon or Kallisto.
Analyzing alternative splicing events.

Long Non-Coding RNA (lncRNA) Analysis

Identifying and characterizing long non-coding RNAs in RNA-Seq data.
Special considerations for lncRNA analysis.

Integration with other Omics Data

Integrating RNA-Seq data with other genomics data (e.g., DNA-Seq, ChIP-Seq) for comprehensive analysis.