Introduction to Computational Genomics
- Instructor:
- David Molik dmolik@nd.edu
- GTA:
- Chissa Rivaldi crivaldi@nd.edu
- Supervisor:
- Michael Pfrender mpfrende@nd.edu
Course Number
- BIOS 60132
Resources
- ND_ICG_2020
- Code of Conduct
- Daily Feedback
General Description:
Massive amounts of data are being generated that biologists are now expected to analyze. In order to use this data to answer research questions, bioinformatics skills are necessary. The skills we will cover in this course center around manipulating big data, using and creating bioinformatic pipelines, and visualization of data.
This class is aimed at people who may have some experience with computational biology, but still consider themselves beginners to the field. Computational biologists employ a vast array of methods and skills that often change with evolving technology. Our goal here is to focus on the most commonly used and widely applicable ones that can provide a solid foundation from where to grow. These skills include overall efficiency, building pipelines, and writing and reading code. After this course, students should be able to carry this knowledge to more specialized courses (e.g. algorithms, data structures, data visualization, GIS).
The weekly structure of the class will include both lecture material and hands-on coding experience, and the material will be reinforced with homework covering fundamental and practical topics, including code review, using code to solve problems, and a final project that will incorporate as much of the course material as possible. Feedback on course material and instruction will be incorporated as the semester progresses.
Learning Goals
After successful completion of this course students will be able to: Complete intermediate bioinformatics tasks, such as, manipulate big data files, implement common bioinformatic pipelines, and visualize data. Understand beginner computational biology algorithms, as such, course participants will reach a level of comfort that would allow them to take a CSE Data Structures or Algorithms course after successful completion.
Grading Policies and Criteria
- Practicals: %35 - Practicals Are analysis and code problems that students have five days to complete. Practicals are assigned on Thursdays and due the night before class on Tuesdays, students will submit the practicals to a github repository on a branch named after their netid. Practicals can be retaken as many times as the student would like for a regrade.
- Assignments: %20 - Each class, due before class starts a series of activities that will need to be completed. These include:
- Readings handed during Tuesday class
- Weekly status updates on Final Project
- Other, mostly administrative activities (e.g. installing Julia)
- Course Participation: %5
- Final project: %40
Textbooks
- Practical reference material: Lauwens, Ben. 2019. Think Julia: How to Think Like a Computer Scientist. O’Reilly Media, Incorporated. Two physical copies have been placed on reserve at Hesburgh, however, the entire book is free online: ThinkJulia.jl
- Another reference material: https://docs.julialang.org/en/v1/
- Yet another: https://github.com/topics/julia
*Optional reading: “Introduction to Computational Biology An Evolutionary Approach” ISBN: 978-3-7643-6700-8 - We will not be using this book in the class, but it is an excellent algorithmic level resource
Office Hours
- David Molik: by appointment, remote
- Chissa Rivaldi: Thursdays 3-4pm (remain on class zoom) or by appointment, remote
Final Project
Over the course of semester you will be expected to design and complete a final project, the final project can be completed in Teams, however, the amount of work expected from the final project will be linear for additional team members (i.e. two people are expected to work on a project requiring two peoples amount of work). During the class we will have weekly homeworks ensuring the completion of the final project, however, it is a good idea to come into the class with an idea of data resources available to you. The final project will consist of an analysis designed to answer a novel question of that data, and a PLoS style write-up.
The Navari Family for Digital Scholarship (cds@nd.edu) the Biology Librarian (Parker Ladwig: ladwig.1@nd.edu) and the head of the Genomics and Bioinformatics Core (Mike Pfrender: mpfrende@nd.edu) are all excellent resources for genomics data resources. Genomics data is often closer than you realize, your advisor and senior labmates might be a great resource as well.
Final Project
You should be as concise and short as possible in the write-up, it should be no longer than 12 pages double spaced (something like eight is preferred), 12pt font not including references. Think PLOS Short Report or PLOS Resources. It should hover around three figures. See Here and Here for details.
you should submit the final paper through https://www.overleaf.com/ a Latex editor, using the PLOS latex format https://journals.plos.org/plosone/s/latex by adding Chissa and I as collaborators.
You will be graded as follows:
- %10 Merit of the Question
- %30 Presentation
- %30 Paper
- %30 Code
Course Flow
There will be lecture on Tuesdays and Thursdays, this will be augmented with a Lab/Pracitcal and Status Report due Monday night, and a Reading and sometimes admin due Wednesday night.
Schedule (see course website for updates)
- Week One: Introduction of “Introduction to Computational Genomics” This week will mostly be review of the Unix Command Line, and Git, but will set the groundwork for the Final Project, which is the majority of course grade, how to submit assignments, and introduce the main scripting language of the course, Julia, a just-in-time compiled analytical language similar to that of R or Python.
- Day One, Introduction, Syllabus, CRC login, Unix, time allowing: Julia
- Day Two, Git, Github Homework Submission
- Reading due: A Quick Introduction to Version Control with Git and GitHub
- Reading question: What is git fork, what does it do? and why would you use it?
- Admin due:
- Decide on CRC or local computer for Julia and Bioinformatics.
- create a .vimrc for yourself
- create a .bashrc for yourself
- Reading due: A Quick Introduction to Version Control with Git and GitHub
- Week Two: Assembly Part One - Introduction to a Assembly Pipeline, This week will demonstrate how to run assembly on the command line, specifically how to use a work queue manager, like SGE, which is in use at Notre Dame, to assemble reads into a de novo, we will then take our assembled genome and annotate it using an orthologs.
- Day One: Scripting, Submission of Grid Computing, de novo Assembly
- Practical due:
- Practical Repo One
- In order to complete the practical and gain access to the repo you will need email Dave your github username (dmolik @ nd.edu)
- Practical Repo One
- Final project status update due:
- Email Chissa three datasets that you could use for the final project
- Practical due:
- Day Two: Annotation Job Submission, Genome Graphs
- Reading due: Sequence assembly demystified
- Reading question: What is N50 in the context of the article, how is it used, and how is it misused?
- Admin due:
- install and use Julia to print fizz buzz from one to twenty
- A blog article going over fizz buzz, also the answer is on the page if you get stuck.
- install and use Julia to print fizz buzz from one to twenty
- Reading due: Sequence assembly demystified
- Day One: Scripting, Submission of Grid Computing, de novo Assembly
- Week Three: Assembly Part Two - We now go deeper down the computational chain, looking under the hood of how an assembler works. This will focus again on creating assemblies, but specifically look at how to compute the same work in the Julia language.
- Day One - How an Assembler works, Genome Assembly in Julia
- Practical due:
- Final project status update due:
- Email Chissa and Dave an abstract of a final project plan.
- Day Two - Julia Annotation, Genome Graphs
- Reading due: A beginner’s guide to eukaryotic genome annotation
- What is the difference between Ab initio gene prediction and Evidence based gene prediction?
- Admin due:
- Put together final project group and email Dave & Chissa
- Reading due: A beginner’s guide to eukaryotic genome annotation
- Day One - How an Assembler works, Genome Assembly in Julia
- Week Four: Assembly Part Three - We will continue our descent down the computational chain, now we go still deeper, looking at the underlying algorithms that make assembly possible.
- Day One - String Comparisons, Distances
- Practical due:
- Final project status update due:
- make Chissa and Dave collaborators on a repo with an extended project abstract with your project team covering:
- AS a README.md in a project repo add:
- Team
- New Code Required
- Activities Required
- Data Used
- Question Asked
- AS a README.md in a project repo add:
- make Chissa and Dave collaborators on a repo with an extended project abstract with your project team covering:
- Day Two - Smith-Waterman, Other Assembly Methods
- Reading Due: Basic Local Alignment Search Tool smithwaterman.jl
- Why are alignment algorithms faster than looping through both sequences (what makes them faster?)
- Admin due: Setup time to talk to Chissa/Dave about Final project.
- Reading Due: Basic Local Alignment Search Tool smithwaterman.jl
- Day One - String Comparisons, Distances
- Week Five: SNPs Part One - Back up for air! We’ve learned about assemblies, time for something different, we’ll be looking at Single Nucleotide Polymorphisms, this week will focus on what a SNP pipeline looks like, and how to submit Grid Computing jobs for a SNP pipeline.
- Day One: Read Quality/Trimming Filtering
- Pratical Due:
- Final project status update due:
- Please create a draft of your code for your final project, and send an update to Dave and Chissa (we will gove over what this means in class)
- Day Two: Variant Calling
- Admin due: take a breather
- Reading due: Compression of FASTQ and SAM Format Sequencing Data
- Second reading: What is a SNP?
- Day One: Read Quality/Trimming Filtering
- Week Six: SNPs Part Two, We’ve doen SNP calling on the command line, now we are going to turn around, and write our own pipeline in Julia
- Day One: Julia version of building a SNP, Read Quality/Trimming Filtering
- Practical Due:
- Final project status update due:
- Create Args for Julia Final project (what inputs and outputs are you going to need?)
- Day Two: Julia version of Variant Calling
- Admin due: install VariantVisualization
- Reading due: SNPs in ecology, evolution and conservation
- What is the difference between Qst and Fst when it comes to SNPs?
- Day One: Julia version of building a SNP, Read Quality/Trimming Filtering
- Week Seven: SNPs Part Three, What’s under the hood of SNP calling?
- Day One: Frequency Estimation
- Practical Due:
- Day Two: Key - Value stores
- Reading due: Mapping short DNA sequencing reads and callingvariants using mapping quality scores and SNP detection for massively parallel whole-genome resequencing
- This will require some research, but what are some faster ways to calculate Frequency?
- Reading due: Mapping short DNA sequencing reads and callingvariants using mapping quality scores and SNP detection for massively parallel whole-genome resequencing
- Day One: Frequency Estimation
- Week Eight: Differential Expression Part One, Say I want to run a differential expression analysis analysis on some RNAseq data, how do I do that?
- Day One - Creating the RNAseq expression table
- Practical Due:
- Final project status update due:
- Schedule a 20min meeting with Chissa and Dave
- Day Two - Visualizing RNAseq expression
- Reading Due: RNA-seq workflow: gene-level exploratory analysis and differential expression
- Reading Question: Out of this giant page of analyses, which are the most important to you to learn how to do with RNA-Seq data?
- Reading Due: RNA-seq workflow: gene-level exploratory analysis and differential expression
- Day One - Creating the RNAseq expression table
- Week Nine: Differential Expression Part Two, Now that we know how to run a differential expression analysis, lets look deeper and do it in Julia.
- Day One - Julia RNAseq expression table
- Practical Due:
- Day Two - Julia RNAseq Visualization
- Reading Due: Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs
- Reading question: Use the paper and the supplementary code to describe what Normalized Coverage is, this will involve you looking at the code.
- Reading Due: Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs
- Day One - Julia RNAseq expression table
- Week Ten: Differential Expression Part Three, What is this clustering thing anyway?
- Day One - Ordination
- Practical Due:
- Day Two - Clustering
- Reading Due: What is Principal Component Analysis?
- Day One - Ordination
- Week Eleven: Population Genomics, I want to look at populations, what are some ways to do that?
- Day One: Post Assembly - Calculating Genotypes
- Practical Due:
- Day Two: Visualization of Genotypes
- Reading Due: The Variety of Genes in the Gene Pool Can Be Quantified within a Population
- Reading question: What are the differences between allele frequency, genotype frequency, and phenotype frequency?
- Reading Due: The Variety of Genes in the Gene Pool Can Be Quantified within a Population
- Day One: Post Assembly - Calculating Genotypes
- Week Twelve: Population Genomics, More populations.
- Day One - Julia Fitness Landscapes
- Final project status update due:
- email Chissa and Dave a 350ish word or less on the status of your final project.
- Final project status update due:
- Day Two - Guest Lecture - Pavel Dimens
- Reading Due: Estimating Genetic Relatedness in Admixed Populations
- Reading question: none-make sure you have a vague concept of Relatedness for lecture.
- Reading Due: Estimating Genetic Relatedness in Admixed Populations
- Day One - Julia Fitness Landscapes
- Week Thirteen: Population Genomics, Still more Populations.
- Day One - Time Zones, Coordinate Transformations
- Day Two - Finding local optima
- Week Fourteen: Final Project Presentations