Setup
Computing on Rivanna
University of Virginia’s High-Performance Computing (HPC) system includes two large clusters named Rivanna and Afton. As a centralized resource the HPC has hundreds of pre-installed software packages available for computational research across many disciplines. All of us have logged on to Rivanna before this class. For today’s lesson, each one of us will request 2 CPUs on Rivanna for the next 8 hours. We can do that by issuing the following shell command
ijob -c 2 --mem-per-cpu=9000 -A ${account} -p standard --time=08:00:00
We will replace the ${account} in the command with the name of the group we will specifically create for the course.
Working directory
We will create a folder named my_rna_seq_analysis when we log onto Rivanna. Remember that you can use the following command to create the folder
mkdir my_rna_seq_analysis
Now that we have the folder, we will cd into it, and do all our analyses there
cd my_rna_seq_analysis
Software tools
We have installed all the tools that will be required for this analysis on the Rivanna cluster. Rivanna uses Environment Modules to let users use tools without having to install them.
R and R packages
We have also installed the R packages we will need for this lesson in a directory “/standard/bims6000/R”.
Data files
We have also downloaded the various files that you are going to need for this lesson on to Rivanna.
What you need for the fastq QC, alignment, and counting
The following files can be found at /standard/bims6000/data/morning/
- Arabidopsis_sample1/2/3/4.fq.gz: A
FASTQfile containing a sample sequenced mRNA-seq reads in the FASTQ format. - AtChromosome1.fa: the chromosome 1 sequence of the Arabidopsis thaliana genome in
FASTAformat. - ath_annotation.gff3: the genome annotation of Arabidopsis thaliana for chromosome 1 in the
GFF3format. This indicates the positions of genes, their exons and 5’ or 3’ UTR on the chromosome and is used to generate the gene counts. - adapters.fasta: the Illumina adapter sequences used for read trimming using Trimmomatic.
What you need for the differential expression and enrichment analysis
The following files can be found at /standard/bims6000/data/afternoon
- Counts: A
raw_counts.csvdataframe of the sample raw counts. It is a tab separated file therefore data are in tabulated separated columns. - Samples to experimental conditions: the
samples_to_conditions.csvdataframe indicates the correspondence between samples and experimental conditions (e.g. control, treated).
Differentially expressed genes:differential_genes.csvdataframe contains the result of the DESeq2 analysis.
Original study
This RNA-seq lesson will make use of a dataset from a study on the model plant Arabidopsis thaliana inoculated with commensal leaf bacteria (Methylobacterium extorquens or Sphingomonas melonis) and infected or not with a leaf bacterial pathogen called Pseudomonas syringae. Leaf samples were collected from Arabidopsis plantlets from plants inoculated or not with commensal bacteria and infected or not with the leaf pathogen either after two days (2 dpi, dpi: days post-inoculation) or seven days (6 dpi).
All details from the study are available in Vogel et al. in 2016 and was published in New Phytologist.