reveal.js

Transcriptome

Gene Expression Analysis

Quality Control
Estimate Expression Level
Normalize Across Samples
Perform Differential Expression Anlaysis
Perform Enirched Pathway and Transcription Factor Analysis

Microarray Analysis

Quality Control
Estimate Expression Level: RMA, GCRMA, dChip
Normalize Across Samples: Quantile Normalization, Scaling
Perform Differential Expression Anlaysis: limma
Perform Enirched Pathway and Transcription Factor Analysis

Microarray Analysis

dChip: Analyzes multiple chips simultaneously
For array $i$ , probe $j$ and number of probe pairs $J$
$p_{ij} = PM_{ij} - MM_{ij} = \theta_{i} \phi_{j} + e_{ij}$
$\theta_{i}$ : relative expression level; $\phi_{j}$ : relative affinity
$\sum_{j=1}^{J} \phi_{j}^{2} = J$ (constraint)
$e_{ij} \sim N(0,\sigma^2)$ : addative error model
Iteratively fit equations excluding outlier $\theta_{i}$ and $\phi_{j}$
Effective expression estimate $\theta_{i} = \frac{1}{J}\sum_{j} p_{ij} \phi_{j}$

Microarray Analysis

Problems with dChip (Li-Wong model)
$\log(PM)$ , $\log(MM)$ tend to be normally distributed
$MM$ tends to capture significant amount of intended target: lowers sensitivity
$MM$ introduces second "noisey" intensity: increases variance

Microarray Analysis

Robust Multiarray Average (RMA) (Li-Wong on log-scale; no MM):

$\log_{2}(PM_{ij}) = \log_{2}(\theta_{i}) + \log_{2}(\phi_{j}) + b + e_{ij}$

Estimate relative $\log_{2} \theta_{i}$ robustly: median polish

Microarray Analysis

GCRMA
Background strongly depends on probe sequence, so modeled as function of sequence ( $S_{ij}$ ) of probes
RMA with sequence dependent background correction

$\log_{2}(PM_{ij}) = \log_{2}(\theta_{i}) + \log_{2}(\phi_{j}) + b(S_{ij},MM_{ij}) + e_{ij}$

Microarray Analysis

Linear Models for Microarray Data (limma)
Assume linear model: $E[\mathbf{y}_{j}] = \mathbf{X} \alpha_{j}$
$\mathbf{y}_{j}$ expression values for gene $j$
$\mathbf{X}$ is the design matrix
$\alpha_{j}$ is vector of coefficients
$\mathbf{y}^{T}_{j}$ : jth row of expression matrix (log-intensities)
Contrasts: $\beta_{j} = \mathbf{C}^{T} \alpha_{j}$ ( $\mathbf{C}$ : contrast matix)

Microarray Analysis

Significance analysis: moderated t-statistic
Uses borrowed information from ensemble of genes
Ordinary t-statistic with
1. standard errors shrunk toward common value
2. increased degrees of freedom (greater reliability associated with smoothed standard errors)

Microarray Analysis

Linear model for gene $j$ has residual variance $\sigma_{j}^2$ with sample value $s_{j}^2$ and degrees of freedom $f_{j}$
Covariance matrix of estimated $\hat{\beta}_{j}$ is $\sigma_{j}^2 \mathbf{C}^T(\mathbf{X}^{T} \mathbf{V}_{j} \mathbf{X})^{-1} \mathbf{C}$
$\mathbf{V}_{j}$ is a weight matrix: prior weights, covariance terms introduced by correlation strucuture and interative weights introduced by robust estimation
Unscaled standard deviation ( $u_{jk}$ ): square roots of diagonal elements of $\mathbf{C}^T(\mathbf{X}^{T} \mathbf{V}_{j} \mathbf{X})^{-1} \mathbf{C}$

Microarray Analysis

Ordinary t-statistic for kth contrast and gene $j$ : $t_{jk} = \hat{\beta}_{jk}/(u_{jk} s_{j})$
Empirical Bayes method assumes inverse Chi-square prior for the $\sigma_{j}^2$ with mean $s_{0}^2$ and degrees of freedom $f_{0}$
Posterior values for residual variances given by

$\tilde{s}_{j}^2 = \frac{f_{0} s_{0}^{2} + f_{j} s_{j}^{2}}{f_{0} + f_{j}}$

Microarray Analysis

Moderated t-statistic: $\tilde{t}_{jk} = \frac{\hat{\beta}_{jk}}{u_{jk} \tilde{s}_{j}}$
Follows t-distribution with $f_{0} + f_{1}$ degrees of freedom if $\hat{\beta}_{jk} = 0$
Extra degree of freedom $f_{0}$ represents borrowed information from ensemble of genes for each gene's inference

Microarray vs RNA-SEQ

RNA-SEQ Analysis

Quality Control: FASTQC
Splice-Aware Alignment: HISAT, STAR
Estimate Expression Level: featureCounts in Rsubread, StringTie, Salmon, Sailfish, kallisto, RSEM
Normalize Across Samples: DESeq2 Normalization, Quantile Normalization, Scaling
Perform Differential Expression Anlaysis: DESeq2, edgeR
Perform Enirched Pathway and Transcription Factor Analysis: MSigDB, GSEA, String

RNA-SEQ Analysis

Poisson Distribution

Sum of Bernoulli random variables, $X_{i}$ , with probability of equaling 1 and 0 given by $p$ and $1-p$

$Y = \sum_{i=1}^{n} X_{i}$

is distributed as a binomial distribution with

$\mu = np$ and $\sigma^2 = np(1-p)$ .

For $p \rightarrow 0$ and $n \rightarrow \infty$ such that $np = \lambda$ ,

$Y$ approaches a Poisson distribution.

Poisson Distribution

Assume $Y$ is the number of reads mapping to a window in the genome coming from low coverage sequencing of a genome.

If the empirical probability of a read mapping to a specific location is $p \ll 1$ ,

and the number of bases in the window $n \sim 1000$ ,

the Poisson distribution is an excellent approximation of the distribution of $Y$ .

Poisson Distribution

Negative Binomial Distribution

The negative binomial distribution is a mixture of a Poisson and a gamma distribution where the

Poisson distribution is $p_{P}(k|\lambda) = \frac{\lambda^{k}}{k!} e^{-\lambda}$

and the gamma distribution is $g(\lambda) = \frac{\lambda^{r-1} \beta^{r} e^{-\beta \lambda}}{\Gamma(r)}$

$\begin{aligned} p_{NB}(k) & = \int_{0}^{\infty} p_{P}(k|\lambda) g(\lambda) d \lambda \\\ & = \frac{\Gamma(r+k)}{k! \Gamma(r)} p^{k} (1-p)^{r} \\\ \end{aligned}$

Negative Binomial Distribution

Where we have used $\beta = (1-p)/p$ . The mean and variance of a random variable $K \sim NB(\mu, \alpha)$ are

$E(K) = \mu$ and $Var(K) = \mu + \alpha \mu^2$

where the variance has a Poisson/"shot noise" term $\mu$ and a overdispersion/"biological variability" term $\alpha \mu^2$

RNA-SEQ Analysis

DESeq2: Assumes read count $K_{ij}$ for gene $i$ in sample $j$ is described by a generalized linear model
$K_{ij} \sim NB(\mu_{ij}, \alpha_{i})$ (Negative Binomial with mean $\mu_{ij}$ and dispersion $\alpha_{i}$ )
$Var(K_{ij}) = \mu_{ij} + \alpha_{i} \mu_{ij}^2$
$\mu_{ij} = s_{ij} q_{ij}$ where $q_{ij}$ is proportional to gene $i$ 's concentraton of cDNA fragments in sample $j$ and $s_{ij}$ is a size or normalization factor

RNA-SEQ Analysis

Size factor accounts for differences in sequencing depth in a robust manner
Motivation: if gene $i$ is not differentially expressed between samples $j$ and $j^{'}$ , then $E(K_{ij})/E(K_{ij^{'}}) = s_{j}/s_{j^{'}}$
Generalize this to multiple samples
Define pseudo-reference: $K_{i}^R = (\prod_{j=1}^{m} K_{ij})^{1/m}$
$s_{ij} = s_{j} = \textrm{median}_{i} \frac{K_{ij}}{K_{i}^{R}}$

RNA-SEQ Analysis

Fit generalized linear model to normalized counts
$\log_{2} q_{ij} = \sum_{r} x_{jr} \beta_{ir}$ where $x_{jr}$ are elements of the design matrix
Estmate $\alpha_{i}$ using shared information across genes assuming that genes with similar average expression have similar dispersion
Estimate each gene's dispersion using maximum likelihood
Fit smooth curve of dispersion estimate versus mean of normalized read count

RNA-SEQ Analysis

Shrink estimates of gene-wise dispersion toward values predicted by the curve using empirical Bayes approach where size of shrinkage depends on
1. estimate of how close true dispersion is to the fit
2. degrees of freedom
Final esimtate of $\alpha_{i}$ is given by maximum a posteriori (MAP) estimate
Use gene-wise estimate if it is more than 2 residual standard deviations from shrunken estimate

RNA-SEQ Analysis

Address strong variance of log fold change (LFC) for genes with low read counts by shrinking LFC estimates toward zero where shrinkage is stronger when available information (low counts, dispersion high or few degrees of freedom) for a gene is lower.

RNA-SEQ Analysis

Employ empirical Bayes procedure:
1. Perform GLM fits to obtain maximum likelihood estimates (MLEs) for the LFCs
2. Fit a zero-centered normal distribution to the observed distribution of MLEs over all genes
3. This distribution is used as prior on LFCs in second round of GLM fits
Final estimate of each LFC is given by MAP estimate
A standard error for each LFC estimate is derived from the posterior's curvature at its maximum

RNA-SEQ Analysis

Assess signficance using a Wald test:
1. the shrunken estimate of LFC is divided by its standard error, resulting in a z-statistic
2. which is used to calculate p-values from a standard normal distribution
Correct p-values for multiple hypothesis testing using Benjamini and Hochberg procedure