Transcriptome
Transcriptome
Gene Expression Analysis
- Quality Control
- Estimate Expression Level
- Normalize Across Samples
- Perform Differential Expression Anlaysis
- Perform Enirched Pathway and Transcription Factor Analysis
Microarray Analysis
- Quality Control
- Estimate Expression Level: RMA, GCRMA, dChip
- Normalize Across Samples: Quantile Normalization, Scaling
- Perform Differential Expression Anlaysis: limma
- Perform Enirched Pathway and Transcription Factor Analysis
Microarray Analysis
Microarray Analysis
Microarray Analysis
Microarray Analysis
- dChip: Analyzes multiple chips simultaneously
- For array i, probe j and number of probe pairs J
- pij=PMij−MMij=θiϕj+eij
- θi: relative expression level; ϕj: relative affinity
- ∑j=1Jϕj2=J (constraint)
- eij∼N(0,σ2): addative error model
- Iteratively fit equations excluding outlier θi and ϕj
- Effective expression estimate θi=J1∑jpijϕj
Microarray Analysis
- Problems with dChip (Li-Wong model)
- log(PM), log(MM) tend to be normally distributed
- MM tends to capture significant amount of intended target: lowers sensitivity
- MM introduces second "noisey" intensity: increases variance
Microarray Analysis
- Robust Multiarray Average (RMA) (Li-Wong on log-scale; no MM):
log2(PMij)=log2(θi)+log2(ϕj)+b+eij
- Estimate relative log2θi robustly: median polish
Microarray Analysis
Microarray Analysis
- GCRMA
- Background strongly depends on probe sequence, so modeled as function of sequence (Sij) of probes
- RMA with sequence dependent background correction
log2(PMij)=log2(θi)+log2(ϕj)+b(Sij,MMij)+eij
Microarray Analysis
- Linear Models for Microarray Data (limma)
- Assume linear model: E[yj]=Xαj
- yj expression values for gene j
- X is the design matrix
- αj is vector of coefficients
- yjT: jth row of expression matrix (log-intensities)
- Contrasts: βj=CTαj (C: contrast matix)
Microarray Analysis
- Significance analysis: moderated t-statistic
- Uses borrowed information from ensemble of genes
- Ordinary t-statistic with
- standard errors shrunk toward common value
- increased degrees of freedom (greater reliability associated with smoothed standard errors)
Microarray Analysis
- Linear model for gene j has residual variance σj2 with sample value sj2 and degrees of freedom fj
- Covariance matrix of estimated β^j is σj2CT(XTVjX)−1C
- Vj is a weight matrix: prior weights, covariance terms introduced by correlation strucuture and interative weights introduced by robust estimation
- Unscaled standard deviation (ujk): square roots of diagonal elements of CT(XTVjX)−1C
Microarray Analysis
- Ordinary t-statistic for kth contrast and gene j: tjk=β^jk/(ujksj)
- Empirical Bayes method assumes inverse Chi-square prior for the σj2 with mean s02 and degrees of freedom f0
- Posterior values for residual variances given by
s~j2=f0+fjf0s02+fjsj2
Microarray Analysis
- Moderated t-statistic: t~jk=ujks~jβ^jk
- Follows t-distribution with f0+f1 degrees of freedom if β^jk=0
- Extra degree of freedom f0 represents borrowed information from ensemble of genes for each gene's inference
Microarray vs RNA-SEQ
Microarray vs RNA-SEQ
Microarray vs RNA-SEQ
RNA-SEQ Analysis
- Quality Control: FASTQC
- Splice-Aware Alignment: HISAT, STAR
- Estimate Expression Level: featureCounts in Rsubread, StringTie, Salmon, Sailfish, kallisto, RSEM
- Normalize Across Samples: DESeq2 Normalization, Quantile Normalization, Scaling
- Perform Differential Expression Anlaysis: DESeq2, edgeR
- Perform Enirched Pathway and Transcription Factor Analysis: MSigDB, GSEA, String
RNA-SEQ Analysis
RNA-SEQ Analysis
RNA-SEQ Analysis
RNA-SEQ Analysis
Poisson Distribution
Sum of Bernoulli random variables, Xi, with probability of equaling 1 and 0 given by p and 1−p
Y=∑i=1nXi
is distributed as a binomial distribution with
μ=np and σ2=np(1−p).
For p→0 and n→∞ such that np=λ,
Y approaches a Poisson distribution.
Poisson Distribution
Assume Y is the number of reads mapping to a window in the genome coming from low coverage sequencing of a genome.
If the empirical probability of a read mapping to a specific location is p≪1,
and the number of bases in the window n∼1000,
the Poisson distribution is an excellent approximation of the distribution of Y.
Poisson Distribution
Poisson Distribution
Negative Binomial Distribution
The negative binomial distribution is a mixture of a Poisson and a gamma distribution where the
Poisson distribution is pP(k∣λ)=k!λke−λ
and the gamma distribution is g(λ)=Γ(r)λr−1βre−βλ
pNB(k) =∫0∞pP(k∣λ)g(λ)dλ=k!Γ(r)Γ(r+k)pk(1−p)r
Negative Binomial Distribution
Where we have used β=(1−p)/p. The mean and variance of a random variable K∼NB(μ,α) are
E(K)=μ and Var(K)=μ+αμ2
where the variance has a Poisson/"shot noise" term μ and a overdispersion/"biological variability" term αμ2
RNA-SEQ Analysis
- DESeq2: Assumes read count Kij for gene i in sample j is described by a generalized linear model
- Kij∼NB(μij,αi) (Negative Binomial with mean μij and dispersion αi)
- Var(Kij)=μij+αiμij2
- μij=sijqij where qij is proportional to gene i's concentraton of cDNA fragments in sample j and sij is a size or normalization factor
RNA-SEQ Analysis
- Size factor accounts for differences in sequencing depth in a robust manner
- Motivation: if gene i is not differentially expressed between samples j and j′, then E(Kij)/E(Kij′)=sj/sj′
- Generalize this to multiple samples
- Define pseudo-reference: KiR=(∏j=1mKij)1/m
- sij=sj=medianiKiRKij
RNA-SEQ Analysis
- Fit generalized linear model to normalized counts
- log2qij=∑rxjrβir where xjr are elements of the design matrix
- Estmate αi using shared information across genes assuming that genes with similar average expression have similar dispersion
- Estimate each gene's dispersion using maximum likelihood
- Fit smooth curve of dispersion estimate versus mean of normalized read count
RNA-SEQ Analysis
- Shrink estimates of gene-wise dispersion toward values predicted by the curve using empirical Bayes approach where size of shrinkage depends on
- estimate of how close true dispersion is to the fit
- degrees of freedom
- Final esimtate of αi is given by maximum a posteriori (MAP) estimate
- Use gene-wise estimate if it is more than 2 residual standard deviations from shrunken estimate
RNA-SEQ Analysis
RNA-SEQ Analysis
- Address strong variance of log fold change (LFC) for genes with low read counts by shrinking LFC estimates toward zero where shrinkage is stronger when available information (low counts, dispersion high or few degrees of freedom) for a gene is lower.
RNA-SEQ Analysis
- Employ empirical Bayes procedure:
- Perform GLM fits to obtain maximum likelihood estimates (MLEs) for the LFCs
- Fit a zero-centered normal distribution to the observed distribution of MLEs over all genes
- This distribution is used as prior on LFCs in second round of GLM fits
- Final estimate of each LFC is given by MAP estimate
- A standard error for each LFC estimate is derived from the posterior's curvature at its maximum
RNA-SEQ Analysis
RNA-SEQ Analysis
- Assess signficance using a Wald test:
- the shrunken estimate of LFC is divided by its standard error, resulting in a z-statistic
- which is used to calculate p-values from a standard normal distribution
- Correct p-values for multiple hypothesis testing using Benjamini and Hochberg procedure
RNA-SEQ Analysis
RNA-SEQ Analysis
RNA-SEQ Analysis
RNA-SEQ Analysis
RNA-SEQ Analysis
RNA-SEQ Analysis