reveal.js

Simple Linear Model

$y_{i} = \beta_{0} + \beta_{1} x_{i} + e_{i}$ , $i=1,\dots,n$

where $e_{i}$ are independent random variables with $E(e_{i}) = 0$ and $Var(e_{i}) = \sigma^2$ .

The $x_{i}$ are assumed to be fixed. This is referred to as the standard statistical model: value of $y$ is a linear function of $x$ plus random noise.

$y$ is called the dependent or response variable and $x$ is called the independent or predictor variable.

Method of Least Squares

Find the slope, $\beta_{1}$ , and intercept, $\beta_{0}$ that minimize

$S(\beta_{0}, \beta{1}) = \sum_{i=1}^{n} (y_{i} - \beta_{0} - \beta_{1} x_{i})^2$

$\frac{\partial S}{\partial \beta_{0}} = -2 \sum_{i=1}^{n} (y_{i} - \beta_{0} - \beta_{1} x_{i}) = 0$

$\frac{\partial S}{\partial \beta_{1}} = -2 \sum_{i=1}^{n} x_{i} (y_{i} - \beta_{0} - \beta_{1} x_{i}) = 0$

Method of Least Squares

$\hat{\beta}_{0} = \bar{y} - \hat{\beta}_{1} \bar{x}$

$\hat{\beta}_{1} = \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{n}(x_{i} - \bar{x})^2}$

The least sequares estimates are unbiased:

$E(\hat{\beta}_{j}) = \beta_{j}$ , for $j = 0,1$ .

Variance of Parameter Estimates

From the standard statistical model, $Var(y_{i}) = \sigma^2$ and $Cov(y_{i}, y_{j}) = 0$ when $i \ne j$

$Var(\hat{\beta}_{0}) = \frac{\sigma^2 \sum_{i=1}^{n} x_{i}^2}{n\sum_{i=1}^{n} x_{i}^2 - (\sum_{i=1}^{n} x_{i})^2}$

$Var(\hat{\beta}_{1}) = \frac{\sigma^2}{\sum_{i=1}^{n} (x_{i} - \bar{x})^2}$

Relationship to Hypothesis Testing

Define the residual sum of squares (RSS) to be

$\textrm{RSS} = \sum_{i=1}^{n} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1} x_{i})^2$

An unbiased estimate of $\sigma^2$ is given by $s^2 = \frac{\textrm{RSS}}{n - 2}$ where $n - 2$ is used instead of $n$ because two parameters have been estimated from the data, yielding $n - 2$ degrees of freedom.

Relationship to Hypothesis Testing

If the errors, $e_{i}$ , are independent normal random variables, then $\hat{\beta_{i}}$ for $i=0,1$ are normally distributed and it can be shown that

$\frac{\hat{\beta}_{i} - \beta_{i}}{s_{\hat{\beta}_{i}}} \sim t_{n-2}$

where $t_{n-2}$ is a $t$ distribution with $n-2$ degrees of freedom. We would use this to test the null hypothesis $H_{0}: \beta_{1} = 0$ , for example.

Relationship to Correlation

Let's define the following quantities:

$s_{xx} = \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^2$

$s_{yy} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \bar{y})^2$

$s_{xy} = \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})$

The correlation coefficient between $x$ 's and $y$ 's is $r = \frac{s_{xy}}{\sqrt{s_{xx} s_{yy}}}$ and $\hat{\beta}_{1} = \frac{s_{xy}}{s_{xx}}$ ; therefore $r = \hat{\beta}_{1} \sqrt{\frac{s_{xx}}{s_{yy}}}$ .

Relationship to Correlation

Let's standardize the variables: $u_{i} = \frac{x_{i} - \bar{x}}{\sqrt{s_{xx}}}$ and

$v_{i} = \frac{y_{i} - \bar{y}}{\sqrt{s_{yy}}}$ . This yields $s_{uu} = s_{vv} = 1$ and $s_{uv} = r$ . The least squares line for predicting $v$ from $u$ has:

slope $r$ ,

intercept $\tilde{\beta}_{0} = \bar{v} - r \bar{u} = 0$ ,

and the predicted values are $\hat{v}_{i} = r u_{i}$ .

Relationship to Maximum Likelihood

$\mathbf{\beta}_{\textrm{ML}} = \underset{\mathbf{\beta}}{\arg\max} \sum_{i=1}^{n} \log P(y_{i} \vert x_{i}; \mathbf{\beta})$

Let $P(y_{i} \vert x_{i}; \mathbf{\beta}) = N(y_{i}; \hat{y}_{i}(x_{i}, \mathbf{\beta}), \sigma^2)$ :

$\sum_{i=1}^{n} \log P(y_{i} \vert x_{i}; \mathbf{\beta}) = C - \sum_{i=1}^{n} \frac{(y_{i} - \hat{y}_{i})^2}{2 \sigma^2}$

where $C = -n \log \sigma - \frac{n}{2} \log (2 \pi)$ .

Equivalent to $\mathbf{\beta}_{\textrm{ML}} = \underset{\mathbf{\beta}}{\arg\min} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^2$

Multiple Linear Regression

Model $y = \beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{p-1} x_{p-1}$

to fit data: $y_{i}, x_{i1}, x_{i2},\ldots, x_{i,p-1}$ for $i = 1,\ldots,n$

Represent the $x_{i,j}$ by an $n \times p$ matrix $\mathbf{X}$ :

$\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1,p-1} \\\ 1 & x_{21} & x_{22} & \cdots & x_{2,p-1} \\\ \vdots & \vdots & \vdots & \vdots & \vdots \\\ 1 & x_{n1} & x_{n2} & \cdots & x_{n,p-1} \\\ \end{bmatrix}$

Multiple Linear Regression

Represent unknowns & observations by vectors: $\mathbf{\beta}$ , $\mathbf{Y}$ .

Find $\mathbf{\beta}$ by minimizing

$S(\mathbf{\beta}) = \sum_{i=1}^{n} (y_{i} - \beta_{0} - \beta_{1} x_{i1} - \cdots - \beta_{p-1} x_{i,p-1})^2$

Differentiating $S(\mathbf{\beta})$ with respect to each $\beta_{k}$ and setting the derivative to zero yields $p$ equations

Multiple Linear Regression

$n \hat{\beta}_{0} + \hat{\beta}_{1} \sum_{i=1}^{n} x_{i1} + \cdots + \hat{\beta}_{p-1} \sum_{i=1}^{n} x_{i,p-1} = \sum_{i=1}^{n} y_{i}$

$\hat{\beta}_{0} \sum_{i=1}^{n} x_{ik} + \hat{\beta}_{1} \sum_{i=1}^{n} x_{i1} x_{ik} + \cdots + \hat{\beta}_{p-1} \sum_{i=1}^{n} x_{i,p-1} x_{ik} = \sum_{i=1}^{n} y_{i} x_{ik}$

for $k=1,\ldots, p-1$ . In matrix form:

$\mathbf{X}^{T} \mathbf{X} \mathbf{\hat{\beta}} = \mathbf{X}^{T} \mathbf{Y}$ . If $\mathbf{X}^{T} \mathbf{X}$ is nonsingular, then

$\mathbf{\hat{\beta}} = (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{X}^{T} \mathbf{Y}$

Multiple Linear Regression

$\mathbf{\hat{Y}} = \mathbf{X} \mathbf{\hat{\beta}} = \mathbf{X} (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{X}^{T} \mathbf{Y} = \mathbf{P} \mathbf{Y}$

where $\mathbf{P} = \mathbf{X} (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{X}^{T}$ is an $n \times n$ matrix.

Note that $\mathbf{P} = \mathbf{P}^{T} = \mathbf{P}^2$ and

$\mathbf{I} - \mathbf{P} = (\mathbf{I} - \mathbf{P})^{T} = (\mathbf{I} - \mathbf{P})^2$

$\mathbf{P}$ (Projection Matrix) projects onto the p-dimensional subspace of $\mathbf{R}^{n}$ spanned by the columns of $\mathbf{X}$ .

$\mathbf{\hat{Y}}$ is a projection of $\mathbf{Y}$ onto the p-dimensional subspace spanned by the columns of $\mathbf{X}$ .

Multiple Linear Regression

Multicolinearity

If there is perfect multicolineararity among the columns of the $\mathbf{X}$ matrix (i.e., they are linearly dependent), then $\mathbf{X}$ has less than full rank, $\mathbf{X}^{T} \mathbf{X}$ cannot be inverted and a unique solution for $\mathbf{\beta}$ does not exist. Even if there is not perfect multicolinearity but highly correlated input variables, then interpreting the fitting coefficients and their significance can become problematic.

Contingency Tables

Phenotype/Genotype	SNP	No SNP	Total
Disease	$n_{11}$	$n_{12}$	$n_{1.}$
No Disease	$n_{21}$	$n_{22}$	$n_{2.}$
Total	$n_{.1}$	$n_{.2}$	$n$

Contigency Tables

Table containing $I$ rows and $J$ columns with numbers for each category from $n$ total samples. Assume the probability of being in cell $i,j$ is $\pi_{ij}$ .

Define $\pi_{i.} = \sum_{j=1}^{J} \pi_{ij}$ and $\pi_{.j} = \sum_{i=1}^{I} \pi_{ij}$ as the mariginal probabilities that an observation with fall in the $i$ th row and $j$ th column, respectively.

If the row and column classifications are independent of each other, then $\pi_{ij} = \pi_{i.} \pi_{.j}$ .

Contigency Tables

Consider testing the the following null hypothesis: $H_{0}: \pi_{ij} = \pi_{i.} \pi_{.j}$ $i=1,\ldots,I$ and $j=1,\ldots,J$ versus the alternative that the $\pi_{ij}$ are free.

Under $H_{0}$ , the maximum likelihood estiamte of $\pi_{ij}$ is

$\hat{\pi}_{ij} = \hat{\pi}_{i.} \hat{\pi}_{.j} = \frac{n_{i.}}{n} \frac{n_{.j}}{n}$

Under the alternative, the maximum likelihood estiamte of $\pi_{ij}$ is $\tilde{\pi}_{ij} = \frac{n_{ij}}{n}$ .

Chi-Squared Test of Independence

We will perform the Pearson's chi-squared test which is asymptotically equivalent to the likelihood ratio test. Define Pearson's chi-squared statistic:

$\chi^2 = \sum_{i=1}^{I} \sum_{j=1}^{J} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$

where $O_{ij} = n_{ij}$ are the observed counts and $E_{ij} = n \hat{\pi}_{ij} = \frac{n_{i.} n_{.j}}{n}$ are the expected counts under the null hypothesis.

Chi-Squared Test of Independence

Pearson's chi-squared statistic is then given by

$\chi^2 = \sum_{i=1}^{I} \sum_{j=1}^{J} \frac{(n_{ij} - n_{i.} n_{.j}/n)^2}{n_{i.} n_{.j}/n}$

which is $\chi^2$ distributed with $k$ degrees of freedom.

The degrees of freedom are the number of independent counts minus the number of independent parameters estimated from the data.

Chi-Squared Test of Independence

There are $IJ - 1$ independent counts since $n$ is fixed. Also, $(I-1) + (J-1)$ indepedent parameters were estiamted from the data. Thus,

$k = IJ-1-(I - 1)-(J - 1)$ .

We would calculate p-values by using the chi-squared statistic and integrating a chi-squared distribution with $k = (I-1)(J-1)$ degrees of freedom.

GWAS & EQTL