Consider this MSA (Multiple Sequence Alignment)
```
Species 1 AAGTTTCCA
Species 2 AAA---CCA
Species 3 GGA---CCA
Species 4 AGA---CCA
```
$d_{K2P} = ...$ > or < D? $d_{JC}$ ? 0.402
Key point: distance depends on the model
---
## HKY model (1985)
$$
\mathbf{\pi} = \sout{[0.25, 0.25, 0.25, 0.25]}
$$
$$
\mathbf{\pi} = [\pi_A, \pi_C, \pi_G, \pi_T]
$$
$$
P_t = \begin{bmatrix}
\cdot & \alpha\pi_C & \beta\pi_G & \alpha\pi_T \\\\
\alpha\pi_A & \cdot & \beta\pi_G & \alpha\pi_T \\\\
\beta\pi_A & \alpha\pi_C & \cdot & \beta\pi_T \\\\
\alpha\pi_A & \beta\pi_C & \alpha\pi_G & \cdot
\end{bmatrix}
$$
---
## GTR model (1990,1994)
$$
P_t = \begin{bmatrix}
\cdot & \alpha_{CA}\pi_C & \alpha_{GA}\pi_G & \alpha_{TA}\pi_T \\\\
\alpha_{CA}\pi_A & \cdot & \alpha_{GC}\pi_G & \alpha_{TC}\pi_T \\\\
\alpha_{GA}\pi_A & \alpha_{GC}\pi_C & \cdot & \alpha_{TG}\pi_T \\\\
\alpha_{TA}\pi_A & \alpha_{TC}\pi_C & \alpha_{TG}\pi_G & \cdot
\end{bmatrix}
$$
Time-reversible: $P_{ij}(t) = P_{ji}(t)$
---
So: which model should you use?
---
How many parameters in each model?
rate parameters + equilibrium frequencies:
JC: 1 + 0 = 1
K2P: 2 + 0 = 2
HKY: 2 + 3 = 5
GTR: 6 + 3 = 9
Model testing:
choose the simplest model that fits the data
---
# Tree building
- Distance-based methods
- UPGMA
- Neighbor-joining
- Character-based methods
- Parsimony
- Maximum likelihood
- Bayesian inference
---
### Two types of tree
Rooted
Unrooted
Rooted trees provide **direction**;
unrooted trees only show relationships.
---
### What's the difference?
Are these trees rooted?
---
### Can you root an unrooted tree?
---
### Can you root an unrooted tree?
An outgroup is a species known to be outside the group of interest.
The root must be between the outgroup and the ingroup.
---
### Are these the same tree? Example 1
---
### Are these the same tree? Example 2
---
### Are these the same tree? Example 3
---
### Are these the same tree? Example 4
---
## Tree building
### For a given MSA,
### How can we find the optimal tree?
Brute force?
1. Enumerate all possible trees
2. Calculate the score for each tree
3. Choose the tree with the best score
---
# How many trees?
[Felsenstein 1978](https://doi.org/10.2307/2412810)
$$
T = \prod_{k=2}^{n} (2k-3)
$$
$$
= \frac{(2n-3)!}{(n-2)!2^{n-2}}
$$
```R
possibleTrees = function(ntaxa) {
factorial(2*ntaxa-3) / (factorial(ntaxa-2) * 2^(ntaxa-2))
}
```
---
# How many trees?
---
# Distance-based methods
1. Compute distances between all pairs of sequences
2. Build a tree based on a distance matrix
---
UPGMA: Unweighted Pair Group Method with Arithmetic Mean
1. Start with a distance matrix
2. Find the closest pair of sequences
3. Join them with a new node
4. Compute the distances for the new node by averaging the distances to the other nodes
5. Repeat until all sequences are joined
---
Merging nearest neighbors
---
The UPGMA tree is guaranteed to be correct
*if the data are ultrametric*.
ultrametric means:
- the distance from any two tips to their MRCA is the same
- equivalently, for any 3 species on an ultrametric tree, the two largest distances will be equal
if sequences evolve at a constant rate,
the data will be approximately ultrametric
---
Problems with UPGMA
assumes a molecular clock -- that the rate of evolution is constant across the tree
In reality, the rate of evolution is not constant
because selection varies:
- among lineages
- among types of genes
- among genes in an organism
- among sites in a gene
---
Problems with UPGMA
Also:
---
Neighbor-joining (Saitou and Nei 1987)
In UPGMA, the pair-to-merge is the one with the minimum $d(i,j)$.
In NJ, we also build a tree by iteratively joining nodes,
but the pair-to-merge is the one that minimizes $Q$:
$$
\small
Q(i,j) = (n-2)d(i,j) - \sum_k d(i,k) - \sum_k d(j,k)
$$
$n$ is the number of current taxa.
$k$ sums over all other taxa
---
The NJ tree is guaranteed to be correct
*if the data are additive*.
additive means:
- the distance between any two tips is the sum of the branch lengths between them
- ultrametric data is additive, but not vice versa
- additivity is thus a weaker assumption
---
Problems with distance-based methods
- they are greedy
- they consider only distances, not actual sequences
- they rely on restrictive assumptions not typically met in real data (ultrametricity or additivity)
- they don't consider the possibility of multiple trees with the same score
---
Character-based methods
- Parsimony
- Maximum likelihood
- Bayesian inference
---
# Parsimony
Choose the tree that requires the fewest evolutionary changes
---
Consider this MSA (Multiple Sequence Alignment)
```
Species 1 AAG
Species 2 AAA
Species 3 GGA
Species 4 AGA
```
Which tree is more parsimonious?
---
Consider this MSA (Multiple Sequence Alignment)
```
Species 1 AAG
Species 2 AAA
Species 3 GGA
Species 4 AGA
```
Which tree is more parsimonious?
---
Algorithms:
- Scoring parismony: Fitch (1971) [Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology](https://doi.org/10.1093/sysbio/20.4.406)
- Weighted parsimony: see Sankoff (1975) [Minimal Mutation Trees of Sequences](https://doi.org/10.1137/0128004)
---
Summary of parsimony criterion:
- considers sequences themselves, not just distances
- requires a tree search
- slower than distance methods
- simple, straightforward, logical
- lacks a model; suitable for morphological data
- faster than likelihood methods
---
Also: Parsimony is sensitive to long-branch attraction
---
# Maximum likelihood
What is the most likely tree?
Given: model of sequence evolution, tree, sequences
Compute: the likelihood of the tree given the data
---
Computing the likelihood of the data on a tree
What is the most likely state at position X? Y?
---
$$
L(X) = P_{X \rightarrow T}(t_1) \times P_{X \rightarrow C}(t_2)
$$
$$
L(X=C) = P_{C \rightarrow T}(t_1) \times P_{C \rightarrow C}(t_2)
$$
---
$$
L(Y) = P_{Y \rightarrow C}(t_4) \times \sum_{X}^{X \in TCGA}{L(X)P_{Y \rightarrow X}(t_3)}
$$
---
Finally, at the root, for each site $n$,
$$
L_{n} = \sum_{Y}^{Y \in TCGA}{L(Y) \times \pi_{Y}}
$$
Likelihoods are multiplied across all sites:
$$
L_{total} = \sum_{n}{ln (L_{n})}
$$
---
Summary of Maximum likelihood criterion:
- More computationally intensive than Parsimony
- Requires a model of sequence evolution
- Requires a tree-search
- More robust to long-branch attraction
---
Tree confidence
All methods considered so far yield
a *point estimate*: a single tree.
They do not provide a measure of confidence.
---
Order of operations for phylogenetics
1. Identify orthologs
2. Align sequences (homology statements)
3. Build a tree
---
Multiple alignment with fast-fourier transform (MAFFT)
See
---
## File formats
---
FASTA format: raw sequences
```console
>sequence1
tgcatgactgctagctatgcatgcatacggcatatagc
>sequence2
tgcacgcactgctagctatgcaggcatacggcatatagc
>sequence3
tgcatgactgctagctatgcatgcataccatatagc
```
---
FASTA format: aligned sequences
```console
mafft dummy.fasta > dummy_aligned.fasta
```
```console
>sequence1
tgcatg-actgctagctatgcatgcatacggcatatagc
>sequence2
tgcacgcactgctagctatgcaggcatacggcatatagc
>sequence3
tgcatg-actgctagctatgcatgcatac--catatagc
---
CLUSTAL format
```console
mafft --clustalout dummy.fasta > dummy_aligned.clustal
```
```console
CLUSTAL format alignment by MAFFT FFT-NS-2 (v7.520)
sequence1 tgcatg-actgctagctatgcatgcatacggcatatagc
sequence2 tgcacgcactgctagctatgcaggcatacggcatatagc
sequence3 tgcatg-actgctagctatgcatgcatac--catatagc
****.* *************** ****** ********
```
---
## NEXUS format
```
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=3 NCHAR=39;
FORMAT DATATYPE=DNA MISSING=? GAP=-;
MATRIX
sequence1 tgcatg-actgctagctatgcatgcatacggcatatagc
sequence2 tgcacgcactgctagctatgcaggcatacggcatatagc
sequence3 tgcatg-actgctagctatgcatgcatac--catatagc
;
END;
```
---
## PHYLIP format
```
3 39
sequence1 tgcatg-actgctagctatgcatgcatacggcatatagc
sequence2 tgcacgcactgctagctatgcaggcatacggcatatagc
sequence3 tgcatg-actgctagctatgcatgcatac--catatagc
```
---
## Newick (parenthetic) format
```console
((x,y),z);
```
```console
((x,y),(z,w));
```
---
Plot trees in R with the `ape` package
```R
plot(ape::read.tree(text="((x,y),z));"))
```
![](images/phylogenetics/xyz.png)
---
Nested groups
```R
plot(ape::read.tree(text="((x,y),(z,w));"))
```
![](images/phylogenetics/xyzw.png)
---
Branch lengths
```R
plot(ape::read.tree(text="((x:0.2,y:0.4):0.1,(z:0.3,w:0.3):0.8);"))
```
![](images/phylogenetics/xyzw-branch-lengths.png)
---
How to use IQ-TREE
```
# Infer maximum-likelihood tree with auto-selected model
iqtree -s example.phy
# Find best-fit model only
iqtree -s example.phy -m MF
# Parallel processing
iqtree -s example.phy -nt 4
```
---
# SLURM
```bash
#!/bin/bash
#SBATCH --job-name=mafft
#SBATCH --output=mafft.out
#SBATCH --error=mafft.err
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --partition=cobi
#SBATCH --
mafft --auto input.fasta > output.fasta
```
```bash
sbatch slurm_script.sh
```