reveal.js

## Hidden Markov Models (HMMs)
![hmm](images/hmm/hiddenmarkov.svg)

Note: HMMs are well-known for their effectiveness in modeling the correlations between adjacent symbols, domains, or events, and they have been extensively used in various fields, especially in speech recognition and digital communication. Considering the remarkable success of HMMs in engineering, it is no surprise that a wide range of problems in biological sequence analysis have also benefited from them.

---

## What do you remember about dynamic programming?

What is dynamic programming really?

- introduced in 1950s by Richard Bellman of Rand Corp  
- it's hard to define   
- it's extremely broad

[The theory of dynamic programming](https://www.ams.org/journals/bull/1954-60-06/S0002-9904-1954-09848-8/)

---

## The setting

![example](images/hmm/dynamic-programming-1.png)

---

## The insight

![example](images/hmm/dynamic-programming-2.png)

The essence of dynamic programming is "do the best you can from where you are"

---

Consider a process determined at any time by an M-dimensional vector

$$
\mathbf{p} = (x_0, x_1, x_2, ..., x_m)
$$

Consider a set of transformations $\mathbf{T} = { T_k }$,  
which are functions that transform $\mathbf{p}$

$$
T_k(p) = p'
$$

We want to maximize our "return" -- the output of some scalar function $R(p)$ of the final state.

---

We want to select a series of transformations,  
called "policy" $P=(T_1, T_2, ... T_N)$,  
which will yield successive states:

$$
p_1 = T_1(p_0)  \\\\
p_2 = T_2(p_1)  \\\\
...... \\\\
p_N = T_N(p_{N-1})
$$

The maximum value of $R(P_N)$, determined by an optimal policy, will only be a function of the initial vector $p_0$ and the number of stages N. The optimal return value is:

$$ 
f_N(p) = Max_{P}R(p_N)
$$

---

$$ 
f_N(p) = Max_{P}R(p_N)
$$

Choose our first transformation $T_1(p_0)$.  
The maximum return from the following (N-1) stages is by definition:

$$
f_{N-1}(T_1(p_0)) \\\\
= f_{N-1}(p_1)
$$

Thus:

$$
f_N(p) = Max_{P}R(p_N) = Max_{k} f_{N-1}(T_k(p))
$$

Which means: at any given time, we must choose the transformation that maximizes the return.

---

## Dynamic programming: the gist

You have a sequence of states, and at each time step, you make a decision. You want to end optimally, for some definition of optimal.

The brute-force way is to consider all possible sequences of decisions and then pick the best one, but this is not possible for long sequences with many possible decisions.

Instead, divide the problem into sub-problems: At any particular time determining your decision payoff depends only on the current state of the system, not on later states of the system.

This concept can apply to a huge array of problems, from investing to aircraft control to...computational biology.

---

## A toy HMM for 5′ splice site recognition

![example](images/hmm/example.jpg.webp)

<small>From: [What is a hidden Markov model?](https://www.nature.com/articles/nbt1004-1315)  
Don't confuse a state-space diagram with a [graphical model diagram](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0629-0/figures/2)
</small>

Note: Often, biological sequence analysis is just a matter of putting the right label on each residue. In gene identification, we want to label nucleotides as exons, introns, or intergenic sequence. In sequence alignment, we want to associate residues in a query sequence with homologous residues in a target database sequence. We can always write an ad hoc program for any given problem, but the same frustrating issues will always recur. One is that we want to incorporate heterogeneous sources of information. A genefinder, for instance, ought to combine splice-site consensus, codon bias, exon/ intron length preferences and open reading frame analysis into one scoring system. How should these parameters be set? How should different kinds of information be weighted? A second issue is to interpret results probabilistically. Finding a best scoring answer is one thing, but what does the score mean, and how confident are we that the best scoring answer is correct? A third issue is extensibility. The moment we perfect our ad hoc genefinder, we wish we had also modeled translational initiation consensus, alternative splicing and a polyadenylation signal. Too often, piling more reality onto a fragile ad hoc program makes it collapse under its own weight.

---

## Hidden Markov Models (HMMs)
* Provide a foundation for probabilistic models of linear sequence ‘labeling’ problems
* Can be designed just by drawing a graph diagram
* Originally developed for applications to voice recognition
* Applications include: Gene prediction, protein secondary structure prediction, copy-number variation, chromatin-state assignment, chromatin topology ...

Note: Joint probability $P(A,B) = P(A|B)  P(B) = P(B|A)  P(A)$, Marginal probability $P(X=A) = \sum_y P(X=A, Y=y_i)$

---

## Markov models

## ↓

## Hidden Markov models

---

## Markov Models
* Set of states: $S \in {s_1, s_2, \ldots, s_n}$
* Process moves from one state to another generating a sequence of $L$ states: $x_{1}, x_{2}, \ldots, x_{L}, $
* Markov property:  The probability of a symbol depends only on the preceding symbol, not the entire previous sequence  
  $$P(x_{L} = s|x_{1},x_{2},\ldots,x_{i(L-1)}) = P(x_{L}=s|x_{i(L-1)})$$

* A Markov chain is defined by:
  * transition probabilities $a_{st} = P(x_i=t| x_{i-1}=s)$, $A=\\{a_{ij}\\}$
  * initial probabilities: $a_{0s} = P(x_1=s)$

---

## Example 
![markov](images/hmm/markov.svg)

$A = \begin{bmatrix}
0.7 & 0.3\\\\\\
0.4 & 0.6
\end{bmatrix}$,
$a_{0s} = (0.4, 0.6)$

Note: So let's consider I have two coins, one of them is fair and the other one is loaded.

---

## Calculation of sequence probability

Multiplication Rule of probability:

$$
P(A, B) = P(A|B) P(B)
$$

$$
\begin{eqnarray}
P(x_{1}, \ldots, x_{L}) &=& P(x_{L} | x_{1}, \ldots, x_{(kL-1)}) P(x_{1}, \ldots, x_{(L-1)})\\\\\\
\end{eqnarray}
$$

By Markov property, the probability of a state sequence is:

$$
\begin{eqnarray}
&=& P(x_{L} | x_{(L-1)}) P(x_{1}, x_{2}, \ldots, x_{(L-1)})\\\\\\
&=& \ldots\\\\\\
&=& P(x_{L} | x_{(L-1)}) \ldots P(x_{2} | x_{1}) P(x_{1}) 
\end{eqnarray}
$$

---

## Calculation of sequence probability
![markov](images/hmm/markov.svg)

$a_0 = (0.4, 0.6)$

Suppose we want to calculate $P(L,L,F,F)$

$$\begin{eqnarray}
P(L,L,F,F) &=& P(F|L,L,F) P(L,L,F) \\\\\\
&=& P(F|F) P(F|L,L) P(L,L)\\\\\\
&=& P(F|F) P(F|L) P(L|L) P(L)\\\\\\
&=& 0.7 \times 0.4 \times 0.6 \times 0.6
\end{eqnarray}$$

---

## Hidden Markov Models
* Set of states: $S \in {s_1, s_2, \ldots, s_n}$
* Process moves from one state to another generating a sequence of $L$  states: $x_{1}, x_{2}, \ldots, x_{L}$
* Markov property: $$P(x_{k}|x_{1},x_{2},\ldots,x_{(L-1)}) = P(x_{L}|x_{(L-1)})$$
* States are not visible, but each state randomly generates one of $L$ observations (or emissions): ${o_1, o_2, \ldots, o_L}$

---

## Components of Hidden Markov Models

The following need to be defined for Model $M = (A, B, a_0)$:

* transition probabilities:  
   $ \mathbf{A} = \\{ a_{ij} \\} $  
   $a_{st} = P(x_i=t |x_{i-1} =s)$
* initial probabilities:  
  $a_{0s} = P(x_1 = s)$
* observation/emission probabilities:  
  $\mathbf{B} = \\{ e_k(b) \\}$  
  $e_k(b) = P(x_i=b | \pi_i=k)$  
  (probability of seeing symbol $b$ when in state $k$)

<!-- ---

## Components of an HMM

- transition from state $k$ to state $l$: $\mathbf{A} = {a_{kl}}$
  - initiation probabilities: $a_{0k}$, or $a_0$
- emission probabilities: $\mathbf{B} = {e_k}$
 -->

---

## Example 
![hidden_markov](images/hmm/hiddenmarkov.svg)

$A = \begin{bmatrix}
0.7 & 0.3\\\\\\
0.4 & 0.6
\end{bmatrix}$,
$B = \begin{bmatrix}
0.5 & 0.5\\\\\\
0.3 & 0.7
\end{bmatrix}$,
$a_{0} = (0.4, 0.6)$

---

## Calculation of sequence probability
Suppose we want to calculate $P( \\{ H,H \\} )$

$$\begin{eqnarray}
P( \\{ H,H \\} ) &=& P( \\{ H,H \\}, \\{ F,F \\}) + \\\\\\
& & P( \\{ H,H \\}, \\{ F,L \\}) +\\\\\\
& & P( \\{ H,H \\}, \\{ L,F \\}) +\\\\\\
& & P( \\{ H,H \\}, \\{ L,L \\})
\end{eqnarray}$$

$$\begin{eqnarray}
P( \\{ H,H \\}, \\{ F,F \\}) &=& P( \\{ H,H \\} | \\{ F,F \\}) P(\\{ F,F \\})\\\\\\
&=& P(H|F) P(H|F) P(F|F) P(F)
\end{eqnarray}$$

Note: Consider all possible hidden state sequences

---

## 3 Computational applications of HMMs
* Decoding problem (aka uncovering, parsing, or inference)

Given an HMM $M=(A,B,a_0)$, and an observation sequence $O$, find the sequence of states most likely to have produced $O$.

* Likelihood problem (aka evaluation, or scoring)

Given an HMM $M=(A,B,a_0)$, and an observation sequence $O$, calculate likelihood $P(O|M)$.

* Learning problem (aka parameter estimation, or fitting)

Given an HMM structure and observation sequence $O$, determine HMM parameters that best fit the training data.

---

## Solutions to 3 applications of HMMs

* Decoding problem (aka uncovering, parsing, or inference)

Viterbi algorithm

* Likelihood problem (aka evaluation, or scoring)

Forward-backward algorithm

* Learning problem (aka parameter estimation, or fitting)

Baum-Welch algorithm

---

## Decoding problem

Given HMM $M=(A,B,a_0)$ and observation sequence $O$, find the sequence of states most likely to produce $O$.

![decoding](images/hmm/decoding1.png)

---

## Decoding problem 
![decoding](images/hmm/decoding1.png)

$P(O, S)$: the probability that HMM follows sequence of states $S$ and emits string $O$.

$$P(O,S) = P(O_{1:T}, S_{1:T}) = P(O_{1:T}|S_{1:T}) P(S_{1:T})$$
$P(O_{1:T}|S_{1:T})$: product of emission probabilities,  
$P(S_{1:T})$: product of transition probabilities

---

## Decoding problem

Find $S_{1:T}$ that maximises $P(O_{1:T}|S_{1:T})$ over all possible paths in the HMM

![decoding](images/hmm/decoding1.png)

$N$ states, $T$ time steps: $N^T$ paths...

---

## Viterbi algorithm

* Dynamic programming
* $N$ rows (number of states), $T$ columns (length of sequence)
* Initialization: $v_0(i) = a_{0i} B(O_1|S_i) = a_{0i} e_l(x_i) $
* Recursion: $ v_l(i) = e_l(x_i) \times max_k(v_k(i-1)a_{kl}) $

Read as: the viterbi score at time $l$ for state $i$ is the emission probability for observation $x_i$ at state $l$ times the best of the previous scores times transition probability

![recursion](images/hmm/decoding1.png)

---

## Viterbi algorithm
![hidden_markov](images/hmm/hiddenmarkov.svg)

Observations : HHTTTTTTTH, $a_0=(0.5,0.5)$

---

## Viterbi algorithm

Use log space to avoid numerical underflow

![hidden_markov](images/hmm/hiddenmarkovlog.svg)

Observations : HHTTTTTTTH, $a_0=c(-0.69,-0.69)$

![dp](images/hmm/dp1.svg)

---