In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision.[1]
The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s.[2] One of the first major applications of HMMs was to the field of speech processing.[3] In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information.[4] They have since become an important tool in the probabilistic modeling of genomic sequences.[5]
A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state.
The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors.
Let
Xt
N
N
P(Xt\midXt-1)
t
A=\{aij\}=P(Xt=j\midXt-1=i).
t=1
\pii=P(X1=i).
Yt
K
yi
t
Xt=j
bj(yi)=P(Yt=yi\midXt=j).
Yt
Xt
N x K
B=\{bj(yi)\}
bj
yi
An observation sequence is given by
Y=(Y1=y1,Y2=y2,\ldots,YT=yT)
Thus we can describe a hidden Markov chain by
\theta=(A,B,\pi)
\theta*=\operatorname{argmax}\thetaP(Y\mid\theta)
\theta
Set
\theta=(A,B,\pi)
Let
\alphai(t)=P(Y1=y1,\ldots,Yt=yt,Xt=i\mid\theta)
y1,y2,\ldots,yt
i
t
\alphai(1)=\piibi(y1),
\alphai(t+1)=bi(yt+1)
N | |
\sum | |
j=1 |
\alphaj(t)aji.
Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences.[7] However, this can be avoided in a slightly modified algorithm by scaling
\alpha
\beta
Let
\betai(t)=P(Yt+1=yt+1,\ldots,YT=yT\midXt=i,\theta)
yt+1,\ldots,yT
i
t
\betai(t)
\betai(T)=1,
\betai(t)=\sum
N | |
j=1 |
\betaj(t+1)aijbj(yt+1).
We can now calculate the temporary variables, according to Bayes' theorem:
\gammai(t)=P(Xt=i\midY,\theta)=
P(Xt=i,Y\mid\theta) | |
P(Y\mid\theta) |
=
\alphai(t)\betai(t) | |||||||||
|
,
i
t
Y
\theta
\xiij(t)=P(Xt=i,Xt+1=j\midY,\theta)=
P(Xt=i,Xt+1=j,Y\mid\theta) | |
P(Y\mid\theta) |
=
\alphai(t)aij\betaj(t+1)bj(yt+1) | |||||||||||||||
|
,
i
j
t
t+1
Y
\theta
The denominators of
\gammai(t)
\xiij(t)
Y
\theta
The parameters of the hidden Markov model
\theta
* | |
\pi | |
i |
=\gammai(1),
i
1
| ||||||||||||||||||||||
a | ||||||||||||||||||||||
ij |
,
*(v | ||||||||||||||||||||||||||
b | ||||||||||||||||||||||||||
|
,
1 | |
yt=vk |
= \begin{cases} 1&ifyt=vk,\\ 0&otherwise \end{cases}
*(v | |
b | |
k) |
vk
i
i
These steps are now repeated iteratively until a desired level of convergence.
Note: It is possible to over-fit a particular data set. That is,
P(Y\mid\thetafinal)>P(Y\mid\thetatrue)
The algorithm described thus far assumes a single observed sequence
Y=y1,\ldots,yN
Y1,\ldots,YR
A
\pi
b
\gammair(t)
\xiijr(t)
y1,r
,\ldots,y | |
Nr,r |
* | |
\pi | |
i |
=
| ||||||||||
R |
| ||||||||||||||||||||||||||||||||||
a | ||||||||||||||||||||||||||||||||||
ij |
,
*(v | ||||||||||||||||||||||||||||||||||||||
b | ||||||||||||||||||||||||||||||||||||||
|
,
1 | |
ytr=vk |
= \begin{cases} 1&ifyt,r=vk,\\ 0&otherwise \end{cases}
Suppose we have a chicken from which we collect eggs at noon every day. Now whether or not the chicken has laid eggs for collection depends on some unknown factors that are hidden. We can however (for simplicity) assume that the chicken is always in one of two states that influence whether the chicken lays eggs, and that this state only depends on the state on the previous day. Now we don't know the state at the initial starting point, we don't know the transition probabilities between the two states and we don't know the probability that the chicken lays an egg given a particular state.[8] [9] To start we first guess the transition and emission matrices.
|
|
|
This gives us a set of observed transitions between days: NN, NN, NN, NN, NE, EE, EN, NN, NN
The next step is to estimate a new transition matrix. For example, the probability of the sequence NN and the state being then is given by the following,
P(S1) ⋅ P(N|S1) ⋅ P(S1 → S2) ⋅ P(N|S2).
Observed sequence | Highest probability of observing that sequence if state is then | Highest Probability of observing that sequence | ||
---|---|---|---|---|
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , | |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , | |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , | |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , | |
NE | 0.006 = 0.2 × 0.3 × 0.5 × 0.2 | 0.1344 | , | |
EE | 0.014 = 0.2 × 0.7 × 0.5 × 0.2 | 0.0490 | , | |
EN | 0.056 = 0.2 × 0.7 × 0.5 × 0.8 | 0.0896 | , | |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , | |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , | |
Total | 0.22 | 2.4234 |
Thus the new estimate for the to transition is now
0.22 | |
2.4234 |
=0.0908
|
|
|
Next, we want to estimate a new emission matrix,
Observed Sequence | Highest probability of observing that sequence if E is assumed to come from | Highest Probability of observing that sequence | |||
---|---|---|---|---|---|
NE | 0.1344 | , | 0.1344 | , | |
EE | 0.0490 | , | 0.0490 | , | |
EN | 0.0560 | , | 0.0896 | , | |
Total | 0.2394 | 0.2730 |
The new estimate for the E coming from emission is now
0.2394 | |
0.2730 |
=0.8769
This allows us to calculate the emission matrix as described above in the algorithm, by adding up the probabilities for the respective observed sequences. We then repeat for if N came from and for if N and E came from and normalize.
|
|
|
To estimate the initial probabilities we assume all sequences start with the hidden state and calculate the highest probability and then repeat for . Again we then normalize to give an updated initial vector.
Finally we repeat these steps until the resulting probabilities converge satisfactorily.
Hidden Markov Models were first applied to speech recognition by James K. Baker in 1975.[10] Continuous speech recognition occurs by the following steps, modeled by a HMM. Feature analysis is first undertaken on temporal and/or spectral features of the speech signal. This produces an observation vector. The feature is then compared to all sequences of the speech recognition units. These units could be phonemes, syllables, or whole-word units. A lexicon decoding system is applied to constrain the paths investigated, so only words in the system's lexicon (word dictionary) are investigated. Similar to the lexicon decoding, the system path is further constrained by the rules of grammar and syntax. Finally, semantic analysis is applied and the system outputs the recognized utterance. A limitation of many HMM applications to speech recognition is that the current state only depends on the state at the previous time-step, which is unrealistic for speech as dependencies are often several time-steps in duration.[11] The Baum–Welch algorithm also has extensive applications in solving HMMs used in the field of speech synthesis.[12]
The Baum–Welch algorithm is often used to estimate the parameters of HMMs in deciphering hidden or noisy information and consequently is often used in cryptanalysis. In data security an observer would like to extract information from a data stream without knowing all the parameters of the transmission. This can involve reverse engineering a channel encoder.[13] HMMs and as a consequence the Baum–Welch algorithm have also been used to identify spoken phrases in encrypted VoIP calls.[14] In addition HMM cryptanalysis is an important tool for automated investigations of cache-timing data. It allows for the automatic discovery of critical algorithm state, for example key values.[15]
The GLIMMER (Gene Locator and Interpolated Markov ModelER) software was an early gene-finding program used for the identification of coding regions in prokaryotic DNA.[16] [17] GLIMMER uses Interpolated Markov Models (IMMs) to identify the coding regions and distinguish them from the noncoding DNA. The latest release (GLIMMER3) has been shown to have increased specificity and accuracy compared with its predecessors with regard to predicting translation initiation sites, demonstrating an average 99% accuracy in locating 3' locations compared to confirmed genes in prokaryotes.[18]
The GENSCAN webserver is a gene locator capable of analyzing eukaryotic sequences up to one million base-pairs (1 Mbp) long.[19] GENSCAN utilizes a general inhomogeneous, three periodic, fifth order Markov model of DNA coding regions. Additionally, this model accounts for differences in gene density and structure (such as intron lengths) that occur in different isochores. While most integrated gene-finding software (at the time of GENSCANs release) assumed input sequences contained exactly one gene, GENSCAN solves a general case where partial, complete, or multiple genes (or even no gene at all) is present.[20] GENSCAN was shown to exactly predict exon location with 90% accuracy with 80% specificity compared to an annotated database.[21]
Copy-number variations (CNVs) are an abundant form of genome structure variation in humans. A discrete-valued bivariate HMM (dbHMM) was used assigning chromosomal regions to seven distinct states: unaffected regions, deletions, duplications and four transition states. Solving this model using Baum-Welch demonstrated the ability to predict the location of CNV breakpoint to approximately 300 bp from micro-array experiments.[22] This magnitude of resolution enables more precise correlations between different CNVs and across populations than previously possible, allowing the study of CNV population frequencies. It also demonstrated a direct inheritance pattern for a particular CNV.