Hidden Markov models (HMM) - 3 qusetions on HMM (May/17/2007 )
1- This is a school exemple and i need some help. The qusteion is how the DNA sequence TAGT could be matched to the HMM in figure above. What is the resulting probability of the match!
my hint is commonly called "threading" the sequence through the model.
2- my second qusteion is how do I determine whether or not a sequence matches an HMM? Are there any differences compared to the profile case, and if so, how can this be used to improve the reliability of the results?
3- Third qusteion is HMMs are made from multiple sequence alignment (MSAs). How this is done!
I really need some help here and I try to understand the HMM but it is so difficult. please
hello,
you can find a good definition and examples of the hidden markov model at http://en.wikipedia.org/wiki/Hidden_Markov_model
good luck!
Hi milestone,
I remember reading about this awhile ago when I took a bioinformatics class. It's still fresh in my mind and I haven't forgotten it. But please correct me if I make mistakes in certain areas, OK?
As you know, a HMM is a stochastic probability tool used in understanding and predicting future states. Such forecasting is made from current states. What I mean by current states are the numbers in the 'Match State(s)'. Therefore, the probability of a given nucleotide is displayed. Now from this, there is a certain probability that the next character state will be an insertion or a deletion. For example, let's look @ the first Match state: We can see the sum of all nucleotides sum up to 1; that should make perfect sense. After we've seen this, we then look at specific character states, for example lets take G. There is a 0.2 probability you will get a G as your first character state. After this char state, there is a prob of 0.7 you get another Match OR a 0.1 prob you get an insertion OR 0.2 prob. of having a deletion.
So you basically repeat this probabilistic iteration for each character state.
To tell you the truth, I'm a little rusty of the next 2 questions and I don't want to tell you something which might be wrong. But this is the general idea though. Hope it was helpful.
This will also shed light as to how a HMM can be made from MSA's:
Let's take this hypothetical example of a MSA:
A C A - - - A T G
T C A A C T A T C
A C A C - - A G C
A G A - - - A T C
A C C G - - A T C
If you are having troubles with HMM's, this is a good example to aide in such acquisition. Try to do this as many times as possible because HMMs look pretty much like this.
(If certain parts are unclear, please let me know).
In this MSA there are gaps (columns 4-6). We can also see that this MSA is made up of 9 columns and 5 rows.
So lets look at the first column:
Column 1: 4X A's, 1X T's. Therefore the prob. of A = 0.8, T=0.2. We now move onto Column 2:
Column 2: 4X C, 1X G. Therefore the prob. of C = 0.8, G=0.2. Move onto Column 3:
Column 3: 4 A's, 1X G. Therefore the probability of A=0.8, G=0.2.
But notice at Column 4 there are gaps. These dashes are deletions. We can see that rows 4, 5 and 6 all have such deletions. Here, you count up the nucleotides in such column ranges: We have 1X A, 2X C's, 1X G and 1X T.
Therefore the prob. of an insertion is A: 0.2, C: 0.4, T: 0.2, G: 0.2.
Looking at Column 4, we can see that only 3 rows have insertions(row 2, 3 and 5), so there probability of any given insertion is 0.6 (when you are coming from Column 3->Column 4 only).
So therefore, going from Column 3 -> Column 4 (insertion) has a prob of 0.6 and the prob. of any specific nucleotide is written above (A: 0.2, C: 0.4, C: 0.2, G: 0.2).
(Because the probability to reach an insertion is 0.6, there is a 0.4 remainer for another insertion.)
So from the insertion (Column 4), we then move onto Column 7. Since we are coming from an insertion (prob=0.6), we use the same prob. to reach Column 7. Why the jump straight to 7? Because 4, 5 and 6 were all insertions and we done the calculations for insertions in the paragraph above. Remember? We grouped the 2 C's, 1 A, 1 T and 1G in total.
So now that we're at Column 7, there is 5X A, therefore a 1.0 prob. of getting all A's. You then move onto Column 8: 4X T's and 1X G, therefore that is: 0.8 for T and 0.2 for G.
Column 9: 4X C's and 1X G's, therefore prob. if 0.8 for C and 0.2 for G.
So if you want to calculate the probability of getting the sequence:
ACACATC, you look-up the probability of each given nucleotide. In this case, A is column 1, C is column 2, A is column 3, and so on.
You then multiply all these probabilities together. Remember that going from one state to another state (if there is no insertion) has a probabiliy of 1.0.
So to calc ACACATC, you do: (0.8)*1.0*(0.8)*1.0*(0.8)*0.6*(0.4)*0.6*(1.0)*1.0*(0.8)*1.0*(0.8) = 0.047
Note: I have written this calculation such that the number inside the parentheses is the probability to the respective nucleotide. Any number not in parantheses is the probability of going to another state.
In your case, to calculate TAGT, we wouldn't use this HMM. Using your existing HMM (in your image above), you'd make an MSA based on the probabilites and draw in the insertions and deletions. From there, I would use your sequence: TAGT and find the probabilities of each given nucleotide state.
Another way to think about calculating the prob of TAGT is to think in reverse with regards to this example because in your case, you already have the sequence (TAGT), so you have to apply the sequence to the HMM you already have.
In my hypothetical case, we are pretty much the opposite because we do not start with a sequence.
Note: The reason why T=State1, A=State2, G=State3, T=State4 will not work is because there is no state 4 which corresponds to the last T. There are a total of 3 maximum states, so that is why it isn't as easy as just doing the (prob of T)*(prob of A)..... so on so on.
And this HMM is not that linear because going from state A to state B either has an insertion and/or deletion. Therefore it must be more than just a simply linear multiplication and more to do with insertions/deletions.
Sorry if I haven't given more info. The above info on HMMs will serve as the backbone behind solving such problems though.
Try to think in reverse to my example and you'll get it.
Best,