|
Pfam 22.0 ::
Help Page : Scores
What Pfam HMM scores mean |
|
Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER2 package. In HMMER2, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal or better than this by chance alone. A good E-value is much less than 1. Around 1 is what we expect just by chance. In principle, all you need to decide on the significance of a match is the E-value.
However, there are a few complications.
The most serious complication is that there are no analytical results available for accurately determining E-values for gapped alignments, especially profile HMM alignments. HMMER uses empirical methods to estimate E-values. These methods are generally rather accurate. However, when "in doubt", HMMER tends to err on the conservative side.
We use a second, and even more empirical, system in maintaining Pfam models. This system is implemented in the Pfam database rather than in the HMMER software. For each Pfam family, we record a "gathering cutoff", "trusted cutoff", and a "noise cutoff": GA1, TC1, and NC1. GA1 is the threshold that we actually set to collect the sequences in the Pfam Full alignment. TC1 is the lowest score for sequences in the Full alignment). NC1 is the highest score for a sequence we did not include in the Full alignment. Thus, TC1 > GA1 > NC1.
Therefore, we can consider a hit very significant if it scores better than the curated cutoffs and also has a significant E-value. Sometimes sequences score better than the cutoffs though they don't have significant E-values; these are marginal hits that we've chosen to include in the family.
Then, there's one additional wrinkle in the scoring scheme. HMMER2 calculates two kinds of scores. The "sequence classification score" is the total score of a sequence aligned to a model; if there are more than one domain, the sequence score is the sum of all (finding multiple domains increases our confidence that the sequence belongs to that protein family, even if each domain individually is a weak match.) The "domain score" is a score for a single domain. (These two scores are identical for single domain proteins.)
When you set the E-value cutoff for a Pfam server search, you're setting the sequence classification cutoff. If your query sequence has a sequence score with an E-value less than that cutoff, all individual domains in your query will be reported, even if some of them have dubious scores individually. This is because in multidomain proteins, there are often eroded, weakly matching domains. Since the Pfam server reports the domain scores and E-values, you may therefore see individual domains with E-values worse than your cutoff.
A second set of trusted and noise cutoffs (GA2, TC2, and NC2) are kept for the domain scores. For multidomain proteins, these are typically lower than GA1, TC1, and NC1. Once a whole sequence is detected with significance, (its score is greater than the chosen cutoff), the individual domain scores are checked against the second (domain) cutoff. Intuitively, what we're saying is that once we detect one domain with a significant score, we relax our criteria for locating additional domains in that same sequence.
HMMER is most sensitive at identifying complete domains: its preferred algorithm is a "profile alignment" algorithm that is neither fully global nor fully local alignment, but instead looks for a "glocal" alignment that is global with respect to the model, but (multiply) local with respect to the sequence -- e.g. to look for one or more complete domains in a query sequence.
Fragments of domains do occur, especially in truncated sequences,
translated ESTs, or because of insertions of new domains in existing
ones. HMMER can also build models that do fully local alignment and
tolerate fragments (look for the hmmbuild -f notation
on the BM line of the Pfam annotation). A standard Pfam
model does not tolerate fragments.
Pfam therefore includes two different types of models, the glocal models ("ls" mode, in the Pfam_ls HMM database) and Smith/Waterman models ("fs" mode, in the Pfam_fs HMM database). In glocal mode, only full-length complete domains are found. In Smith/Waterman mode, fragmentary domains can also be found, because fully local alignments are allowed. "ls" mode is much more sensitive than "fs" mode, but only if a complete domain is actually present; if a partially deleted fragment is present, "fs" mode will be needed. The two modes have different curated cutoffs.
By default, both sets of models are searched, and the results are integrated -- if an "ls" mode hit completely overlaps an "fs" mode hit from the same model, the "fs" hit is discarded.
So far, we've said nothing about what the raw HMMER scores mean, that the E-values and GA, TC, NC cutoffs are based on.
In the formalism used by HMMs (basically a Bayesian statistical formalism), the HMM of the domain is considered as a model which "produces" sequences with a certain probability for each different sequence. For each possible sequence, these probabilities are very low, but they are compared to the probability that this sequence was produced by a "random" (also called NULL) model. The log of the ratio of these two likelihoods are reported (a log odds ratio). The base of the log is base2, giving a statistic that is commonly called a "bit score". That's what the HMMER raw scores are. The higher this score, the better.
We're now treading close to a discussion of the theoretical basis of probabilistic models such as profile HMMs. For a more in depth discussion, see publications from the Eddy lab, which include several reviews of HMM methods, including the book Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin, Eddy, Krogh, and Mitchison.