|
Pfam 22.0 ::
Help page : FAQ
Frequently asked questions and general overview |
|
General background
Pfam is a database of multiple alignments of protein domains or conserved protein regions. The alignments represent some evolutionary conserved structure which has implications for the protein's function. Profile hidden Markov models (profile HMMs) built from the Pfam alignments can be very useful for automatically recognizing that a new protein belongs to an existing protein family, even if the homology is weak. Unlike standard pairwise alignment methods (e.g. BLAST, FASTA), Pfam HMMs deal sensibly with multidomain proteins.
Pfam is formed in two separate ways. Pfam-A are accurate human crafted multiple alignments, whereas Pfam-B is an automatic clustering of the rest of a nonredundant protein database derived from the PRODOM database. This FAQ is mostly about the Pfam-A section.
Questions about Pfam alignments
A profile hidden Markov model defines a multiple alignment by aligning each individual sequence to a single model. The model contains a number of "match" states that represent the main, consensus positions of the domain. A "consensus sequence" of the domain would (in general) align entirely to match states. Deletions relative to the consensus pass through a "delete state" instead of a match state; insertions relative to the consensus pass through an "insert state" between two match states.
The path each sequence took through the model is encoded in the capitalization and gap characters used in the alignment.
Capital letters represent residues that were aligned to match states (you can think of them as residues that are aligned to the consensus). Small letters represent residues assigned to insert states; these residues, importantly, are arbitrarily aligned (HMMs treat inserts as unaligned sequence).
A '-' character is a gap that represents an assignment to a delete state: a deletion relative to consensus. A '.' character is a gap in an insert column; it means that one or more sequences have insertions in this column.
Therefore, a column of the multiple alignment is either a "match column" or an "insert column". A match column contains capital letters (residues aligned to the match state for this column) and '-' gaps (uses of the delete state). An insert column contains small letters (residues emitted as an unaligned insertion from an insert state) and '.' gaps (which are padding characters, strictly for visualizing the alignment, and having no meaning as far as the HMM is concerned).
Gripes about Pfam
Oi. If you do feel that an alignment is bad, then please tell us - we can probably fix it, at least in the next release.
However, in general, we're finding that usually when we field an inquiry like this, it's because people aren't used to looking at multiple alignments of hundreds or thousands of sequences. Remember that a rare insertion in even just one sequence means having to open a gap in the whole alignment: Pfam full alignments look very gappy for this reason, but in fact they're not. A couple of other things to keep in mind:
We attempt to get the SEED alignments correct. The FULL alignments, which are made automatically from the HMM model, are not hand checked, and so their quality will generally be somewhat lower.
The FULL alignments often look worse than they are because of the alignment in the gaps (e.g. lower case residues). Profile HMMs do not attempt to align insertions, so lower case residues are placed arbitrarily. Emphasis is placed on getting the conserved regions correct; in general we argue that the placement of residues in loop regions of multiple alignments is basically aesthetic, because these regions are often structurally variable.
We are very keen to be alerted to new domains. If you can provide us with a multiple alignment then we will try hard to encorporate it into the database. If you know of a domain, but don't have a multiple alignment, we still want to know. Please tell us.
You can download the Pfam data files from FTP servers at Sanger or WashU. You'll also need to obtain the HMMER profile HMM software. The Janelia Farm FTP site also distributes a complete mirror installation of the Janelia Farm Pfam web server, including the CGI scripts used to drive the Wulfpack Linux cluster that provides Janelia's compute power.
Pfam structure
Pfam was originally developed by Erik Sonnhammer and Sean Eddy in Richard Durbin's group. Currently it is maintained by a consortium of researchers that includes the labs of principal investigators:
Pfam is a collaborative venture and we hope to be able to interact with as many people as possible to provide a quality database. Please get into contact with any one of us for information.