![]() |
HMMER
User's Guide |
Unaligned sequence formats
FASTA | Pearson FASTA format; BLAST databases |
SWISSPROT | SWISSPROT protein sequence database |
PIR | PIR protein sequence database |
EMBL | EMBL DNA sequence database |
GenBank | GenBank DNA database flat files |
GCGdata | Wisconsin GCG sequence database format |
GCG | Wisconsin GCG single sequence format |
Multiple sequence alignment formats
MSF | GCG alignment format |
CLUSTAL | CLUSTALW and CLUSTALV format |
SELEX | My annotated format (see below) |
Alignment formats are read as if they were multiple unaligned sequences by programs doing sequential, database-style access.
There is no provision for enforcing that single unaligned sequence formats (i.e. GCG unaligned sequence format) are single sequence only. HMMER will happily try to read more than one sequence if your file contains more than one. However, this may not give the results you expected.
Staden ``experiment file'' format is parsed using the EMBL file parser, but this functionality is relatively unsupported. There is one wrinkle in this. Staden experiment files use '-' characters to indicate 'N' - i.e., that a base is present in a sequence read, but its identity is unknown. Therefore, the software replaces any '-' in an EMBL sequence with an 'N'. Sometimes people use the unaligned formats to distribute aligned sequences by including gap characters. If EMBL files are used i this way for aligned strings, they must use a different character than '-' to indicate gaps.
An example of a simple FASTA file:
> seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CATCGTCAGTTACTGCATGCTCG
FASTA is probably the simplest of formats for unaligned sequences.
FASTA files are easily created in a text editor. Each sequence is
preceded by a line starting with >
. The first word on this line
is the name of the sequence. The rest of the line is a description of
the sequence (free format). The remaining lines contain the sequence
itself. You can put as many letters on a sequence line as you want.
Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence. Any other non-amino or non-nucleic acid symbols in the sequence should produce an appropriately strident string of warnings on yur terminal screen when you try to use the file.
An example of a simple SELEX alignment file:
# Example selex file seq1 ACGACGACGACG. seq2 ..GGGAAAGG.GA seq3 UUU..AAAUUU.A seq1 ..ACG seq2 AAGGG seq3 AA...UUU
SELEX is an interleaved multiple alignment format that arose as an intuitive format that was easy to write and manipulate manually with a text editor like emacs. It is usually easy to convert other alignment formats into SELEX format, even with a couple of lines of Perl, but it can be harder to go the other way, since SELEX is more free-format than other alignment formats. For instance, GCG's MSF format and the output of the CLUSTALV multiple alignment program are similar interleaved formats that can be converted to SELEX just by stripping a small number of non-sequence lines out. Because SELEX evolved to accomodate different user input styles, it is very tolerant of various inconsistencies such as different gap symbols, varying line lengths, etc.
Each line contains a name, followed by the aligned sequence. A space,
dash, underscore, or period denotes a gap. If the alignment is too
long to fit on one line, the alignment is split into multiple blocks,
separated by blank lines. The number of sequences, their order, and
their names must be the same in every block (even if a sequence has no
residues in a given block!) Other blank lines are ignored. You can add
comments to the file on lines starting with a #
.
SELEX, by the way, stands for ``Systematic Evolution of Ligands by Exponential Enrichment'' - it refers to the Tuerk and Gold technology for evolving families of small RNAs for particular functions [Tuerk and Gold, 1990]. SELEX files were what we used to keep track of alignments of these small RNA families.
As the format evolved, more features have been added. To maintain compatibility with past alignment files, the new features are added using a reserved comment style. You don't have to worry about adding these extra information lines. They are generally the province of automated SELEX-generating software, such as my koala sequence alignment editor or the COVE and HMMER sequence analysis packages. This extra information includes consensus and individual RNA or protein secondary structure, sequence weights, a reference coordinate system for the columns, and database source information including name, accession number, and coordinates (for subsequences extracted from a longer source sequence) See below for details.
#=
as the first two characters is a
machine comment. #=
comments are reserved for parsed data
about the alignment. Usually these features are maintained by software
such as the koala editor, not by hand. The format of #=
lines is usually quite specific, since they must be parsed by the
file-reading software.
%
or #
as the first
character are user comments. User comments are ignored by all
software. Anything may appear on these lines. Any number of comments
may be included in a SELEX file, and at any point.
-_.
or a space are
recognized as gaps. Gaps will be converted to a '.'. Any other
characters are interpreted as sequence. Sequence is
case-sensitive. There is a common assumption by my software that
upper-case symbols are used for consensus (match) positions and
lower-case symbols are used for inserts. This language of ``match''
versus ``insert'' comes from the hidden Markov model formalism
[Krogh et al., 1994]. To almost all of my software, this isn't important,
and it immediately converts the sequence to all upper-case after it's
read.
I use one-letter codes to indicate secondary structures. Secondary structure strings are aligned to sequence blocks just like additional sequences.
For RNA secondary structure, the symbols >
and <
are
used for base pairs (pairs point at each other). +
indicate
definitely single-stranded positions, and any gap symbol indicates
unassigned bases or single-stranded positions. This description
roughly follows [Konings and Hogeweg, 1989]. For protein secondary structure, I
use E to indicate residues in -sheet, H for those
in
-helix, L for those in loops, and any gap symbol for
unassigned or unstructured residues.
RNA pseudoknots are represented by alphabetic characters, with upper case letters representing the 5' side of the helix and lower case letters representing the 3' side. Note that this restricts the annotation to a maximum of 26 pseudoknots per sequence.
Lines beginning with #=SS
or #=CS
are individual or
consensus secondary structure data, respectively. #=SS
individual secondary structure lines must immediately follow the
sequence they are associated with. There can only be one #=SS
per sequence. All, some, or none of the sequences may have #=SS
annotation.
#=CS
consensus secondary structure predictions precede all the
sequences in each block. There can only be one #=CS
per file.
Alignments are usually numbered by some reference coordinate system, often a canonical molecule. For instance, tRNA positions are numbered by reference to the positions of yeast tRNA-Phe.
A line beginning with #=RF
preceding the sequences in a block
gives a reference coordinate system. Any non-gap symbol in the
#=RF
line indicates that sequence positions in its columns are
numbered. For instance, the #=RF
lines for a tRNA alignment
would have 76 non-gap symbols for the canonical numbered columns; they
might be the aligned tRNA-Phe sequence itself, or they might be just
X's.
Additional per-sequence information can be placed in a header before any blocks appear. These lines, one per sequence and in exactly the same order as the sequences appear in the alignment, are formatted like
#=SQ <name> <wgt> <source> <accession> <start..stop::orig length> <desc>
This information includes a sequence weight (for compensating for biased representation of subfamilies of sequences in the alignment); source information, if the sequence came from a database, consisting of identifier, accession number, and source coordinates; and a description of the sequence.
If a #=SQ
line is present, all the fields must be present on
each line and one #=SQ
line must be present for each sequence.
If no information is available for a field, use '-' for all the fields
except the source coordinates, which would be given as '0'.
The first non-comment, non-blank line of the file may be a #=AU
``author'' line. There is a programmatic interface for
alignment-generating programs to record a short comment like
11 November 1993, by Feng-Doolittle v. 2.1.1
,
and this comment will be
recorded on the #=AU
line by WriteSELEX()
.
Here's an example SELEX file with examples of all the optional fields. The sequences are evolved RNA ligands for the phage R17 coat protein [Schneider et al., 1992].
#=AU COVE 2.2.3 #=SQ lig28 1.0 - - 1..29:29 R17 coat protein ligand 28 #=SQ lig1 1.0 - - 1..17:17 R17 coat protein ligand 1 #=SQ lig2 1.0 - - 1..29:29 R17 coat protein ligand 2 #=RF ***A** loop ***** #=CS .......>>>+>> ++++ <<<<<....... lig28 GGAGUAAGAUAGC AUCA GCAUCUUGUUCC #=SS >>>>>>>>>+>> ++++ <<<<<<<<<<<< lig1 GUUCACC AUCA GGGGAC #=SS >>>>+>> ++++ <<<<<< lig2 AUGGAUGCGCACC AUCA GGGCGUAUCUAU #=SS >>>>>>>>>>+>> ++++ <<<<<<<<<<<<