fastHmm/fastBlast Database Files
$Id: README,v 1.2 2008/07/31 21:26:13 mprice Exp $


Overview

The fastHmm database files are arranged in a specific directory hierarchy
according to specific naming conventions.  Only properly organized and
formatted databases will work with fastHmm.  This document will explain
the directory layout, which files are required, and how those files should
be named.  fastHmm also includes a tool, alnDbFormat.pl, which will generate
the proper database hierarchy for a number of different databases.  At the
time of this writing, this includes gene3d, panther, pfam, pirsf, smart,
superfam, and tigrfam.  You may also download some fastHmm-compatible
datasets from our website, however due to licensing issues not all datasets
are available.


Directory Structure

The fastHmm base directory is the directory in which you installed all the
fastHmm binaries and libraries.  Within the base directory, you will have an
empty "db" directory in which you can uncompress pre-formatted datasets
and/or build your own datasets.  Each dataset is stored in a directory
under the base "db" directory.  For example, gene3d data is stored in the
directory "db/gene3d/" off the base fastHmm directory.

Within each dataset directory are a number of different files.
Perhaps most important to these are the target files.  Each target
within a database must be uniquely named and generally should not
contain spaces or other non- alphanumeric characters.  Each target
must have a seed sequence and either a psiblast profile (.psiblast
extension) or an alignment (.b extension). For superfam, the seed
sequence has a .fa extension; otherwise it has a .seq extension.  The
psiblast profiles or checkpoints allow blastpgp to start up faster, so
if you've got all three files, you can use just the psiblast profile
and omit the alignment files to save space.

The seed sequence (.seq) must contain a single representative
sequence only and must be in FASTA format with the same defline as defined
in the alignment file.  blastpgp will complain if the seed sequence is not
found in the alignment file and fastHmm will be unable to produce an
alignment.

The multiple sequence alignment file (.b) must be in a psiblast-compatible
format.  The format is briefly explained here, however a more thorough
explanation can be found in the blastpgp documentation.  First, it is
important that all sequences in the alignment be exactly the same length,
as is expected output of a multiple sequence alignment.  If your database
contains multiple sequence alignment files whose component sequences are not
of the same length, you should contact the database maintainer for
assistance.  psiblast alignments are split into blocks, where each block
contains a sequential portion of the alignment sequence.  Though blastpgp
does not specify an optimal wrapping width, we suggest 60 characters.
Each block must be delimited by an empty line.  The sequence identifiers
must be identical from block to block as must be the number of characters
output for each sequence within a block.  There must be at least one white-
space character delimiting each sequence id/name from the actual sequence
itself.  Here is a quick example showing a FASTA multiple sequence alignment
reformatted for psiblast.

[ Example in FASTA format ]

>seq1
TPDCVTGKVEYTKYNDDDTFTVKVGDKELATNRANLQSLLLSAQITGMTVTIKTNACHNGGGFSEVIFR
>seq2
AADCAKGKIEFSKYNEDDTFTVKVDGKEYWTSRWNLQPLLQSAQLTGMTVTIKSSTCESGSGFAEVQF-
>seq3
AADCAKGKIEFSKYNEDDTFTVKVDGKEYWTSRWNLQPLLQSAQLTGMTVTIKSSTCESGSGFAEVQF-
>seq4
AADCAKGKIEFSKYNEDDTFTVKVDGKEYWTSRWNLRPLLQSAQLTGMTVTIKSSTCESGSGFAEVQF-


[ Example in psiblast format ]

seq1                        TPDCVTGKVEYTKYNDDDTFTVKVGDKELATNRANLQSLLLSAQITGMTVTIKTNACHNG
seq2                        AADCAKGKIEFSKYNEDDTFTVKVDGKEYWTSRWNLQPLLQSAQLTGMTVTIKSSTCESG
seq3                        AADCAKGKIEFSKYNEDDTFTVKVDGKEYWTSRWNLQPLLQSAQLTGMTVTIKSSTCESG
seq4                        AADCAKGKIEFSKYNEDDTFTVKVDGKEYWTSRWNLRPLLQSAQLTGMTVTIKSSTCESG

seq1                        GGFSEVIFR
seq2                        SGFAEVQF-
seq3                        SGFAEVQF-
seq4                        SGFAEVQF-


The targets within a database do not all have to be comprised of psiblast
profiles or alignments and seed sequences.  It is perfectly acceptable to
have some psiblast profile targets, while other targets are represented by
alignments and seed sequences.  If your database contains any targets
with alignments and seed sequences, it is recommended that you generate their
corresponding psiblast profiles by using the -C flag with fastHmm the first
time you run it using that database.  The first run will initially take longer
but subsequent runs will be significantly faster, especially for large
multiple sequence alignments.

Within in each database directory must be an "hmm" directory containing all of
the corresponding hmmer models for each target.  A target is only considered
valid if it contains (a psiblast profile or (a seed sequence and muliple
sequence alignment)) and a hmmer model.  The first time (or subsequent runs
with the -r option) fastHmm runs, it will verify the integrity of the database
and build a cache (called ".accList") containing all verified targets for the
database.  You should not use the -r option unless the underlying data has
changed as it will increase the startup time of fastHmm.

Every database can optionally have an associated "hard list" named
<database>.hard.list.  This is a list of all targets considered to be
problematic for fastHmm to properly detect domains.  These lists are
pre-generated for all data downloaded from our website as well as for all
data generated using alnDbFormat.pl.  In general, targets from this list
will not be filtered first with blastpgp and instead will be run directly
with hmmsearch.  This is slower, but will find small hits that may be missed
by fastHmm.

The database directories will occasionally contain other files, however these
are often specific to individual databases and will be discussed in the next
section.


Database-Specific Files

The gene3d database also contains a map from gene3d accessions to
corresponding gene3d superfamily names (gene3d.tab).  This file contains one
entry per line and must be a gene3d accession and superfamily name delimited
by one or more whitespace characters.  This file is required for post-
processing raw fastHmm hits into domains.

The panther database contains a list of all panther families and subfamilies
and their corresponding names (panther.names).  Though the names are not
used by fastHmm, this file is required in order for proper expansion of
panther domains to include subfamilies.  This file is bundled with the
panther data.

The pfam database contains a list of all targets and the method by which
the alignment was selected (pfam.am).  This information is required for pfam
post-processing.

The pirsf database contains a list of all targets and the parent-child
relationships between targets (pirsf.dat).  This information is required for
pirsf post-processing.

The superfam database contains a list of all targets and the name of the
superfamily to which the target belongs (superfam.tab).  This information
is required for superfam post-processing in mapping superfam accessions to
superfam family names.


Custom Database

It is important to note that while fastHmm supports other databases besides
gene3d, panther, pfam, pirsf, smart, superfam, and tigrfam, however the
custom database(s) must be built according to the specifications outlined
above.  Automatic post-processing is not supported for custom databases.
fastHmm will produce both an .hmmhits (raw) and .domains file for custom
databases, however the output will be identical.  If there is any post-
processing required, you must run it manually.


Using alnDbFormat.pl

This script is provided to build fastHmm-compatible databases given source
data for gene3d, panther, pfam, pirsf, smart, superfam, and/or tigrfam.
Some sources also require InterPro data, as explained below.  It is possible
that the format of data sources and/or the names or locations of files may
change and therefore alnDbFormat.pl may no longer work.  If you encounter
any problems using this tool, please let us know which source database
isn't working and include the version of the source database and we'll
release a patch so the tool will work with newer versions of the problematic
data source.

The usage for alnDbFormat.pl is as follows:

Usage:
  alnDbFormat.pl <options>

Parameters:
  -g <tar.gz>   Specify gene3d alignments tar.gz file to build

  -n <tar.gz>   Specify panther alignments+hmm tar.gz file to build

  -r <tar.gz>   Specify pirsf alignments tar.gz file to build

  -p <pfamBase> Specify Pfam mirror base directory; must contain
                   Pfam-A.seed[.gz], Pfam-C[.gz], Pfam_fs[.gz], Pfam_ls[.gz]

  -s <tar.gz>   Specify smart alignments+hmm tar.gz file to build

  -ua <tar.gz>  Specify superfam alignment tar.gz file to build
  -uh <tar.gz>  Specify superfam hmm tar.gz file to build

  -ta <tar.gz>  Specify tigrfam alignment tar.gz file to build
  -th <tar.gz>  Specify tigrfam hmm tar.gz file to build

Required for -g, -r, -ua/-uh:
  -i <tar.gz>   Specify InterProScan data file for hmms

Optional Parameters:
  -c <thresh>   cdhit reduction for alignments with over <thresh> seqs
                   (except superfam)
  -ck <inFa>    Build psiblast checkpoint files using inFa fasta seqs
                   (except for superfam)
  -ckj <procs>  Build psiblast checkpoint files using <procs> processes
                   (Default: 1)
  -h            Build "hard" HMM lists (except gene3d, superfam)
  -o <outDir>   Directory in which to create reformatted database directories
                   (Default: Use FASTHMM_DIR/db or ./ if former doesn't exist)
  -t <tmpDir>   Temporary directory (Default: use /tmp)
  -q            Quiet execution; disable status updates to stdout


When building gene3d, pirsf, and/or superfam databases, you must also
have a copy of the latest InterProScan data.  All parameters requiring
.tar.gz file inputs may also be substituted for directories which contain
the unpacked data.  This is useful as unpacking can take quite a while
if done repeatedly.  Any single file may also be specified with the
extension .gz if the file is compressed.  This tool will automatically
decompress the data before reading the file.

It is recommended that you also use the -ckj option to build psiblast
profiles as well as the -h option to build the hard HMM lists.  If building
psiblast profiles, you must also specify an input FASTA file to use while
building the profiles.  Though we believe the input FASTA makes no difference
in building the psiblast profiles, we cannot say with 100% certainty as of
this writing.  All of the psiblast profiles downloadable from our website
were generated using the E. coli K12 proteome.

The -o option must be set to the "db" directory located in your base
fastHmm directory.


Downloads

Please visit our website to download pre-formatted datasets.  They are
accessible at http://microbesonline.org/fasthmm.

Once you have downloaded the desired datasets, you must unpack them into
the appropriate location using the following command:

  tar zxvf <dataset_archive.tar.gz> -C <fastHmm base dir>/db/

If your version of tar does not support decompressing gzip files, use this
command instead:

  gzip -d -c <dataset_archive.tar.gz> | tar xvf - -C <fastHmm base dir>/db/


Contact

For additional assistance, please contact the fastHmm/fastBlast team at
http://microbesonline.org/fasthmm or by email to fasthmm@microbesonline.org.
