This tarball (in release/FastBLAST-NR-May15-2008.tar.gz) includes the the result of running the first stage of FastBLAST on the non-redundant protein sequences in Genbank ("NR") as of May 15, 2008. Warning: you'll need the tarball requires 27 GB of disk space and unpacking it requires another 79 GB of disk space. Formatting the BLAST database so that you can use FastBLAST on it will require a few more GB. So you'll need about 110 GB of free disk space to install. Once you have downloaded and expanded the tarball, you will need to run formatdb before you can use topHomologs.pl to search for the top homologs of a given gene: $FASTHMM_DIR/bin/formatdb -o T -p T -i nr.faa The files in the tarball are: nr.faa -- a fasta file of the sequences. It contains gi numbers only, as numbers (without the "gi|" prefix). The nr.faa.map file has the mapping from the identifier in the original NR database from Genbank to the identifier in nr.faa. fb.all.align -- The alignments of each family fb.all.align.seek.db -- A BerkeleyDB index of the seek positions of each family. The FastBLAST.pm Perl module in $FASTHMM_DIR/lib has routines to use this index. fb.all.domains.bygene -- The families for each gene. Most of the families are from HMMs and are the same name as the HMM. COG families have names like "gnl|CDD|30365". Ad-hoc families from FastBLAST have names like fb.2345028.1.33 fb.all.domains.bygene.seek.db -- A BerkeleyDB index of the seek positions of each gene. The FastBLAST.pm Perl module in $FASTHMM_DIR/lib has routines to use this index. fb.all.nseq -- The number of sequences topHomologs.pl uses this to determine how many homologs to return.