MicrobesOnline for Programmers

MicrobesOnline: Tools and Downloads for Programmers

Access to data
Code
Links
- Linking to the MicrobesOnline genome browser

This page gives technical advice on how to programmatically obtain data from the MicrobesOnline data. Please see the tutorial for background on MicrobesOnline and on the data it contains.

How to connect to pub.microbesonline.org

Due to security concerns, as of April 5, 2018, we are restricting access to our public SQL server. Please contact us at help@microbesonline.org if you wish to have access to our server.

We have set up a public read-only SQL server at pub.microbesonline.org. (This SQL server may move to another host in the future.) This server will include only data from public genomes. Data about genes and genomes will be in the "genomics" database. Functional genomics data, such as microarray data, will ultimately be in the "microarray" database. (The microarray database will not be available initially.) For both databases, use the username "guest" and the password "guest".

To connect, use the mysql command-line tool, e.g.

mysql -h pub.microbesonline.org -u guest -pguest genomics

mysql -h pub.microbesonline.org -u guest -pguest genomics -B -e 'my query' > output_file

Please limit your active queries to two, and active connections to three. If a query of yours is killed, it may be because it was running for too long, using too much memory, or using too many connections. If this happens to you, please email us at help@microbesonline.org and we will help you debug any problematic queries.

How to download auxiliary data: BLAST and FastBLAST databases

Nucleotide and protein BLAST databases and a FastBLAST database of the predicted proteins will be available for download at a later date.

Example SQL queries

Get all VIMSS locus Ids and genbank ids for protein-coding genes in E. coli K12 (NCBI taxonomyId 511145):

SELECT Locus.locusId, Synonym.name AS genbankId
FROM Scaffold JOIN Locus USING (scaffoldId) 
JOIN Synonym USING (locusId,version)
WHERE Locus.priority=1
AND Locus.type=1
AND Scaffold.isActive=1
AND Scaffold.taxonomyId=511145
AND Synonym.type=2;

Get all taxonomy Ids for sequenced delta-Proteobacteria (NCBI taxonomyId 28221), and their names:

SELECT DISTINCT Taxonomy.taxonomyId, Taxonomy.shortName
FROM Scaffold JOIN Taxonomy USING (taxonomyId)
JOIN TaxParentChild ON TaxParentChild.childId=Taxonomy.taxonomyId
WHERE Scaffold.isActive=1 AND Scaffold.isGenomic=1 AND TaxParentChild.parentId=28221;

Schema documentation

Groups of tables:

Taxa and Organisms: Taxonomy TaxParentChild TaxNode TaxName
Scaffolds and Genes: Scaffold Locus LocusType Position ObjectDescr
Sequences: ScaffoldSeq LocusSeq AASeq
Gene identifiers and annotations: Synonym SynonymType Description Annotation AnnotationDetail
Gene trees and species trees: Locus2Tree Taxonomy2Tree Tree
Gene families:
- COG: COG COGCount COGFun COGInfo COGrpsblast
- InterPro families: Locus2Domain DomainInfo FasthmmFamily2HMM FasthmmRawHits PfamClan TIGRInfo TIGRroles
- InterPro: InterPro IPR2Go IPRInfo
- Ad-hoc FastBLAST families: FastBLASTDomains
- Orthology: Tree-based OGMember OrthologGroup or group-based MOG MOGComponent MOGMember or bidirectional best hits Ortholog or see COG
Enzyme predictions: Locus2Ec ECInfo
Metabolic maps from KEGG: KEGG2Taxonomy KEGGCompound KEGGConf KEGGInfo KEGGMap KEGGblast
Links to characterized genes:
- Swissprot: SwissProt SwissProt2Locus SwissProt2Pubmed SwissProtBlast
- Experimentally characterized regulation from RegTransBase: Locus2RTB Locus2RTBArticles
- Curated regulatory predictions from RegPrecise: Locus2RegPrecise
- PDB: Locus2Pdb PdbEntries PdbReps PdbSeq
Operon predictions: Locus2Operon Operon
Relationships between genes from comparative genomics: RegulonCluster RegulonLinks MOGNeighborScores
Summary statistics for genomes: COGCount GoCount LocusCount
Gene Ontology (GO): GoCount Locus2Go term term2term term_definition term_synonym graph_path IPR2Go
Access control: ACL GroupUsers Groups Users
User-created data: Jobs Carts

Alphabetic list of tables:

AASeq
ACL
Annotation
AnnotationDetail
COG
COGCount
COGFun
COGInfo
COGrpsblast
Carts
CrisprFamily
Description
DomainInfo
ECInfo
FastBLASTDomains
FasthmmFamily2HMM
FasthmmRawHits
GoCount
GroupUsers
Groups
HitInfo
IPR2Go
IPRInfo
InterPro
Jobs
KEGG2Taxonomy
KEGGCompound
KEGGConf
KEGGInfo
KEGGMap

KEGGblast
Locus
Locus2Domain
Locus2Ec
Locus2Go
Locus2Ipr
Locus2Operon
Locus2Pdb
Locus2RegPrecise
Locus2RTB
Locus2RTBArticles
Locus2SwissProt
Locus2TigrFunCat
Locus2Tree
LocusCount
LocusSeq
LocusType
MOG
MOGComponent
MOGMember
MOGNeighborScores
ObjectDescr
OGMember
Operon
Ortholog
OrthologGroup
PdbEntries
PdbReps
PdbSeq
PfamClan
Position
RegulonCluster

RegulonLinks
Scaffold
ScaffoldIsActiveFlag
ScaffoldPosChunks
ScaffoldSeq
SwissProt
SwissProt2Locus
SwissProt2Pubmed
SwissProtBlast
Synonym
SynonymType
TIGRInfo
TIGRroles
TaxName
TaxNode
TaxParentChild
Taxonomy
Taxonomy2Tree
Tree
UserSessions
UserSessionsURLTracking
Users
graph_path
term
term2term
term_definition
term_synonym

Deprecated tables: AutoAnno, BLASTp, BLASTp_dddd where dddd is a taxonomyId, TMHMM, FamilyAlignments, FastBLASTDomains

MicrobesOnline also includes microarray data, in the "microarray" database, but this is not in the public SQL database yet.

AASeq

locusId	int(10) unsigned
version	int(2) unsigned
sequence	longblob

The amino acid sequence of a protein-coding gene.

ACL

requesterId	int(10) unsigned
requesterType	enum('user','group')
resourceId	int(10) unsigned
resourceType	enum('cart','job','uarray','scaffold')
read	tinyint(1)
write	tinyint(1)
admin	tinyint(1)

Permissions to access scaffolds, carts, jobs, and microarray experiments, either by user or by group. All of the data in pub.microbesonline.org is accessible by group 1 (public / all users).

Annotation

annotationId	int(10) unsigned
source	varchar(64)
date	int(10) unsigned

This table stores the history of all the annotations, including both MicrobesOnline automated annotations, annotations from RefSeq, and manual annotations created by MicrobesOnline users, sorted by date. (This table may not exist in pub.microbesonline.org.) The source field is a userId. The actual content of the annotation is in the AnnotationDetail table.

AnnotationDetail

annotationId	int(10) unsigned
locusId	int(10) unsigned
type	enum('name','synonym','description','ecNum','go','comment')
action	enum('append','replace','delete')
annotation	text

AnnotationDetail contains the actual data for each annotation. There may be multiple changes with the same annotationId.

COG

cogId	int(10) unsigned
locusId	int(10) unsigned
version	int(2) unsigned
cogInfoId	int(10) unsigned

The COG assignment for the gene is in cogInfoId. (Please ignore cogId.) Genes are only assigned to the best-matching COG, and the gene must cover at least 60% of the COG.

COGCount

count(l.locusId)	bigint(21)
funCode	varchar(5)
taxonomyId	int(10)
scaffoldId	int(10) unsigned

Number of genes in each COG functional category on each scaffold.

COGFun

funCode	char(1)
description	varchar(70)
funGroup	varchar(50)

Descriptions of the COG function codes.

COGInfo

cogInfoId	int(10) unsigned
funCode	varchar(5)
description	varchar(255)
geneName	varchar(20)
cddId	varchar(255)
length	int(10) unsigned

Metadata on each COG id (cogInfoId). cddId is the conserved domain database identifier for this COG. funCode can be multiple characters, each of which has its own row in COGFun.

COGrpsblast

locusId	int(10) unsigned
version	int(2) unsigned
subject	varchar(20)
identity	float unsigned
alignLength	int(10) unsigned
mismatch	int(10) unsigned
gap	int(10) unsigned
qBegin	int(10) unsigned
qEnd	int(10) unsigned
sBegin	int(10) unsigned
sEnd	int(10) unsigned
evalue	double
score	float

All hits, from rpsblast, to each COG. The subject is the COG number (as in COGInfo.cogInfoId). A gene may have hits to multiple COGs, even though it is only assigned to one of them (or to none of them if the hits are too short).

Carts

cartId	int(10) unsigned
userId	int(10) unsigned
name	varchar(32)
seqData	longtext
seqCount	int(10) unsigned
time	int(10) unsigned
active	int(1) unsigned

User-created lists of genes, or "carts." This table is empty in pub.microbesonline.org.

CrisprFamily

locusId	int(10) unsigned
familyId	int(10) unsigned
type	smallint(5) unsigned

Clusters of CRISPR repeats and spacers based on BLAST hit connectivity.

Description

descriptionId	int(10) unsigned
description	text
source	varchar(255)
created	date
locusId	int(10) unsigned
version	int(2) unsigned

The gene descriptions, e.g., "DNA-binding transcriptional activator, 3HPP-binding". These are often incorrect or misleading.

DomainInfo

domainDb	varchar(20)
domainId	varchar(20)
domainName	varchar(50)
iprId	varchar(10)
iprName	varchar(100)
domainLen	int(5) unsigned
fileName	varchar(20)

Meta-data on interpro domains and sites. iprId is defined in the IPRInfo table.

ECInfo

ecNum	varchar(20)
name	varchar(255)

Metadata on enzyme classification numbers.

FastBLASTDomains

domainId	varchar(20)
locusId	int(10) unsigned
version	int(10) unsigned
seqBegin	int(5) unsigned
seqEnd	int(5) unsigned

Associations of genes with ad-hoc domains from FastBLAST. This table has been deprecated.

FasthmmFamily2HMM

domainDb	varchar(20)
domainId	varchar(20)
hmmName	varchar(10)
domainLen	int(5) unsigned

Data on each HMM for the InterPro families, with domainDb and domainId linked to hmmName (the same as hmmId in FasthmmRawHits). Many InterPro families are described by more than one HMM.

FasthmmRawHits

hmmId	varchar(20)
locusId	int(10) unsigned
version	int(2) unsigned
seqBegin	int(5) unsigned
seqEnd	int(5) unsigned
domainBegin	int(5) unsigned
domainEnd	int(5) unsigned
score	float
evalue	float

The raw FastHMM hits for each InterPro domain (Gene3D, PANTHER, Pfam, PIRSF, SMART, Superfam, TIGRFAM). The hmmIds are described in FasthmmFamily2HMM.hmmName.

GoCount

goId	int(10) unsigned
goCount	int(10) unsigned
taxId	int(10)

Numbers of genes (goCount) in each gene ontology category (goId) for each genome (taxId). goId maps to id in the term table and taxId maps to taxonomyId in the Taxonomy table.

GroupUsers

groupId	int(10) unsigned
userId	int(10) unsigned
active	tinyint(1) unsigned
time	int(10) unsigned

Which users are members of which groups.

Groups

groupId	int(10) unsigned
name	varchar(32)
description	varchar(255)
adminUserId	int(10) unsigned

Groups of users are defined for access control. Group 1 is special: all users belong to it.

HitInfo

hitId	varchar(100)
synId	varchar(100)
sLength	int(10) unsigned
description	longtext

Deprecated. This table is empty on pub.microbesonline.org.

IPR2Go

iprId	varchar(10)
goId	varchar(10)

Which interpro id (iprId, defined in the IPRInfo table) maps to which goId (defined by id in the term table).

IPRInfo

iprId	varchar(9)
type	varchar(16)
shortName	varchar(50)
proteinCount	int(5)
iprName	varchar(255)

Each interpro id defines a conserved gene function (e.g., "IPR005471" corresponds to "Transcriptional regulator IclR, N-terminal"). proteinCount is from the InterPro database and does not reflect the number of genes in MicrobesOnline that is associated with that interpro id.

InterPro

locusId	int(10) unsigned
version	int(2) unsigned
checksum	varchar(16)
length	int(5) unsigned
domainDb	varchar(30)
domainId	varchar(30)
domainDesc	varchar(255)
domainStart	int(5)
domainEnd	int(5)
evalue	float
status	varchar(10)
date	varchar(50)
iprId	varchar(9)
iprName	varchar(255)
geneOntology	longtext

Assignments of genes to InterPro sites, such as from PRINTS or PROSITE. Assignments to InterPro families (e.g. PFams, TIGRFAMs, etc.) are in the Locus2Domain table, not in this table.

Jobs

jobId	int(10) unsigned
parentJobId	int(10) unsigned
userId	int(10) unsigned
cartId	int(10) unsigned
jobName	varchar(32)
jobType	varchar(32)
jobData	text
jobCmd	text
status	int(2) unsigned
time	int(10) unsigned
doneTime	int(10) unsigned
saved	int(1) unsigned

Computations requested by users. This table is empty in pub.microbesonline.org.

KEGG2Taxonomy

keggOrgId	varchar(5)
taxonomyId	int(10)

The mapping between NCBI taxonomy ids for genomes and KEGG organism ids.

KEGGCompound

compound	varchar(7)
name	tinytext
formula	tinytext
mass	double

Small molecules (metabolic compounds) from KEGG.

KEGGConf

mapId	varchar(20)
object	varchar(20)
type	int(1)
url	varchar(255)
coord	varchar(100)

Where the objects are on the KEGG maps, to support clicking through to compounds or enyzymes.

KEGGInfo

hitId	varchar(100)
sLength	int(10) unsigned
description	longtext

Descriptions of genes from the KEGG database. hitId is a KEGG identifier.

KEGGMap

mapId	varchar(20)
title	varchar(255)

Titles of the KEGG maps -- the actual maps are in image files.

KEGGblast

locusId	int(10) unsigned
version	int(2) unsigned
subject	varchar(30)
identity	float unsigned
alignLength	int(10) unsigned
mismatch	int(10) unsigned
gap	int(10) unsigned
qBegin	int(10) unsigned
qEnd	int(10) unsigned
sBegin	int(10) unsigned
sEnd	int(10) unsigned
evalue	double
score	float

Hits of MicrobesOnline genes to KEGG genes. Subjects are KEGG ids, e.g. "aae:aq_001". Despite the name, these hits are from BLAT, not from BLAST. Due to size limitations, this table currently does not exist.

Locus

locusId	int(10) unsigned
version	int(2) unsigned
priority	tinyint(3) unsigned
created	date
posId	int(10) unsigned
evidence	varchar(50)
scaffoldId	int(10) unsigned
type	smallint(5) unsigned

The central object in the MicrobesOnline database is a gene or "Locus". Locus includes protein-coding genes, non-coding RNA genes and pseudogenes. The Locus table includes a locusId, which is not unique, a version number, and a priority flag. The combination of a locusId and a version number is unique, and priority=1 for the current version. To avoid using stale and/oor redundant data, please make sure that you require "Locus.priority=1" whenever you query this table. Locus ids are also referred to as VIMSS ids and MicrobesOnline gene ids. (They are also referred to as orfIds in some of our older code.) Locus.type gives the type of the locus, e.g. type=1 for protein-coding gene. See the LocusType table for more information.

Locus also links to Scaffold.scaffoldId and to Position.posId. The Scaffold table includes a taxonomyId, so you can use that to determine what genome the Locus is from. The Position table stores the location of the gene. For protein-coding genes, the Position stores the range from the predicted start codon to the stop codon.

A large number of tables give additional information about a locus. Most of these tables include both locusId and version. Some of the tables includes only locusId but no version -- those entries implicitly refer to the active version of the locus (the one with priority=1).

Locus2Domain

domainId	varchar(20)
locusId	int(10) unsigned
version	int(2) unsigned
seqBegin	int(5) unsigned
seqEnd	int(5) unsigned
domainBegin	int(5) unsigned
domainEnd	int(5) unsigned
score	float
evalue	float

Assignments of genes to InterPro families. Within each database (e.g., within Pfam), these assignments are based on the best-hitting family for each region. There may be many more hits, or multiple hits to different HMMs for the same domainId, in FasthmmRawHits.

Locus2Ec

locusId	int(10) unsigned
version	int(2) unsigned
scaffoldId	int(10) unsigned
ecNum	varchar(20)
evidence	varchar(50)

Assignments of genes to enzyme classification numbers such as "1.4.1.15". These assignments are from KEGG or by best BLAST hit to KEGG, and are not very reliable. Via the ecNum field, this also links to descriptions of enzymes in the ECInfo table.

Locus2Go

locusId	int(10) unsigned
goId	int(10) unsigned
evidence	varchar(255)

Assignments of genes to Gene Ontology identifiers. These assignments have many false positives, partly because they are based on individual domains rather than the entire gene. goId maps to id from the term table.

Locus2Ipr

locusId	int(10) unsigned
iprId	varchar(9)
taxonomyId	int(10)

Assignment of genes to InterPro ids, which are described in the IPRInfo table.

Locus2Operon

locusId	int(10) unsigned
tuId	int(10) unsigned

MicrobesOnline operon predictions. Genes predicted to be transcribed by themselves will still have an operon, but with only one gene in it. For E. coli, we also have known operons, as explained in the Operon table. The tuId field links to the Operon table.

Plasmids, viruses, incomplete genomes, and eukaryotes usually do not have operon predictions.

Locus2Pdb

pdbId	varchar(6)
pdbChain	char(1)
locusId	int(10) unsigned
version	int(2) unsigned
seqBegin	int(5) unsigned
seqEnd	int(5) unsigned
pdbBegin	int(5) unsigned
pdbEnd	int(5) unsigned
identity	decimal(5,2) unsigned
alignLength	int(5) unsigned
mismatch	int(5) unsigned
gap	int(5) unsigned
evalue	double
evalueDisp	text
score	decimal(10,2) unsigned

BLAST hits of MicrobesOnline genes to proteins of known structure in the protein data bank ("PDB").

Locus2RegPrecise

locusId	int(11)
geneInOperonIndex	int(11)
sourceTypeTermId	int(11)
regulonId	int(11)
regulatorName	varchar(255)
regulatorLocusId	int(11)

Regulatory predictions from RegPrecise. regulatorLocusId is the regulator and locusId is the regulated gene. sourceTypeTermId is 1 for manually curated predictions and 2 for predictions that are automatically propagated to orthologs that have a conserved match to the weight matrix. regulonId is from RegPrecise itself and should be used together with sourceTypeTermId. geneInOperonIndex is the position in the operon (e.g., 1 for the 1st gene in the operon).

Locus2RTB

locusId	int(10) unsigned
rtb_seqfeature_id	int(10) unsigned
rtb_characterized	int(1) unsigned

Mapping of MicrobesOnline genes to RegTransBase genes. If rtb_characterized is set then the regulation of this gene has been characterized by a paper.

Locus2RTBArticles

locusId	int(10) unsigned
pubMedId	varchar(20)
wetExp	tinyint(1) unsigned

Mapping of MicrobesOnline genes to papers about them from the RegTransBase database. The RegTransBase database stores data, collected from published papers by curators, about gene regulation.

Locus2SwissProt

locusId	int(10) unsigned
version	int(2) unsigned
id	varchar(20)
accession	varchar(20)
identity	int(3) unsigned
bidir	int(2) unsigned

Locus2SwissProt stores a mapping between MicrobesOnline genes and UniProt (formerly known as SwissProt) accesssions and identifiers. This table only stores the best hit(s) of each MicrobesOnline gene in UniProt. The SwissProt2Locus table stores a best-hit mapping in the other direction. The Locus2SwissProt.bidir flag is set if this relation is also in the SwissProt2Locus table.

Redundant mappings are common if multiple strains of a genome have been sequenced, because all of the copies of the protein sequence will usually be identical. Although UniProt contains organism information, it generally does not specify the strain. So, we do not use the UniProt organism information to make the mapping -- we only use the protein sequences.

Usually, a gene from a complete genome will have an exact or nearly-exact match in UniProt, but hits down to 80% identity are used if no closer match is available.

Locus2TigrFunCat

locusId	int(10) unsigned
mainRole	varchar(255)
subRole	varchar(255)
evidence	varchar(64)

Links genes to TIGR functional categories. This is based on annotations made by CMR, and does not reference TIGRFam hits from HMMER. For those hits you should use the the Locus2Domain table, the metadata on TIGRFams in the TIGRInfo table, and the descriptions of the roles in the TIGRroles table. The mainRole and subRole columns in the Locus2TigrFunCat table are text descriptions not role ids.

Locus2Tree

treeId	int(10) unsigned
locusId	int(10) unsigned
version	int(2) unsigned
aaTree	tinyint(1)
begin	int(10) unsigned
end	int(10) unsigned
nAligned	int(10) unsigned
score	decimal(5,1) unsigned
scaffoldId	int(10) unsigned

Links genes to gene trees. aaTree is 1 if the tree is based on protein sequences. begin and end is the region of the gene that is in the alignment that was used to build the tree. (These are 1-based positions within the gene sequence.) Note that because of domain duplications, a gene can occur in a tree more than once. nAligned is the number of aligned positions, and may be less than end-begin+1 because of how profile-based alignment works and because of trimming. score is deprecated. scaffoldId is the scaffoldId of that locus.

LocusCount

scaffoldId	int(10) unsigned
taxonomyId	int(10)
nGenes	int(10) unsigned

Counts the number of genes on each scaffold. Also links to Taxonomy.taxonomyId.

LocusSeq

locusId	int(10) unsigned
version	int(2) unsigned
sequence	longblob

The nucleotide sequence of each gene. For protein-coding genes, this includes only the open reading frame (including the stop codon).

LocusType

type	smallint(5) unsigned
description	varchar(100)

This table explains the values of the type field in the Locus table. This is a controlled vocabulary, e.g. "protein-coding gene" (type=1), large subunit ribosomal RNA (type=2), pseudogene of a protein-coding gene (type=7), etc.

MOG

mogId	int(10) unsigned
nComponents	int(10) unsigned
nLoci	int (10) unsigned
metric	float

A MicrobesOnline Ortholog Group, or MOG, is the union of maximal ortholog groups, from different trees, that have similar membership. The "components" of a MOG are the maximal ortholog groups, see MOGComponent. The loci that belong to a MOG are stored in MOGMember. Note that a genome can have more than one representative in a MOG, but a locus belongs to at most one MOG. While constructing MOGs, we use the "best" maximal ortholog groups first, as determined by the metric of the average alignment length of the ogmembers. (We may use another metric, such as alignment score, in the future.) A MOG's metric is that of the first ortholog group that was placed in this MOG.

Also see the orthologs overview or the OrthologGroup or COG tables.

MOGComponent

mogId	int(10) unsigned
treeId	int(10) unsigned
ogId	int(10) unsigned
metric	float
nMembers	int(10) unsigned
nMembersBest	int(10) unsigned

A MOG component is a maximal ortholog-group (an ortholog group for a tree that is not a duplication and is not contained within any any other non-duplication ortholog group) that was added to a MOG. nMembers is the number of members, and nMembersBest is the number of members for which this was the best ortholog group.

MOGMember

mogId	int(10) unsigned
locusId	int(10) unsigned
version	int(2) unsigned
minBegin	int(10) unsigned
maxEnd	int(10) unsigned
treeId	int(10) unsigned
ogId	int(10) unsigned
metric	float
nAligned	int(10) unsigned
taxonomyId	int(10)

MOGMember reports which MOG a protein-coding gene belongs to (if any) and which ortholog group was used to add this locus to that MOG. minBegin and maxEnd are the range of positions in the protein that are covered by the trees/ortholog-groups that are included in this MOG. metric is average metric for the ortholog group, while nAligned describes the number of aligned positions for this gene in that tree (this ultimately is from Locus2Tree.nAligned).

MOGNeighborScores

mog1	int(10) unsigned
mog2	int(10) unsigned
score	float
nTaxGroupsBoth	int(10) unsigned
nTaxGroups1	int(10) unsigned
nTaxGroups2	int(10) unsigned

MOGNeighborScores reports which MOGs are conserved near each other across more than one group of closely-related genomes. Only pairs with mog1 < mog2 are included in the table. nTaxGroupsBoth is the number of genome groups for which the genes are near each other. nTaxGroups1 is the number of genome groups that contain at least one representative of mog1. score = nTaxGroupsBoth / max(nTaxGroups1, nTaxGroups2). Genes are considered to be near each other if they are adjacent and spaced by less than 200 nucleotides, or if they are within a run of genes on the same strand for which each adjacent pair of which is spaced by less than 200 nucleotides. Groups of closely-related genomes are defined by clades from the MicrobesOnline species tree that have diverged by at most 0.02 from their common ancestor. (For example, this threshold groups strains of Escherichia coli and various Shigella species with each other but separately from Salmonella species.)

ObjectDescr

objectId	int(2) unsigned
description	varchar(255)

This table explains the values of the objectId fields in the Position table, e.g. if the position is for a Locus or an Operon.

OGMember

treeId	int(10) unsigned
ogId	int(10) unsigned
locusId	int(10) unsigned
version	int(2) unsigned
begin	int(5) unsigned
end	int(5) unsigned
taxonomyId	int(10)
nMemberThisGenome	int(10) unsigned
aaLength	int(10) unsigned

This table describes the members of each ortholog group. A gene with a domain duplication can be in a tree more than once, which is why begin and end are necessary to guarantee uniqueness. nMemberThisGenomes helps determine if this is a 1:1 ortholog. aaLength (the length of this protein in amino acids) helps determine if the orthology relation covers the full length of the gene.

Operon

tuId	int(10) unsigned
posId	int(10) unsigned
evidence	varchar(255)

Each entry in the Operon links together one or more genes that are known or predicted to be transcribed together. The evidence field is either "VIMSS Predicted" for our operon predictions or "Confirmed" for experimentally characterized operons. Operon also links to the Position table, which stores the extent of the operon. Those positions are based on the extents of the genes, not on predicting transcription starts or stops.

Ortholog

locus1	int(10) unsigned
locus2	int(10) unsigned

Ortholog stores putative orthology relationships between pairs of genes. Any gene will map to at most 1 gene in another genome. These "orthologs" are from a FastBLAST-based variant of bidirectional best BLASTp hits and are often incorrect. Also, the table is growing very rapidly (it grows as the square of the number of genomes). So, these will be discontinued soon. Use the OrthologGroup or MOG tables instead.

OrthologGroup

treeId	int(10) unsigned
ogId	int(10) unsigned
parentOG	int(10) unsigned
isDuplication	tinyint(1)
nGenes	int(10) unsigned
nGenomes	int(10) unsigned
nNonUniqueGenomes	int(10) unsigned
splitTaxId	int(10)

The primary key of this table is the combination of treeId and ogId. ogIds themselves are not unique.

Our definition of ortholog group is a clade in a gene tree that is mostly one copy per genome, or, if isDuplication is set, a clade of genes from a closely-related group of genomes that have undergone duplication. These are not evolutionary orthologs as originally defined by Fitch (evolutionary orthologs are genes that diverged by a speciation event). Nor is there any guaranatee that these orthologs have the same function. However, be believe that these tree-based orthologs, if used carefully (see below), are more accurate than bidirectional-best hits, and they are faster to compute and smaller to store. The members of the ortholog groups are in the OGMember table. Every gene in a gene tree will be in an ortholog group, even if the ortholog "group" contains only that gene.

Ortholog groups can be nested inside larger ortholog groups, as described by parentOG (which may be NULL). For example, suppose that two closely related genomes 1 and 2 have a duplication of a gene A, into A1/A2 and a1/a2. There will be ortholog groups for A1/A2, for a1/a2, another with isDuplication=1 for A1/A2/a1/a2, and then (potentially) a larger ortholog group with isDuplication=0 that contains A1/A2/a1/a2 as well as orthologs of these genes from other genomes that did not undergo duplication. splitTaxId is a genome that contains paralogs that are just outside this ortholog group and is hence at least partially responsible for this ortholog group being split here.

nGenes and nGenomes are the total number of genes, and nNonUniqueGenomes is the number of genomes with more than one gene in this ortholog group.

Complications with using tree-based orthologs. These are handled by Gene::treeOrthologs().

A small number of duplications by horizontal gene transfer ("xenoparalogs") are also allowed within an ortholog group. If you want a stricter definition of orthology, use only the groups with no non-unique genomes. And, you probably want to examine the smaller orthology groups first.
Genes typically belong to many trees, so you need to choose which tree(s) to use.
Genes with very different domain structures, or truncated genes, may nevertheless be in some of the same gene trees, and you might not want to include these as orthologs.

Also see the orthologs overview or the MOG or COG tables.

PdbEntries

pdbId	varchar(6)
header	text
ascessionDate	date
compound	text
source	text
authorList	text
resolution	float
experimentType	text
dbSource	text

Data on proteins of known structure from the protein data bank ("PDB").

PdbReps

pdbIdRep	varchar(6)
pdbChainRep	char(1)
pdbId	varchar(6)
pdbChain	char(1)

Data on chains within proteins from PDB. pdbChain is not unique -- only the combination with pdbId is unique.

PdbSeq

pdbId	varchar(6)
pdbChain	char(1)
version	int(2) unsigned
sequence	longblob

The sequence of each chain of a protein in PDB.

PfamClan

clanId	varchar(10)
domainId	varchar(10)

For some diverse families, which are difficult to represent with a single HMM, PFam contains multiple models with overlapping specificity, or "Clans." Thus, PfamClan table stores the mapping between PFam clans and domainIds.

Position

posId	int(10) unsigned
strand	enum('+','-')
begin	int(10) unsigned
end	int(10) unsigned
scaffoldId	int(10) unsigned
objectId	int(2) unsigned

Position stores the position of objects such as genes or operons. begin and end are 1-based, and strand is either "+" or "-". For objects on linear chromosomes, begin is always less than end and strand, so for a protein-coding gene on the "-" strand, the start codon will be at the end position.

Objects on circular chromosomes (see Scaffold.isCircular) can wrap around the origin. In this case, both begin and end will be within the range [1,Scaffold.length], and end will be less than begin. In other words, end < begin indicates that the object wraps around the origin.

RegulonCluster

clusterId	int(10) unsigned
locusId	int(10) unsigned
updated	timestamp
link	varchar(50)

RegulonLinks

cluster1	int(10) unsigned
cluster2	int(10) unsigned
link	varchar(50)
score	decimal(10,3) unsigned
updated	timestamp

RegulonCluster and RegulonLinks store predicted "regulons", or genes that are likely to have related functions and/or regulation. Each clusterId links several locusId in the same genome together, and RegulonLinks gives a higher-level clustering of those clusters. At this time, all predictions are based on conserved proximity, or the "gene neighbor" method, and the link field is "GNScore". In the future, gene expression-based links may also be included. Also, at this time, each cluster contains only genes that are in the same operon, and the links link the operons together.

Scaffold

scaffoldId	int(10) unsigned
chr_num	int(10) unsigned
isCircular	int(1)
length	int(10) unsigned
file	varchar(32)
isGenomic	int(1)
gi	int(10) unsigned
taxonomyId	int(10)
comment	varchar(255)
isActive	int(1)
isPartial	int(1)
created	date
allowUpdates	tinyint(3) unsigned

This table represents chromosomes, plasmids, or contigs. A genome may have multiple scaffolds and each scaffold may have thousands of genes. To make sure that you do not use out-of-date data, please require "Scaffold.isActive=1" whenever you query this table. If you don't want data from plasmids, require "isGenomic=1". Require "isPartial=0" to consider only complete genomes. isCircular is 1 for circular chromosomes and 0 for linear chromosomes.

For scaffolds from RefSeq, file will be the NC_nnnnnn identifier and gi should be a genbank id.

ScaffoldIsActiveFlag

isActive	int(10) unsigned
description	varchar(255)

This table describes the levels of the Scaffold.isActive flag.

ScaffoldPosChunks

locusId	int(10) unsigned
version	int(2) unsigned
posId	int(10) unsigned
scaffoldId	int(10) unsigned
kbt	int(10) unsigned

This table indexes the positions of genes on scaffolds. This way you can quickly find genes at between, say, 110 and 120 kB on the E. coli chromosome. The posId field contains the gene's position (the same as Locus.posId), and the kbt value is the gene's position divided by 10,000 and rounded down. If a gene overlaps more than one 10 kB chunk of the genome then it will appear multiple times in this table.

ScaffoldSeq

scaffoldId	int(10) unsigned
sequence	longblob

This table stores the DNA sequence of each scaffold.

SwissProt

id	varchar(20)
accession	varchar(20)
length	int(10)
description	text

This table stores data about each UniProt protein.

SwissProt2Locus

locusId	int(10) unsigned
version	int(2) unsigned
id	varchar(20)
accession	varchar(20)
identity	int(3) unsigned

SwissProt2Locus stores a mapping from a UniProt protein to a MicrobesOnline gene. This table only stores the best hit of each UniProt gene in MicrobesOnline. The Locus2SwissProt table stores a best-hit mapping in the other direction. See its description for more explanation of this mapping.

SwissProt2Pubmed

id	varchar(20)
accession	varchar(20)
PubMedId	int(20) unsigned
isDetailed	tinyint(1)

SwissProt2Pubmed stores a mapping between UniProt proteins and papers at PubMed. isDetailed is set if the paper links to 5 genes or less. Most of the entries with isDetailed=0 are from genome papers; most of the entries with isDetailed=1 contain experimental data about the function or regulation of a gene.

SwissProtBlast

locusId	int(10) unsigned
version	int(2) unsigned
subject	varchar(20)
identity	float unsigned
alignLength	int(10) unsigned
mismatch	int(10) unsigned
gap	int(10) unsigned
qBegin	int(10) unsigned
qEnd	int(10) unsigned
sBegin	int(10) unsigned
sEnd	int(10) unsigned
evalue	double
score	float

This paper stores the homology relations between SwissProt genes and MicrobesOnline genes. Despite the name, the hits are from BLAT, not BLAST. The subjects are of the form accession|id, e.g. "Q31A92|PSBA_PROM9". Due to size limitations, this table currently does not exist.

Synonym

name	varchar(100)
locusId	int(10) unsigned
version	int(2) unsigned
type	int(2) unsigned
created	date
source	varchar(50)

This table contains names and identifiers for genes, such as gene names (type 0: "cheA"), locus tags (type 1: "b1888"), genbank ids (type 2: "16129840"), or genbank accessions (type 3: "NP_416402.1").

SynonymType

type	int(10) unsigned
description	varchar(255)

This table should document the types of identifiers or synonyms for genes.

TIGRInfo

id	varchar(16)
tigrId	varchar(30)
type	varchar(32)
roleId	int(5)
geneSymbol	varchar(10)
ec	varchar(16)
definition	longtext

Metadata on each TIGRFam. TIGRFam tigrId are the same as domain Ids that appear in the Locus2Domain and DomainInfo tables, and the roleIds are described in the TIGRroles table.

TIGRroles

roleId	int(5)
level	enum('main','sub1')
description	varchar(255)

Data on TIGR "roles", a functional classification of many of the TIGRFAM gene families.

TaxName

taxonomyId	int(20)
name	text
uniqueName	text
class	varchar(255)

From the NCBI taxonomy database: names for taxonomic groups. Note that the Taxonomy table only contains data for organisms with genomes (i.e., strains or species), while this table contains all taxa from the NCBI database. See TaxName for names for these taxa, and TaxParentChild for genomes of organisms within each taxonomic group.

TaxNode

taxonomyId	int(10) unsigned
parentId	int(10) unsigned
rank	varchar(50)
embl	varchar(10)
divId	int(3) unsigned
flag1	int(1) unsigned
flag2	int(1) unsigned
flag3	int(1) unsigned
flag4	int(1) unsigned
flag5	int(1) unsigned
flag6	int(1) unsigned
flag7	int(1) unsigned
comments	varchar(255)

From the NCBI taxonomy database: the parent/child relationship between taxonomic groups. Also see TaxName and TaxParentChild.

TaxParentChild

parentId	int(10)
childId	int(10)
nDistance	int(10) unsigned

This table expands the parent-child relationships for all genomes that are in our database and their higher-level taxa. For example, this table lets you select all sequenced genomes within a taxonomic group with a single query. parentId and childId are NCBI taxonomyIds from the TaxNode table. The childId may be in the Taxonomy table as well. nDistances is the number of links on the path from parent down to child.

Taxonomy

taxonomyId	int(10)
name	varchar(255)
placement	int(10)
shortName	varchar(100)
taxDispGroupId	int(10)
created	date
PMID	varchar(50)
Publication	varchar(255)
Uniprot	varchar(10)

This table includes only organisms (or viruses or plasmids) whose genome sequences are in our database. The taxonomyId values are from the NCBI taxonomy database. These are usually species- or strain-level taxa.

Higher-level taxa are available from TaxParentChild, TaxNode, or TaxName. Note that if a plasmid from an organism is sequenced, then that organism may appear in this table, even though its genome is not sequenced. Use Scaffold.isGenomic=1 to avoid plasmid scaffolds, and use TaxParentChild to get genomes within a certain taxa. In particular, to avoid viruses, use parentId=131567, which corresponds to all cellular organisms.

Use Scaffold to get to complete chromosomes or contigs, and from there you can get to genes (the Locus table).

Taxonomy2Tree

treeId	int(10) unsigned
taxonomyId	int(10)

Links genomes to species trees. Plasmids, viruses, incomplete genomes, and mixed "genomes" do not appear in the species trees.

Tree

treeId	int(10) unsigned
name	varchar(64)
type	varchar(30)
modified	timestamp
newick	longblob

Gene trees and species trees. The newick field contains the actual tree, in Newick format. For gene trees, the leaf ids are of the form locusId_version_begin_end or locusId_begin_end. For species trees, the leaf ids are taxonomy ids. See the tree-browser documentation for more information on how these trees are constructed. The full species tree always has name="MOSpeciesTree". There are also other species trees for subgroups of genomes. The type field is "species" for species trees or otherwise the type of family that was used to build the tree: 16S, Adhoc, COG, FastBLAST, Gene3D, PFAM, PIRSF, SMART, SSF, TIGRFAMs. For gene trees, the name is some sort of name for the family (e.g., the cogInfoId for COG trees).

Because private genomes are removed from pub.microbesonline.org, these trees may contain genes or organisms that are not in the public versions of the tables. Use the TreeUtils::simplify() routine or the Genomics/stat/simplifyTree.pl program to remove these leaves from the trees.

UserSessions

userSessionId	int(10) unsigned
sessionId	varchar(24)
ip	varchar(16)
uaHash	varchar(24)
userId	int(10) unsigned
browser	varchar(32)
version	varchar(16)
platform	varchar(16)
arch	varchar(16)
unknownUa	varchar(255)

Records cookies to keep track of user's sessions. This table is empty in pub.microbesonline.org.

UserSessionsURLTracking

userSessionId	int(10) unsigned
requestUri	varchar(255)
requestParams	longtext
method	enum('GET','POST')
accessTime	int(10) unsigned

Records HTTP requests. This table is empty in pub.microbesonline.org.

Users

userId	int(10) unsigned
name	varchar(40)
org	varchar(40)
email	varchar(80)
pwhash	varchar(32)
annotationTrust	smallint(6)
emailDataRelease	int(1)
emailSiteMaintenance	int(1)
emailSoftware	int(1)
isSysAdmin	int(1) unsigned

Records registered users. This table is empty in pub.microbesonline.org.

graph_path

id	int(11)
term1_id	int(11)
term2_id	int(11)
distance	int(11)

This records parent-child relationships in the Gene Ontology hierarchy. term1_id and term2_id link to term.id, and term1_id is the parent id. distance is the number of steps between them.

term

id	int(11)
name	varchar(255)
term_type	varchar(55)
acc	varchar(255)
is_obsolete	int(11)
is_root	int(11)

Gene Ontology identifiers. id (goId or term_id in tables that refer to this one) is an internal identifier; most people use the accession ("acc"), such as GO:0000018, regulation of DNA recombination. Also see Locus2Go and GoCount tables, and the other GO tables below.

term2term

id	int(11)
relationship_type_id	int(11)
term1_id	int(11)
term2_id	int(11)
complete	int(11)

Relationships between Gene Ontology identifiers. term1_id and term2_id are ids in the term table. The relationship_type_id is in the term_definition table -- usually it is 2, which means a parent-child relation.

term_definition

term_id	int(11)
term_definition	text
dbxref_id	int(11)
term_comment	mediumtext
reference	varchar(255)

Longer descriptions of term ids in the term table.

term_synonym

term_id	int(11)
term_synonym	text
acc_synonym	varchar(255)
synonym_type_id	int(11)
synonym_category_id	int(11)

Alternate names for Gene Ontology identifiers.

Obtaining our source code

You can download our source code from here. Note that the executables in this download (in Genomics/browser/bin/) are all 64-bit Linux executables. Please double-check any results you get from using our scripts, as the database schema may have changed since the script was written.

Popular perl scripts

Genomics/util/getGenes.pl: Get names, descriptions, and other annotations for all genes in a genome or a given list of loci.

Genomics/util/matchGenes.pl: Given a tab-delimited file, map the gene names to locusIds ("VIMSS ids")

Genomics/util/scaffoldSubseq.pl: Get sequences for ranges within scaffolds or for ranges relative to a gene

Genomics/seq/getUpstream.pl: Get 500 nt upstream of all genes in a genome, truncating parts that overlap other genes

Overview of the perl modules

The key modules are in Genomics/Perl/modules. There are also additional modules, mainly to support the web site code, in Genomics/browser/lib. The key modules are:

GenomicsUtils -- Use GenomicsUtils::connect(hostname) to connect to a MicrobesOnline sql database, before you create genome, gene, or scaffold objects.
Genome -- Use Genome::new(taxonomyId => 511145) to get a genome object for E. coli, and $genome->genes() or $genome->scaffolds() to get gene or scaffold objects. Use $genome->nongenomicGenes() or $genome->nongenomicScaffolds() to get any plasmid genes or scaffolds.
Gene -- Has many methods to obtain attributes of genes such as begin(), end(), strand(), synonym(), description(), etc.

Perl Examples

(Also see the source code of the above perl scripts.)

Fetch all genes in E. coli K12 (taxonomyId 511145) and tree-orthologs in another strain of E. coli and in D. vulgaris Hildenborough (taxonomyId 83334 and 882). (This takes ~1/2 a second per gene, so doing the whole genome will take a little while.)

use GenomicsUtils;
use Genome;
use Gene;
GenomicsUtils::connect();
my $genome = Genome::new("taxonomyId" => 511145);
my @genes = $genome->genes();
push @genes, $genome->nongenomicGenes();
foreach my $gene (@genes) {
   if($gene->type()==1) {
       my $l = $gene->treeOrthologs("taxa" => [83334,882]);
       while (my ($orthId, $hash) = each %$l) {
           print join("\t",$gene->locusId(), $orthId, $hash->{taxonomyId})."\n";
       }
   }
}

Fetch all 1:1 MOG-based orthologs between two loci of interest from E. coli K12 (VIMSS 14484 and 14485) in another strain of E.c oli and in D. vulgaris Hildenborough. (Note: MOGs need not be 1:1, but Gene::MOGForTaxa removes non-1:1 hitsorthologs.)

use GenomicsUtils;
use Gene;
GenomicsUtils::connect();
my @loci = (14484,14485);
my @taxa = (83334,882);
my $mogOrthologs = Gene::MOGForTaxa(\@loci, \@taxa);
while (my ($locus,$moghash) = each %$mogOrthologs) {
   while (my ($taxId,$orthId) = each %$moghash) {
      print join("\t", $locus, $orthId, $taxId)."\n";
   }
}

Installing your own web server

You can install your own MicrobesOnline web server, if you want to modify the cgi scripts and/or add your own data to the database. The first step is to install your own copy of the database. We will provide mysqldump files suitable for loading into your own MySQL server at a later date; please contact us at help@microbesonline.org if you are interested. Very rough guideliness on installing your own web server and on what what other packages you need to install are in Genomics/config/README. Please contact us for further advice.

Linking to the MicrobesOnline genome browser

You can use the MicrobesOnline browser to view a set of genomes or to show where items that you have identified (e.g., BLAST hits) are located relative to genes. You can use URLs with http://www.microbesonline.org/cgi-bin/browser? and various mode= and data= options:

To show a locus and its orthologs in other organisms, use mode=2 and data=locusId,tax1,tax2,...,taxN, e.g.
- mhpR from E. coli K12 with option to view orthologs
- mhpR from E. coli K12 and orthologs from E. coli B171 and K. pneuomiae
To show how your objects relate to the genes in MicrobesOnline, use mode=4 and a colon-delimited set of object specifiers with the prefixes s for scaffold, c for center, f for from, n for label, and d for description. The scaffold specifier (e.g. "s7023") must be the first entry in each track specifier, and the "c" or "f" entry to specify where the object goes must be the second entry. This way will put each object in its own track, but you can also have multiple objects within a single track -- just leave out the later scaffold specifiers. Example track specifiers:
- "s7023:c20000" (position 20000 on scaffold 7023, centered at 7023)
- "s7023:c20000:-" (centered at position 20000 on the - strand)
- "s7023:f395t478:-:nLabel:dDescription" (feature from 395 to 478 on - strand, with name "Label" and popup-text "Description")
- "s7023:f478t395:nLabel:dDescription" (feature from 478 to 395 on the - strand; this is on the minus strand as implied by to < from)
Also see an example URL with many tracks or with multiple objects within a track