Up to FastTree
Environmental or metagenomic sequences are often very short, which creates difficulties in inferring a phylogeny. In particular, it seems impossible to estimate the distance between two sequences if they do not overlap! Nevertheless, FastTree can generate reasonable topologies for alignments with many sequence fragments, and FastTree can use pseudocounts (the -pseudo option) to improve its results.
To test FastTree on highly gapped alignments, I first simulated gapped alignments of 50, 250, or 1,250 protein sequences, as described in the MBE09 paper. Each alignment corresponds to a broad family (33% identity, 9% gaps). I then replaced half of the sequences with fragments that covered 1/3rd of the extent of that sequence's alignment. In these final "fragmented" alignments, about 6% of pairs of sequences did not overlap, and about 8% of pairs of sequences overlapped by 10 positions or less, while on average, two sequences share 125 non-gapped positions.
|FastTree 2.0.0, -pseudo (with ML NNIs)||60.6%||60.5%||52.6%|
|PhyML 3.0 (no gamma, no SPR)||52.1%||--||--|
|Parsimony (from RAxML)||42.0%||54.6%||54.3%||FastTree 1.1, -pseudo (no ML)||47.2%||46.4%||39.6%||FastTree 1.1, default (no ML)||46.4%||44.5%||37.8%||FastME 1.1||35.2%||32.3%||28.0%|
Note: for FastME, which requires a distance matrix, I used FastTree's log-corrected distances, which sets the distance between non-overlapping sequences to 3.0. All methods had much more similar topological on the alignments with 9% gaps (not shown).
For fragmented alignments, the minimum evolution phase of FastTree is much more accurate than distance matrix methods. To see why, consider four sequences A,B,C,D, where A and B are full-length and C and D are fragments that do not overlap. By default, FastTree uses an arbitrary high value of 3.0 for the distance between non-overlapping sequences. Thus, FastTree will always reject the split AB|CD because d(C,D) will be so high. However, FastTree chooses between AC|BD and AD|BC by comparing d(A,C)+d(B,D) to d(A,D)+d(B,C). As this formula does not involve d(C,D), FastTree can make a reasonable choice. Furthermore, if these sequences are part of a larger alignment, and FastTree joins A to C and B to D, then the profiles for the parent nodes AC and BD will be full-length, and comparisons of those profiles to other sequences will work reasonably, even if those sequences are also fragments.
Also suppose FastTree selects the topology ((A,C),B,D). With default settings, FastTree will set the branch length for the interior branch between AC and BD by the formula d(AC,BD) = (d(A,B)+d(A,D)+d(C,B)+d(C,D))/4 - (d(A,C)+d(B,D))/2. Because this formula includes d(C,D), the branch length will be unreasonably high. With the pseudocount approach, FastTree instead uses the pairwise distances between A,B,C,D, for those pairs that have overlapping positions in their profiles, to get a prior estimate of what d(C,D) might be. (The prior is just the weighted average of uncorrected pairwise distances, where the weight is the number of overlapping positions in that comparison.) FastTree will then estimate each pairwise distance as a weighted average of the evidence and the prior. (This will set d(C,D) to the prior, and will have a small effect on the other distances.) Finally, FastTree will log-correct the distances.