Phylip vs. Column-based format changing downstream results

Hello everyone,
I am a new user of mothur as I attempt to analyze my first set of 16/18S pyrosequencing data from the Canadian Arctic. My question pertains to the effect of changing distance file formats from Phylip to Column-based and what it will mean for downstream analysis. I’ll get to what I mean in a moment, first the background: I have 10 samples of ~3300 16S sequences each which I was able to analyze individually quite easily all the way through with mothur (on an average-power PC workstation) using generally the following set of commands (in this example, the sample#1 for bacteria was done = 1B.fas) after having pre-processed everything with 1) the Roche software to remove tags, 2) BioEdit to filter out my chosen error cutoffs, and 3) the RDP aligner, which I manually corrected, to generate the input fastas for mothur:

dist.seqs(fasta=1B.fas, output=lt, cutoff=0.05)
read.dist(phylip=1B.phylip.dist)
cluster(cutoff=0.05)
summary.single(label=0.03-0.05)
collect.single(label=0.03-0.05)
rarefaction.single(label=0.03-0.05)
get.oturep(phylip=1B.phylip.dist, fasta=1B.fas, list=1B.phylip.fn.list, label=0.03, sorted=size)

Everything is perfect after this point. However, when I merge the 10 samples into 33,000 sequences, things start to heat up.

I can run the dist.seqs command fairly easily (50 min.) on our server (4 processors on the unix/linux compiled version of mothur) which generated a Phylip distance matrix of ~4 Gb. However, I then cannot load this into memory (with read.dist) in order to cluster. I am now running the hcluster command instead which looks as though it may work (it’s running on our server as I type). As a side-question, does anyone know why the multiple processor option was not activated for the cluster command, since (in my experience analyzing my 10 samples individually) it takes roughly 10x as long as the dist.seqs command on a single processor and that’s where a big gain in time could be made? I assume it is a programming hurdle…

I have seen that switching to the Column-formatted matrix can be a good alternative for large datasets, reducing significantly the size of the resulting file - as the Sogin example remarks after running a column-based distance for 22,000 sequences:

“The resulting file, sogin.unique.filter.dist, will be 14.1 MB in size. Contrast this to the size of a phylip-formatted distance for the 21,908 sequences of about 6 GB…”

So I went ahead and tried switching to Column format on my 3,700 sequence sample#1 which I re-analyzed to compare to the first run-through with the Phylip format. I had not used unique.seqs before, but did this time in order to generate the names file in order to use the Column format in dist.seqs and this is where the “problem” starts. After running the same analyses, I wind up with approximately the same number of OTUs (1000 vs. 997 in the second run) and relatively the same numerical values for the diversity indices (the latter being based on the former, so it is not surprising they move a bit if the #OTUs changes). Note that the sample is rather diverse and not many seqs were identical and therefore culled from the “unique” seqs to begin with (I went from 3,690 total to 3,624 unique). What I don’t understand is why the #OTUs is not exactly the same, since it should be (computational rounding perhaps?).

This winds up affecting the analyses/commands afterwards…for example, when I launch the following command:

get.oturep(column=1B.unique.dist, names=1B.names, fasta=1B.fas, list=1B.unique.fn.list, label=0.03, sorted=size)

…I get back the OTUs along with their numbers of sequences they represent which is “incorrect” compared to the original Phylip-based analysis:

The top 10 OTUs using: Phylip matrix without unique.seqs removal / Column matrix with unique.seqs removal
OTU1 - #seqs= 163 / 170
OTU2 - #seqs= 119 / 115
OTU3 - #seqs= 112 / 103
OTU4 - #seqs= 93 / 83
OTU5 - #seqs= 82 / 81
OTU6 - #seqs= 62 / 56
OTU7 - #seqs= 50 / 48
OTU8 - #seqs= 50 / 46
OTU9 - #seqs= 43 / 43
OTU10 - #seqs= 43 / 39

…being directly a result of the change in #OTUs, I Imagine. At least the total #seqs represented by all the OTUs works out to 3690 for both, so mothur does “re-find” in the names file the few sequences culled out during the unique.seqs step. What this messes up is that the proportions of sequences belonging to the “abundant” OTUs (and number of “abundant” OTUs), something we are quite interested in ecologically speaking, will not come out the same in both. In the above example, I originally had 15 abundant OTUs (those having >1% of the total sequences, so 37 or more) representing 1021 seqs, but had only 13 OTUs for 940 seqs in the Column-based re-analysis. My fear is that this problem would then continue into the venn, get.sharedseqs, heatmap, etc. commands…

Could this long posting/problem (excuse the length, but I wanted to be clear) be simply chalked up to the famous variability of algorithm-based analyses? I would have thought, though, that since the extra step of unique.seqs simply removes exact replicates that this should not have influenced the subsequent dist.seqs/cluster commands to generate OTUs (since identical sequences would have simply collapsed into the same OTUs anyways, theoretically not changing the total #OTUs)…or does the fn algorithm calculate “averages” of distances and therefore including many identicals in a cluster/“proto-OTU” would then influence the calculated distances to the “third-parties” (next seqs that will be added to the growing OTU)?

Thanks to anyone who may have some illumination to shed!

PS: Another side question/curiosity - can anyone explain why switching from a Phylip matrix to a Column-formatted one has such an impressive impact on the file-size as reported above (several orders of magnitude)? I would assume that a large amount of seqs with few replicates (say only 10% removed with a unique.seqs command) would still have nearly the same amount of information to code in bytes - you still have to return a number for each pair-wise combination…for example, my 33,000 seqs in a Phylip lower-t. format should consume: 3300033000/2 = 544,500,000 numbers + 33000 labels = 544,533,000 items to code for. If the Column matrix has 30,000 seqs, you still have: 3000030000/2 = 450,000,000 numbers + labels to code and there are two labels for every number/line in the Column format…you can’t lose orders of magnitude of the pair-wise data, so this isn’t making any math/computational sense to me, but I must be obviously missing something about the Column format…?

Let me know if I miss anything as I go through this…

However, I then cannot load this into memory (with read.dist) in order to cluster. I am now running the hcluster command instead which looks as though it may work (it’s running on our server as I type).

You can also use the cutoff option, which will have the effect of throwing out distances larger than you’re probably interested in. If you don’t do this, the column matrix will actually be larger than the phylip matrix.

As a side-question, does anyone know why the multiple processor option was not activated for the cluster command, since (in my experience analyzing my 10 samples individually) it takes roughly 10x as long as the dist.seqs command on a single processor and that’s where a big gain in time could be made?

The clustering algorithms actually scale at least quadratically and because the first step is to find the smallest distance and then merge the rows and columns, these heirarchical clustering algorithms cannot be parallelized.

After running the same analyses, I wind up with approximately the same number of OTUs (1000 vs. 997 in the second run) and relatively the same numerical values for the diversity indices (the latter being based on the former, so it is not surprising they move a bit if the #OTUs changes). Note that the sample is rather diverse and not many seqs were identical and therefore culled from the “unique” seqs to begin with (I went from 3,690 total to 3,624 unique). What I don’t understand is why the #OTUs is not exactly the same, since it should be (computational rounding perhaps?).

If you ran each data format through cluster multiple times you will see slight differences like these. This is because after finding the smallest distance that row and column is merged; however, if there is a tie, mothur randomly picks among the ties. Because of this, there will be some slight fluctuation. Other implementations use either the first or last encountered minimum distance. Randomly picking seems the least biased. Of course, if the distribution is slightly different, the downstream analysis will be slightly different. That’s just the breaks.

Somewhat concerning is that you get different results for the uniques line. Could you please email us (mothur.bugs@gmail.com> ) the original alignment file for this example and we can look into it further? We have not seen any differences between using unique’d and non-unique’d sequences other than the slight variation due to the random picking.

PS: Another side question/curiosity - can anyone explain why switching from a Phylip matrix to a Column-formatted one has such an impressive impact on the file-size as reported above (several orders of magnitude)? I would assume that a large amount of seqs with few replicates (say only 10% removed with a unique.seqs command) would still have nearly the same amount of information to code in bytes - you still have to return a number for each pair-wise combination…for example, my 33,000 seqs in a Phylip lower-t. format should consume: 3300033000/2 = 544,500,000 numbers + 33000 labels = 544,533,000 items to code for. If the Column matrix has 30,000 seqs, you still have: 3000030000/2 = 450,000,000 numbers + labels to code and there are two labels for every number/line in the Column format…you can’t lose orders of magnitude of the pair-wise data, so this isn’t making any math/computational sense to me, but I must be obviously missing something about the Column format…?

You are correct. The savings comes in because you don’t have to save every distance with the column format. People typically aren’t very interested in distances larger than 0.10 (or even 0.03). But I think you’ll notice the benefit even if you use a cutoff of 0.20. Because of this, we chuck them and don’t keep them around. In a phylip matrix you have to save every row/column combination. Within the code, we represent these matrices in the column format expecting people to apply the cutoff. If you don’t use a cutoff, it’s going to hurt.

I hope this helps, and if I’ve missed anything just let us know. Please send us the non-unique’d alignment file and the exact commands you entered and we can take a look.

Thanks,
Pat

Thanks Pat for the quick answers!

I should have realized the difference about “cutting off” the larger distances in the Column format…guess I had in my mind the cutoff as being the “clustering-type” cutoff parameter/option (since most of the Wiki pages are referring to this type) and not the actual distance values being cut off. Perhaps a sentence or two making this clear would be beneficial on the Column-formatted distance matrix page (Redirecting…). I would put something well visible stating the difference in the calculation/final file output versus the Phylip matrix, with maybe a recommendation for people using pyro data.

Somewhat concerning is that you get different results for the uniques line. Could you please email us (mothur.bugs@gmail.com> ) the original alignment file for this example and we can look into it further? We have not seen any differences between using unique’d and non-unique’d sequences other than the slight variation due to the random picking.

I may have misspoken here in that the first post above was simply referring to the number of seqs I had after running the unique.seqs command to generate the starting fasta file. Regardless of the distance matrix chosen, the number of unique seqs/OTUs in the rabund files were the correct number at 3,624…

Thanks again for the rest of the answers!