Hello everyone,
I am a new user of mothur as I attempt to analyze my first set of 16/18S pyrosequencing data from the Canadian Arctic. My question pertains to the effect of changing distance file formats from Phylip to Column-based and what it will mean for downstream analysis. I’ll get to what I mean in a moment, first the background: I have 10 samples of ~3300 16S sequences each which I was able to analyze individually quite easily all the way through with mothur (on an average-power PC workstation) using generally the following set of commands (in this example, the sample#1 for bacteria was done = 1B.fas) after having pre-processed everything with 1) the Roche software to remove tags, 2) BioEdit to filter out my chosen error cutoffs, and 3) the RDP aligner, which I manually corrected, to generate the input fastas for mothur:
dist.seqs(fasta=1B.fas, output=lt, cutoff=0.05)
read.dist(phylip=1B.phylip.dist)
cluster(cutoff=0.05)
summary.single(label=0.03-0.05)
collect.single(label=0.03-0.05)
rarefaction.single(label=0.03-0.05)
get.oturep(phylip=1B.phylip.dist, fasta=1B.fas, list=1B.phylip.fn.list, label=0.03, sorted=size)
Everything is perfect after this point. However, when I merge the 10 samples into 33,000 sequences, things start to heat up.
I can run the dist.seqs command fairly easily (50 min.) on our server (4 processors on the unix/linux compiled version of mothur) which generated a Phylip distance matrix of ~4 Gb. However, I then cannot load this into memory (with read.dist) in order to cluster. I am now running the hcluster command instead which looks as though it may work (it’s running on our server as I type). As a side-question, does anyone know why the multiple processor option was not activated for the cluster command, since (in my experience analyzing my 10 samples individually) it takes roughly 10x as long as the dist.seqs command on a single processor and that’s where a big gain in time could be made? I assume it is a programming hurdle…
I have seen that switching to the Column-formatted matrix can be a good alternative for large datasets, reducing significantly the size of the resulting file - as the Sogin example remarks after running a column-based distance for 22,000 sequences:
“The resulting file, sogin.unique.filter.dist, will be 14.1 MB in size. Contrast this to the size of a phylip-formatted distance for the 21,908 sequences of about 6 GB…”
So I went ahead and tried switching to Column format on my 3,700 sequence sample#1 which I re-analyzed to compare to the first run-through with the Phylip format. I had not used unique.seqs before, but did this time in order to generate the names file in order to use the Column format in dist.seqs and this is where the “problem” starts. After running the same analyses, I wind up with approximately the same number of OTUs (1000 vs. 997 in the second run) and relatively the same numerical values for the diversity indices (the latter being based on the former, so it is not surprising they move a bit if the #OTUs changes). Note that the sample is rather diverse and not many seqs were identical and therefore culled from the “unique” seqs to begin with (I went from 3,690 total to 3,624 unique). What I don’t understand is why the #OTUs is not exactly the same, since it should be (computational rounding perhaps?).
This winds up affecting the analyses/commands afterwards…for example, when I launch the following command:
get.oturep(column=1B.unique.dist, names=1B.names, fasta=1B.fas, list=1B.unique.fn.list, label=0.03, sorted=size)
…I get back the OTUs along with their numbers of sequences they represent which is “incorrect” compared to the original Phylip-based analysis:
The top 10 OTUs using: Phylip matrix without unique.seqs removal / Column matrix with unique.seqs removal
OTU1 - #seqs= 163 / 170
OTU2 - #seqs= 119 / 115
OTU3 - #seqs= 112 / 103
OTU4 - #seqs= 93 / 83
OTU5 - #seqs= 82 / 81
OTU6 - #seqs= 62 / 56
OTU7 - #seqs= 50 / 48
OTU8 - #seqs= 50 / 46
OTU9 - #seqs= 43 / 43
OTU10 - #seqs= 43 / 39
…being directly a result of the change in #OTUs, I Imagine. At least the total #seqs represented by all the OTUs works out to 3690 for both, so mothur does “re-find” in the names file the few sequences culled out during the unique.seqs step. What this messes up is that the proportions of sequences belonging to the “abundant” OTUs (and number of “abundant” OTUs), something we are quite interested in ecologically speaking, will not come out the same in both. In the above example, I originally had 15 abundant OTUs (those having >1% of the total sequences, so 37 or more) representing 1021 seqs, but had only 13 OTUs for 940 seqs in the Column-based re-analysis. My fear is that this problem would then continue into the venn, get.sharedseqs, heatmap, etc. commands…
Could this long posting/problem (excuse the length, but I wanted to be clear) be simply chalked up to the famous variability of algorithm-based analyses? I would have thought, though, that since the extra step of unique.seqs simply removes exact replicates that this should not have influenced the subsequent dist.seqs/cluster commands to generate OTUs (since identical sequences would have simply collapsed into the same OTUs anyways, theoretically not changing the total #OTUs)…or does the fn algorithm calculate “averages” of distances and therefore including many identicals in a cluster/“proto-OTU” would then influence the calculated distances to the “third-parties” (next seqs that will be added to the growing OTU)?
Thanks to anyone who may have some illumination to shed!
PS: Another side question/curiosity - can anyone explain why switching from a Phylip matrix to a Column-formatted one has such an impressive impact on the file-size as reported above (several orders of magnitude)? I would assume that a large amount of seqs with few replicates (say only 10% removed with a unique.seqs command) would still have nearly the same amount of information to code in bytes - you still have to return a number for each pair-wise combination…for example, my 33,000 seqs in a Phylip lower-t. format should consume: 3300033000/2 = 544,500,000 numbers + 33000 labels = 544,533,000 items to code for. If the Column matrix has 30,000 seqs, you still have: 3000030000/2 = 450,000,000 numbers + labels to code and there are two labels for every number/line in the Column format…you can’t lose orders of magnitude of the pair-wise data, so this isn’t making any math/computational sense to me, but I must be obviously missing something about the Column format…?