Hello everyone,
I’m starting another post for a second problem/question, but I refer you to my previous post (Phylip vs. Column-based format changing downstream results) since the background is the same. Briefly, I have 10 samples of ~3300 16S sequences in individual fastas which I was able to analyze separately quite easily with mothur using the following command flow:
dist.seqs(fasta=X.fas, output=lt, cutoff=0.05)
read.dist(phylip=X.phylip.dist)
cluster(cutoff=0.05)
summary.single(label=0.03-0.05)
collect.single(label=0.03-0.05)
rarefaction.single(label=0.03-0.05)
get.oturep(phylip=X.phylip.dist, fasta=X.fas, list=X.phylip.fn.list, label=0.03, sorted=size)
However, when I combine all the sequences into one fasta file of 33,071 seqs and then use the following command flow:
dist.seqs(fasta=ALL.fas, cutoff=0.03, processors=4)
hcluster(phylip=ALL.phylip.dist, cutoff=0.03)
read.otu(list=ALL.phylip.fn.list, group=ALL.groups)
summary.shared(label=0.03)
…the output of the read.otu command gives me the 10 fn.X.rabund files and the fn.shared file which have total numbers of OTUs at the 0.03 level which are no longer concordant with the original individual analysis, whereas I assume they should be the same (or not very different due to computational rounding/randomness).
Here’s what I mean:
Sample - uniques before / uniques after - 0.03 OTUs before - 0.03 OTUs after
1 - 3624 / 3611 - 1000 / 1424
2 - 2465 / 2455 - 770 / 1100
3 - 3591 / 3591 - 1025 / 1398
4 - 2539 / 2538 - 871 / 1232
5 - 3518 / 3517 - 997 / 1434
6 - 2993 / 2993 - 689 / 1077
7 - 3623 / 3620 - 892 / 1282
8 - 2753 / 2747 - 541 / 868
9 - 3835 / 3833 - 861 / 1266
10 - 3445 / 3443 - 805 / 1217
As you can see, there is a minor effect on the uniques, but the impact on the OTUs is enormous. Could this much variability be caused simply by the differing number of gaps between the individual alignments and the larger, all inclusive one? Obviously, the large alignment has a bit more “positions” (columns=1186) vs. the 10 separate files (~800-1000) since real 16S variability and pyrosequencing errors are not distributed completely evenly and as you add more and more sequences, you discover more indels and errors at more positions (we can see above that the larger the alignment/greater # of gaps, the more OTUs).
I had not thought that the OTU clustering would be that sensitive to increased gaps since: 1) for a fragment size of ~450 bp, you need 13-14 errors (not gaps) to miss putting seqs A and B into OTUX at the 0.03 level; and 2) regardless, adding sequence C which differs quite a bit from A and B will simply introduce common gaps into A and B, therefore their calculated similarity/distance should stay the same with whichever way you count gaps. Shouldn’t it?
I can accept minor changes in numbers, but up to 60% difference (sample#8) is a lot of variability given the different ways people go about their 16S pyrosequencing analyses in the different papers. I also assume (and could see in some of the results) that theses differences were affecting the downstream commands of summary.shared, tree.shared, venn, get.sharedseqs, etc. Is there a way to manage this problem (avoiding counting gaps can’t work since they are informative in the 16S)?