OTU differences when analyzing separate fastas or one file

Hello everyone,
I’m starting another post for a second problem/question, but I refer you to my previous post (Phylip vs. Column-based format changing downstream results) since the background is the same. Briefly, I have 10 samples of ~3300 16S sequences in individual fastas which I was able to analyze separately quite easily with mothur using the following command flow:

dist.seqs(fasta=X.fas, output=lt, cutoff=0.05)
read.dist(phylip=X.phylip.dist)
cluster(cutoff=0.05)
summary.single(label=0.03-0.05)
collect.single(label=0.03-0.05)
rarefaction.single(label=0.03-0.05)
get.oturep(phylip=X.phylip.dist, fasta=X.fas, list=X.phylip.fn.list, label=0.03, sorted=size)

However, when I combine all the sequences into one fasta file of 33,071 seqs and then use the following command flow:

dist.seqs(fasta=ALL.fas, cutoff=0.03, processors=4)
hcluster(phylip=ALL.phylip.dist, cutoff=0.03)
read.otu(list=ALL.phylip.fn.list, group=ALL.groups)
summary.shared(label=0.03)

…the output of the read.otu command gives me the 10 fn.X.rabund files and the fn.shared file which have total numbers of OTUs at the 0.03 level which are no longer concordant with the original individual analysis, whereas I assume they should be the same (or not very different due to computational rounding/randomness).

Here’s what I mean:

Sample - uniques before / uniques after - 0.03 OTUs before - 0.03 OTUs after
1 - 3624 / 3611 - 1000 / 1424
2 - 2465 / 2455 - 770 / 1100
3 - 3591 / 3591 - 1025 / 1398
4 - 2539 / 2538 - 871 / 1232
5 - 3518 / 3517 - 997 / 1434
6 - 2993 / 2993 - 689 / 1077
7 - 3623 / 3620 - 892 / 1282
8 - 2753 / 2747 - 541 / 868
9 - 3835 / 3833 - 861 / 1266
10 - 3445 / 3443 - 805 / 1217

As you can see, there is a minor effect on the uniques, but the impact on the OTUs is enormous. Could this much variability be caused simply by the differing number of gaps between the individual alignments and the larger, all inclusive one? Obviously, the large alignment has a bit more “positions” (columns=1186) vs. the 10 separate files (~800-1000) since real 16S variability and pyrosequencing errors are not distributed completely evenly and as you add more and more sequences, you discover more indels and errors at more positions (we can see above that the larger the alignment/greater # of gaps, the more OTUs).

I had not thought that the OTU clustering would be that sensitive to increased gaps since: 1) for a fragment size of ~450 bp, you need 13-14 errors (not gaps) to miss putting seqs A and B into OTUX at the 0.03 level; and 2) regardless, adding sequence C which differs quite a bit from A and B will simply introduce common gaps into A and B, therefore their calculated similarity/distance should stay the same with whichever way you count gaps. Shouldn’t it?

I can accept minor changes in numbers, but up to 60% difference (sample#8) is a lot of variability given the different ways people go about their 16S pyrosequencing analyses in the different papers. I also assume (and could see in some of the results) that theses differences were affecting the downstream commands of summary.shared, tree.shared, venn, get.sharedseqs, etc. Is there a way to manage this problem (avoiding counting gaps can’t work since they are informative in the 16S)?

Yep, this is the biggest downfall of OTU-based analysis and unsupervised methods in general - the OTUs are only meaningful within a single analysis (whereas the phylotypes aren’t meaningful at all :slight_smile: ). We encourage people to put all of their sequences together and then process them as a batch. Once you have your mega list file you can parse that into your 10 groups and do all of the alpha diversity analyses on those groups. You might try average neighbor as that might be less sensitive to these types of things.

Hope this helps,
pat

We encourage people to put all of their sequences together and then process them as a batch. Once you have your mega list file you can parse that into your 10 groups and do all of the alpha diversity analyses on those groups.

So if I read you correctly, you would go with the mega-alignment fasta file and forget about individual files? By “parse that into your 10 groups” you simply mean always using the large alignment and simply selecting the individual groups using the group=X-Y-Z option on the different commands to pull out the group-specific files/data?

If this is the case, then the alpha-diversity measurments (and beta by default) will always wind up being based on the “inflated” OTU numbers…guess this makes comparison of two different study’s Shannon indices, for example, meaningless. The calculations become purely comparative values for intra-study comparison and never to be used as absolute values for inter-study comparison - seems to me there are people/papers out there making statements such as “X found less diversity that we did somewhere” or “Y’s Shannon index shows higher diversity in summer than we found in winter” and therefore all of that is suspect since they never re-aligned (nor then re-analyzed) everything togther…

A bit of a bummer that that doesn’t seem to do us much good for the standardisation of science/results :cry: , which I had originally thought was the point behind creating these metrics/indices - but perhaps it worked back in the day when they were created, before sequencing, when they were counting “integer” values of macro-organisms…

Much the wiser, I’ll give the average neighbor a try and see what the variability looks like.

On a side-note: too bad about the timing of the proposed mothur workshop in Detroit, many of us 16/18S pyro people (the ICoMM people as well - we’re having a sub-meeting) are going to be at ISME in Seattle that week…

A side-question related to the above:

The command get.oturep is the only one that will not work on the concatenated, large alignment mentioned above since you cannot specify a group=X option, right? Am I also correct in reading the manual that get.sharedseqs(list=X.fn.list, group=X.groups, label=0.03, shared=A, fasta=X.fas) would replace this command and give out representative sequences (1 per OTU) for group A, or would it instead give out all of the sequences from group A?

If the latter, what do you replace get.oturep with when the alignment has multiple groups and you only want OTUs from a specific group?

Thanks again!

So if I read you correctly, you would go with the mega-alignment fasta file and forget about individual files? By “parse that into your 10 groups” you simply mean always using the large alignment and simply selecting the individual groups using the group=X-Y-Z option on the different commands to pull out the group-specific files/data?

Yes. By the way, if you are going to align each library at a time the RDP aligner will give you fluctuating alignment lengths. Also, in light of your desire for standardization, you might want to note that the RDP aligner does a pretty poor job of aligning the variable regions. All those lower case letters in the alignment are not aligned. You can see my recent PLoS ONE paper for more discussion along these lines.

If this is the case, then the alpha-diversity measurments (and beta by default) will always wind up being based on the “inflated” OTU numbers…guess this makes comparison of two different study’s Shannon indices, for example, meaningless. The calculations become purely comparative values for intra-study comparison and never to be used as absolute values for inter-study comparison - seems to me there are people/papers out there making statements such as “X found less diversity that we did somewhere” or “Y’s Shannon index shows higher diversity in summer than we found in winter” and therefore all of that is suspect since they never re-aligned (nor then re-analyzed) everything togther…

I think the numbers would be more robust, not “inflated”. Some would argue that Shannon is meaningless, regardless :). The whole concept of performing metaanalysis based on reading papers is some what problematic anyway when you consider all the caveats involved. I think that when you get to larger and larger sequences collections this problem is minimized.

A bit of a bummer that that doesn’t seem to do us much good for the standardisation of science/results , which I had originally thought was the point behind creating these metrics/indices - but perhaps it worked back in the day when they were created, before sequencing, when they were counting “integer” values of macro-organisms…

Standardization - what’s that? The analysis is standardized, we know how it works, what it does, etc. The standardization problem comes into how we set up our PCR reactions, what primers are used, how DNA is extracted, how people QC their data, how they set up a taxonomy outline. All of this is subjective. The OTU algorithms are pretty objective and standardized - data in, OTUs out.

On a side-note: too bad about the timing of the proposed mothur workshop in Detroit, many of us 16/18S pyro people (the ICoMM people as well - we’re having a sub-meeting) are going to be at ISME in Seattle that week…

Yeah, I didn’t know about that. If people are interested in doing it a week or two earlier, we could probably do that instead.

Yes. By the way, if you are going to align each library at a time the RDP aligner will give you fluctuating alignment lengths. Also, in light of your desire for standardization, you might want to note that the RDP aligner does a pretty poor job of aligning the variable regions. All those lower case letters in the alignment are not aligned. You can see my recent PLoS ONE paper for more discussion along these lines.

Had a read through of the PLoS ONE paper and now tried my 33,000 seqs using your align.seqs using both the GreenGenes and SILVA template alignments - I’m much happier with it than the RDP alignment and, in my case of 16S V6+7+8, the SILVA template performed the best with a 9kmer.

Thansk for the heads up on that and now off to a new posting to get my next question/problem resolved over get.oturep with multiple samples…