I’m a bit puzzled on how the sub.sample command works. Not really on the normalization as such, but on how to refer to these normalised data downstream. In the SOP for example, I first see a subsampling:
mothur > sub.sample(shared=final.an.shared, size=4601)
followed by otu classification:
mothur > classify.otu(list=final.an.list, name=final.names, taxonomy=final.taxonomy, label=0.03, cutoff=80)
But without refering to the subsample.shared file that the sub.sample command creates.
Does this mean that Mothur automatically uses the normalised data in donwstream commands? If so, does it do so for all downstream commands or only for the one right behind sub.sample? Or only when using the sub.sample command on a shared file? Should you repeat the normalisation if you would for example want to compare two classifications (different cutoffs)? So first sub.sample, then classify.otu, then sub.sample again and classify.otu with other cutoff? This doesn’t seem right.
So, how far does Mothur ‘remember’ to use normalised data? I would guess untill there is another shared file made? Can you do the same with a regular subsampled fasta/names/groups file?
Or is the original shared file normalized? If so, what is then the purpose of the subsample.shared file?
Maybe it’s not the sub.sample command that I’m not quite understanding, but the way shared files are being used.
Just to make sure I’m indeed using normalised datasets when I advocate for it…
Thanks in advance,
So mothur doesn’t remember much. In the sub.sample command we are making a subsampled shared file. In the second command we are classifying OTUs based on the sequences within them. Since the list file wasn’t based on subsampling then those data weren’t subsampled.
Also, I would strongly encourage people to differentiate between normalization and sub-sampling. In mothur-speak, normalization is to get the relative abundance for every OTU, multiply by a common number, and then round everything into an integer. Sub-sampling is to randomly draw the same number of sequences from each sample. I’m personally a bigger fan of sub-sampling/rarefaction than I am of normalization because of how it treats the rarer populations and you are guaranteed to get the same number of sequences in each sample.
OK, I should indeed distinguish between subsampling and normalizing, you’re totally right. I’m also more inclined to go for subsampling than for the real normalization.
So, I should indeed refer to the subsampled datasets each time, that makes sense.
Then my question remains about the SOP pipeline.
In phyloptyping, treemaking, diversity analysis, etc, you indeed refer to the .subsample. files each time, but at OTU classification, as I pointed out above, you don’t. Still, you do perform a subsampling right before the OTU classification. So what’s the rationale behind this specific subsampling step? I guess it is useful for downstream applications (like the alpha diversity and such). But still, why not doing the OTU classification on the subsampled dataset?
Sorry if this is a confusing question, but, well, I am confused…
So that sub-sampling is to have a shared file for all of the downstream analyses. I didn’t use subsampled data for the otu classification because I wanted to have as many sequences as possible in each OTU to base the classification on. The sub-sampling isn’t because we don’t trust the data, but because we have different numbers of samples.
Ahaaa, so, for classification, the number of sequences should logically be maximal, but for other analyses like tree builing and diversity analysis, the OTU data can be standardized by subsampling to minimize bias by false OTUs.
Yesssss, I get it now! It perfectly makes sense.
THANKS for your patience
I am probably missing the obvious, but when I use sub-sampled data for analysis, I only have ~9000 OTUs, and they are listed (for instance in corr.axes command or the final.an.0.03.subsample.shared) as “OTU1, OTU2” etc. But when I go back to final.an.0.03.cons.taxonomy I have OTUs listed up to 15800, so obviously not the same ones. How do I figure out the taxonomy for my OTUs after they have been subsampled?
If you subsample a shared file, the column headings in the new shared file will correspond to the OTU numbers in the cons.taxonomy file. Is this what you mean?
Yes, I think so? The problem is that every time I subsample and do an analysis, the OTUs in the output seem to get re-numbered 1 to n, and n is less than the number of OTUs in cons.taxonomy so they clearly don’t correspond. I assume there is some simple way to use the subsample.shared file to link the OTUs from the analysis (for instance, the distance matrix->nmds->corr.axes, where I want to know which OTUs correspond to the axes, and then use cons.taxonomy to find their identify) to the larger number of OTUs in the cons.taxonomy. I’ve been following the Schloss SOP fairly faithfully, skipping the phylotype() because it is buggy, but I don’t see anything in the SOP addressing this.
Just looked over it and yes, the subsample.shared is totally OK, it is the output for corr.axes which gives “1,2,3,…n” in the OTU column, where n<total OTUs and the OTUs do not correspond at all to subsample.shared.