when subsample? - correspondence of shared and taxonomy file


I recently had a discussion about when to subsample to the same amount of reads per sample.

I normally do the following order:

dist.seqs(fasta=cz.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)

cluster(column=cz.good.unique.good.filter.unique.precluster.pick.pick.dist, count=cz.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table)



classify.otu(list=cz.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.subsample.list, count=cz.good.unique.good.filter.unique.precluster.uchime.pick.pick.subsample.count_table, taxonomy=cz.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.subsample.taxonomy)

Is that wrong? Should I rather subsample before making the shared.file?
Thanks, C.

I think you should sub.sample after making the shared and only for certain things (like indicator species). Alpha and beta diversity in mothur should be run on the whole shared matrix because mothur will subsample repeatedly to calculate those-repeated subsampling gives you a better idea of your data than a single subsampling.

Thanks, kmitchell, that’s what I did so far.

However, the question I had was the following:
I would like to plot the bacterial taxonomy and for that I planned to merge the shared file with the cons.taxonomy file and then to plot the relative abundance of each taxon. Maybe there’s a much easier way to do it then forget my question and please tell me how to do it. :slight_smile:
The problem I was facing with my strategy was that the OTUs in my subsampled shared file don’t correspond with the OTUs in the cons.taxonomy file and I was trying to find a solution to either subsample much earlier or to subsample the cons.taxonomy file, as well.

Thanks, C.


I just found a user having the same question: http://www.mothur.org/forum/viewtopic.php?f=3&t=2006&start=10

The problem he was describing is the same I am facing: after the command procedure suggested by P. Schloss I have different OTUs in my subsample.cons.taxonomy file compared to my subsample.shared file. The amount of OTUs differes slightly and more important the OTUs itself are different. In my shared file I for instance have OTU 00030 but it doesn’t appear in the taxonomy file and vice versa.


kmitchell is correct - for things like metastats, lefese, classify.rf you want to run subsample. If you want to use alpha and beta-diversity metrics you’ll set the subsample value within the command and it will rarefy the data for you.

As for the cons.taxonomy file, the consensus taxonomy should not depend on the subsampling, so I would use the original list, names/count, and taxonomy files.


So, then, to go further with the alpha and betadiversity analysis, is it not necessary to subsample the shared file to normalize the number of sequences in each sample? Then the estimators of diversity will be based on “different sampling effort” for each sample? and the same with the betadiversity, the analysis will be based on different amount of sequences per sample?
Sorry but at this point I guess I was wrong and thought all the analysis should be based on the same “sampling effort” for all samples… I´d very much appreciate some explanations about this, pls! :slight_smile:
Thanks a lot,

mothur will subsample repeatedly when calculating alpha and beta indices! it’s one of the things that I love about mothur. So, don’t subsample the shared file you feed into those commands