Why 79542 classified OTUs, but 5249 OTUs have higher than 20 hits


We have this quite common for our results: after
classify.otu(list= xx.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=xx.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=xx.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v123.wang.pick.taxonomy, label=0.03)

We could get in this case 79542 OTUs classified, But only 5249 OTUs have higher than 20 hits (I pick 5249 OTUs > 20 hits arbitrarily to make a point). We thought it was the data quality, but we have much better data now, it still happens. Why would that be and how would this tons of lower hits affect our results, and how(if possible) to avoid this? How can we trust the lower hits results, are we going to cut off OTUs with hits lower than certain number?

We follow mothur miseq SOP, no parameters alternation. we do
classify.seqs(fasta=xx.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=xx.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.nr_v123.align, taxonomy=silva.nr_v123.tax, cutoff=80)

One of the way I could think of is to do sub.sampels and choose the smallest size of all sample to make a sub sample shared file. would that be the way?

Thanks much!

What do you mean by “hits” ?

Hi Dr. Pschloss,

Sorry for the confusion, by “hits” I meant the size of OTU. For instance, when I open a xx.trim.contigs.good.unique.good.filter.uniuqe.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy file, I got otu1 size 44924, meaning otu1 was found in all the samples for 44924 times. Whereas otu5998, size 16, and from otu22934 to otu79515, they all had size 1.

So would the extensive number of otus which only observed once in the data flag as poor data quality or alignment, or is the way it should be?

Thanks much


I see - I wouldn’t do anything to throw away data like you propose since it will disproportionately affect the larger samples and it will likely throw out good OTUs with bad.


Got it. So you wouldn’t recommend to sub.sample of the shared file, and make.biom(shared=xx.0.03.subsample.shared, conxtaxonomy=xx.0.03.cons.taxonomy) to analyze and present rarefied data? I guess I should stick with xx.trim.contigs.good.unique.good.filter.uniuqe.precluster.pick.pick.an.unique_list.0.03.shared and xx.trim.contigs.good.unique.good.filter.uniuqe.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy files to make biom file for analyzing?

Also, how can we verify that large portion of size=1 OTUs is the way it should be or mistake from sequencing, alignment or classification?

Thanks much

I would definitely subsample and rarefy your data to make sure that every sample has been equally sampled. I would not remove singletons.