Why 79542 classified OTUs, but 5249 OTUs have higher than 20 hits

ch3coch3 · July 15, 2016, 2:33pm

Hi,

We have this quite common for our results: after
classify.otu(list= xx.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=xx.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=xx.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v123.wang.pick.taxonomy, label=0.03)

We could get in this case 79542 OTUs classified, But only 5249 OTUs have higher than 20 hits (I pick 5249 OTUs > 20 hits arbitrarily to make a point). We thought it was the data quality, but we have much better data now, it still happens. Why would that be and how would this tons of lower hits affect our results, and how(if possible) to avoid this? How can we trust the lower hits results, are we going to cut off OTUs with hits lower than certain number?

We follow mothur miseq SOP, no parameters alternation. we do
classify.seqs(fasta=xx.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=xx.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.nr_v123.align, taxonomy=silva.nr_v123.tax, cutoff=80)

One of the way I could think of is to do sub.sampels and choose the smallest size of all sample to make a sub sample shared file. would that be the way?

Thanks much!

pschloss · July 18, 2016, 12:11pm

What do you mean by “hits” ?

ch3coch3 · July 19, 2016, 3:44pm

Hi Dr. Pschloss,

Sorry for the confusion, by “hits” I meant the size of OTU. For instance, when I open a xx.trim.contigs.good.unique.good.filter.uniuqe.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy file, I got otu1 size 44924, meaning otu1 was found in all the samples for 44924 times. Whereas otu5998, size 16, and from otu22934 to otu79515, they all had size 1.

So would the extensive number of otus which only observed once in the data flag as poor data quality or alignment, or is the way it should be?

Thanks much

pschloss · July 21, 2016, 11:42am

Hi,

I see - I wouldn’t do anything to throw away data like you propose since it will disproportionately affect the larger samples and it will likely throw out good OTUs with bad.

Pat

ch3coch3 · July 21, 2016, 3:02pm

Got it. So you wouldn’t recommend to sub.sample of the shared file, and make.biom(shared=xx.0.03.subsample.shared, conxtaxonomy=xx.0.03.cons.taxonomy) to analyze and present rarefied data? I guess I should stick with xx.trim.contigs.good.unique.good.filter.uniuqe.precluster.pick.pick.an.unique_list.0.03.shared and xx.trim.contigs.good.unique.good.filter.uniuqe.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy files to make biom file for analyzing?

Also, how can we verify that large portion of size=1 OTUs is the way it should be or mistake from sequencing, alignment or classification?

Thanks much

pschloss · July 25, 2016, 7:19pm

I would definitely subsample and rarefy your data to make sure that every sample has been equally sampled. I would not remove singletons.

Topic		Replies	Views
help needed: subsample not carried through to classif.otu Commands in mothur	3	3322	June 19, 2012
OTU for each sample Commands in mothur	5	2682	November 2, 2015
classify.otu with normalised data Commands in mothur	17	17635	September 10, 2014
Classify OTUs by sample using all sequences Commands in mothur	5	5983	June 20, 2014
classify.otu after subsampling Commands in mothur	1	1214	August 15, 2016

Why 79542 classified OTUs, but 5249 OTUs have higher than 20 hits

Related topics