Issues on sub.sampling


I’m having some difficulties in my analysis and I decided to call for help here in the forum.
Mothur makes my job much easier and I’ve already solved many doubts through this forum without posting, but this time I can’t find other way.
I’m checking some HiSeq V6 sequences from freshwater samples and I’m having problems with weird subsampling results
This is how I’ve been doing it on a mothur 1.33v:

make.contigs(ffastq=cce1.fq, rfastq=cce2.fq)
trim.seqs(fasta=current, oligos=V6.oligos, pdiffs=2)
trim.seqs(fasta=current, oligos=V6.oligos, pdiffs=2)
merge.files(input=cce1.trim.contigs.scrap.rc.trim.fasta-cce1.trim.contigs.trim.fasta, output=cce.fasta)

----------I’m doing this with all my 9 samples, than i merge them:, groups=cce-cch-ccm-cse-csh-gce-gch-gse-gsh)
merge.files(input=cce.fasta-cch.fasta-ccm.fasta-cse.fasta-csh.fasta-gce.fasta-gch.fasta-gse.fasta-gsh.fasta, output=all.fasta)

screen.seqs(fasta=all.fasta, minlength=56, maxlength=64, maxhomop=8, maxambig=0, summary=all.summary)
align.seqs(fasta=current, reference=V6_vamps.align, ksize=9)
screen.seqs(fasta=current, name=current, start=31189, end=33183)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, name=current)
get.seqs(group=all.groups, accnos=current)
chimera.uchime(fasta=current, name=current, dereplicate=t, group=all.pick.groups)
remove.seqs(accnos=current, fasta=current, name=current)
classify.seqs(fasta=current, name=current, reference=gg_13_5_99.fasta,, cutoff=70)
remove.lineage(fasta=current, name=current, taxonomy=current, taxon=unknown-Chloroplast-mitochondria)
get.seqs(accnos=current, group=all.groups)
cluster.split(fasta=current, name=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15)
make.shared(list=current, group=current, label=0.03)
classify.otu(list=current, group=current, name=current, taxonomy=current, persample=t, label=0.03)

mothur > count.groups()
Using as input file for the shared parameter.
cce contains 2325437.
cch contains 2247329.
ccm contains 2222988.
cse contains 682916.
csh contains 651511.
gce contains 2331413.
gch contains 2359810.
gse contains 654156.
gsh contains 520838.

Total seqs: 13996398._
As some samples have four times more sequences than others (came from different sequencing runs) I need to equalize the sizes to perform alpha and beta analysis:

_sub.sample(list=current, group=current, label=0.03, persample=t)
make.shared(list=current, group=current)

mothur > count.groups(group=all.pick.subsample.groups)
cce contains 520838.
cch contains 520838.
ccm contains 520838.
cse contains 520838.
csh contains 520838.
gce contains 520838.
gch contains 520838.
gse contains 520838.
gsh contains 520838.

Total seqs: 4687542._
The problem arises when comparing the .shared files before and after sub.sampling.
Sub.sampling strongly changed the OTU’s accumulation curves of the samples that had more sequences.
After sub.sampling, these samples (which were before composed of hundreds of abundants OTUs) exhibit less than ten abundant OTUs and increased singletons frequency.

It’s normal or expected that a random process causes such a deep change in sample diversity?
Before sub.sampling the rank-abbundance(relative) curves were very similar between all samples.
If I’m not missing any commands mistake, what other strategy could I use for improving or getting more reliable results?

Hope someone helps me,
Thanks in advance!

We suggest subsampling after you have your shared file and we only use that for specific applications - namely to do otu-by-otu inference analyses (e.g. lefse, metastats). For calculating alpha and beta diversity metrics we use the full shared file and use the subsample option on those commands. Does that make sense?


Thanks for answering!

Using the subsample option on those commands yielded different results indeed. But I dont understand the difference between the two strategies …
I read that the different subsampling technics currently used are not reliable and that is an alternative parametric method more aproppiate for doing it:
“Waste Not, Want Not, Why Rarefying Microbiome Data Is Inadmissible” - McMurdie, Ploss paper 2014, April.
It seems that it is adaptable by importing my mothur results through phyloseq. Would you recommend trying this approach?

Pat, I need some clarification on how to do the following and some minor questions related to subsampling, i hope you got the time for them:
I am trying to create a pie and stacked-bar plots form my classified otus (by treatment) but i realized that i should be using a subsampled taxonomy to make sure that my plots really reflects my stats done on the alpha and beta analyses. I have already subsamples my shared file as instructed but he SOP from your paper.
mothur > sub.sample(, size=849)
Sampling 849 from each group.
Output File Names:
however in order to do that i need to subsample my list, name, fast, and groups files
here is what i will do:
mothur > sub.sample( ??
mothur > sub.sample(fasta=final.pick.fasta, name=esophagus.names) ??
mothur > sub.sample(fasta=final.pick.fasta, name=final.pick.names, taxonomy=final.pick.taxonomy)
mothur > sub.sample(, group=final.pick.groups)

is there any redundancy here??
now since i subsampled my list and groups file, shall i re do my make.shared from the new subsampled list and groups files??

then i will run
"mothur > classify.otu(, name=final.pick.names, taxonomy=final.pick.taxonomy, label=0.03, persample=t???)

My second main question is to whether i should be using the subsampled files throughout the analysis to ensure sequences i used are not changing with the use of the other random “subsample” options in the command down the line. I totally understand why and where we need to use the full shared file in the rarefaction and other analyses. so basically what i am trying to do is to use my subsampled files where a subsample option is being used i.e. summary.single (alpha OTU analysis), tree.shared and dis.shared (beta OTU analysis, and unifrac (beta phylogenic analysis).
the question what files beside the list, names, fast, taxonomy, and shared that i may need???

many many thanks


I don’t understand why subsmapling the shared file doesn’t get you what you want.


Pat, thanks for responding.