sub.sample with fasta, name & group or fasta and count file

Hi all,

I have analyzed a illumina MiSeq dataset of 16S V4 following the MiSeq SOP.

In order to be able to prepare a fasta file as required for oligotyping (http://oligotyping.org/2012/05/11/oligotyping-pipeline-explained/#Preparing_the_FASTA_File), I restarted the whole SOP by using name and group file instead of the count file. I solved the problem generating the fasta file for oligotyping, but came across another thing:

When running the MiSeq SOP with a count file, I am getting 261,137 uniques and 4,059,903 sequences. After cluster.split at taxlevel=5, I end up with 33,036 OTUs. By sub.sample they are reduced to 26,255 OTUs, 185,224 uniques and 2,371,440 sequences.

When running the MiSeq SOP with a name and group file, I am getting 257,435 uniques and 3,917,871 sequences. After cluster.split at taxlevel=5, I end up with 33,272 OTUs. But by using sub.sample they are reduced to only 7,732 OTUs, 38,802 uniques and 2,286,384 sequences.

I am not too much worried about the slightly deviating numbers before sub.sample, but completely cluelesss why I am getting this massive difference in OTU number and unique sequences by using sub.sample.

Any ideas?

By subsampling you are removing data and so you should expect to have fewer OTUs.

Hi Pat,

thanks for your quick reply.

Reducing my data set by sub.sampling is what I want and I know that it reduces the number of OTUs.
But I am confused about the very different outcomes of sub.sample when using exactly the same data set employing the MiSeq SOP once with a count file and once with a name and group file: after sub.sample I am getting more than 26,000 OTUs (count file) or less than 8,000 OTUs (name and group file)! Before sub.sample both data sets have about the same number of sequences (4,000,000), uniques (260,000) and number of OTUs (33,000).

Can you post the fasta, group, count, and names files somewhere for me to download and look at? It’d also be good to have the exact commands you are running.

Thanks,
Pat

Hi,
I ran the following command on the final files from Pat’s 454 example, http://www.mothur.org/wiki/454_SOP.

make.table(group=final.groups, name=final.names)
sub.sample(fasta=final.fasta, count=current, size=4419, persample=t)
dist.seqs(fasta=current)
cluster(count=current)

and

sub.sample(fasta=final.fasta, group=final.groups, name=final.names, size=4419, persample=t)
dist.seqs(fasta=current)
cluster(name=current)

The resulting list files had a comparable number of OTUs. Could you have left the persample off of one of the sub.sample commands?

Kindly,
Sarah