order of subsampling and finding unique sequences?

pavel47 · May 9, 2014, 2:58pm

Hello, first a disclaimer I’m a statistician dabbling in biology, so excuse the beginner type questions

I am trying to run comparative analysis on square distance matrices between DNA samples. Here’s what I’ve done so far:

The smallest sample has 2294 sequences, so I take a subsample of size 2294 from each sample (45 samples total)
Identify unique sequences in each of the subsamples
Create a distance matrix of pairwise distances for each sample (45 of these distance matrices)

The problem I have is that after running steps 1-3 in some samples I only have 4 or 5 unique sequences. For example one sequence originally has 307 unique sequences out of 11560, but when I take a subsample of 2294 then only 4 unique sequences are found in those 2294. I have tried repeating the procedure and in each case the number of unique sequences doesn’t vary significantly. Am I just being really unlucky or is this pretty common?

Does it make biological sense to first find unique sequences in the unnormalized data and then subsample? Straight away I suspect this may lead to the problem of rare sequences being weighted the same as common sequences.

Does anyone have any suggestions about how to deal with this problem?

Thanks
Pavel

pschloss · May 9, 2014, 4:40pm

So this isn’t really a “standard” way of doing things. Typically subsampling is done after you have a shared file (about 5 steps later). It sounds pretty odd to only have 4 or 5 unique sequences unless you have some very low diversity gut community (e.g. individual Drosophila guts?)

pat

pavel47 · May 9, 2014, 8:10pm

Yes the samples are taken from the gut, will check if it’s Drosophilla.

As for the “standard” way, I’m not really interested in constructing OTU tables as is done in the SOP. I just want to compare the sequences in the samples, and how “close” they are to each other, hence why I’m looking at the distance matrices of sequences within each sample.

So other than the problem mentioned earlier (of all sequences being weighed as equally likely), is it biologically acceptable to find the unique sequences and then subsample?

adamc83 · May 14, 2014, 7:22pm

I agree with pat that its odd to see so few unique sequences – I’ve had more than that from supposedly DNA-free control samples. How many sequences/unqiues were in the other samples? Is it possible that that one sample with 4-5 uniques actually only had a trace amount of DNA extracted or sent to sequencing, or perhaps a problem during sequencing?

As to:

is it biologically acceptable to find the unique sequences and then subsample?

I dont see a particular problem with that if you are just trying to normalize the amount of sequences per sample (make sure to use persample=true in sub.sample). As far as statistics go, generally the interesting stuff is done at the OTU level, not on the unique sequence level. Even if you have a very low number of uniques, I wouldnt run statistics on unique-level sequences – you might end up analyzing sequencing noise, PCR artifacts, or biologically irrelevant sequence difference. 97% sequence similarity in the 16S gene is generally considered to be roughly equivalent to species level similarity.

Topic		Replies	Views
tips on subsampling, feature request? Theory behind mothur	5	5287	February 4, 2014
Question regarding subsampling Theory behind mothur	9	9258	March 4, 2013
Normalizing sequences in each sample Commands in mothur	8	7739	January 9, 2015
Diversity comparisons between different sized datasets? Theory behind mothur	15	7513	March 18, 2015
Sub-sampled data	1	317	October 26, 2021

order of subsampling and finding unique sequences?

Related topics