order of subsampling and finding unique sequences?

Hello, first a disclaimer I’m a statistician dabbling in biology, so excuse the beginner type questions

I am trying to run comparative analysis on square distance matrices between DNA samples. Here’s what I’ve done so far:

  1. The smallest sample has 2294 sequences, so I take a subsample of size 2294 from each sample (45 samples total)
  2. Identify unique sequences in each of the subsamples
  3. Create a distance matrix of pairwise distances for each sample (45 of these distance matrices)

The problem I have is that after running steps 1-3 in some samples I only have 4 or 5 unique sequences. For example one sequence originally has 307 unique sequences out of 11560, but when I take a subsample of 2294 then only 4 unique sequences are found in those 2294. I have tried repeating the procedure and in each case the number of unique sequences doesn’t vary significantly. Am I just being really unlucky or is this pretty common?

Does it make biological sense to first find unique sequences in the unnormalized data and then subsample? Straight away I suspect this may lead to the problem of rare sequences being weighted the same as common sequences.

Does anyone have any suggestions about how to deal with this problem?


So this isn’t really a “standard” way of doing things. Typically subsampling is done after you have a shared file (about 5 steps later). It sounds pretty odd to only have 4 or 5 unique sequences unless you have some very low diversity gut community (e.g. individual Drosophila guts?)


Yes the samples are taken from the gut, will check if it’s Drosophilla.

As for the “standard” way, I’m not really interested in constructing OTU tables as is done in the SOP. I just want to compare the sequences in the samples, and how “close” they are to each other, hence why I’m looking at the distance matrices of sequences within each sample.

So other than the problem mentioned earlier (of all sequences being weighed as equally likely), is it biologically acceptable to find the unique sequences and then subsample?

I agree with pat that its odd to see so few unique sequences – I’ve had more than that from supposedly DNA-free control samples. How many sequences/unqiues were in the other samples? Is it possible that that one sample with 4-5 uniques actually only had a trace amount of DNA extracted or sent to sequencing, or perhaps a problem during sequencing?

As to:

is it biologically acceptable to find the unique sequences and then subsample?

I dont see a particular problem with that if you are just trying to normalize the amount of sequences per sample (make sure to use persample=true in sub.sample). As far as statistics go, generally the interesting stuff is done at the OTU level, not on the unique sequence level. Even if you have a very low number of uniques, I wouldnt run statistics on unique-level sequences – you might end up analyzing sequencing noise, PCR artifacts, or biologically irrelevant sequence difference. 97% sequence similarity in the 16S gene is generally considered to be roughly equivalent to species level similarity.