Getting random seqs


I’ve got a set of 100 000 seqs, from which I need to get a subset of 20 000 randomly picked seqs. Is that possible with Mothur?

Sorry, not really at this point, but it would be a pretty simple addition. May I ask what your end goal is with the 20000?

The point is, that due to tiny lil’ pipeting error we’ve got 8 samples of which 4 have around 20 000 seqs and 4 around 100 000 seqs :roll: . So for the analysis we are planning to get equal amount of seqs for every sample. Luckily it was not my pipeting error :wink: . But anyway I’d be happy to get suggestions on how to get those random seqs :D.

Ok, that’s what i thought. The next release will have a command called normalize.shared that will give you a normalized shared file. It does this by finding the relative abundance for each OTU and then multiplying each relative abundance by the desired number (e.g. 20000). Also, I’d add that this may not be necessary depending on the beta-diversity metric you are using. Fore example, Bray-Curtis is based on counts whereas Morisita-Horn/thetaYC are based on relative abundances - I’d stay far away from B-C because of this.

Until the normalize option is available, you could try the following:

  1. Sort your sequences in alphabetical/numerical order and assign them values 1through n.
  2. Use a random integer generator (like to select a “subset” of values (be careful, as you can run into problems downstream with repeated values). Many of these generators have an option that will create a set of values without replication.
  3. Use the vlookup function in Excel to link your “random” values to your sequence IDs.
  4. Use the selected set of sequence IDs as a .accnos file and use get.seqs to generate your “normalized” data set.

This approach won’t necessarily produce the normalized subset that Pat described, but it should produce a random/pseudo random sampling of the data that you have on hand…and it allows you to use a set of Mothur’s features that are currently available.

Good luck!

I’d also be interested in hearing from people whether they prefer the random vs. normalize idea. I hate throwing away data and it seems that normalizing would take away that problem and not be sensitive to picking the “wrong” random subset. Thoughts?

I find it weird to compare just relative abundances. When comparing several samples with a different amount of reads each relative abundance is measured with different accuracy. This renders it hard to make statistical tests. Furthermore you deal with fractions when using relative abundances which again are not fun to do statistics with. I think for presence absence beta-diversity measures, read number is crucial for the low abundance microbes. You will find a bunch more of the low abundant microbes with samples you have more reads for and bias the results. With respect to this I would prefer throwing away data over biasing my results.I think there are valuable ecological approaches which circumvent throwing away data but take the problems into account. Maybe this can help you
Please let me know if you disagree, I am happy to discuss since encountering identical problems

Fabian - absolutely - I think it ultimately depends on the question. If you want to do something like indicator analysis / group-based statistics (e.g. metastats-ish analysis) counts are probably the way to go. But if you’re looking at things like alpha and beta-diversity throwing away data can actually hinder your ability to get the correct richness or tighter confidence intervals on an estimate. One frustration I have is that people are ignoring the beta-diversity literature that looks at things like uneven sampling and clearly shows huge problems with Bray-Curtis, yet it seems to be the method du jour. I appreciate your link to the Legendre paper.

Perhaps a more specific question is whether it is better to normalize a dataset to a specific number of sequences or to subsample. I’d vote for normalizing, but am open to other thoughts…


I agree with Pat and prefer to normalize my unevenly sampled sequence data rather than throw out good data. I know that Clarke and Warwick, who derived a lot of the statistics implemented in Primer-6, sometimes suggest taking the square root of the normalized values to further smooth abundance data. I know this is often done when comparing T-RFLP profiles using multivariate statistics, although I haven’t seen this approach mentioned in the literature for sequence data.

Hi Pat,

On this topic of normalise vs subsampling, I’m convinced that the normalise may be the way to go since the subsample may ‘randomly’ sample the less abundant sequences skewing the R-A curve. However, I’m not sure about how the normalise.shared function handles OTUs represented by only a single sequence? As you mentioned “It does this by finding the relative abundance for each OTU and then multiplying each relative abundance by the desired number (e.g. 20000)” Does this remove these OTUs from the analysis? or are they converted to a relative abundance of less than 1?


A singleton (or anything else) would be divided by the number of sequences in that group and multiplied by the normalization factor. The result would then be rounded. So if you took data from a library with 1000 sequences and normalized it to 400, you would lose the singletons.