Hi All,

So I have a question about normalizing my data. I see from the Schloss protocol there is a command that can generate a random subsample (using the sub.sample command). Unfortunately, most of my projects have one or two groups (conditions) that have significantly less samples than the others. If I used the sub.sample command to only keep a number of samples equal to the smallest group, I’d end up loosing most of my data. I have several groups that would go from 3000-8000 sequences for the majority of the conditions, down to 1000 since that’s the size of the smallest group. In other cases, I would go from 3000-15000 down to less than 1000, which isn’t realistic.

Does anyone have suggestions as to how I can handle this situation? Can I use a larger sample size for the sub.sample command and maybe just add in the smaller groups afterward? Is there a way to use the sub.sample command and have it keep all the samples from groups that fail to meet the minimum sample size? Any other suggestions?

My other question (posed by my supervisor) is whether there is a way in Mothur to check to see whether you need to normalize your data, or do we just assume we need to normalize? Any advice would be appreciated. Thanks!


The problem is that many metrics (alpha and beta diversity) are affected by having different levels of sampling. Furthermore, we showed in the recent PLoS ONE paper that the number of artifacts goes up with additional sequences. So having 10000 sequences in one library and 1000 in another screws things up on a number of levels. I would say that you always need to sub-sample/rarefy your data so everything is on an equal footing.

We are strongly encouraging people to subsample down to the smallest library. You might decide that you really aren’t interested in that sample with 1000 sequences and so you’re willing to get rid of it. Alternatively, you might realize that you really need it. In that case, you could go back and obtain additional sequences.

Finally, I would encourage people to not think of it as “throwing data away”. When you have 10000 sequences and rarefy/subsample down to 1000 you will have greater confidence in the relative abundance than you might if you only had 1000 sequences and couldn’t rarefy down further.

Hope this helps,