Normalizing sequences in each sample

Hi All,

First of all, thanks for this awesome forum. I regret not searching for it earlier. I am comparitively new to MOTHUR and I am currently following 454 SOP for analysis. I understand that after clustering the data, I need to normalise the number of sequences from each sample. The SOP mentions that randmonly picking the minimum number of seqeunces from each sample is not ideal. My data looks like following:

Plank_run2 contains 14725.
WE1_run1 contains 2602.
WE1_run2 contains 26980.
WE2_run1 contains 3697.
WE2_run2 contains 14793.
WE3_run1 contains 4606.
WE3_run2 contains 13596.
WE4_run1 contains 27960.
WE4_run2 contains 22737.
plank_run1 contains 3198.

One can see huge variation in sequence numbers. Randomly picking up 2600 from each sample may not represent the total community in other sample judiciously. Is there any other way I could proceed from here?

Thanks a ton in advance!

Unfortunately, uneven sequencing depth is pretty much a fact of life at the moment. If you’re doing alpha diversity analysis, I think you just have to bite the bullet and subsample your shared file then calculate the diversity measures, possibly repeating a few times to make sure they’re consistent

However, if you’re doing beta diversity analysis, you can build your input file from the full data set then randomly subsample it multiple times for get your final distance. What I mean is, say, for an unweighted unifrac analysis you can build your tree off the full data set, then use the command;

unifrac.unweighted(tree=XXX, count=XXX, subsample=2600, iters=1000)

and mothur will randomly subsample each group to 2600 sequences and calculate the distances 1000 times, then report the average distance between groups (as well as the std dev for each distance). This can be done with the dist.shared() command as well for OTU-based distances (Jaccard, Bray-Curtis).

Thanks a lot! I think your approach is similar to the one provided in 454 SOP (randomly subsample with many iterations).

Just to add a small correction - for things like alpha and beta diversity we rarefy - not subsample - the data. In summary.single and dist.shared, you can rarefy the data to your desired number of reads per sample. We use sub.sample for doing things like metastats, lefse, classify.rf, get.communitytype

Looks like I missed this post.

This is in reference to 454 SOP - OTU-based analysis

I see that alpha diversity does not require any subsampling unless, I run the summary.single command. For beta diversity also, subsampling is done only after building a trre using tree.shared command. Using the results of this command - one does all the further analysis like: unifrac etc. Also, pcoa analysis is done when we subsample the data.

Thanks for clearing it out!

Hi,

I’m new to MOTHUR and having questions the same topic. Is there any literature on how sub.sample does the subsampling? (i assume some sort of unbiased sampling… but want to learn more about the theory behind.). I read the differences between “rarefying” and “subsampling” under this post, yet I’m thinking about using the same set of sub-sampled otu-abundance data to do random forest tree and alpha- beta-diversity analysis, so I was wondering is there’s anyway I could subsample 1000 times and make an otu table from that.

Any response and thoughts on this topic will be much appreciated!

Fangqiong

We randomly select N sequences where each sequence is equally likely to be samlped.

Hi,

I am a bit confused with this post. I understand why we subsample our OTU table to the lowest number of sequences. However, I see that in the SOP we don’t use the sumbsampled table for the rest of the analyses but again the an.shared file. I see that for B-analyses we can include the subsample command, instead of using the subsampled table, but then, why would we need the subsampled table?

Thanks

The subsampled shared file is the result of only one subsampling. Elsewhere (e.g. dist.shared) the ave.dist file is generated based on a large number of iterations. This allows you to overcome the problems that might occur if you get lucky/unlucky with that one subsampling.

Pat