Diversity analysis: How to build and compare OTUs from different samples with same sequence ID

Dear Mothur community,

I´m a Mothur newbie and struggle a littel bit with my data.

I have 24 samples containing varying numbers of sequences. Some of them appear in several samples and have the same sequence ID.
For example:

is in sample 1,2 and 4 but not in 3, M345 is in 2 and 3.
First, I split my fasta file into 24 fasta files containing only sequence Id and sequences of the subsample, aligned and clustred them into OTUs at 0.03. But I guess Mothur could not compute diversity of merged files containing x-times OTU001, OTU002… .
Next, I tried to sort the samples in the original fasta file using a group file, but Mothur warns that all my IDs appear more than one time in my sample.

What I want Mothur to do is: sorting my IDs samplewise, cluster them into OTUs and compute the diversity within and along all my samples.

Does anyone know how to solve that problem? Many thanks in advance!

The approach you’re doing, with the groups file, is the right way to do it. It’s just an issue of the identical sequence names - if you rename those in some way so that they’re unique then you won’t get the error from mothur.

Usually when I do this kind of thing I just write a simple script to rename each sequence as [File Name].[Sequence ID] so that they’re all unique, but you can also work out which original sequence they were.

Thanks for your reply!
I renamed my subsample fasta files, merged and clustred them and used make.shared to assign OTUs to their corresponding subsample, but the number of OTUS appears much higher than clustering them samplewise.
Sample 1 contains 14 OTUs running it as a single sample, now 481 OTUs.

I guess all sequences are mixed while clustering and result in this high number of OTUs. Shell I first cluster sequences samplewise and then merge all shared folders or is it possible to determine that before?

Does anyone know help?


You will need to merge all of your data together and cluster them together. Then you’ll generate a shared file and use summary.single to calculate your alpha diversity parameters for each sample. Have you tried to follow one of the SOPs on the wiki?


Yes I tried 454 and MiSeq SOP (great to get started!), but struggled with my own data. Seems to work now, thank you very much!