Removing split.abund after pre.clustering


After pre.clustering and chimera removal I still have:

of unique seqs: 1,161,177

total # of seqs: 13,302,990

Obviously this causes a large problem when attempting to perform dist.seqs and the command crashes after about two days of feebly struggling.

I don’t want to use the phylotype method if at all possible. A few people were discussing the removal of singletons and when I perform this on the above sequences I get the following:

of unique seqs: 40,083

Total # of seqs: 12,181,896

This is able to go through dist.seqs fine but I noticed that most people remove singletons after OTU clustering rather than earlier. Would doing it after preclustering rather than later cause any kind of problem in your opinion?


I will join this question, as I was about to post another one and jumped into this one.

I am also trying to do sth similar, as you can see from my other posts. I tried to filter the OTUs by filtering the shared file, and it worked well but then there is a difference in the final dataset for OTU analysis and the unifrac-based PCoA that I had to obtain.

So, I tried to use split.abund before defining OTUs, with my fasta, name and group files, but I was not convinced with the results as when I built OTUs with the rare file and classified them, I realized I was loosing valuable sequences and rather abundant OTUs.

Then I tried split.abund after defining OTUs, on the fasta, list and group files to filter otus with 1 and 2 seqs and not filter seqs that appear only 1 or 2 times (but that proved to be able to be clustered then and be part of OTUs with more than 1 or 2 seqs). Then I had the abund fasta, list and group files and making the shared file worked well. But then, building a tree with the abund.fasta file worked but then it was not possible to get unifrac distances as it happen not to be a match between the abund.fasta and the abund.groups file here (but they worked well together to make the shared file) :?:

Then I wander… with pre.cluster the sequences are clustered in the same way than when using the cluster command to define OTUs but at a lower threshold than 3% difference?
And… is that possible with split.abund (or other command?) to remove the “pre clusters” (built at a threshold below 3% diffs) that contain 1, 2 or 3 seqs? and then have a fasta file that works for OTU clustering at 97% similarity AND building a tree to end up in a PCoA ordination based on unifrac metrics both based on the same dataset?

I think that this last approach is similar to the uclust-usearch pipeline followed in Qiime? (sorry to ask about this comparison, but I have to find the differences and try to demonstrate that Mothur can do the same and even perform better, I was working hard to find the way but could not figure it out at the moment, as I’m far from being good at bioinformatics and cannot fully understand the theory behind all algorithms).

Any help with this will be really welcome! :roll:


The problem is the error rate. You are likely using a sequencing strategy where your reads do not fully overlap and are getting an elevated error rate. This will result in a large number of unique sequences, increased number of OTUs, and your samples will look more different from each other than they really are.


Hi Pat,

Yeah I’m aware that that is the problem, I’m just trying to think of ways to not have to use the phylotype method if at all possible and was wondering if the method I suggested might work. However, from Susanna’s reply it doesn’t seem like it will…



Hi Pat,

Just wondering,

In the board discussion P’hyltoypes vs. OTUs’ it was briefly mentioned that:

‘n your 2013 paper in AEM, you propose a heuristic that is basically doing phylotyping before OTU building?’

‘To be safe, I do generally first cluster to the class or family level and then do OTUs. If it’s a gnarly dataset that is huge, we’ll go to genus level.’

Do you think doing so might solve my problem? If so, how would I go about doing this in MOTHUR?

Thanks for all your help,


Hello forum users
Just wanted to resurface this subject since I also generate huge distance matrixes which apparently are due to errors in the seq process.
I was also toying with idea of discarding singletons prior to dist.seqs, the weakness I see by taking that aproch though it is used in Robert Edgar’s UPARSE pipline (, is that you might get rid of legit otus that are just under represented or get rid of unique seqs that at 97% simmilarity will be joind to abundant otus thous screwing with otu abundance. What I’m thinking is increasing the diffs in pre.cluster to 3% which if I understand things correctly will make my unique seqs clusters of 97% similarity based on single linkage, then if I choose to get rid of singletons prior to dist.seqs that will reduce screwing with otu abundance also reduce the chance of throwing away legit seqs of low abundance (if they merge with other low abundance seqs). Hopefully this will also make my distance matrix more palatable for the cluster command.

Does that make sense?


Why singletons and not doubletons, tripletons, or even ten-tons? Here is an expanded discussion of what’s going on…