split.abund (too many singletons and doubletons)!!


I have a real problem, as one-third of the OTUs that I defined seem to be singletons and doubletons. May be chimeras that were not detected? who knows. But I do really want to remove those OTUs and see the differences in the further analysis.
I don’t know how to do it and found this command but I tried to use it with my final.fasta file and then, I ended up in more OTUs than before! Also, as the taxonomy, names and group files are the same, all the OTUs defined are classified and not only the ones present in the shared file.

At which step of the SOP should I use this split.abund command? or there is another way to get rid of the 1-2 seq OTUs and get the classification of only the final OTUs in the normalized shared file?

please, help! :roll: :oops: :geek:

I have a real problem, as one-third of the OTUs that I defined seem to be singletons and doubletons. May be chimeras that were not detected?

Why is this a problem? Just because something is rare doesn’t mean it’s bad (and just because it’s abundant doesn’t mean it’s good!).

If you insist on removing singletons and doubletons (we don’t), it would probably be better to use filter.shared.


Hi Pat
I understand what you mean and agree, but in this case I only want to have insight on the difference in the analysis when I use or remove the singletons and singletons+doubletons, to see if the differences I am seeking between my samples due to the metadata are the same when I look only to the abundant OTUs or depend on the rare OTUs.

I started with a total of 13500 seqs (is a subsampled dataset of 500 seqs per sample to find the best parameters for the analysis) and after filtering and improving the dataset I ended up with:

Final Fasta file=6239 unique seqs
total # of seqs: 8275
total # of singletons+doubletons: 2468

When I defined OTUs at the level of 0.03 distance, I got 2243 OTUs. I consider it quite a lot of rare populations considering the type of samples I am working with, so I wanted to repeat the analysis in parallel only with those OTUs with 2 and 3 or more sequences, without loosing the information of the abundances and classification of the “non-rare” OTUs.

I tried the filter.shared as you suggested, but I have two further questions: should I subsample (normalize) the shared file before running the filter.shared or after? and what is the effect that allowing the presence of the “rareotus” will have in further alphadiversity analysis? If I really want to remove the OTUs with only one or two sequences, and not have them represented as rare populations in the diversity indexes, should I add makerare=F?

Thanks a lot for your help! :slight_smile:

i would run filter.shared after sub.sample.


I tried that and the filtered shared file had different amount of sequences per sample, so I had to normalize it again. The final normalized shared file had 150 seqs per sample and 664 otus in total.
If I do the other way run (I already did both before asking for help), the final (first filtered and then normalized for the first and only time) shared file had 168 seqs per sample and 851 outs in total.
What is the difference, in concept, that I have to consider between both ways? I mean, from the biological and statistical point of view, what will be wrong if I had taken the first filter and then subsample way?
Sorry, but I am trying to understand what I am doing and why one way is right and the other is not, or gives different results. :oops:

Like I said, I wouldn’t advocate doing what you’re doing in the first place, so it’s kind of hard for me to give you guidance on this.