Reducing OTU inflation due to misalignments

Hi,

I was wondering what you think about this issue. A reviewer of a paper I published some time ago suggested that I should have inspected the aligned sequence data matrix to manually remove misalignments that were causing an inflation of the total number of OTUs in my dataset. Indeed it seemed that OTUs that were called as distinct at 97% identity had in fact 100% identical representative sequences, and were called as distinct only because of a misalignment.

Instead of doing this manually I created a loop that basically picks representative OTUs then runs MUSCLE, then again mothur to run classify.seqs, remove.lineage, summary.tax, dist.seqs, cluster, make.shared, classify.otu and again get.oturep and MUSCLE… I found that indeed looping this several times substantially reduces the number of OTUs until they reach a final number that cannot be reduced further.

Has this issue been discussed and would you think this is indeed a valuable step to add to the analysis?

Best,
Joanito

1 Like

That sounds pretty torturous. We’ve shown that MUSCLE-generated alignments are not as good as those from align.seqs.

If the sequences were identical, then they should have been merged in the unique.seqs step. If there was some weird trimming-related issue going on, you could use degap.seqs, unique.seqs, and align.seqs to get them to agree better.

Even better would be to find sequences that aligned poorly, identify their reference sequence, and then correct the reference sequence so the problems don’t propagate.

Pat

1 Like

Dear Pat,

Thank you for your kind reply. I am not sure if we are talking about the same thing. What I meant is that when you follow the MiSeq SOP tutorial (aligning with align.seqs), once you pick the OTUs you have a certain number, but this number is inflated because misaligned sequences that are less than 3% different are considered different OTUs because of misalignments (you are right, they cannot be 100% identical because of the unique.seqs step, but they are less than 3% different). So if you then pick a representative sequence set with get.oturep and realign them (either manually or with the loop I proposed), and recluster, repick OTUs etc, the number of OTUs decreases. Today for example I analyzed a small dataset. I followed all the steps in the SOP until picking OTUs and I ended up with 96 OTUs. By picking the representative set, realigning, reclustering etc and doing this 20 times, I ended up with 53 OTUs. This is a substantial drop! As I mentioned before, this was suggested by a reviewer, and he suggested also to have a look at this paper, which first noticed this issue and dealt with it manually https://doi.org/10.1111/mec.12607 … see the methods section, where they say:

“Operational taxonomic units (OTUs) were identified at 97% sequence similarity using the nearest neighbour option. However, we noticed that misaligned sequences in the data matrix seemed to inflate OTU number. At this stage, singletons (i.e. OTUs with only one read in the entire data set) were removed from the alignment. The remaining representative sequences were then manually adjusted to establish a quality alignment (Appendix S1, Supporting information). Finally, this realigned data set was used for new OTU clustering at 97% sequence similarity”

I guess my question is whether there would be any reason for not doing something like this.

Best,
Joanito

1 Like

When you recluster are you including the frequency of the reads or are you using them as singletons?

Dear Pat,

I think I do use the frequency of the reads as I use the count input in the various commands.

This is the script I’m using (I start right after picking the representative set from my OTU table):

muscle -in stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.opti_mcc.0.03.rep.fasta -out realignedRepSetMuscle.fasta -maxiters 2

PREFIX=realignedRepSetMuscle
FASTA=realignedRepSetMuscle.fasta
COUNT=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.opti_mcc.0.03.rep.count_table
TAXONOMY=path_to_folder/trainset16_022016rdp/trainset16_022016.rdp.tax
REFERENCE=path_to_folder/trainset16_022016rdp/trainset16_022016.rdp.fasta
TAXCUTOFF=80
NPROC=4

for i in {1…20}; do

mothur “#set.dir(input=path_to_folder, output=path_to_folder); classify.seqs(fasta={FASTA}, count={COUNT}, taxonomy={TAXONOMY}, reference={REFERENCE}, cutoff={TAXCUTOFF}, processors={NPROC}); remove.lineage(fasta={FASTA}, count={COUNT}, taxonomy=${PREFIX}.rdp.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota); summary.tax(taxonomy=current, count=current); dist.seqs(fasta=current, cutoff=0.03); cluster(column=current, count=current); make.shared(list=current, count=current, label=0.03); classify.otu(list=current, count=current, taxonomy=current, label=0.03); get.oturep(column=current, list=current, count=current, fasta=current)” &&

muscle -in realignedRepSetMuscle.pick.opti_mcc.0.03.rep.fasta -out realignedRepSetMuscle.fasta -maxiters 2 &&

COUNT=${PREFIX}.pick.opti_mcc.0.03.rep.count_table;

done

Would this make sense?

I am now also reading something else: this paper https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.12700 suggests to remove singleton reads (not OTUs!) during early stages of analysis. Maybe this is actually a more efficient approach than mine? Would you think this makes sense and could help reducing this substantial inflation one often gets? And if one wants to do this during the mothur pipeline, when should one do it and with which command? I was thinking this could be done right after calling the first unique.seqs (after making contigs and screening sequences) but not sure what command could do this?

Thanks a lot for your kind help!

Best,
Joanito

Removing singletons is a pretty horrible idea at any stage as it will mess with the abundance distribution of your communities.

Hello, I’m new here and i was reading this. I hope I can contribute at any way.

Pat, you said the singleton removal would mess with the abundance distribution? Isn’t that what we want? to have an abundance distribution without singletons? (Usually)

I been struggling with this a long time ago on mothur and what I do is to remove singletons from the last shared file created. Since not all singletons can be removed because OTUs are being shared between samples, its a nice way to reduce data size, and then I don’t know if I’m doing this right but, I tabulate data on Excel and manually remove singletons (by filtering) and then I plot the data.

Why would you want an abundance distribution without singletons? You’re going to remove a lot of the diversity in a sample and if your samples aren’t evenly sequenced then you’re going to create a bias.