aligning fungal ITS sequences for pre.cluster?

Hi all,

I’ve been grappling with fungal ITS analysis in mothur for some time…

As stated in other posts, the major drawback during ITS analysis is that ITS alignments don’t make phylogenetic sense past the genus level and thus can’t be used in any analysis steps where aligned fasta files are required.
In principle there is a pretty good work-around pipeline possible with mothur
(e.g. “https://github.com/krmaas/bioinformatics/blob/master/mothur.fungal.batch#L103”,
however there is one major bottleneck that I.M.O. needs addressing:

When using standard ITS2 primers, due to highly variable insert lengths, fungal amplicons range in length roughly between 300 and 480 nucleotides. This leaves for longer amplicons a large stretch of the paired end reads unchecked by the complementary strand read. Together with the usually shocking drop in base call quality with increasing read length - that we’ve come to accept with the unavoidable 2 x 300 kit -, this should result in frighteningly high sequencing error rates for fungal ITS sequences (by the way – this will unavoidably bias against longer ITS sequences and thus underrepresent some species).

It seems thus all the more important to screen fungal ITS sequences as rigorously as possible.
Thus, if you are analysing MiSeq or HiSeq runs with highly diverse fungal community samples, including the mothur pre.cluster module should be mandatory to remove sequencing errors by merging sequences that are different by a set number of nucleotides.

With an aligned fasta input sequence, the pre.cluster module is wickedly fast but for unaligned ITS sequences, processing time is just too long. As the pre.cluster module takes a one sample per processor approach, I recently tried to pre.cluster 54 highly diverse / 65000 cleaned up sequences per sample out of a total 216 sample set on a high spec 94 cpu server: it took 4 weeks to run. This, together with server crashes and hogging valuable processor time is just not feasible.

To my understanding pre.cluster sorts aligned sequences (“the advantage of our approach is that the algorithm works on aligned sequences instead of a distance matrix”) and thus for an unaligned fasta input file performs a pairwise alignment step for which processing time seems to increase exponentially with the number of sequences.

So, my assumption, for the pre.cluster module the phylogenetic validity of any ITS alignment is not the issue but aligned sequences are computationally easier to sort. In consequence, to get some sequencing error reduction, a reasonably good aligned ITS fasta file should be valid as input file – even if the Huse et al. paper claims that pairwise alignment reduces the number of OTUs by 30% over multiple sequence aligners.
I performed an experiment: I cut the UNITE fungal database with Pcr.seqs to the corresponding sequence length and aligned with ClustalO, then used this alignment as template to align my cleaned -up ITS sequences with align.seqs, cleaned up the output, then performed pre.cluster which completed within a few hours.

I assumed that if the alignment was erroneous due to the high ITS variability, that this would inflate the cluster output and result in a high number of final OTUs – this was not the case and compared to completed, unaligned pre.cluster runs, the final OTU number was lower. Our artificial community was correctly assembled and replicate samples were highly similar.

Intuitively, a somewhat more conservative output can’t be wrong……
Is this a valid approach?

Cheers,
Tom

I’ve been grappling with this as well (my last project with ~40 samples too 9 days on our cluster). I know some fungal labs that been using dada2, but I haven’t played with that yet. Given how little foundation we have for determining what is a useful OTU level for a non-coding region in fungi, I’m not opposed to that approach. The other thing that I’ve been meaning to try is cluster.split (before pre.cluster) but haven’t worked through that code yet.