miseq dataset (amplicons) takes too long to process

I have a miseq dataset, which consists of 17 million paired reads.
They are from 370 amplicon samplesgenerated with Nextera barcode primers
After assembling contigs and doing some basic cleaning I have about 6,500,000 reads split amongst my 370 samples.
Unique seqs only brings this number down to about 2,500,000 reads.
These are fungal ITS sequences by the way.
I don’t generally align fungal ITS sequences because it’s such a variable region, and I am also afraid that doing so would take forever.
I thought I could reduce my analysis load for clustering and blasting by first running pre.cluster with an allowed 2bp difference (my contigs are about 350-400 bp long).
It’s been running for over 3 full days now on 48 cores of a 64 core cluster and while it seems to be nearing completion (based on the number of sequences I get when I run “cat run2stability.trim.contigs.good.unique.precluster.*map | wc -l”
it is slowing down a lot and I wonder if I am doing soemthing wrong or if there is something I am missing?
Is there a faster way to do this?

The Miseq SOP seems built for tiny tiny datasets (128,000 seqs), one or two orders of magnitudesmaller than what one ought to expect from a miseq run.

You have a couple of things going on. First, because you’re sequencing the ITS region, I suspect you do not have fully overlapping reads. If you don’t have fully overlapping reads your error rates will be huge (http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/). This will artificially inflate the number of unique sequences. Second, ITS sequences aren’t really homologous to each other and do not lend themselves well to an alignment-based approach. You might try running pre.cluster with unaligned sequences and use align=needleman and diffs=2 (or 3). This will generate pairwise alignments, and should go a bit faster. Also, be sure that you are using a group or count file when you run this.


Thanks very much for the response PSchloss.
However, I did run:
pre.cluster(fasta=run2stability.trim.contigs.good.unique.fasta, count=run2stability.trim.contigs.good.count_table, align=needleman, diffs=2, processors=48)
and it took 5 days to complete (this is a 376 sample data set; on 48 cores of a 64 core server)
and it reduced 2.6 million unique sequences to 1.3 million sequences
(chimera checker subsequently removed about 50 thousand)

Any idea why pre.cluster took 5 days?
Is it simply not appropriate for the ITS region, or any unaligned dataset?

I suspect it has a lot to do with the first problem then - data quality and your error rate. You might try using pre.cluster with a larger diff value and then treating each output sequence as it’s own OTU.