miseq dataset (amplicons) takes too long to process

nrosenstock · August 8, 2015, 7:33am

I have a miseq dataset, which consists of 17 million paired reads.
They are from 370 amplicon samplesgenerated with Nextera barcode primers
After assembling contigs and doing some basic cleaning I have about 6,500,000 reads split amongst my 370 samples.
Unique seqs only brings this number down to about 2,500,000 reads.
These are fungal ITS sequences by the way.
I don’t generally align fungal ITS sequences because it’s such a variable region, and I am also afraid that doing so would take forever.
I thought I could reduce my analysis load for clustering and blasting by first running pre.cluster with an allowed 2bp difference (my contigs are about 350-400 bp long).
BUT PRE.CLUSTER TAKES FOREVER
It’s been running for over 3 full days now on 48 cores of a 64 core cluster and while it seems to be nearing completion (based on the number of sequences I get when I run “cat run2stability.trim.contigs.good.unique.precluster.*map | wc -l”
it is slowing down a lot and I wonder if I am doing soemthing wrong or if there is something I am missing?
Is there a faster way to do this?

The Miseq SOP seems built for tiny tiny datasets (128,000 seqs), one or two orders of magnitudesmaller than what one ought to expect from a miseq run.

pschloss · August 10, 2015, 2:36pm

You have a couple of things going on. First, because you’re sequencing the ITS region, I suspect you do not have fully overlapping reads. If you don’t have fully overlapping reads your error rates will be huge (http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/). This will artificially inflate the number of unique sequences. Second, ITS sequences aren’t really homologous to each other and do not lend themselves well to an alignment-based approach. You might try running pre.cluster with unaligned sequences and use align=needleman and diffs=2 (or 3). This will generate pairwise alignments, and should go a bit faster. Also, be sure that you are using a group or count file when you run this.

Pat

nrosenstock · August 10, 2015, 4:26pm

Thanks very much for the response PSchloss.
However, I did run:
pre.cluster(fasta=run2stability.trim.contigs.good.unique.fasta, count=run2stability.trim.contigs.good.count_table, align=needleman, diffs=2, processors=48)
and it took 5 days to complete (this is a 376 sample data set; on 48 cores of a 64 core server)
and it reduced 2.6 million unique sequences to 1.3 million sequences
(chimera checker subsequently removed about 50 thousand)

Any idea why pre.cluster took 5 days?
Is it simply not appropriate for the ITS region, or any unaligned dataset?

pschloss · August 10, 2015, 6:51pm

I suspect it has a lot to do with the first problem then - data quality and your error rate. You might try using pre.cluster with a larger diff value and then treating each output sequence as it’s own OTU.

Pat

Topic		Replies	Views
Feedback on a pre.cluster issue workaround for processing ITS sequences Commands in mothur	2	595	November 1, 2019
aligning fungal ITS sequences for pre.cluster? Commands in mothur	1	1646	August 28, 2018
Pre.cluster taking longer than usual and eliminating 90% of sequences Commands in mothur	4	325	September 4, 2022
Pre.cluster taking super long Commands in mothur	5	806	August 9, 2020
Analysing fungal ITS with the pre.cluster function Commands in mothur	10	5755	July 19, 2016

miseq dataset (amplicons) takes too long to process

Related topics