My laboratory group is running a soil 16S analysis. We used MiSeq kit v2 with an outuput of 7,5 GB and 500 cycles.
We ran the sequences in mothur, using the following pipeline:
As you can see, ~77% of all sequences are ‘unique’. So we decided to check the quality of the sequences with seqyclean, using phred values of 20 and 30.
When running the same pipeline above with a subset of the sequences after quality trimming in seqyclean, we got:
Welcome to soils Keep following the SOP especially including pre.clustering. Does that get you down to a manageable number? I had ~700k pre.clustered “uniques” that I managed to cluster with cluster.split, but just barely (because the alpha proteos are so abundant and diverse) by using a single processor on a high memory (128gb) machine
The first time I ran the SOP, I got 3,859,378 unique sequences (from a total of 5,259,281). After pre.cluster, there were 3,001,495 unique seqs, and after removing the chimeras, there were 2,076,182 ‘uniques’.
I then tried to run cluster.split:
I’m unsure if the parameters used in cluster.split are correct or whether it’s possible to run this with ~2M unique sequences…
Do you think that using phylotypes is the only way to go?
First your cutoff should be higher-like .10 or .15. I don’t think that large=T is actually helpful, but maybe Pat or Sarah will weigh in on that. If that still hangs (possible because of certain very abundant groups), I’d change taxlevel=5 before I’d go phylotype. Actually, I’d use a greedy algorithm (uclust etc) before I’d go with phylotype
Illumina has definitely been having some quality issues with their V2 chemistry over the past few months. If you call and badger them about it and have your read metrics handy they may be willing to provide you a replacement kit or to help trouble shoot what is going on.