cluster.split

Hi,
I am having problem following the MiSeq SOP for data generated after sequencing of the V4 region of the 16S gene. My data set has about 1.2M reads after make.contigs with over 800,000 unique sequences. Every time I try to run cluster.split the computer will crash after a while, even when I try to subsample it. The analysis has been running in a computer that has 8 processors for almost a week now, but it is still running. The cluster density was 475K clusters/mm2 and 3% PhiX was detected (loaded at 5%).
What would be the consequences of subsampling the data after the first screen.seqs command on the SOP?
Many thanks.
Marcio.

forgot to mention we are using the 2x250 kit.

What Illumina and mothur software versions are you using?

Illumina RTA v1.17.28; MCS v2.2

mothur v.1.30.2 on Mac OS X 10.6 64 bit, but have also tried the 64 bit for windows and linux (the last one is still running).

After you get through pre.cluster and removing chimeras how many unique sequences do you have left? Have you tried splitting at a lower taxonomic level? Did you happen to run a mock community?

Hi Pat,

When I try to open the logfile it only says: “linux version”. It’s been running for a week now. Should I stop it?
I tryed splitting at level 5 and 6 on my Mac, but it still couldn’t handle it. I didn’t use a Mock community for this run…
Thanks.

How many unique reads after removing chimeras?

I’ve got 396,649 unique sequences out of 922,771.

Assuming you’re following the SOP exactly, I’m not sure what the problem might be. We’ve had no problems analyzing multiple runs together with cluster.split. I guess if it doesn’t finish, then you might just be left with phylotyping your data.

Pat

Hi,

I’ve repeatedly run into this error message. Initially I thought it was because I was running out of memory, but now I am running the analysis in a much more powerful computer. Any obvious reason for that?

Thanks again.

mothur > cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)


Using 12 processors. Using splitmethod fasta. Splitting the file... ERROR: M00307_4_000000000-A2D4C_1_1101_10034_23311 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate and error.

I get an error warning about the same file when I do summary.seqs

[ERROR]: ‘M00307_5_000000000-A439A_1_2112_22523_17065’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_1102_15694_19026’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_1108_6190_18091’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_2102_23385_20316’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_1111_18611_22216’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2108_15670_11145’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_2105_18893_23087’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_2111_14470_28074’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2103_7878_16270’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2108_21587_23231’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2110_7640_15956’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2107_5725_15887’ is not in your name or count file, please correct.

You are running…

cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

You want to run…

cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

You were using the taxonomy file generated by remove.lineage, but the fasta and count_table files that were the input rather than the output from remove.lineage.

Pat

Thanks a lot Pat.

Hi again Pat,

Do you see any problems in sub-sampling groups (let’s say at 10.000 reads/group) after chimera removal in order to make my files smaller before running cluster.split?

Thanks.

Eh, it’s not ideal since you could pick the “right” / “wrong” 10,000. You could do it a few times to make sure the answers agree, but I’d really suggest double checking your error rates and/or getting a computer with more RAM.

Pat