cluster.split

costamc · May 9, 2013, 7:23pm

Hi,
I am having problem following the MiSeq SOP for data generated after sequencing of the V4 region of the 16S gene. My data set has about 1.2M reads after make.contigs with over 800,000 unique sequences. Every time I try to run cluster.split the computer will crash after a while, even when I try to subsample it. The analysis has been running in a computer that has 8 processors for almost a week now, but it is still running. The cluster density was 475K clusters/mm2 and 3% PhiX was detected (loaded at 5%).
What would be the consequences of subsampling the data after the first screen.seqs command on the SOP?
Many thanks.
Marcio.

costamc · May 9, 2013, 7:25pm

forgot to mention we are using the 2x250 kit.

pschloss · May 9, 2013, 7:55pm

What Illumina and mothur software versions are you using?

costamc · May 10, 2013, 3:27am

Illumina RTA v1.17.28; MCS v2.2

mothur v.1.30.2 on Mac OS X 10.6 64 bit, but have also tried the 64 bit for windows and linux (the last one is still running).

pschloss · May 10, 2013, 11:22am

After you get through pre.cluster and removing chimeras how many unique sequences do you have left? Have you tried splitting at a lower taxonomic level? Did you happen to run a mock community?

costamc · May 10, 2013, 6:46pm

Hi Pat,

When I try to open the logfile it only says: “linux version”. It’s been running for a week now. Should I stop it?
I tryed splitting at level 5 and 6 on my Mac, but it still couldn’t handle it. I didn’t use a Mock community for this run…
Thanks.

pschloss · May 10, 2013, 7:44pm

How many unique reads after removing chimeras?

costamc · May 13, 2013, 3:39am

I’ve got 396,649 unique sequences out of 922,771.

pschloss · May 13, 2013, 1:12pm

Assuming you’re following the SOP exactly, I’m not sure what the problem might be. We’ve had no problems analyzing multiple runs together with cluster.split. I guess if it doesn’t finish, then you might just be left with phylotyping your data.

Pat

costamc · June 25, 2013, 2:59pm

Hi,

I’ve repeatedly run into this error message. Initially I thought it was because I was running out of memory, but now I am running the analysis in a much more powerful computer. Any obvious reason for that?

Thanks again.

mothur > cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

Using 12 processors. Using splitmethod fasta. Splitting the file... ERROR: M00307_4_000000000-A2D4C_1_1101_10034_23311 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate and error.

I get an error warning about the same file when I do summary.seqs

[ERROR]: ‘M00307_5_000000000-A439A_1_2112_22523_17065’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_1102_15694_19026’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_1108_6190_18091’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_2102_23385_20316’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_1111_18611_22216’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2108_15670_11145’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_2105_18893_23087’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_5_000000000-A439A_1_2111_14470_28074’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2103_7878_16270’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2108_21587_23231’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2110_7640_15956’ is not in your name or count file, please correct.
[ERROR]: ‘M00307_4_000000000-A2D4C_1_2107_5725_15887’ is not in your name or count file, please correct.

pschloss · June 25, 2013, 6:14pm

You are running…

cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

You want to run…

cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

You were using the taxonomy file generated by remove.lineage, but the fasta and count_table files that were the input rather than the output from remove.lineage.

Pat

costamc · June 26, 2013, 12:35pm

Thanks a lot Pat.

costamc · July 11, 2013, 4:04pm

Hi again Pat,

Do you see any problems in sub-sampling groups (let’s say at 10.000 reads/group) after chimera removal in order to make my files smaller before running cluster.split?

Thanks.

pschloss · July 15, 2013, 2:23pm

Eh, it’s not ideal since you could pick the “right” / “wrong” 10,000. You could do it a few times to make sure the answers agree, but I’d really suggest double checking your error rates and/or getting a computer with more RAM.

Pat

Topic		Replies	Views
Use cluster.split on MiSeq data Commands in mothur	15	13904	May 9, 2013
cluster.split problem Theory behind mothur	1	3392	January 9, 2015
cluster.split failure Commands in mothur	1	4109	June 30, 2016
Cluster.split issue (again, sorry) mothur bugs	4	499	December 11, 2021
Errors in cluster.split Commands in mothur	11	377	December 28, 2023

cluster.split

Related topics