cluster split failing

Kendra · September 16, 2015, 2:06pm

I tried running this over the weekend but it hung (no kill error but no writes to the logfile in 48hrs.

cluster.split(fasta=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.fasta, count=fancher.all.trim.contigs.good.unique.good.filter.precluster.denovo.uchime.pick.pick.pick.count_table, taxonomy=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=5, cutoff=0.15, processors=16, classic=T)

So I killed it and ran

cluster.split(fasta=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.fasta, count=fancher.all.trim.contigs.good.unique.good.filter.precluster.denovo.uchime.pick.pick.pick.count_table, taxonomy=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=5, cutoff=0.15, processors=16, classic=T, cluster=f)

cluster.split(file=current, processors=1)
Using fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.file as input file for the file parameter.

Using 1 processors.
Using splitmethod distance.

Reading fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.fasta.6.phylip.dist
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     ||[ERROR]: 57449 is not in your count table. Please correct.

The making of the distance files seems to work but it chokes on clustering the first distance file. that file is 11GB, I have ~150GB available

all the commands that got me to this point

summary.seqs(fasta=fancher.all.trim.contigs.fasta, processors=16)
screen.seqs(fasta=current, group=fancher.all.contigs.groups, summary=current, maxambig=0, maxlength=280)
summary.seqs(fasta=current)
unique.seqs(fasta=current)
summary.seqs(fasta=current, name=current)
count.seqs(name=current, group=current)
align.seqs(fasta=current, reference=silva.v4.fasta)
summary.seqs(fasta=current, count=current)
screen.seqs(fasta=current, count=current, summary=current, start=8, end=9582, maxhomop=8)
filter.seqs(fasta=current, vertical=T)
summary.seqs(fasta=current, count=current)
pre.cluster(fasta=current, diffs=3, count=current)
summary.seqs(fasta=current, count=current)
dereplicate=f
chimera.uchime(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current, count=current)
summary.seqs(fasta=current, count=current)
classify.seqs(fasta=current, count=current, reference=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, cutoff=60)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)

westcott · September 17, 2015, 4:43pm

Is 57449 a real sequence name in your dataset?

Kendra · September 17, 2015, 9:23pm

Hi Sarah

I’m sure it’s not a sequence name (these are miseq) but I can’t figure out how it would have gotten in the countfile?
Kendra

Kendra · September 18, 2015, 9:00pm

I pulled a few of the sequences-they are pretty divergent 75-80% to anything in refseq (but 100% matches in nr). I reran the whole process, starting with trim.contig.fasta. I no longer have a sequence called “57449” in either the fasta or the count table. Around midnight (17hrs ago) it finished making the distance files and started clustering and has not written to the logfile since. This is what happened last weekend too. this distance file is only 12gb, it shouldn’t be causing any problem. but the sequences are all very divergent from each other when I head/tail the dist ~0.25-0.35 from each other.

Help?

westcott · September 21, 2015, 1:30pm

It sounds like you could be running into this issue, Large dist.seqs producing corrupt files?

Kendra · September 21, 2015, 2:03pm

I’ll try renaming that but the dist matrix that is causing all the problems is only 11gb. Aslo, I’ve been running it to generate the LT matrix rather than the 3 column format. should I change back to 3 column?

Kendra · September 22, 2015, 2:22pm

That worked!! I shortened the names (just dropped the sequencer info) and switched back to 3 column dist, so I’m not sure which one solved the problem but it finished clustering overnight.

westcott · September 22, 2015, 5:52pm

We are adding a feature to rename.seqs in the next version to shorten the names for you to help with this issue. https://github.com/mothur/mothur/issues/132

stevewhitemd · October 8, 2015, 2:32pm

Hi. Having the same problem with cluster.split. May I ask (since I’m new at this) how you did that? Loaded the file (which one, the count_table or the taxonomy file?) into TextWrangler and just removed the sequencer info? \

Thanks in advance – I’ve got a problem with the dist.seqs command (see separate post), and with my sample size I’d really like cluster.split to work.

Kendra · October 14, 2015, 3:21pm

I used sed to remove the sequencer info.

here’s an example of my sequence names:
M00704_13_000000000-ACVUN_1_1101_13836_1721

sed ‘s/“M00704_13_000000000-”//g’ input.fasta > output.fasta

**I can’t remember if I used the double quotes in the command or not, every time I sed I have to figure that out. You could get rid of the flow cell id as well (ACVUN in this case) but I use flow cell id in tracking sequence runs.

Topic		Replies	Views
mysterious seqID from cluster.split Commands in mothur	6	4548	May 6, 2013
Cluster.split problem Commands in mothur	1	2827	October 28, 2014
Error message when doing cluster.split Commands in mothur	6	5029	October 20, 2014
cluster.split Commands in mothur	4	1284	May 26, 2017
Cluster error -- Commands in mothur	10	2082	October 18, 2016

cluster split failing

Related topics