cluster split failing

I tried running this over the weekend but it hung (no kill error but no writes to the logfile in 48hrs.

cluster.split(fasta=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.fasta, count=fancher.all.trim.contigs.good.unique.good.filter.precluster.denovo.uchime.pick.pick.pick.count_table, taxonomy=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=5, cutoff=0.15, processors=16, classic=T)

So I killed it and ran

cluster.split(fasta=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.fasta, count=fancher.all.trim.contigs.good.unique.good.filter.precluster.denovo.uchime.pick.pick.pick.count_table, taxonomy=fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pds.wang.pick.taxonomy, splitmethod=classify, taxlevel=5, cutoff=0.15, processors=16, classic=T, cluster=f)

cluster.split(file=current, processors=1)
Using fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.file as input file for the file parameter.

Using 1 processors.
Using splitmethod distance.

Reading fancher.all.trim.contigs.good.unique.good.filter.precluster.pick.pick.fasta.6.phylip.dist
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     ||[ERROR]: 57449 is not in your count table. Please correct.

The making of the distance files seems to work but it chokes on clustering the first distance file. that file is 11GB, I have ~150GB available

all the commands that got me to this point

summary.seqs(fasta=fancher.all.trim.contigs.fasta, processors=16)
screen.seqs(fasta=current, group=fancher.all.contigs.groups, summary=current, maxambig=0, maxlength=280)
summary.seqs(fasta=current)
unique.seqs(fasta=current)
summary.seqs(fasta=current, name=current)
count.seqs(name=current, group=current)
align.seqs(fasta=current, reference=silva.v4.fasta)
summary.seqs(fasta=current, count=current)
screen.seqs(fasta=current, count=current, summary=current, start=8, end=9582, maxhomop=8)
filter.seqs(fasta=current, vertical=T)
summary.seqs(fasta=current, count=current)
pre.cluster(fasta=current, diffs=3, count=current)
summary.seqs(fasta=current, count=current)
dereplicate=f
chimera.uchime(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current, count=current)
summary.seqs(fasta=current, count=current)
classify.seqs(fasta=current, count=current, reference=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, cutoff=60)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)

Is 57449 a real sequence name in your dataset?

Hi Sarah

I’m sure it’s not a sequence name (these are miseq) but I can’t figure out how it would have gotten in the countfile?
Kendra

I pulled a few of the sequences-they are pretty divergent 75-80% to anything in refseq (but 100% matches in nr). I reran the whole process, starting with trim.contig.fasta. I no longer have a sequence called “57449” in either the fasta or the count table. Around midnight (17hrs ago) it finished making the distance files and started clustering and has not written to the logfile since. This is what happened last weekend too. this distance file is only 12gb, it shouldn’t be causing any problem. but the sequences are all very divergent from each other when I head/tail the dist ~0.25-0.35 from each other.

Help?

It sounds like you could be running into this issue, Large dist.seqs producing corrupt files?

I’ll try renaming that but the dist matrix that is causing all the problems is only 11gb. Aslo, I’ve been running it to generate the LT matrix rather than the 3 column format. should I change back to 3 column?

That worked!! I shortened the names (just dropped the sequencer info) and switched back to 3 column dist, so I’m not sure which one solved the problem but it finished clustering overnight.

We are adding a feature to rename.seqs in the next version to shorten the names for you to help with this issue. https://github.com/mothur/mothur/issues/132

Hi. Having the same problem with cluster.split. May I ask (since I’m new at this) how you did that? Loaded the file (which one, the count_table or the taxonomy file?) into TextWrangler and just removed the sequencer info? \

Thanks in advance – I’ve got a problem with the dist.seqs command (see separate post), and with my sample size I’d really like cluster.split to work.

I used sed to remove the sequencer info.

here’s an example of my sequence names:
M00704_13_000000000-ACVUN_1_1101_13836_1721

sed ‘s/“M00704_13_000000000-”//g’ input.fasta > output.fasta

**I can’t remember if I used the double quotes in the command or not, every time I sed I have to figure that out. You could get rid of the flow cell id as well (ACVUN in this case) but I use flow cell id in tracking sequence runs.