Cluster error --

Hi all

Running the MiSeq Mothur protocol (iMac, MacOS sierra, Mothur 1.38, oodles of disk space) on a set of samples (~80 sputum samples from patients with cystic fibrosis). Everything was going great until I hit the cluster command, for which I get this error message:

[ERROR]: HWI-HWI-M04771_47_000000000-AUJLE_1_1105_19621_21354 is not in your count table. Please correct.

I then tried the cluster.split command and had the exact same error message. Both commands appear to execute properly until the error message hits. The dist file is about 207.6 GB; the count table is about 19 MB.

So I’m at a loss as to what to do. Thoughts?

Thanks in advance!

Let me add: the sequence noted in the error message indeed is not in the count table. Would it be as simple as adding it, along with values of ‘0’ for each of the samples?

So I took my advice and added that sequence to the count table. I then re-ran cluster:

cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)

For which I got this error message:

[ERROR]: HWI-M04771_47_000000000AUJLE_1_1106_15430_14650 is not in your count table. Please correct.

So I could see doing this repeatedly, one at a time. The dist file is 207.6 GB so I sure can’t load that into a text editor. Thoughts?

Another update: in inspecting the count table, after the header line each line of the table (>153000 lines in mine) looks like this:

HWI-M04771_47_000000000-AUJLE_1_1105_19621_21354 1 0 0 …

Where there is an identifier for the sequence followed by the total count (here, 1) and the count for each sample.

The error message I received was:

[ERROR]: HWI-HWI-M04771_47_000000000-AUJLE_1_1105_19621_21354 is not in your count table. Please correct.

Notice the doubled “HWI-HWI”? I checked the count table and there are no entries with “HWI-HWI”. So now I’m wondering if the issue isn’t a missing sequence but a bug somewhere.

search for that seq in your dist file, you could try deleting that seq from the dist

Ok. It’s a 207 GB file; it’s not going to fit in any text editor I have. Suggestions?

Also, am re-running cluster.split to see if perhaps I goofed up the first time.

I’d always suggest cluster.split for any next gen datasets.

you can sed for the offending sequence. something like this will remove the line where it’s found

sed -e ‘/F4Q4SKU0…/d’ fung.dist >fung1.dist

I’ll take a look.

One quick further question: how long does it take for cluster.split to work with very large dist files? It generated the various smaller dist and temp files fairly quickly but now has been sitting for quite a while with no (obvious) activity. How long do I give it?

unfortunately, you can’t know that till you’ve done it. Generally, when clustering >100 samples on 4 processors (my server has 512gb ram) it will take between 12hrs and 3 days for the whole SOP to run.

Hi, I have the same Problem I am running Cluster.split after followed the exact protocol from MiseqSop, the only differences is that I’m using 18S, This is the command:

cluster.split(fasta=889trim.contigs.good.unique.good.filter.precluster.pick.pick.fasta,count=889trim.contigs.good.unique.good.filter.precluster.denovo.uchime.pick.pick.count_table,taxonomy=889trim.contigs.good.unique.good.filter.precluster.pick.nr_v123.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.2, processors=8)

I have a total number of sequences of 2591884
And just 537348 unique sequences

After 3 days running, it stops without sending and Output file and the only ERROR I get is this;

[ERROR]: M03540_58_000000000-M03540_58_000000000-AK3J9_1_2107_9877_3118 is not in your count table. Please correct.

So cluster.split took just under 2 days (2 processors, 24 GB) for me. Perhaps in the future the command could have some sort of progress indicator to let the user know that it’s working and that Mothur hasn’t crashed. Just a thought.