Clearcut Syntax Error: Distance Matrix Issue in

Hi, How can I fix this?
Command runs in cluster computer with 400 GB RAM

mothur 1.48.0

mothur > clearcut(phylip=current)
Using stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.phylip.dist as input file for the phylip parameter.
Clearcut: Syntax error in distance matrix at offset 25.

Can you post the first 5 or so lines of the distance matrix?

Pat

head stability.trim.contigs.good
.unique.good.filter.unique.precluster.denovo.vsearch.pick.phylip.dist
82017
852240700170067925
852240700166174793 0.2421
85224070017627201 0.1535 0.2126
852240700168816867 0.01632 0.2324 0.1419
852240700176135264 0.3083 0.2482 0.2749 0.2937
852240700176748695 0.2721 0.2475 0.2469 0.2598 0.285
852240700167327312 0.2028 0.2494 0.1837 0.1958 0.318 0.2843
852240700173014210 0.3382 0.2745 0.3007 0.3284 0.2143 0.2948 0.2941
85224070017397912 0.3286 0.2867 0.3138 0.331 0.3212 0.3056 0.3192 0.3325

make.shared(list=current, count=current, label=0.03);
classify.otu(list=current, count=current, taxonomy=current, label=0.03);
dist.seqs(fasta=current, output=lt);
clearcut(phylip=current)"

Can you tell me what running the following returns?

wc -l stability.trim.contigs.good
.unique.good.filter.unique.precluster.denovo.vsearch.pick.phylip.dist

wc -l stability.trim.contigs.goo
d.unique.good.filter.unique.precluster.denovo.vsearch.pick.phylip.dist

82018 stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.phylip.dist

I am not able to reproduce the issue with the MiSeq_SOP dataset. The error message indicates there are junk / hidden characters in the first line or two of the file. Could the file have gotten corrupted? If you want to send the log file and distance matrix to mothur.westcott@gmail.com I can take a look for you.

Thank you, I have sent an email

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Thanks for sending your files. I was able to find the source of the issue. The clearcut command uses code originally developed by Luke Sheneman for the clearcut program. The code expects sequence names to include non numeric characters. Your sequence names are being misinterpreted as distances in the matrix and the command assumes a corrupted file at location 25 (the first sequence name). To resolve this issue you can rename your sequences to include non numeric characters. You can do this with the rename.seqs command as follows:

mothur > rename.seqs(fasta=current, count=current, delim="_”)

The above command will create sequence names like: number_sampleName. So a sequence like 852240700170067925 belonging to sample1 would become 1_sample1. The rename.seqs command creates a map file you can use to restore the original names. To restore the names to the originals, run the command below.

mothur > rename.seqs(fasta=current, map=mapFileCreatedByFirstRenameseqs)

Hi again - it looks like your distance matrix has 82k sequences in it and is using tons of RAM. It’s taking more than 24 hours to just read in the distance matrix. It’s likely just too big to be processed by clearcut. I’d strongly suggest you use an OTU or phylotype based approach instead. I have yet to find a case where an OTU-based approach using something like Bray-Curtis disagreed with one of the UniFrac commands.

Pat