Hi, How can I fix this?
Command runs in cluster computer with 400 GB RAM
mothur 1.48.0
mothur > clearcut(phylip=current)
Using stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.phylip.dist as input file for the phylip parameter.
Clearcut: Syntax error in distance matrix at offset 25.
I am not able to reproduce the issue with the MiSeq_SOP dataset. The error message indicates there are junk / hidden characters in the first line or two of the file. Could the file have gotten corrupted? If you want to send the log file and distance matrix to mothur.westcott@gmail.com I can take a look for you.
Thanks for sending your files. I was able to find the source of the issue. The clearcut command uses code originally developed by Luke Sheneman for the clearcut program. The code expects sequence names to include non numeric characters. Your sequence names are being misinterpreted as distances in the matrix and the command assumes a corrupted file at location 25 (the first sequence name). To resolve this issue you can rename your sequences to include non numeric characters. You can do this with the rename.seqs command as follows:
The above command will create sequence names like: number_sampleName. So a sequence like 852240700170067925 belonging to sample1 would become 1_sample1. The rename.seqs command creates a map file you can use to restore the original names. To restore the names to the originals, run the command below.
Hi again - it looks like your distance matrix has 82k sequences in it and is using tons of RAM. It’s taking more than 24 hours to just read in the distance matrix. It’s likely just too big to be processed by clearcut. I’d strongly suggest you use an OTU or phylotype based approach instead. I have yet to find a case where an OTU-based approach using something like Bray-Curtis disagreed with one of the UniFrac commands.