Hello, I’ve been using cluster.split with taxonomy to try and cluster my dataset. (166k unique sequences), it contains about 223 Order-level taxa. I’m using a PC, 8 processors.
It appears to finish after about 13 hours and writes the output file, but there are still 7 small count_table temp files leftover in the directory. When I try to make.shared I get an error message saying:
“Your group file contains 166381 sequences and list file contains 166360 sequences. Please correct.”
There are 21 readnames in the temp files so I assume there was an unflagged problem during merging.
Is there any way to recover those reads without rerunning the whole clustering step?
Is this something I can prevent happening again in future?
Thank you.
It looks like we have a small bug in the split by distance option. Here’s a workaround for your current dataset. The .temp count files contain reads that should be singletons in your list file. Rather than rerunning the entire cluster.split command, you can manually add the remaining 21 reads to your list file as singletons.
NOTE: Be sure to update the number of OTUs in the second column of the list file, as well as adding OTUxxxx labels to the header line for each new OTU added.
The splitting by distance method is the slowest option for cluster.split. We recommend using the fasta, taxonomy, count options for the cluster.split command as that option takes advantage of parallelized processing during the splitting process. In the future, try this instead: