Using cluster.split with large data


I am having a few problems running this command due to the large amount of unique sequences I have (776,819). I have tried running it on only one processor to decrease RAM usage and I have also tried running it using lower taxonomic levels but it still crashes. I am currently trying to run dist.seqs instead but I am doubtful this is going to be any use as it is taking quite a lot of time to run (although it hasn’t crashed yet after three days).

I have 64 samples run on a MiSeq with 250bp reads of the V2-V3 region. I have been following the MIseq SOP exactly up until this point.

I don’t want to have to use a phylotoype based approach unless I really have to.

Do you have any suggestions? Also, are there any analyses where it is possible to use the taxonomically assigned read data from before clustering into OTUs?

I used:

split.abund(fasta=example.fasta, count=example.count_table, cutoff=1)

to remove singletons in a similar situation. You would then proceed to cluster on the abund.fasta and abund.count_table. It doesn’t seem to be uncommon in literature to take a similar approach as singletons are more likely to be error-containing reads anyways.

If you still have too many uniques, consider using sub.sample() to further reduce your counts.

Also, that is a suspiciously high number of uniques, even with 64 samples. What quality filtering/trimming are you doing prior to running mothur?

Thanks a lot that completely solved my problem! Using split.abund I went from ~700,000 reads to ~25,000. Much more manageable.