dist.seqs incorporating taxanomic splitting

I’d like to try splitting my aligned and classified sequences based on phyla (a la your 2011 AEM paper) but I can’t find the option for that. Is there one or do I need to split the fasta outside of mothur using the taxonomy file? If it’s an outside of mothur procedure, do you have a perl script that you’d share to do that? thanks

http://www.mothur.org/wiki/Cluster.split

Thanks :slight_smile:

Hey Sarah and Pat

I finally got around to running this but am confused by the output. I have bc_bac.dist.#.temp and bc_bac.names.#.temp where # is 0-61. The original bc_bac.dist was 351GB bc_bac.dist.0.temp is 260GB while the remaining 61 are much more reasonable sizes. Are the 0-61.temp the different phyla matrixes and names files or is there something special about the .0.temp

thanks
Kendra

here’s what I ran:

mothur > mothur > cluster.split(column=bc_bac.dist, name=bc_bac.names, taxonomy=bc_bac.taxonomy, large=T, splitmethod=classify)
Splitting the file…
It took 163320 seconds to split the distance file.

Reading bc_bac.dist.0.temp

I’m still stuck on this and would appreciate guidance on what the cluster.split output should look like

Don’t use the large option, it’s a waste. How about trying:

cluster.split(fasta=bc_bac.fasta, name=bc_bac.names, taxonomy=bc_bac.taxonomy, splitmethod=classify, processors=???)

thanks Pat, so you wouldn’t bother splitting the distance matrix even though I’ve already calculated it? Just go back to the fasta and calculate each phyla separately?

Right - it’s actually much much faster if you calculate the distance matrices on individual taxa than all together and then splitting the matrix.