cluster.split method fasta or classify

What’s the difference between the 2 methods? I have a very large fasta that I want to split by taxonomy before constructing the distance matrix

Here’s what I’m running, it’s been processing 5 days:
cluster.split(fasta=bc_bac.good.align, name=bc_bac.good.names, taxonomy=bc_bac.pick.tax, splitmethod=fasta, taxlevel=3, processors=2)

Also, I know I’ve seen something that give guidelines for how long clustering will take with X sequences but can’t refind that page on the Wiki. Help? I have followed the SOP (unique, precluster, cluster.split, etc) but still have 700k uniques so think that I have to move the clustering off my desktop. To do that I need to estimate time for clustering and memory required.

http://www.mothur.org/wiki/Cluster_stats

What kind of sequence data are these? Are you running shhh.flows, screen.seqs, filter.seqs(trump=.)???

Bac V1-3 from soil. I’ve run (with mothur SOP suggested defaults except where noted):

sffinfo
shhh.flows *minflows=320, maxflows=450 because my amplicons are fully sequenced in ~350-400 flows
trim.seqs *keepfirst=200
unique.seqs
align.seqs *silva
screen.seqs *start=1044
filter.seqs
unique.seqs
pre.cluster
chimera.uchime
classify.seqs
remove.lineage

I started with ~2.5M reads (from 600 soils) and have ~700k after all of that to run through cluster.split

I would set maxflows to 320 and get rid of the keep first step. shhh.flows has a hard time denoising when it has varying length of flowgrams. What are the exact commands you are running (with options)?

for n in *.flow; do o=`basename $n .flow`.oligos; mothur "# trim.flows(flow=$n, oligos=$o, pdiffs=2, bdiffs=1, processors=2, minflows=300, maxflows=450)"; done

for n in *.flow.files; do mothur "# shhh.flows(file=$n, processors=4)"; done

for n in ?????????.shhh.fasta; do o=`basename $n .shhh.fasta`.oligos; m=`basename $n .fasta`.names; mothur "# trim.seqs(fasta=$n, oligos=$o, name=$m,  pdiffs=2, bdiffs=1, maxhomop=8, minlength=200)"; done

for n in ?????????.shhh.trim.fasta; do m=`basename $n .fasta`.names; mothur "# unique.seqs(fasta=$n, name=$m)"; done

for n in ?????????.shhh.trim.unique.fasta; do mothur "# align.seqs(fasta=$n, reference=silva.bacteria.fasta, flip=T, processors=4)"; done

for n in ?????????.shhh.trim.unique.align; do mothur "# summary.seqs(fasta=$n)"; done  ### bac start 1044 97.5% end by 6111

for n in *.shhh.trim.unique.align; do g=`basename $n .trim.unique.align`.groups; m=`basename $n .align`.names; mothur "#screen.seqs(fasta=$n, name=$m, group=$g, start=1044, processors=4)"; done

for n in *.shhh.trim.unique.good.align; do mothur "#filter.seqs(fasta=$n, vertical=T)"; done

for n in *.shhh.trim.unique.good.filter.fasta; do m=`basename $n .filter.fasta`.names; mothur "#unique.seqs(fasta=$n, name=$m)"; done

for n in *.shhh.trim.unique.good.filter.unique.precluster.fasta; do m=`basename $n .trim.unique.good.filter.unique.precluster.fasta`.trim.unique.good.filter.unique.precluster.names; g=`basename $n .trim.unique.good.filter.unique.precluster.fasta`.good.groups; mothur "#chimera.uchime(fasta=$n, name=$m, group=$g, processors=4)"; done

for n in *.shhh.trim.unique.good.filter.unique.precluster.uchime.accnos; do m=`basename $n .trim.unique.good.filter.unique.precluster.uchime.accnos`.trim.unique.good.filter.unique.precluster.names; f=`basename $n .trim.unique.good.filter.unique.precluster.uchime.accnos`.trim.unique.good.filter.unique.precluster.fasta; g=`basename $n .trim.unique.good.filter.unique.precluster.uchime.accnos`.good.groups; mothur "#remove.seqs(accnos=$n, fasta=$f, name=$m, group=$g)"; done

for n in *.shhh.trim.unique.good.filter.unique.precluster.pick.fasta; do m=`basename $n .trim.unique.good.filter.unique.precluster.pick.fasta`.trim.unique.good.filter.unique.precluster.pick.names; g=`basename $n .trim.unique.good.filter.unique.precluster.pick.fasta`.good.pick.groups; mothur "#classify.seqs(fasta=$n, name=$m, group=$g, template=silva.bacteria.fasta, taxonomy=silva.bacteria.silva.tax, cutoff=50, processors=2)"; done

for n in *.shhh.trim.unique.good.filter.unique.precluster.pick.fasta; do m=`basename $n .trim.unique.good.filter.unique.precluster.pick.fasta`.trim.unique.good.filter.unique.precluster.pick.names; g=`basename $n .trim.unique.good.filter.unique.precluster.pick.fasta`.good.pick.groups; t=`basename $n .trim.unique.good.filter.unique.precluster.pick.fasta`.trim.unique.good.filter.unique.precluster.pick.silva.taxonomy; mothur "#remove.lineage(fasta=$n, name=$m, group=$g, taxonomy=$t, taxon=Mitochondria-Cyanobacteria_Chloroplast-Archaea-Eukarya-unknown)"; done

cat *.pick.pick.names > bc_bac.names
cat *.pick.pick.fasta > bc_bac.fasta
cat *.pick.taxonomy > bc_bac.tax
cat *.pick.pick.groups > bc_bac.groups

#switch to mothur now that I'm down to one set of file
## need to realign because they're not all the same length
align.seqs(fasta=bc_bac.fasta, reference=silva.bacteria.fasta, processors=4)
screen.seqs(fasta=current, name=current, group=bc_bac.groups, start=1044, processors=4)
filter.seqs(fasta=bc_bac.good.align, vertical=T)
cluster.split(fasta=bc_bac.good.align, name=bc_bac.good.names, taxonomy=bc_bac.pick.tax, countends=F, splitmethod=fasta, taxlevel=3, processors=2)

oy. this seems overly complicated. why not pool everything after trim.seqs? regardless, before realigning everything…

degap.seqs(fasta=bc_bac.fasta)
unique.seqs(fasta=bc_bac.ng.fasta, name=bc_bac.names)
align.seqs…

then how many sequences do you have?

arg, my reply got eaten. anyway I tried that and another precluster which got my fasta down to 360k sequences. I’m trying to cluster.split that now.

ok cluster.split with a cutoff=0.03 worked and I now have a otu matrix.

cluster.split(fasta=bc_bac.ng.unique.good.filter.trim.unique.good.filter.unique.precluster.fasta, name=bc_bac.ng.unique.good.filter.trim.unique.good.filter.unique.precluster.names, taxonomy=bc_bac.pick.tax, splitmethod=fasta, taxlevel=3, cutoff=0.03, processors=4)

Can you clarify-the distance matrix that command generated doesn’t have distances greater that 3% correct? If I also wan to look at 6 or 10% OTU’s I need to rerun the cluster.split with method=fasta, cutoff=0.1, rather than use the distance matrix I’ve generated and method=classify?

actually, the cluster.split from above only generated unique and 0.01 clusters? I rechecked my fasta and it seems correct. What is going wrong here?

mothur > summary.seqs(fasta=bc_bac.final2.fasta, name=bc_bac.final2.names, processors=4)


Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 8 4 0 1 1
2.5%-tile: 1 1004 200 0 3 48565
25%-tile: 1 1053 200 0 4 485645
Median: 1 1089 200 0 4 971290
75%-tile: 1 1121 200 0 5 1456934
97.5%-tile: 1 1129 200 0 6 1894014
Maximum: 1 1166 200 0 8 1942578
Mean: 1 1099.95 194.756 0 4.35654

of unique seqs: 361382

total # of seqs: 1942578