Trouble shoting of cluster

hi, dear friends,
i am facing a problem with the file generated from mothur software during anlysis of data having almost 800000 seqs. i analyses my seqs as Schloss SOP, but when i finished dist.seqs, i got a 140Gb file of .dist. when i did the cluster commond, i only got the dist of unique. mothur do not give me the dist 0.01-0.10 result.
please help me ! thanks!

i forgot i used the furthest neighbor
thank you

How closely are you following the SOP and what sequencing platform are you using?

dear dr. pschloss, i followed the SOP almost the same, i used 454 sequencing platform

What does “almost” mean? I’m suspicious of your large number of unqiue sequences.

yes Dr. That might be a reason. i think my data has a large number of unqiue sequences. but if it true, how can i solve this problem?

What are you changing in the SOP instructions?

I looked the SOP again. i think i find the problem. i will try them again! thank you so much!

Dear Dr. Pschloss, this time I followed your SOP. But when I do the filter.seqs commend, the trump=. will make my sequences become start -1 end 1, no sequences.
So I tried trump=N, and from start 9 to end 1400. then I do unique and pre.cluster. but when I do the dist.seqs, only 470000 unique sequence make 88Gb .dist file.

you are probably having problems with your settings for the screen.seqs command. please provide the output of summary.seqs using the fasta/names file that you give to screen.seqs as well as the parameter values you used in screen.seqs.

Dear Dr. Pschloss, this is the summary of the filter.seqs

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 13743
25%-tile: -1 -1 0 0 1 137425
Median: -1 -1 0 0 1 274849
75%-tile: -1 -1 0 0 1 412273
97.5%-tile: -1 -1 0 0 1 535955
Maximum: -1 -1 0 0 1 549697
Mean: -1 -1 0 0 1

of unique seqs: 469433

total # of seqs: 549697

And this is my screen commend screen.seqs(fasta=./align/454.trim.cut.unique.align,name=./unique/454.trim.cut.names,group=./trim/454.groups,minlength=200,maxambig=0,maxhomop=8,processors=10, outputdir=./screen/)

I used the same step to analysis the bacteria. More then 600000 unique sequences of bacteria, only have a 35Gb dist file

what is the output of

summary.seqs(fasta=./align/454.trim.cut.unique.align)

Dear Dr. Pschloss, here is my output of summary.seqs(fasta=./align/454.trim.cut.unique.align)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1076 5 0 2 1
2.5%-tile: 10352 19463 264 0 5 14380
25%-tile: 10352 25318 471 0 5 143792
Median: 10352 25497 485 0 5 287584
75%-tile: 10352 25503 488 0 6 431376
97.5%-tile: 10352 25503 497 0 7 560788
Maximum: 43107 43116 500 0 10 575167
Mean: 10362 24615.6 459.286 0 5.47729

of unique seqs: 493957

total # of seqs: 575167

Try this…

screen.seqs(fasta=./align/454.trim.cut.unique.align,name=./unique/454.trim.cut.names,group=./trim/454.groups,minlength=200,maxambig=0,maxhomop=8,processors=10, outputdir=./screen/, start=10352)

Thank you, Dr. Pschloss. I’ll try it!

Dear Dr. Pschloss,

I think the dist file will still be very large. Right now, the second column of the output of dist.seq is 38558, and the files of dist is already 15.4 Gb.

Please help me how to solve this! Thanks a lot!

Judging from the lengths of your sequences, I can tell you are probably not applying any meaningful quality trimming steps either using trim.flows/shhh.flows or trim.seqs. The problem is that with all of the remaining errors in the reads there are many more unique sequences that are error infested copies of the good sequences. This makes for larger distance matrices and overall, less reliable results. Please consult the SOP for how to apply the two methods to your own data.

Pat

Dear Dr Pschloss, I got an other problem. If I add trump=. or trump=- in filter.seq commend, the length of filtered alignment turned to 0. why is that? what can I do? please help me. Thanks a lot

This is because you are not using screen.seqs correctly to make sure that your sequences all overlap in the same region.

When you do this within mothur:

screen.seqs(fasta=./align/454.trim.cut.unique.align,name=./unique/454.trim.cut.names,group=./trim/454.groups,minlength=200,maxambig=0,maxhomop=8,processors=10, outputdir=./screen/, start=10352)
summary.seqs(fasta=current, name=current)

What do you get?