Trouble shoting of cluster

john1608 · November 14, 2012, 2:46am

hi, dear friends,
i am facing a problem with the file generated from mothur software during anlysis of data having almost 800000 seqs. i analyses my seqs as Schloss SOP, but when i finished dist.seqs, i got a 140Gb file of .dist. when i did the cluster commond, i only got the dist of unique. mothur do not give me the dist 0.01-0.10 result.
please help me ! thanks!

john1608 · November 14, 2012, 2:57am

i forgot i used the furthest neighbor
thank you

pschloss · November 19, 2012, 9:48pm

How closely are you following the SOP and what sequencing platform are you using?

john1608 · November 26, 2012, 6:24am

dear dr. pschloss, i followed the SOP almost the same, i used 454 sequencing platform

pschloss · November 28, 2012, 4:56pm

What does “almost” mean? I’m suspicious of your large number of unqiue sequences.

john1608 · December 4, 2012, 1:06am

yes Dr. That might be a reason. i think my data has a large number of unqiue sequences. but if it true, how can i solve this problem?

pschloss · December 5, 2012, 12:04pm

What are you changing in the SOP instructions?

john1608 · December 6, 2012, 9:26am

I looked the SOP again. i think i find the problem. i will try them again! thank you so much!

john1608 · December 12, 2012, 1:32am

Dear Dr. Pschloss, this time I followed your SOP. But when I do the filter.seqs commend, the trump=. will make my sequences become start -1 end 1, no sequences.
So I tried trump=N, and from start 9 to end 1400. then I do unique and pre.cluster. but when I do the dist.seqs, only 470000 unique sequence make 88Gb .dist file.

pschloss · December 12, 2012, 3:39pm

you are probably having problems with your settings for the screen.seqs command. please provide the output of summary.seqs using the fasta/names file that you give to screen.seqs as well as the parameter values you used in screen.seqs.

john1608 · December 13, 2012, 4:56am

Dear Dr. Pschloss, this is the summary of the filter.seqs

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 13743
25%-tile: -1 -1 0 0 1 137425
Median: -1 -1 0 0 1 274849
75%-tile: -1 -1 0 0 1 412273
97.5%-tile: -1 -1 0 0 1 535955
Maximum: -1 -1 0 0 1 549697
Mean: -1 -1 0 0 1

of unique seqs: 469433

total # of seqs: 549697

And this is my screen commend screen.seqs(fasta=./align/454.trim.cut.unique.align,name=./unique/454.trim.cut.names,group=./trim/454.groups,minlength=200,maxambig=0,maxhomop=8,processors=10, outputdir=./screen/)

john1608 · December 13, 2012, 5:20am

I used the same step to analysis the bacteria. More then 600000 unique sequences of bacteria, only have a 35Gb dist file

pschloss · December 13, 2012, 1:47pm

what is the output of

summary.seqs(fasta=./align/454.trim.cut.unique.align)

john1608 · December 14, 2012, 11:46am

Dear Dr. Pschloss, here is my output of summary.seqs(fasta=./align/454.trim.cut.unique.align)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1076 5 0 2 1
2.5%-tile: 10352 19463 264 0 5 14380
25%-tile: 10352 25318 471 0 5 143792
Median: 10352 25497 485 0 5 287584
75%-tile: 10352 25503 488 0 6 431376
97.5%-tile: 10352 25503 497 0 7 560788
Maximum: 43107 43116 500 0 10 575167
Mean: 10362 24615.6 459.286 0 5.47729

of unique seqs: 493957

total # of seqs: 575167

pschloss · December 14, 2012, 12:09pm

Try this…

screen.seqs(fasta=./align/454.trim.cut.unique.align,name=./unique/454.trim.cut.names,group=./trim/454.groups,minlength=200,maxambig=0,maxhomop=8,processors=10, outputdir=./screen/, start=10352)

john1608 · December 14, 2012, 12:37pm

Thank you, Dr. Pschloss. I’ll try it!

john1608 · December 15, 2012, 12:33pm

Dear Dr. Pschloss,

I think the dist file will still be very large. Right now, the second column of the output of dist.seq is 38558, and the files of dist is already 15.4 Gb.

Please help me how to solve this! Thanks a lot!

pschloss · December 17, 2012, 11:55am

Judging from the lengths of your sequences, I can tell you are probably not applying any meaningful quality trimming steps either using trim.flows/shhh.flows or trim.seqs. The problem is that with all of the remaining errors in the reads there are many more unique sequences that are error infested copies of the good sequences. This makes for larger distance matrices and overall, less reliable results. Please consult the SOP for how to apply the two methods to your own data.

Pat

john1608 · December 26, 2012, 6:53am

Dear Dr Pschloss, I got an other problem. If I add trump=. or trump=- in filter.seq commend, the length of filtered alignment turned to 0. why is that? what can I do? please help me. Thanks a lot

pschloss · January 2, 2013, 5:25pm

This is because you are not using screen.seqs correctly to make sure that your sequences all overlap in the same region.

When you do this within mothur:

screen.seqs(fasta=./align/454.trim.cut.unique.align,name=./unique/454.trim.cut.names,group=./trim/454.groups,minlength=200,maxambig=0,maxhomop=8,processors=10, outputdir=./screen/, start=10352)
summary.seqs(fasta=current, name=current)

What do you get?

Topic		Replies	Views
unique.seq - large number Commands in mothur	7	5754	October 20, 2014
Produce too large amount of data when running dist.seqs Commands in mothur	8	7699	October 18, 2013
Cluster weirdly takes 49 sec to run! Commands in mothur	4	2910	March 28, 2013
Problem with MiSeq SOP pre.cluster Commands in mothur	2	2985	May 22, 2015
mothur does not complete batch script Commands in mothur	5	2037	June 8, 2016

Trouble shoting of cluster

of unique seqs: 469433

of unique seqs: 493957

Related topics