Average Clustering of ~10k unique V6 sequences

Hi~

I have met a problem when clustering OTUs with divergence 0.03, 0.05 and 0.10, from ~10k unique V6 sequences.

I set cutoff=0.20 in dist.seqs(), and used average method in cluster().

But the .rabund .sabund .list don’t give the OTUs below “unique”.

Is there a bug in mothur or the parameters I set did not suit?

Could anyone give me some suggustion?

Thank you.

ps: When I change the cluster method from “average” to 'furthest", the cluster() function worked well.

This typically happens when the sequences are poorly aligned or when a number of sequences do not overlap with the others. This causes distances of essentially infinity between sequences and will not allow sequences to merge together into OTUs by the average neighbor algorithm, but they can be merged using the furthest and nearest neighbor algorithms. Are you using screen.seqs and filter.seqs to make sure the sequences overlap in the same alignment space?

Pat

Hi, Prof. Patrick D. Schloss. Thank you for answering my question.
I had used screen.seqs() and filter.seqs(), but I am wondering if the parameters I set fit. Could you give me some advices about parameter setting in align.seqs() and filter.seqs() to avoid infinite distances? I have list the results of align.seqs() and filter.seqs() as follows. Thank you.

summary() after using align.seqs()
Start End NBases Ambigs Polymer
Minimum 1044 1461 45 0 2
2.5%-tile 31189 33183 57 0 3
25%-tile 31189 33183 59 0 3
Median 31189 33183 60 0 4
75%-tile 31189 33183 60 0 4
97.5%-tile 31189 33287 64 0 5
Maximum 42614 43116 75 0 11

of unique seqs 88172

total # of seqs 100355358

summary() after using screen.seqs() start=31189 end=33183
Start End NBases Ambigs Polymer
Minimum: 29675 33183 45 0 2
2.5%-tile: 31189 33183 57 0 3
25%-tile: 31189 33183 59 0 3
Median: 31189 33183 60 0 4
75%-tile: 31189 33183 60 0 4
97.5%-tile: 31189 33287 64 0 5
Maximum: 32821 35984 75 0 11

of unique seqs: 83592

total # of seqs: 100215306

summary() after using filter.seq()
Start End NBases Ambigs Polymer
Minimum 1 273 45 0 2
2.5%-tile 52 273 57 0 3
25%-tile 52 273 59 0 3
Median 52 273 60 0 4
75%-tile 52 273 60 0 4
97.5%-tile 52 278 64 0 5
Maximum 260 382 75 0 11

of unique seqs: 83592

total # of seqs: 100215306

Can you post the actual commands you used and their output? Some of this doesn’t make sense - the maximum start after screen.seqs should be 31189 and if you use filter.seqs(trump=., vertical=T), everything should more or less start and end at the same coordinates.

Pat

Hi, I post my commands used here. I am sorry for mistaking the results of the command screen.seqs() last time.

unique.seqs(fasta=MetatongueRun4.fasta)
align.seqs(candidate=MetatongueRun4.unique.fasta, template=silva.bacteria/silva.bacteria.fasta, ksize=9, align=needleman, gapopen=-1, processors=8)
summary.seqs(fasta=MetatongueRun4.unique.align, name=MetatongueRun4.names)

Start End NBases Ambigs Polymer
Minimum: 1044 1461 45 0 2
2.5%-tile: 31189 33183 57 0 3
25%-tile: 31189 33183 59 0 3
Median: 31189 33183 60 0 4
75%-tile: 31189 33183 60 0 4
97.5%-tile: 31189 33287 64 0 5
Maximum: 42614 43116 75 0 11

of unique seqs: 88172

total # of seqs: 100355358

screen.seqs(fasta=MetatongueRun4.unique.align, name=MetatongueRun4.names, group=MetatongueRun4.groups, start=31189, end=33183, processors=8)
summary.seqs(fasta=MetatongueRun4.unique.good.align, name=MetatongueRun4.good.names)

Start End NBases Ambigs Polymer
Minimum: 29675 33183 45 0 2
2.5%-tile: 31189 33183 57 0 3
25%-tile: 31189 33183 59 0 3
Median: 31189 33183 60 0 4
75%-tile: 31189 33183 60 0 4
97.5%-tile: 31189 33287 64 0 5
Maximum: 31189 34471 75 0 11

of unique seqs: 81491

total # of seqs: 100138523

filter.seqs(fasta=MetatongueRun4.unique.good.align, trump=.)

Length of filtered alignment: 1995
Number of columns removed: 48005
Length of the original alignment: 50000
Number of sequences used to construct filter: 81491

summary.seqs(fasta=MetatongueRun4.unique.good.filter.fasta, name=MetatongueRun4.good.names)

Start End NBases Ambigs Polymer
Minimum: 1 1637 42 0 2
2.5%-tile: 1 1995 57 0 3
25%-tile: 1 1995 59 0 3
Median: 1 1995 60 0 4
75%-tile: 1 1995 60 0 4
97.5%-tile: 1 1995 64 0 5
Maximum: 1218 1995 73 0 11

of unique seqs: 81491

total # of seqs: 100138523

unique.seqs(fasta=MetatongueRun4.unique.good.filter.fasta, name=MetatongueRun4.good.names)
summary.seqs(fasta=MetatongueRun4.unique.good.filter.unique.fasta, name=MetatongueRun4.good.filter.names)

Start End NBases Ambigs Polymer
Minimum: 1 1637 42 0 2
2.5%-tile: 1 1995 57 0 3
25%-tile: 1 1995 59 0 3
Median: 1 1995 60 0 4
75%-tile: 1 1995 60 0 4
97.5%-tile: 1 1995 64 0 5
Maximum: 1218 1995 73 0 11

of unique seqs: 80914

total # of seqs: 100138523

system(cp MetatongueRun4.unique.good.filter.unique.fasta MetatongueRun4.final.fasta)
system(cp MetatongueRun4.unique.good.filter.names MetatongueRun4.final.names)
system(cp MetatongueRun4.good.groups MetatongueRun4.final.groups)
dist.seqs(fasta=MetatongueRun4.final.fasta, cutoff=0.20, processors=8)
cluster(column=MetatongueRun4.final.dist, name=MetatongueRun4.final.names, cutoff=0.10, method=average, hard=t)


changed cutoff to 0.0187468

Output File Names:
/home/baijiang/Run4/MetatongueRun4.final.an.sabund
/home/baijiang/Run4/MetatongueRun4.final.an.rabund
/home/baijiang/Run4/MetatongueRun4.final.an.list

It took 17807 seconds to cluster

Hmm… This looks odd - Illumina data? Could you try using the default gap opening value in align.seqs (-5) and tell me how the rest of the output changes? My thinking is that your use of -1 is screwing things up. In the supplementary table 3 from the align.seqs paper, we show that -1 is the worst setting for v6 and that -5 is the desired level.

Also, are the sequences possibly a mix of sequences in both directions?

Yes, it is Illumina data. And the paired-end reads were assembled so the sequences are just in the forward direction. Does Illumina or 454 matter?

I will tried gapopen=-5 in align.seqs() and let you know the output changes.

Thank you.

The commands and results using gapopen=-5. The OTU average clustering did not work either.

align.seqs(candidate=MetatongueRun4.unique.fasta, template=silva.bacteria.fasta, ksize=9, align=needleman, gapopen=-5, processors=8)

Some of you sequences generated alignments that eliminated too many bases, a list is provided in MetatongueRun4.unique.flip.accnos. If you set the flip parameter to true mothur will try aligning the reverse compliment as well.
It took 493 secs to align 88172 sequences.

summary.seqs(fasta=MetatongueRun4.unique.align, name=MetatongueRun4.names)

Start End NBases Ambigs Polymer
Minimum: 0 0 0 0 1
2.5%-tile: 31189 33183 57 0 3
25%-tile: 31189 33183 59 0 3
Median: 31189 33183 60 0 4
75%-tile: 31189 33183 60 0 4
97.5%-tile: 31189 33287 64 0 5
Maximum: 43116 43116 75 0 11

of unique seqs: 88172

total # of seqs: 100355358

screen.seqs(fasta=MetatongueRun4.unique.align, name=MetatongueRun4.names, group=MetatongueRun4.groups, start=31189, end=33183, processors=20)
summary.seqs(fasta=MetatongueRun4.unique.good.align, name=MetatongueRun4.good.names)

Start End NBases Ambigs Polymer
Minimum: 31145 33183 53 0 2
2.5%-tile: 31189 33183 57 0 3
25%-tile: 31189 33183 59 0 3
Median: 31189 33183 60 0 4
75%-tile: 31189 33183 60 0 4
97.5%-tile: 31189 33287 64 0 5
Maximum: 31189 34128 75 0 11

of unique seqs: 82050

total # of seqs: 100178513

filter.seqs(fasta=MetatongueRun4.unique.good.align, trump=.)

Length of filtered alignment: 1995
Number of columns removed: 48005
Length of the original alignment: 50000
Number of sequences used to construct filter: 82050

summary.seqs(fasta=MetatongueRun4.unique.good.filter.fasta, name=MetatongueRun4.good.names)

Start End NBases Ambigs Polymer
Minimum: 1 1995 53 0 2
2.5%-tile: 1 1995 57 0 3
25%-tile: 1 1995 59 0 3
Median: 1 1995 60 0 4
75%-tile: 1 1995 60 0 4
97.5%-tile: 1 1995 64 0 5
Maximum: 2 1995 71 0 11

of unique seqs: 82050

total # of seqs: 100178513

unique.seqs(fasta=MetatongueRun4.unique.good.filter.fasta, name=MetatongueRun4.good.names)
summary.seqs(fasta=MetatongueRun4.unique.good.filter.unique.fasta, name=MetatongueRun4.unique.good.filter.names)

Start End NBases Ambigs Polymer
Minimum: 1 1995 53 0 2
2.5%-tile: 1 1995 57 0 3
25%-tile: 1 1995 59 0 3
Median: 1 1995 60 0 4
75%-tile: 1 1995 60 0 4
97.5%-tile: 1 1995 64 0 5
Maximum: 2 1995 71 0 11

of unique seqs: 81626

total # of seqs: 100178513


[baijiang@node10 Run4]$ cp MetatongueRun4.unique.good.filter.unique.fasta MetatongueRun4.final.fasta
[baijiang@node10 Run4]$ cp MetatongueRun4.unique.good.filter.names MetatongueRun4.final.names
[baijiang@node10 Run4]$ cp MetatongueRun4.good.groups MetatongueRun4.final.groups

dist.seqs(fasta=MetatongueRun4.final.fasta, cutoff=0.20, processors=20)
cluster(column=MetatongueRun4.final.dist, name=MetatongueRun4.final.names, cutoff=0.10, method=average, hard=t)


changed cutoff to 0.0192843

Output File Names:
/home/baijiang/Run4_2/MetatongueRun4.final.an.sabund
/home/baijiang/Run4_2/MetatongueRun4.final.an.rabund
/home/baijiang/Run4_2/MetatongueRun4.final.an.list

It took 18624 seconds to cluster

can you email the initial (unaligned) fasta file to mothur.bugs@gmail.com?

Hi, Prof. P.Schloss,

Our data is under analysis. And I will send it you if authorized by my supervisor’s after the data analysis.

I wanna to skirt the clustering problem by using furthest method.

Thanks for your help and developing Mothur. It is a very beneficial tool.

Best.

Bai