stuck with cluster

Hello,

I have to analyse 454 FLX+ sequences and in the beginning I tried to analyse the forward and reverse sequences (half of the sequences randomly starts from forward or reverse). Now I tried to combine them and this works out until the cluster command that will stuck (I don´t get an error message).

I used:

mothur v.1.31.2

trim.seqs(fasta=…, oligos=…, minlength=500, maxlength=1050, qaverage=35, qwindowsize=50, maxhomop=8, pdiffs=0)

align.seqs(fasta=16S_trim.unique.fasta, reference=silva.bacteria.fasta, align=gotoh, match=1, mismatch=-3, flip=T, processors=8)
screen.seqs(fasta=16S_trim.unique.align, maxhomop=8, name=16S_trim.names, group=mergegroupspick, minlength=500, processors=8)
filter.seqs(fasta=16S_trim.unique.good.align, vertical=T)
unique.seqs(fasta=16S_trim.unique.good.filter.fasta, name=16S_trim.good.names)
pre.cluster(fasta=16S_trim.unique.good.filter.unique.fasta, name=16S_trim.unique.good.filter.names, group=mergegroupspickgood, diffs=2, processors=8)
chimera.uchime
remove.seqs
classify.seqs
remove.lineage

summary from here:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1634 500 0 4 1
2.5%-tile: 1 1940 520 0 4 2073
25%-tile: 1 2767 659 0 5 20728
Median: 1 3005 752 0 5 41455
75%-tile: 506 3005 826 0 6 62182
97.5%-tile: 1218 3047 877 2 7 80837
Maximum: 1506 3094 936 6 8 82909
Mean: 284.745 2827.17 735.422 0.182393 5.42115

of unique seqs: 73481

total # of seqs: 82909

dist.seqs(fasta=16S_trim.unique.good.filter.unique.precluster.pick.pick.fasta, output=lt, processors=8)
cluster(phylip=16S_trim.unique.good.filter.unique.precluster.pick.pick.phylip.dist, name=16S_trim.unique.good.filter.unique.precluster.pick.pick.names)

Hm, I think the problem is that the sequences don´t align in the same area? I also tried to use a cutoff of 0.03 but I end up with some unique sequences.
Is there any way to get this done?
Could it help when I run the trim.flows etc. with the sff files? I am trying this in the moment but I am failing… But if this would help I would ask you about this later.

It would be so great if you could help!

Thank you so much!

I suspect the problem is your data quality. If you have a high error rate, you’ll have a lot of uniques and that will make everything painful.

First, do you know what image processing algorithm was used to generate the SFF file? I would suggest LA1.

Second, in filter.seqs I would add trump=. otherwise you’re not comparing apples to apples since your reads won’t be in the same alignment space. If you don’t want to do this then I would process your forwards separate from your reverses.

Let us know how that works out for you…
Pat

Dear Pat,

thank you very much for the answer! Unfortunately I don´t know what image processing algorithm the company used. I used trump=. before and it works for the 16S but I also have to analyse fungal ITS sequences and when I use trump then I end up with sequences that are 50-130 bp long. And so I thought, I just don´t do this. What is potentially not the best idea :wink:

Again: thank you!

Oh, I should add that I had the same problem for the ITS sequences (I use the UNITE database) with adding trump when I analysed the forward and reverse sequences seperately. Do you have any idea why this is the case?

Thank you!

I tried it and here is the summary after filtering with trump for ITS:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 1988
25%-tile: -1 -1 0 0 1 19876
Median: -1 -1 0 0 1 39752
75%-tile: -1 -1 0 0 1 59628
97.5%-tile: -1 -1 0 0 1 77516
Maximum: -1 -1 0 0 1 79503
Mean: -1 -1 0 0 1

of unique seqs: 63528

total # of seqs: 79503


and 16S:
Start End NBases Ambigs Polymer NumSeqs Minimum: 1 94 2 0 1 1 2.5%-tile: 1 129 9 0 1 2456 25%-tile: 1 129 9 0 2 24552 Median: 1 129 12 0 2 49103 75%-tile: 1 129 34 0 4 73654 97.5%-tile: 1 129 35 0 5 95749 Maximum: 76 129 59 2 8 98204 Mean: 1.08315 128.998 18.8158 0.00192457 2.71437 # of unique seqs: 92679 total # of seqs: 98204

Both really bad. :frowning:
And you say it´s not good to just filter without using trump? I would love to have a possibility to analyse the forward and reverse sequences together…

Sorry, for all the posts.

Best,

It’s all about picking the correct parameters for your data in screen.seqs. In the ITS case your reads don’t seem to overlap and so when you run trump=. they all go away. If you don’t use trump=. then you have distances of infinity or 1.0 between sequences because there’s no overlap - and that’s for reads that might actually be identical.

In the future can you get your sequencing company to only sequence in one direction?

Ah okay, I see the point. I just loose so much length when I only use the overlap. I hoped there would be some way around this.
I think I will just never use the company again :wink:

Thank you for your help!

One more question:

I tried your suggestion and would like to know if you think this is okay. Is there a problem when I don´t know which region of 16S I really end up with? Cause the primers would target the V1-V5 region and when I cut them down so much then I don´t know which region it really is in the end, isn´t it? Possibly that doesn´t matter at all.

Before screening:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1046 1 0 1 2633
25%-tile: 1044 23963 635 0 5 26326
Median: 1044 27654 730 0 5 52652
75%-tile: 3148 27654 822 0 6 78977
97.5%-tile: 7941 28476 874 2 7 102670
Maximum: 43116 43116 945 7 8 105302
Mean: 2839.95 24828.8 683.193 0.187119 5.06493

of unique seqs: 99596

total # of seqs: 105302

screen.seqs(fasta=16S_trim.unique.align, maxhomop=8, name=16S_trim.names, group=mergegroupspick, end=25000, minlength=300, processors=8)
filter.seqs(fasta=16S_trim.unique.good.align, vertical=T, trump=.)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1028 235 0 3 1
2.5%-tile: 1 1033 322 0 4 1920
25%-tile: 1 1033 324 0 5 19199
Median: 1 1033 327 0 5 38398
75%-tile: 1 1033 348 0 5 57597
97.5%-tile: 1 1033 351 1 6 74876
Maximum: 72 1033 473 5 8 76795
Mean: 1.08174 1032.97 332.883 0.0768019 5.0845

of unique seqs: 72774

total # of seqs: 76795


When you say that is fine then I will go for this method. Cause it´s so annoying to work with 16S forward and reverse and ITS forward and reverse (esspecially when I have to present this). Too many graphs. :)

Thank you so, so much!

Looks like you found a good set of parameters.

Really? Juhu! :smiley: And it also works for the ITS sequences. That is so good!

Thank you so much for your help!

I am very sorry to bother you again…
I have a last question. I am writing the stuff down in the moment and I am wondering how I can explain why I used this way of analysing (using overlap) and why this way is okay. Is there somewhere a reference that says that it would be better to sequence from both directions due to avoiding a bias? There is this reference that explains that there is a different taxonomic output depending on if the sequences were read from the forward (ITS1 region) or reverse (ITS2 region) side (Bazzicalupo et al. 2013). But because I use the overlap I can´t completely say which region I use, isn´t it? Something in between. Same for bacteria.

If anybody has an idea what I could cite for this, I would be very happy!

I am really sorry I read so many papers but most have done another sequencing strategy.

Sorry and thank you!

Actually, for 454, sequencing in both directions is a bad idea because they aren’t true paired ends. If you’re doing Sanger or Illumina, paired end works and is ideal because it gives higher quality data.