screen.seqs removes half of my sequences

Hi,

I’m investigating bacteria associated with fish skin. I amplified the V4 of the 16S, using the primers 515F - 806R, as in the Kozich paper.
Two libraries with 120 and 200 fish samples have been sequenced using Illumina MiSeq with the MiSeq Reagent Kit v2 500 cycles.

I got fastq files, one reverse and one forward for each sample. I made the contigs for each library, cut the sequences with screen.seqs to start=230 and end=275 and maxambig=0, split the libraries according to two different experiments and merged the respective files afterwards.

The summary. seqs generated the following:

mothur v.1.33.3

mothur > summary.seqs(fasta=czallstability.trim.contigs.good.pick.fasta, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 230 230 0 3 1
2.5%-tile: 1 247 247 0 3 318972
25%-tile: 1 248 248 0 4 3189718
Median: 1 252 252 0 4 6379436
75%-tile: 1 252 252 0 5 9569153
97.5%-tile: 1 253 253 0 6 12439899
Maximum: 1 275 275 0 60 12758870
Mean: 1 250.067 250.067 0 4.39062

of Seqs: 12758870

I then followed the SOP with:

unique.seqs(fasta=czallstability.trim.contigs.good.pick.fasta)

and

align.seqs(fasta=czallstability.trim.contigs.good.pick.unique.fasta, reference=silva.bacteria.fasta, processors=8)

and got the following summary:

mothur > summary.seqs(fasta=czallstability.trim.contigs.good.pick.unique.align, count=czallstability.trim.contigs.good.pick.count_table, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 13862 22588 247 0 3 318972
25%-tile: 13862 23439 248 0 4 3189718
Median: 13862 23443 252 0 4 6379436
75%-tile: 13871 23443 252 0 5 9569153
97.5%-tile: 13875 23443 253 0 6 12439899
Maximum: 43116 43116 275 0 60 12758870
Mean: 13867.9 23418.1 250.044 0 4.39032

of unique seqs: 760815

total # of seqs: 12758870

Subsequently, I run the screen command with the following parameters:

mothur > screen.seqs(fasta=czallstability.trim.contigs.good.pick.unique.align, count=czallstability.trim.contigs.good.pick.count_table, start=13875, end=23443, maxhomop=8, processors=8)

which removed about the half of my sequences:

mothur > summary.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.align, count=czallstability.trim.contigs.good.pick.good.count_table, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 11892 23443 234 0 3 1
2.5%-tile: 13862 23443 252 0 3 163572
25%-tile: 13862 23443 252 0 4 1635717
Median: 13862 23443 252 0 4 3271434
75%-tile: 13862 23443 252 0 5 4907151
97.5%-tile: 13862 23443 253 0 6 6379296
Maximum: 13875 25318 275 0 8 6542867
Mean: 13862 23443.1 252.068 0 4.35542

of unique seqs: 341350

total # of seqs: 6542867




When I run screen. seqs with start=13875 and optimize=end I get

mothur > summary.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.align, count=czallstability.trim.contigs.good.pick.good.count_table)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 11892 23439 234 0 3 1
2.5%-tile: 13862 23439 248 0 3 310057
25%-tile: 13862 23439 248 0 4 3100569
Median: 13862 23443 252 0 4 6201137
75%-tile: 13871 23443 252 0 5 9301705
97.5%-tile: 13871 23443 253 0 6 12092216
Maximum: 13875 25318 275 0 8 12402272
Mean: 13866.3 23441.2 250.156 0 4.39429

of unique seqs: 720310

total # of seqs: 12402272

I’m know puzzled why after aligning some of my sequences don’t overlap at all but since more than 75% of them do I still don’t get why I lose half of my sequences when setting the start and end as I did.
Further I’m not sure if I can use the files produces by screen.seqs with optimize=end.

I appreciate if somebody could help me.

Thanks a lot!

I’m know puzzled why after aligning some of my sequences don’t overlap at all but since more than 75% of them do I still don’t get why I lose half of my sequences when setting the start and end as I did. Further I’m not sure if I can use the files produces by screen.seqs with optimize=end.

Yeah that seems weird. I suspect you have an abundant sequence in there that is a base shorter than is typically found. Using optimize=end as you did should work fine.

Pat

Hi Pat, thanks for your answer!

I kept on proccessing as mentioned before but I got stuck at the cluster command and am wondering if that’s a bug-
After screen with optimize=end I did the following steps:

filter.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.align, vertical=T, trump=.,processors=8)
unique.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.fasta, count=czallstability.trim.contigs.good.pick.good.count_table)
pre.cluster(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.fasta, count=czallstability.trim.contigs.good.pick.unique.good.filter.count_table, diffs=2)
chimera.uchime(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.fasta, count=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.count_table, dereplicate=t)
remove.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.fasta, accnos=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.accnos)
classify.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.fasta, count=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, iters=1000, cutoff=80, processors=8)
remove.lineage(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.fasta, count=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
remove.groups(count=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.fasta,taxonomy=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, groups=SNpl1H12-SNpl2H12-SNpl2H8-SNpl2H9) --> I removed some control groups
dist.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.20, processors=8)
cluster(column=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.pick.dist, count=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table)

When I start the cluster command it stops after 2 minutes and tells me that a certain sequence is not in my countfile.
I did the whole processing again because I thought I could have used the wrong count_table file at one point.
Again, I got the same error message with the cluster command but know with another sequence which is missing inmy countfile.

Strangely, the error message doesn’t emerge in my logfile!
There I just see the following:
mothur > cluster(column=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.pick.dist, count=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table)


I would appreciate your help!

Weird. Can you compress the czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.pick.fasta and czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table files, post them to google drive or dropbox and email me the url where they are located and of this post? I’ll take a look and get back to you.

Pat

Hi Pat,
thanks a lot for your help. I sent you the URLs to you email adress. Please let me know if sth went wrong.

Meanwhile, I tried the same process without the remove.groups command which didn’t change anything.
I forgot to say that after the error message in the cluster command mothur is automatically terminated.

The last logfile looks like that:



mothur > dist.seqs(fasta=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20, processors=8)

Using 8 processors.

Output File Names:
czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.dist

It took 8746 to calculate the distances for 129923 sequences.

mothur > cluster(column=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.pick.pick.dist,count=czallstability.trim.contigs.good.pick.unique.good.filter.unique.precluster.uchime.pick.pick.count_table)
********************###########
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||[ERROR]: HWI-M01755_82__1_2110_2394


Actually the message in the shell continues to: [...] [ERROR]: HWI-M01755_82__1_2110_2394_19118 is not in your count table. Please correct.

Then mothur is terminated.

I’m not sure what’s going on, but I think it has to do with the large size of your dataset doing weird things. There’s no way that you’ll get 130k sequences to cluster into OTUs. How are you running make.contigs to assemble the reads?

Pat

Hi Pat,

I do: make.contigs(file=stability.files, trimoverlap=T, processors=8).

I also think that my dataset is too big.
Although the commands seem to work now (the server had a memory problem) I got a dist file with 63,5MB and have 134749 unique sequences.
Now I’m stuck with cluster.


C.

Hi,

there is some issue which I forgot to mention so far.
My first library was sequences using the 300PE kit and I just checked with fastqc that the sequences have poor quality at the end.
I read that you, Pat, prefer the 250PE for the V4 region because the 300PE generates crap at the end.
The second library was sequences using the 250PE.

I made the contogs of bot library seperately with the command above and after that I merged both stability.trim.contigs.fasta files.

The summary output was:

summary.seqs(fasta=allstability.trim.contigs.fasta, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1 1 0 1 1
2.5%-tile: 1 166 166 0 3 464105
25%-tile: 1 248 248 0 4 4641046
Median: 1 252 252 0 4 9282092
75%-tile: 1 252 252 0 5 13923138
97.5%-tile: 1 253 253 4 6 18100079
Maximum: 1 300 300 117 116 18564183
Mean: 1 246.287 246.287 0.348832 4.44951

of Seqs: 18564183

Then, I ran the screen command with:
screen.seqs(fasta=allstability.trim.contigs.fasta, group=allstability.contigs.groups, maxambig=0, minlength=230, maxlength=275)

Then I got:

mothur > summary.seqs(fasta=allstability.trim.contigs.good.fasta, processors=8)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 230 230 0 3 1
2.5%-tile: 1 247 247 0 3 385778
25%-tile: 1 248 248 0 4 3857777
Median: 1 252 252 0 4 7715553
75%-tile: 1 252 252 0 5 11573329
97.5%-tile: 1 253 253 0 6 15045328
Maximum: 1 275 275 0 60 15431105
Mean: 1 250.292 250.292 0 4.42416

of Seqs: 15431105


Maybe I have to change something in the very beginning?

Thanks!

When they did 300PE did they do 300 cycles per read or 250? If they did 300 then you will get bad data out. Illumina has told us that doing 300 cycles on a 250 bp error rate will increase your error rate - we’ve seen this too (by accident).

Hi Pat,

I’m pretty sure they did the 300PE with 300 cycles without knowing about the worse data quality.
Would you suggest me to just work with the 2nd library and repeat the first library with 250PE?
I don’t really know how to analyze my data when I merge the libraries. The distance matrix is poor and after clustering (which took 4 days) I just got a cutoff of 0.06.

Thanks!

C

Would you suggest me to just work with the 2nd library and repeat the first library with 250PE?

Yeah, that would be best

Thanks, Pat, for your advice!

One last question: I read that you, when having bad data caused by the 300PE kit, suggested to analyze data with phylotype-based approaches.

Would you recommend that in my case, too?

Yeah, but it’s always better to get good data :slight_smile:

Greetings,

I’m running into the same problem. I’m working with an old 454 dataset of the oyster microbiome, using F8-R357. I’ve quality filtered, denoised, and trimmed my sequences. After making a unique FASTA file, I aligned it to the silva.bacteria.fasta file.

align.seqs(fasta=X.fasta, reference=silva.bacteria.fasta, processors=8)

and saw the following results:

Start End NBases Ambigs Polymer NumSeqs
Minimum 0 0 0 0 1 1
2.5% Tile 1044 1096 6 0 2 32630
25% Tile 1044 6389 13 0 2 326293
Median: 1044 6389 283 0 3 652586
75% Tile: 43059 43116 309 0 5 978879
97.5% Tile 43105 43116 327 0 5 1272542
Maximum: 43116 43117 334 0 7 1305171
Mean: 18683.9 21315.7 155.593 0 3.58003


A very large fraction of my sequences aligned, very poorly, to the 43,000bp region. At first I thought they were just poor reads and screened them out, but it cut over 50% of my reads. I saw this post, and saw that the first person who posted had alignment also had reads aligning to the exact same area (43116). I went back and grabbed a subset of the sequences aligning to that region, and ran them through blast. For example, the following three reads all aligned between 6-13 bases in the 43,000 region:

41_1257649
GGAGTCTGGACCGTGTCTCAGTTCCAGCGTGGCTGATCATCCTCTCAAACCAGCTATAGATCGTAGGCTTGGTAGGCCATTACCCCACCAACTACCTAATCTAACGCGGGCTAATCCTTCTCCGATAAATCTTT
CCCCCAAAGGGCGTATAAGGTATTACTCTCCGTTTCCAGAGGCTATTCCTTAGAGAAGGGCATATTCCCACGCGTTACTCACCCGTCCGCCGCTCACCCCGAAGGGCGCGCTCGACTTGCATGTATTAGGCC
TGCCGCCAGCGTTC

9_1257662
GGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCTCTCAGACCAGCTACGGATTGTCGCCTTGGTAGGCCTTTACCCCACCAACAAGCTAATCCGACGTAAGCTCAGCTAAAAGCGCAAGGCCCG
AAGGTCCCCTGCTTTCCTCCGTAGAGATTATGCGGTATTAGCCTGAGTTCCCCCAGGTTGTCCCCCACTTCTAGGTAGATTCCTACGCATTACTCACCCGTCCGCCACTCGTCAGCACCCCGAAGGGCCTGCTA
CCGTTCGACTTGCATGTGTTAAGCATGCCGCCAGCGTTCAAT

12_1258378
GGAGTAAGGGCCGTGTCTCAGTCCCCTTGTGGCCGTTCACCCTCTCAGGCCGGCTATCCATCATCGCCTTGGTAGGCCATTACCCTACCAACTAGCTAATGGAACGCAAAGTTCTCCTCTAGCGCATATAGC
TTTCATAAGTTCTTGATGCCAAGATCTCATAATATCCGGTATTAGCTAACCTTTCAGTTAGTTGTCCCAGTCTAGAGGGCAAATTCTTTACGCGTTACTCACCCGTCCGCCACCGTACTAACTCCGAAGAGAA
TTCCAGTCGACTTGCATGTGTTAAGCATTTTTTC

When I ran these through blast, they came up as matching (very well, example: 505 bits(273) 1e-139 287/294(98%) 0/294(0%)) to “uncultured marine clones.”

So my question is, well what to make of this. Do I want to throw out 50% of my data just because they don’t have a good match in SILVA? Or would I be better off with the perspective that because I really don’t know anything about the quality, etc. of the uncultured clones in blast, should I get rid of them? Instinctively, I don’t want to get rid of them, just because I don’t know who they are doesn’t mean they are garbage data - is there a way around this? Could I just skip the alignment step, or align my sequences to each other rather than a reference and then trim them to the same length? Any suggestions would be super appreciated!

Thanks,
-Ashley

have you tried flip=T in align.seqs?

Pat

Coming back to my original problem.
Pat, I just read your blog entry, thanks for that!
What I still don’t fully understand is, why I also get huge distance matrixes, although I have paired reads which do fully overlap (300 kit for 250 bp region)!
As a reminder, I have library I, sequenced with V3 Kit and library II, sequenced with V2 Kit. Both libraries were built using the primers in the Kozich 2013 paper, spanning the V4 region. I can analyze my libraries separately but I can’t merge them without getting the known problems with too many unique sequences.

Thanks for you help!

It’s because the overall error rate gets jacked up when you do 300 cycles on a fragment shorter than 300 bp. You might contact Illumina, I’m not clear on the mechanism myself.

Thanks a lot!
I’ll contact Illumina and report what they tell me.

I am running into the same problem with align.seq and sreen.seqs. and a very lasrge fraction of my sequences align to the 43.000 region, with reads aligning to the exact area 43116.
When I continue with screen.seqs i get a high number of messages saying “Your groupfile does not include the sequence IWPCFFB04JZY48 please correct”. But the most significant fact is the output, there are two merged.unique.good.names rather than a group file.

Output File Names:
merged.unique.good.align
merged.unique.bad.accnos
merged.unique.good.names
merged.unique.good.names

mothur > summary.seqs(fasta=merged.unique.good.align, name=merged.unique.good.names, processors=16)


Start End NBases Ambigs Polymer NumSeqs Minimum: 1044 1106 18 0 2 1 2.5%-tile: 1044 1106 23 0 2 57162 25%-tile: 1044 9876 421 0 5 571611 Median: 1044 9908 431 0 5 1143222 75%-tile: 1044 10237 438 0 5 1714833 97.5%-tile: 1044 10364 454 0 6 2229282 Maximum: 1044 13882 508 0 8 2286443 Mean: 1029.47 9347.37 403.839 0 4.97981 # of unique seqs: 527669 total # of seqs: 2286443

Output File Name:
merged.unique.good.align.summary

Anyone have any thoughts?

Many thanks

Can you move this to a new thread since it is a different question? These errors commonly occur because you’ve used the wrong name/group/fasta file at some point. It’s usually best to go back a step or two and try again.

pat