Blank file after summary.seqs

Hello everyone,

I’m a new Mothur User and I tried to follow the MiSeq SOP tutorial with my own MiSeq data.

The problem starts here :

align.seqs(fasta=test.trim.contigs.good.unique.fasta,reference=/data/DataSet/SSURef_102_SILVA_full_align.pcr.fasta)

gave me the following message :

Some of you sequences generated alignments that eliminated too many bases, a list is provided in test.trim.contigs.good.unique.flip.accnos. If you set the flip parameter to true mothur will try aligning the reverse compliment as well.
It took 3 secs to align 1444 sequences.

Then I did :

summary.seqs(fasta=test.trim.contigs.good.unique.fasta,count=test.trim.contigs.good.count_table)

gave me this :

Start End NBases Ambigs Polymer NumSeqs
Minimum: 232 237 1 0 1 1
2.5%-tile: 3465 3700 11 0 2 41
25%-tile: 3683 3961 11 0 2 404
Median: 9259 9684 126 0 3 807
75%-tile: 10547 11072 224 0 4 1210
97.5%-tile: 10547 11072 273 0 5 1573
Maximum: 11072 11073 275 0 9 1613
Mean: 8107.98 8640.33 132.613 0 3.29262

of unique seqs: 1444

total # of seqs: 1613

I launched

screen.seqs(fasta=test.trim.contigs.good.unique.align,count=test.trim.contigs.good.count_table,summary=test.trim.contigs.good.unique.summary,start=1968, end=11550, maxhomop=8)

exactly like the tutorial but maybe I should increase the “start” value ?

After that, I tried the following command

 mothur > summary.seqs(fasta=current, count=current)

that gave me :

Using test.trim.contigs.good.good.count_table as input file for the count parameter.
Using test.trim.contigs.good.unique.good.align as input file for the fasta parameter.

[ERROR]: test.trim.contigs.good.unique.good.align is blank, aborting.

That gave me the same error message using

screen.seqs(fasta=test.trim.contigs.good.unique.align,count=test.trim.contigs.good.count_table,summary=test.trim.contigs.good.unique.summary,start=[b]3683[/b], end=11550, maxhomop=8)

So what am I doing wrong here ? What align.seqs warning mean ? Could someone help me please ?

I’d say the problem comes from your stop position. Interpreting the start/stop sites and the implication of using them to screen sequences is one of the things that always throws people new to our lab :mrgreen:

This might be a bit of a slow explanation, but hopefully it will make everything clear. The gist of the screen.seqs command is that you want to remove sequences which start after your ‘start’ position, and stop before your ‘stop’ position. Basically, the start/stop positions mark the minimum alignment positions a sequence must cover to be maintained (analogous to screening by length). In order to get the maximum number of sequences out you would pick the highest possible start position and the lowest stop position but doing this gives also gives you the shortest possible read length. You therefore need to make a decision about trading off sequencing depth for sequence length.

Your 97.5% percentile start position is 10547, which means that at this position in the full 50,000 character alignment space 97.5% of your sequences have started. If you screened on this parameter and no other, you would discard the 2.5% of your sequences that hadn’t started by this point. You picked 1968 as the start position, which occurs somewhere between the minimum and 2.5% percentile (so you would only be keeping about 2.5% of your sequences). The opposite then happens with the stop position - as none of your sequences make it to position 11550 they all get discarded. This will be why you’re getting empty output files.


As a more general observation, though - your sequence numbers look strange, both in terms of sequence yield and the alignment distribution. For example, you only got 1613 sequences post-alignment? How many sequences were there in your initial data? The MiSeq data I've worked with ran 96 samples with 40% PhiX and we obtained around 10k reads per sample, almost all of which survive alignment. If you really only received that many sequences back from your provider then something screwy is happening. If you had millions of sequences but only 1.6k made it through alignment then I'd recommend you retry the alignment using the 'flip=T' parameter.

Hello,
thanks for your help, indeed the low number of sequences is strange.

Going back in my workflow, I start with 595,919 reads and it appears that the number of sequences decreases after this command :

screen.seqs(fasta=test.trim.contigs.fasta,group=test.contigs.groups,maxambig=0,maxlength=275)

The mean size is 282 and the 97.5%-tile is 296 so that’s why I lost almost everything… Thus I ran the following command :

screen.seqs(fasta=test.trim.contigs.fasta,group=test.contigs.groups,maxambig=0,maxlength=300)

(changing 275 by 300)

After this command I got 575,206 sequences (that’s better :slight_smile: ).
I aligned it like this :

align.seqs(fasta=test.trim.contigs.good.unique.fasta,reference=/data/DataSet/SSURef_102_SILVA_full_align.pcr.fasta,flip=T)

And I get this warning again :

Some of you sequences generated alignments that eliminated too many bases, a list is provided in test.trim.contigs.good.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 63 secs to align 97050 sequences.

I’m wondering what this warning “Some of you sequences generated alignments that eliminated too many bases” what is too many bases ?
And I don’t get why mothur aligned 97 050 sequences will the summary seqs command ( summary.seqs(fasta=test.trim.contigs.good.unique.fasta,count=test.trim.contigs.good.count_table) ) gives me this output :

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 69 69 0 3 1
2.5%-tile: 1 282 282 0 4 14381
25%-tile: 1 282 282 0 4 143802
Median: 1 286 286 0 4 287604
75%-tile: 1 295 295 0 5 431405
97.5%-tile: 1 296 296 0 5 560826
Maximum: 1 300 300 0 216 575206
Mean: 1 288.09 288.09 0 4.49828

of unique seqs: 97050

total # of seqs: 575206

Output File Names:
test.trim.contigs.good.unique.summary

It took 3 secs to summarize 575206 sequences.

So how many reads are mapped against silva database ? 97,050 or 575,206 ?

FYI I processed the database like this :

pcr.seqs(fasta=/data/DataSet/SSURef_102_SILVA_full_align.fasta,start=11894,end=23519,keepdots=F,processors=32)

Again I’m not very sure about the “start” and “end” value… I think that must be the 16s gene location so it’s always the same value whatever the reads’ length right ?

Going back in my workflow, I start with 595,919 reads and it appears that the number of sequences decreases after this command:

O right, that’s a bit more reasonable!

I’m wondering what this warning “Some of you sequences generated alignments that eliminated too many bases” what is too many bases?

It basically means the sequence couldn’t be aligned to the reference database. This could be either due to very poor sequence quality or some non-specific amplification during your PCR. As long as the number is quite small don’t worry about it, I think it’s a given that you’ll always lose a few sequences here.

And I don’t get why mothur aligned 97 050 sequences will the summary seqs command ( summary.seqs(fasta=test.trim.contigs.good.unique.fasta,count=test.trim.contigs.good.count_table) ) gives me this output :

So how many reads are mapped against silva database ? 97,050 or 575,206 ?

There are 97,050 unique sequences in your data set, but these represent 575,206 non-unique sequences (for example: ATAC, TAGG, ATAC, ATAC - 4 oligos, but only 2 unique oligos). To make things faster mothur only aligns the representative sequences and records the abundance/distribution of that sequence among your groups. This is the data recorded in your count table.

Again I’m not very sure about the “start” and “end” value… I think that must be the 16s gene location so it’s always the same value whatever the reads’ length right?

Maybe in the ‘true’ sequence, but not necessarily in your data. You’ll almost certainly be trimming DNA from the start of each sequence as part of your quality filtering. Since the quality differs for each sequence some reads will be cut earlier than others and this would give rise to slightly different start positions for each sequence, even if they’re generated from the same piece of template DNA.

Thanks for your answers.

Maybe in the ‘true’ sequence, but not necessarily in your data. You’ll almost certainly be trimming DNA from the start of each sequence as part of your quality filtering. Since the quality differs for each sequence some reads will be cut earlier than others and this would give rise to slightly different start positions for each sequence, even if they’re generated from the same piece of template DNA.

Ok, but for example if I choose start=11894 and end=23519 if a read is trimmed and “lose” 20 bases it will align from 11914 bp but it will still be in the interval (between 11894 and 23519) so that should works right ?

This could be either due to very poor sequence quality or some non-specific amplification during your PCR. As long as the number is quite small don’t worry about it, I think it’s a given that you’ll always lose a few sequences here.

The test.trim.contigs.good.unique.flip.accnos files contains 430 lines, so I guess only 430 reads are not mapped so yes it’s quite small.

Another question :

after alignment, summary.seqs give me this ouput :

Start End NBases Ambigs Polymer NumSeqs
Minimum: 232 246 3 0 1 1
2.5%-tile: 3566 3855 282 0 4 14381
25%-tile: 3566 3859 282 0 4 143802
Median: 3566 3859 283 0 4 287604
75%-tile: 9259 11072 285 0 5 431405
97.5%-tile: 9259 11072 296 0 5 560826
Maximum: 11070 11073 300 0 23 575206
Mean: 5968.57 6933.08 283.755 0 4.49647

of unique seqs: 97050

total # of seqs: 575206

Output File Names:
test.trim.contigs.good.unique.summary

It took 4 secs to summarize 575206 sequences.

The MiSeq SOP tutorial does a screen.seqs after this step. I think I should use start=3566 and end=3859 (because I want to screen reads that end at or after 3,859) and maxhomop=5, am I right ?

Ok, but for example if I choose start=11894 and end=23519 if a read is trimmed and “lose” 20 bases it will align from 11914 bp but it will still be in the interval (between 11894 and 23519) so that should works right ?

Yea. You’re basically just trying to pick a standardised start/end point for your sequences so that you don’t get gaps at the end (which will mess with your distance calculations). These overhangs are stripped out in the filter.seqs step. The common sequence between your start/end point is what gets analysed.

The test.trim.contigs.good.unique.flip.accnos files contains 430 lines, so I guess only 430 reads are not mapped so yes it’s quite small.

That sounds good. You might find it’s even less than that because that file records all sequences that were flipped or removed. You could work out exactly how many were removed by running this command on the console:

grep -c "NOT" test.trim.contigs.good.unique.flip.accnos

But really, in the worst case you’re losing about 0.5% of your sequences so I think that’s pretty good.

The MiSeq SOP tutorial does a screen.seqs after this step. I think I should use start=3566 and end=3859 (because I want to screen reads that end at or after 3,859) and maxhomop=5, am I right?

Hm, to be honest those alignment stats look quite odd to me. For example, your 97.5%-tile start position occurs after your median stop position. Can you run the following commands and then tell me the output of both summaries?

screen.seqs(fasta=test.trim.contigs.good.unique.align, count=test.trim.contigs.good.count_table, maxhomop=8, optimize=start-end, criteria=95)
summary.seqs(fasta=current,count=current)
filter.seqs(fasta=current,trump=.,vertical=T)
summary.seqs(fasta=current,count=current)

Well I ran these commands, this is the first summary output :

summary.seqs(fasta=current,count=current)
Using test.trim.contigs.good.good.count_table as input file for the count parameter.
Using test.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 2894 3855 47 0 3 1
2.5%-tile: 3566 3855 282 0 4 14319
25%-tile: 3566 3859 282 0 4 143187
Median: 3566 3859 283 0 4 286374
75%-tile: 9259 11072 285 0 5 429561
97.5%-tile: 9259 11072 296 0 5 558429
Maximum: 9259 11073 300 0 8 572747
Mean: 5966.9 6932.87 284.083 0 4.49983

of unique seqs: 95337

total # of seqs: 572747

Output File Names:
test.trim.contigs.good.unique.good.summary

It took 25 secs to summarize 572747 sequences.

Filter output :

Length of filtered alignment: 0
Number of columns removed: 11625
Length of the original alignment: 11625
Number of sequences used to construct filter: 95337

Output File Names:
test.filter
test.trim.contigs.good.unique.good.filter.fasta

2nd summary output :

summary.seqs(fasta=current,count=current)
Using test.trim.contigs.good.good.count_table as input file for the count parameter.
Using test.trim.contigs.good.unique.good.filter.fasta as input file for the fasta parameter.

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 0 0 0 0 1 14319
25%-tile: 0 0 0 0 1 143187
Median: 0 0 0 0 1 286374
75%-tile: 0 0 0 0 1 429561
97.5%-tile: 0 0 0 0 1 558429
Maximum: -1 -1 0 0 1 572747
Mean: 0 0 0 0 1

of unique seqs: 95337

total # of seqs: 572747

Output File Names:
test.trim.contigs.good.unique.good.filter.summary

It took 1 secs to summarize 572747 sequences.

So it looks like I got nothing after filtering my sequences… Does that mean every characters of my alignments are indeed gaps “-” ?

Not exactly, but in effect it’s the same thing. Looking at your summary output you seem to have two sequence distributions - a cluster of sequences around the 3566-3859 positions and another around the 9259-11073 positions. You’re probably losing all the alignment positions because the second cluster falls into the overhang region of the first cluster, and vice versa (fitler.seqs documentation).

What primer set did you use, do you know what variable region your primers were targeting? It might help to inform you of which cluster is correction, although the fact that you see these is a bit odd. I would recommend you at least split the data set, and then look at each cluster in detail. You might find that at the classification stage one of these is complete junk (host or environment genomic DNA or some such) but that’s not clear at this point.

I’m guessing the file names a little bit (but I think they’re correct), but here’s how I’d start off (if you’re on Windows use ‘move’ instead of ‘mv’):

screen.seqs(fasta=test.trim.contigs.good.unique.good.align,count=test.trim.contigs.good.good.count_table, start=3566,end=3859)
filter.seqs(fasta=test.trim.contigs.good.unique.good.good.align, vertical=T,trump=.)
system(mv test.trim.contigs.good.unique.good.good.filter.fasta test.cluster1.fasta)
system(mv test.trim.contigs.good.good.good.count_table test.cluster1.count_table)

screen.seqs(fasta=test.trim.contigs.good.unique.good.align,count=test.trim.contigs.good.good.count_table, start=9259,end=11072)
filter.seqs(fasta=test.trim.contigs.good.unique.good.good.align, vertical=T,trump=.)
system(mv test.trim.contigs.good.unique.good.good.filter.fasta test.cluster2.fasta)
system(mv test.trim.contigs.good.good.good.count_table test.cluster2.count_table)

And from there look at the summary.seqs of each file set, as well as the classification of your sequences.

Ok I’ll find out what primers was use and what regions were targeted.

Here is the ouput of summary for each cluster :

mothur > summary.seqs(fasta=test.cluster1.fasta,count=test.cluster1.count_table)

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 1 294 282 0 4 6786
25%-tile: 1 294 282 0 4 67851
Median: 1 294 282 0 4 135702
75%-tile: 1 294 282 0 4 203553
97.5%-tile: 1 294 294 0 5 264618
Maximum: 1 294 294 0 8 271403
Mean: 0.988327 290.568 280.296 0 4.13621

of unique seqs: 43961

total # of seqs: 271403

Output File Names:
test.cluster1.summary

It took 1 secs to summarize 271403 sequences.

mothur > summary.seqs(fasta=test.cluster2.fasta,count=test.cluster2.count_table)

Just few questions ;
What does that mean when summary print “-1” ? Is that showing the alignment fails ?
According to my results it looks like 271 403 reads are aligned in the 1st cluster from 1 to 294 bases and 241 779 reads are aligned in the 2nd cluster from 1 to 315 bases.
So the 2nd cluster contains a few less sequences but they are quite bigger.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 315 45 0 2 1
2.5%-tile: 1 315 284 0 4 6045
25%-tile: 1 315 285 0 5 60445
Median: 1 315 285 0 5 120890
75%-tile: 1 315 285 0 5 181335
97.5%-tile: 1 315 285 0 5 235735
Maximum: 243 315 291 0 8 241779
Mean: 1.45842 315 284.452 0 4.97612

of unique seqs: 41452

total # of seqs: 241779

Output File Names:
test.cluster2.summary
It took 1 secs to summarize 241779 sequences.


About classify.seqs I'm not sure about how to use this command.

For thue taxonomy file I used silva.bacteria.silva.tax which I downloaded here Redirecting… (release 102) and for template file I used silva.gold.align I downloaded on a the same link (I found the idea here : http://mothur.ltcmp.net/t/template-file-in-classify-seqs/2121/1 ) but I got a lot of error messages like that :

Z76665.1 is in your taxonomy file and is not in your template file. Please correct.
Z76666.1 is in your taxonomy file and is not in your template file. Please correct.
Z76667.1 is in your taxonomy file and is not in your template file. Please correct.
Z76669.1 is in your taxonomy file and is not in your template file. Please correct.
Z76671.1 is in your taxonomy file and is not in your template file. Please correct.
Z76672.1 is in your taxonomy file and is not in your template file. Please correct.
Z76673.1 is in your taxonomy file and is not in your template file. Please correct.
Z76675.1 is in your taxonomy file and is not in your template file. Please correct.
Z78203.1 is in your taxonomy file and is not in your template file. Please correct.
Z78204.1 is in your taxonomy file and is not in your template file. Please correct.
Z78206.1 is in your taxonomy file and is not in your template file. Please correct.
Z83204.1 is in your taxonomy file and is not in your template file. Please correct.
Z83205.1 is in your taxonomy file and is not in your template file. Please correct.
Z94012.1 is in your taxonomy file and is not in your template file. Please correct.
Z95707.1 is in your taxonomy file and is not in your template file. Please correct.
Z95708.1 is in your taxonomy file and is not in your template file. Please correct.
Z95709.1 is in your taxonomy file and is not in your template file. Please correct.
Z95710.1 is in your taxonomy file and is not in your template file. Please correct.
Z95711.1 is in your taxonomy file and is not in your template file. Please correct.
Z95712.1 is in your taxonomy file and is not in your template file. Please correct.
Z95713.1 is in your taxonomy file and is not in your template file. Please correct.
Z95714.1 is in your taxonomy file and is not in your template file. Please correct.
Z95715.1 is in your taxonomy file and is not in your template file. Please correct.
Z95716.1 is in your taxonomy file and is not in your template file. Please correct.
Z95717.1 is in your taxonomy file and is not in your template file. Please correct.
Z95718.1 is in your taxonomy file and is not in your template file. Please correct.
Z95719.1 is in your taxonomy file and is not in your template file. Please correct.
Z95720.1 is in your taxonomy file and is not in your template file. Please correct.
Z95721.1 is in your taxonomy file and is not in your template file. Please correct.
Z95722.1 is in your taxonomy file and is not in your template file. Please correct.
Z95723.1 is in your taxonomy file and is not in your template file. Please correct.
Z95724.1 is in your taxonomy file and is not in your template file. Please correct.
Z95725.1 is in your taxonomy file and is not in your template file. Please correct.
Z95726.1 is in your taxonomy file and is not in your template file. Please correct.
Z95727.1 is in your taxonomy file and is not in your template file. Please correct.
Z95728.1 is in your taxonomy file and is not in your template file. Please correct.
Z95729.1 is in your taxonomy file and is not in your template file. Please correct.
Z95730.1 is in your taxonomy file and is not in your template file. Please correct.
Z95731.1 is in your taxonomy file and is not in your template file. Please correct.
Z95732.1 is in your taxonomy file and is not in your template file. Please correct.
Z95733.1 is in your taxonomy file and is not in your template file. Please correct.
Z95734.1 is in your taxonomy file and is not in your template file. Please correct.
Z95735.1 is in your taxonomy file and is not in your template file. Please correct.
Z95736.1 is in your taxonomy file and is not in your template file. Please correct.
Z95737.1 is in your taxonomy file and is not in your template file. Please correct.
Z96077.1 is in your taxonomy file and is not in your template file. Please correct.
Z96079.1 is in your taxonomy file and is not in your template file. Please correct.
Z96080.1 is in your taxonomy file and is not in your template file. Please correct.
Z96081.1 is in your taxonomy file and is not in your template file. Please correct.
Z96085.1 is in your taxonomy file and is not in your template file. Please correct.
Z96086.1 is in your taxonomy file and is not in your template file. Please correct.
Z96087.1 is in your taxonomy file and is not in your template file. Please correct.
Z96089.1 is in your taxonomy file and is not in your template file. Please correct.
Z96090.1 is in your taxonomy file and is not in your template file. Please correct.
Z96093.1 is in your taxonomy file and is not in your template file. Please correct.
Z96095.1 is in your taxonomy file and is not in your template file. Please correct.
Z96096.1 is in your taxonomy file and is not in your template file. Please correct.
Z96097.1 is in your taxonomy file and is not in your template file. Please correct.
Z96098.1 is in your taxonomy file and is not in your template file. Please correct.
Z99997.1 is in your taxonomy file and is not in your template file. Please correct.
Z99998.1 is in your taxonomy file and is not in your template file. Please correct.
Z99999.1 is in your taxonomy file and is not in your template file. Please correct.

So obviously my template file is wrong, what should I use then ? I don’t really get what this file is made for.

O right, the gold database is the one used for chimera screening. For classification the taxonomy files all refer to the sequences in the silva.bacteria.fasta file. There are several taxonomies, because the SILVA databases list the equivalent RDP/Greengenes/EMBL taxnomies for each sequence, so I guess it’s up to you which you use.

Ok so I guess I can use silva.bacteria.silva.tax as a tax file.
But what template file should I use ?

In the folder you downloaded from the silva reference files link there should be ‘silva.bacteria.fasta’. Use that one.

Ok thanks,
So I ran classify.seqs(fasta=test.cluster1.fasta,template=silva.bacteria/silva.bacteria.fasta,taxonomy=silva.bacteria/silva.bacteria.silva.tax) and classify.seqs(fasta=test.cluster2.fasta,template=silva.bacteria/silva.bacteria.fasta,taxonomy=silva.bacteria/silva.bacteria.silva.tax) *

With the first command I have this kind of warnings :

MSQ-M01173_24_000000000-ABPJ2_1_1113_7894_10272 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1113_7894_10272 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1113_21256_10359 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1113_21256_10359 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2106_20905_16882 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2106_20905_16882 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1106_20961_15237 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1106_20961_15237 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1106_20521_15283 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1106_20521_15283 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1106_26743_15301 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1106_26743_15301 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2114_21786_25016 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_21786_25016 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
Processing sequence: 10900
MSQ-M01173_24_000000000-ABPJ2_1_1106_3862_15473 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1106_3862_15473 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1106_11512_15525 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1106_11512_15525 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1113_19321_10698 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1113_19321_10698 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2106_25074_17353 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2106_25074_17353 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1113_14398_10751 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1113_14398_10751 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_1113_20413_10829 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_1113_20413_10829 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2114_13509_25813 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_13509_25813 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
Processing sequence: 10900
MSQ-M01173_24_000000000-ABPJ2_1_2114_23474_26013 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_23474_26013 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2106_19954_17741 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2106_19954_17741 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2114_9998_26153 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_9998_26153 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2114_19005_26316 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_19005_26316 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
Processing sequence: 10992
MSQ-M01173_24_000000000-ABPJ2_1_2106_22137_17959 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2106_22137_17959 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
Processing sequence: 10990
MSQ-M01173_24_000000000-ABPJ2_1_2106_3949_18212 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2106_3949_18212 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2114_10472_26873 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_10472_26873 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2114_20970_26980 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_20970_26980 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
MSQ-M01173_24_000000000-ABPJ2_1_2114_16464_27200 is bad. It has no kmers of length 8.
[WARNING]: MSQ-M01173_24_000000000-ABPJ2_1_2114_16464_27200 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.

But not with the second cluster.
test.cluster1.silva.wang.taxonomy and test.cluster1.silva.wang.tax.summary files are well created. I expected way more “unclassified” taxonomies in cluster1 files but that doesn’t seem to be the case :

$ grep -c "unclassified" test.cluster1.silva.wang.tax.summary
575
$ grep -c "unclassified" test.cluster2.silva.wang.tax.summary
353

Hm, I’ve never seen that message before but I assume it means the flagged sequences were <8 bp long (pretty sure sign that they’re bad).

Grep-ing the summary file probably isn’t the best way to check the abundances because in those files all the read abundances are summarised as a table. You’ll get the keyword ‘unclassified’ a lot down at the low taxonomic levels which isn’t very meaningful. For example, in the summary file grep would count ‘unclassified’ the same as ‘Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobaceriaceae;Escherichia;unclassified’ although the former is much more likely to be an erronous sequence*. You can open the summary files up in Excel and check the actual abundance of sequences unclassified at the different taxonomic levels. Does cluster 1 have more sequences unclassified at the domain/kingdom/phylum levels?


[b]*[/b]More likely, but not necessarily. We had a manuscript in review a while ago where we reported an OTU classified to genus level which turned out (after BLAST-ing against the NCBI nt database) to be a random fragment of the host genome that got amplified in the PCR. So never trust anything!

What is the difference between “unknown” and “unclassified” in the *.tax.summary files ?

Because in cluster 1 I got the following line :


1 0.3 unknown 1 1774 Otherwise I got this : 2 0.3.1 unclassified 1 1774 3 0.1.19.1 unclassified 1 2 3 0.1.20.1 unclassified 1 2 3 0.1.40.1 unclassified 1 107 3 0.3.1.1 unclassified 1 1774

Whereas in cluster 2 I only have this :
3 0.1.11.1 unclassified 1 42
3 0.1.40.1 unclassified 1 2

I used excel to check but grep can works :

$ grep "unclassified" test.cluster2.silva.wang.tax.summary | grep -P '^[0-3]\s+'
3       0.1.11.1        unclassified    1       42
3       0.1.40.1        unclassified    1       2
$ grep "unclassified" test.cluster1.silva.wang.tax.summary | grep -P '^[0-3]\s+'
3       0.1.19.1        unclassified    1       2
3       0.1.20.1        unclassified    1       2
3       0.1.40.1        unclassified    1       107
2       0.3.1   unclassified    1       1774
3       0.3.1.1 unclassified    1       1774
$ grep "unknown" test.cluster2.silva.wang.tax.summary | grep -P '^[0-3]\s+'
$ grep "unknown" test.cluster1.silva.wang.tax.summary | grep -P '^[0-3]\s+'
1       0.3     unknown 1       1774

I believe it’s to distinguish between sequences that can’t be classified at the domain level (unknown) and those that can be classified to some taxonomic ranking, but not down to genus/species level (unclassified).

Since most sequences won’t classify all the way down to genus/species level they’ll all have ‘unclassified’ in their taxonomy somewhere so you need a way of differentiating between sequences that can’t be classified further and sequences that can’t be classified at all.

Hello,

I’m back after cheked my data’s background :slight_smile:

So regions V5 and V6 were amplified using V5-784F and V6-1061R primers.
I know that silva DB has several regions like V35 and V69. Since it contains 50,000 columns (BTW I find it very strange, could someone explain me how is that possible ? I doubt rRNA is that big) how can I select the region I want ?

I think the general wisdom is to find yourself a full length E. coli 16S sequence (Google ftw!) and then run it through pcr.seqs with your primers. Align the result and then look at it using summary.seqs. That’ll give you the expected start/stop positions in the 50k position alignment.

For the 50,000 position alignment, it’s really just so that you can insert bases wherever you want without changing everything. Since the SILVA reference is a fixed alignment space (unlike the multiple alignments you get from MUSCLE et al.) you can’t be changing the positions in the reference database if you discover a new insertion. What I mean is, say we had a reference database of 10bp sequences like:

>1
ATATATATAT
>2
ATATA-ATAT
>3
TGATCATGAT
>4
AAACATGACT

And then I discovered a new sequence ‘ATACTAGCATA’. This couldn’t fit into the existing alignment space (11bp vs 10bp) so we’d need to either filter out a base of my sequence, or resize the entire reference database. I know the ‘entire’ is four here, but it’s over 100k in SILVA and remember that the length of the 16S gene isn’t constant between species, it is size variable, so we expect to occasionally (frequently?) see new sequences that don’t quite fit into the old space around the variable loop regions. To pre-empt this problem, SILVA write their alignments more like

>1
---A---T---A---T---A---T---A---T---A---T---
>2
---A---T---A---T---A-------A---T---A---T---
>3
---T---G---A---T---C---A---T---G---A---T---
>4
---A---A---A---C---A---T---G---A---C---T---

So that any future indels can be easily accounted for in the existing space without error. This also allows you to align the 18S rRNA gene into the same database. If you use ARB, or have been to the SILVA website you’ll have noticed that the databases aren’t 16S/23S, but Small Subunit [SSU] and Large Subunit [LSU], because they include both prokaryotic genes and the eukaryotic equivalents in the same database.

Thanks for the explanations about SILVA database.

So I ran that

pcr.seqs(fasta=eColi16s.fasta,oligos=primers.oligo)

Since I can’t run ecoli and oligos options ah the same time.
My primers.oligo file is like that :
primer AGGATTAGATACCCTGGTA CRRCACGAGCTGACGAC V5-784F_V6-1061R

Then I did that to align the pcr seqs’ output against SILVA :

align.seqs(fasta=eColi16s.pcr.fasta,reference=/data/DataSet/SSURef_102_SILVA_full_align.pcr.fasta)

Using 1 processors.

Reading in the /data/DataSet/SSURef_102_SILVA_full_align.pcr.fasta template sequences… DONE.
It took 267 to read 312918 sequences.
Aligning sequences from eColi16s.pcr.fasta …
Erreur de segmentation (core dumped)

What did I do wrong ?