No data after filter.seqs

Good morning,

I’m helping a collaborator with their analysis and I’ve hit something I’ve never seen before. They did Miseq V1-3 and are using only the R1 read in analysis because the overlap of R1 and R2 was poor (I made them promise to never do V1-3 again and sent them the “huge distance matrix” blog post :smiley: ).

Here’s what I’ve done:
fastq.info(fastq=SJR1_S251_L001_R1_001.fastq) (did this for each R1 file)

trim.seqs(fasta=SJR1_S251_L001_R1_001.fasta, oligos=SJR1_oligos.txt, qfile=SJR1_S251_L001_R1_001.qual, pdiffs=2, maxambig=0, maxhomop=8, flip=F, qwindowaverage=30, qwindowsize=25, processors=2) (did this for each file)

make.group
merge.files
screen.seqs(fasta=tick.fasta, group=tick.groups, minlength=150, maxlength=350)
unique.seqs(fasta=tick.good.fasta)
count.seqs(name=tick.good.names, group=tick.good.groups)
align.seqs(fasta=tick.good.unique.fasta, reference=silva.v1v3.fasta)

summary.seqs(fasta=tick.good.unique.align, count=tick.good.count_table)
Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 2 4 1 0 1 36471
25%-tile: 2 4295 30 0 3 364706
Median: 2 5060 239 0 4 729411
75%-tile: 2 5188 274 0 5 1094116
97.5%-tile: 10853 12081 280 0 5 1422350
Maximum: 12081 12081 283 0 8 1458820
Mean: 1578.71 5367.35 185.771 0 3.675

of unique seqs: 592288

total # of seqs: 1458820

screen.seqs(fasta=tick.good.unique.align, count=tick.good.count_table, summary=tick.good.unique.summary, end=12081)

filter.seqs(fasta=tick.good.unique.good.align, vertical=T, trump=.)
Length of filtered alignment: 1
Number of columns removed: 12082
Length of the original alignment: 12083
Number of sequences used to construct filter: 146562

I read similar posts in the forum and filter.seqs removes all data and tried:

screen.seqs(fasta=tick.good.unique.align, count=tick.good.count_table, optimize=start-end)
Optimizing start to 10841.
Optimizing end to 57.

summary.seqs(fasta=tick.good.unique.good.align, count=tick.good.good.count_table)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 57 6 0 1 1
2.5%-tile: 2 3663 10 0 2 29458
25%-tile: 2 4376 209 0 4 294574
Median: 2 5060 255 0 4 589148
75%-tile: 2 5188 280 0 5 883722
97.5%-tile: 9324 12081 280 0 5 1148838
Maximum: 10841 12081 283 0 8 1178295
Mean: 670.108 5243.43 228.897 0 4.19013

of unique seqs: 428552

total # of seqs: 1178295

filter.seqs(fasta=tick.good.unique.good.align, vertical=T, trump=.)
Length of filtered alignment: 0
Number of columns removed: 12081
Length of the original alignment: 12081
Number of sequences used to construct filter: 428552

I’m not sure what’s going on. The screen.seqs start and end optimization looks weird. Anyone seen this before?

Thank you,
Bonnie

1 Like

I had a similar problem:

Your minimum sequence goes from 1-57, and your maximum is from 10841-12081. Each of these will have … for all other positions. This means you have a sequence with 57- “…” -end and one that is start - “…” - 10841. The result is that for each position in the column there’s a ‘.’, which is your trump character (meaning all columns with that character get removed). Thus, you remove all columns.

Looking at the rest of your summary, it looks like you really want your end to be somewhere around 5188 and your start to be 2. You also want a minlength parameter, to remove the really small ones (2.5%-tile and below are really small).

I hope this helps, it worked for me when I ran into this.

1 Like

Hey Bonnie,

I suspect you don’t really know the orientation of each fragment - I looks like half aligned and half didn’t. Can you try using flip=T in align.seqs?

Also, you probably want to set start=2. The end will take anything that ends after 12081. So you probably have at least one sequence that also starts at
12081 leading every column to have at least one ‘.’ in it.

Pat

PS. Make them promise not to use the V3 chemistry too!

1 Like

Thank you both for your help!

Pat, sadly, the genome center here is set on using the V3 chemistry. I’ve tried…