Alignment issues

Hello. I am having some issues with aligning a portion of my sequences. They kept getting filtered out from the majority of the sequences, so I tried to analyze them separately to see what might be causing the issue. The number of bases are way too small, and that was after mothur flipped the sequences to create better alignment. Any help would be greatly appreciated. Thank you.

Before screen.seqs this was the summary:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 35 35 0 2 1
2.5%-tile: 1 35 35 0 4 215362
25%-tile: 1 440 440 18 5 2153611
Median: 1 445 445 20 6 4307222
75%-tile: 1 465 465 22 6 6460833
97.5%-tile: 1 538 538 53 35 8399082
Maximum: 1 602 602 86 301 8614443
Mean: 1 410 410 20 8
# of unique seqs: 8614443
total # of seqs: 8614443

It took 126 secs to summarize 8614443 sequences.

I ran a screen seqs using the same parameters that I used for the rest of the samples. Should I not be doing this? Ideally I would like to make these samples work on the same run as the rest of the samples.

mothur > screen.seqs(fasta = /Users/joehansen/Documents/USA/MicrobiomeProject/Bioinformatics/HansenWithPrelimNewMethodV3V4/PreLim/prelim.trim.contigs.fasta , count = /Users/joehansen/Documents/USA/MicrobiomeProect/Bioinformatics/HansenWithPrelimNewMethodV3V4/PreLim/prelim.contigs.count_table , maxambig = 0, minlength = 200, maxlength = 466, maxhomop = 8)

The summary.seqs after the screen.seqs was:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 200 200 0 3 1
2.5%-tile: 1 215 215 0 4 9489
25%-tile: 1 287 287 0 5 94882
Median: 1 288 288 0 5 189763
75%-tile: 1 290 290 0 6 284644
97.5%-tile: 1 313 313 0 7 370036
Maximum: 1 465 465 0 8 379524
Mean: 1 285 285 0 5
# of unique seqs: 379524
total # of seqs: 379524

It took 4 secs to summarize 379524 sequences.

mothur > align.seqs(fasta = /Users/joehansen/Documents/USA/MicrobiomeProject/Bioinformatics/HansenWithPrelimNewMethodV3V4/PreLim/prelim.trim.contigs.good.fasta , reference = /Users/joehansen/Documents/USA/MicroiomeProject/Bioinformatics/silva.bacteria/silva.bacteria.fasta )

It took 483 secs to align 379524 sequences.

[WARNING]: 372581 of your sequences generated alignments that eliminated too many bases, a list is provided in /Users/joehansen/Documents/USA/MicrobiomeProject/Bioinformatics/HansenWithPrelimNewMethodV3V4/PreLim/prelim.trim.contigs.good.flip.accnos.

[NOTE]: 207099 of your sequences were reversed to produce a better alignment.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1060 3 0 1 9489
25%-tile: 43029 43116 9 0 3 94882
Median: 43059 43116 13 0 3 189763
75%-tile: 43097 43116 19 0 4 284644
97.5%-tile: 43112 43116 38 0 6 370036
Maximum: 43116 43116 457 0 8 379524
Mean: 39361 39658 18 0 3
# of unique seqs: 379524
total # of seqs: 379524

It took 50 secs to summarize 379524 sequences.

Can you look at this thread? They have nearly the same distribution of sequence lengths that you had before running screen.seqs

Pat

Hi Pat. Thank you for the response. It is very similar to the issues I am having and is also looking at the V3-V4 region (I know this is not recommended). It looks like their issues was setting the maxlength too low in the screen.seqs(). Do you think this would still be the cause for my issue even though I set it for the 75% tile. It was the 97.5% tile with all the other samples included in the analysis.

It looks like most of your sequences have at least one ambiguous base. At least 75% of your sequences have 18 or more ambiguous bases in them. So, when you use screen.seqs with maxambig = 0, you will lose any sequence with an ambiguous base. I’m afraid this comes back to sequence quality from using the 2x300 nt reads and not having a region (V3-V4) where the reads do not fully overlap each other.

Pat

Thank you for clearing that up for me. That makes sense.