Large number of scrap sequences

Hi all,

I have been using the ‘Costello Stool Analysis’ pipeline and applying it to my samples one by one (if there is an easier way to do this then I would love to know), however when I run the trim.seqs command (mothur > trim.seqs(fasta=stool.fasta, oligos=stool.oligos, qfile=stool.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=2) I lose a lot of my sequences. For example, if I had 3000 sequences in the untrimmed fasta file I would generally have around half this (1500 sequences) in the trimmed file. The reason for just about all the scrapped sequences is due the (‘q’) qaverage and thus poor average quality score. As expected, when i reduced the ‘qwindowaverage’ in the trim.seqs command from 35 to 25 just about all the sequences were included (only 10% removed compared with ~50% initially). Its not just more sequences that are included, the average number of bases in the sequences are increased substantially (e.g. ‘qwindowaverage=35’ gives average ~80 nbases and ‘qwindowaverage=25’ gives average ~250).

Below is the logfile for one sample as an e.g…

mothur > summary.seqs(fasta=current)
Using PGWM2SUMP.raw.trim.fasta as input file for the fasta parameter. -trimmed file(qwindowaverage=35)----------

Start End NBases Ambigs Polymer
Minimum: 1 50 50 0 3
2.5%-tile: 1 57 57 0 4
25%-tile: 1 72 72 0 4
Median: 1 88 88 0 4
75%-tile: 1 210 210 0 4
97.5%-tile: 1 281 281 0 5
Maximum: 1 414 414 0 6

of Seqs: 1087

Output File Name:

mothur > summary.seqs(fasta=PGWM2SUMP.raw.fasta) -Original File--------

Start End NBases Ambigs Polymer
Minimum: 1 225 225 0 3
2.5%-tile: 1 538 538 0 4
25%-tile: 1 549 549 0 4
Median: 1 556 556 0 5
75%-tile: 1 566 566 2 5
97.5%-tile: 1 667 667 7 7
Maximum: 1 1080 1080 35 31

of Seqs: 2757

Output File Name:

Using PGWM2SUMP.raw.trim.fasta as input file for the fasta parameter. -trimmed file(qwindowaverage=25)------------

Start End NBases Ambigs Polymer
Minimum: 1 50 50 0 3
2.5%-tile: 1 69 69 0 4
25%-tile: 1 92 92 0 4
Median: 1 251 251 0 4
75%-tile: 1 380 380 0 5
97.5%-tile: 1 437 437 0 6
Maximum: 1 524 524 0 7

of Seqs: 2406

[b]What I would like advice on is... Should I stick to using an 'qwindowaverage' of 35 and have the far reduced sequences in my analysis or lower the 'qwindowaverage' and increase the number of length of the sequences but obviously allow poorer quality sequences to make it into the analysis?!?[/b]

Im relatively new to mothur but having invested the last few days (with little sleep) into analysing the sequences from the 18 samples that I recently aquired I feel I know my way around the program pretty well.

Thanks in advance for any help,

Since posting the previous post I have carried one sample through the costello stool analysis pipeline exactly, hence I used the ‘qwindowaverage=35’. However, having got to the filter.seqs section I have encounted a problem, this is shown below in the logfile…

Length of filtered alignment: 0
Number of columns removed: 50000
Length of the original alignment: 50000
Number of sequences used to construct filter: 831

Output File Names:

mothur > summary.seqs(fasta=current, name=current)
Using 340080F.raw.trim.unique.good.filter.fasta as input file for the fasta parameter.
Using 340080F.raw.trim.good.names as input file for the name parameter.

Start End NBases Ambigs Polymer
Minimum: 0 0 0 0 1
2.5%-tile: -1 -1 0 0 1
25%-tile: -1 -1 0 0 1
Median: -1 -1 0 0 1
75%-tile: -1 -1 0 0 1
97.5%-tile: -1 -1 0 0 1
Maximum: -1 -1 0 0 1

of unique seqs: 831

total # of seqs: 3123

Output File Name:

Again, any help will be greatly appriciated!!! :s



It is common to have bad batches of sequences with generally low quality scores and its important to keep in mind that the goal isn’t the # of sequences but the # of high quality sequences. The manuscript is currently in review, but when we use a qwindowaverage=35 over 50 bases the error rate is 0.0010 (1 bp per 1000) and when we use qwindowaverage of 30 it is about 0.0020. I’m not sure what the error rate is for qwindowaverage=25 over 50, but it is likely to be at least double again. Considering even with the stringent 35/50 set up we don’t hit the correct number of OTUs (a few extra), I’d be leery of going to 25. Although we end up chucking ~35% of the reads at this step as being shorter than 200 bp, situations such as yours are generally typical of bad sequencing runs.

As for your second question, I suspect you might have a slight misstep in your screen.seqs step. Can you run summary.seqs on the input to filter.seqs and post the results along with what you’re using to run filter.seqs?


Hi Pat, thanks for the response.

Firstly I was able to overcome the second problem by adding flip=T to the align.seqs command.

With regard to the qwindowaverage I have since carried out analysis with that set to 35, however to incorporate more sequences without lowering the quality to <35 I increased the qwindowsize to 100. From the 121836 sequences I started with I was left with 729 by the end of the Costello pipeline example. Is there a reason why qwindowsize=50 is the ‘gold standard’?

I left a post yesturday asking this question but it appears to no longer exists on the forums, no does a post I left this morning (assisting people in where I went wrong when I first started using mothur so they had all the presteps to the costello pipeline), despite both posts definately getting successfully posted…

Diverging topic slightly, I was also wondering if its possible to draw a phylo tree with the names of the taxa and not the names of the fasta sequence (i.e. Staphylococcus and not G2HICOQ02EEQ37)? I followed the phylogenetic-based methods in the costello stool example to generate the .tre file.

Thanks very much for your helpful response, its greatly appriciated!

Sorry, I’m trying to limit threads to a single topic so they are actually useful to other people. I thought this thread covered the others - sorry.

  1. Opening the window to 100 bp increases the error rate slightly to ~0.0015.
  2. No, it’s not possible. The better way would probably be to include reference sequences and then use guilt by association to call everything that clusters w/ S. aureus as relate to S. aureus.

Hope this helps…

So how does shhh.seqs compare to the 35/50 trimming in terms of error rates?

Thanks much.