Hi all,
I have been using the ‘Costello Stool Analysis’ pipeline and applying it to my samples one by one (if there is an easier way to do this then I would love to know), however when I run the trim.seqs command (mothur > trim.seqs(fasta=stool.fasta, oligos=stool.oligos, qfile=stool.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=2) I lose a lot of my sequences. For example, if I had 3000 sequences in the untrimmed fasta file I would generally have around half this (1500 sequences) in the trimmed file. The reason for just about all the scrapped sequences is due the (‘q’) qaverage and thus poor average quality score. As expected, when i reduced the ‘qwindowaverage’ in the trim.seqs command from 35 to 25 just about all the sequences were included (only 10% removed compared with ~50% initially). Its not just more sequences that are included, the average number of bases in the sequences are increased substantially (e.g. ‘qwindowaverage=35’ gives average ~80 nbases and ‘qwindowaverage=25’ gives average ~250).
Below is the logfile for one sample as an e.g…
mothur > summary.seqs(fasta=current)
Using PGWM2SUMP.raw.trim.fasta as input file for the fasta parameter. -trimmed file(qwindowaverage=35)----------
Start End NBases Ambigs Polymer
Minimum: 1 50 50 0 3
2.5%-tile: 1 57 57 0 4
25%-tile: 1 72 72 0 4
Median: 1 88 88 0 4
75%-tile: 1 210 210 0 4
97.5%-tile: 1 281 281 0 5
Maximum: 1 414 414 0 6
of Seqs: 1087
Output File Name:
PGWM2SUMP.raw.trim.fasta.summary
mothur > summary.seqs(fasta=PGWM2SUMP.raw.fasta) -Original File--------
Start End NBases Ambigs Polymer
Minimum: 1 225 225 0 3
2.5%-tile: 1 538 538 0 4
25%-tile: 1 549 549 0 4
Median: 1 556 556 0 5
75%-tile: 1 566 566 2 5
97.5%-tile: 1 667 667 7 7
Maximum: 1 1080 1080 35 31
of Seqs: 2757
Output File Name:
PGWM2SUMP.raw.fasta.summary
Using PGWM2SUMP.raw.trim.fasta as input file for the fasta parameter. -trimmed file(qwindowaverage=25)------------
Start End NBases Ambigs Polymer
Minimum: 1 50 50 0 3
2.5%-tile: 1 69 69 0 4
25%-tile: 1 92 92 0 4
Median: 1 251 251 0 4
75%-tile: 1 380 380 0 5
97.5%-tile: 1 437 437 0 6
Maximum: 1 524 524 0 7
of Seqs: 2406
[b]What I would like advice on is... Should I stick to using an 'qwindowaverage' of 35 and have the far reduced sequences in my analysis or lower the 'qwindowaverage' and increase the number of length of the sequences but obviously allow poorer quality sequences to make it into the analysis?!?[/b]
Im relatively new to mothur but having invested the last few days (with little sleep) into analysing the sequences from the 18 samples that I recently aquired I feel I know my way around the program pretty well.
Thanks in advance for any help,
Chris