High number of unique sequences when processing only R1 reads

I’m processing only MiSeq 2x300 R1 reads from samples sequenced targeting the 16S V1V3 region. (My attempt at using both R1 and R2: All groups removed after screen.seqs) We swapped the sequencing reads so that the read from V3 is R1 and the read from V1 is R2. Our genome center demultiplexes reads and removes barcodes before they send them, but primer sequences are not removed.

I’m using this as a guide for processing only the R1 reads:

I started with fastq.info for each sample:
mothur > fastq.info(fastq=MOF2W10B1C_S59_L001_R1_001.fastq)

I then did trim.seqs for each sample:
mothur > trim.seqs(fasta=MOF2W10B1C_S59_L001_R1_001.fasta, oligos=MOF2W10B1C_oligos.txt, qfile=MOF2W10B1C_S59_L001_R1_001.qual, pdiffs=2, maxambig=0, maxhomop=8, flip=T, qwindowaverage=25, qwindowsize=50, processors=24)
–> My oligos files reads: forward ATTACCGCGGCTGCTGG MOF2W10B1C

I then merged the many fastas using merge.files and made a groups file using make.group and looked at my dataset:
mothur > summary.seqs(fasta=MOSNR1.trim.fasta, processors=24)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 50 50 0 2 1
2.5%-tile: 1 81 81 0 3 333942
25%-tile: 1 204 204 0 4 3339414
Median: 1 243 243 0 4 6678827
75%-tile: 1 277 277 0 5 10018240
97.5%-tile: 1 284 284 0 6 13023711
Maximum: 1 286 286 0 8 13357652
Mean: 1 229.835 229.835 0 4.30143

of Seqs: 13357652

Output File Names:
/panfs/roc/scratch/youmansb_test/mothur351/MOSN_R1_B/MOSNR1.trim.summary

I then did screen.seqs:
mothur > screen.seqs(fasta=MOSNR1.trim.fasta, group=MOSNR1.groups, minlength=200, processors=24)
mothur > summary.seqs(fasta=MOSNR1.trim.good.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 200 200 0 2 1
2.5%-tile: 1 205 205 0 3 255578
25%-tile: 1 232 232 0 4 2555771
Median: 1 258 258 0 4 5111542
75%-tile: 1 284 284 0 5 7667313
97.5%-tile: 1 284 284 0 6 9967506
Maximum: 1 286 286 0 8 10223083
Mean: 1 254.783 254.783 0 4.33019

of Seqs: 10223083

Output File Names:
/panfs/roc/scratch/youmansb_test/mothur351/MOSN_R1_B/MOSNR1.trim.good.summary

Unique.seqs then count.seqs:
mothur > unique.seqs(fasta=MOSNR1.trim.good.fasta)
–> Last line from unique.seqs: 10223083 7120685

I get that using make.contigs on V1V3 data results in many spurious OTUs because of the minimal overlap and the un-quality-checked regions outside of the overlap region. I thought that using only the R1 reads and the quality parameters of trim.seqs would result in good quality reads and a “normal” ~10%ish unique sequences. Or least not 70% uniques…

What am I missing? Or is phylotype analysis the only thing I can do at this point?

Bonnie

I think the problem is this —> qwindowaverage=25, qwindowsize=50

The sequences come off with an average quality score of 20, so going up to 25 probably isn’t a big deal. When we were doing this with 454, we were using qwindowaverage=35, which is 10-times better than 25. I think if you increase the stringency, you’ll get fewer uniques, but you’ll also get much shorter sequences.

Pat