make.contig on v3 600 cycle kit

Someone has handed me data from a v3 600 cycle kit. When I use my normal batch (make.contigs all default, screen.seqs maxambig=0, maxlength=370) all sequences get tossed. After make.contigs all the seqs are ~350bp. This group only has v4 Capporaso primers in their lab, so it has to be v4 sequences. Do you guys have any suggestions for getting any workable data out of v3 chemistry?

oops typo maxlength=270

I adjusted my maxlength to 370 (knowing that I’m increasing garbage)

Summary after make.contigs, screen.seqs, unique.seqs

mothur > summary.seqs(fasta=current, name=current)
Using eoea.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using eoea.trim.contigs.good.names as input file for the name parameter.

Using 16 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       301     301     0       3       1
2.5%-tile:      1       348     348     0       3       107339
25%-tile:       1       349     349     0       4       1073384
Median:         1       349     349     0       4       2146768
75%-tile:       1       350     350     0       5       3220152
97.5%-tile:     1       350     350     0       6       4186197
Maximum:        1       370     370     0       49      4293535
Mean:   1       349.414 349.414 0       4.35216
# of unique seqs:       4272439
total # of seqs:        4293535

After align.seqs to a trimmed v4 silva alignment

mothur > summary.seqs(fasta=current, count=current)
Using eoea.trim.contigs.good.count_table as input file for the count parameter.
Using eoea.trim.contigs.good.unique.align as input file for the fasta parameter.

Using 16 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        -1      -1      0       0       1       1
2.5%-tile:      1       13425   292     0       3       107339
25%-tile:       1       13425   292     0       4       1073384
Median:         1       13425   293     0       4       2146768
75%-tile:       1       13425   293     0       4       3220152
97.5%-tile:     1       13425   294     0       6       4186197
Maximum:        13425   13425   323     0       19      4293535
Mean:   110.955 13360.4 288.765 0       4.07184
# of unique seqs:       4272439
total # of seqs:        4293535

After screen.seqs, filter.seqs, and pre.cluster (diffs=2)

mothur > summary.seqs(fasta=current, count=current)
Using eoea.trim.contigs.good.unique.good.filter.precluster.count_table as input file for the count parameter.
Using eoea.trim.contigs.good.unique.good.filter.precluster.fasta as input file for the fasta parameter.

Using 16 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       725     253     0       3       1
2.5%-tile:      1       817     292     0       3       105872
25%-tile:       1       817     292     0       4       1058711
Median:         1       817     293     0       4       2117422
75%-tile:       1       817     293     0       4       3176133
97.5%-tile:     1       817     294     0       6       4128972
Maximum:        65      817     323     0       8       4234843
Mean:   1.00098 816.999 292.62  0       4.10739
# of unique seqs:       1784585
total # of seqs:        4234843

I think this is still way too many "uniques"for human samples

You’ll want to use trimoverlap=T in make.contigs and then you should be able to proceed as usual. I forget how often I’m repeating myself, but when the sequencer goes beyond the ends of the fragments the error rates go up significantly. This is on top of the usual craptitude of the V3 chemistry.

ah that’s more like it, after aligning/filtering/clustering

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       552     221     0       3       1
2.5%-tile:      22      552     252     0       3       116830
25%-tile:       22      552     252     0       3       1168293
Median:         22      552     253     0       4       2336586
75%-tile:       22      552     253     0       4       3504879
97.5%-tile:     22      552     254     0       6       4556342
Maximum:        22      575     277     0       8       4673171
Mean:   21.9998 552.001 252.608 0       3.93159
# of unique seqs:       95508
total # of seqs:        4673171