Pcr.seqs does not remove logical amount of sequences with different values of pdiffs/rdiffs

Hi!

I am using mothur v.1.48.0. I used pcr.seqs with this kind of command:
pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=1, pdiffs=1, processors=4)
When I tried pdiffs=0 and rdiffs=0, I got less sequences just as I would expect. However, when I tried pdiffs=2 and rdiffs=2, I also got LESS sequences in each sample. I would have expected that I would get more sequences when allowing larger difference with the primer sequence. I wonder why this happens?

Below are also summary.seqs in each of the three cases:

DIFFS=0

mothur > pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=0, pdiffs=0, processors=4)

Using 4 processors.
/******************************************/
Running command: remove.seqs(accnos=testset.trim.contigs.good.bad.accnos, count=testset.contigs.good.count_table)
Removed 126786 sequences from testset.contigs.good.count_table.

Output File Names:
testset.contigs.good.pick.count_table

/******************************************/
It took 31 secs to screen 1129940 sequences.

Output File Names:
testset.trim.contigs.good.pcr.fasta
testset.trim.contigs.good.bad.accnos
testset.trim.contigs.good.scrap.pcr.fasta
testset.contigs.good.pcr.count_table

mothur > summary.seqs(fasta=current)
Using testset.trim.contigs.good.pcr.fasta as input file for the fasta parameter.

Using 4 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 99 99 0 3 1
2.5%-tile: 1 252 252 0 3 25079
25%-tile: 1 252 252 0 4 250789
Median: 1 253 253 0 4 501578
75%-tile: 1 253 253 0 5 752366
97.5%-tile: 1 254 254 0 6 978076
Maximum: 1 549 549 0 8 1003154
Mean: 1 252 252 0 4

of Seqs: 1003154

It took 4 secs to summarize 1003154 sequences.

Output File Names:
testset.trim.contigs.good.pcr.summary

mothur > count.groups(count=current)
Using testset.contigs.good.pcr.count_table as input file for the count parameter.
NW003 contains 56728.
NW004 contains 113152.
NW006 contains 207578.
NW010 contains 198100.
NW012 contains 427596.

Size of smallest group: 56728.

Total seqs: 1003154.

DIFFS=1

mothur > pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=1, pdiffs=1, processors=4)

Using 4 processors.
/******************************************/
Running command: remove.seqs(accnos=testset.trim.contigs.good.bad.accnos, count=testset.contigs.good.count_table)
Removed 11966 sequences from testset.contigs.good.count_table.

Output File Names:
testset.contigs.good.pick.count_table

/******************************************/
It took 141 secs to screen 1129940 sequences.

Output File Names:
testset.trim.contigs.good.pcr.fasta
testset.trim.contigs.good.bad.accnos
testset.trim.contigs.good.scrap.pcr.fasta
testset.contigs.good.pcr.count_table

mothur > summary.seqs(fasta=current)
Using testset.trim.contigs.good.pcr.fasta as input file for the fasta parameter.

Using 4 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 99 99 0 3 1
2.5%-tile: 1 252 252 0 3 27950
25%-tile: 1 252 252 0 4 279494
Median: 1 253 253 0 4 558988
75%-tile: 1 253 253 0 5 838481
97.5%-tile: 1 254 254 0 6 1090025
Maximum: 1 549 549 0 8 1117974
Mean: 1 252 252 0 4

of Seqs: 1117974

It took 5 secs to summarize 1117974 sequences.

Output File Names:
testset.trim.contigs.good.pcr.summary

mothur > count.groups(count=current)
Using testset.contigs.good.pcr.count_table as input file for the count parameter.
NW003 contains 62028.
NW004 contains 140055.
NW006 contains 227243.
NW010 contains 240212.
NW012 contains 448436.

Size of smallest group: 62028.

Total seqs: 1117974.

DIFFS=2

mothur > pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=2, pdiffs=2, processors=4)

Using 4 processors.
/******************************************/
Running command: remove.seqs(accnos=testset.trim.contigs.good.bad.accnos, count=testset.contigs.good.count_table)
Removed 68181 sequences from testset.contigs.good.count_table.

Output File Names:
testset.contigs.good.pick.count_table

/******************************************/
It took 269 secs to screen 1129940 sequences.

Output File Names:
testset.trim.contigs.good.pcr.fasta
testset.trim.contigs.good.bad.accnos
testset.trim.contigs.good.scrap.pcr.fasta
testset.contigs.good.pcr.count_table

mothur > summary.seqs(fasta=current)
Using testset.trim.contigs.good.pcr.fasta as input file for the fasta parameter.

Using 4 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 99 99 0 3 1
2.5%-tile: 1 252 252 0 3 26544
25%-tile: 1 252 252 0 4 265440
Median: 1 253 253 0 4 530880
75%-tile: 1 253 253 0 5 796320
97.5%-tile: 1 254 254 0 6 1035216
Maximum: 1 549 549 0 8 1061759
Mean: 1 252 252 0 4

of Seqs: 1061759

It took 5 secs to summarize 1061759 sequences.

Output File Names:
testset.trim.contigs.good.pcr.summary

mothur > count.groups(count=current)
Using testset.contigs.good.pcr.count_table as input file for the count parameter.
NW003 contains 60390.
NW004 contains 121622.
NW006 contains 218406.
NW010 contains 214387.
NW012 contains 446954.

Size of smallest group: 60390.

Total seqs: 1061759.

I wonder if your index sequences are pretty similar to each other. If you do pdiffs=2 and it’s 2 nt different from two index sequences then it will toss the sequence. I think this could explain the ~5% drop in the number of reads from 1 to 2.

pat

Thank you!

This is my oligos-file:

primer GTGCCAGCMGCCGCGGTAA GGACTACHVGGGTWTCTAAT

I am not sure how the similarity should be counted, but with the ways I tried it, the minimum difference I got was 11 nt. But I am really not sure how to do the comparison correctly…

Best wishes
Mira

What about your barcodes? If you have a sequence that is cut with the higher number of differences, it might be easier to see what’s going on. It would be nice to see what the sequence name and the sequence itself are in the scrap.fasta file.

Pat

Here are the first lines of the scrap file:

M03602_308_000000000-J4BT9_1_1119_14201_9066|ft(frt)(frt) fpdiffs=7(noMatch) rpdiffs=7(noMatch) ee=0.7826
TATCTGTGCCAGCCGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGGCTTAGATACCCTAGTAGTCCAGATAGATC
M03602_308_000000000-J4BT9_1_1101_15561_1048|ft(frt)(frt) fpdiffs=7(noMatch) rpdiffs=7(noMatch) ee=0.956905
TCGCTCTCAGTGCCAGCAGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGGCTTAGAAACCCGAGTAGTCCAGATCGG
M03602_308_000000000-J4BT9_1_2113_18537_21339|ft(frt)(frt) fpdiffs=7(noMatch) rpdiffs=7(noMatch) ee=0.605776
TTCTGCAGTGCCAGCAGCCGCGGTGATACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGGCCTTTCCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGCTTAGAAACCCGGGTAGTCCTGAGAT

The samples come de-multiplexed from the sequencer so I haven’t been thinking of the barcodes at all… Your example data in SOP does not include primers. In our case when the primers are in the sequences, which procedure would you suggest in general: remove the sequences with mismatch with the primer or only cut the sequences to remove the primer parts?

I can also send you the files via e-mail if that helps and is okay?

Mira

Can you post sequences that were accepted at one level of pdiffs/rdiffs but not another?