Pcr.seqs does not remove logical amount of sequences with different values of pdiffs/rdiffs

Hi!

I am using mothur v.1.48.0. I used pcr.seqs with this kind of command:

pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=1, pdiffs=1, processors=4)

When I tried pdiffs=0 and rdiffs=0, I got less sequences just as I would expect. However, when I tried pdiffs=2 and rdiffs=2, I also got LESS sequences in each sample. I would have expected that I would get more sequences when allowing larger difference with the primer sequence. I wonder why this happens?

Below are also summary.seqs in each of the three cases:

DIFFS=0

mothur > pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=0, pdiffs=0, processors=4)

Using 4 processors.
/******************************************/
Running command: remove.seqs(accnos=testset.trim.contigs.good.bad.accnos, count=testset.contigs.good.count_table)
Removed 126786 sequences from testset.contigs.good.count_table.

Output File Names:
testset.contigs.good.pick.count_table

/******************************************/
It took 31 secs to screen 1129940 sequences.

Output File Names: 
testset.trim.contigs.good.pcr.fasta
testset.trim.contigs.good.bad.accnos
testset.trim.contigs.good.scrap.pcr.fasta
testset.contigs.good.pcr.count_table



mothur > summary.seqs(fasta=current)
Using testset.trim.contigs.good.pcr.fasta as input file for the fasta parameter.

Using 4 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	99	99	0	3	1
2.5%-tile:	1	252	252	0	3	25079
25%-tile:	1	252	252	0	4	250789
Median: 	1	253	253	0	4	501578
75%-tile:	1	253	253	0	5	752366
97.5%-tile:	1	254	254	0	6	978076
Maximum:	1	549	549	0	8	1003154
Mean:	1	252	252	0	4
# of Seqs:	1003154

It took 4 secs to summarize 1003154 sequences.

Output File Names:
testset.trim.contigs.good.pcr.summary


mothur > count.groups(count=current)
Using testset.contigs.good.pcr.count_table as input file for the count parameter.
NW003 contains 56728.
NW004 contains 113152.
NW006 contains 207578.
NW010 contains 198100.
NW012 contains 427596.

Size of smallest group: 56728.

Total seqs: 1003154.

DIFFS=1

mothur > pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=1, pdiffs=1, processors=4)

Using 4 processors.
/******************************************/
Running command: remove.seqs(accnos=testset.trim.contigs.good.bad.accnos, count=testset.contigs.good.count_table)
Removed 11966 sequences from testset.contigs.good.count_table.

Output File Names:
testset.contigs.good.pick.count_table

/******************************************/
It took 141 secs to screen 1129940 sequences.

Output File Names: 
testset.trim.contigs.good.pcr.fasta
testset.trim.contigs.good.bad.accnos
testset.trim.contigs.good.scrap.pcr.fasta
testset.contigs.good.pcr.count_table



mothur > summary.seqs(fasta=current)
Using testset.trim.contigs.good.pcr.fasta as input file for the fasta parameter.

Using 4 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	99	99	0	3	1
2.5%-tile:	1	252	252	0	3	27950
25%-tile:	1	252	252	0	4	279494
Median: 	1	253	253	0	4	558988
75%-tile:	1	253	253	0	5	838481
97.5%-tile:	1	254	254	0	6	1090025
Maximum:	1	549	549	0	8	1117974
Mean:	1	252	252	0	4
# of Seqs:	1117974

It took 5 secs to summarize 1117974 sequences.

Output File Names:
testset.trim.contigs.good.pcr.summary


mothur > count.groups(count=current)
Using testset.contigs.good.pcr.count_table as input file for the count parameter.
NW003 contains 62028.
NW004 contains 140055.
NW006 contains 227243.
NW010 contains 240212.
NW012 contains 448436.

Size of smallest group: 62028.

Total seqs: 1117974.

DIFFS=2

mothur > pcr.seqs(fasta=testset.trim.contigs.good.fasta, count=testset.contigs.good.count_table, oligos=primers.oligos, rdiffs=2, pdiffs=2, processors=4)

Using 4 processors.
/******************************************/
Running command: remove.seqs(accnos=testset.trim.contigs.good.bad.accnos, count=testset.contigs.good.count_table)
Removed 68181 sequences from testset.contigs.good.count_table.

Output File Names:
testset.contigs.good.pick.count_table

/******************************************/
It took 269 secs to screen 1129940 sequences.

Output File Names: 
testset.trim.contigs.good.pcr.fasta
testset.trim.contigs.good.bad.accnos
testset.trim.contigs.good.scrap.pcr.fasta
testset.contigs.good.pcr.count_table



mothur > summary.seqs(fasta=current)
Using testset.trim.contigs.good.pcr.fasta as input file for the fasta parameter.

Using 4 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	99	99	0	3	1
2.5%-tile:	1	252	252	0	3	26544
25%-tile:	1	252	252	0	4	265440
Median: 	1	253	253	0	4	530880
75%-tile:	1	253	253	0	5	796320
97.5%-tile:	1	254	254	0	6	1035216
Maximum:	1	549	549	0	8	1061759
Mean:	1	252	252	0	4
# of Seqs:	1061759

It took 5 secs to summarize 1061759 sequences.

Output File Names:
testset.trim.contigs.good.pcr.summary


mothur > count.groups(count=current)
Using testset.contigs.good.pcr.count_table as input file for the count parameter.
NW003 contains 60390.
NW004 contains 121622.
NW006 contains 218406.
NW010 contains 214387.
NW012 contains 446954.

Size of smallest group: 60390.

Total seqs: 1061759.

I wonder if your index sequences are pretty similar to each other. If you do pdiffs=2 and it’s 2 nt different from two index sequences then it will toss the sequence. I think this could explain the ~5% drop in the number of reads from 1 to 2.

pat

Thank you!

This is my oligos-file:

primer GTGCCAGCMGCCGCGGTAA GGACTACHVGGGTWTCTAAT

I am not sure how the similarity should be counted, but with the ways I tried it, the minimum difference I got was 11 nt. But I am really not sure how to do the comparison correctly…

Best wishes
Mira

What about your barcodes? If you have a sequence that is cut with the higher number of differences, it might be easier to see what’s going on. It would be nice to see what the sequence name and the sequence itself are in the scrap.fasta file.

Pat

Here are the first lines of the scrap file:

>M03602_308_000000000-J4BT9_1_1119_14201_9066|ft(frt)(frt)	fpdiffs=7(noMatch) rpdiffs=7(noMatch) 		ee=0.7826	
TATCTGTGCCAGCCGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGGCTTAGATACCCTAGTAGTCCAGATAGATC
>M03602_308_000000000-J4BT9_1_1101_15561_1048|ft(frt)(frt)	fpdiffs=7(noMatch) rpdiffs=7(noMatch) 		ee=0.956905	
TCGCTCTCAGTGCCAGCAGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGGCTTAGAAACCCGAGTAGTCCAGATCGG
>M03602_308_000000000-J4BT9_1_2113_18537_21339|ft(frt)(frt)	fpdiffs=7(noMatch) rpdiffs=7(noMatch) 		ee=0.605776	
TTCTGCAGTGCCAGCAGCCGCGGTGATACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGGCCTTTCCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGCTTAGAAACCCGGGTAGTCCTGAGAT

The samples come de-multiplexed from the sequencer so I haven’t been thinking of the barcodes at all… Your example data in SOP does not include primers. In our case when the primers are in the sequences, which procedure would you suggest in general: remove the sequences with mismatch with the primer or only cut the sequences to remove the primer parts?

I can also send you the files via e-mail if that helps and is okay?

Mira

Can you post sequences that were accepted at one level of pdiffs/rdiffs but not another?

I come back to this question (sorry for the delay!)

Here are ten examples of sequences that went through with diffs=1 but did not with diffs=2.

[1] ">M03602_308_000000000-J4BT9_1_1101_10254_3040\tfpdiffs=1(match) rpdiffs=0(match) \t\tee=0.241188\t"                                                                                                                                                            
 [2] "CACGTAGGCACCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTCAGTAAGTCGGGTGTGAAAACTTTGGGCTTAACCCAAAGCGTGCATCCGATACTGCTGTGACTTGAGTACGGTAGGGGAGCGGGGAATTCCTAGTGTAGCGGTGAAATGCGCAGATATTAGGAGGAACACCGGTGGCGAAGGCGCCGCTCTGGGCCGAAACTGACGCTGAGGAGCGAAAGCATGGGTAGCAAACAGG"
 [3] ">M03602_308_000000000-J4BT9_1_1101_10588_1346\tfpdiffs=1(match) rpdiffs=1(match) \t\tee=0.0637563\t"                                                                                                                                                           
 [4] "TACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGGCCTTTCCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG" 
 [5] ">M03602_308_000000000-J4BT9_1_1101_10935_2192\tfpdiffs=1(match) rpdiffs=0(match) \t\tee=0.797005\t"                                                                                                                                                            
 [6] "TACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG" 
 [7] ">M03602_308_000000000-J4BT9_1_1101_11231_1591\tfpdiffs=1(match) rpdiffs=0(match) \t\tee=1.2665\t"                                                                                                                                                              
 [8] "CACGTAGGCACCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCACGTCGGCTGTGAAAACCCGAGGCTCAACCTCGGGCCTGCAGTCGATACGGGCAAACTAGAGTACTGCAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG" 
 [9] ">M03602_308_000000000-J4BT9_1_1101_11313_1170\tfpdiffs=1(match) rpdiffs=1(match) \t\tee=1.21612\t"                                                                                                                                                             
[10] "TACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGG" 
[11] ">M03602_308_000000000-J4BT9_1_1101_11439_1940\tfpdiffs=1(match) rpdiffs=1(match) \t\tee=0.904284\t"                                                                                                                                                            
[12] "TACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGGCCTTTCCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG" 
[13] ">M03602_308_000000000-J4BT9_1_1101_11749_1352\tfpdiffs=1(match) rpdiffs=0(match) \t\tee=0.172391\t"                                                                                                                                                            
[14] "TACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGTCGTGAAAGTCCGGGGCTTAACCCCGGATCTGCGGTGGGTACGGGCAGACTAGAGTGCAGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCTGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGG" 
[15] ">M03602_308_000000000-J4BT9_1_1101_12205_1573\tfpdiffs=1(match) rpdiffs=1(match) \t\tee=0.917797\t"                                                                                                                                                            
[16] "TACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGG" 
[17] ">M03602_308_000000000-J4BT9_1_1101_12361_1979\tfpdiffs=1(match) rpdiffs=1(match) \t\tee=1.73277\t"                                                                                                                                                             
[18] "TACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGGCCTTTCCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG" 
[19] ">M03602_308_000000000-J4BT9_1_1101_12854_1751\tfpdiffs=1(match) rpdiffs=0(match) \t\tee=0.00407551\t"                                                                                                                                                          
[20] "TACGGAGGATCCAAGCGTTATCCGGAATTACTGGGCGTAAAGAGTTGCGTAGGTGGCATAGTAAGTAGATAGTGAAAGCGTTCGGCTCAACCGGATAAACATTATCTAAACTGCTAAGCTAGAGGATGAGAGAGGTTATTGGAATTCCCTGTGTAGGAGTGAAATCCGTAGATATAGGGAGGAACACCGATGGCGTAGGCAGATAACTGGCTCATTCCTGACACTAAGGCACGAAAGCGTGGGTAGCAAACGGG"

Hi Mira - I don’t see your primer sequences in those sequences. Are you sure that these sequences should still have the primers on them or that you posted the correct primer sequences previously?

Pat

Sorry! I accidentally picked the sequences from the fasta-file after pcr.seqs so that’s why they did not include the primers. Here are ten sequences before pcr.seqs that later were discarded when diffs=2 but accepted when diffs=1.

>M03602_308_000000000-J4BT9_1_1101_15561_1048   ee=0.956905
TCGCTCTCAGTGCCAGCAGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGGCTTAGAAACCCGAGTAGTCCAGATCGG
>M03602_308_000000000-J4BT9_1_1101_14712_1079   ee=0.959142
TCTCTGTGCCAGCCGCCGCGGTGATACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGGCCTTTCCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGCTTAGATACCCTAGTAGTCCAGATAGATC
>M03602_308_000000000-J4BT9_1_1101_13469_1130   ee=0.0147969
AGCAATTGTGCCAGCAGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGGCTTAGAAACCCTTGTAGTCCTG
>M03602_308_000000000-J4BT9_1_1101_11313_1170   ee=1.21612
TCCGTCTGTGCCAGCCGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGGCTTAGATACCCGAGTAGTCCTGAGATCGG
>M03602_308_000000000-J4BT9_1_1101_20262_1302   ee=0.344478
TTCTGCAGTGCCAGCAGCCGCGGTGATACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGTGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGTAGCAAACAGGATTAGATACCCGAGTAGTCCTGAGAT
>M03602_308_000000000-J4BT9_1_1101_17832_1236   ee=0.0407841
TTAGCAATTGTGCCAGCCGCCGCGGTGATACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGTGGTTGATCGCGTCGGAAGTGTAATCTTGGGGCTTAACCCTGAGCGTGCTTTCGATACGGGTTGACTTGAGGAAGGTAGGGGAGAATGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCAGTGGCGAAGGCGGTTCTCTGGGCCTTTCCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGCTTAGAAACCCCTGTAGTCCAG
>M03602_308_000000000-J4BT9_1_1101_11749_1352   ee=0.172391
TTAGCAATTGTGCCAGCCGCCGCGGTGATACGTAGGGTGCGAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGTCGTGAAAGTCCGGGGCTTAACCCCGGATCTGCGGTGGGTACGGGCAGACTAGAGTGCAGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCTGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGGATTAGAAACCCTGGTAGTCCAG
>M03602_308_000000000-J4BT9_1_1101_9376_1264    ee=0.225715
TTAGCAATTGTGCCAGCAGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCTCGCTGGGCCACACCTGACGCTGAGACGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCCGGTAGTCCAG
>M03602_308_000000000-J4BT9_1_1101_17620_1266   ee=0.966998
TACAGCAGTGCCAGCCGCCGCGGTCATACGTAGGACCCAAGCGTTATCCGGAGTGACTGGGCGTAAAGAGTTGCGTAGGTGGTTTGTAAAGTGAATAGTGAAATCTGGTGGCTCAACCATACAGGCTATTATTCAAACTCACAAACTCGAGAATGGTAGAGGTAACTGGAATTTCTTGTGTAGGAGTGAAATCCGTAGATATAAGAAGGAACACCAATGGCGTAGGCAGGTTACTGGACCATTTCTGACACTGAGGCACGAAAGCGTGGGGAGCGAACCGGATTAGATACCCCGGTAGTCCACAG
>M03602_308_000000000-J4BT9_1_1101_19748_1286   ee=1.28202
GATCCAGTGCCAGCAGCCGCGGTGATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGCGGTCCGTCGCGTCAGGAGTGAAAACTCGGGGCTTAACCCCGAGCCTGCTTTTGATACGGGCGGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGGAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCGGGTAGTCCTGAGATC

Thanks Mira - I’m sorry about this, it is a bug on our end. As a work around, I’d suggest aligning your sequences to the silva reference alignment, filter the sequences with vertical=T, and then pick the coordinates that would remove the primers from the sequence.

Pat

Okay! Good to know that the problem is not not in my sequences. I will remove the primers as you suggested. Thank you for your help!

Mira

By the way, even though there was this problem with pcr.seqs command, is pcr.seqs still safe to use when customizing the reference database for my region (V4)?

yep, it should be fine

Hi again! I have questions regarding the work around. If I understood correctly, the work around means doing: 1. align.seqs, 2. filter.seqs(vertical=T) and 3. trim with pcr.seqs(start=, end=).

## In the previous runs with other datasets my summary looks similar to SOP (i.e. start is mostly 1968 and end 11550), but now it looks different. Sequences mainly start at position 1 and end 13434. Is this okay?:

mothur > summary.seqs(fasta=testset.trim.contigs.good.unique.align, count=current, processors=16)

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 0 0 0 0 1 1
2.5%-tile: 1 13424 291 0 4 621604
25%-tile: 1 13424 292 0 4 6216031
Median: 1 13424 292 0 4 12432061
75%-tile: 1 13424 292 0 5 18648091
97.5%-tile: 1 13424 293 0 6 24242518
Maximum: 13424 13424 471 0 8 24864121
Mean: 6 13420 291 0 4
of unique seqs: 13055807
total # of seqs: 24864121

## If I do filter.seqs directly after align, my summary.seqs looks pretty weird:

mothur > filter.seqs(fasta=testset.trim.contigs.good.unique.align, vertical=T, trump=., processors=16)

Length of filtered alignment: 0
Number of columns removed: 13424
Length of the original alignment: 13424
Number of sequences used to construct filter: 13055807

mothur > summary.seqs(fasta=current)

Using 16 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 0 0 0 1 1
2.5%-tile: 1 0 0 0 1 326396
25%-tile: 1 0 0 0 1 3263952
Median: 1 0 0 0 1 6527904
75%-tile: 1 0 0 0 1 9791856
97.5%-tile: 1 0 0 0 1 12729412
Maximum: 1 0 0 0 1 13055807
Mean: 1 0 0 0 1
of Seqs: 13055807

## I also tried screen.seqs before filter as in SOP. Below are commands and summary.seqs in this case. If this is the correct order and the summary.seqs after align was okay, which starting and ending positions should I use in screen?

mothur > screen.seqs(fasta=current, count=current, start=1968, end=11550, processors=16)
Using testset.trim.contigs.good.count_table as input file for the count parameter.
Using testset.trim.contigs.good.unique.align as input file for the fasta parameter.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 11550 240 0 3 1
2.5%-tile: 1 13424 291 0 4 620840
25%-tile: 1 13424 292 0 4 6208395
Median: 1 13424 292 0 4 12416789
75%-tile: 1 13424 292 0 5 18625183
97.5%-tile: 1 13424 293 0 6 24212738
Maximum: 1968 13424 471 0 8 24833577
Mean: 1 13423 291 0 4
of unique seqs: 13036739
total of seqs: 24833577

## I suppose hat after these steps, I should use pcr.seqs with start=1968 and end=11550 to pick the V4 region without primers?

Hi - I think 1968 and 13424 should work. You might try decreasing the start position to smaller values if the resulting sequences are short.

Pat