make.contigs - recognizing primer

I currently have forward R1 and R2 fastq files that still contain the primer. However, there are a few nucleotides that are found before the primer, for example:

CTATAGTGCCAGCCGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTA

the underlined portion is my primer, but there are 5 base pairs that are found before. The number of preceding nucleotides varies (sometimes there are more than 5, sometimes there are none at all and the read begins right at the primer).

My question is, for Mothur’s make.contigs command, is Mothur expecting the reads to begin with the primer, or will it scan the sequence until it finds the primer, and then make the contig?

I have successfully assembled contigs using mothur [my command was “mothur> make.contigs(file=stability.file, oligos= file.oligos, bdiffs=1, pdiffs=2)”], and I am sure that my stability and oligos files are set up correctly. However most of my reads are still getting thrown out into the scrap file. So I’m wondering if the source of my troubles is the primer, and if mothur is not correctly identifying it due to these extra nucleotides? If mothur is expecting the primer to be found at the beginning of the read, how would I go about removing the extra nucleotides that are found before my primers?

So mothur does expect the sequences to start/end with the barcodes and then the primers. Do you know what protocol your sequence provider is using? I know people are playing around with varying length barcodes to break the phasing (this is actually unnecessary). I wonder if this is what is happening.

One way to figure out why things are getting scrapped is to look in the scrap.fasta file that comes out of make.contigs. Something like this should tell you what you want to know…

grep “>” ERR361063_1.scrap.contigs.fasta | cut -f 2 -d “|” | sort | uniq -c

This is counting the number of times different codes show up after the “|” in the fasta sequence names. b=barcode, f=forward primer, etc. These are described in the make.contigs wiki page.

Let us know what you find out.
Pat

Thank you Dr. Schloss.

It looks like the main problem codes are either “bf” or “f”.

I have a list of the barcodes that were used, and they are all of the same size (12 bases). I have an index file for these barcodes, and I set up my stability file in the 4 column format to include the index file. I had an issue with the index file before, however I got help from westcott in a previous thread, so I am certain that I am setting up everything correctly. However, the error code indicates that there may be something wrong with the barcode, however I’m not sure what that could be?

I will ask our collaborators for their protocol to see if they used anything other than these barcodes primers.

The strange thing is, not all of my sequences are getting thrown out. I cross referenced the reads in the trim.contig.fasta file with the reads in the raw R1.fastq file, and even though the R1 sequence had extra nucleotides before the primer, Mothur was able to assemble those just fine.

Might you know of a reason why Mothur would assemble some of the reads with extra nucleotides before their primers, but not others?

So if you do pdiffs=2 and bdiffs=1 it will allow up to 2 mismatches/gaps to the primers and 1 mismatch/gap to the barcode. So that would explain why some sequences with slightly funky primers get through.

If you have a mismatch to the barcode, it will automatically mismatch the primer.

Also, what fraction of all sequences are going to scrap?

Thank you for your help Dr. Schloss! I was able to speak with our sequencing provider, and they used linkers of various lengths before the primer. I was able to adjust the oligos file to include the extra nucleotides, and my output looks like more of what we were expecting.