How does make.contigs merging work?

While checking some Mothur output, I’ve come across two places where I don’t understand what Mothur is doing. The first is in make.contigs, the second is in pcr.seqs (that question is in a separate post).

When I look at a random sample of raw reads, I see the exact primer match at the start of R1 (forward) and R2 (reverse primer), as expected. I also see an exact match for the reverse complement of the reverse primer in R1, and an imperfect match for the reverse complement of the forward primer in R2. The imperfect forward primer has one mismatch.

I see that the imperfect forward primer appears in the output of make.contigs. So, why does Mothur, when merging, “capture” the imperfect forward primer when a perfect one exists?

I found this to be the case for about 50% of my random sample of reads. It leads me to believe I don’t quite understand how make.contigs works. Here are the parameters I run:

make.contigs(file=Undetermined.paired.files, processors=40, format=illumina1.8+, oligos=amr.oligos, bdiffs=0, pdiffs=0, checkorient=t, insert=25, trimoverlap=f, allfiles=0)

Can you post a forward and reverse sequence along with how they’re getting assembled and what you think is wrong with the output?


Thanks, Pat. Here’s the sequence info (this is the same example read I’ve used for my other post, linked above, on pcr.seqs). I bolded the primers, and I’ve un-bolded the mismatch on the forward primer in R2 (and the make.contigs output).

R1 sequence:


R2 sequence


merged sequence, after make.contigs


reverse complement of the fwd primer: CACTCTCTTTGTCACCATCTCAA

rev complement of the reverse primer: CTTTCCAATGTGCAGTACCAGTA

Some other relevant info I should have mentioned in the initial post:

1). I’m using Mothur v.1.44.3
2). This is not 16S data. This is highly-multiplexed amplicon sequencing data sequenced on Illumina MiSeq. Instead of 1 primer, we have ~800 primers. It’s designed for near-complete overlap on R1 and R2 (to minimize errors).

Do let me know if you have more questions or you need to see additional data. I have an Excel spreadsheet with all of this info for a random sample of 30 reads. Thanks again for your help!

The pcr.seqs command will scan the read from left to right and look for the first possible match of the forward primer. Then it will scan the read from right to left looking for the first possible match of the reverse primer. If both are found the read is trimmed, otherwise it’s scrapped.

The make.contigs command is more strict about where primers can be located in the read. The make.contigs command will look for the barcodes and primers ONLY at the ends of the read. If the primers are found, they are removed and the fragments are assembled. Mothur found the exact match of the primers and removed them. It then assembled the remaining read fragments. The imperfect matches were not considered because they are not located at the ends of the read.

Have you tried using trimoverlap=T ?
This should remove surplus sequence from both ends and probably also removes both your perfect and imperfect primer sequences. For the further analyses, you’ll want to remove primers anyway because they are not informative and might even cause misclassification (e.g., if you have ambiguous bases).

1 Like

Thanks for this suggestions, Andreas! We’re trying it out today. It seems like an elegant solution to our problem.

Be careful though, the behaviour of trimoverlap can be tricky, because it removes any overlap. That means 3’-overlap when you sequence inserts shorter than your readlength, which is good and removes primers and adapter sequences. But also 5’-overlaps that happen when your insert is longer than the readlength (e.g. with V3-V4 and 2x 250 bp), which will remove parts of your insert.

This can cause issues if your lab uses several regions and you just copy-paste your workflow.

1 Like

Thanks for the heads-up, Andreas. We just finished a trial run with trimoverlap=t and I looked at that data compared to when trimoverlap=f. For my random sample of 30 reads, trimoverlap=t option does exactly what we expect (and what we want). Our design specifies complete read-through, and our inserts are always shorter than our read length, so it makes sense that it is working.

I really appreciate your suggestion. I will continue to scrutinize the data to make sure it’s working as expected and keep an eye out for edge cases, but so far it’s looking great.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.