Per Earth Microbiome website, using the 1391f and EukB primers should give me amplicons around 260 bp (+/-50), but my contig assemblies are basically all around 420 bp…can I be fairly certain there was sequencing error, and can I still work with my sequence data in some way?
Perhaps it’s including the primers and barcodes in the sequence data? Throw in the +/- 50 bp and I could see you getting something longer than 260. Having not done these types of PCRs in the past, I’m not sure how long it would be. You might grab your favorite 18S rRNA gene sequence and do the PCR and sequencing “in silico” on your computer to confirm the expected lengths.
I just checked the last time I used those primers and which lengths the amplicons had and came up with lengths between 100 bp (my lower mothur cut-off) and 130 bp. Those sequences were fine and classification made sense too (confirmed by BLASTing selected sequences too). So, no clue where the EMB people got the 260 bp from (typo?).
Anyway, 420 bp is far off the mark I think. My suggestion would be to take a few sequences from your FAST file and BLAST them. How do they look? Does the whole sequence match the V9 region of one database sequence or does it look as if the two R1 & R2 MiSeq sequences did not create a proper contig? Are they even 18S? (I had such a case which I was never able to explain but luckily also never able to replicate.) If your 420 bp ones are artefacts (as I suspect) are there maybe some sequences in your dataset with the right lengths that are “drowned” in the mass of wrong ones? You could run screen.seq with min/max length values of 100 & 150 (those would be around my settings for these primers) to get rid of the large stuff and see what is left.
After you figured out what went wrong with your sequences then you can decide if you can trust them enough to somehow use them. But to be honest, I don’t think it looks to promising.
Thank you for the replies!
I am getting alignments about 130 bp long, and error rates below 0.2% on my mock community, but I’m concerned because I expect alignments around 170 bp.
I compared one sequence after make.contigs and [that same sequence] after align.seqs, to see where my primers were landing - see below. I found a sequence (blue) similar to the reverse primer sequence a little bit upstream from the actual true primer (green). If you take the sequence all the way out to the end of the green-highlighted portion, it comes to about 170 bp. Could this be why I am getting truncated (130 bp sequences) alignments? I still have to look into this in more depth on a reference alignment viewer.
But, Rene, you’re saying that even if it is only keeping a 130-bp part of my sequences, the results would still be trustworthy?
Thanks so much!
your mock community might be very helpful in figuring out what is going on. The 130 bp length of your sequences fit to the length of my sequences when using the same primers, but why do you expect 170 bp? Is it based on previous sequencing of the same mock community or based on in silico data? I would suggest to take a 130 bp sequence from your mock community out of the mothur FASTA file and compare it to the 170 bp one you expect, then you should see what is different/missing.
Secondly, I am a little bit confused about what you wrote regarding the comparison of the (big) contig and the shorter sequence below. Is this really the exact same sequence (same FASTA ID)? Because the way I understood the mothur commands and the SOP, sequences should not be truncated during the pipeline but instead removed from the analysis when they did not follow certain criteria, e.g. ambiguities, lengths etc. (Please Pat or somebody in the know comment or correct me in this). Also, did the long sequence come from the file “YourName.trim.contigs.fasta” or “YourName.trim.contigs.good.fasta”?
How come that you find the primer sites like this within your contigs? They should be at the beginning of the sequence and not after 100+ bp, at least they are like this in my “good” contigs which also have the proper lengths. Unfortunately, I can’t copy & paste the sequences from your screen shot to have a closer look at them (and I am not typing them in manually to satisfy my curiosity ).
And to your final question, in my opinion it depends how the 130 bp are “created”. If they are the result of a proper PCR reaction and sequencing, then yes. You are only looking at part of the whole gene anyway, so 130 bp will give you real information if the sequence is real. (Or as much as 130 bp can give you, that’s why I normally prefer the 450 bp amplicons of my V4 primers.) You just need to be certain that these are real 18S sequences and not some kind of artefact, but your mock community looks good to me in this aspect.
All the best,
I expect 170 bp for several reasons: a) approximate sequence length for S. cerevisiae using these primers that was published in a paper cited on the EMB website b) I made a reference alignment using database sequences for the organisms in my mock community, aligned using MEGA and trimmed to primers, and they were 174 bp without gaps, and c) that is the length of sequences I sampled from my ‘stability.trim.contigs.fasta’ file, once I trimmed them down to the primers.
I forgot to label the sequences above, my bad! The longer sequence comes from ‘stability.trim.contigs.fasta.’ However, the screen.seqs max length I used was large enough that it didn’t screen much out - essentially ALL of my contigs were 420-450, so I don’t think using “YourName.trim.contigs.good.fasta” would have made much difference. The shorter sequence is from 'stability.trim.contigs.good.unique.align.
I did this comparison for several sets of sequences and I think the stuff before my forward primer is the adapter, linker etc from the sequencing process. So my contigs are larger because of that. When I go through the MiSeq SOP, the sequences are getting cut off too soon, as you can see in the short sequence from my example above.
I’ve taken several sequences, both from my mock reference alignment (the one I made on MEGA) and from my ‘stability.trim.contigs.fasta’ file (trimmed to primers), and aligned them 2 ways, (1) using mothur and reference file ‘silva.nr_v132.align’ (2) using https://www.arb-silva.de/aligner/ and viewing on Wasabi (Thanks for that Rene, by the way!).
For one example sequence:
(1) Run on mothur: Start= 41,791 End= 43,116 NBases= 131
(2) Run online: Start= 41,790 End= 43,275 Counted the #bases in alignment = 174
When I use the values from (2) for the pcr.seqs step in mothur, my alignments still come to 100-130 bp. Pat, any advice for me on that front?
Or maybe it doesn’t matter that much. My main concern is what I would be losing using truncated sequences, but you’re right Rene, and I’m not losing a ton of sequence length anyway.
Sorry for all the posts lately - I really appreciate all the help!
if I understand you correctly then you made your contigs from sequences that still had adapter, linker, etc. still in it? My colleague who handles our MiSeq facilities routinely gets rid of those when processing the run and only delivers us the clean fastq files for the R1 & R2 reads, that’s why I was confused by your long sequence. I would suggest that you also trim your sequences and get rid of that stuff before you start with mothur and make the contigs.
Admittingly, I don’t think that this has an impact on the shorter length of your sequences, so still confused by that
The example sequence you mentioned (mothur versus online), what is the difference, which bases are missing from the shorter (mothur) sequence? The End value for mothur is 160 positions shorter than for “online”, maybe bps got lost there/cut off? Do you see the primer sequences in the 131 bp and 174 bp sequences?
Yes they appear to still have that stuff on them. I will look into trimming that out before analysis.
The bases are getting cut off from the end. Here is the aligned sequence I get from mothur (black) with the missing bases in red and the primer underlined.
I’m just stumped about why they’re coming out short in mothur, and how to correct it.