Sequence Length

Hi everyone,

I have sequenced the v4 region of the 16S rRNA gene on a MiSeq with a 500 cycle kit (2 x 250 bp reads). Can anyone tell me why the majority of my read lengths are around 290 bases when I run summary.seqs? How is this possible if the reads are literally only 250 bp long?

Did you read through this thread? https://www.mothur.org/forum/viewtopic.php?f=1&t=4202&start=20

Thank you. I just read through the whole thread and it was helpful however I am still confused. We use the Caporaso primers for the V4 region (515F & 806R). I follow the Illumina protocol “16S Metagenomic Sequencing Library Preparation” which utilizes a two-step PCR approach. I then follow the Mothur MiSeq SOP for my analysis. Should I be clipping adapters and primers before I begin the Mothur pipeline?

The Caporaso primers already have the adapters, why are you using the illumina protocol to add the adapters?

Do you mean that you use 515F and 806R but not the rest of the Caporaso primers? Are you using custom sequencing primers?

Exactly. Specifically we use the following in the first PCR:

TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG [Illumina forward overhang] GTGCCAGCMGCCGCGGTAA [515F Caporaso et al]

GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG [Illumina reverse overhang] GGACTACHVGGGTWTCTAAT [806R Caporaso et al]

And in the 2nd PCR we add Illumina barcodes using the Nextera XT index 1 and 2 primers.

ok in this case you do need to trim the primers. Your length makes sense because you are sequencing the primers and part of the adapters.

Hi everyone ,

I am using mothur for my illumina Miseq data (v3 kit 300bp paired-end read/ 341F-785R/ V3-V4 regions) when I run summary.seqs after make.contigs I had this :
the minimum was 297bp
the max was 543bp
the mean was 427bp

is it normal?! (it is supposed to be 300 pb?!!!) I am confused how can figure out the min length and max length in screen.seqs command.

thank you !

You have sequenced a region of the 16s gene that spans from base number 341 to base 785 (from your primer names). If you take the difference of these sites you get 444bp which is around the length of the fragment you are getting so it all makes perfect sense.

The v3 kit from Illumina generates 600 bp of Paired End reads by generating a 300bp forward read and a 300bp reverse read. If your target region is larger than 300bp then these forward and reverse reads will not fully overlap and therefore your contig will be longer than 300bp.

This is a relevant read.

In terms of picking min/max lengths for the screen.seqs command there is no correct answer. Variation in the 16S gene means that its not unreasonable that sequences larger than your mean length are legitimate, but on average the longer sequences are usually trash. Pick a value maybe 10% greater/less than the mean as a starting point and compare to something like a 20% greater/less than the mean and see how that affects your results. I found once I was trimming too stringently and losing some archaeal sequences.

thank you campenr for the clarification, not confused anymore :slight_smile:

Hello, I have some followup questions on this topic. I similarly have V4 .fastq data from a 2x250 MiSeq run and it seems I too need to remove the primers (my reads were also assembling to ~290). I have 28 .fastq files for 14 samples and am following the MiSeq SOP. My oligos.txt file looks like this currently:

primer GTGYCAGCMGCCGCGGTAA GGACTACNVGGGTWTCTAAT
# BARCODE none none JG01
# BARCODE none none JG02
…etc.

make.contigs(file=stability.files, oligos=oligos.txt, processors=4)

Now reads assemble to the expected 253 bp length instead of 292, but I still have a few questions:

  1. Do I need to account for degeneracy (if that’s the correct term) in these primers by listing all the possible permutations in my oligos file? (e.g this post)

  2. make.contigs() throws a warning: * [WARNING]: your oligos file does not contain any group names. mothur will not create a groupfile. *. How do I work around this if barcodes are already removed and each sample is split into forward and reverse fastq files? Without a groupfile, do all the sequences end up in the same group, even if stability.files has the sample IDs?

  3. Could leaving the primers in previously be a cause for why my batch script was crashing during cluster.split() - perhaps the primers caused excessive unique sequences (per the blog post here)?

Appreciate any suggestions.

Would you mind starting a new threa to answer this question? The current thread is a few years old.

Not at all, wasn’t sure which would be preferred