Where to now? V4 region 16S rRNA (v3 Kit- 600 cycle)

Hi there :slight_smile:

I’ve read quite a few of the posts on the mothur forum and realised all the problems I have and will be experiencing when I analyze my data. This post may seem like a rant, I do apologize. But I am in desperate need for guidance.:cherry_blossom::cherry_blossom::cherry_blossom:

First things first, I am a MSc Medical Microbiology student with no bioinformatic experience (besides what I’ve tried doing by myself and a single course I’ve attended- which only helps when you have ideal data to work with) and only one other microbiome project has been conducted within our division thus far. I am working with nasopharyngeal aspirate samples, and we targeted the V4 region of the 16S rRNA gene. Sequencing was done on the Illumina Miseq platform using the V3 kit (600 cycles)(eeeek :sweat:) we done the sequencing in collaboration with another university and therefore followed their wetlab preparation.

With regards to the sequencing run, the run did not complete as a result of load shedding (welcome to South Africa :)). The run ended at about 590 cycles (and therefore shorter reverse reads) and the fastq files had to be generated manually (according to Illuminas guidelines). I therefore also have no information of the sequencing run itself other than the fastq files that was manually generated.

I checked the quality of the forward and reverse reads using fastqc. The forward read is fully sequenced (300bp) and the quality is >30 up to position ~175bp. As for the reverse read the sequenced length is about 275bp and the quality is well bad (>20 up to position 75 sometimes). Please see and example of the quality for one of my samples, html files below.


I doubt the reverse read is usable at all, so I think my analysis would be based on the forward read only- which has problems of its own from what I understand? Why would the reverse reads be that bad? Also, I don’t think redoing the sequencing is an option as I have to submit my thesis within this year, so I basically have to work with what I have, which is okay since this is a pilot study. Hopefully whomever tries to tackle this project in the future within our department will have better luck after I discuss all the downfalls and have better luck.

(Please note, I followed the SOP just to see what the output would be without adding additional commands at this point)I have tried running the make.contigs command on a few samples and a mock control using the Miseq SOP and tried assessing the error rates with my mock control (self made mock control using ATCC strains from our lab). The error was calculated to be 0.0139 with 132 OTUs (I only have 5 organisms in my mock). But then again I didn’t change anything in the SOP so that may have contributed to so many OTUs?
For the mock control, I (NCBI) blasted my primers against the 5 known organisms in my mock and then selected a hit and copied the genbank sequence region and saved it for each organism. I then created a single fasta file in bioedit containing each of the organisms sequences and used those sequences to align my mock to?

Does anyone have any feedback please, or what you think I can do based on the basic information provided? Anything at this point will be great, because I am not sure what I am doing.

Thanking you in advance :cherry_blossom:

try make.contigs on the whole dataset, then remove all seqs that have ambiguities. How many sequences are you left with?

Thank you for the feedback, I will try doing that this week just to see what the outcome will be. Would you recommend trimming the sequences to the same length (since the reverse is only 275bp) and then making the contigs?

Kind regards
Bianca Hamman
MSc Student (Medical microbiology)
Stellenbosch University, Tygerberg campus

nope,no need to trim just make contigs. The tails of each read will be low quality and will be corrected by its opposite read as long as the opposite base is decent quality.