reverse read bad quality - still make contigs?

Hi,

in my current project I investigate fish skin bacteria with 16S amplicons using the primer published in Kozich et al. (2013).

I just got my sequences back and when uploading some reads to fastqc I see two major issues: i) the average quality of the forward read decreases to <= 20 after 125 bp and the average quality of the reverse reads doesn’t even reach a quality of 20, except some regions.
If I know merge the reads with make.contigs I fear to get crappy data at the end whereof I can’t conclude anything. Could I still do the analysis with the forward read only, despite its poor quality in the 2nd half?

Since I had some trouble during the library preparations (bad DNA and thus unspecific bands and little PCR product) I doubt that a repeat of the run would solve my problem :frowning:


Thanks!

C

A small update:

I was able to follow the SOP and got ok data but I don’t trust them at all.
If I understand the algorithm in make.contigs correctly I would still get the reads merged when they both have a similar bad quality as mine have.
At the end I get differences in sequences because of a high amount of sequencing errors.

What do you think?

This makes me think that the overall data quality is pretty bad and suggests something went wrong with the run. Do you know what the %Q30 was for the run? In general we see values over 80%.

Hi Pat,

the overall >Q30 is 73.9 %, so not too much below your 80%.

The phasing/prephasing is bad, as well: 0.774/0.220 and 1.228/1.267, respectively.
Cluster PF is 49.17 % ±0.54.

In fastqc I saw that I also have many highly overrepresented sequence with up to 8.73 %.
I just found out that a lot of them can be aligned to the beginning of my read but they don’t look like my primers.

If that helps I could send you the SAV output.

C.

I would very much appreciate your help since I don’t know if I can continue my analysis or if I should repeat the sequencing or the library processing.

I’m in contact with a colleage from a center for genomics, who is just doing sequencing. He told my that I can’t use the reverse read for analysis and that the library (the DNA) might be the problem not the sequencing.




C.

Yeah, you probably want to go back to the sequence provider and figure out what went wrong. Sorry I can’t be more helpful.

Pat

Hi everyone,

I know this is an old post but I have had similar trouble with sequence analysis from the cod gut microbiome (basically cod poo/gut content). The samples were very tricky to extract and did give quite weak bands. I’ve held off posting in the forum, as I’ve been trying a lot of things and reading through the forum and reading Pat’s blog. Particularly this one (http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/)! So I guess part of the problem is targeting the V4 region and 2x250 paired end format (v2 500 cycle) {Caporaso et al}.

The issue is that the forward read the Q scores are >80% but the reverse reads were lower (between 40-50%). I guess I could disregard the second read? But I’ve been trying to make the contigs in Mothur anyway as the two reads will be overlapped and Mothur should choose the base calls from the higher quality read? (Although I think I need to play around with the deltaq option? The issue I’m having is ~800,000 unique sequences. (I have 68 samples in total, I seen similar numbers in soil studies in this forum so maybe this could be the case but I’m thinking this is a lot of errors?). I’m at the pre.clust stage and the computer may go on fire soon! I’m just wondering is there anything that I could do differently?

I had thought about merging my R1 reads and trimming them, merging my R2 reads and trimming them. Then making the contigs with these trimmed sequences and working from there? At the moment I used the file type input as I have separate reads R1 and R2 for each sample, that already have the primer and barcodes removed.

Sorry I’m repeating a question you loathe! Perhaps if you could even send me a link with some more information that could help? :oops:

Thanks so much,

Dr. Ciara Keating (NUI Galway)

Dear All,

Currently I have been starting to analyze my Miseq Illumina data, 16S rRNA genes, v34 regions, 2x300 paired- end reads. The raw sequence data which I obained from the comany are fastq files for each sample (fw and rev reads). So barcode, index and primer sequencing were not provided. I assembled both reads for each sample using “make.contigs(file=stability.files, processors=2)”. The results came out are find but the sequence length are much longer than what we ordered (300 bp), i.e. approximately 500 bp. That means the primiers sequences are also included. Then, I make another try with the same command but in addition with trimoverlap=T, “make.contigs(file=stability.files, trimoverlap=T, processors=2)” which means the produced sequence presented only the overlab regions (the sequence primers and the non overlap regions had been removed). With that, I ran into another problem since the sequence lengths are too short, i.e. 50-158 bp which is not normal for the subsequence analysis.

Any suggestions or comments to improve would be very much appreciated. Thank you in advance for greatful contribution.

Best regards,
Bunlong

Try using split.abund with cutoff = 1, see how many singletons you have. I’m often removing my singleton reads as the constitute > 50 % of my unique for about 5 % of my total counts