OTUs number too high

Dear mothur-community,

we have followed the Schloss SOP but with our own dataset (16S rRNA analysis (v3-v4 region) of two samples from two different saline habitats).
At the moment we are at the step “Preparing for Analysis” and have created the OTUs by the command “cluster.split” and assigned taxas to the OTUs by classify.otu. We have derived a huge amount of OTUs comprising only one representative sequence (75% of the total number of generated OTUs). Our assumption would be that this is affected by sequencing errors, at least partially. And hence the diversity is inflated. Unfortunately, we don’t have a sequenced mock community, yet.
Without being too bold would it be possible to hint us to one of the conducted commands and their parameters so we end up with less OTUs filled with just one sequence. As we guess we have to alter one of the “filters” at the command unique.seqs.
As always we are grateful for any advice or hint!

Regards,

Hi,

For starters, can you give some background about your data? Amplicon size, run chemistry, quality-filtering parameters and so forth. Also, did you sequence a mock community with your samples?

That kind of long singleton tail does sound like a sequencing quality issue.

Cheers.

Hello dwaite,

thank you for your fast reply.

For starters, can you give some background about your data?

Sample origin: saline habitat
Interested topic: biodiversity in respect to bacteria and archea (extreme halophiles)

Amplicon size

Amplicon product: v3-v4 region of the 16S rRNA gene
Amplicon size: approx. 425 bp

run chemistry

Nextera-v3 kit by Illumina

quality-filtering parameters and so forth

make.contigs(file=x.files, oligos=x.oligos, pdiffs=2)

screen.seqs(fasta=x.fasta, group=x.groups, contigsreport=x.report, maxambig=1, minlength=403, maxlength=453, minoverlap=10)

unique.seqs(fasta=x.fasta)

count.seqs(name=x.names, group=x.groups)

pcr.seqs(fasta=silva.nr_v123.align, start=6427, end=23439, keepdots=F)

align.seqs(fasta=x.fasta, reference=silva.v3v4.fasta)

screen.seqs(fasta=x.align, count=x.count_table, summary=x.summary, start=1, end=17012, maxhomop=8)

filter.seqs(fasta=x.align, vertical=T, trump=.)

unique.seqs(fasta=x.fasta, count=x.count_table)

pre.cluster(fasta=x.fasta, count=x.count_table, diffs=2)

chimera.uchime(fasta=x.fasta, count=x.count_table, dereplicate=t)

remove.seqs(fasta=x.fasta, accnos=x.accnos)

classify.seqs(fasta=x.fasta, count=x.count_table, reference=silva.nr_v123.fasta, taxonomy=silva.nr_v123.tax, cutoff=80)

remove.lineage(fasta=x.fasta, count=x.count_table, taxonomy=x.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)

cluster.split(fasta=x.fasta, count=x.count_table, taxonomy=x.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

make.shared(list=x.list, count=x.count_table, label=0.03)

classify.otu(list=x.list, count=x.count_table, taxonomy=x.taxonomy, label=0.03)

For reasons of readability we truncated the file names.

Also, did you sequence a mock community with your samples?

Unfortunately, we don’t have a sequenced mock community, yet.

Thanks again in advance for any hints and help!
Regards,

Can you take a look at the read quality of your data? It might be that the read quality is low towards the end of the sequences, which gives bad joining.

FastQC is a really good tool for this - just open some fastq files in it and it will score the average quality along the sequences.

Hello dwaite,

thanks for your reply.
We have looked with FastQC at our acquired fastq-files prior proceeding with mothur.
Sample1_fw:

Sample1_rev:

Sample2_fw:

Sample2_rev

Thanks again for your input!
Regards,

Since your product is so long, you are only overlapping the final 25bp of each read. check out the quality of those last 25 bases, especially on the second read. You have a 1/10 chance of any base being a sequencing error. This is pretty normal for the v3 chemistry.

I have yet to find anyone that has had a good experience with V3 chemistry or when their reads do not fully overlap. You probably need to see this:

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Hello kmitchell and Pat,

first of all excuse the belated answering and of course, thanks again for your answers.

Post by kmitchell » Fri Jan 22, 2016 3:51 pm
Since your product is so long, you are only overlapping the final 25bp of each read.

Alright, that makes sense. Nevertheless what puzzles us at the moment is the following result.
We repeated the make.contigs command (make.contigs(file=Salz.files, oligos=Salz.oligos, pdiffs=2, trimoverlap=T)) and the summary.seqs command (summary.seqs(fasta=Salz.trim.contigs.fasta)) and got the following table:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1 1 0 1 1
2.5%-tile: 1 20 20 0 3 37262
25%-tile: 1 136 136 0 3 372611
Median: 1 143 143 0 4 745221
75%-tile: 1 146 146 5 5 1117831
97.5%-tile:1 161 161 22 6 1453180
Maximum: 1 286 286 60 20 1490440
Mean: 1 139.411 139.411 3.70482 4.24013

of Seqs: 1490440

Now the inter-quartile has sequences with a length of 136 - 146 bp. How is this possible?!? And doing a nblast at NCBI with one of these sequences we got a reasonable entry. As said, we are quite confused by that.

You have a 1/10 chance of any base being a sequencing error. This is pretty normal for the v3 chemistry.

So essentially we are gaining in sequencing length by the v3 kit but losing so much quality that the data is hardly usable. A phred score of 10 seems to be quite low and makes us unsure to continue with the data.

Regards,