Are my number of sequences and OTU weird?


I am comparing the results I obtained to the SOP and I think my results are a little bite weird!
For example, there is 129 058 sequences before unique.seqs, and 16 477 after.
I have 583 612 before, and 486 121.

After pre-cluster and chimeres screening: 2609 unique sequences in the SOP. I still have 241 958.

I ended up with 205 332 Otu (including 182 929 otu created with only one sequence…)
Can you help me to figure out if there is a problem with my data (V3-V4, 192 fecal samples. I used cluster.split with a cutoff of 0.15, and tax level of 4)

Also, is there a command to generated a shared file excluding for example Otu with less than 2 sequences?

edit: I also notice that make.contig doesn't remove any sequences. If the input files for R1 and R2 are 40 000 reads, I end up with 40 000 reads after make.contig. Is it normal?

Thank you

What are your samples-fecal, soil,…? What are your sequences-v4, v34, miseq, hiseq?

Thanks for answering.
I sequenced on mi-seq (V3 reagents) V3-4 region of fecal samples (mice).

After more investigation, I believe that the poor quality of the run (only 50% of the reads >Q30) is the cause of my problem. When I compare the results of the SOP with mine:

After make.contig
Mothur SOP: 152360
Me: 7640334

Mothur SOP: 129058
Me: 583612

% reads kept
Mothur SOP: 85
Me: 8

after unique.seqs
Mothur SOP:16477
Me: 486121

% unique
Mothur SOP: 13
Me: 83

What is your feeling about it?

Run quality and sequencing too long of an insert. Pat has written about this issue

I think you are left with phylotype analysis as the only way to try to salvage the data.

I am surprised that the overlap between paired-end reads is an issue: with 2X300, I have ~130 bp overlap (report of make.contig).

I guess we will need to select a shorter region for the next experiment.

A last question: When I performed cluster.split with different tax level, my otu (in the taxonomy file), are the same. I would have expected, if taxlevel=3, to have only family names in this file, and when taxlevel=4, genus. This is not the case…

Thanks again!

taxlevel=3 is phylum (root, kingdom, phylum). 4 is class

Yes you are right. Sorry about this.
But why are the taxonomy files identical, when I use tax level 4 ou 3? Even when I use level 3, I have otu that are identified as species.

cluster.split isn’t making otu’s at a particular taxon level, it’s splitting up sequences by taxon identification for clustering. so if you use tax=3, it will only calculate sequence dissimilarity and cluster sequences that are all id’d to the same phyla. This is a computational load reduction-it should result in roughly the same OTUs as clustering all sequences together, just in much less time.

Remember that the taxon id for an OTU tells you nothing about the level of that OTU. You could be looking at phyla level OTUs and still see an id down to species level because classify.otu classifies the one representative sequence for that OTU.

So finally how can I process to compare analysis at various taxonomic level (I would like to do stack bars with the shared file)? I think with classify.otu for phylotype, we can do it with label 2,3 or 4, but with otu, this is not clear what the label 0.03 means.

0.03 means 3% sequence dissimilarity or 97% sequence similarity. You can use *.tax.summary to make your bar graphs

Thanks, I will know work on phylotype and maybe use only the r1 reads, for which the quality is a bit better…

I will continue the discussion here, even if I believe the answer is somewhere on the forum (is there a bug on the forum? When I search on google, using " x", it says that I am not allow to search on the forum).

I decided to work only with the r1, and I would like to use mothur for single-end processing. Is it possible?