Problem Qiime/mothur different results

Hi everybody,
I am facing an issue with my dataset.
I used 454 to sequence g23 viral protein from sea water samples. g23 analysis on new sequencing (ok, not that new anymore) technologies is still not really developed, and one of the main difference with say 16SrRNA analsysis is that there is no database on which you can count.
Anyway, a colleague developed a pipeline using QIIME, but as I have starting to learn mothur I tried to adapt the pipelin on it. My problem is that I get a crazy amount of OTUs out of it: around 28000 at 0.03 level, for ca 170000 cleaned reads, 110000 uniques. Even for viruses this is a lot…
Thus I asked the colleague to run my data on his pipeline, using QIIME, and he ends up with roughly 800 OTUs. So my question is: what is wrong? which one should I trust?
I put here under the protocols. You will already notice that the cleaning is not the same, while I used amplicon noise, he used a q27 trimming. But actually, I tried using the quality files in mothur, and get still 16000 OTUs, but maybe I am doing the q27 trimming wrong, I am not sure of that… Maybe it is also because I have too many singletons remaining?
I can also tell you that the average sequence length is the same, regardless of the protocols ( ca. 450bp). Also, the amount of sequences passing the cleaning is roughly the same.

So, here are the protocols, and thank you for your help!

Regards.


[u]Original protocol:[/u] 4. Quality trimming reads - we trim at quality Q27, to (hopefully) remove most singletons. (BBMap) 5. Convert fastq to fasta - BioPython script These next steps were all QIIME... 6. Demultiplexing # split_libraries.py -m Mapping_File.txt -f trim.fasta -b 10 -p -o Split_library_output 7. Picking OTUs at either 0.97 similarity level using usearch denovo chimera detection # pick_otus.py -i Split_Library_Output/seqs.fna -s 0.97 -m usearch --suppress_reference_chimera_detection -o picked_otus_97/ 8. Pick representative OTUs # pick_rep_set.py -i picked_otus_97/seqs_otus.txt -f Split_Library_Output/seqs.fna -o rep_set1_97.fna

My protocol:

  1. Extract sff file:
    mothur > sffinfo(sff=g23.sff, flow=T)
  2. Preparing flows for amplicon noise:
    mothur > trim.flows(flow=g23.flow, oligos=g23.oligos, pdiffs=2, bdiffs=1, minflow=300, order=B)
  3. Run amplicon noise:
    mothur > shhh.flows(file=g23.flow.files, order=B)
  4. Trimming sequences, removing barcodes, delete long homopolymers, min and max length were chosen in order to delete the shortest and longest 2.5%:
    mothur > trim.seqs(fasta=g23.shhh.fasta, name=g23.shhh.names, oligos=g23.oligos, flip=F, pdiffs=2, bdiffs=1, minlength=399, maxlength=490, maxhomop=8)
  5. Simplify the dataset:
    mothur > unique.seqs(fasta=g23.shhh.trim.fasta, name=g23.shhh.trim.names)
  6. Denovo chimera detection:
    mothur > chimera.uchime(fasta=g23.shhh.trim.unique.fasta, name=g23.shhh.trim.unique.names, group=g23.shhh.groups)
  7. Remove chimeras:
    mothur > remove.seqs(accnos=g23.shhh.trim.unique.uchime.accnos, fasta=g23.shhh.trim.unique.fasta, group=g23.shhh.groups, name=g23.shhh.trim.unique.names)
  8. Rename the files, cause it gets a bit long to write everything…
    g23.shhh.trim.unique.pick.names >> g23.cleaned.names
    g23.shhh.trim.unique.pick.fasta >> g23.cleaned.fasta
    g23.shhh.pick.groups >> g23.cleaned.groups
  9. Chop the sequences, needed for the matrix distance:
    mothur > chop.seqs(fasta=g23.cleaned.fasta, numbases=399)
  10. Simplify the dataset again, although it did not change much:
    mothur > unique.seqs(fasta=g23.cleaned.chop.fasta, name=g23.cleaned.names)
  11. Build a distance matrix:
    mothur > dist.seqs(fasta=g23.cleaned.chop.unique.fasta, cutoff=0.15)
  12. Cluster that mess into OTUs:
    mothur > cluster(column=g23.cleaned.chop.unique.dist, name=g23.cleaned.chop.names)
  13. For step 14, mothur complained that my group and name files had more sequences than my fasta file… This comes from the chopping at step 9, however I do not understand why 173 seqs did not make it through… Anyway:
    mothur > remove.seqs(accnos=g23.cleaned.chop.accnos, name=g23.cleaned.names, group=g23.cleaned.groups)
  14. Now I am ready to make the OTU tables, at 97% level:
    mothur > make.shared(list=g23.cleaned.chop.unique.an.list, group=g23.cleaned.pick.groups, label=0.03)
  15. Then I need to subsample everything to the smallest samples, in fact that deleted one, as I went down to 4082, instead of 1647.
    mothur > count.groups()
    mothur > sub.sample(shared=g23.cleaned.chop.unique.an.shared, size=4082)

To start with, QIIME and mothur use different clustering algorithms (link, QIIME uses the uclust method tested), so you’ll always get slightly different results between the pipelines.

That said, you should probably be using pairwise.seqs instead of dist.seqs. Dist.seqs assumes that your sequences have been aligned and trimmed so all represent an overlapping region, whereas pairwise.seqs aligns each pair of sequences against each other and then calculates the difference (which is more in line with what QIIME does). Dist.seqs is the choice when working with 16S data because we always align the sequences against a 16S reference database prior to this step. As soon as you move away from a well-studied gene, you need to start looking at either multiple sequence alignment, or pairwise distance comparisons.

Here’s a little schematic of what your current approach is doing, and why it’s probably inflating your OTUs.

If you have these sequences, which are clearly just slightly offset regions of the same sequence:

>Seq1
ATCGATCG
>Seq2
TCGATCG
>Seq3
CGATCG

Then the chop/dist.seqs approach will format the sequences as:

>Seq1
ATCGAT
>Seq2
TCGATC
>Seq3
CGATCG

And dist.seqs would tell you that each sequence is completely different from the others (distance = 1) because at each position, every sequence has a unique base. Pairwise.seqs (default settings) would give the following:

>Seq1 vs Seq2, distance = 0.125
ATCGATCG
-TCGATCG

>Seq1 vs Seq3, distance = 0.25
ATCGATCG
--CGATCG

>Seq2 vs Seq3, distance = 0.14
TCGATCG
-CGATCG

Although you can toggle whether or not to penalise gaps ends. If you were to run the command with “countends=F” you should get a distance of 0 between each pair.

If you repeat your pipeline using pairwise.seqs you should end up with fair fewer OTUs. Personally, I wouldn’t bother with chop.seqs, and would instead run pairwise.seqs(fasta=XXX, countends=F) to account for the mismatched lengths, since in your data I don’t think there’s any way of telling whether gaps ends are a property of the gene sequenced or just the cut-off from your sequencer.

Out of curiosity, what’s your rationale for using an OTU threshold of 97% similarity?

Hei Dwaite,
thanks a lot for your super detailed answer, I really had no idea about all that, and here everybody is in holidays right now, so no help available…:slight_smile:
pairwise.seqs is now running, but I am affraid that even with 10 processors it will take a decade, so I ll post the results later.
Regarding the 97%, well, it is really tricky to say why a value more than another. It is already arbitrary for 16SrRNA, en here even more. Several studies have been using this value, so it makes it easier to compare datasets. However, I will probably go down to 95% when it comes to use the data for statistics. There is probably no correct answer, but considering the higher mutation rate of viruses than there hosts, we believe that 95% gives a better overview of the diversity.

Me again, pairwise.seqs is now running since 80 hours, and I did around 15% of my 110000 unique seqs… The computer is fairly powerful, and never a command took me more than overnight, so does that make sence? I ran:
mothur > pairwise.seqs(fasta=g23.cleaned.fasta, countends=F, cutoff=0.15)

In the QIIME pipeline, the whole OTU clustering process took 1.5 sec. Can somebody explain me how this is possible?

Thanks.

I’m assuming you did QIIME de novo clustering using the usearch method?

The short answer (explained in more detail here) is that usearch uses the abundance of a read as part of it’s weighting for determining OTU centroids, while mothur uses (at the distance calculation) just the sequence itself. What I mean is that with the usearch pipeline, reads are sorted by abundance and then the algorithm proceeds something like:

The first sequence (most abundant) is an OTU centroid. Add it to the OTU pool. For each subsequence sequence, compare it to the existing OTU pool. If it matches an existing OTU within the threshold, it is part of that OTU. If it doesn’t match, add it to the pool as a new centroid.

This makes clustering extremely fast because you only iterate the data set once. By contrast, for the pairwise.seqs approach you’re starting off by comparing each unique sequence to each other one. I have zero opinion on which approach is better, although usearch is undeniably faster.

If you want to stick with pairwise.seqs, you could improve the performance by using multiple processors to split the load:

pairwise.seqs(fasta=g23.cleaned.fasta, countends=F, cutoff=0.15, processors=XXX)

Hei,
I understand. I also tried to run on our server with 10 processors, but it does not even go faster… Well, I guess I’ll have to use QIIMEs results.

Thanks a lot.