Variation in the results of Mothur/QIIME and BaseSpace

Dear Prof Pat Schloss,
I am working on V3 region MiSeq data obtained from 7 human faecal DNA. I am new to the analysis and followed your MiSeq_SOP as well as QIIME protocols for MiSeq data. I also have the results for same samples obtained from BaseSpace, Illumina (Analysis done by the Sequence providers, so I don’t know what parameters they used while analysing the data). My problem is, there is too much variation in the results of Mothur and QIIME when compared with BaseSpace. Difference is as follows:

BaseSpace Results:
Sample Id No. of Species identified No. of Reads % of reads classified into genus
FN12 800 631567 69.82 %
FN173 650 345118 83.16 %
FN147 711 485254 80.00 %
FN160 692 425320 75.84 %
FN40 776 709729 79.77 %
FN144 759 469407 83.76 %
FN146 815 916905 91.31 %

Mothur results:

mothur > summary.seqs(fasta=current, count=current)
Using metagenic.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table as input file for the count parameter.
Using metagenic.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta as input file for the fasta parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 462 156 0 3 1
2.5%-tile: 1 462 168 0 4 66388
25%-tile: 1 462 168 0 4 663873
Median: 1 462 168 0 5 1327746
75%-tile: 1 462 188 0 5 1991619
97.5%-tile: 1 462 194 0 6 2589104
Maximum: 1 462 199 0 8 2655491
Mean: 1 462 174.861 0 4.6974

of unique seqs: 51130

total # of seqs: 2655491

count.groups(shared=metagenic.an.shared)
FN12 contains 376682.
FN144 contains 314796.
FN146 contains 659528.
FN147 contains 334515.
FN160 contains 279828.
FN173 contains 233850.
FN40 contains 456273.

Total seqs: 2655472.

Observed OTUs before subsampling:
group nseqs sobs
FN12 233850 2760
FN144 233850 3084
FN146 233850 2342
FN147 233850 2788
FN160 233850 2798
FN173 233850 2921
FN40 233850 2736

Observed OTUs after subsampling:
Group nseqs sobs
FN12 376682 3866
FN144 314796 3804
FN146 659528 4900
FN147 334515 3486
FN160 279828 3132
FN173 233850 2921
FN40 456273 4597


[b]QIIME results:[/b]

mothur > summary.seqs(fasta=cdhit_rep_seps/cdhit_rep_seqs.fasta)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 151 151 0 3 1
2.5%-tile: 1 166 166 0 3 253
25%-tile: 1 169 169 0 4 2529
Median: 1 172 172 0 5 5058
75%-tile: 1 190 190 0 5 7587
97.5%-tile: 1 195 195 0 6 9863
Maximum: 1 195 195 0 9 10115
Mean: 1 178.658 178.658 0 4.54226

of Seqs: 10115

Sample observed_species
FN144 3789.0
FN160 3721.0
FN173 3718.0
FN40 2671.0
FN147 3681.0
FN12 2996.0
FN146 2806.0

As given by Mothur and QIIME, are these figures (observed species ranging from 2671 to 4900) natural and expected or am I missing something during analysis?

  1. If these numbers are fine, why so much variation from BaseSpace?
  2. Or else what I observed is, there are many OTUs representing single sequence, should I remove all those, If yes, how to remove?
  3. Is there a necessity for further subsampling?

Thank you

Anwesh

I think “filter.shared” is my option…
Sorry for posting a new thread before searching…

You have to keep in mind that these three methods use very different parameters and approaches. cdhit is vastly different from what we’re doing. in fact, we’ve shown that cd-hit really sucks for otu classification. other things like method of despising, chimera checking, databases, etc. can all have an impact on the output.