I am facing an issue with my dataset.
I used 454 to sequence g23 viral protein from sea water samples. g23 analysis on new sequencing (ok, not that new anymore) technologies is still not really developed, and one of the main difference with say 16SrRNA analsysis is that there is no database on which you can count.
Anyway, a colleague developed a pipeline using QIIME, but as I have starting to learn mothur I tried to adapt the pipelin on it. My problem is that I get a crazy amount of OTUs out of it: around 28000 at 0.03 level, for ca 170000 cleaned reads, 110000 uniques. Even for viruses this is a lot…
Thus I asked the colleague to run my data on his pipeline, using QIIME, and he ends up with roughly 800 OTUs. So my question is: what is wrong? which one should I trust?
I put here under the protocols. You will already notice that the cleaning is not the same, while I used amplicon noise, he used a q27 trimming. But actually, I tried using the quality files in mothur, and get still 16000 OTUs, but maybe I am doing the q27 trimming wrong, I am not sure of that… Maybe it is also because I have too many singletons remaining?
I can also tell you that the average sequence length is the same, regardless of the protocols ( ca. 450bp). Also, the amount of sequences passing the cleaning is roughly the same.
So, here are the protocols, and thank you for your help!
[u]Original protocol:[/u] 4. Quality trimming reads - we trim at quality Q27, to (hopefully) remove most singletons. (BBMap) 5. Convert fastq to fasta - BioPython script These next steps were all QIIME... 6. Demultiplexing # split_libraries.py -m Mapping_File.txt -f trim.fasta -b 10 -p -o Split_library_output 7. Picking OTUs at either 0.97 similarity level using usearch denovo chimera detection # pick_otus.py -i Split_Library_Output/seqs.fna -s 0.97 -m usearch --suppress_reference_chimera_detection -o picked_otus_97/ 8. Pick representative OTUs # pick_rep_set.py -i picked_otus_97/seqs_otus.txt -f Split_Library_Output/seqs.fna -o rep_set1_97.fna
- Extract sff file:
mothur > sffinfo(sff=g23.sff, flow=T)
- Preparing flows for amplicon noise:
mothur > trim.flows(flow=g23.flow, oligos=g23.oligos, pdiffs=2, bdiffs=1, minflow=300, order=B)
- Run amplicon noise:
mothur > shhh.flows(file=g23.flow.files, order=B)
- Trimming sequences, removing barcodes, delete long homopolymers, min and max length were chosen in order to delete the shortest and longest 2.5%:
mothur > trim.seqs(fasta=g23.shhh.fasta, name=g23.shhh.names, oligos=g23.oligos, flip=F, pdiffs=2, bdiffs=1, minlength=399, maxlength=490, maxhomop=8)
- Simplify the dataset:
mothur > unique.seqs(fasta=g23.shhh.trim.fasta, name=g23.shhh.trim.names)
- Denovo chimera detection:
mothur > chimera.uchime(fasta=g23.shhh.trim.unique.fasta, name=g23.shhh.trim.unique.names, group=g23.shhh.groups)
- Remove chimeras:
mothur > remove.seqs(accnos=g23.shhh.trim.unique.uchime.accnos, fasta=g23.shhh.trim.unique.fasta, group=g23.shhh.groups, name=g23.shhh.trim.unique.names)
- Rename the files, cause it gets a bit long to write everything…
g23.shhh.trim.unique.pick.names >> g23.cleaned.names
g23.shhh.trim.unique.pick.fasta >> g23.cleaned.fasta
g23.shhh.pick.groups >> g23.cleaned.groups
- Chop the sequences, needed for the matrix distance:
mothur > chop.seqs(fasta=g23.cleaned.fasta, numbases=399)
- Simplify the dataset again, although it did not change much:
mothur > unique.seqs(fasta=g23.cleaned.chop.fasta, name=g23.cleaned.names)
- Build a distance matrix:
mothur > dist.seqs(fasta=g23.cleaned.chop.unique.fasta, cutoff=0.15)
- Cluster that mess into OTUs:
mothur > cluster(column=g23.cleaned.chop.unique.dist, name=g23.cleaned.chop.names)
- For step 14, mothur complained that my group and name files had more sequences than my fasta file… This comes from the chopping at step 9, however I do not understand why 173 seqs did not make it through… Anyway:
mothur > remove.seqs(accnos=g23.cleaned.chop.accnos, name=g23.cleaned.names, group=g23.cleaned.groups)
- Now I am ready to make the OTU tables, at 97% level:
mothur > make.shared(list=g23.cleaned.chop.unique.an.list, group=g23.cleaned.pick.groups, label=0.03)
- Then I need to subsample everything to the smallest samples, in fact that deleted one, as I went down to 4082, instead of 1647.
mothur > count.groups()
mothur > sub.sample(shared=g23.cleaned.chop.unique.an.shared, size=4082)