Proving there is a problem with the input

I have been struggling a while with a dataset that I feel has problems but cannot figure out what to do to prove it.

The study is to find differences in 16S populations from a series of hosts that are classified into 2 groups (IS and IR).

What I am seeing is that multiple OTUs form that will each end up being classified as the same taxonomy. The individual OTUs are very prominent in one mid, but either close to or completely absent in the others (we have 2 groups with multiple different samples per group but each sample comes from a unique host).

Essentially a single, multiplex 454 run was performed with Titanium LibA. Both directions were barcoded and sequenced but only forward primer starting reads were retained.
The SOP was followed but the shhh.flows() part was not as the sff file does not want to work with it. The first table shows the most prominent taxon. The first line is a sum of the remaining lines. Totalsize refers to the number of reads in total per OTU and the other numbers come from the sub.sample (10,000 per group). The left columns are just sums from the columns corresponding to group 1 and group 2. This table was created from a merging of output from these files:

16sc.unique.good.filter.unique.precluster.pick.an.0.10.cons.taxonomy
16sc.unique.good.filter.unique.precluster.pick.0.10.an.0.10.subsample.0.10.IS-IR.metastats
16sc.unique.good.filter.unique.precluster.pick.0.10.an.0.10.subsample.shared

I started with 0.03 distance and there are even more OTUs, so here is 0.10:

OTU Taxonomy TotalSize Qval IS03f IS05f IS24f IS29f IS58f IS60f IR02f IR04f IR19f IR47f IR53f Group1Total Group2Total
1870 Bacteroidetes_Bacteroides_5 90322 0  1111 5276 2638 303 5810 626 5181 43 327 588 1414 15764 7553
1870 Bacteroidetes_Bacteroides_5 8156 0  85 2946 412 55 33 20 17 18 14 76 24 3551 149
1891 Bacteroidetes_Bacteroides_5 547 0  0 363 0 1 0 0 0 0 0 11 0 364 11
1868 Bacteroidetes_Bacteroides_5 3926 0  0 0 0 0 1 6 755 0 0 0 3 7 758
1867 Bacteroidetes_Bacteroides_5 16620 0  507 0 0 4 0 10 2675 13 0 47 0 521 2735
1984 Bacteroidetes_Bacteroides_5 213 0  2 119 0 5 2 0 0 0 0 0 0 128 0
1876 Bacteroidetes_Bacteroides_5 4274 0  31 0 0 14 1593 6 7 0 1 0 0 1644 8
1894 Bacteroidetes_Bacteroides_5 1971 0  0 70 0 0 707 0 0 0 0 0 4 777 4
1901 Bacteroidetes_Bacteroides_5 166 0  0 93 0 1 0 0 0 0 0 0 4 94 4
1877 Bacteroidetes_Bacteroides_5 2056 0  10 159 178 4 59 1 14 0 0 7 69 411 90
1907 Bacteroidetes_Bacteroides_5 470 0  0 0 70 0 0 1 0 1 0 0 0 71 1
1879 Bacteroidetes_Bacteroides_5 466 3e-06  10 2 42 0 11 0 3 0 0 0 9 65 12
1928 Bacteroidetes_Bacteroides_5 408 0  0 0 59 1 0 0 0 1 0 0 5 60 6
1912 Bacteroidetes_Bacteroides_5 70 0  0 0 0 0 0 0 0 0 63 0 0 0 63
1865 Bacteroidetes_Bacteroides_5 49418 0  466 1524 1877 217 3404 582 1710 10 249 447 1086 8070 3502
1976 Bacteroidetes_Bacteroides_5 1561 0  0 0 0 1 0 0 0 0 0 0 210 1 210

I used diffs=2 in pre.cluster and it merged about 1/2 of the sequence reads at that step. If I set the diffs to 10, things get even crazier:

Total number of sequences before pre.cluster was 13946.
pre.cluster removed 11729 sequences.
Total number of sequences before pre.cluster was 5321.
pre.cluster removed 4654 sequences.
Total number of sequences before pre.cluster was 692.
pre.cluster removed 583 sequences.
Total number of sequences before pre.cluster was 3166.
pre.cluster removed 2527 sequences.
Total number of sequences before pre.cluster was 7571.
pre.cluster removed 5899 sequences.
Total number of sequences before pre.cluster was 27992.
pre.cluster removed 20187 sequences.
Total number of sequences before pre.cluster was 19274.
pre.cluster removed 15176 sequences.
Total number of sequences before pre.cluster was 4742.
pre.cluster removed 3612 sequences.
Total number of sequences before pre.cluster was 17051.
pre.cluster removed 14080 sequences.
Total number of sequences before pre.cluster was 16172.
pre.cluster removed 12493 sequences.
Total number of sequences before pre.cluster was 6843.
pre.cluster removed 5638 sequences.
Total number of sequences before pre.cluster was 13840.
pre.cluster removed 11353 sequences.

These reads are basically quite similar to each other but different enough to form different OTUs and then each mid seems to like to spawn its own OTU for each taxonomy. My problem is, what do I do? How can I go tell the sequencing center that they need to fix something as the data is basically unusable. If the problem stems from the input samples, how can I show that they data and the bioinformatics are good, but the input is bad?

Hope this was clear, thanks for reading.

Sorry for all the problems you’re having.

What I am seeing is that multiple OTUs form that will each end up being classified as the same taxonomy. The individual OTUs are very prominent in one mid, but either close to or completely absent in the others (we have 2 groups with multiple different samples per group but each sample comes from a unique host).

This is actually something you should expect - an OTU can be thought of as a sub-genus taxonomic level. So it would be reasonable to have multiple OTUs with the same taxonomy. For example, the diameter of the Bacteroidetes is about 0.15. You can fit a lot of OTUs into that space.

I used diffs=2 in pre.cluster and it merged about 1/2 of the sequence reads at that step. If I set the diffs to 10, things get even crazier:

The first comment makes sense. I’m not sure that I would ever recommend diffs=10, since I’m not really confident how it would perform.

My problem is, what do I do? How can I go tell the sequencing center that they need to fix something as the data is basically unusable. If the problem stems from the input samples, how can I show that they data and the bioinformatics are good, but the input is bad?

So a couple things here…

  1. The first problem is that you have a bad sff file. I would ask them to re-provide that to you. That’s a simple thing to do and get. If they can’t do it, the problem is on them.
  2. Since you aren’t running shhh.flows, are you running trim.seqs with qwindowaverage=35, qwindowsize=50?
  3. This is another reason to suggest that people always run a mock community. You would have your sequence data, the truth, and an error rate to go back to the center with.

Since you probably can’t do #3, I’d start with 1 and 2. I’m also not convinced that you have a problem or entirely sure what the problem is.

Pat

Thanks for your help. The problem with the sff file is beyond me. Apparently the biologist and the core facility decided to put 4Ns upstream of the barcode. The have a perl script that separates out the sequences based on the barcodes and finds the primers and I think reverse complements the sequences when the read was off the reverse primer (yes, sequenced in both directions but not paired).

For this data set, they reported a problem getting .qual files. I think I might try to get this out of the sff myself, as I was not able to run trim.seqs with the parameters you suggest.

Your explanation of multiple OTUs in a taxon makes sense, and I am glad you wrote that, but was I clear on what I suspect the problem to be:

Can I have gut microbiota from 6 control mice and each of them have reads that cluster into OTUs under a single taxon, but none of these 6 mice actually have the same OTU shared between them?

Is that to be expected? If so, I am maybe more lost than I thought. If I want to say that one group of mice have bugs differentially compared to another group of mice, I need to compare individual OTUs, not collections of OTUs that all get the same basic taxonomic label, right?

I think I’d find a new sequence provider in the future… One thing you might try since you don’t have the quality scores or the flow data is to use keepfirst=200 in trim.seqs. This should cut down on a lot of the sequencing error and perhaps the problem you are seeing.

So you have six otus that are each found in a different mouse, they’re each abundant, and classify as the same thing? Yeah that does sound weird. Are they all from the same breeding colony? It’s possible that it’s a sequencing error problem. If you take a representative sequence from each of the 6 otus, how similar are they to each other?

Ok, hopefully a little progress. I managed to use get.oturep and integrated the fasta output into the program.

Here are the sequences for the previously mentioned OTUs:

>IS03f_G0Z7RRJ01C2NVI 1870|8156
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGAACTTAGCTTGCTAAGTTTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGATGACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATGGCATAGTTCTTCCGCATGGTGGAACTATTAAAGAATTTCGGTCATCGATGGGGA
>IR47f_G0Z7RRJ01DSA6T 1891|547
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGATTGAAGCTTGCTTCAATCGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGATAACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATAGCATAGTATTTCCGCATGGTTTCACTATTAAAGAACTTCGGTTATCGATGGGGA
>IR02f_G0Z7RRJ01BO84W 1868|3926
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGAACTTAGCTTGCTAAGTTTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTTCCGTTTACTCAGGGATAGCCTTTCGAAAGAAAGATTAATACCTGATAGTATGGTGAGATTGCATGATAGCACCATTAAAGATTTATTGGTAAACGATGGGGA
>IS29f_G0Z7RRJ01EYVLN 1867|16620
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCGGGATTGAAGCTTGCTTCAATTGCCGGCGACCGGCGCACGGGTGAGTAACGCGTATCCAACCTTCCGTACACTCAGGGATAGCCTTTCGAAAGAAAGATTAATACCTGATGGTATGATGAGATTGCATGATAGCATCATTAAAGATTTATCGGTGTACGATGGGGA
>IS05f_G0Z7RRJ01AOAJJ 1984|213
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCATCAGGAAGAAAGCTTGCTTTCTTTGCTGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCCTTTACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATGGCATAATGATTCCGCATGGTTTCATTATTAAAGGATTCCGGTAAAGGATGGGGA
>IR14f_G0Z7RRJ01CDQ2G 1876|4274
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATCATCAAAGCTTGCTTTGATGGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGACAACACTGGGATAGCCTTTCGAAAGAAAGATTAATACCGGATGGCATAGTTTTCCCGCATGGGATAATTATTAAAGAATTTCGGTTGTCGATGGGGA
>IR53f_G0Z7RRJ01EZHB7 1894|1971
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCATCAGGAAGAAAGCTTGCTTTCTTTGCTGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGATGACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATGGTATATCTGAAAGGCATCTTTCAGCTATTAAAGAATTTCGGTCATTGATGGGGA
>IR53f_G0Z7RRJ01EF629 1901|166
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGATTGAAGCTTGCTTCAATTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGATAACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATAGTATAGTATTTCCGCATGGTTTCACTATTAAAGAATTTCGGTTATCGATGGGGA
>IR47f_G0Z7RRJ01BVXK9 1877|2056
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATTTTAGTTTGCTTGCAAACTAAAGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGATAACTCGGGGATAGCCTTTCGAAAGAAAGATTAATATCCGATGGTATATTAAAACCGCATGGTTTTACTATTAAAGAATTTCGGTTATCGATGGGGA
>IR04f_G0Z7RRJ01BAZ6I 1907|470
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCATCAGGGTGTAGCAATACACCGCTGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCCTTTACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATGGTATAACATGACCTCCTGGTTTTGTTATTAAAGAATTTCGGTAGAGGATGGGGA
>IR53f_G0Z7RRJ01AYPNJ 1879|466
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCATCAGTTTGGTTTGCTTGCAAACCAAAGCTGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCTCATACTCGGGGATAGCCTTTTCGAAAGAAAGATTAATATCCGATAGCATATATTTCCCGCATGGGTTTTATATTAAAGAAATTCGGTATGAGATGGGGA
>IS24f_G0Z7RRJ01AXOEJ 1928|408
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGACCTAGCAATAGGTTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTACCGGTTATTCCGGGATAGCCTTTCGAAAGAAAGATTAATACCGGATAGTATAACGAGAAGGCATCTTTTTGTTATTAAAGAATTTCGATAACCGATGGGGA
>IR19f_G0Z7RRJ01CZYEW 1912|70
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGAGCTTAGCTTGCTAAGCTTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTCCCGCTTACTCAGGAATAGCCTTTCGAAAGAAAGATTAATGCCTGATGGTATCTTAAGCACACATGTAATTAAGATTAAAGATTTATCGGTAAGCGATGGGGA
>IR02f_G0Z7RRJ01ALARM 1865|49418
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTCTTAGCTTGCTAAGGCCGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGTCTACTCTTGGACAGCCTTCTGAAAGGAAGATTAATACAAGATGGCATCATGAGTCCGCATGTTCACATGATTAAAGGTATTCCGGTAGACGATGGGGA
>IR53f_G0Z7RRJ01DE4M7 1976|1561
GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCGGGATTGTAGCAATACAATTGCCGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTTCCGGTTACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATAGCATAAGATTTCCGCATGGTTTTCTTATTAAAGATTCATCGGTAACCGATGGGGA

And here is the screen output from cluster:

mothur > cluster(column=Bac5.unique.dist, name=Bac5.names)

********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
unique 1 15 
0.09 3 10 1 1 
0.10 3 8 2 1 
0.11 4 7 2 0 1 
0.14 6 6 0 1 0 0 1 
0.15 7 5 0 1 0 0 0 1 
0.16 7 3 1 1 0 0 0 1 
0.17 9 3 0 1 0 0 0 0 0 1 
0.18 10 1 0 0 1 0 0 0 0 0 1 
0.19 14 1 0 0 0 0 0 0 0 0 0 0 0 0 1 
0.21 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

I am guessing that this means these sequences, when treated alone, get different results than when treated in the whole data set, which is fine. But there are no new OTUs being formed between unique and 0.09 and at 0.10 (the setting I was playing with earlier) they still form 8 different otus at 0.10.

I guess that I really need to get at the quals data to make sure that the areas of difference are not bad calls by the machine, which is my next step.

Please let me know if after seeing these sequences, this reveals anything to you. Thanks!