I have been struggling a while with a dataset that I feel has problems but cannot figure out what to do to prove it.
The study is to find differences in 16S populations from a series of hosts that are classified into 2 groups (IS and IR).
What I am seeing is that multiple OTUs form that will each end up being classified as the same taxonomy. The individual OTUs are very prominent in one mid, but either close to or completely absent in the others (we have 2 groups with multiple different samples per group but each sample comes from a unique host).
Essentially a single, multiplex 454 run was performed with Titanium LibA. Both directions were barcoded and sequenced but only forward primer starting reads were retained.
The SOP was followed but the shhh.flows() part was not as the sff file does not want to work with it. The first table shows the most prominent taxon. The first line is a sum of the remaining lines. Totalsize refers to the number of reads in total per OTU and the other numbers come from the sub.sample (10,000 per group). The left columns are just sums from the columns corresponding to group 1 and group 2. This table was created from a merging of output from these files:
16sc.unique.good.filter.unique.precluster.pick.an.0.10.cons.taxonomy
16sc.unique.good.filter.unique.precluster.pick.0.10.an.0.10.subsample.0.10.IS-IR.metastats
16sc.unique.good.filter.unique.precluster.pick.0.10.an.0.10.subsample.shared
I started with 0.03 distance and there are even more OTUs, so here is 0.10:
OTU Taxonomy TotalSize Qval IS03f IS05f IS24f IS29f IS58f IS60f IR02f IR04f IR19f IR47f IR53f Group1Total Group2Total
1870 Bacteroidetes_Bacteroides_5 90322 0 1111 5276 2638 303 5810 626 5181 43 327 588 1414 15764 7553
1870 Bacteroidetes_Bacteroides_5 8156 0 85 2946 412 55 33 20 17 18 14 76 24 3551 149
1891 Bacteroidetes_Bacteroides_5 547 0 0 363 0 1 0 0 0 0 0 11 0 364 11
1868 Bacteroidetes_Bacteroides_5 3926 0 0 0 0 0 1 6 755 0 0 0 3 7 758
1867 Bacteroidetes_Bacteroides_5 16620 0 507 0 0 4 0 10 2675 13 0 47 0 521 2735
1984 Bacteroidetes_Bacteroides_5 213 0 2 119 0 5 2 0 0 0 0 0 0 128 0
1876 Bacteroidetes_Bacteroides_5 4274 0 31 0 0 14 1593 6 7 0 1 0 0 1644 8
1894 Bacteroidetes_Bacteroides_5 1971 0 0 70 0 0 707 0 0 0 0 0 4 777 4
1901 Bacteroidetes_Bacteroides_5 166 0 0 93 0 1 0 0 0 0 0 0 4 94 4
1877 Bacteroidetes_Bacteroides_5 2056 0 10 159 178 4 59 1 14 0 0 7 69 411 90
1907 Bacteroidetes_Bacteroides_5 470 0 0 0 70 0 0 1 0 1 0 0 0 71 1
1879 Bacteroidetes_Bacteroides_5 466 3e-06 10 2 42 0 11 0 3 0 0 0 9 65 12
1928 Bacteroidetes_Bacteroides_5 408 0 0 0 59 1 0 0 0 1 0 0 5 60 6
1912 Bacteroidetes_Bacteroides_5 70 0 0 0 0 0 0 0 0 0 63 0 0 0 63
1865 Bacteroidetes_Bacteroides_5 49418 0 466 1524 1877 217 3404 582 1710 10 249 447 1086 8070 3502
1976 Bacteroidetes_Bacteroides_5 1561 0 0 0 0 1 0 0 0 0 0 0 210 1 210
I used diffs=2 in pre.cluster and it merged about 1/2 of the sequence reads at that step. If I set the diffs to 10, things get even crazier:
Total number of sequences before pre.cluster was 13946.
pre.cluster removed 11729 sequences.
Total number of sequences before pre.cluster was 5321.
pre.cluster removed 4654 sequences.
Total number of sequences before pre.cluster was 692.
pre.cluster removed 583 sequences.
Total number of sequences before pre.cluster was 3166.
pre.cluster removed 2527 sequences.
Total number of sequences before pre.cluster was 7571.
pre.cluster removed 5899 sequences.
Total number of sequences before pre.cluster was 27992.
pre.cluster removed 20187 sequences.
Total number of sequences before pre.cluster was 19274.
pre.cluster removed 15176 sequences.
Total number of sequences before pre.cluster was 4742.
pre.cluster removed 3612 sequences.
Total number of sequences before pre.cluster was 17051.
pre.cluster removed 14080 sequences.
Total number of sequences before pre.cluster was 16172.
pre.cluster removed 12493 sequences.
Total number of sequences before pre.cluster was 6843.
pre.cluster removed 5638 sequences.
Total number of sequences before pre.cluster was 13840.
pre.cluster removed 11353 sequences.
These reads are basically quite similar to each other but different enough to form different OTUs and then each mid seems to like to spawn its own OTU for each taxonomy. My problem is, what do I do? How can I go tell the sequencing center that they need to fix something as the data is basically unusable. If the problem stems from the input samples, how can I show that they data and the bioinformatics are good, but the input is bad?
Hope this was clear, thanks for reading.