Diversity comparisons between different sized datasets?

Hi guys;

I was wondering on a broad theoretical level what ways one could use the mothur program to compare datasets of drastically differing sizes (100k vs 12k sequences, for example). Assuming the same sequencing platform, what major pitfalls in terms of diversity comparisons would I perhaps run into?

Thanks

You would need to rarefy or subsample everything to a common number of sequences to make comparisons between datasets.

Hi Dr Schloss;

thanks, I was afraid of that answer–I had previously used subsampling per the 454 SOP, but the reason I shied away from it was that it started giving me weird results. My current datasets I’m working with range in size from 120k to 300k in read sizes and the toher environmental groups I want to compare them to have anywhere from 9k-30k reads.

the first thing I noticed was when I subsampled, I got strange Chao1 (1.04, with sobs and chao being exactly the same) and diversity values and when I did yue clayton trees, the relationships seemed very off from what I expected.

So I investigated further and did the sub.sample command for some of my groups and what resulted from this were shared files that showed drastically different communities than the original groups (I went from 120-200k to 10k, mind you), but all three of the sites in question had 50-70% of the same, single, dominant OTU. When I looked at the subsampled shared files, I was getting a much more even spread with the top OTUs being different in each group and often being from OTUs that had in the original file been 0.1, 0,2% of the community. I realize that this is only one run of subsampling, but this still seems extremely off, particularly given that these values make up almost the entirety of these rarer otus in the original 100k+ dataset!

I’m not sure what any of this meant and I had to submit something ASAP for the lab meeting so I skirted away from subsampling and beta diversity in general.

But now that I have more time, I’d really like to get this ironed out.

thanks so much for you help

Jon

Also, I suppose I’m also not sure how sobs is calculated int he first place. I’m not particularly bright, but I don’t really get why it adds the number of sequences of the most dominant OTU, followed then by the number of singleton, doubleton, tripleton, etc. OTUs.

the first thing I noticed was when I subsampled, I got strange Chao1 (1.04, with sobs and chao being exactly the same) and diversity values and when I did yue clayton trees, the relationships seemed very off from what I expected.

Have you taken the shared file and calculated the parameters by hand for a couple of comparisons to see if the math works out? If you have a ton of reads and a sufficiently simple community then Sobs may equal Chao1 (like when there are no singletons).

So I investigated further and did the sub.sample command for some of my groups and what resulted from this were shared files that showed drastically different communities than the original groups (I went from 120-200k to 10k, mind you), but all three of the sites in question had 50-70% of the same, single, dominant OTU. When I looked at the subsampled shared files, I was getting a much more even spread with the top OTUs being different in each group and often being from OTUs that had in the original file been 0.1, 0,2% of the community. I realize that this is only one run of subsampling, but this still seems extremely off, particularly given that these values make up almost the entirety of these rarer otus in the original 100k+ dataset!

If you can send us the shared file and the actual commands you’ve run, I can take a look.

Also, I suppose I’m also not sure how sobs is calculated int he first place. I’m not particularly bright, but I don’t really get why it adds the number of sequences of the most dominant OTU, followed then by the number of singleton, doubleton, tripleton, etc. OTUs.

It adds the number of OTUs in the singleton, doubleton, etc - not the number of sequences. FWIW, all of the formulae should be on the wiki. For example…

Hi professor;

thanks for the prompt reply. sorry, I read the wiki for sobs prior but the wording confused me a bit–mainly because i am terrible at reading comprehension. I get it now.

The problem here is still that I’m getting a chao1 of ~1 for one of my samples, which I just don’t see how that’s possible. I did the calculations by hand and it doesn’t seem to work out, if only because there’s no way I have only one OTU. When I don’t subsample, the sobs calculator works perfectly.

would you like me to post the commands that I ran?

That would be helpful as would be the rabund data for the sample giving you a Chao1 of 1.

Pat

hi professor;

sounds good. I hope this isn’t too much to sort through:

unique.seqs(fasta=archaea.fasta)

align.seqs(fasta=archaea.unique.fasta, reference=gg_13_5_99.align, processors=2)

screen.seqs(fasta=archaea.unique.align, name=archaea.names, group=archaea.groups, optimize=start-end, criteria=99)

filter.seqs(fasta=archaea.unique.good.align, vertical=T, trump=.)

unique.seqs(fasta=archaea.unique.good.filter.fasta, name=archaea.good.names)

chimera.uchime(fasta=archaea.unique.good.filter.unique.fasta, name=archaea.unique.good.filter.names, group=archaea.good.groups)

remove.seqs(accnos=archaea.unique.good.filter.unique.uchime.accnos, fasta=archaea.unique.good.filter.unique.fasta, name=archaea.unique.good.filter.names, group=archaea.good.groups, dups=T)

classify.seqs(fasta=archaea.unique.good.filter.unique.pick.fasta, name=archaea.unique.good.filter.pick.names, group=archaea.good.pick.groups, template=gg_13_5_99.fasta, taxonomy=gg_13_5_99.gg.tax, cutoff=80, processors=5)
dist.seqs(fasta=archaea.unique.good.filter.unique.pick.fasta, cutoff=0.15)

make.table(name=archaea.unique.good.filter.pick.names, group=archaea.good.pick.groups)

cluster(column=archaea.unique.good.filter.unique.pick.dist, count=archaea.unique.good.filter.pick.count_table, cutoff=0.03)

make.shared(list=archaea.unique.good.filter.unique.pick.an.unique_list.list, count=archaea.unique.good.filter.pick.count_table)

summary.single(shared=archaea.unique.good.filter.unique.pick.an.unique_list.unique.subsample.shared, calc=sobs-chao-invsimpson, subsample=10589)

here’s the rabund subsampled file, things are transposed and to make the list more manageable, I brought together the counts with many different OTUs and listed them as having x(# OTUs of that count)

unique
1743
995
527
473
364
300
265
238
220
178
149
141
133
131
126
108
108
100
98
90
74
71
71
71
70
68
67
66
58
49
49
48
46
44
43
42
42
40
38
38
37
37
37
37 x4
34 x1
33 x1
32 x1
30 x2
29 x1
27 x2
26 x1
24 x3
23 x2
22 x4
21 x3
20 x2
19 x3
18 x2
17 x2
16 x3
15 x3
14 x3
13 x1
12 x11
11 x6
10 x12
9 x9
8 x27
7 x23
6 x27
5 x27
4 x48
3 x107
2 x235
1 x1119

Can you just post the content of the rabund file as you are using it?

unique 1743 995 527 473 364 300 265 238 220 178 149 141 133 131 126 108 108 100 98 90 74 71 71 71 70 68 67 66 58 49 49 48 46 44 43 42 42 40 38 38 37 37 37 37 34 33 32 30 30 29 27 27 26 24 24 24 23 23 22 22 22 22 21 21 21 20 20 19 19 19 18 18 17 17 16 16 16 16 15 15 15 14 14 14 14 13 12 12 12 12 12 12 12 12 12 12 12 12 11 11 11 11 11 11 10 10 10 10 10 10 10 10 10 10 10 10 10 10 9 9 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

I just copied and pasted this into a file and ran it through summary.single with no problem. When I tried to rarefy the data to 10589 sequences as you did, I see that the data you sent me only has 10106 sequences. Was this sample perhaps excluded from the analysis? Can you email your shared file to us and reference the link to this thread?

sure that sounds good. thanks professor. Who should I email/what email address would you/he/she prefer i use?

Please send the shared file that is giving you the problems. We’re trying to reproduce your problem and have been unable to with the data you have provided me.

Pat

Hi professor;

I sent them to Ms Westcott;

interestingly, I tried running it again on several computers and it worked on all the macs.