Newbie - rarefaction.single

Hi,
I’m unsure if I have the correct location for this question (forgive me if I’ve posted wrongly)

I have alignments (ergo a set of OTUs with the counts for each OTU already).
I need to do rarefaction curves and was wondering if I need to go back and import all my fasta files into Mothur and have it do the alignments or if I can break in half way with a file.

The easily exportable format for me is: (a random number example to show format and not real data)
OTU_Name Sample1TotalReads Sample2TotalReads Sample3Reads
1 30 2 10001
2 100 34 4023
etc

but as long as I know how to format my data I can turn it into what ever format I need.
I’m also not too sure how I would write the rarefaction.single command line in the case of where I’m not using a fasta file.
I’m working with Illumina Hi-Seq data, and need to generate some quick rarefaction plots.

If you have this in one of your tutorials already, a pointer to where would be appreciated.
Thanks.

Hi,

You would need to generate a shared file. And you could then enter that into rarefaction.single with the shared option. You can see what the shared files look like through the various tutorials. Briefly, the first column is some label that’s the same for all of your samples, the second column is the name of the same, the third column is the number of OTUs and then the subsequent columns are that many long and each value is the number of reads in that OTU and sample.

Pat

Thanks for the quick reply Pat. I’m only now back at it and trying to get it to work, but the program seems to be stalling, or maybe I’m not that patient enough.

the commands I’ve tried are
rarefaction.single(shared=nameoffile.shared,calc=ace,freq=100,groupmode=F)
rarefaction.single(shared=nameoffile.shared,calc=ace,freq=1,groupmode=F)

The sharedfile looks like
0.1B11928021023…etc for all 1928 OTUs…it’s an illumina data run…
0.1B21928…OTU counts…
etc

Mothur spits out a set of rabund files and then I get
Processing group B1
0.1
and then it just sits there for a really really long time… a few hours.
Am I not being patient enough? I seem to recall getting things working once on a different computer and it ran faster but it was a 454 data set so the numbers weren’t nearly as large.
I can probably produce a list with a smaller number of OTUs for the data if that’s the problem.

My options for computer platforms here are Windows or Ubuntu (Linux). I’m currently trying it on the windows system first - so if that’s the problem I can switch to the Ubuntu platform if needed.
(My IT department are windows fiends and have a small freak out when ever I start the Linux platform up on their system… so I don’t get internet access with it, and can’t look for help files as easily; which means if I can sort out the issue and get it working on the windows system I’ll try that first.)
Thanks for any help you can be with this.

The number of OTUs isn’t the issue, rather it’s the number of sequences. How many sequences do you have? Can you try calc=sobs and pick a freq that’s 10% of the number of sequences you have?

Pat

Piling on to "newbie"´s question, I have an embarrassingly simple/silly question to this output from a rarefaction.single(shared=whatever.shared) - command:
sample group lci hci
1 VV1 1 1 1
10 VV1 3.837 1.7893 5.8847
20 VV1 5.229 2.8819 7.5761
30 VV1 6.314 3.7697 8.8583
40 VV1 7.157 4.5156 9.7984
50 VV1 7.882 5.2191 10.5449
60 VV1 8.556 5.8609 11.2511
70 VV1 9.167 6.4889 11.8451
80 VV1 9.714 7.0713 12.3567
90 VV1 10.268 7.6035 12.9325
100 VV1 10.723 8.1242 13.3218
110 VV1 11.186 8.6569 13.7151
120 VV1 11.649 9.2299 14.0681
130 VV1 12.079 9.865 14.293
140 VV1 12.473 10.4566 14.4894
150 VV1 12.859 11.0864 14.6316
160 VV1 13.222 11.687 14.757
170 VV1 13.571 12.3606 14.7814
180 VV1 13.931 13.434 14.428
182 VV1 14 14 14
1 VV10 1 1 1
10 VV10 2.552 0.5375 4.5665
20 VV10 3.846 1.2755 6.4165
30 VV10 4.882 2.2047 7.5593
40 VV10 5.786 3.0843 8.4877
50 VV10 6.552 3.9114 9.1926
60 VV10 7.193 4.5857 9.8003
70 VV10 7.82 5.268 10.372
80 VV10 8.327 5.8465 10.8075
90 VV10 8.778 6.3203 11.2357
100 VV10 9.215 6.7967 11.6333
110 VV10 9.565 7.1763 11.9537
120 VV10 9.93 7.6489 12.2111
130 VV10 10.302 8.0689 12.5351
140 VV10 10.635 8.5785 12.6915
150 VV10 10.968 9.1337 12.8023
160 VV10 11.291 9.7072 12.8748
170 VV10 11.602 10.3938 12.8102
180 VV10 11.926 11.3907 12.4613
182 VV10 12 12 12

[VV11, VV12 etc. - there are 22 samples resampled down to 182 sqs/sample]

My question is indeed simple: What is this third column without a header, and why does it get smaller towards the end?

I guessed that if lci and hci are the low/high conf-ints (as I have read), then this should be the “mean/central” estimate. But if so, then why is it smaller than the hci? If I plot this value, all samples strongly appear to have reached saturation - which is good, though a bit surprising for a mere 182 sqs. The same, though less clear, holds true if i plot the other values. I hesitate to draw conclusions from this before I get a better understanding of this.

I remained somewhat puzzled and found no conclusive information from the mothur-wiki on the rarefaction.single command.

Best regards, CBH

My question is indeed simple: What is this third column without a header, and why does it get smaller towards the end?

Can you post the rabund file data for V10 and the exact command you are running?

I did first a make.shared with a groupfile which returned the file:

H33R9YH01.trim.unique.pick.good.precluster.abund.fn.0.10.rep_0.03.shared

Then, I used the rarefaction.single command:

rarefaction.single(shared=H33R9YH01.trim.unique.pick.good.precluster.abund.fn.0.10.rep_0.03.shared, freq=10)

which proceeded like this:

Processing group VV1

0.03

Processing group VV10

0.03

[…the same for all groups until VV22]

Output File Names:
H33R9YH01.trim.unique.pick.good.precluster.abund.fn.0.10.rep_0.03.groups.rarefaction

The rabund file for VV10 looks like this in its entirety:

0.03 12 152 7 5 4 4 4 1 1 1 1 1 1

My main puzzlement concerns how a rarefaction value for a simple species count can decrease. I might understand it if it is some kind of uncertainty expression, but that is hard to know since there is no header or other indication as to what this string of numbers mean.

But to be sure, thank you for an otherwise great programme, and a great support service. Best Regards, Christoffer Bugge Harder

Thanks for catching this…

1 VV10 1 1 1
10 VV10 2.552 0.5375 4.5665
20 VV10 3.846 1.2755 6.4165
30 VV10 4.882 2.2047 7.5593
40 VV10 5.786 3.0843 8.4877
50 VV10 6.552 3.9114 9.1926
60 VV10 7.193 4.5857 9.8003
70 VV10 7.82 5.268 10.372
80 VV10 8.327 5.8465 10.8075
90 VV10 8.778 6.3203 11.2357
100 VV10 9.215 6.7967 11.6333
110 VV10 9.565 7.1763 11.9537
120 VV10 9.93 7.6489 12.2111
130 VV10 10.302 8.0689 12.5351
140 VV10 10.635 8.5785 12.6915
150 VV10 10.968 9.1337 12.8023
160 VV10 11.291 9.7072 12.8748
170 VV10 11.602 10.3938 12.8102
180 VV10 11.926 11.3907 12.4613
182 VV10 12 12 12

I don’t think it’s a big deal, but let me describe what’s happening. The current way we calcualte the lci and hci is by getting the standard error and converting that to the confidence level so that the confidence intervals are symmetrical. That’s what you’re seeing here. Clearly the confidence interval shouldn’t be symmetrical at the ends of the sampling curve. The next release (next weekish) will get the confidence interval empirically based on the iterations. The mean value won’t change. The reason I don’t think it’s a big deal is that I’m not really sure that anyone uses/needs the confidence interval from rarefaction.single.

Pat

Thanks for the fast reply. Then I understand a little better. And no, I don´t need the confidence interval from the species count, but I did need to be sure about which numbers represent what. Thus, in order to be certain that the hard of understanding has understood the numbers correctly: The three columns to the right besides 1) sample and 2) group represent

  1. mean 4) low ci 5) high ci, right? I.e. to spell it out with the first rows from the above file:

sample group [3] > [> ]
1 VV1 > >
10 VV1 > >
20 VV1 > >
30 VV1 > >
40 VV1 > >

lci/Hci appears as the header to columns 3-4 on my screen, while the fifth column has no header. Obviously, that hardly fits with the numbers, but then again, the 5th column iappeared quite mysterious ndeed , and then I got confused. Have I got it about right with the sequence of the columns? Best regards, CBH

You are correct the 7.157 @ 40 is the mean