I love Mothur, and there are fewer and fewer functions for which I need to use outside sources.
I have one request, but don’t know how simple it would be to implement. In classify.seqs, I generally like that the number of rows in the summary is reduced to only the organisms that are actually identified in your dataset. On the rare occasion, however, I have wanted to combine multiple classification summaries, but this was not trivial since they could not be matched up 1 to 1 (e.g. you have a Caulobacter in one dataset, but not in another, so that row is omitted).
Would it be possible to include an option to print all rows in the summary? Or are the rows derived from the output taxonomy file rather than the input .tax file?
Thanks for all your hard work!
We can work on this for you… Also, you should note that regardless of whether Caulobacter is present in your dataset, the numbering (eg. 126.96.36.199) will be the same. That may allow you to merge data files in R or something else.
Thanks for the quick reply, and great suggestion for the workaround! I will use that for now.
Hi, helpful information! Thanks.
I add a question/suggestion related to this…
Isn’t subsampling important prior classification? I mean, it is not the same to classify a sample with 2000 sequences and another with 8000 sequences, specially if you are using those numbers to compare samples later. Is there a way to do this within Mothur suggested pipeline? At which step would it be advisable?
Sub-sampling is necessary because we aren’t sure of the frequencies when the number of sequences differ between samples. So I wouldn’t be concerned about running classify.otu before subsampling, because it shouldn’t affect the % of different taxa in each OTU. In classify.otu, you actually want more information so you can get a better idea of the consensus classification. Practically speaking, I would be surprised if there were OTUs that changed their consensus classification with and without subsampling first.
Thanks Pat for your reply. I realize my question was not clear. What I meant is:
Sample 1: 6000 sequences
Sample 2: 2000 sequences
If you run classify.seqs, I think the odds to detect rare taxa is larger in Sample 1 than in Sample 2. If you sub.sample both to 2000 sequences, you kind of balance those odds. Is that right?
I found this command is possible: sub.sample(fasta=X.fasta, group=X.groups, size=2000, persample=T, name=X.names)
Please let me know what you think. Thanks!
The classification of a sequence doesn’t depend on its abundance. So, I wouldn’t subsample the classifications - I would subsample the phylotype or OTU tables.