Challenges faced during ITS analysis with UNITE

For dealing with ITS sequences from UNITE, the recommended approach is to begin with pairwise.seqs() command in mothur.

In this context, we are having the following questions:

  1. When dealing with >400K sequences in the UNITE dataset, the computation time for distance matrix calculation increases a lot.

[ In fact, applying unique.seqs() to reduce no. of sequences, gets down the no. of sequences to >330K, which is still quite large for distance matrix computations.

Even if we specify the cutoff=0.10 and do not have output specified as "“lt” or “square”, still the program needs to compute all distances and retain only those that fulfil the cutoff criterion. ]

  1. Once we have the distance matrix, we would like to be able to generate something like the taxonomy summary of the sequence data in our sample (like: test.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary).

Basically, how do we associate the distance matrix, which is a profile of relatedness of sequences in the UNITE database to our MiSeq data of interest.

Thank you very much for your help.

why are you clustering the unite database?

Sorry to get back much later.

Actually, I was trying to illustrate the fact that even if we reduce the redundancy in the UNITE dataset, we still have a large collection of sequences to work with.

My questions remain:

  1. How to efficiently handle pairwise distance computations for a large collection of sequences such as UNITE.

  2. Once we have the distances, we would like to be able to generate something like the taxonomical summary of the sequence data in our sample (like: test.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary).

Basically, how do we associate the distance matrix, which is a profile of relatedness of sequences in the UNITE database to our MiSeq data of interest.

Thanks in advance.

I’m reworking my ITS analyses and will post my batch soon

I am much glad to know that you will be posting on ITS data analysis with mothur. I will be happy to learn. Thank you.

I finally finished my batch file

Wow !! Thank you so much. I will go through it carefully in order to learn and apply ITS analysis with mothur.

Please comment either on github or here if (when!) you find errors

Hi,
I apologize in advance if that’s a stupid question. I understood that pairwise.seqs is a key step in analyzing ITS sequences (because their variable length prevents from aligning them as is done for 16S sequences). But I can’t see this command in Kendra’s batch file (on github). Is it “replaced” by pre.cluster?
Yours,
Maxime

Yeah, I’m not sure how that worked with opticlust? vsearch includes pairwise alignment (I think!)

http://mothur.ltcmp.net/t/opticlust-w-fasta-rather-than-dist/3291/1

Once I understand what’s going on better, I’ll try to add more explaination

Hi Kendra,
I tried to follow your pipeline step-by-step, but I get an error at the step:

make.shared(list=current, count=current)

This comes right after the clustering step, which was successful (with method=agc).
The error message is made of a number of lines like this one:

[ERROR]: M00880_10_000000000-AA2UK_1_2114_9936_18529 is in your groupfile and not your listfile. Please correct.
Your group file contains 51350 sequences and list file contains 50254 sequences. Please correct.

Output File Names:
bff.trim.contigs.pcr.good.unique.precluster.pick.pick.agc.unique_list.shared

The list.shared file is not generated. To me it’s strange that there is something wrong with the group file. I am aware that it becomes outdated because it is replaced by the count file and I don’t understand why mothur calls this file instead of the count file (that is supposed to be up-to-date).
Thank you,
Maxime

sorry busy this week and can’t run through it again. but usually that means that I didn’t have count file in some previous command that was removing sequences. I’ll try to look at it next week

It’s possible that I forgot to include the count option somewhere, though I tried to include it everywhere needed.
What I find strange is that mothur complained about the group file (which I know is outdated, because earlier in the pipeline, I switched to count file), whereas the arguments I put in make.shared were list and count (but not group).
So I tried, but to no success, to update the group file by using remove.seqs to exclude the chimeras. I have not been able to dereplicate the excluded sequences by using the information in the name/count file. Only the “centroids” were removed.
Maxime

Dear colleagues,
I found what I was doing wrong, which prevented me to go all the way down Kendra’s ITS workflow (the one on GitHub). My count file got out-dated because I mistakenly used group instead of count at one step. Now it seems to be fine.
Thank you all!
Maxime