Challenges faced during ITS analysis with UNITE

meta_analyst · May 24, 2017, 6:00am

For dealing with ITS sequences from UNITE, the recommended approach is to begin with pairwise.seqs() command in mothur.

In this context, we are having the following questions:

When dealing with >400K sequences in the UNITE dataset, the computation time for distance matrix calculation increases a lot.

[ In fact, applying unique.seqs() to reduce no. of sequences, gets down the no. of sequences to >330K, which is still quite large for distance matrix computations.

Even if we specify the cutoff=0.10 and do not have output specified as "“lt” or “square”, still the program needs to compute all distances and retain only those that fulfil the cutoff criterion. ]

Once we have the distance matrix, we would like to be able to generate something like the taxonomy summary of the sequence data in our sample (like: test.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary).

Basically, how do we associate the distance matrix, which is a profile of relatedness of sequences in the UNITE database to our MiSeq data of interest.

Thank you very much for your help.

Kendra · May 26, 2017, 6:19pm

why are you clustering the unite database?

meta_analyst · June 4, 2017, 2:17pm

Sorry to get back much later.

Actually, I was trying to illustrate the fact that even if we reduce the redundancy in the UNITE dataset, we still have a large collection of sequences to work with.

My questions remain:

How to efficiently handle pairwise distance computations for a large collection of sequences such as UNITE.
Once we have the distances, we would like to be able to generate something like the taxonomical summary of the sequence data in our sample (like: test.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary).

Basically, how do we associate the distance matrix, which is a profile of relatedness of sequences in the UNITE database to our MiSeq data of interest.

Thanks in advance.

Kendra · June 6, 2017, 2:35pm

I’m reworking my ITS analyses and will post my batch soon

meta_analyst · June 9, 2017, 10:19pm

I am much glad to know that you will be posting on ITS data analysis with mothur. I will be happy to learn. Thank you.

Kendra · June 13, 2017, 8:41pm

I finally finished my batch file

github.com

krmaas/bioinformatics/blob/master/mothur.fungal.batch

###################################################################
###################################################################
###################################################################


##Basic mothur processing of MiSeq sequences using ITS2 primers.  This batch is based on mothur.org/wiki/MiSeq_SOP  All commands are explained in much more detail in that SOP. If your data is not demultiplexed, see the chunk of code at the bottom of this file.
##I use a computer with a high memory node (512gb RAM) with 32 processors. You'll need to adjust processors used to match your system. 


#############
##### I've started using basemount to download the fastq directly to the server. This will download and rename the files to include sample name and sample ID.
#p=PROJECTNAME
#mkdir -p $p/fastq
#for f in basespace/Projects/$p/Samples/*/Files/*.gz; 
#do s=${f##basespace/Projects/$p/Samples/}; s=${s%%/*}; 
#cp $f $p"/fastq/"$s"."${f##*Files/}; 
#done
#####################

### In addition to your sequence files and the oligos file, you need some files from the mothur website to be accessible (either in your path or in the folder you are working in). The line below moves them from my public directory on the BBC server to your current directory.

This file has been truncated. show original

meta_analyst · June 15, 2017, 10:25am

Wow !! Thank you so much. I will go through it carefully in order to learn and apply ITS analysis with mothur.

Kendra · June 15, 2017, 6:34pm

Please comment either on github or here if (when!) you find errors

MaximeG · June 23, 2017, 12:54pm

Hi,
I apologize in advance if that’s a stupid question. I understood that pairwise.seqs is a key step in analyzing ITS sequences (because their variable length prevents from aligning them as is done for 16S sequences). But I can’t see this command in Kendra’s batch file (on github). Is it “replaced” by pre.cluster?
Yours,
Maxime

Kendra · June 23, 2017, 6:31pm

Yeah, I’m not sure how that worked with opticlust? vsearch includes pairwise alignment (I think!)

http://mothur.ltcmp.net/t/opticlust-w-fasta-rather-than-dist/3291/1

Once I understand what’s going on better, I’ll try to add more explaination

MaximeG · June 27, 2017, 11:48am

Hi Kendra,
I tried to follow your pipeline step-by-step, but I get an error at the step:

make.shared(list=current, count=current)

This comes right after the clustering step, which was successful (with method=agc).
The error message is made of a number of lines like this one:

[ERROR]: M00880_10_000000000-AA2UK_1_2114_9936_18529 is in your groupfile and not your listfile. Please correct.
Your group file contains 51350 sequences and list file contains 50254 sequences. Please correct.

Output File Names:
bff.trim.contigs.pcr.good.unique.precluster.pick.pick.agc.unique_list.shared

The list.shared file is not generated. To me it’s strange that there is something wrong with the group file. I am aware that it becomes outdated because it is replaced by the count file and I don’t understand why mothur calls this file instead of the count file (that is supposed to be up-to-date).
Thank you,
Maxime

Kendra · June 28, 2017, 10:38pm

sorry busy this week and can’t run through it again. but usually that means that I didn’t have count file in some previous command that was removing sequences. I’ll try to look at it next week

MaximeG · June 29, 2017, 1:37pm

It’s possible that I forgot to include the count option somewhere, though I tried to include it everywhere needed.
What I find strange is that mothur complained about the group file (which I know is outdated, because earlier in the pipeline, I switched to count file), whereas the arguments I put in make.shared were list and count (but not group).
So I tried, but to no success, to update the group file by using remove.seqs to exclude the chimeras. I have not been able to dereplicate the excluded sequences by using the information in the name/count file. Only the “centroids” were removed.
Maxime

MaximeG · October 5, 2017, 1:32pm

Dear colleagues,
I found what I was doing wrong, which prevented me to go all the way down Kendra’s ITS workflow (the one on GitHub). My count file got out-dated because I mistakenly used group instead of count at one step. Now it seems to be fine.
Thank you all!
Maxime

Topic		Replies	Views
Analysing fungal ITS with the pre.cluster function Commands in mothur	10	5828	July 19, 2016
UNITE Database Theory behind mothur	3	8049	November 11, 2014
Reg. UNITE for 18S ITS Analysis Commands in mothur	1	873	May 11, 2017
Analisis of FUNGI with UNITE database in Mothur Commands in mothur	11	3618	November 1, 2017
Processing amplicon datasets without aligning Theory behind mothur	3	3111	August 11, 2015

Challenges faced during ITS analysis with UNITE

Related topics