Dist.seqs Taking Really Long

jgcx December 14, 2021, 11:03pm 1

Hey Mothur Team,

I am using the dist.seqs command with mothur v.1.46.1. However, the data has been slowly processing nonstop since yesturday morning and is still processing. Is this command supposed to take this long?

Thanks for all your help and I look forward to hearing back from you.

Alexandre_Thibodeau December 16, 2021, 3:41pm 2

Hello!

This is a common question.

Here are the common answers that first are questions: how long are the merged reads? How many unique sequences? What are your computer specs? How does your controls look? (positive and negative)

Not enough overlaping (merged reads over 300pb) = not enough error correction, inflating your uniques and therefore asking too much of your computer. Try to make contigs using maxee instead of deltaq or use phylogeny approach for downstream analysis, or use clister.split based on taxonomy.

Too many uniques? You may try to use other pre-cluster algorithm like Deblur, your controls will tell you if thing are looking somewhat ok. It happens in our low DNA samples.

Not enough computer power? try using supercomputer /servers like Amazon.

Hope it helps.

Kind regards and do not forget to use controls,

jgcx December 26, 2021, 10:06pm 3

My number of unique sequences is about 770,000 and total number of sequences is about 4,902,143 after pre.cluster, chimera.vsearch, remove.seqs, classify.seqs, and remove.lineage commands. I am analyzing Ion Torrent data that does not have paired ends.

For controls the commands I have used so far are below:

fastq.info(fastq=R_2021_10_05_12_00_01_user_S51010052021_Chip.fastq)

summary.seqs(fasta=R_2021_10_05_12_00_01_user_S51010052021_Chip.fasta, processors=8)

trim.seqs(fasta=R_2021_10_05_12_00_01_user_S51010052021_Chip.fasta, oligos=CXMicroMothurOligos.txt, maxambig=0, maxhomop=6, bdiffs=0, pdiffs=0, minlength=265, keepfirst=310, flip=F, processors=8)

summary.seqs(fasta=current, processors=8)

get.current()

unique.seqs(fasta=current)

get.current()

summary.seqs(fasta=current, name=current, processors=8)

count.seqs(name=current, group=current)

get.current()

#pcr.seqs(fasta=silva.nr_v138_1.align,start=11895,end=25318,keepdots=F,processors=32)
#rename.file(input=silva.nr_v138_1.pcr.align,new=silva.v4.fasta)
#summary.seqs(fasta=silva.v4.fasta)
#get.current()

align.seqs(fasta=current, reference=silva.v4.fasta, processors=2)

summary.seqs(fasta=current, count=current, processors=8)

get.current()

summary.seqs(fasta=current, count=current, processors=8)

screen.seqs(fasta=current, count=current, summary=current, start=1967, optimize=end, criteria=95, processors=8)

summary.seqs(fasta=current, count=current, processors=8)

filter.seqs(fasta=current, vertical=T, trump=.)

summary.seqs(fasta=current, count=current, processors=8)

unique.seqs(fasta=current, count=current)

summary.seqs(fasta=current, count=current, processors=8)

get.current()

pre.cluster(fasta=current, count=current, diffs=2)

summary.seqs(fasta=current, count=current, processors=8)

chimera.vsearch(fasta=current, count=current, dereplicate=t)

remove.seqs(accnos=current, fasta=current)

summary.seqs(fasta=current, count=current, processors=8)

classify.seqs(fasta=current, count=current, template=silva.nr_v138_1.align, taxonomy=silva.nr_v138_1.tax, cutoff=80, processors=8)

remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)

summary.seqs(fasta=current, count=current, processors=8)

summary.tax(taxonomy=current, count=current)

get.current()

rename.file(fasta=current, count=current, taxonomy=current, prefix=10052021)

dist.seqs(fasta=current, cutoff=0.03)

I have been struggling with understanding the type of computer power and storage needed since I’m new to using Mothur and analyzing data without the use of prebuilt tools (RDPipeline).

Since I have started using Mothur I can see the size of my files (32KB to 56,000KB in reference to barcoded count tables/map files and 500,000KB in reference to precluster files), the number of sequences, and the average bp size (265-285). This raw multiplexed FASTQ file was about 2 million KB and FASTA was 2 million KB. The trimmed unique good align files were about 22 million KB. I do not have enough of a computer background to know how this all translates to the CPUs, storage, RAM etc needed. Therefore when I ask IT or computer cloud company reps it is really hard for me to find a way to communicate what is needed.

I have a call scheduled with a rep from the Microsoft Azure team in the New Year. However, I have no clue what to tell them to help get a better set up.

Do you have any pointers on what I should be communicating to IT and cloud computer reps that makes it clear for what is needed based on the data?

If it helps so far I know the following now that I am further along in analyzing this data set:

• I only have access to Microsoft Azure ( I’m working in a business structure that uses Microsoft products where the type of computer, parts, setup, etc has to go through IT after I communicate what it needed)

• To test out Mothur I chose to use FASTQ data that had about 47 barcodes (47 samples) of data. However there are times where I will have a FASTQ file with 32 to 72 barcodes (32-72 samples) of data so these sizes can be smaller or larger than what I am testing out now. In about a year I process about 10-15 chips that produce Mutilplexed FASTQ files. All this data must be saved.

• During this trial of 47 barcoded multiplexed data up to the summary.tax step all the file sizes range from 9KB to 22 million KB. Which doesn’t seem like a lot of space when converted to GBs.

• I am extremely comfortable with Ubuntu, Notepad++, command-lines, and virtual machine language. I was able to adequately install Ubuntu Shell, ARB Project files, and Mothur. However, in the past this was all set up via University help and I really only got the use of the preset up tools experience. Now that I am not in a university setting it’s a bit different.

• The type of computer I was provided initially was 16 GB RAM, 64 bit operating system, and Intel Core Processor i5, Base speed 2.21 GHz, Cores 4, and logical processors 8.

• I have also tried a computer with the following specs: 4 GB RAM, 64 bit operating system, and Intel Core Processor TM CPU 1.00 GHz, Base speed 1.61 GHz, Cores 2, and logical processors 4.

jgcx December 26, 2021, 10:14pm 4

jgcx December 31, 2021, 9:43pm 5

It turns out the cluster.split command works better, however, I am still experiencing problems that I will post in a different topic.

Alexandre_Thibodeau January 5, 2022, 2:22pm 6

Hello!

Hope this years get better for you. 770 000 unique is a lot for the computer power that you have. I will see what you have posted on the other topic.

pschloss January 11, 2022, 5:52pm 7

The problem is likely that you are using iontorrent data. These data, in general, are pretty low quality. I strongly encourage people not use iontorrent data. If you look at this blog post the problems are pretty similar to what you are facing…

Pat

Topic		Replies	Views	Activity
Dist.seqs of 700 000 illumina sequences Commands in mothur	4	4459	March 31, 2013
Problems with dist.seqs and illumina reads mothur bugs	1	2520	January 6, 2014
Dist.seqs taking too much time Commands in mothur	4	444	May 15, 2022
dist.seqs() -- How to deal with 240K input sequences? Commands in mothur	1	872	October 30, 2017
How many normal unique sequences after unique.seq Commands in mothur	7	5066	March 17, 2014