Calculation of analysis times

Does anyone know of a good estimate for analysis times for functions based on number of sequences and sequence length? I’m paying for server time to do some analyses and I was wondering if there are any calculators for the length of time an analysis may take? I know the variables involved on the server side would be CPU speed , memory, and OS but there may be others.

So for example, if I was going to create a distance matrix of 500,000 sequences I would assume that with a given processor on a given OS and with a given amount of memory, it may take say 10 seconds per 100 bp per sequence to calculate each pairwise distance. You could then plug in your average sequence length and the total expected analysis time would be a permutation of this?

I’m particularly interested in being able to calculate analysis times for align.seqs, dist.seqs, and cluster.

Sorry, we don’t really have anything like that. Although if you have 500,000 unique sequences going into dist.seqs, I’m pretty confident that it will never go through. That many uniques (i.e. after running pre.cluster) is usually a symptom of sequence data with a lot of sequencing errors.

Pat

Well, I combined data from three separate sequencing runs. I started with nearly 4 million reads and after processing it has been reduced to 770k runs. What is the typical reduction to unique sequences that you would expect in a highly diverse sample? It’s still running after 10 days with four processors and 32 GB RAM. What is the typical limiting factor on an analysis this large? I assume the data is being written to a temp file somewhere so I’m not sure what would limit the processing. Another consideration is that even if it completes, I’m not sure how I will classify to OTUs with a distance matrix this large…

There could certainly be a lot of sequencing error. The sequences average 450 bp long spanning the V3-V4 regions. I was wondering if I could cut the alignment file to just the V3 region to save processing time and perhaps reduce the number of unique sequences? Or analyze the V3 region and the V4 region separately and compare results?

I know the data should just be resequenced but that is not a possibility. I’m not even getting paid for this-I’m just trying to salvage what’s left of my scientific career. This is my pandemic project! Maybe if I had a garage or attic to clean out I wouldn’t have started this.

I’m afraid cutting to a subregion won’t help since there will still be low coverage of that region and high error rates. Can you go with the phylotype-based approach described in the MiSeq SOP?

Pat

I used E. coli to determine the regions I was working with in the alignment. I then used pcr.seqs on the alignment file to take the first 200 bp (V3) and the last 200 bp (V4) as I assumed these would be the highest quality sequences (albeit with no overlap). I then used unique.seqs on both regions which brought the total number of unique sequences from 770k to about 450k. I was able to generate two distance matrices with a 0.03 bp cutoff, each about 50 GB, although it took a few days. I’m now trying to run cluster.split on one distance matrix, and so far the application hasn’t crashed, but it hasn’t completed either. If it does not complete, then I will go to using method=classify with taxlevel=6.