Correspondance between OTU numbers between runs

dporco · September 5, 2022, 1:55pm

Hello

I have an issue of calculation overload with the command cluster.split.

I tried to process 3 Miseq runs together because I want to get a correspondance between the OTU numbers I obtain among the samples from the different runs.

Here is what I did:

make.groups on the 3 contig fastas so I can sort them apart in the final step

merge.files on the 3 contig fastas

merge.count on the 3 contig count tables

Then I processed everything but when I reach the cluster split I could only use a single of my shinny 128 processors because I ran out of RAM (128G) - 629274 unique seqs = 115G used. It’s been running for 3 days now, not sure how much longer it will take.

But there probably is a better/smarter way to do this. Maybe processing each run separately and then establishing a correspondance between the OTUs found in the 3 runs?

Thanks for you help!

David

pschloss · September 8, 2022, 4:09pm

Hi David,

What region are you sequencing and what MiSeq chemistry? If you don’t have fully overlapping reads, this can cause problems…

Pat

dporco · September 9, 2022, 7:13am

Hi Pat
This is a fragment of the COI barcode region, products are 250bp reads that completely overlap (MiSeq v3 reagents). It goes ok one run at a time though. Yes we should integrate mock samples for error rates monitoring.
Beyond these 3 runs, we are going to integrate more data to the dataset in the long run, and want to keep consistent OTU numbers.
I saw that I could use cluster.fit to do that one run at a time if I got it well.
Best
David

pschloss · September 9, 2022, 1:28pm

Thanks - Just to confirm, the COI region is about 250 nt long? We have had problems using v3 chemistry with 2x250 reads. We’ve found that to get the lower error rate we have to reduce the loading density significantly to basically create a more expensive version of the v2 chemistry. We only use the v2 chemistry at this point.

You might try increasing diffs in pre.cluster to 3 instead of 3. Another issue is what taxonomic level you are using with cluster.split. You might be able to safely work at taxlevel=6 (genus), which should make for smaller distance matrices.

Pat

dporco · September 9, 2022, 4:14pm

Yes it is 250bp with primers (205bp trimmed). I’ll see with the seq plateform about trying out v2 and I’ll test the diffs 3 / taxlevel=6 to make it go through.
Thanks!
David

Alexandre_Thibodeau · September 12, 2022, 3:25pm

Hello! I can put together somewhat 5 to 7 runs without problems for analysis.

I usually run all steps on all samples at the same time. It is working fine. But I ran into a problem where I needed to analyse more runs together and could not due to computer ressources. Here is what I did.

Hope it helped.

Fisrtly: run mothur on as much sample as possible and save the results.

Secondly, run mothur on the sample you want to add to the database.

Merge both fasta and count files using the approrpiate functions.

Reanalyze starting from the alignment step.

Here is my full pipeline for a set of data I added to another one. Cheers!

set.logfile(name=pouletchamp2merge_logFile_clustersplit)

set.current(processors=32)

set.seed(seed=100)
make.contigs(file=pouletchamp2.file, oligos=primers.oligo.txt, checkorient=t, pdiffs=2, deltaq=5, maxambig=0, maxlength=300, maxhomop=20, processors=32)

unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current, count=current)

screen.seqs(fasta=current, count=current, summary=current, maxambig=0, maxhomop=20)

classify.seqs(fasta=current, count=current, iters=500, reference=silva.nr_v138.align, taxonomy=silva.nr_v138.tax, cutoff=80, processors=32)

align.seqs(fasta=current, reference=silva.nr_v132.pcr.align, flip=t, processors=32)
summary.seqs(fasta=current, count=current)
count.groups(count=current)

screen.seqs(fasta=current, count=current, summary=current, start=1968, end=11550, maxambig=0, maxhomop=20)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

set.seed(seed=100)
pre.cluster(fasta=current, count=current, diffs=4, processors=32)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

chimera.vsearch(fasta=current, count=current, dereplicate=t, processors=32)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

set.seed(seed=100)
classify.seqs(fasta=current, count=current, iters=1000, reference=silva.nr_v138.align, taxonomy=silva.nr_v138.tax, cutoff=80)

remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)
summary.tax(taxonomy=current, count=current)
summary.seqs(fasta=current, count=current)
count.groups(count=current)

merge.files(input=pouletchamp2.trim.contigs.unique.good.good.filter.unique.precluster.denovo.vsearch.pick.fasta-combined_saracperf.good.filter.unique.precluster.denovo.vsearch.fasta, output=combined_pouletchamp.fasta)

merge.count(count=pouletchamp2.trim.contigs.unique.good.good.filter.unique.precluster.denovo.vsearch.pick.count_table-combined_saracperf.good.filter.unique.precluster.denovo.vsearch.count_table, output=combined_pouletchamp.count_table)

summary.seqs(fasta=combined_pouletchamp.fasta, count=combined_pouletchamp.count_table)

set.seed(seed=100)
align.seqs(fasta=current, reference=silva.nr_v132.pcr.align, flip=t, processors=32, match=2, mismatch=-2)
summary.seqs(fasta=current, count=current)

screen.seqs(fasta=current, count=current, summary=current, start=1968, end=11550, maxambig=0, maxhomop=20)
filter.seqs(fasta=current, vertical=T, trump=.)
summary.seqs(fasta=current, count=current)

unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current, count=current)

set.seed(seed=100)
pre.cluster(fasta=current, count=current, diffs=4)
summary.seqs(fasta=current, count=current)

set.seed(seed=100)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
summary.seqs(fasta=current, count=current)
count.groups(count=current)

set.seed(seed=100)
classify.seqs(fasta=current, count=current, iters=1000, reference=silva.nr_v138.align, taxonomy=silva.nr_v138.tax, cutoff=80, processors=32)

set.seed(seed=100)
cluster.split(fasta=current, count=current, taxonomy=current, taxlevel=6, iters=1000, precision=1000, delta=0, cutoff=0.02, processors=32)

make.shared(list=current,count=current,label=0.02)

classify.otu(list=current, count=current, taxonomy=current, threshold=75)

quit()

dporco · September 13, 2022, 6:08am

Hi Alexandre
Thanks a lot for this !
Cheers
David

system · September 23, 2022, 6:08am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cluster.split Commands in mothur	10	10419	March 12, 2015
duration of analyses Commands in mothur	7	5266	May 30, 2014
Cluster.split issue "Num_Dists_Below_Cutoff" Commands in mothur	4	1165	March 14, 2019
merging .shared files Commands in mothur	8	2039	August 20, 2019
mothur does not complete batch script Commands in mothur	5	2038	June 8, 2016

Correspondance between OTU numbers between runs

Related topics