Clustering 368gb dist file

vladi · September 26, 2016, 1:59pm

Hello all,

My name is Vladimir and this is my first topic in the forum so i really hope if you guys can help me with the following:

I’m currently working with AMI of Amazon Web Service: m4.4xlarge 16 cores, 64 gb ram, and 600gb of hhd and Windows Base 64 bits

I’m working with 13 samples at the same time (1 stability.files with 13 included) for networking purposes. Right now i’m having problems clustering the distance matrix because the dist.seqs command created a 368 gb dist file and the cluster command can’t read the file and throws that error recommending me to use 64 bits mothur and contact Pat Schloss, etc. The main objective of the pipeline that i’m using is to get the .share file that make.shared command creates, so with this file i can work in another software and create the network that i want between the 13 samples.

I’ve been reading here that you guys recommend to use hcluster command because it doesn’t store the matrix on the ram memory but i don’t know exactly how to use it because that command asks for column and name and i don’t know what to use for the name file (unlike cluster command that asks for column and count). Can anyone tell me how to use hcluster in more details according to the files i manage and mentioned?

I haven’t tried yet setting cutoff to 0.20 on the cluster command as once read here in the forum.

PD: I already did this with other 13 samples with a 90 gb distance file and i successfully got my .shared

Here i leave you the pipeline that i’m using:

*make.contigs(file=stability.files, processors=16)
*screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=292)
*unique.seqs(fasta=stability.trim.contigs.good.fasta)
*count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
*align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva123.fasta, flip=T)
*screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, *summary=stability.trim.contigs.good.unique.summary, start=2, end=13423, maxhomop=8)
*filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
*unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
*pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
*chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t, processors=1)
*remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)
*classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=silva123.fasta, taxonomy=silva.nr_v123.tax, cutoff=80)
*remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v123.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
*summary.seqs -------> # of unique seqs= 160226 / total # of seqs= 624926
*dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)
------------------------------------(here follows the 2 commands i need to run but can’t yet)--------------------------------------------------------------------
*cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)
*make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.03)

Thank you !

Kendra · September 26, 2016, 2:17pm

Instead of dist.seqs and cluster, use cluster.split

*dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)
------------------------------------(here follows the 2 commands i need to run but can't yet)--------------------------------------------------------------------
*cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)

vladi · September 26, 2016, 4:22pm

thank you for answering!
Same issue of the cluster.split command because i don’t know what to fill in the “name” parameter, and also, the cluster.split requires a dist file and you telling me to use cluster.split instead of dist.seqs?

thank you

vladi · September 28, 2016, 6:23am

It worked perfectly for the test sample that i was running! Thank you so much, i wasnt sure how the cluster.split worked but now i do!

i have another question, if by uploading the sequences (done by the remove.lineages command), into SILVAngs 1.2.3. (same that i have to classify with mothur) and gave me classified sequences of 98.76%, should i use cutoff=98.7 for the command classify.seqs right? sorry for the possible rhetorical question but it’s only to be sure.

Thank you!

edd-gar · September 28, 2016, 7:00am

Hi
I don’t know how your data is, but by my experience if you use a cutoff so close(astringent) you will get a wide number of Otus like “unclassified” maybe is not your case but anyway It would be a better idea 80 even when the default is 60.

Good luck, hope it helps.

Kendra · September 28, 2016, 2:12pm

the cutoff for classify.seqs isn’t a percent identity but a baysian probibility of that sequence belonging to the taxa identified. go with something between 60-80

Topic		Replies	Views
cluster large distance matric Commands in mothur	2	3403	February 8, 2011
read.dist aborting before finishing Commands in mothur	6	5214	July 7, 2010
Issues with cluster command Commands in mothur	5	4450	December 19, 2012
Command cluster_issue	17	888	November 28, 2021
Computer Issues with hcluster Commands in mothur	2	2956	May 24, 2011

Clustering 368gb dist file

Related topics