Clustering 368gb dist file

Hello all,

My name is Vladimir and this is my first topic in the forum so i really hope if you guys can help me with the following:

I’m currently working with AMI of Amazon Web Service: m4.4xlarge 16 cores, 64 gb ram, and 600gb of hhd and Windows Base 64 bits

I’m working with 13 samples at the same time (1 stability.files with 13 included) for networking purposes. Right now i’m having problems clustering the distance matrix because the dist.seqs command created a 368 gb dist file and the cluster command can’t read the file and throws that error recommending me to use 64 bits mothur and contact Pat Schloss, etc. The main objective of the pipeline that i’m using is to get the .share file that make.shared command creates, so with this file i can work in another software and create the network that i want between the 13 samples.

I’ve been reading here that you guys recommend to use hcluster command because it doesn’t store the matrix on the ram memory but i don’t know exactly how to use it because that command asks for column and name and i don’t know what to use for the name file (unlike cluster command that asks for column and count). Can anyone tell me how to use hcluster in more details according to the files i manage and mentioned?

I haven’t tried yet setting cutoff to 0.20 on the cluster command as once read here in the forum.

PD: I already did this with other 13 samples with a 90 gb distance file and i successfully got my .shared

Here i leave you the pipeline that i’m using:

*make.contigs(file=stability.files, processors=16)
*screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=292)
*count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
*align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva123.fasta, flip=T)
*screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, *summary=stability.trim.contigs.good.unique.summary, start=2, end=13423, maxhomop=8)
*filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
*unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
*pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
*chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t, processors=1)
*remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)
*classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=silva123.fasta,, cutoff=80)
*remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table,, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
*summary.seqs -------> # of unique seqs= 160226 / total # of seqs= 624926
*dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)
------------------------------------(here follows the 2 commands i need to run but can’t yet)--------------------------------------------------------------------
*cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)
*make.shared(, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.03)

Thank you !

Instead of dist.seqs and cluster, use cluster.split

*dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)
------------------------------------(here follows the 2 commands i need to run but can't yet)--------------------------------------------------------------------
*cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)

thank you for answering!
Same issue of the cluster.split command because i don’t know what to fill in the “name” parameter, and also, the cluster.split requires a dist file and you telling me to use cluster.split instead of dist.seqs?

thank you

It worked perfectly for the test sample that i was running! Thank you so much, i wasnt sure how the cluster.split worked but now i do!

i have another question, if by uploading the sequences (done by the remove.lineages command), into SILVAngs 1.2.3. (same that i have to classify with mothur) and gave me classified sequences of 98.76%, should i use cutoff=98.7 for the command classify.seqs right? sorry for the possible rhetorical question but it’s only to be sure.

Thank you!

I don’t know how your data is, but by my experience if you use a cutoff so close(astringent) you will get a wide number of Otus like “unclassified” maybe is not your case but anyway It would be a better idea 80 even when the default is 60.

Good luck, hope it helps.

the cutoff for classify.seqs isn’t a percent identity but a baysian probibility of that sequence belonging to the taxa identified. go with something between 60-80