Can I do these steps to overcome large files of dist.seqs and subsequent procedures?

angulo · February 23, 2016, 4:19pm

when I did my sequence analysis, I can only get 60 g memory from the computer cluster to work with. My aim is to generate fasta files with representative sequences according to different cutoff values to do phylogenetic analysis. I can get the original fasta file cleaned up without problem based on the mothur SOP. However, the files are too big dist.seqs and the subsequent steps. I realised that when I did cluster with a relative low cutoff value, the computer can still generate files for processing further. The steps I need to use if memory is not an issue are as followed:

dist.seqs(fasta=myfastafile.fa, cutoff=0.1, output=lt)
cluster(phylip=myfastafile.phylip.dist, cutoff=0.1, precision=1000)
get.oturep(phylip=myfastafile…phylip.dist, list=myfastafile.phylip.an.list, fasta=myfastafile.fa, count=myfastafile.count_table)

the above code will not work due to memory overload. however the workflow below will work.

dist.seqs(fasta=myfastafile.fa, cutoff=0.03, output=lt)
cluster(phylip=myfastafile.phylip.dist, cutoff=0.03) ### just do not set the precision for now
get.oturep(phylip=myfastafile…phylip.dist, list=myfastafile.phylip.an.list, fasta=myfastafile.fa, count=myfastafile.count_table)

this code will somehow to generate one or two fasta files containing representative sequences and the number of sequences can reduce by 1/3. lets the files generated are:

myfastafile.0.01.rep.fasta
myfastafile.0.01.rep.count_table

### I did the following commands dist.seqs(fasta=myfastafile.0.01.rep.fasta, cutoff=0.1, output=lt) cluster(phylip=myfastafile.0.01.rep.phylip.dist, cutoff=0.1, precision=1000) get.oturep(phylip=myfastafile.0.01.rep.phylip.dist, list=myfastafile.0.01.rep.phylip.an.list, fasta=myfastafile.0.01.rep, count=myfastafile.0.01.rep.count_table)
### this way of repeating somehow generate a set of fasta files with sequences of different cutoff values. the memory can handle the files, too. I would like to know whether this repeating is valid or not. Anyone can answer, please?

Kendra · February 23, 2016, 7:59pm

Why don’t you use cluster.split?

angulo · February 26, 2016, 1:43pm

I tried that, but the output file gave a lower cutoff value.

mothur > cluster.split(phylip=nxrA.cleaned.unique.precluster.phylip.dist, count=nxrA.cleaned.unique.precluster.count_table, cutoff=0.1, precision=1000)

Cutoff was 0.1005 changed cutoff to 0.032
It took 67 seconds to cluster
Merging the clustered files…
It took 2 seconds to merge.

Output File Names:
nxrA.cleaned.unique.precluster.phylip.column.an.unique_list.list

does this mean that there is no difference on otu representatives between 0.1005 and 0.032 cutoff values?

pschloss · March 7, 2016, 12:48pm

it looks like you are fine here. please see:

“Why does the cutoff change when I cluster with average neighbor?”
http://www.mothur.org/wiki/Frequently_asked_questions#Why_does_the_cutoff_change_when_I_cluster_with_average_neighbor.3F

Pat

Topic		Replies	Views
dist.seq- taking lot of disk space Commands in mothur	1	1208	January 28, 2016
HUGE dist file when running Eukarya analysis Commands in mothur	3	613	August 10, 2019
dist.seqs() -- How to deal with 240K input sequences? Commands in mothur	1	872	October 30, 2017
Dist.seqs running for many days/large file Commands in mothur	8	1539	April 26, 2020
Cluster.split issue "Num_Dists_Below_Cutoff" Commands in mothur	4	1159	March 14, 2019

Can I do these steps to overcome large files of dist.seqs and subsequent procedures?

this code will somehow to generate one or two fasta files containing representative sequences and the number of sequences can reduce by 1/3. lets the files generated are:

Related topics