Can I do these steps to overcome large files of dist.seqs and subsequent procedures?

when I did my sequence analysis, I can only get 60 g memory from the computer cluster to work with. My aim is to generate fasta files with representative sequences according to different cutoff values to do phylogenetic analysis. I can get the original fasta file cleaned up without problem based on the mothur SOP. However, the files are too big dist.seqs and the subsequent steps. I realised that when I did cluster with a relative low cutoff value, the computer can still generate files for processing further. The steps I need to use if memory is not an issue are as followed:

dist.seqs(fasta=myfastafile.fa, cutoff=0.1, output=lt)
cluster(phylip=myfastafile.phylip.dist, cutoff=0.1, precision=1000)
get.oturep(phylip=myfastafile…phylip.dist, list=myfastafile.phylip.an.list, fasta=myfastafile.fa, count=myfastafile.count_table)



the above code will not work due to memory overload. however the workflow below will work.

dist.seqs(fasta=myfastafile.fa, cutoff=0.03, output=lt)
cluster(phylip=myfastafile.phylip.dist, cutoff=0.03) ### just do not set the precision for now
get.oturep(phylip=myfastafile…phylip.dist, list=myfastafile.phylip.an.list, fasta=myfastafile.fa, count=myfastafile.count_table)

this code will somehow to generate one or two fasta files containing representative sequences and the number of sequences can reduce by 1/3. lets the files generated are:

myfastafile.0.01.rep.fasta
myfastafile.0.01.rep.count_table


### I did the following commands dist.seqs(fasta=myfastafile.0.01.rep.fasta, cutoff=0.1, output=lt) cluster(phylip=myfastafile.0.01.rep.phylip.dist, cutoff=0.1, precision=1000) get.oturep(phylip=myfastafile.0.01.rep.phylip.dist, list=myfastafile.0.01.rep.phylip.an.list, fasta=myfastafile.0.01.rep, count=myfastafile.0.01.rep.count_table)
### this way of repeating somehow generate a set of fasta files with sequences of different cutoff values. the memory can handle the files, too. I would like to know whether this repeating is valid or not. Anyone can answer, please?

Why don’t you use cluster.split?

I tried that, but the output file gave a lower cutoff value.


mothur > cluster.split(phylip=nxrA.cleaned.unique.precluster.phylip.dist, count=nxrA.cleaned.unique.precluster.count_table, cutoff=0.1, precision=1000)

Cutoff was 0.1005 changed cutoff to 0.032
It took 67 seconds to cluster
Merging the clustered files…
It took 2 seconds to merge.

Output File Names:
nxrA.cleaned.unique.precluster.phylip.column.an.unique_list.list



does this mean that there is no difference on otu representatives between 0.1005 and 0.032 cutoff values?

it looks like you are fine here. please see:

“Why does the cutoff change when I cluster with average neighbor?”
http://www.mothur.org/wiki/Frequently_asked_questions#Why_does_the_cutoff_change_when_I_cluster_with_average_neighbor.3F

Pat