when I did my sequence analysis, I can only get 60 g memory from the computer cluster to work with. My aim is to generate fasta files with representative sequences according to different cutoff values to do phylogenetic analysis. I can get the original fasta file cleaned up without problem based on the mothur SOP. However, the files are too big dist.seqs and the subsequent steps. I realised that when I did cluster with a relative low cutoff value, the computer can still generate files for processing further. The steps I need to use if memory is not an issue are as followed:
dist.seqs(fasta=myfastafile.fa, cutoff=0.1, output=lt)
cluster(phylip=myfastafile.phylip.dist, cutoff=0.1, precision=1000)
get.oturep(phylip=myfastafile…phylip.dist, list=myfastafile.phylip.an.list, fasta=myfastafile.fa, count=myfastafile.count_table)
the above code will not work due to memory overload. however the workflow below will work.
dist.seqs(fasta=myfastafile.fa, cutoff=0.03, output=lt)
cluster(phylip=myfastafile.phylip.dist, cutoff=0.03) ### just do not set the precision for now
get.oturep(phylip=myfastafile…phylip.dist, list=myfastafile.phylip.an.list, fasta=myfastafile.fa, count=myfastafile.count_table)
this code will somehow to generate one or two fasta files containing representative sequences and the number of sequences can reduce by 1/3. lets the files generated are:
myfastafile.0.01.rep.fasta
myfastafile.0.01.rep.count_table
### I did the following commands dist.seqs(fasta=myfastafile.0.01.rep.fasta, cutoff=0.1, output=lt) cluster(phylip=myfastafile.0.01.rep.phylip.dist, cutoff=0.1, precision=1000) get.oturep(phylip=myfastafile.0.01.rep.phylip.dist, list=myfastafile.0.01.rep.phylip.an.list, fasta=myfastafile.0.01.rep, count=myfastafile.0.01.rep.count_table)
### this way of repeating somehow generate a set of fasta files with sequences of different cutoff values. the memory can handle the files, too. I would like to know whether this repeating is valid or not. Anyone can answer, please?