Hi again, after very useful help with my first question I have now another. As I´m new in this kind of analyses and mothur hopefully somebody with more experience and knowledge can help me that good again…
The overall situation is I got various differently big fasta files (biggest with 1897954 sequences) and have to analyze them (otu,Shannon,shao…)
I had no problem with the smaller files but now with the bigger I get the problem that my matrix become far too big (>100gb) and I read in http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/ that this isn´t good and something is wrong. ->High error rate.
My order is to get as many sequences classified as possible. I will post my commands here and hopefully somebody can help me to optimize them and finally solve this problem.
I use current mothur version and current Silva files from mothur webpage.
Commands:
make.group(fasta=test1.fasta, groups=abc)
screen.seqs(fasta=test1.fasta, group=abc.groups, maxambig=0, optimize=start-end, criteria=98, processors=x)
unique.seqs(fasta=test1.good.fasta)
count.seqs(name=test1.good.names, group=abc.good.groups)
align.seqs(fasta=test1.good.unique.fasta, reference=silva.bacteria.fasta, flip=T)
screen.seqs(fasta=test1.good.unique.align, count=test1.good.count_table, optimize=start-end criteria=98)
(If criteria 98 is not possible and my start>end I set the range manually with help of summary.seqs.)
filter.seqs(fasta=test1.good.unique.good.align, vertical=T, trump=.)
unique.seqs(fasta=test1.good.unique.good.filter.fasta, count=test1.good.good.count_table)
pre.cluster(fasta=test1.good.unique.good.filter.unique.fasta, count=test1.good.unique.good.filter.count_table, diffs=3)
chimera.uchime(fasta=test1.good.unique.good.filter.unique.precluster.fasta, count=test1.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
remove.seqs(fasta=test1.good.unique.good.filter.unique.precluster.fasta, accnos=test1.good.unique.good.filter.unique.precluster.uchime.accnos)
classify.seqs(fasta=test1.good.unique.good.filter.unique.precluster.pick.fasta, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=trainset14_032015.rdp.fasta, taxonomy=trainset14_032015.rdp.tax, cutoff=80)
dist.seqs(fasta=test1.good.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.11)
cluster(column=test1.good.unique.good.filter.unique.precluster.pick.dist, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, cutoff=0.20)
make.shared(list=test1.good.unique.good.filter.unique.precluster.pick.an.unique_list.list, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, label=0.03)
classify.otu(list=test1.good.unique.good.filter.unique.precluster.pick.an.unique_list.list, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=test1.good.unique.good.filter.unique.precluster.pick.rdp.wang.taxonomy, label=0.03)
[b]Calculators[/b]
mothur > dist.seqs(fasta=….precluster.pick.fasta, output=phylip, cutoff=0.11)
cluster(phylip=current, cutoff=0.20)
summary.single(list=current, label=unique-0.03-0.05-0.10)
It would be really helpful if you can check the commands and say if they are totally wrong and that I should use command x with setting yz instead. I assume that this many sequences can´t be calculated this way and that I have to reduce my criteria in screen.seqs and calculate only less number of sequences? Which criteria do you prefer and how much can be handled in the later steps? The other alternative would be a phylotype-based approach?
In short:
- Can you check the commands if the procedure is correct
- Can you give me a hint how to optimize or customize the procedure to prevent >100gb distant matrix.
If you need more Information please ask. Otherwise big thanks for the help again!