pairwise.seqs vs. Esprit

Question - is the pairwise.seqs function suitable for ~500,000 sequences? And I have 24 CPUs available. I am assuming the answer is no since it doesn’t seem to start with a kmer strategy like Esprit, so it has to do all 2.5E11 pairs. But Esprit also seems to be running quite slow on this fungal dataset (reads ~200bp) and it is set up for distributed multiprocessing on a cluster, which I don’t have easy access to.

Never mind I realized that Esprit “compute cluster” mode also runs quite easily on a single multiprocessor machine. They have some pseudocode in the manual that’s easy to adapt. Here’s how I set it up in perl, using 10 CPUs (note that I renamed their executable “split” to “splitdist”, which is the name they use in the manual pseudocode).

execute(“mkdir Esprit”);
chdir(“Esprit”);

execute(“rm -f ");
execute(“ln -s …/all.unique.seq .”); # output of Mothur unique.seqs
for (my $i=1; $i<=4; $i++)
{
for (my $j=$i; $j<=4; $j++)
{
execute(“kmerdist_par all.unique.seq 4 $i $j &”);
}
}
while (1) # wait until all are done
{
my $test = ps -ef | grep kmerdist | grep -v grep;
if (not $test =~ /\S/)
{
last;
}
sleep(1);
}
execute(“rm kmer.dist”) if -f “kmer.dist”;
execute(“touch kmer.dist”);
execute("cat all.unique
.dist >> kmer.dist”);
execute(“splitdist -s 10 kmer.dist”);
for (my $i=0; $i<=9; $i++)
{
execute(“needledist all.unique.seq kmer.dist_$i needle.dist_$i &”);
}
while (1)
{
my $test = ps -ef | grep needledist | grep -v grep;
if (not $test =~ /\S/)
{
last;
}
sleep(1);
}

execute(“rm sequence.ndist”) if -f “sequence.ndist”;
execute(“touch sequence.ndist”);
execute(“cat needle.dist_* >> sequence.ndist”);

#execute(“hcluster -u -t 30000 sequence.ndist_sort all.unique_Clean.frq”);

The last step uses a script I wrote to convert their numeric indices back into sequence names. It’s very simple,

since the index is just the order of sequences in all.unique.seq, counting from 0.

execute(“rm esprit.dist”);
execute(“perl …/scripts/dist2names.pl > esprit.dist”);
chdir("…");