Convert between distance matrix formats

Hi,

I spent some time looking around today, but couldn’t find any utility in mothur (other than regenerating it) to convert between distance matrix formats. Does such a utility exist?

It looks like the column format is representation of the lower triangle, but it seems like it’s maybe out of order wrt columns (or in some order that I’m not seeing right now). Thoughts on doing this? I’ve attached a sample from my own data.

Thanks,
Chris

FSQ564R01AEWK0 FSQ564R01EEDHP 0.1719
FSQ564R01BYQSB FSQ564R01EEDHP 0.1719
FSQ564R01BYQSB FSQ564R01AEWK0 0.03125
FSQ564R01EKQY4 FSQ564R01EEDHP 0.1875
FSQ564R01EKQY4 FSQ564R01AEWK0 0.2154
FSQ564R01EKQY4 FSQ564R01BYQSB 0.2462
FSQ564R01EOL34 FSQ564R01EEDHP 0.1719
FSQ564R01EOL34 FSQ564R01AEWK0 0.0625
FSQ564R01EOL34 FSQ564R01BYQSB 0.0625
FSQ564R01EOL34 FSQ564R01EKQY4 0.2769

Eh, we’ve thought about it, but it wouldn’t be possible to go from column to phylip since there are many distances missing. It would be more practical to go from phylip to column, but this probably wouldn’t be too necessary.

This is probably too late now for OP, but I wrote a simple Python script that will convert a complete column to phylip converter. Just save the code below as a *.py file and run it. Note that it’s written in Python 3, so you’ll need that. It is also very memory hungry since the entire distance file has to be loaded into memory (e.g., ~165 GB of memory for a 11 GB dist file). On a computer with 8 GB of memory, the largest dist file you can use is probably less than 1 GB.

The script includes simple instructions on how you can check if your dist file is compete.

#!/usr/local/bin/python3
# -*- coding: utf-8 -*-

# Convert column-style distance files to lower triangular Phylip distance matrix
# Need to first confirm that the distance is complete:
#  NumSeq = grep -c ">" ORIGINAL.FASTA
# NumDist = wc -l COLUMN.DIST
# NumDist should be NumSeq * (NumSeq + 1) / 2

# Requires python3.0.1 or above
# Input and output file names are specified in the command line
# Version 2: Switch to dictionary instead of list. Drastic improvement

import sys
import math

input_file_name = sys.argv[1]
output_file_name = sys.argv[2]

seqList = [] # Store list of all sequences
bigList = {} # Store distances as a dictionary

print("Reading from "+input_file_name+" and writing to "+output_file_name)

with open(input_file_name, "rt") as input_file:
 for line in input_file.readlines():
  seqName,junk1,junk2 = line.split()
  seqList.append(seqName)
  list = line.rsplit(None, 1)
  bigList[list[0]] = list[1]

seqNames = sorted(set(seqList)) # Get unique list of sequences
print("Finished reading input file")
input_file.close()
output_file = open(output_file_name, "wt")

seqNum = (math.floor(math.sqrt(len(bigList)*2)))
print("The input file contains "+str(seqNum)+" unique sequences")
output_file.write(str(seqNum)+"\n")

for i in range(0, seqNum):
 output_file.write("ID_"+str(i).rjust(6,'0'))
# output_file.write(seqNames[i]) # Outputs original sequence names instead
 if i % 100 == 0:
  print(str(i)+" sequences processed")
 for j in range(0, i, 1):
  query1 = seqNames[i]+" "+seqNames[j]
  query2 = seqNames[j]+" "+seqNames[i]
  if query1 in bigList:
   output_file.write("\t"+bigList[query1])
  elif query2 in bigList:
   output_file.write("\t"+bigList[query2])
  else:
   print("Missing distance: "+query1)
   break
 output_file.write("\n")

print("Phylip distance matrix generated. New sequence names were generated as below:")
for i in range(0, seqNum):
 print("ID_"+str(i).rjust(6,'0')+"\t"+seqNames[i])

output_file.close()

Out of interest, when you say you couldn’t go from column to phylip, has something changed in the way mothur writes column distance files? I put together this script that can do the conversion, and I can change a column file created in 1.31.2 to a phylip with no issues.

It’s all a bit moot, since there doesn’t seem to be any reason to use the column files in mothur, but I put this together to convert between the two with a much lower memory footprint. It’s a bit slow, because it has to pass through the column file twice (since I’m not smart enough to work out the number of lines in advance), but it only needs a few MB of RAM to process a 1 GB column file.

import sys, os

columnFile = sys.argv[1]
phylipFile = os.path.splitext(columnFile)[0] + ".phylip.dist"

#First pass to get the order of unique reads, and the total size
readOrder = []
observedReads = set()
for line in open(columnFile, 'r'):
 firstRead, secondRead, distance = line.split(' ')
 if not secondRead in observedReads: #The second column is the only column to contain the first read
  observedReads.add(secondRead)
  readOrder.append(secondRead)

observedReads.add(firstRead) #The last read in the first column never appears in the second column, so stick it in here

totalSize = len(observedReads)

#Open a writing stream. Because the first read in the readOrder list will not have any
#comparisons on its phylip line, we'll just print it here before the loop
phylipFileWriter = open(phylipFile, 'w')
phylipFileWriter.write(str(totalSize) + '\n')

#When mothur writes phylip distance files, each line ends with a tab. I don't know
#if this is important, but this does the same behaviour
phylipFileWriter.write(readOrder.pop(0) + '\t')

currentLine = ""
for line in open(columnFile, 'r'):
 firstRead, secondRead, distance = line.split(' ')

 if (firstRead in observedReads):
  phylipFileWriter.write('\n' + firstRead + '\t')
  observedReads.remove(firstRead)

 phylipFileWriter.write(distance.strip() + '\t')

phylipFileWriter.write('\n') #mothur phylip files end with a newline
phylipFileWriter.close()

If you used the cutoff function when generating the distance file with mothur, the column file would be incomplete. The mothur SOPs all employ this to speed up the pipeline, hence Pat’s original comment.

The script you wrote will work with a distance file generated by mothur running in single-threaded mode (I think), since it assumes the entries to be in order. I had to write the script the way I did because our dist files were generated outside of mothur using a multi-threaded approach, so the entries are not in order.

O fair enough. Irritatingly, if I’d run my test with the cutoff I would have spotted that myself…

Out of curiosity, could you point me to that software? I’m having a hard time visualising a phylip matrix where the distance rows appear out of order (since I’ve always understood the column order to just be the row order but along the top).

I’m not sure I follow you–did you mean the sequence alignment software used to generate our dist file? It’s Espirit http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2691849/. Since it doesn’t support OpenMPI (AFAIK), we just use a shell script to parallelize the job on a Mac Pro with 24 cores.

Sorry, I got myself confused (that’s twice in this thread now…). I thought you meant the phylip file had the rows/columns in a jumbled order, which would make no sense. I realise now that because we’re talking about converting a column to phylip, you’re talking about the order in your column file.

For what it’s worth, mothur keeps the order in a multi-threaded column distance, so my script still works there, but obviously won’t work with other software or a cutoff.