I’m hoping some R wizard can help me. I’ve constructed distance matrices in mother (yeah for repeated subsampling capability) and now need to use them in R for further analyses. I can get the lower triangle data in but not the labels using which I modified from Unifrac Distance Matrix -> Newick using R
Code: Select all
b_bc <- data.matrix(read.table("", fill=T, row.names=1, skip=1, col.names=1:<#samples>))
I don't get any errors, just don't have the sample labels associated with the data. I've played with the row.names/col.names variable. And, thinking that the problem was related to the lower triangle data form, reran the dist in mothur to get a square matrix but still can't get the labels to be associated with the matrix. do I need to bring the sample names in as a vector then associate that vector with the matrix?
ETA: oh actually something’s not quite right when I load it into R, I get an N x N-1 matrix. Anyone?
I’ve been editing my mothur generated dist matrices in excel for the past couple of years but would really love to figure out how to get them into R straight from mothur. What I do in excel is copy the sample names, transpose paste them to be column headers, and add “NA” the the bottom right most cell. After these alterations I can import the dist file using:
I use the following function in R to import mothur distance files:
parseDistanceMatrix = function(phylip_file) {
# Read the first line of the phylip file to find out how many sequences/samples it contains
temp_connection = file(phylip_file, 'r')
len = readLines(temp_connection, n=1)
len = as.numeric(len)
close(temp_connection)
phylip_data = read.table(phylip_file, fill=T, row.names=1, skip=1, col.names=1:len)
phylip_matrix = as.dist(phylip_data)
return(phylip_matrix)
}
Funnily enough, it’s actually based on your original R command in this thread.
Using this does with you a warning when it does the as.dist() cast, but I think this is just due to the trailing tab on the last line of the phylip. I’ve tested it extensively and it doesn’t change the data in anyway.
I tweaked your code a bit because I like working with df rather than dist (I can remove samples with logic vectors when they’re df)
parseDistanceDF = function(phylip_file) {
# Read the first line of the phylip file to find out how many sequences/samples it contains
temp_connection = file(phylip_file, 'r')
len = readLines(temp_connection, n=1)
len = as.numeric(len)
len = len +1
close(temp_connection)
phylip_data = read.table(phylip_file, fill=T, row.names=1, skip=1, col.names=1:len)
colnames(phylip_data) <- row.names(phylip_data)
return(phylip_data)
}