I’m having a problem with the size of my distance matrix, so I’m wondering if anyone has advice. It’s an alignment of 49000 fungi samples (eukarya). I used the silva eukarya database for the alignment. I filtered the alignment and generated a column-based distance matrix with cutoff of 0.10. Unfortunately, the resulting distance matrix was over 14 gb long! I tried doing an hcluster on it (since it errored out trying to read in the matrix for a normal “cluster()” analysis), but it’s been running for over 4 days and the last time it updated the output files was over 12 hours ago, so I’m assuming mothur has frozen.
I’ve done a bacteria alignment of 48000 sequences using the core set database (greengenes), and that distance matrix was only a fraction of the size (roughly 200 mb). I looked at the alignments in Bioedit and the fungi database is only a little longer (100 bases or so) than the bacterial alignment, so I don’t understand why there’s such a large discrepency in the distance matrix sizes. There are some gaps in the fungi alignment, but they’re not that much larger than the ones in the bacterial alignment. I’d considered changing the cutoff to reduce it, but I may need that data.
Anyway, if someone has any advice or can explain the discrepency so I can understand better, I’d appreciate it. Thanks!
I’m using a windows-based pc with 8 processors (though I can’t use all of them because it’s windows) and 48 gb of memory.
Hmm… I suspect the difference is that 18S sequences are far more similar than 16S sequences and so a tighter cutoff may be appropriate. Have you trimmed and unique’d the sequences before calculating the distances, etc?
I ran unique.seqs, trimmed, and ran a screen to remove what I could. I’m running a cutoff of 0.5 right now to see how much of a size difference it makes. I was just surprised that it was such a drastic difference in size (almost 20 times as large).
I’m sure you did, but after you ran filter.seqs, did you run unique.seqs again?
I hadn’t thought to run the unique.seqs again. Unfortunately, I have 15 different sites in one large batch. I didn’t want to remove samples that were unique to a particular site but that might be represented in another site. However, just to see, I ran unique.seqs again and it only pulled out 1000 samples, and all the ones I saw were across sites rather than within sites (granted, I didn’t look at all the samples).
As an update, it looks like mothur finished processing my 14 gb distance matrix. It took nearly five days, but it finished without erroring out so I assume it’s good. I ran the distance matrix at the 0.5 cutoff and got a 7 gb distance matrix that I was also hclustering to compare the results so we’ll see how that goes. Thanks for the advice.
You aren’t really getting rid of sequences when you run unique.seqs. It’s just a way to simplify the process. Either way those sequences will wind up in the same OTU in the end.
Your discussion just reminded me of a question that’s been bouncing around in the back of my mind –
If you run unique.seqs once, modify your data in some way, and then run unique.seqs again, how should you deal with the multiple .names files? Do you append them to one another? Do you enter multiple name files in the name= portion of the command?
I believe one runs uniq.seqs passing the old names file in the name option. Then the old names file is updated.