Mothur should be able to read/write gzipped files

To save disk space Mothur should be able to read and write gzipped files to save on disk space and lower the amount of data that needs to be read from disk.

With the MiSeq data, the performance of mothur would be a lot improved by the ability to read and write compressed files.
For instance, I have a 53 Gb file of aligned sequence against the silva template, which only takes 800 Mb when gzipped. Mothur spend a lot of the time reading and writing such big files.
A nice side effect would be disk space saving, of course :wink:

Flo - are you following the sop including using a region specific silva.bacteria.fasta file?

Also, I’m sure that compressing 53 GB to 800 MB took a very long time, right? Anything we would do would encumber that time penalty as well.

We’re looking into it…
Pat

I think this would be a good addition as well. I do most of my analyses on EC2, where its easy to get ample compute resources, but adding high IO to that can be additional hassle as well as more money. The instance I use 99% of the time has 16 processors/32 threads, but struggles to read or write to disk at 50MB/s. Alignment speed is actually limited by IO in this case.

If you’re worried about speed, you could use something like LZ4 (compresses about 400MB/s per core), with the tradeoff being that it is more obscure than gzip/zlib.