Hey guys,
So I am at the distaning and clustering stage of my first illumina run. I started with 12 million sequences @ 90-100 bp, 5 M after quality filtering, etc etc with 700 000 unique sequences @ 96 bp at the end of all the processing.
I ran dist.seqs (cutoff=0.15) on our computing server, and ended up with a 350gb file. It took 32 processors 11 hours. Does this seem plausible?
I’m now running the cluster.split command with large=T, again with 32 processors. Lets see how it goes, and hope i’m not jumping the gun here (i.e. something funny with the distance matrix)
Pat, have you thought of renaming the sequences as integers like ESPIRT-tree does to save a hd space?
All the best, Shaun
Plausible… Biologically? No. Technically? Yes. Bioinformatically? No. Can you tell me a bit more about your sequences? I gather these are from a HiSeq run. Do the read pairs overlap?
At the end of the week we hope to release our latest version of mothur and an SOP for use with paired end data that overlap. This was designed for use with MiSeq data. However, it significantly reduces sequencing error rates and makes a lot of what you’re seeing go away. The problem is that the sequencer is generating biodiversity towards the end of the reads and you need to be pretty aggressive in trimming the sequences or make contigs.
As for the sequence names, the new version will help get around this by replacing the name/group files with a count_table file that holds the count of each unique sequence across the various groups.
Pat
Hi Pat
Thanks much for all your help to the comminity. When are you releasing new version. I am also having problems processing mi-seq data. I am still struggling to get “cluster” output. My .names file is not recognized by cluster script. I verified whether any fasta headers are cropped either during aligning or calculating distances (column). But they are all intact. I also manually checked and found that all fasta headers are there in .names files.
Any comments?
Srini
Hi Pat
Thanks much for all your help to the community. When are you releasing new version. I am also having problems processing mi-seq data. I am still struggling to get “cluster” output. My .names file is not recognized by cluster script. I verified whether any fasta headers are cropped either during aligning or calculating distances (column). But they are all intact. I also manually checked and found that all fasta headers are there in .names files.
Any comments?
Srini
Hi Srini,
I was having the same problem and I think I figured it out. I’ve been running all of my stuff on iPlant, which is a great resource to us plant biologists, yet there are some drawbacks. On iPlant you have to store all of your data on volumes and the largest volume I can get is 100 GB. The size of my distance file was exceeding this limit every time I would run dist.seqs, but the process would finish. When I went to cluster the sequences I would get that annoying warning of not recognizing certain names. I think it’s just because the size of the .dist file exceeded the space left on your disk and things got screwy. Hopefully the new release of Mothur will address the issue of the .dist file sizes being out of hand.
Cheers,
Joe