Dist.seqs of 700 000 illumina sequences

Shaunson26 · March 25, 2013, 1:28am

Hey guys,

So I am at the distaning and clustering stage of my first illumina run. I started with 12 million sequences @ 90-100 bp, 5 M after quality filtering, etc etc with 700 000 unique sequences @ 96 bp at the end of all the processing.

I ran dist.seqs (cutoff=0.15) on our computing server, and ended up with a 350gb file. It took 32 processors 11 hours. Does this seem plausible?

I’m now running the cluster.split command with large=T, again with 32 processors. Lets see how it goes, and hope i’m not jumping the gun here (i.e. something funny with the distance matrix)

Pat, have you thought of renaming the sequences as integers like ESPIRT-tree does to save a hd space?

All the best, Shaun

pschloss · March 25, 2013, 10:37am

Plausible… Biologically? No. Technically? Yes. Bioinformatically? No. Can you tell me a bit more about your sequences? I gather these are from a HiSeq run. Do the read pairs overlap?

At the end of the week we hope to release our latest version of mothur and an SOP for use with paired end data that overlap. This was designed for use with MiSeq data. However, it significantly reduces sequencing error rates and makes a lot of what you’re seeing go away. The problem is that the sequencer is generating biodiversity towards the end of the reads and you need to be pretty aggressive in trimming the sequences or make contigs.

As for the sequence names, the new version will help get around this by replacing the name/group files with a count_table file that holds the count of each unique sequence across the various groups.

Pat

Srini_UGA · March 28, 2013, 3:03pm

Hi Pat
Thanks much for all your help to the comminity. When are you releasing new version. I am also having problems processing mi-seq data. I am still struggling to get “cluster” output. My .names file is not recognized by cluster script. I verified whether any fasta headers are cropped either during aligning or calculating distances (column). But they are all intact. I also manually checked and found that all fasta headers are there in .names files.
Any comments?
Srini

Srini_UGA · March 28, 2013, 3:05pm

Hi Pat
Thanks much for all your help to the community. When are you releasing new version. I am also having problems processing mi-seq data. I am still struggling to get “cluster” output. My .names file is not recognized by cluster script. I verified whether any fasta headers are cropped either during aligning or calculating distances (column). But they are all intact. I also manually checked and found that all fasta headers are there in .names files.
Any comments?
Srini

jedwards · March 31, 2013, 6:35am

Hi Srini,

I was having the same problem and I think I figured it out. I’ve been running all of my stuff on iPlant, which is a great resource to us plant biologists, yet there are some drawbacks. On iPlant you have to store all of your data on volumes and the largest volume I can get is 100 GB. The size of my distance file was exceeding this limit every time I would run dist.seqs, but the process would finish. When I went to cluster the sequences I would get that annoying warning of not recognizing certain names. I think it’s just because the size of the .dist file exceeded the space left on your disk and things got screwy. Hopefully the new release of Mothur will address the issue of the .dist file sizes being out of hand.

Cheers,

Joe

Topic		Replies	Views
Problems with dist.seqs and illumina reads mothur bugs	1	2534	January 6, 2014
Error message when doing cluster.split Commands in mothur	6	5031	October 20, 2014
Enormous dist file Commands in mothur	5	2137	October 15, 2015
Produce too large amount of data when running dist.seqs Commands in mothur	8	7716	October 18, 2013
Large dist.seqs producing corrupt files? mothur bugs	11	10593	November 1, 2016

Dist.seqs of 700 000 illumina sequences

Related topics