Dist.seqs (set RAM?)

Hi there,

TL;DR version.

  • not looking at microbiome, but songbird MHC (highly duplicated genes in this taxon)
  • can only use 1 processor for dist.seqs because mothur will use all 24gb of RAM on my PC and shut down if I ask the program to use more than one processor
  • have to use 0.0 for cut-off because one SNP may be of importance
  • filtered aligned FASTA file has ~2.3 million sequences
  • dist.seqs has been running for days, but has stopped at sequence 390,800
  • can you designate how much RAM for dist.seq to use, so mothur won’t close? Also, to allow me to use more processors?

I’m using mothur to determine individual MHC genes per individual, in this case, a songbird. In a sense, you can consider each individual as a bacterial community, and each MHC allele as an OTU. Since songbirds have incredibly diverse MHC class II allele repertoires, I need to use a program such as mothur.

So far I have tailored the MiSeq SOP to my MHC analysis, however, I am running into trouble at dist.seqs – the reason being is my computer has 8 processors and 24 gb of RAM, and if I run any higher than 1 processor, it will max out the RAM and shut down mothur. It’s also important to note, that I cannot have a cutoff value as even a SNP may be important in determining allelic variation at MHC, so I had to set it to 0.

Is there any way to tell this command to use a set amount of RAM so it won’t shut down, but it will also speed up? The filtered aligned FASTA file has roughly 2.3 million sequences, and it has stalled at 390,800.

Any help is appreciated.

Consider this resolved! I’m going to have to have to use a server for sure.

If you’re using a threshold of 0, you should be able to use make.shared with a count file:

mothur > make.shared(count=amazon.count_table, label=0.03)

That should be considerably lighter and will not require dist.seqs or cluster.

Hi Pat,

Apologies if this is an overly simple question – still new to using mothur.

With my filtered and aligned file, I was having difficulty trying to find the commands to remake a count table that contains the new (and less) sequences. I see one prior to these commands (formed at the start of the MiSeq SOP).

Thanks in advance!

So I was able to create a shared file. Thank you. However, how would I go about cross-referring the OTU names with the sequences? Which are named in this format: “M03127_554_000000000-C89WG_1_2111_6953_9784”

Thanks!

Those would be in the list file generated in the cluster commands.

This helped greatly! Thank you. The one issue I am having is the vast amount of OTUs (1.4 million) – which are probably mainly singletons among my samples. I found some information regarding singletons in the clustering wiki, however, I’m not too sure if that’s what I need to do. Is there a way to eliminate all of your singleton OTUs in a shared file? Due to having so many columns, I cannot open the file in a program without it stalling or shutting down.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.