Cluster.split and computer characteristics

We’re stuck in cluster.split. This command takes a lot of time (42 hours: 137.000 reads), at least in our remote desktop (16 Gigas RAM and 4 proccesors). However, it only uses 1 proccesor, and I was wondering if any more is needed for this command, in your experience.

We’re now proving with 4 proccesors and higher memory, but let’s wait to see if it works faster.

I forgot to mention that we want to analyse 8.000.000 sequences, a lot!

What do you recommend us?

Welcome to the mothur community! We have several suggestions for clustering large datasets. First we recommend following the steps in Pat’s example analysis, http://www.mothur.org/wiki/MiSeq_SOP, to remove errors, contaminants and chimeras. It is important to make sure you have good overlap within your dataset as well. When you are ready to cluster there are a few things you can do:

  1. Use a cutoff equal to the distance you want OTUs created for. We recommend 0.03.

  2. Consider running the cluster.split command in 2 parts. https://mothur.org/wiki/Cluster.split#file The file option allows you to enter your file containing your list of column and names/count files as well as the singleton file. This file is mothur generated, when you run cluster.split() with the cluster=f parameter. This can be helpful when you have a large dataset that you may be able to use all your processors for the splitting step, but have to reduce them for the cluster step due to RAM constraints.

    mothur > cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, cluster=f, processors=8) - split dataset by taxonomy and create distance matrices for each grouping.
    mothur > cluster.split(file=final.file, processors=4) - cluster each distance matrix and combine results

    Note: The more processors used the more memory is required. Each process will load a distance matrix into memory (RAM).

  3. If you are unable to cluster your dataset due to the size on the hardware you have, consider using AMI, https://mothur.org/wiki/Mothur_AMI.

  4. Alternatively, you can cluster based on taxonomic assignment using the phylotype command, https://mothur.org/wiki/Phylotype.

Hi!

Thank you so much for your help.

The clister.split is still running, but the “splitting” was fast;
“cluster” is still executing.

We’re going to try Phylotype in other computer. Despite the explanation in the mothur website, we are not sure about the cut-off meaning in this command. Does it mean class, order, phylum…?

I guess that the meaning here is different of that of the cluster.split, does it? In the example, what is the meaning of the number next to the cut-off (1,2,3…)? For instance: 35 in the first row.

Kind regards
Noemi

The cutoff parameter in the phylotype command is different than the cluster commands. The cluster commands cutoff parameter refers to a biological distance used to bin OTUs. The cutoff in the phylotype command refers to the taxonomic level. The OTUs are formed by binning all sequences with the same classification at a given level together. Consider this trivial example:

seq1 Bacteria(100);Firmicutes(82);Firmicutes_unclassified(82);Firmicutes_unclassified(82);Firmicutes_unclassified(82);Firmicutes_unclassified(82);
seq2 Bacteria(100);“Bacteroidetes”(96);“Bacteroidia”(84);“Bacteroidales”(84);“Bacteroidales”_unclassified(84);“Bacteroidales”_unclassified(84);
seq3 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100);
seq4 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(89);Lachnospiraceae_unclassified(89);
seq5 Bacteria(100);“Bacteroidetes”(100);“Bacteroidia”(94);“Bacteroidales”(94);“Porphyromonadaceae”(82);“Porphyromonadaceae”_unclassified(82);

Results in these OTUs:

label numOtus Otu1 Otu2 Otu3 Otu4
1 4 seq3,seq4 seq1 seq2 seq5
2 4 seq3,seq4 seq1 seq2 seq5
3 3 seq2,seq5 seq3,seq4 seq1
4 3 seq2,seq5 seq3,seq4 seq1
5 2 seq1,seq3,seq4 seq2,seq5
6 1 seq1,seq2,seq3,seq4,seq5

The smaller the cutoff level the greater definition of taxonomy. At level 1 (most specific), the reads are classified to 4 groupings: Firmicutes_unclassified, “Bacteroidales”_unclassified, Lachnospiraceae_unclassified and “Porphyromonadaceae”_unclassified. At level 6 (least specific) all the reads classify to Bacteria.

what is the meaning of the number next to the cut-off (1,2,3…)? For instance: 35 in the first row.

I am not sure what you mean. Could you explain?

Hi Sarah,

Thank you very much for your response.

Understood. So, the best option is to include the 6 cut-off, or can we
include only few of them (the most specific) 6,5,4,3? We’re more
intereseted in phyla, families, genus and species taxonomies. Do you
think that 6 cut-off levels will slow the command?

I’ve doubts with the column “numOtus”- Does it refers to the number of
a given OTU in a sample? Or is the name of the sample? In your
example, 4 in the first row.

In the mothur website, the example is with “35”:

Outputted to the screen is a level, 1 meaning the most specific
taxonomy given.

Opening abrecovery.tx.list you would see the output as:

1 35 AY457915,AY457912,AY457898,AY457895,AY457894 …
2 35 AY457915,AY457912,AY457898,AY457895,AY457894,AY457891,AY457869 …
3 32 AY457915,AY457914,AY457912,AY457898,AY457895,AY457894,AY457891 …
4 25 AY457915,AY457914,AY457912,AY457898,AY457895,AY457894,AY457891,AY457888

5 15 AY457915,AY457914,AY457913,AY457912,AY457908,AY457901,AY457898,AY457895

6 11 AY457915,AY457914,AY457913,AY457912,AY457908,AY457901,AY457898,AY457895

7 8 AY457915,AY457914,AY457913,AY457912,AY457911,AY457910,AY457908,AY457901,AY457898

8 5 AY457915,AY457914,AY457913,AY457912,AY457911,AY457910,AY457908,AY457901,AY457898

9 1 AY457915,AY457914,AY457913,AY457912,AY457911,AY457910,AY457909,AY457908,AY457907

What is the difference between cluster.split and phylotype commands?
Only the splitting proccess?

Kind regards and thank you again!!
Noemi

Sarah Westcott via mothur mothur@discoursemail.com escribió:

Hi Sarah,

Thank you very much for your response.

Understood. So, the best option is to include the 6 cut-off, or can we
include only few of them (the most specific) 6,5,4,3? We’re more
intereseted in phyla, families, genus and species taxonomies. Do you
think that 6 cut-off levels will slow the command?

I’ve doubts with the column “numOtus”- Does it refers to the number of
a given OTU in a sample? Or is the name of the sample? In your
example, 4 in the first row.

In the mothur website, the example is with “35”:

Outputted to the screen is a level, 1 meaning the most specific
taxonomy given.

Opening abrecovery.tx.list you would see the output as:

1 35 AY457915,AY457912,AY457898,AY457895,AY457894 …
2 35 AY457915,AY457912,AY457898,AY457895,AY457894,AY457891,AY457869 …
3 32 AY457915,AY457914,AY457912,AY457898,AY457895,AY457894,AY457891 4

Other question: What is the difference between cluster.split and
phylotype commands? Only the splitting proccess?

Kind regards and thank you again!!
Noemi

Sarah Westcott via mothur mothur@discoursemail.com escribió:

Hi Noemi,

If you are interested in the most specific OTU binning, I recommend running the phylotype command without a cutoff. Without a cutoff, mothur will produce a list for every tax level. The command runs pretty quickly (MiSeq_SOP with no cutoff ~1second) so the time to process should not be a hinderance.

The list file, https://mothur.org/wiki/List_file, columns are label, numOTUs, otuNames. Each row in the list file represents a different label. Let’s look at this example:

1 35 AY457915,AY457912,AY457898,AY457895,AY457894 …
2 35 AY457915,AY457912,AY457898,AY457895,AY457894,AY457891,AY457869 …
3 32 AY457915,AY457914,AY457912,AY457898,AY457895,AY457894,AY457891 4

The first column contains 1, 2, 3. These are the labels / cutoffs.
The second column contains 35, 35, 32. These are the number of OTUs at the cutoffs / labels.

The cluster.split command and phylotype command are different. We have 2 papers that assess and discuss the clustering methods, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3126452/ and https://msphere.asm.org/content/2/2/e00073-17.

Kindly,
Sarah

Hi Sarah,

Thank you very much for all your help.

We run yesterday Phylotype command and yes, it was very fast! After, we run classify.otu to make the assignment of OTUS of our samples. It generated a table with the information about the type of OTU, the taxonomic rank, daughter levels (what is the meaning?), total, and a list of samples. Below each sample, there is a number assigned to each OTU and we are not sure if this is a percentage or a number. In our example, for instance, for p_Actinobacteria, sample P115 has “16”.

We’ve already understood the differences between cluster.split and phylotype commands. It seems that cluster.split is more accurate and more useful when using large datasets. However, this command is very slow for us since we don’t have the appropriate computational characteristics. We’ve decided to continue with phylotype results and the next step is to make a phylogenetic tree and run the diversity commands. We’ve reviewed the different commands in the Mothur Miseq SOP and we’ve seen that we can do those either in the “Phylogeny” section or in “OTU based analyses”. Since we have decided to continue with Phylotype results, can we use the commands specified in the “OTU based analyses” section despite our analysis doesn’t account for that? Are the OTU-based analyses commands only appropriate when performing cluster.split?

Many thanks again,

Noemi

image001.jpg

(Attachment Pantallazo summary phylotype.docx is missing)