Cluster.split problem in second step of clustering

jaya · March 30, 2022, 8:52am

I have tried to run cluster.split command but it never does the clustering part. I have also tried running it into two parts, but still facing issues (even when I run it on HPC).
These are the commands I ran:

chimera.vsearch(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.fasta, count= obesasth.trim.contigs.unique.good.filter.unique.precluster.count_table)
classify.seqs(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table, reference=trainset18_062020.pds.fasta, taxonomy=trainset18_062020.pds.tax)
remove.lineage(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table, taxonomy=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

summary.seqs(fasta=obesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pick.fasta)

Here is my output for summary:
Using 40 processors.

            Start   End     NBases  Ambigs  Polymer NumSeqs

Minimum: 1 632 220 0 3 1
2.5%-tile: 1 669 252 0 3 36094
25%-tile: 1 669 252 0 4 360931
Median: 1 669 253 0 4 721862
75%-tile: 1 669 253 0 5 1082792
97.5%-tile: 1 669 253 0 6 1407629
Maximum: 2 669 272 0 8 1443722
Mean: 1 668 252 0 4

(number) of Seqs: 1443722

It took 10 secs to summarize 1443722 sequences.

After this, I am running cluster.split

cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, cluster=f, processors=36)

Using 36 processors.
Splitting the file…
/******************************************/
Selecting sequences for group Enterobacterales (1 of 129)
Number of unique sequences: 153062

I get the final.0.count.temp and final.0.count.dist type of files, but in the second step of cluster.split, it just keeps going endlessly. What should I do?

Alexandre_Thibodeau · March 30, 2022, 12:21pm

Morning.

One think that is stricking for me is that you do not seem to be using the unique command and your summary does not contain the number of unique sequences. So it feels like you are clustering 1.5 million of reads, which is a lot and will take forever even on a good cluster.

Best of sucess.

Alexandre_Thibodeau · March 30, 2022, 12:23pm

oh, I just saw the name of your files and you did run unique. How come in the summary you are posting the number of unique is not showing up, that’s weird I think.

jaya · March 31, 2022, 7:02am

Hi Alexandre,
Just after running the summary when I run cluster.split, it shows the number of unique sequences.
cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, cluster=f, processors=36)

Using 36 processors.
Splitting the file…
/******************************************/
Selecting sequences for group Enterobacterales (1 of 129)
Number of unique sequences: 153062

Selected 4735881 sequences from final.count_table.

Calculating distances for group Enterobacterales (1 of 129):
bla bla bla…

Alexandre_Thibodeau · March 31, 2022, 2:38pm

I am sorry, I cannot help further. Sometime, when my student get errors that are hard to solve, I always tel them to delete all files generated by Mothur and to start from a clean environment to generate all files needed just to make sure everything is always perfect. Also, we always use “current” in our pipeline instead of the full name file to make sure that there is no mismatch with any anterior files. That’s my only advice as for now.

I am working on Mothur 1.47, using 32 processors and 128 000 M for memory and I cluster.split at tax level 6 (genus,because past reasons too long to explain) before clustering at .03 and never got any problems with a new dataset I am currently analysing that contains a final number of 52 million sequences containing 430 000 uniques.

Best of success,

pschloss · March 31, 2022, 6:38pm

Is it possible that the data are 2x150 rather than 2x250 reads? If they are 2x150 I suspect the error rates will be quite high since the reads don’t fully overlap the V4 region.

Pat

jaya · April 1, 2022, 5:43am

Yes my data contains the 2*150 reads. But is there any workaround to improve the alignment and overcome the subsequent issues?

Alexandre_Thibodeau · April 1, 2022, 12:42pm

Not really…and that’s why your clustering is taking forever (other post).

What I would do to “save” the experiment: I would use maxEE in the make.contigs to remove really bad reads. I would use the following options ine the make.contig: deltaq = 0, maxee= 2 or 3 or even 5 (I would adjust maxee based on the number of unique after pre.cluster and of course based on the results of count.group for my positive and negative controls).

Best of sucess.

jaya · April 4, 2022, 8:15am

Thank you so much for your support. Today I checked the results of my experiment after the weekend and thankfully the process has completed. It did not crash this time. But there are 257877 OTUs in all from 1.5 million reads. Is this normal?

jaya · April 4, 2022, 8:20am

The process completed successfully. I am now confident that I can run this process with the resources available to me. I have one query regarding the final.opti.mcc_shared file.
Is the order of the samples preserved in this file? I mean is the order of samples the same as what we have when we start the pipeline from make.contigs?

jaya · April 5, 2022, 11:33am

I am running my pipeline again on the same dataset to reduce the number of OTUs. So when I used the parameters deltaq = 0, maxee= 2 with make.contig, I got blank file obesasth.trim.contigs.fasta. What does this mean and what can I do with this?

Alexandre_Thibodeau · April 5, 2022, 12:35pm

Hello! Means you read are not passing qc. Increase max ee value untill you get satisfactory results. Sorry there is in my view no easy way out of it.

Télécharger Outlook pour Android

jaya · April 6, 2022, 11:01am

I just realized that the reason I got the blank file was not because of maxee, but due to the addition of oligos file.This was the command I had given.
make.file(inputdir=’/mnt/client_sharedfolder/obesity/empty/metagenafterbwtphixhg’, type=fastq, prefix=obesasth)
make.contigs(file=current, maxambig=0, maxlength=275, maxhomop=8, maxee=2, deltaq=0, oligos=test.oligos)

Here is the content of my oligos file.
primer GTGCCAGCMGCCGCGGTAA GGACTACHVGGGTWTCTAAT V4

Can you suggest why it happened? I mean why I got zero reads in my contigs file when I used oligos option?
Now I have removed the oligos option, and I am getting contigs in my file obeasth.trim.contigs.fasta

Alexandre_Thibodeau · April 6, 2022, 12:22pm

I kinda do not know if you are using the right primers. the other thing you can try is to use pdiff=2 to allow some difference between your reads and primers before the read is deleted. Also, maybe you do not have primers attached to your reads anymore and since the primers are never found, all reads are automatically deleted? In the earlier post that you made, you have a length of 253, was this using oligo or not?

jaya · April 7, 2022, 5:16am

These are the standard primers (515f and 806r) I think the primers are not there in my files, because I had run cutadapt before starting the Mothur pipeline. Even cutadapt did not do anything to my files- the ‘before’ and ‘after’ files were just the same. So I think I can safely accept that the primers are not included in my reads.
The length that I got earlier-253 was without the use of any oligo file.

After running the mothur script yesterday, with the maxee=2 (and deltaq=0, and no oligo parameter), here is my summary after precluster:

Using 40 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 631 220 0 3 1
2.5%-tile: 1 669 252 0 3 848227
25%-tile: 1 669 252 0 4 8482261
Median: 1 669 253 0 4 16964522
75%-tile: 1 669 253 0 5 25446783
97.5%-tile: 1 669 254 0 6 33080817
Maximum: 2 669 270 0 8 33929043
Mean: 1 668 252 0 4

of unique seqs: 2265207

total # of seqs: 33929043

It took 41 secs to summarize 33929043 sequences.

Output File Names:
nobesasth.trim.contigs.unique.good.filter.unique.precluster.summary

The number of sequences is more than what I had when I did not set maxee=2 and deltaq=0. I don’t understand why this is so. What is the default value of maxee?

Alexandre_Thibodeau · April 7, 2022, 11:06am

Hello! I am glad you found out what was the problem. I would finish the run and see after chimera removal and see if I have less unique then last time. Best of sucess!

Télécharger Outlook pour Android

jaya · April 7, 2022, 11:28am

Hello, I have figured out the problem about primer thing. However I am still stuck because I don’t know what to do reduce the number of sequences I am getting. The number of sequences I get now is more than the number I was getting earlier after chimera removal. So adding maxee=2 did not do any good. What is the default value of maxee/

Here is the summary I get after chimera removal and classify.seqs:

summary.seqs(fasta=nobesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pick.fasta)

Using 40 processors.

            Start   End     NBases  Ambigs  Polymer NumSeqs

Minimum: 1 631 220 0 3 1
2.5%-tile: 1 669 252 0 3 51134
25%-tile: 1 669 252 0 4 511337
Median: 1 669 253 0 4 1022673
75%-tile: 1 669 253 0 5 1534009
97.5%-tile: 1 669 253 0 6 1994211
Maximum: 2 669 270 0 8 2045344
Mean: 1 668 252 0 4

of Seqs: 2045344

It took 13 secs to summarize 2045344 sequences.

Output File Names:
nobesasth.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pick.summary

Alexandre_Thibodeau · April 7, 2022, 12:31pm

Hello!

You can always try maxee=1 if you want. Another thing that I found this morning with test I have run during the night is that to be more aggressive in the alignment step reduces my unique number and I get slightly better error rates. I am using now match=2, mismatch=-2 in the options instead of the default. That is the most I will be tinkering with the original SOP for now at least.

At some point you will have to decide on how you like your data to look like and choose the pipeline that best suites you. Downstream analysis can also help into decision making.

Best of success

jaya · April 12, 2022, 3:41am

Hello, I have some interesting results to share. In my last experiment, I has 1443722 sequences going to cluster.split which gave me 25877 OTUs when taxon level=4.
This time after setting some parameters in make.contigs, I got 913717 sequences going to cluster.split. I have got 78715 OTUs when taxon level=5.
Is this a realistic number and good to take further for downstream analysis?

Thank you

Alexandre_Thibodeau · April 12, 2022, 10:56am

Morning! My advise: just try it!

Télécharger Outlook pour Android

Topic		Replies	Views
cluster.split Commands in mothur	10	10419	March 12, 2015
Problems with Cluster.split	6	401	August 2, 2022
Cluster.split issue (again, sorry) mothur bugs	4	499	December 11, 2021
cluster.split problem Theory behind mothur	1	3393	January 9, 2015
Groups missing after cluster command	3	584	October 16, 2020

Cluster.split problem in second step of clustering

(number) of Seqs: 1443722

of unique seqs: 2265207

of Seqs: 2045344

Related topics