Problems with Cluster.split

AliceBaek · July 18, 2022, 12:13am

Hi,

I am experiencing some difficulties getting past cluster.split in the MiSeq SOP. I sequenced the V4 region of the 16S rRNA gene on the MiSeq using the 2x250 kit. I followed the MiSeq SOP (mothur v.1.47.0) to make contigs and clean the dataset without any issues.
I have tried a number of commands in an attempt to produce OTUs:

make.file(inputdir=., type=gz, prefix=stability)
make.contigs(file=stability.files, maxambig=0, maxlength=292, maxhomop=8, processors=24)
summary.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)
       Start   End   NBases   Ambigs   Polymer   NumSeqs
Minimum:   1   37   37   0   3   1
2.5%-tile:   1   291   291   0   3   190879
25%-tile:   1   291   291   0   4   1908783
Median:    1   292   292   0   4   3817565
75%-tile:   1   292   292   0   5   5726347
97.5%-tile:   1   292   292   0   5   7444251
Maximum:   1   300   300   0   8   7635129
Mean:   1   291   291   0   4
# of unique seqs:   7635129
total # of seqs:   7635129

It took 83 secs to summarize 7635129 sequences.

mothur > unique.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)
mothur > summary.seqs(fasta=stability.trim.contigs.unique.fasta, count=stability.trim.contigs.count_table)

Using 24 processors.

      Start   End   NBases   Ambigs   Polymer   NumSeqs
Minimum:   1   37   37   0   3   1
2.5%-tile:   1   291   291   0   3   190879
25%-tile:   1   291   291   0   4   1908783
Median:    1   292   292   0   4   3817565
75%-tile:   1   292   292   0   5   5726347
97.5%-tile:   1   292   292   0   5   7444251
Maximum:   1   300   300   0   8   7635129
Mean:   1   291   291   0   4
# of unique seqs:   2524391
total # of seqs:   7635129

mothur > align.seqs(fasta=stability.trim.contigs.unique.fasta, reference=/data/ref_database/mothur_silva/silva.nr_v138_1.align)
summary.seqs(fasta=stability.trim.contigs.unique.align, count=stability.trim.contigs.count_table)

Using 24 processors.

      Start   End   NBases   Ambigs   Polymer   NumSeqs
Minimum:   1044   1048   1   0   1   1
2.5%-tile:   11895   25318   291   0   3   190879
25%-tile:   11895   25318   291   0   4   1908783
Median:    11895   25318   292   0   4   3817565
75%-tile:   11895   25318   292   0   5   5726347
97.5%-tile:   11895   25318   292   0   5   7444251
Maximum:   43116   43116   300   0   8   7635129
Mean:   11896   25318   291   0   4
# of unique seqs:   2524391
total # of seqs:   7635129

mothur > screen.seqs(fasta=stability.trim.contigs.unique.align, count=stability.trim.contigs.count_table, start=11895, end=25318)
summary.seqs(fasta=stability.trim.contigs.unique.good.align, count=stability.trim.contigs.good.count_table)

Using 24 processors.

      Start   End   NBases   Ambigs   Polymer   NumSeqs
Minimum:   10366   25318   260   0   3   1
2.5%-tile:   11895   25318   291   0   3   188773
25%-tile:   11895   25318   291   0   4   1887729
Median:    11895   25318   292   0   4   3775457
75%-tile:   11895   25318   292   0   5   5663185
97.5%-tile:   11895   25318   292   0   5   7362141
Maximum:   11895   25498   300   0   8   7550913
Mean:   11894   25318   291   0   4
# of unique seqs:   2466444
total # of seqs:   7550913

filter.seqs(fasta=stability.trim.contigs.unique.good.align, vertical=T, trump=.)
unique.seqs(fasta=stability.trim.contigs.unique.good.filter.fasta, count=stability.trim.contigs.good.count_table)
summary.seqs(fasta=stability.trim.contigs.unique.good.filter.unique.fasta, count=stability.trim.contigs.unique.good.filter.count_table)

Using 24 processors.

      Start   End   NBases   Ambigs   Polymer   NumSeqs
Minimum:   1   611   260   0   3   1
2.5%-tile:   1   629   291   0   3   188773
25%-tile:   1   629   291   0   4   1887729
Median:    1   629   292   0   4   3775457
75%-tile:   1   629   292   0   5   5663185
97.5%-tile:   1   629   292   0   5   7362141
Maximum:   1   629   300   0   8   7550913
Mean:   1   628   291   0   4
# of unique seqs:   2447732
total # of seqs:   7550913

pre.cluster(fasta=stability.trim.contigs.unique.good.filter.unique.fasta, count=stability.trim.contigs.unique.good.filter.count_table, diffs=3)
mothur > chimera.vsearch(fasta=stability.trim.contigs.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.unique.good.filter.unique.precluster.count_table, dereplicate=t)
mothur > summary.seqs(fasta=stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table)

Using 24 processors.

      Start   End   NBases   Ambigs   Polymer   NumSeqs
Minimum:   1   611   260   0   3   1
2.5%-tile:   1   629   291   0   3   173851
25%-tile:   1   629   291   0   4   1738510
Median:    1   629   292   0   4   3477020
75%-tile:   1   629   292   0   5   5215530
97.5%-tile:   1   629   292   0   5   6780189
Maximum:   1   629   300   0   8   6954039
Mean:   1   628   291   0   4
# of unique seqs:   814566
total # of seqs:   6954039

classify.seqs(fasta=stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table, reference=/data/ref_database/mothur_silva/silva.nr_v138_1.align, taxonomy=/data/ref_database/mothur_silva/silva.nr_v138_1.tax)
mothur > remove.lineage(fasta=stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.fasta, count=stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.count_table, taxonomy=stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.nr_v138_1.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
mothur > rename.file(fasta=current, count=current, taxonomy=current, prefix=final)
Using stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table as input file for the count parameter.
Using stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.pick.fasta as input file for the fasta parameter.
Using stability.trim.contigs.unique.good.filter.unique.precluster.denovo.vsearch.nr_v138_1.wang.pick.taxonomy as input file for the taxonomy parameter.

Next, I tried cluster.split several times in this way
(1) mothur > cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, processors=40)
(2) mothur > cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=4, processors=36)
(3) mothur > cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=5, processors=16)
(4) mothur > cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=6, processors=16)
(5)mothur > cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=6, processors=16, cutoff=0.10)
(6)mothur > cluster.split(fasta=final.fasta, count=final.count_table, taxonomy=final.taxonomy, taxlevel=6, processors=8, cutoff=0.10)

But, the result is: Out of memory: Killed process 2276 (mothur) total-vm: 263820292kB, anin-rss: 127037696kB, file-rss: 348kB, shmem-rss:0kB, UID: 1003 pgtables: 501792kB oom_score_adj:0
Killed

How can I figure it out to get OTUs?

(Total RAM: 128GB, processors:48)

Thank you.

pschloss · July 18, 2022, 5:47pm

Hi Alice,

From the lengths of your contigs, I suspect your primers and barcodes are still attached to the sequences. You can use an oligos file in make.contigs to remove those sequences and assign the sequences to samples. Your V4 data should be about 252 nt long. Can you give that a try? If you still have barcodes on your sequences, then I suspect you have N-times as many unique sequences in this dataset than you really do (N = number of samples in dataset).

Pat

AliceBaek · July 21, 2022, 8:16am

Hi Pat,

I thought carefully about your words.

I sequenced the V4 region using the same 515F and 806R primers you used according to your Wet-lab SOP.

Also, I confirmed that barcodes have been removed from the sequencing reads (sequencing company confirmed).

I think that if we use 515F and 806R to generate PCR products before the sequencing, the length of PCR products should be 292 bp.

And also think that if we make contigs with two paired reads, the length of contigs should be 292 nt (please refer to the figure I made below).

But based on your reply and MiSeq SOP, the length of contigs are 252 nt.

I do not fully understand why this happened.

Could you let me know about this?

Thanks for your help!

Best,

Alice

pschloss · July 21, 2022, 5:58pm

Hi Alice,

Can you post a few of your assembled sequences? I suspect they start with you forward primer. They will need to be removed to proceed with the rest of the analysis. Just because people use the same primers as us, that doesn’t mean that they are sequencing the same way that we recommend.

Pat

AliceBaek · July 22, 2022, 1:21am

Hi Pat,

As you said, I saw the file created after make.contigs. As a result, there was a forward primer.
However, I have a question here.
I understand why the barcode should be removed, but I do not fully understand why the forward primer should be removed.

Could you let me know about this?

Also, what command should I use to remove forward primer?

Thanks for your help!

Best,
Alice

AliceBaek · July 25, 2022, 2:35am

Hi Pat,

I made oligos file and removed the primer as you said.
Then after make.contigs, the length of contigs came out to be 252 and I was able to solve the problem.

Thank you so much for helping me solve the problem.

Best,

Alice

pschloss · August 2, 2022, 3:38pm

Wonderful! Glad it’s working now.
Pat

Topic		Replies	Views
Cluster.split issues: out of memory	1	427	August 16, 2022
Stuck at cluster.split -- how do I overcome RAM issue? Commands in mothur	12	12755	August 20, 2013
cluster and cluster.split error mothur bugs	3	5934	July 24, 2014
Cluster.split issue "Num_Dists_Below_Cutoff" Commands in mothur	4	1159	March 14, 2019
cluster.split V4 MiSeq runtime problem Commands in mothur	7	3511	February 25, 2016

Problems with Cluster.split

Related topics