Pre.cluster command after demultiplexing is taking too long

Hello, I hope the community can help me with this:
I am working with mothur v 1.48.0 with some data I need to first demultiplex and then run the complete mothur pipeline.
The demultiplexing itself went well, but the pre.cluster command is taking forever. I checked the threats and realized that the pre.cluster command is only using one, instead the complete capacity of the PC.
I wonder if, by any chance, it is not analyzing the samples in group mode, and therefore not only the pre.cluster step is going to take a long time, but the samples won’t separate as the should. When I did not need to demultiplex (in other experiments), I always prepared the stabily.files as part of the script to help separate the groups of sequences and analyze them together, but I cannot do it now, because now I only have one reverse and one forward fastq files.
This is the script I’m using:

    # Prepare Mothur batch file
    MOTHUR_BATCH_FILE="$MOTHUR_OUTPUT_DIR/mothur_commands.batch"
    {
        echo "make.contigs(ffastq=LU_L_2_forward_paired.fq, rfastq=LU_L_2_reverse_paired.fq, oligos=$BASE_DIR/barcode_map/LU_L_2_barcode_map.tsv, processors=15, checkorient=t, pdiffs=3, bdiffs=2, tdiffs=4);"
        echo "summary.seqs(fasta=current, processors=$CORES);"
        echo "screen.seqs(fasta=current, count=current, maxambig=0, minlength=$MIN_LENGTH, maxlength=$MAX_LENGTH, maxhomop=8, processors=15);"
        echo "unique.seqs(fasta=current, count=current);"
        echo "pre.cluster(fasta=current, count=current, diffs=2, processors=15);"
        echo "chimera.vsearch(fasta=current, count=current, dereplicate=t, processors=15);"
        echo "classify.seqs(fasta=current, count=current, reference=$EZ_DATABASE, taxonomy=$EZ_TAXONOMY, cutoff=60, processors=15);"
        echo "remove.lineage(fasta=current, count=current, taxonomy=current, taxon='unknown-Protista');"
        echo "summary.tax(taxonomy=current, count=current, processors=15);"
        echo "make.shared(count=current, label=ASV);"
        echo "classify.otu(list=current, count=current, taxonomy=current, label=ASV);"
    } > "$MOTHUR_BATCH_FILE"

    # Execute Mothur with batch file and direct logs to the Mothur log directory
    $MOTHUR_EXECUTABLE "$MOTHUR_BATCH_FILE" | tee $MOTHUR_LOG_DIR/mothur.logfile

And these are the logs:

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > make.contigs(ffastq=/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.fq, rfastq=/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_reverse_paired.fq, oligos=/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/barcode_map/LU_L_2_barcode_map.tsv, processors=15, checkorient=t, pdiffs=3, bdiffs=2, tdiffs=4);

Using 15 processors.
Making contigs...
Done.

Group count: 
LU_L_2_1.V5_V7	93935
LU_L_2_10.V5_V7	125363
LU_L_2_11.V5_V7	39409
LU_L_2_12.V5_V7	54061
LU_L_2_13.V5_V7	6523
LU_L_2_14.V5_V7	31704
LU_L_2_15.V5_V7	127507
LU_L_2_16.V5_V7	57839
LU_L_2_17.V5_V7	61705
LU_L_2_18.V5_V7	170966
LU_L_2_19.V5_V7	50277
LU_L_2_2.V5_V7	169554
LU_L_2_20.V5_V7	173857
LU_L_2_21.V5_V7	61362
LU_L_2_22.V5_V7	70759
LU_L_2_23.V5_V7	148012
LU_L_2_24.V5_V7	91537
LU_L_2_25.V5_V7	53512
LU_L_2_26.V5_V7	123139
LU_L_2_27.V5_V7	198013
LU_L_2_28.V5_V7	197719
LU_L_2_29.V5_V7	156640
LU_L_2_3.V5_V7	137612
LU_L_2_30.V5_V7	128871
LU_L_2_31.V5_V7	293816
LU_L_2_32.V5_V7	166654
LU_L_2_33.V5_V7	195727
LU_L_2_34.V5_V7	170883
LU_L_2_35.V5_V7	93765
LU_L_2_36.V5_V7	153331
LU_L_2_37.V5_V7	200890
LU_L_2_38.V5_V7	84394
LU_L_2_39.V5_V7	174023
LU_L_2_4.V5_V7	87426
LU_L_2_40.V5_V7	203069
LU_L_2_41.V5_V7	126804
LU_L_2_42.V5_V7	147330
LU_L_2_43.V5_V7	131324
LU_L_2_44.V5_V7	66301
LU_L_2_45.V5_V7	118774
LU_L_2_46.V5_V7	204762
LU_L_2_47.V5_V7	107987
LU_L_2_48.V5_V7	212448
LU_L_2_49.V5_V7	204668
LU_L_2_5.V5_V7	155856
LU_L_2_50.V5_V7	262382
LU_L_2_51.V5_V7	178735
LU_L_2_52.V5_V7	135582
LU_L_2_53.V5_V7	152664
LU_L_2_54.V5_V7	273555
LU_L_2_55.V5_V7	297188
LU_L_2_56.V5_V7	228081
LU_L_2_57.V5_V7	300399
LU_L_2_58.V5_V7	220290
LU_L_2_6.V5_V7	196192
LU_L_2_7.V5_V7	22751
LU_L_2_8.V5_V7	35444
LU_L_2_9.V5_V7	83346

Total of all groups is 8216717

It took 753 secs to process 12910226 sequences.

Output File Names: 
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.fasta
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.scrap.contigs.fasta
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs_report
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs.count_table


mothur > summary.seqs(fasta=current, processors=15);
Using /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.fasta as input file for the fasta parameter.

Using 15 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	38	38	0	2	1
2.5%-tile:	1	68	68	0	4	205418
25%-tile:	1	372	372	0	5	2054180
Median: 	1	372	372	0	5	4108359
75%-tile:	1	378	378	0	5	6162538
97.5%-tile:	1	380	380	9	5	8011300
Maximum:	1	410	410	67	205	8216717
Mean:	1	332	332	0	4
# of Seqs:	8216717

It took 7 secs to summarize 8216717 sequences.

Output File Names:
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.summary


mothur > screen.seqs(fasta=current, count=current, maxambig=0, minlength=330, maxlength=440, maxhomop=8, processors=15);
Using /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs.count_table as input file for the count parameter.
Using /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.fasta as input file for the fasta parameter.

Using 15 processors.

It took 8 secs to screen 8216717 sequences, removed 1456520.

/******************************************/
Running command: remove.seqs(accnos=/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.bad.accnos.temp, count=/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs.count_table)
Removed 1456520 sequences from /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs.count_table.

Output File Names:
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs.pick.count_table

/******************************************/

Output File Names:
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.good.fasta
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.bad.accnos
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs.good.count_table


It took 37 secs to screen 8216717 sequences.

mothur > unique.seqs(fasta=current, count=current);
Using /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.contigs.good.count_table as input file for the count parameter.
Using /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.good.fasta as input file for the fasta parameter.
6760197	2603500

Output File Names: 
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.good.unique.fasta
/media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.good.count_table


mothur > pre.cluster(fasta=current, count=current, diffs=2, processors=15);
Using /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.good.count_table as input file for the count parameter.
Using /media/shot89_1000/Ubuntu_data/mtp_librerias_maite_05_24/output_LU_L_2/trimmomatic_output/LU_L_2_forward_paired.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 15 processors.

/******************************************/
Splitting by sample: 

Using 15 processors.

Selecting sequences for groups LU_L_2_12.V5_V7-LU_L_2_13.V5_V7-LU_L_2_14.V5_V7


Selecting sequences for groups LU_L_2_6.V5_V7-LU_L_2_7.V5_V7-LU_L_2_8.V5_V7-LU_L_2_9.V5_V7


Selecting sequences for groups LU_L_2_1.V5_V7-LU_L_2_10.V5_V7-LU_L_2_11.V5_V7


Selecting sequences for groups LU_L_2_22.V5_V7-LU_L_2_23.V5_V7-LU_L_2_24.V5_V7-LU_L_2_25.V5_V7


Selecting sequences for groups LU_L_2_15.V5_V7-LU_L_2_16.V5_V7-LU_L_2_17.V5_V7-LU_L_2_18.V5_V7


Selecting sequences for groups LU_L_2_44.V5_V7-LU_L_2_45.V5_V7-LU_L_2_46.V5_V7-LU_L_2_47.V5_V7


Selecting sequences for groups LU_L_2_33.V5_V7-LU_L_2_34.V5_V7-LU_L_2_35.V5_V7-LU_L_2_36.V5_V7


Selecting sequences for groups LU_L_2_19.V5_V7-LU_L_2_2.V5_V7-LU_L_2_20.V5_V7-LU_L_2_21.V5_V7


Selecting sequences for groups LU_L_2_26.V5_V7-LU_L_2_27.V5_V7-LU_L_2_28.V5_V7-LU_L_2_29.V5_V7


Selecting sequences for groups LU_L_2_40.V5_V7-LU_L_2_41.V5_V7-LU_L_2_42.V5_V7-LU_L_2_43.V5_V7


Selecting sequences for groups LU_L_2_37.V5_V7-LU_L_2_38.V5_V7-LU_L_2_39.V5_V7-LU_L_2_4.V5_V7


Selecting sequences for groups LU_L_2_48.V5_V7-LU_L_2_49.V5_V7-LU_L_2_5.V5_V7-LU_L_2_50.V5_V7


Selecting sequences for groups LU_L_2_3.V5_V7-LU_L_2_30.V5_V7-LU_L_2_31.V5_V7-LU_L_2_32.V5_V7


Selecting sequences for groups LU_L_2_51.V5_V7-LU_L_2_52.V5_V7-LU_L_2_53.V5_V7-LU_L_2_54.V5_V7


Selecting sequences for groups LU_L_2_55.V5_V7-LU_L_2_56.V5_V7-LU_L_2_57.V5_V7-LU_L_2_58.V5_V7

Selected 17101 sequences from LU_L_2_12.V5_V7.
Selected 1121 sequences from LU_L_2_13.V5_V7.
Selected 4052 sequences from LU_L_2_14.V5_V7.
Selected 18956 sequences from LU_L_2_6.V5_V7.
Selected 7862 sequences from LU_L_2_7.V5_V7.
Selected 7815 sequences from LU_L_2_8.V5_V7.
Selected 222 sequences from LU_L_2_9.V5_V7.
Selected 40178 sequences from LU_L_2_1.V5_V7.
Selected 22554 sequences from LU_L_2_10.V5_V7.
Selected 2113 sequences from LU_L_2_11.V5_V7.
Selected 32169 sequences from LU_L_2_22.V5_V7.
Selected 63134 sequences from LU_L_2_23.V5_V7.
Selected 40749 sequences from LU_L_2_24.V5_V7.
Selected 22962 sequences from LU_L_2_25.V5_V7.
Selected 53046 sequences from LU_L_2_15.V5_V7.
Selected 17389 sequences from LU_L_2_16.V5_V7.
Selected 21379 sequences from LU_L_2_17.V5_V7.
Selected 69062 sequences from LU_L_2_18.V5_V7.
Selected 34513 sequences from LU_L_2_44.V5_V7.
Selected 41316 sequences from LU_L_2_45.V5_V7.
Selected 67755 sequences from LU_L_2_46.V5_V7.
Selected 51110 sequences from LU_L_2_47.V5_V7.
Selected 16048 sequences from LU_L_2_19.V5_V7.
Selected 57193 sequences from LU_L_2_2.V5_V7.
Selected 84622 sequences from LU_L_2_20.V5_V7.
Selected 34681 sequences from LU_L_2_21.V5_V7.
Selected 91767 sequences from LU_L_2_33.V5_V7.
Selected 71676 sequences from LU_L_2_34.V5_V7.
Selected 38830 sequences from LU_L_2_35.V5_V7.
Selected 69779 sequences from LU_L_2_36.V5_V7.
Selected 85556 sequences from LU_L_2_40.V5_V7.
Selected 42430 sequences from LU_L_2_41.V5_V7.
Selected 60831 sequences from LU_L_2_42.V5_V7.
Selected 65871 sequences from LU_L_2_43.V5_V7.
Selected 39917 sequences from LU_L_2_26.V5_V7.
Selected 64720 sequences from LU_L_2_27.V5_V7.
Selected 71739 sequences from LU_L_2_28.V5_V7.
Selected 70263 sequences from LU_L_2_29.V5_V7.
Selected 89706 sequences from LU_L_2_37.V5_V7.
Selected 37617 sequences from LU_L_2_38.V5_V7.
Selected 82795 sequences from LU_L_2_39.V5_V7.
Selected 44227 sequences from LU_L_2_4.V5_V7.
Selected 66376 sequences from LU_L_2_3.V5_V7.
Selected 45250 sequences from LU_L_2_30.V5_V7.
Selected 117114 sequences from LU_L_2_31.V5_V7.
Selected 65855 sequences from LU_L_2_32.V5_V7.
Selected 81832 sequences from LU_L_2_48.V5_V7.
Selected 67028 sequences from LU_L_2_49.V5_V7.
Selected 30904 sequences from LU_L_2_5.V5_V7.
Selected 95079 sequences from LU_L_2_50.V5_V7.
Selected 71140 sequences from LU_L_2_51.V5_V7.
Selected 66475 sequences from LU_L_2_52.V5_V7.
Selected 67566 sequences from LU_L_2_53.V5_V7.
Selected 114121 sequences from LU_L_2_54.V5_V7.
Selected 148688 sequences from LU_L_2_55.V5_V7.
Selected 99130 sequences from LU_L_2_56.V5_V7.
Selected 159087 sequences from LU_L_2_57.V5_V7.
Selected 80353 sequences from LU_L_2_58.V5_V7.

It took 35 seconds to split the dataset by sample.
/******************************************/

Processing group LU_L_2_12.V5_V7:

Processing group LU_L_2_15.V5_V7:

Processing group LU_L_2_19.V5_V7:

Processing group LU_L_2_22.V5_V7:

Processing group LU_L_2_26.V5_V7:

Processing group LU_L_2_3.V5_V7:

Processing group LU_L_2_33.V5_V7:

Processing group LU_L_2_37.V5_V7:

Processing group LU_L_2_40.V5_V7:

Processing group LU_L_2_44.V5_V7:

Processing group LU_L_2_48.V5_V7:

Processing group LU_L_2_51.V5_V7:

Processing group LU_L_2_55.V5_V7:

Processing group LU_L_2_6.V5_V7:

Processing group LU_L_2_1.V5_V7:
LU_L_2_19.V5_V7	16048	5374	10674
Total number of sequences before pre.cluster was 16048.
pre.cluster removed 10674 sequences.

It took 4352 secs to cluster 16048 sequences.

Processing group LU_L_2_2.V5_V7:
LU_L_2_12.V5_V7	17101	8127	8974
Total number of sequences before pre.cluster was 17101.
pre.cluster removed 8974 sequences.

It took 10346 secs to cluster 17101 sequences.

…(the pre.cluster is still running, so many more pre.cluster outputs like the one before)

Hi there,

I see a couple of things. First, you are trimming your sequences outside of mothur - I assume using trimmomatic. We find that trimming sequences produces worse output than using make.contigs. Second, you aren’t aligning your sequences - I’m surprised mothur isn’t kicking it out at when you try to run pre.cluster. pre.cluster can be be run without aligning sequences, but it will take a very long time. Perhaps that’s where you’re at. If you follow the MiSeq SOP and align your sequences, screen, filter, and unique (again) then pre.cluster and what follows should run much faster.

As an aside, it appears that you have very long contigs. This means that the individual reads do not fully overlap resulting in suboptimal denoising. You can read about this here. Since we’re discussing pre.cluster, you could probably safely increase the diffs argument value up to 3 or 4 (1 per 100 nt).

Pat

Dear Pat,
Thank you very much for your kind suggestions, I implemented it and it worked like a charm. The analysis had been running for days and now I completed it superfast. Thank you very much!

Can I ask you a couple more things?

  • In this analysis I was focused in demultiplexing one sample, however, in the future I’ll have several multiplexed samples that I would want to include in the same analysis. However the barcodes between different multiplexed samples can be the same… My guess is that I will need to implement a stability.file and multiple barcode/oligo maps, but I have no clue in how to implement it.
  • How can I obtain fastq files of the demultiplexed samples to upload them to NCBI?

Glad it worked!

  • If you have multiple files that need to be demultiplexed, I’d suggest running make.contigs separately and then use merge.files to merge the fasta and count files separately.
  • I’d encourage you to check out our make.sra command for how to submit your sequences to NCBI’s SRA

Pat