Pre.cluster taking longer than usual and eliminating 90% of sequences

I’m carrying out pre.cluster of 5 samples, 3 of which I have already analysed in mothur once before. The first time I ran pre.cluster on sample x it took:

Sal_16SB 148876 111371 37505
Total number of sequences before pre.cluster was 148876.
pre.cluster removed 37505 sequences.

It took 7059 secs to cluster 148876 sequences.

This time round however:

Sal_16SB 77348 12493 64855
Total number of sequences before pre.cluster was 77348.
pre.cluster removed 64855 sequences.

It took 118026 secs to cluster 77348 sequences.

What could be causing the delay and the removal of so many sequences?

Here is my logfile prior to pre.cluster as it is still running:

mothur > make.file(inputdir=/Users/NVujacic/mothur, type=gz, prefix=stability)
Setting input directories to:
/Users/NVujacic/mothur/

Output File Names:
/Users/NVujacic/mothur/stability.files

mothur > make.contigs(file=stability.files, processors=16)

Using 16 processors.

[WARNING]: group MMC-16S-A2B contains illegal characters in the name. Group names should not include :, -, or / characters. The ‘:’ character is a special character used in trees. Using ‘:’ will result in your tree being unreadable by tree reading software. The ‘-’ character is a special character used by mothur to parse group names. Using the ‘-’ character will prevent you from selecting groups. The ‘/’ character will created unreadable filenames when mothur includes the group in an output filename.

[NOTE] Updating MMC-16S-A2B to MMC_16S_A2B to avoid downstream issues.

Unable to open MMC-16S-A2B_S1_L001_R1_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/MMC-16S-A2B_S1_L001_R1_001.fastq.gz.
Unable to open MMC-16S-A2B_S1_L001_R2_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/MMC-16S-A2B_S1_L001_R2_001.fastq.gz.

[WARNING]: group MMC-16S-A3-2ng contains illegal characters in the name. Group names should not include :, -, or / characters. The ‘:’ character is a special character used in trees. Using ‘:’ will result in your tree being unreadable by tree reading software. The ‘-’ character is a special character used by mothur to parse group names. Using the ‘-’ character will prevent you from selecting groups. The ‘/’ character will created unreadable filenames when mothur includes the group in an output filename.

[NOTE] Updating MMC-16S-A3-2ng to MMC_16S_A3_2ng to avoid downstream issues.

Unable to open MMC-16S-A3-2ng_S3_L001_R1_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/MMC-16S-A3-2ng_S3_L001_R1_001.fastq.gz.
Unable to open MMC-16S-A3-2ng_S3_L001_R2_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/MMC-16S-A3-2ng_S3_L001_R2_001.fastq.gz.

[WARNING]: group MMC-16S-A3B contains illegal characters in the name. Group names should not include :, -, or / characters. The ‘:’ character is a special character used in trees. Using ‘:’ will result in your tree being unreadable by tree reading software. The ‘-’ character is a special character used by mothur to parse group names. Using the ‘-’ character will prevent you from selecting groups. The ‘/’ character will created unreadable filenames when mothur includes the group in an output filename.

[NOTE] Updating MMC-16S-A3B to MMC_16S_A3B to avoid downstream issues.

Unable to open MMC-16S-A3B_S2_L001_R1_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/MMC-16S-A3B_S2_L001_R1_001.fastq.gz.
Unable to open MMC-16S-A3B_S2_L001_R2_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/MMC-16S-A3B_S2_L001_R2_001.fastq.gz.

[WARNING]: group Sal-16S-2ng contains illegal characters in the name. Group names should not include :, -, or / characters. The ‘:’ character is a special character used in trees. Using ‘:’ will result in your tree being unreadable by tree reading software. The ‘-’ character is a special character used by mothur to parse group names. Using the ‘-’ character will prevent you from selecting groups. The ‘/’ character will created unreadable filenames when mothur includes the group in an output filename.

[NOTE] Updating Sal-16S-2ng to Sal_16S_2ng to avoid downstream issues.

Unable to open Sal-16S-2ng_S5_L001_R1_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/Sal-16S-2ng_S5_L001_R1_001.fastq.gz.
Unable to open Sal-16S-2ng_S5_L001_R2_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/Sal-16S-2ng_S5_L001_R2_001.fastq.gz.

[WARNING]: group Sal-16SB contains illegal characters in the name. Group names should not include :, -, or / characters. The ‘:’ character is a special character used in trees. Using ‘:’ will result in your tree being unreadable by tree reading software. The ‘-’ character is a special character used by mothur to parse group names. Using the ‘-’ character will prevent you from selecting groups. The ‘/’ character will created unreadable filenames when mothur includes the group in an output filename.

[NOTE] Updating Sal-16SB to Sal_16SB to avoid downstream issues.

Unable to open Sal-16SB_S3_L001_R1_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/Sal-16SB_S3_L001_R1_001.fastq.gz.
Unable to open Sal-16SB_S3_L001_R2_001.fastq.gz. Trying input directory /Users/NVujacic/mothur/Sal-16SB_S3_L001_R2_001.fastq.gz.

Processing file pair /Users/NVujacic/mothur/MMC-16S-A3-2ng_S3_L001_R1_001.fastq.gz - /Users/NVujacic/mothur/MMC-16S-A3-2ng_S3_L001_R2_001.fastq.gz (files 2 of 5) <<<<<

Processing file pair /Users/NVujacic/mothur/MMC-16S-A3B_S2_L001_R1_001.fastq.gz - /Users/NVujacic/mothur/MMC-16S-A3B_S2_L001_R2_001.fastq.gz (files 3 of 5) <<<<<

Processing file pair /Users/NVujacic/mothur/Sal-16S-2ng_S5_L001_R1_001.fastq.gz - /Users/NVujacic/mothur/Sal-16S-2ng_S5_L001_R2_001.fastq.gz (files 4 of 5) <<<<<

Processing file pair /Users/NVujacic/mothur/Sal-16SB_S3_L001_R1_001.fastq.gz - /Users/NVujacic/mothur/Sal-16SB_S3_L001_R2_001.fastq.gz (files 5 of 5) <<<<<

Processing file pair /Users/NVujacic/mothur/MMC-16S-A2B_S1_L001_R1_001.fastq.gz - /Users/NVujacic/mothur/MMC-16S-A2B_S1_L001_R2_001.fastq.gz (files 1 of 5) <<<<<
Making contigs…
Making contigs…
Making contigs…
Making contigs…
Making contigs…
Done.

It took 641 secs to assemble 214665 reads.

Done.

It took 673 secs to assemble 207981 reads.

Done.

It took 689 secs to assemble 219971 reads.

Done.

It took 712 secs to assemble 231362 reads.

Done.

It took 1062 secs to assemble 437576 reads.

Group count:
MMC_16S_A2B 231362
MMC_16S_A3B 214665
MMC_16S_A3_2ng 207981
Sal_16SB 219971
Sal_16S_2ng 437576

Total of all groups is 1311555

It took 1087 secs to process 1311555 sequences.

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.fasta
/Users/NVujacic/mothur/stability.scrap.contigs.fasta
/Users/NVujacic/mothur/stability.contigs_report
/Users/NVujacic/mothur/stability.contigs.count_table

mothur > summary.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table)

Using 16 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 35 35 0 2 1
2.5%-tile: 1 98 98 0 3 32789
25%-tile: 1 258 258 0 4 327889
Median: 1 359 359 0 5 655778
75%-tile: 1 451 451 0 5 983667
97.5%-tile: 1 497 497 13 6 1278767
Maximum: 1 502 502 89 250 1311555
Mean: 1 344 344 1 4

of unique seqs: 1311555

total # of seqs: 1311555

It took 80 secs to summarize 1311555 sequences.

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.summary

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, count=stability.contigs.count_table, maxambig=0, maxhomop=8)

Using 16 processors.

It took 22 secs to screen 1311555 sequences, removed 279247.

/******************************************/
Running command: remove.seqs(accnos=/Users/NVujacic/mothur/stability.trim.contigs.bad.accnos.temp, count=/Users/NVujacic/mothur/stability.contigs.count_table)
Removed 279247 sequences from /Users/NVujacic/mothur/stability.contigs.count_table.

Output File Names:
/Users/NVujacic/mothur/stability.contigs.pick.count_table

/******************************************/

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.good.fasta
/Users/NVujacic/mothur/stability.trim.contigs.bad.accnos
/Users/NVujacic/mothur/stability.contigs.good.count_table

It took 51 secs to screen 1311555 sequences.

mothur > summmary.seqs(count=current)
[ERROR]: Invalid command.
[ERROR]: did not complete summmary.seqs.

mothur > summary.seqs(count=current)
Using /Users/NVujacic/mothur/stability.contigs.good.count_table as input file for the count parameter.
Using /Users/NVujacic/mothur/stability.trim.contigs.good.fasta as input file for the fasta parameter.

Using 16 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 35 35 0 2 1
2.5%-tile: 1 89 89 0 3 25808
25%-tile: 1 238 238 0 4 258078
Median: 1 327 327 0 5 516155
75%-tile: 1 407 407 0 5 774232
97.5%-tile: 1 497 497 0 6 1006501
Maximum: 1 502 502 0 8 1032308
Mean: 1 318 318 0 4

of unique seqs: 1032308

total # of seqs: 1032308

It took 55 secs to summarize 1032308 sequences.

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.good.summary

mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta, count=stability.contigs.count_table)
1032308 857615

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.good.unique.fasta
/Users/NVujacic/mothur/stability.trim.contigs.good.count_table

mothur > summary.seqs(count=current)
Using /Users/NVujacic/mothur/stability.trim.contigs.good.count_table as input file for the count parameter.
Using /Users/NVujacic/mothur/stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 16 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 35 35 0 2 1
2.5%-tile: 1 89 89 0 3 25808
25%-tile: 1 238 238 0 4 258078
Median: 1 327 327 0 5 516155
75%-tile: 1 407 407 0 5 774232
97.5%-tile: 1 497 497 0 6 1006501
Maximum: 1 502 502 0 8 1032308
Mean: 1 318 318 0 4

of unique seqs: 857615

total # of seqs: 1032308

It took 49 secs to summarize 1032308 sequences.

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.good.unique.summary

mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.bacteria.fasta)

Using 16 processors.

Reading in the /Users/NVujacic/mothur/silva.bacteria.fasta template sequences… DONE.
It took 23 to read 14956 sequences.

Aligning sequences from /Users/NVujacic/mothur/stability.trim.contigs.good.unique.fasta …
It took 6991 secs to align 857615 sequences.

[WARNING]: 426285 of your sequences generated alignments that eliminated too many bases, a list is provided in /Users/NVujacic/mothur/stability.trim.contigs.good.unique.flip.accnos.
[NOTE]: 423692 of your sequences were reversed to produce a better alignment.

It took 6991 seconds to align 857615 sequences.

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.good.unique.align
/Users/NVujacic/mothur/stability.trim.contigs.good.unique.align_report
/Users/NVujacic/mothur/stability.trim.contigs.good.unique.flip.accnos

mothur > summary.seqs(count=current)
Using /Users/NVujacic/mothur/stability.trim.contigs.good.count_table as input file for the count parameter.
Using /Users/NVujacic/mothur/stability.trim.contigs.good.unique.align as input file for the fasta parameter.

Using 16 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 0 0 0 0 1 1
2.5%-tile: 1083 6202 83 0 3 25808
25%-tile: 6389 21596 236 0 4 258078
Median: 21785 32520 325 0 5 516155
75%-tile: 32412 40339 404 0 5 774232
97.5%-tile: 40310 43116 495 0 6 1006501
Maximum: 43116 43116 502 0 8 1032308
Mean: 19567 29697 315 0 4

of unique seqs: 857615

total # of seqs: 1032308

It took 1863 secs to summarize 1032308 sequences.

Output File Names:
/Users/NVujacic/mothur/stability.trim.contigs.good.unique.summary

The results of the align.seqs command show poor overlap. This will result in fewer sequences be merged together during pre.cluster. Did you do any screening before the pre.cluster command?

That would be because I’m doing a shotgun approach of the whole 16S gene. No screening beforehand.

Thanks

Ah, that makes sense. This workflow really only works for amplicon data since it assumes all reads start and end at the same location in the gene.

Thanks,
Pat

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.