Pre.cluster removes the majority of sequences and names mismatch

dagg · June 23, 2021, 4:02pm

Hi,

I am still a mothur rookie but have encountered several problems at the pre.cluster stage (and then subsequently as I have tried various other commands to try and resolve this). Please ignore processor use numbers as I have varied these according to the number of samples I have included in the analysis:

I am using mothur v 1.44.1. The entire set of commands I am using is as follows:
make.contigs(file=IDtestStudy.txt, trimoverlap=F, processors=4)
summary.seqs(fasta=current)
count.groups(group=current)
screen.seqs(fasta=current, group=current, maxambig=0, maxhomop=6, minlength=200, maxlength=450, processors=4)
summary.seqs(fasta=current) count.groups(group=current)

exit Mothur and trim sequences using a prinseq

reload Mothur:
unique.seqs(fasta=IDtestStudy_prinseq_good.fasta)
summary.seqs(fasta=current, name=current, processors=4) count.groups(group=IDtestStudy.contigs.good.groups)
align.seqs(fasta=current, reference=silva.nr_v132.align, flip=T, processors=4) summary.seqs(fasta=current, name=current)
screen.seqs(fasta=current, name=current, group=current, summary=IDtestStudy_prinseq_good.unique.summary, start=1046, end=6421, processors=4)
summary.seqs(fasta=current, name=current) count.groups(group=current)
filter.seqs(fasta=current, vertical=T, trump=., processors=4)
summary.seqs(fasta=current, name=current)
unique.seqs(fasta=current, name=current)
summary.seqs(fasta=current, name=current)
pre.cluster(fasta=current, name=current, group=current, diffs=3)
summary.seqs(fasta=current, name=current) count.groups(group=current)

At this stage I enter issues. Firstly the pre.cluster removes the vast majority of the sequences. E.g.:
Processing group XXXXX_1:
XXXXX_14 1121 86 1035
Total number of sequences before pre.cluster was 1121.
pre.cluster removed 1035 sequences.

Once pre.cluster finishes and I try to run summary seqs, I get the following error:
mothur > summary.seqs(fasta=current, name=current)
Using IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.fasta as input file for the fasta parameter.
Using IDtestStudy_prinseq_good.unique.good.filter.names as input file for the name parameter.

Using 4 processors.
[ERROR]: Your name file contains 452721 unique sequences, but your fasta file contains 61338. File mismatch detected, quitting command.

I have then switched tack and used an alternate set script using the example analysis on the website. Please ignore the different reference SILVA file. The complete set of commands I am using are as follows:

make.contigs(file=IDtestStudy.txt, trimoverlap=F, processors=4)

summary.seqs(fasta=current)

count.groups(group=current)

screen.seqs(fasta=IDtestStudy.trim.contigs.fasta, group=IDtestStudy.contigs.groups, maxambig=0, maxhomop=6, minlength=200, maxlength=450, processors=4)

summary.seqs(fasta=current)

count.groups(group=current)

perl prinseq-lite.pl -fasta IDtestStudy.trim.contigs.good.fasta -trim_left 4 -trim_right 4

mothur > unique.seqs(fasta=IDtestStudy_prinseq_good.fasta)

summary.seqs(fasta=current, name=current, processors=4)

count.groups(group=IDtestStudy.contigs.good.groups)

count.seqs(name=IDtestStudy_prinseq_good.names, group=IDtestStudy.contigs.good.groups)

summary.seqs(count=IDtestStudy_prinseq_good.count_table)

align.seqs(fasta=IDtestStudy_prinseq_good.unique.fasta, reference=silva.nr_v123.chop.fasta, flip=T, processors=4)

summary.seqs(fasta=IDtestStudy_prinseq_good.unique.align, count=IDtestStudy_prinseq_good.count_table)

screen.seqs(fasta=IDtestStudy_prinseq_good.unique.align, count=IDtestStudy_prinseq_good.count_table, summary=IDtestStudy_prinseq_good.unique.summary, start=1046, end=6421, processors=4)

summary.seqs(fasta=current, count=current)

filter.seqs(fasta=IDtestStudy_prinseq_good.unique.good.align, vertical=T, trump=., processors=4)

unique.seqs(fasta=IDtestStudy_prinseq_good.unique.good.filter.fasta, count=IDtestStudy_prinseq_good.good.count_table)

pre.cluster(fasta=IDtestStudy_prinseq_good.unique.good.filter.unique.fasta, count=IDtestStudy_prinseq_good.unique.good.filter.count_table, diffs=3)

chimera.vsearch(fasta=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.fasta, count=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.count_table, dereplicate=t, processors=4)

remove.seqs(fasta=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.fasta, accnos=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.denovo.vsearch.accnos)

(name file used is IDtestStudy_prinseq_good.names)

summary.seqs()

classify.seqs(fasta=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.pick.fasta, count=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, reference=silva.nr_v123.chop.fasta, taxonomy=silva.nr_v123.tax, cutoff=80)

remove.lineage(fasta=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.pick.fasta, count=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table, taxonomy=IDtestStudy_prinseq_good.unique.good.filter.unique.precluster.pick.nr_v123.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

summary.seqs()

summary.tax(taxonomy=current, count=current)

Once again, the pre.cluster command removes the vast majority of sequences. E.g:
It took 0 secs to cluster 2720 sequences.
XXXXX_1 4642 512 4130
Total number of sequences before pre.cluster was 4642.
pre.cluster removed 4130 sequences.

I have then reverted to mothur v 1.38.1 as my postdoc has run the samples through the original script successfully with that version. Depressingly, with the same fastq files that my postdoc has used (and used in the above analyses without issue at Step 1), I get the following fail at the make.contigs step:

mothur > make.contigs(file=IDtestStudy.txt, trimoverlap=F, processors=4)

Using 4 processors.

Processing file pair XXXX_1F.fastq - XXXX_1R.fastq (files 1 of 3) <<<<<
MS7_XXXX_1_1101_19506_1072/1 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
MS7_XXXX_1_1110_20061_11698/1 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
MS7_XXXX_1_1119_28400_16794/1 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
MS7_XXXX_1_2110_28119_9527/1 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
Making contigs…
23042 Segmentation fault

I have manually checked the reverse and forward files and there doesn’t appear to be an issue with the headers, and the postdoc used the same files without issue (using the same version of mothur, loaded on the same high computing server…).

I have also tried all the above with different datasets, on a personal Mac with few samples and a server with lots of samples (and few samples) and encountered the same problems in all situations.

Sorry for the long thread but I would appreciate any help at any of the above stages to get on to the next stage of the analysis - my attempts to rectify have failed miserably!

Thank you

westcott · June 25, 2021, 2:46pm

The pre.cluster command converts the the name and group files to a count file. The count file contains the same information as the name and group file, but uses less memory and runs faster. The summary.seqs command with the name file fails because the name file no longer matches the fasta file after the pre.cluster command is run. I recommend using the count file instead of the name and group option.

The pre.cluster command will combine sequences that are within xxx diffs. You set diffs=3, which may be a little high. Have you tried running pre.cluster with diffs=2?

We don’t recommend reverting to older versions. It looks like mothur may be having a hard time with the sequence names. Is the matching read for MS7_XXXX_1_1101_19506_1072/1 MS7_XXXX_1_1101_19506_1072/2? Have you tried the make.contigs command with our current version, Release Version 1.45.3 · mothur/mothur · GitHub. If the current version does not resolve this issue, can you send the fastq file pairs and your logfile to mothur.bugs@gmail.com so I can track down the issue for you?

system · July 5, 2021, 2:46pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pre.cluster removed all sequences Commands in mothur	3	710	July 3, 2022
Issue with pre.cluster Commands in mothur	10	521	October 30, 2023
Help in pre.cluster Commands in mothur	3	325	August 18, 2023
An error occurs while running pre.cluster command Commands in mothur	7	721	February 13, 2023
pre.cluster removes >90 % of sequences? Commands in mothur	3	1216	August 9, 2016

Pre.cluster removes the majority of sequences and names mismatch

Related topics