mothur

Start and End differs after remove.seqs compared to screen.seqs

Hi,

I would be most grateful for your help please. I am getting very different start and end (1 - 502 after the remove.seqs; previously 1 - 249 after screen.seqs) from your MiSEQ SOP. I am using the v4 region, and the start and end has been almost the same for the all the previous steps.

These are the instructions and output so far, notably, I did attempt to go through the MiSeq mothur SOP over two separate timings, and wondering if I have missed a step/used the wrong files!

[NOTE]: Setting random seed to 19760620.

Interactive Mode



mothur > set.dir(input=/data/M
M01359_150_000000000-APKK7_1_1101_12712_1956  M01359_150_000000000-APKK7_1_1101_16430_1874
M01359_150_000000000-APKK7_1_1101_14955_1878  M01359_150_000000000-APKK7_1_1101_18618_1866
M01359_150_000000000-APKK7_1_1101_15214_1945  MMP_gutmicrobiome_2020/

mothur > set.dir(input=/data/MMP_gutmicrobiome_2020/)
Mothur's directories:
inputDir=/data/MMP_gutmicrobiome_2020/

Mothur > make.contigs(file=inputMMPHC.txt)

Total of all groups is 12889892

It took 4137 secs to process 12889892 sequences.

Output File Names: 
inputfileMMPHC.trim.contigs.fasta
inputfileMMPHC.scrap.contigs.fasta
inputfileMMPHC.contigs.report
inputfileMMPHC.contigs.groups

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       244     244     0       3       1
2.5%-tile:      1       252     252     0       3       322248
25%-tile:       1       253     253     0       4       3222474
Median:         1       253     253     0       4       6444947
75%-tile:       1       253     253     0       5       9667420
97.5%-tile:     1       253     253     8       6       12567645
Maximum:        1       502     502     116     251     12889892
Mean:   1       252     252     0       4
# of Seqs:      12889892

It took 101 secs to summarize 12889892 sequences.

Output File Names:
inputfileMMPHC.trim.contigs.summary

mothur > screen.seqs(fasta=inputfileMMPHC.trim.contigs.fasta , group=inputfileMMPHC.contigs.groups , summary=inputfileMMPHC.trim.contigs.summary , maxambig=0, maxlength=275)

Using 8 processors.

It took 136 secs to screen 12889892 sequences, removed 2428829.

/******************************************/
Running command: remove.seqs(accnos=inputfileMMPHC.trim.contigs.bad.accnos.temp, group=inputfileMMPHC.contigs.groups)
Removed 2428829 sequences from your group file.

Output File Names:
inputfileMMPHC.contigs.pick.groups

/******************************************/

Output File Names:
inputfileMMPHC.trim.contigs.good.summary
inputfileMMPHC.trim.contigs.good.fasta
inputfileMMPHC.trim.contigs.bad.accnos
inputfileMMPHC.contigs.good.groups


It took 275 secs to screen 12889892 sequences.


mothur > summary.seqs(fasta=inputfileMMPHC.trim.contigs.good.fasta )

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       249     249     0       3       1
2.5%-tile:      1       252     252     0       3       261527
25%-tile:       1       253     253     0       4       2615266
Median:         1       253     253     0       4       5230532
75%-tile:       1       253     253     0       5       7845798
97.5%-tile:     1       253     253     0       6       10199537
Maximum:        1       274     274     0       11      10461063
Mean:   1       252     252     0       4
# of Seqs:      10461063

It took 86 secs to summarize 10461063 sequences.

Output File Names:
inputfileMMPHC.trim.contigs.good.summary

mothur > unique.seqs(fasta=inputfileMMPHC.trim.contigs.good.fasta )

Output File Names: 
inputfileMMPHC.trim.contigs.good.names
inputfileMMPHC.trim.contigs.good.unique.fasta

mothur > count.seqs(name=inputfileMMPHC.trim.contigs.good.names , group=inputfileMMPHC.contigs.good.groups )


It took 120 secs to create a table for 10461063 sequences.

Total number of sequences: 10461063

Output File Names: 
inputfileMMPHC.trim.contigs.good.count_table

mothur > summary.seqs(count=inputfileMMPHC.trim.contigs.good.count_table )
Using inputfileMMPHC.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 8 processors.

                **Start   End     NBases  Ambigs  Polymer NumSeqs**
**Minimum:        1       249     249     0       3       1**
**2.5%-tile:      1       252     252     0       3       261527**
**25%-tile:       1       253     253     0       4       2615266**
**Median:         1       253     253     0       4       5230532**
**75%-tile:       1       253     253     0       5       7845798**
**97.5%-tile:     1       253     253     0       6       10199537**
**Maximum:        1       274     274     0       11      10461063**
**Mean:   1       252     252     0       4**
**# of unique seqs:       363479**
**total # of seqs:        10461063**

**It took 10 secs to summarize 10461063 sequences.**

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.summary

(base) ubuntu@liying-biggvl:/data/MMP_gutmicrobiome_2020$ wget https://mothur.s3.us-east-2.amazonaws.com/wiki/silva.seed_v138.tgz
--2020-06-24 21:08:53--  https://mothur.s3.us-east-2.amazonaws.com/wiki/silva.seed_v138.tgz
Resolving mothur.s3.us-east-2.amazonaws.com (mothur.s3.us-east-2.amazonaws.com)... 52.219.104.40
Connecting to mothur.s3.us-east-2.amazonaws.com (mothur.s3.us-east-2.amazonaws.com)|52.219.104.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13559908 (13M) [application/x-gzip]
Saving to: 'silva.seed_v138.tgz'

silva.seed_v138.tgz                                         100%[=========================================================================================================================================>]  12.93M  10.4MB/s    in 1.2s    

2020-06-24 21:08:55 (10.4 MB/s) - 'silva.seed_v138.tgz' saved [13559908/13559908]

(base) ubuntu@liying-biggvl:/data/MMP_gutmicrobiome_2020$ tar -xvzf silva.seed_v138.tgz 
silva.seed_v138.tax
silva.seed_v138.align
README.md

mothur > pcr.seqs(fasta=silva.seed_v138.align , start=11894, end=25319, keepdots=F)   

Using 8 processors.
957
957
957
957
956
957
957
958
[NOTE]: no sequences were bad, removing silva.seed_v138.bad.accnos

It took 2 secs to screen 7656 sequences.

Output File Names: 
silva.seed_v138.pcr.align

mothur > rename.file(input=silva.seed_v138.align, new=silva.v4.align) 

Current files saved by mothur:
fasta=silva.seed_v138.pcr.align
processors=8

mothur > rename.file(input=silva.v4.align, new=silva.v4.fasta)

Current files saved by mothur:
fasta=silva.seed_v138.pcr.align
processors=8
mothur > summary.seqs(fasta=silva.v4.fasta )

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
:
silva.v4.summaryMinimum:        1044    42803   1336    0       4       1
2.5%-tile:      1044    43116   1400    0       5       192
25%-tile:       1044    43116   1444    0       5       1915
Median:         1044    43116   1463    0       6       3829
75%-tile:       1044    43116   1500    0       6       5743
97.5%-tile:     1044    43116   1748    4       7       7465
Maximum:        1103    43116   2838    5       12      7656
Mean:   1044    43115   1520    0       5
# of Seqs:      7656

It took 2 secs to summarize 7656 sequences.

Output File Names

mothur > rename.file(input=silva.seed_v138.pcr.align, new=silva.v4.align)

Current files saved by mothur:
processors=8

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       12067   261     0       3       1
2.5%-tile:      1       13425   292     0       4       192
25%-tile:       1       13425   293     0       4       1915
Median:         1       13425   293     0       5       3829
75%-tile:       1       13425   296     0       6       5743
97.5%-tile:     1       13425   461     1       6       7465
Maximum:        3       13425   1122    5       10      7656
Mean:   1       13424   333     0       4
# of Seqs:      7656

It took 1 secs to summarize 7656 sequences.

Output File Names:
Silva.v4.summary

mothur > align.seqs(fasta=inputfileMMPHC.trim.contigs.good.unique.fasta , reference=silva.v4.align )

Using 8 processors.

It took 117 secs to align 363479 sequences.

[WARNING]: 18 of your sequences generated alignments that eliminated too many bases, a list is provided in inputfileMMPHC.trim.contigs.good.unique.flip.accnos.
[NOTE]: 6 of your sequences were reversed to produce a better alignment.

It took 117 seconds to align 363479 sequences.

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.align
inputfileMMPHC.trim.contigs.good.unique.align.report
inputfileMMPHC.trim.contigs.good.unique.flip.accnos

mothur > summary.seqs(fasta=inputfileMMPHC.trim.contigs.good.unique.align, count=inputfileMMPHC.trim.contigs.good.count_table )

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	1235	1	0	1	1
2.5%-tile:	1968	11550	252	0	3	261527
25%-tile:	1968	11550	253	0	4	2615266
Median: 	1968	11550	253	0	4	5230532
75%-tile:	1968	11550	253	0	5	7845798
97.5%-tile:	1968	11550	253	0	6	10199537
Maximum:	13425	13425	273	0	11	10461063
Mean:	1968	11549	252	0	4
# of unique seqs:	363479
total # of seqs:	10461063

It took 31 secs to summarize 10461063 sequences.

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.summary

mothur > screen.seqs(fasta=inputfileMMPHC.trim.contigs.good.unique.align, count=inputfileMMPHC.trim.contigs.good.count_table, summary=inputfileMMPHC.trim.contigs.good.unique.summary , start=1968, end=11550, maxhomop=8)

It took 80 secs to screen 363479 sequences, removed 4959.

/******************************************/
Running command: remove.seqs(accnos=inputfileMMPHC.trim.contigs.good.unique.bad.accnos.temp, count=inputfileMMPHC.trim.contigs.good.count_table)
Removed 48300 sequences from your count file.

Output File Names:
inputfileMMPHC.trim.contigs.good.pick.count_table

/******************************************/

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.good.summary
inputfileMMPHC.trim.contigs.good.unique.good.align
inputfileMMPHC.trim.contigs.good.unique.bad.accnos
inputfileMMPHC.trim.contigs.good.good.count_table


It took 94 secs to screen 363479 sequences.

mothur > summary.seqs(fasta=current, count=current)
Using inputfileMMPHC.trim.contigs.good.good.count_table as input file for the count parameter.
Using inputfileMMPHC.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 8 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       11550   249     0       3       1
2.5%-tile:      1968    11550   252     0       3       260320
25%-tile:       1968    11550   253     0       4       2603191
Median:         1968    11550   253     0       4       5206382
75%-tile:       1968    11550   253     0       5       7809573
97.5%-tile:     1968    11550   253     0       6       10152444
Maximum:        1968    13425   273     0       8       10412763
Mean:   1967    11550   252     0       4
# of unique seqs:       358520
total # of seqs:        10412763

It took 31 secs to summarize 10412763 sequences.

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.good.summary

mothur > filter.seqs(fasta=inputfileMMPHC.trim.contigs.good.unique.good.align , vertical=T, trump=.)

It took 21 secs to filter 358520 sequences.



Length of filtered alignment: 502
Number of columns removed: 12923
Length of the original alignment: 13425
Number of sequences used to construct filter: 358520

Output File Names: 
inputfileMMPHC.filter
inputfileMMPHC.trim.contigs.good.unique.good.filter.fasta

mothur > unique.seqs(fasta=inputfileMMPHC.trim.contigs.good.unique.good.filter.fasta , count = inputfileMMPHC.trim.contigs.good.good.count_table )

Output File Names: 
inputfileMMPHC.trim.contigs.good.unique.good.filter.count_table
inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.fasta

mothur > pre.cluster(fasta=inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.fasta , count = inputfileMMPHC.trim.contigs.good.unique.good.filter.count_table , diffs=2)

mothur > chimera.vsearch(fasta=inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.fasta , count = inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.count_table , dereplicate=t)

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table
inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.chimeras
inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos

mothur > remove.seqs(fasta=inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.fasta , accnos=inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.accnos )

Removed 78744 sequences from your fasta file.

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta

mothur > summary.seqs(fasta=current, count=current)
Using inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.count_table as input file for the count parameter.
Using inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta as input file for the fasta parameter.

Using 8 processors.

                **Start   End     NBases  Ambigs  Polymer NumSeqs**
**Minimum:        1       502     229     0       3       1**
**2.5%-tile:      1       502     252     0       3       252954**
**25%-tile:       1       502     253     0       4       2529535**
**Median:         1       502     253     0       4       5059069**
**75%-tile:       1       502     253     0       5       7588603**
**97.5%-tile:     1       502     253     0       6       9865183**
**Maximum:        1       502     259     0       8       10118136**
**Mean:   1       502     252     0       4**
**# of unique seqs:       36090**
**total # of seqs:        10118136**

It took 0 secs to summarize 10118136 sequences.

Output File Names:
inputfileMMPHC.trim.contigs.good.unique.good.filter.unique.precluster.pick.summary

I don’t think anything is wrong. The output when the sequences are around 250 nt is on sequences that are not aligned. Those that are around 500 have been aligned.

Pat

Hi Pat,

Thank you for your prompt reply. So sorry, I am still unclear about your explanation (i’m still a beginner at this!).

What does the start and end column refer to? How come they do not correspond to the number of bases (?Nbases)? Apologies - I’m sure you have mentioned this before!

Start is where the bases start and end is where they end. When the sequences are aligned there is often padding before the first bases. Also, because there will be gaps between bases in the aligned sequences the end position will be different than the total number of bases.

Thank you for the clarification!

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.