Lost in the analyzes

Good morning, I’m new to using mothur and I have no idea where to start, or a workflow to follow. I’m trying to do my analyzes according to the examples of analyzes available on the site. I have 10 sequencing samples from 16 sanger. And I’m using the Esophageal community analysis that’s on the site to do mine. Well, the first step I did was to align my sequences with the Silva database, using the command

align.seqs(fasta=merged.fasta,reference=silva.seed_v138_1.align)
Aligning sequences from merged.fasta …
3
2
2
3
It took 7 secs to align 10 sequences.

[WARNING]: 1 of your sequences generated alignments that eliminated too many bases, a list is provided in merged.flip.accnos.
[NOTE]: 1 of your sequences were reversed to produce a better alignment.

It took 12 seconds to align 10 sequences.

Output File Names:
merged.align
merged.align_report
merged.flip.accnos

I didn’t understand if these [WARNING] and [NOTE] are good or bad. I also didn’t understand what this merged.flip.accnos file brought me, I didn’t understand what to do with it in the analyses. I followed the flow, according to the example analysis.

The next step I did a basic summary to see where my sequences overlap in the same region of the 16S rRNA gene.

summary.seqs(fasta=merged.align)
Using 4 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 6117 271 0 4 1
2.5%-tile: 1044 6117 271 0 4 1
25%-tile: 1044 15650 496 0 4 3
Median: 1044 16319 536 0 4 6
75%-tile: 1044 21303 562 0 4 8
97.5%-tile: 9820 31135 598 0 5 10
Maximum: 9820 31135 598 0 5 10
Mean: 1921 17574 509 0 4

of Seqs: 10

It took 0 secs to summarize 10 sequences.

Output File Names:
merged.summary

my set of sequences is small, but even so I look for good quality sequences among them. Then I performed the command
screen.seqs(fasta=merged.align,start=9820,end=6117,maxambig=5)

Using 4 processors.
2
2
3
3

It took 0 secs to screen 10 sequences, removed 0.

[NOTE]: no sequences were bad, removing merged.bad.accnos

Output File Names:
merged.good.align

It took 0 secs to screen 10 sequences.

It took 0 secs to screen 10 sequences.

enter my command; screen.seqs(fasta=merged.align,start=9820,end=6117,maxambig=5)
And the example analysis; screen.seqs(fasta=esophagus.align, group=esophagus.groups, start=204, end=4456, maxambig=5).

I didn’t use group=esophagus.groups. Because I don’t know how this file.groups was generated in the example analysis. there is not clear, how it came about. I continued with hope.

The next step, I made a summary to see how my sequences turned out.

summary.seqs(fasta=merged.good.align)

Using 4 processors.

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1044 6117 271 0 4 1
2.5%-tile: 1044 6117 271 0 4 1
25%-tile: 1044 15650 496 0 4 3
Median: 1044 16319 536 0 4 6
75%-tile: 1044 21303 562 0 4 8
97.5%-tile: 9820 31135 598 0 5 10
Maximum: 9820 31135 598 0 5 10
Mean: 1921 17574 509 0 4

of Seqs: 10

It took 0 secs to summarize 10 sequences.

Output File Names:
merged.good.summary

apparently no sequence was deleted. All features have been maintained. The next step was to use the filter command to eliminate gaps.

Using 4 processors.
Creating Filter…
3
2
2
3
It took 0 secs to create filter for 10 sequences.

Running Filter…
3
2
3
2
It took 0 secs to filter 10 sequences.

Length of filtered alignment: 0
Number of columns removed: 50000
Length of the original alignment: 50000
Number of sequences used to construct filter: 10

Output File Names:
merged.filter
merged.good.filter.fasta

no change in the set of sequences

the next step would be to assemble the distance matrix and I did it with the command:

dist.seqs(fasta=merged.good.filter.fasta, output=lt)

Using 4 processors.

Sequence Time Num_Dists_Below_Cutoff
7 0 7
0 0 0
6 0 11
4 0 10
9 0 17

It took 1 secs to find distances for 10 sequences. 45 distances below cutoff 1.

Output File Names:
merged.good.filter.phylip.dist

I didn’t change the name of the merged.good.filter.phylip.dist file to final.good.filter.phylip.dist as it shows in the example.

In the next step I started OTU-based Analyses. I did the command cluster(phylip=merged.good.filter.phylip.dist,cutoff=0.10)

Using 4 processors.

Clustering merged.good.filter.phylip.dist

iter time label num_otus cutoff tp tn fp fn sensitivity specificity ppv npv fdr accuracy mcc f1score

0.10
0 0 0.1 10 0.1 0 0 0 45 0 0 0 0 1 0 0 0
1 0 0.1 10 0.1 0 0 0 45 0 0 0 0 1 0 0 0

It took 0 seconds to cluster

Output File Names:
merged.good.filter.phylip.opti_mcc.list
merged.good.filter.phylip.opti_mcc.steps
merged.good.filter.phylip.opti_mcc.sensspec

These output files

merged.good.filter.phylip.opti_mcc.list, merged.good.filter.phylip.opti_mcc.steps, merged.good.filter.phylip.opti_mcc.sensspec, are not the same output files generated by the example analysis, which there are output files esophagus.fn.sabund, esophagus.fn.list, and esophagus.fn.rabund . What I mean is that I do the same command but the output files are not generated the same, they have different formats.
the next step would be to do the make.shared command. to continue the analysis, but since I can’t generate the.an.list file and I don’t know how it generated the file.good.groups file.

I’m stuck at this point in my analysis. Unfortunately I’m alone doing this I don’t have laboratory colleagues to help me and my boss is a lady who doesn’t understand much about programming or command lines. Please help me as I need to finish these analyzes to complete my master’s work and I’m under serious pressure. I’m exhausted.

Hi - it looks like your sanger data is not full-length. So you’ll need to use screen.seqs to select sequences that overlap the same region before running filter.seqs. I’d suggest doing something like this with your data…

screen.seqs(fasta=merged.align, start=1044, minlength=400, maxambig=0)

This will make sure that everything starts at the 5’ end of the gene and is at least 400 nt long. You’re losing everythign because in your version of screen.seqs your start position was after the end position and by the time you get to filter.seqs you have at least one . in every column. This character indicates missing data to the function. When you use trump=. in filter.seqs columns wiht a . will be removed. So your filtered alignment is 0 nt long. I’m not actually sure how you were able to run dist.seqs

Thanks Pat, I’ll try your suggestion and get back to you. About the dist. seqs it actually ran given the conditions it was applying to my data. And it always returned output files from merged.good.filter.phylip.opti_mcc.list, merged.good.filter.phylip.opti_mcc.steps, merged.good.filter.phylip.opti_mcc.sensspec, which are not similar to the outputs from the example analysis, which are esophagus.fn.sabund, esophagus.fn.list, and esophagus.fn.rabund. Pat, I also don’t understand what are the files by groups, for example, file.gropus. And what are the files, file.named. And how to generate them.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.