help with unique.seqs

Hi! I’m doing a OTU analysis with mothur. Since I’ve started I’ve had a lot of problems trying to learn to use it, Dr. Schloss helped me before and I need help again.

I have a trouble with the command unique.seqs, it’s the second time I use it in my SOP and the first time it has not problems.

mothur > unique.seqs(fasta=sff.unique.good.filter.fasta, name=sff.unique.good.names)
1000 11
2000 13
3000 13
4000 14
5000 16
6000 16
7000 17
8000 17
9000 17
10000 18
11000 19
12000 20
13000 20
14000 20
15000 20
16000 20
17000 20
18000 20
19000 20
20000 20
21000 20
22000 20
23000 20
24000 20
25000 20
26000 20
27000 20
28000 20
29000 20
30000 20
31000 20
32000 20
33000 20
34000 20
35000 20
36000 21
37000 21
38000 21
39000 21
40000 21
41000 21
42000 21
43000 22
44000 22
45000 22
46000 22
47000 22
48000 22
49000 22
50000 22
51000 22
52000 22
53000 22
54000 22
55000 22
56000 22
57000 22
58000 22
59000 22
60000 22
61000 22
62000 22
63000 22
64000 22
65000 22
66000 22
67000 22
68000 22
[ERROR]: You already have a sequence named H33W2WV03C1OF6 in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03CY8LW in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03DL1K2 in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03CXYJ2 in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03DMZB1 in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03DFD3E in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03C5ZHH in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03C70JX in your fasta file, sequence names must be unique, please correct.
[ERROR]: H33W2WV03DDBEA is in your fasta file, and not in your namefile, please correct.
[ERROR]: H33W2WV03C32KZ is in your fasta file, and not in your namefile, please correct.
[ERROR]: H33W2WV03C7B8D is in your fasta file, and not in your namefile, please correct.
[ERROR]: You already have a sequence named H33W2WV03DCWP4 in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03CUN7E in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03C35BY in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03DGP02 in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03C8OZM in your fasta file, sequence names must be unique, please correct.
[ERROR]: H33W2WV03C8E5R is in your fasta file, and not in your namefile, please correct.
[ERROR]: You already have a sequence named H33W2WV03C2ABT in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03C62O6 in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03DHHMS in your fasta file, sequence names must be unique, please correct.
[ERROR]: H33W2WV03DC94X is in your fasta file, and not in your namefile, please correct.
[ERROR]: You already have a sequence named H33W2WV03DBH4V in your fasta file, sequence names must be unique, please correct.
[ERROR]: You already have a sequence named H33W2WV03C7QT7 in your fasta file, sequence names must be unique, please correct.
[ERROR]: H33W2WV03DD7M8 is in your fasta file, and not in your namefile, please correct.
[ERROR]: H33W2WV03DH577 is in your fasta file, and not in your namefile, please correct.


It continues until it finishes. I solved the problem [b][ERROR]: H33W2WV03DD7M8 is in your fasta file, and not in your namefile, please correct.[/b] Using a previous names file, but Idk how may I solve the duplicates names in my fasta file.

Theese are the command lines I used previously

```text
sff.multiple(file=sff.txt, processors=20)
unique.seqs(fasta=sff.fasta, name=sff.names)
summary.seqs(fasta=sff.unique.fasta, name=sff.unique.names)
align.seqs(fasta=sff.unique.fasta, reference=silva.bacteria.fasta, processors=20, search=kmer, flip=t)
summary.seqs(fasta=sff.unique.align, name=sff.unique.names)
screen.seqs(fasta=sff.unique.align, name=sff.unique.names, group=sff.groups, start=1044, end=5709, optimize=start, criteria=95, processors=20)
summary.seqs(fasta=sff.unique.good.align, name=sff.unique.good.names)
filter.seqs(fasta=sff.unique.good.align, vertical=T, trump=., processors=20)

This is are my sff files I used in the sff.multiple command.

2013_02_06_MikoJorquera_PASTURE2_SFF.sff barcodes.oligos
2013_02_06_MikoJorquera_PASTURE1_SFF.sff barcodes.oligos

and the output files are named sff.fasta, sff.names and sff.groups




May I continue without use the [b]unique.seqs[/b] command in this step ? Does exist a way to solve it? I would be really glad if somebody can help me to find an answer.
Oscar :)

Hi,

What kind of data are you analysing? It seems that after going through 68 000 sequences mothur has found only 22 unique sequences. If you are working with samples with such a low richness that is ok but if you are working with e.g. environmental samples I would expect higher number of unique seqs. For instance I just ran unique.seqs for the second time (after filter.seqs) for my peatland and lake sediment samples: Of 68000 sequences mothur found 40000 unique sequences.

If you would expect a higher richness, you could check your alignment after filter.seqs (run summary.seqs). What is the length of your alignment?

Best,

Antti Rissanen
Jyväskylä University
Finland

The duplicate names could be coming from using the same barcode file for multiple sff files with multiple processors. Can you try it with processors=1?

Hi!, Thanks for your answers.

risku, When a ran a summary.seqs I realized that my data where corrupt (all my sequences where 1 or 2 bp long) So I determined to start again, since the outputs of my sff.multiple run. When I did this I could continue with the next steps.

westcott, I followed your tip and I ran my analysis using just 1 core (,processors=1)to be sure that it won’t fail due to problems with paralell analysis.

I could continue, but now I don’t know what to do with the mock community used in the 454 SOP.

When I ran filter.seqs(fasta=HMP_MOCK.v35.align, hard=sff.filter) I didn’t know what to do after it, because the instruction in the SOP si as follows.

 get.groups(fasta=sff.unique.good.filter.unique.precluster.pick.pick.fasta, group=sff.good.pick.pick.groups, name=sff.unique.good.filter.unique.precluster.pick.pick.names, groups=MOCK.GQY1XT001)

I don’t understand so good how to use mock community, I think this is a sequencing of standarized organisms but I’m not sure if theese files can be applied to another sequencing proyects. I red a similar question previously made, where Dr. Schloss said that I had to make a pooling of 16S samples.

May I use HMP_MOCK.v35.align file and look for MOCK.GQY1XT001 file?
Do I have to skip theese steps?
Is there another option?

So sorry for asking as much, but it’s first time I use mothur and nobody that I know have used it before.

Thanks for helping me :slight_smile:

Oscar,

Hi Oscar,

I’m kind of new to this, too, but I’ll give it a shot. The mock file is simply a tailor-made group of known sequences from known organisms from the basic community you are assessing, which you can use to estimate error rates among your experimental group. If I’m understanding part of your question correctly, the mock file you create is fairly specific to the community you wish to evaluate. The more your community structure differs from the mock community, the less valid your error estimation.

Cheers,
Mike

Thanks for your answer, if I understand I need a mock file 'cause it is a known piece of information of my experiments that I need to estimate my error rate.

So, I need a mock file made by me or my lab team, but I don’t have it, what I can do is try to recopile sequences of similar works in this lab. From this I have 2 questions.

May I skip theese steps or is really important to use theese files?
If I find sequences from known organisms. How I make a mock file? I saw a script but I don’t know if this is the optimal procedure (the file format is .align and is not .fasta or something else)

Thanks everyone for helping out - this is great to see people helping others!

The Mock community isn’t critical to the analysis. If you want to do an error analysis, you have to sequence a defined mock community in parallel with your real samples and process them in parallel too. Then to calculate the error rate, you have to know the actual sequence of the fragments in the mock community. Ideally this would be based on high quality Sanger sequence data that you have 100% trust in.

Pat