unique.seqs & abundance

Hi,

This is a very trivial question (and kinda obvious), but just to be sure, I ask it anyway…

As the SOP says, only working with unique seqs saves both memory and time… but IF one is interested in the abundances of OTUs e.g. in different timely-sampled samples, you can’t only process unique.seqs (e.g. in dist.seqs), right? And then in some magical way use the e.g. deunique.seqs script to include their abundances?

Thanks,

J

Easier than that - when you run cluster and then make.shared you bring the name file and the abundance of each sequence back in.

Make sense?
Pat

When you use the unique.seqs() command, a .names file is generated which is basically an index of which redundant reads each unique read represents. This information is re-incorporated into your dataset in later steps, depending on the analysis path you take. If you go the OTU-based direction, information from the names file is used during cluster(), so by the time you get to your shared file (the mothur equivalent of a biom file) the abundance data is accounted for. If you analyse phylogenetically (unifrac etc) the names file is used during the analysis to account for abundance.

Thanks guys! That make sense indeed…

I still have one minor question… So, I ran unique.seqs on my final.fasta (without the names file) and checked that it worked using summary.seqs, but when comparing size and number of lines within the two fasta files (pre & post unique.seqs) they are exactly the same… How then, can this be less computationally demanding when running e.g. dist.seqs ?

Thanks again,

Cheers

You need to include the most recent names file at each step - By running unique.seqs(fasta=final.fasta) you’ve effectively discarded all of the duplicate sequences from before. At this point in the SOP, everything should be unique’d already, but the proper syntax would be unique.seqs(fasta=final.fasta, name=final.names)

Thanks for replying,

Ok. But is it should be possible to run at a later stage as well, right? I didn’t run unique.seqs earlier (since my concern of the abundance), but realized that I really need to now when running dist.seqs. I tried running unique.seqs together with the final.names file as well, but that didn’t discard any of the sequences when controlling against summary.seqs. But did so when running without. Although, file is still equal size and contain equal # rows (i.e. seqs), which I don’t really understand.

No, you need to process the fasta, group, and names files in parallel.

Hi,

So, just for the sake of it I did re-run the workflow, using the unique.seqs cmds when implemented. The summary outputs are almost identical, but the weird thing is that when counting the number of lines (i.e. seqs) in .fasta and .names files… the non-unique and unique-processed files contain sequences in the same order of magnitude. Actually, the non-unique fasta file only contain the unique seqs if one compare to the summary output… which the unique-processed also do. So, from what I can tell there is no difference?? (see below for outputs)
confused

Many many thanks,


[b]Non-unique[/b]

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 933 210 0 3 1
2.5%-tile: 1 935 225 0 3 17814
25%-tile: 1 935 241 0 4 178132
Median: 1 935 256 0 5 356263
75%-tile: 1 935 266 0 5 534394
97.5%-tile: 1 935 276 0 6 694711
Maximum: 1 935 302 0 8 712524
Mean: 1 935 253.828 0 4.70271

of unique seqs: 61864

total # of seqs: 712524

wc -l
123728 final.final.fasta
61864 final.final.names

Unique-processed


Start End NBases Ambigs Polymer NumSeqs Minimum: 1 933 210 0 3 1 2.5%-tile: 1 935 225 0 3 17917 25%-tile: 1 935 241 0 4 179167 Median: 1 935 256 0 5 358333 75%-tile: 1 935 266 0 5 537499 97.5%-tile: 1 935 276 0 6 698749 Maximum: 1 935 302 0 8 716665 Mean: 1 935 253.781 0 4.69973 # of unique seqs: 60966 total # of seqs: 716665

wc -l
121932 final.fasta
60966 final.names

the non-unique fasta file only contain the unique seqs if one compare to the summary output

Sorry, I don’t know what you’re trying to say here.


It's hard to say why you have a slight difference in the number of sequences by the two approaches since I'm still not clear what you've done. Perhaps the different frequencies have affected chimera.seqs and the different frequencies have thrown things off there.

wc -l
121932 final.fasta
60966 final.names

Each sequence in the fasta file takes up two lines and each sequence in the name file takes up one line. Perhaps I don’t understand your question.

Hi Pat,

Sorry for not making myself clear. I know that between the two ‘rounds’ of running the SOP I got slightly different results. I would imagine that to be normal, even if you start out with exactly the same files(?). And I know the .fasta files have two lines per sequences. So here’s the the thing I did. I ran the SOP with and without implementing the unique.seqs() cmd. The first time without, because I didn’t get how the ‘abundances’ was treated, and the second time with because I wanted the process to be less computationally demanding (starting at dist.seqs). And from what I understood, you couldn’t run the unique.seqs cmd, for example, just before running dist.seqs to make it run faster.

My point (rather question, confusion) is that I didn’t see any difference (on the final files) between running the SOP with or without implementing the unique.seqs. So, e.g. the final.using.unique.seqs.cmd.fasta contained as many sequences (in the same order of magnitude) as did the final.not.using.unique.seqs.cmd.fasta. I would imagine that without using the unique.seqs cmd, all sequences should be treated as unique and thus, the final.not.using.unique.seqs.cmd.fasta should contain what is the total number of sequences rather than only the unique sequences which I imagine should be contained in the final.using.unique.seqs.cmd.fasta(?). So the thing I don’t understand is why I don’t see a difference between the two .fasta files? And thus, also the implementation of using unique.seqs for minimizing the burden of computation.

I have probably misunderstood some major thing here…

Many thanks for taking the time to answer my confusion questions!

Did you run pre.cluster by both methods? If so, pre.cluster will unique the sequences before running which might explain the similarity in numbers.

Ah, ok! Indeed, I have.

Thanks getting my head around it!