unique.seqs command

Hi,

I want to analyze my data of a Titanium 454 run containing 16S V1-V3 region amplicons with an average length of ~600 bp. As the amplicon length is longer than the maximum read length, the data set contains sequences that have varying read lengths.
As a result when using the unique.seqs command I only get very few identical sequences (many sequences that are nearly identical but slightly differ in read length). Consequently, I end up with a unique.fasta-file that still contains 200,000-300,000 sequences. Clustering then becomes a problem, mothur is running for days and without finishing the distance matrix already has reached 30 Gb.

My question is whether it is possible to collapse sequences that are identical over their full length to a longer sequence. In other words I would like to consider sequences that are identical but differ slightly in length as identical and include only the longest of such sequences in my unique.fasta-file?

Is this currently possible with mothur or can it be made possible? Or do you have any other suggestions to overcome this problem?

Thanks,
John

John,

So the solution is for you to run unique.seqs twice. The first as you describe. Then align the sequences and filter them. Then run unique.seqs, dist.seqs, cluster, etc. It’s a bad idea to compare sequences that don’t completely overlap as the 16S rRNA gene does not vary uniformly across the gene.

Pat

I second what Pat says.

Also, be sure to run summary.seqs() and look for long homopolymers and N’s, then screen.seqs() to get rid of them. These problems were really messing up my downstream analysis of V1 data.

Hi,
I want to use unique.seqs command to create unique sequences that only take into account mismatch bases and disregarding indels/gaps. I am not sure if there is a way to do it.
for example:
seq 1 CTCGGGATTTCCTGGGAGCA
seq 2 CTCGGGATT-CCTGAGAGCA
seq3 CTCGGGATTTCCTGAGAGCA

Currently unique.seqs command will separate these 3 sequences as 3 unique sequences. However, I want to collasp seq2 and seq3 to be 1 unique seqs (disregard the gap/indel) and seperate seq1 as another unique seq (due to the mismatch of base G in bold).

Is there a way to do this?

Thank you,
Arif

Nope.