I have full-length (FL) 16S reads and ~400bp 16S pyro reads each from 15 samples. The FL reads are processed through the normal mothur pipeline. What I would like to do for the pyro reads is, first use unique.seqs to deconvolute the dataset and then second, “deconvolute” against my FL dataset. In other words I want to get rid of all pyro reads that match exactly to FL reads. The goal is to identify the diversity “missed” by FL analysis.
So:
is this an appropriate/useful analysis?
can anyone suggest a way of performing such a task?
I also have problems similar to those of jarrod_s…
in practice I have 15 samples, each containing > 15k pyro-reads in the 16S V5-V6 region.with an average length of ~ 350 bp.
I was wondering whether I should simply remove sequences perfectly contained in others…I don’t think this will affect in any way my OTUs, but i’m somewhat uncertain about how to go on… consider that I already evaluated this “hard removal” with a perl script, and I found can remove up to 2/3 sequences, much much more than with the unique.seqs approach, resulting in a global file containing “only” 60k sequences instead of the initial > 225k…
So I wouldn’t remove any sequences - just make one big fasta and group file and proceed like we do in the Costello example analysis on the wiki. If 2/3 of the sequences are redundant, then the unique.seqs command will figure that out so that the hard steps of aligning, classifying, distance calculating, and clustering are only done on the uniques and then the redundants will be mapped back in.