preprocessing 454 data for mothur analysis


I’m trying to find optimal solution to prepare set(s) of 454 reads for mothur analysis.
Initial data was obtained by pyrosequencing of total metagenomic DNA, i.e. there was no DNA amplification step with specific PCR-primers. The 454 adaptor sequences and low-quality reads were dropped. Additionally, reads were screened for duplicated sequences such as pyrosequencing artifacts (using CD-HIT with extract_replicates wrapper).

I want to perform (a) mothur analysis using 16S rRNA reads and (b) mothur analysis of ‘metaproteomic’ sequences. Here are the specific questions:

  1. Considering 16S rRNA analysis, could anybody suggest an adequate algorithm to select rRNA-specific sequences from the whole dataset? It does not really matter whether the algorithm utilizes only mothur capabilities or needs some external tools (such as BLASTing VS rRNA database). I’m now playing with both of these options but the problem is I’m not sure both how much worth data I lose and how many parasite (i.e. non 16S rRNA-specific) sequences are still in the processed dataset after cleaning-up the data. Also, any considerations on minimal sequence length are appreciated as there is no common opinion how short sequence should be to include it for further analysis.

  2. As for ORFs analysis, my doubts are actually similar. In the BMC paper on mg-dotur, the initial ORFs were longer than 100aa. However, the average read length in some groups of my data is small (~200bp). Subsequently, the number of ORFs in these groups is ten times smaller than in those with 320-350bp average read length. Could you suggest how to treat data in this case? I see two ways: (a) decrease minimal ORF size in all samples (say, down to 70aa) or (b) decrease minimal ORF size in samples with small average read length. Of course, there is a point not to decrease ORF length at all, but I do not see a proof for it. What are the optimal options to run blast? Should I perform low-complexity filtering of input ORFs when selfblasting?