Being used to analysing 454 community data i Mothur, I now have a somewhat complicated community analysis to do with Illumina amplicon data, consisting of nonoverlapping 18S contigs.
To begin with, I am not used to files being demultiplexed into separate fastq-files for each sample in advance before I get them. For the routine quality sorting, preclustering, aligning/filtering and chimera check steps, I have just used a standard linux loop to repeat the analyses for each fastq-file, but I need to generate shared files, and to cluster my data and throw out nontarget clusters without having to do this for each separate fastq-file. I also want to discard singletons occurring only once in the entire dataset, but not those occurring only once in each sample.
If you use the merge.files on the quality checked fasta files and proceed with the subsample, dist, cluster and the make.shared + unifrac steps etc, is it then still possible to keep track of the individual samples that were merged and split the files again afterwards if necessary?
Because some (most) of the contigs are nonoverlapping, they contain two NN flanked by two nucleotides of poor quality (<5) on each sides. I.e. applying normal qthreshold commands would scrap most of, if not the entire dataset.
I could write up a python script or similar to remove all nucleotides with a quality of less than 5 if NN is found, but is there a simpler way to trim the middle 6 nucleotides of a contig without scrapping the entire sequence?
- If you use the merge.files on the quality checked fasta files and proceed with the subsample, dist, cluster and the make.shared + unifrac steps etc, is it then still possible to keep track of the individual samples that were merged and split the files again afterwards if necessary?
Have you looked at the MiSeq SOP? We deal with this issue in the make.contigs command. Seems like you’re trying to make extra work for yourself.
- Because some (most) of the contigs are nonoverlapping, they contain two NN flanked by two nucleotides of poor quality (<5) on each sides. I.e. applying normal qthreshold commands would scrap most of, if not the entire dataset.
Hmmm… well I guess you have the data, but you’re really going down a black hole of pain. If you look at our AEM MiSeq paper you’ll see that unless your reads fully overlap with each other, your error rates are going to be gigantic. Having no overlap is an even bigger headache because it becomes next to impossible to align the reads. If this is the case, then I’d suggest just phylotyping.
Thanks for the answer. Having looked at the MiSeq paper now, I realise that I created extra work for myself. Thank you. Sometimed things are less complicated than you think.
I do realise where IÂ´m going (as you correctly summarise). However, the quality really is not that poor except for the ends, where it plummets quite suddenly. I am able to trim this part, and it is 18S amplicons which are conserved in each end, so I have not had trouble in aligning the pre-OTUs as long as I included a complete template. Are there other “hidden” errors that are not accounted for this way?
But still, phylotyping might well be a better (and less painful) option. Then I have another question: I have an alignment of about 250 microeukaryotic reference sequences from my targeted group which is poorly represented in the Silva database that I would like to use. How do I add this to e.g. the silva taxonomy files the easiest way?
Then, I suppose I would run some commands like this
classify.seqs(fasta=mysequences.fasta, template=silva_18S_+myextendedrefseqs.fasta, taxonomy=silva_18S+_myextended.taxonomy)
or something. Right?
And finally, if I proceed this way, is there a very good standard argument as to why the possible large error rates will not be a problem with this approach? And as long as one sticks to phylotyping, could on even justify a comparison of such data with others where there was a full overlap?
Phylotyping works because you’re typically classifying at a much broader level than you are when you do OTUs. If a genus is really a 10% cutoff, you can have a lot of mistakes and still call something a Bacteroides.