Community analysis in illumina

cbharder · December 4, 2013, 8:59am

Being used to analysing 454 community data i Mothur, I now have a somewhat complicated community analysis to do with Illumina amplicon data, consisting of nonoverlapping 18S contigs.

To begin with, I am not used to files being demultiplexed into separate fastq-files for each sample in advance before I get them. For the routine quality sorting, preclustering, aligning/filtering and chimera check steps, I have just used a standard linux loop to repeat the analyses for each fastq-file, but I need to generate shared files, and to cluster my data and throw out nontarget clusters without having to do this for each separate fastq-file. I also want to discard singletons occurring only once in the entire dataset, but not those occurring only once in each sample.

If you use the merge.files on the quality checked fasta files and proceed with the subsample, dist, cluster and the make.shared + unifrac steps etc, is it then still possible to keep track of the individual samples that were merged and split the files again afterwards if necessary?
Because some (most) of the contigs are nonoverlapping, they contain two NN flanked by two nucleotides of poor quality (<5) on each sides. I.e. applying normal qthreshold commands would scrap most of, if not the entire dataset.
I could write up a python script or similar to remove all nucleotides with a quality of less than 5 if NN is found, but is there a simpler way to trim the middle 6 nucleotides of a contig without scrapping the entire sequence?

Best regards,

Christoffer

pschloss · December 4, 2013, 4:02pm

If you use the merge.files on the quality checked fasta files and proceed with the subsample, dist, cluster and the make.shared + unifrac steps etc, is it then still possible to keep track of the individual samples that were merged and split the files again afterwards if necessary?

Have you looked at the MiSeq SOP? We deal with this issue in the make.contigs command. Seems like you’re trying to make extra work for yourself.

Because some (most) of the contigs are nonoverlapping, they contain two NN flanked by two nucleotides of poor quality (<5) on each sides. I.e. applying normal qthreshold commands would scrap most of, if not the entire dataset.

Hmmm… well I guess you have the data, but you’re really going down a black hole of pain. If you look at our AEM MiSeq paper you’ll see that unless your reads fully overlap with each other, your error rates are going to be gigantic. Having no overlap is an even bigger headache because it becomes next to impossible to align the reads. If this is the case, then I’d suggest just phylotyping.

Pat

cbharder · December 12, 2013, 10:15am

Thanks for the answer. Having looked at the MiSeq paper now, I realise that I created extra work for myself. Thank you. Sometimed things are less complicated than you think.

I do realise where IÂ´m going (as you correctly summarise). However, the quality really is not that poor except for the ends, where it plummets quite suddenly. I am able to trim this part, and it is 18S amplicons which are conserved in each end, so I have not had trouble in aligning the pre-OTUs as long as I included a complete template. Are there other “hidden” errors that are not accounted for this way?

But still, phylotyping might well be a better (and less painful) option. Then I have another question: I have an alignment of about 250 microeukaryotic reference sequences from my targeted group which is poorly represented in the Silva database that I would like to use. How do I add this to e.g. the silva taxonomy files the easiest way?

Then, I suppose I would run some commands like this

classify.seqs(fasta=mysequences.fasta, template=silva_18S_+myextendedrefseqs.fasta, taxonomy=silva_18S+_myextended.taxonomy)

and then

phylotype(mysequences.taxonomy)

or something. Right?

And finally, if I proceed this way, is there a very good standard argument as to why the possible large error rates will not be a problem with this approach? And as long as one sticks to phylotyping, could on even justify a comparison of such data with others where there was a full overlap?

Best, cbharder

pschloss · December 13, 2013, 12:48pm

Phylotyping works because you’re typically classifying at a much broader level than you are when you do OTUs. If a genus is really a 10% cutoff, you can have a lot of mistakes and still call something a Bacteroides.

Topic		Replies	Views
fastq, fasta and qual Commands in mothur	2	781	December 15, 2017
make.contigs and 'clean' data Commands in mothur	1	663	February 27, 2018
Subsampling with Illumina data Commands in mothur	5	3032	July 29, 2014
equivalent of trim.flows for single-end Illumina reads Commands in mothur	2	2692	April 14, 2014
Analysis of Illumina data - problem with make.contigs Commands in mothur	10	9899	September 10, 2014

Community analysis in illumina

Related topics