Hi,
I am new to the field of meta-genomic analysis, and am trying to figure understand this. I am sorry if this has been answered earlier. And I am sorry for the long post. Right now I am doing a pilot study to train myself in the analysis using mothur and would love some help with the following-
I am trying to analyse the human oral microbiome from multiple datasets. I have multiple datasets from multiple countries (8 countries). Each dataset from a country has 2 groups - healthy & diseased. Each of these groups have multiple individuals who have been sampled and sequenced. The sequences of each are available as a separate fastq file for each individual, containing 10,000+ sequences in each. Note that all of these are either single ended reads or paired end reads for which the forward and reverse reads have already been joined into contigs. I would like to estimate/plot the following:
- Rarefaction curves of each individual sample but colour coded by the country-health status combination
- Alpha diversity value (Shannon index) calculated for each individual and plotted in a box-whisker plot and grouped by country-health status combination in the x-axis
- 2x PCoA plots consisting of scatter points for each individual’s microbiome sample colour coded by the either health status or the country.
- Core microbiome composition of each of the Country-health status combinations
What I expect to see in the PCoA plot is how the scatter points representing each individual, cluster together when colour coded based on healthy and diseased states. Also how the individual samples cluster when colour coded based on country.
Considering my needs mentioned above and my limited understanding of the groups in mothur, I would like some help as to how to go about the whole process. I understand that mothur combines all the input fastq into a single fasta file at the start and crates groups, but how do I go about doing the above without mixing up the datasets? -
- Do I have to analyse each individuals fastq file separately (using batch script) and give each sample its own group name in the groups file, because I need the alpha diversity values for each individual?
- Or Do I analyse by creating batches of the samples based on each Country-health status combination (8x2=16 batches)? If so what would be my group names in the group file for each batch? - would it be the country health combination, as that is the basis for the colours in the PCoA scatter plot later, or would the group names be the names of each individual sample name, as alpha diversity is to be calculated on per individual sample basis?
- Or is there way to feed all the data at once and use the groups file to keep the analysis of each group independent? If so, my group names in the group files will be the Country-Health status combination right?
I also have a few further questions and hoping someone can clarify that too for me-
-
The file filetype in mothur contains groups and the fastq file names for paired end. But in the documentation there is no mention of how to use the file filetype for single end reads. Can I use a 2 column format (groups—singleRead.fastq)? Will mothur mistake this for the 2 column format for paired end sequences (forwardRead.fastq—reverseRead.fastq)?
-
The group filetype in mothur contains 2 columns (sequenceName—group). The sequenceName column contains names of the individual sequences present in a fastq file. But can I create a group file with the names of the fastq files instead of the names of each individual sequence?
-
In case I wish to add more datasets in between the process (or analyse the datasets separately from the start), can I analyse them separately till a point and merge them at a point? If my understanding is correct, I will not be able to add/merge datasets after the pcoa command, as it will require all of the data to be plotted. But in the process till where can I do so and how can I go about merging them?
Thanks for patiently reading through all this, since I’m new I wanted to convey my ideas clearly. Kindly correct me if my understanding is wrong, and I’m hoping to get some help/clarification on the above.
Thanks in advance.