How to go about grouping in mothur?


I am new to the field of meta-genomic analysis, and am trying to figure understand this. I am sorry if this has been answered earlier. And I am sorry for the long post. Right now I am doing a pilot study to train myself in the analysis using mothur and would love some help with the following-

I am trying to analyse the human oral microbiome from multiple datasets. I have multiple datasets from multiple countries (8 countries). Each dataset from a country has 2 groups - healthy & diseased. Each of these groups have multiple individuals who have been sampled and sequenced. The sequences of each are available as a separate fastq file for each individual, containing 10,000+ sequences in each. Note that all of these are either single ended reads or paired end reads for which the forward and reverse reads have already been joined into contigs. I would like to estimate/plot the following:

  1. Rarefaction curves of each individual sample but colour coded by the country-health status combination
  2. Alpha diversity value (Shannon index) calculated for each individual and plotted in a box-whisker plot and grouped by country-health status combination in the x-axis
  3. 2x PCoA plots consisting of scatter points for each individual’s microbiome sample colour coded by the either health status or the country.
  4. Core microbiome composition of each of the Country-health status combinations

What I expect to see in the PCoA plot is how the scatter points representing each individual, cluster together when colour coded based on healthy and diseased states. Also how the individual samples cluster when colour coded based on country.

Considering my needs mentioned above and my limited understanding of the groups in mothur, I would like some help as to how to go about the whole process. I understand that mothur combines all the input fastq into a single fasta file at the start and crates groups, but how do I go about doing the above without mixing up the datasets? -

  1. Do I have to analyse each individuals fastq file separately (using batch script) and give each sample its own group name in the groups file, because I need the alpha diversity values for each individual?
  2. Or Do I analyse by creating batches of the samples based on each Country-health status combination (8x2=16 batches)? If so what would be my group names in the group file for each batch? - would it be the country health combination, as that is the basis for the colours in the PCoA scatter plot later, or would the group names be the names of each individual sample name, as alpha diversity is to be calculated on per individual sample basis?
  3. Or is there way to feed all the data at once and use the groups file to keep the analysis of each group independent? If so, my group names in the group files will be the Country-Health status combination right?

I also have a few further questions and hoping someone can clarify that too for me-

  1. The file filetype in mothur contains groups and the fastq file names for paired end. But in the documentation there is no mention of how to use the file filetype for single end reads. Can I use a 2 column format (groups—singleRead.fastq)? Will mothur mistake this for the 2 column format for paired end sequences (forwardRead.fastq—reverseRead.fastq)?

  2. The group filetype in mothur contains 2 columns (sequenceName—group). The sequenceName column contains names of the individual sequences present in a fastq file. But can I create a group file with the names of the fastq files instead of the names of each individual sequence?

  3. In case I wish to add more datasets in between the process (or analyse the datasets separately from the start), can I analyse them separately till a point and merge them at a point? If my understanding is correct, I will not be able to add/merge datasets after the pcoa command, as it will require all of the data to be plotted. But in the process till where can I do so and how can I go about merging them?

Thanks for patiently reading through all this, since I’m new I wanted to convey my ideas clearly. Kindly correct me if my understanding is wrong, and I’m hoping to get some help/clarification on the above.

Thanks in advance.


Welcome! I’d strongly encourage you to go through the MiSeq SOP with running the commands on the example data to get a sense of what is going on.

I would strongly recommend against using single read data. Individual MiSeq reads are pretty bad. The second read, which hopefully fully overlaps the first, denoises the first read. If you don’t have a choice, then I would use the same single read from all of your samples to keep things consistent. Otherwise you will have a weird mishmash of data. You would likely want to use and trim.seqs to trim your sequences based on the quality scores. You can get a sense of how to do these initial steps from the 454 SOP.

Once you have a shared file for all of your samples, you can then pool things however you want using the merge.groups function. To generate the shared file, all of your data need to be processed together from the beginning.


Thank you for the reply and guidance. I will surely check those SOPs out.

I am actually analyzing publicly available data from NCBI SRA, and some of the authors have uploaded the reads as a joined, trimmed and per-processed fastq which is why I am dealing with non-paired files while a few other datasets have raw unprocessed paired reads.

Also since I’m trying to do a cross study analysis, would it be a problem that each dataset has been sequenced on a different platform - Illumina, 454, etc.?

Also, how would I go about changing the colours for the individual points/curves in the PCoA plot or rarefaction curve? I would like to colour them based on some criteria. Can I somehow make use of the group file as input to R to generate the colour coded images?

Again, thank you so much

I’d encourage you to check out for tutorials on analyzing mothur data in R

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.