Pulling out lineages and assigning OTUs

Hi Pat, Sarah and team,

I am in the final steps of processing my 18S PacBio reads and I’m finding mothur has been awesome in helping me analyze data. Thank you! And thanks for answering some previous questions about my dataset. I just have two quick questions as I prepare to sign this project off.

  1. Firstly, I spiked my DNA for sequencing with internal standards (fungal constructs) and I pulled out the fungi from a processed file (Antarctica.unique.trim.good.filter.unique.precluster.pick.fasta) using the get.lineage command. I pulled out an appropriate names, group and taxonomy file at the same time. Since I had many groups in there, I wanted to just get the fungi from one group at a time, so as a trial I used the get.groups command to get the fungi reads out of a file named “St7t0”. Now, I noticed from my silva taxonomy summary that there were 1665 fungi sequences but once I pulled them out I got only 775:

Selected 1665 sequences from your name file.
Selected 775 sequences from your fasta file.
Selected 1665 sequences from your group file.

Is this because of the unique.seqs commands that I implemented earlier? Because I am trying to be quantitative and I want to pull out all of my fungi (there are several different standards in there) should I classify the raw Antarctic file before doing any unique seq commands or do you think this would be too computationally heavy? Is there any way of “deunique-ing” the data so that I can get the 1665 sequences for St7t0?

  1. Secondly, I have been doing OTU based analyses on a different set of data and all has gone well but I was confused about OTU IDs - are the OTUs static when looking at subsampled data and when looking at data for the whole set of sequences? I.e. is OTU001 the same “species” when looking at all summary files or is this when the reftaxonomy command is required? I got myself a bit confused.

Thanks for any help you may be able to give me! Thanks again for such an awesome resource!

Bethan

Is this because of the unique.seqs commands that I implemented earlier? Because I am trying to be quantitative and I want to pull out all of my fungi (there are several different standards in there) should I classify the raw Antarctic file before doing any unique seq commands or do you think this would be too computationally heavy? Is there any way of “deunique-ing” the data so that I can get the 1665 sequences for St7t0?

That is correct - the difference in numbers is because of the unique.seqs step. There actually is a deunique.seqs command (Redirecting…), but you can also get the counts back by making a counts file using the count.seqs command (Redirecting…). The full number of sequences also shows up in the shared file.


  1. Secondly, I have been doing OTU based analyses on a different set of data and all has gone well but I was confused about OTU IDs - are the OTUs static when looking at subsampled data and when looking at data for the whole set of sequences? I.e. is OTU001 the same “species” when looking at all summary files or is this when the reftaxonomy command is required? I got myself a bit confused.

If you subsample a shared file, the outputted OTU ids should be the same as the inputted OTU ids.

Pat

Thanks Pat!

Bethan