Classify.seqs [HELP].

Dear Mothur users,

I’ve been trying to run mothur using two files by following the MiSeq - SOP and everything went well. I could perform all the steps on the guide, since my files presents the specific data derived from Illumina MiSeq run.

My problems started when I was trying to perform a comparison between the files from Illumina with other biological files. I’m using two samples from a MiSeq run and I wanted to compare them with other biological data as metagenome samples (bovine rumen, cecum samples and gut samples which are related to mine).

So, I created a topic 'cause I was unable to perform the analysis and then, I discovered If I want to run Mothur with all the samples, I needed to create a single group file and fasta file (by using make.contigs, merge.files and make.group).

I performed the MiSeq - SOP guide (not all the steps due to the building of different biological libraries), and when I ran the classify.seqs command, I couldn’t classify a single sequence other than those inside my samples.

If I run my two samples only, following the MiSeq - SOP guide:
M00988_41_000000000-ACE44_1_1101_12783_9498 Bacteria(100);Firmicutes(95);Clostridia(94);Clostridiales(93);unclassified;unclassified;
M00988_41_000000000-ACE44_1_1101_12769_26602 Bacteria(100);Firmicutes(84);Clostridia(81);Clostridiales(81);unclassified;unclassified;
M00988_41_000000000-ACE44_1_1101_12175_15467 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);unclassified;
M00988_41_000000000-ACE44_1_1101_10469_7046 Bacteria(100);Firmicutes(100);Clostridia(99);Clostridiales(99);Ruminococcaceae(95);unclassified;
M00988_41_000000000-ACE44_1_1101_10021_12449 Bacteria(100);Firmicutes(98);Clostridia(98);Clostridiales(98);Ruminococcaceae(98);unclassified;

And If I run using more samples than mine (Metagenome samples obtained at the MG-Rast web-server):

M00988_41_000000000-ACE44_1_2114_23954_25084 Bacteria(100);Verrucomicrobia(100);Verrucomicrobiae(100);Verrucomicrobiales(100);Verrucomicrobiaceae(100);Akkermansia(100);
M00988_41_000000000-ACE44_1_2114_18925_25608 Bacteria(99);Firmicutes(97);Clostridia(71);Clostridiales(70);unclassified;unclassified;
M00988_41_000000000-ACE44_1_2114_16242_26215 Bacteria(100);Firmicutes(70);unclassified;unclassified;unclassified;unclassified;
FTJKNNL02FWWIT Bacteria(82);unclassified;unclassified;unclassified;unclassified;unclassified;
FTJKNNL02HH0RK Bacteria(87);unclassified;unclassified;unclassified;unclassified;unclassified;
FTJKNNL02JJ1VG Bacteria(85);unclassified;unclassified;unclassified;unclassified;unclassified;
FTJKNNL02HA9V3 Bacteria(87);unclassified;unclassified;unclassified;unclassified;unclassified;
FTJKNNL02H12DN Bacteria(85);unclassified;unclassified;unclassified;unclassified;unclassified;
FTJKNNL02HDIRT Bacteria(76);unclassified;unclassified;unclassified;unclassified;unclassified;


So, my first question is: what am I missing??

The other samples contains 16S information, I ran the classify.seqs and the remove.lineage commands, so If there is nothing that can be compared to RDP files, it should be removed, right (unknown…)?

I ran the command using the default instructions from MiSeq - SOP guide, also using an updated RDP taxonomic file trainset14.

I didn’t perform all the commands in the guide, but I was able to refine and remove duplicated entries from the files. And I performed those analysis using a total of 12 samples at once, then divided it into 3 files containing 4 samples each, and also get one file to ran Mothur. The results are always the same.

I clearly understand that running only classify.seqs wouldn’t give me a high ratio of specificity, but I don’t understand why I can classify nothing beside the domain.

At last, I also changed the default value even to 20%, but nothing happens. Can you guys help me?

Thanks for your attention,
Rafael.

To start with, when you say metagenomes from MG-RAST I’m assuming you mean the proper shotgun-sequenced metagenomes? If so, I wouldn’t be surprised that the majority of sequences don’t classify, since most of the DNA in samples won’t be ribosomal.

My first question would be - when running your own data through the MiSeq SOP did you perform alignment filtering on the taxonomic database? If so, this will drastically reduce your ability to find 16S fragments in the metagenomes, since you only have a small region of the ribosome to identify.

What does the summary file from taxonomic assignment look like? Do you see any sequences classified to at least phylum level? You would expect hte majority of your sequences to be unclassified, so don’t worry if that percentage is huge, but are you at least seeing some sequences successfully classified? When doing this kind of assignment, removing sequences with an ‘unknown’ assignment probably isn’t stringent enough, you will probably also need to remove sequences which are classified as ‘Bacteria;unclassified’.

Thanks for your answer!

About the metagenome samples, no. I downloaded the sequences from a specific step of the analysis “RNA Clustering 97%”:

“The FASTA formatted file METAGENOMESAMPLE contains sequence clusters that have at least 70% identity to ribosomal sequences and have sequences within 97% identity”.

So, when I get the samples, they’ll contain sequence clusters related do ribosomal sequences, and that’s what I need.

About the alignment filtering, no. My samples have the V4 region primer, that’s why the guide was very useful to analyse them, but the metagenome samples aren’t. Then, I didn’t performed any alignment filtering, 'cause the software would remove all the sequences inside the fasta file (like it did before).

And the summary file only presents, for taxonomic assignment, the domain level. I can perform the remove.lineage considering “Bacteria;unclassified” as an option, but the only remaining and classified sequences would be from my samples.

So, what am I doing wrong? For classify.seqs I only use the RDP files, but I don’t perform an alignment with SILVA files, 'cause it will focus on the V4 region (as the guide teach us) and I don’t have this information from the other metagenome samples (I also performed the alignment filtering and everything went wrong, due to the incompatibility to the region).

Thanks again,
Rafael.

That’s off that it only assigns to domain level. Have you tried submitting a few sequences to the NCBI BLAST tool to see what sort of gene that are? I don’t know how MG-RAST does it’s ribosome calling but I assume it annotates anything from the three ribosomal sub units as such, whereas mothur is only interested in the 16S part.

I would have expected you to have some of these, but if nothing is classifying to phylum level then that’s odd. Try BLASTing online to make sure they are actually 16S.