remove.lineage removes half of my sequences

Hi,

I can´t figure out what is going on after I run the remove.lineage command to remove Archaea, chloroplasts, and mitochondira sequences. I recover only half of the sequences and it seems very high, In the SOP just 350 sequences are removed.

After removing quimeras I obtained

of unique seqs: 83768

total # of seqs: 976210

Then, I run “mothur > classify.seqs(fasta=Undetermined_S0_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=Undetermined.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80)”, and then “remove.lineage(fasta=Undetermined_S0_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=Undetermined.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=Undetermined_S0_L001_R1_001.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)”.

When I run summary.seqs again I obtain

of unique seqs: 34744

total # of seqs: 305921

so the number of both unique and total reads seem to have been reduced by half.

Why can that be?

My samples are a mix of mostly cyanobacteria and other heterotrophic bacteria, I wonder if most of the cyanobacteria were not classified and were removed?

Thanks

  1. I actually forgot to mention that my research question is to investigate the microbial diversity (heterotrophic community) associated to a cyanobacteria (trichodesmium). realized that when I run the classify command I included “chloroplast” in it. When I look into the trainset9 dataset I see that cyanobacteria are considered as “chloroplasts”, since they are phytoplankton of course. I guess that means that all cyanobacteria are removed and only heterotrophic bacteria remain in my dataset?

  2. Also, I see that “Trichodesmium” is not classified as such in the trainset9 dataset. When I look for Trichodesmium in the silva database online, I find different entries. Some of them I can find them in the trainset9 database but some of them I can´t. For example, I find X70767, which in the silva dataset is classified as Trichodesmium sp. gene for 16S rRNA, but in the trainset9 database is classified as Root;Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family XII;GpXII.

If I wanted to remove only Trichodesmium reads but keep the rest of cyanobacteria, would I have to create my own trainset9 database, or remove only the trichodesmium reads from trainset9?, how could I do it?

Thank you!

Correct - you would have to manually alter the tax file and then use remove.lineage.

Pat