Removing singletons (yeah, I know)

Hi guys,

I have clustered my 16S V4 amplicon sequences from 96 different samples according to the Mothur MiSeq SOP, and I now want to remove singletons (OTUs consisting of maximum one sequence (read) in all of the 96 samples) from my dataset. I consider this a technical question, let’s leave the discussion whether it’s right or not until later.

After clustering 149,352 pre-clusters at 97% similarity level, I have 70,376 OTUs, representing about 12 million reads.

From the user manual and forum I understand that there are at least two strategies to remove singletons (after clustering):

  1. split.abund using fasta file, list or count file, label=0.03 and cutoff=1
  2. remove.rare using list file, count file, label=0.03 and nseqs=2

Using split.abund:
Should I use the count_table (generated before clustering), or the list (generated after clustering)?

LIST:

split.abund(fasta=xxx.fasta, [b]list[/b]=xxx.list, cutoff=1, label=0.03)

outputs an .abundant.list with 10,711 “abundant” OTUs (with >1 sequence).

OR

COUNT_TABLE:

split.abund(fasta=xxx.fasta, [b]count[/b]=xxx.count_table, cutoff=1, label=0.03)

outputs an .abund.count_table with 21,103 pre-clusters (not OTUs) with >1 read.

Is the list option compatible with the make.shared command? If so, how do I update my count_table for use in this command?

Using remove.rare

remove.rare(list=pilot.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=pilot.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, nseqs=2, label=0.03)

Here, I get only 6788 OTUs. This is much less than when performing split.abund with the list option.

Can anyone help me? Do anyone have a good way to remove singletons?

All answers appreciated! :slight_smile:

Even

What about filter.shared(…, mintotal=1) ?

If you are trying to remove singletons OTUs then you want to use the list and count files after cluster. You need to be sure to include the count file. This is because when you cluster with a count file, the list file created only contains unique names. This could mean you have an OTU with 1 sequence name that is not “singleton” OTU because that one unique name may represent hundreds of sequences from various samples. This will create an *.abund and *.rare version of your files.

mothur > split.abund(fasta=xxx.fasta, list=xxx.list, count=xxx.count, cutoff=1, label=0.03)

If you want to use the remove.rare command, you can do the following:

mothur > remove.rare(fasta=xxx.fasta, list=xxx.list, count=xxx.count, cutoff=2, label=0.03)

You can also use the filter.shared command:

mothur > make.shared(list=xxx.list, count=xxx.count, label=0.03)
mothur > filter.shared(shared=current, mintotal=1, makerare=f, label=0.03)

Kindly,
Sarah

Hi guys, thanks for answering! Still, I have some problems getting this to work properly. Let’s focus on “split.abund”:

Okay, so I’m working with three input files:
1) The .count_table.
I did not get any .count_table output from the clustering, just the list file. I used dist.seqs (fasta=xxx.fasta, cutoff=0.20, processors=32). Then I used cluster.split with split method “classify”:

cluster.split(column=pilot.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=pilot.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=pilot.trim.contigs.good.unique.good.filter.unique.precluster.pick.seed_v119.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.2, method=average, processors=32)

Output File Names:
pilot.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list

As you see, this gave me only a .list file! Sarah suggests I use the “list and count files after cluster”, which therefore makes no sense to me… :confused:

I have only got the .count_table generated during “remove.lineage” before clustering. This contains 149,352 representative sequences, with a total number of 12,428,311 reads. Of these 149,352 sequences, I can see that 128,249 sequences represent only 1 read, and can be considered singletons. The remaining 21,103 representative sequences still contain most of the reads (12.3 millions).

2) The .fasta file, also generated during “remove.lineage”. Contains the same 149,352 representative sequences. Is this the one I should be using for split.abund?

3) The .list file, generated during “cluster.split” (which probably went wrong, as I dit not get any .count_table output file). From the “unique” line, I see that the number of OTUs is 149,352, which corresponds to all of my 149,352 representative sequences left after “remove.lineage”, before clustering.

Can anyone tell if I did the clustering process correctly, and if yes, also guide me through the singleton removal?

Kind regards,

Even

  1. The cluster.split command does not create a count file, but it does use the count file to determine the merging of OTUs. The count file I was referring to is the pilot.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table file.

  2. The .fasta file, also generated during “remove.lineage”. Contains the same 149,352 representative sequences. Is this the one I should be using for split.abund? Yes, :slight_smile:

  3. The .list file, generated during “cluster.split” (which probably went wrong, as I dit not get any .count_table output file). From the “unique” line, I see that the number of OTUs is 149,352, which corresponds to all of my 149,352 representative sequences left after “remove.lineage”, before clustering.

http://www.mothur.org/wiki/Frequently_asked_questions#Aren.27t_the_.27unique.27_and_.270.00.27_distance_levels_the_same.3F Did it cluster beyond unique?

Can anyone tell if I did the clustering process correctly?

What do you mean by clustered correctly?