Subsampling with Illumina data


I am working with Illumina data for the first time. I have worked with 454 and Iontorrent in the past. I have a really large data set and I have gotten through most of the processing up until the cluster.split command. I would like to subsample prior to this command in order for it to run faster. I tried running the sub.sample command and setting the amount of sequences to be t so mothur picks the sample with the lowest amount of sequences. I have about 3,000,000 unique sequences, about 21,000,000 sequences total and mother choose to subsample to 15 sequences when I ran this command. This is way too small and I would like to see how many sequences each of my samples has at this point in the analysis. Is there a command I can run that tells me this? Excel will not allow me to open up the count.table because the file is too large.



I have also ran the cluster.split without subsampling using 10 processors and mother froze up and did not finish the command after running for a long time

15 is probably the size of the smallest group. You can run count.groups to get the frequency of each sample.

Thanks, I was able to run a count.groups to see the frequency of each sample. I found that there were 3 samples that have very low frequencies and I would like to remove them from my data set. I ran a remove.groups command using the final count_table file and the sequences were removed. I tried running sub.sample(,,, persample=t) however the subsample will not run because it says there are different numbers of sequences in my fasta file and my count_table. How would I remove these 3 samples in order to run a sub.sample?

I just ran:
remove.groups(, groups=40B-11B-131B)
Removed 481 sequences from your count file.

Output File names:

remove.groups(group=stability.contigs.good.groups, groups=40B-11B-131B,
Removed 106 sequences from your fasta file.
Removed 595 sequences from your group file.

Output File names:

Remove.groups(group=stability.contigs.good.groups, groups=40B-11B-131B,

Removed 595 sequences from your group file.
Removed 106 sequences from your taxonomy file.

Output File names:

sub.sample(,,, persample=t)

and I received this error: [ERROR]: your fasta file contains 2857086 sequences, and your count file contains 2857089 unique sequences, please correct.

how do i correct this?

You just need a single command…

remove.groups(,,, groups=40B-11B-131B)

Then try the subsample command.