Subsampling with Illumina data

Hey,

I am working with Illumina data for the first time. I have worked with 454 and Iontorrent in the past. I have a really large data set and I have gotten through most of the processing up until the cluster.split command. I would like to subsample prior to this command in order for it to run faster. I tried running the sub.sample command and setting the amount of sequences to be t so mothur picks the sample with the lowest amount of sequences. I have about 3,000,000 unique sequences, about 21,000,000 sequences total and mother choose to subsample to 15 sequences when I ran this command. This is way too small and I would like to see how many sequences each of my samples has at this point in the analysis. Is there a command I can run that tells me this? Excel will not allow me to open up the count.table because the file is too large.

Thanks,

Nicole

I have also ran the cluster.split without subsampling using 10 processors and mother froze up and did not finish the command after running for a long time

15 is probably the size of the smallest group. You can run count.groups to get the frequency of each sample.

Thanks, I was able to run a count.groups to see the frequency of each sample. I found that there were 3 samples that have very low frequencies and I would like to remove them from my data set. I ran a remove.groups command using the final count_table file and the sequences were removed. I tried running sub.sample(fasta=filename.final.fasta, count=filename.final.pick.count_table, taxonomy=filename.final.taxonomy, persample=t) however the subsample will not run because it says there are different numbers of sequences in my fasta file and my count_table. How would I remove these 3 samples in order to run a sub.sample?

I just ran:
remove.groups(count=stability.trim.contigs.final.count_table, groups=40B-11B-131B)
Removed 481 sequences from your count file.

Output File names:
stability.trim.contigs.final.pick.count_table

remove.groups(group=stability.contigs.good.groups, groups=40B-11B-131B, fasta=stability.trim.contigs.final.fasta)
Removed 106 sequences from your fasta file.
Removed 595 sequences from your group file.

Output File names:
stability.trim.contigs.final.pick.fasta
stability.contigs.good.pick.groups

Remove.groups(group=stability.contigs.good.groups, groups=40B-11B-131B, taxonomy=stability.trim.contigs.final.taxonomy)

Removed 595 sequences from your group file.
Removed 106 sequences from your taxonomy file.

Output File names:
stability.contigs.good.pick.groups
stability.trim.contigs.final.pick.taxonomy

sub.sample(fasta=stability.trim.contigs.final.pick.fasta, count=stability.trim.contigs.final.pick.count_table, taxonomy=stability.trim.contigs.final.pick.taxonomy, persample=t)

and I received this error: [ERROR]: your fasta file contains 2857086 sequences, and your count file contains 2857089 unique sequences, please correct.

how do i correct this?

You just need a single remove.group command…


remove.groups(count=stability.trim.contigs.final.count_table, fasta=stability.trim.contigs.final.fasta, taxonomy=stability.trim.contigs.final.taxonomy, groups=40B-11B-131B)

Then try the subsample command.

Pat