Subsampling with Illumina data

nminalt · July 24, 2014, 4:26pm

Hey,

I am working with Illumina data for the first time. I have worked with 454 and Iontorrent in the past. I have a really large data set and I have gotten through most of the processing up until the cluster.split command. I would like to subsample prior to this command in order for it to run faster. I tried running the sub.sample command and setting the amount of sequences to be t so mothur picks the sample with the lowest amount of sequences. I have about 3,000,000 unique sequences, about 21,000,000 sequences total and mother choose to subsample to 15 sequences when I ran this command. This is way too small and I would like to see how many sequences each of my samples has at this point in the analysis. Is there a command I can run that tells me this? Excel will not allow me to open up the count.table because the file is too large.

Thanks,

Nicole

nminalt · July 24, 2014, 7:30pm

I have also ran the cluster.split without subsampling using 10 processors and mother froze up and did not finish the command after running for a long time

pschloss · July 24, 2014, 8:15pm

15 is probably the size of the smallest group. You can run count.groups to get the frequency of each sample.

nminalt · July 29, 2014, 4:11pm

Thanks, I was able to run a count.groups to see the frequency of each sample. I found that there were 3 samples that have very low frequencies and I would like to remove them from my data set. I ran a remove.groups command using the final count_table file and the sequences were removed. I tried running sub.sample(fasta=filename.final.fasta, count=filename.final.pick.count_table, taxonomy=filename.final.taxonomy, persample=t) however the subsample will not run because it says there are different numbers of sequences in my fasta file and my count_table. How would I remove these 3 samples in order to run a sub.sample?

nminalt · July 29, 2014, 4:55pm

I just ran:
remove.groups(count=stability.trim.contigs.final.count_table, groups=40B-11B-131B)
Removed 481 sequences from your count file.

Output File names:
stability.trim.contigs.final.pick.count_table

remove.groups(group=stability.contigs.good.groups, groups=40B-11B-131B, fasta=stability.trim.contigs.final.fasta)
Removed 106 sequences from your fasta file.
Removed 595 sequences from your group file.

Output File names:
stability.trim.contigs.final.pick.fasta
stability.contigs.good.pick.groups

Remove.groups(group=stability.contigs.good.groups, groups=40B-11B-131B, taxonomy=stability.trim.contigs.final.taxonomy)

Removed 595 sequences from your group file.
Removed 106 sequences from your taxonomy file.

Output File names:
stability.contigs.good.pick.groups
stability.trim.contigs.final.pick.taxonomy

sub.sample(fasta=stability.trim.contigs.final.pick.fasta, count=stability.trim.contigs.final.pick.count_table, taxonomy=stability.trim.contigs.final.pick.taxonomy, persample=t)

and I received this error: [ERROR]: your fasta file contains 2857086 sequences, and your count file contains 2857089 unique sequences, please correct.

how do i correct this?

pschloss · July 29, 2014, 7:30pm

You just need a single remove.group command…

remove.groups(count=stability.trim.contigs.final.count_table, fasta=stability.trim.contigs.final.fasta, taxonomy=stability.trim.contigs.final.taxonomy, groups=40B-11B-131B)

Then try the subsample command.

Pat

Topic		Replies	Views
Merge the count file Commands in mothur	3	2832	September 4, 2013
Normalization Commands in mothur	1	4332	May 8, 2012
cluster.split Commands in mothur	13	8686	July 15, 2013
Sub-sampling Commands in mothur	2	2885	December 2, 2011
cluster.split problem Theory behind mothur	1	3389	January 9, 2015

Subsampling with Illumina data

Related topics