I wonder if there is some error with the sub.sample command or am i asking it something it is not meant for. I am trying to sample out an equal number of sequences for all my groups within a fasta file using the fasta, names, group and size option. However, it samples out an unequal number of sequences for all my groups and the total number of sequences in the subsample.groups file = that of the size option. Whereas, i need a subsample.groups that should equal the value for size option times the number of groups i have in my input file. Am i doing something wrong?
The current version of the sub.sample command cannot do what you are looking for, but we can certainly add a feature request.
If you run a command like: sub.sample(fasta=abrecovery.fasta, group=abrecovery.groups, name=abrecovery.names, size=100)
mothur will randomly select 100 sequences from the name file and output them to a new fasta file, and create a new group file with those sequences in it.
Here’s a work around to get what you are looking for, unfortunately it could get tedious with many groups:
sub.sample(fasta=abrecovery.fasta, group=abrecovery.groups, name=abrecovery.names, size=100, groups=A)
sub.sample(fasta=abrecovery.fasta, group=abrecovery.groups, name=abrecovery.names, size=100, groups=B)
sub.sample(fasta=abrecovery.fasta, group=abrecovery.groups, name=abrecovery.names, size=100, groups=C)
For each of these commands mothur will select 100 names from the name file that are from the group specified, and create a new fasta file and group file. The output files will all be named abrecovery.subsample.fasta so in between commands you will need to rename the fasta and groups files. Then you can merge the fasta and group files to create one large fasta and group file containing 100 sequences from each of your groups.
I hope this helps,
Sarah
Thanks, Sarah! I already figured the workaround. Its just tedious and adds to a huge list of commands to run especially when you have many files. I hope this feature would be added very soon.