Problem with split.abund and group file

Hi

I am working 454 data and have to remove the OTUs with only 1, 2 or 3 seqs (yes, I know is not the best choice, but I was asked to do like this, I’m sorry). But also I have to get the ordination (PCoA) based on unifrac distances. Of course this last and the other results (which will be based on the OTU approach) have to be based on the same dataset (same fasta file, type and number of sequences per sample).

I followed the 454 SOP up to the cluster step and got the final4.an.list, together with my final4.fasta, final4.groups and final4.names files.

Then,
split.abund(fasta=final4.fasta, list=final4.an.list, group=final4.groups, cutoff=3, label=0.03)
Output File Names:
final4.an.0.03.rare.list
final4.an.0.03.abund.list
final4.0.03.rare.groups
final4.0.03.abund.groups
final4.0.03.rare.fasta
final4.0.03.abund.fasta

summary.seqs(fasta=final4.0.03.abund.fasta, name=final4.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1571 493 0 3 1
2.5%-tile: 1 1573 500 0 4 4608
25%-tile: 1 1573 505 0 4 46073
Median: 1 1573 520 0 5 92145
75%-tile: 1 1573 527 0 5 138217
97.5%-tile: 1 1573 529 0 7 179681
Maximum: 3 1573 574 0 8 184288
Mean: 1.00001 1573 517.172 0 4.94459

of unique seqs: 38482

total # of seqs: 184288

make.shared(list=final4.an.0.03.abund.list, group=final4.0.03.abund.groups, label=0.03)
count.groups()

10A_11 contains 7563 5A_10 contains 6590
11A_10 contains 7117 5A_11 contains 6302
11B_10 contains 6154 6A_11 contains 7735
11C_10 contains 6696 6B_11 contains 7408
12A_10 contains 5950 7A_10 contains 5736
12B_10 contains 8900 7A_11 contains 8435
1A_11 contains 6508 7B_10 contains 7686
1B_11 contains 7614 8A_10 contains 6151
2A_11 contains 8065 8A_11 contains 6781
2B_11 contains 7924 8B_10 contains 5481
3B_11 contains 6144 9A_10 contains 6463
4A_11 contains 5278 9A_11 contains 6509
4B_11 contains 6270 S1B1S_10 contains 5767
Total seqs: 184288 S1B1S_11 contains 7061

sub.sample(shared=final4.an.0.03.abund.shared, size=5278)
classify.otu(list=final4.an.0.03.abund.list, name=final4.names, taxonomy=final4.taxonomy, label=0.03)

Up to here, all seem to have worked well!! I could get the rarefaction curves, diversity indexes, heatmap and so on from the subsampled.shared file.

Then, when I tried to do the PCoA based on unifrac distances:
dist.seqs(fasta=final4.0.03.abund.fasta, output=phylip, processors=10)
clearcut(phylip=final4.0.03.abund.phylip.dist)

I tried:
unifrac.unweighted(tree=final4.0.03.abund.phylip.tre, name=final4.names, group=final4.0.03.abund.groups, distance=lt, processors=10, random=F, subsample=T)

It didn’t work and a lot of seqs names were listed saying that those seqs were not in my groups file.
I tried with the first group file (before split.abund):
unifrac.unweighted(tree=final4.0.03.abund.phylip.tre, name=final4.names, group=final4.groups, distance=lt, processors=10, random=F, subsample=T)

Setting subsample size to 5830

Now it worked but the size of subsample was higher than that I had to set for subsampling the shared file. :roll:

Why is not working if the fasta file is the same and the groups file should match it, as both were outputs after split.abund??

Sorry, but I cannot see where I am doing sth wrong :oops:

Thanks!!

You might want to include the names file in the split.abund command.