During a recent analysis with mothur we got a strange result for our heatmap and it would be great if someone could help us understand it and tell us if we encountered a bug or are just unable to properly interprete the heatmap (of which I don´t have much expertise).
In our analysis we compared the top 20 OTUs from three treatments and the number of sequences in the OTUs (as seen in the shared file used for creating the heatmap) do not seem to match the colouring in the heatmap. For example:
A B C
OTU005 61 451 449
OTU006 0 0 449
OTU007 0 0 419
In the heatmap treatment C shows for example a light red for OTUs 005 & 006 but black for the very similar number of sequences at OTU007. OTU005 at treatment B, which is again nearly identical to the numbers in C, shows also a much darker red than those.
We did this analysis on the Windows and Linux versions of mothur (both V1.32) with the following commands:
Starting from our processed and quality controled final fasta, names & groups files, we excluded some samples with remove.groups resulting in “pick” files. We also created a new groups file in which we combined samples that belong to the same treatment. After that we used:
The number of sequences in the 20 most abundant OTUs were taken from the shared file (final.pick.an.shared) at label 0.03 and (of course) compared to the heatmap of the same level.
Any help in understanding this output and putative discrepancy would be greatly appreciated.
I tried the linear and log2 options of the command, but with the linear output everything but the first three OTU lines are pitch black while the log2 output does not seem to differ from the log10.
So you are saying that the colour differences come from the scaling, even when the numbers are very similar (451 versus 449 & 449 versus 419 in my example)? Is this an effect of the differences in abundance between the top three OTUs with sequence numbers of 3000-4000 versus the other 17 in the analysis with 400 and below? As I mentioned, I have not much experience with heatmaps, so I hope you forgive my stupid questions.
Hi Rene,
There are no stupid questions, :). I am happy to help. For all options mothur finds the maximum relative abundance for each group, and the relative abundance for each group in each OTU. In all cases the relative abundance is divided by the maximum relative abundance. The scalers add a layer of complexity. Let’s look at your example in concrete numbers:
label Group numOtus OTU1 OTU2 OTU3
0.03 A 3 61 0 0
0.03 B 3 451 0 0
0.03 C 3 449 449 419
group OTU relabund maxRelabund (relabund/maxrelabund)
A OTU1 1 1 1
B OTU1 1 1 1
C OTU1 0.340926 0.340926 1
C OTU2 0.340926 0.340926 1
C OTU3 0.318147 0.340926 0.933185
The entire picture is almost all red, with the exception of the 0 abundances which are white. If you are seeing black you are likely using an old version of mothur with a bug where the colors were reversed.
Now the log10 scale:
group OTU relabund maxRelabund (log10(relabund)/log10(maxrelabund))
A OTU1 1 1.001 0
B OTU1 1 1.001 0
C OTU1 0.340926 0.340926 1
C OTU2 0.340926 0.340926 1
C OTU3 0.318147 0.340926 1.06426
For groups A and B, because all your sequences are in one OTU the maxRelabund = 1 and the log10(1)=0. This is causing a bit of a weird result. I will bring this special case up with Pat.
I also have 8 libraries (sorry, still Sanger sequences) with few clones each (normalized dataset to a maximum of 96 seqs per sample) and they happen to be dominated by one to three OTUs and then a lot of 2-3 and unique OTUs. When I tried to build the heatmap, I have the same issue. Should I use the linear option? Is there any solution Pat proposed for this problem?
And one more question, only black and red colors can be used or there is a way to make the plots look more colorfull?
It would probably be best to build the heat maps in R. They have far more resources available for making pretty heat maps and we really aren’t interested in reimplementing R.
But if I use the Mothur heatmap command, due to all the above explained about the distribution of the sequences among the OTUs defined, should I use linear, log2 or log10?