Statistical details


The Normalised Enrichment Score

The Normalized Enrichment Score (NES) is the estimate of the probability $P\big[X_{in} > X_{out}\big]$, where $X_{in} \sim F_{in}(x)$ and $X_{out} \sim F_{out}(x)$. $F_{in}(x)$ is a distribution function that describes the intensities of the genes in the gene-set, while $F_{out}(x)$ describes the values of the genes outside the gene-sets. The estimate of the NES comes from the Mann-Whitney (MW) test.

If we assume that the gene-profile recapitulates the differential expression of a set of treatment samples versus a control set, a value of the NES close to 1 means association of the gene-set with the treatment. On the contrary, an NES close to 0 suggests an association with the control group. In this view, we can say that

$NES = P\left[\mbox{the gene-set is associated with the treatment group}\right]$.

A different way to look at the NES is the

$odds = \frac{NES}{1-NES}$ $=\frac{P\left[\mbox{the gene-set is associated with the treatment group}\right]}{P\left[\mbox{the gene-set is not associated with the treatment group}\right]}$

The $odds$ is the imbalance of the probability that the gene-set is associated with the treatment group to the probability that the gene-set has not associated with it, or the gene-set is related to the control group. The association with the treatment is as strong as the odds diverges to infinity; is weak when the odds approaches to zero. In last this case, the association is with the control groups. An odds about 1.0 means no association either with the treatment or with the control group.

We introduce a second transformation of the NES that makes clearer the "direction" of the enrichment.

$logit2NES= \log_2\frac{NES}{1-NES}= \log_2 (odds)=\log_2 \frac{P\left[\mbox{the gene-set is associated with the treatment group}.\right]}{P\left[\mbox{the gene-set is not associated with the treatment group}.\right]}$

In this version of the NES, a value of the logit2NES close to 0 means no association; a positive value (logit2NES > 0) is a measure of the association of the gene-set with the treatment group, while a negative value points at the control.


The ordering of the enrichments

In the implementation of the enrichment procedure, we left the user free to order the results according to any of the measures available.
Generally, the gene-sets with small size have more chance to get higher NES and lower p-value than the gene-sets with large size. Given this empirical observation, we defined a variable that summarize the relevance of a gene-set taking into account these three features, size, NES, and p-value. The relevance a gene-set can be used to define a marginal ordering of the results.
Let consider a two sided test, so that both significant gene-sets with logit2NES > 0 and with logit2NES < 0 should be considered.
If $k'$ is the index of any of the gene-sets in the collection of the items such that $\mbox{logit2NES} > 0$, then

$\mbox{relevance}_{k'}^+ = \mbox{rank}\left(\mbox{actual_size}_{k'}\right) + \mbox{rank}\left(\mbox{logit2NES}_{k'}\right) + \mbox{rank}\left(1 - p\mbox{-value}_{k'}\right)$,

where the $\mbox{rank}\left(\cdot\right)$ function associates the highest rank with the highest value of its argument. The variable $\mbox{actual_size}$ is the size of the gene-set bounded to those gene in the gene-profile. The relevance in the subsets of the gene-sets such that $\mbox{logit2NES} < 0$ is

$\mbox{relevance}_{k''}^- = \mbox{rank}\left(\mbox{actual_size}_{k''}\right) + \mbox{rank}\left(-\mbox{logit2NES}_{k''}\right) + \mbox{rank}\left(1 - p\mbox{-value}_{k''}\right)$.

In the end, given the gene-set$_k$, the variable $\mbox{relevance}_k$ has value from $\mbox{relevance}^+$ when the corresponding gene-set has $\mbox{logit2NES}_k > 0$, and from $\mbox{relevance}^-$ in case $\mbox{logit2NES}_k < 0$.

When the test considers the "greater" alternative hypothesis, then the $relevance$ is $relevance^+$. In case of the alternative "less", then the $relevance$ is $relevance^-$.

The prioritization of the enrichments provided by the relevance in the table of the result is the default order.

The network-map

To plot the results of the enrichment analysis, we build a network where each ball/node is associated with a gene-set. Two nodes are connected when their similarity is above a cutoff value.

Two gene-sets are essentially two sets $A$ and $B$ of genes. We compute two index of similarity:

1) the Jaccard's index $\delta_0(A,B) = \frac{|A\cap B|}{|A\cup B|}$, and

2) the overlap index $\delta_1(A,B) = \frac{|A\cap B|}{\min\left(|A|, |B|\right)}$.

A convex combination of these two indexes provides the final similarity available in the network map:

$\delta_{\epsilon}(A,B) = \epsilon\cdot \delta_1(A,B) + (1-\epsilon)\cdot \delta_0(A,B)$

where $\epsilon$ range from 0 to 1. When $\epsilon = 0$, then $\delta_{\epsilon}\equiv \delta_0$ the jaccard's index; in case $\epsilon = 1$, then $\delta_{\epsilon}\equiv \delta_1$ the overlap similarity.
Empirically, we observed better results when $\epsilon = 0.25$ (the suggested default value), in terms of connection between the nodes.

The thickness of the segment connecting two nodes is proportional to the value of $\delta_{\epsilon}$.
The size of the ball/node is proportional to the size of the gene-set.
The color is instead proportional to the intensity of the NES.

The Mann-Whitney-Wilcoxon (MWW) test

We inflect these few notes about the Mann-Whitney test (1947) to its use as a test for the enrichments.

The MW-test concerns the null hypothesis that there is no mutual dominance of the distribution functions $F_{in}(x)$ and $F_{out}(x)$ associated with the genes in the gene-set and the genes outside:
$\mathcal H_0: F_{out}(x) = F_{in}(x).$

The alternative hypothesis states that the distribution function $F_{out}(x)$ of the genes outside the gene-set dominates the $F_{in}(x)$, i.e.

$\mathcal H_1: F_{out}(x)>F_{in}(x).$

Under the alternative hypothesis, the genes in the gene-set have intensities $x^{in}_i$ higher than that $x^{out}_j$ of the genes outside the gene-set, $j = 1, 2, \ldots, m'$, $i = 1, 2, \ldots, m''$, $m''$ is the size of the gene-set, and $m'+m''$ amounts to the dimension of the gene-profile.

The test statistic $U$ of the MW-test is the number of times that the relation $x^{in}_i > x^{out}_j$ is true $\forall\, i, j$. The actual computation of the $U$ statistic engages the rank-sum statistic from Wilcoxon (1945).

According to Bamber (1975), the ratio $U/m'm''$ is an unbiased estimator of the probability $P\big[X_{in} > X_{out}\big]$, where $X_{in} \sim F_{in}(x)$ and $X_{out} \sim F_{out}(x)$. Given a gene-set, the event $X_{in} > X_{out}$ says that "a gene randomly drawn from the gene-set has an intensity greater than the one of a second gene randomly sampled from outside the gene-set".