# The input files

To run the analysis two files must be provided: 1) the gene-profile, and 2) one or more collections of gene-sets.

### Gene-profile

A gene-profile is a tab separated value .txt file where the first column is the gene-symbol and the second column is a numeric value (eg. differential expression level). No particular ordering is requested as the tool automatically computes the ranks associated to each gene symbol. As an example see this file gene profile.
As an example, see this file.

### Collection of gene-sets

A gene-sets collection is a file in .gmt format (Gene Matrix Transposed file format) from the Broad institute. A clear description of the format is from this page, while an example is from the the Ma'ayan Laboratory.

The tool allows to upload multiple .gmt files. The source of each collection is shown in the result table.

# Setting the parameters of the analysis

Three parameters can be set to run the analysis: 1) the alternative hypothesis, 2) the level of significance of the test with respect to p-value or its corrected versions (Bonferroni and BH), and 3) the thresholds value of logit2NES. The tool encodes for each parameter a default value that the user can change before running the analysis.

### the alternative hypothesis

The identification of the alternative hypothesis depends on the logic of the gene-profile. The tool allows to choose among:

Greater (upper tail) connected to $\mathcal H_1: F_{out}(x) > F_{in}(x)$,
Less (lower tail) connected to $\mathcal H_1: F_{out}(x) < F_{in}(x)$, and
Two sided (upper and lower tails) connected to $\mathcal H_1: F_{out}(x) \neq F_{in}(x)$.

In case the scheme behind the gene-profile is treatment group versus control group, it would be better to adopt the greater alternative; when the scheme is treatment group 1 versus treatment group 2, then the two sided alternative is more appropriate.

The choice of the alternative hypothesis affects the computation of the p-value.

### the level of significance of the test

The user can choose any value between 0 and 1 as the level of significance of the enrichment test. This value can be applied to the p-values, the adjusted p-values according to Benjamini-Hochberg rule (BH-value), or the p-values adjusted according to Bonferroni's method (B-value).

This choice affects the tabular results displayed on the screen (and the network-map), while the retrievable tables contain as many rows as the number of the gene-sets.

### $logit2NES$ threshold value

$logit2NES$ threshold allows to select only those enrichments having a minimum probability to be associated with the treatment group.

$logit2NES$ is defined as

$logit2NES = \log_2\frac{NES}{1-NES}$

which is a non-linear monotone transformation of the NES. The table below allows to see the equivalence between the values corresponding to the NES, the odds, and the logit2NES. The default value is 0.9, that means we consider those gene-sets having a probability at least greater than 65% to be associated with the treatment group.

When the alternative hypothesis is "two sided" (treatments group no. one versus treatment group no. two) the threshold concerns the

$abs\_logit2NES = |\log_2\frac{NES}{1-NES}|$.

In this version of the index we symmetrically select those gene-sets having the same probability to be associated both with the treatment group no. one (when $logit2NES > 0$), and the treatment group no. two (when $logit2NES < 0$). As an example, if we require $abs\_logit2NES > 0.9$, we select the gene-sets with a $NES > 0.65%$ and those with $NES < 0.35%$ to be associated with the treatment group no. one. However, the gene-sets satisfying $NES < 0.35%$ are those having a $probability = 1 - NES > 0.65$ to be associated with the treatment group no. two.

 NES odds logit2NES 0.2 0.25 -2 0.3 0.43 -1.23 0.4 0.67 -0.58 0.5 1.0 0.0 0.6 1.5 0.58 0.65 1.86 0.90 0.75 3.0 1.58 0.9 9.0 3.17

# analysis result

When the analysis is completed results appears in a table below.
Some minimal statistics and the chosen parameters are shown at the top of the table.
The title of the table can be set before the run using the "caption" box.
The table shows those gene-sets whose statistics satisfy the constraints about the level of significance and the threshold-value of $logit2NES$. Results can be exported as: 1) a comma separated value text file, 2) a tab separated value text file, and 3) an HTML file of the table. The text files 1, and 2 contain the enrichments associated with every gene-set in the collection (no constraint is applied), while the HTML file is the same table shown on screen.

The table contains 11 columns for each gene-set (in the rows). The rows can be ordered according to every column by clicking on the column name.
 column name description gene-set This is the gene-set name as it appears in the first column of the .gmt file. Behind the name, there is a link (if it is present in the second column of the .gmt file) to the description of the gene-set as in the case of the .gmt collections from Broad Institute. collection This is the name of the file from which the gene.set is got. size This is the number of the genes in the gene-set. actualSize This is the number of the genes that are present in the gene-profile as well. NES It is the Normalized Enrichment Score, that is $P\left[\mbox{the gene-set is associated with the treatment group}.\right]$ odds This is the unbalance of the NES, i.e. $odds = \frac{NES}{1-NES}$ $=\frac{P\left[\mbox{the gene-set is associated with the treatment group}.\right]}{P\left[\mbox{the gene-set is not associated with the treatment group}.\right]}$ logit2NES This is the $logit$ transformation of the NES, i.e. $logit2NES = \log_2 (odds)$ $= \log_2\frac{NES}{1-NES}$ $=\log_2 \frac{P\left[\mbox{the gene-set is associated with the treatment group}.\right]}{P\left[\mbox{the gene-set is not associated with the treatment group}.\right]}$ p‑value This is the p-value of the Mann-Whitney computed with Central Limit Theorem. BH‑value This is the adjusted p-value according to the Benjamini-Hochberg methodology B‑value This is the adjusted p-value according to the Bonferroni methodology relevance This is the ordering variable (see the statistical details for more info)

# Visualizing the table of enrichments: the network map

By clicking on "Network " tab the user can visualize the gene-sets network built from the obtained results. With two sliders the user can control gene-sets similarity metric and similarity threshold.
Two slider-box control the network: 1) Gene-sets similarity metric, and 2) Similarity threshold.

### Gene-sets similarity metric

Similarity between gene-sets is computed as a convex combination of Jaccard's index and overlap. The slider in the leftmost location indicates a pure Jaccard's index, while the rightmost location the metric pure overlap. Any intermediate location corresponds to a value of the $\epsilon$ parameter that controls the convex combination of the two similarities (see the statistical details for more info). The default value is 0.25.

### Similarity threshold

This slider controls the minimum amount of similarity between the gene-sets so that the representative balls are connected by a segment. The default value is 0.35.

# Massive gene-sets test in R

The algorithm has been implemented as an R package at https://CRAN.R-project.org/package=massiveGST where a vignette shows how to use the funtions.

# Supported browsers

Basically, all last version of browsers are supported (Safari, Chrome, Firefox, and IE)
IE does not support the network-graph.