Commit 447b42f4 authored by domingue's avatar domingue
Browse files

Added in depth explanations

parent 976dbd42
......@@ -161,8 +161,30 @@ category_exists <- function(category, species) {
TRUE
}
#
#' GSEA uses ranked genes based on a measure of each gene's differential expression with respect to the two phenotypes. In our case genes are ranked with the formula -log10(padj) multiplied by -1 if the gene is down-regulated (negative fold-change). This means that significant down-regulated genes will be at the top of the rank, and significantly up-regulated genes at the bottom. In the middle will be those with non-significant p-values. There other ways of ranking genes. Then the entire ranked list is used to assess how the genes of each gene set are distributed across the ranked list. To do this, GSEA walks down the ranked list of genes, increasing a running-sum statistic when a gene belongs to the set and decreasing it when the gene does not. A simplified example is shown in the following figure:
#'
#' ![figure](https://www.genepattern.org/uploaded/content_gseapic1.png)
#'
#' The enrichment score (ES) is the maximum deviation from zero encountered during that walk. The ES reflects the degree to which the genes in a gene set are overrepresented at the top or bottom of the entire ranked list of genes. A set that is not enriched will have its genes spread more or less uniformly through the ranked list. An enriched set, on the other hand, will have a larger portion of its genes at one or the other end of the ranked list. The extent of enrichment is captured mathematically as the ES statistic:
#'
#' ![figure](https://www.genepattern.org/uploaded/content_gseapic2.png)
#'
#' A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES indicates gene set enrichment at the bottom of the ranked list. This is an example plot from our report:
#' ![gsea_example](https://git.mpi-cbg.de/bioinfo/ngs_tools/uploads/6cf3f282c707fcd0e00df0374b8b55d9/gsea_example.png)
#'
#' What is confusing is that at the top of the list are the down-regulated genes, and at the bottom the up-regulated, which is counter-intuitive.
#'
#' Next, GSEA estimates the statistical significance of the ES by a permutation test. To do this, GSEA creates a version of the data set with phenotype labels randomly scrambled, produces the corresponding ranked list, and recomputes the ES of the gene set for this permuted data set. GSEA repeats this many times (1000 is the default) and produces an empirical null distribution of ES scores. Alternatively, permutations may be generated by creating “random” gene sets (genes randomly selected from those in the expression dataset) of equal size to the gene set under analysis.
#' The nominal p-value estimates the statistical significance of a single gene set's enrichment score, based on the permutation-generated null distribution. The nominal p-value is the probability under the null distribution of obtaining an ES value that is as strong or stronger than that observed for your experiment under the permutation-generated null distribution.
#' Typically, GSEA is run with a large number of gene sets. For example, the MSigDB collection and subcollections each contain hundreds to thousands of gene sets. This has implications when comparing enrichment results for the many sets. The ES must be adjusted to account for differences in the gene set sizes and in correlations between gene sets and the expression data set. The resulting normalized enrichment scores (NES) allow you to compare the analysis results across gene sets. The nominal p-values need to be corrected to adjust for multiple hypothesis testing. For a large number of sets (rule of thumb: more than 30), we recommend paying attention to the False Discovery Rate (FDR) q-values: consider a set significantly enriched if its NES has an FDR q-value below 0.25.
#' https://www.genepattern.org/modules/docs/GSEA/14
# ==========================================================================
#' ## Read data and prepare variables
#' # Data
# ==========================================================================
if (category == "None" & is.na(extra_sets) & is.na(gmt_file)) {
stop("One of --category or --extra_sets|--gmt_file must be used.")
......@@ -245,11 +267,11 @@ ranks_l <- gsea_dat %>%
# ==========================================================================
#' ## GSEA
#' # GSEA
# ==========================================================================
#' The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L, in which case the gene set is correlated with the phenotypic class distinction.
#'
#' For this analysis we retrieved the hallmark gene sets from MSigDB for mouse. The MSigDB version used was `r packageVersion("msigdbr")`.
#' For this analysis we retrieved the hallmark gene sets from MSigDB for `r species`. The MSigDB version used was `r packageVersion("msigdbr")`.
#'
#' The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.
#'
......@@ -313,9 +335,16 @@ fgseaRes_l <- ranks_l %>%
# ==========================================================================
#' For a gene set to be considered enriched we used a p-adjusted value cutoff of `r qcutoff`. The plots are for those sets that passed the cut-off.
#'
#' We can have an overview of the pathway enrichment by plotting the normalized enrichment score (NES) and p-adjusted values for each of the top enriched pathways. A positive NES indicates gene set enrichment at the top of the ranked list; a negative NES indicates gene set enrichment at the bottom of the ranked list.
#' We can have an overview of the pathway enrichment by plotting the normalized enrichment score (NES) and p-adjusted values for each of the top enriched pathways. A positive NES indicates gene set enrichment at the top of the ranked list; a negative NES indicates gene set enrichment at the bottom of the ranked list.
#' **Note:** At the top of the list (left of x-axis) are the down-regulated genes, and at the bottom the up-regulated (right of x-axis).
#' Reference: https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html
#'
#' Regarding the plots saved as png in the `figure` folder, they show the value of the ranking metric as you move down the list of ranked genes. So the x-axis is the metric used to rank the genes (-log10(padj) multiplied by -1 or 1 depending on the fold-change) and the y-axis is the gene rank. Only genes part of the pathway are shown. In essence this is a visually appellative plot but not very informative. Here is an example:
#'
#'![gsea_table_example](https://git.mpi-cbg.de/bioinfo/ngs_tools/uploads/5acc91e05974178323e9663233e6d035/gsea_table_example.png)
#'
#'The information contained in this plot is also in the final results table present in the report and in the results folder (`fgseaRes.txt`).
#'
#+ results='asis'
top_pathways_l <- list()
for (contrast in names(fgseaRes_l)){
......@@ -328,7 +357,7 @@ for (contrast in names(fgseaRes_l)){
cat("\nAfter data preparation ", length(ranks), "are used for analysis, from the original", length(unique(gene_df$ensembl_gene_id)), "\n")
## if too many pathways are enriched, only top 20 should be plotted in the report,otherwise it's a mess.
## if too many pathways are enriched, only top 20 should be plotted in the report,otherwise it's the report becomes too large. All plots are saved in the the folder `figure`.
if (length(enriched_pathways) > 20) {
msg <- paste0(length(enriched_pathways), " pathways found significantly enriched for a p-adjust value of ", qcutoff, ".In this report only the top 20 will be show for visualization reasons. Enrichment plots for all enriched pathways can be found in the folder figures/")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment