with more text

7155f9eb · Melanie Schneider · b5c62ff0 · 7155f9eb
Commit 7155f9eb authored 9 years ago by Melanie Schneider
--- a/dge_workflow/featcounts_deseq.R
+++ b/dge_workflow/featcounts_deseq.R
@@ -46,8 +46,13 @@ qcutoff <- if(is.numeric(pcutoff)) NULL else as.numeric(opts$qcutoff)
 if(is.numeric(pcutoff)) opts$qcutoff <- NULL


+#' # Differential Expression Analysis
 ########################################################################################################################
-#' ## Data preparation
+#' ## Data Preparation
+
+#' The detection of differentially expressed genes is performed using the R package DEseq2, which requires a count matrix as input.
+#' The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.
+#' An important analysis question is the quantification and statistical inference of systematic changes between conditions, as compared to within-condition variability.

 #' Working Directory: `r getwd()`

@@ -65,12 +70,10 @@ countMatrix <- countData %>% column2rownames("ensembl_gene_id") %>% as.matrix()
 genesBefore <- nrow(countMatrix)
 countMatrix %<>% filterByExpression(opts$min_count)
 genesAfter <- nrow(countMatrix)
-#' Counts were filtered to only retain genes which had more that `r opts$min_count` alignments in at least one replicates. This reduced the number of genes from `r genesBefore` to `r genesAfter`.
+#' Counts were filtered to only retain genes which had more that `r opts$min_count` alignments in at least one replicate. This reduced the number of genes from `r genesBefore` to `r genesAfter`.


 ########################################################################################################################
-#' ## Data Preparation
-

 #' DESeq is an R package to analyse count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression. The latest version, DESeq2, is used.
 #' DESeq uses a model based on the negative binomial distribution and offers the following features:
@@ -125,11 +128,10 @@ dds <- estimateSizeFactors(dds)

 #' ### Data Dispersion

-# TODO explain how this plot is useful
 #' Plot of the per-gene dispersion estimates together with the fitted mean-dispersion relationship.
 #' The dispersion plot shows how the estimates are shrunk from the gene wise values (black dots) toward the fitted estimates, with the final values used in testing being the blue dots.
 #' The dispersion can be understood as the square of the coefficient of biological variation.  So, if a gene's expression typically differs from replicate to replicate sample by 20%, this gene's dispersion is: .20^2 = .04.
-## The function estimateDispersions performs three steps. First, it estimates, for each gene and each (replicated) condition, a dispersion value, then, it fits, for each condition, a curve through the estimates. Finally, it assigns to each gene a dispersion value, using either the estimated or the fitted value.
+#' The estimation of dispersions performs three steps. First, it estimates, for each gene and each (replicated) condition, a dispersion value, then, it fits, for each condition, a curve through the estimates. Finally, it assigns to each gene a dispersion value, using either the estimated or the fitted value.
 #dds <- estimateDispersions(dds)
 plotDispEsts(dds, main="Dispersion plot")

@@ -261,6 +263,9 @@ normCounts <- counts(dds,normalized=TRUE) %>%
 ########################################################################################################################
 #' # MA and Volcano plots

+#' MA-plot: The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by log2 is shown on the x-axis (“M” for minus, because a log ratio is equal to log minus log, and “A” for average). Each gene is represented with a dot. Genes with an adjusted p value below a certain threshold are shown in cyan (True).
+#' This plot demonstrates that only genes with a large average normalized count contain sufficient information to yield a significant call.
+
 # deseq approach
 # plotMA(deResults, main="DESeq2", ylim=c(-2,2))

@@ -286,7 +291,12 @@ deResults %>% ggplot(aes(0.5*log2(norm_count_1*norm_count_2), log2(norm_count_2/

 #deResults %$% pvalue %>%  log10() %>% quantile(0.05, na.rm=T)

-#' For the volcano plot we have fold change on the X-axis and -log10 of the p-value on the y-axis.
+#' A volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log(10)(p-value) from the t-test).
+#' This scatter plot is used to quickly identify changes in large datasets composed of replicate data.
+#' Here we have log2-fold-change on the X-axis and -log10 of the p-value on the y-axis.
+#' This results in datapoints with low p-values (highly significant) appearing toward the top of the plot. For the x-axis the log of the fold-change is used so that changes in both directions (left and right) appear equidistant from the center.
+#' Therefore interesting points are those, which are found toward the top of the plot that are far to either the left- or the right-hand side. These represent values that display large magnitude fold changes (hence being left- or right- of center) as well as high statistical significance (hence being toward the top).
+
 ##Volcano plots
 hitCounts <- filter(deResults, is_hit) %>%
        count(sample_1, sample_2, s1_overex) %>%
@@ -461,6 +471,7 @@ maData %>% ggplot(aes(0.5*log2(norm_count_1*norm_count_2), log2(norm_count_1/nor
 multiplot(plotlist=maPlots, cols=min(2, length(maPlots)))


+#' 
 ##--------------------------------------------------------

 ## Useful link: https://gist.github.com/stephenturner/f60c1934405c127f09a6