ngs_tools issues

Hardcoded pvalue in plot

2020-04-15T10:41:53Z

In the MA plot section "MA and Volcano plots" the description reads: > Each gene is represented with a dot. Genes with an adjusted p value below a certain threshold are shown in cyan (True) However the code for adding color is using `pvalue` instead of `padj` and uses a harcoded value of `0.05`: https://git.mpi-cbg.de/bioinfo/ngs_tools/blob/master/dge_workflow/featcounts_deseq_mf.R#L576 ```r deResults %>% ggplot(aes(0.5 * log2(mean_norm_count_1 * mean_norm_count_2), log2(mean_norm_count_2 / mean_norm_count_1), color = pvalue < 0.05)) + geom_point(alpha = 0.1) + geom_hline(yintercept = 0, color = "red") + facet_grid(condition_1 ~ condition_2) ``` ![pvalue_maplot](/uploads/86b9de6b7faebb9ee3c7c3c8caa59200/pvalue_maplot.png) **Expected behaviour** Genes coloured by `is_hit` which reflects the cut-off used in arguments (`qcutoff` or `pcutoff`).

argparser error when --bam_files is the last optional argument

2020-07-01T14:58:02Z

Lena reported some error occurring when the `--bam_files` option was placed as the last optional arguument of `genic_counts.R`. I will investigate.

dge_workflow: expression_explorer app failed to load due to renamed/additiona...

2020-07-10T10:23:01Z

**Issues**: - at least for the igenome `Homo_sapiens/Ensembl_v99` (*others were not tested*) running `featcounts_deseq_mf.R` with the `--gtf` flag results in empty gene descriptions which have to be manually added; however, adding the information from biomaRt ensembl results in the column 'description' instead of 'gene_description' if not manually changed and this leads to issues with the `expression_explorer` app which assumes the 'gene_description' but not the 'description' column - annotation columns (e.g. domain prediction) which are additionally added to the dge results are not taken into account when the columns for further data summarization are selected in the `gather` functions

Stranded counts

2020-07-13T10:52:16Z

Change the function `dge_star_counts2matrix` to extract read counts based on the library stranding

dge_workflow: future improvements

2020-07-20T08:54:19Z

Whilst https://git.mpi-cbg.de/bioinfo/ngs_tools/blob/master/dge_workflow/featcounts_deseq_mf.R works well, it was developed a long time ago and some of the `DESeq2` functionality and best-practices changed. So did what we might want to add or remove from the script. In here we should list the things that we would like to change if we consider creating a new script for analysis.

GSEA: add option to use custom gene sets

2020-09-09T10:13:36Z

Sometimes researchers will come with lists of genes that were taken from a paper, and not listed in the mSigDB.

gsea: bug, hard-coded species

2020-09-15T15:04:24Z

The mouse database (org.Mm.eg.db) is hard-coded when converting ensembl gene IDs into entrez IDs in the gsea.R script.

gsea: bug, in extra sets the first line is being read as header

2020-09-15T15:13:36Z

gsea: bug, in extra sets the first line is being read as header

gsea: generate results for each contrast

2020-09-23T12:06:53Z

Right now all genes are taken into account because I had only single contrast experiment but this is not always the case

GSEA

2020-09-23T13:41:04Z

Until now GSEA has been done by @henry using API calls and it was not super efficient. While working on a [project](s://git.mpi-cbg.de/scicomp/bioinfo_team/alexaki_rnaseq_deg) I found out that this could be done exclusively with R packages: - msigdb, contains the annotations - fgsea, does the enrichment analysis Code for testing in: https://git.mpi-cbg.de/domingue/test_gsea

Gene enrichment: script selects random genes if list too large

2020-10-07T14:53:05Z

The `cp_enr` function randomly samples genes if the list is larger than 1500 ([this bit](https://git.mpi-cbg.de/bioinfo/ngs_tools/-/blob/master/common/cp_utils.R#L123-125)). Replace this function in the script until we start using `corescf` [package](https://git.mpi-cbg.de/scicomp/bioinfo_team/corescf/)

gsea: improvments

2020-10-13T06:42:05Z

A few things to add: - [x] include on table with number of Entrez IDs per gene set at the beginning of the report. This would also help to directly see why some of the gene sets were not tested (i.e. because of too few genes) and could help to adjust the settings accordingly. - [x] add how many genes we miss because there was no corresponding Entrez ID found. We did something similar in cp_enrichment.R script where we give the percentage of ‘lost’ genes. Some issues to fix: - [ ] when a single gene list is analysed and it has more or fewer genes than or `--maxSize` `--minSize`, respectably, it will fail without a meaningful error. Add error message to fail gracefully. - [ ] related, add a table with the gene lists analysed, number of genes per list, and if they pass the thresholds.

gsea: better description

2020-11-19T15:10:43Z

I got feedback that the plots are not very intuitive. We need to add better explanations to the report.

exDesign: numeric vs categorical variables and shrinkage method

2021-02-23T07:51:33Z

- DESeq2 treats numeric experimental variables as numeric variables, not as discrete variables - hence: all discrete variables should have values that are not numbers, e.g. litter1, litter, litter3 instead of 1, 2, 3 - DESeq2: one should go through these steps: 1. contrast_oe <- c("sampletype", "MOV10_overexpression", "control") 2. res_tableOE_unshrunken <- results(dds, contrast=contrast_oe, alpha = 0.05) 3. res_tableOE <- lfcShrink(dds, contrast=contrast_oe, res=res_tableOE_unshrunken) This allows to use other approaches for shrinking the logFC than the DEQeq standard approach See: > What you observe is consistent with what we see in testing on the benchmarking data and on simulation data. > If you just compare method="normal" to method="apeglm" or "ashr", the differences you are likely to see is > that normal will shrink large effects even if they have high precision (so shrinking too much) and allow > small effects to float around 0, while apeglm/ashr will not shrink the precise, large effects much at all and > the small effects which are indistinguishable from 0 will be shrunk to 0. Papers show that these other two approaches are more effective.

GSEA: improve color code of KEGG pathways

2021-03-23T15:34:26Z

KEGG pathways are overlayed with a color code that should represent differential gene expression. This code is flawed by the fact that some boxes/entities are associated with several genes and the abs.max over genes was color encoded without information on the fact that it's several genes and which expression of which gene is displayed by color. The new version of cp_enrichment improves on this and now it makes clear i) if there are several genes associated to a box/entity ii) what is the diff. expression of all these genes iii) color codes these diff. expressions in a coherent way so user can interpret the color codes.

gsea: fixing expression explorer

2021-04-08T13:38:56Z

The EE uses the package shinyjqui. This package has changed data structures which caused the EE to not work properly for the most recent version 0.4.0 of this package. Searching + adapting EE to make it run with all versions of shinyjqui until v0.4.0. I've added expression_explorer.yml with a minimal Conda env to run the EE. On top, I've change the EE to not depend anymore on devtools and additional SCF internal scripts which includes to load libraries with library statements rather then load_pack statements.

gsea: improvement of description

2021-04-08T13:39:07Z

The current description is misleading as of where up- and down-regulated genes are in the list of sorted analyzed genes. It's true that genes are sorted according to diff. expression from down-regulated to up-regulated genes but the GSEA inverses this list and output results (plots) will have the up-regulated genes at the beginning and the down-regulated genes at the end. I've changed the GSEA description to be more clear on this aspect. I also shortened it to be more precise.