ngs_tools issues

gsea: fixing expression explorer

2021-04-08T13:38:56Z

The EE uses the package shinyjqui. This package has changed data structures which caused the EE to not work properly for the most recent version 0.4.0 of this package. Searching + adapting EE to make it run with all versions of shinyjqui until v0.4.0. I've added expression_explorer.yml with a minimal Conda env to run the EE. On top, I've change the EE to not depend anymore on devtools and additional SCF internal scripts which includes to load libraries with library statements rather then load_pack statements.

gsea: improvement of description

2021-04-08T13:39:07Z

The current description is misleading as of where up- and down-regulated genes are in the list of sorted analyzed genes. It's true that genes are sorted according to diff. expression from down-regulated to up-regulated genes but the GSEA inverses this list and output results (plots) will have the up-regulated genes at the beginning and the down-regulated genes at the end. I've changed the GSEA description to be more clear on this aspect. I also shortened it to be more precise.

GSEA: improve color code of KEGG pathways

2021-03-23T15:34:26Z

KEGG pathways are overlayed with a color code that should represent differential gene expression. This code is flawed by the fact that some boxes/entities are associated with several genes and the abs.max over genes was color encoded without information on the fact that it's several genes and which expression of which gene is displayed by color. The new version of cp_enrichment improves on this and now it makes clear i) if there are several genes associated to a box/entity ii) what is the diff. expression of all these genes iii) color codes these diff. expressions in a coherent way so user can interpret the color codes.

exDesign: numeric vs categorical variables and shrinkage method

2021-02-23T07:51:33Z

- DESeq2 treats numeric experimental variables as numeric variables, not as discrete variables - hence: all discrete variables should have values that are not numbers, e.g. litter1, litter, litter3 instead of 1, 2, 3 - DESeq2: one should go through these steps: 1. contrast_oe <- c("sampletype", "MOV10_overexpression", "control") 2. res_tableOE_unshrunken <- results(dds, contrast=contrast_oe, alpha = 0.05) 3. res_tableOE <- lfcShrink(dds, contrast=contrast_oe, res=res_tableOE_unshrunken) This allows to use other approaches for shrinking the logFC than the DEQeq standard approach See: > What you observe is consistent with what we see in testing on the benchmarking data and on simulation data. > If you just compare method="normal" to method="apeglm" or "ashr", the differences you are likely to see is > that normal will shrink large effects even if they have high precision (so shrinking too much) and allow > small effects to float around 0, while apeglm/ashr will not shrink the precise, large effects much at all and > the small effects which are indistinguishable from 0 will be shrunk to 0. Papers show that these other two approaches are more effective.

gsea: better description

2020-11-19T15:10:43Z

I got feedback that the plots are not very intuitive. We need to add better explanations to the report.

Gene enrichment: script selects random genes if list too large

2020-10-07T14:53:05Z

The `cp_enr` function randomly samples genes if the list is larger than 1500 ([this bit](https://git.mpi-cbg.de/bioinfo/ngs_tools/-/blob/master/common/cp_utils.R#L123-125)). Replace this function in the script until we start using `corescf` [package](https://git.mpi-cbg.de/scicomp/bioinfo_team/corescf/)

gsea: improvments

2020-10-13T06:42:05Z

A few things to add: - [x] include on table with number of Entrez IDs per gene set at the beginning of the report. This would also help to directly see why some of the gene sets were not tested (i.e. because of too few genes) and could help to adjust the settings accordingly. - [x] add how many genes we miss because there was no corresponding Entrez ID found. We did something similar in cp_enrichment.R script where we give the percentage of ‘lost’ genes. Some issues to fix: - [ ] when a single gene list is analysed and it has more or fewer genes than or `--maxSize` `--minSize`, respectably, it will fail without a meaningful error. Add error message to fail gracefully. - [ ] related, add a table with the gene lists analysed, number of genes per list, and if they pass the thresholds.

gsea: generate results for each contrast

2020-09-23T12:06:53Z

Right now all genes are taken into account because I had only single contrast experiment but this is not always the case

gsea: bug, in extra sets the first line is being read as header

2020-09-15T15:13:36Z

gsea: bug, in extra sets the first line is being read as header

gsea: bug, hard-coded species

2020-09-15T15:04:24Z

The mouse database (org.Mm.eg.db) is hard-coded when converting ensembl gene IDs into entrez IDs in the gsea.R script.

GSEA: add option to use custom gene sets

2020-09-09T10:13:36Z

Sometimes researchers will come with lists of genes that were taken from a paper, and not listed in the mSigDB.

GSEA

2020-09-23T13:41:04Z

Until now GSEA has been done by @henry using API calls and it was not super efficient. While working on a [project](s://git.mpi-cbg.de/scicomp/bioinfo_team/alexaki_rnaseq_deg) I found out that this could be done exclusively with R packages: - msigdb, contains the annotations - fgsea, does the enrichment analysis Code for testing in: https://git.mpi-cbg.de/domingue/test_gsea

dge_workflow: expression_explorer app failed to load due to renamed/additiona...

2020-07-10T10:23:01Z

**Issues**: - at least for the igenome `Homo_sapiens/Ensembl_v99` (*others were not tested*) running `featcounts_deseq_mf.R` with the `--gtf` flag results in empty gene descriptions which have to be manually added; however, adding the information from biomaRt ensembl results in the column 'description' instead of 'gene_description' if not manually changed and this leads to issues with the `expression_explorer` app which assumes the 'gene_description' but not the 'description' column - annotation columns (e.g. domain prediction) which are additionally added to the dge results are not taken into account when the columns for further data summarization are selected in the `gather` functions

argparser error when --bam_files is the last optional argument

2020-07-01T14:58:02Z

Lena reported some error occurring when the `--bam_files` option was placed as the last optional arguument of `genic_counts.R`. I will investigate.

general: create new human igenome for the Ensembl release 99

2020-06-24T14:01:01Z

further information: http://www.ensembl.info/category/01-release/

Stranded counts

2020-07-13T10:52:16Z

Change the function `dge_star_counts2matrix` to extract read counts based on the library stranding

Move workflows to separate repositories

2020-03-20T08:32:41Z

The goal is to keep the NGS tools repo tidy and focused on bulk NGS wokflows. We already started the process by having the single cell workflow in a separate repo, but the ms_workflow is still here. Ideally we should be able to: 1. copy the ms_worklow in it's current state to a separate repo for further development 2. the commits and versioning history should be transferred as well 3. the current ms_workflow stays in ngs_tools to avoid breaking projects.

ms_workflow: ms_ms_prop and reorder information are missing for protein IDs w...

2020-03-17T15:07:56Z

ms_workflow: ms_ms_prop and reorder information are missing for protein IDs without fasta_header information

Hardcoded pvalue in plot

2020-04-15T10:41:53Z

In the MA plot section "MA and Volcano plots" the description reads: > Each gene is represented with a dot. Genes with an adjusted p value below a certain threshold are shown in cyan (True) However the code for adding color is using `pvalue` instead of `padj` and uses a harcoded value of `0.05`: https://git.mpi-cbg.de/bioinfo/ngs_tools/blob/master/dge_workflow/featcounts_deseq_mf.R#L576 ```r deResults %>% ggplot(aes(0.5 * log2(mean_norm_count_1 * mean_norm_count_2), log2(mean_norm_count_2 / mean_norm_count_1), color = pvalue < 0.05)) + geom_point(alpha = 0.1) + geom_hline(yintercept = 0, color = "red") + facet_grid(condition_1 ~ condition_2) ``` ![pvalue_maplot](/uploads/86b9de6b7faebb9ee3c7c3c8caa59200/pvalue_maplot.png) **Expected behaviour** Genes coloured by `is_hit` which reflects the cut-off used in arguments (`qcutoff` or `pcutoff`).

dge_workflow: collect_kallisto_data.R fails for paired-end data

2020-03-12T10:54:15Z

- parsing the kallisto.log files failed because the two fastq files are listed not in one line - additionally, it makes more sense to provide a list of kallisto output folders as argument to the script instead of taking all subfolders in the current working directory which causes issues in case other subfolders e.g. from `multiqc` are present as well