ngs_tools issueshttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues2021-04-08T13:38:56Zhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/103gsea: fixing expression explorer2021-04-08T13:38:56Zgohrgsea: fixing expression explorerThe EE uses the package shinyjqui. This package has changed data structures which caused the EE to not work properly for the most recent version 0.4.0 of this package. Searching + adapting EE to make it run with all versions of shinyjqui...The EE uses the package shinyjqui. This package has changed data structures which caused the EE to not work properly for the most recent version 0.4.0 of this package. Searching + adapting EE to make it run with all versions of shinyjqui until v0.4.0. I've added expression_explorer.yml with a minimal Conda env to run the EE. On top, I've change the EE to not depend anymore on devtools and additional SCF internal scripts which includes to load libraries with library statements rather then load_pack statements.gohrgohrhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/102gsea: improvement of description2021-04-08T13:39:07Zgohrgsea: improvement of descriptionThe current description is misleading as of where up- and down-regulated genes are in the list of sorted analyzed genes. It's true that genes are sorted according to diff. expression from down-regulated to up-regulated genes but the GSEA...The current description is misleading as of where up- and down-regulated genes are in the list of sorted analyzed genes. It's true that genes are sorted according to diff. expression from down-regulated to up-regulated genes but the GSEA inverses this list and output results (plots) will have the up-regulated genes at the beginning and the down-regulated genes at the end.
I've changed the GSEA description to be more clear on this aspect. I also shortened it to be more precise.gohrgohrhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/101GSEA: improve color code of KEGG pathways2021-03-23T15:34:26ZgohrGSEA: improve color code of KEGG pathwaysKEGG pathways are overlayed with a color code that should represent differential gene expression. This code is flawed by the fact that some boxes/entities are associated with several genes and the abs.max over genes was color encoded wit...KEGG pathways are overlayed with a color code that should represent differential gene expression. This code is flawed by the fact that some boxes/entities are associated with several genes and the abs.max over genes was color encoded without information on the fact that it's several genes and which expression of which gene is displayed by color. The new version of cp_enrichment improves on this and now it makes clear
i) if there are several genes associated to a box/entity
ii) what is the diff. expression of all these genes
iii) color codes these diff. expressions in a coherent way so user can interpret the color codes.gohrgohrhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/100exDesign: numeric vs categorical variables and shrinkage method2021-02-23T07:51:33ZgohrexDesign: numeric vs categorical variables and shrinkage method- DESeq2 treats numeric experimental variables as numeric variables, not as discrete variables
- hence: all discrete variables should have values that are not numbers, e.g. litter1, litter, litter3 instead of 1, 2, 3
- DESeq2: one should...- DESeq2 treats numeric experimental variables as numeric variables, not as discrete variables
- hence: all discrete variables should have values that are not numbers, e.g. litter1, litter, litter3 instead of 1, 2, 3
- DESeq2: one should go through these steps:
1. contrast_oe <- c("sampletype", "MOV10_overexpression", "control")
2. res_tableOE_unshrunken <- results(dds, contrast=contrast_oe, alpha = 0.05)
3. res_tableOE <- lfcShrink(dds, contrast=contrast_oe, res=res_tableOE_unshrunken)
This allows to use other approaches for shrinking the logFC than the DEQeq standard approach See:
> What you observe is consistent with what we see in testing on the benchmarking data and on simulation data.
> If you just compare method="normal" to method="apeglm" or "ashr", the differences you are likely to see is
> that normal will shrink large effects even if they have high precision (so shrinking too much) and allow
> small effects to float around 0, while apeglm/ashr will not shrink the precise, large effects much at all and > the small effects which are indistinguishable from 0 will be shrunk to 0.
Papers show that these other two approaches are more effective.gohrgohrhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/99gsea: better description2020-11-19T15:10:43Zdominguegsea: better descriptionI got feedback that the plots are not very intuitive. We need to add better explanations to the report.I got feedback that the plots are not very intuitive. We need to add better explanations to the report.dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/98Gene enrichment: script selects random genes if list too large2020-10-07T14:53:05ZdomingueGene enrichment: script selects random genes if list too largeThe `cp_enr` function randomly samples genes if the list is larger than 1500 ([this bit](https://git.mpi-cbg.de/bioinfo/ngs_tools/-/blob/master/common/cp_utils.R#L123-125)).
Replace this function in the script until we start using `core...The `cp_enr` function randomly samples genes if the list is larger than 1500 ([this bit](https://git.mpi-cbg.de/bioinfo/ngs_tools/-/blob/master/common/cp_utils.R#L123-125)).
Replace this function in the script until we start using `corescf` [package](https://git.mpi-cbg.de/scicomp/bioinfo_team/corescf/)dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/97gsea: improvments2020-10-13T06:42:05Zdominguegsea: improvmentsA few things to add:
- [x] include on table with number of Entrez IDs per gene set at the beginning of the report. This would also help to directly see why some of the gene sets were not tested (i.e. because of too few genes) and could ...A few things to add:
- [x] include on table with number of Entrez IDs per gene set at the beginning of the report. This would also help to directly see why some of the gene sets were not tested (i.e. because of too few genes) and could help to adjust the settings accordingly.
- [x] add how many genes we miss because there was no corresponding Entrez ID found. We did something similar in cp_enrichment.R script where we give the percentage of ‘lost’ genes.
Some issues to fix:
- [ ] when a single gene list is analysed and it has more or fewer genes than or `--maxSize` `--minSize`, respectably, it will fail without a meaningful error. Add error message to fail gracefully.
- [ ] related, add a table with the gene lists analysed, number of genes per list, and if they pass the thresholds.dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/96gsea: generate results for each contrast2020-09-23T12:06:53Zdominguegsea: generate results for each contrastRight now all genes are taken into account because I had only single contrast experiment but this is not always the caseRight now all genes are taken into account because I had only single contrast experiment but this is not always the casedominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/95gsea: bug, in extra sets the first line is being read as header2020-09-15T15:13:36Zdominguegsea: bug, in extra sets the first line is being read as headerThis is leads to one set missing from the analysis.This is leads to one set missing from the analysis.dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/94gsea: bug, hard-coded species2020-09-15T15:04:24Zdominguegsea: bug, hard-coded speciesThe mouse database (org.Mm.eg.db) is hard-coded when converting ensembl gene IDs into entrez IDs in the gsea.R script.The mouse database (org.Mm.eg.db) is hard-coded when converting ensembl gene IDs into entrez IDs in the gsea.R script.dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/93GSEA: add option to use custom gene sets2020-09-09T10:13:36ZdomingueGSEA: add option to use custom gene setsSometimes researchers will come with lists of genes that were taken from a paper, and not listed in the mSigDB.Sometimes researchers will come with lists of genes that were taken from a paper, and not listed in the mSigDB.dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/92GSEA2020-09-23T13:41:04ZdomingueGSEAUntil now GSEA has been done by @henry using API calls and it was not super efficient. While working on a [project](s://git.mpi-cbg.de/scicomp/bioinfo_team/alexaki_rnaseq_deg) I found out that this could be done exclusively with R packag...Until now GSEA has been done by @henry using API calls and it was not super efficient. While working on a [project](s://git.mpi-cbg.de/scicomp/bioinfo_team/alexaki_rnaseq_deg) I found out that this could be done exclusively with R packages:
- msigdb, contains the annotations
- fgsea, does the enrichment analysis
Code for testing in: https://git.mpi-cbg.de/domingue/test_gseadominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/91dge_workflow: expression_explorer app failed to load due to renamed/additiona...2020-07-10T10:23:01Zhersemandge_workflow: expression_explorer app failed to load due to renamed/additional annotation columns**Issues**:
- at least for the igenome `Homo_sapiens/Ensembl_v99` (*others were not tested*) running `featcounts_deseq_mf.R` with the `--gtf` flag results in empty gene descriptions which have to be manually added; however, adding the in...**Issues**:
- at least for the igenome `Homo_sapiens/Ensembl_v99` (*others were not tested*) running `featcounts_deseq_mf.R` with the `--gtf` flag results in empty gene descriptions which have to be manually added; however, adding the information from biomaRt ensembl results in the column 'description' instead of 'gene_description' if not manually changed and this leads to issues with the `expression_explorer` app which assumes the 'gene_description' but not the 'description' column
- annotation columns (e.g. domain prediction) which are additionally added to the dge results are not taken into account when the columns for further data summarization are selected in the `gather` functionshersemanhersemanhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/90argparser error when --bam_files is the last optional argument2020-07-01T14:58:02Zdomingueargparser error when --bam_files is the last optional argumentLena reported some error occurring when the `--bam_files` option was placed as the last optional arguument of `genic_counts.R`.
I will investigate.Lena reported some error occurring when the `--bam_files` option was placed as the last optional arguument of `genic_counts.R`.
I will investigate.dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/89general: create new human igenome for the Ensembl release 992020-06-24T14:01:01Zhersemangeneral: create new human igenome for the Ensembl release 99further information: http://www.ensembl.info/category/01-release/further information: http://www.ensembl.info/category/01-release/hersemanhersemanhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/88Stranded counts2020-07-13T10:52:16ZdomingueStranded countsChange the function `dge_star_counts2matrix` to extract read counts based on the library strandingChange the function `dge_star_counts2matrix` to extract read counts based on the library strandingdominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/87Move workflows to separate repositories2020-03-20T08:32:41ZdomingueMove workflows to separate repositoriesThe goal is to keep the NGS tools repo tidy and focused on bulk NGS wokflows. We already started the process by having the single cell workflow in a separate repo, but the ms_workflow is still here.
Ideally we should be able to:
1. cop...The goal is to keep the NGS tools repo tidy and focused on bulk NGS wokflows. We already started the process by having the single cell workflow in a separate repo, but the ms_workflow is still here.
Ideally we should be able to:
1. copy the ms_worklow in it's current state to a separate repo for further development
2. the commits and versioning history should be transferred as well
3. the current ms_workflow stays in ngs_tools to avoid breaking projects.dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/86ms_workflow: ms_ms_prop and reorder information are missing for protein IDs w...2020-03-17T15:07:56Zhersemanms_workflow: ms_ms_prop and reorder information are missing for protein IDs without fasta_header informationhersemanhersemanhttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/85Hardcoded pvalue in plot2020-04-15T10:41:53ZdomingueHardcoded pvalue in plotIn the MA plot section "MA and Volcano plots" the description reads:
> Each gene is represented with a dot. Genes with an adjusted p value below a certain threshold are shown in cyan (True)
However the code for adding color is using `p...In the MA plot section "MA and Volcano plots" the description reads:
> Each gene is represented with a dot. Genes with an adjusted p value below a certain threshold are shown in cyan (True)
However the code for adding color is using `pvalue` instead of `padj` and uses a harcoded value of `0.05`:
https://git.mpi-cbg.de/bioinfo/ngs_tools/blob/master/dge_workflow/featcounts_deseq_mf.R#L576
```r
deResults %>% ggplot(aes(0.5 * log2(mean_norm_count_1 * mean_norm_count_2), log2(mean_norm_count_2 / mean_norm_count_1), color = pvalue < 0.05)) +
geom_point(alpha = 0.1) +
geom_hline(yintercept = 0, color = "red") +
facet_grid(condition_1 ~ condition_2)
```
![pvalue_maplot](/uploads/86b9de6b7faebb9ee3c7c3c8caa59200/pvalue_maplot.png)
**Expected behaviour**
Genes coloured by `is_hit` which reflects the cut-off used in arguments (`qcutoff` or `pcutoff`).dominguedominguehttps://git.mpi-cbg.de/bioinfo/ngs_tools/-/issues/84dge_workflow: collect_kallisto_data.R fails for paired-end data2020-03-12T10:54:15Zhersemandge_workflow: collect_kallisto_data.R fails for paired-end data- parsing the kallisto.log files failed because the two fastq files are listed not in one line
- additionally, it makes more sense to provide a list of kallisto output folders as argument to the script instead of taking all subfolders in...- parsing the kallisto.log files failed because the two fastq files are listed not in one line
- additionally, it makes more sense to provide a list of kallisto output folders as argument to the script instead of taking all subfolders in the current working directory which causes issues in case other subfolders e.g. from `multiqc` are present as wellhersemanherseman