Skip to content
Snippets Groups Projects
Commit da71bb8a authored by luppino's avatar luppino
Browse files

R Rmd file to run predictions and evaluate performance on 334 genes.

parent 2ae5c606
No related merge requests found
---
---
---
### Run DeMAG on new genes
In order to show that DeMAG is generalizable we run the model on a set of extra 334 genes that have at least 5 benign and 5 pathogenic high-quality (at least 2 review stars) variants in ClinVar (version 20220812).
For the features we used:
- precomputed EVmutation scores from the [webserver](<https://marks.hms.harvard.edu/evmutation/>)
- [IUPred2A](<https://iupred2a.elte.hu/>) disorder scores that we obtained with the command line application. To note here that for the UniProt id Q8NFD5, we specified Q8NFD5-1 because IUPred2A considered as canonical isoform the id Q8NFD5-5 (as correctly indicated in UniProt).
- pLDDT scores obtained from the AlphaFold pdb files. For protein longer than 1800 residues we assembled AlphaFold multiple models to produce an ensemble structure. Nevertheless, these models are highly unreliable. (AlphaFold model is missing for UniProt id Q9NZV5).
- Partners score that we designed based on 3D contacts between C-alpha atoms but without evolutionary coupled residues because it would have required to run an alignment pipeline not optimized for hundreds of genes.
- [PolyPhen-2](<http://genetics.bwh.harvard.edu/pph2/dokuwiki/downloads>) was used to annotate the remaining 9 features (Nsubs, Score2, Nres, Score1, phylop, dScore, NormASA, BaRE, DistQmin). Check [here](<http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_a>) for a description of the features.
```{r,eval=TRUE,warning=FALSE,include=TRUE,echo=TRUE,results='markup', echo=TRUE, warning=FALSE, include=TRUE, r,eval=TRUE, results='markup'}
library(data.table)
library(stringr)
library(gbm)
library(caret)
setwd("/Users/luppino/ownCloud/GitLab/DeMAG/demag_newgenes/")
new_genes = fread("/Users/luppino/ownCloud/GitLab/DeMAG/demag_newgenes/genes_at_least_5.tsv")
neut = fread("clinvar_20220812.neutral.clean.pph2.output")
dele = fread("clinvar_20220812.deleterious.clean.pph2.output")
neut$label1 = "neutral"
dele$label1 = "deleterious"
df = rbind(neut,dele)
df = df[acc%in%new_genes$From,]
# do not consider genes with length greater than 1800 residues because AF models are unreliable
df = df[length<=1800,]
## attach plddt
plddt = fread("/Users/luppino/ownCloud/GitLab/DeMAG/demag_newgenes/genes_at_least_5_plddt.txt",col.names = c("acc","pos","plddt"))
df = merge(df,plddt,by=c("acc","pos"),all.x=T)
## attach iupred
iupred = fread("/Users/luppino/ownCloud/GitLab/DeMAG/demag_newgenes/genes_at_least_5_iupred.txt",col.names = c("pos","aa1","iupred","acc"))
iupred$acc = str_extract(iupred$acc,"[^_]+")
df = merge(df,iupred,by=c("acc","pos","aa1"),all.x=T)
## attach partners score
partners = fread("/Users/luppino/ownCloud/Gitlab/DeMAG/demag_newgenes/genes_at_least_5_partners_score.txt")
df = merge(df,partners,by=c("pos","acc"),all.x=T)
## evmutation
evmut = fread("/Users/luppino/ownCloud/Gitlab/DeMAG/demag_newgenes/genes_at_least_5_evmut.txt")
evmut = evmut[!duplicated(evmut[,.(id,pos,wt,subs)]),]
df = merge(df,evmut[,.(id,pos,wt,subs,evm_score_v2=prediction_epistatic)],by.x=c("acc","pos","aa1","aa2"),by.y=c("id","pos","wt","subs"),all.x=T)
##load model (it is located in the models folder)
load("/Users/luppino/ownCloud/PhD_results/R workspace/gbm.RData")
## try to exclude some features to check drop in performance
#df$partners_score = NA
#df$evm_score_v2 = NA
## DeMAG prediction
cl.gbm = predict(final.gbm,newdata=df,n.trees=final_bestTree,type="response")
variants = df[,"DeMAG_prediction":=cl.gbm]
variants = variants[,"DeMAG":=ifelse(DeMAG_prediction>=0.5,"deleterious","neutral")]
caret::confusionMatrix(as.factor(variants$DeMAG),as.factor(variants$label1),positive="deleterious")
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment