Specs for storing Pubmed ID with protein with specific tags
In this documentation, I am trying to outline the specification of feature additions and modifications needed for storing PubMed IDs and strong evidence for all protein-specific annotations in condensates. As of now, we only store common set of PubMed IDs for general protein-membership in a condensate.
Database Schema Change
Referring to current version of schema
Condensate Attributes concerned
protein_functional_type
protein_exp_evidence
protein_driver_criterion
Currently, these attributes are of data type dict
where the key part is the UniProt ID of the member protein, and the value part is the respective annotation(s). For example, for protein_exp_evidence
, it is currently a list of str
, where each member is an experimental evidence (e.g. in_vivo
, in_vitro
etc.)
The value part in the dict
needs to have its value modified to store a dict
where it was just an str
, and list of dict
where it was a list of str
. This will allow us to store more contextual details along with the annotation. To handle the situation at hand, the dict
should have two mandatory keys - 'annotation'
and 'pubmed_ids'
. A sample annotation of funtional_type of a protein in a condensate could change from simply "driver"
to {'annotation': "driver", 'pubmed_ids': ['31366629', '31366121']}
For now, we will not have any value for the pubmed_ids
in the dict
, however, hopefully, this will be curated over time using the values from the protein_pubmed_ids
(these are list of all pubmed_ids supporting the membership of this protein in this condensate, but doesn't say yet anything about any specific annotaions).
API Change
The same data structure change flows through the database. The visual changes will be done by the frontend. I realize that this is a breaking change and calls for a new version. But are we supporting API versions yet? I don't think we need that unless it is powering another web-app or long-running service.
Frontend
Component concerned
Proteins table in condensate detail page. Columns:
- Role in Condensate (rendering
protein_functional_type
) - Driver Criterion (rendering
protein_driver_criterion
) - Experimental Evidence (rendering
protein_exp_evidence
)
Now we would be able to show PubMed IDs in the brackets beside each annotation. This is how most other databases have been displaying PubMed IDs till now.
Example
Now we parse the protein_functional_type
field and search for the UniProt ID of the current protein row in this dict's key. If found, we show the value of this key. Since. now the value is dict
, we have to display the 'annotation'
of this inner-dict and if there is pubmed_ids
, we join the list of pubmed_ids
with a suitable separator and show in the brackets, or else no brackets if no pubmed_ids
Database Update (Sync)
While receiving update_items
from contributors from "Role in Condensates" (functional_type), "Driver Criterion" (driver_criterion) and "Experimental Evidence" (exp_evidence), we must also accept pubmed_ids
in another text input box along with annotation drop-down selection. This would ask for changes in the frontend input boxes. The rest of the part for the CMS frontend is described here dd-code-web#149