Skip to content

Specs for storing Pubmed ID with protein with specific tags

In this documentation, I am trying to outline the specification of feature additions and modifications needed for storing PubMed IDs and strong evidence for all protein-specific annotations in condensates. As of now, we only store common set of PubMed IDs for general protein-membership in a condensate.

Database Schema Change

Referring to current version of schema

Condensate Attributes concerned

  1. protein_functional_type
  2. protein_exp_evidence
  3. protein_driver_criterion

Currently, these attributes are of data type dict where the key part is the UniProt ID of the member protein, and the value part is the respective annotation(s). For example, for protein_exp_evidence, it is currently a list of str, where each member is an experimental evidence (e.g. in_vivo, in_vitro etc.)

The value part in the dict needs to have its value modified to store a dict where it was just an str, and list of dict where it was a list of str. This will allow us to store more contextual details along with the annotation. To handle the situation at hand, the dict should have two mandatory keys - 'annotation' and 'pubmed_ids'. A sample annotation of funtional_type of a protein in a condensate could change from simply "driver" to {'annotation': "driver", 'pubmed_ids': ['31366629', '31366121']}

For now, we will not have any value for the pubmed_ids in the dict, however, hopefully, this will be curated over time using the values from the protein_pubmed_ids (these are list of all pubmed_ids supporting the membership of this protein in this condensate, but doesn't say yet anything about any specific annotaions).

API Change

The same data structure change flows through the database. The visual changes will be done by the frontend. I realize that this is a breaking change and calls for a new version. But are we supporting API versions yet? I don't think we need that unless it is powering another web-app or long-running service.

Frontend

Component concerned

Proteins table in condensate detail page. Columns:

  1. Role in Condensate (rendering protein_functional_type)
  2. Driver Criterion (rendering protein_driver_criterion)
  3. Experimental Evidence (rendering protein_exp_evidence)

Now we would be able to show PubMed IDs in the brackets beside each annotation. This is how most other databases have been displaying PubMed IDs till now.

Example

Now we parse the protein_functional_type field and search for the UniProt ID of the current protein row in this dict's key. If found, we show the value of this key. Since. now the value is dict, we have to display the 'annotation' of this inner-dict and if there is pubmed_ids, we join the list of pubmed_ids with a suitable separator and show in the brackets, or else no brackets if no pubmed_ids

Database Update (Sync)

While receiving update_items from contributors from "Role in Condensates" (functional_type), "Driver Criterion" (driver_criterion) and "Experimental Evidence" (exp_evidence), we must also accept pubmed_ids in another text input box along with annotation drop-down selection. This would ask for changes in the frontend input boxes. The rest of the part for the CMS frontend is described here dd-code-web#149

Edited by Soumyadeep Ghosh