Add tech documentation and new release note

1fe75d35 · sghosh · 7df0e9cc · 1fe75d35 · 1fe75d35 · 1fe75d35
Commit 1fe75d35 authored 2 years ago by sghosh
--- a/images/tech_architecture.png
+++ b/images/tech_architecture.png
--- a/release_notes/2022_09.md
+++ b/release_notes/2022_09.md
+### Release planned for: 08-08-2022
+
+#### Release Notes
+
+#####Frontend and CMS
+1. Auto-logout user and reset cache after 48 hours; also revoke special permissions after this
+2. Error messages on frontend made more verbose and user friendly
+3. Administrator menu bar cleaned up (removed extra options)
+4. Synonyms for condensates displayed and made editable
+
+####Data &Sync Script
+1. Sync Script (NovelCondensate) issue fixed
+2. Deleted some erroneous condensates with 0 proteins
+
+#####API
+1. API Key Generator and verification against database entries
+
--- a/tech_architecture.md
+++ b/tech_architecture.md
+
+
+# Technical Architecture
+
+From a user’s point of view, CD-CODE is a web application that exposes a database of condensates and proteins with extended data update features in a community crowdsourcing style. However, this simple-looking web application has multiple frontend and backend frameworks and servers behind it. The distinct technical parts and sub-parts of CD-CODE are depicted in figure X below. In this chapter, we discuss the motivation and importance of each component one by one.
+
+
+
+![alt_text](images/tech_architecture.png "image_tooltip")
+
+
+
+## Backend
+
+At the heart of CD-CODE sits the “main database” (numbered 1.a. in figure X) which holds the ground-truth data of condensates and proteins. This database always consists of verified data along with all the possible evidence, including PubMed IDs. The data structure and schema of the entities and relations in this database are explained in Chapter Schema. To implement this, we a community version of [MongoDB](https://www.mongodb.com/) 4.4 as the database server. The primary key of each collection is enforced using the “unique indexes” option. The textual and informative attributes are indexed for tokenized text search, which also powers the search and filter features from the frontend. The categorical attributes like species, functional_type, and condensate type are also indexes so as to facilitate faster-filtered queries.
+
+A thin and lightweight layer of Application Programming Interface (API) (numbered 1.b. in figure X) sits on top of the main database to expose consumable data for both frontend applications and programmatic use. The APIs have been designed to be REST-compliant. The two main resources for the API are **“proteins”** and **“condensates”** respectively. We have a list API for both of them - `'[GET] /proteins' `and` '[GET] /condensates'`. The list APIs have standard size-based pagination that accepts query parameters for page number and size of each page. They also allow filtering and sorting for a selected number of attributes. The tokenized search configured in the database also powers a text search filtering in the list APIs using the query parameter “query”. Also, both of these resources, there is also a detail API - ` '[GET] /proteins/{uniprot_id}' `and` '[GET] /condensates/{unique_identifier}'`.  The list and the detail API power the respective list pages on the frontend as well.
+
+We have used the Python framework [Flask](https://flask.palletsprojects.com/) to develop our API layer and the package [pyMongo](https://pymongo.readthedocs.io/) to connect the API layer to the database. The APIs are also protected with an Authorization Bearer token which should be passed in the header of the request. The status codes of the responses are also REST-compliant.
+
+
+## Frontend
+
+This component of CD-CODE contains all the GUI elements of the web application that a normal user would have access to. A simple user with just data view rights can see pages for proteins, condensates list and detail pages, and another page showing some data statistics. These user interfaces are built using the progressive Javascript framework [Vuejs](https://vuejs.org/). This component directly consumes the REST APIs from the backend layer.
+
+To facilitate the crowdsourcing functionality, we use an underlying layer of a headless content management system (CMS) (numbered 2.a.) powered by [Strapi](https://strapi.io/). Upon successful recruitment as a contributor, the users get a modified user interface with edit options beside selected data points on the same UI (on detail pages). We term these contributions as “UpdateItems”. Using these interactive options, they can submit data modifications to CD-CODE. The contributors also get an additional form to create novel condensates and add their protein members along with the condensate-specific protein properties. The submissions from these forms are termed “NovelCondensates”.  We store these update items, and novel condensates submissions in the database connected to the CMS by Strapi numbered 3 in figure X. For our purpose, we selected the open-source relational database [PostgreSQL](https://www.postgresql.org/). The fixed data structure of the RDBMS influences the Strapi APIs and CRUD functionalities of the UpdateItems and NovelCondensates. At first impression, the CMS might appear to be like a redundant layer to support data write functionalities; however, it serves the following purposes:
+
+
+
+* A layer of moderation for data edits
+* Version control and history of data changes
+* Record and reward the contribution by users (very important in science)
+
+The [encyclopedia](https://wiki.cd-code.org/) (numbered 2.b. in figure X), another crowdsourcing element of CD-CODE, contains descriptive data related to definitions, synonyms, and terminologies in the world of biomolecular condensates and liquid-liquid phase separation in biology.  The recruited contributors have the necessary credentials to create content in our encyclopedia. This is internally powered by [Wiki.js](https://js.wiki/) and provides easy content management functionalities to users not equipped with programming or HTML skills.
+
+
+## Sync Scripts
+
+The final and the most secretive component of CD-CODE is a scheduled process that runs in the background at regular intervals to copy the accepted data submission from the Contribution Database (PostgreSQL) to the Main Database (MongoDB). These are simple Python scripts that connect to each of the databases using an ORM connector and copy each update item one by one. The script performs from check regarding the feasibility of accommodating the updates, for example, checking duplicates and data types. There is also a list of post-processing tasks that run in the end to accommodate data sanity, for e.g., updating confidence scores based on evidence or fetching data points from Uniprot for entirely new proteins that didn't exist in CD-CODE before. This is also the section where other post-processing tasks and rule-based updates can be added in the future.