Category Archives: Ideas

Why algorithmic thinking is needed?

Introduction

  • Let’s say you have two files A and B.
  • Let’s say file size of A, in interms of count of lines, is m and file size of B is n.
  • Both files are reasonably large esp. file B. Let’s say file B is 180 GB.
  • Both tend to grow over time esp. file A. Let’s say file A gets updated every week.
  • File A’s contents i.e. each line (or a part of it) needs to be searched in file B against its lines (or part of it)

Problem modeling

  • It’s a search problem.
  • It’s same as this problem.

Solution Approach

Approach 1 : Brute force

  • Iterate through one file say A.
  • Read each line of A
  • Iterate through the other file B.
  • Read each line of B
  • See if there is match
  • Report

Analysis

  • Big-O time complexity is m * n. 
  • Will take a very long time

Approach 2 : Sort file B and do binary search for file A’s contents

  • Sort file B
  • Iterate through A.
  • Read each line
  • Do a binary search in file B
  • See if there is a match
  • Report

Analysis

  • Big-O time complexity is : n * log(n) + m * log(n)
  • Big-O space complexity is : n (you need to host the file B in memory)
  • Sorting file B ahead will save time.
  • However m * log(n) is the Big-O time complexity for each search operation.
  • will need a lot of space
  • will still be slow
  • Will break in future as m and n grow

Approach 3 : Sort file A and B and do file co-parsing by streaming both files

  • Sort file A
  • Sort file B
  • Iterate through A while simultaneously reading through contents of B
  • See if there is a match.
    • If both are equal report.
    • If file B’s content is ‘smaller’ than file A. Read next line of file B. Continue till you find a match.
    • If file A’s content is ‘smaller’ than file B. Read next line of file A. Continue till you find a match.
  • if ‘end of file’ is reached on either A or B then report and exit

Analysis

  • Big-O time complexity is : n * log(n) + m * log(m) + m + n. 
  • Sorting file A and B ahead will save considerable time.
  • Slightly more time consuming than Approach 2 when all steps are done for the first time. However, if sorting is done apriori subsequent searches will take linear time : m + n.
  • No additional space requirements
  • Approach is future-proof as file size for A and B grows/changes

Author : Madhulatha Mandarapu

Context:

  • Search Pubmed article ids linked to any clinical trial against NLM’s MRCOC detailed MeSH term co-occurrence file.
  • Detailed MeSH term co-occurrence file is very large with 183GB size and 1.7B rows and will grow.
  • Number of global clinical trials done/in-flight is approximately 760k. As trials grow, Pubmed article ids linked to any clinicl trial will also grow.
  • Brute force approach for this problem will take long time and binary search approach will need consideable memory esp. if you search in MRCOC.
  • Here’s an implementation

Enterprise Knowledge Graphs and the need for it

Today, most enterprises have a Business Intelligence and analytics teams. They address the time-sensitive, operational needs of the organization. Also, importantly, business decisions are taken based on insights discovered from these platforms. Often AI/ML helps project into the future. Most often, BI platforms and the need for a workforce is acknowledged and highly valued by CXO team. Data from various departments including sales and R & D flows into BI platforms via data warehouses/data lakes / lake houses. However, direct access to operational DBMS systems is still needed at times. Also, data may need to flow in reverse ETL from data warehouses to DBMS systems.

The above scheme is mostly accepted as necessary. In reality, adoption, data lineage, speed of insight generation and subsequent discovery varies. Often human insight is still ahead of the system. Here’s an opportunity for improvement.

One of the key additions to the above ecosystem could be Enterprise Knowledge Graphs. They can address a critical-need for ‘drilling-down’ into the data to arrive at the ‘nugget of gold’. This while feasible in current scheme, it is dependent on human skill. A skilled CXO might be able to get to the ‘insight’ with the right ‘SQL’ query (they may or may not write it though). This is not uncommon.

EKGs have the potential to bring together key ‘identities’ and their ‘relationships’ across organization. People, departments, products, customers, geography, time, research, language and inter-dependencies. The ‘operational’ facts can/should continue to come from data warehouse/data-lake/lake-house.

Can an organization achieve benefits of an EKG by leveraging investments in a ‘Master data management’ system? Yes, partially. In practice, ‘MDM’ is not brought into BI platforms, its siloed and has less visibility. ‘Mastering’ data is considered a data engineering act. Instead, an EKG system would address the organizational needs more holistically when it’s integrated into BI platforms through GraphQL.

Understanding needed for building an EKG is natural for any organization’s team. They know this intuitively. Skills and standards may be evolving. Web3 and subsequent conversations around semantic web are helping bridge the gaps. Most of these conversations are about blockchains. A necessary area that needs a focused effort, of its own, in the very near future. EKGs, though, can be built now and can provide value right away.

Let us know if EKGs, Semantic Web interest you. Here’s an open knowledge graph that can help you draw an analogy to your organizational needs. Write to us. We are happy to help.

Open knowledge graph on clinical trials

VaidhyaMegha is building an open knowledge graph on clinical trials, comprising

  • Clinical trial ids from across the globe
  • MeSH ids for
    • Symptoms/Phenotype
    • Diseases
  • PubMed Article ids
  • Genotype(from Human Genome),

Specification

Below is a very brief specification

  • Inputs
    • Mesh RDF
    • WHO’s clinical trials database – ICTRP.
    • US clinical trial registrydata from CTTI’s AACT database.
    • Data from clinical trial registries across the globe scraped from their websites’ ex: India
    • MEDLINE Co-Occurrences (MRCOC) Files
  • Outputs
    • Clinical Trials RDF with below constituent ids and their relationships
      • MeSH, Clinical Trial, PubMed Article, Symptom/Phenotype, Genotype(from Human Genome)
      • Additionally, clinical trial -> clinical trial links across trial registries will also be discovered and added.

Source code

  • Source code would be hosted here.

Release notes

v0.2 : 27-Jan-2022

  • Clinical trials are linked to the RDF nodes corresponding to the MeSH terms for conditions.
  • Download the enhanced RDF from here.

Reference

VaidhyaMegha’s prior work on

  • clinical trial registries data linking.
  • symptoms to diseases linking.
  • phenotype to genotype linking.
  • trials to research articles linking.

Last 3 are covered in the examples folder. They were covered in prior work in separate public repos.

Web 3.0 musings

1 Introduction

Given the interest generated recently due to Elon Musk’s and Jack Dorsey’s tweets on Web 3.0. Here are few topics, thoughts, thought experiments and ideas.

1.1 Topics

Web 1.0 – Static content – HTML, URI, URL, HTTP. Access/Read content from anywhere with a browser.

Web 2.0 – Interactive – Apps – Forms, AJAX, Search, Secure access, reports – WSDL, SOAP, REST, GraphQL, TLS.

Web 3.0 – Decentralized, Semantic, No trust/permissions needed – IPFS, Blockchain.

  • Semantic Web – fully connected WWW beyond hyperlinks.

Current hyperlinks may break, redirect, subvert (by updated pages). No versioning feasible. Search engines fill this gap only slightly using page ranking and other heuristics/proxies for intention and authenticity

blockchain based non-repudiation with versioning will make a link immutable

  • Knowledge Graphs

Connected data/information/insights/knowledge/wisdom across various dimensions, domains and any observable phenomena.

  • Front end

GraphQL (Will need to be further refined) – for API.

  • Backend

SparQL – SQL for graphs

RDF , TripleStore

  • Architectural tools and frameworks

Apache Marmotta

  • Impact

Cloud/IoT to Intelligent connected edge.

Blogs/Content can be implemented as kgraph.

Search engines / Social Media / Rental Apps to SparQL Apps. Examples include

> find a movie theater near Lalbagh

> Get Posts made by my friends on Politics

> Find a 2 BHK for rent for 2 days in Guruvayur

Open standards in BioInformatics

Like in any other domain there are several significant developments in standards in health in the recent years. Several new standards like FHIR have evolved and have seen strong adoption globally.

However health is a domain of domains, there are many areas within health industry that can significantly benefit from strong set of standards. BioInformatics is one such sub domain which could benefit from stronger standards and better adoption.

There are quite a few standards already in BioInformatics. A few standards are listed below :

Most of these standards in BioInformatics have been addressing the need for standardizing file-format/content for various datasets generated a.

For processes a.k.a. pipelines below are often used techniquees

  • Direct command line scripts using languages like BASH, Perl and/or Python
  • CWL – Common Workflow Language
  • OpenWDL – Open Workflow Definition Language

The last amongst these OpenWDL is being championed by MIT and Broad Institute and has generated significant interest in the BioInformatics community and industry at large.

Below is a brief listing of these standards how support for them stacks up today.

VaidhyaMegha’s BISDLC SaaS offering has out-of-the box support to all 3 paradigms and more. Below is a very quick demonstration on top of the fantastic open source platform BioStar central, that demonstrates in an over simplified manner how these paradigms can be supported with small enhancements to BioStar central.

Credits : Biostar central and our intern Ananya Shivkumar

More here : https://github.com/vaidhyaMegha/biostar-central/

Animated Bar Chart Race For Clinical Trial Data

Github : https://github.com/VaidhyaMegha/Animated-Bar-chart-race

What is a Bar Chart Race:

The bar chart race is an animated chart and it displays top “n” values as per year or any category.

The chart consists of four parts. From bottom to top in z-order: the bars, the x-axis, the labels, and the ticker showing the current date. 

It has below features. 

  • Make the animation faster or slower by adjusting the duration in milliseconds
  • Selects top “n” values for displaying bar chart race
  • Good visualization with different colors for each value
  • Replay button

Reference Link: https://observablehq.com/@d3/bar-chart-race-explained

Clinical Trial Registry Data :

The Clinical trial data file (clinical trial data.csv) is taken from “ClinicalTrials.gov”. ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. 

It is a Web-based resource that provides patients, their family members, health care professionals, researchers, and the public with easy access to information on publicly and privately supported clinical studies on a wide range of diseases and conditions. 

The Web site is maintained by the National Library of Medicine (NLM) at the National Institutes of Health (NIH).

Steps for implementation of this graph:

  1. Open the link https://observablehq.com/@d3/bar-chart-race-explained and click on “three dots icon (…)”  (displayed next to “Teams” at top right). Then, click on “Download code ”. The file “bar-chart-race-explained.tgz” will be downloaded. Now, extract the tgz file. It should contain .html , .js , .css , .json files as shown below:
  2. The file e9e3929cf7c50b45@3007.js  contains javascript code as shown in below image :

Open the e9e3929cf7c50b45@3007.js file using any editor (say VS code). Replace the file name “category-brands.csv” with our data file (CSV file) name.                   

3. Download our data file (CSV format).

4. If you want to add some extra features to bar chart(selecting top n values,adjusting duration time in text box etc.) then you should include extra code in the file e9e3929cf7c50b45@3007.js .

 5. Finally, open a terminal and run the below command to get into our project folder path. Ex: cd enter your project folder path(/home/desktop/bar chart race explained)

6. Run the command “simplehttpserver .” Then local server will start work like below:

Listening 0.0.0.0:8000 web root dir /home/Desktop/Bar Graph/bar-chart-race-explained

7. Open the browser and run the localhost:8000.Then animated bar chart will be generated as shown below

8. Duration time of animation can be adjusted


Connecting symptoms, diseases and genotype

In the previous article Connecting Clinical Trials to Research Articles , we have seen how to search PubMed database by specifying clinical trial id(s) and retrieve all the relevant journal articles. In this article, let’s learn about the association of symptoms and diseases, and Phenotype-Genotype.

Importance of symptom and disease relationship

Disease is an abnormal condition that negatively affects the functionality of an organism. Symptom is a physical or mental feature which can indicate a condition. The relation between the diseases and their symptoms are important to diagnose any disease. This information is also useful for medical research purposes. 

Each article in the PubMed is associated to metadata that includes major topics of the article. By using a perl script with the NCBI E-utilities, we can retrieve PubMed identifiers of any symptom and disease terms.The symptom and disease terms are defined by MeSH. We can find an association between symptoms and diseases by using the PubMed ids.

Program: The below program gives the pubmed identifiers of co-occurrence of symptoms and diseases.

Input file for Diseases:

Input file for Symptoms:

Output file: The output file contains list of pubmed identifiers of co-occurrence of diseases and symptoms.

Phenotype-Genotype
Once the association between diseases and symptoms are identified, we can find the phenotype and genotype information based on symptoms. Let’s take a look at “Phenotype-Genotype” integrator. 

Phenotype is the composite of the organism’s observable characteristics. 

Genotype is the part of the genetic makeup of a cell which determines one of its characteristics.

What is Phegeni ?

Phegeni is a web interface that integrates various genomic databases with genome wide association study (GWAS). 

The genomic databases are from National Center for Biotechnology Information (NCBI) and the association data from National Human Genome Research Institute (NHGRI). Here, the phenotype terms are MESH terms .

  • The GWAS is a study of a genome-wide set of genetic variants in different individuals to observe if any variant is associated with phenotype/trait.
  • Clinicians and epidemiologists are interested with the results of GWAS because it helps to study design considerations and generation of biological hypotheses.
  • GWAS consists of  various results that is SNP rs,Gene ,Gene ID,Gene2,Gene ID2,Chromosome and Pubmed ids.

Phegeni Association results file:

Downloading all associate results at PheGeni browser and sample file looks as below.

The Association results can be accessed from here – https://www.ncbi.nlm.nih.gov/gap/phegeni

Program: The below program gives the list of SNP rs,Gene ,Gene ID,Gene2,Gene ID2,Chromosome and Pubmed ids of respective phenotype term.

Input file contains a list of phenotype search terms based on MESH and the sample file looks as below.

Output file: Output file contains list of SNP rs,Gene ,Gene ID,Gene2,Gene ID2,Chromosome and Pubmed ids of respective phenotype term.

In this way,we can retrieve genetic variants related to any Phenotype(s).

Note:

All the files (input, script and results) of symptom disease relationship,we have used in the above example are available on GitHub and can be downloadable from https://github.com/VaidhyaMegha/SymptomDiseaseRelationships

All the files (input, script and results) of phegeni,we have used in the above example are available on GitHub and can be downloadable from https://github.com/VaidhyaMegha/Phegeni

Connecting clinical trials to research articles

We can search PubMed database by specifying clinical trial id(s) and retrieve all the relevant journal articles. The NLM (The world’s largest medical library, the U.S. National Library of Medicine is part of the National Institutes of Health) extract those trail IDs and placed them into PubMed secondary id field.

How to  query single trial id??

Example: To search for journal articles related to a clinical trial id say NCT00000419, use PubMed’s application protocol interface (API) called e-Utils, which can be accessed through URL  https://www.ncbi.nlm.nih.gov/books/NBK25497/

Now, specify clinical trial id in the eutils api as shown below:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=NCT01874691[si]

All the journal articles related to the clinical trial id will be displayed as shown below:

Articles

In the above url, query string [SI] refers to Secondary ID which can be used to search for articles. This field contains accession numbers of various databases (molecular sequence data, gene expression or chemical compounds etc.)

How to  query multiple trial id’s ?

Here,we have taken a large number of clinical trial numbers in one file and the results were taken into another file which contains pubmed articles of respective trials.

Input file contains multiple NCT numbers which can be used as a query in PubMed API(e-utility) search field. Sample input file looks as below:

Trial ids

Below Perl language script will search for each clinical trial id specified in the input file against PubMed database using PubMed(API) E-Utils.

Output file: Output file contains PMIDs (pubmed records) of respective clinical trials.

Trials connected to research articles

In this way, we can retrieve journal articles related to any clinical trial(s) by using PubMed(API) E-Utils.  

Note:
All the files (input, script and results) we have used in the above example are available on GitHub and can be downloadable from https://github.com/VaidhyaMegha/Connectingclinicaltrialstoresearcharticles

Open Dental Software

What is Open Dental

“Open Dental” is a dental practice management software. It enables Providers and staff to schedule appointments and send reminders to patients automatically through text messages and email.Open Dental is compatible with Windows, Linux and Mac operating systems.Current versions of the software require Microsoft Windows.

Download and Install open dental trial version

Download “Open Dental” software @ https://www.opendental.com/trial.html and install it on the system. Launch the application and register as a new user by entering mandatory information. Now, login to the application by using registered credentials. “Home” screen will be displayed as shown below:

No alt text provided for this image

Create new patient 

To create a new patient, click on “Select Patient” button on the main toolbar. “Select Patient” window will be displayed as shown below :

No alt text provided for this image

Now, enter either “Last name” or “First name” to ensure that patient doesn’t exist. System will perform auto-search with the entered text and display the search results. Once search is complete and patient doesn’t exist, click on “Add Pt” button (under ‘Add New Family’ section) to add single patient. “Edit Patient Information” screen will be displayed as shown below with the entered text :

No alt text provided for this image

Enter relevant information and save the patient record by clicking on “OK” button.

To add multiple patients of a family, click on “Add Many” button (under ‘Add New Family’ section) in “Select Patient” screen. “Add Family” screen will be displayed as shown below:

No alt text provided for this image

Create new appointment 

To create or view appointments, click on “Appts” module in the left side bar. To add an appointment, select a patient by clicking on “Select Patient” button in the main toolbar. Search for a patient and select the record. Appointment view will be displayed as shown in the below screenshot :

No alt text provided for this image

Click on the “Make Appt” button displayed on the right side. “Edit Appointment” window will be displayed as shown in the below screenshot:

No alt text provided for this image

Add procedure(s) and other information and click on “OK” button. Appointment will be created and displayed as shown in the below screenshot :

No alt text provided for this image

View of appointments in a day is shown as below:

No alt text provided for this image

Cancel the appointment

To cancel an appointment, right click the appointment and select “Break appointment”. “Break appointment” dialog will be displayed as shown below:

No alt text provided for this image

Click “OK” to cancel the appointment. Cancelled appointment will be displayed as shown below (highlighted using green color):

No alt text provided for this image

FHIR api access 

Access FHIR by navigating to ‘Set up’ menu – > Advanced setup -> FHIR, pop up dialog will be displayed as shown in the below.

No alt text provided for this image

Generate developer key:

From the FHIR Setup(shown in the above) pop up dialog click on “Generate key” and fill the required details.Finally new generated API Key will be displayed as shown above. It is authorized for read operations only.

Eservices

Navigating to ‘eservices’ menu – > Sign up, pop up dialog will be displayed as shown in the below.

No alt text provided for this image

Database access – MYSQL

MYSQL database access from DBeaver

Before logging in to the DBeaver tool, select database(Eg:MYSQL) and provide login credentials. Now, user can login to DBeaver window as shown in the below screenshot :

No alt text provided for this image

Retrieving patient(s) information (Manual method)

Expand the MYSQL 8+ -localhost from the Database Navigator (left side). Now, expand Databases, extend opendental and finally extend Tables and double click on patient table. The patient table will be displayed as shown below:

No alt text provided for this image

Retrieving all patients information (by using SQL query)

Click on SQL Editor from main menu and select SQL Editor. Enter SQL query. Ex: select * from opendental.patient;

No alt text provided for this image

Retrieving specific patient information by using SQL query

Eg:SQL Query: select * from opendental.patient where LName=’PPO’ and FName=’Patty’;

No alt text provided for this image

Testing FHIR API in postman

1.Getting all the patient information by using FHIR API (GET Request)

Testing Environment: Open Dental hosts a test database for developers to play with FHIR. The base URL is the same for all customers and is https://api.opendental.com/fhir. The Capability Statement on this server gives a detailed, technical description of Open Dental’s FHIR capabilities. These three headers must be included in requests sent to this server: 

Content-Type: application/json

Authorization: FHIRAPIKey=paste your API key

Accept: application/json 

A browser extension or other software is necessary to send request headers.

URL: https://api.opendental.com/fhir/patient

After “get” request for above URL we can get all the patients information as shown below image.

No alt text provided for this image

2.Finding the specific appointment through a GET request

The appointment we want is at ID 4. .After “get” request for URL below we can get appointment information as shown below image.

URL: https://api.opendental.com/fhir/appointment/4

No alt text provided for this image

Conclusion

As shown above we can create new patient,appointment,edit appointment and cancel the appointments using OpenDental software. Also, we can work with FHIR api and Database.

Clinicaltrialsdata.org – Part 3

In Part2, we have discussed about “Access” feature on clinicaltrialsdata.org platform. In this article, let’s explore “Search” feature.

Search” feature enables users to search for any clinical trial record by entering any text i.e. disease, medical condition, study type,  location etc. Below are the aspects of Search feature:

  • Search engine examines all the words of clinical trial records in each Clinical trial registry to match the search text
  • Returns all clinical trial records which meets the search criteria
  • Search results can be bookmarked to quickly view and analyze later
  • Customize search results by selecting needed database columns from UI 

Search functionality can be accomplished in 2 ways:

  • UI -User Interface
  • API – from command line or programmatically 

Let’s take a look at the navigation steps and screenshots which shows how the search functionality can be accomplished.
1. Launch clinicaltrialsdata.org and click on “Search” tab. “Search” page will be displayed as shown in the below screenshot:

2. Enter any search text say “Malaria” and click “SEARCH” button. Search results will be displayed as shown below : 

3. Customize search results by selecting any “Available Columns” say trialid, Date_Of_Registration, study_type, enrollment.type, primary_outcome.measure. Search results will be displayed as shown below:

In the coming articles, we will discuss about Discover, Compare and Analyse features.