Animated Bar Chart Race For Clinical Trial Data

What is a Bar Chart Race:

The bar chart race is an animated chart and it displays top “n” values as per year or any category.

The chart consists of four parts. From bottom to top in z-order: the bars, the x-axis, the labels, and the ticker showing the current date. 

It has below features. 

  • Make the animation faster or slower by adjusting the duration in milliseconds
  • Selects top “n” values for displaying bar chart race
  • Good visualization with different colors for each value
  • Replay button

Reference Link: https://observablehq.com/@d3/bar-chart-race-explained

Clinical Trial Registry Data :

The Clinical trial data file (clinical trial data.csv) is taken from “ClinicalTrials.gov”. ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. 

It is a Web-based resource that provides patients, their family members, health care professionals, researchers, and the public with easy access to information on publicly and privately supported clinical studies on a wide range of diseases and conditions. 

The Web site is maintained by the National Library of Medicine (NLM) at the National Institutes of Health (NIH).

Steps for implementation of this graph:

  1. Open the link https://observablehq.com/@d3/bar-chart-race-explained and click on “three dots icon (…)”  (displayed next to “Teams” at top right). Then, click on “Download code ”. The file “bar-chart-race-explained.tgz” will be downloaded. Now, extract the tgz file. It should contain .html , .js , .css , .json files as shown below:
  2. The file e9e3929cf7c50b45@3007.js  contains javascript code as shown in below image :

Open the e9e3929cf7c50b45@3007.js file using any editor (say VS code). Replace the file name “category-brands.csv” with our data file (CSV file) name.                   

3. Delete all existing data in the below mentioned file and update it with our data file (CSV format).

File location : /home/Downloads/bar-chart-race-explained(1) /files/aec3792837253d4c6168f9bbecdf495140a5f9bb1cdb12c7c8113cec26332634a71ad29b446a1e8236e0a45732ea5d0b4e86d9d1568ff5791412f093ec06f4f1

4. If you want to add some extra features to bar chart(selecting top n values,adjusting duration time in text box etc.) then you should include extra code in the file e9e3929cf7c50b45@3007.js .

 5. Finally, open a terminal and run the below command to get into our project folder path. Ex: cd enter your project folder path(/home/desktop/bar chart race explained)

6. Run the command “simplehttpserver .” Then local server will start work like below:

Listening 0.0.0.0:8000 web root dir /home/Desktop/Bar Graph/bar-chart-race-explained

7. Open the browser and run the localhost:8000.Then animated bar chart will be generated as shown below

8. Duration time of animation can be adjusted


Connecting symptoms, diseases and genotype

In the previous article Connecting Clinical Trials to Research Articles , we have seen how to search PubMed database by specifying clinical trial id(s) and retrieve all the relevant journal articles. In this article, let’s learn about the association of symptoms and diseases, and Phenotype-Genotype.

Importance of symptom and disease relationship

Disease is an abnormal condition that negatively affects the functionality of an organism. Symptom is a physical or mental feature which can indicate a condition. The relation between the diseases and their symptoms are important to diagnose any disease. This information is also useful for medical research purposes. 

Each article in the PubMed is associated to metadata that includes major topics of the article. By using a perl script with the NCBI E-utilities, we can retrieve PubMed identifiers of any symptom and disease terms.The symptom and disease terms are defined by MeSH. We can find an association between symptoms and diseases by using the PubMed ids.

Program: The below program gives the pubmed identifiers of co-occurrence of symptoms and diseases.

Input file for Diseases:

Input file for Symptoms:

Output file: The output file contains list of pubmed identifiers of co-occurrence of diseases and symptoms.

Phenotype-Genotype
Once the association between diseases and symptoms are identified, we can find the phenotype and genotype information based on symptoms. Let’s take a look at “Phenotype-Genotype” integrator. 

Phenotype is the composite of the organism’s observable characteristics. 

Genotype is the part of the genetic makeup of a cell which determines one of its characteristics.

What is Phegeni ?

Phegeni is a web interface that integrates various genomic databases with genome wide association study (GWAS). 

The genomic databases are from National Center for Biotechnology Information (NCBI) and the association data from National Human Genome Research Institute (NHGRI). Here, the phenotype terms are MESH terms .

  • The GWAS is a study of a genome-wide set of genetic variants in different individuals to observe if any variant is associated with phenotype/trait.
  • Clinicians and epidemiologists are interested with the results of GWAS because it helps to study design considerations and generation of biological hypotheses.
  • GWAS consists of  various results that is SNP rs,Gene ,Gene ID,Gene2,Gene ID2,Chromosome and Pubmed ids.

Phegeni Association results file:

Downloading all associate results at PheGeni browser and sample file looks as below.

The Association results can be accessed from here – https://www.ncbi.nlm.nih.gov/gap/phegeni

Program: The below program gives the list of SNP rs,Gene ,Gene ID,Gene2,Gene ID2,Chromosome and Pubmed ids of respective phenotype term.

Input file contains a list of phenotype search terms based on MESH and the sample file looks as below.

Output file: Output file contains list of SNP rs,Gene ,Gene ID,Gene2,Gene ID2,Chromosome and Pubmed ids of respective phenotype term.

In this way,we can retrieve genetic variants related to any Phenotype(s).

Note:

All the files (input, script and results) of symptom disease relationship,we have used in the above example are available on GitHub and can be downloadable from https://github.com/VaidhyaMegha/SymptomDiseaseRelationships

All the files (input, script and results) of phegeni,we have used in the above example are available on GitHub and can be downloadable from https://github.com/VaidhyaMegha/Phegeni

Connecting clinical trials to research articles

We can search PubMed database by specifying clinical trial id(s) and retrieve all the relevant journal articles. The NLM (The world’s largest medical library, the U.S. National Library of Medicine is part of the National Institutes of Health) extract those trail IDs and placed them into PubMed secondary id field.

How to  query single trial id??

Example: To search for journal articles related to a clinical trial id say NCT00000419, use PubMed’s application protocol interface (API) called e-Utils, which can be accessed through URL  https://www.ncbi.nlm.nih.gov/books/NBK25497/

Now, specify clinical trial id in the eutils api as shown below:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=NCT01874691[si]

All the journal articles related to the clinical trial id will be displayed as shown below:

In the above url, query string [SI] refers to Secondary ID which can be used to search for articles. This field contains accession numbers of various databases (molecular sequence data, gene expression or chemical compounds etc.)

How to  query multiple trial id’s ?

Here,we have taken a large number of clinical trial numbers in one file and the results were taken into another file which contains pubmed articles of respective trials.

Input file contains multiple NCT numbers which can be used as a query in PubMed API(e-utility) search field. Sample input file looks as below:

Below Perl language script will search for each clinical trial id specified in the input file against PubMed database using PubMed(API) E-Utils.

Output file: Output file contains PMIDs (pubmed records) of respective clinical trials.

In this way, we can retrieve journal articles related to any clinical trial(s) by using PubMed(API) E-Utils.  

Note:
All the files (input, script and results) we have used in the above example are available on GitHub and can be downloadable from https://github.com/VaidhyaMegha/Connectingclinicaltrialstoresearcharticles

Open Dental Software

What is Open Dental

“Open Dental” is a dental practice management software. It enables Providers and staff to schedule appointments and send reminders to patients automatically through text messages and email.Open Dental is compatible with Windows, Linux and Mac operating systems.Current versions of the software require Microsoft Windows.

Download and Install open dental trial version

Download “Open Dental” software @ https://www.opendental.com/trial.html and install it on the system. Launch the application and register as a new user by entering mandatory information. Now, login to the application by using registered credentials. “Home” screen will be displayed as shown below:

No alt text provided for this image

Create new patient 

To create a new patient, click on “Select Patient” button on the main toolbar. “Select Patient” window will be displayed as shown below :

No alt text provided for this image

Now, enter either “Last name” or “First name” to ensure that patient doesn’t exist. System will perform auto-search with the entered text and display the search results. Once search is complete and patient doesn’t exist, click on “Add Pt” button (under ‘Add New Family’ section) to add single patient. “Edit Patient Information” screen will be displayed as shown below with the entered text :

No alt text provided for this image

Enter relevant information and save the patient record by clicking on “OK” button.

To add multiple patients of a family, click on “Add Many” button (under ‘Add New Family’ section) in “Select Patient” screen. “Add Family” screen will be displayed as shown below:

No alt text provided for this image

Create new appointment 

To create or view appointments, click on “Appts” module in the left side bar. To add an appointment, select a patient by clicking on “Select Patient” button in the main toolbar. Search for a patient and select the record. Appointment view will be displayed as shown in the below screenshot :

No alt text provided for this image

Click on the “Make Appt” button displayed on the right side. “Edit Appointment” window will be displayed as shown in the below screenshot:

No alt text provided for this image

Add procedure(s) and other information and click on “OK” button. Appointment will be created and displayed as shown in the below screenshot :

No alt text provided for this image

View of appointments in a day is shown as below:

No alt text provided for this image

Cancel the appointment

To cancel an appointment, right click the appointment and select “Break appointment”. “Break appointment” dialog will be displayed as shown below:

No alt text provided for this image

Click “OK” to cancel the appointment. Cancelled appointment will be displayed as shown below (highlighted using green color):

No alt text provided for this image

FHIR api access 

Access FHIR by navigating to ‘Set up’ menu – > Advanced setup -> FHIR, pop up dialog will be displayed as shown in the below.

No alt text provided for this image

Generate developer key:

From the FHIR Setup(shown in the above) pop up dialog click on “Generate key” and fill the required details.Finally new generated API Key will be displayed as shown above. It is authorized for read operations only.

Eservices

Navigating to ‘eservices’ menu – > Sign up, pop up dialog will be displayed as shown in the below.

No alt text provided for this image

Database access – MYSQL

MYSQL database access from DBeaver

Before logging in to the DBeaver tool, select database(Eg:MYSQL) and provide login credentials. Now, user can login to DBeaver window as shown in the below screenshot :

No alt text provided for this image

Retrieving patient(s) information (Manual method)

Expand the MYSQL 8+ -localhost from the Database Navigator (left side). Now, expand Databases, extend opendental and finally extend Tables and double click on patient table. The patient table will be displayed as shown below:

No alt text provided for this image

Retrieving all patients information (by using SQL query)

Click on SQL Editor from main menu and select SQL Editor. Enter SQL query. Ex: select * from opendental.patient;

No alt text provided for this image

Retrieving specific patient information by using SQL query

Eg:SQL Query: select * from opendental.patient where LName=’PPO’ and FName=’Patty’;

No alt text provided for this image

Testing FHIR API in postman

1.Getting all the patient information by using FHIR API (GET Request)

Testing Environment: Open Dental hosts a test database for developers to play with FHIR. The base URL is the same for all customers and is https://api.opendental.com/fhir. The Capability Statement on this server gives a detailed, technical description of Open Dental’s FHIR capabilities. These three headers must be included in requests sent to this server: 

Content-Type: application/json

Authorization: FHIRAPIKey=paste your API key

Accept: application/json 

A browser extension or other software is necessary to send request headers.

URL: https://api.opendental.com/fhir/patient

After “get” request for above URL we can get all the patients information as shown below image.

No alt text provided for this image

2.Finding the specific appointment through a GET request

The appointment we want is at ID 4. .After “get” request for URL below we can get appointment information as shown below image.

URL: https://api.opendental.com/fhir/appointment/4

No alt text provided for this image

Conclusion

As shown above we can create new patient,appointment,edit appointment and cancel the appointments using OpenDental software. Also, we can work with FHIR api and Database.

Clinicaltrialsdata.org – Part 3

In Part2, we have discussed about “Access” feature on clinicaltrialsdata.org platform. In this article, let’s explore “Search” feature.

Search” feature enables users to search for any clinical trial record by entering any text i.e. disease, medical condition, study type,  location etc. Below are the aspects of Search feature:

  • Search engine examines all the words of clinical trial records in each Clinical trial registry to match the search text
  • Returns all clinical trial records which meets the search criteria
  • Search results can be bookmarked to quickly view and analyze later
  • Customize search results by selecting needed database columns from UI 

Search functionality can be accomplished in 2 ways:

  • UI -User Interface
  • API – from command line or programmatically 

Let’s take a look at the navigation steps and screenshots which shows how the search functionality can be accomplished.
1. Launch clinicaltrialsdata.org and click on “Search” tab. “Search” page will be displayed as shown in the below screenshot:

2. Enter any search text say “Malaria” and click “SEARCH” button. Search results will be displayed as shown below : 

3. Customize search results by selecting any “Available Columns” say trialid, Date_Of_Registration, study_type, enrollment.type, primary_outcome.measure. Search results will be displayed as shown below:

In the coming articles, we will discuss about Discover, Compare and Analyse features.

Clinicaltrialsdata.org – Part 2

In Part 1 we have briefly introduced the clinicaltrialsdata.org platform, the problem it solves and the alternatives as of now. In this part 2 article we will explore one of its features – ‘Access’.

‘Access’ essentially allows a way to retrieve trials records from a specific trial registry. There are 3 path ways that this might happen

  • A specific/known trial id needs to be looked up to retrieve it’s record in the registry.
  • A random exploration of list of trials from the registry
  • Query the trial registry records directly by filtering them against any attribute i.e. SQL.

Like all other platform features ‘Access’ is feasible through both UI as well as API(which can be used directly from command-line and/or programmatically).

Below are few examples and screenshots:

  • Navigation

This feature is currently available for 15 registries, over 4 continents and several countries.

In the next articles we will cover other features – Search, Discover, Analyze.

Clinicaltrialsdata.org – Part 1

In this post, we will introduce clinicaltrialsdata.org platform, the problem it solves and the alternatives as of now.

Clinical trials are executed across the globe and several trial registries, including primary registries (as classified by WHO in their registry network), exist today. If one has a need for looking at trials in a specific area of interest, at present, the only options are searching on google and visiting each one of the registries where a trial may exist and manually curating the content from them. This process is tedious, error prone not to mention frustrating.

clinicaltrialsdata.org aims to collect data from all trial registries across the globe. There have been very few attempts before this – opentrials.netis one of the most recent work. clinicaltrialsdata.org took a fresh look at this problem and tried to address in a native-cloud, on-demand, API-driven manner.

clinicaltrialsdata.org aims to provide below features :

  • access to all trials registered anywhere. ex:trials registered in clinicaltrials.gov can be accessed here.
  • full text search on registry records. ex: if you are searching for a topic, say ballon sinoplasty, try this search. (NOTE: Search results can be bookmarked to quickly look at updates.)
  • discover additional information about a trial ex: a trial may be registered in 3 trial registries globally where it is assigned different ids along with several other internal/external ids. Here’s an example.
  • compare two trials using their registry ids
  • analysis of trials ex: try the geo-visualizaton of trial counts across countries here and
  • learning from previous trial’s design and their outcomes.

These features are aimed to help avoid inefficiencies in trial selection, design and execution which might eventually lead to redundant and/or failed trials wasting – money, resources, time and effort.

clinicaltrialsdata.org is part of a suite of sites focused on making open datasets from clinical research, health care and related domains readily accessible. For more details please click here.

NOTE: Please honor the data usage policies as prescribed by source registry sites.

Compression and Analytics : Querying Compressed Files – Part 1

Efficient ways of searching compressed files, that are comparable to searching uncompressed files, would benefit several applications, analytical or otherwise.

Specifically if space savings can be achieved without adding significant costs in searching for content then cost savings can be achieved by reducing number of disks and thereby reducing number of nodes/servers needed to store the data.

Some relevant ideas (some can be applied in combination with others)

  1. Pre-processing the file by reading it once and extracting metadata that can be used to quickly match against predicates ex: partitioning, file-level indexes, min/max values of a column etc. ex: Netezza (compresses records after pre-processing and uses FPGA to decompress files), Hive(ex: using ORC serde to store data in compressed columnar form) use some of these approaches on compressed files.
  2. For files with small alphabets code compression would help. For example 2-bit code(similar to bitmaps) compression on a genome file containing A,C, T, G characters – gives 75% space saving (or 25% compression ratio) compared to an original ASCII file containing the characters themselves. Both compressed and uncompressed files can be read in approximately same time since the alphabet contains very few letters.
  3. Flip the search by compressing search string using compression algorithm’s specifics i.e. translate search string into a compressed binary format that can be quickly compared against the binary compressed content. (NOTE: cannot be used for regular expression searches yet could have wide applicability). For example
    • For code compression, code the search string itself.
    • For Huffman pre-process to read and build prefix-free codes trie and then code the search string.
    • For LZW pre-process once by decompressing the file completely to rebuild code table and then code the search string using the code table.
    • For LZ77(used by gzip) pre-process once per sliding window and generate a separate file (a serialized symbol table with block ids as key and a trie of repeated strings may be??) which can be later used to check a search string is present in a block.

Experiment

For illustrative purposes below is an experiment, its implementation and some results. It does not apply above ideas instead tries to establish baseline by creating an analogy for grep and zgrep(decompress and search) tools.

  1. Implement GREP to search all matches of a regular expression on a stream reader.
  2. Unzip .gz files compressed files and retain a copy of original .gz file
  3. Read uncompressed file and GREP.
  4. Read compressed file, decompress and stream the content to GREP.
  5. Repeat and measure difference in time ( TO-DO NOTE: memory is another parameter to measure)

To ensure apples-to-apples comparison use one language for all of the above,

Implementation

Source code for the implementation is here.

  • It is based on GREP implementation using NFAs (available here) based on Kleene’s theorem which states equivalency between DFA and Regular Expressions.

Results

  • For 22K compressed file and 225K uncompressed file and 1000 trials – Average Time difference uncompressed – compressed : -0.002 seconds i.e. the additional cost of decompressing and searching is 2 ms
  • For 147K compressed file and 2.3M uncompressed file and 100 trials – Average Time difference between searching uncompressed vs compressed files is : -0.009 seconds. i.e. the additional cost of decompressing and searching is 9 ms.
  • For 34M compressed file and 229M uncompressed file and 10 trials – Average Time difference between searchinguncompressed – compressed : -1.773 seconds i.e. the additionalcost of decompressing and searching is 1.773 seconds

Next Steps

  • Attempt implementation of Flipped Search on LZW and LZ77 and compare results.

Java 8 Streams(Parallel), Lambda expressions and Amdahl’s law observations

 

Java has fantastic language features ex: Generics, Annotations and now Lambda expressions. These are complimented very well by several core capabilities packaged within a JDK ex: collections (also newer concurrency package), multi-threading.

“Streams” – a feature that Java made available in 8 in combination with lambda expressions can be extremely expressive, almost akin to functional programming (leverages Functional interfaces). It also makes writing code to leverage muti-processors, for parallel execution, much more simpler.

Purely for illustrative purposes I thought of putting together a scenario to understand these features better.

Below is the description of the scenario:

  • 100,000 products have monthly sales over one year period.
  • Requirement is to sort products by their sales of a specific month.
  • If we were to do this on a single processor – the filter, sort and collect operations will happen sequentially with sort being the slowest step.
  • If we were to do this in a multi-processor system – we have a choice to either do them in parallel or sequentially.

(Caveat all operations do not uniformly benefit with parallel execution ex: Sorting needs specialized techniques to improve performance while filtering should theoretically execute much faster straight away. Concurrency and List implementations esp. ArrayList is again a non-trivial consideration.In addition streams add a different dimension. NOTESorting in parallel, using streams or otherwise, is unstable Stable refers to the expectation that two equal valued entries in a list appear in their original order post sorting as well.)

To simplify things we could break this into 4 smaller test cases

  1. In stream execute filter and collect and then sort the list
  2. In parallel stream execute filter and collect and then sort the list
  3. In stream execute filter, sort and collect the list
  4. In parallel stream execute filter, sort and collect operations.(NOTE: don’t try this as noted above)

Timing 100 runs of each of the above 4 test cases and making observations against Amdahl’s law, which helps predict the theoretical speedup when using multiple processors, felt like a fun illustrative way to understand – streams, parallel streams, sorting, lambda expressions and multi-processor runs.

Here’s the code for it and below are results of one complete execution.

  • For case 2 vs case 1: Average Speedup = 1.2291615907722224 Amdahls percentage = 0.18190218936593688 and Speedup = 1.1000508269148064
  • For case 4 vs case 3:Average Speedup = 2.1987344172707037 Amdahls percentage = 0.8987062837645722 and Speedup = 1.8160459562383011

Distributed Database Design – Notes

  • Most distributed systems(ex: HBase, VoltDB) require coordination among-st its nodes for several/all of its tasks. Coordination can be either taken up by the system itself or the system can give this responsibility to a separate coordination system ex: Zookeeper.
  • The coordination system itself needs to be fault tolerant, preferably very high throughput, scalable (preferably horizontally scalable) system i.e. it is a distributed system on its own right.
  • Zookeeper Paper (Wait free coordination for large distributed systems) has these key ideas
    • Provide wait-free base features and let clients of zookeeper implement higher order primitives on-top of these simpler primitives ex: read/write locks, simple locks without herd effect, group membership, configuration management.
    • The base features are
      1. sequential/linearizable writes at leader node
      2. All reads are locally executed by the client
      3. FIFO execution of client requests
      4. Optimistic locking i.e. Version based operations
    • The underlying mechanisms for linearizables writes is
      • A protocol for Atomic broadcast — zookeeper defines  this as guaranteed delivery of a message to all nodes (ex: a write message from leader)
      • ZAB(Zookeeper Atomic Broadcast) seems to be essentially a variant of two phase commit protocol. An article explaining it further is here.
  • VoltDB is a in-memory, fault tolerant, ACID compliant, horizontal scale system. Some of its architecture choices are here. The key ideas are
    • Single thread operations
    • No client controlled transactions. instead each stored procedure runs in a transaction.
    • Partitioning and shared-nothing cluster (with replication for fault tolerancve)
    • leveraging/support deterministic operations. Instead of writing at master and propagating changes to all nodes containing replicas, each node which has a replica executes same transaction in parallel.
  • VoltDB is based on this paper which tries to analyze operations involved the OLTP systems and its key observations are
    • buffer management, locking, logging and latching are the costliest operations in OLTP….unless one strips out all of these components, the performance of a main memory-optimized database is unlikely to be much better than a conventional database where most of the data fit into RAM.
    • In a fully stripped down system — e.g., that is single threaded, implements recovery via copying state from other nodes in the network, fits in memory, and uses reduced functionality transactions — the performance is orders of magnitude better than an unmodified system.
  • VoltDB uses  Zookeeper for leader election. More here.
Some more notes on VoltDB: