CanDIG Blog

Blog posts from the CanDIG team

Simplify Your Development Workflow with Dev Environments

Son Chau

Introduction At CANDIG, we leverage Docker to streamline the setup of microservices, providing a lightweight and adaptable environment across various platforms. However, using Docker adds a layer of complexity to developers’ workflows. In this blog post, we will explore how Dev Environments can simplify development and streamline the process of creating, running, and sharing code. Figure 1: Docker Dev Environments. Development with Docker: The Challenges Developing applications using Docker can be challenging, especially when it comes to certain aspects of the development workflow. Here are...

CanDIG Variant Search

Francis Nguyen

Introduction – Genomic Variants Genomic variants are differences from one genome to the next. These variants can take multiple forms, from modifications to a single nucleotide, or the deletion, insertion, copying, translocation, inversion, etc., of entire segments. While a genomic variant can technically be relative to anything, in human genetics we generally talk about genomic variants in one of two contexts: comparison between individuals or groups of individuals to a Genomic Assembly, for example, GRch38. comparison between the genome of tumour tissue to normal tissue....

Standardizing Clinical Data for Canada-wide Cancer Research

Javier Castillo-Arnemann

Challenges in creating clinical data standards When working with any kind of data creating a model is fundamental to facilitate its querying and analysis. A data model explicitly separates the data into discrete entities, defines the relationships between them, and enables standardization when compiling data from different sources. Since a model will never capture reality fully, one must make compromises when creating a standard data model, depending on the analysis goals and research questions. Creating data models for biological processes is particularly difficult since they...

Data Pipelines - CanDIG’s Approach and Best Practices

Niharika Srivatsa

Data is key to CanDIG’s operations. CanDIG is constantly receiving data from various sources, including different clinical trials. In the past year, CanDIG received data from over seven different clinical trials. About 97% of all cancer data is recorded in Electronic Health Records (EHRs) in various institutions and cohorts. The data in these institutions and cohorts all differ in data models and standardizations. This poses an issue for the analysis and reporting of health related research. A solution includes developing strong and maintainable data pipelines...

Deploy React.js Applications on DigitalOceans

Jimmy Li

Overview At CanDIG, we have developed multiple React.js-based frontend applications. For demo purposes, we wanted to deploy them on a public hosting platform. There are multiple app hosting platforms, including Amazon AWS, DigitalOceans, Google Cloud, Heroku, etc. Currently, our demo React.js applications are hosted on DigitalOceans’ App Platform. It supports the deployment of applications written in multiple languages and frameworks and allows free deployment of up to three static sites. For non-static sites, the starting cost is $5/month. Quick Walkthrough The overall deployment process is...

Applying Differential Privacy to Federated Learning

Laiba Zaman

Federated learning has many applications, from the protection of healthcare data to typing models in smartphones. Particularly in the healthcare context, it avoids the centralization of data and thus allows the data to remain with the federation client. This is especially helpful as provinces have different laws concerning the transport of data. There is one particular shortcoming to consider for federated learning. When training a model on a dataset, there are many times when the model can remember specific data examples. This is especially a...

GA4GH Beacon API - Implementing a REST API specification in GraphQL

Ali Raza Zaidi

The CanDIG team, in their quest to enable “national scale analysis over locally-controlled data”, make extensive use of APIs. The backbone for many of these APIs are the standards laid out by the Global Alliance for Genomics and Health (GA4GH). These standards tend to be RESTful, which make adopting an additional standard like GraphQL a tall order. GraphQL is booming in popularity, attributable, in part, to its excellent data response system, which eliminates both the over-fetching and under-fetching of data by returning only the fields...

Making a GitHub Action

Daisie Huang

At CanDIG, we work with lots of data in spreadsheet form. Many of our clinicians and data specialists expect to be able to share their data in Excel spreadsheets or similar, but these formats are less easily processed by scripts or for ingest into databases. For example, we often have to map datasets to internal data formats before they can be ingested into our system. I often wish that I could convince my more spreadsheet-oriented colleagues to export their work into a format like CSV...

Introduction to Figma and Design Principles

Courtney Gosselin

What is Figma? Figma is a collaborative tool built with designers in mind; it allows for multiple people to work together on a design file. It can be used for many design workflows, but is primarily used in the conception or brainstorming phases as well as for high and low fidelity prototyping. Perhaps the most useful feature of Figma is that it allows for seamless collaboration by multiple users on a project. This helps in getting fast feedback from clients as well as fast iteration...

Federated Learning on CanDIG

Rishabh Sambare

In its journey to provide analytics of distributed, locally-controlled health data, CanDIG naturally is looking to machine learning as one of its next steps towards improving its platform. Learning in decentralized environments is a somewhat new problem, with some of the first marked progress being made in 2016 with the onset of Federated Averaging. Since then, federated learning – the study of attempting to federate ML models – has received much attention from the ML community, with frameworks such as Tensorflow Federated, NVFlare, and Flower...

Secure Cross-service GraphQL interface

Siyue Wang

GraphQL is a query language that was released by Facebook in 2015. A lot of big companies such as Twitter, Netflix and PayPal have adopted GraphQL and released GraphQL APIs. However, most use cases for GraphQL are to connect it directly to their user databases. Since the GraphQL interface is usually deployed within one organization and connected with multiple data storage within the same organization, there are not too many security or privacy concerns. At CanDIG, our efforts are focused on developing a large-scale, federated...

Keycloak - Our Journey on How to Contribute to the Project

Dashaylan Naidoo

Keycloak Overview and Our Usecase Keycloak is an open-source identity and access management solution. Some of the features that it provides out-of-the-box are the easy deployment of application authentication, single-sign on, identity brokering from various sources like social media applications, user federation by creating a facade for LDAP, Active Directory, etc., and all of this using standard protocols like SAML, OpenID Connect (OIDC), etc. CanDIG uses Keycloak (one deployment on each site in a project), to create a uniform interface for authentication and authorization infrastructure...

Working with mCODE

Nebiyou Petros

CanDIG, is building the next generation of it’s federated solution for the analysis of privacy-sensitive genomic data across Canada. It will enable clinical researchers from distributed sites to examine and analyze quality data, including data from cancer trials, with no need for a central infrastructure to maintain or secure. This sharing of research-quality data between numerous cancer trials and treatment centers will help generate essential information that will aid in the efficient and effective diagnosis, treatment and follow-up of health conditions. An introduction to mCODE...

The CanDIGv2 Authentication and Authorization Stack

Amanjeev Sethi

Overview CanDIG has some pretty particular authentication and authorization requirements! For authentication our fully-distributed federation, with each site self-governing, we need to accept authentication credentials, and claims about those identities, from multiple institutions. Similarly for authorization: we support a number of different data projects, and fine-grained access control within each (e.g. a researcher might be able to view somatic variants but not germline). In our federation, authorization decisions are made locally, but informed by platform-wide information. So there are multiple sources of claims of entitlements...

A Gentle Introduction to Federated and Distributed Authorization

Amanjeev Sethi

Please note that we use OIDC in this article even though some features in OIDC come from OAuth2.0. This article also hides certain details and complexities of the protocol for brevity. The problem Say you are solving the problem of understanding a rare disease - Rarivitis. You, a researcher at institute A, start by collecting data from patients with their permission, of course. In this problem, two facts stand out - You need a sufficient amount of data to make a judgment on a specific...

Data Federation - Design Dimensions and CanDIG’s Approach

Jonathan Dursi

CanDIG has spent the last two years hypothesizing, prototyping, and validating data federation approaches for its national efforts. For us, data federation means something fairly specific: We are interested in federating queries and analyses across the network. These are read-only operations; thus we’re not worried about consistency of updates which is a complex problem in distributed read/write databases. In CanDIG’s case, the different sites representing different geographical regions and the data subjects are research project participants, with all of their data located at one given...

Using PySAM to access private S3 Bucket

Jackson Zheng

I was working on a project for CanDIG which was an implementation of the Htsget API — a simple data-access API for reads and variants. A key component of our application was the PySAM library, which provides an interface to genomic data. Its main use in the service was to parse a desired file and return only a chunk of that file with various filters. The most common way of accessing a file is with a local file path. However, with S3 buckets emerging as...

Federated Analysis of the Thousand Genomes Project

Jonathan Dursi

The thousand genome project was an early large-scale population genomics project which sequenced 2,504 individuals from a broad range of regions and ancestries, with aligned reads and called variants being made publically available. The resulting populations included some now-classic figures in population genetics, such as those shown to the right. Here we describe our process of reproducing this analysis of important, public, moderately sized data sets with well-understood characteristics, but now across a partitioned, federated set of the data. Our aim is to demonstrate that...

Choosing a Container Solution for CanDIG

Jonathan Dursi

An important part of our architecture is the ability to run specified bioinformatics tasks against data sets; but this requires the ability to bundle up these tasks and distribute them across potentially quite heterogenous sites. There are many different possibilties for such an approach: Containers (Docker, rkt, LXC, Singularity, Intel Clear Containers) VMs Application packagers (AppImage, Snappy) To weigh these options, we conducted a simple Benchmark; a crude remapping pipeline taking reads from GRCh37 to GRCh38, then measuring coverage in specific exons. This benchmark allows...

Distributed privacy preserving data mining

Neelam Memon

Motivation While de-identification by removing explicit identifiers is the first step that we take towards the privacy protection, we consider the cases when the approach may fail to provide the sufficient level of protection specifically when records associated to the individuals contain characteristics or combinations of characteristics that are rare if not unique and make them identifiable even in the large crowds and even in the cases when data query responses are restricted to aggregate statistical results only. CanDIG provides an additional layer of privacy...