The Architecture of CanDIG

CanDIG is fundamentally distributed

Fundamental to CanDIG is national scale analysis, but over locally-controlled data. Our platform is completely distributed, with no central infrastruture to maintain or secure. But atop that, researchers need to be able to readily discover, access, and analyze this information, possibly jointly across sites, while allowing the data stewards to ensure the security and privacy of their data.

We do this by building on established or in-progress projects elsewhere such as OpenID Connect and Keycloak for authentication and the GA4GH (Global Alliance for Genomics and Health) APIs and schemas for genomic data and genomic data exchange.

API-based data access

All Data Access is API-based

In the CanDIG platform, all data access, even local, is API based; that is, there’s no processes which are let loose on directories of data files. This allows us several advantages:

We are making use of the GA4GH APIs for data (and metadata) access, with a thin CanDIG layer on top, which we will use for

Task execution

The API accesses against any particular dataset can be simple queries (“please tell me how many individuals have this particular variant in this data set”) or running longer-lived tasks, which must be scheduled and require a particular executable. For this we are making use of the GA4GH Task Execution Schemas and implementations such as Funnel.

Doing this requires the bundling and distribution of CanDIG-blessed images for fundamental bioinformatics tasks. We have examined various container and VM approaches for executing these discrete tasks; while we are proceeding with Docker comtainers in the short term, in the medium term we will be moving to Singularity or rkt which allow us to have what we need (application bundling, no system-wide root daemons) without what we don’t in our context of unprivledged users and sandboxes (container-level isolation).


OpenID Connect

For authentication, we are using best-practices for RESTful API authentication, such as OpenID Connect, using tools such as Keycloak. As the project involves, we anticipate using UMA (User Managed Access) for federated, role-based authorization.