CanDIG Blog

Blog posts from the CanDIG team

Federated Analysis of the Thousand Genomes Project

The thousand genome project was an early large-scale population genomics project which sequenced 2,504 individuals from a broad range of regions and ancestries, with aligned reads and called variants being made publically available. The resulting populations included some now-classic figures in population genetics, such as those shown to the right. Here we describe our process of reproducing this analysis of important, public, moderately sized data sets with well-understood characteristics, but now across a partitioned, federated set of the data. Our aim is to demonstrate that...

Continue...

Docker or rkt or Singularity, Oh My!

An important part of our architecture is the ability to run specified bioinformatics tasks against data sets; but this requires the ability to bundle up these tasks and distribute them across potentially quite heterogenous sites. There are many different possibilties for such an approach: Containers (Docker, rkt, LXC, Singularity, Intel Clear Containers) VMs Application packagers (AppImage, Snappy) To weigh these options, we conducted a simple Benchmark; a crude remapping pipeline taking reads from GRCh37 to GRCh38, then measuring coverage in specific exons. This benchmark allows...

Continue...

Distributed privacy preserving data mining

Motivation While de-identification by removing explicit identifiers is the first step that we take towards the privacy protection, we consider the cases when the approach may fail to provide the sufficient level of protection specifically when records associated to the individuals contain characteristics or combinations of characteristics that are rare if not unique and make them identifiable even in the large crowds and even in the cases when data query responses are restricted to aggregate statistical results only. CanDIG provides an additional layer of privacy...

Continue...