CanDIG has spent the last two years hypothesizing, prototyping, and validating data federation approaches for its national efforts. For us, data federation means something fairly specific:
Here we describe five design dimensions - which interact but are useful to think about separately - across which decisions have to be made when designing federated queries: authentication, authorization, query flow, combining of results, and access to federation audit trails and logs. Design choices along each of those dimensions are both informed by, and help clarify, the foundational design issue: the trust model and accountability of the federated access.
The underlying constraint on making data available for querying and analyses across participating institutions, sites, or projects is the trust model for the federation, and thus the accountability that has to be baked in to the structure. Who trusts whom to see or do what?
In an environment where no one trusts anyone to see or do anything, data federation isn’t meaningfully possible. On the other hand, if everyone implicitly trusts everyone to see and do everything, then data federation is barely necessary; researchers can readily gather everything they need, maybe with a little bit of tooling help or technical support.
Most situations where federation is being considered are necessarily somewhere in between. In low-trust environments, there may be some third party which is trusted somewhat more, and technologies like secure multi-party computation for querying, homomorphic encryption for combining results, and block-chains for low-trust audit logs can help; but the amount and resultant utility of data sharing is always going to be quite limited. In higher-trust environments, both a wider range of options and higher utility of data sharing are available, but audit logs and data access agreements take on new importance to ensure that the everyone is being held accountable to the higher levels of trust that have been extended to them; “trust but verify”.
In CanDIGs case, the participating sites are currently large institutions that are very well known to each other and which frequently participate in multi-institutional projects; with end users who are researchers, and typically part of groups that have long history of collaborations across the participating sites. The data is consented research data, generally collected as part of explicitly national projects, with directly identifying information (names, medical record numbers, etc) removed. This represents a comparatively high-trust use case, although as the project expands to include a larger number of smaller organizations it’s possible that this may change over time.
The technological implementation of authentication in a data federation can be quite complicated, but from a policy and overall design point of view, AuthN and for that matter AuthZ are comparatively straightforward - or at least are the least subtle.
Ultimately the choices here for authentication - for issuing accounts and tying them to individuals - are some combination of:
(Not listed here is authentication by institution: external “service accounts” for requests from institution B to query institution A. This is one of the few places where we are inclined to be prescriptive: if the end users in your model are individual users such as researchers or clinicians, accounts must be issued to those specific individuals, who are then accountable for their queries. Obscuring the source of the queries through institutional service accounts reduces accountability and ultimately erodes trust.)
The centralized approach can work well, particularly in a low-trust environment if there is some way to set up a central registry of users that is more trusted by each institution than their peers; but if such a central authentication registry doesn’t already exist, it has the downside of requiring one to be created and maintained, and the issue of who decides which users get added and when they are removed remains.
Decentralized and case-by-case, with (e.g.) institution A granting accounts to specific users from institution B upon request and vetting is labour intensive but has the advantage of being extremely explicit; at any given time institution A has a list of exactly which users are permitted to authenticate into their system. It does require additional process development though of how external accounts are to be renewed and/or revoked, which will likely still require some coordination between institutions.
Decentralized and by-institution makes use of the existing credentials at each site, which are accepted as authentication by the peer sites (but not necessarily authorized to see or do anything in particular). While this relies on trusting the other sites to authenticate properly and to maintain good security practices around their authentication (for instance, good identity proofing, revoking accounts quickly once a user leaves the institution or project, adequate security controls), in the highly regulated areas where data federation is most needed (health data, financial data) these requirements are almost automatically met. This is the approach taken by CanDIG (Fig 1) where CanDIG users (not all or even particularly many people at the institution, necessarily) at each institution use their institutional credentials locally to get an OpenID Connect token that is used for authentication purposes at all peer sites.
Related but different is authorization - determining which users, regardless of how they were authenticated, are allowed to make queries or perform analyses at each data site. This too can be decentralized or centralized, although in most data federations we are aware of, participating sites are (understandably) quite reluctant to outsource authorization decisions to a third party, and authorization decisions are ultimately made locally. This is certainly the model within CanDIG.
However, it is often the case that the locally-made authorization decisions are informed by additional, external information - some list of who is participating in the federation project or some subset of it - and that list can be held centrally or in a decentralized fashion.
If there is already a central authentication list, such information is most naturally held there (explicitly or implicitly - it may be the case that anyone on the central user list is implicitly authorized to make certain queries.) Alternately, in a high-trust environment that information could be maintained by each of the sites for their users (again, implicitly or explicitly).
CanDIG has a quite distributed authorization model. Much of the data supported by the CanDIG platform is part of multiple national projects, whose use is each governed by data access committees; so the sites must take that into account when making authorization decisions. Currently those lists are transmitted to the sites “out of band”, and lookups are local; we are moving to a more streamlined model where data access committee portals provide that information at query time based on the identity of the requestor.
Related to what data a user is authorized to see is a question of how much. Data sets that a user does not have row-level authorization for might still be queryable for aggregated results or for computations such as training models on. In CanDIG, we have been building out infrastructure since the beginning of the project to authorize differentially-private aggregations to data to allow data custodians to make some data sets accessible for calculations without necessarily exposing the data directly to researchers.
Once the request can be authenticated and a process is in place for authorizing the request, the next design question is how data - and in particular the initial query - should be communicated between the user to the participating sites.
There are four primary approaches to distributing the query:
In the CanDIG model, which is committed for a variety of reasons to avoid any centralized infrastructure, we take the peer-to-peer approach; with currently three and soon five-or-six sites the topology is largely irrelevant, but we’d like to move towards more of a mesh model. In early days we did experiment with the cycles model which worked well enough for our needs but was slightly more complicated and we didn’t need the advantages of obscuration of partial results (which we’ll talk about more in the next section).
After the authenticated, authorized query successfully reaches sites, the results have to be generated and aggregated.
How that can be done comes down to some different questions to our trust model:
Note that there’s something of a tradeoff here in the privacy of results; the obvious methods for avoiding having the participating sites see each other’s partial results mean that the researcher sees quite explicitly which result came from which site, which does expose some information to the researcher that isn’t strictly necessary to provide an answer to the result, and requires some work on the client side.
There is also a functionality tradeoff; the stronger privacy preserving method (secure multi-party computation and homomorphic encryption) both require carefully crafted approaches to the particular sort of query/calculation in question, and so adding a new query type or data type requires significant amounts of work, and sometimes progress on open research questions; there’s no approach that works well with arbitrary queries.
Finally, in a hub-and-spokes model of data flow, the central site may have higher/lower levels of trust with each of the participating sites than their peers. This can be exploited/must be mitigated when designing the result combination approach.
In CanDIGs case, (a) the sites trust each other strongly, (b) each user has an affiliation with a site at which the query results are gathered, and (c) authorizations are tied to institutional affiliation, suggesting the institution is at least as trusted as the user, so we currently allow the results they are authorized to see to be aggregated “in the clear” at the home site. We do make efforts to ensure through API design that as little unnecessary information is exposed as possible. In early days we experimented with homomorphic encryption for simple aggregations and may revisit such privacy-enhancing approaches to aggregation as the project grows.
Further we use local (regional?) differential privacy for sensitive data sets. Since in our model it is difficult to distinguish the trust of the institutions with those of the researchers, global differential privacy - combining the results at the users institution, then applying noise, then returning the results - hasn’t been used, even though that can result in higher result utility per unit privacy. Moreover, applying the differential privacy at each site separately allows each site to apply its own privacy policies to determine the level of differential privacy “noise”, which is a very useful capability in our very privacy-heterogenous case.
Finally, to understand access patterns across the federation it is useful to have the capacity to access audit logs across the federation rather than simply having a single local window of logging available. In CanDIG, data custodians of data sets that are distributed across sites want to be able to see at a national level how and by whom the data entrusted to them is being used; site operators want to be able to see global access patterns for the purposes of intrusion/suspicious behaviour detection; and federation operators will find such information useful for capacity and capability planning.
The difficulty here comes from the fact that unless carefully handled, system logs are arguably more sensitive than the health data, since if too much data is exposed they could compromise the security of one or more system and thus jeopardize the security and privacy of significant amounts of the health data.
As with authentication and authorization, the choice is along a spectrum of centralized to decentralized.
In CanDIG, our operations teams work together quite closely, and so distributed logs plus rapid communication between staff at the sites serve as our main audit log model. As data volumes and numbers of sites increase, this is unlikely to remain feasible. Since each CanDIG site is running the same software stack, we can collaboratively determine what is safe to put into logs, and we would like to move towards a model where the logs are just another distributed data set can be queried, but this remains some time in the future.
We’ve presented here the CanDIG model of data federation, in the context of five useful dimensions for considering how to design other data federations:
Design decisions along each of those dimensions both stem from and help clarify the underlying trust model of the data federation - how much sites trust each other, how much data they are willing to let each other see, and the trust and authorization relationships between the users and the sites. Working through possibilities along these dimensions is a useful way to design, plan for, or even just assess the feasibility of a data federation project.