Data Sharing Taskforce - Minimum Cohort Sizes
Ian Oppermann
Commonwealth Data Standards Chair, Co-founder ServiceGen, Industry Professor, UTS
Following the release of the ACS Data Sharing Frameworks Technical Whitepaper, the Data Sharing Taskforce has kicked off again in 2018 to explore the challenges of developing trust-presevering data sharing frameworks. The March workshop focussed on "minimum identifiable cohort sizes"
Personal data covers a very wide field and is described differently in different jurisdictions. In the NSW context[1]:
“… personal information means information or an opinion (including information or an opinion forming part of a database and whether or not recorded in a material form) about an individual whose identity is apparent or can reasonably be ascertained from the information or opinion”.
The definition is very broad and in principle, covers any information that relates to an identifiable, individual living or within 30 years of death.
The ambiguity about the presence of personal information in sets of data highlights the limitations of most existing privacy regulatory frameworks. The inability of human judgment to determine “reasonable” likelihood of reidentification when faced with large numbers of complex data sets limits the ability to appropriately apply the regulatory test.
Development of standards around what constitutes “anonymised” would help to address the challenges of dealing with privacy. In all parts of the world, there is currently only very high-level guidance, and certainly nothing quantitative, as to what “anonymised” means, hence many organisations must determine what “anonymised” means to them based on different data sets.
Technology can potentially play a role to address this challenge but agreeing and then communicating what an acceptable degree of anonymisation is, and how to achieve it in quantitative terms, would also greatly improve data sharing. This clarification of existing legal frameworks needs to include quantified descriptions of acceptable levels of risk in mays which are meaningful for modern data analytics.
2. DATA SHARING FRAMEWORKS
Australian Data Sharing Framework Development
In September 2017, the Australian Computer Society (ACS) released a technical whitepaper which explored the challenges of data sharing[2]. This paper was the culmination of more than 18 months’ work by a taskforce which included ACS, the NSW Data Analytics Centre (DAC), Standards Australia, the office of the NSW Privacy Commissioner, the NSW Information Commissioner, the Federal Government’s Digital Transformation Office (DTO), CSIRO, Data61, the Department of Prime Minister and Cabinet, the Australian Institute of Health and Welfare (AIHW), SN-NT DataLink, South Australian Government, Victorian Government, West Australian Government, Queensland Government, Gilbert and Tobin, the Communications Alliance, the Internet of Things Alliance, and a number of interested companies.
Modified “Five Safes” Framework
The whitepaper introduced a number of conceptual frameworks for practical data sharing including an adapted version of the “Five Safes” framework[3]. A number of organisations around the world including the Australian Bureau of Statistics use the Five Safes model to help make decisions about effective use of data which is confidential or sensitive. The dimensions of the framework are:
Safe People – refers to the knowledge, skills, and incentives of the users to store and use the data appropriately. In this context, ‘appropriately’ means ‘in accordance with the required standards of behaviour’, rather than level of statistical skill. In practice, a basic technical ability is often necessary to understand training or restrictions and avoid inadvertent breaches of confidentiality; an inability to analyse data may lead to frustration and increases incentives to ‘share’ access with unauthorised people.
Safe Projects – refers to the legal, moral, and ethical considerations surrounding use of the data. This is often specified in regulations or legislation, typically allowing but limiting data use to some form of ‘valid statistical purpose’, and with appropriate ‘public benefit’. ‘Grey’ areas might exist when ‘exploitation of data’ may be acceptable if an overall ‘public good’ is realised.
Safe Setting – refers to the practical controls on the way the data is accessed. At one extreme researchers may be restricted to using the data in a supervised physical location. At the other extreme, there are no restrictions on data downloaded from the internet. Safe settings encompass both the physical environment (such as network access) but also procedural arrangements such as the supervision and auditing regimes.
Safe Data – refers primarily to the potential for identification in the data. It could also refer to the sensitivity of the data itself.
Safe Outputs – refers to the residual risk in publications from sensitive data.
The Five Safes model is relatively easy to conceptualise when considering the extreme cases of ‘extremely’ Safe although it does not unambiguously define what this is. An extremely Safe environment may involve researchers who have had background checks, projects which have ethics approval and rigorous vetting of outcomes. Best practice may be established for such frameworks, but none of these measures is possible to describe in unambiguous terms as they all involve judgement.
The adapted model explores different, quantifiable levels of “Safe” for each of People, Projects, Setting, Data and Outputs as well as how these different “Safe” levels could interact in different situations. Figure 1 shows the dimensions of the adapted “Five Safes” model taken from the ACS Technical whitepaper.
Figure 1. Adapted Five Safes Model
Personal Information and Aggregation
Personal information (often also called personally identifying information (PII) or personal data) covers a very a broad range of information about individuals. In principle, it covers any information that relates to an identifiable individual (living or within 30 years of death), where identifiability is determined not only by reference to the information itself but also having regard to other information that is reasonably available to any entity that holds relevant information.
The test for personal information relates to the situation where an individual identity can “..reasonably be ascertained from the information or opinion”. The ACS Technical whitepaper uses a concept of Personal Information Factor (PIF) to describe the level of personal information in a data set or outcome as shown in Figure 2. A PIF of 1.0 means personal information exists, a value of 0.0 means there is no personal information.
Often aggregation is used to protect individual identity ensuring outputs are not released for cohort smaller than “N”. In principle if (N-1) other data sets can be found which relate to the cohort of interest, then the cohort of size N can be decomposed into identifiable individuals. As the aggregation levels increase (cohort sizes of N2, N3 and so on for N > 1), the level of protection increases as more related data sets are needed to identify an individual within the cohort.
The definition of PIF is still to be robustly determined however the working definition is an upper bound and defined within a closed, linked, de-identified data set as:
PIF < 10^ (-log10(Minimum Identifiable Cohort Size) - Epsilon)
The Minimum Identifiable Cohort Size is the smallest group within a data set that can be identified from the available features. For example, in one data set, there may be 100 males without beards, born in NSW. If an addition feature is included (those under 18), this number may reduce to 10. In this example, the Minimum Identifiable Cohort Size is at most 10.
For a Minimum Identifiable Cohort Size of
1, PIF is less than 1.0
2, PIF is less than 0.5
5, PIF is less than 0.2
10, PIF is less than 0.1
100, PIF is less than 0.01
As new data sets are added to an existing closed linked data set, new features are potentially identified. As a consequence, the Minimum Identifiable Cohort size will potentially reduce leading to higher PIF values.
The notion of a “bound” is important as having a cohort size of 1 is not always the same as having personal information. The term “Epsilon” in the PIF calculation is intended to reflect the fact that with deidentified data, even at Minimum identified Cohort size of 1, at least one additional data field is required to map to the identified individual.
In the example above of a defined anonymised cohort, knowing there is only one male member does not provide sufficient information to identify the male individual. Depending on the exact circumstances, it is possible to imagine additional data which would allow identification. Similarly, if there were 2 males in the cohort, it is possible to imagine several additional data sets which would allow individual identification. The approach continues for 5 or 10 males in a defined cohort. The PIF is therefore treated as upper bound rather than an exact value.
The quantification of Epsilon is still to be determined and will be contextual. For an Epsilon of “0.01” and minimum Identifiable Cohort Size of
1, PIF is less than 0.98
2, PIF is less than 0.49
5, PIF is less than 0.20
10, PIF is less than 0.10
100, PIF is less than 0.01
Figure 2. Personal Information Factor and Aggregation Level
When Is Personal Information Revealed?
In this paper, a distinction is made between the level of personal information (see Figure 3):
· when linked and analysed in an analytical environment (Insights and Models level),
· when considering outputs at different stages in a project which are seen by an observer (Personal context level) and
· when outputs are made available to the wider world and may be linked to data sets in the wider world (Real world context level)
Figure 3. Context for determining the degree of personal information
In the lowest level in Figure 3 (Insights and models), it is possible to link anonymised data sets and ensure the PIF does not reach 1.0 by mathematically exploring the feature sets which describe the Minimum Identifiable Cohort size. If the smallest identifiable cohort is N > 1, then the PIF is less than 1.0. This means more independent data is needed to reach a PIF of 1.0 (personal information).
When working with anonymised data, a minimum identified cohort of 1 does not explicitly imply a PIF of 1.0 (personal identification). As discussed above, a de-identified data set with a Minimum Identifiable Cohort size of 1 may still require additional data to map to an individual. In the closed analytical environment (Insights and models level), this additional data in not necessarily available.
In the next level of this model (Personal context), any observer who views results will bring to that observation their own experience, knowledge, and perspective. It is at this point that the “reasonable” test is truly applied. At this stage, it is impossible to know the total range of interactions between the PIF developed by the linking of analytical processes and the additional information brought by Personal context of the observer. The risk mitigation required when revealing outputs at different stages of the project depends on the level of “Safety” of the observer in context of the other Safe dimensions of the project (setting, data and output). This is discussed further in the next section.
In the final level of the model (Real world context), any observer who views the results not only brings their own knowledge and experience, but also has access to a wide range of other data sets to potentially link to the project outcomes. The level of protection via Minimum Identifiable Cohort size becomes increasingly important.
With Whom Can Data and Outputs be Shared?
In this paper, a distinction is made between concerns about project findings and privacy. A project may produce results which are challenging, however, unless there is an issue of privacy, these concerns are not considered here. This paper also acknowledges that outcomes are produced at multiple stages in a project rather than at completion. This section therefore deals with Safe People and Safe Projects. The level of “Safeness” of people relate to the level of pre-qualification for inclusion in the project – from deep involvement to no vetting at all. The level of “Safeness” of project relates to the level of PIF involved in the project – from “very Safe” with a PIF of 0.0 to “Not Safe” at a level close to 1.0. The term “Not Safe” is used simply to reflect a scale which has “very Safe” at one end.
As results from different stages of an analytics project are produced, they potentially increase in PIF associated with linking of data sets, and so greater risk exists associated with sharing.
Figure 4 shows an example of how Safe Settings may be established for combinations of different levels of Safety for People and Projects. In this example, People considered to be ‘UnSafe’ (or unevaluated) only gain access to data which is publicly available. If open data is the only data used, it is impossible to overlay governance on a project. Projects which are evaluated as ‘Not Safe’ (PIF of exactly 1.0) are excluded from this example as they require individual evaluation.
Figure 4. Safe Settings for a combination of "Projects" and “People”
Whilst technology cannot be considered to be the complete answer to Safe Setting, it can help mitigate risks for different levels of ‘Safe’. Examples of systems which provide Safe Setting at different levels already exist. The challenge with many of these current frameworks is that they are not particularly well suited to widespread, automated data sharing.
As an example, the SURE framework is a long-established framework which enables a researcher to access sensitive data. Authorised researchers working on approved projects operate on data within a constrained environment. Researchers can perform operations over unit record level data and cannot on-share data. Whilst addressing the needs of individual researchers, the system is not well suited to wide ranging collaboration in its current form.
At the other extreme, systems such as data.gov.au provide examples of data sharing mechanisms for open data. While appropriate for the release of raw data, particularly from government agencies, it remains limited from the perspective of wide ranging collaboration.
An area which is actively being developed is the technology which allows computational operations to be performed where the data is stored and return the answer to a query (and not provide access to the underlying data). The anonymised computations can be distributed, performing calculations over multiple data sources, at multiple sites, and still returning just the computed outcome. These approaches are well advanced, and while there will be a significant additional ICT burden associated with this approach, it may significantly lower privacy and legal concerns associated with use of data, and so reduce governance requirements.
Dealing with Mixed “Safe” Levels
One of the fundamental principles underpinning the challenge of data sharing is addressing the challenge of value, risk, and trust in data sharing. This can change as a data analysis (the simplest case being data sharing) project develops through the major phases of:
? Project scoping (including identification of people)
? Data collection, organisation, and curation
? Data analysis
? Results interpretation
? Release of results.
As each of these phases progresses, the ‘value’ of the outcomes increases, and the potential risk may also increase. An important consideration is that projects which involve any element of discovery need periodic review depending on the level of risk which is assessed at each of the major project phases. Identification of the impact on privacy or the ethical considerations of a project will depend on what is identified, and this may not be known at the outset.
A more flexible approach to data analysis projects may allow light touch up-front assessment of privacy impact, people, and technology, and increase the frequency or intensity of these assessments as the project continues.
A summary of possible guidelines is given in Figure 5. Figure 6 attempts to map the major data analysis project phases to the risk mitigation focus for each dimension in the “Safe’s” model. The benefit of a multistage assessment for privacy and ethics is that it is no longer necessary to preconceive at the outset of the project all of the issues or risks which may arise during analysis.
Figure 5. Ethics, Privacy Impact, Technology, and People Assessments for Different Risk Levels
Figure 6. Mapping to the Five Safe's Framework
3. POSITIONING PROJECTS IN THE DATA SHARING FRAMEWORK
Safe People and access to Safe Outputs
The recommendation is to adopt the data sharing frameworks described in this paper and the ACS Technical Whitepaper to allow the project to progress and to support practical data sharing.
Within the scope of a project, the major factors to consider are:
· the (potentially) increasing PIF at each stage of the project,
· the people who can access the outcomes at each stage of the project and at what level of aggregation.
Following the flow of logic in Figure 4, Figure 7 shows the relevant squares highlighted for different levels of “Safe” for observers of the project:
Researcher / Research Supervisor (Highly Safe People):
- Police check and Working with Children Check[4]
- Qualified data analytics skills
- Higher technical degree or working under supervision of Higher technical degree
- Named access on data sharing MOU
- Completion of deed of access
- Access to de-identified, linked, unit record data
- Access to results at de-identified, linked, unit record level
Partner Agency Project Reviewer (Safe People):
- Police check and Working with Children Check
- Qualified data analytics skills
- Named access on data sharing MOU
- Knowledge of data at dictionary level
- No access to de-identified, linked, unit record data
- Access to aggregated results at cohort level
Agency Partner (Moderately Safe People):
- Working with Children Check
- Named access on data sharing MOU
- Knowledge of data at dictionary level
- No access to de-identified, linked, unit record data
- Access to aggregated results at cohort level
Affiliated Agency (Low Level of Safety):
- Not named access on data sharing MOU
- No access to de-identified, linked, unit record data
- Delayed access to aggregated results at report level
General Audience (No screening done so "Not Safe People"):
- No security checks
- Not named on data sharing MOU
- No access to de-identified, linked, unit record data
- Access to aggregated results at trend level
Figure 7. Safe people and Safe projects for a Project
Recommended Aggregation Levels for Outcomes
The level of aggregation depends on the “Safe” level of people involved
Researcher / Research Supervisor (Highly Safe People):
- Access to results at de-identified, linked, unit record level
- Minimum Identifiable Cohort size = 1
- PIF < 1.0
Partner Agency Project Reviewer (Safe People):
- Access to aggregated results at cohort level
- Minimum Identifiable Cohort size = N
- PIF < 1/N
Agency Partner (Moderately Safe People):
- Access to aggregated results at report level
- Minimum Identifiable Cohort size = N*N
- PIF < 1/(N*N)
Affiliated Agency (Low Level of Safety):
- Delayed access to aggregated results at report level
- Minimum Identifiable Cohort size = N*N
- PIF < 1/(N*N)
General Audience (Unscreened so Not Safe People):
- Access to aggregated results at trend level
- Minimum Identifiable Cohort size = N*N*N
- PIF < 1/(N*N*N)
In practice, the access provided to the General Audience would be a different type of document from that released to Agency Partners. In this context “Not Safe People” refers to no vetting process or assumed analytical skills.
Project Risk Management based on Outputs
In this discussion, the consideration of personal information is separated from that of concerns about the significance or interpretations of outcomes. The recommendation is to adopt the data sharing frameworks described in this paper and the ACS Technical Whitepaper to allow the project to progress and to support practical data sharing.
Within the scope of a project, the major factors to consider are
? If the PIF has approached 1.0 for any of the outputs of the different stages of the project
Following the flow of logic in Figure 5, Figure 8 shows the relevant squares highlighted for different “stages” of the project if the project works with and reports on anonymised data with a Minimum Identified Cohort size of 1.
To reduce the risk at each stage of the project, the Minimum Identified Cohort size can be increased before outputs are released as shown in Figure 9.
Figure 8. Project Risk Profile for small Minimum Identifiable Cohort size
Figure 9. Risk Reduction in Project based on increasing Minimum Identifiable Cohort size
[1] See Privacy and Personal Information Protection Act (1998) No 133. Current version for 1 April 2016 to date (accessed 19 February 2017 at 15:17) https://www.legislation.nsw.gov.au/#/view/act/1998/133/part1/sec4
[2] See ACS website, available online https://www.acs.org.au/content/dam/acs/acs-publications/ACS_Data-Sharing-Frameworks_FINAL_FA_SINGLE_LR.pdf
[3] Five Safes: designing data access for research, T. Desai, F. Ritchie, R. Welpton, October 2016, https://www.nss.gov.au/nss/home.NSF/533222ebfd5ac03aca25711000044c9e/b691218a6fd3e55fca257af700076681/$FILE/The%20Five%20Safes%20Framework.%20ABS.pdf
[4] See office of the Children’s Guardian https://www.kidsguardian.nsw.gov.au/child-Safe-organisations/working-with-children-check
Lifting Performance With Data & Empathy
7 年A great approach to a tricky problem thanks for sharing Ian Oppermann and team...