Introducing a "Data Safety Factor"- Data Sharing Taskforce Update #15
Ian Oppermann
Commonwealth Data Standards Chair, Co-founder ServiceGen, Industry Professor, UTS
This update refines the developing view of how to quantify "Safe Data" and introduces a Data Safety Factor for consideration. This update reworks some older ground, but also tries to get specific about quantification of one of the most important aspects of the Fives Safes - Safe Data and Safe Outputs.
1: DATA SHARING FRAMEWORKS
Sharing data relating to individuals safely, storing it securely and ensuring it is only accessed and used by an approved user is a global challenge. Sharing data relating to individuals that has been de-identified through replacement of personal identifiers with linkage code and is protected from re-identification through prudent management of a data analytics environment based on de-identification poses particular challenges. The challenges lie in managing a data environment based on de-identification and ensuring the perimeters of that environment are reliably and verifiably effective.
Sharing large quantities of people-centred data to create smart services requires robust data sharing frameworks that preserve privacy and ensure proper evaluation of outputs before implementation. Such data frameworks will need to provide the necessary guidance and direction to enable all actors in a data-driven process – data custodians, analysts, data governance staff, managers and service providers – to understand how to meet their obligations and respect limits of acceptable use.
Those frameworks and the social benefits they deliver must be sufficiently transparent to, and understood by, interested citizens to sustain and nurture data trust (social licence). Social licence requires proper management of any data analytics environment based on de-identification and proper evaluation of outputs before they are implemented.
These frameworks, the manner of their implementation and management and the safeguards for quarantining of outputs for human evaluation, must be sufficiently transparent and understood by citizens to mitigate risk. The views of government and its agencies as to their good intentions are likely to be contested by at least some citizens. An effective data sharing framework must contain controls and safeguards that can be demonstrated to citizens as reliable and effective.
An efficient authorising environment could be managed consistently through a nationally accepted information governance framework, designed to guide the regulators, data owners and data custodians in a practical way. This framework could help clarify the risks at each stage of the data analysis process and provide appropriate transparency to citizens.
For the authorising environment to be truly effective, it needs to comply with an appropriate information governance framework that demonstrates transparency, trust, efficacy and value.
A modified Five Safes Framework
In September 2017, the Australian Computer Society (ACS) released a technical whitepaper, Data Sharing Frameworks, that explored the challenges of data sharing.[1] The article highlighted that one fundamental challenge for the creation of smart services is addressing the question of whether a set of datasets contains personal information. Determining the answer to this question is further complicated as the act of combining datasets creates information. The article proposed a modified version of the Five Safes Framework[2] for data sharing that attempts to quantify different thresholds for ‘safe’.
The 2017 whitepaper introduced several conceptual frameworks for practical data sharing, including an adapted version of the Five Safes Framework. Many organisations around the world, including the Australian Bureau of Statistics, use the Five Safes Framework to help make decisions about effective use of data which is confidential or sensitive. The dimensions of the framework are:
Safe People – refers to the knowledge, skills and incentives of the users to store and use the data appropriately. In this context, ‘appropriately’ means ‘in accordance with the required standards of behaviour’, rather than level of statistical skill. In practice, a basic technical ability is often necessary to understand training or restrictions and avoid inadvertent breaches of confidentiality – an inability to analyse data may lead to frustration and increase incentives to share access with unauthorised people.
Safe Projects – refers to the legal, moral and ethical considerations surrounding use of the data. This is often specified in regulations or legislation, typically allowing but limiting data use to some form of valid statistical purpose, and with appropriate public benefit. Grey areas might exist when exploitation of data may be acceptable if an overall public good is realised.
Safe Setting – refers to the practical controls on the way the data is accessed. At one extreme, researchers may be restricted to using the data in a supervised physical location. At the other extreme, there are no restrictions on data downloaded from the internet. Safe settings encompass both the physical environment (such as network access) and procedural arrangements (such as the supervision and auditing regimes).
Safe Data – refers primarily to the potential for identification in the data. It may also refer to the quality of the data and the conditions under which it was collected, the quality of the data (accuracy), the percentage of a population covered (completeness), the number of features included in the data (richness), or the sensitivity of the data.
Safe Outputs – refers to the residual risk in publications from sensitive data.
The Five Safes Framework is relatively easy to conceptualise when considering the idea of ‘extremely’ safe’, although it does not unambiguously define this. An ‘extremely safe’ environment may involve researchers who have had background checks, projects that have ethics approval, and rigorous vetting of outputs from that data environment. Best practice may be established for such frameworks, but none of these measures is possible to describe in unambiguous terms as each involves judgement.
The adapted model explores different, quantifiable levels of safe for each dimension of people, projects, setting, data and outputs and how these different safe levels could interact in different situations. Figure 1 shows the dimensions of the adapted Five Safes Framework taken from the 2017 ACS Technical whitepaper.
Figure 1. Modified Five Safes Framework
2: SAFE DATA AND PERSONAL INFORMATION FACTORS
How safe is a dataset?
The aspects of safe data that were described earlier primarily focus on the risk of re-identification but include aspects of the quality of the data (accuracy), the conditions under which it was collected, the percentage of a population covered (completeness), the number of features included in the data (richness), and the sensitivity of the data. Figure 2 illustrates the different aspects that will be considered in this article to determine the safe level of data.
The legal tests for personal information generally relate to the situation where an individual identity can ‘reasonably be ascertained’. The 2017 ACS Data Sharing Frameworks technical whitepaper uses a concept of Personal Information Factor (PIF) to describe the level of personal information in a dataset or outcome as shown in Figure 3. A PIF of 1 means personal information exists, a value of 0 means there is no personal information.
Personal information that includes health information is excluded from the scope of this article. It is important to note the PIF method described is not a technique for anonymisation: rather, it is a heuristic measure of potential risk of re-identification.
Figure 2. Aspects of Safe Data including Personal Information Factors
In this article:
· Feature depth is the number of independent features in the dataset. For example, in the binary valued feature set: eye_colour_is_brown, individual_is_adult, gender_is_female, the feature depth is 3. If one of these features is dependent on another, or can be derived from a combination of features, the feature depth would be 2 (for example, a pregnancy feature may also enable the gender feature to be derived). An implicit assumption is that all features carry equal information for an individual. Also, the sensitivity of each feature is not considered.
· Coverage probability is the probability that an individual is in the population included in the dataset. In a closed analytical environment, with randomly selected samples and no other information available, this is taken to be the percentage of the entire population covered by the sample dataset. For example, a sample dataset of ten men with beards taken from a known population of 1,000 men with beards has a coverage probability of 1/100. If an individual is known to be in a dataset, either because the data was not selected randomly or the sample set covers the entire population, the coverage probability is 1.
· Accuracy refers to the ratio of the number of correct values in all features in the dataset to the number of all values for all features in the dataset. For a sample population of eight individuals with ten features each, of which 20% of the values were known to be wrong for one feature, the accuracy is 0.98. If a second feature was known to have 20% of incorrect values, the accuracy would drop to 0.96. No consideration is given to values which are almost correct. This is discussed further in a subsequent section.
Personal, spatial, temporal and relationship features will be discussed in greater detail in a subsequent section.
Figure 3. Personal Information Factor and aggregation level
Revisiting the Personal Information Factor (PIF)
Aggregation is often used to protect individual identity, ensuring outputs are not released for cohorts smaller than ‘N’. The value of N depends on the risk appetite of the organisation and the perceived sensitivity of the data itself.
In principle, for any value of N selected, if (N-1) other datasets can be found that relate to the cohort of interest, then the cohort of size N can be decomposed into identifiable individuals. As the aggregation levels increase (cohort sizes of N, N^2, N^3 and so on for N > 1), the level of protection increases, as more related datasets are needed to identify an individual within the cohort. The fundamental weakness nonetheless remains that determining N is dependent on the risk appetite.
The definition of PIF is still to be robustly determined; however, the working definition is upper-bound and defined within a closed, linked, de-identified dataset as:
The minimum identifiable cohort size (MICS) is the smallest group within a dataset that can be created from the available features.
For example, in one dataset there may be 100 males without beards born in NSW. If an additional feature is included (those under 18), this number may reduce to 10. In this example, the MICS is at most 10. The ‘at most’ is important, as it specifies there cannot be a cohort smaller than this. The strict condition of the MICS being determined within a closed, linked, de-identified dataset is required to satisfy the condition that no additional data can be introduced to this set.
As new datasets are added to an existing closed linked dataset, new features are potentially identified. As a consequence, the MICS will potentially reduce, leading to higher PIF values.
The notion of bound is important, as having a cohort size of 1 in a deidentified dataset is not the same as having personal information (when the MICS is 1, the PIF is still strictly less than 1). Some additional data or feature is needed to identify the actual individual.
The term ‘epsilon’ in the PIF calculation is intended to reflect the fact that with de-identified data, even at MICS of 1, at least one additional data field is required to map to the identifiable individual.[3]
In the example above of a defined de-identified cohort, knowing there is only one male member does not provide sufficient information to identify the male as a named individual. Depending on the exact circumstances, it is possible to imagine additional data (an additional feature) which would allow identification. Similarly, if there were two males in the cohort, it is possible to imagine several additional datasets (features) that would allow individual identification.
The approach continues for five or 10 males in a defined cohort. The PIF is therefore treated as upper-bound rather than an exact value. The additional information required to link the individual described by their feature set in the data may include a unique personal feature, a unique name, a unique address or a unique relationship.
Figure 4 shows a simple example of a closed, linked, de-identified dataset with a population of size 16 (P=16), with eight features (F=8) and four equal-sized cohorts (MICS=4). The PIF for each of these cohorts is strictly less than 0.25. In this simplistic example, the first four features (f1, f2, f3, f4) define the cohorts, and the addition of features 5 through 8 do not impact the cohort sizes.
Figure 4.Population of 16, with eight features and four equal-sized cohorts
The quantification of epsilon is still to be finally determined and will be contextual. It relates to the uniqueness of the minimum identifiable cohort and is currently defined for the purpose of this article as:
Where:
· d(i) is the Hamming Distance[4] (the count of features that do not match) between the minimum identifiable cohort and all cohorts of size Gp at distance i.
· Gp(i) is non-zero.
· P is the size of the population.
· F is the number of features (for example, hair colour) in the closed, linked, de-identified dataset.
As illustrated in the example population shown in Figure 4 and conceptualised in Figure 5, there may be more than one cohort at any given distance from the minimum identifiable cohort and there may be more than one cohort with the MICS.
Figure 5. Illustration of the relationship between the minimum identifiable cohort and other cohorts
The more unique a cohort is, the smaller the epsilon. In the population example of Figure 4, the Hamming Distance (the count of the number of features which are different) between each cohort is 2, and the epsilon value for each cohort is approximately 0.02. The larger the value of epsilon, the more similar the cohort is to other cohorts, and so the larger the number of additional features required to identify a unique member of the population. In the special case that the MICS is the entire population (P), then epsilon is 0 and the PIF is bounded by 1/P.
Breaking context: spatial, temporal and relationship information factors
In a closed, linked, de-identified dataset, it is assumed that each feature is independent. That is, no information can be gained about one feature by examining another. Knowledge of context can, however, allow information to be inferred, decreasing the feature depth. Separating features that provide context from features that describe a person or object helps ensure the independence of features.
In an exact analogy to the Personal Information Factor, this article introduces a Contextual Information Factor (CIF) which is combination of a Spatial Information Factor (SIF), Temporal Information Factor (TIF) and Relationship Information Factor (RIF).
When calculating the overall Contextual Information Factor (CIF) for a population described by a set of spatial/temporal/relationship attributes:
where ? is the intersection operator acting on cohorts defined by spatial/temporal/relationship features and Espison_c is defined as before, but across all cohorts formed by spatial/temporal/relationship features.
Figure 6 illustrates how cohorts based on different contextual feature sets intersect to create a minimum identifiable cohort. For clarity, the MICS spatial/temporal/relationship cannot be larger than the cohorts formed by any one context feature set and the MICS must be at least 1.
Figure 6. Illustration of cohorts based on spatial, temporal and relationship features
The PIF is then updated to be defined by:
The relationship between personal features and context feature means that a PIF can potentially be reduced whilst maintaining MICS for personal features. This is useful if population sizes are relatively small or if the number of personal features is relatively large.
A Data Safety Factor (DSF)
Referring again to Figure 2, the safeness of data included a PIF built on a range of features, including feature depth, coverage probability and accuracy. This article proposes a heuristic for a DSF:
Feature depth is weighted as an inverse multiple, reflecting the significance of the additional information that would be revealed about an individual for each additional feature included in the dataset.
Coverage probability is weighted as an inverse exponential (squared), reflecting that small reductions in total population inclusion lead to increased uncertainty that a known individual will be present in a sample dataset.
Accuracy is weighted as a sigmoid function, reflecting that small reductions in accuracy produce significantly less safe data and outputs. The sigmoid function has a value of 0.5 at 70% accuracy, reflecting the significant reduction in data safety as data accuracy reduces.
These factors interact to ensure that as the value of PIF increases towards 1 or the accuracy decreases, the DSF reduces rapidly towards 1.
As the number of independent features increases, the DSF decreases. As coverage probability approaches 100%, the DSF reduces rapidly. Figure 7 shows the scaling factors associated with coverage and accuracy.
If accuracy and coverage are unknown, they are assumed to have no effect. The Data Safety Factor simplifies to the combination of the inverse of the PIF and the inverse of the feature depth. If the feature depth is not known, the DSF reduces to simply the inverse of the PIF.
Figure 7. Scaling associated with coverage and accuracy parameters
Figure 8 shows how the DSF changes with coverage probability (from 10% to 100%) for a linked, de-identified dataset with ten independent features and an accuracy of 100% for differing PIF values. It is again emphasised that this Data Safety Factor is a heuristic measure.
Figure 8. Data Safety Factor versus coverage probability for a feature depth of 10 and accuracy of 100%
The level of data safety to be made available will depend on the other safe settings. This will be discussed later.
Exploring Personal Information Factor(s), k-anonymity and the importance of context
A common approach to protecting personal information in a dataset is to reduce the risk or reidentification by use of k-anonymity (or l-diversity)[5]. These techniques represent ways of minimising risk of re-identification, rather than measures of personal information in the dataset.
A dataset is said to have the k-anonymity property if the information for each individual contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the dataset. There are two commonly employed approaches for achieving k-anonymity (for a given value of ‘k’):
· Generalisation – where values of selected attributes are replaced by a broader category. For example, age may be replaced by bands of 0–5 years, 5–10 years and so on.
· Suppression – where certain values of the attributes are replaced by a null value before release. This is often used for values such as a person’s religion.
Because k-anonymisation does not include any randomisation, someone attempting to re-identify an individual can still make inferences by linking other datasets to the k-anonymised set. It has also been shown that using k-anonymity can skew the statistical characteristics of a dataset if it disproportionately suppresses and generalises data points with unrepresentative value.
One of the weaknesses of k-anonymity is that it does not include randomisation of data, so it is possible to make inferences about members of the population. If a person is known to be in a dataset and can be identified to be in the minimum identifiable cohort using a subset of features (for example on eye colour, gender, age), then any additional features not used to identify them can be learned. For this reason, k-anonymity is not considered a good technique for protection of privacy of individuals in relation to high-dimension datasets.
The approach of separating selected context features in a dataset has been explored in mobile communications systems[6] to protect location information of mobile users. These approaches are referred to as ‘spatial cloaking’ and are employed in circumstances where aggregation techniques such as k-anonymity are used to reduce privacy threats resulting from uncontrolled usage of location-based services. Extending the contextual separation process to include spatial, temporal and relationship features potentially further increases the effectiveness of spatial cloaking-style protection.
The PIF described is not a technique for anonymisation. Rather, it is a heuristic measure of the potential risk of re-identification of an individual based in a given dataset based on the smallest identifiable cohort. At its simplest, the PIF described reduces to 1/k if k is the MICS and there are no other cohorts identified in the population. If there are other cohorts, the PIF is less than 1/k.
To illustrate, Figure 9 shows the evolution of the PIF data in Figure 4 as the features in the first row change value (individual feature values change from 0 to 1). The dataset initially has four cohorts of size 4 and all cohorts are equidistant from each other. From a value of approximately 0.23, the PIF rises quickly as a minimum cohort of 1 is created with the first change of feature value (F2 changes from 0 to 1).
This smallest cohort now has distance 1 from a cohort of size 3 and a cohort of size 4 and has distance 2 from two cohorts of size 4. The distance to, and size of, these other cohorts means the PIF does not reach 1. As the number of features that change value in the first row increases, the PIF moves closer to 1. The cohort of size 1 becomes more unique as the distance increases from all other members of the population.
Figure 9. Population example with MICS of 1 and cohorts of size 3 and size 4
[1] See ACS website. Available online at https://www.acs.org.au/content/dam/acs/acs-publications/ACS_Data-Sharing-Frameworks_FINAL_FA_SINGLE_LR.pdf
[2] T. Desai, F. Ritchie, R. Welpton, ‘Five Safes: designing data access for research’, October 2016. Available online at https://www.nss.gov.au/nss/home.NSF/533222ebfd5ac03aca25711000044c9e/b691218a6fd3e55fca257af700076681/$FILE/The%20Five%20Safes%20Framework.%20ABS.pdf
[3] The members of this data set may be reasonably identifiable in this circumstance, just not actually identified.
[4] For an explanation of Hamming distance, see for example https://www.oxfordmathcenter.com/drupal7/node/525. While many features will have non-binary values – hair colour may be a range of values, age may be recorded in number of years – each feature can be mapped to one of a finite number of values as a categorical variable without loss of information. The use of Hamming distance as a measure of similarity relies on counting the number of features which differ, not considering how much they differ.
[5] See, for example, L. Sweeney, ‘k-anonymity: a model for protecting privacy’, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002, pp. 557-570. Available online at https://epic.org/privacy/reidentification/Sweeney_Article.pdf.
[6] See, for example, B. Gedik, L. Liu, ‘Protecting Location Privacy with Personalized k-Anonymity: Architecture and Algorithms’, IEEE Transactions on Mobile Computing 7(1), January 2008. Available online at https://ieeexplore.ieee.org/abstract/document/4359010/ Accessed 16 September 2018.
Product, Technology, Data. King (ABK Microsoft) , Alumni - Meta, BBC, Experian
6 年Diane Elvers, this looks really interesting!
Data Strategy | Data Governance | Data Quality
6 年Thanks for sharing this..Ian