Data Sharing Taskforce - A Framework for “Reasonable”
Ian Oppermann
Commonwealth Data Standards Chair, Co-founder ServiceGen, Industry Professor, UTS
The Data Sharing Taskforce met in Sydney in March to tackle some significant challenges which have been identified for data sharing frameworks.
A framework for "Reasonable"
Throughout this document, we have used personal information as information which can be used to identify a person. In many jurisdictions in Australia, the Commonwealth Privacy Act 1988 regulates how personal information is handled. The Act defines personal information as:
…information or an opinion, whether true or not, and whether recorded in a material form or not, about an identified individual, or an individual who is reasonably identifiable. [emphasis added]
This definition ignores the distinction between data and information. Throughout this document, we use the working definition that “Personal Data” is that which contains personal information (high Personal Information Factor) or which is combined to create personal information (has a non-zero Personal Information Factor).
Examples of Personal Information cited on the website of the Office of the Information Commissioner[1] include
“an individual’s name, signature, address, telephone number, date of birth, medical records, bank account details and commentary or opinion about a person”.
Of these examples, it is easy to see how some could be used to identify an individual:
- After formal identification, a copy of an individual’s signature may be kept on record by a “trust” centre such as a bank and used to reidentify that individual in future;
- After identification, formal data handling and governance processes are used to manage collection and use of medical records, ensuring the same individual is always associated with the same data.
Whilst they may contain some personal information, the other examples put forward are less clearly associated with being able to identify an individual:
- Whilst each of us has only one date of birth, many people share the same date of birth. Whilst it may be used as one factor for identification, date of birth with no other information cannot be used to uniquely identify an individual (low Personal Information Factor);
- Whilst a person typically has one legal name, many people share the same name. Like date of birth, with no other information cannot be used to uniquely identify an individual (moderate to low Personal Information Factor). An individual may have a set of commonly used nicknames or aliases, and many online identities in different contexts (low to very low Personal Information Factor);
- Telephone numbers – once associated with a fixed residence and published in “white” or “yellow” pages directories, are now automatically allocated each time a prepaid SIM card is purchased and may have a useful life of only one call or data connection (low to zero Personal Information Factor);
The level of personal information associated with a telephone number has changed with the changing use of telecommunications from person to person communications, to being an entry point to the Internet of Things. A telephone number may have a high personal information factor if there is strong or long standing association with that individual. It will have a low personal information factor if the number is used only once, shared amongst many people, or used by an anonymous device in the possession of the individual.
Names have an even more interesting relationship to personal information factor. In some close-knit communities, combinations of given and family can be names can be quite common as a result of tradition.
Taking another perspective, if an individual used their legal name to register for online services such as provided by Skype, LinkedIn, Twitter, Tinder, or Ashley Maddison, this can be personally identifiable. If, however an individual created a different online persona for each of these services, it may not be possible to identify the individual. Knowledge of the set of personas may however be used to identify the individual.
Taking the question of what is personal information further, the questions associated with identification of an individual from data they create, or the data sets which contain information about them are broadly categorised as
- A Cohort of One - is identifying an anonymous individual (person, company, entity) the same as identifying the individual?
- Radius of Convergence – if ever more data sets are brought together, is it certain that personal information will be reached or that an individual will be identified?
- Uniqueness – how small a cohort is required before an individual can be uniquely identified?
When “reasonably” is used as the test, the framing question is what is the limit on the ability to decide if personal information is present when increasingly more data sets are brought together?
Finally, if we can find answers to these questions above, could we develop automated “trust” frameworks which measurable units of “Trust”?
A Cohort of One – Identifying “any anyone”
Privacy legislation is framed in terms of identification of an individual. In the case of NSW privacy legislation, it need not even be a living individual; covering people up to 30 after death.
Data anonymization is often used as a means to prevent dealing with personal data. A fundamental challenge however is faced when exploring cohorts of people in data sets which begin to narrow to individuals. Online advertising may be shaped based on individual preferences and browsing behaviours, in-game promotions may be targeted based on gaming behaviours. The value of these focussed services is clear. The challenge however is if narrowing service delivery to the anonymous individual is the same as dealing with a named person. Is the identification of a “cohort of 1” the same as identification of an individual person?
Figure 1. Identification of an anonymous individual, and the potential to link to the person.
Framing Questions
- What are the circumstances required to unambiguously connect a cohort of one de-identified individual to an individual person?
- What circumstances would prevent the mapping to an individual person?
The Ability to Decide
People are notoriously poor at making decisions based on multiple inputs. One of the challenges with a test for personal information described in terms of “reasonably” is the issue of being able to make decisions about whether data sets contain sufficient information to be able to determine if an individual can be `identified. If data sets contain, for example, date of birth (low personal information factor), then it is possible to think of scenarios where an individual can be identified by adding additional data sets with non-zero personal information factors. If date of birth is linked to postcode (low), gender (low), school attendance (low), dietary restrictions (low to moderate), work location (low to moderate), it is easy to see how this narrowing set of candidates would lead to an identified individual (personal information factor builds to 1).
Framing Questions
- What are the measures a person can use to decide if a person can “reasonably” be identified?
- What is the limit on the number of data sets a person can mentally process to determine “reasonably”?
Figure 2. The ability to decide (See top of article)
1.3 Radius of Convergence
A related argument to that presented in above is the assumption that combining of ever more data sets must lead to the identification of an individual (personal information factor of 1). As with the example provided above, it is possible to imagine scenarios where this is the case: linking home postcode, work postcode, online login name, date of birth, and so on. However, if you combined home postcode plus data with very low (or zero) personal information factor such as weather information for that postcode, you could link data sets spanning the last hundred years without coming any closer to identification of an individual.
Figure 3. Radius of Convergence
Framing Questions
- Under what which circumstances can will linking of ever more data sets lead to identification of an individual?
- What conditions must be met to ensure linking ever more data sets will not lead to identification of an individual?
How Unique is Too Unique?
The risk of identifying an individual is often addressed through aggregation of data. For example, data sets which contain age, gender and income may be aggregated to suburb or SA1 level and then released. This risk however is the classical linkage problem where additional, external, data sets are used to dissect an aggregated data set sufficiently to identify an individual (cohort of one). Combining age/ has-a-beard / income with religion, marital status, employment type, car ownership, credit card debt, smoking preference, favourite beverage and so on, may lead to a cohort of one.
To use aggregation as a personal information protecting technique, the challenge becomes identifying the feature set which describes the smallest cohort within the aggregated set. Individuals within the smallest (most unique) cohort of the feature set are potentially the most vulnerable to linkage attack. The level of uniqueness of a cohort in a data set can be described in terms of the percentage of the total data set that the individuals in the cohort match.
Taking a real example examined recently, if individuals in a cohort match the entire set, they are not unique (men in a small working group of all men). If an individual uniquely matches one characteristic in a data set (men with beards in a small working group of all men), they can be identified uniquely.
Perturbation is another technique which is often used to limit how small a cohort can be become. If there is uncertainty as to the exact match of features in a data set, it may not be possible to reduce the cohort to one. If in the example of the bearded male working group, perturbation of the property “has-a-beard” may be sufficient to limit the unique identifying feature.
Figure 4. Uniqueness in a set of data
Framing Questions
- How unique is “too unique” to limit the effectiveness of linkage attack?
- What conditions must be met to ensure linking ever more data sets will not lead to identification of a cohort of one?
How do you Measure Trust?
Much of the challenge of data sharing is essentially related to trust. Developing trust preserving frameworks is at the heart of the work of this taskforce.
Figure 5. The Trust "Equation"
People often are unwilling to share data in environments of low trust. Concerns are typically based on fear of unintended consequences, concerns about loss of control, or concerns about adverse outcomes.
In 2001, a heuristic model of trust was developed [1] to describe the major components of trust and how the challenges of developing a trusted relationship could be addressed. The Trust equation described in [1] uses four objective variables to measure trustworthiness best described as: Credibility, Reliability, Intimacy, and Self-Orientation.
- Credibility: refers to the professional or technical credibility of the subject.
- Reliability: refers to actions and consistency of performance.
- Intimacy: refers to the safety or security that someone feels when entrusting the subject with important information
- Self-Orientation: refers to the subject’s focus and motivations.
The Trust equation provides a framework for potential interventions to improve the effectiveness of an engagement between individuals, or between an individual and an organisation.
The framing questions are
- can trust be measured?
- If trust can be measured, what are the units of trust?
The Work continues
The work of the Data Taskforce continues. The next workshop will be in Canberra in mid May.
[1] See website https://www.oaic.gov.au/privacy-law/privacy-act/
Professor at Yokohama National University
7 年Well, we need gloval concensus on secure data sharing and mining. Ryuji
Information Governance | Digital Government
7 年Hi Ian, great summary and thank you for continuing to drive this work forward. In relation to the definition of 'personal information', I agree it is important to understand the distinction between data and information. Information is data that has been processed. As such, data can't actually contain personal information. But all data would have a 'personal information factor' related to the potential to derive personal information by processing that data (eg by combining with other data sets). I think it will be very interesting to explore further the concept of the PIF, and the alternative 'wave shaped' graph Group 3 proposed around the Radius of Convergence. It's also interesting to look at this in combination with the factors to be weighed in the legal test for 'reasonableness' and the public interest test, which weigh risk with utility and allow for evaluating alternatives - and are based on information available to the decider at the time.