K-Anonymity's process for protecting the data of its users
In the wake of the Facebook-Meta announcement, discussions have intensified around the need for reimagined user data protection regulations and what users can do to protect their data [1]. On the other hand, it is equally essential that user data is handled responsibly within an organization. Encrypting sensitive information is one of the primary responsibilities of data owners to protect the data's integrity and the users' trust. Anonymizing data is not as simple as redacting or randomizing it. Multiple attributes can help identify data even if it cannot be uniquely identified on its own. Anonymized data can be re-identified by combining jigsaw puzzle pieces to complete the picture, i.e., finding common attributes between datasets to develop more detailed knowledge. One can, for example, obtain medical health information or voter registration information by knowing the postal code, date of birth, and gender.
Businesses must, therefore, first understand how data is anonymized before formulating a data protection strategy and mitigating the risk of data breaches. The k-Anonymity model is one of the most widely used techniques for anonymizing data.
How the k-Anonymity model works
K-Anonymity was first proposed in 1998 and refined in 2002 [2]. The technique involves "hiding the identity of an individual in plain sight" or "hiding the identity of an individual in a crowd." If the data is gathered from individuals who share similar characteristics, it won't be easy to link the data back to a specific person.
The 'k's are similar to the 'x's in algebraic expressions. The K-Anonymity principle is satisfied if at least k-1 individuals have the same characteristics. Imagine a dataset where k is 100, and the data is postal code. If we take any random individual, there will always be 99 others with the same postal code; thus, identifying an individual based on only the postal code value in a k-Anonymous dataset would be impossible. To gain more clarity, let's review the following table.
The table is divided into three parts: Identifiers, Quasi-Identifiers, and Sensitive Data. An identifier is a piece of information that directly identifies a person. Quasi-identifying information that may or may not uniquely identify an individual, but, when combined with other quasi-identifying information, may reveal the individual, is referred to as a Quasi-Identifier. The Identifier is 'Name,' the quasi-identifiers are 'Age,' 'Postal Code,' and 'Gender,' and the sensitive data is 'Disease.'
The table above is k-Anonymous with k-3, which is achieved by redacting certain quasi-identifiers. Notice that the age has been generalized from exact numbers to a group, while the identifiable numbers of the postal codes have been masked. Similarly, all names and some genders are redacted entirely.
Generalization
Generalization refers to grouping or generalizing data in the context of k-Anonymity. Organizing the identifiable data into a larger group eliminates identifying information that can be derived from it. Think of it as increasing the radius. For example, a dataset contains references to Italian cities such as Palermo, Turin, Milan, Rome, and Naples. In such a case, they can be generalized as 'Italy.' Similarly, in the preceding example, specific age data is generalized into age groups to achieve anonymity.
Here is an example of how Generalization can be accomplished.
领英推荐
Suppression
Suppression is the process of removing all data from a dataset. Data handlers, on the other hand, should be cautious about what data is suppressed. For example, if we need to know which disease is more common in which age group of patients, suppressing the age group would be impractical. Instead, we must suppress datapoint information that is irrelevant to the current study.
To better understand the mechanism behind k-Anonymization, consider another hypothetical dataset of the number of goals scored by youth players in a soccer league.
Here, the dataset is k-Anonymous with k=1, as the (13, M), (13, F), (15, M), (15, F), (14, F), (14, M), and (17, M), (17, F) combinations are only represented once.
However, the table above is k-Anonymous with k-2 as every age-gender pair has at least two rows (13, F), (15, M), (14, F), (17, M). That is, there are at least two rows for each combination of identity-revealing characteristics.
In a soccer academy, we could use the Table 3 dataset to figure out which players scored how many goals since each combination is represented only once. Due to the obfuscation of the data in Table 4, it is difficult to determine the goals accurately.
As k increases, the anonymity of the dataset becomes more robust, and we have at least a 1/k chance of correctly attributing a row to a specific person. As a result, organizations that use a higher level of k-Anonymity in their data protection mechanism can achieve a higher level of data security while minimizing risks.
To Conclude
With businesses relying more and more on collecting data to gain insights, data masking is becoming increasingly important. For example, Google already uses k-Anonymity [3] to protect user data. Meanwhile, other privacy-preserving techniques such as l-diversity, t-closeness, k-Anonymity, and differential privacy are already being incorporated into the larger picture of data masking.
Nevertheless, k-Anonymity, Suppression, and Generalization remain the foundations of more advanced anonymization algorithms and are the most widely used techniques for masking data.
Founder and Managing Partner | Comprehensive Solutions for Growth
3 年The following are my references cited in the newsletter: [1] Facebook is Meta now - How it will impact data privacy regulations law shorturl.at/eBEK3 [2] k-Anonymity: A model for protecting privacy shorturl.at/dxBLS [3] How Google Anonymised Data shorturl.at/lDM28