Methods and challenges of de-identifying data
Data de-identification is the process of stripping data of any personal identifiers. It is a set of practices, algorithms, and tools that are applied to data at varying levels with varying degrees of effectiveness. HIPAA mainstreamed the concept in their Privacy Rule, which primarily deals with the anonymization of Patient Health Information (PHI). HIPAA's commendable efforts have brought much-needed regulation of data and Patient Health Information as of the beacon of privacy regulations across the United States. Data de-identification can be utilized for purposes other than hospital records and patient records. Examples include:
It is one of the fastest ways to ensure compliance with the likes of HIPAA and bolster security from a data protection viewpoint. In the digital age with strict privacy regulations, data de-identification is necessary. Let's dive deeper into the sea of de-identification and see how it works.
Methods of De-Identification
HIPAA has chalked out two methods of de-identifying data in Sections 164.514(b) and (c) of the Privacy Rule: Expert Determination and Safe Harbor. [1]
Expert Determination
As the name suggests, Expert determination involves individuals with experience and knowledge who can utilize statistical and scientific principles to minimize re-identification. Using the services of an expert can ensure that the anticipated recipient could identify the individual. However, finding an expert can be expensive.?
Safe Harbor
Safe Harbor requires the removal of 18 types of identifiers to assure there is no chance of residual information leakage. The 18 identifiers are:?
领英推荐
This method is one of the most cost-effective ways of protecting user data. However, it is unsuitable for all use cases and may lead to information loss. Experts on top of Safe Harbor often utilize data masking techniques such as Generalization and Randomization. Let's see how de-identification can be achieved through them:
De-Identification through Generalization
Generalization refers to grouping or generalizing data in the context of k-Anonymity. [2] The technique involves "hiding an individual's identity in plain sight" If the data is gathered from individuals who share similar characteristics, it won't be easy to link the data back to a specific person. Organizing the identifiable data into a larger group eliminates identifying information that can be derived from it. Generalization can reduce the redacting of data, preventing information loss while securing the integrity of the data at the same time.?
De-Identification through Randomization?
De-identification can also be achieved through Randomization. In this technique, data is randomized so that any leakage of personal information is prevented. Randomization is performed through Differential Privacy - incorporating random noise in the data to make it imprecise and difficult to breach. You can then utilize the data for proper statistical analysis without exposure to personal information. Technology giants such as Facebook, Amazon, and Apple are already using differential privacy to anonymize and de-identify data.
Drawbacks
Although data de-identification is necessary, it can still possess some severe privacy risks if not done correctly. In 2006, AOL, one of 90s Internet's most famous companies, published a set of search log data on its subscribers which did not contain any personally identifiable data. Yet, a New York Times reporter de-anonymized and correctly identified users and their searches. [3]. In the same year,? Netflix, probably the world's favorite streaming platform, used to sell DVDs. It released over 100 million movie ratings by 500,000 subscribers to its online DVD rental service. Of course, the dataset was anonymized, but still, researchers used information from other movie review platforms to trace backward and match names to the profiles and their online behavior. [4]. These instances are a stark reminder that companies should conduct their due- diligence and perform a risk assessment before releasing any data online, even if it is de-identified to prevent any re-identification.?
Conclusion
Real-time data analysis is the fuel that modern organizations run their operations on. Data de-identification preserves data integrity, confidentiality, and privacy while still allowing it to be used to gain insights. Although the process of de-identification can be intimidating, you can address some of the complexities by utilizing automated tools and experts/institutions that can provide counsel. In any case, any data that is made public must be de-identified and anonymized.
Founder and Managing Partner | Comprehensive Solutions for Growth
3 年Below are my references cited in the article: [1]: Guidance Regarding Methods for De-identification of Protected Health Information,? https://bit.ly/3EuJfVK [2]: K-Anonymity's process for protecting the data of its users,? https://bit.ly/3dkaRAO [3]: A Face Is Exposed for AOL Searcher No. 4417749,? https://nyti.ms/3rFS23w [4]: Who’s Watching? De-anonymization of Netflix Reviews using Amazon Reviews, ?https://bit.ly/31wB2BE