Why Data Sharing Is Still Creating Significant Data Breaches
Aloke Guha
Serial Entrepreneur | Innovator | Data-Driven Distributed Systems | Data Science
A recent conversation with a data architect at a major? healthcare company brought this surprising fact to my attention: there are still ongoing data breaches in healthcare and other industry sectors that rely on just-in-time data sharing with their service providers, suppliers and other partners in their data ecosystem. One recent case[1] was an inadvertent sharing of personal identifiable information (PII) of more than 13M patients to 3rd parties.
We have seen a similar situation before where an application-level data sharing caused massive security breaches. In that case Facebook granted a 3rd party app a range of information from their users’ profiles that collected PII data on people’s locations and interests, ?photos, etc.[2]. This information was then shared with Cambridge Analytica who misused it with major political and historical ramifications.
Inadvertent Data Sharing Results in Large-Scale Security Breaches
In the Kaiser Permanente case, the data was likely handed over? by user tracking and analytics tools from well-known internet companies, including Google, Microsoft Bing, and X/Twitter via tools/applications installed on its websites and mobile applications. Unfortunately, there was no oversight on what was shared with these applications, resulting in significant data breach.
Look closer into why such oversight is missing in such data sharing, you will uncover some interesting systematic challenges:?
Chaotic Nature of Data
The shared data may have very different and sometimes different structure (schema). In many cases they have different formats (e.g., id vs biometric data).
Need for a Universal Policy
The applicable sharing policy varies from partner to partner, since different partners have unique applications consuming the different data, using distinct APIs.
Need for Semantic Understanding
Whether data should be shared depends on the semantics or meaning of the data and the context in which it is used (e.g., relevance of temperature of animate versus inanimate objects)
Limitations of Current Data Sharing Practices
Common approaches to data sharing used in practice today exhibit some major limitations.
领英推荐
Application-Level Sharing - Limited Scalability
One well-known approach is to embed the sharing policy within the application that is the source of the data being shared, or overlay business logic above the application. The limitation is that this does not scale well as one needs to map schemas across pairs of applications involved in the data exchange, or write business logic for across all possible application pairs.
Data Lake Solutions - Violates Governance
While cloud-based data storage and analytics services such as Snowflake are popular today, they require the data producer and data consumer to share their data to a 3rd party. For sensitive data that privy to only the sharing partners, disclosing the data to a 3rd party violates their confidentiality and governance requirements.
Streaming Data Platforms - Content-Unaware Sharing
Solutions based on data streaming platforms such as Kafka or Kinesis can efficiently transport distributed data but are not aware of data semantics, and hence are a blunt tool that cannot be used for policy-based sharing.
The Need for Distributed Edge Analytics
As data being used by enterprises are becoming more distributed, the analytics too has to become more distributed and move to the edge[3].
The emerging requirement for the new data platform is distributed, with intelligence at the edge that can enforce centrally defined security policies without exposing data to anyone not involved in the data exchange.
Principal Data Scientist | AIOps GenAI AIDC| Driving Innovation in Enterprise Observability
9 个月Interesting. So, essentially, it's about the Data Governance.
Writing a fresh story on a clean slate!
9 个月#DataSharing #distributeddata #semanticdrivensharing. Great Points, Aloke ! As enterprises move towards a hybrid future in which data is in several locations, it has an impact on where data is processed. Data is highly distributed and large to move it to a centralized cloud service. Enterprises must move the processing close to the data to deliver value faster and at a lower cost. From a security perspective, it also reduces the risk of data being compromised during transfer.
Need of the hour Aloke Guha