Why Data Sharing Is Still Creating Significant Data Breaches
Image created by Dall-E

Why Data Sharing Is Still Creating Significant Data Breaches

A recent conversation with a data architect at a major? healthcare company brought this surprising fact to my attention: there are still ongoing data breaches in healthcare and other industry sectors that rely on just-in-time data sharing with their service providers, suppliers and other partners in their data ecosystem. One recent case[1] was an inadvertent sharing of personal identifiable information (PII) of more than 13M patients to 3rd parties.

We have seen a similar situation before where an application-level data sharing caused massive security breaches. In that case Facebook granted a 3rd party app a range of information from their users’ profiles that collected PII data on people’s locations and interests, ?photos, etc.[2]. This information was then shared with Cambridge Analytica who misused it with major political and historical ramifications.

Inadvertent Data Sharing Results in Large-Scale Security Breaches

In the Kaiser Permanente case, the data was likely handed over? by user tracking and analytics tools from well-known internet companies, including Google, Microsoft Bing, and X/Twitter via tools/applications installed on its websites and mobile applications. Unfortunately, there was no oversight on what was shared with these applications, resulting in significant data breach.

Look closer into why such oversight is missing in such data sharing, you will uncover some interesting systematic challenges:?

Chaotic Nature of Data

The shared data may have very different and sometimes different structure (schema). In many cases they have different formats (e.g., id vs biometric data).

Need for a Universal Policy

The applicable sharing policy varies from partner to partner, since different partners have unique applications consuming the different data, using distinct APIs.

Need for Semantic Understanding

Whether data should be shared depends on the semantics or meaning of the data and the context in which it is used (e.g., relevance of temperature of animate versus inanimate objects)

Limitations of Current Data Sharing Practices

Common approaches to data sharing used in practice today exhibit some major limitations.

Application-Level Sharing - Limited Scalability

One well-known approach is to embed the sharing policy within the application that is the source of the data being shared, or overlay business logic above the application. The limitation is that this does not scale well as one needs to map schemas across pairs of applications involved in the data exchange, or write business logic for across all possible application pairs.

Data Lake Solutions - Violates Governance

While cloud-based data storage and analytics services such as Snowflake are popular today, they require the data producer and data consumer to share their data to a 3rd party. For sensitive data that privy to only the sharing partners, disclosing the data to a 3rd party violates their confidentiality and governance requirements.

Streaming Data Platforms - Content-Unaware Sharing

Solutions based on data streaming platforms such as Kafka or Kinesis can efficiently transport distributed data but are not aware of data semantics, and hence are a blunt tool that cannot be used for policy-based sharing.

The Need for Distributed Edge Analytics

As data being used by enterprises are becoming more distributed, the analytics too has to become more distributed and move to the edge[3].

The emerging requirement for the new data platform is distributed, with intelligence at the edge that can enforce centrally defined security policies without exposing data to anyone not involved in the data exchange.


[1] https://www.theregister.com/2024/04/26/kaiser_patient_data/

[2] Here’s how Facebook allowed Cambridge Analytica to get data for 50 million users - Vox?

[3]https://www.forbes.com/sites/forbestechcouncil/2021/03/15/computing-on-the-edge-can-be-transformative---but-look-before-you-leap/?

Sankar N.

Principal Data Scientist | AIOps GenAI AIDC| Driving Innovation in Enterprise Observability

9 个月

Interesting. So, essentially, it's about the Data Governance.

回复
Dilip P.

Writing a fresh story on a clean slate!

9 个月

#DataSharing #distributeddata #semanticdrivensharing. Great Points, Aloke ! As enterprises move towards a hybrid future in which data is in several locations, it has an impact on where data is processed. Data is highly distributed and large to move it to a centralized cloud service. Enterprises must move the processing close to the data to deliver value faster and at a lower cost. From a security perspective, it also reduces the risk of data being compromised during transfer.

要查看或添加评论,请登录

Aloke Guha的更多文章

社区洞察

其他会员也浏览了