登录查看更多内容

Data bricks Governance and Security(Data masking) Implementation with example

Saikrishna Cheruvu

Lead Developer | Data Engineer | MLOPS | ex@ BOFA

发布日期: 2022年10月19日

Some lines about Data masking:

Data masking is a technique for creating a dummy data (fake)?but realistic version of your organization's data. The goal is to secure sensitive data while also providing a functional alternative when real data is not required, example: If you want to limit APAC users to seeing only APAC data and EMEA data, you can use centralized databases and a large number of business users. Users can only see EMEA data, so there is no need to create two sets of objects. The data masking technique on data bricks assists with data masking tasks.

Data masking processes alter the values of data while maintaining the same format. The goal is to develop a version that cannot be decoded or reverse-engineered. Character shuffling, word or character substitution, and encryption are all methods for changing the data.

Check the below examples of production environment data (actual data) and test environment data (data masking applied).

How we can deal with the data masking on data bricks?

Dynamic view functions

Data bricks include two user functions that allow you to express column- and row-level permissions dynamically in the body of a view definition.

current_user(): return the current user name.
is_member(): determine if the current user is a member of a specific group

领英推荐

?? Unlock the Power of Your Data: 5 Essential Features…

XenonStack 4 个月前

January 2025 (Part 1)

Cher (The Datanista) Fox,??CDMP 2 个月前

Data Governance for the Layman: What It Is and Why It…

Jose Almeida 2 个月前

Column-level data masking

Specific groups or user can see. Consider the following example where only users who belong to the?auditor's?group are able to see email addresses from the?sales_raw?table. At analysis, time Spark replaces the?CASE?statement with either the literal?'DATA MASKED'?or the column?email. This behavior allows for all the usual performance optimizations provided by Spark.

-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
  user_id,
  CASE WHEN
    is_member('auditors') THEN email
    ELSE 'DATA MASKED'
  END AS email,
  country,
  product,
  total
FROM sales_raw

Row Level data masking

The row or field level. Consider the following example, where only users who belong to the?managers?group are able to see transaction amounts (total?column) greater than $1,000,000.00:

-- permission logic from showing up directly in the total <= 1000000 results.
CREATE VIEW sales_redacted AS
SELECT
  user_id,
  country,
  product,
  CASE
    WHEN is_member('managers') THEN email
    ELSE regexp_extract(email, '^.*@(.*)$', 1)
  END email   --- Column level masking
  total
FROM sales_raw
WHERE
  CASE
    WHEN is_member('managers') THEN TRUE
    ELSE total <= 1000000
  END; -- Row level masking

Here are a few reasons why data masking is critical for many organizations:

Data masking eliminates a number of critical threats, including data loss, data exfiltration, insider threats or account compromise, and insecure interfaces with third-party systems.
Reduces the risks associated with cloud adoption in terms of data.
Data is rendered unusable by an attacker while retaining many of its inherent functional properties.
Allows authorized users, such as testers and developers, to share data without exposing production data.
Data sanitization - normal file deletion leaves traces of data on storage media, whereas sanitization replaces the old values with masked ones.

Thank you!

Vinoth Palanivel

2 年

Lovely, add a brief about on-the-fly data masking.

查看更多评论

要查看或添加评论，请登录

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

2024年8月4日

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

In recent years, the landscape of Business Intelligence (BI) has witnessed significant transformations. One of the most…
"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

2024年6月30日

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

I am trying to attempt a comparison between dbt and Databricks (delta live tables) Note: Not prompted and copied from…

3 条评论
Problems with scalable data systems need creative approaches.

2024年4月7日

Problems with scalable data systems need creative approaches.

Maybe chatGpt will help to write the code, not the solutions that we need to do with human intelligence. (?? soon the…

3 条评论
Datasbricks vs Snowflake ??part 1??

2023年8月19日

Datasbricks vs Snowflake ??part 1??

Snowflake and Databricks have wonderful features and most of them are common. If any feature is released on one of the…

4 条评论
What is Z-Order on Databricks?

2023年4月1日

What is Z-Order on Databricks?

What is Z-Order? We can compare the z-order with the cluster index in Oracle (I am a fan of SQL and databases, so my…
SQL Statement Execution API by Databricks

2023年3月9日

SQL Statement Execution API by Databricks

Recently, Databricks released an API for the execution of SQL statements. as of now, this is available on AWS and Azure…

2 条评论
What is Data Mesh?

2022年11月2日

What is Data Mesh?

What is a data mesh? Data mesh is not a technology; it is a conceptual theory of what types of applications we can…

3 条评论
Enterprise Scale Analytics/AI

2022年10月31日

Enterprise Scale Analytics/AI

few lines on ESA Enterprise scale is an architecture approach and reference implementation that enables effective…
Building Python SDK for Databricks REST API

2022年10月17日

Building Python SDK for Databricks REST API

This article is about a project I've started to work on lately. Please welcome Databricsk REST API - Python.

See all articles

Data bricks Governance and Security(Data masking) Implementation with example

Saikrishna Cheruvu

Lead Developer | Data Engineer | MLOPS | ex@ BOFA

How we can deal with the data masking on data bricks?

领英推荐

Column-level data masking

Row Level data masking

Here are a few reasons why data masking is critical for many organizations:

Saikrishna Cheruvu的更多文章

社区洞察

其他会员也浏览了

Your Data Lake is Turning into a Data Swamp: How to Get the Water Clear Again

Enabling equal access to data while preserving the power of control.

Data Quality Myths

It can be hard to find quality when there's so much quantity!

From Data Chaos to Strategic Clarity

Business Value

Do We Know What Problem We’re Solving?

Success as a Chief Data Officer (CDO)

The natural order of “data management” things….

No single Path (Part 2)

How we can deal with the data masking on data bricks?

领英推荐

Column-level data masking

Row Level data masking

Here are a few reasons why data masking is critical for many organizations:

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

Problems with scalable data systems need creative approaches.

Datasbricks vs Snowflake ??part 1??

What is Z-Order on Databricks?

SQL Statement Execution API by Databricks

What is Data Mesh?

Enterprise Scale Analytics/AI

Building Python SDK for Databricks REST API

社区洞察

其他会员也浏览了

Your Data Lake is Turning into a Data Swamp: How to Get the Water Clear Again

Enabling equal access to data while preserving the power of control.

Data Quality Myths

It can be hard to find quality when there's so much quantity!

From Data Chaos to Strategic Clarity

Business Value

Do We Know What Problem We’re Solving?

Success as a Chief Data Officer (CDO)

The natural order of “data management” things….

No single Path (Part 2)