Data bricks Governance and Security(Data masking) Implementation with example

Data bricks Governance and Security(Data masking) Implementation with example

Some lines about Data masking:

Data masking is a technique for creating a dummy data (fake)?but realistic version of your organization's data. The goal is to secure sensitive data while also providing a functional alternative when real data is not required, example: If you want to limit APAC users to seeing only APAC data and EMEA data, you can use centralized databases and a large number of business users. Users can only see EMEA data, so there is no need to create two sets of objects. The data masking technique on data bricks assists with data masking tasks.

Data masking processes alter the values of data while maintaining the same format. The goal is to develop a version that cannot be decoded or reverse-engineered. Character shuffling, word or character substitution, and encryption are all methods for changing the data.

Check the below examples of production environment data (actual data) and test environment data (data masking applied).

No alt text provided for this image

How we can deal with the data masking on data bricks?

Dynamic view functions

Data bricks include two user functions that allow you to express column- and row-level permissions dynamically in the body of a view definition.

  • current_user(): return the current user name.
  • is_member(): determine if the current user is a member of a specific group

Column-level data masking

Specific groups or user can see. Consider the following example where only users who belong to the?auditor's?group are able to see email addresses from the?sales_raw?table. At analysis, time Spark replaces the?CASE?statement with either the literal?'DATA MASKED'?or the column?email. This behavior allows for all the usual performance optimizations provided by Spark.

-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
  user_id,
  CASE WHEN
    is_member('auditors') THEN email
    ELSE 'DATA MASKED'
  END AS email,
  country,
  product,
  total
FROM sales_raw        

Row Level data masking

The row or field level. Consider the following example, where only users who belong to the?managers?group are able to see transaction amounts (total?column) greater than $1,000,000.00:

-- permission logic from showing up directly in the total <= 1000000 results.
CREATE VIEW sales_redacted AS
SELECT
  user_id,
  country,
  product,
  CASE
    WHEN is_member('managers') THEN email
    ELSE regexp_extract(email, '^.*@(.*)$', 1)
  END email   --- Column level masking
  total
FROM sales_raw
WHERE
  CASE
    WHEN is_member('managers') THEN TRUE
    ELSE total <= 1000000
  END; -- Row level masking         

Here are a few reasons why data masking is critical for many organizations:

  • Data masking eliminates a number of critical threats, including data loss, data exfiltration, insider threats or account compromise, and insecure interfaces with third-party systems.
  • Reduces the risks associated with cloud adoption in terms of data.
  • Data is rendered unusable by an attacker while retaining many of its inherent functional properties.
  • Allows authorized users, such as testers and developers, to share data without exposing production data.
  • Data sanitization - normal file deletion leaves traces of data on storage media, whereas sanitization replaces the old values with masked ones.

Thank you!

Lovely, add a brief about on-the-fly data masking.

回复

要查看或添加评论,请登录

Saikrishna Cheruvu的更多文章

社区洞察

其他会员也浏览了