Data bricks Governance and Security(Data masking) Implementation with example
Some lines about Data masking:
Data masking is a technique for creating a dummy data (fake)?but realistic version of your organization's data. The goal is to secure sensitive data while also providing a functional alternative when real data is not required, example: If you want to limit APAC users to seeing only APAC data and EMEA data, you can use centralized databases and a large number of business users. Users can only see EMEA data, so there is no need to create two sets of objects. The data masking technique on data bricks assists with data masking tasks.
Data masking processes alter the values of data while maintaining the same format. The goal is to develop a version that cannot be decoded or reverse-engineered. Character shuffling, word or character substitution, and encryption are all methods for changing the data.
Check the below examples of production environment data (actual data) and test environment data (data masking applied).
How we can deal with the data masking on data bricks?
Dynamic view functions
Data bricks include two user functions that allow you to express column- and row-level permissions dynamically in the body of a view definition.
领英推荐
Column-level data masking
Specific groups or user can see. Consider the following example where only users who belong to the?auditor's?group are able to see email addresses from the?sales_raw?table. At analysis, time Spark replaces the?CASE?statement with either the literal?'DATA MASKED'?or the column?email. This behavior allows for all the usual performance optimizations provided by Spark.
-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
user_id,
CASE WHEN
is_member('auditors') THEN email
ELSE 'DATA MASKED'
END AS email,
country,
product,
total
FROM sales_raw
Row Level data masking
The row or field level. Consider the following example, where only users who belong to the?managers?group are able to see transaction amounts (total?column) greater than $1,000,000.00:
-- permission logic from showing up directly in the total <= 1000000 results.
CREATE VIEW sales_redacted AS
SELECT
user_id,
country,
product,
CASE
WHEN is_member('managers') THEN email
ELSE regexp_extract(email, '^.*@(.*)$', 1)
END email --- Column level masking
total
FROM sales_raw
WHERE
CASE
WHEN is_member('managers') THEN TRUE
ELSE total <= 1000000
END; -- Row level masking
Here are a few reasons why data masking is critical for many organizations:
Thank you!
Lovely, add a brief about on-the-fly data masking.