登录查看更多内容

Handling Sensitive Data: Redaction, Masking and Compliance

Vijay Chaudhary

Lead Software Engineer

发布日期: 2025年2月2日

In today's data-driven world, digital documents containing sensitive information pose challenges to privacy and compliance. Personally Identifiable Information (PII), such as social security numbers, credit card details, and tax file numbers, should be handled with care to prevent misuse or unauthorized access. Organizations are required to redact or mask such sensitive data to comply with stringent regulations like GDPR and HIPAA while also maintaining trust with their customers. Redaction solutions could play a role in protecting sensitive content. ?

I have been involved in projects around Redaction or masking of data form image files for some time. The first time I heard about this concept a decade back, there was a regulatory need to not show the sensitive tax file number (TFN) to offshore data indexing team. ?

Problem statement – Data indexing team will see the whole image except the area of image where this sensitive data is present, those area of images will be blacked out?to protect the data and meet the compliance need to not show the data to users in different geography. This use case was about temporary redaction where data was fully hidden from intended users but after indexing work was completed original image was used/restored for further action on documents or archival (redacted zones were removed from the image). ?

Let’s start with understanding both these terminologies, ?

Redaction - No trace of the original content remains visible or accessible.?

Masking - retain partial details (like last four digits of a number) without revealing all sensitive information. ?

When implementing redaction, it is crucial to ensure that false positives are minimized. Accidentally masking non-sensitive information can lead to loss of critical data and operational inefficiencies, particularly when redaction is applied permanently. You can apply some check-sum algorithm to verify if the a number you are picking is the intended number - for sum use cases you can use Lugn algorithm to validate the number (like Credit Card Number). Precision in detection algorithms and human validation steps can help mitigate such risks.

[Non-Image Data]

For non image format data, another important concept to consider is tokenization. Instead of permanently masking sensitive information like a tax file number (TFN), the data can be replaced with a randomized token. This token can later be securely decrypted to allow controlled access when integrating with backend systems. For some format delete any hidden text layers behind the redacted areas, especially in formats like PDFs where OCR-processed text might still exist behind the visible image. If text layers are not properly removed, redacted information can still be extracted using simple text selection.

Another critical aspect of redaction for non-image data is ensuring that sensitive data is not embedded within larger values. For example, BPAY reference number or other transaction identifiers might contain a credit card number, making it challenging to detect using simple pattern-matching techniques. PII detection systems must go beyond standalone values and perform deep analysis to identify instances where sensitive information is embedded inside larger strings. Once a sensitive value is detected, it should be searched for and redacted across the entire document. This could help to prevent data leakage through indirect exposures, ensuring compliance with regulations.

[Image]

Our focus will be mostly be on image based files. Now let's go through some of the possible use cases. ?

[1] Permanent Redaction/Masking – Black or colored area is permanently burnt on the source image or few characters are permanently changed into “*” or others without intention to restore the data.?

[2] Temporary Redaction/ Masking?- Black or colored area is temporarily created on top of?the source image (original image is maintained), or few characters are temporarily changed into “*” or others with intention to restore the data – original data is maintained. ?

[3] Masking/Redaction Indexing – Put humans in loop to identify if data hiding is done on correct sensitive. A review process where a human validates or adjusts automatically identified redaction zones (e.g. a phone number mis-detected as a credit card).?

[4] Selective Redaction/Masking by Role or Context - Redact only certain fields for certain user roles (e.g., the full Adhaar Number is visible to an HR manager but only the last four digits are shown to others).?

[5] On-the-Fly Redaction/Masking - Storing original images?in repository and generating redacted/masked versions on demand.?

For simplicity let’s stick to redaction use cases for now. At high level below steps could be involved in building a redaction solution, ?

Now there are different technologies we can use to build such a solution. Two key parts of this solution are to identify the sensitive data and then apply redaction using a custom program or off-the-shelf program. Here are few options which we can try to build,?

Google Cloud Platform (GCP) ?

Cloud Vision API?
Cloud Data Loss Prevention (DLP) ?
Apply custom program to apply redaction?

?Microsoft Azure ?

ACS Computer Vision?

ACS Document Intelligence?

Text Analytics (PII Detection)?

Apply custom program to apply redaction?

?Third Party Applications ?

Identify Sensitive data using proprietary technology?

领英推荐

How do you ensure transparency and accountability in…

Anil Patil ??"PrivacY ProdigY"?? 8 个月前

Protecting data and preventing bias: Creating a…

Bullhorn 11 个月前

Code vs Algorithm vs AI (LLM): Data Privacy

Concur - Consent Manager 1 个月前

Apply custom program to apply redaction ?

Alternatively, applications can offer similar feature to apply redaction?

Note - Azure does not currently offer a single “one-click” image redaction service for PII in the same way GCP does with Cloud DLP.? A common approach is to combine Computer Vision OCR (or Form Recognizer) with Text Analytics PII detection and then mask bounding boxes in the image.?

Now let's focus on an example custom program - input, processing steps and output. ?

Fictious Mapped sensitive data table would look something like,?

Sensitive Data - Shows the detected text (credit card number, tax file number, SSN, etc.) flagged by your PII detection process.?

Page Number - Page number of a multi-page document.?

X1, Y1 – Top left corner of the bounding box, measured in pixels.?

X2, Y2 – Bottom right corner of the bounding box, also in pixels.?

Let’s go through a sample program to see how a sample redaction program will work using Python programming language. ?

[1] Load required packages and do imports?

[2] Load the fictious bounding boxes in a list of dictionaries, in a real workflow, these come from OCR + PII detection. ?

[3] Open the local input image and draw rectangles in the image over each bounding box. ?

Note – Page number is not taken into account here, for real use case you can have multiple pages as well. ?

[4] Set the input image, call the above method to redact and then save and deliver the redacted output image. ?

[5] Check if the fictitious sensitive data areas of the image are redacted or not. Most likely you would see some redacted portion of the image.

Summary?

Redaction and masking for protecting privacy and ensuring regulatory compliance are important today. We covered how sensitive data like tax file numbers, social security numbers, or credit card details can be handled in image files.?

We also explored the difference between redaction (permanently removing sensitive information) and masking (which allows partial visibility) - such as showing only the last few digits of a number. Real-world scenarios like temporary redaction for data indexing queues and selective masking for user-specific roles show the flexibility of these techniques. By combining technologies like Google Cloud DLP, Microsoft Azure's PII detection, and custom programs, organizations can explore solutions to fit their data protection needs.? ?

AI-ML & Automations

1,575 位关注者

要查看或添加评论，请登录

Vijay Chaudhary的更多文章

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

2025年3月16日

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

Retrieval-Augmented Generation (RAG) systems are gaining popularity, helping users find relevant documents to answer…

1 条评论
Splitting Text Right Way - NLTK, SpaCy or Markdown

2025年3月2日

Splitting Text Right Way - NLTK, SpaCy or Markdown

For natural language processing (NLP) working with large pieces of text can be challenging. Many language models have…

1 条评论
Unlocking Entities and Relations: Creating Knowledge Graphs with AI

2025年2月16日

Unlocking Entities and Relations: Creating Knowledge Graphs with AI

GraphRAG is something which is picking up recently, in this article we will try to get to the basics of GraphRag…
Structured Outputs from LLMs: LangChain Output Parsers

2025年2月9日

Structured Outputs from LLMs: LangChain Output Parsers

LLMs are good at generating human-like text (hence called Generative AI), but when it comes to integrating to…
Optimizing AI Workflows with LangChain - A Practical Introduction

2025年1月25日

Optimizing AI Workflows with LangChain - A Practical Introduction

LangChain is a framework for developing applications powered by large language models (LLMs). It helps in simplifying…
Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

2025年1月19日

Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

In real-world scenarios, it's common to encounter multiple documents combined into a single, multi-page image or PDF…
Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

2025年1月4日

Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique in natural language processing that uses knowledgebase information…

2 条评论
Understanding Custom Classifiers in Google Document AI

2024年12月29日

Understanding Custom Classifiers in Google Document AI

There are three categories of models or services in GCP Document AI – General Document processors (Layout, Form and Doc…
Processing with GCP Document AI: Exploring Pretrained Parsers

2024年12月15日

Processing with GCP Document AI: Exploring Pretrained Parsers

GCP Document AI offers multiple products to process documents for information for different use cases. Below…

2 条评论
Custom Document Extractors with Google Document AI

2024年12月8日

Custom Document Extractors with Google Document AI

GCP Document AI broadly has three categories of document extraction models – General Document processors (Layout, Form…

See all articles

Handling Sensitive Data: Redaction, Masking and Compliance

Vijay Chaudhary

Lead Software Engineer

领英推荐

AI-ML & Automations

1,575 位关注者

Vijay Chaudhary的更多文章

社区洞察

其他会员也浏览了

Do LLMs Store and Retrieve Personal (or Confidential) Data?

Understanding PDPC Guidelines on Use of Personal Data in AI Systems: Fostering Accountability and Transparency

Personal Data in context of AI Models - Excerpt of EDPB Opinion 28/2024

Data Compliance in the Age of AI: Why We’re All Doomed, and It’s Your Fault

Private LLMs vs RAGs (Fact vs Fiction)

Data Protection - When Legal Meets Data Analytics [Part 1 of 3]

Importance of Adding Data Governance to AI Implementations in Companies

AI Governance vs. AI Data Governance

Balancing Innovation and Risk: A Responsible Approach to AI Data Governance

When data bugs kill people

领英推荐

AI-ML & Automations

1,575 位关注者

Vijay Chaudhary的更多文章

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

Splitting Text Right Way - NLTK, SpaCy or Markdown

Unlocking Entities and Relations: Creating Knowledge Graphs with AI

Structured Outputs from LLMs: LangChain Output Parsers

Optimizing AI Workflows with LangChain - A Practical Introduction

Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

Understanding Custom Classifiers in Google Document AI

Processing with GCP Document AI: Exploring Pretrained Parsers

Custom Document Extractors with Google Document AI

社区洞察

其他会员也浏览了

Do LLMs Store and Retrieve Personal (or Confidential) Data?

Understanding PDPC Guidelines on Use of Personal Data in AI Systems: Fostering Accountability and Transparency

Personal Data in context of AI Models - Excerpt of EDPB Opinion 28/2024

Data Compliance in the Age of AI: Why We’re All Doomed, and It’s Your Fault

Private LLMs vs RAGs (Fact vs Fiction)

Data Protection - When Legal Meets Data Analytics [Part 1 of 3]

Importance of Adding Data Governance to AI Implementations in Companies

AI Governance vs. AI Data Governance

Balancing Innovation and Risk: A Responsible Approach to AI Data Governance

When data bugs kill people