Handling Sensitive Data: Redaction, Masking and Compliance

Handling Sensitive Data: Redaction, Masking and Compliance

In today's data-driven world, digital documents containing sensitive information pose challenges to privacy and compliance. Personally Identifiable Information (PII), such as social security numbers, credit card details, and tax file numbers, should be handled with care to prevent misuse or unauthorized access. Organizations are required to redact or mask such sensitive data to comply with stringent regulations like GDPR and HIPAA while also maintaining trust with their customers. Redaction solutions could play a role in protecting sensitive content. ?

I have been involved in projects around Redaction or masking of data form image files for some time. The first time I heard about this concept a decade back, there was a regulatory need to not show the sensitive tax file number (TFN) to offshore data indexing team. ?

Problem statement – Data indexing team will see the whole image except the area of image where this sensitive data is present, those area of images will be blacked out?to protect the data and meet the compliance need to not show the data to users in different geography. This use case was about temporary redaction where data was fully hidden from intended users but after indexing work was completed original image was used/restored for further action on documents or archival (redacted zones were removed from the image). ?

Let’s start with understanding both these terminologies, ?

Redaction - No trace of the original content remains visible or accessible.?

Masking - retain partial details (like last four digits of a number) without revealing all sensitive information. ?

When implementing redaction, it is crucial to ensure that false positives are minimized. Accidentally masking non-sensitive information can lead to loss of critical data and operational inefficiencies, particularly when redaction is applied permanently. You can apply some check-sum algorithm to verify if the a number you are picking is the intended number - for sum use cases you can use Lugn algorithm to validate the number (like Credit Card Number). Precision in detection algorithms and human validation steps can help mitigate such risks.

[Non-Image Data]

For non image format data, another important concept to consider is tokenization. Instead of permanently masking sensitive information like a tax file number (TFN), the data can be replaced with a randomized token. This token can later be securely decrypted to allow controlled access when integrating with backend systems. For some format delete any hidden text layers behind the redacted areas, especially in formats like PDFs where OCR-processed text might still exist behind the visible image. If text layers are not properly removed, redacted information can still be extracted using simple text selection.

Another critical aspect of redaction for non-image data is ensuring that sensitive data is not embedded within larger values. For example, BPAY reference number or other transaction identifiers might contain a credit card number, making it challenging to detect using simple pattern-matching techniques. PII detection systems must go beyond standalone values and perform deep analysis to identify instances where sensitive information is embedded inside larger strings. Once a sensitive value is detected, it should be searched for and redacted across the entire document. This could help to prevent data leakage through indirect exposures, ensuring compliance with regulations.

[Image]

Our focus will be mostly be on image based files. Now let's go through some of the possible use cases. ?

[1] Permanent Redaction/Masking – Black or colored area is permanently burnt on the source image or few characters are permanently changed into “*” or others without intention to restore the data.?

[2] Temporary Redaction/ Masking?- Black or colored area is temporarily created on top of?the source image (original image is maintained), or few characters are temporarily changed into “*” or others with intention to restore the data – original data is maintained. ?

[3] Masking/Redaction Indexing – Put humans in loop to identify if data hiding is done on correct sensitive. A review process where a human validates or adjusts automatically identified redaction zones (e.g. a phone number mis-detected as a credit card).?

[4] Selective Redaction/Masking by Role or Context - Redact only certain fields for certain user roles (e.g., the full Adhaar Number is visible to an HR manager but only the last four digits are shown to others).?

[5] On-the-Fly Redaction/Masking - Storing original images?in repository and generating redacted/masked versions on demand.?

For simplicity let’s stick to redaction use cases for now. At high level below steps could be involved in building a redaction solution, ?

Now there are different technologies we can use to build such a solution. Two key parts of this solution are to identify the sensitive data and then apply redaction using a custom program or off-the-shelf program. Here are few options which we can try to build,?

Google Cloud Platform (GCP) ?

  • Cloud Vision API?
  • Cloud Data Loss Prevention (DLP) ?
  • Apply custom program to apply redaction?

?Microsoft Azure ?

  • ACS Computer Vision?

  • ACS Document Intelligence?

  • Text Analytics (PII Detection)?

  • Apply custom program to apply redaction?

?Third Party Applications ?

  • Identify Sensitive data using proprietary technology?

  • Apply custom program to apply redaction ?

  • Alternatively, applications can offer similar feature to apply redaction?

Note - Azure does not currently offer a single “one-click” image redaction service for PII in the same way GCP does with Cloud DLP.? A common approach is to combine Computer Vision OCR (or Form Recognizer) with Text Analytics PII detection and then mask bounding boxes in the image.?

Now let's focus on an example custom program - input, processing steps and output. ?


Custom Program Outline

Fictious Mapped sensitive data table would look something like,?

Sensitive Data - Shows the detected text (credit card number, tax file number, SSN, etc.) flagged by your PII detection process.?

Page Number - Page number of a multi-page document.?

X1, Y1 – Top left corner of the bounding box, measured in pixels.?

X2, Y2Bottom right corner of the bounding box, also in pixels.?

Let’s go through a sample program to see how a sample redaction program will work using Python programming language. ?

[1] Load required packages and do imports?

[2] Load the fictious bounding boxes in a list of dictionaries, in a real workflow, these come from OCR + PII detection. ?

[3] Open the local input image and draw rectangles in the image over each bounding box. ?

Note – Page number is not taken into account here, for real use case you can have multiple pages as well. ?

[4] Set the input image, call the above method to redact and then save and deliver the redacted output image. ?

[5] Check if the fictitious sensitive data areas of the image are redacted or not. Most likely you would see some redacted portion of the image.

Summary?

Redaction and masking for protecting privacy and ensuring regulatory compliance are important today. We covered how sensitive data like tax file numbers, social security numbers, or credit card details can be handled in image files.?

We also explored the difference between redaction (permanently removing sensitive information) and masking (which allows partial visibility) - such as showing only the last few digits of a number. Real-world scenarios like temporary redaction for data indexing queues and selective masking for user-specific roles show the flexibility of these techniques. By combining technologies like Google Cloud DLP, Microsoft Azure's PII detection, and custom programs, organizations can explore solutions to fit their data protection needs.? ?


要查看或添加评论,请登录

Vijay Chaudhary的更多文章

社区洞察

其他会员也浏览了