Data Lake -Part 3: Simplifying Secured Data Analytics and AI Workflows
Simplifying Secure Data Analytics and AI Workflows

Data Lake -Part 3: Simplifying Secured Data Analytics and AI Workflows

This article discusses a common challenge: securely managing data analytics and AI projects that involve sensitive data. Strict regulations like GDPR, PCI DSS, HIPAA, and FHIR/HL7 add complexity to these projects.

We'll explore a reference architecture that streamlines data access for analytics, AI modeling, and deploying models to production (MLOps). This architecture incorporates security measures to ensure compliance with relevant regulations.

Note:This is part 3 of the article on Data Lake, please refer?Part 2?for better context.

Keeping Sensitive Data Safe in a Data Lake

Here's a breakdown of the additional security requirements for handling sensitive data in a data lake:

Sharing Sensitive Data Securely:

  • Redaction and Tokenization: Sensitive data can be masked (redacted) or replaced with tokens for sharing with a wider data science audience for experiments. Data analytics jobs can then flag records containing sensitive records? for special attention.
  • De-tokenization: The source system can reverse the tokenization process to recreate the original data for specific purposes, like comparing it with other records or taking action based on it.

Encryption Throughout the Process:

  • Encryption at Rest, In Motion, and Processing: Data must be encrypted at all times, whether it's stored (at rest), transferred (in motion), or being processed. This ensures that even a system crash (core dump) wouldn't expose sensitive data in system logs or files.
  • Tenant-Managed Encryption Keys and Zero Trust: Each tenant's data is encrypted using a key they manage themselves, following Zero Trust security principles. This means even the data lake administrator cannot access a tenant's data unless they are part of a tenant’s authorized user group.

Secure Deployment and Access:

  • Infrastructure as Code: Tools like Terraform manage infrastructure deployment, Helm charts handle Kubernetes workload deployment, and Ansible takes care of pre-deployment cleanup and post-deployment tasks.
  • Limited User Access: Direct user access to data lake components is restricted. Access is only allowed for monitoring tools and event/alert notification systems on a separate network (e.g., VNet in Azure, VPC in AWS).
  • "Break Glass" Access and Secure Credentials: DevOps can only access the system in critical situations ("break glass" scenarios) with manual approval and additional controls. A secure vault service stores all access credentials and certificates.

Auditing and Data Distribution:

  • API Access Audit Trail: The data distribution API tracks who accessed what data and when, providing a clear audit log for security purposes.
  • Data Distribution via APIs: All data shall be distributed through secure APIs. These APIs offer functionalities like:

- Data Discovery: Searchable and navigable data catalog.

- Data Quality Reports: Get insights into the data quality before using it.

- Data Query Definitions: Register data request queries for future use.

- Data Subscriptions: Automatically receive data updates based on pre-registered queries.

Development and Model Management:

  • Synthetic Data for Development and Validation: All development and validation activities should use synthetic data (artificially generated) or carefully curated sample data.
  • MLOps Workflow: Validated AI models are promoted to production and then retrained with real-world data as part of an automated MLOps (Machine Learning Operations) workflow.
  • Model Monitoring and Drift Detection: Continuously monitor the performance of production models and fine-tune them to address data drift or model drift (where the model's performance degrades over time).

Authentication and Authorisation:

  • OAuth 2 for Secure Access: All APIs leverage OAuth 2, a secure authorisation framework, to control who can access the data lake and what they can do.

Data lake Architecture consideration

When dealing with complex data architecture challenges, a proven approach is to break them down into smaller, more manageable pieces. Here's how it works:

  1. Decompose and Abstract: We take the big problem and divide it into smaller, well-defined modules with clear responsibilities.
  2. Modular Excellence: Each module is built to excel at its specific task.
  3. Seamless Teamwork: These modules are designed to work together smoothly, ensuring they solve the overall problem when combined.
  4. Validation by Experts: Finally, the entire solution is reviewed by an Architecture Review Board (ARB) to confirm it meets all the project's needs.

This approach makes complex problems more manageable and ensures the final solution is effective and well-designed.


While architectural analyses are typically kept general and independent of specific vendors for flexibility, it can be helpful to mention familiar technologies like Airflow or S3 as reference points. This can make the concepts easier to grasp for the team.

Here's an architect's (Specific but Opinionated) viewpoint based on experience:

  • AWS Lake Formation: This bundled service simplifies access control across various AWS services using LF-tags, making data management a breeze.
  • GCP BigQuery: This is a great choice for quickly launching new data analytics experiments.
  • Azure Power BI: This tool excels at data visualisation, while Azure DevOps (ADO) fosters collaboration across team


Data lake solution overview evolved based on above architecture analysis

Why Store Data in Multiple Formats? It's About Efficiency!

One might be wondering why data gets stored in three different places (Raw, Stage, and Refined) within a data lake, especially since it seems to triple storage costs.? Well, there's a good reason for it!

In reality, with the data source and target data marts considered, there can be even more copies (at least 5!). This might seem excessive, but it's a well-thought-out architectural decision. Here's why:

Separation of Concerns:

Having three separate zones allows for a simpler design based on the principle of "separation of concerns." Each zone has a specific purpose:

  • Raw Zone: This is the initial landing spot for data, directly copied "as is" from the source. This makes the pipeline from source to raw zone flexible and resistant to changes in the source data format or flow rate. Additionally, it isolates downstream processes from these source-side variations. Schema changes can be applied to the staged and refined zones only when necessary. This approach also helps with data lineage and change history within the data lake, independent of the source system. In many cases, the source system can even offload the responsibility of maintaining change history to the data lake, allowing its database to focus on optimizing transactional workloads.

Flexibility and Efficiency:

  • Raw Data Format:? The raw data can be stored in various formats like JSON, CSV, Avro, or Parquet depending on the source system and data ingestion method. If the data is streamed or comes in small batches, it might create numerous small files.? These can be periodically combined or compressed? for better data management.
  • Bringing in All Usable Content: It's generally recommended to bring all usable content into the Raw zone. Establishing network connections, security measures, and governance for new data sources can be time-consuming. Delays here could impact the turnaround time for data analytics tasks.
  • Unified Zone (Optional): Data is pulled to this zone based on actual usage.? Unused data in the Refined zone for a set period (e.g., 90 days) can be automatically removed using storage lifecycle management.? This data can be recreated if needed, as long as the data pipeline versions are well-managed.
  • Stage Zone: Data is only moved from the Raw zone to the stage zone if it's specifically needed for generating Refined data, improving data quality, or enhancing data point coverage by merging with related content from the raw zone.

Teamwork Makes the Dream Work:

Typically, separate teams manage the three data pipelines:

  • Source to Raw: Managed by the source data team.
  • Stage to Refined: Managed by the data science or feature engineering team.
  • Raw to Stage: Managed by the feature engineering team.

Prioritising Agility:

In short, these three separate data pipelines each have a distinct role. Keeping them separate helps with maintainability and day-to-day operations. This approach prioritised the agility and maintainability of the data pipelines (costing around $3-5 per GB) over storage costs (which are much lower, at around $0.1-0.2 per GB per month for uncompressed data).

Data lake reference architecture - detail component view after solution design.

Breakdown of a Data Lake and MLOps System component view.

Breakdown of a Data Lake and MLOps System

Here the? over all ecosystem is? well connected with? cohesive capability and separation of? functions

  • Data lake,? (6 core? services? covered earlier )
  • Dev AI/ML env (4 services : Exploratory Data Analysis ( EDA) service , AI/ML experiments, Build pipeline and deploy to data lake, Model turning/ validation)
  • MLOPs? (3 services:Model deployment and rebuilding using production data continuous model evaluation for Model/Data drift, and a scalable AI/Ml model Inference service )

Summary

The Problem of Organic Data flow growth between applications

When data flows between applications grow over time, things can get messy.? Imagine 10 applications, all sharing data with each other.? Without a central plan, there could be up to 45 different connections (calculated as N * (N-1) / 2, where N is the number of applications).

This number can be even higher if data gets passed through multiple applications or even creates loops where data keeps circling back on itself.? These tangled connections can lead to major headaches down the road:

  • Data Quality Issues: Inconsistencies and errors can multiply as data bounces between applications.
  • Maintenance Challenges: Keeping track of all these connections and fixing problems becomes a nightmare.

The Data Lake Solution

A data lake provides a central hub for data storage and access. This planned approach helps avoid the chaos of organic data flows.


The Key to Success: People, Policies,? Process, and Culture

While technology plays a role, the real key to a data lake's success lies in people, policies,? processes, and company culture.? This article focuses on the technical aspects that provide security and governance, but it's important to remember these are just guardrails.

For a data lake to truly thrive, it needs broad adoption within the organization. This means fostering a data-driven culture where people are comfortable using and contributing to the lake.

Consulting services like Senn Delaney Culture Shaping (now part of Heidrick & Struggles) can be a valuable partners in this process. They can help guide organizations through a cultural transformation that embraces data as a strategic asset.

In short, the right technology is just one piece of the puzzle.? Building a successful data lake requires a holistic approach that addresses people, policies, processes, and culture.



Up Next: My experiments with the Generative AI in Data Lake

The next part 4 of this article would dives into my? exciting experiments and finding (work in progress !) in the world of Generative AI (Gen AI) and Large Language Models (LLMs) and how they can revolutionise DataLake. Any serious partners for collaboration are welcome !? here are few inspirational thoughts below as teaser ?!

  • Data Discovery Guru: Imagine a system that can automatically comb through your data lake and pinpoint the exact information. That's the power of Gen AI! It can help us discover hidden insights and valuable data sets that might have otherwise overlooked. e.g automated metadata curation.
  • The Ultimate Summarizer and Organizer: Struggling with complex data sources? Gen AI can automatically generate summaries, create ontologies (classification systems), and build taxonomies (hierarchies) to bring order to the data chaos.
  • Test Data - Kitchen: Need high-quality test data to validate the models and applications? Gen AI can cook it up! It can generate realistic and anonymised test data, saving? time and resources.
  • Data Quality Champion: Improving data quality is an ongoing battle. Gen AI can join team, automatically identifying and addressing data inconsistencies and errors.
  • Data Flow Detective: Planning changes to data pipelines? Gen AI can analyse end-to-end data pipeline and the potential impact on any component and even suggest automated solutions to minimise disruption.
  • Optimising cost (FinOps): Are we getting the most out of the data storage and processing resources? Gen AI can help by analysing usage patterns and recommending cost-saving improvements to the? data models.

Stay tuned for the next part where I would like to share the excitements and frustrations of my Gen AI experiments.?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了