Selected Data Engineering Posts . . . August 2024

Selected Data Engineering Posts . . . August 2024

The most popular of my posts on data engineering in August 2024 ... with additional references ...


Welcome to the latest edition of "Selected Data Engineering Posts". This month we look at key trends that are reshaping data processing. Learn how the separation of business logic and data management can optimize workflows and reduce errors. Explore the transition from traditional ETL to advanced EtLT architecture, which enables real-time data processing and integration of multiple sources. Learn how the EU AI Act is raising standards for data quality, with tools such as Databricks Lakehouse and Unity Catalog proving essential for compliance. We also explore the importance of choosing the right semantic layer, using brainwriting for innovation and ensuring data observability for reliable AI. Finally, we discuss the benefits of structured data labeling for improved governance and utilization. Dive in to expand your knowledge and stay at the forefront of advances in data engineering.

Each post is accompanied by references to further reading so that you can deepen your knowledge of these informative topics.


Subscribe now to stay updated with our monthly issues to realize the full potential of data engineering and make impactful business decisions. Expand your data engineering expertise today!



This issue:

Data Engineering, Redefined: The main problem with data engineering today is the inappropriate mixing of data management and business logic. Data engineers are overburdened with implementing business rules that should be the responsibility of application developers, resulting in inefficient and error-prone data pipelines.

Beyond Traditional ETL: The evolution of data integration has progressed from traditional ETL to the modern EtLT architecture, which integrates real-time data processing and hybrid batch-stream integration. Embrace EtLT for enhanced real-time capabilities and support for diverse data sources while preparing for future innovations like DataFabric and automated governance.

EU AI Act as a Catalyst: The EU AI Act serves as a catalyst for raising standards for data engineering by enforcing strict requirements for data quality and management. Tools such as Databricks Lakehouse and Unity Catalog play a critical role in meeting these standards, ensuring compliance and improving data management.

Understanding Semantic Layers: In the modern data landscape, "Semantic Layer" refers to two distinct types: the Metrics layer, which centralizes analytics and simplifies query access, and the Semantic Layer, which provides contextual understanding and interoperability across data. Choosing the appropriate layer is crucial for building a robust, adaptable enterprise infrastructure.

Data Trustworthiness: A flexible approach to data trustworthiness should be adopted to fully leverage data assets, with recognition that not all data requires top-tier certification. This shift enhances usability, reduces costs, and fosters innovation while necessitating updates to governance roles and policies.

Brainwriting: Brainwriting fosters innovation by allowing all team members, including quieter ones, to contribute ideas independently, reducing groupthink and dominant voices. This method enhances idea generation, promotes diverse perspectives, and improves focus, making it valuable for complex data engineering challenges.

Data Quality for AI: High-quality training data is crucial for effective LLMs and AI models. Poor data introduces noise and bias, leading to unreliable predictions. Prioritizing data accuracy, completeness, and validity is essential for achieving accurate, trustworthy AI outcomes and meaningful insights.

Data Observability: To ensure trustworthy and effective AI, data observability is essential. It allows for continuous monitoring, quality checks, and error correction of data, which enhances AI model performance, transparency, and compliance. Implementing data observability practices fosters collaboration and builds trust in AI systems.

Data Tagging: Implementing a structured tagging strategy enhances data governance by improving classification, lifecycle management, and compliance, while automating and integrating tagging with Unity Catalog boosts efficiency and utility.


We look forward to sharing these insights with you and supporting your journey towards data excellence.

Enjoy reading!



???????????????????? ???????? ?????????????????????? ?????? ???????????? ??????????

???????????????? ???? ???????? ???????????????????? ?????? ????????????????????, ?????????????? ?????????????????? ???????????????? ??????????.

In this article Bernd Wessely points out that data engineering is often defined as developing, implementing, and maintaining systems that transform raw data into high-quality information for uses like analysis and machine learning. However, this approach has significant issues.

Transforming raw data into meaningful information involves applying the correct logic, usually handled by applications developed by software engineers. Currently, data engineers are tasked with implementing business logic, leading to inconsistent and hidden logic within brittle data pipelines.

A new definition of data engineering is proposed: ???????????????? ???????????? ???? ???????? ????????????????, ????????????????????????, ?????? ????????????????????, ?????????????? ?????????????????? ???????????????? ??????????. Data engineers should provide tools and platforms for application developers, who handle business logic.

???????????????????? ???? ?????????????? ???????? ??????????????????????:

  • Mixing data transformation with business logic.
  • Creating brittle data pipelines with inconsistent logic.
  • Encroaching on roles traditionally handled by application developers.

???????????????? ????????????????????????:

  • Data engineering should be about data movement, manipulation, and management.
  • Business logic should be handled by application developers.
  • Technical manipulations allowed for data engineers include partitioning, reformatting, and indexing, but not adding new business information.

????????????????????:

Data engineering should focus on providing a robust data infrastructure and tools and leave the business logic to the application developers. This separation ensures clear responsibilities, improves data quality and supports scalable, maintainable systems.

Go to Article


Additional References - including controversial ones

Data engineering: a role redefined by business needs and interpersonal skills

Dear Data Engineer?—?Get to know your Stakeholders

The Future of Data Engineering as a Data Engineer

The Synergy of Algorithms: Generative AI Redefining Data Engineering

Data Engineering in the Age of Generative AI



?????? ???????????? ???? ???????? ??????????????????????: ???????????? ???????????? ??????

???????? ???????????????????????? ???? ???????????????? ?? ???????????? ???????????????? ???? ???????? ??????????????????????

In this article, Dr. RVS Praveen Ph.D points out that data integration is evolving beyond traditional ETL (Extract, Transform, Load) to more advanced architectures such as ELT (Extract, Load, Transform) and ???????? (Extract, transform, Load, Transform).

?????? ????????????????????????:

  • ????????????????????: Ensures data consistency and quality, integrates complex data sources, provides clear technical architecture, and facilitates business rule implementation.
  • ??????????????????????????: Lacks real-time processing, incurs high hardware costs, offers limited flexibility, has expensive maintenance, and poorly handles unstructured data.

?????? ????????????????????????:

  • ????????????????????: Efficiently handles large data volumes, enhances development and operational efficiency, is cost-effective, flexible, and scalable, and integrates seamlessly with new technologies.
  • ??????????????????????????: Offers limited real-time processing support, incurs high data storage costs, may have data quality issues, and depends on target system capabilities.

???????? ????????????????????????:

  • ????????????????????: Supports real-time data processing, complex data sources, cost reduction, enhanced flexibility and scalability, optimized performance, large model handling, and improved data quality and governance.
  • ??????????????????????????: Involves increased technical complexity, depends on target system capabilities, faces management and monitoring challenges, has complex data change management, and depends on specific tools and platforms.

?????? ???????????? ???? ???????? ??????????????????????:

  1. Shift from batch processing to real-time data capture and hybrid batch-stream integration.
  2. Evolution of data transformation capabilities to handle complex transformations within data warehouses.
  3. Expansion of data source support to include emerging sources, unstructured data systems, and cloud databases.
  4. Emphasis on core capabilities like data source diversity, accuracy, and ease of troubleshooting.

????????????????????: EtLT architecture is becoming a global standard in data integration, addressing the limitations of traditional ETL and ELT by enabling real-time data processing, flexibility, and scalability.

Go to Article


Additional References

How Data Integration Is Evolving Beyond ETL

Future Trends in Data Integration

ETL vs. ELT: Dive deeper into two data processing approaches

The convergence of ETL and ELT: The future of unified data management



?????? ???? ???? ?????? ???? ?? ???????????????? ???? ?????????? ???????? ?????????????????????? ??????????????????

Effective data management and data governance are essential for maximizing the value of corporate?data, ensuring its quality, transparency, and regulatory compliance. ?????????????? ???? ???? ?????? ???? ???? ?????? ?????????????????????? ?????? ???????????????????? ???? ?????????? ????????-?????????????? ???????? ?????? ????????-???????? ???? ??????????????. It mandates that data for training, validation, and testing must be carefully managed to be relevant, accurate, and free from bias.

?????? ???? ???? ?????? ???????????????? ?? ?????????? ?????????????????????? ?????? ?????????????????????????? ???? ?????????????? ?????????? ???????? ?????????????????????? ??????????????????.? By adhering to the Act’s requirements, organizations are encouraged to adopt advanced data platforms and governance frameworks, such as Databricks Lakehouse and Unity Catalog. These tools facilitate the integration, management, and governance of data, ensuring it meets stringent quality and compliance standards.

Implementing best practices for data engineering not only supports regulatory compliance, but also ?????????????????????? ?????? ?????????????????????? ?????? ?????????????????????????????? ???? ???? ??????????????.

The EU AI Act can thus act as a catalyst for improving data management processes, enhancing data quality and promoting responsible AI development.

Go to Article


Additional References

How to prepare for the EU AI Act

Navigating the EU AI Act: Technical Key Priorities for Businesses

What does the AI Act mean for the use of AI in companies?

Compliance under the EU AI Act: Summary and Key Issues



?????????????????????????? ???????????????? ????????????: ?????????????? ????. ???????? ??????????????????

???????????????????? ?????????????????? ???????? ???????????? ?????????? ?????????????? ??????????


In his article, Helyx Chase Scearce Horwitz points out that in data management, the term "Semantic Layer" can refer to ?????? ???????????????? ?????????? ???? ????????????:

?????????????? ??????????: Originating from the 1990s, this layer simplifies querying by allowing users to access databases without SQL knowledge. It focuses on standardizing metrics and analytics to ensure consistent interpretations across an organization.

???????????????? ??????????:: This layer provides a broader context by linking various data and information sources. It emphasizes interoperability and extensibility, utilizing frameworks akin to the semantic web to make implicit relationships explicit and facilitate comprehensive understanding.

?????? ??????????????????:

  • ?????????????? ??????????: Standardizes and simplifies analytics for consistent, self-serve access.
  • ???????????????? ??????????: Connects disparate data sources, enhancing understanding and enabling complex insights through a graph data model and ontologies.

While AIO (All-In-One) Semantic Layer tools offer integrated analytics, they often fall short of the extensive capabilities of a true Semantic Layer. A genuine Semantic Layer not only provides consistent metrics but also integrates diverse information sources, supports advanced analytics, and ensures a unified source of truth for the entire enterprise.

????????????????????: For transformative insights and comprehensive data integration, ?????????????????????? ???????? ?? ???????????????? ?????????? ???????? ???????? ???????????? ?????????????? ?????? ???????????????????? ???????????????????????????? ?????????????????? ????????????. Customized solutions with semantic depth will better serve complex needs and future-proof enterprise data strategies.

Go to Article


Additional References

Industry Panel: Different Applications of a Semantic Layer — Takeaways Blog

Semantic layer vs metric layer, or a hybrid solution. Which is right for me?

What is a Metrics Layer?

Semantic Layer — One Layer to Serve Them All

Semantic layer 101: Why your data team should focus on metrics over data



???????????????????? ???????? ??????????????????????????????: ???????????? ??????????????????????????

Flexible trustworthiness of data for more value

In his article, Francesco De Cassai points out that the "Data as a Product" concept has become a key focus in data management, defined by traits such as Discoverable, Addressable, Trustworthy, Secure, Interoperable, and Self-Describing (DATSIS). However, the term "Trustworthy" is often misconceived as requiring high certification standards, which can limit the potential of new data solutions.

?????? ??????????????????:

  • ???????? ????????????????: A data product comprises data assets and metadata that describe its features and access methods.
  • ?????????????? ??????????: Data products can be general-purpose or specialized, similar to physical products that serve different needs.
  • ?????????? ????????????: The concept of "Trustworthy" should not be restricted to highly certified data. Instead, a "good enough" approach should be adopted, akin to different categories of physical products.
  • ?????????????????????? ??????????: Limiting interoperability to only highly certified data increases operational costs and restricts access to broader data assets.
  • ???????????????? ??????????: Moving beyond the need for perfect data requires a cultural and procedural shift, encouraging the use of "silver" and "bronze" data alongside "gold" data.
  • ???????? ????????????????: Effective data policy management and metadata availability at all levels are essential for managing different trustworthiness levels.

????????????????????: A flexible approach to data trustworthiness can improve the usability and value of data products while reducing costs. This approach promotes a more comprehensive data ecosystem that supports innovation and efficiency.

Go to Article


Additional References

The 3 Data Product components and how a Data-First Stack enables each

Data Trustworthiness

What is data trust?

4 Ways to Achieve Trustworthy Data



???????? ????????????????????: ?????? ?????????? ???? ????????????????????????

... ?????? ?????? ???? ???????????????? ???????????????????? ?????? ???????????????????? ??????????????-??????????????

Brainstorming, while a common method for generating ideas, often falls short due to inherent group dynamics and behaviors. Challenges such as dominant personalities, focus drift, and self-censorship can hinder innovation. Brainwriting offers a more effective alternative by allowing all participants to contribute ideas silently, thereby addressing these issues and fostering a more inclusive environment.

????????????????????:

  • ?????????? ????????????????: Traditional brainstorming often suffers from dominance by vocal participants and difficulty maintaining focus.
  • ????????-????????????????????: Participants may withhold ideas based on perceived negative feedback or fear of judgment.
  • ????????????????????????: Large groups can struggle with time management and maintaining engagement.

??????????????????????????????:

  • ???????????? ?????? ?????????????? ??????????????: Ensure that the issue being addressed is specific and actionable.
  • ?????????????????? ???????????? ??????????: Set clear guidelines for participation and idea generation to maintain structure.
  • ???????????? ?????? ?????????? ???? ??????????: Allow participants to enhance or modify ideas from others, promoting collaboration and refinement.
  • ?????????????????????? ?????? ????????????????: Gather and assess ideas to identify the most impactful solutions.

Brainwriting effectively mitigates the common pitfalls of traditional brainstorming, offering a structured approach that promotes equal participation and reduces the influence of dominant personalities.

By incorporating brainwriting, teams can enhance creativity and accelerate problem-solving, paving the way for innovative breakthroughs and more effective solutions.

Go to Article


Additional References

Brainwriting 101: how to unlock new, innovative ideas

How to use brainwriting to generate ideas

Brainwriting: A 3-Step Approach to Generating Innovative Ideas



???????? ??????????????: ?????? ?????????????????????? ???? ?????????????????? ???? ????????????

Ensuring AI Effectiveness through High-Quality Data Management


In his blog post, Tejasvi Addagada points out that Large Language Models (LLMs) derive their capabilities from vast datasets collected from diverse sources. The quality of this data is crucial, as it enables LLMs to learn language patterns and generate accurate responses. However, poor data quality introduces noise, leading to incorrect embeddings and reduced model effectiveness. ???????? ???????? ?????????????? ?????????????? ???? ???????????????? ???? ?????????????? ????????:

???????????????????? ??????????????????????: Errors in training data cause unreliable or incorrect model predictions.

???????????? ??????????????: Biased training data perpetuates biases in AI-generated results.

??????-???????????? ??????????????:: Incomplete or inconsistent data confuses models, resulting in nonsensical outputs.

???????????????????? ??????????????????????: Erroneous data can produce misleading information, detrimental to decision-making processes.

???????????????????? ???????? ?????? ???????? ???? ?????????????????? ???????????? ?????? ?????????????????? ???????????????????? ???? ???????????????????????????? ??????????????????. For Generative AI, data quality is critical to producing accurate insights. Data scientists spend a significant amount of time preparing data, highlighting the challenge of maintaining high-quality data.

???????? ???????? ?????????????? ???????? ?????????????? ???????????????? ???????????????????????? ?????? ??????????????. Inaccurate predictions can lead to wrong decisions, decreasing customer trust and satisfaction. Implementing systematic quality control and verification can mitigate these issues, much like quality checks in a production line.

In conclusion, ???????? ?????????????? ???? ?? ???????????????? ???????????????????? that impacts financial outcomes, customer satisfaction, and AI effectiveness. High-quality data ensures that AI models are accurate and reliable, driving growth and value in data-driven projects.

Go to Article


Additional References

The risks of poor data quality?in AI systems

Data Quality in AI: Challenges, Importance & Best Practices in '24

Data Quality For Good AI Outcomes

Why a Data Governance Strategy Is Crucial to Harness the Capabilities of Artificial Intelligence

Why Data Quality Matters in the Age of Generative AI



?????? ?????????????????? ???????? ???? ???????? ?????????????????????????? ???? ????????-?????????????? ????

???? ???????????????????? ?????????????????????????? ?????? ?????????????? ?????????? ??????????????????????

In this article, Jatin Solanki points out that ???????? ?????????????????????????? ???? ?????????????? ?????? ???????????????? ???????? ???? ?????????????? ?????????????? ?????????????????????? ?????? ??????????????????????????. It provides visibility into data pipelines, allowing organizations to monitor, validate, and govern data throughout its lifecycle. This proactive approach helps maintain data quality, which is essential for the reliability and performance of AI systems.

?????? ???????????????????? ???? ???????? ?????????????????????????? include:

  • ???????? ?????????????? ????????????????????: Continuously tracking data metrics to ensure accuracy and timeliness, preventing errors before they impact AI models.
  • ???????? ?????????????? ????????????????: Documenting the data's journey to trace errors and understand transformations, which aids in compliance and impact analysis.
  • ???????? ????????????????????: Applying rules to verify data integrity and consistency, reducing the risk of incorporating faulty data into AI models.

By integrating data observability into MLOps workflows, teams can streamline collaboration and enhance model performance. Real-world examples, such as Uber, PayPal, and Airbnb, demonstrate the benefits of this approach, including improved accuracy, transparency, and operational efficiency.

???????? ?????????????????????????? ???? ?????????????????????????? ?????? ???????????????? ?????????????????????? ?????? ?????????????????? ???? ??????????????. It not only enhances data quality and governance but also promotes greater transparency and compliance. As AI continues to evolve, prioritizing data observability will be essential for developing reliable and accountable AI solutions.

Go to Article


Additional References

What is Data Observability? 5 Key Pillars To Know

What is data observability?

What is Data Observability?

Data Observability for Data Engineers: What, Why & How?

How to Calculate the ROI of Data Observability



?????????????????? ???????? ???????????????????? ?????? ???????????????? ???????? ???????????????????? ?????????? ??????????????

... ???????????????????? ?????????????? ???? ?????????????? ???????? ???????????????????? ?????? ????????????????????


Databricks Unity Catalog centralizes security and management for data and AI assets across the lakehouse. It provides fine-grained access control for databases, tables, files, and models, improving governance and reducing workload.

In this article, David Callaghan points out that tags, structured as key-value pairs, can be attached to any asset in the lakehouse. This strategy improves data classification, regulatory compliance, and data lifecycle management. Key steps include identifying a use case as Proof of Value and securing stakeholder buy-in.

???????????? ?????????????? ?????? ??????????::

  • ???????? ???????????????????????????? ?????? ????????????????: Tagging data as PII integrates with access controls for better security.
  • ???????? ?????????????????? ????????????????????: Tags identify data stages to enforce policies and manage transitions.
  • ???????? ???????????????????? ?????? ??????????????????: Descriptive tags improve data searchability and usability for analysts.
  • ???????????????????? ?????? ????????????????????: Tags like 'GDPR' simplify audits and regulatory compliance efforts.
  • ?????????????? ???????????????????? ?????? ??????????????????????????: Tags organize assets by project or department, aiding collaboration and tracking.

Databricks Runtime supports tag management through SQL commands, preferred for ease of use. Tags can be managed manually or through automated PySpark scripts.

????????????????????: Utilizing Unity Catalog and a robust tagging strategy enhances governance, security, and utility of the data lakehouse, facilitating broader enterprise adoption and compliance.

Go to Article


Additional References

Apply tags to Unity Catalog securable objects

Identifying and Tagging PII data with Unity Catalog

Data tagging for Databricks




Takeaways

Here are the key takeaways from this month's edition, providing you with essential strategies and insights to excel in data engineering:

Data Engineering, Redefined: As a data engineer, focus on data movement and management while leaving the implementation of business logic to application developers to enhance efficiency and accuracy in your data processes.

Beyond Traditional ETL: Prioritize EtLT architecture to leverage real-time data processing and effectively integrate multiple data sources. Stay ahead of the curve by preparing for future advances in automated governance and data virtualization to ensure your solutions remain cutting-edge and scalable.

EU AI Act as a Catalyst: Advanced platforms such as Databricks Lakehouse and Unity Catalog should be used to meet the stringent requirements of the EU AI Act and improve data quality and management. This approach ensures compliance and increases the reliability of high-risk AI systems.

Understanding Semantic Layers: To achieve a transformative impact, organizations should choose a semantic layer that goes beyond basic analytics and provides comprehensive insights and interoperability across all data and knowledge assets, rather than relying solely on metrics-driven solutions.

Data Trustworthiness: Embrace a flexible approach to data trustworthiness with varying certification levels. Implement robust data policy management and adjust governance roles to enhance data usability, reduce operational costs, and drive innovation across data-driven projects.

Brainwriting: To address complex computing challenges, brainstorming sessions should be held to ensure equal participation, reduce groupthink and encourage idea generation. This approach encourages diverse perspectives and streamlines the ideation process, leading to more innovative and effective solutions.

Data Quality for AI: Prioritize data quality by emphasizing accuracy, completeness, and validity. Reliable AI outcomes hinge on high-quality data, which prevents biases, errors, and misleading results. Implement robust data governance practices to ensure data integrity and enhance the effectiveness of AI models and predictions.

Data Observability: To build trustworthy and reliable AI systems, organizations should prioritize data observability. By continuously monitoring data quality, tracking data lineage, and validating data integrity, teams can enhance AI performance, ensure compliance, and foster greater transparency and trust in their AI solutions.

Data Tagging: Complement Databricks Unity Catalog with a comprehensive tagging strategy to enhance data governance, improve searchability, and streamline lifecycle management. Use automation for consistent tagging across assets, ensuring better regulatory compliance and efficient management of your data lakehouse.



Conclusion

To wrap up, this issue highlights key advancements in data engineering. By focusing on data management and delegating business logic to developers, efficiency and accuracy are improved. The EtLT architecture ensures effective real-time processing and integration. The use of platforms such as Databricks Lakehouse and Unity Catalog is in line with EU AI Act and improves data quality and compliance. Choosing the right semantic layer provides deeper insights and better interoperability. Adopting flexible approaches to data trustworthiness and robust governance drives innovation and reduces costs. Brainwriting encourages diverse, innovative solutions. Prioritizing data quality and observability is essential for reliable AI outcomes, while structured data tagging improves governance and management. Stay up to date with these insights to drive effective data processing practices.

Stay tuned for our next issue, where we’ll dive into the latest advancements and discoveries in data technology.

See you next month ...



#DataEngineering #DataManagement #DataGovernance #MachineLearning #BigData #DataQuality #DataIntegration #AI #CloudComputing #DataTransformation #DataOps #DataArchitecture #DataPipelines #DataSecurity #BusinessIntelligence #AICompliance #UnityCatalog #PrivacyCompliance #DataStrategy #DataLakes #DataScience #MLOps #RealTimeData #RegulatoryCompliance #Databricks #Analytics #DataMesh #DataLifecycle #DataInfrastructure #SelfServiceAnalytics #DataClassification #DataObservability #DataLineage #AIRegulation #DataProducts #DataWarehouse #AITransparency #Automation #InnovationInData #GenerativeAI #CustomerSatisfaction

Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

2 个月

"Selected Data Engineering Posts . . . August 2024" compiles insightful articles and discussions on the latest trends, tools, and techniques in data engineering. This curated selection provides valuable perspectives on topics such as ETL processes, data pipeline optimization, cloud data architectures, and more. Very useful for data engineers and professionals looking to stay updated with the cutting-edge developments in the field. ??????

要查看或添加评论,请登录

社区洞察

其他会员也浏览了