Selected Data Engineering Posts . . . July 2024
Axel Schwanke
Senior Data Engineer | Data Architect | Data Science | Data Mesh | Data Governance | 4x Databricks certified | 2x AWS certified | 1x CDMP certified | Medium Writer | Turning Data into Business Growth | Nuremberg, Germany
The most popular of my data engineering posts in July 2024 ... with additional references ...
Welcome to the latest edition of "Selected Data Engineering Posts". This month, we look at the integration of generative AI with data products and emphasize the importance of robust data management for AI success. We look at AI content management practices for better AI interaction, the role of semantic layers in unifying data for better decision making, and how GenAI will enhance, not replace, data engineering tasks.
We also explore the need for strong data governance alongside AI governance, introduce Deequ for scalable data quality checks, and review the impact of the EU AI Act on data management. We discuss how embedded analytics and AI can improve business intelligence through the seamless integration of data into workflows. Finally, we look at how intelligent orchestration can support business growth and the adoption of GenAI in highly regulated industries.
Each article contains additional references to further reading so that you can enhance your knowledge of these informative topics.
Discover the latest trends, best practices and innovative strategies that are transforming data engineering. Whether you're a seasoned pro or just starting out, "Selected Data Engineering Posts . . . July 2024" offers key insights for all levels.
Subscribe to our monthly issues now to stay up to date and unlock the full potential of data technology. Expand your expertise today!
This issue:
GenAI and Data Products: Effective data products are essential for leveraging Generative AI, requiring well-defined design principles and robust data management. They enhance AI capabilities by ensuring high-quality, accessible, and reliable data across various industry applications.
AI Content Management: To effectively implement AI, organizations must prepare their content through strategic structuring, cleanup, and standardization. Key steps include defining knowledge domains, auditing content, and developing reusable content models and components for better AI interaction and efficiency.
Semantic Layer: Organizations struggle with scattered and inconsistent business data that makes decision-making difficult. Resolving entities into semantic layers unifies data and improves decision making and efficiency. Experts discuss typical data challenges, benefits and real-world applications of Entity Resolved Knowledge Graphs.
GenAI & Data Engineering: GenAI won't replace data engineers due to its lack of abstract thinking, business understanding, and context application. Instead, it will automate routine tasks, allowing engineers to focus on strategic, value-driven work, enhancing overall efficiency.
Digital Transformation & Data Strategy: Digital transformation often results in a disconnect between elegant front-end innovations and the complex data management required behind the scenes. To bridge this gap, organizations must integrate a robust data strategy aligned with digital initiatives to ensure seamless data processing and foster a truly data-driven culture.
AI & Ungoverned Data: To ensure effective AI implementation, distinguish between AI Governance, which focuses on ethical AI use, and Data Governance, which ensures data quality and security. Implement strong data governance policies, conduct audits, and promote a data-driven culture.
Deequ - Data Quality Library: Deequ, an open-source library built on Apache Spark, provides scalable data quality checks via a declarative API and enables integration into ETL pipelines. It supports large data sets and incremental validation, but does not have a user interface and has limited community support.
EU AI Act: The final version of the AI Act outlines data management for AI systems, featuring simplified compliance for SMEs, integrated assessments, and updated risk definitions. It emphasizes data management and privacy practices, aligning with GDPR's requirements for detailed documentation and data quality.
Embedded Analytics and AI: Business intelligence has evolved from static, exclusive platforms to dynamic, widespread analytics. However, many users struggle with fragmented data sources. To address this, implement a universal semantic layer to unify data and enhance internal and external analytics with embedded AI, improving decision-making, productivity, and workflows.
Intelligent Orchestration: As technology transforms consumer behavior, delivering superior customer experience (CX) is crucial for business growth. Intelligent orchestration—integrating processes, data, and technology—emerges as a key strategy for enhancing CX, ensuring businesses remain competitive and responsive to evolving customer needs.
GenAI in Highly Regualted Industries: Generative AI offers significant benefits across industries by automating tasks and extracting insights, but financial services lag in adoption due to regulatory concerns. Successful implementation requires cautious, well-planned efforts, transparency, and strong security measures.
We’re excited to share this knowledge with you and support your journey to data excellence.
Enjoy reading!
???????????????????? ???????????????????? ???? ?????? ???????? ????????????????
In his article, Willem Koenders points out that data experts discussed the importance of data products and generative AI (Gen AI) in the pharmaceutical industry at the Pharma SOS conference.
Data products are curated collections of data components designed to improve understanding and access to data.
?????? ?????????????????????????????? ???? ???????? ???????????????? ??????????????:
?????????????????? ???????? ???????????????? ???????????? ???????????? ???????????????????? such as autonomy, a common development framework, consistent metadata management, automated governance, and data-sharing protocols. They can be categorized into four levels: raw/staged data, conformed data, analytics-ready data, and fit-for-purpose data, each serving different business needs.
???????????????????? ????, which creates content resembling its training data, depends on high-quality, diverse data. Ensuring robust data management and governance is crucial for effective Gen AI deployment. AI models require substantial and varied datasets to avoid inefficiencies and biases.
Gen AI also enhances the data value chain in areas like data acquisition, transformation, consumption, and operations. It streamlines data tagging, code generation, business settings configuration, and routine operations, improving efficiency and decision-making.
The integration of generative AI into data management is revolutionizing the collection, transformation and use of data. The synergy between data governance and AI will be critical to the long-term success of organizations.
Additional References
???????????????????? ???? ?????????????? ????????????????????
?????????????????????????? ?????? ???? ?????????????????? ???????? ?????????????? ?????? ???????????????????? ?? ?????????? ?????????????? ???????????????? ???? ????????????????.
In her article, Emily Crockett points out, that organizations face challenges in effectively managing vast amounts of content. AI offers innovative solutions like chatbots, auto-tagging, and personalization to improve operations and efficiency. However, ???????????? ?????????????????????? ???? ?????????????????? ?????? ???? ???? ?????????????? ???????????????? ?????? ???????????? ??????????????. Understanding how AI interacts with content and developing a solid content strategy is critical.
????????????????????:
??????????????????????????????:
?? ????????-?????????????? ?????????????? ????????????????, ???????????????? ?????????????? ??????????, ?????? ???????????????????? ?????????????? ?????????????????????? ???????????? ?????????????????? ???? ??????????????????????. These practices ensure AI can access correct, comprehensive content with meaningful relationships, enhancing the overall effectiveness and accuracy of AI applications.
Additional References
?????? ???? ?????????? ???????? ?? ???????????????? ??????????
... ???? ?????????????? ????????????????-???????????? ?????? ?????????????????????? ????????????????????
In their article, Lulit Tesfaye and Jeff Jonas point out that the explosion of information often leaves organizations struggling to make sense of their data due to its scattered, inconsistent nature. This article delves into the use of Entity Resolution within the Semantic Layer to contextualize enterprise data, thereby enhancing decision-making and operational efficiency.
?????????????? ????????????????:
??????????????????????????????:
?????? ?????????????? ?????????????? ?? ???????????????? ?????????? ?????? ???????????? ???????????????????? ?????????????? ???????????? ???????? ??????????????????????, ???????????????? ????????????????????, ?????? ???????????????? ??????????????????. This approach enhances customer experiences, improves data quality, and informs strategic decision-making, making organizations more competitive across industries.
Additional References
???????? ?????????? ?????????????? ???????? ???????????????????
?????? ???????????? ???? ?????????? ???? ???????? ??????????????????????
In this article, Barr Moses underlines that the evolving landscape of GenAI in data engineering presents both challenges and opportunities for professionals in the field. Challenges include pressure to adopt new tools, navigate complexities of AI technologies, and address privacy and security concerns. Recommendations include getting closer to the business, measuring team ROI, and prioritizing data quality:
?????????????? ???????????????? ??????????????????????????: Data engineers should build relationships with stakeholders and gain a deep understanding of business needs to align AI initiatives with organizational objectives effectively.
?????????????? ?????? ?????????????????????? ??????: Data teams should develop metrics to measure the return on investment (ROI) of AI initiatives and communicate the value delivered to the organization, highlighting the strategic impact of data engineering efforts.
???????????????????? ???????? ??????????????: Focus on ensuring data quality by implementing rigorous validation processes and data observability tools to support AI models and enhance their accuracy and reliability.
???????? ?????????????? ???? ???????????????? ????????????: Data engineers should stay updated on emerging trends and best practices in GenAI and data engineering to remain agile and responsive to evolving technology landscapes.
Despite the increasing role of GenAI in data engineering, ?????????????? ?????????????????????????? ???????????? ?????????????????????????? ?????? ?????????????? ?????????????????? ?????????????????????? ?????? ???????????????????? ??????????. By embracing new technologies while focusing on business understanding and data quality, data engineers can thrive in an AI-powered future.
Additional References
???????????????? ?????? ??????: ?????????????? ???????????????????????????? ?????? ???????? ????????????????
Integrating a solid data strategy into digital transformation ensures that companies fully leverage their digital investments to drive sustainable growth and innovation.
In his second article about data strategy, Jan Meskens explores the relationship between digital transformation and data strategy, highlighting common pitfalls and suggesting strategies for bridging the gap between the two. A notable observation from a data strategy masterclass highlighted that ?????????????? ???????????????????????????? ?????????????????????? ?????????? ???????? ???? ?????????????????? ???? ?????????????????? ???? ?????????????????? ???????? ????????????????. This paradox arises because while digital transformation aims to create data-driven organizations, there is often a disconnect between the IT and data worlds.
?????????????? ???????????????????????????? ???????????????? ?????????????????????? ?????????????? ???????????????????????? ???????? ???????????????? ????????????????????in order to redesign interactions with customers and optimize processes. Despite the promises, many companies end up with a "hyper-digitized front end" and a cumbersome back end, leading to inefficiencies and stunted growth. The article shows why aligning digital tools with data-driven decisions is crucial for long-term success.
?????? ???????????????????? ???? ???????????? ???????? ?????? ??????????????:
?? ???????????????? ???????? ???????????????? ???????? ?? ?????????????? ?????? ????????-???????????? ??????????????????????, ?? ?????????? ???????????????????????? ???? ?????? ???????? ???????????????????? ?????????????????????? ???? ?????? ???????????????? ???????????????? ?????? ?????????????? ???? ?????????????? ???????????? ???? ??????????????????. This strategy should be a mix of top-down leadership and bottom-up innovation and promote a data-centric culture.
Additional References
?????? ???????? ???? ???? ?????????????????? ???????? ???????????????????? ????????
In his article Robert S. Seiner points out that the rapid adoption of Artificial Intelligence (AI) relies heavily on the quality and governance of the data used. Differentiating between AI Governance and Data Governance is crucial for effective implementation.
???????? ??????????????????????:
???????????????????? ???? ??????????????????????????????:
?????????? ???? ???? ???????? ???????????????????? ????????:
????????????????????: Differentiating between AI and Data Governance is essential for building robust, ethical AI solutions. Investing in both frameworks ensures high-quality, secure, and reliable AI practices, mitigating risks and unlocking AI's full potential.
领英推荐
Additional References
?????????? - ???? ???????? ???????????? ???????? ?????????????? ??????????????
?????????????????? ???????? ?????????????? ???????? ??????????: ???? ????????????????
Deequ, an open-source data quality library developed on Apache Spark, offers a ???????????????? ???????????????? ?????? ???????????????? ?????? ?????????????????? ???????? ?????????????? ???????????? ???????????? ?????? ??????????????????. It introduces a declarative API for crafting quality constraints and validation code, enabling seamless integration of unit tests for data at scale. Here's a concise breakdown:
???????? ???? ??????????? A library that leverages Apache Spark for scalable data quality validation, supporting both small and large datasets with built-in and user-defined constraints.
???????????????????? ???? ???????? ??????????????:
?????????? ????????????????: Uses Spark’s distributed computation to validate constraints, supports incremental updates, and provides a domain-specific language for quality checks.
???????????????????? ?????????????????????? Assesses data quality column-wise and suggests improvements based on completeness and uniqueness metrics.
?????????????????? ?????? ??????????????: Uses built-in analyzers to compute and track data metrics over time.
?????????????? ??????????????????: Employs standard and customizable algorithms to detect data anomalies based on user-defined thresholds.
?????????? ????????????????????????????: Built on Apache Spark and compatible with Scala and PySpark, it integrates with Spark's computation engine and stores data in DynamoDB and S3.
??????????????????????: Lacks a user interface, requires data in Spark DataFrame format, and has challenges in defining quality checks for all columns.
???????????????? ????????????????: Adopted by Amazon, Thoughtworks, Netflix, and others, valued for its open-source nature and integration with Spark.
????????????????????:
Deequ emerges as a ???????????? ???????????????? ?????? ?????????????????? ???????? ?????????????? ???????????? ???????????? ??????????-?????????? ?????? ??????????????????. Despite limitations, its declarative API, anomaly detection capabilities, and industry adoption underscore its significance in ensuring robust data quality management.
Additional References
???????? ???????????????????? ???? ?????? ???? ??????
???? ?????? ??????????????????: ???????????????????? ???????? ???????????????????? ???? ?????? ?????? ???? ???????????????????? ????????????????????????
Implementing strong data governance practices under the AI Act is crucial for ensuring transparency, trustworthiness, and compliance in AI systems, supporting responsible AI development.
In his article, Leon Doorn points out that the AI Act introduces new regulations for data governance to ensure secure and ethical management of AI-driven data. The final version of the AI Act, recently leaked on LinkedIn, provides insights into the agreements among the European Commission, Council, and Parliament. Key takeaways include:
???????????????????? ??????????????????????????: Manufacturers can use existing documentation and procedures to demonstrate compliance, integrating AI-related risks into their current frameworks.
???????????????????? ?????????????????????????? ?????? ????????: Small and medium-sized enterprises (SMEs) and startups can provide technical documentation in a simplified form, though compatibility with existing frameworks remains uncertain.
???????????????????? ???????????????????? ??????????????????????: AI system conformity assessments will be incorporated into existing procedures, requiring Notified Bodies to meet AI Act requirements.
???????? ???????????????????? ?????? ?????????????????????????? ????????????????????????????: The significant risk definition proposed by the European Parliament was not agreed upon, and explicit environmental risk assessment requirements were removed, aligning with fundamental environmental protection rights.
Data governance as defined by the AI Act includes practices to maintain data quality and privacy throughout the lifecycle of the data. High data quality is essential, but must be balanced with privacy protections, as demonstrated by the biases in health algorithms. The AI Act requires organizations to implement robust data management procedures, including data collection, storage, analysis, and retention. Compliance with the Act requires documentation of these processes as part of the quality management system and technical documentation.
Additional References
An Introduction to EU AI Act: A Practical Guide to Governance, Compliance, and Regulatory Guidelines
???????????????????? ???????????????? ?????????? ?????????????? ???????????????? ?????????????????? ?????? ????
AI and embedded analytics, supported by a semantic layer, are transforming the way organizations use data. This integration enhances the employee experience and transforms siloed data into actionable insights that drive growth and innovation.
In this article, Artyom Keydunov points out that business intelligence (BI) has evolved from traditional platforms to modern analytics that are accessible to many users. Nevertheless, the numerous data sources and dashboards can make it difficult to find the right data. Curated data experiences, such as embedded analytics and AI, improve
the accessibility of data for internal and external applications.
?????? ????????????????:
A universal semantic layer is critical to the delivery of embedded analytics and AI. It serves as a translation layer between data repositories and endpoints, providing a consistent view of unified data. On this basis, companies can efficiently provide curated data experiences and thus significantly improve internal processes.
Additional References
?????????????? ???????????? ?????????????? ?????????????????????? ??????????????????????????
Adopting intelligent orchestration is essential for businesses seeking to lead in their industry, drive growth, and build lasting customer loyalty.
In this article, Rob Vatter points out that superior customer experience (CX) is essential for business growth and differentiation in today’s competitive landscape. According to Forrester’s CX Index?, ?????????????????? ?????? ?????????????? ?????????????????????? ?????????????? ?????????? ???????? ???????? ???????????? ???????????????????????? ???? ???? ????????????. However, achieving exceptional CX is challenging due to factors such as skill gaps, misaligned priorities, and rapid technological advancements, including generative AI.
A recent Forrester Consulting report commissioned by Cognizant highlights that ???????? ??.??. ?????????? ???????? ?????????????????????? ?????????????????? ???????????????? ???????????????????????? ???????? ?????? ???????? ?????? ??????????. Intelligent orchestration emerges as a critical solution, integrating processes, data, technology, and operations into a cohesive system. This approach enhances CX by ensuring interactions are personalized and relevant.
?????? ????????????????:
????????-???????????????? ?????????? ?????????? ?????????????????????? ?????????????????????????? ???????????? ???????? ????% ???????????? ?????????? ?????? ?????????????????? ???? ???????????????? ???????????????????????? and show significant improvements in metrics like customer satisfaction and Net Promoter Score.
Additional References
?????????? ???????????????? ???? ?????????????????? ????????????????
?????????????????????? ???????????????? ??????????????? ???????????? ?????????????????? ????????????????????
Generative Artificial Intelligence (GenAI) offers substantial benefits, such as automating repetitive tasks, extracting insights from complex data, and making knowledge widely accessible. Despite these advantages, the financial services industry has been cautious in adopting GenAI, with 30% of organizations banning its use due to concerns about accuracy, security, and regulatory compliance.
???????????????????? ???? ?????????? ????????????????:
???????? ?????????????????? ?????? ???????????????????? ?????????? ????????????????:
By taking a cautious, well-planned approach and focusing on security and transparency, financial services organizations can effectively leverage GenAI and improve their operational efficiency and compliance.
Additional References
Takeaways
Here are the key takeaways from this month's edition, providing you with essential strategies and insights to excel in data engineering:
GenAI and Data Products: Data engineers should focus on creating high-quality data products with clear principles and governance. These products are essential for optimizing Generative AI applications and enhancing decision-making and operational efficiency.
AI Content Management: Data engineers should focus on preparing content for AI by defining clear knowledge domains, reviewing and cleansing data, creating reusable content models, and structuring content into manageable components to improve the effectiveness and accuracy of AI.
Semantic Layer: Implement entity resolution within semantic layers to unify and contextualize scattered enterprise data. This approach promotes decision making, improves data quality and optimizes organizational efficiency across business units.
GenAI & Data Engineering: Use GenAI to automate routine tasks and streamline workflows, but focus on developing strategic, business-oriented skills and knowledge to add unique value. This approach will enhance your role and adaptability in a rapidly evolving technical landscape.
Digital Transformation & Data Strategy: Focus on integrating robust data strategies into digital transformation. Prioritize seamless data management and alignment with digital tools to close gaps between front-end innovation and back-end data systems and ensure a cohesive, data-driven approach to business success.
AI & Ungoverned Data: Ensure robust data governance practices by establishing clear data quality, security, and compliance policies. Regularly audit and validate data to maintain integrity, and integrate these practices with your AI initiatives to support ethical and effective AI outcomes.
Deequ - Data Quality Library: Use Deequ for scalable data quality checks in Spark-based ETL pipelines. Use the declarative API to define and automate data quality constraints to ensure robust data integrity, but be aware of the lack of user interface and limited support.
EU AI Act: To comply with the AI Act, implement comprehensive data governance practices. Develop and document robust procedures for data management, including acquisition, quality control, and bias mitigation. Ensure these practices align with GDPR requirements and integrate them into your quality management system to demonstrate compliance and maintain data integrity.
Embedded Analytics and AI: Implement a universal semantic layer to unify data sources and streamline access. This enables the creation of embedded analytics and AI solutions, starting with internal applications to enhance decision-making and workflows, before expanding to customer-facing solutions.
Intelligent Orchestration: To excel in today’s competitive market, integrate intelligent orchestration by harmonizing processes, data, and technology. This approach enhances customer experience and ensures your business remains agile and responsive to evolving customer needs and market changes.
GenAI in Highly Regualted Industries: To maximize the benefits of generative AI, take a cautious approach with clear goals, prioritize transparency and security, and ensure compliance. Start with manageable projects and scale incrementally based on proven successes and safe practices.
Conclusion
This month's issue offers key strategies for data engineering excellence. Focus on creating high-quality data products with robust governance to optimize generative AI and improve decision making. Consider implementing semantic layers to unify data and increase efficiency, while perhaps leveraging Generative AI to automate tasks and develop strategic capabilities. If driving digital transformation, integrate strong data strategies to align innovative solutions with effective data management. Ensure compliance with the EU AI Act through comprehensive data practices and remember that intelligent orchestration drives business growth through the integration of CX, data and technology. And adopt Generative AI cautiously with clear goals and security focus. Finally, don't forget to use tools like Deequ for scalable data quality checks. industrie,
Stay tuned for the next issue, in which we will explore the latest advances and findings in data technology.
See you next month ...
#DataProducts #GenerativeAI #DataManagement #DataEngineering #DataGovernance #DataScience #BusinessIntelligence #AIApplications #DataQuality #DataTransformation #AIandData #ContentManagement #KnowledgeGraph #AIReadiness #Metadata #MachineLearning #DigitalTransformation #SemanticLayer #DataStrategy #EntityResolution #DataIntegration #AdvancedAnalytics #DataEnrichment #EnterpriseData #DataOptimization #ArtificialIntelligence #BusinessUnderstanding #ValueDelivery #AIInnovation #DataDriven #BigData #DataAnalytics #BusinessGrowth #EnterpriseIT #AIGovernance #DataSecurity #EthicalAI #DataCompliance #CustomerExperience #CX #FinancialServices #Regulation #TechInnovation #AIAdoption #FinancialTechnology #GenAI #BusinessStrategy #TechSolutions