Data Warehousing, BI, Big Data & Data Science for Data Managers

Data Warehousing, BI, Big Data & Data Science for Data Managers

Executive Summary?

The landscape of Data Management encompasses several critical components, including Big Data, Data Warehousing, and Business Intelligence, each presenting unique challenges and opportunities. Key themes include the development of enterprise Data Warehouses and strategies for effective self-service reporting, as well as the implementation of robust data governance frameworks.??

Understanding Data Warehousing architecture is essential, particularly in relation to Big Data and its implications for organizational decision-making. Concepts like the Abate Information Triangle highlight the importance of data integration and analysis, while the roles and responsibilities of Data Managers are pivotal in bridging Data Science with business performance measurement.??

Additionally, the adoption of models like Data Vault can significantly impact enterprise Data Warehousing, reflecting the evolving nature of Data Management in the face of privacy concerns and the fourth paradigm of Data Science, which focuses on key elements and processes essential for effective Data Management and statistical analysis.?


Big Data in Data Warehousing and Business Intelligence?

Howard Diesel opens the webinar by emphasising the integral role of Big Data within the context of Data Warehousing and Business Intelligence (BI). He goes on to highlight its importance in decision support and gaining valuable insights for businesses. The data life cycle starts from data use and enhancement to integrating Big Data back into planning. The aim is to elevate data assets through the Data, Information, Knowledge, Wisdom (DIKW) triangle—transforming data into information, knowledge, and ultimately wisdom. He references the "Five Whys and Two Hows" framework, encouraging the exploration of essential questions such as Who, What, When, Where, How and How Many, which are crucial for understanding and measuring business events and metrics effectively.?

?

Figure 1 How-to Analyse Data Using the 5W & 2H?

Challenges in Writing the Revised Edition of DMBoK V2?

?

The upcoming Revision of the Data Management Body of Knowledge (DMBOK) Version 2 is set to highlight the differences between the second version and the new edition. It is crucial for those familiar with version 2 to understand these changes, especially as version 3 is in the works. As a member of the editorial board for version 3, I invite collaboration to review the writing for accuracy and clarity, ensuring that any errors are addressed. Your participation will be invaluable in refining this essential document, as the previous editions have had their share of mistakes.?

Figure 2? Content of "How-To Analyse"?
Figure 3 DW & BI Essential Components

?

Understanding the Role of Data Warehousing and Business Intelligence?

The mind map of Data Warehousing outlines its definition and goals, emphasising the importance of maintaining a technical environment for Business Intelligence activities to facilitate effective decision-making. Initially referred to as Decision Support Systems (DSS), the field evolved to include Business Intelligence, which encompasses Data Science to aid businesses in their decision-making processes. Key business drivers include compliance, operational support, and leveraging insights for innovation, focusing on how data can be utilised to improve business operations and outcomes.?

Figure 4 Focus on “Governance”

Strategies for Developing an Enterprise Data Warehouses?

To develop an effective Enterprise Data Warehouse (EDW), it's essential to begin with business goals and maintain a focus on the desired outcomes. The process involves initially understanding all potential data inputs on a global scale while building the system incrementally, starting with a single subject and ultimately creating data marts. It is crucial to avoid importing aggregated data from applications and instead aim to retain the lowest level of detail to support thorough archiving.?

?Streamlining applications is beneficial, allowing for a lean design that offloads data into the EDW. Following data retention policies, data can be archived and disposed of appropriately, often requiring aggregation for efficient storage. A successful example highlighted a 2012 implementation of in-memory databases that significantly enhanced performance by extracting data from applications into memory, showcasing the EDW as a strategic archiving solution.?


Data Strategies and the Challenges in Self-Service Reporting?

The development of aggregate data and self-service data strategies is rapidly advancing, particularly in the context of reporting. A well-defined reporting strategy is essential, especially as organisations increasingly utilise self-service tools to create regulatory reports, which can provide significant value.??

In previous experiences as an ABI developer, achieving precise reporting proved challenging, often necessitating a return to SQL Server Reporting Services due to limitations in other visualisation tools. This frustration highlighted the complexities involved in producing accurate and detailed reports.?

Data Governance and Implementation Strategies?

Data Governance emphasises the importance of obtaining business acceptance before going live, ensuring consistency and quality across logical data sets. User satisfaction should be monitored through Service Level Agreements (SLAs), particularly regarding the quality of data movement and ETL processes, which are crucial for timely decision-making. A release roadmap should be created using the MoSCoW Prioritisation method (Must-have, Should-have, Could-have, Won’t-have), allowing for structured planning of project phases. Key elements of implementation include configuration management, cultural change, and the use of various tools such as metadata repositories and data integration techniques, along with methodologies like prototyping and self-service Business Intelligence.?

Figure 5 Focus on "Implementation"?
Figure 6 Focus on "Technology"?

The Key Concepts in Data Processing and Analysis?

Howard spends some time on covering the essential concepts in data processing and warehousing, including batch change data capture (CDC) and historical data integration. The distinction between Inmon's Corporate Information Factory (CIF) and Kimball's dimensional modelling approach highlights different methodologies for building Data Warehouses, with Kimball emphasising quicker implementations through fact tables and conformed dimensions. Howard also touches on various data analytical methods such as OLAP (Online Analytical Processing), ROLAP (Relational OLAP), and MOLAP (Multidimensional OLAP), as well as central components like staging, reference, Master Data, data marts, and operational data stores (ODS). Understanding these concepts is vital for effective Business Intelligence (BI) and data analysis strategies.?

Figure 7 Focus on "Load Processing"?
Figure 8 Focus on "DW Architecture Concepts"?
Figure 9 Focus on "Essential Concepts"?

Understanding Data Warehousing Architecture and Big Data?

The chapter on Big Data and Data Science highlights the importance of the evolving architecture in this field. It emphasises that while data lakes are prevalent, they still need to eliminate the need for centralised Data Warehouses. Recently, the concept of the Data Lake House has emerged, which aims to integrate the functionalities of warehousing with data lakes. This approach standardises data transformation and ensures efficient processing of data from the lake to the warehouse, making it ready for operational use quickly. A key takeaway is the significance of utilising insights from Big Data to enhance and inform Data Warehousing practices.?

Figure 10 Big Data and Data Science Essential Components

?The Impact of Data Privacy and Digital Decisioning in Organizations?

Approximately 3-4 months ago, a case emerged involving an individual who claimed that an organisation denied them credit based on a decision made by a machine learning model. Ten years later, the individual took legal action against the organisation, citing the negative impact this decision had on their life. The organisation faced challenges in providing evidence regarding the rejection since they had not retained sufficient data on the decisions made by their model. Ultimately, the court awarded damages to the individual. This case highlights the importance of incorporating human oversight in high-stakes digital decision-making processes, emphasising the need for better data retention and accountability in algorithms that influence significant outcomes.?

The Abate Information Triangle and Data Integration Strategy?

The Abate Information Triangle highlights the connections between data, information, knowledge, and wisdom, emphasising the importance of Master Data in validating relevant information. As organisations collect streaming and social networking data, particularly from sources like photographs, it’s crucial to ensure that this data pertains to the organisation itself rather than competitors.??

The process involves understanding business needs, selecting appropriate data sources, and integrating Master Data to provide context. This integration facilitates data analysis and requires monitoring to minimise error rates before deployment, after which insights may generate further data requests. A governance framework is essential to evaluate the relevance and provenance of the collected data, ensuring it is vetted appropriately to support informed decision-making.?

Data Integration and Analysis?

The updated DMBoK Version 2 Revised edition introduces significant changes, particularly emphasising the importance of an integrated data system that consolidates data from various sources, which was not mentioned in the previous version. While Version 1 focuses on providing decision support, Version 2 highlights the integration of data as a crucial element.??

The focus on Big Data shifts to posing unknown questions at the outset of analysis to addressing the handling of large volumes of data and the associated statistical analysis within the Data Science paradigm. This change reflects a broader understanding of data complexities and analytical approaches, moving beyond initial concepts of known and unknown data states to encompass various analytical methodologies such as confirmatory and exploratory analytics.?

Figure 11 Definitions for Data Warehousing and Business Intelligence and Big Data and Data Science?

?

Figure 12 Alteration of Definitions?

Big Data and Data Science in Business Performance Measurement?

The essence of Data Science lies in uncovering answers and insights from various data types. Business Intelligence (BI) focuses on analysing historical data to understand what has happened and why. This involves identifying changes in performance, such as an increase or decrease in sales, and determining their causes.??

When evaluating business performance, it’s crucial to address five key questions: What happened in each period? Why did it happen? What might happen if current trends continue? What actions should be taken? And what critical information might be missing? These discussions with executives are essential for effectively sharing insights.?

Figure 13 Business Drivers

?Understanding and Managing Data Flow?

The DMBoK Version 2 Revised edition outlines a structured approach to Big Data strategy, encompassing activities such as understanding business needs, establishing environments, selecting data sources, and data acquisition and ingestion. The process includes developing hypotheses, integrating and exploring data, and communicating insights through data storytelling and visualisation.

A critical aspect highlighted is model drift, which occurs when real-world conditions diverge from the original model, potentially diminishing its value and accuracy. This phenomenon necessitates ongoing feature engineering and monitoring of model performance to prevent increased error rates, which can lead to a reduction in the perceived value of data assets—a concept referred to as applying a "haircut" to their valuation. Additionally, the discussion touches on data drift, which involves changes in the statistical properties of the data over time, further impacting model effectiveness.?

Figure 14 Inputs, Activities & Deliverables (HOW)?
Figure 15 Definition of "Data Drift"?

Responsibilities of Data Manager in Big Data and Data Science and Data Warehousing and Business Intelligence?

The Data Manager is responsible for key areas in Data Warehousing and Big Data, focusing on operations planning, model performance monitoring, and enhancement strategies. In Big Data, metrics such as usage, loading speed, and data lake ingestion time are crucial for evaluating how data, including large videos, is processed. Effective storytelling helps derive insights from data, revealing "aha moments" that can drive valuable business reports. Unlike the fixed structure of Data Warehousing, Big Data operates in a more dynamic environment, employing various analytics to answer critical questions about business performance and decision-making. The progression from data-enabled to knowledge-driven capabilities indicates an organization’s AI maturity, with assessments determining the level of questions that can be addressed with existing data.?

Figure 16 Data Managers Role and Responsibility?
Figure 17 Metrics (How Many)?
Figure 18 "Information to Insights to Decisions to Action"?
Figure 19 DIKW Diagram

?Data Vault and its Impact on Enterprise Data Warehousing?

Howard highlights the emerging importance of Data Vault in enterprise Data Warehousing, particularly in light of its limited coverage in the DMBoK Version 2 Revised edition. Despite the absence of extensive updates on Data Vault in the textbook, it unexpectedly featured prominently in exam questions. Data Vault has effectively replaced the traditional relational model, often referred to as Data Warehouse version 2, addressing key issues such as handling unstructured data and the constraints of the relational model that hindered speed and flexibility in Data Management. Overall, the transition to Data Vault represents a significant evolution in the approach to enterprise Data Warehousing.?

Figure 20 "BUT, What about Data Vault?"?

Enterprise Data Warehousing and Data Vault Models?

The customer journey in Data Management begins with the identification of potential clients as "suspects" when they first visit a website. Once they express interest, they transition to "prospects." Traditional relational models hinder data entry due to strict business rules and cardinality requirements, often leaving essential details, such as customer names or IDs, uncollected. In contrast, the Data Vault model simplifies this process by requiring only a customer code to access and link additional data as the customer progresses.??

The Data Vault consists of raw and business data components, facilitating improved data quality and design by allowing for comparisons across multiple data sources in the staging area. This enables effective Master Data modelling and provides a foundation for data quality assessment and trust rules without needing to revert to original sources, marking a significant advancement in enterprise Data Warehousing.?

Figure 21 Data Vault Architecture

?Fourth Paradigm of Data Science?

The evolution of scientific paradigms has progressed from the empirical approach, where natural phenomena were simply described, to the theoretical paradigm utilising mathematical models, followed by the computational paradigm that simulated complex systems. The fourth paradigm, termed "e-science" by Jim Gray and Alex Szalay, integrates experimental data, archives, and simulations into a comprehensive Data Warehouse, allowing data scientists to explore and analyse massive datasets. This approach combines computer science, business domain knowledge, and mathematics, leading to the development of Data Science. Moving from multidisciplinary to interdisciplinary and finally to transdisciplinary collaboration, Data Science fosters a holistic understanding of complex systems by merging computational models, algorithms, and extensive metadata with diverse knowledge sources.?

Figure 22 Kimbal: Dimension Data Warehousing


Figure 23 Data Science Paradigm

?

?

Figure 24 eScience - A Transformed Scientific Method?


Figure 25 Fourth Paradigm of Science?
Figure 26 "X-Info & Comp-X for Discipline X"


Figure 27 Levels of Discipline Integration?



?The Key Elements and Process of Data Science?

The Data Science paradigm encompasses several key elements, starting with data acquisition, followed by collection, cleaning, integration, and transformation. Once the data is prepared, exploratory analysis is conducted using statistical methods, data visualisation, and hypothesis testing. This involves defining a hypothesis and determining whether it is confirmatory or exploratory, leading to predictions or repeated probes of the hypothesis. Following exploratory analysis, the focus shifts to model building and training, which includes feature engineering and model selection to identify core features. The model is then trained and evaluated for quality, after which it is deployed and monitored for performance drift. Finally, the process culminates in data interpretation, storytelling, and decision-making.?

Figure 28 Key Elements to Data Science Paradigm?


Data Management in Data Science?

Data Science heavily relies on effective Data Management, particularly in the context of simulations and experiments involving user behaviour and customer data platforms. The advent of concepts like "bring your own lake" allows platforms like Snowflake to create virtualised views of vast amounts of data.??

Traditional statistical methods often fall short when applied to single files, as the sheer volume of information in data lakes can overwhelm these approaches. To effectively analyse this data, it’s essential to think of analysis as searching for needles in an ever-expanding haystack, where careful integration, alignment, and aggregation of data are crucial.?

Furthermore, successful data analysis requires transdisciplinary integration, necessitating domain knowledge alongside mathematical expertise, as statisticians and mathematicians cannot deliver value without a solid understanding of the context in which they are working.?

Figure 29 (Data) Science Needs Data Management?


Figure 30 Data Analysis

?


Data Management and Analysis?

Scientists are facing significant challenges in data delivery, particularly in fields like astronomy, where numerous observatories worldwide contribute vast amounts of data—often reaching petabytes and terabytes. Traditional file-by-file analysis methods are proving inadequate, with processes like searching a petabyte of data taking up to three years and costing around $1 million. To address these issues, researchers advocate for better Data Management through indexing, similar to Google's approach, which enables efficient search and analysis. The emphasis is shifting from analysing files to utilising databases, highlighting the importance of transitioning from data lakes to Data Warehouses, a concept popularised by recent innovations such as the data lake house model introduced by Inmon.?

Figure 31 Data Analysis Expanded

?

Figure 32 Data Delivery: Hitting a Wall?

Intricacies of Data Management and Statistical Analysis?

Statistical analysis involves the creation of uniform samples and filtering relevant data subsets, relying on Data Management capabilities such as data models and controlled vocabularies. Ensuring data quality through completeness and addressing bad data are critical for ethical data handling. Business Intelligence (BI) allows for the automation of certain procedures to understand data's "What" and "Why."??

Data Science improves hypothesis testing and likelihood calculations through structured storage. Recent developments, like Microsoft enabling R and Python routines to run directly in databases, enhance processing efficiency by executing queries server-side. Additionally, effective Data Management skills are essential for visualisations and building scalable platforms for Big Data, which encompasses capturing, curating, analysing, publishing, and peer reviewing data before providing access.?


Importance and Challenges of Data Management?

The costs associated with data platforms are significant, estimated at around $1,000,000, with schema, ontologies, and provenance being the most expensive areas in Data Science. As data is increasingly generated digitally through sensors, IoT devices, and simulators, the need for effective Data Management becomes crucial.?

A data catalogue, often referred to by data scientists as a digital data library, plays a vital role in organising these diverse datasets and accompanying research, facilitating access to valuable insights from both small and large data sets. This comprehensive approach aims to enhance the visibility and understanding of data assets, ultimately supporting better organisational decision-making.?

Figure 33 Analysis and Databases

?

Figure 34 Analysis and Databases Expanded?
Figure 35 Data Science Paradigm Elements?
Figure 36 Data Science Paradigm Elements & Costs?

Data Warehousing and Big Data Challenges?

Howard highlights the ongoing relevance of Data Warehousing in the face of the rise of Big Data, emphasising the challenges posed by inconsistent transformations within data lakes, where varying interpretations of data attributes can lead to chaos. Participants stress the importance of discipline in data modelling and cataloguing to ensure a reliable semblance of reality, which has diminished in modern data practices.??

The introduction of the lake house concept aims to standardise transformations from data lakes through ETL processes provides a more consistent data view. Concrete examples from astronomy illustrate how a common schema has enabled federated data querying across multiple observatories, while health informatics showcases the use of ontologies to integrate data from various hospitals through shared metadata standards.?

A discussion starts on the transition from traditional ETL (Extract, Transform, Load) processes to automated, API-driven integration systems utilising ontologies for data normalisation across various genomic projects. This shift allows for the seamless integration of diverse data sources, enabling real-time data curation and analysis.??

Initiatives like the James Webb Telescope and the Meerkat array exemplify the use of standardised ontologies to compile large datasets from global observational efforts, facilitating timely queries and insights without the constraints of weather-related data availability. This advancement marks significant progress in the automated handling and interpretation of complex data in the field of genomics and astrophysics.?

?

Figure 37 Digital Data Library (Catalogue)

?

Figure 38 Use of ontologies in Astronomy

?

Figure 39 Use of ontologies in RNA Structure Genomics

Join our CDMP Data Warehousing, BI, Big Data & Data Science Specialist training to prepare for a Master Pass and upskill your career!?

Learn more about our training here:

CDMP Specialist Training

?


Majid Ali Syed

Manager - Enterprise Data & Analytics | Strategy & Consulting @ Deloitte Middle East Limited

2 个月

I'm interested in watching the webinar, since I missed the it

回复

Hi Howard, I missed it. Please, share the recording ???? thanks

Nkululeko Tshabalala

Masters in Applied Data Science (MADS) Candidate | Data Analyst at Massmart | ESG - Sustainability - Ethics - Compliance | Data Modeling | SQL | ETL | Python | Power BI | Data Analysis & Analytics | Data Engineering

2 个月

Can I have a recording as well, thank you.

Marita Huaman Peralta

Data Driven Business Architect / Founder & CEO Ciclus Group/General Manager Advisor/ Business Model Advisor/Data Literacy Expert

2 个月

Dejar Howard. I want thecrecording

要查看或添加评论,请登录

Howard Diesel的更多文章

社区洞察

其他会员也浏览了