登录查看更多内容

Data Warehousing, BI, Big Data & Data Science for Data Managers

Howard Diesel

Chief Data Officer @ Modelware Systems | CDMP Master | Data Management Advisor

发布日期: 2024年12月4日

Executive Summary?

The landscape of Data Management encompasses several critical components, including Big Data, Data Warehousing, and Business Intelligence, each presenting unique challenges and opportunities. Key themes include the development of enterprise Data Warehouses and strategies for effective self-service reporting, as well as the implementation of robust data governance frameworks.??

Understanding Data Warehousing architecture is essential, particularly in relation to Big Data and its implications for organizational decision-making. Concepts like the Abate Information Triangle highlight the importance of data integration and analysis, while the roles and responsibilities of Data Managers are pivotal in bridging Data Science with business performance measurement.??

Additionally, the adoption of models like Data Vault can significantly impact enterprise Data Warehousing, reflecting the evolving nature of Data Management in the face of privacy concerns and the fourth paradigm of Data Science, which focuses on key elements and processes essential for effective Data Management and statistical analysis.?

Big Data in Data Warehousing and Business Intelligence?

Howard Diesel opens the webinar by emphasising the integral role of Big Data within the context of Data Warehousing and Business Intelligence (BI). He goes on to highlight its importance in decision support and gaining valuable insights for businesses. The data life cycle starts from data use and enhancement to integrating Big Data back into planning. The aim is to elevate data assets through the Data, Information, Knowledge, Wisdom (DIKW) triangle—transforming data into information, knowledge, and ultimately wisdom. He references the "Five Whys and Two Hows" framework, encouraging the exploration of essential questions such as Who, What, When, Where, How and How Many, which are crucial for understanding and measuring business events and metrics effectively.?

Figure 1 How-to Analyse Data Using the 5W & 2H?

Challenges in Writing the Revised Edition of DMBoK V2?

The upcoming Revision of the Data Management Body of Knowledge (DMBOK) Version 2 is set to highlight the differences between the second version and the new edition. It is crucial for those familiar with version 2 to understand these changes, especially as version 3 is in the works. As a member of the editorial board for version 3, I invite collaboration to review the writing for accuracy and clarity, ensuring that any errors are addressed. Your participation will be invaluable in refining this essential document, as the previous editions have had their share of mistakes.?

Understanding the Role of Data Warehousing and Business Intelligence?

The mind map of Data Warehousing outlines its definition and goals, emphasising the importance of maintaining a technical environment for Business Intelligence activities to facilitate effective decision-making. Initially referred to as Decision Support Systems (DSS), the field evolved to include Business Intelligence, which encompasses Data Science to aid businesses in their decision-making processes. Key business drivers include compliance, operational support, and leveraging insights for innovation, focusing on how data can be utilised to improve business operations and outcomes.?

Strategies for Developing an Enterprise Data Warehouses?

To develop an effective Enterprise Data Warehouse (EDW), it's essential to begin with business goals and maintain a focus on the desired outcomes. The process involves initially understanding all potential data inputs on a global scale while building the system incrementally, starting with a single subject and ultimately creating data marts. It is crucial to avoid importing aggregated data from applications and instead aim to retain the lowest level of detail to support thorough archiving.?

?Streamlining applications is beneficial, allowing for a lean design that offloads data into the EDW. Following data retention policies, data can be archived and disposed of appropriately, often requiring aggregation for efficient storage. A successful example highlighted a 2012 implementation of in-memory databases that significantly enhanced performance by extracting data from applications into memory, showcasing the EDW as a strategic archiving solution.?

Data Strategies and the Challenges in Self-Service Reporting?

The development of aggregate data and self-service data strategies is rapidly advancing, particularly in the context of reporting. A well-defined reporting strategy is essential, especially as organisations increasingly utilise self-service tools to create regulatory reports, which can provide significant value.??

In previous experiences as an ABI developer, achieving precise reporting proved challenging, often necessitating a return to SQL Server Reporting Services due to limitations in other visualisation tools. This frustration highlighted the complexities involved in producing accurate and detailed reports.?

Data Governance and Implementation Strategies?

Data Governance emphasises the importance of obtaining business acceptance before going live, ensuring consistency and quality across logical data sets. User satisfaction should be monitored through Service Level Agreements (SLAs), particularly regarding the quality of data movement and ETL processes, which are crucial for timely decision-making. A release roadmap should be created using the MoSCoW Prioritisation method (Must-have, Should-have, Could-have, Won’t-have), allowing for structured planning of project phases. Key elements of implementation include configuration management, cultural change, and the use of various tools such as metadata repositories and data integration techniques, along with methodologies like prototyping and self-service Business Intelligence.?

The Key Concepts in Data Processing and Analysis?

Howard spends some time on covering the essential concepts in data processing and warehousing, including batch change data capture (CDC) and historical data integration. The distinction between Inmon's Corporate Information Factory (CIF) and Kimball's dimensional modelling approach highlights different methodologies for building Data Warehouses, with Kimball emphasising quicker implementations through fact tables and conformed dimensions. Howard also touches on various data analytical methods such as OLAP (Online Analytical Processing), ROLAP (Relational OLAP), and MOLAP (Multidimensional OLAP), as well as central components like staging, reference, Master Data, data marts, and operational data stores (ODS). Understanding these concepts is vital for effective Business Intelligence (BI) and data analysis strategies.?

Figure 8 Focus on "DW Architecture Concepts"?

Understanding Data Warehousing Architecture and Big Data?

The chapter on Big Data and Data Science highlights the importance of the evolving architecture in this field. It emphasises that while data lakes are prevalent, they still need to eliminate the need for centralised Data Warehouses. Recently, the concept of the Data Lake House has emerged, which aims to integrate the functionalities of warehousing with data lakes. This approach standardises data transformation and ensures efficient processing of data from the lake to the warehouse, making it ready for operational use quickly. A key takeaway is the significance of utilising insights from Big Data to enhance and inform Data Warehousing practices.?

Figure 10 Big Data and Data Science Essential Components

?The Impact of Data Privacy and Digital Decisioning in Organizations?

Approximately 3-4 months ago, a case emerged involving an individual who claimed that an organisation denied them credit based on a decision made by a machine learning model. Ten years later, the individual took legal action against the organisation, citing the negative impact this decision had on their life. The organisation faced challenges in providing evidence regarding the rejection since they had not retained sufficient data on the decisions made by their model. Ultimately, the court awarded damages to the individual. This case highlights the importance of incorporating human oversight in high-stakes digital decision-making processes, emphasising the need for better data retention and accountability in algorithms that influence significant outcomes.?

The Abate Information Triangle and Data Integration Strategy?

The Abate Information Triangle highlights the connections between data, information, knowledge, and wisdom, emphasising the importance of Master Data in validating relevant information. As organisations collect streaming and social networking data, particularly from sources like photographs, it’s crucial to ensure that this data pertains to the organisation itself rather than competitors.??

The process involves understanding business needs, selecting appropriate data sources, and integrating Master Data to provide context. This integration facilitates data analysis and requires monitoring to minimise error rates before deployment, after which insights may generate further data requests. A governance framework is essential to evaluate the relevance and provenance of the collected data, ensuring it is vetted appropriately to support informed decision-making.?

Data Integration and Analysis?

The updated DMBoK Version 2 Revised edition introduces significant changes, particularly emphasising the importance of an integrated data system that consolidates data from various sources, which was not mentioned in the previous version. While Version 1 focuses on providing decision support, Version 2 highlights the integration of data as a crucial element.??

The focus on Big Data shifts to posing unknown questions at the outset of analysis to addressing the handling of large volumes of data and the associated statistical analysis within the Data Science paradigm. This change reflects a broader understanding of data complexities and analytical approaches, moving beyond initial concepts of known and unknown data states to encompass various analytical methodologies such as confirmatory and exploratory analytics.?

Figure 11 Definitions for Data Warehousing and Business Intelligence and Big Data and Data Science?

Big Data and Data Science in Business Performance Measurement?

The essence of Data Science lies in uncovering answers and insights from various data types. Business Intelligence (BI) focuses on analysing historical data to understand what has happened and why. This involves identifying changes in performance, such as an increase or decrease in sales, and determining their causes.??

When evaluating business performance, it’s crucial to address five key questions: What happened in each period? Why did it happen? What might happen if current trends continue? What actions should be taken? And what critical information might be missing? These discussions with executives are essential for effectively sharing insights.?

?Understanding and Managing Data Flow?

The DMBoK Version 2 Revised edition outlines a structured approach to Big Data strategy, encompassing activities such as understanding business needs, establishing environments, selecting data sources, and data acquisition and ingestion. The process includes developing hypotheses, integrating and exploring data, and communicating insights through data storytelling and visualisation.

A critical aspect highlighted is model drift, which occurs when real-world conditions diverge from the original model, potentially diminishing its value and accuracy. This phenomenon necessitates ongoing feature engineering and monitoring of model performance to prevent increased error rates, which can lead to a reduction in the perceived value of data assets—a concept referred to as applying a "haircut" to their valuation. Additionally, the discussion touches on data drift, which involves changes in the statistical properties of the data over time, further impacting model effectiveness.?

Figure 14 Inputs, Activities & Deliverables (HOW)?

Responsibilities of Data Manager in Big Data and Data Science and Data Warehousing and Business Intelligence?

The Data Manager is responsible for key areas in Data Warehousing and Big Data, focusing on operations planning, model performance monitoring, and enhancement strategies. In Big Data, metrics such as usage, loading speed, and data lake ingestion time are crucial for evaluating how data, including large videos, is processed. Effective storytelling helps derive insights from data, revealing "aha moments" that can drive valuable business reports. Unlike the fixed structure of Data Warehousing, Big Data operates in a more dynamic environment, employing various analytics to answer critical questions about business performance and decision-making. The progression from data-enabled to knowledge-driven capabilities indicates an organization’s AI maturity, with assessments determining the level of questions that can be addressed with existing data.?

Figure 16 Data Managers Role and Responsibility?

Figure 18 "Information to Insights to Decisions to Action"?

?Data Vault and its Impact on Enterprise Data Warehousing?

Howard highlights the emerging importance of Data Vault in enterprise Data Warehousing, particularly in light of its limited coverage in the DMBoK Version 2 Revised edition. Despite the absence of extensive updates on Data Vault in the textbook, it unexpectedly featured prominently in exam questions. Data Vault has effectively replaced the traditional relational model, often referred to as Data Warehouse version 2, addressing key issues such as handling unstructured data and the constraints of the relational model that hindered speed and flexibility in Data Management. Overall, the transition to Data Vault represents a significant evolution in the approach to enterprise Data Warehousing.?

领英推荐

Big Data Platforms vs. Traditional Data Warehousing:…

Databuzz Ltd 1 个月前

Data Warehouse vs Data Vault

BBI 1 年前

Data Warehousing, OLAP, and Optimization: Making Your…

Harmony Data Integration Technologies Pvt. Ltd. 1 个月前

Figure 20 "BUT, What about Data Vault?"?

Enterprise Data Warehousing and Data Vault Models?

The customer journey in Data Management begins with the identification of potential clients as "suspects" when they first visit a website. Once they express interest, they transition to "prospects." Traditional relational models hinder data entry due to strict business rules and cardinality requirements, often leaving essential details, such as customer names or IDs, uncollected. In contrast, the Data Vault model simplifies this process by requiring only a customer code to access and link additional data as the customer progresses.??

The Data Vault consists of raw and business data components, facilitating improved data quality and design by allowing for comparisons across multiple data sources in the staging area. This enables effective Master Data modelling and provides a foundation for data quality assessment and trust rules without needing to revert to original sources, marking a significant advancement in enterprise Data Warehousing.?

?Fourth Paradigm of Data Science?

The evolution of scientific paradigms has progressed from the empirical approach, where natural phenomena were simply described, to the theoretical paradigm utilising mathematical models, followed by the computational paradigm that simulated complex systems. The fourth paradigm, termed "e-science" by Jim Gray and Alex Szalay, integrates experimental data, archives, and simulations into a comprehensive Data Warehouse, allowing data scientists to explore and analyse massive datasets. This approach combines computer science, business domain knowledge, and mathematics, leading to the development of Data Science. Moving from multidisciplinary to interdisciplinary and finally to transdisciplinary collaboration, Data Science fosters a holistic understanding of complex systems by merging computational models, algorithms, and extensive metadata with diverse knowledge sources.?

Figure 22 Kimbal: Dimension Data Warehousing

Figure 24 eScience - A Transformed Scientific Method?

Figure 26 "X-Info & Comp-X for Discipline X"

Figure 27 Levels of Discipline Integration?

?The Key Elements and Process of Data Science?

The Data Science paradigm encompasses several key elements, starting with data acquisition, followed by collection, cleaning, integration, and transformation. Once the data is prepared, exploratory analysis is conducted using statistical methods, data visualisation, and hypothesis testing. This involves defining a hypothesis and determining whether it is confirmatory or exploratory, leading to predictions or repeated probes of the hypothesis. Following exploratory analysis, the focus shifts to model building and training, which includes feature engineering and model selection to identify core features. The model is then trained and evaluated for quality, after which it is deployed and monitored for performance drift. Finally, the process culminates in data interpretation, storytelling, and decision-making.?

Figure 28 Key Elements to Data Science Paradigm?

Data Management in Data Science?

Data Science heavily relies on effective Data Management, particularly in the context of simulations and experiments involving user behaviour and customer data platforms. The advent of concepts like "bring your own lake" allows platforms like Snowflake to create virtualised views of vast amounts of data.??

Traditional statistical methods often fall short when applied to single files, as the sheer volume of information in data lakes can overwhelm these approaches. To effectively analyse this data, it’s essential to think of analysis as searching for needles in an ever-expanding haystack, where careful integration, alignment, and aggregation of data are crucial.?

Furthermore, successful data analysis requires transdisciplinary integration, necessitating domain knowledge alongside mathematical expertise, as statisticians and mathematicians cannot deliver value without a solid understanding of the context in which they are working.?

Figure 29 (Data) Science Needs Data Management?

Data Management and Analysis?

Scientists are facing significant challenges in data delivery, particularly in fields like astronomy, where numerous observatories worldwide contribute vast amounts of data—often reaching petabytes and terabytes. Traditional file-by-file analysis methods are proving inadequate, with processes like searching a petabyte of data taking up to three years and costing around $1 million. To address these issues, researchers advocate for better Data Management through indexing, similar to Google's approach, which enables efficient search and analysis. The emphasis is shifting from analysing files to utilising databases, highlighting the importance of transitioning from data lakes to Data Warehouses, a concept popularised by recent innovations such as the data lake house model introduced by Inmon.?

Figure 32 Data Delivery: Hitting a Wall?

Intricacies of Data Management and Statistical Analysis?

Statistical analysis involves the creation of uniform samples and filtering relevant data subsets, relying on Data Management capabilities such as data models and controlled vocabularies. Ensuring data quality through completeness and addressing bad data are critical for ethical data handling. Business Intelligence (BI) allows for the automation of certain procedures to understand data's "What" and "Why."??

Data Science improves hypothesis testing and likelihood calculations through structured storage. Recent developments, like Microsoft enabling R and Python routines to run directly in databases, enhance processing efficiency by executing queries server-side. Additionally, effective Data Management skills are essential for visualisations and building scalable platforms for Big Data, which encompasses capturing, curating, analysing, publishing, and peer reviewing data before providing access.?

Importance and Challenges of Data Management?

The costs associated with data platforms are significant, estimated at around $1,000,000, with schema, ontologies, and provenance being the most expensive areas in Data Science. As data is increasingly generated digitally through sensors, IoT devices, and simulators, the need for effective Data Management becomes crucial.?

A data catalogue, often referred to by data scientists as a digital data library, plays a vital role in organising these diverse datasets and accompanying research, facilitating access to valuable insights from both small and large data sets. This comprehensive approach aims to enhance the visibility and understanding of data assets, ultimately supporting better organisational decision-making.?

Figure 34 Analysis and Databases Expanded?

Figure 35 Data Science Paradigm Elements?

Figure 36 Data Science Paradigm Elements & Costs?

Data Warehousing and Big Data Challenges?

Howard highlights the ongoing relevance of Data Warehousing in the face of the rise of Big Data, emphasising the challenges posed by inconsistent transformations within data lakes, where varying interpretations of data attributes can lead to chaos. Participants stress the importance of discipline in data modelling and cataloguing to ensure a reliable semblance of reality, which has diminished in modern data practices.??

The introduction of the lake house concept aims to standardise transformations from data lakes through ETL processes provides a more consistent data view. Concrete examples from astronomy illustrate how a common schema has enabled federated data querying across multiple observatories, while health informatics showcases the use of ontologies to integrate data from various hospitals through shared metadata standards.?

A discussion starts on the transition from traditional ETL (Extract, Transform, Load) processes to automated, API-driven integration systems utilising ontologies for data normalisation across various genomic projects. This shift allows for the seamless integration of diverse data sources, enabling real-time data curation and analysis.??

Initiatives like the James Webb Telescope and the Meerkat array exemplify the use of standardised ontologies to compile large datasets from global observational efforts, facilitating timely queries and insights without the constraints of weather-related data availability. This advancement marks significant progress in the automated handling and interpretation of complex data in the field of genomics and astrophysics.?

Figure 37 Digital Data Library (Catalogue)

Figure 38 Use of ontologies in Astronomy

Figure 39 Use of ontologies in RNA Structure Genomics

Join our CDMP Data Warehousing, BI, Big Data & Data Science Specialist training to prepare for a Master Pass and upskill your career!?

Learn more about our training here:

CDMP Specialist Training

Modelware CDMP Training

4,436 位关注者

Majid Ali Syed

Manager - Enterprise Data & Analytics | Strategy & Consulting @ Deloitte Middle East Limited

2 个月

I'm interested in watching the webinar, since I missed the it

Mayela Torrealba

IM consultant

2 个月

Hi Howard, I missed it. Please, share the recording ???? thanks

1 次回应

Nkululeko Tshabalala

2 个月

Can I have a recording as well, thank you.

1 次回应

Marita Huaman Peralta

Data Driven Business Architect / Founder & CEO Ciclus Group/General Manager Advisor/ Business Model Advisor/Data Literacy Expert

2 个月

Dejar Howard. I want thecrecording

1 次回应

Syed Jaffer Hussain

2 个月

Thank you for sharing.

1 次回应

查看更多评论

要查看或添加评论，请登录

Howard Diesel的更多文章

SQL with LLMs: Chat with Your Data with Francesco Puppini

2025年2月10日

SQL with LLMs: Chat with Your Data with Francesco Puppini

Executive Summary This webinar explores the integration of Artificial Intelligence (AI) and Large Language Models…

57 条评论
Automating Analytics - Examples from Economics and Finance

2025年1月28日

Automating Analytics - Examples from Economics and Finance

Executive Summary This webinar highlights the critical role of automating analytics in enterprise settings, emphasising…

15 条评论
Reference & Master Data Management for Data Citizens

2024年11月28日

Reference & Master Data Management for Data Citizens

Executive Summary ‘Reference & Master Data Management for Data Citizens’ delves into the critical aspects of data…

19 条评论
Metadata Management for Data Managers

2024年10月23日

Metadata Management for Data Managers

Executive Summary This webinar shares a comprehensive overview of Metadata Management, focusing on the role of Data…

14 条评论
Metadata Management for Data Managers

2024年10月17日

Metadata Management for Data Managers

Executive Summary This webinar shares a comprehensive overview of Metadata Management, focusing on the role of Data…

27 条评论
Business Goals Drive Data Management - Data Executives

2024年8月29日

Business Goals Drive Data Management - Data Executives

How do I get a "Seat at the Business Table?" Have you ever asked yourself this question? Multiple customers and…

21 条评论
Digital decarb and dark data

2024年8月23日

Digital decarb and dark data

Data's rapid growth is due to technological advances, including information and communication technologies, hardware…

5 条评论
Declarative Modelling Languages: Unveiling the Power of Understanding Severity Levels

2024年8月15日

Declarative Modelling Languages: Unveiling the Power of Understanding Severity Levels

I want to thank Maria Keet for showing us the different levels of declarative modelling languages. It began with a Mind…

37 条评论
Context is Everything with Remco Broekmans

2024年8月1日

Context is Everything with Remco Broekmans

Context is everything, but separation is critical to support context change over time. Over the past two decades, data…

17 条评论
Business Goals Drive Data Management

2024年5月13日

Business Goals Drive Data Management

Effective data management is a crucial aspect of any organisation's success. However, it's important to note that data…

8 条评论

See all articles

Executive Summary?

Big Data in Data Warehousing and Business Intelligence?

Challenges in Writing the Revised Edition of DMBoK V2?

Understanding the Role of Data Warehousing and Business Intelligence?

Strategies for Developing an Enterprise Data Warehouses?

Data Strategies and the Challenges in Self-Service Reporting?

Data Governance and Implementation Strategies?

The Key Concepts in Data Processing and Analysis?

Understanding Data Warehousing Architecture and Big Data?

?The Impact of Data Privacy and Digital Decisioning in Organizations?

The Abate Information Triangle and Data Integration Strategy?

Data Integration and Analysis?

Big Data and Data Science in Business Performance Measurement?

?Understanding and Managing Data Flow?

Responsibilities of Data Manager in Big Data and Data Science and Data Warehousing and Business Intelligence?

?Data Vault and its Impact on Enterprise Data Warehousing?

领英推荐

Enterprise Data Warehousing and Data Vault Models?

?Fourth Paradigm of Data Science?

?The Key Elements and Process of Data Science?

Data Management in Data Science?

Data Management and Analysis?

Intricacies of Data Management and Statistical Analysis?

Importance and Challenges of Data Management?

Data Warehousing and Big Data Challenges?

Modelware CDMP Training

4,436 位关注者

Howard Diesel的更多文章

SQL with LLMs: Chat with Your Data with Francesco Puppini

Automating Analytics - Examples from Economics and Finance

Reference & Master Data Management for Data Citizens

Metadata Management for Data Managers

Metadata Management for Data Managers

Business Goals Drive Data Management - Data Executives

Digital decarb and dark data

Declarative Modelling Languages: Unveiling the Power of Understanding Severity Levels

Context is Everything with Remco Broekmans

Business Goals Drive Data Management

社区洞察

其他会员也浏览了

Data Platforms - An Outlook

Role Hierarchies in Snowflake: Effective Data Access Management

Secret sauce in data warehousing

5 Big Data Analytics Tools You Must Know

Data Management News for the Week of September 16; Updates from Kensu, Oracle, Syniti, and More

Data Management

Data Management News for the Week of September 2; Updates from Alation, SingleStore, Teradata, and More

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Data Architecture Patterns: Choosing the Right Approach

Evolution of Data Architectures