登录查看更多内容

Blueprints of Data: Concepts of Modern Data Engineering

Nikolaos Tsinganos, PhD

AI Research Scientist & Postdoctoral Educator @UoM

发布日期: 2024年2月4日

The following text provides an overview of some of the critical components of data engineering as they relate to machine learning. Understanding the intricacies of structured and unstructured data, along with the systems that store and manage them, is essential for machine learning practitioners. This knowledge enables the design of efficient data pipelines, the selection of appropriate data storage solutions, and the implementation of robust data processing workflows, all of which are vital for deriving valuable insights from data.

The content presented draws upon my notes from Chip Huyen's seminal work, "Designing Machine Learning Systems." This book stands as a comprehensive guide for those aiming to master the art of building machine learning systems with a production-ready mindset. Chip Huyen , is a seasoned expert in the field whose insights can be further explored on her LinkedIn profile, delves into the multifaceted decision-making process required to build robust ML systems.

It is with great admiration that I praise this book for its methodical and thorough exploration of machine learning systems design. Huyen's ability to distill complex concepts into actionable knowledge is unparalleled and serves as an invaluable resource for both newcomers and veterans in the field. I highly recommend this book to anyone looking to elevate their understanding of machine learning in a practical and impactful way.

Structured vs. Unstructured Data

In the world of data management, data can broadly be categorized into structured and unstructured data. Structured data refers to any data that adheres to a predefined data model and is organized in a manner that is easily searchable by simple, straightforward search engine algorithms or other search operations. It is typically tabular data represented by rows and columns, each with a specific data type. Structured data benefits from schema, which is an outline or blueprint that defines how the data is legally organized and how the relations among them are associated. This type of data is often managed in SQL databases.

Unstructured data, on the other hand, is data that does not have an identifiable structure or does not fit neatly into a database. It includes formats like audio, video, and social media postings. This type of data is characterized by the lack of a schema and is typically stored in data lakes or filesystems that can handle the variability and complexity of the data. Unstructured data is often processed and analyzed using more complex methods, such as natural language processing (NLP), machine learning, and big data processing frameworks.

Data Warehousing and ETL vs. Data Lakes and ELT

Data warehouses and data lakes are two distinct types of data storage that serve different purposes in an organization. A data warehouse is a centralized repository for structured data, optimized for querying and analysis. Data warehouses use an ETL (Extract, Transform, Load) process, where data is extracted from the original source, transformed into the desired format, and then loaded into the warehouse.

Data lakes, in contrast, are designed to store a vast amount of raw, unstructured data. The data here is kept in its native format until it is needed. When the data is utilized, it is then extracted, loaded, and transformed if necessary (ELT). This approach is more flexible when dealing with the variety and volume of big data.

Data Storage Engines – OLTP and OLAP

When discussing data storage engines, we differentiate between systems optimized for OLTP and those for OLAP. OLTP systems are optimized for managing transactional data. They are designed to handle a large number of short online transactions and emphasize fast query processing, maintaining data integrity in multi-access environments. On the other hand, OLAP systems are designed for query-heavy analytical workloads. They are optimized for data reading and typically used for complex analytical and ad-hoc queries, including aggregations and joins.

领英推荐

Future Trends in Data Science & Analytics | Data…

Pratibha Kumari J. 9 个月前

Data Science in 2025: Skills, Tools, and Job Market…

Analytics Insight? 1 个月前

Know The Top 10 Data Science Trends (2022)

Learnbay 2 年前

Transactional Databases and ACID Properties

Transactional databases are a type of database management system that ensures all database transactions are processed reliably. They adhere to ACID properties: Atomicity guarantees that all operations within a work unit are completed successfully; Consistency ensures that the database properly changes states upon a successfully committed transaction; Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially; Durability guarantees that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.

Data Sources

Data sources can be classified into various types, such as user input data, system-generated data, internal databases, and third-party data. User input data is directly provided by users, often in the form of feedback, submissions, or interactions. System-generated data is produced by systems and applications as a byproduct of their operations, capturing logs, transactions, and system states. Internal databases are repositories that organizations maintain, containing structured records of operations, customer details, and other business-related information. Third-party data, on the other hand, is sourced from external providers and can include datasets for benchmarking, demographic information, or additional data that can enhance the organization's own data.

Data Serialization Formats

The way data is formatted is crucial for ensuring interoperability between different systems and technologies. Common data serialization formats include JSON, CSV, and Parquet, each serving different needs. JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write, as well as for machines to parse and generate. CSV (Comma-Separated Values) is a simple format that stores tabular data in plain text, where each line of the file is a data record, with each record consisting of one or more fields separated by commas. Parquet is a column-major format that is optimized for working with complex nested data structures and is particularly efficient for queries that process large volumes of data at once.

Data Storage Engines (Databases)

Data storage engines, commonly referred to as databases, are systems that efficiently store, retrieve, and manage data. They can be optimized for different purposes, such as OLTP (Online Transaction Processing) which is designed for transaction processing and requires low latency and high availability, and OLAP (Online Analytical Processing) which is suitable for analytical processing of complex queries over large datasets. Transactional databases that support ACID (Atomicity, Consistency, Isolation, Durability) properties ensure that all transactions are processed reliably.

Data Models

Data models define the conceptual structure of data and inform how data is stored, accessed, and manipulated. The relational model is one of the most common data models and is used in relational databases managed by SQL (Structured Query Language), a declarative language for interacting with data. The NoSQL model, which includes document and graph models, is used for more flexible data storage options that don't require a fixed schema. Document models store data in formats like JSON, BSON, or XML, which can accommodate a variety of data types and structures. Graph models are designed to represent and store data in terms of entities and their interrelations, which is particularly useful for social networks, recommendation engines, and other applications where relationships are a focus.

#DataEngineering #BigData #DataSystems #MachineLearning #DataProcessing #InformationTechnology #DataScience #Analytics #CloudComputing #DataStorage

要查看或添加评论，请登录

Nikolaos Tsinganos, PhD的更多文章

AI Pact: Progress, Milestones, and What’s Next

2025年1月31日

AI Pact: Progress, Milestones, and What’s Next

AI Pact: Progress, Milestones, and What’s Next Artificial Intelligence is transforming industries, economies, and…
Understanding the General-Purpose AI Code of Practice

2024年11月14日

Understanding the General-Purpose AI Code of Practice

Today, the European Commission released the First Draft of the General-Purpose AI Code of Practice, marking another…
The Role of Harmonized Standards in Supporting the EU AI Act: Key Developments and What You Need to Know

2024年10月30日

The Role of Harmonized Standards in Supporting the EU AI Act: Key Developments and What You Need to Know

The European Union’s adoption of the AI Act in August 2024 has positioned it as a global pioneer in AI regulation. With…
The AI Pact: Encouraging Early Compliance with the AI Act

2024年9月16日

The AI Pact: Encouraging Early Compliance with the AI Act

As Europe continues to lead in the development of trustworthy artificial intelligence, the AI Pact stands as a…
AI Act: Paving the Way for Trustworthy AI in Europe

2024年9月12日

AI Act: Paving the Way for Trustworthy AI in Europe

The AI Act is a groundbreaking regulation that sets the stage for a safe and ethical use of artificial intelligence…
The Role of the European AI Office in Shaping Trustworthy AI

2024年9月9日

The Role of the European AI Office in Shaping Trustworthy AI

Artificial Intelligence (AI) is a powerful driver of innovation, and Europe is determined to be at the forefront of its…
Driving Europe's Digital Future: A Closer Look at the EU's Path to the Digital Decade

2024年9月4日

Driving Europe's Digital Future: A Closer Look at the EU's Path to the Digital Decade

The European Union (EU) is dedicated to ensuring that its policies benefit citizens, businesses, and other stakeholders…
Navigating the AI-related Cybersecurity Threat Landscape: Insights from ENISA's 2030 Foresight Report

2024年9月1日

Navigating the AI-related Cybersecurity Threat Landscape: Insights from ENISA's 2030 Foresight Report

As we step closer to 2030, the interplay between artificial intelligence (AI) and cybersecurity is set to redefine the…
Data Engineering's Role in Machine Learning

2024年4月1日

Data Engineering's Role in Machine Learning

In the preceding data engineering segment (Blueprints of Data: Concepts of Modern Data Engineering) , we embarked on an…
Navigating OpenAI's New Emdedding Models

2024年1月27日

Navigating OpenAI's New Emdedding Models

OpenAI's recent introduction of new embedding models – 'text-embedding-3-small', 'text-embedding-3-large', and their…

See all articles

Blueprints of Data: Concepts of Modern Data Engineering

Nikolaos Tsinganos, PhD

AI Research Scientist & Postdoctoral Educator @UoM

领英推荐

Nikolaos Tsinganos, PhD的更多文章

社区洞察

其他会员也浏览了

Top Data Science & AI Trends For 2022

Deconstructing Unstructured Data: Strategies for Analysis and Insights

Dataiku

??From Chaos to Clarity: How you can level up your Data Engineering team with the help of Generative AI ??

Machine Learning vs Data Science: Unraveling the Essentials

Data representation

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j

Dimensionality Reduction in Data Science: A Pragmatic Insight based on my experiential insights in umpteen Data Science engagements in IT

3. Unlocking Unstructured Data: Moving beyond basic RAG into token control

ML Algorithms usage Part1: Understanding the usage of Linear and Logistic Regression in Data Science, ML, AZ ML and Gen AI

领英推荐

Nikolaos Tsinganos, PhD的更多文章

AI Pact: Progress, Milestones, and What’s Next

Understanding the General-Purpose AI Code of Practice

The Role of Harmonized Standards in Supporting the EU AI Act: Key Developments and What You Need to Know

The AI Pact: Encouraging Early Compliance with the AI Act

AI Act: Paving the Way for Trustworthy AI in Europe

The Role of the European AI Office in Shaping Trustworthy AI

Driving Europe's Digital Future: A Closer Look at the EU's Path to the Digital Decade

Navigating the AI-related Cybersecurity Threat Landscape: Insights from ENISA's 2030 Foresight Report

Data Engineering's Role in Machine Learning

Navigating OpenAI's New Emdedding Models

社区洞察

其他会员也浏览了

Top Data Science & AI Trends For 2022

Deconstructing Unstructured Data: Strategies for Analysis and Insights

Dataiku

??From Chaos to Clarity: How you can level up your Data Engineering team with the help of Generative AI ??

Machine Learning vs Data Science: Unraveling the Essentials

Data representation

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j

Dimensionality Reduction in Data Science: A Pragmatic Insight based on my experiential insights in umpteen Data Science engagements in IT

3. Unlocking Unstructured Data: Moving beyond basic RAG into token control

ML Algorithms usage Part1: Understanding the usage of Linear and Logistic Regression in Data Science, ML, AZ ML and Gen AI