Blueprints of Data: Concepts of Modern Data Engineering
The following text provides an overview of some of the critical components of data engineering as they relate to machine learning. Understanding the intricacies of structured and unstructured data, along with the systems that store and manage them, is essential for machine learning practitioners. This knowledge enables the design of efficient data pipelines, the selection of appropriate data storage solutions, and the implementation of robust data processing workflows, all of which are vital for deriving valuable insights from data.
The content presented draws upon my notes from Chip Huyen's seminal work, "Designing Machine Learning Systems." This book stands as a comprehensive guide for those aiming to master the art of building machine learning systems with a production-ready mindset. Chip Huyen , is a seasoned expert in the field whose insights can be further explored on her LinkedIn profile, delves into the multifaceted decision-making process required to build robust ML systems.
It is with great admiration that I praise this book for its methodical and thorough exploration of machine learning systems design. Huyen's ability to distill complex concepts into actionable knowledge is unparalleled and serves as an invaluable resource for both newcomers and veterans in the field. I highly recommend this book to anyone looking to elevate their understanding of machine learning in a practical and impactful way.
Structured vs. Unstructured Data
In the world of data management, data can broadly be categorized into structured and unstructured data. Structured data refers to any data that adheres to a predefined data model and is organized in a manner that is easily searchable by simple, straightforward search engine algorithms or other search operations. It is typically tabular data represented by rows and columns, each with a specific data type. Structured data benefits from schema, which is an outline or blueprint that defines how the data is legally organized and how the relations among them are associated. This type of data is often managed in SQL databases.
Unstructured data, on the other hand, is data that does not have an identifiable structure or does not fit neatly into a database. It includes formats like audio, video, and social media postings. This type of data is characterized by the lack of a schema and is typically stored in data lakes or filesystems that can handle the variability and complexity of the data. Unstructured data is often processed and analyzed using more complex methods, such as natural language processing (NLP), machine learning, and big data processing frameworks.
Data Warehousing and ETL vs. Data Lakes and ELT
Data warehouses and data lakes are two distinct types of data storage that serve different purposes in an organization. A data warehouse is a centralized repository for structured data, optimized for querying and analysis. Data warehouses use an ETL (Extract, Transform, Load) process, where data is extracted from the original source, transformed into the desired format, and then loaded into the warehouse.
Data lakes, in contrast, are designed to store a vast amount of raw, unstructured data. The data here is kept in its native format until it is needed. When the data is utilized, it is then extracted, loaded, and transformed if necessary (ELT). This approach is more flexible when dealing with the variety and volume of big data.
Data Storage Engines – OLTP and OLAP
When discussing data storage engines, we differentiate between systems optimized for OLTP and those for OLAP. OLTP systems are optimized for managing transactional data. They are designed to handle a large number of short online transactions and emphasize fast query processing, maintaining data integrity in multi-access environments. On the other hand, OLAP systems are designed for query-heavy analytical workloads. They are optimized for data reading and typically used for complex analytical and ad-hoc queries, including aggregations and joins.
领英推荐
Transactional Databases and ACID Properties
Transactional databases are a type of database management system that ensures all database transactions are processed reliably. They adhere to ACID properties: Atomicity guarantees that all operations within a work unit are completed successfully; Consistency ensures that the database properly changes states upon a successfully committed transaction; Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially; Durability guarantees that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.
Data Sources
Data sources can be classified into various types, such as user input data, system-generated data, internal databases, and third-party data. User input data is directly provided by users, often in the form of feedback, submissions, or interactions. System-generated data is produced by systems and applications as a byproduct of their operations, capturing logs, transactions, and system states. Internal databases are repositories that organizations maintain, containing structured records of operations, customer details, and other business-related information. Third-party data, on the other hand, is sourced from external providers and can include datasets for benchmarking, demographic information, or additional data that can enhance the organization's own data.
Data Serialization Formats
The way data is formatted is crucial for ensuring interoperability between different systems and technologies. Common data serialization formats include JSON, CSV, and Parquet, each serving different needs. JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write, as well as for machines to parse and generate. CSV (Comma-Separated Values) is a simple format that stores tabular data in plain text, where each line of the file is a data record, with each record consisting of one or more fields separated by commas. Parquet is a column-major format that is optimized for working with complex nested data structures and is particularly efficient for queries that process large volumes of data at once.
Data Storage Engines (Databases)
Data storage engines, commonly referred to as databases, are systems that efficiently store, retrieve, and manage data. They can be optimized for different purposes, such as OLTP (Online Transaction Processing) which is designed for transaction processing and requires low latency and high availability, and OLAP (Online Analytical Processing) which is suitable for analytical processing of complex queries over large datasets. Transactional databases that support ACID (Atomicity, Consistency, Isolation, Durability) properties ensure that all transactions are processed reliably.
Data Models
Data models define the conceptual structure of data and inform how data is stored, accessed, and manipulated. The relational model is one of the most common data models and is used in relational databases managed by SQL (Structured Query Language), a declarative language for interacting with data. The NoSQL model, which includes document and graph models, is used for more flexible data storage options that don't require a fixed schema. Document models store data in formats like JSON, BSON, or XML, which can accommodate a variety of data types and structures. Graph models are designed to represent and store data in terms of entities and their interrelations, which is particularly useful for social networks, recommendation engines, and other applications where relationships are a focus.
#DataEngineering #BigData #DataSystems #MachineLearning #DataProcessing #InformationTechnology #DataScience #Analytics #CloudComputing #DataStorage