Basics of Data Engineering

In the chapter 2, Data Engineering Fundamentals, of the book Designing Machine Learning Systems, author Chip Huyen has discussed about the various aspects of Data Engineering.?

For Data Scientists, it is of vital importance to have knowledge about these basics as it will give them a holistic view of different aspects and will also allow them to have better conversations with their or client's data engineers.?

We will start with different types of data sources with which ML systems can work

Data Sources?

1) User Input Data - It is the data input by the users which is utilised by ML models to make predictions. It can be in form of images, text, videos, files etc. Users might or might not give input in the correct format and it requires checking and processing. Also, user tend to be less patient and we need a system which can process at a brisk speed.

2) System Generated Data - This is the data that is generated by the different components of your systems, e.g. logs and output from model. Logs are mainly used for debugging and enhancing the application.Logs grow very quickly and important can be lost due to this. To overcome this specialized tools like logstash can be used. In addition , we can use logs till the point they are useful and they can be discarded after that.?

3) Internal Databases - They are generated by various services and enterprise applications in a company. They can store information like inventory, customer relationship, etc. This can be used directly by ML model or its components.?

4) Third Party Data - First Party data is the data collected by your company collects about it users. Second party data is the data collected by another company about their users and they make data available to you when you pay them. Third party data companies collect data on the users who aren't their customers. In today's world due to digitalization it has become easier to collect data about the users but it has to comply with the privacy laws.?

Now we will look into different types of data formats which are available to store the data. The process of converting a data structure into a state which can be stored or transmitted and reconstructed later is called?Data Serialization.?

Data Formats

1) JSON - JavaScript Object Notation. It is language independent and it's human readable. It has key-value pair format which simple but powerful. It can store data in a hierarchical format as well. JSON is text format and takes up lot of space.?

2) Row Major and Column Major Formats - In row major format like csv consecutive elements are stored in adjacent rows. Columns major format like parquet stores consecutive elements in side by side columns.

Computers process sequential data faster which makes row format storage faster. On the other side, column major format allows flexible column-based reads which makes it faster if we want to read selective columns. Row major formats are better when we have to write lot of data.?

3) Text and Binary Formats - Text files are plain text files which can be read by humans but binary are made up of 1's and 0's and are accessed by programs which can interpret raw bytes.?

Moving on let's discuss about Data Models. Data models tells us how data is represented. We will look into two most commonly used models - relational and NoSQL.?

Data Models

1) Relational Model - Data is organized into relations and each relation is a set of tuples. Table is an example and each row of table makes up tuple. Relations are unordered and if we change the order of rows or columns relation is still the same. In relational models data is desired to be in normalized format. It helps in data redundancy and enhances data integrity.?

SQL is the desired language used for fetching the desired data from these relational databases. It is a declarative language where we specify the o/p and system figures out the steps to give the required o/p. When writing SQL queries one needs to have knowledge or gain expertise in writing optimized queries which makes system fast otherwise it might take too much time to get the desired data.?

2) NoSQL - It stands for not only SQL and stores semi-structured and unstructured data. There are two major types - document model and graph model.?

A) Document Model - Document is of single continuous string encoded in json, xml or binary formats like BSON. All documents are to be encoded in same format and each document has a unique key which is used to retrieve tht particular document. A schema is not enforced in this. Different documents in the same model can have different schemas.?

B) Graph Model - The graph model is built around the concept of graph. A graph consists of nodes and edges where edge represents the relationships between the nodes. Generally social networking sites use this type of data model.?

No alt text provided for this image


Now let's discuss about structured v/s unstructured data

Structured Data - It follows a pre defined data model or schema. Disadvantage of this schema is that model is committed to pre-defined model which makes it cumbersome to change the schema as we have to make changes in the retrospective manner of all the data. For example, you’ve never kept your users’ email addresses before but now you do, so you have to retrospectively update email information to all previous users. Business requirement changes over time and this could be tricky with predefined schema. They are in data warehouses.

Unstructured Data - It doesn't adhere to predefined schema. It might contain intrinsic patterns though.?It can handle data from any source and need to worry about changes in schema. They are stored in data lakes.

Next topic of discussion is Data Storage Engines and Processing - Storage engines a.k.a databases tells us how data is tired and retrieved on machines. They are either optimized for transactional or analytical processing.?

1) Transactional - Transaction refers to anky kind of action happening online. For example - watching youtube video, ordering food from app etc. The transactions are inserted as they are generated, and occasionally updated when something changes, or deleted when they are no longer needed. This type of processing is known as OnLine Transaction Processing (OLTP).

They are needed to be processed faster as it involves user. They tend to be ACID compliant( Atomicity, Consistency, Isolation and Durability). They are not a mandate though as sometimes they can be restrictive. They are often row-major.?

2) Analytical - If we want to look at data from different viewpoints we use analytical databases. They are proficient in OnLine Analytical Processing (OLAP). It can easily aggregate data across rows in different columns. It is faster in doing analytics.?

However today we have databases which can handle both.?

Let's look into favourite part of our fellow data Engineers - ETL : Extract, Transform, and Load

ETL - It is the process of extracting data from multiple sources, then transforming into desirable format and then loading into databases or data warehouses.?

Extraction - It is extracting data from multiple sources in different formats. Data validation is an important step here.?

Transform - Processing of the data is done here where data is joined and cleaned. Data Standardization is also done here. Different operations could also be performed.

Load - How and how often data is loaded to destination and it could be file, DB or DW.?

If using a data lake then we can go for ELT process.?

In ML systems lot of systems needs to talk to each other so will we learn modes of dataflow in next section.?

There are three main modes of data flow:

A)Data passing through databases.

B)Data passing through services using requests such as the requests provided by REST and RPC APIs (e.g. POST / GET requests).

C)Data passing through a real-time transport like Apache Kafka and Amazon Kinesis.??

A) Data passing through databases - This is the easiest way as we need to pass data from process A to database A and process B can pick data from there. They are not effective as they make the system slow.?

B) Data Passing Through Services - In this data is sent between processes via network connecting them. It is request driven as processes communicate through requests. In this case processes can be distributed across organizations as well. As services are developed, tested and maintained independently it gives us micro-services architecture.?

In the below picture all the systems talk to each other via request. They perform the tasks as follows:

No alt text provided for this image


Driver management service - Predicts number of drivers in an area in next one minute

Ride management service - Predicts number of rides predicted in an area in next one minute

Price optimization service - Predicts the optimal price for each ride


Most common requests used for passing data through network are REST and RPC.?

C) Data Passing through a real time transport -?In this we add a broker which acts as central point which is connected to all the processes. It stores the o/p of processes and when some other processes need them it provides the data. These are mainly in time memory storages as they are faster as compared to databases. They are event driven.

No alt text provided for this image


Last section of the chapter is about the difference in two types of processing.

1) Batch Processing - When the data is processed in form of batch jobs. It is periodically updated where period is large. Distributed system like MapReduce are efficient in this. Period could be a day. In ML lingo, batch processing gives us batch features which are static in nature as they don't change often.

2) Stream Processing - It is the computation on the streaming data. They are periodically updated as well but periods are very small. Period could be 5 minutes. They have low latency as we can process data as soon as it is generated. Streaming gives us features are dynamic as they change continuously. Example - Number of rides in next 5 minutes.

These are the things which we need to consider in the data engineering part of our ML systems.?






Keshawn Smith

Building The Future @UConn

2 年

Ankur Bhargava - This is a great summarization of this chapter. Thank you for sharing!

要查看或添加评论,请登录

Ankur Bhargava的更多文章

  • Large Language Model Embeddings Fundamentals

    Large Language Model Embeddings Fundamentals

    Imagine an intricate web, woven from threads of words and meaning, stretching infinitely across a hidden landscape…

    1 条评论
  • Critical Pain Points in Retrieval Augmented Generation (RAG)

    Critical Pain Points in Retrieval Augmented Generation (RAG)

    Retrieval Augmented Generation (RAG) stands as a pinnacle in harnessing the power of Large Language Models (LLMs) to…

    2 条评论
  • ROUGE and BLEU Score

    ROUGE and BLEU Score

    Let's dive into the world of evaluating text generated from Large Language Models (LLMs) and explore the metrics that…

    1 条评论
  • Results to Decision - A/B Test

    Results to Decision - A/B Test

    Few Days back, I wrote an article on how to perform an A/B testing. Once we have done our experiment, now it is the…

  • Basics and Example of A/B Test

    Basics and Example of A/B Test

    In this article, we will be covering the basics of A/B testing. Before understanding the basics and various aspects of…

    2 条评论
  • Training Data

    Training Data

    In the chapter 3, Training Data, of the book Designing Machine Learning Systems, author Chip Huyen has talked about how…

  • Designing Machine Learning Systems

    Designing Machine Learning Systems

    Designing Machine Leaning Systems is an amazing and insightful book written by Chip Huyen. It's a wonderful book if…

    6 条评论

社区洞察

其他会员也浏览了