登录查看更多内容

Last updated on 2024年9月19日

How do you handle complex and unstructured data sources when ingesting data into your data lake?

由人工智能和领英社区提供技术支持

Data lakes are repositories of raw and unstructured data that can be used for various analytical purposes. However, ingesting data into a data lake can be challenging, especially when dealing with complex and diverse data sources. In this article, we will explore some of the common methods and best practices for data lake ingestion, and how they can help you optimize your data pipeline and analytics.

本文章的要点总结

Batch processing:

Use batch ingestion for stable data sources. Tools like Apache Spark or AWS Glue can help you schedule regular data loads, making it ideal for historical or transactional data.### *Real-time streaming:Stream ingestion is perfect for dynamic data sources needing fast processing. Utilize tools like Apache Kafka or AWS Kinesis to load real-time data such as sensor or social media feeds efficiently.

本摘要由 AI 和以下专家提供支持

1 Batch ingestion

Batch ingestion is the process of loading data into a data lake in batches, usually at regular intervals or on a schedule. This method is suitable for data sources that are stable, consistent, and do not require real-time processing. For example, you can use batch ingestion to load historical data, transactional data, or log data into your data lake. Batch ingestion can be done using various tools and frameworks, such as Apache Spark, Apache Sqoop, or AWS Glue.

添加您的观点

Albert Wong

Senior Solutions Architect | K8S, Application & Data Integration, Database | Open Source & Retail SME, 15x+ certifications, 3x Patent Author
举报内容
Different open table formats have different recommendations. Apache Hudi: Best way is to use HudiDeltaStreamer. Why? It's because it has all the basic and advanced configurable settings over than Apache Kafka Sink for Apache Hudi. The other option is to write Apache Spark code and HudiDeltaStream avoids that by just running a CLI. Apache Iceberg: So many databases (Trino, StarRocks, etc etc) and frameworks (Apache Spark, Apache Kafka) can write Apache Iceberg equally (most use the same Iceberg libraries) so do what is easiest. Delta Lake: You really only have one choice and that is Apache Spark.

已翻译

赞
Anurag Harsh

Founder & CEO at the Creating Dental Excellence Group
举报内容
One example of batch ingestion could be a retail company that wants to collect sales data from all its stores nationwide every night. Each store sends its sales data in a pre-defined format to a central server, where the data is collected in batches and stored in a data lake. The data could include information such as the products sold, the time of sale, the quantity of each item sold, and the store location. The company could then use this data for various purposes such as inventory management, sales forecasting, and marketing analysis. Batch ingestion allows the company to collect large amounts of data on a regular basis without putting too much stress on their systems. It also allows the data to be processed later. #ChooseDataForYourTeam

已翻译

赞
Nishith Agarwal

Data&ML@Lyra, Ex-Uber, Author of Apache Hudi
举报内容
Batch ingestion is usually really useful in the following scenarios: 1. Tables are small (few millions) - Here, the cost of batch ingestion is minimal and relatively easier to setup 2. Bootstrapping & Backfill - Typically, for new tables moved to the data lake or to fix data issues, a table is either bootstrapped or backfilled. Many times performing a batch ingestion of the entire table is the simplest and most reliable solutions. In most other scenarios such as large tables or advanced cases of bootstrapping or backfill, the industry has been moving towards incremental ingestion in order to address multiples factors such as a) Compute Cost b) Latency and c) Reliability

已翻译

赞

2 Stream ingestion

Stream ingestion is the process of loading data into a data lake in near real-time, as the data is generated or received. This method is suitable for data sources that are dynamic, unpredictable, and require fast processing. For example, you can use stream ingestion to load sensor data, social media data, or clickstream data into your data lake. Stream ingestion can be done using various tools and frameworks, such as Apache Kafka, Apache Flume, or AWS Kinesis.

添加您的观点

Nishith Agarwal

Data&ML@Lyra, Ex-Uber, Author of Apache Hudi
举报内容
Many applications demand more real-time access to data with users & applications becoming more sensitive to data latency. Traditionally, a lambda architecture was adopted to power analytics (batch style) and near-real time use-cases (streaming) where disparate systems were built for each use-case. This presented multiple challenges of scale and data consistency. More recently, technologies such as Apache Hudi, Iceberg are making it possible to combine both use-cases onto a single platform. By choosing Copy on Write vs Merge on Read, organizations can make trade-offs or write vs read amplifications while serving both use-cases with a single platform.

已翻译

赞
Praveen Kumar

Data Analytics and ML
举报内容
For one of my previous projects we ingested Salesforce events and analysed for anomaly in real time and raised alerts accordingly. We used AWS Kinesis and Flink to process/aggregate the data and redshift as a sink for further analysis.

已翻译

赞
Mehr un Nisa khalid

Senior Data Engineer | 1x Databricks Certified | Azure | Spark | Power BI | SQL
举报内容
In one of my previous project, we managed microservices streaming via Event Hub. Initially, aligning microservices on a unified event structure was challenging. We standardized event headers and structures across all microservices. Efficient data flow was another hurdle. We batched relevant data for at least a minute to avoid small file issues in the data lake. Sender-side, we used Event Hub; on the receiver's end, Logic App processed streaming data. Within Logic App, each event was cleaned, extracting necessary data by excluding unnecessary headers. This refined data was sent to the data lake for further processing via Databricks, following a Medallion architecture approach.

已翻译

赞
Albert Wong

Senior Solutions Architect | K8S, Application & Data Integration, Database | Open Source & Retail SME, 15x+ certifications, 3x Patent Author
举报内容
For real time streaming there is really only one choice, supporting merge on read (MOR). Apache Hudi and Delta Lake are the most mature in capability. They also have compaction tools that merge the delta back into the base file. As a result, in performance tests***, you'll find that Apache Hudi and Delta Lake much more faster than in write and read performance (un-compacted) than Apache Iceberg. Once it becomes compacted, then it's features like row level indexing and other types of metadata indexing on base files that will determine performance.

已翻译

赞
Anurag Harsh

Founder & CEO at the Creating Dental Excellence Group
举报内容
One example of stream ingestion could be a social media platform that wants to collect and analyze data on user behavior in real-time. As users interact, such as posting content, commenting, or reacting, the data needs to be processed quickly. The platform could use stream ingestion to continuously collect and process this data as it is generated - user ID, time and content of interaction, associated metadata and use this data for improving user engagement, identifying trending topics, or detecting/ addressing inappropriate behavior. Stream ingestion requires specialized tools and infrastructure, such as stream processing frameworks like Apache Kafka or Flink, to handle large volume of data and process it quickly. #ChooseDataForYourTeam

已翻译

赞

加载更多内容

3 Incremental ingestion

Incremental ingestion is the process of loading only the new or updated data into a data lake, instead of loading the entire data set. This method is suitable for data sources that are large, incremental, and do not require full refresh. For example, you can use incremental ingestion to load customer data, product data, or inventory data into your data lake. Incremental ingestion can be done using various tools and frameworks, such as Apache Airflow, Apache NiFi, or AWS Lambda.

添加您的观点

Albert Wong

Senior Solutions Architect | K8S, Application & Data Integration, Database | Open Source & Retail SME, 15x+ certifications, 3x Patent Author
举报内容
Among the 3 open table formats Apache Hudi, Apache Iceberg and Delta Lake, Apache Hudi (which stands for Hadoop Update Deletes and Increments) was designed for incremental ingestion from the very beginning. Merge on read was specifically designed to address this issue and was borrowed from traditional database systems.

已翻译

赞
Anurag Harsh

Founder & CEO at the Creating Dental Excellence Group
(已编辑)
举报内容
Example of incremental ingestion could be a financial institution that needs to continuously update its data lake with new transactional data. They might receive millions of transactions per day and need to process quickly to update customer balances, identify fraud, and generate reports. They could load only new or updated transactions into the lake, instead of loading entire data set each time using a variety of techniques such as change data capture (CDC) or delta processing. CDC can track changes in a database by capturing and replicating only the modified records. Delta processing can identify changes in a data source by comparing it to a previously ingested version and only extracting the new or changed data. #ChooseDataForYourTeam

已翻译

赞

4 Schema-on-read ingestion

Schema-on-read ingestion is the process of loading data into a data lake without applying any predefined schema or structure. This method is suitable for data sources that are unstructured, heterogeneous, and do not require validation. For example, you can use schema-on-read ingestion to load text data, image data, or video data into your data lake. Schema-on-read ingestion can be done using various tools and frameworks, such as Apache Hadoop, Apache Hive, or AWS Athena.

添加您的观点

Anurag Harsh

Founder & CEO at the Creating Dental Excellence Group
(已编辑)
举报内容
One example of Schema-on-read ingestion could be a media company that wants to collect and store user-generated content, such as photos and videos, in a data lake. The content may come from social media platforms or mobile apps, and may have different formats and structures. They could load the content into the data lake without applying any predefined schema or structure. This allows the company to store content as is and apply schema or structure on the fly when the data is being accessed or analyzed. They could use Apache Hadoop, Spark, or cloud-based services like Amazon S3 or Microsoft Azure Blob Storage. These tools can handle large volumes of unstructured data and provide flexible schema-on-read capabilities. #ChooseDataForYourTeam

已翻译

赞

5 Schema-on-write ingestion

Schema-on-write ingestion is the process of loading data into a data lake after applying a predefined schema or structure. This method is suitable for data sources that are structured, homogeneous, and require validation. For example, you can use schema-on-write ingestion to load relational data, CSV data, or JSON data into your data lake. Schema-on-write ingestion can be done using various tools and frameworks, such as Apache Parquet, Apache Avro, or AWS Redshift.

添加您的观点

Anurag Harsh

Founder & CEO at the Creating Dental Excellence Group
(已编辑)
举报内容
One example of Schema-on-write ingestion could be an e-commerce company that wants to collect and store transactional data from its online store in a data lake such as customer IDs, order IDs, product IDs, and sales amounts. They could load the data into the data lake after applying a predefined schema or structure. The schema might define the data types, field names, and relationships between tables. The company could use ETL (Extract, Transform, Load) tools, data integration platforms, or cloud-based services like Amazon Redshift or Google BigQuery. These tools can validate the data against the predefined schema and ensure its consistency and accuracy before loading it into the data lake. #ChooseDataForYourTeam

已翻译

赞

6 Hybrid ingestion

Hybrid ingestion is the process of combining different methods of data lake ingestion, depending on the data source, the data type, and the analytical use case. This method is suitable for data sources that are mixed, complex, and require flexibility. For example, you can use hybrid ingestion to load data from multiple sources, formats, and frequencies into your data lake. Hybrid ingestion can be done using various tools and frameworks, such as Apache Spark, Apache Kafka, or AWS Glue.

添加您的观点

Albert Wong

Senior Solutions Architect | K8S, Application & Data Integration, Database | Open Source & Retail SME, 15x+ certifications, 3x Patent Author
(已编辑)
举报内容
To me, hybrid ingestion means kappa architecture where you can put streaming and batch ingestion in the same pipeline. Personally I'm not a big fan of polling but in certain situations it makes sense (eg. poll every 5 seconds so that the compute execution of the commit is optimized). Push, although fast, creates a lot of delta files which puts a load on the query side (until you get compaction). So it really comes down to requirements. If you want super fresh data, push and compact often. If you can deal with longer periods of unuqdated data and want to save some money, go with the poll solution. In all situations, you can use the same pipeline and adhere to kappa instead of lambda architecture.

已翻译

赞
Anurag Harsh

Founder & CEO at the Creating Dental Excellence Group
举报内容
One example of hybrid ingestion could be a healthcare org that wants to collect and store patient data from electronic health records (EHRs), medical devices, wearables, and patient surveys, and may have different structures and frequencies. They could use hybrid ingestion to combine different methods of ingestion, depending on the source and the analytical use case. For example, they might use batch for EHRs, typically generated on a daily basis, and stream for medical devices and wearables, which generate real time data. They could also use schema-on-write for structured data, such as patient demo and lab results, and schema-on-read for unstructured data, such as physician notes and patient feedback. #ChooseDataForYourTeam

已翻译

赞

7 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

Albert Wong

Senior Solutions Architect | K8S, Application & Data Integration, Database | Open Source & Retail SME, 15x+ certifications, 3x Patent Author
(已编辑)
举报内容
It's great to be able to write data but writing data to data lakehouse tend to be "not evenly distributed". You tend to see data skews across files and partitions which leads to an environment that is unoptimized (just like back in the old days when you didn't defrag your disk drive). If you run table services (compaction, cleaning, file resizing, clustering) like what is provided by Apache Hudi and others, you can gain 2x to 10x in performance. If you want a free tool to observe your data skew, check out onehouse.ai's lakeview product.

已翻译

赞
Luis Henrique Santana (LH)

Data, AI & GenAI @ Google | Sales Director | GTM | Business Transformation | Tech | Digital | Latin America
举报内容
Think like a kid :) 1. Break it down: Imagine you're sorting toys. You put similar toys together. We do the same with data. 2. Clean it up: Sometimes toys are dirty. We clean them. With data, we fix mistakes or missing pieces. 3. Give it a home: We put toys in boxes. With data, we use special software to organize it in a big storage space. Please use BigQuery from Google :) 4. Label it: We write on boxes what's inside. With data, we add tags or labels to describe it.

已翻译

赞
Malini Vittal

Empowering Sales and Marketing through Data, AI, and Tech Innovation | CDO and CAIO | AI/Data Coach | Trusted Advisor to C-Suite | Advisory Board | Startup Advisor | Dynamic Speaker | Ex- Gartner AI/Data Expert | CHIEF
举报内容
We handle complex, unstructured data by leveraging advanced machine learning models for automatic classification and tagging. Initially, we train these models using labeled datasets, ensuring they recognize various data patterns and structures. As data is ingested, the models classify and tag it in real-time, converting unstructured data into a usable format. We continuously refine these models with feedback loops, incorporating user corrections and new data to improve accuracy. This approach transforms raw, unstructured inputs into valuable insights efficiently, maintaining a dynamic and adaptive data lake.

已翻译

赞

Analytics

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

How do you handle complex and unstructured data sources when ingesting data into your data lake?

1

2

3

4

5

6

7

1 Batch ingestion

2 Stream ingestion

3 Incremental ingestion

4 Schema-on-read ingestion

5 Schema-on-write ingestion

6 Hybrid ingestion

7 Here’s what else to consider

Analytics

给文章评分

感谢您的反馈

更多Analytics相关文章

更多相关阅读内容