登录查看更多内容

What is the best way to handle duplicates in batch processing?

由人工智能和领英社区提供技术支持

Batch processing is a common technique in data engineering that involves processing large volumes of data at fixed intervals. However, batch processing can also introduce the problem of duplicates, which are records that have the same key or identifier but different values or attributes. Duplicates can affect the quality, consistency, and accuracy of the data, and lead to errors or inefficiencies in downstream applications or analyses. How can you handle duplicates in batch processing effectively and efficiently? Here are some tips and strategies to consider.

此文章中的业界达人

由社区从 35 条内容中精选。了解更多

QASIM HASSAN

Cloud Data Engineer @Royal Cyber || AI Engineer || AWS Community Builder, ex Cloud Captain || Databricks UG Pakistan Co…
Aastha Katiyar

Senior Data Engineer @ Hexaware
Ramachandran Boominathan

Director of Engineering

1 Identify the source of duplicates

The first step to handle duplicates is to identify where and how they are generated. Duplicates can originate from different sources, such as data ingestion, data transformation, data integration, or data storage. For example, you might have duplicates if you ingest data from multiple sources that have overlapping or inconsistent identifiers, or if you apply different transformations or aggregations to the same data, or if you join or merge data from different tables or databases, or if you store data in multiple locations or formats. To identify the source of duplicates, you need to understand the data pipeline, the data schema, the data quality rules, and the data lineage.

添加您的观点

Ramachandran Boominathan

Director of Engineering
举报内容
Duplicates should be avoided while extraction/reading from source itself. You queries/extraction logic must include removal of duplicates at first place, so this should be done as part of data profiling. Later in the batch or pipeline when we join with other source data there can be chances of cartesian product which also can at times create duplicate based on keys, so pay attention here too. The third place would be any generation of keys inside the batch process, so this logic should be atomic in nature even if the process runs in threads to avoid duplication.

已翻译

赞
Akshay T.

Azure 14X | KPMG | Ex - EY | Azure Data Engineer | Data Factory | DataBricks | Data Lake | Synapse | Data Pipelines | Data Warehousing | CI/CD | PySpark | SQL | Python | [Views Are Personal]
举报内容
The best way to handle duplicates in batch processing is to first identify the source of duplicates. For example, if you are processing a large dataset, you can analyze the data and look for common attributes or fields that indicate duplicates. By identifying the source, you can implement appropriate deduplication techniques or algorithms to remove or merge duplicate records effectively.

已翻译

赞
Mayank Gupta

Seasoned Data Engineer | Leading Teams, Driving Innovation, and Delivering Results @ Namasys
举报内容
Before you start working on any data job, make sure you know the following: 1. Does the data table have a primary key? If yes, what is it? 2. If there is no primary key, what other ways exist to check data integrity? 3. Apply the data integrity check on a small subset of data. If you find any issue, try to find the reason withing that small subset. If you follow the above steps, 90% of your duplicates issues will not occur.

已翻译

赞
Sahil Agnihotri

Data Engineer - DNA (Data & Advance Analytics + AI) | Microsoft Certified Azure Data Engineer Associate | Databricks Certified Data Engineer Professional | Databricks Certified Apache Spark Developer | GenAI Fundamentals
举报内容
Duplicates should be controlled while reading from the source itself. This will help you to understand your data behaviour and to know your data. Your data pipeline itself should be smart enough that on the basis of primary keys they should identify whether the data is already available in our ingestion layer or not. If it is already available no ingestion should be taken place. So to understand your source and processing unique data and avoid processing same data again and again is the key of successful pipelines.

已翻译

赞
Anuja Merwade

Senior Consultant - AI & Data at EY | MS in DS from IU Bloomington | GHC'23 | Data Engineering & Analytics | Software Development | Snowflake SnowPro Core Certified
举报内容
Duplicates in data can occur at any stage of the ETL process. They can be identified by the following methods: 1. Joins: Improper joins are one of the main reasons for duplicates. It in necessary to perform joins using only the required columns and after assessing the data in the joining columns. 2. Data Profiling: Thorough data profiling helps to gauge the data distribution, anomalies and patterns that generate duplicates. 3. RCA: Root Cause Analysis is a crucial part of identifying the source of duplicates. You can backtrack to the previous layers/tables where the data is coming from to understand where exactly the duplicates are being generated.

已翻译

赞

加载更多内容

2 Choose the appropriate deduplication method

The next step to handle duplicates is to choose the appropriate deduplication method based on the source and the type of duplicates. Deduplication is the process of removing or resolving duplicates from a data set, and there are various methods to do so. For example, key-based deduplication uses a unique key or identifier to compare and remove duplicates, such as a primary key, a surrogate key, a hash key, or a composite key. This method is fast and simple, but it requires a reliable and consistent key across the data sources. Alternatively, attribute-based deduplication uses one or more attributes or values for comparison, like a timestamp, checksum, status, or flag. This method is more flexible and robust, yet requires more processing and logic. Lastly, record linkage combines keys and attributes to compare and remove duplicates with fuzzy matching, probabilistic matching, machine learning algorithms, or rule-based systems. This method is more advanced and accurate, yet demands more resources and expertise.

添加您的观点

Hariom Jat

Senior Data Engineer (Real-time Streaming /Batch processing) | 7+ years Exp| Apache Spark | GCP | Azure | Databricks | Java | PySpark | Hive | Apache Kafka | Apache Beam | Apache Flink | Telecom | SaaS | Banking domain
举报内容
Utilize appropriate de-duplication techniques, such as removing exact duplicates or implementing fuzzy matching for similar records.

已翻译

赞
Vikram N.

Azure Data Engineer
举报内容
Deduplication is an important part to ensure we have the quality data and we can present it to the end business users. Right technique to ensure that the data will not contain duplicates are based on the columns and type of data. We can have multiple ways to remove duplicates like fuzzy matching,Levenshtein distance technique and many pattern matching customised logics as well. Depends upon the nature of data we can apply the rules and remove the duplicates.

已翻译

赞
Akshay T.

Azure 14X | KPMG | Ex - EY | Azure Data Engineer | Data Factory | DataBricks | Data Lake | Synapse | Data Pipelines | Data Warehousing | CI/CD | PySpark | SQL | Python | [Views Are Personal]
举报内容
After identifying the source of duplicates in batch processing, the next step is to choose the appropriate deduplication method. There are several methods available, such as using hashing algorithms, comparing key attributes, or employing machine learning techniques. The choice of method depends on the specific requirements and characteristics of the dataset, and it is important to select the most suitable approach to accurately identify and eliminate duplicates.

已翻译

赞
Dr Emmanuel Ogungbemi

I help you break into data science and AI with practical tips, real-world insights, and the latest trends.
举报内容
1. Assess Data Characteristics: Understanding the nature of your data is critical. Different data types might require other deduplication methods, such as using primary keys for structured data or similarity matching for unstructured data. 2. Evaluate Processing Overhead: Choose a deduplication method that balances effectiveness with processing overhead. More efficient methods like hashing might be preferable for large datasets, even if they could be more precise. 3. Consider Data Usage and End Goals: The method chosen should align with how the data will be used post-processing. For instance, stricter methods might be necessary for data used in sensitive analyses, where accuracy is paramount.

已翻译

赞
Rahul Sounder

Senior Engineering Manager - Data at Xiaomi Technology | Ex-Amazon, Merck | Top Data Engineer Voice - Principal Architect - ?? Certified AWS Architect - Azure Cloud ? - SAFe?5 Agilist - Mentor - Hiring Data Engineers
举报内容
Choosing the appropriate deduplication method for handling duplicates in batch processing depends on the characteristics of your data, the specific requirements of your use case, and the available resources. Identify records with identical values in specified fields (e.g., unique identifiers). Suitable for scenarios where duplicates can be precisely matched based on exact attribute values. Fast and straightforward but may not handle slight variations in data.

已翻译

赞

加载更多内容

3 Implement the deduplication process

The final step to handle duplicates is to implement the deduplication process in the batch processing workflow. Depending on the data volume, frequency, complexity, and requirements, you can implement the deduplication process at different stages of the batch processing pipeline. Pre-processing involves deduplicating data before it is loaded or transformed in the batch processing system, such as using a script, tool, or service to deduplicate the data at the source or staging area. Processing involves deduplicating data during batch processing operations, like using a query, function, or framework to deduplicate data within the batch processing system. Post-processing involves deduplicating data after it is processed and stored in the batch processing system, for example by using a trigger, job, or workflow to deduplicate data in the target or archive. This can reduce data size and errors, improve quality and consistency, enhance reliability and availability, and increase usability.

添加您的观点

Aastha Katiyar

Senior Data Engineer @ Hexaware
举报内容
To avoid duplication in batch processing , we can implement window function with row number or rank to identify the latest timestamp record in case of duplicate records. Another way to handle it is to use 'Merge' while insertion, that will update the data. Like MERGE INTO and APPLY CHANGES INTO are used in delta tables and delta live tables.

已翻译

赞
Gaurav Kumar

Sr.Cloud Data Engineer?? | Trainer | Unleashing Insights ?? | Formerly Brillio, Mu Sigma, Capgemini alumni ?? | Making Data Dreams Reality ??
举报内容
Implementing deduplication at different stages—pre-processing, processing, and post-processing—allows for flexibility based on data characteristics. It's evident that this approach not only addresses data integrity but also contributes to overall data management improvements.

已翻译

赞
Gaurav Kumar

Sr.Cloud Data Engineer?? | Trainer | Unleashing Insights ?? | Formerly Brillio, Mu Sigma, Capgemini alumni ?? | Making Data Dreams Reality ??
举报内容
Implementing deduplication at different stages—pre-processing, processing, and post-processing—allows for flexibility based on data characteristics. It's evident that this approach not only addresses data integrity but also contributes to overall data management improvements.

已翻译

赞
Dr Emmanuel Ogungbemi

I help you break into data science and AI with practical tips, real-world insights, and the latest trends.
举报内容
1. Choose an Efficient Deduplication Algorithm: Implement an algorithm suited to your data's nature and volume. For large datasets, algorithms like hash-based deduplication can efficiently identify duplicates without extensive resource consumption. 2. Integrate Deduplication in Data Pipeline: Embed the deduplication process seamlessly within the data pipeline. This ensures that duplicates are handled systematically in the regular data processing flow. 3. Regularly Update and Optimize the Method: Continuously monitor the performance of your deduplication method and adjust as needed. Adapting to changes in data patterns or volume can maintain the efficiency and accuracy of the deduplication process.

已翻译

赞
Gaurav Kumar

Sr.Cloud Data Engineer?? | Trainer | Unleashing Insights ?? | Formerly Brillio, Mu Sigma, Capgemini alumni ?? | Making Data Dreams Reality ??
举报内容
Implementing deduplication at different stages—pre-processing, processing, and post-processing—allows for flexibility based on data characteristics. It's evident that this approach not only addresses data integrity but also contributes to overall data management improvements.

已翻译

赞

加载更多内容

4 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

QASIM HASSAN

Cloud Data Engineer @Royal Cyber || AI Engineer || AWS Community Builder, ex Cloud Captain || Databricks UG Pakistan Co Lead????
举报内容
Handling duplicates in batch processing is a common challenge but normally it can be rectified by deduplication using unique Identifiers ie: primary keys or hash keys if we talk about unique keys in Data Vault. Apart from that other methods that we can use are database constraints, record lineage (if we received the same data from different sources), and bloom filters. CDC with SCD type-1 is another technique to handle duplication & lastly, if we're working with join based on the field we're joining two tables we can remove the duplicates by applying drop duplicate on that particular field on which we joint two tables

已翻译

赞
Mayank Gupta

Seasoned Data Engineer | Leading Teams, Driving Innovation, and Delivering Results @ Namasys
举报内容
I cannot stress enough on the need to understand your data. You must have the context of the data on which you are working. Always try to read on google/chatgpt if you get any new domain's data to work on. E.g. If you get Google Analytics data to work on, you must know that most of the Google Analytics data does not have a primary key as it mostly monitored on a date basis. Now in this scenario, your normal primary key-based de-duplication can never work.

已翻译

赞
Dr. Fawsy Bendeck

Senior Principal at Accenture
举报内容
Semantic deduplication: it's important to recognize that data duplication can take multiple forms, requiring the use of semantic analysis in certain contexts. Often, records might differ lexically due to abbreviations, omissions, or textual reordering, yet they convey the same meaning. Humans intuitively use semantic understanding to recognize this congruence in content, assessing texts based on their meanings rather than just their lexical structure. This approach underscores the value of semantic analysis in identifying equivalences in records that go beyond straightforward lexical checks, a key insight for anyone working with complex data sets. We can call it: Semantic Deduplication

已翻译

赞
Dr Emmanuel Ogungbemi

I help you break into data science and AI with practical tips, real-world insights, and the latest trends.
举报内容
1. Implement Data Deduplication Techniques: Employ deduplication methods at the data ingestion stage. Using hashing algorithms to identify duplicates before processing can significantly reduce redundancy and improve efficiency. 2. Incorporate Validation Checks: Set up validation rules or checks in your batch processing pipeline to catch and handle duplicates. Regularly updating these rules based on the types of duplicates encountered can be highly effective. 3. Post-Processing Data Cleaning: Run a cleaning step to remove any remaining duplicates after batch processing. This ensures that the final dataset is clean and reliable, which is crucial for maintaining data integrity.

已翻译

赞
Emmanuel Christopher

Founding Team @GlueX | Data Engineer | Cloud DevOps Engineer | Data Engineering Diploma | AWS, Linux, Python, SQL Proficiency
举报内容
- Document your deduplication strategy: Document the chosen methods and justifications for future reference. - Test and Validate: Thoroughly test your deduplication logic on different datasets to ensure its effectiveness and accuracy. - Monitor and Update: Regularly monitor your data for duplicate trends and adjust your deduplication strategy as needed. By understanding the source of duplicates and applying appropriate deduplication techniques, you can ensure data integrity and achieve accurate results in your batch-processing tasks.

已翻译

赞

加载更多内容

Data Engineering

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

What is the best way to handle duplicates in batch processing?

1

2

3

4

1 Identify the source of duplicates

2 Choose the appropriate deduplication method

3 Implement the deduplication process

4 Here’s what else to consider

Data Engineering

给文章评分

感谢您的反馈

更多Data Engineering相关文章

更多相关阅读内容

What is the best way to handle duplicates in batch processing?

1

2

3

4

1 Identify the source of duplicates

2 Choose the appropriate deduplication method

3 Implement the deduplication process

4 Here’s what else to consider

Data Engineering

给文章评分

感谢您的反馈

查看其他技能