Data Integration in the Era of Big Data, Cloud Storage and Data Analytics
The zettabyte era is official. In the last two years more data has been generated than in the?entire history?before. One zettabyte is equal to a trillion Gigabytes. The digitalization of the society, social media, mobile devices, satellites, cloud computing, the reduction of technology costs, increased knowledge about data science and new technologies such as the internet of things (IoT) are the causes that data will keep growing exponentially. The effective integration of all this data and the speed in which this process is realized is crucial for its use and exploitation.
In this article is described what is data integration, what methods of data integration exist and how integration processes are performed, how data integration works in business intelligence, what role data quality plays in data integration, and what the impact of big data and cloud computing is on data integration.
What is data integration?
Many companies store large amounts of data on different platforms such as ERP, CRM, legacy and other information systems. Because data is distributed, businesses cannot interpret and analyze data in an integral manner. By integrating data in a single place such as a data warehouse, it becomes a business asset and available for operations management, decision making, reporting, planning, forecasting and more. Data integration is the extracting, transforming and loading (ETL) of data from different sources into a form so that it can be used in business operations or that is consumable by analytical tools.
Data integration comes with several implicit benefits. Because data must be harmonized, data quality rules are applied to assure that data is complete, error-free and reliable. So it improves data quality and integrity. Secondly it reduces data silos by integrating the data into one repository improving the capacity to analyze and realize business decision-making. Data integration provides access to valuable insights for companies, for example to understand their customers and improve customer experiences.
Data and analytics build off of each other to deliver?a deeper understanding of the business. Analytics is used for the discovery, interpretation, and communication of meaningful patterns in data. These insights are used to make strategic, tactical and operational decisions and improve business performance.?Artificial intelligence (AI) and machine learning (ML) are replacing standard BI reporting and used for extracting real-time patterns from large sets of data, then making predictions based on that information, without human intervention.
Methods of data integration
Data integration can be performed in different ways. An organization must have a good understanding of the different data integration methods and approaches in order to choose the one that best fits.
The most common data integration methods can be grouped into five categories:
Retrieving data from different sources can be a challenging task where the quality of data must be high. Choosing the correct integration method depends on the type of business, its processes, the applications that the organization is using, and the specific data needs of the company. Organizations may require data migration and integration tools that are simple or instead very complex.
Engineering the ETL process
The exact steps of how data between the source systems and targets systems is processed and transferred may differ from one method or tool to the next, but the end result is always the same.
Examples of how data is processed between source systems and target systems are:
Independent of the method used, the goal is to engineer the ETL process in such a way that the data pipelines and analytics provide?business?value to the company. Setting up the data pipelines can be challenging and is based upon engineering best practices. Examples of problems in the engineering process are the lack of proper flow of business information, loss of data during the ETL process, and difficulties?in?obtaining test data that is a correct representation of the business processes.
Batch ETL versus streaming ETL
In traditional data environments ETL tools collect and extract data in batches usually based on a schedule. Then that data is transformed and loaded into a database or a data warehouse. Examples of processes where ETL in batches is working appropriately are salary reporting or tax calculations.
Nowadays however, businesses cannot wait for hours or days for applications to handle data in batches. They must act to new data in real time as the data is generated. These real-time applications require streaming data integration. Streaming data integration is the continuous and real-time integration of data sources into data stores and analytics tools. Examples of processes that require real-time data are fraud detection, payment processing and the forecasting of future power consumption.
The shift from batch processing to information being generated real time has had profound implications for the architecture of the technology that has moved to a model based on distributed message queues and stream processing. Stream processing is the real-time processing of continuous streams of data in motion.?In addition, streaming ETL pipelines are emerging that are capable of handling continuous streams of data in real time.
Data integration for business intelligence
Business intelligence (BI) is one of the most significant?data integration use cases. A powerful BI solution incorporates everything from reporting to predictive analytics and operations management. A typical BI architecture is displayed in below figure.
Figure 1: Business Intelligence Architecture
领英推荐
In this architecture the data warehouse is populated from the various sources systems through the three steps of the ETL process:
After the data warehouse is populated, a part of the data is loaded into a data mart. A data mart is a collection of data, similar to a data warehouse, but usually has a smaller amount of data used for a specific purpose. It is a subset of the data warehouse and is usually oriented to a certain business line or team for example sales or finance. The data mart represents that focused data that is used in the reporting tool and consumed by the final user.
Quality in data integration
The primary objective of data integration is to provide consolidated data sets that are clean and consistent and meet the information needs of different end users in an organization. Data integration is dependent upon data architecture standards as it describes the data structures used by a business and its applications and because it sets data standards including quality standards for all data systems.
Data cleansing is the process of preparing data for its usage or analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated or improperly formatted. It is an important activity in the transformation step in the ETL process and includes the identification, removal and/or update of invalid data from the source systems. The goal of data cleansing is to improve data quality and utility by catching and correcting errors before it is transferred to a target database or data warehouse. Data integrity is the overall accuracy, completeness, and consistency of data and defines the criteria for data quality.
There are five criteria to measure data quality:
Data integrity also refers to data security and data safety. Data security concerns the protection of data against unauthorized use and data safety concerns protecting data against loss by ensuring safe storage and making regular backups of the data. Companies must ensure that all data migrations are handled in accordance with the company's security and safety policies and with regulatory requirements.
Handling big data with distributed computing
In 2020, stored data will grow to 44 ZB and incoming data will only increase. Big data?is a term that refers to the collection and analysis of?data, both structured and unstructured,?that is huge in size and yet increasing exponentially in time and which cannot be analyzed using traditional database methods and tools.
The use of data‐driven mathematical models to analyze large amounts of available data and the need for collecting, storing, and processing them have raised many technical challenges for the?big data applications related to inconsistency, incompleteness, privacy, scale and timeliness amongst others. Parallel processing and distributed computing have became essential in handling the scale and timeliness challenges.
The concept of parallel processing is based on dividing a large problem into smaller ones and each of them is carried out by a single processor individually. A distributed system, also known as distributed computing,?is a group of networked computers that interact with each other in order to appear as a single coherent?system?to the end-user and achieve a common goal. It manages algorithms across many machines in a computing environment and shares resources ranging from memory to networks to storage and processes.
Internet users generate about?2.5 quintillion bytes of data each day.
The most popular open source for distributed computing and parallel processing is Hadoop. It offers a robust solution to handle, process, and group mass streams of structured and unstructured data. It is comprised of two components: a distributed processing framework (MapReduce) where processing tasks are distributed across clusters of nodes so that large data volumes can be processed very quickly across the system as a whole, and a distributed file system (HDFS) where the output is stored.
The most outstanding feature of Hadoop is that the processing happens at orders of magnitude faster than traditional on-site ETL. Organizations that are still using ETL in an on-premise or hybrid environment have a direct competitive disadvantage as for speed. Furthermore, it is highly fault-tolerant and it is designed to be deployed on low-cost hardware. Because it processes data in batch, it can not be used for real-time streaming data.
Hadoop became the framework over which cloud-based ETL has developed. Hadoop distributes the computing processes, which means that data from different sources can be extracted remotely, transformed via a network of computing resources, and then made available for local analysis.
ETL and cloud computing
Businesses have been using ETL for a long time but new nowadays is that both the sources of data as well as the target databases are moving to the cloud. One of the reasons companies are moving from on-premise to cloud?computing is to modernize their data platforms in order to leverage new-age applications and advanced analytics.
Main benefits of cloud computing are:
The cloud-based ETL process is similar to the traditional ETL process, nevertheless cloud-based tools offer new and competitive functions. They provide real-time streaming data processing, scalability and integrations with a constantly growing number of data sources. Data can be extracted from sources such as online, SaaS, third-party and IoT devices.
About the Author
Guido N. is Senior Project Manager (PMP, SMC) with 10+ years of experience in leading strategic business and technology projects within the financial sector in Mexico. You can follow him on LinkedIn .