Data Integration in the Era of Big Data, Cloud Storage and Data Analytics
By Guido N.

Data Integration in the Era of Big Data, Cloud Storage and Data Analytics

The zettabyte era is official. In the last two years more data has been generated than in the?entire history?before. One zettabyte is equal to a trillion Gigabytes. The digitalization of the society, social media, mobile devices, satellites, cloud computing, the reduction of technology costs, increased knowledge about data science and new technologies such as the internet of things (IoT) are the causes that data will keep growing exponentially. The effective integration of all this data and the speed in which this process is realized is crucial for its use and exploitation.

In this article is described what is data integration, what methods of data integration exist and how integration processes are performed, how data integration works in business intelligence, what role data quality plays in data integration, and what the impact of big data and cloud computing is on data integration.


What is data integration?

Many companies store large amounts of data on different platforms such as ERP, CRM, legacy and other information systems. Because data is distributed, businesses cannot interpret and analyze data in an integral manner. By integrating data in a single place such as a data warehouse, it becomes a business asset and available for operations management, decision making, reporting, planning, forecasting and more. Data integration is the extracting, transforming and loading (ETL) of data from different sources into a form so that it can be used in business operations or that is consumable by analytical tools.

Data integration comes with several implicit benefits. Because data must be harmonized, data quality rules are applied to assure that data is complete, error-free and reliable. So it improves data quality and integrity. Secondly it reduces data silos by integrating the data into one repository improving the capacity to analyze and realize business decision-making. Data integration provides access to valuable insights for companies, for example to understand their customers and improve customer experiences.

Data and analytics build off of each other to deliver?a deeper understanding of the business. Analytics is used for the discovery, interpretation, and communication of meaningful patterns in data. These insights are used to make strategic, tactical and operational decisions and improve business performance.?Artificial intelligence (AI) and machine learning (ML) are replacing standard BI reporting and used for extracting real-time patterns from large sets of data, then making predictions based on that information, without human intervention.


Data integration and data analytics


Methods of data integration

Data integration can be performed in different ways. An organization must have a good understanding of the different data integration methods and approaches in order to choose the one that best fits.

The most common data integration methods can be grouped into five categories:

  1. Manual data integration: all phases of the integration, from collection to visualization, are performed manually.
  2. Uniform data access integration: a technique that retrieves and uniformly displays data while the data is maintained in its original source.
  3. Middleware data integration: integrates data from legacy systems into a new middleware layer.
  4. Common data storage integration: a copy of the data from the source is stored in a new database or data warehouse in order to present the data in a uniform way.
  5. Application-based integration: software applications extract and integrate data by harmonizing data from different sources and systems.

Retrieving data from different sources can be a challenging task where the quality of data must be high. Choosing the correct integration method depends on the type of business, its processes, the applications that the organization is using, and the specific data needs of the company. Organizations may require data migration and integration tools that are simple or instead very complex.


Engineering the ETL process

The exact steps of how data between the source systems and targets systems is processed and transferred may differ from one method or tool to the next, but the end result is always the same.

Examples of how data is processed between source systems and target systems are:

  • Extract, transform and load: copies of datasets from different sources are collected, harmonized and loaded into a database or data warehouse.
  • Extract, load and transform: data is loaded as is into a data lake and transformed subsequently for specific analytics purposes.
  • Change data capture: identifies data changes in databases in real-time and applies them to a database or data warehouse.
  • Data replication: data in one database is replicated to other databases to keep the data synchronized for operational usage and backup.
  • Data virtualization: data from different systems are virtually combined to create a unified view instead of loading data into a new repository.

Independent of the method used, the goal is to engineer the ETL process in such a way that the data pipelines and analytics provide?business?value to the company. Setting up the data pipelines can be challenging and is based upon engineering best practices. Examples of problems in the engineering process are the lack of proper flow of business information, loss of data during the ETL process, and difficulties?in?obtaining test data that is a correct representation of the business processes.


Real-time data streaming


Batch ETL versus streaming ETL

In traditional data environments ETL tools collect and extract data in batches usually based on a schedule. Then that data is transformed and loaded into a database or a data warehouse. Examples of processes where ETL in batches is working appropriately are salary reporting or tax calculations.

Nowadays however, businesses cannot wait for hours or days for applications to handle data in batches. They must act to new data in real time as the data is generated. These real-time applications require streaming data integration. Streaming data integration is the continuous and real-time integration of data sources into data stores and analytics tools. Examples of processes that require real-time data are fraud detection, payment processing and the forecasting of future power consumption.

The shift from batch processing to information being generated real time has had profound implications for the architecture of the technology that has moved to a model based on distributed message queues and stream processing. Stream processing is the real-time processing of continuous streams of data in motion.?In addition, streaming ETL pipelines are emerging that are capable of handling continuous streams of data in real time.


Data integration for business intelligence

Business intelligence (BI) is one of the most significant?data integration use cases. A powerful BI solution incorporates everything from reporting to predictive analytics and operations management. A typical BI architecture is displayed in below figure.


Business intelligence architecture

Figure 1: Business Intelligence Architecture


In this architecture the data warehouse is populated from the various sources systems through the three steps of the ETL process:

  1. Extraction. In the first step of the ETL process, data is extracted from various source systems with different formats like relational databases, NoSQL, XML and flat files and loaded into an intermediate storage area, the staging area.
  2. Transformation. Then a set of rules are applied on the extracted data to transform it into a single standard format. The data structures of the different sources may vary depending on type of database and need for harmonization as part of the integration activities. This may involve filtering, cleaning, joining, splitting and sorting.
  3. Loading. In the last step of the ETL process, the transformed data is loaded into the data warehouse. Only the most important data is extracted so that only decision-making data is represented.

After the data warehouse is populated, a part of the data is loaded into a data mart. A data mart is a collection of data, similar to a data warehouse, but usually has a smaller amount of data used for a specific purpose. It is a subset of the data warehouse and is usually oriented to a certain business line or team for example sales or finance. The data mart represents that focused data that is used in the reporting tool and consumed by the final user.


Quality in data integration

The primary objective of data integration is to provide consolidated data sets that are clean and consistent and meet the information needs of different end users in an organization. Data integration is dependent upon data architecture standards as it describes the data structures used by a business and its applications and because it sets data standards including quality standards for all data systems.

Data cleansing is the process of preparing data for its usage or analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated or improperly formatted. It is an important activity in the transformation step in the ETL process and includes the identification, removal and/or update of invalid data from the source systems. The goal of data cleansing is to improve data quality and utility by catching and correcting errors before it is transferred to a target database or data warehouse. Data integrity is the overall accuracy, completeness, and consistency of data and defines the criteria for data quality.

There are five criteria to measure data quality:

  1. Completeness: the degree to which data is available.
  2. Validity: the degree to which data conforms to defined business rules.
  3. Accuracy: the degree to which data is close to true values.
  4. Consistency: the degree to which data is consistent within the same dataset and/or across multiple data sets.
  5. Uniformity: the degree to which data is specified using the same unit of measure.

Data integrity also refers to data security and data safety. Data security concerns the protection of data against unauthorized use and data safety concerns protecting data against loss by ensuring safe storage and making regular backups of the data. Companies must ensure that all data migrations are handled in accordance with the company's security and safety policies and with regulatory requirements.


Data quality in big data


Handling big data with distributed computing

In 2020, stored data will grow to 44 ZB and incoming data will only increase. Big data?is a term that refers to the collection and analysis of?data, both structured and unstructured,?that is huge in size and yet increasing exponentially in time and which cannot be analyzed using traditional database methods and tools.

The use of data‐driven mathematical models to analyze large amounts of available data and the need for collecting, storing, and processing them have raised many technical challenges for the?big data applications related to inconsistency, incompleteness, privacy, scale and timeliness amongst others. Parallel processing and distributed computing have became essential in handling the scale and timeliness challenges.

The concept of parallel processing is based on dividing a large problem into smaller ones and each of them is carried out by a single processor individually. A distributed system, also known as distributed computing,?is a group of networked computers that interact with each other in order to appear as a single coherent?system?to the end-user and achieve a common goal. It manages algorithms across many machines in a computing environment and shares resources ranging from memory to networks to storage and processes.


Internet users generate about?2.5 quintillion bytes of data each day.


The most popular open source for distributed computing and parallel processing is Hadoop. It offers a robust solution to handle, process, and group mass streams of structured and unstructured data. It is comprised of two components: a distributed processing framework (MapReduce) where processing tasks are distributed across clusters of nodes so that large data volumes can be processed very quickly across the system as a whole, and a distributed file system (HDFS) where the output is stored.

The most outstanding feature of Hadoop is that the processing happens at orders of magnitude faster than traditional on-site ETL. Organizations that are still using ETL in an on-premise or hybrid environment have a direct competitive disadvantage as for speed. Furthermore, it is highly fault-tolerant and it is designed to be deployed on low-cost hardware. Because it processes data in batch, it can not be used for real-time streaming data.

Hadoop became the framework over which cloud-based ETL has developed. Hadoop distributes the computing processes, which means that data from different sources can be extracted remotely, transformed via a network of computing resources, and then made available for local analysis.


ETL and cloud computing

Businesses have been using ETL for a long time but new nowadays is that both the sources of data as well as the target databases are moving to the cloud. One of the reasons companies are moving from on-premise to cloud?computing is to modernize their data platforms in order to leverage new-age applications and advanced analytics.

Main benefits of cloud computing are:

  • Security: providers offer high levels of security?by encryption of data, the use of AI, constant updates, built-in firewalls, third-party security testing, among others.
  • Automatic updates: automatic and real-time software updates as well as upgrades to servers and computer processing power.
  • Scalability: cloud allows on-demand and instant upscaling and downscaling of storage in accordance with the needs.
  • Use of modern databases: big players like Amazon, Cisco and Microsoft offer mature cloud computing services and data centers.
  • Reduced IT?costs: earned savings due to the avoidance of capital expenditure, costs of maintenance of hardware, and payment of unused storage space.
  • Safety: available backups in case of power outages, hardware failures?or hacks. Data is copied multiple times and stored on different data centers.
  • Data integration: cloud-based applications allow to be extended upon and integrate with each other through published API access.

The cloud-based ETL process is similar to the traditional ETL process, nevertheless cloud-based tools offer new and competitive functions. They provide real-time streaming data processing, scalability and integrations with a constantly growing number of data sources. Data can be extracted from sources such as online, SaaS, third-party and IoT devices.


Big data and cloud computing


About the Author

Guido N. is Senior Project Manager (PMP, SMC) with 10+ years of experience in leading strategic business and technology projects within the financial sector in Mexico. You can follow him on LinkedIn .

要查看或添加评论,请登录

Guido N.的更多文章

社区洞察

其他会员也浏览了