Data Engineering
In the smart data era, due to the complexity of data and data application contexts, data engineering needs to integrate both AI and human wisdom to maximize its effectiveness. In terms of implementation, data engineering normally includes data acquisition, organization, analytics, and action, which form a closed loop of data (see Figure?4-1.)
Data Acquisition
Data acquisition focuses on generated data and captures data into the system for processing. It is divided into two stages—data harvest and data ingestion.
Different data application contexts have different demands for the latency of the data acquisition process. There are three main modes:
Acquisition Latency
Real time (Stream)
Data should be processed in a real-time manner without any time delay. Streaming data is continuously generated, without a boundary. Common streams include video streams, click event streams on web pages, mobile phone sensor data streams, and so on.
Batch
Batch data is generated in a periodic manner at a certain time interval, with a boundary. Common batch data includes server blog files, video files, and so on.
Micro batch
Data should be processed by the minute in a periodic manner. It is not necessary that data is processed in a real-time manner. Some delay is allowed. For example, the effect of an advertisement should be monitored every five minutes so as to determine a future release strategy. It is thus required that data should be processed in a centralized manner every five minutes in aggregate.
Mega batch
Data should be processed periodically with a time span of several hours, without a high volume of data ingested in real time and a long delay in processing. For example, some web pages are not frequently updated and web page content may be crawled and updated once every day.
Data Ingestion
Data ingestion refers to a process by which the data acquired from data sources is brought into your system, so the system can start acting upon it. It concerns how to acquire data.
Data ingestion typically involves three operations, namely discover, connect, and sync. Generally, no revision of any form is made to numeric values to avoid information loss.
Discover
Refers to a process by which accessible data sources are searched in the corporate environment. Active scanning, connection, and metadata ingestion help to develop the automation of the process and reduce the workload of data ingestion.
Connect
Refers to a process by which the data sources that are confirmed to exist are connected. Once connected, the system may directly access data from a data source. For example, building a connection to a MySQL database actually involves configuring the connecting strings of the data source, including IP address, username and password, database name, and so on.
Sync
Refers to a process by which data is copied to a controllable system. Sync is not always necessary upon the completion of connection. For example, in an environment which requires highly sensitive data security, only connection is allowed for certain data sources. Copying is not allowed for that data.
Data Organization
Data organization refers to a process to make data more available through various operations. It is divided into two stages, namely data preparation and data enrichment.
Data Preparation
Data preparation refers to a process by which data quality is improved using tools. In general cases, data integrity, timeliness, accuracy, and consistency are regarded as indicators for improvement so as to prepare for further analytics.
Common data preparation operations include:
·????? Supplement and update of metadata
·????? Building and presentation of data catalogs
·????? Data munging, such as replacement, duplicate removal, partition, and combination
·????? Data correlation
·????? Checking of consistency in terms of format and meaning
·????? Application data security strategy
领英推荐
Data Enrichment
In contrast to data preparation, data enrichment shows more preference to contexts. It can be understood as a data preparation process at a higher level based on context.
Common data enrichment operations include:
·????? Data labels
o?? Labels are highly contextual. They may have different meanings in different contexts, so they should be discussed in a specific context. For example, gender labels have different meanings in contexts such as ecommerce, fundamental demography, and social networking.
·????? Data modeling
o?? This targets the algorithm models of a business—for example, a graph model built in order to screen the age group of connoisseurs in the internet finance field.
Data Analytics
Data analytics refers to a process by which data is searched, explored, or displayed in a visualized manner based on specific problems so as to form insight and finally make decisions. Data analytics represents a key step from data conversion to action and is also the most complicated part of data engineering.
Data analytics is usually completed by data analysts with specialized knowledge. Figure?4-2 highlights some key aspects of analytics that are utilized to obtain policymaking support.
Each process from insight to decision is based on the results of this analysis. Nevertheless, each level of analytics means greater challenges than that of the previous one. If the system fails to complete such analytics on its own, the intervention of human wisdom is required.
A data analytics system should continuously learn from human wisdom and enrich its data dimensions and AI so as to solve these problems and reduce the cost of human involvement to the largest extent.
For example, in the currently popular internet finance field, big data and AI algorithms can be used to evaluate user credit quickly and determine the limit of a personal loan, almost without human intervention, and the cost of such solutions is far lower than that of traditional banks.
Data analytics is divided into two stages, namely data insight and data decisions.
Data Insight
Data insight refers to a process by which data is understood through data analytics. Data insights are usually presented in the form of documents, figures, charts, or other visualizations.
Data insight can be divided into the following types depending on the time delay from data ingestion to data insight:
Real time
Applicable to the contexts where data insight needs to be obtained in a real-time manner. Server system monitoring is one example of simple contexts. An alarm and response plan should be immediately triggered when key indicators (including magnetic disk and network) exceed the designated threshold. In complicated contexts such as P2P fraud prevention, a judgment should be made if there is any possibility of fraud according to contextual data (the borrower’s data and characteristics) and third-party data (the borrower’s credit data). Also, an alarm should be trigged based on such judgment.
Interactive
Applicable to the context where the insight needs to be obtained in an interactive manner. For example, a business expert cannot get an answer in one query when studying the reason for the recent fall in the sales volume for a particular product. A clue needs to be obtained through continuous query, thus determining the target for the next query. The response speed of the query should be in an almost real-time manner, as required by interactive insight.
Batch
Applicable to the context where the insight should be completed once every time interval. For example, there are no real-time requirements for behavior statistics of mobile app users (including add, daily active, retain) in general cases.
The depth and completeness of data insight results greatly affects the quality of decisions.
Data Decisions
A decision is a process by which an action plan is formulated based on the result of data insight. In the case of sufficient and deep data insight, it is much easier to make a decision.
Action
An action is a process by which the decision generated in the analytics stage is put into use and the effect is assessed. It includes two stages, namely deployment and assessment.
Deployment
Deployment is a process by which action strategies are implemented. Simple deployment includes presenting the visualized result or reaching users during the marketing process. However, the common deployment is more complicated.
Assessment
Assessment is a process by which the action result is measured; it aims to provide a basis for optimization of all data engineering.
In practice, although the problems of the action result appear to be derived from the decision, they are more a reflection on data quality. Data quality may relate to all the stages of data engineering, including acquisition, preparation, enrichment, insight, decision, and action. Thus, it is necessary to track the processing procedures of each link, which can help to locate the root causes for problems.