登录查看更多内容

DATA TRANSFORMATION

Yuvaraj Pandey

Assistant Manager

发布日期: 2023年3月21日

What is data transformation?

Data transformation is the process of changing the format, structure, or values of data. For data analytics projects, data may be transformed at two stages of the data pipeline. Organizations that use on-premises data warehouses generally use an ETL (extract, transform, load) process, in which?data transformation is the middle step. Today, most organizations use cloud-based data warehouses, which can scale compute and storage resources with latency measured in seconds or minutes. The scalability of the cloud platform lets organizations skip preload transformations and load raw data into the data warehouse, then transform it at query time — a model called ELT (?extract, load, transform).

Processes such as?data integration,?data migration,?data warehousing, and?data wrangling?all may involve data transformation.

Data transformation may be constructive (adding, copying, and replicating data), destructive (deleting fields and records), aesthetic (standardizing salutations or street names), or structural (renaming, moving, and combining columns in a database).

An enterprise can choose among a variety of?ETL tools?that automate the process of data transformation. Data analysts, data engineers, and data scientists also transform data using?scripting languages such as Python?or?domain-specific languages like SQL.

Benefits and challenges of data transformation

Transforming data yields several benefits:

Data is transformed to make it better-organized. Transformed data may be easier for both humans and computers to use.
Properly formatted and validated data improves data quality and protects applications from potential landmines such as null values, unexpected duplicates, incorrect indexing, and incompatible formats.
Data transformation facilitates compatibility between applications, systems, and types of data. Data used for multiple purposes may need to be transformed in different ways..

However, there are challenges to transforming data effectively:

Data transformation can be expensive. The cost is dependent on the specific infrastructure, software, and tools used to process data. Expenses may include those related to licensing, computing resources, and hiring necessary personnel.
Data transformation processes can be resource-intensive. Performing transformations in an on-premises data warehouse after loading, or transforming data before feeding it into applications, can create a computational burden that slows down other operations. If you use a cloud-based data warehouse, you can do the transformations after loading because the platform can scale up to meet demand.
Lack of expertise and carelessness can introduce problems during transformation. Data analysts without appropriate subject matter expertise are less likely to notice typos or incorrect data because they are less familiar with the range of accurate and permissible values. For example, someone working on medical data who is unfamiliar with relevant terms might fail to flag disease names that should be mapped to a singular value or notice misspellings.
Enterprises can perform transformations that don't suit their needs. A business might change information to a specific format for one application only to then revert the information back to its prior format for a different application.

How to transform data

Data transformation can increase the efficiency of analytic and business processes and enable better data-driven decision-making. The first phase of data transformations should include things like data type conversion and flattening of hierarchical data. These operations shape data to increase compatibility with analytics systems. Data analysts and data scientists can implement further transformations additively as necessary as?individual layers of processing. Each layer of processing should be designed to perform a specific set of tasks that meet a known business or technical requirement.

Data transformation serves many functions within the data analytics stack.

Extraction and parsing

In the modern ELT process, data ingestion begins with extracting information from a data source, followed by copying the data to its destination. Initial transformations are focused on shaping the format and structure of data to ensure its compatibility with both the destination system and the data already there. Parsing fields out of comma-delimited log data for loading to a relational database is an example of this type of data transformation.

Translation and mapping

Some of the most basic data transformations involve the mapping and translation of data. For example, a column containing integers representing error codes can be mapped to the relevant error descriptions, making that column easier to understand and more useful for display in a customer-facing application.

领英推荐

The Future of Data Transformation

Ravit Jain 6 个月前

Data Pipeline: Purpose, Types, Components and More

Lyftrondata 3 个月前

GOOGLE TUG Leader Says Data Analysts Should Learn Data…

Andrew C. Madson 6 个月前

Translation converts data from formats used in one system to formats appropriate for a different system. Even after parsing, web data might arrive in the form of hierarchical JSON or XML files, but need to be translated into row and column data for inclusion in a relational database.

Filtering, aggregation, and summarization

Data transformation is often concerned with whittling data down and making it more manageable. Data may be consolidated by filtering out unnecessary fields, columns, and records. Omitted data might include numerical indexes in data intended for graphs and dashboards or records from business regions that aren’t of interest in a particular study.

Data might also be aggregated or summarized. by, for instance, transforming a time series of customer transactions to hourly or daily sales counts.

BI tools can do this filtering and aggregation, but it can be more efficient to do the transformations before a reporting tool accesses the data.

Enrichment and imputation

Data from different sources can be merged to create denormalized, enriched information. A customer’s transactions can be rolled up into a grand total and added into a customer information table for quicker reference or for use by customer analytics systems. Long or freeform fields may be split into multiple columns, and missing values can be imputed or corrupted data replaced as a result of these kinds of transformations.

Indexing and ordering

Data can be transformed so that it's ordered logically or to suit a data storage scheme. In relational database management systems, for example, creating indexes can improve performance or improve the management of relationships between different tables.

Anonymization and encryption

Data containing personally identifiable information, or other information that could compromise privacy or security, should be anonymized before propagation. Encryption of private data is a requirement in many industries, and systems can perform encryption at multiple levels, from individual database cells to entire records or fields.

Modeling, typecasting, formatting, and renaming

Finally, a whole set of transformations can reshape data without changing content. This includes casting and converting data types for compatibility, adjusting dates and times with offsets and format localization, and renaming schemas, tables, and columns for clarity.

Refining the data transformation process

Before your enterprise can run analytics, and even before you transform the data, you must replicate it to a data warehouse architected for analytics. Most organizations today choose a cloud data warehouse, allowing them to take full advantage of ELT. Stitch can load all of your data to your?preferred data warehouse?in a raw state, ready for transformation.?Try Stitch?for free.

#HUQUO #DATA TRANSFORMATION

DATA TRANSFORMATION

Yuvaraj Pandey

Assistant Manager

What is data transformation?

Benefits and challenges of data transformation

How to transform data

Extraction and parsing

Translation and mapping

领英推荐

Filtering, aggregation, and summarization

Enrichment and imputation

Indexing and ordering

Anonymization and encryption

Modeling, typecasting, formatting, and renaming

Refining the data transformation process

更多精彩文章

社区洞察

其他会员也浏览了

Understanding the Data Vault Model: ABC to Advanced Strategies and Best Practices for Data Vault Modeling

Data Build Tool(DBT) — Aamir P

Data Engineering: The Backbone of Modern Data-Driven Decision Making

Data Assets, Data Products, Data as a Product, Data Engineering - The Whimsical World of Data Terminology Soup

The Bridge to Insight: Data Engineers and the Importance of Understanding Data Analytics Concepts

Understanding the Roles in a Data Ingestion Project: A Deep Dive

??? Why Data Lakehouses Could Make Data Warehouses Obsolete

Data Lineage and Impact Analysis: Understanding and Dealing with Data Dependencies in Data Pipelines by Fidel V.

High Scale Ingestion Meets Big Data Analytics

What is data transformation?

Benefits and challenges of data transformation

How to transform data

Extraction and parsing

Translation and mapping

领英推荐

Filtering, aggregation, and summarization

Enrichment and imputation

Indexing and ordering

Anonymization and encryption

Modeling, typecasting, formatting, and renaming

Refining the data transformation process

DATA MANAGEMENT

2023年4月14日

SCALA

2023年4月13日

RISK MANAGEMENT

2023年4月10日

TERADATA

2023年4月8日

SPARK

2023年4月5日

POWER BI

2023年4月4日

DOT NET DEVELOPER

2023年4月3日

DATA NETWORK

2023年4月1日

DATA EXTRACTION

2023年3月30日

DATA VISUALIZATION

2023年3月29日

社区洞察

其他会员也浏览了

Understanding the Data Vault Model: ABC to Advanced Strategies and Best Practices for Data Vault Modeling

Data Build Tool(DBT) — Aamir P

Data Engineering: The Backbone of Modern Data-Driven Decision Making

Data Assets, Data Products, Data as a Product, Data Engineering - The Whimsical World of Data Terminology Soup

The Bridge to Insight: Data Engineers and the Importance of Understanding Data Analytics Concepts

Understanding the Roles in a Data Ingestion Project: A Deep Dive

??? Why Data Lakehouses Could Make Data Warehouses Obsolete

Data Lineage and Impact Analysis: Understanding and Dealing with Data Dependencies in Data Pipelines by Fidel V.

High Scale Ingestion Meets Big Data Analytics