登录查看更多内容

Data Transformation in Data Mining

Darshika Srivastava

Associate Project Manager @ HuQuo | MBA,Amity Business School

发布日期: 2023年12月6日

Data Transformation in Data Mining

Raw data is difficult to trace or understand. That's why it needs to be preprocessed before retrieving any information from it. Data transformation is a technique used to convert the raw data into a suitable format that efficiently eases data mining and retrieves strategic information. Data transformation includes data cleaning techniques and a data reduction technique to convert the data into the appropriate form.

Data transformation is an essential data preprocessing technique that must be performed on the data before data mining to provide patterns that are easier to understand.

Data transformation changes the format, structure, or values of the data and converts them into clean, usable data. Data may be transformed at two stages of the data pipeline for data analytics projects. Organizations that use on-premises data warehouses generally use an ETL (extract, transform, and load) process, in which data transformation is the middle step. Today, most organizations use cloud-based data warehouses to scale compute and storage resources with latency measured in seconds or minutes. The scalability of the cloud platform lets organizations skip preload transformations and load raw data into the data warehouse, then transform it at query time.

Data integration, migration, data warehousing, data wrangling may all involve data transformation. Data transformation increases the efficiency of business and analytic processes, and it enables businesses to make better data-driven decisions. During the data transformation process, an analyst will determine the structure of the data. This could mean that data transformation may be:

Constructive: The data transformation process adds, copies, or replicates data.
Destructive: The system deletes fields or records.
Aesthetic: The transformation standardizes the data to meet requirements or parameters.
Structural: The database is reorganized by renaming, moving, or combining columns.

Data Transformation Techniques

There are several data transformation techniques that can help structure and clean up the data before analysis or storage in a data warehouse. Let's study all techniques used for data transformation, some of which we have already studied in data reduction and data cleaning.

1. Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some algorithms. It allows for highlighting important features present in the dataset. It helps in predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to help predict different trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often be difficult to digest for finding patterns that they wouldn't see otherwise.

We have seen how the noise is removed from the data using the techniques such as binning, regression, clustering.

Binning: This method splits the sorted data into the number of bins and smoothens the data values in each bin considering the neighborhood values around it.
Regression: This method identifies the relation among two dependent attributes so that if we have one attribute, it can be used to predict the other attribute.
Clustering: This method groups similar data values and form a cluster. The values that lie outside a cluster are known as outliers.

2. Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to construct a new data set that eases data mining. New attributes are created and applied to assist the mining process from the given attributes. This simplifies the original data and makes the mining more efficient.

For example, suppose we have a data set referring to measurements of different plots, i.e., we may have the height and width of each plot. So here, we can construct a new attribute 'area' from attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a data set.

3. Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary format. The data may be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results. The collection of data is useful for everything from decisions concerning financing or business strategy of the product, pricing, operations, and marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales of each year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.

领英推荐

Why data science is the future of the oil & gas…

Petroleum Engineers Association 10 个月前

Uncover Insights using Exploratory Data Analysis (EDA)

Techcanvass 8 个月前

Data Engineering

BBI 10 个月前

Consider that we have a numeric attribute A and we have n number of observed values for attribute A that are V1, V2, V3, ….Vn.

Min-max normalization: This method implements a linear transformation on the original data. Let us consider that we have minA and maxA as the minimum and maximum value observed for attribute A and Viis the value for attribute A that has to be normalized. The min-max normalization would map Vi to the V'i in a new smaller range [new_minA, new_maxA]. The formula for min-max normalization is given below: For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600. The value $73,600 would be transformed using min-max normalization as follows:
Z-score normalization: This method normalizes the value for attribute A using the meanand standard deviation. The following formula is used for Z-score normalization: Here ? and σA are the mean and standard deviation for attribute A, respectively. For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000. And we have to normalize the value $73,600 using z-score normalization.
Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point in the value. This movement of a decimal point depends on the maximum absolute value of A. The formula for the decimal scaling is given below: Here j is the smallest integer such that max(|v'i|)<1 For example, the observed values for attribute A range from -986 to 917, and the maximum absolute value for attribute A is 986. Here, to normalize each value of attribute A using decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3. So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917. The normalization parameters such as mean, standard deviation, the maximum absolute value must be preserved to normalize the future data uniformly.

5. Data Discretization

This is a process of converting continuous data into a set of data intervals. Continuous attribute values are substituted by small interval labels. This makes the data easier to study and analyze. If a data mining task handles a continuous attribute, then its discrete values can be replaced by constant quality attributes. This improves the efficiency of the task.

This method is also called a data reduction mechanism as it transforms a large dataset into a set of categorical data. Discretization also uses decision tree-based algorithms to produce short, compact, and accurate results when using discrete values.

Data discretization can be classified into two types: supervised discretization, where the class information is used, and unsupervised discretization, which is based on which direction the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.

For example, the values for the age attribute can be replaced by the interval labels such as (0-10, 11-20…) or (kid, youth, adult, senior).

6. Data Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy. This conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the data. Data generalization can be divided into two approaches:

Data cube process (OLAP) approach.
Attribute-oriented induction (AOI) approach.

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher conceptual level into a categorical value (young, old).

Data Transformation Process

The entire process for transforming data is known as ETL (Extract, Load, and Transform). Through the ETL process, analysts can convert data to its desired format. Here are the steps involved in the data transformation process:

Data Discovery: During the first stage, analysts work to understand and identify data in its source format. To do this, they will use data profiling tools. This step helps analysts decide what they need to do to get data into its desired format.
Data Mapping: During this phase, analysts perform data mapping to determine how individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is essential to many data processes, and one misstep can lead to incorrect analysis and ripple through your entire organization.
Data Extraction: During this phase, analysts extract the data from its original source. These may include structured sources such as databases or streaming sources such as customer log files from web applications.
Code Generation and Execution: Once the data has been extracted, analysts need to create a code to complete the transformation. Often, analysts generate codes with the help of data transformation platforms or tools.
Review: After transforming the data, analysts need to check it to ensure everything has been formatted correctly.
Sending: The final step involves sending the data to its target destination. The target might be a data warehouse or a database that handles both structured and unstructured data.

Advantages of Data Transformation

Transforming data can help businesses in a variety of ways. Here are some of the essential advantages of data transformation, such as:

Better Organization: Transformed data is easier for both humans and computers to use.
Improved Data Quality: There are many risks and costs associated with bad data. Data transformation can help your organization eliminate quality issues such as missing values and other inconsistencies.
Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to it being stored and standardizedin a source location.
Better Data Management: Businesses are constantly generating data from more and more sources. If there are inconsistencies in the metadata, it can be challenging to organize and understand it. Data transformation refines your metadata, so it's easier to organize and understand.
More Use Out of Data: While businesses may be collecting data constantly, a lot of that data sits around unanalyzed. Transformation makes it easier to get the most out of your data by standardizing it and making it more usable.

Disadvantages of Data Transformation

While data transformation comes with a lot of benefits, still there are some challenges to transforming data effectively, such as:

Data transformation can be expensive. The cost is dependent on the specific infrastructure, software, and tools used to process data. Expenses may include licensing, computing resources, and hiring necessary personnel.
Data transformation processes can be resource-intensive. Performing transformations in an on-premises data warehouse after loading or transforming data before feeding it into applications can create a computational burden that slows down other operations. If you use a cloud-based data warehouse, you can do the transformations after loading because the platform can scale up to meet demand.
Lack of expertise and carelessness can introduce problems during transformation. Data analysts without appropriate subject matter expertise are less likely to notice incorrect data because they are less familiar with the range of accurate and permissible values.
Enterprises can perform transformations that don't suit their needs. A business might change information to a specific format for one application only to then revert the information to its prior format for a different application.

Ways of Data Transformation

There are several different ways to transform data, such as:

Scripting: Data transformation through scripting involves Python or SQL to write the code to extract and transform data. Python and SQL are scripting languages that allow you to automate certain tasks in a program. They also allow you to extract information from data sets. Scripting languages require less code than traditional programming languages. Therefore, it is less intensive.
On-Premises ETL Tools: ETL tools take the required work to script the data transformation by automating the process. On-premises ETL tools are hosted on company servers. While these tools can help save you time, using them often requires extensive expertise and significant infrastructure costs.

要查看或添加评论，请登录

Darshika Srivastava的更多文章

Apple Intelligence

2025年3月8日

Apple Intelligence

What Is Apple Intelligence? Apple Intelligence is an artificial intelligence developed by Apple Inc. Relying on a…
Snowflake

2025年3月6日

Snowflake

What is the Snowflake Data Platform? While data is a core asset for modern enterprises, technology’s ability to scale…
Business Intelligence

2025年3月5日

Business Intelligence

What is Business Intelligence? Business intelligence (BI) is a technology-driven process for analyzing data and…
Azure Databricks

2025年3月4日

Azure Databricks

What is Azure Databricks? Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and…
self-service data

2025年3月3日

self-service data

What is self-service data? What are its key characteristics? Self-service data refers to a set of processes, tools, and…
Graphical User Interface

2025年3月1日

Graphical User Interface

What is Graphical User Interface (GUI)? A system of interactive visual components for a computer or system software is…
Insight Generation

2025年2月28日

Insight Generation

What is Insight Generation? Insight generation involves analyzing data to uncover valuable insights that drive…
Fraud Detection

2025年2月27日

Fraud Detection

What is Fraud Detection? Fraud detection is the process of identifying suspicious activity that indicates criminal…
Unsupervised Learning

2025年2月25日

Unsupervised Learning

What is Unsupervised Learning? As the name suggests, unsupervised learning is a machine learning technique in which…
Monetization

2025年2月24日

Monetization

What is Monetization? Monetization is the process of creating revenue from things or actions that don’t currently make…

See all articles

Data Transformation in Data Mining

Darshika Srivastava

Associate Project Manager @ HuQuo | MBA,Amity Business School

Data Transformation in Data Mining

Data Transformation Techniques

领英推荐

Data Transformation Process

Advantages of Data Transformation

Disadvantages of Data Transformation

Ways of Data Transformation

Darshika Srivastava的更多文章

社区洞察

其他会员也浏览了

Understanding the Role of Data Engineering in the AI Industry

8 Steps In Data Science Process Decoded – 4th One Is Amazing

What Is The Difference Between Data Mining And Data Analysis? Types And Importance

Emergence of Real-Time Data Processing: Data Engineers' Vital Contribution in an Ever-Changing Terrain

What is Data Extraction? Definition, Tools & Importance

The Importance of Data Engineering in Today's Digital World

Unlocking Insights: The Power of Data Engineering

Data Management News for the Week of February 10; Updates from Acceldata, Fivetran, InfluxData & More

Data Ingestion Explained: Everything You Need to Know

What is the role of data mining in business intelligence?

Data Transformation in Data Mining

Data Transformation Techniques

领英推荐

Data Transformation Process

Advantages of Data Transformation

Disadvantages of Data Transformation

Ways of Data Transformation

Darshika Srivastava的更多文章

Apple Intelligence

Snowflake

Business Intelligence

Azure Databricks

self-service data

Graphical User Interface

Insight Generation

Fraud Detection

Unsupervised Learning

Monetization

社区洞察

其他会员也浏览了

Understanding the Role of Data Engineering in the AI Industry

8 Steps In Data Science Process Decoded – 4th One Is Amazing

What Is The Difference Between Data Mining And Data Analysis? Types And Importance

Emergence of Real-Time Data Processing: Data Engineers' Vital Contribution in an Ever-Changing Terrain

What is Data Extraction? Definition, Tools & Importance

The Importance of Data Engineering in Today's Digital World

Unlocking Insights: The Power of Data Engineering

Data Management News for the Week of February 10; Updates from Acceldata, Fivetran, InfluxData & More

Data Ingestion Explained: Everything You Need to Know

What is the role of data mining in business intelligence?