MBA Students: A primer on Data Quality & ETL

MBA Students: A primer on Data Quality & ETL

Today, data analytics plays a major role in organisation’s decision making. It is made possible because data is culled from a variety of sources and then assembled in a single data repository that corporate decision makers can access. When data is combined from different areas throughout the company, corporate decision makers get a 360-degree view of what is going on. This enables them to make more informed decisions.

What is ETL?

Typically you come across this term in data migration projects. It involves extracting data from its original source, transforming it into a suitable format for the target database and loading it into the final destination. Organizations make use of ETL tools to convert all of their existing data into a format that allows them to govern it more easily.

What are the 3 steps involved in ETL process?

They are Extract, Transform and Load.

Extract – This addresses gathering relevant data from various sources, whether homogeneous or heterogeneous. These data sources may use different formats, such as relational databases, XML, JSON, flat files, IMS and VSAM, or any other format obtained from external sources by web spidering or screen scraping. In some cases you may require an intermediate data stage to process the data before uploading them on to the destination database. The focus here is on completeness, accuracy and consistency of data.

Transform – Helps to apply a set of rules or functions on the data (that has been extracted in the earlier step). Why do you require transformation? This could be due to multiple factors. E.g. to fit in with the destination database structure and format or for doing intermediate processing or creating a computed field etc. In addition, they help to clean up the data too. Typically transformations require multiple steps and involve interfaces with other systems. Some examples of transformation types include the following:

·??????Encoding free-form values:?Mapping “High School Graduates” to “HS”

·??????Selecting only a set of fields or columns to load and ignoring the rest

·??????Creating unique keys e.g. adding a new id or number

·??????Normalizing data:?Joining first and last names into a single column called “Name”

·??????Sorting data:?Sorting customer IDs by ascending or descending order

·??????Deriving new calculated values:?Computing average products sold per customer

·??????Pivoting and transposing data:?Converting columns into rows

Load - The last step of ETL is loading transformed information into its end target. Loading could involve an asset as simple as a single file or as complex as a data warehouse. This process can vary widely depending on the requirements of each organization and its data migration projects.

What are some of the ETL tools available?

ETL tools pull, scrub and store data from multiple sources for future use and insights. are used to migrate data from one system to another, be it a database management system, a data warehouse or even an external storage system. These tools can run in the cloud or on-premises and often come with an interface that creates a visual workflow when carrying out various extraction, transformation and loading processes. Additionally, their versatility allows datasets to be analyzed, cleansed and (re)structured, making them invaluable in most industries today.

Typical tools are: Informatical Powercentre, AWS Glue, Azure Data Factory, Google Cloud Dataflow, IBM Infosphere Datastage, Oracle Data Integrator

With plenty of options available on the market, organizations can select an ETL tool that is suited to their needs in terms of capability and complexity. Below are some details for cloud-based, on-premises and hybrid, and open-source ETL tools.

AWS Glue - a strong cloud ETL option for companies already familiar with SQL databases and Amazon storage services. It’s powerful enough to clean, validate, organize and load data from diverse sources like streaming data and clickstreams as well as processing jobs with either Scala or Python.

Azure Data Factory - Boasting an impressive suite of features,?Azure Data Factory ?is a pay-as-you-go cloud-based tool that can quickly scale ETL processing and storage to meet businesses’ data needs. Users benefit from a no-code graphical user interface for simple tasks and a code-based option for data professionals. It has a range of connectors — including AWS, DB2, MongoDB, Oracle, MySQL, SQL, SyBase, Salesforce and SAP.

IBM DataStage - Utilizing a client/server model allows for data integration from multiple sources in large volumes and via different platforms, including Intel, UNIX, Linux or even IBM mainframe. Includes extended metadata management, enterprise connectivity and high-performance parallel framework capabilities. As such, IBM DataStage is well-suited to large enterprises running various clouds and companies with on-premises data centers that wish to maximize their ETL capabilities.

Talend Open Studio ?- is an open-source, user-friendly tool with a graphical user interface to facilitate data and application connections. It comes packed with a library of more than 900 connectors, enabling users to access corporate data from sources including Excel, Dropbox, Oracle, Salesforce and Microsoft Dynamics. Additionally, it can handle many forms of structured or unstructured data from relational databases and software applications. Data professionals can use Talend Open Studio with on-premises, cloud and multicloud platforms, so it’s well-equipped for companies that work in hybrid computing modes. By implementing an effective ETL process, non-data professionals and less-experienced data management professionals can ensure a successful transition between databases with no errors or inconsistencies.

What is Data Quality?

The cornerstone for any data strategy or data-driven system is high quality data. As organizations realize the importance of data, there is an increased emphasis on improving and maintaining data quality. However, the vast volume and increasing complexity of data make it challenging to monitor and improve data quality on a continuous basis.

Data quality is a measure of the condition of data based on characteristics such as its integrity, validity, uniqueness, accuracy, timeliness, consistency and reliability. Data that is high quality is well suited to serve its specific purpose. From a business perspective, data quality can have a major impact on the ability of the business to gather business insights, make strategic decisions, improve operational efficiency and improve other business outcomes. Common issues that can compromise data quality include poorly defined data, incomplete data, duplicate data, incorrect data or data that is not securely stored.

Data quality is measured by organizations so they can identify and fix data issues before they turn into bigger business problems. There are various methodologies used to assess data quality. For example, there is the Data Quality Assessment Framework (DQAF), which is used to measure data quality using the data dimensions of consistency, timeliness, validity, completeness and integrity.

It is common for organizations to perform data asset inventories to establish a baseline of data quality and then to measure and improve based on those baseline scores.

What are data quality tools?

Combing through datasets to find and fix duplicate entries, fix formatting issues, and correct errors can use up valuable time and resources. Although data quality can be improved through manual processes, using data quality tools increases the effectiveness, efficiency and reliability of the process.

Data quality tools are used to monitor and analyze business data, determining if the quality of data makes it useful enough for business decision-making while also defining how data quality can be improved. This can include gathering data from multiple data sources that exist in different formats and effectively scrubbing, cleaning, analyzing and managing the data to make it ready for use. Data sources can be databases, emails, social media, IoT, data logs or other types of data.

What can you do with the data quality tools?

  1. Data profiling:?Analyze and explore data to understand how it is structured and how it can be used for maximum benefit.
  2. Connectivity:?Gather data from all sources of relevant enterprise data, including internal and external data.
  3. Data parsing:?Allows the conversion of data from one format to another. A data quality tool also uses data parsing for data validation.
  4. Data matching:?Algorithms that help to identify and eliminate duplicate data.
  5. Monitoring and notifications:?Monitor data throughout the data lifecycle and notify administrators and management of any issues that need to be addressed.
  6. Data cleaning and standardization:?Help to identify incorrect or duplicate data and modify it according to predefined requirements.

What are some of the data quality tools?

·??????Data Ladder, OpenRefine, Talend, Ataccama, Dataedo, Precisely, Informatica


要查看或添加评论,请登录

社区洞察

其他会员也浏览了