The Story of Data - Part 4

The Story of Data - Part 4

This is the fourth post in a multi-part series that delves into our approach to partnering with organisations to drive impactful outcomes. In this post, we will focus on the?Data Preparation?phase of the?Data story. All of this can be found on our website at?www.wearepoweredbydata.com/datastory?where we will build up to the full story as we go.

Data Preparation

The Data Preparation phase of the Data Story is concerned with taking all of the raw data we have now collected and preparing it for analysis. This phase can be quite lengthy, particularly if the data is messy, incomplete, sizable or contains known errors. However, the end result is a dataset that is magnitudes more useful than the raw data alone.

It is important to note that we do not totally replace the raw data with new prepared data – the original data should always remain available, even if we rarely delve into it. This availability can really help in a forensic understanding of why certain results have been achieved, and to provide a trail of truth that can be interrogated.

Data preparation is very wide subject, and we’ll look at just a few techniques and methods here. Often the process we will go through will depend on the data itself, it’s source, how it has been collected and a multitude of other factors.

Data Cleaning

One of the first steps in preparing the data for analysis is clean it appropriately. This is often one of the hardest stages as it deals with “unknown unknowns”, meaning that whether data is “clean” or not can in some cases be a little subjective. It takes years of experience handling data to identify which items within a data set need cleaning and then to define what “clean” actually means.

Key Data Cleaning items include:

·????????Handling missing data

·????????Handling outliers

·????????Handling duplicate data

·????????Handling inconsistent data

·????????Formatting data

It's important to note that data cleaning process is iterative and requires a lot of attention to detail, it may require multiple rounds of cleaning and validation to ensure that the data is fit for the intended analysis.

Data Transformation

Data transformation, also known as data pre-processing or data wrangling, is the process of converting the data into a format that is appropriate for the intended analysis. The goal of this stage is to ensure that the data is in a format that can be easily used and understood by analysts and analytical systems, and that it is structured in a way that supports the intended analysis.

Key Data Transformation items include:

·????????Data merging

·????????Data normalisation

·????????Data aggregation

·????????Data filtering

·????????Data enrichment

The data transformation process is iterative, and it may require multiple rounds of work ensure that the data is ready for the intended analysis. Also, for both transformation and cleaning it's important to keep track of the changes made to the original data set and to document the process, so that the results can be easily replicated and the process can be audited.

Data reduction

Data reduction, subsetting or summarisation is the process of reducing the size of the data set by selecting a subset of relevant data or by aggregating data at a higher level. The goal of this stage is to make the data set more manageable and easier to analyze, without compromising the integrity of the data. We generally have a very light touch in this area as we want to use as much of the data as possible for analysis and have designed our systems to process vert large datasets as required – however, there is always value in making sure the data is appropriately sized for the analysis at hand.

Key Data Reduction items include:

·????????Sampling

·????????Dimensionality reduction

·????????Aggregation

·????????Filtering

All of these approached slightly reduce the amount of information held within the data, so need to be applied with substantial care.

Data Validation

A crucial step in the preparation of data for analysis is to validate it. Data validation is the process of checking the data for errors, inconsistencies, or missing values, and ensuring that the data is complete and accurate. Optimally there will be a second source of data against which to validate the data – sometimes this can be something as basic as handwritten records, or someone’s understanding of the business. Often though, the validation requires the input of subject matter experts within a business to ascertain how “sensible” the data is. The goal of data validation is to ensure that the data is fit for purpose and can be used to make informed decisions.

Key Data Validation items include:

·????????Quality checks (errors, inconsistencies etc)

·????????Integrity checks

·????????Completeness checks

·????????Cross-checking

It's important to note that the data validation process should be done carefully, as it may affect the validity of the analysis results. Also, it's important to audit and keep track of the changes made to the original data set and to document the validation process, so that the results can be easily replicated and the process can be audited. Additionally, it's important to balance the need for data validation with the need for data completeness, so that the results of the analysis are accurate and reliable.


Data Preparation is the final stage before we start to explore and visualise the data, and that will be the point at which the client will start to see some novel insights as well as reflect some of what they already know. We know if the phases up to this point have been successful when the client starts to recognise some of the results of the initial visualisations – as they have a much deeper understanding of their business.

?

In the next article we will focus on the Data Visualisation and Exploration phase of the Data Story.

Wherever you are in your data story, we can help - from acting as a sounding board as you dip your toe in the water, right through to a full data and insight stack implementation. If you’d like to have a chat about your business challenges and opportunities drop a DM to?Anna Blackwell?or visit our website ?www.wearepoweredbydata.com?– we love meeting new people and hearing about your data plans!

要查看或添加评论,请登录

Andy Crellin的更多文章

  • The Story of Data - Part 5

    The Story of Data - Part 5

    We at Powered By Data are passionate about supporting talent when we see it. This final part of this phase of our…

  • The Story of Data - Part 3

    The Story of Data - Part 3

    This is the third post in a multi-part series about how we approach partnering with organisations to drive positive…

  • Call Based Marketing Automation: Attributing "post-visit" calls

    Call Based Marketing Automation: Attributing "post-visit" calls

    Call Based Marketing Automation (also known as visitor-level call tracking) allows you to follow a customer's journey…

社区洞察

其他会员也浏览了