An Introduction To Data Analysis Workflow
Photo by Solen Feyissa on Unsplash

An Introduction To Data Analysis Workflow

[TL;DR] A data analysis workflow delineates a systematic, repeatable, and scalable process for analysing data. It comprises several distinct stages, each with its prescribed tasks and objectives, offering a structured approach to ensure methodical data analysis.

A well-defined data analysis workflow is crucial for the overall success of project delivery. The data analysis workflow is often integrated into the data governance framework of many companies, allowing each company to create a tailored workflow that aligns with its specific business needs. Additionally, numerous proposals regarding the data analysis process can be found online or in data analysis textbooks (such as Blitzstein & Pfister’s workflow, Aakash Tandel’s Workflow, Grolemud & Wickham's workflow, CRISP-DM, etc.). Some workflows follow a linear progression, while others may be nonlinear, involving multiple feedback loops.

Many data analysis experiences and projects indicate that the workflow is rarely a simple linear procedure or consists of several small inner loops. The data analysis process may look like the following image, where feedback could occur at any step, even after delivery.

Figure 1: Data Analysis Workflow

The data analysis workflow comprises various stages, each encompassing its distinct set of tasks and objectives. While the specific steps may vary based on the project's nature and the available data, a common workflow generally involves the following stages.

Define Question

Sales stewards tasked with contract signing must adeptly convey the requirement specifications to data stewards. The latter group assumes the responsibility for conducting data analysis and should be communicated in a manner that is clear and easily understood by all parties involved. In other words, the translation of requirements into questions is essential.

The commencement of data analysis usually starts with formulating pertinent questions. Mastering these questions is the foundational, arguably the most pivotal, stage in the workflow, often termed requirement analysis. This crucial initial step establishes the trajectory for the entire analysis process, guaranteeing that the outcomes are pertinent and actionable. If the problem is thoroughly understood, half of the data analysis success has already been achieved.

Many painful lessons demonstrate that imperfect project deliveries often stem from a lack of thorough understanding of the users' genuine requirements, meaning the failure to translate user needs into effectively addressable problems. Some might argue that customers are not always clear about their needs. These factors are often the clients' primary reasons for not accepting the subsequent data analysis results. Adopting an Agile working methodology within these companies could, to a certain extent, alleviate these issues caused by ambiguous requirements. The Agile method requires more direct and prompt feedback and usually offers more flexibility than traditional methods. Of course, the best solution is to clear the ambiguities before conducting any data analysis.

Investing 5% to 10% of project time in analysing requirements and defining problems is usually worthwhile.

Data Preparation

Once the question has been defined, the next step is to prepare data to answer it. Data preparation involves transforming raw data to make it suitable for analysis and processing. This process may include collecting, cleaning, and converting formats for a more comprehensive view and correcting inaccuracies in improperly recorded data. Although this type of work can be time-consuming, it is crucial for any task that deals with substantial amounts of intricate data. The data preparation process can generally be broken down into three steps:

  • Collect data: Data collection should prioritize obtaining information directly from the most original sources, followed by deriving data from the original sources, and lastly, incorporating data generated by models. For example, in the analysis of extreme precipitation, site-specific observational data should be collected from the local meteorological and climatic authorities with the highest priority. Subsequently, the merged data from satellite and observational data can be utilized, with the reanalysis data as the final resort.
  • Clean data: After gathering your data, the subsequent step involves preparing it for analysis. This process, often called cleaning or 'scrubbing,' is essential to ensure you work with high-quality data. This process will include filling in missing values, removing inaccurate information, harmonizing inconsistent data, and converting the data into a standardized format. Sometimes, it may also involve data enrichment and validation. The former further enhances and optimizes data sets as needed through measures such as augmenting and adding data. At the same time, the latter explores the data to ensure it is correct and ready for analysis. Visual tools such as histograms, scatter plots, box and whisker plots, line plots, and bar charts are vital in verifying data accuracy. Moreover, these visualizations assist data science teams in conducting exploratory data analysis, facilitating identifying patterns, detecting anomalies, testing hypotheses, and validating assumptions.
  • Store Data: Once prepared, the data will be stored in data warehouses, data lakes, or another public/private cloud server until it is time to use it.

Generally, the data preparation may account for 20-80% of the project time, depending entirely on the data infrastructure. If a company has a robust data infrastructure such as a data lake, data warehouse, etc., and the data already exists, the data preparation work may be relatively minimal. However, if everything needs to be established from scratch, it will likely require more time. Sometimes, a proficient data analyst typically allocates approximately 70-90% of their time to the data preparation. While this may seem extensive, it underscores the importance of ensuring data quality and reliability in the analytical workflow.

Data preparation encompasses various data engineering tasks (such as ETL, ELT, and feature engineering) and should be overseen by data science stewards. It is important to avoid relying solely on IT personnel, as their focus tends to be on technology rather than the subtle nuances of the data itself. There was once an IT professional who used the nearest neighbour method to impute all missing data and believed it was a reasonable approach. This is not acceptable for serious data analysis in many cases.

Data Analysis

After cleaning and preparing the data, the subsequent stage involves its analysis. In contrast to the complexities of collecting and preparing data, the data analysis is relatively straightforward. As Lei Jun, the CEO of Xiaomi has expressed, "99% of problems have standard answers. Just ask someone who understands, and you'll save a lot of time and effort."

Data analysis involves applying statistical techniques, machine learning algorithms, or other methods to uncover data patterns, relationships, and insights. Many experts have already summarized and categorized various data analysis methods. In many cases, it's sufficient to analyse data following a specific type or pattern systematically. This resembles software development, where design patterns are extensively applied to solve common problems in software design.

The four most fundamental types of data analytics include:

  • Descriptive Analytics: describes events or trends over time, such as determining whether the heatwave days have increased or decreased during the past 30 years.
  • Diagnostic Analytics: unravel the causes behind specific events, such as investigating the factors leading to heavy rainfall in Auckland, New Zealand, in January 2023. This process requires formulating hypotheses and working with a diverse dataset to gain insights into the root causes.
  • Predictive Analytics: aim at anticipating events that are likely to occur in the future. For example, people are keen to know how heatwave days will change in 2050 under climate change. In predictive analysis, data analysts leverage insights from previous data. Machine learning, deep learning, and other AI technologies have been extensively applied to predictive analysis. Some may heavily depend on numerical models such as climate change projections.
  • Prescriptive Analytics: involve identifying the most effective strategy for implementing a decision that has been reached. Prescriptive Analytics entails proposing a course of action. For instance, if the probability of rising sea levels is projected to increase to very high levels in a coastal area by global climate models, an adaptation strategy could involve designing a seawall.

Other analytics types may include quantitative and qualitative analysis, exploratory analysis (mentioned in data preparation), statistical analysis, etc. It is worth noting that the selection of data analysis methods is primarily determined by the questions that need addressing.

Reporting

After finishing the analyses, the next step is to interpret the analysis insights and report them clearly and understandably. This involves creating visualizations, writing reports, and presenting the findings to internal reviewers.

To demonstrate to customers that their interests are prioritized, showing everything transparent, including all data collection, cleaning, organization, and presentation processes, is essential. In other words, everything is clear, defensible, and documented.

Internal Review

Internal auditing mainly validates the appropriateness of the utilized data and methods, ensuring that the generated reports and data meet the company's and its clients' quality specifications. This process also encompasses essential adjustments for grammar and formatting. Moreover, certain reports may be subject to compliance checks.

Delivery

The final stage in the data analysis workflow involves delivering the thoroughly reviewed data and reports to the client.

Happy, Hooray!

Submitting the deliverables often does not mark the end of the process. In many instances, client feedback can be swift, but it might take considerable time in extreme cases. Therefore, backing up all data and documents for necessary contingencies is crucial.

Summary

Adopting a common data analysis workflow doesn't guarantee success but significantly enhances the likelihood of achieving it.

“Happy families are all alike; every unhappy family is unhappy in its own way.”

— Leo Tolstoy

“Successful data analysis projects often share common practices, but each unsuccessful project faces its own distinctive challenges and pitfalls."

—Chonghua Yin


References

R for Data Science (2e) by Hadley Wickham, Mine ?etinkaya-Rundel, and Garrett Grolemund, 2003

https://docs.kanaries.net/articles/data-analysis-workflow

https://www.datascience-pm.com/data-science-workflow/

https://www.fieldengineer.com/blogs/data-preparation

https://datasciencedojo.com/blog/data-analysis-methods/#

https://careerfoundry.com/en/blog/data-analytics/the-data-analysis-process-step-by-step/

?

?

Absolutely loving the blend of data analysis insight with a touch of weekend fun! ?? Remember what Albert Einstein said, “In the middle of difficulty lies opportunity.” Keep exploring those common practices and unique challenges - they’re your path to growth and success. ?? Stay curious, and keep thriving!

回复

要查看或添加评论,请登录

Chonghua Yin的更多文章

社区洞察

其他会员也浏览了