An Introduction To Data Analysis Workflow
Chonghua Yin
Head of Data Science | Climate Risk & Extreme Event Modeling | AI & Geospatial Analytics
[TL;DR] A data analysis workflow delineates a systematic, repeatable, and scalable process for analysing data. It comprises several distinct stages, each with its prescribed tasks and objectives, offering a structured approach to ensure methodical data analysis.
A well-defined data analysis workflow is crucial for the overall success of project delivery. The data analysis workflow is often integrated into the data governance framework of many companies, allowing each company to create a tailored workflow that aligns with its specific business needs. Additionally, numerous proposals regarding the data analysis process can be found online or in data analysis textbooks (such as Blitzstein & Pfister’s workflow, Aakash Tandel’s Workflow, Grolemud & Wickham's workflow, CRISP-DM, etc.). Some workflows follow a linear progression, while others may be nonlinear, involving multiple feedback loops.
Many data analysis experiences and projects indicate that the workflow is rarely a simple linear procedure or consists of several small inner loops. The data analysis process may look like the following image, where feedback could occur at any step, even after delivery.
The data analysis workflow comprises various stages, each encompassing its distinct set of tasks and objectives. While the specific steps may vary based on the project's nature and the available data, a common workflow generally involves the following stages.
Define Question
Sales stewards tasked with contract signing must adeptly convey the requirement specifications to data stewards. The latter group assumes the responsibility for conducting data analysis and should be communicated in a manner that is clear and easily understood by all parties involved. In other words, the translation of requirements into questions is essential.
The commencement of data analysis usually starts with formulating pertinent questions. Mastering these questions is the foundational, arguably the most pivotal, stage in the workflow, often termed requirement analysis. This crucial initial step establishes the trajectory for the entire analysis process, guaranteeing that the outcomes are pertinent and actionable. If the problem is thoroughly understood, half of the data analysis success has already been achieved.
Many painful lessons demonstrate that imperfect project deliveries often stem from a lack of thorough understanding of the users' genuine requirements, meaning the failure to translate user needs into effectively addressable problems. Some might argue that customers are not always clear about their needs. These factors are often the clients' primary reasons for not accepting the subsequent data analysis results. Adopting an Agile working methodology within these companies could, to a certain extent, alleviate these issues caused by ambiguous requirements. The Agile method requires more direct and prompt feedback and usually offers more flexibility than traditional methods. Of course, the best solution is to clear the ambiguities before conducting any data analysis.
Investing 5% to 10% of project time in analysing requirements and defining problems is usually worthwhile.
Data Preparation
Once the question has been defined, the next step is to prepare data to answer it. Data preparation involves transforming raw data to make it suitable for analysis and processing. This process may include collecting, cleaning, and converting formats for a more comprehensive view and correcting inaccuracies in improperly recorded data. Although this type of work can be time-consuming, it is crucial for any task that deals with substantial amounts of intricate data. The data preparation process can generally be broken down into three steps:
Generally, the data preparation may account for 20-80% of the project time, depending entirely on the data infrastructure. If a company has a robust data infrastructure such as a data lake, data warehouse, etc., and the data already exists, the data preparation work may be relatively minimal. However, if everything needs to be established from scratch, it will likely require more time. Sometimes, a proficient data analyst typically allocates approximately 70-90% of their time to the data preparation. While this may seem extensive, it underscores the importance of ensuring data quality and reliability in the analytical workflow.
Data preparation encompasses various data engineering tasks (such as ETL, ELT, and feature engineering) and should be overseen by data science stewards. It is important to avoid relying solely on IT personnel, as their focus tends to be on technology rather than the subtle nuances of the data itself. There was once an IT professional who used the nearest neighbour method to impute all missing data and believed it was a reasonable approach. This is not acceptable for serious data analysis in many cases.
Data Analysis
After cleaning and preparing the data, the subsequent stage involves its analysis. In contrast to the complexities of collecting and preparing data, the data analysis is relatively straightforward. As Lei Jun, the CEO of Xiaomi has expressed, "99% of problems have standard answers. Just ask someone who understands, and you'll save a lot of time and effort."
Data analysis involves applying statistical techniques, machine learning algorithms, or other methods to uncover data patterns, relationships, and insights. Many experts have already summarized and categorized various data analysis methods. In many cases, it's sufficient to analyse data following a specific type or pattern systematically. This resembles software development, where design patterns are extensively applied to solve common problems in software design.
The four most fundamental types of data analytics include:
Other analytics types may include quantitative and qualitative analysis, exploratory analysis (mentioned in data preparation), statistical analysis, etc. It is worth noting that the selection of data analysis methods is primarily determined by the questions that need addressing.
Reporting
After finishing the analyses, the next step is to interpret the analysis insights and report them clearly and understandably. This involves creating visualizations, writing reports, and presenting the findings to internal reviewers.
领英推荐
To demonstrate to customers that their interests are prioritized, showing everything transparent, including all data collection, cleaning, organization, and presentation processes, is essential. In other words, everything is clear, defensible, and documented.
Internal Review
Internal auditing mainly validates the appropriateness of the utilized data and methods, ensuring that the generated reports and data meet the company's and its clients' quality specifications. This process also encompasses essential adjustments for grammar and formatting. Moreover, certain reports may be subject to compliance checks.
Delivery
The final stage in the data analysis workflow involves delivering the thoroughly reviewed data and reports to the client.
Happy, Hooray!
Submitting the deliverables often does not mark the end of the process. In many instances, client feedback can be swift, but it might take considerable time in extreme cases. Therefore, backing up all data and documents for necessary contingencies is crucial.
Summary
Adopting a common data analysis workflow doesn't guarantee success but significantly enhances the likelihood of achieving it.
“Happy families are all alike; every unhappy family is unhappy in its own way.”
— Leo Tolstoy
“Successful data analysis projects often share common practices, but each unsuccessful project faces its own distinctive challenges and pitfalls."
—Chonghua Yin
References
R for Data Science (2e) by Hadley Wickham, Mine ?etinkaya-Rundel, and Garrett Grolemund, 2003
?
?
Absolutely loving the blend of data analysis insight with a touch of weekend fun! ?? Remember what Albert Einstein said, “In the middle of difficulty lies opportunity.” Keep exploring those common practices and unique challenges - they’re your path to growth and success. ?? Stay curious, and keep thriving!