What is a Data Analyst?

I often use the phrase Data Analyst to identify a distinct role, separate from that of a Data Scientist or Data Engineer. The role itself has, I believe, become blurred and occasionally even devalued when compared to these other roles. The activity of data analysis itself is often misunderstood or misinterpreted. The role of a Data Analyst, though, can be the most important role in a data-driven organization.

Data analysis plays a critical role in any modern organization. The phrase "Data-Driven," though, is thrown around as-if it is simply enough to refresh a report daily and consider yourself in tune with the insights of your “Data Lake.” And no, your organization is not "Data-Driven" because it actually has a Data Lake. Without ongoing data analysis, insights are unlikely to materialize from data, and ongoing data analysis is an activity that requires both knowledge of the domain you are analyzing, as well as a breadth of analytical skills and techniques.

I want to start by baselining some definitions. Perhaps you would define these roles and activities differently than what I have here, and I welcome that feedback in the comments. Let me start with a basic definition of Data Analysis, then explore some additional activities that are often lumped in with analysis and Data Science. I will then explore Data Analysis in more depth, and finish with a few comments about how I define Data Science as distinct from Data Analysis.

Data Analysis - the process of directly analyzing data to answer specific business questions. Requires an individual (a human) to identify the problem, propose an approach to the solution, and to directly work with the data, applying multiple techniques to answer the problem. May require simple pivot table skills (Exploratory analysis) but may also require more advanced tools and statistical methods (Inferential, Predictive, Causal) depending on the scope of the question.

Machine Learning - the automation of data analysis techniques on data to provide ongoing and repetitive analysis of data. This allows frequent and rapid repetition of data analysis and can result in improvement of the analysis through feeding in additional data in subsequent iterations. That is the Learning part - more data allows the system to "learn" more, improving the iterative analysis.

Artificial Intelligence - A subset of Machine Learning that imitates what are normally thought of as human characteristics. Examples include:

  • Vision - identifying objects, animals, people, etc. based on analyzing images.
  • Speech recognition - also known as audition (as in "the power or sense of hearing", not a try-out for a play or movie). This is the interpretation of human speech and responding either through text translation or through actions that are responses to commands.
  • Natural language - interpreting and responding to spoken or written commands presented in "natural" human language. Analyzing the sense or meaning of that language.
  • Text generation - creating new text based on a request for a specific topic or genre, including the generation of legal documents based on specific scenarios, or even the creation of research papers based on a given topic.
  • Creativity - examples include creation of new images or artistic concepts based on random input.

Note, I have often seen Machne Learning defined as a subset of AI. That is, in my mind, ridiculous, as there are a vast number of examples of ML that I would never classify as imitating something human in nature. Running a linear regression rapidly and repeatedly is not AI, it is simply performing a math problem really fast. Or perhaps you want to classify a calculator as AI? AI uses extremely advanced mathematical techniques to imitate human capabilities.

Data Engineering - Includes several activities related to the design, development, and deployment of data solutions including SQL databases, data warehouses, data lakes, etc. Can include relational data design, NoSQL, columnar, BLOB, Hadoop, or really any form of data storage. In fact, the broad set of technologies covered within this activity seems to grow each year, and most individuals must now specialize in only a few of these platforms to have any hope of keeping up. Data Engineering covers both the design of data solutions as well as the actual implementation both in development and production. These activities used to be called either Database Development or Data Architecture, but I like the Engineering title, especially if it brings more attention and respect to these critical roles.

Data Science - The activity of science includes research and development of practical information and capabilities either through experimentation or observation. Data Scientists would, therefore, develop the techniques and tools that have a practical application to Data Analysis, Machine Learning, and Artificial Intelligence. For many, I may be drastically redefining what a Data Scientist is, so more on this later in the document.

Data Analysis and Domain Knowledge

Stephen Few, in his book The Data Loom, dislikes the term data analysis to describe the activity of understanding quantitative data.

"The term data analysis suggests that the process consists entirely of breaking information down into its component parts - digging into the details - which is what analysis means, but this is only one of the activities required to make sense of data."
(Few, p.10).

Although I agree with Few's intent here, I struggle with adopting the phrase he suggests as an alternative: data sensemaking. I am aligned with his key points throughout the book, but I hate to think of taking on the corporate title of Data Sensemaker.

As Few explores the skills and capabilities required of data analysis, the skill set most often overlooked, and yet, perhaps, most important is domain knowledge. Indeed, domain knowledge and a close association to the business strategies and processes you are supporting is what most stands out in the role of a Data Analyst.

"To understand data in a particular domain, we must understand that domain to a fair degree. The greater our knowledge of a domain, the better equipped we are in most situations to make sense of its data."
(Few, p. 2)

Few continues with several additional skill sets, including critical thinking, scientific thinking, and visual design, all of which I strongly concur. I hope to explore domain knowledge and the business side of data analysis in a future article. What I would like to consider for the rest of this article, though, is the analytical skills required of a Data Analyst, and why these skills are so critical to the role. At the same time, I hope to better define what the breadth of capabilities are that can be covered by a Data Analyst, so that organizations may more clearly understand the value of the role and raise it to a more appropriate level of consideration.

Data Analysis in Depth

In the book The Elements of Data Analytic Style, Jeff Leek, a Biostatistician and Chief Data Officer at Fred Hutchinson Cancer Center, identifies six types of analysis based on the type of questions being answered (Leek, Chapter 2, "The data analytic question"):

  • Descriptive
  • Exploratory
  • Inferential
  • Predictive
  • Causal
  • Mechanistic

Mechanistic analysis, as Carl Anderson points out in Creating a Data-Driven Organization, seems to be a transition from pure analysis into engineering and modeling of systems and structures. In fact, it better aligns with what I define as Data Science below. I'll leave out Mechanistic in my descriptions, but will otherwise stick with Leek's framework and provide my own interpretation of each level of analysis.

Descriptive - Something happened (let’s call it 'Y'). - Essentially data is gathered into a summarized form that presents what happened during a given time period with little or no interpretation. Unfortunately, the vast majority of data analysis work stops with descriptive analysis. We often call this simply reporting.

Exploratory - Do we know anything else about Y? - According to Leek, this includes trends, correlations, or relationships between multiple variables in the data. The defining aspect of this work is the high level of interaction with the data required of the analyst. This is an activity, rather than a specific skill, in my mind. In fact, it is often the first step in any more complete data analysis project: data exploration. Exploratory analysis rarely will answer a question, but may often help with better defining how to answer that question in subsequent steps by reducing the variables and relationships being considered.

Inferential - When X happens, what is the likelihood of Y happening? - This is the heart of what I find to be more complete data analysis. It includes hypothesis testing, probability, confidence intervals, and regression. I believe absolutely every aspect of business - Sales, Marketing, Supply Chain, Finance - would benefit from this first step beyond Descriptive and Exploratory analysis to statistical inference. Further, without taking this step beyond basic reporting, we may be actually damaging our business models, losing revenue, or at least losing opportunity. "Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers." (Spiegelhalter, p. 236). ?

Predictive - if X happens, what will happen to Y? - The goal of Predictive analysis is to "develop a statistical model that can predict values of attributes for new, incomplete, or future data points." (Anderson, p. 103). Interestingly, as advanced as this type of analysis is, it is one that many people will be readily familiar with the results of: "if you bought this on Amazon, you might also be interested in this other thing." In fact, predictive analytics is often considered the holy grail of business analysis, yet it rarely is invested in beyond some financial forecasting metrics.

Causal - does X cause Y to happen? - Actual causation is, perhaps, one of the most difficult things to develop in data analytics. It is one thing to say that when X happens, Y will also happen, but a completely different thing to say that X caused Y to happen. To demonstrate true causality, we must build on Inferential and Predictive analytics and apply both on a frequent and repetitive basis, narrowing the scope of what levers trigger Y. "Such experiments provide a causal, deeper understanding of the system that [can] be used for predictions" (Anderson, p. 108).

There is certainly some overlap between inference, prediction, and causality. Regardless, I believe this list covers the breadth of what I, and clearly others (Anderson, Few, Leek), include in the practice of Data Analysis. One must ask, then, if a Data Analyst is responsible for all of the activities described above, what is the role of a Data Scientist? Or is "Data Science" simply a synonym for "Data Analysis"?

Data Science Redefined

Many reading this may disagree with the definition of Data Scientist I submit here, and many may classify the work I describe above as Data Science. I believe this is a misnomer, though. While I have great respect for the role of a Data Scientist, I believe the misuse of this term to describe the work of data analysis has devalued the role of a Data Analyst.

Really, I believe the term Data Scientist has simply become too generalized. Anderson implies this in the opening sentence of his description of the role: "A broad term that tends to include more mathematically or statistically inclined staff" (Anderson, p. 62, emphasis my own). In fact, Anderson is simply categorizing a Data Scientist as one type of Data Analyst. Other categorizations include Data Engineers, Business Analysts, Statisticians, "Quants" (what?), Accountants and Financial Analysts. Indeed, after reading the descriptions provided, I am still left wondering what a Data Scientist is and how it differs from other Data Analysts.

To understand my definition of a Data Scientist, an analogy may help. When I schedule an appointment with my doctor, I do not expect to be part of a scientific research project. I expect my doctor to have extensive knowledge about medical science, and to apply that knowledge to assess my health or to prescribe a treatment - get more exercise, take this antibiotic, etc. No one would deny the knowledge and expertise a doctor, nurse, or other medical professional has about medicine. However, we do not refer to these professionals as scientists.

There are, in fact, medical scientists, often at university research labs, or working at biotech or pharmaceutical companies. These scientists are constantly conducting research and experiments to advance and expand the overall body of medical knowledge that a doctor, nurse, or other medical professional can apply when examining or treating a patient. Medical scientists do scientific research to improve the tools and processes available to medical professionals.

Similarly, a Data Scientist conducts research to improve the tools and processes available to Data Analysts. A Data Analyst needs to understand the proper application of those tools and processes to answer business questions (diagnose a patient).

Let Data Scientists be actual scientists. This is an important role necessary for the ongoing improvement of data analysis, machine learning, and artificial intelligence. At the same time, elevate the value of Data Analysts. Identify what are the key capabilities necessary for data analysis and provide a clear path of advancement in those careers. Recognize the value that both roles bring and, most importantly, bring that value into your organization through application of more complete data analysis.

References

Anderson, C. (2015). Creating a Data-Driven Organization. Sebastapol, CA: O'Reilly Media Inc.

Few, S. (2019). The Data Loom. El Dorado Hills, CA: Analytics Press.

Leek, J. (2015). The Elements of Data Analytics Style. Leanpub.

Spiegelhalter, D. (2019). The Art of Statistics, How to Learn from Data. New York: Basic Books.

要查看或添加评论,请登录

John Morse的更多文章

社区洞察

其他会员也浏览了