The Worst Survey Question in the World
This is a question that I saw recently in a report on using AI in data governance tasks. This might seem hyperbolic, but I genuinely loathe it. The intent was good, but I refuse to believe anyone can glean any shred of useful information out of this mess.
First, it’s important to understand what is the point of asking this question. Based on the conclusions drawn, it seems like they want to know how to free up the data team’s time by automating the activities on which they spend the most time. I think that this is flawed on the face of it, since automation can’t be implemented for broad genres of activities but for specific use cases in those activities. As an example (outside of the data sphere), take house cleaning. If I spend too much time cleaning, can I automate my tasks? I can easily get a robot which will automate my daily vacuuming. It might even have sensors that would prevent it from running into shoes or toys or other objects left on the floor. To my everlasting disappointment though, it is a lot more difficult to get a robot that will pick up those objects and put them away. The same may be said for data cleansing; some parts of it are easily automated, others are so difficult as to be practically impossible.
领英推荐
But for the sake of argument, let's say that the question was asked with the specification that the only activities considered were those that would be easy to automate. The data team is entirely college interns, and as I previously mentioned, ChatGPT is basically a decently skilled college intern. Even in this situation, having the question be multiselect negates any meaning the results can have. It’s nonadditive. Consider the following situation: We all want to spend less time on chores. I spend the majority of my time on vacuuming; you spend the majority of your time on dishes and laundry, and our roommate spends the majority of her time on laundry and mopping. According to this survey that would mean that 66% of people spend the majority of time laundry. This is clearly wrong. The very fact that you spend a majority of your time on two activities necessarily means that you can’t spend the majority of your time on either activity individually.
The one exception to the tautology above is when two activities are not both individual activities. If I spend a majority of my time vacuuming, I can also say that I spend a majority of my time cleaning. This overlap in terms essentially renders the question useless. What do you mean when you say that you spend the majority of your time doing data governance? Is it different than what you mean when you say data validation or data consolidation? Say I get a ticket from a business user that asks about a potential duplicate record. I investigate where each record came from and discover that the discrepancy was due to two separate source systems sending data in a slightly inconsistent way. I combine both records and run some checks on the database to see if there were any other discrepancies of a similar nature. I then write up my actions and suggest some policy that might be implemented to prevent this situation from occurring in the future. This one action might then be data validation, data consolidation, data lineage, responding to a ticket from business users, and data governance. I can’t imagine that saying in that situation that I spend the majority of my time on those five activities would lead to any valid insights.
How could this question be made better? The purpose of the question is valid. With the recent democratization of AI and subsequent push for AI automated solutions, it is important to know what sort of activities might be good candidates for automation. But I don’t think that this question can be asked in this manner. As I pointed out above, there are significant issues with equating ‘majority of one’s time’ to ‘should be automated’. The simple solution would be to ask people, “What tasks does your data team do that you think would be beneficial to have automated?”. The better solution would be to conduct real interviews with data leaders and to truly understand what the data teams’ tasks are, what sort of difficulties they face, and what sort of tedium they endure. These interviews can then be categorized by someone who understands the capabilities of AI automation and how that might fit into the business use cases.