Counting is classification in disguise
From https://en.wikipedia.org/wiki/File:Count_von_Count_kneeling.png

Counting is classification in disguise

Foundations of AI series

I fear for the future of AI -- in particular the vast swaths of applied AI that rely on counting labelled data to train and/or evaluate models. Our systems live or die based on whether the data drives them across the finish line or drives them off a cliff. Data quality is crucial to everything we do, so the widely cited 80% of our time spent focused on data makes complete sense.?But are we doing a good enough job of managing data quality?

AI, for the most part, relies on mathematical models that leverage frequencies in a collection of training examples (a "sample" of items with labels). These frequencies are estimates of probabilities of real world phenomena that we are modeling (a "population").?In essence, when an AI reacts to something new, it responds with the label taken from the most similar and most frequent example that it has seen during training.?A new image? The most similar and most frequent image seen while building the system was labelled "octopus". So, the AI responds that this new image is of an octopus. These frequencies are most often calculated based on counting items in the training examples that are labelled as having a specific set of characteristics.?Similarly, for evaluating our AI systems -- even when they're not trained in this particular way -- we count system responses that match the labels for previously-identified-as-correct test items.?

Many of the key issues that hinder progress in AI are related to the accuracy and precision of the data sets that we rely on for training and for testing. Bias in data and models -- which boils down to how we choose what to count -- has sparked heated debate (See Jon Stokes' cogent analysis of one part of it). Developing and deploying applied AI systems is often slow and unpredictable in part because of variability in how data is counted. This led Andrew Ng to call for much more systematic attention to data creation , or "data-centric AI" .?Results of AI experiments are often difficult to reproduce, sometimes because data is created and processed differently from one iteration to the next, which also slows down progress and deployment.?Many AI models suffer from difficulty in generalizing to new situations (called "overfitting" ) because of how the "correct" responses in the test data have been labeled and counted.?Merging diverse data from multiple sources to meet the needs of voracious algorithms is very hard because "the same" features and entities are counted in different ways by different data developers.?

My data science friends say that there are lots of people who don't know how to count

Counting, then, is one of the very foundations of AI -- it's how we create accurate and precise data in the first place -- both for building and testing AI systems.?But my data science friends say that there are lots of people out there who don't know how to count. Of course we're not talking about mindlessly reciting numbers one after the other like in grade school. We're talking about establishing quantities with high accuracy and high precision like in grad school or in science. And science has taught us that we can't just rely on intuitive methods -- even those based on expert intuitions -- for high accuracy and precision (see Kahneman, 2011 , Ch. 22 on the variable reliability of expert intuitions):?we need to objectively examine and assess our processes step by step -- even when we're "just" talking about counting.

We can't rely on intuitive methods for counting -- not even expert intuitions.

So, if you manage or mentor or partner with people on an AI project, then you might want to discreetly check out how they think about counting.?Here's a quick and dirty test you can use; four short true/false questions.

  • Counting consistently is easy to do.?True or False?
  • We need to count before we can classify.?True or False?
  • Computers can count with very high accuracy and precision .?True or False?
  • We can't count the same thing in several different ways.?True or False?

TL; DR: If your colleagues or partners answered "True" to any of these questions, then it's time for you to get a stiff drink or do some serious meditation! For your project to succeed, everyone helping has to understand that there are dozens, sometimes hundreds, of ways to count the same things, so variability in your counts is unavoidable. Part of that is because counting is first and foremost classification, which everyone seems to do differently.?For your AI project to be a success, everyone needs to count like they really mean it.

How to count

Here's a crash course in how to count, in case your colleagues or partners need a place to start or materials to review.

Counting is measurement. Counts are the result of measurement, which is the process of mapping some attribute of whatever-we're-interested-in to a numeric scale by a particular method (see Bunge, 1967 , Ch. 13). For example, we might measure some attribute like the temperature of water on the centigrade scale with an electronic thermometer. (See Chang's (2007) interesting book on how scientists' measurement of temperature has evolved).?In the case of counting to populate the random variables that we use for most AI models, we start by mapping the presence of an item of a particular kind to a whole-number scale, using a specific counting method.?Each time we detect the presence of one such item, we increment the count by 1.?Easy peasy, right? Maybe not.

A key assumption of measurement theory [1 , 2 ] (abundantly corroborated in practice!) is that the variability of measurements is unavoidable: we can manage variability but not eliminate it.?Perfect measurement is impossible. The figure below from Edward Tufte’s The Visual Display of Quantitative Information is my favorite illustration of this assumption. Each thin line in the graph is identified with an arrow and a number in a circle -- it represents a set of measurements from a specific experiment. All of the experiments attempted to measure the same attribute (thermal conductivity) of the same material (tungsten). The thick line represents the theoretical values for conductivity at each temperature.

No alt text provided for this image

What jumps out immediately from this figure is the huge range of variation in the measurements. Even in the very carefully controlled experimental conditions of these experiments, the numbers they produced are wildly different. This unavoidable variability is one reason why statistics is so useful -- to help us derive more reliable conclusions from differing measurements.?It's also a key reason why we have to pay careful attention to how we count.

It turns out that the kind of counting that artificial intelligence, decision making, and other applications depend on so heavily is far more complex than the grade-school notion of reciting numbers one after the other (also called "counting"). In fact, high-accuracy, high-precision counting is so complex that it offers very many opportunities for variation in the resulting counts -- like the other measurement example illustrated above.?And a lot is at stake in getting counts right (or at least making them comparable). Reproducing the same counts for the same things is essential for sharing data, for comparing systems, and for tracking progress. It's essential for deciding when an AI system is ready to deploy among unsuspecting human users.

Counting in practice. Let's take a timely and seemingly straightforward example:?How can we reliably count green jobs ?-- jobs that are related to fighting climate change or enabling sustainable business? This data will play a central role in allocating resources to build back better after COVID, in defusing the climate crisis , as well as in tracking opportunities and progress toward these goals.?High accuracy, high-precision counts of everything "green" will be essential for reliable matching and recommendation of green jobs or funding for green companies, for example.?

For this discussion, we can identify five steps of counting -- five ways in which the measurement process can vary and yield sometimes dramatically differing results:

Step1: Sampling.?Where will we look to find the entities (jobs, in this example) that we want to count??Looking for jobs in newspapers, company websites, or government taxonomies won't give us an accurate estimate of their frequency across states, industries, or periods of time. We don't know how representative these samples are of the population of jobs, how their data was collected and annotated, or sometimes even which time periods or locations they cover. Sampling biases describe well known ways in which sampling methods can vary and lead to dramatic differences in counts. Many of these sampling biases boil down to lack of domain knowledge and similar biases are introduced in the steps described below.?Without solid knowledge of the domain, we are forced to rely on random sampling, which does not always generate representative samples for us to start with.?We can generate counts for samples, of course, but how well can we generalize our conclusions from these samples? Differences in sampling lead to lots of variation in the results of counting.

Step 2: Delimiting. Given a (hopefully representative) sample of items, then which of them count as jobs??We need to double check carefully. This step yields a set of candidate entities from the sample. Is a "volunteer opportunity" a job? A consulting contract? A one-day gig? A request for advice? The presence of a job in a list of items is unclear until we can define clearly what a job is, based on domain knowledge. Then we need to use the criteria in the definition to identify valid entities -- i.e., which items count as the jobs that we want to count. Creating robust definitions and reliably delimiting items using these definitions are very hard to do with real-world data while herding real-world annotators. Even for something as seemingly simple as whether an item is a job or not.?Differences in delimiting valid jobs (or other entities) lead to lots of variation in the results of counting.?

Step 3: Featurizing.?With a list of candidate jobs in hand, we need to identify features that will help us decide whether the candidates are or are not in fact instances of what we want to count. We want to count not just jobs but green jobs. Is a job "green" because it requires green skills? Because it is at a company that uses green processes? Because it is related to a green product or service??Again, domain knowledge will tell us which features to focus on and which will probably not be useful. Keyword spotting is no substitute for domain knowledge. If we use the words "green" or "sustainable" as important indicators, we'll miss a huge proportion of jobs and include items that are irrelevant -- so counts will vary even more from team to team. This ubiquitous featurizing step is another component of the foundations of AI, which I've discussed elsewhere . It yields evidence for each candidate job that we can use to decide whether we should count it as green or not.?But featurizing is often very difficult. On top of the difficulty of knowing which features to focus on, many features stubbornly resist measurement. For a different example, where does red stop and orange begin when we want to count items that have the color red? The boundary is inherently unclear. On top of that, Berlin & Kay long ago documented that people from different cultures delimit and classify "the same" colors in radically different ways. Creating robust definitions and reliably featurizing items using these definitions are very hard to do with real-world data and real-world annotators -- even for something as seemingly simple as whether an item is green or not. Differences in choosing the features that are relevant and important -- the definitions of what to count -- lead to lots of variation in the results of counting.?

Step 4: Deciding.?After this, we will have a collection of items that are likely to be jobs along with evidence that they might be green. Now we need rules or computations to decide whether the features we detected constitute enough evidence to increment our green jobs counter. Just how green is green enough to count??Like a neuron in an artificial network, we need some sort of weighted sum of the features detected plus a decision ("activation") function. This is like a definition with the importance of each feature embedded in it. But given that incomplete, unclear, or inconsistent features are common, it is very hard to come up with a reliable, highly accurate decision function. This is yet another step in which domain knowledge plays a key role. Differences in the decision function lead to lots of variation in the results of counting. For a different example, the linguist William Labov (2004 ) famously studied how humans had difficulty classifying images of cups, mugs, and bowls that varied along a gradient. They had candidate items and features, but could not reliably decide where cups end and mugs begin. This is a very common difficulty for annotators -- with any kind of data. AI classifiers are built to make these kinds of decisions at scale -- based on a lot of features that might or might not be relevant -- and the algorithms show similar difficulties.

No alt text provided for this image

Step 5: Responding. The usual response in counting is to add one to the variable that is accumulating the counts and skip this step if the decision outcome is not true. Incrementing a variable, it seems, is the only easy part in the whole counting process. But not enough information (i. e., no decision) is not the same as "not true", so even "no response" can create problems. And other responses are possible at this point. We might mark yes or no in some one-hot encoding scheme (in a column marked "green" ) or annotate our degree of confidence that a particular job is green or not (in a column marked "confidence that green"). We might also record a label ("green" ) instead of a number (in a column marked "job type") or write a whole sentence ("This is a green job.").?Note that for some of these alternative responses we would call the process "classification" instead of "counting" -- even though the process as a whole is the same.?

Counting is first and foremost classification.

It's easy to see that counting is first and foremost classification, and high-precision counting may even be just classification with a different kind of response. Crucially, we can classify the same things in different ways and if we do, we will count them differently. To classify or categorize items reliably, we need to leverage domain knowledge to establish which features are important enough and reliable enough to determine how we identify an item as the kind of thing that we want to count. So domain knowledge shapes data in a very direct way -- and the absence of domain knowledge creates its own kinds of bias.?The different assumptions, definitions, and ranking of features that come from domain knowledge change dramatically the counts that we come up with. This is why science has found that we can't just rely on intuitive methods -- even those based on expert intuitions -- for high accuracy and precision.

Count like you really mean it!

Now you know how to count.?But counting like you really mean it is hard. We've seen that you can count the same thing in a variety of different ways. You'll get very different numbers of green jobs if you change sampling, delimitation, featurization, or your decision rules.?You have to classify things before you can count them -- otherwise you can't tell whether to count them or not. And classification is hard to do at scale. Algorithms can help but they usually can't classify with very high accuracy and precision, so they can't establish quantities with high accuracy and high precision either.?

To count like you really mean it, start with these steps:

Check all of your assumptions and definitions -- and those of the people who are helping you count. Again, you'll get very different results if you and your annotators or data providers have different assumptions about how to do sampling, delimitation, featurization, or which decision rules to use. Is everyone using the same definition of "job" and of "green", for example? If you've ever worked with data annotators on site or through crowdsourcing, you'll know that the people who are doing the actual tallying often have wildly different assumptions about what counts and what doesn't. These differences translate directly into lower precision and higher bias of whatever model you're building. You have to work very hard to ensure that everyone is on exactly the same page.

Distrust your data (and especially others' data!). Assume that much of it is wrong or irrelevant. Doing so will help you to avoid blindly gathering additional data (that you will have to spend lots of time "cleaning up") and to remember the importance of data provenance and data governance.?

Data provenance (or data lineage) is data seen from the consumer's perspective. It's like traceability of ingredients in food:?it refers to everything that happened before the data set (or food) got into your hot little hands. That includes sampling, data "cleanup", and everything in the counting methods that were described above.?When I talk to colleagues across a range of organizations, we collectively lament that measurement, "cleanup", and processing of data are rarely carefully controlled or even documented during the building of AI systems. These are key parts of data provenance. In experimental science , standard practice is quite the opposite. Data provenance is the focus of experimental studies like those illustrated above and is a key criterion that reviewers check for. My colleagues and I all agree that you ignore data provenance at your peril, the same as for food.?That's where almost all of your problems with model accuracy or recall come from, and no amount of tweaking hyperparameters can help overcome data quality issues.

In AI, we need mountains of data, often from various sources, so we often use data sets for which we don't know exactly how the data was sampled and labelled or how representative the samples are. We sometimes don't know what the labels were intended to mean or how relevant the data are for the problems we want to solve. Are you using someone else's "green jobs" data??Check carefully how they counted!?

It would be best if we had clear guidance on how to systematically distrust a data set.?But as far as I know, we don't have this kind of guidance yet. For now, my rule of thumb is to judge data quality by its documentation.?If the data creators have documented their assumptions, definitions, and methods for sampling, delimiting, featurizing, annotation, and deciding how to count, then you can rely on the data. But good luck with that!?The colleagues I talk to in different organizations rarely see even a link to an extraction script or a simple data dictionary. Even less common is documentation of sampling and counting methods.?This is one reason why merging data sets and using third-party data yields questionable results: the data developers probably counted differently from the way your team would. If their assumptions aren't documented, then you can't tell the quality or relevance of the data -- but you can be very sure that it will add variability.?This is a key component of understanding why modeling results are often so hard to reproduce.

Data governance is data seen from the producer's perspective.?It's the process of creating and enforcing best practices for creating reliable, well documented data. If your colleagues or partners are unfamiliar with the term or don't know who owns data governance in your organization, then that's a sign that everyone is probably struggling with unreliable data quality.?If counting is first and foremost classification, then data quality is classification quality. And that means that you need a solid team of people to focus on classification -- the foundation of data quality -- like the team that I built at LinkedIn.?

Data Quality is classification quality.

Get access to as much domain expertise as you possibly can. Domain expertise is the key to data quality and to the success of your AI projects.?You ignore domain expertise at your peril. The amazingly useful domain-independent techniques of machine learning, for example, only yield the best results when combined with domain expertise.?Are your colleagues or partners having a hard time improving the accuracy and generalizability of your green jobs model??It's probably not because you need more data.?You probably need more and better domain expertise, which is often encoded in taxonomies or labels.

In the optimal case, all of the steps in data creation are guided by domain expertise (and documented for better data governance). We leverage that expertise for more targeted stratified sampling, better delimitation of the entities that we want to model, improved featurization and tagging, etc.?In practice, it's very easy to see that domain expertise shapes the data that we model -- and in very direct ways. So in data-driven modeling, more or less emphasis on data provenance and domain expertise in the process means more or less accuracy in the models.

Domain expertise is the key to data quality.

Almost all of our AI, mathematical modelling, and data analytics work is quite literally driven by the characteristics of the data sets that we use. So the call for a shift to data-centric AI is an important one. Good data depends quite directly on good counting. And counting is first and foremost classification, which in turn relies crucially on domain expertise.?The key point here is that our methods and assumptions shape all of our data in very, very direct ways -- even for something as seemingly simple as counting. That's a key reason why we have to understand the foundations of AI very clearly.

With data, as with many other things, you get what you pay for: when you invest in clearer assumptions and more precise, better documented methods, you can count like you really mean it.?

#ArtificialIntelligence, #MachineLearning, #Data, #DataScience

Many thanks to my diverse circle of pre-readers for important contributions and spirited discussion!

Read other articles in my Foundations of AI series here:

Daniel Lundin

Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights

5 个月

Thanks for sharing your article. I see it as a new way of expressing that data + context = information. But in a sliding scale, eg. the better context you provide your data the better the information you have.

回复

Thank you for sharing your domain expertise with us!

Thomas Mansūr

~ loading new project ~

3 年

Kurt Cagle I think you might like this article! :)

Thomas Mansūr

~ loading new project ~

3 年

Clear and concise step-by-step guide to data-centric AI! Igor Falconieri , check out this article! Very related to your data-viz expertise :)

Ana Simoes

Business Development | AI/ML GenAI Principal Specialist at AWS | Haas MBA

3 年

Counting is classification - great insight! Thank you Mike Dillinger, PhD for the article!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了