Does the Number of Outcomes Affect Required Sample Size in "Shallow" Learning?
I wanted to take a minute to write this, not to relay findings, but actually to solicit help from anyone willing to provide it. I am finding myself in one of those "everyone knows that..." instances, where it seems so obvious it surely must be true, but for the life of me, I cannot find a citation that supports it. Thus the existential crisis and public appeals for help.
Background/Goal
While deep learning is becoming more commonplace, there are arguments about opacity and data requirements that make it infeasible in certain applications. In many applications, we don’t get thousands (or even hundreds) of observations for what we are trying to study, hence, I am focusing on “shallow” methods in this particular problem space (e.g. Na?ve Bayes, Linear/Logistic Regression, etc.). One assumption commonly made (i.e. an assumption I have made, which a mere handful of people I talk to seem to agree that it makes sense) is that the required sample size, or number of observations needed to achieve a given accuracy in a supervised learning problem increases based on the number of potential outcomes. That is, if I want to classify all the birds that come to my bird feeder, I would assume I need a greater sample of data to reliably classify 15 types of birds than I do to classify five types of birds (all other variables, including accuracy, being held constant).
An Example
To try and explain the prima facie case for this (and to let smart people poke holes in it), let's look at a simple classification/forecasting problem. Let’s say I have a very persnickety plant that burns if it gets too much direct sunlight, but droops if it doesn't get enough. I get tired of guesswork and put a sunlight sensor in the window by the plant and asked somebody to build me a ML model that would tell me whether the light level was either good or too bright (a binary classification problem), so I can lug the thing away from the window based on the local weather forecast, if needed.
A natural question I would have is (assuming daily readings), “what sample size of sensor readings do you need for this model to be ready?” Because I’m a social scientist and have had alpha = .05 beaten into my head, let’s assume I mean 95% accuracy when I say ready. You would probably speculate about a lot of idiosyncrasies and then likely give me a number we’ll call N2 (the sample needed to have 95% accuracy with two outcomes). Now let’s say I really want to save this plant from moldy roots, and ask if you can make a 'Goldilocks' model that tells me if the sunlight is too high, just right, or too low (leaving it damp for too long). You explain some stuff to me about how it's a slightly tougher problem, and then give me a number, N3 (the sample needed to have 95% accuracy with three outcomes). Now let’s say I want you to actually predict the exact amount of sunlight that will be received in that window to the nearest integer value, based on the weather forecasts. That’s a continuous variable, that has a likely range of 100 based on previous measurements. Of course, you question my sanity and the practicality/need for such a thing given the $25 price tag on the plant, grumble, then give me a number, N100 (the sample needed to have 95% accuracy over the approximately 100 possible outcomes).
The assumption I would make, and I think many would, is that N2 < N3 < N100.
(Micro) Lit Review
There are a variety of articles out there that talk about different heuristics, or rules of thumb regarding sample sizes required for various tasks, although they don’t all use the same variables, some of them are entirely invariant across variables, and the answers range wildly. Most importantly, none of the ones I’ve seen thus far include the number of outcomes. At a very basic level, it seems that one could apply the Condorcet Jury Theorem, where the more jurors there are on a binary decision, the higher the likelihood of the correct decision being made, barring some assumptions being violated (Page, 2018). If this binary outcome is expanded to a series of binary outcomes (or bits) for more complicated decisions, one might assume that the theorem holds, and that as the number of outcomes increases, the number of jurors/votes (or observations) also increases in kind.
Conversely, Abramovich & Pensky (2019) have researched large outcome, large dimensionality, and small sample classification, and found that in some instances having a larger number of outcomes was helpful in that it greatly enabled feature engineering by identifying more relevant features across all outcomes. However, they do acknowledge that having more outcome classes may require more features; in other areas of the literature, it is more widely accepted that an increase in features does require an increase in sample size.
The curse of dimensionality is a well-published phenomenon that says that error is a ratio of the sample size and the number of dimensions, or Accuracy (A) = N/L, where N is the sample size of the training set, and L is the number of dimensions or features. Thus, to get the required sample size for a desired level of accuracy and number of features, one would use N = L*A. The chart below shows this relationship with the sample size on the Y axis, the number of features (L) on the X axis, and different lines for different levels of desired accuracy.
Subramanian & Simon (2013) asserted a simpler rule of thumb that one should have a sample size 10 times that of the number of features, or N = 10*L. This, of course, provides a linear function where A is invariant across desired accuracy, as shown below.
Finally, Guyon, Makhoul, Schwartz, & Vapnik (1998) state that the sample size should be inversely proportional to the error rate, with a rule of thumb that N = 100/p, where p = error rate. We can easily find error rate from accuracy, since Accuracy is simply 1-p. Thus, we can generate a linear function where L is invariant, as shown below.
One should note that given these three sources, for a 95% accuracy in my Golidlocks example (i.e. too low, just right, too high sunlight), with say 5 features (e.g. cloud cover, humidity, etc.), I would need a test set with a minimum of 50, 475, or 2000 cases, depending on which rule of thumb is employed. Not only does this beg some questions about our theoretical understanding, but also has practical implications as I would need to wait somewhere between < 2 months and 5.5 years to be able to reliably predict whether my plant should be pulled away from the window.
Lingering Questions
Given all of this, I have some questions that remain to be answered. My hope is that somebody can answer these, or even better, provide some citations that may be used in the future. I would use such references when addressing at least some of the normative constructs we're trying to tackle, if not help with practical assessments of future work.
- Is there any research out there that can substantiate or refute the assumption that a larger sample or test set is required for a larger set of outcomes?
- Are there any more recently developed rules of thumb for determining sample/test set size based on a variety of factors?
References
Abromavich, F. & Pensky, M. (2019). Classification with many classes: Challenges and pluses. arXiv: 1506.01567v4.
Guyon, I., Makhoul, J., Schwartz, R. & Vapnik, V. (1998). What size test set gives good error rate estimates? IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 52-64.
Page, S.E. (2018). The model thinker: What you need to know to make data work for you. Basic Books.
Subramanian, J. & Simon, R. (2013). Overfitting in prediction models - is it a problem only in high dimensions? Contemporary Clinical Trials, 36, 636-641.
Ph.D., CSSLP, Senior Software Engineer, Sonalysts, Adjunct Professor of Computer Science
5 年Excellent questions Steve- You are priming yourself to turn this into a citation for itself! I say go for it and let's find the answer! Here are my 2 cents anyway It might help to separate discussion of discrete vs continuous outcome classification examples, since those could lead to 2 different analyses in terms of sample size determinations. As far as rule of thumb, when discussing discrete outcomes- make sure the base case for training from a sample size is included in the discussion: i.e., the absolute smallest sample size possible for training would have to be N (i.e. N outcomes have actually occurred, to allow for a classification of that outcome to be possible). So your N2 example would require a minimum sample size of 2 to train, and your N3 requires a minimum sample size of 3 to train, N100 a sample size of 100 (again, assuming discrete and not continuous), and so on. All minimum sample sizes are "anecdotal" at the minimum size obviously, but the curvature of improving margin of error and confidence levels as the sample size increases must necessarily start from those floor #s. Just make sure any citation you find (or decide to write yourself wink wink) covers that base case :-)
LSE at CBP Program Management Office Directorate (PMOD)
5 年Steve, this is a very thought provoking inquiry. Though I have no paper to support or contradict the arguments made, granted this is with less than an hour of reading your inquiry and doing a cursory inquiry myself into some literature. I would tend to assume (personal bias) the first noted graph you outlined, i.e. while holding a fixed alpha, L would increase as N increased. I did find one paper of interest that is more of a literature review titled “A survey of high dimension low sample size asymptotics”, (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6124695/#!po=0.549451). I will note this paper is generally focused on biomedical application as it relates to genetics and the use of the Principal Component Analysis (PCA), it does mention its application to big data and has a number of cited sources to reference and expand upon. I will continue to give this some thought and if I find anything be sure to let you know.
Digital Nomad
5 年I can see some good and bad in your assumption.? Are you concerned about possibility that the data will be overfit?? Did you start with an analysis to determine the difference between how many birds did come to your bird feeder vs the number of different kinds of birds that COULD have come???