DATA SCIENCE IN AGRITECH: BEING APPROXIMATELY RIGHT OR EXACTLY WRONG
Michael C. Rubin
Helping the Energy Industry with AI & Data Science | Data Scientist MIT | Advisor & Investor
A Discussion about Bayesian Inference and Classical Statistics in the Machine Learning Era
12 minutes reading / by Michael C. Rubin
1 Introduction
1.1 Machine Learning in Agriculture
We need to produce food for 10 billion people, while reducing the net carbon output to almost zero. The mission for agriculture in the next decades sounds like a mission impossible, and indeed, it is a very hard task. But big challenges have always been big opportunities and have given birth to great businesses. No wonder that we see hundreds of Agritech start-ups[1], all with similar plans – use Big Data and Machine Learning to make agriculture more efficient. This is certainly a very good and needed trend, however, relatively little has been discussed on the matter of practicability of the classical statistical modelling traditionally used in agronomic sciences. Machine Learning, Artificial Intelligence and Data Science are thereby often synonymously used as an extension (or automation) of classical statistics as it traditionally used in Agronomy.
This article intends to discuss the limitations of the use of classical statistics in agriculture given the structure of data typically available. Having the data available in the quantity, structure and quality needed to run accurate statistical models is often a big challenge in agriculture. Further, I suggest an alternative approach, which can be more accurate under practical constrains. The discussion is based on the subject of analysis of associations between variables (aka Regressions models).
This paper has no ambition to be exhaustive on the mentioned topics. Rather, I intend to launch a discussion on alternative approaches.
1.2 Statistics, Machine Learning – is this not the same anyway?
Well, if you think architecture and sand-castle building are the same, then the answer is probably yes. In that case, you should stop reading at this point and devote your time to other things. Otherwise, here are some subtle but important differences.
The goal of statistics is to find associations and differences in a given set of data. In the case of regression, the goal is to find the line which best fits the observation. This line is described by the parameters . Once you found this association, the job of statistics is done. Of course, the found regression parameters are then often used to predict the future based on new data, but this is actually not the strength of statistics, which often has a poor generalization capacity. More on this topic later. Statistics work with a relatively limited set of data and therefore makes some assumptions about their true distribution, like normality of the errors.
Machine learning (henceforth ML) , on the other hand, aims at making predictions. It is based on statistical learning theory, which starts with a set of n data points, each of which is described by some other values we call features (x), and these features are mapped by a certain function to give us the value y (for supervised learning). The algorithm tries to learn this mapping function by applying a loss function, which evaluates how ‘poor’ the model is predicting, and then we run through an optimization process to minimize this loss. ML does not make any assumptions about the data and their underlying distribution and usually works with much larger data sets and much more features.
One particular problem in all of Data Science is that models can inventing unreasonably complex functions, which almost perfectly represent the data at hand, including the noise, but would eventually not perform well on new data (i.e. prediction). This is called the overfitting problem, and this is where the rubber meets the road. Machine Learning, as opposed to statistics, has a method to deal with this issue and optimize its generalization performance, rather than the pure data fitting[2]. In short, Statistics and Machine Learning are related, but not the same. While statistics is strong in finding explanations on associations between individual variables, Machine Learning’s strength is to make accurate predictions which can be generalized[3].
2 The Frequentist Regression
The frequentist or classical linear ordinary least square regression (henceforth OLS regression) is probably the one you are familiar with from school: the model assumes that the dependent variable (y) is a linear combination of parameters (b) multiplied by a set of predictor variables (x) plus some error term (e). To fit the model to a set of data, we need to find the coefficients, β, that best explain the data. In the OLS regression this means minimizing the residual sum of squares (RSS), i.e. is the total of the squared differences between the known values (y) and the predicted model outputs (?). This equation has a closed form solution, so we can solve for β and obtain the equation to calculate the coefficient which gives us the best estimate for the observed data. Mathematically:
What this function gives us is the Maximum Likelihood Estimator or MLE, which is the main mechanism to find statistical parameters in frequentist statistical modelling[4].
So, what is the problem with Linear Regression in the first place? Why we don’t use them all the times? It turns out that Linear Regression is suitable and very convenient for modelling under certain constraints and conditions. These, so goes my hypothesis, do often not hold in practice of Agriculture. Here following I discuss some limitations.
2.1 Large Datasets and Computational Complexity
The above closed form equation is very simple for univariate cases and if my N (number of observations) is relatively low. However, in agronomy, there are few cases where a dependent variable (y) is affected only by a single independent variable (x). As the science is complex and still relatively unexplored, we needed to test y’s, such as yield or fruit growth, against dozens or hundreds of possible explanatory variables x, for which tens of thousands of observations are needed. Computing the inverse of the (X' X) matrix has computation complexity of O(n^3), thus might easily take hundreds of billions or trillions of operations (if n = 10’000) [5]. Further, such large calculations can also cause a memory problem in any ordinary machine. Generally, we say closed form approach is good for N<5’000 [6]. Otherwise, the method of gradient descent should be used as an optimization algorithm. This approach, however, does no longer guarantee to find a global optimum, but only a local optimum. Finding the global optimum, however, is the essence of Maximum Likelihood Estimators. So, it’s at least questionable whether we should use this procedure.
2.2 Non-Linearity
Non-linear OLS regressions do generally not have a closed form solution. This means that we can’t compute it directly anymore and we have to go over the gradient descent optimization, even for small data sets. As discussed above, this yields no longer a unique solution, but there might be multiple minima. As one has to initiate the model by 'random parameters', it is almost certain that different runs will result in different solutions. In other words, the model starts becoming numerically instable. Additionally, in practice, results are often biased, even under global minima conditions.
A further problem you might face, especially when applying polynomial functions, is the risk of overfitting and non-generalizability. In particular, if your data sample is relatively small and not 100% unbiased (which is the normality in practice), you can be almost sure that your model overfits and performs poorly on new data. A good solution to this is to transform the feature space to a higher dimensional and/or non-linear space and apply a linear regression model. In other words, I transform the x-axis of the independent variables to a quadratic or polynomial scale (e.g. 2->4, 3->9, 4->16 and so on…), so that the relation between the observed data (y) and the features (x) on the new scale is still linear[7]. This has the advantage that the simpler linear procedures can be applied, and I can also control for overfitting by the regularization parameter (lambda). Popular methods are the Kernel Trick or using Neutral Networks with several layers, which ‘learn’ the transformation[8]. On the other hand, interpretability will suffer, especially when using Neural Networks.
2.3 Data Scarcity, Normality and the Central Limit Theorem
Against popular believe, OLS regression does not require normal distribution assumption in any case. In cases where the data set is “sufficiently large”, this assumption can be relaxed. More precisely, as β is a weighted sum of Y, if we consider repeated sampling from the population for large sample sizes, the distribution (across repeated samples) of coefficients β follow a normal distribution. This is a consequence of the Central Limit Theorem[9]. In other words, we need either the data to be normally distributed or the sample size to be large enough. If none of these conditions is true, we can’t use the OLS regression. The error term distribution would distort the estimate and yield a wrong result, as the example in Figure 1 shows. But how large is “sufficiently large”? For the Central Limit Theorem starts having an effect and healing the above problem of non-normality, a sample size of 30 for samples whose population has a symmetric and unimodal shape, and between 50 and 100 for highly skewed and multimodal populations is considered sufficiently large. This quantity of data is often available, also in Agriculture. Still, we have to remember this when dealing with small data set, as we can almost never assume data to be normally distributed in Agronomy.
2.4 Maximum Likelihood Estimator, Confidence and Overfitting
However, there is another aspect to the sample size problem. In theory, two data points are enough to draw a straight-line. However, it is intuitive that given the ‘random’ nature of Agronomic data, chances are high that your slope is partly determined by random noise and does not represent the true underlying correlation between x and y. Hence, we certainly need more data to get more confidence. How many exactly depends on ‘how wrong’ you accept to be. Traditionally, the confidence of a regression is calculated by the p-value, which tells the frequency in which I get the correlation I observe by pure randomness, while in truth there is no correlation. Scientist want this probability typically to be <5% to accept the results. Think a moment what this means. Let’s assume we have a quite steep slope of y = 3*x and p = 0,05, i.e. we are 95% confident the true slope β is > 0. In that case, we accept the hypothesis and consider β =3 to be the truth relation. However, we didn’t prove with 95% that the slope is indeed β=3, rather we are just 95% confident it is not equal or smaller than 0, with 3 being the most likely among many possible values. This value 3 is referred to as the Maximum Likelihood Estimator. But there is still a very large range of possible true value between 0,001 and 2,999 or between 3,001 and infinity, which combined are much more likely than the tiny band around the MLE.
What does this mean? Is every linear regression wrong? Don’t worry, it’s not all that bad. But there are important considerations in here. Here too, the Central Limit Theorem has a healing effect. As the sample size is getting larger, the spread of the estimate is getting smaller and smaller and eventually approaches a single value, as n goes to infinity. Figure 2 shows this effect. Put differently, as we get large sample sizes, our unique value result becomes increasingly correct. Formally:
Here, the healing effect of the Central Limit Theorem occurs only in the rate of square root of n and a “line” might only be approached with really large data sets. As a consequence, in practice we might have to live with the fact that our parameter estimate is subject to a large spread and thus uncertainty.
2.5 Using OLS Regression in Machine Learning?
We can conclude that for strictly linear functions, OLS can be used up to a data set of several thousand observations, before it gets computationally too complex. However, the MLE result only becomes expressive if I have much larger data sets, otherwise the result’s spread would be big, and a unique value tells only half the story. If there is a very small number of data, we need to assume normality, which is not always correct in agronomy. Such models based on ‘small data’ have very little prediction power given the complexity of the subject and overfitting problem. If we deal with non-linear functions, which is almost always the case, the mentioned problems worsen. OLS Regression is certainly a good tool within a relatively narrow range of application. However, it should not be used as a ‘one-fits-all’ tool and blindly fed into machine learning algorithms.
3 Bayesian Regression as an Alternative
3.1 Bayesian vs Frequentist – a short Excursion into Philosophy
Bayesian Inference is an alternative way of statistical inference to the frequentist approach. The difference is mainly in the way we see the concept of randomness. An exhaustion of this topic goes into deep philosophical waters and lies beyond the limits of this paper, but here follows a short introduction into the interesting discussion[10].
Frequentist believe that sampling is infinite and decision rules can be sharp. Data are coming from a repeatable random sample. The underlying parameters, i.e. the population distribution, are fixed and remain constant over time. If there is no evident population distribution, we need to ‘invent’ one and assume e.g. normality. It is a more rigid concept.
In the Bayesian world, unknown quantities are treated probabilistically, as random variables and the state of the world can always be updated. Data is what is fixed in this world, i.e. each observation data is a realization of a sample. Parameters are unknown and described probabilistically and will never be entirely known. It is a more flexible concept.
The Bayesian process starts out with an initial estimate, our prior believe, and as we gather more evidence, our model becomes less wrong. This is shown in Figure 3. Bayesian reasoning is a natural extension of our intuition. Often, we have an initial hypothesis, and as we collect data that either supports or disproves our ideas, we change our model of the world. Here follows the main function of Bayes:
3.2 Bayesian Regression
The aim of Bayesian Linear Regression is not to find the single “best” value of the model parameters, but rather to determine the posterior distribution for the model parameters. The output, ? is generated from a normal (Gaussian) distribution characterized by a mean and variance. The mean is the offset parameter b_0 + the slope b_1 * x_i and then variance is simply the square of the standard deviation, σ^2 (note that for matrix notation, you need to multiply σ^2 by the identity matrix I). Hence, we have three unknown or hidden parameters, b_0, b_1, σ^2, to be learned using Bayesian Inference given the X and Y training data set. Mathematically (note: univariate case to simplify notation)[11]:
From this follows the learning function over all n training examples. Note that I omitted the evidence P(Y | X) to simplify, as it does not affect the proportions:
Here, the parameters represent the prior (initial) believes and are hyperparameters of the model. They are set by domain knowledge. In Agronomy, this is an excellent opportunity for an expert agronomist to bring his expertise into the model.
Let’s stop and think about what this means. In contrast to OLS, we have a posterior distribution for the model parameters that is proportional to the likelihood of the observed data under the prior believe, multiplied by the prior probability of the parameters, which are the agronomy expert’s viewpoint. In other words, we can start with the experts believe and as we get more data, we either correct the model or strengthen it, depending of how much the data agree. Here follow some explicit advantages of the Bayesian Regression over OLS.
3.3 Uncertainty and Agronomic Decision Taking
First and foremost, the Bayesian model has the ability to express uncertainties of predictions. This can be very important in real world decision making in Agriculture. Imagine the following situation: The MLE predictor predicts that on a particular day, there is no risk of a crop disease, hence, no protection treatment is needed. However, the probability distribution of a prediction is very wide spread, almost uniform, with the likelihood of the MLE being only slightly higher than the rest. Taking the decision to not apply a crop protection would bring a saving of several hundred $. However, not doing the treatment, if the prediction was wrong, would cost several tens of thousands of dollars. If aware about that situation, no expert would take the decision of not using crop treatment. The relation between financial utility and risk is simply not good. Bayesian Regression delivers the information to do exactly this kind of utility calculation. We have an exact quantification of the uncertainty of the prediction and can weigh this risk against the cost and benefits of both options. With OLS, in contrast, one has just the ‘blind’ choice to either trust or not trust the model.
3.4 Incorporating Expert Knowledge and Small Data Sets
In agronomy, we often have the situation that we have very little reliable data, but we have good expert knowledge available. Farmers know their soils and specialized agronomy consultants do have very deep knowledge about a particular problem. This way, agriculture has evolved a lot in recent time.
If we introduced a frequentist model with small data sets, chances are that the prediction is distorted by noise and disagrees with the expert. Conflicts and low creditability in the data model would be almost unavoidable.
Bayesian model, however, allows for the expert to incorporate his knowledge as prior distribution. The model then starts learning from there and strengthens or adjusts its prediction power as more data comes in. This allows for a more practical and smooth adoption of data science in agriculture.
3.5 Machine Learning and Overfitting
One of the major challenges in machine learning is to prevent the model from overfitting. When we have small datasets, the model will only learn the particularities of the given data and will not generalize to the unseen data. Especially in agronomy, the ML models are prone to overfitting. Initially, we have a lot of explanatory variables (x’s), ranging from climate to soil conditions to fertilization to genetical and biological variables. In sharp contrast, we often have very limited observations. If we talk about dependent variables (y) like crop yield, each season represents one observation, i.e. we need to work with very limited number of observations. Hence, there is no chance to heal the overfitting problem with the brute force method of feeding the models so many data until the Central Limit Theorem kicks in to regularize.
However, if we have domain knowledge from experts available, which was gained over a broader perspective. This knowledge could be used to generalize our models in the absence of a sufficiently large dataset. The expert believes can be used as a regularizer term (lambda) to balance between trust in data and human expertise.
3.6 Computational Aspects
As seen above, using very large data sets is a challenge due to limited computational resources. If a machine learning algorithm requires a complete dataset to be available in the memory for the computations, then those algorithms will require a large amount of memory to scale them to big data applications. This won’t work for normal machines. However, if a machine learning algorithm can inherently support incremental learning, then, we can load a subset of the data into the memory for computations in each increment. Bayesian learning can be used as an incremental learning technique to update the prior belief whenever new evidence is available. We ‘shoot’ the data evidence one by one through the model as they come in and gradually refine the model.
4 Conclusion
This article is not intended to prefer one or the other approach. I think both approaches have their place. What I hope I could show is that the classical approach, which is used in the overwhelmingly most cases, is not always the ideal one. Especially given the data limitations in agriculture and the potentially expensive wrong decisions that can be taken, frequentist models can be too rigid. A Bayesian approach, which allows for incorporation of human expertise and reports the model’s uncertainty, can be a good alternative. It does not force to think in Black and White, i.e. models are either right or wrong, but allows for the grey scales.
The fact this topic got little attention so far in the community of Agritech does not come as a surprise to me. What we currently see in the market is a very general level of Machine Learning and Data Science application, which, in many cases, is just automated statistical modelling, and in the worst case, my apologies in advance, a mere catchword for fundraising. At that level it doesn’t really matter which approach one uses. Probably even descriptive statistics will do the job.
My Boss Gottfried Pessl, founder of Pessl Instruments, one of the pioneer companies in Digital Agriculture, told me once: “In agriculture, we are currently running the 100 meters in 25 seconds, struggling to get the 20”. At that time, I didn’t realize how right Gottfried was. Looking some years ahead, I claim that we may need to think about how our models deal with uncertainty to get practically usable results, if we wan to make our 15 second run. Bayesian Inference can be an effective tool.
Eventually, it’s better to be approximately right than exactly wrong!
Follow us:
https://www.dhirubhai.net/company/28565689/admin/
www.drawdownlabs.com
Sources:
[1] https://www.crunchbase.com/hub/agtech-companies#section-overview
[2] https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3
[3] MITx - 14.310x, Data Analysis for Social Scientists, Module 10
[4] https://www.quantstart.com/articles/Maximum-Likelihood-Estimation-for-Linear-Regression
[5] https://towardsdatascience.com/my-journey-into-machine-learning-class-5-regression-cb6f04006b29
[6] https://stats.stackexchange.com/questions/278755/why-use-gradient-descent-for-linear-regression-when-a-closed-form-math-solution?noredirect=1&lq=1
[7] https://www.futurelearn.com/courses/advanced-machine-learning/0/steps/49532
[8] MIT Course MITx - 6.86x
[9] https://pdfs.semanticscholar.org/3f2c/aa91466e0fea50beb8178a1d2c3af35cf16b.pdf
[10] https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading20.pdf
[11] https://wso2.com/blog/research/part-two-linear-regression
I think that another big problem for Ag applications is that the result is either difficult to apply and then, if applied correctly, and the outcome not desired, you can always blame it on some management strategy change (pruning, weeding, irrigation errors, etc). Thanks for the post!!!