Machine Learning
What is machine learning, and how does it relate to science, Bayesian inference and statistics?
Science is the systematic study of the universe—through observation and experiment—in the pursuit of knowledge that allows us to generalise. Science is essentially Bayesian inference (Sewell, 2012). This means that in its purest sense the application of science involves making assumptions, in the form of prior probabilities, gathering data and applying Bayes’ theorem.
Machine learning is an area of artificial intelligence concerned with the study of computer algorithms that improve automatically through experience. In practice, machine learning is generally a practical approximation of Bayesian inference, justified because the techniques are simpler and good enough. In other words, machine learning can be viewed as an attempt to automate ‘doing science’. The practical automation of science involves gathering data and applying a machine learning algorithm with the correct inductive bias. Whilst the task of choosing a model with the correct inductive bias is known as model selection. Note that the no free lunch theorem for supervised machine learning (Wolpert, 1996) showed that in a noise-free scenario where the loss function is the misclassification rate, in terms of off-training-set error, there are no a priori distinctions between learning algorithms. In other words, on average, all supervised machine learning algorithms are equivalent (over all possible data sets).
Whilst science is essentially Bayesian inference, Bayesian model selection (Jeffreys, 1939), at least in theory, is the optimal way of choosing the best model. In principle, all we need is the following equation.
P(model|data) ∝ prior × likelihood
Note that we are only interested in the relative probability of different hypotheses (models). It is optimal to take an average of common parameter estimates or predictions across all the models, with each model weighted by its posterior probability. We have simply formalised the fact that a surprising result requires more evidence. The prior is our subjective assessment of how surprising a model is. This is the philosophy that should drive model selection. In practice, it is not always practical to apply Bayesian model selection, but an approximation should take place implicitly. As we shall see, there are many tools in the machine learning toolbox.
Machine learning algorithms may be formulated in terms of three parts:
- Loss function A function, applied to a single data point, of a prediction and the desired output, designed to penalise the error.
- Cost function Typically the sum of loss functions over the training set plus a model complexity penalty (regularization).
- Optimisation algorithm Uses training data to find a minimum of the cost function. Typically gradient descent or stochastic gradient descent.
Machine learning algorithms may be categorised in at least a dozen ways, as the following taxonomy shows.
? Type of data
- Categorical features Categorical features may need to be converted into numerical features.
- Numerical features May be discrete or continuous.
? Processing
- In-memory processing The data set can be fully loaded into RAM.
- Out-of-memory processing The data does not fit into RAM. Incremental learning algorithms may be preferred.
? Model type
- Probabilistic Build a full or partial probability model.
- Non-probabilistic Find a discriminant/regression function directly.
? Type of reasoning
- Induction Model-based learning: reasoning from observed training cases to general rules, which are then applied to the test cases.
- Transduction Instance-based learning: reasoning from observed, specific (training) cases to specific (test) cases. For example, the k-nearest neighbors algorithm.
? Type of machine learning
- Supervised learning The algorithm is presented with training data that consists of many examples, each of which consists of inputs and the desired output, thus enabling it to learn a function. The learner should then be able to generalise from the presented data to unseen examples. Examples include classification and regression. In situations where there is a cost to labelling data, a method known as active learning may be used, where the learner chooses which data to label.
- Unsupervised learning The algorithm is presented with examples from the input space only and a model is fit to these observations. Examples include cluster analysis and anomaly detection.
- Semi-supervised learning The algorithm is presented with both labelled data (as in supervised learning) and unlabelled data (as in unsupervised learning).
- Reinforcement learning An agent explores an environment and at the end receives a reward, which may be either positive or negative. In effect, the agent is told whether he was correct or incorrect, but is not told how. Examples include playing a game of chess or Go.
? The manner in which the training data are presented to the learner
- Batch All of the data is given to the learner at the start of learning.
- Online The learner receives one example at a time, and updates its current hypothesis in response to each new example. A type of incremental learning.
? Task
- Classification May be binary or multiclass.
- Regression Real-valued targets (generalises classification).
? Classification model type
- Generative model Defines the joint probability of the data and latent variables of interest, and therefore explicitly states how the observations are assumed to have been generated.
- Discriminative model Focuses only on discriminating one class from another.
? Depth of model
- Shallow Between input and output, the data is transformed at most twice.
- Deep Uses multiple layers and automatically discovers the representations needed from the raw data.
? Parameterisation
- Parametric Strong assumptions.
- Nonparametric Weak assumptions.
? Linearity
- Linear Assumes a linear relationship between inputs and output.
- Nonlinear Allows for a nonlinear relationship between inputs and output.
? Number of learning algorithms
- Individual learner A single learning algorithm.
- Ensemble learning Multiple learning algorithms with outputs combined.
Regardless of the type, below is a list of desirable features of a machine learning algorithm. Naturally, trade-offs may be necessary.
- Complexity First and foremost, our algorithm should exhibit optimal complexity given the data.
- Simplicity With the above in mind, simple solutions are favoured over complicated ones—we prefer explanatory power over a black box algorithm.
- Stability We prefer a robust algorithm, one that is not overly sensitive to changes in parameters.
- Convergence Many algorithms are iterative, and we require them to converge in finite time.
- Scalability If we have large complex data sets, we need an algorithm that scales (in terms of running time and space) when confronted with a large number of training examples, input features and/or test examples.
What makes machine learning distinct from, say, statistics or econometrics? There is no distinct demarcation, but the assumptions and goals tend to differ. In essence, the spirit of machine learning is best served by models that make weak assumptions concerning the dependency between the model inputs and output. For example, machine learning is embodied by nonparametric nonlinear models. In terms of goals, machine learning generally concerns formulating the process of generalisation by searching for the best model. In contrast the statistical community generally assume that the data are generated by a given stochastic data model and are often mostly concerned with validating the model (Breiman, 2001). Whilst econometrics tends to revolve around multiple linear regression—a model with strong assumptions. Although even linear regression makes up an important and well understood part of machine learning, it does not fully capture what typifies the paradigm.
How is machine learning applied in practice? Naturally, one should use as much relevant training data as possible. However, for example in the case of forecasting time series, the dynamics of the time series could change over time, leading to a structural break. In which case recent data will be more informative than data from the distance past. The key to choosing a suitable model, or class of models, is to use domain knowledge to select a model of optimal complexity for the given data set. You want a model that doesn’t underfit or overfit the training set, but generalises well when presented with new data. In general, choosing one’s model space and model selection are both difficult, whilst parameter selection is relatively straightforward (though is often computer intensive). When dealing with time series, a popular technique is to use a sliding window training set with a (later in time) validation set for parameter selection, and a previously unseen final out-of-sample test set (later still in time). Though note that you only have one shot at testing your model on the test set if you want an unbiased estimate of its future performance.
To conclude, machine learning is the ideal tool for doing science, as it enables us to formulate the process of generalisation by searching for the best model.
References
Breiman, L. (2001, August). Statistical modeling: The two cultures. Statistical Science, 16 (3), 199–231. https://doi.org/10.1214/ss/1009213726
Jeffreys, H. (1939). Theory of Probability. Oxford: Clarendon Press. (Third ed. (1998, August). Oxford: Oxford University Press.)
Sewell, M. (2012, April). The demarcation of science. (Young Statisticians’ Meeting, Cambridge, 2–3 April 2012.)
Wolpert, D. H. (1996, October). The lack of a priori distinctions between learning algorithms. Neural Computation, 8 (7), 1341–1390. https://doi.org/10.1162/neco.1996.8.7.1341
Building The Capital Efficient DeFi Universe
4 年Great article Martin
KnowRisk Consulting
4 年Nice article and decision tree Martin