Machine Learning - Some basic definitions
Machine learning is a branch in computer science that studies the design and use of algorithms and models that can learn patterns in data and then make predictions without human intervention when similar patterns are found in new data.
Methods
Algorithms are categorized in methods. These are the most important ones:
- Supervised learning algorithms are trained using past data with labeled examples to predict labels in future data. For example, we could have data points for flowers, each one labeled with the species it belongs to, along with other features such as petal size. The algorithm reads the input set and learns from it, updating the model. Then, the algorithm uses the information in the model to predict the species a new flower belongs to.
- Unsupervised learning is used against data that does not have labels. The algorithms explore the data and find some structure. For example, we could have data points for flowers. The algorithm would read the input set, learn from it and suggest a set of species for classifying flowers into groups.
- Reinforcement machine learning algorithms interact with its environment by producing actions and learning from the results. For example, drones may learn to fly by trail and error.
Problems
Labels define the problem. as follows:
- Categorical and discrete variables can take one of a limited number of possible values. Predicting categorical labels is called classification. For example, the species a flower belongs to.
- Continuous or real variables can take any real value. Predicting quantitative labels is called regression. For example, the petal width a flower taking into account the species and the petal length.
Features
In machine learning, statistical variables are called features.
Features can be of any of the following types:
- Numeric: describe a quantity as a number. Subtypes are continuous and discrete.
- Continuous: observations can take any numeric value in a range of real values. Examples include length, width and time.
- Discrete: observations can take any numeric value in a set of numeric values. A discrete variable cannot take the value of a fraction between one value and the next closest value in the set. Examples include number of flowers and number of passengers.
- Categorical: Describe a quality or characteristic. Subtypes are ordinal and nominal.
- Ordinal: observations can be ordered. Examples include t-shirt size (e.g. XL, L, M, S, XS) and satisfaction grades(e.g. high, medium, low)
- Nominal: observations cannot be ordered. Examples include species, sex, brand, etc.
Formal and detailed classification of statistical variables according to the nature of the information they represent is out of the scope of this article. However, here is a valuable resource if you are interested: Statistical data types.
Algorithms
Most frequently used algorithms are listed below.
Supervised learning algorithms
- Linear regression algorithms for regression and classification (e.g. Fisher's linear discriminant analysis (LDA), logistic regression, naive Bayes, Winnow, perceptron)
- Non-linear regression algorithms for regression and classification
- Linear and non-linear support vector machine (SVM)
- Learning vector quantization (LVQ)
- Classification and regression trees (CART)
- K-nearest neighbours (KNN)
- Neural networks
Unsupervised learning algorithms
- Apriori
- K-means clustering
- Principal Component Analysis (PCA)
Ensemble learning techniques
- Random forest
- AdaBoost
A more comprehensive list of algorithms available in caret package for R can be found here.
Well, enough theory for today. In the following articles we will play with these algorithms using R.