Machine Learning Interview BuzzTerms
This article is a list of terminologies defined in a simple manner that enable easy recall. Feel free to use this list to refresh on keywords and definitions on your way to an interview. Or perhaps you are already a descent Data Scientist or Machine learning expert, but you would like to revisit your foundations. Or you want to pass the time.
Share this article with anyone that might need this. Also, add some more definitions in the comment section to help others and show off your skills.
JOIN THE BEST ML/AI COMMUNITY: https://www.dhirubhai.net/company/the-code-scholar.
?Neuron: In the world of deep learning and machine learning, a neuron can be described as a processing unit that is one of the fundamental building blocks of a neural network.
Weights: These are the strength of interneuron connections. They are the critical element in the storing of knowledge within neural networks.
Neural Network: A collection of neurons (processing units) in a manner that enables the retention of knowledge. They can also be described as parallel distributed processors.
Machine learning: Is the science of the implementation of computer algorithms or instructions that are orchestrated to learn from data, where the algorithm is utilized for a task correlated with the data it learned from.
Deep learning: This is an area of machine learning where algorithms leverage the utilization of several layers of neural networks to extract richer features from input data. Examples of deep learning techniques are Convolution Neural Networks(CNN) and Recurrent Neural Networks(RNN)
Supervised learning: This is the training of a machine learning algorithm with data annotated with labels. The annotation of data is typically provided by an expert system, such as a human or external system. The task of classification is an example of a supervised learning task.
Unsupervised learning: Algorithms designed to tackle this type of learning have self-organizing characteristics built into them. These algorithms self organize data based on patterns detected in data without the involvement of an expert system.
Semi-supervised learning: Machine learning algorithms that are semi-supervised consist of both unlabeled and labeled training data. The frequency of labeled data in the distribution of the training dataset is usually on a smaller scale in comparison to unlabeled training data.
Reinforcement learning: This is a type of Machine learning technique that involves defined programs that are referred to as agents. These agents are placed in an environment and are governed by the notion of the increase of rewards through interactions with the environment. The agents are designed to aim to accumulate rewards where possible. There is also the form of negative rewards or penalties. The agent task is to improve its governing system to collect rewards over time and avoid penalties.
Batch Learning: This is a method of the presentation of training data to a machine learning algorithm’s network. Training data accumulated is fed to the machine learning algorithm all at once, and changes are made to the bias and the weights of the network being trained once all training data have been fed forward.
Online Learning: This method of presenting training data to the network is carried out incrementally. The training data is split into groups referred to as mini-batches, and once a mini-batch has been fed through the network, an update is made to the network’s weights and bias, then another mini-batch is then fed forward. This process is repeated until all mini-batches are passed through the network.
Generalization: Machine learning algorithms and models can be categorized based on the measurement of their performance on unseen data.
Model: This can be described as a mathematical representation of the generalized pattern observed in a dataset.
Instance-based learning (memory-based learning): A machine learning system can generalize based on observation of patterns within training data presented to the algorithm during the training phase. These patterns (instances) can be utilized to improve the generalization of the machine learning algorithm by calculating some similarity measurement scores based on new instances presented during testing with the instances used in training. The algorithm is said to make predictions on new instances based on previous instances of observations during training.
Model-based learning: An alternative method of implementing a system that can generalize as opposed to instance-based learning, is to use a generate a model based on a dataset and use the model to tackle tasks such as predictions.
Dataset: This is a collection of information that contains related elements that can be treated by a machine learning algorithm as a single unit.
Feature: This is a measurable definitive characteristic of an object, observation, or dataset.
Training Dataset: This is the group of our dataset that is used to train our neural network directly. In the task of using a convolution neural network for classification, the training set data’s images and labels relationships will be learned by the network. These are the group of our dataset the network sees during training.
Validation Dataset: This is the group of our dataset that is utilized during training to assess the performance of our network at various stages during training. This group of the dataset is essential as it acts as an indicator of issues such as overfitting and underfitting of the network model.
Test Dataset: We utilize this group of the dataset to evaluate the performance of our network after the training stage is completed.
Underfitting: This occurs when a machine learning algorithm fails to learn the patterns in a dataset. Underfitting can be fixed by using a better algorithm or model that is more suited for the task. Underfitting can also be adjusted fixed by recognizing more features within the data and presenting it to the algorithm.
Overfitting: This problem involves the algorithm predicting new instances of patterns presented to it, based too closely on instances of patterns it observed during training. This can cause the machine-learning algorithm to not generalize accurately to unseen data. Overfitting can occur if the training data does not accurately represent the distribution of test data. Overfitting can be fixed by reducing the number of features in the training data and reducing the complexity of the network through various techniques.
Regularization: In Machine Learning, regularization is a technique used to reduce the complexity (hence prevent overfitting) of a network by enabling the weights of the network only to take on small value, or have values with a small order of magnitude. By placing constraints on the values that the weights in the network can have, we can make the network simpler, thus reducing the complexity. Regularization is implemented within a network by applying a cost/penalty to the loss function where values of the weights are higher than a certain number. We are reducing the contributions of large valued weights to the network by applying regularization, and this can mitigate overfitting.
Hyperparameters: These are values that are defined before the training of the network begins; they are initialized to help steer the network to a positive training outcome. Their effect is on the machine / deep learning algorithm, but they are not affected by the algorithm. Their values do not change during training. Examples of hyperparameters are regularization values, learning rates, number of layers, etc.
Network parameter: These are components of our network that are not manually initialized. They are embedded network values that are manipulated by the network directly. An example of a network parameter is the weights internal to the network.
Ground-Truth: This is an element within a dataset that is annotated by observation. In machine learning, the ground-truth data is used to measure the accuracy of an algorithm prediction through a comparison of the inference result provided by the algorithm and the observational result contained in the ground-truth.
Confusion matrix (error matrix): Provides a visual illustration of the number of matches or mismatches the annotation of the ground truth to the classifier results. A confusion matrix is typically structured in tabular form, where the rows are filled with the observational results from the ground-truth, and the columns are filled with inference results from the classifier.
Precision-Recall: These are performance metrics that are used to evaluate classification algorithms, visual search systems, and more. Using the example of evaluating a visual search system(find similar images based on a query image), precision captures the number of results returned that are relevant, while recall captures the number of relevant results in your dataset that are returned. Read that again.
Sampling-bias: This can occur during the process of data collection. Sampling-bias can be described as the occurrence of limited representation of a specific member or subgroup within a dataset. The dataset could be said to be distributed in a manner that one subset is significant in number to another. Sampling-bias can be prevented with random sampling techniques.
Backpropagation: An algorithm whose purpose is to enable the capability of learning within a neural network by efficiently calculating a gradient that minimized the cost function of the network.
.................................................................................................................................................
Credits: Richmond Alake