Article 2 - The Predicament of Predictors
"Innovations in technology are most impactful and enjoyable when they solve real-life problems, improve lives, enhance safety, and are accessible to the average person."
In the first article we touched upon some foundational concepts of linear regression used in machine learning focusing on "understanding and minimizing error". The article included some examples and also an overview of an exciting million dollar prize competition run by Netflix where the challenge was to better the RMSE of its Cinematch algorithm by 10%.
You can read the first article here: Article 1: The Predicament of Predictors
In this article we continue to build upon our Machine Learning vocabulary and move beyond linear regression, considering solutions for more real life challenges with much larger volumes of data to process. We will also cover essential branches of mathematics that are the building blocks of Machine Learning algorithms.
Many challenges that impact human lives at a large scale (for example those related to Health, Cyber Security, Finance, Governance, Law and Order)?are having qualitative data in addition to quantitative data that needs to be understood well and in the right context to make recommendations. The analysis of these issues often lead to results that are a classification into pre-determined categories (and not from a possible set of continuous numeric values).
Consider these three scenarios, as real life examples where the output is a classification into a category or categorical variables or labels and not a continuous numerical value.
A. Digital Communication: Email Categorization
Objective: Scan incoming email and classify into any of the following categories as appropriate:
?
B. Online financial safety: Detect a Credit/Debit Card Transaction being fraudulent ?
Objective: Reduce fraudulent misuse of credit and debit cards information by unauthorized entities.
In 2022, merchants and card owners have lost over 30 billion USD, worldwide, in debit and credit card fraud. As the card issuing FSI company (or consortium) it is imperative to detect and prevent fraudulent transactions pre-emptively to safeguard customers, prevent losses to the merchant and the card issuing company, and to maintain brand loyalty with trust. ?
C. Health and Well Being: Detection and prevention of Cardiovascular diseases
Objective: At an individual and community level identify risk of heart diseases
As per World Health Organization - Cardiovascular diseases (CVDs) are the leading cause of death globally. In 2019, 32% of all global deaths were due to CVD of which 85% were due to heart attack and stroke.
With the power of AI - we can process large volumes of individual and demographic data managed by health organizations to identify factors responsible, having highest impact,?so that awareness can be built and preventive measures can be encouraged.
Moving beyond Linear Regression
The algorithms based on linear regression alone may not provide the most efficient models in many scenarios including the ones noted above. For example - while measuring the probability of an event happening - a negative value or greater than 1 would be invalid that we can get when linear regression algorithms are applied (that worked well for the example of impact of multi channel advertising spend on sales, covered in Article 1)
Refer to the figure above with two graphs. Lets consider an example of finding the probability of a credit card holder defaulting on payment (Y axis) considering the outstanding balance on the card due for payment (X axis).
?
If we apply linear regression on the values (left image) to find probability of a person defaulting - we could get a negative value or a value greater than 1. That is not what is expected for probability (should be from 0 to 1). ?
However, we solve this problem by shifting to a logistic function (shown above and part of Logistic Regression). Now we always get values between 0 and 1 (right image). There are more equations to solve and transform to understand Logistic Regression but this is just to drive home the point that a different approach is needed for different type of data and outcomes.
An approach that worked for one set of problems may not work for another. Therefore we need to use different models here more suited to classification problems like Logistic Regression or Linear Discriminant Analysis etc. for a better fit.
For example, the classification would need a decision like a Yes or No:
Based on the data is any risk of heart disease detected? Yes/No.
What is the probability of the current transaction being fraudulent? etc.?
The analysis may lead to a likelihood of the result falling into a certain category if it crosses a certain threshold we set based on prior knowledge.
Branches of Mathematics - Building blocks of ML
So what are the "must-know" branches of mathematics that form the building blocks of machine learning and AI systems?? Interestingly - the mathematical theories and algorithms being used in modern day AI applications range from those proposed nearly a century ago to more recent times.
In order to pursue ML and AI deeper it is highly recommended to get the fundamentals strong in the following branches of mathematics:
Mathematics in Action - Applied to ML
From the examples above lets pick one up - Email categorization and touch upon at a high level how these mathematical concepts and various algorithms play together to help build the right model.
Digital Communication: Email Categorization
Objective: Scan incoming email and classify it into appropriate category.?
Key Machine Learning concepts applied:
Key Mathematical concepts applied
Logic applied: ?
领英推荐
Algorithms applied: ?
Here Feature Extraction relies majorly on Linear Algebra (Vectorization), Classification relies majorly on Probability (Naive Bayes) and Model Performance relies majorly on Statistics (Precision, Recall, Accuracy)
The diagram above that I created represents a Artificial Neural Network (ANN) model that shows its parameters and hyper parameters that we need to tweak, optimize to get maximum accuracy. It also shows how mathematical concepts we touched upon are applied - in this case matrices (linear algebra) and gradient descent (calculus). The key elements of this ANN shown are the parameters (weights and biases) and hyper-parameters.?? The data is divided into training, validation and test sets (usually a 80:10:10 mix). We start with forward propagation from the input variables. In this scenario there are 3 inputs (independent variables) and the final output is a classification into two categories. A Batch is a subset (sample) of the total training data propagated through the network in a single pass. Every consecutive batch will have the parameters adjusted to reduce error. The number of times the training dataset is passed through the ANN is called as an Epoch. An epoch can have one of more batches. There are multiple epochs run across all of the training data. The training process completes when all of the epochs have completed.
The initial weights could be initialized randomly with values from a standard normal deviation (mean = 0 and standard deviation =1). The matrix multiplication and addition in this diagram shows how these weights and biases are used in the algorithm (WT * X + B = Y)? when processed for the 1st and 2nd hidden layers.
The matrix output for each layer is fed into an activation function that filters and reduces noise and determines what information is propagated to the next layer. The type of activation function is decided based on the context of the problem being solved.
Some of the activation functions, for example are: ?
Sigmoid: output is 0 to 1
TanH: output is -1 to 1
Rectified Linear Unit (ReLU):? 0 if x <0
Softmax: Vector of probabilities that add up to 1?
?
For usage example: In case of the email classification problem, we could use the "softmax function" that determines what's the probability of an email belonging to a specific category (Primary, Social, Marketing, Spam). We run the iterations using forward propagation. The category with the highest probability for an email will decide under which category the email is assigned to. To determine the error and accuracy during each iteration we use an algorithm most suited for the problem being solved. In this scenario of multi-class email classification, a cost function called "Categorical Cross Entropy" would be ideal. Based on the error value - back propagation is used to adjust the weight and improve the accuracy. In the diagram above we would be moving towards the left for back-propagation against the direction of the arrows to fix the errors and adjust the weights.
This process of forward propagation, application of activation function, error estimation and back propagation to improve accuracy - is called Gradient Descent process and uses Calculus as the core branch of Mathematics.
This is a simplistic representation without getting into greater details of mathematics, Neural Networks and Deep Learning that would require a lot more time and rigor and is often a dedicated term subject in a formal course.
Access to Computing Power
The computing power available today makes testing on large datasets possible that was not possible with only a few CPUs to start with. With the rapid advancement of technology to process larger amounts of data faster - datacenters around the world are being redesigned and revamped with modern day architectures with the computing power made possible by running thousands of GPUs and DPUs.
Large Language Models (LLMs) and Large Multimodal Models (LMMs) today are possible due to more computing power available than ever before.
We are quickly developing in parallel LMMs that process massive amount of data that include text, image, video, audio (examples: OpenAI GPT-4o "omni", Gemini 1.5 from Alphabet) and also sensory data (device inputs with touch, gyro, navigation etc. data) going beyond LLMs like GPT-4 that process massive amount of data that is primarily text-based.
The deep learning neural networks for GPT-3 has over 175 billion machine learning parameters and GPT-4 has 1.7 trillion parameters.
With the computing power available today we can apply all of the mathematical algorithms to large volumes of data and perform the compute intensive validations to determine the most efficient models.
In programming languages like Python, all of the calculations can be performed with ease using the wide array of libraries available today. As the features and parameters increase you will need higher processing power beyond that of your personal computer (as we move from 10s of Millions to billions to trillions of data points forming LLM and LMM) where you will use dedicated servers running on thousands of cores (CPUs, GPUs, TPUs to DPUs) on data centers built for such purpose.
For example with a large dataset when we use Cross Validation (CV) techniques (for example K-Fold CV or Leave One Out CV) instead of using just the initial training and test data split, we split the data into many folds and iteratively run through them marking certain data as training data and other as test data to get more accurate predictions because of a diverse data set. This needs much higher computational power than your laptop or desktop.
In such cases cloud computing resources or a data center designed for such purposes is best suited. Some examples of a data center with powerful processing architectures today may involve NVIDIA A100 Tensor Core GPU, NVIDIA H100 Tensor Core GPU, AMD Instinct MI250X, Google TPU v4, NVIDIA BlueField-4 DPU.
If you want to play with large datasets that need computing power beyond your personal computers at home, consider leveraging the cloud computing power available to you. Some of the providers are (both free and paid plans based on the usage limits and computing resources you need) Kaggle.com, Colab.Google, Lightning.ai, Brev.dev
I ran a code [dataset and code courtesy: Kagggle.com] on colab.google to find which model is most suitable for spam detection and got these results:
The Ground We Covered
In this article we expanded our horizons to go deeper and wider into the field of ML and applications in AI systems going through some real life problems, mapping them to the branches of mathematics that are the building blocks of ML, going beyond linear regression into classification based models, understood the key terminology used with neural networks, key activation functions, considered an example to see how all the concepts are applied, understood the computing power needed and available today to run LLMs and LMMs and cloud platforms where you can open your own free or paid account to get hands-on and run your own ML code, on processors much more powerful than your personal computers.
In the next article we will debate about the areas where AI can play greater role performing possibly better than humans, the challenges that AI will face, and the importance of the Human-AI collaboration.
These views are personal. Just like in statistics, please allow for a margin of error, though efforts have been made to minimize them. These articles are intended to spark the interest of general readers in statistics, machine learning, and AI.
Acknowledgement:
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
2 个月The emphasis on real-world applications and diverse algorithms suggests a practical approach to machine learning education. Logistic regression and neural networks are foundational, while generative AI's inclusion reflects the field's current trends. How might these concepts be applied to personalize educational content within a large online learning platform?