Modern machine learning through simulation systems.

Statistics, and machine learning methods can be thought of algorithms that learn patterns based on historical data, subject to certain assumptions. For instance OLS regression is subjected to assumptions of linearity, normality of dependent variables, normality of covariates (features) and independent identically distributed residuals. The upside of these assumptions - we get closed form solutions to such problems.

OLS Regression - Shamelessly generated by ChatGPT from somewhere on the internet

Present day machine learning methods that work on structured data helped break free from the tyranny of these assumptions, as well as solve for challenges like multi-collinearity, high dimensionality and latent factors.

For instance random forests allowed for non-linear relationships, high dimensional feature spaces subject to somewhat less constraining assumptions i.e. non-correlated decision tree models that learn relatively independent patterns.

Random Forests : Generated Using Mid-Journey


XGBoost (Gradient Boosted Trees) is another such successful method/code library which can be used to really solve for most problems to a reasonable degree of effectiveness in practical business setting (not academia).


Gradient Boosted Trees: Generated Using Mid-Journey


Neural networks - convolutional neural nets (CNN), recurrent neural nets (RNN) and their myriad avatars further made it easier to pattern learn from large massive datasets esp. non-tabular data like images, audio & text.

One common problem across each one of these algorithms is generalization i.e how to make predictions on data, events or features that the algorithm has never seen before. A related problem is that of unknown "unknowns" i.e. I do not know what I do not know, so I cannot solve for it.

Large language models (LLMs), diffusion models, flow models have solved the generalization problem for text and images. Well known tools like ChatGPT, Stable Diffusion, Midjourney provide ample examples of this. These tools are able to generate novel text, images, audio and video. These models work effectively, because such models do not just learn patterns that exist in data, rather they learn the underlying unobserved drivers that generate observed patterns in data. By manipulating the behavior of these unobserved drivers novel information can be generated.

The next frontier in this space is to bring this same capability to solving the unknown "unknowns" problem for structured business data. The world of images and text has billions of learnable data points which makes it comparatively easier to have data to train generative models. However when one evaluates a business - say sourcing coffee for a company like Starbucks, there is only one version of the past data that can be used to train a learning algorithm.

Hence to generalize statistics and machine learning models being used presently, one needs a way to generate alternate realities and alternate past datasets. A system that can take existing data, combine it with the existing business domain knowledge, and simulate reliable alternate realities can be really useful in evaluating scenarios that the business may not have come across in the past. Through the use of such simulations, and by combining these simulations with generative machine learning techniques, the existing analytics can be made much more robust. An instance where this learning paradigm is being used successfully is the generation of synthetic training data for self-driving cars.

Simulations to model user behavior and their interactions.


Simulations, however are not simple to do. One needs to consider reliability of the simulation, computational cost involved to do simulations at scale, and then learning from the massive amounts of data that the simulation would generate. This brings us to next frontier of machine learning - use of simulations to generalize existing machine learning based decision support systems. By doing so, existing analytics can be leveraged to respond to unknowns that have not been observed in the past, as well as discover newer optimal solutions that were simply not considered due to a single version of the past.

An exciting time to be in machine learning, indeed!


Alok Ranjan

Co-founder at WalkingTree and Qritrim | Generative AI, AI/ML and Product Engineering

1 年

While the whole article was an excellent read, I was excited to read about the synthetic data and the use of simulation in machine learning. I look forward to seeing your next article in this series. Also, part of the work Ashish Kapoor's team is related to the "Generation of data & evaluation of scenarios through simulation.". While they focus on robot intelligence, it can be scaled in many other areas.

Shilpi Sharma

Builder| Technologist | Investor | Advisor

1 年

Great summary. Do you think on this journey of working with unknown "unknowns", we would have to first codify systems with known "unknowns" that will be outcome of permutation & combinations of known attributes of a process and/or interactions of multiple processes? And then those systems will be able to generate new attributes that can imagine an evolved process/interactions etc. to truly simulate unknown "unknowns"? Would love to hear more about it in rest of the series.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了