Data Before ML

Data Before ML

Intelligence is about solving a chaotic problem, you can see the manifold (the butterfly wings) but never exactly know when and which wing will the form flip too. Data is the bottleneck and the gold mine when it comes to building such systems.

ML could be understood as something supervised or unsupervised, leading to a classification by error reduction.

No alt text provided for this image

Figure 1 - Plain Vanilla View of ML

An ML system based on a data architecture that mirrors or has semblance to the data generating mechanism improves performance. Because it asks, where is the data coming from? How should the data architecture be? Will the dataset lend itself well to the chosen ML process?

No alt text provided for this image

Figure 2 - Data Generating Mechanism View of ML

However, a data generating mechanism and a good architecture which assumes stability and linearity in information could stay biased and deliver erroneous results. Specifying a linear model and expecting it to understand causality has lead modern finance to a set of conflicting theories over the last 100 years. So to expect such an approach to solve the challenges of the non-financial domain is naive thinking.

No alt text provided for this image

Figure 3 - Causal Complex View of ML

And because causality eventually leads to chaos, a robust ML system should specify a model assuming a varying and dynamic degree of influence between a set of causes, expect some causes to fail and new causes to emerge and succeed. Such a system makes no assumption about information validity and embraces both error reduction and amplification as an output, which are used to calibrate the understanding of the Data Generating Mechanism.

Data heavy or data light is not a characteristic of the ML process, but of the data architecture. One can use a good data architecture to sample the data well for the ML process. A well-designed architecture can drive the ML process with a fraction (e.g. less than 10%) of the data, which is refreshed periodically. Therefore, it’s essential to ask, how much of my database do I really need for my ML process to run optimally? The more carefully we use data for training, the less biases we introduce in our ML processes and the less electricity we burn, a desired objective in a computation heavy world.

If we want to build ML systems to simulate real world problems, we have to train them on datasets that evolve from an understanding of data generating mechanisms, approriate data architectures, and causal complexity which bases itself on the assumption that there is a certain probability of failure and success of a cause to affect.B

要查看或添加评论,请登录

Mukul Pal的更多文章

  • Can’t be done

    Can’t be done

    The Ginger Thumb In the late 1970s, Gali number 3 in Krishna Nagar was remarkably narrow—like a slender corridor hidden…

    4 条评论
  • The beginning of the end of passive investing

    The beginning of the end of passive investing

    For years, it seemed almost heretical to question the mantra that passive investing—simply “buying the market” through…

    30 条评论
  • The $100 Million Vanguard Free Lunch That Went Wrong

    The $100 Million Vanguard Free Lunch That Went Wrong

    When it comes to long-term investing, many people are drawn to passive investing, believing it offers a straightforward…

  • The Sciences of the Artificial - 1

    The Sciences of the Artificial - 1

    Interdisciplinary research gave me a toolbox to disassemble a host of theories across subjects such as statistics…

  • The Entropy Challenge in Intelligence: From Hopfield to Boltzmann to 3N

    The Entropy Challenge in Intelligence: From Hopfield to Boltzmann to 3N

    In 2024, John J. Hopfield and Geoffrey Hinton jointly received the Nobel Prize in Physics for their foundational work…

    3 条评论
  • The Strategic Startup

    The Strategic Startup

    I was invited to speak at a strategic management program in France about how I used strategy at AlphaBlock. This is the…

  • Why I Am Bullish About Romania for the Next Decade

    Why I Am Bullish About Romania for the Next Decade

    At the recent Tradeville Quarterly Report event in Bucharest, I was asked about my views on the Romanian capital…

    6 条评论
  • Great Intellectual Fraud

    Great Intellectual Fraud

    I was unsettled when Mandelbrot boldly declared normality as the "great intellectual fraud." He did not mince words in…

    2 条评论
  • The AGI Illusion

    The AGI Illusion

    I wrote this two years before the launch of ChatGPT. The more I hear about scaling laws, computational power, and…

  • AIMCo’s Strategic Missteps: Lessons from VOLTS and Interest Rate Challenges

    AIMCo’s Strategic Missteps: Lessons from VOLTS and Interest Rate Challenges

    In a major shakeup, the Alberta government recently replaced the leadership of the Alberta Investment Management…

    7 条评论

社区洞察

其他会员也浏览了