Probabilistic Nearest Neighbors: The Swiss Army Knife of GenAI

Probabilistic Nearest Neighbors: The Swiss Army Knife of GenAI

ANN — Approximate Nearest Neighbors —? is at the core of fast vector search, itself central to GenAI, especially GPT and LLM. My new methodology, abbreviated as PANN, has many other applications: clustering, classification, measuring the similarity between two datasets (images, soundtracks, time series, and so on), tabular data synthetization (improving poor synthetizations), model evaluation, and even detecting extreme observations.

Just to give an example, you could use it to categorize all time series without statistical theory. Statistical models are redundant and less explainable, leading to definitions less useful to developers, and math-heavy.? PANN avoids that.

Fast and simple, PANN (for Probabilistic ANN) does not involve training or neural networks, and it is essentially math-free. Its versatility comes from four features:

  • Most algorithms aim at minimizing a loss function. Here I also explore what you can achieve by maximizing the loss.
  • Rather than focusing on one set of datasets, I use two sets S and T. For instance, K-NN looks for nearest neighbors within a set S. What about looking for nearest neighbors in T, to observations in S? This leads to far more applications than the one-set approach.
  • Some algorithms are very slow and may never converge. No one looks at them. But what if the loss function drops very fast at the beginning, fast enough that you get better results in a fraction of the time, by stopping early, compared to using the “best” method?
  • In many contexts, a good approximate solution obtained in little time from an otherwise non-converging algorithm, may be as good for practical purposes as a more accurate solution obtained after far more steps using a more sophisticated algorithm.

The figure below shows how quickly the loss function drops at the beginning. In this case, the loss represents the average distance to the approximate nearest neighbor, obtained so far in the iterative algorithm. The X-axis represents the iteration number. Note the excellent curve fitting (in orange) to the loss function, allowing you to predict its baseline (minimum loss, or optimum) even after a small number of iterations. To see what happens if you maximize the loss instead, read the full technical document.

Fast convergence of PANN, at the beginning (Y-axis is loss function

Read the full article on GitHub, here . It is included as project 8.1 in my coursebook State of the Art in GenAI & LLMs — Creative Projects, with Solutions available here . This 200+ pages coursebook features many related projects, case studies, Python code, and datasets.

Prasenjit Singh

Technologist | Digital Innovation & Management

5 个月

Intricate graph! The loops and twists represent the complex relationships within knowledge graphs. Fascinating to see how 346.18 and 346.21 are connected.

Abdelhakim M.

Research And Development Engineer

5 个月

Spoiler alert: maximizing a loss function is the same as minimizing the negative of that loss function.

回复
Stanley Waite - Inventor

Time is Everything and I Help Teach that You can take control of it, for better Mental & Physical Wellbeing - Thinking outside the Box with Business, Relationships and Pleasure.

5 个月

without understanding the context of the question which gives the answers/data points, all AI generative results are just garbage in the end, not able to create, just give answers at the mean.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了