30 Features that Dramatically Improve LLM Performance - Part 1

30 Features that Dramatically Improve LLM Performance - Part 1

Read full article, here.

Many are ground-breaking innovations that make LLMs much faster and not prone to hallucinations. They reduce the cost, latency, and amount of computer resources (GPU, training) by several orders of magnitude. Some of them improve security, making your LLM more attractive to corporate clients. I introduced a few of these features in my previous article "New Trends in LLM Architecture". Now I offer a comprehensive list, based on the most recent developments.

1. From one trillion parameters to less than 5

By parameter, here I mean the weight between two connected neurons in a deep neural network. How can you possibly replace one trillion parameters by less than 5, and yet get better results, faster? The idea is to use parametric weights. In this case, you update the many weights with a simple formula relying on a handful of explainable parameters, as opposed to neural network activation functions updating (over time) billions of Blackbox parameters — the weights themselves — over and over. I illustrate this in Figure 1. The example comes from my recent book, available here.


Figure 1: LLM for classification, with only 2 parameters

2. Adaptive loss function

The goal of many deep neural networks (DNN) is to minimize a loss function, usually via stochastic gradient descent. This is also true for LLMs that use transformers. The loss function is a proxy to the evaluation metric that measures the quality of your output. In supervised learning LLMs (for instance, those performing supervised classification), you may use the evaluation metric as the loss function, to get better results. One of the best evaluation metrics is the full multivariate Kolmogorov-Smirnov distance (KS), see here, with Python library here.

But it is extremely hard to design an algorithm that makes billions of atomic changes to KS extremely fast, a requirement in all DNNs as it happens each time you update a weight. A workaround is to use an adaptive loss function that slowly converges to the KS distance over many epochs. I did not succeed at that, but I was able to build one that converges to the multivariate Hellinger distance, the discrete alternative that is asymptotically equivalent to the continuous KS.

Read more

This is the first post in a series of three. Each one lists 10 features. To read the full article, learn about agentic LLMs, LLM routers, contextual tables, fast search, and more, follow this link.

Akshay Sharma

Recruitment Manager | Connecting Top U.S. IT and Non-IT Talent with Leading U.S. Clients | Your Partner in Excellence" || IT Consultants for Public and Federal Clients.

5 个月

Hi , This is Kritika?Sharma? I have position for Title: Finsys Analytics Consultant : Instacart : Remote, PST and CST based candidates only role with a good payrate. Please let me know if you are interested or you can send me your updated resume on [email protected]

回复

instead of the first four lines, you can do local_hash = hash.get(key, {}) both shorter and faster.

This is great

Akshay Radha Manohar

AI Prompt Engineer @ Soul AI | Fine Tuning | Advanced Excel | Power BI | SQL | Python | Statistics | Python Libraries |

7 个月

Your article on new LLM architectures is fascinating, especially the move from trillions of parameters to just a few and the adaptive loss function. Great work on these groundbreaking ideas! Your insights could significantly advance LLM technology and its practical uses. Thanks for sharing your expertise!

Danial Hosseinpour

Data Analyst | MSc by Research Student at the University of Huddersfield

7 个月

Useful tips

要查看或添加评论,请登录

Vincent Granville的更多文章

社区洞察

其他会员也浏览了