LLMs complexity
Esau Rodriguez Sicilia
Chief Technology Officer at Scope Better | Leading Tech Innovations | Ex Octopus, Ex Triller
I went to the University long ago, and then Neural Networks were very far from where they are now. During the last years, the computing capacity has increased so much that it's made it possible to create more complex networks, and this in turn led to more research done on the topic and really impressive results we are all accustomed to nowadays with GPT-4, Claude, etc. Right now the key players have created humongous networks that are incredibly useful.
I didn't learn much about NN at the university, so I've been trying to get up to speed recently. I normally learn by doing, but in this case, I had to understand the underlying concepts before I could start doing. I started watching the excellent content from 3Blue1Brown.
During the series, they cover the theoretical details with the right balance of precision and simplicity, which is really good. However, as I said before I personally tend to learn by doing rather than by reading, listening or watching. Lucky for me LLMs are a hot topic right now so it was easy to find good content with a more practical approach. The great content by Andrej Karpathy came to the rescue. The next video is where things are connected together for me.
I've been playing with NanoGPT since then. It's small enough, that you can run it on your laptop with either CPU or GPS, it can run in Apple Silicon using MLS. From doing this it became very apparent how costly is to train and run a NN for LLMs purposes. To achieve decent results you need several layers, several heads, and several iterations, and the number of calculations needed increases very quickly to the point that you need to train for a very long time even with this toy-like network.
I've been counting FLOPS ever since, and it's amazing to read researchers' suggested computer configuration for their local work at $10k or learning the price of one A100 is around that mark too, and companies like OpenAI are using hundreds if no thousands of them.
Another author I love content from is Shaw Talebi and in the next video, he quotes a paper estimating the computer power needed to train a model based on the number of parameters it uses.
领英推荐
For context, ChatGPT-4o is estimated to have around 1.8 Trillion params so according to the table in the video it would require over 1.27e+26 Flops. The consumer laptop I'm currently using can do around 5 TFlops, that is 5e+12. Let's forget about memory for now. If I was to train a model like that on my computer it would take me 289 million days. There is much better-equipped hardware for this, but that hardware is not cheap. A company in the LLM space fighting for every bigger network will need an incredible amount of money to be able to train its networks.
A state-of-the-art modern H100 can do up to around 1000 TFlops so still many of them would be required to accomplish the training in a reasonable time each of which will cost over $40000.
Running an LLM is not cheap either, you might be able to use much more commodity hardware but memory requirements grow with the number of parameters the network has so even if you could run on CPU a lot of memory would be needed. You can use quantification, etc. but the requirements are still high.
Probably the most recognizable company in AI right now is OpenAI. They have been very successful at creating a lot of awareness and interest in the space and they are even monetising it quite well according to reports, but they are potentially in trouble. They are reportedly expected to lose $5b this year and they will need to raise to be able to keep operating in as little as 12 months. Building cool technology and great business is not always the same thing.
One could expect the Moore Law will come to the rescue and in a few years the computing capacity will increase so much that NN would be created with commodity hardware, the problem is that NN is growing quite significantly in size too, so the advantages on computing power will be likely eaten by an increasing number of parameters.
I do think we will need to start looking at making the networks not only bigger but also more efficient. How can we achieve the same quality output with a small number of parameters? That's the key for me.
I look into this as if we are currently living in the era of muscle cars, very big engines, not very efficient and over time we have cars with similar capacities in terms of acceleration and speed but with much smaller, efficient engines. There is work being done on this front. For instance the Liquid Neural Networks. They are task-specific but generate pretty interesting results.
CEO, DIAGNAL
3 个月Interesting. Thanks Esau Rodriguez Sicilia