The Week of Small Language Models

The Week of Small Language Models

This week marks an exciting trend in AI with the rise of small language models from Hugging Face, Mistral AI, NVIDIA, and OpenAI, among others, showcasing the latest advancements in the field. “LLM model size competition is intensifying… backwards!” quipped OpenAI cofounder Andrej Karpathy. (see below)

The week started off with Paris-based AI startup Mistral AI, in collaboration with NVIDIA, introducing Mistral NeMo, a 12-billion parameter model with a 128k token context length, alongside the launch of MathΣtral, a specialised 7B model for advanced mathematical reasoning and scientific exploration.

A few days later, Hugging Face also released a new series of compact language models called SmolLM, available in three sizes: 130M, 350M, and 1.7B parameters. These models are perfect for use on local devices such as laptops and phones, eliminating the need for cloud-based resources and significantly reducing energy consumption.

This was followed by OpenAI releasing GPT-4o mini, a highly cost-efficient model designed to expand AI applications by making intelligence more affordable. Priced at $0.15 per million input tokens and $0.6 per million output tokens, GPT-4o mini is 30x cheaper than GPT-40 and 60% cheaper than GPT-3.5 Turbo.

Similarly, H2O.ai introduced H2O-Danube3, a new series of SLMs to bring AI capabilities to mobile devices. The series includes two models: H2O-Danube3-4B, trained on 6 trillion tokens, and H2O-Danube3-500M, trained on 4 trillion tokens.

Apple, too, entered the game, releasing a model called DCLM-BASELINE 7B, along with its weights, training code, and dataset. Trained on 2.5 trillion tokens from open datasets, the model primarily uses English data and features a 2048-token context window.

In China, Alibaba released Qwen’s Base and Instruct models in five sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B, trained on data in 27 additional languages besides English and Chinese.

A few months ago, researchers from IIT Gandhinagar, India, released Ganga-1B, a pre-trained LLM for Hindi as part of its Unity project. Built from scratch using the largest curated Hindi dataset, this new model outperforms all open-source LLMs supporting Hindi, up to 7B in size.

Earlier this year, Microsoft unveiled Phi-3-Mini, a 3.8 billion parameter language model trained on an extensive dataset of 3.3 trillion tokens. Despite its compact size, Phi-3-Mini boasts performance levels that rival larger models such as Mixtral 8x7B and GPT-3.5.

LLMs Get Cheaper

Karpathy recently said that the cost of building an LLM has come down drastically over the past five years due to improvements in compute hardware (H100 GPUs), software (CUDA, cuBLAS, cuDNN, FlashAttention) and data quality (e.g., the FineWeb-Edu dataset).

He said that it is now possible to train models like GPT-2 for approximately $672 on “one 8XH100 GPU node in 24 hours”.

Mayank Singh, the creator of Ganga-1B, told AIM that it was built from scratch for under INR 10 lakh. Meanwhile, Tech Mahindra also told us that it was able to build Project Indus for well within $5 million (INR 41 crore), which again, is built on GPT-2 architecture, starting from the tokeniser to the decoder.

Are SLMs the future of generative AI?

Abacus.AI chief Bindu Reddy predicted that in the next five years, smaller models will become more efficient, LLMs will continue to become cheaper to train, and LLM inference will become widespread. “We should expect to see several Sonnet 3.5 class models that are 100x smaller and cheaper in the next one to two years.”

Given the cost benefits, soon, there will be hundreds and thousands of SLMs mushrooming in the space. But the question is, how are enterprises going to benefit from them? And, are SLMs really the future of generative AI?

“Technology is rapidly evolving how we operate and train. Three months ago, using an SLM isolated in a customer's private database might have seemed like a drawback compared to a private instance of a large public model,” said Kasey Roh from Upstage, which has built an SLM called Solar, one of the top-ranked models on the Hugging Face Open LLM leaderboard, and a fine-tune of Llama 2.

Further, Roh said that to address the challenges of SLMs, the company has now entered into a subscription model with its customers, allowing them to continuously train and fine-tune models with the latest data points.

She believes that this modular and standardised approach significantly mitigates the drawbacks of using small private models. “Once we have a complete set of the tools and programmes available, I think that the drawback of having a small model that's private can be largely addressable,” said Roh.

Many experts believe that SLMs or specialised models are going to be the future, alongside generalised large models like GPT-4 or Claude 3.5 Sonnet. “For everyday use, an 8B or maybe 70B LLM will suffice. If you don't want to test a model to the max, you don't need a SOTA model. For everyday questions, which now appear in all training data, smaller models are sufficient,” posted a user on X who goes by the name Kimmonismus, aka Chubby.

“Both specialised and generic models will coexist, one is not a replacement for the other. It’s the wrong dream to believe that we only need one API, such as OpenAI. We will need both mega models aiming for AGI and specialised micro models that can integrate into today’s workflows,” said Pramod Varma, chief architect of Aadhaar, in a recent interaction with AIM.

Check out the full story here.


Top Stories of the Week >>

Google Trying to Mimic Google Pay for Transportation via Namma Yatri

Back in 2017, Google made UPI and cashless transactions household terms with the launch of? Google Pay in India. The app has over 67 million users in the country, its largest single market to date. Now, the company is trying to mimic the success of Google Pay to democratise transformation for all via Namma Yatri, the Bengaluru-based open-source rival of ride-hailing services like Ola and Uber.

Read the full story here.


People & Tech >>

In the Era of Gen AI MongoDB, PostgreSQL Will Run out of Gas, DataStax Won’t

Jihad Dannawi of DataStax, in an exclusive interview with AIM, highlighted the need for scalable databases like Apache Cassandra to support the growing complexity and real-time demands of generative AI applications.

He said that in the era of generative AI, “MongoDB will run out of gas; PostgresDB will run out of gas when it has to handle real-time processing or high volumes of data. To effectively manage large-scale operations, you require a robust infrastructure that can handle substantial demands without running out of capacity.”

Check out the full story here.

BharatGPT’s Ganesh Ramakrishnan’s AI Startup bbsAI Tackles Limited Indic Data Challenge

Apart from working on BharatGPT, Ganesh Ramakrishnan, a professor at IIT Bombay, has been dedicated to developing translation engines. To continue this bid, he has co-founded bbsAI with Ganesh Arnaal, which has been a decade in the making. Get to know more about bbsAI here.


AIM Videos >>

The Secret Behind Aadhaar Card Photos

In our latest episode of Tech Talks, AIM speaks to Pramod Varma, the CTO of EkStep Foundation and chief architect of Aadhaar & India Stack, about the secret behind our ‘unfortunate’ photos on Aadhaar cards, besides discussing some of the challenges and struggles in executing Aadhaar project nationwide.


Workshop Wonders

RAG & Fine Tuning in GenAI with Snowflake

Join Prashant Wate from Snowflake India for an exciting workshop on RAG & Fine Tuning in GenAI. Learn how to optimise models and create seamless AI apps effortlessly.

Date: July 25, 2024

Time: 6 - 7.30 pm

Register now!


AIM Shots >>

  • A faulty software update from cybersecurity firm CrowdStrike caused one of the largest tech outages ever, resulting in the widespread “blue screen of death” on Microsoft Windows hosts, disrupting flights, financial institutions, and news broadcasts worldwide, with industries expected to feel the aftermath for weeks.
  • NVIDIA recently invested $30 million in San Jose and Bengaluru-based Arrcus to enhance its data traffic management platform, which serves major clients like SoftBank and Target, alongside scaling its global operations and technology infrastructure.
  • Salesforce recently unveiled its fully autonomous AI agent, Einstein Service Agent, poised to revolutionise customer service by augmenting human agents, making traditional chatbots obsolete.
  • Infosys, in its FY25 Q1 results, emphasised its dedication to generative AI through its Topaz platform, highlighting strong client traction but refrained from disclosing specific revenue figures, while TCS reported doubling its AI pipeline to $1.5 billion.
  • Bengaluru-based AI-powered personalisation platform Fibr recently raised $1.8 million in a funding round led by Accel, with participation from 2am VC and angel investors, including Cred founder Kunal Shah, aiming to enhance its platform, expand its customer base, and hire new talent.
  • Pintel.ai, a Bengaluru-based platform designed for sales development, raised $1 million in a seed funding round led by IvyCap Ventures, with participation from notable investors, to enhance its AI-driven solution for automating prospect research and improving sales engagement.


Article is amazing ... key questions which has grey answers for now, and would love if anyone can throw light on these a) What is the size of these models ? b) Are phones part of TAM of these models ? c) If a and b are reasonable, how do you still measure the latency on tokenization ? d) Use cases beyond playing with text, code and image ? e) Native GPU support on SLM will continue to be a challenge in absence of quantum computing on LLM, how can SLMs be executed without this ? un-app

回复
Ashish Singh

Visionary Senior Leader | Data Engineering | Data Analytics | Data Governance | GenAI | Speaker | Ex Yahoo, Credit Suisse, UBS

4 个月

Excellent article on small language model.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了