Why India Needs an Indic LLM?
Sooraj Divakaran
B2B Tech Marketer | prev. Lenovo, Infosys, TCS | ISB Alum | Award-Winning Creator
Earlier this year, Tech Mahindra announced the launch of Project Indus, an Indic-based foundational model for Indian languages.
Despite their multilingual capabilities, existing large language models often have limitations in comprehending and generating content in Indic languages.
We have seen these models often struggle when it comes to constructing coherent sentences. But is that the only problem we want to solve with an Indic LLM, or is there more to it?
Why Building an Indic LLM is Difficult?
Most large language models today are trained using an English dataset; hence, they struggle when generating output in Indic languages.
To effectively build an Indic LLM, you'll have to obtain quality datasets for training. However, there is a scarcity of datasets for Indic languages and dialects.
Recognizing the need to build Indic datasets, the government of India launched the Bhasini project last year.
Bhasini aims to develop language translation technologies that effectively translate content from one Indian language to another.
The government of India also launched an initiative titled Bhasha Daan to crowd-source voice datasets in multiple Indian languages. This critical India has 22 major languages and over 19,569 dialects.
Even the team at Tech Mahindra realized the complexity involved; hence, it is only planning to use 40 different Hindi dialects on its launch. They plan to support additional languages and dialects subsequently.
Microsoft launched Project Ellora earlier this year. The project aims to build its open repository and support some of these Indic languages, which are considered vulnerable or endangered by UNESCO.
Educational institutions are also pitching in with initiatives from the Indian Institute of Science (IISc) and IIT Madras (Ai4 Bharat).
Ai4 Bharat has already launched IndicTrans2, India's first open-source transformer-based multilingual NMT model. It supports high-quality translations across all the 22 scheduled Indic languages.
Lastly, Tech Mahindra is gathering voice datasets using a dedicated website they have set up for Project Indus.
They say it takes a village to raise a child, but in the case of Indic datasets, it might take the entire country to build this open repository.
What Problems Will the Indic LLM Solve?
Sarvam launched its Indic LLM model earlier this week; the answer to most of your questions has been covered in both the write-up and the eighteen-minute demo they have put together.
Sarvam's OpenHathi model is built on top of Llama-2; it took me a while to understand this, but that's why you see an elephant sitting on top of Llama.
They had to train a sentence-piece tokenizer and then merge it with the Llama2 tokenizer to create a brand-new tokenizer for the model.
Let's take a moment to explain what a tokenizer is.
领英推荐
Imagine you have a big box of words and want a computer to understand and play with those words. But the computer doesn't see words like we do; it sees a long string of letters.
A tokenizer is a tool that breaks down sentences or phrases into smaller units, like words or subwords. It helps the model understand and process language more effectively by organizing the text into manageable parts.
With the work that the team at Sarvam has done, they are seeing on average, 70% of Hindi words across domains are not broken into smaller tokens.
This is significant. The more tokens the model uses, the more it adds to the cost.
For instance, my name in English uses up to six tokens, while the same, when spelled in Hindi, uses 14 tokens, which is nearly double. You can see this action for any Indian language using the Open AI Tokenizer.
But to effectively train the model the team at Sarvam had to train the model on 'world knowledge.'
The model was trained on approximately 150,000 Hindi Wikipedia articles, only about 2% of the total number of articles in English on the platform. To prepare the model further, Sarvam had to translate content available in English sources - primarily Wikipedia using IndicTrans2.
Let's take a moment to understand why the team performed this task - the answer is bilingual next token prediction.
This allows the model to predict the next word or token in a sentence while considering information from two languages, which means you could ask a question in English and get a reply in Hindi.
So, instead of just using Hindi text to predict the next word, they suggested a bilingual approach. This means mixing Hindi and English sentences alternately. By doing this, the model can better understand and predict the next word, considering information from both languages.
Since Sarvam's base model has been extensively trained on English text, it provides valuable responses in English.
To leverage this capability, users can use Chain-of-Thought (CoT) prompting, i.e., ask the model to generate outputs in English and then rewrite them in Hindi.
Perhaps, the most effective use case for an Indic LLM would be social media content moderation. Sarvam has partnered with Koo to fine-tune its model further. It also has partnerships with Kissan AI and VerSe; both platforms could derive tremendous value as the model improves over time.
Sarvam clearly has more work to do, but the progress we have made so far is clearly inspiring.
A year from now, or perhaps in just a few months, we will witness the emergence of an Indic LLM that will revolutionize how people connect and conduct their business.
The potential use cases for this technology are virtually limitless.
Message for the reader: Hey, I know I'm breaking my one-newsletter-a-week rule, but I couldn't resist sharing this excitement with you. If you're into the AI scene, don't miss out on exploring Sarvam's OpenHathi on Hugging Face.
Your thoughts mean the world, so leave a comment if you enjoyed this read, and hit subscribe if you haven't already joined the fun!
Helping businesses to scale up revenue and brand awareness using Digital Marketing | Performance Marketing | Search Engine Marketing | Paid Media Marketing | Growth Marketing | SaaS| AI Product Management
11 个月The user base makes sense too and calls for a seperately fine tuned LLMs for Indic languages.
Premium Ghostwriter for Data Analytics Businesses | Product Copywriting for Tech | B2B Content Writer for Tech businesses | Thought Leadership Writer for Tech CEOs
11 个月Nice insightful article Sooraj Divakaran. It is known that LLMs face challenges in contextual understanding, dealing with misinformation, ethical issues, and creativity limitations, each crucially impacting their efficiency and applicability in India's complex linguistic landscape.