登录查看更多内容

Why India Needs an Indic LLM?

Sooraj Divakaran

B2B Tech Marketer | prev. Lenovo, Infosys, TCS | ISB Alum | Award-Winning Creator

发布日期: 2023年12月15日

Earlier this year, Tech Mahindra announced the launch of Project Indus, an Indic-based foundational model for Indian languages.

Despite their multilingual capabilities, existing large language models often have limitations in comprehending and generating content in Indic languages.

We have seen these models often struggle when it comes to constructing coherent sentences. But is that the only problem we want to solve with an Indic LLM, or is there more to it?

Why Building an Indic LLM is Difficult?

Most large language models today are trained using an English dataset; hence, they struggle when generating output in Indic languages.

To effectively build an Indic LLM, you'll have to obtain quality datasets for training. However, there is a scarcity of datasets for Indic languages and dialects.

Recognizing the need to build Indic datasets, the government of India launched the Bhasini project last year.

Bhasini aims to develop language translation technologies that effectively translate content from one Indian language to another.

The government of India also launched an initiative titled Bhasha Daan to crowd-source voice datasets in multiple Indian languages. This critical India has 22 major languages and over 19,569 dialects.

Even the team at Tech Mahindra realized the complexity involved; hence, it is only planning to use 40 different Hindi dialects on its launch. They plan to support additional languages and dialects subsequently.

Microsoft launched Project Ellora earlier this year. The project aims to build its open repository and support some of these Indic languages, which are considered vulnerable or endangered by UNESCO.

Educational institutions are also pitching in with initiatives from the Indian Institute of Science (IISc) and IIT Madras (Ai4 Bharat).

Ai4 Bharat has already launched IndicTrans2, India's first open-source transformer-based multilingual NMT model. It supports high-quality translations across all the 22 scheduled Indic languages.

Lastly, Tech Mahindra is gathering voice datasets using a dedicated website they have set up for Project Indus.

They say it takes a village to raise a child, but in the case of Indic datasets, it might take the entire country to build this open repository.

What Problems Will the Indic LLM Solve?

Sarvam launched its Indic LLM model earlier this week; the answer to most of your questions has been covered in both the write-up and the eighteen-minute demo they have put together.

Sarvam's OpenHathi model is built on top of Llama-2; it took me a while to understand this, but that's why you see an elephant sitting on top of Llama.

They had to train a sentence-piece tokenizer and then merge it with the Llama2 tokenizer to create a brand-new tokenizer for the model.

Let's take a moment to explain what a tokenizer is.

LinkedIn News India 11 个月前

7 Linguistic Meanings that Determine Every Language…

Sim Ngezahayo 1 年前

Highlights from the TAUS Massively Multilingual AI…

TAUS 1 个月前

Imagine you have a big box of words and want a computer to understand and play with those words. But the computer doesn't see words like we do; it sees a long string of letters.

A tokenizer is a tool that breaks down sentences or phrases into smaller units, like words or subwords. It helps the model understand and process language more effectively by organizing the text into manageable parts.

With the work that the team at Sarvam has done, they are seeing on average, 70% of Hindi words across domains are not broken into smaller tokens.

This is significant. The more tokens the model uses, the more it adds to the cost.

For instance, my name in English uses up to six tokens, while the same, when spelled in Hindi, uses 14 tokens, which is nearly double. You can see this action for any Indian language using the Open AI Tokenizer.

But to effectively train the model the team at Sarvam had to train the model on 'world knowledge.'

The model was trained on approximately 150,000 Hindi Wikipedia articles, only about 2% of the total number of articles in English on the platform. To prepare the model further, Sarvam had to translate content available in English sources - primarily Wikipedia using IndicTrans2.

Let's take a moment to understand why the team performed this task - the answer is bilingual next token prediction.

This allows the model to predict the next word or token in a sentence while considering information from two languages, which means you could ask a question in English and get a reply in Hindi.

So, instead of just using Hindi text to predict the next word, they suggested a bilingual approach. This means mixing Hindi and English sentences alternately. By doing this, the model can better understand and predict the next word, considering information from both languages.

Since Sarvam's base model has been extensively trained on English text, it provides valuable responses in English.

To leverage this capability, users can use Chain-of-Thought (CoT) prompting, i.e., ask the model to generate outputs in English and then rewrite them in Hindi.

Perhaps, the most effective use case for an Indic LLM would be social media content moderation. Sarvam has partnered with Koo to fine-tune its model further. It also has partnerships with Kissan AI and VerSe; both platforms could derive tremendous value as the model improves over time.

Sarvam clearly has more work to do, but the progress we have made so far is clearly inspiring.

A year from now, or perhaps in just a few months, we will witness the emergence of an Indic LLM that will revolutionize how people connect and conduct their business.

The potential use cases for this technology are virtually limitless.

Message for the reader: Hey, I know I'm breaking my one-newsletter-a-week rule, but I couldn't resist sharing this excitement with you. If you're into the AI scene, don't miss out on exploring Sarvam's OpenHathi on Hugging Face.

Your thoughts mean the world, so leave a comment if you enjoyed this read, and hit subscribe if you haven't already joined the fun!

Sounding Board

1,270 位关注者

Vidya Sagar Pasupuleti

11 个月

The user base makes sense too and calls for a seperately fine tuned LLMs for Indic languages.

1 次回应

Anuradha Shiv

Premium Ghostwriter for Data Analytics Businesses | Product Copywriting for Tech | B2B Content Writer for Tech businesses | Thought Leadership Writer for Tech CEOs

11 个月

Nice insightful article Sooraj Divakaran. It is known that LLMs face challenges in contextual understanding, dealing with misinformation, ethical issues, and creativity limitations, each crucially impacting their efficiency and applicability in India's complex linguistic landscape.

1 次回应

查看更多评论

要查看或添加评论，请登录

Sooraj Divakaran的更多文章

Are IT Services Companies Ready for the AI Future?

2024年6月27日

Are IT Services Companies Ready for the AI Future?

Are you future-ready? It's the most commonly asked question in Annual General Meetings. In the context of IT Services…

1 条评论
What Does it Take to Win a Cannes Creative Lion?

2024年6月14日

What Does it Take to Win a Cannes Creative Lion?

With Cannes just around the corner, it's that time of year when I dive into YouTube for Case Studies. I've always felt…

3 条评论
How Much Should Indian Brands Spend On Content Marketing?

2024年6月7日

How Much Should Indian Brands Spend On Content Marketing?

If you ask Google what percentage of your budget should be spent on content marketing, the answer that pops up is…

1 条评论
Why Should B2B Brands Invest in Brand Building in 2024?

2024年5月31日

Why Should B2B Brands Invest in Brand Building in 2024?

All anyone hears in the marketing world these days is 'Khata Kat Khata Cut.' No, this isn't about a recent political…

3 条评论
Why Brand Building Matters for Firmer Pricing?

2024年5月21日

Why Brand Building Matters for Firmer Pricing?

The last few years have been challenging for retailers. Any attempt to increase product prices is promptly rejected.
Percy Williams: Lessons from Sports History for Business

2024年5月14日

Percy Williams: Lessons from Sports History for Business

It feels like ages since I last revisited Percy Williams's story. The first time it crossed my path was at Infosys's…
What's B2B Blue?

2024年5月7日

What's B2B Blue?

Ever feel like a lone wolf in the world of B2B marketing, needing a sounding board for those tough questions? You're…

1 条评论
What are Marketing's Timeless Truths?

2024年4月29日

What are Marketing's Timeless Truths?

When I first read Morgan Housel's Psychology of Money, I felt a palpable sense of anticipation and curiosity. Little…
How is News Business Changing and Its Impact on Marketing?

2024年4月23日

How is News Business Changing and Its Impact on Marketing?

The state of the news business is more dire than ever. The Los Angeles Times laid off 20 percent of its workforce…
How Empathic AI Might Change Conversational Experiences?

2024年4月16日

How Empathic AI Might Change Conversational Experiences?

Last weekend, I conducted an interesting experiment. I called my son and said, "Hey, little boy (not using his actual…

See all articles

Why India Needs an Indic LLM?

Sooraj Divakaran

B2B Tech Marketer | prev. Lenovo, Infosys, TCS | ISB Alum | Award-Winning Creator

Why Building an Indic LLM is Difficult?

What Problems Will the Indic LLM Solve?

领英推荐

Sounding Board

1,270 位关注者

Sooraj Divakaran的更多文章

社区洞察

其他会员也浏览了

WTF is a language?

Super-Human Translation?

The Devil’s Dictionary – Language Services Edition

AI Enables Translation of Rare Languages

Forget English Language Conventions: Language is What LLM Generates!

Technology's Role in Preserving Endangered Languages

How to create a new language in 7 precise & efficient steps

PGLS Pulse: August 2024

How multilingual is Multilingual BERT?

Building Indic LLMs: Vocabulary Expansion

Why Building an Indic LLM is Difficult?

What Problems Will the Indic LLM Solve?

领英推荐

Sounding Board

1,270 位关注者

Sooraj Divakaran的更多文章

Are IT Services Companies Ready for the AI Future?

What Does it Take to Win a Cannes Creative Lion?

How Much Should Indian Brands Spend On Content Marketing?

Why Should B2B Brands Invest in Brand Building in 2024?

Why Brand Building Matters for Firmer Pricing?

Percy Williams: Lessons from Sports History for Business

What's B2B Blue?

What are Marketing's Timeless Truths?

How is News Business Changing and Its Impact on Marketing?

How Empathic AI Might Change Conversational Experiences?

社区洞察

其他会员也浏览了

WTF is a language?

Super-Human Translation?

The Devil’s Dictionary – Language Services Edition

AI Enables Translation of Rare Languages

Forget English Language Conventions: Language is What LLM Generates!

Technology's Role in Preserving Endangered Languages

How to create a new language in 7 precise & efficient steps

PGLS Pulse: August 2024

How multilingual is Multilingual BERT?

Building Indic LLMs: Vocabulary Expansion