ChatGPT for Bahasa Indonesia
ChatGPT for Bahasa Indonesia by Datasaur.ai, GLAIR.ai and Prosa.ai

ChatGPT for Bahasa Indonesia

Authors: Ayu Purwarianti, Prosa.ai Co-Founder & Chief Scientist and Associate Professor ITB, Hammam Riza, KORIKA President, Ivan Lee, Datasaur.ai Founder & CEO, Michell Handaka, GLAIR.ai Founder & CEO, On Lee, GDP Labs CEO & CTO and GDP Venture CTO

FULL DISCLOSURE: GDP Venture has invested in?Datasaur.ai, GLAIR.ai, and Prosa.ai

Introduction

Today, we announce the development of a “ChatGPT for Bahasa Indonesia.”.

In today's rapidly evolving technological landscape, groundbreaking advancements set the stage for future innovations. One such revolutionary development is the Large Language Model (LLM), exemplified by OpenAI's ChatGPT.

However, most LLM research has predominantly centered on English, leaving a void in the market for other languages and concentrating the technology's advantages primarily among English-speaking nations.

Problem: ChatGPT Limitations

Despite the impressive growth and success of ChatGPT, OpenAI's revolutionary Large Language Model, which garnered over 100 million users within a mere two months of its launch, has certain limitations:

  1. Limited Bahasa Indonesia support: ChatGPT's training data for Bahasa Indonesia is significantly smaller than that for English, resulting in limited support for the language. As per Statista's data from January 2023, the most common languages employed for web content, ranked by their share of websites, include English at a dominant 58.8% and Indonesian with a considerably smaller portion of 0.6%. This disparity highlights the need for expanded research and development to cater to Bahasa Indonesia.
  2. Country-specific knowledge: As a "Jack of all trades, master of none," ChatGPT lacks specialized, in-depth knowledge about particular countries, topics, and industries. For example, it easily recognizes US brands like Coca-Cola but may not recognize household brands like Limun Linggardjati in Indonesia.
  3. Outdated information: ChatGPT's training data encompasses material up to 2021, meaning it lacks knowledge of events and developments. Consequently, it cannot provide real-time updates on weather conditions, stock market prices, and other current affairs.

Recently, there has been a notable increase in demand from Indonesian companies looking for ChatGPT-like capabilities tailored specifically for Bahasa Indonesia.

To address this demand, Datasaur.ai, GLAIR.ai, and Prosa.ai have collaborated to develop a Bahasa Indonesia-specific LLM that caters to the diverse needs of businesses in the region by addressing the above ChatGPT limitations.

Solution: Promising Preliminary Results

Below are some preliminary results where “ChatGPT for Bahasa Indonesia” outperforms ChatGPT.

Legal Questions

Below is an example of the chatbot answering questions about the Omnibus Law (“Undang-Undang Cipta Kerja”):

Question: Apa itu ketenagakerjaan?

English translation: What is employment?

Expected answer: See the image below.

No alt text provided for this image
Expected answer

English translation:? Article 1 - In this law, the following terms are defined: Employment refers to all matters related to the labor force before, during, and after the period of employment.

ChatGPT-4 answer: The response is overly general and fails to cite the origin of the information provided. This can lead to mistrust regarding whether the information is correct.

No alt text provided for this image
ChatGPT-4 answer

ChatGPT for Bahasa Indonesia answer: The response is precise and concise. The definition is derived directly from a government document source, which is cited and provided.

No alt text provided for this image
ChatGPT for Bahasa Indonesia answer

Financial Questions

Question: Berapa limit harian transfer antar bank?

English translation: What is the daily transfer limit between banks?

Expected answer: For this initial model, our Bahasa Indonesia training data includes a corpus of information provided by BCA Bank. See the image below.

No alt text provided for this image
Expected answer

English translation: A table of Interbank Transfer rates

ChatGPT-4 answer: The response is overly general and fails to cite the origin of the information provided.

No alt text provided for this image
ChatGPT-4 answer

ChatGPT for Bahasa Indonesia answer: The response is precise and concise. The definition is derived directly from the information on BCA’s website.

No alt text provided for this image
ChatGPT for Bahasa Indonesia answer

Conclusion

Developing a Bahasa Indonesia LLM offers significant advantages to Indonesian companies and users: it better understands the country and language-specific prompts and is more concise and precise in its answers.

The preliminary results of ChatGPT for Bahasa Indonesia are encouraging. These findings demonstrate that it is feasible to harness the capabilities of an LLM and tailor it specifically for Bahasa Indonesia.

Future developments will focus on feeding in more diverse types of data, including Indonesia’s many local dialects and everyday slang and providing better tools for understanding document scans, tables, and images via Optical Character Recognition (OCR).

Together, we are ushering in a new era of language technology that will shape the future of communication and collaboration across speakers of Bahasa Indonesia.

Yos Vincenzo

Experienced Enterprise Sales with Focus on Solving Customer Business Problem

1 年

Look forward for further advancement on this!

Sahal Zain

Solution Architect

1 年

Is it fine tuning? Using same gpt-4 model? Or is it pairing gpt-4 with semantic search?

回复
Ibrahim Arief

Building Asah & Daily Friend AI ?? | 1x unicorn exit | Fortune's 40u40

1 年

I don't see the need to build a Bahasa Indonesia specific GPT model since even though OpenAI's training corpus is predominantly English-language media, the sheer volume of training data that they use means there are already an adequate volume of Bahasa Indonesia corpus. In fact, from the GPT 4 technical paper [https://arxiv.org/pdf/2303.08774.pdf] we can see in Figure 5 at page 8 that the performance of their model in understanding and generating proper answers with Bahasa Indonesia (83.1%) is already nearly comparable with their English-language model (85.5%) and making Bahasa Indonesia on par with German, French, and Spanish. (83%-84%) The examples you showcased in your article demonstrates models trained with additional fine-tuned contextual corpus, which is a positive improvement in itself, but does not necessarily represent a novel and quantifiable specific improvement in Bahasa Indonesia performance compared to the GPT-4 model used in ChatGPT.

Restu Kresnadi

Chief Data Officer - E-commerce, supply chain, and marketing leader

1 年

So where is the chat gpt for Bahasa Indonesia? Can you share the link for us to access?

Prasidhi Artono

Enterprise Digital Specialist, Azure @ Microsoft

1 年

Great stuff!

回复

要查看或添加评论,请登录

On Lee的更多文章

社区洞察

其他会员也浏览了