ChatGPT for Bahasa Indonesia
Authors: Ayu Purwarianti, Prosa.ai Co-Founder & Chief Scientist and Associate Professor ITB, Hammam Riza, KORIKA President, Ivan Lee, Datasaur.ai Founder & CEO, Michell Handaka, GLAIR.ai Founder & CEO, On Lee, GDP Labs CEO & CTO and GDP Venture CTO
FULL DISCLOSURE: GDP Venture has invested in?Datasaur.ai, GLAIR.ai, and Prosa.ai
Introduction
Today, we announce the development of a “ChatGPT for Bahasa Indonesia.”.
In today's rapidly evolving technological landscape
However, most LLM research has predominantly centered on English, leaving a void in the market for other languages and concentrating the technology's advantages primarily among English-speaking nations.
Problem: ChatGPT Limitations
Despite the impressive growth and success of ChatGPT, OpenAI's revolutionary Large Language Model, which garnered over 100 million users within a mere two months of its launch, has certain limitations:
Recently, there has been a notable increase in demand from Indonesian companies
To address this demand, Datasaur.ai, GLAIR.ai, and Prosa.ai have collaborated to develop a Bahasa Indonesia-specific LLM that caters to the diverse needs of businesses in the region by addressing the above ChatGPT limitations.
Solution: Promising Preliminary Results
Below are some preliminary results where “ChatGPT for Bahasa Indonesia” outperforms ChatGPT.
Below is an example of the chatbot answering questions about the Omnibus Law (“Undang-Undang Cipta Kerja”):
Question: Apa itu ketenagakerjaan?
English translation: What is employment?
Expected answer: See the image below.
English translation:? Article 1 - In this law, the following terms are defined: Employment refers to all matters related to the labor force before, during, and after the period of employment.
领英推荐
ChatGPT-4 answer: The response is overly general and fails to cite the origin of the information provided. This can lead to mistrust regarding whether the information is correct.
ChatGPT for Bahasa Indonesia answer: The response is precise and concise. The definition is derived directly from a government document source, which is cited and provided.
Question: Berapa limit harian transfer antar bank?
English translation: What is the daily transfer limit between banks?
Expected answer: For this initial model, our Bahasa Indonesia training data includes a corpus of information provided by BCA Bank. See the image below.
English translation: A table of Interbank Transfer rates
ChatGPT-4 answer: The response is overly general and fails to cite the origin of the information provided.
ChatGPT for Bahasa Indonesia answer: The response is precise and concise. The definition is derived directly from the information on BCA’s website.
Conclusion
Developing a Bahasa Indonesia LLM offers significant advantages to Indonesian companies and users: it better understands the country and language-specific prompts and is more concise and precise in its answers.
The preliminary results of ChatGPT for Bahasa Indonesia are encouraging. These findings demonstrate that it is feasible to harness the capabilities of an LLM and tailor it specifically for Bahasa Indonesia.
Future developments will focus on feeding in more diverse types of data, including Indonesia’s many local dialects and everyday slang and providing better tools for understanding document scans, tables, and images via Optical Character Recognition (OCR).
Together, we are ushering in a new era of language technology that will shape the future of communication and collaboration across speakers of Bahasa Indonesia.
Experienced Enterprise Sales with Focus on Solving Customer Business Problem
1 年Look forward for further advancement on this!
Solution Architect
1 年Is it fine tuning? Using same gpt-4 model? Or is it pairing gpt-4 with semantic search?
Building Asah & Daily Friend AI ?? | 1x unicorn exit | Fortune's 40u40
1 年I don't see the need to build a Bahasa Indonesia specific GPT model since even though OpenAI's training corpus is predominantly English-language media, the sheer volume of training data that they use means there are already an adequate volume of Bahasa Indonesia corpus. In fact, from the GPT 4 technical paper [https://arxiv.org/pdf/2303.08774.pdf] we can see in Figure 5 at page 8 that the performance of their model in understanding and generating proper answers with Bahasa Indonesia (83.1%) is already nearly comparable with their English-language model (85.5%) and making Bahasa Indonesia on par with German, French, and Spanish. (83%-84%) The examples you showcased in your article demonstrates models trained with additional fine-tuned contextual corpus, which is a positive improvement in itself, but does not necessarily represent a novel and quantifiable specific improvement in Bahasa Indonesia performance compared to the GPT-4 model used in ChatGPT.
Chief Data Officer - E-commerce, supply chain, and marketing leader
1 年So where is the chat gpt for Bahasa Indonesia? Can you share the link for us to access?
Enterprise Digital Specialist, Azure @ Microsoft
1 年Great stuff!