CodeMix/ CodeSwitch in ChatGPT: Hit or Miss?
ChatGPT has taken the world by storm and raised a lot of eyebrows in the research community since its release in November 2022, by OpenAI. Although models similar to ChatGPT have been seen in the recent past (e.g., DeepMind's Sparrow ), ChatGPT owing to its range of generation capabilities has gained immense popularity.
Code-mixing or Code-switching is the mixing of languages in the same context, predominantly observed in multilingual societies. Codemixing has gradually become the lingua franca in social media conversations and posts. In the current era of social media codemixing is extremely relevant, especially from the modern-day AI and NLP perspective.
In this article, we explore the potential of ChatGPT in codemix scenarios. We experiment and analyze the responses of ChatGPT for several codemixed prompts and summarize the capabilities and bottlenecks of ChatGPT with respect to codemixing.
We have used Hindi-English (Hinglish) and Spanish-English (Spanglish) codemix settings for our analysis, to keep the analysis simple and understandable. We have analyzed the responses to understand, where ChatGPT is performing well, and where it fails for codemixed settings.
ChatGPT: Understanding and Generating Codemixed Content
In certain cases, ChatGPT is able to understand as well as generate codemixed content. In the following example, ChatGPT understands the codemixed query in Spanglish (Spanish-English codemix) and generates a proper response.
Query: Describe una review en English para la mejor película espa?ola
Translation: Write a review in English for the best Spanish film
ChatGPT: Sure, here's an example review in English for one of the best Spanish films: "Talk to Her" (2002), directed by Pedro Almodovar, is a masterpiece of Spanish cinema. The film follows two men, a nurse named Benigno and a journalist named Marco, who form an unlikely bond over their mutual love for comatose women. Through Almodovar's signature blend of drama, comedy, and surrealism, "Talk to Her" explores themes of love, loss, and the power of human connection. The film features exceptional performances from its lead actors, including Javier Camara as Benigno and Dario Grandinetti as Marco, who deliver nuanced and emotionally resonant portrayals of their complex characters. The cinematography, art direction, and music all contribute to the film's dreamlike atmosphere, immersing the viewer in Almodovar's unique vision. Overall, "Talk to Her" is a beautiful and thought-provoking work of art that showcases the best of Spanish cinema. It is a must-see for fans of Almodovar's work, as well as anyone interested in exploring the depths of the human experience.
In some cases, ChatGPT is able to generate perfectly codemixed sentences, where the language switching occurs with high precision, and the generated word and sentences follow standard spellings, unlike social media text, where codemixing is most prevalent. The following is an example:
Query: ChatGPT features kya hai?
Translation: What are the features of ChatGPT?
ChatGPT: ChatGPT ke kuch key features neeche diye gaye hai:
ChatGPT: Role of Data
To understand the codemix capability of ChatGPT we analyzed several responses of ChatGPT. When queried if it is trained on a specific codemix corpus, ChatGPT responded as follows:
Query: Is Chatgpt trained on Codemix?
ChatGPT: No, ChatGPT is not specifically trained on CodeMix Text data. ChatGPT is a large language model developed by OpenAI that has been trained on a diverse range of text data from the internet, including news articles, books, websites, and more.
ChatGPT uses a transformer-based architecture that is similar to the GPT-2 model used by CodeMix Text, which allows it to generate coherent and fluent text in response to prompts or questions. However, ChatGPT has been trained on a much larger and more diverse dataset, which enables it to understand and generate text in a wide range of topics and contexts.
This establishes that ChatGPT does not specifically understand codemix. It is language agnostic, and like several multilingual Large Language Models (LLMs), where huge amounts of data from multiple languages are pumped into a single model like M-BERT, XLM-RoBERTa etc., ChatGPT with arguably the largest dataset for LLMs so far, is capable of handling codemix, but without complete awareness about its usage.?
From our analysis of several codemix queries to ChatGPT and their corresponding responses, we found that as ChatGPT is trained on a multitude of data from several languages, as well as codemixed datasets, the understanding of codemixed contexts in ChatGPT fairly good, but as the model is not aware of this usage, the generation is very random and not as expected. For example, when asked to generate content specifically in codemix, the ratio of mixed words is very low:?
Query: Tell me something about India in Hinglish
ChatGPT: Bharat ek bahut hi sundar desh hai jahan anek bhashayein boli jaati hain aur vibhinna sanskritiyon ke log rehte hain. Yahan ki khana-pina aur desh ki vastuon ka prachalan poori duniya mein hai. Yahan par dher saare dharmik sthal, puraatatv sangrahalay, sundar vaadiyan aur sahyogiyon se bhara vyaparik mahaul hai. Bharat ki film aur television industry poori duniya mein mashhoor hai. Iske alawa, yahan ki literature, classical sangeet aur naach-gaana bhi bohot lokpriya hain. Is desh mein alag-alag rajyon mein alag-alag shailiyon ka vastra pehna jaata hai aur yahan ki handicrafts poori duniya mein mashhoor hain.
The above response has 96 words, with 5 words in English and 91 words in Hindi. The codemix ratio is 0.05%, which is extremely low, as already discussed above. This emphasizes the fact that ChatGPT is decent in understanding codemix context, but due to a lack of awareness, the generation is not up to the mark.
领英推荐
Another important aspect that we have observed in our analysis is that for generic prompts (e.g., ChatGPT is asked to talk about itself, or write an essay on a generic topic), the generation is of high quality. On the contrary, when ChatGPT generates content on non-generic or domain-specific topics, the quality of generation as well as codemixing degrades. This stems from the same logic that ChatGPT is data-centric and is tuned more for generic responses, as such data is more abundant, and the fine-tuning has happened more for such data.
ChatGPT: Generating variations in codemix responses?
One of the primarily highlighted features of the GPT-3 model, released by OpenAI in May 2021 was the ability to paraphrase and summarize content. ChatGPT, which builds on top of GPT-3.5 or InstructGPT inherits this feature, as it is built on top of GPT-3.5. This essentially means that it is able to generate variations in content, to the same or even similar prompts.
Unfortunately, the variations are not reflected in the codemix parts of the content. Hence, if the same or similar prompt is given, the codemix variations observed are almost the same.
Essentially this means, that ChatGPT generates the responses and their variations by fine-tuning the training data. Further, as it is not conscious of the codemix parts of the response, it is unable to generate variations of the same. This is exhibited in the following examples, where ChatGPT is queried about the Indian healthcare situation using prompts that are similar in meaning. ChatGPT is able to generate variations in its responses, but the exact same words are used to introduce codemixing in both the generated content, without any variations:
Query: India ke swasth dekhbhal situation ke bare mein tell me
Translation: Tell me something about the healthcare situation in India
ChatGPT: Bharat mein swasthya dekhbhaal ki sthiti kaafi kathin hai. Desh ki aabaadi bahut badi hai aur iske saath hi, kayi logon ko basic swasthya suvidhaon tak pahunchane mein mushkil hoti hai. Kayi logon ke paas koi bhi aisi suvidha nahin hai jo unhe badi bimariyon se bacha sake.
Desh mein kuchh mukhya samasyaon mein se ek hai malnutrition, jiski vajah se bachchon mein raktalpata (anemia), kamzori aur anya bimariyan hoti hain. Iske alava, desh mein kayi jagahon par saaf paani aur swachhata ki kami hai, jiski vajah se kai bimariyan phail jaati hain.
COVID-19 mahamari ne bhi Bharat ke swasthya vyavastha par bahut bada prabhav dala hai. Pehle toh vaccine ki kami thi aur phir bhi ek baar vaccination shuru hone ke baad bhi, kayi jagahon par logon ke liye vaccine ki upalabdhata ki samasya aayi. COVID-19 ke karan kayi logon ko jeevan ke lie larna pada aur iske alava, desh mein abhi bhi COVID-19 ke cases aur usse judi bimariyon ki sankhya adhik hai.
Halaanki, Bharat ke swasthya vyavastha ko sudhaarna ki koshish ki jaa rahi hai. Sarkaar ne ayushman bharat yojana jaisi yojanaon ko shuru kiya hai, jiske tahat garib logon ko muft ilaaj diya jata hai. Bharat ke alag-alag hisson mein COVID-19 vaccination drives chal rahi hain aur logon ko vaccine di ja rahi hai. Iske alava, kayi NGOs aur sarkari sansthayein bhi desh ke swasthya par dhyaan deti hain aur unhe sudhaane ki koshish karti hain.
Query: Swasth care ki halat India mein kya hai tell me
Translation: Tell me something about the healthcare situation in India
ChatGPT: Bharat mein swasthya dekhbhaal ki sthiti bahut kathin hai. Desh mein kuchh mukhya samasyaon mein se ek hai malnutrition, jiski vajah se bachchon mein raktalpata (anemia), kamzori aur anya bimariyan hoti hain. Iske alava, desh mein kayi jagahon par saaf paani aur swachhata ki kami hai, jiski vajah se kai bimariyan phail jaati hain.
Bharat mein kayi jagahon par swasthya suvidhaon ki kami hai. Garib logon ko badi bimariyon se ladne ke lie koi suvidha nahin hai aur kayi jagahon par swasthya se sambandhit suvidhaon ki kami hai. Iske alava, jahan tak swachhata ki baat hai, desh mein bahut si jagahon par logon ko safai ki suvidhaon ki kami hai, jiski vajah se kayi bimariyan phailti hain.
COVID-19 ke dauraan, Bharat ki swasthya vyavastha aur bhi jyada prabhavit hui. Vaccine ki kami ke karan, desh mein kayi logon ko vaccine nahi di ja paayi aur COVID-19 ke cases aur usse judi bimariyon ki sankhya adhik hai. Iske alava, kayi jagahon par sahi COVID-19 dekhbhaal nahin kiya jaa raha hai aur logon ko sahi ilaaj nahin diya ja raha hai.
Halaanki, sarkar aur kayi NGOs aur sansthayein desh ke swasthya par dhyaan deti hain aur unhe sudhaane ki koshish karti hain. Sarkar ne ayushman bharat yojana jaisi yojanaon ko shuru kiya hai, jiske tahat garib logon ko muft ilaaj diya jata hai. Iske alava, kayi NGOs aur sarkari sansthayein bhi desh ke swasthya par dhyaan deti hain aur unhe sudhaane ki koshish karti hain.
In both responses generated by ChatGPT above, English words are used in Hindi context qualifying the responses as codemix. It can be observed that almost every paragraph across the above examples captures similar content, but is expressed differently, as the responses are picked/generated from the same content, but summarized in a different manner. The codemix parts of the responses (marked in bold fonts) are almost an exact match. It can be noted here that the ratio of mixing is very low here as well, as discussed earlier.
Conclusion and Inferences
From the points discussed above, the impact of ChatGPT in terms of understanding and generation of codemix content can be summarized as follows:
Final Note: Overall, ChatGPT understands codemixed context, but is unable to generate decent codemix content, due to its language-agnostic nature. This opens an avenue for focused codemix research, where the focus of the language models and resources would be to address codemix understanding and generation. These models should be able to understand and generate content not only in codemix but also in the constituent languages of the codemix settings.