Translating code switched languages
Our project is building a seq-to-seq model that does code switched translation. It entails taking in as input a sequence of text from a mix of 2 source languages and turning it into a sequence of sufficiently equivalent meaning in 1 target language and vice versa. Our focus was Hinglish to English only as we did not see any meaningful application of English to Hinglish.
Full Project Presentation including more about the models here
Motivation and Contributions/Originality:
The "What" of Code-Switching : Code switching is a phenomenon where different people across different languages communicate with each other using words from more than one language. It is best understood with an example : "हाँ muje भी ऐसे hi lagtha लेकिन personally muje facebook इतना भी पसंद नहीं hein." This sentence mixes 2 languages, Hindi and English. Here in addition to code-switching, some Hindi words are in Latin script, while some are in Devanagri Hindi script.
The "Where" of Code-Switching : Code switching is very common in parts of the world which are multilingual (e.g.: India, Philippines, etc.) or amongst people who are multilingual. While we have done machine translations from one language to another, code switched machine translation needs translation from a mixed language to another and vice versa.
The "Why" Code-Switching : There are many reasons why people code switch between languages:
Directive function - to exclude others from conversation
Expressive function - to express part of one's identity
Reference function - if unable to express idea in one language
Phatic function - repeat in both languages to emphasize
In our research, we found that, while research papers are available on the topic, there is limited or no available solution to Hinglish code switched machine translation.
Code switching is very normal these days being it in text over social media, SMS messages, etc. or speech. There are number of applications where we need translation in a majority used language for e.g.: English, for increasing understand-ability for the user who does not understand code-switched language.
Real Life Application of the Project :
Our project is aimed at providing a solution to a real life company project where there is a need to translate and understand code switched conversations of dangerous potential criminals and suspects whose phone lines are monitored and tapped by the police and security agencies. The criminals in this part of the world frequently use code switching in their conversations.
There could be many more applications of this, such as translation feature in Facebook, Twitter of such code-switched posts.
Availability of code-switched language options in google translate or any other translation service, none of which currently exists.
Note: Due to the confidential nature of the project, we will not be using the actual conversations data. Instead we will use publicly available Hinglish - English parallel corpus. For the real life project, Automatic speech recognition will also be needed. Our focus for the 585 project is not code switching ASR or even ASR.
We combined several publicly available Hinglish-English parallel datasets into a parallel corpus of the 28464 code-mixed English-Hindi sentences and their corresponding translation in English. Primary source of the dataset is Twitter and was available to us from the below mentioned websites:
Linguistic Code Switching Evaluation Benchmark
A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation
The IIT Bombay parallel English-Hindi corpus
Hindi-English parallel corpora
Cleaned and Integrated new corpus
We combined all the source datasets and now have one parallel corpus of 28464 Hinglish - English sentences. We have split this corpus into our train, dev and test sets randomly, which inludes both ".tsv" and ".json" files respectively.
Test Data : 4000 documents. Dev Data : 4000 documents. Train Data : 20464 documents.
Errors and Corrections Made
Some of the data was not the correct gold English, they were in Latin Hindi, and seen specifically at the places where there was a conversation about songs, for eg: GuessTheSong Dil mein khwab jaage jaage, hai nigaahein khoyi khoyi Dance Dance!
some words are totally shortened like "plz", we have a plan to correct them using an english dictionary.
Data Cleaning was required on the source and target sentences for the following attributes:-
We removed the above mentioned attributes from our corpus as we are performing just text deletions.
We found some instances in both source and target sentences which were totally blank, we have removed them as they would not be beneficial.
The models we tried, mBart(pretrained) and mt5 does not work well on Latin Hindi script in the Hinglish sentences. Therefore we have converted the Latin Hindi script to Devanagari Hindi script but using a lexical approach using a publicly available dictionary. To ignore all English words, we detected English words using the Enchant python library.
Detailed .ipynb notebook for Data preprocessing and corpus analysis can be found here
After the above data pre-processing, the corpus was ready for translations.
We explored state-of-the-art models which have multi-lingual capabilities. Since code-switching is multi-lingual, it was a natural choice to finalise on these models.
We performed code switching Neural Machine Translation using the models below with the PyTorch Framework.
Multilingual sequence-to-sequence denoising auto-encoder using the BART objective (mBART)
Multilingual Text-to-Text Transfer Transformer (mT5)
We used Google colab as our computing infrastructure. We used Google Colab TPU for mBart and mt5 training/fine tuning and test, while Google Colab GPU was sufficient for test on mBart pretrained model.
**Our Models : Pretrained and Fine Tuning: **
We completed Hinglish to English machine translation using the mBart pretrained model
Translation results here, Source code to run the translation here
We completed fine tuning of the pretrained vanilla mBart cc-25 model above, by training it on our corpus and Hinglish to English machine translation.
Pretrained mBART model further fine-tuned till 2500 checkpoints
Pretrained mBART model further fine-tuned checkpoints
We have also further completed fine-tuning of the existing pretrained and fine-tuned mBart many-to-many model, by training it on our corpus and Hinglish to English machine translation.
Pretrained and fine-tuned mBART model further fine-tuned till 3000 checkpoints ->
Pretrained and fine-tuned mBART model further fine-tuned till 4500 checkpoints ->
We completed fine tuning of the pretrained mt5 model above, by training it on our corpus and Hinglish to English machine translation.
Translation results here, Source code to run the translation here
Since, fine tuning of the below mentioned models takes a very long time, we followed a strategy to use the models generated at intermediate checkpoints to evaluate the translation.
There were a number of challenges while performing code-switched translation:
Google colab pro GPU with RAM limitation is not sufficient to train Transformer models like mBART and mT5. We had to switch to Google colab TPU.
Data cleaning for both source and target languages is challenging, but it is the most important part to perform neural machine translations.
Most challenging piece for data cleaning is Hinglish language,for which we used Latin Hindi to Devanagri Hindi lexicon based approach.
These models, particularly mBart also took long time to translate.
Credits : Project team of Gurpreet Bedi, Yi Yan and Shounak ( UBC MDS )
Cover page image credit : https://owlcation.com/humanities/Code-Switching-Definition-Types-and-Examples-of-Code-Switching
Pease write to firstname.lastname@example.org to request access to the source code for this project