Classifying English Language Learners
Updated: Dec 31, 2020
Problem : Read written text journals of English language learners and predict whether they are native to Asia or Europe.
A key distinction exists between European and East Asian languages in terms of their relationship to the "L2" (second language) in this case, English. Although English is in a different branch of the Indo-European language family (Germanic), Spanish and French nonetheless share numerous linguistic similarities with English (particularly with regards to having a shared educated vocabulary), whereas the East Asian Languages, having developed quite separately, have only coincidental similarities with English. One might presume this linguistic distance generally makes English more challenging for East Asian learners, who are in essence "starting from scratch" in a way that Spanish and French speakers are not. Credit : Julian Brooke
Solution Approach : In this problem we used classical NLP style of iterating and creating features with our hypothesis of what could possibly separate the two classes.
Features Considered :
1. TTR (Type Token Ratio)
TTR is chosen because we assume that Asian learners of English have a relatively smaller vocabulary size (so smaller TTR) than European learners of English, with the evidence that English is a language with numerous loanwords borrowed from European languages, especially French.
This feature related to article usage is chosen because the definite and indefinite articles central to English do not exist in many Asian languages’ linguistic systems. Under this influence, we assume that Asian learners of English may not have a frame of reference for definite and indefinite articles when writing in English, so they are more likely to ignore or forget to use them. On the other hand, French and Spanish do have articles, which would help them easily transfer from their L1s.
Based on our observation of the texts, Asian learners of English are more likely to mention ‘nice to meet you’ in the initial part of their writings, showing respect and politeness due to their cultural background. This type of greeting, however, is rarely utilized by European learners. Therefore, this feature is chosen.
This feature is chosen because based on our observation, European learners would like to show strong emotion status in the opening of their wirings. ‘Hey!’ is somehow a typical opener they used.
5. has_lexicon_asn / has_lexicon_euro
Those two features (identifying the names of countries, cities and their native languages) are chosen: based on our observation of genre variation, most L2 learners of English would like to give a very brief introduction of their backgrounds, including the names of their home countries or cities, such as Japan and Tokyo. In addition, most L2 learners write in English combined with their native languages by adding French accented letters (à, è, é, …), Spanish accented letters (ñ, …), Chinese characters, Japanese characters, and Korean characters. Thus, we assume that it is an important feature to look at.
This feature identifies Because, But, and So with the first letter capitalized. It is chosen because Asian learners of English have those subordinators in their L1s but used in different logical relations compared with those in English. For example, they place them more often in a sentence-initial position, or apply all of them in one sentence. Hence, we assume that those subordinators might be more typical usage in Asian learners’ English.
It calculates the number of adjectives and adverbs /number of words. This feature is chosen based on our assumption that Asians tend to use less of these word classes than Europeans because in French and Spanish, there are a large number of adjectives similar to English, such as fantastic versus fantastique in French due to close language families and loanwords, so Europeans’ vocabulary size of adjectives/adverbs is relatively larger than Asian learners of English’s.
We use this feature based on our assumption that Asian learners of English will use more modal verbs than European learners of English. The reason is that in both Japanese and Korean, there are suffixes adding to the verb root, showing attitude and politeness, so we assume that as L2 learners, they would be more likely to add modal verbs to boost their claim or show their attitude. In Mandarin, there are direct equivalents of model verbs, such as ‘can’, which they use very often in their mother language writing, so we assume that there might be a L1 transfer taking place.
Based on our observation of the occurrence of the word “good” in Korean, Japanese and Mandarin corpora, “good” is one of the top 5 most common used words. The word is absent as a top word from the Spanish and French corpora, so we assume this is an essential feature.
It is well known that neither Japanese nor Chinese distinguish the singular or plural forms unless absolutely necessary (often add determiners and quantifiers in front of nouns), so our assumption is that this grammatical feature would transfer from their L1s to their English writings, and therefore this would help in classification.
11. preposition density
It is commonly believed that Japanese, Korean, and Mandarin do not have prepositions (Japanese and Korean have postpositions) or they use prepositions differently (Mandarin). We assume that Asians would avoid using linguistic functions or markers, such as prepositions, which do not exist in their L1s, because they know they will use them incorrectly. This is why we choose this feature.
12. Reading ease
We assume that two groups would differ in reading ease because of different vocabulary sizes they have and different levels of English language proficiency. As mentioned previously, Asian learners of English seem to have relatively small vocabulary size, so their writings will be easier to understand than Europeans’. This feature would help separate two groups and help in classification.
13. total length
We assume that two groups would differ in total sentence length. Based on our observations of the texts, Asians produce longer sentences than Europeans because Asians would use more conjunctions, such as and to connect each clause.
Based on observation of the texts, there are more sentences in Europeans’ writings than in Asians’ writings, so this feature is chosen to help in classification.
So, how did the features perform? Did they work as we expect? What did we end up dropping?
Many features did not impact the score during ablation and we ended up rejecting our hypothesis and dropping them, including article_density, TTR, has_lexicon_asn, has_greeting, has_hey_INTJ, has_subordinator, particular_word_count, plural_usage, preposition_density, sent_num, modal_frequency. The results are not exactly what we expected. We suspect that some features are highly correlated. Besides that, we assume that most features, such as has_hey_INTJ, has_subordinator, and article_density, should be important features, but they turn out to be less important in a three-depth level decision tree model.
The final 4 remaining features are has_lexicon_euro, jj_rb_density, total_length and reading_ease that were driving the accuracy score. We fed these features to a decision tree model.
The final decision tree :
Test Classification Accuracy achieved : 75.8% We also used other algorithms to better the score we got with decision-tree and below are the results. We achieved minor improvement only. ( Note that test_score in the table below is the validation set score and not the score on test data)
Our scores on test data Random Forest on test set: 76.6% Catboost on test set: 76.2%
Credits : Project team of Mia , Yi Yan ( UBC MDS Candidate ) and Shounak Mondal
Pease write to email@example.com to request access to the Jupyter notebook for this project