• Shounak Mondal

Understanding and Comparing Text Corpus

Evaluating and comparing corpus is fundamental exploratory data analysis.

Below I explore 3 genre or topics within the Brown corpus on the below metrics and then present a supporting visual and inference.

(1) Average Sentence Length - a quick indication of ease of reading of text. For a more robust indication of readability, refer to my other article on "Readability" here

(2) % POS or Parts of Speech which are Adjectives - indicative of degree of nouns in the text.

(3) Lexical Density - a measure of what portion of the text is made up of nouns, adjectives, verbs and adverbs. It tells us about how informative is the text.

(4) Top 50 most associated words in a genre and word clouds to get a sense of the text genre.

Below are the results for the 3 genres : Adventure, Religion and Editorial from the Brown corpus.

Word Cloud for Adventure

Word Cloud for Religion

Word Cloud for Editorial Genre

Top 50 list of most strongly associated common words in the three genres :

50 words strongly associated with genre : adventure 

 ['ground', 'wide', 'heard', 'surprise', 'burned', "you'd", "you're", 'grass', 'quickly', 'steps', 'head', 'woman', 'corner', 'door', 'water', 'blood', 'voice', 'looked', 'lifted', "wouldn't", 'walked', 'dark', 'fingers', 'she', 'walk', 'minutes', 'shot', 'hit', 'gray', 'oh', 'turned', "that's", 'started', 'yards', 'throat', 'horse', "didn't", 'waiting', 'drink', 'eyes', 'wife', 'front', 'got', 'watched', 'girl', 'sat', 'her', 'caught', 'maybe', "i'm"]
 50 words strongly associated with genre : religion 

 ['number', 'claim', 'kingdom', 'created', 'personality', 'mere', 'code', 'mercy', 'certainty', 'relatives', 'intended', 'modern', 'beings', 'image', 'fill', 'existence', 'reality', 'terms', 'nature', 'devoted', 'consciousness', 'creation', 'moral', 'human', 'everywhere', 'soul', 'ourselves', 'experience', 'numbers', 'structure', 'ideas', 'aspects', 'positive', 'england', 'persons', 'opportunities', 'belong', 'religion', 'thou', 'eternal', 'magic', 'sacred', 'spirit', 'represented', 'evil', 'thee', 'god', 'holy', 'faith', 'spiritual']
 50 words strongly associated with genre : editorial 

 ['authority', 'gained', 'pictures', 'operation', 'prevent', 'project', 'providence', 'comment', 'press', 'facts', 'policy', 'citizens', 'peaceful', 'news', 'including', 'survive', 'parents', 'aj', 'armed', 'cities', 'city', 'increase', 'economic', 'task', 'politics', 'countries', 'defeat', 'responsible', 'problems', 'wants', 'east', 'club', 'november', 'patient', 'national', 'seek', 'lack', 'chicago', 'mr.', 'medical', 'british', 'official', 'nations', 'interested', 'united', 'tests', 'party', 'foreign', 'army', 'industrial']

Conclusions :

1.Sentence Length for adventure genre is shorter indicating that works of fiction have shorter sentence for easier reading as compared to religious texts

2.% POS Adjective for adventure genre is much lesser indicating that works of fiction have less descriptions of nouns as compared to editorial and religious texts. This is expected as adventure genre has less nouns than religion or editorial

3. There is no major difference in Lexical density but as expected editorial has most lexical density indicating that its most informative as compared to adventure or religious text.

iPython Notebook in Github Repo

5 views0 comments

Recent Posts

See All