• Shounak Mondal

Text Corpus Readability

Updated: Dec 30, 2020

We often are unsure whether a piece of text is suitable for reading in terms of complexity for different age groups. This is a useful application which classifies any text into three readability groups : (1) Primary School Level, (2) Secondary School Level and (3) College Level and above.

Step 1 : Load CMU Dictionary which is part of NLTK library. This has the English word pronunciations.


Step 2 : Write a function to get syllables for a word.

Step 3 : Write a function to get reading ease for a sentence.

The reading ease is obtained using Flesch–Kincaid readability tests

Step 4 : Write a function to get readability for the entire corpus.

Here are the results on 5 text corpora.

The results make sense as Penn is most complex to read as its financial text, and its score is below 50 which is indicative that its appropriate in terms of readability only for college students.

While Web text is easiest to read and scores above 70 indicate web text is appropriate for primary school school students in terms of ease of reading.

iPython Notebook in Github Repo

