• Shounak Mondal

Text Summarisation and Topic Modelling

Updated: Nov 29, 2020

Text Summarisation - Extraction based ( More advanced Abstraction based coming up later)

Gensim uses a network with sentences as nodes and lexical similarity as weights of the edges of the nodes. Here is an example of using Gensim for text summarisation :

News article to be summarised. We can pass the length of summary we want. Here I want a 100 word short summary of the article :

summary = gensim.summarization.summarize(strip_text, word_count = 100)

Summary obtained : 
And Moderna\'s supply will be tied up with the US for at least probably the first half of 2021, so in light of that, the Oxford/AstraZeneca vaccine is really good news for the rest of the world," Andrea Taylor, assistant director of programs at Duke Global Health Innovation Center, told CNN.\nShe said that as Moderna and Pfizer build up information and manufacturing capacity, they may be able to find storage methods at higher temperatures, but the Oxford vaccine "has the potential to be able to be shipped more readily around the globe" using existing supply chains.

Thats pretty impressive.

Now let's get into a more complex attempt at Topic Modelling. Topic Modelling identifies the major concepts in a piece of text. I will use an algorithm called LDA: Latent Dirichlet Allocation Model. The algorithm :

1. Identifies potential topics using pruning techniques like "upward closure"

2. Computes conditional probabilities for topic word sets

3. Identifies the most likely topics

4. Does this over multiple passes probabilities for topic

Now lets run the same CNN article on the Topic Modelling LDA algorithm and see what topics it gives.

[(1, '0.002*"vaccine" + 0.002*"country" + 0.002*"astrazeneca"')]

This is pretty good. It correctly identifies the topic as dealing with Vaccine, the company astrazeneca and also that its about countries.

iPython Notebook in Github Repo

6 views0 comments

All rights reserved. 


  • White YouTube Icon
  • LinkedIn - White Circle