Six Unsupervised Extractive Text Summarization Techniques Side by Side


In text summarization, we create a summary of the original content that is coherent and captures the salient points in the original content. There are various important usages of text summarization. Something we face almost every day is the text snippet that is shown in the search engine results. That snippet is essentially a summary. Our decision of whether to click on an items in the search result is largely driven by the title and the summary of the content.

In this post we will go through 6 unsupervised extractive text summarization algorithms that have been implemented in Python and is part of my open source project avenir in github.

Usage of Text Summarization

There are various applications of text summarization. Here are some common usage scenarios for text summarization.

  • News Article: A summary of a news article helps the user to decide whether to browse the content
  • Search Engine Result: Summary of the retrieved content helps the user to decide whether to click of the content link
  • Product Review: Long product reviews in eCommerce sites could be summarized to help customers quickly understand the essence of the summary
  • Technical Article: Although many technical articles contain abstract, summarization could help, when the abstract is missing
  • Email Thread: A summary could quickly convey to the user the essence of an email thread discussion, without having to go through all the emails in a thread.

With information overload, that we all experience, text summarization helps greatly by enabling us to digest the essence of some content quickly.

Dimensions of Text Summarization

Text summarization problem and the solutions can be described along various dimensions

  • Input
  • Context
  • Output
  • Machine learning solution approaches

Depending on the type of input, the summary could be over one document or over multiple documents. With multiple documents, summary is created spanning across multiple documents.

Summarization could have different contexts. e.g. domain specific and query. For domain specific content summarization, there could be set of pre defined key phrases. Additional constraint could be placed on the summary so that the occurrence of the key phrases is maximized in the summary. For query, the summary could be required to include sentences that maximizes the occurrence of the query words

Based on output , the summary could be extractive or abstractive. In extractive summary, some key sentences from the original content are extracted verbatim. In abstractive summarization, summary is rephrased or reworded to capture the essence of the original content, just as a human would summarize. Although more appealing, abstractive summarization is lot more challenging.

Considering machine learning solution approaches, the solution could be un supervised or supervised. Most of the existing solutions are un supervised. For supervised machine learning, you will need training data, which for text summarization is human generated summary.

For extractive supervised machine learning, a set of features could be extracted for each sentence e.g. sentence length, position of sentence in the document and whether the sentence contains title words.

For abstractive supervised learning, the solutions are still nascent. Promising result has been been found with Deep Learning sequence to sequence modeling. The problem is akin to language translation. Classical supervised machine learning techniques have failed to produce satisfactory results.

Unsupervised Extractive Algorithms

The focus of this post is on unsupervised extractive algorithms. These algorithms are relatively simple and provide satisfactory output. The solutions we will discuss belong to one of these categories

  • Term frequency
  • Latent variable
  • Graphical

In term frequency based approach, sentences that resemble document term frequency, get higher score.

In latent variable based approach sentence term matrix is factorized to a lower dimensional space. For each independent latent variable in the lower dimensional space, sentences are sorted. Finally for each latent variable, top sentence i.e. the sentence most representative of the latent concept is picked.

In graphical methods, from sentence term matrix, a sentence to sentence similarity matrix is built. With the similarity matrix, a Page Rank like algorithm called Text Rank is run. After the algorithm runs, each sentence gets a score.

The 6 algorithms belonging to the 3 categories above are as follows. There are many other extractive summarization algorithms. These are commonly used. The first 2 belong to the term frequency category, the next 2 the latent variable category and the last 2 the graphical category.

  • Term frequency
  • Sum basic
  • Latent semantic indexing
  • Non negative matrix factorization
  • Page rank
  • Embedding page rank

In Term Frequency, frequencies of terms appearing in a sentence are added up to calculate score. A regularization is done based on sentence length. Sum Basic works in a similar way, except that after the sentence with top score is selected, the frequency of all the words appearing the selected sentence is reduced to induce more diversity in the sentences selected.

In Latent Semantic Indexing (LSI), matrix factorization is done with Singular Value Decomposition. Non negative matrix factorization(NMF) is similar to LSI, except that the matrix factorization technique is different. The advantage of NMF is that it’s more interpretable. Latent concepts found are more in line actual underlying concepts in the document.

In Text Rank, sentence term matrix is used to cosine similarity between sentences. The similarity matrix is used to construct a graph, where sentences are nodes. Text rank algorithm is run on the graph. Embedding Text Rank is similar, except that instead of raw sentence term vector, sentence embedding vector is used to calculate similarity matrix. Sentence embedding vector is calculated by averaging embeddings of all the words in the sentence.

All the algorithms follow these steps. The scoring logic is different for each algorithm. All the configuration parameters for each algorithm are defined in properties file.

  • Split document into sentences
  • Pre process each sentence including tokenization, stop word removal and lemmatization
  • Assign algorithmic specific score to each sentence
  • Sort the sentences by descending order of the score and retain top n, where n is a configurable parameter.
  • Sort the sentences from the last step in the same order as they appear in the original document.

There could be various other extractive summarization techniques. Here is one example

  • Get word embedding vectors, by Word2Vec or Glove
  • Get sentence embedding vectors by averaging word embeddings
  • Get clusters using KNN based on sentence embeddings
  • Find 1 or 2 sentences close to each cluster center

Diversification

Ideally, we want the sentences in the summary to represent the underlying concepts and and not have any redundancy. Multiple sentences should not convey essentially the same information. In other words, sentences should be as diverse as possible. Diversity is the inverse of similarity.

Sum Basic, Latent semantic indexing and Non negative matrix factorization have redundancy reduction built into the algorithms. For the remaining algorithms, explicit diversification is necessary. A technique called Maximal Marginal Relevance (MMR) is used for these other algorithms, that require diversification as an extra step. The steps are as follows

  • Get sentences sorted in descending score from the core algorithm
  • To select a sentence find diversity with respect to already selected sentence. This could be minimum, maximum or average diversity with already selected sentences.
  • An weighted score is found based on the original score and the diversity score
  • The sentence with the highest weighted score is selected next

Results

The document we will summarize is as follows. It’s about climate change and methane in the atmosphere. In the summary score for each sentence is also shown

It’s not that CO2 isn’t a problem — it’s the main problem. But on a molecule-for-molecule basis, methane traps more heat, so converting it into something less potent would reduce its climate impact.

In fact, by restoring the concentration of methane in the atmosphere to preindustrial levels, this counterintuitive strategy could eliminate about a sixth of human-caused warming, according to a paper published Monday in Nature Sustainability. And it would add only a few months’ worth of CO2 emissions to the atmosphere.

“In the grand scheme of carbon dioxide emissions, this would not be a deal-breaker,” said lead author Rob Jackson, an earth scientist at Stanford University.

Jackson, like most scientists, says the best strategy to combat climate change is to stop emitting greenhouse gases.

“Having said that, we’re not getting the job done on reducing emissions, so I think we need to look at some of these other approaches,” said Jackson, who chairs the Global Carbon Project, which tracks greenhouse gases.

Already, it’s clear that people will have to pull huge amounts of carbon dioxide out of the atmosphere to meet the goals of the Paris climate accord, which would limit global warming to less than 2 degrees Celsius above preindustrial levels.

Some scenarios call for removing up to 10 billion metric tons of the gas per year — a quarter of humanity’s annual emissions — by storing it in biomass or soil, or building facilities that directly capture the gas from the air.

“But no one’s talking about this for methane,” Jackson said. “That’s what we want to accomplish with this paper.”
Methane in the atmosphere is surging, and that’s got scientists worried.

Methane hasn’t caused as much warming as CO2, but humans have had a much bigger impact on the methane cycle, he said.

Over the last 200 years, we have more than doubled its concentration in the atmosphere by extracting fossil fuels, raising livestock, and allowing the gas to escape from landfills and wastewater treatment plants, among other activities. Today, methane levels are 1,860 parts per billion — and rising — compared with 750 ppb before 1800.

Methane also has a more acute warming effect. It traps about 84 times as much heat as CO2 over a 20-year period, and 28 times as much over the course of a century. (Its potency drops because it has a shorter lifetime in the atmosphere.) So turning a molecule of methane into a molecule of CO2 would slash its climate-altering capacity, Jackson and his coauthors say.

And the trade-off would be worth it, they argue. Restoring methane to its preindustrial concentration would involve a one-time conversion of 3.2 billion tons of the gas into CO2. That would increase CO2 levels — which were 280 ppm in preindustrial times and are currently 415 ppm — by less than 1 part per million.

“That’s one of the real selling points in my mind,” Jackson said.

So can it be done?

At this stage, the idea is mostly theoretical, but the authors are cautiously optimistic.

Researchers have started developing ways to oxidize methane into methanol, a valuable compound used for 
fuel and chemical manufacturing. The same reactions, if allowed to proceed further, could also be used 
to convert methane into carbon dioxide.

Edward Solomon, a Stanford chemist who worked on the new paper, studies one promising method of processing methane using minerals called zeolites. They catch the gas, and help with oxidation.
Researchers propose removing methane from the atmosphere by pulling air through large fans and using materials like zeolites to catalyze its conversion into carbon dioxide. Thousands of these arrays would be needed to restore preindustrial methane levels.

The researchers envision using these kinds of materials in facilities like those being developed to remove CO2, which use fans to draw air into chambers where the gases are captured through chemical reactions.

The low concentration of methane means you’d have to process a lot of air to reduce atmospheric levels, Jackson said. “It would take many thousands of these arrays to make a dent,” he said, although it would be a much smaller effort than what’s been proposed for dealing with CO2.

And the economics could prove attractive — if countries eventually settle on a price for carbon emissions, either through a tax or a cap-and-trade system like the one in California.

Indeed, removing methane could be many times more lucrative than removing CO2, because the value of keeping a greenhouse gas out of the atmosphere typically depends on its heat-trapping ability.

Imagine a system in which the market will pay $50 for every ton of CO2 emissions that can be avoided. In the 100-year scenario in which methane is considered to be 28 times more powerful, the value of eliminating one ton of it would be $1,400. There’d be a small cost for emitting a little bit of CO2 in the conversion process, but even so, the take home would be about $1,250. (A bipartisan bill introduced in the House this year would impose a fee on carbon starting at $15 per ton of CO2 and increasing to more than $100 by 2030; models predict prices could climb as high as $500 later in the century.)

At the upper end of that range, researchers estimate that a methane removal facility the size of a football field could generate millions of dollars of income.

But that could be dangerous, warned Myles Allen, a climate scientist at Oxford University in the U.K. who was not involved in the paper.

Allen fears that, if there’s money to be made — or saved — removing methane from the atmosphere, some emitters may favor doing that instead dealing with CO2.

“That might make good business sense, but it would be terrible for the climate,” he said.

That’s because methane and carbon dioxide affect the climate in different ways.

Methane only stays in the atmosphere for about a decade, so its climate effects are short-lived. CO2, on the other hand, lingers for centuries. Thus, removing some methane might yield a small benefit right now, but won’t help solve the climate problem in the long run, Allen said.

Yet if carbon pricing schemes treat the gases as if they are interchangeable, then companies could offset their CO2 emissions by reducing methane emissions.

“You can imagine the world’s airlines getting on to this and getting all excited because suddenly it looks like they’ve got a very cheap pass and they can just carry on emitting,” he said. “And of course, they’ll just carry on causing global warming.”

Still, Allen agreed that we have to reduce the concentration of methane in the atmosphere. And he endorses the idea of developing technologies to offset the methane emissions that are difficult to eliminate, as the researchers also suggest in the paper.
Rice grows in a watery field near the city of Williams in the Sacramento Valley. Flooded soils produce methane, and rice cultivation represents about 10% of human-caused emissions.
Rice grows in a watery field near the city of Williams in the Sacramento Valley. Flooded soils produce methane, and rice cultivation represents about 10% of human-caused emissions. (Brian van der Brug / Los Angeles Times)

Raising cattle, growing rice and other seemingly essential activities produce the gas and they are unlikely to go away. “It’s hard for me to see any time in the near future when methane emissions will be zero,” Jackson said.

He and his colleagues will keep working to make methane conversion technology a reality, and they hope their paper will encourage others to try as well.

Jackson said part of his motivation is symbolic. Resetting methane concentrations to their preindustrial levels offers a way to “repair the atmosphere,” he said. And that might be inspiring.

“The notion of restoring the atmosphere sends a message of hope to people,” he said. “I would love to see something like this happen in my lifetime.”

Term Frequency output.

Having said that, were not getting the job done on reducing emissions, so I think we need to look at some of these other approaches, said Jackson, who chairs the Global Carbon Project, which tracks greenhouse gases.  (0.743875)
Already, its clear that people will have to pull huge amounts of carbon dioxide out of the atmosphere to meet the goals of the Paris climate accord, which would limit global warming to less than 2 degrees Celsius above preindustrial levels.  (0.752175)
Methane hasnt caused as much warming as CO2, but humans have had a much bigger impact on the methane cycle, he said.  (0.7)
Restoring methane to its preindustrial concentration would involve a one-time conversion of 3.2 billion tons of the gas into CO2.  (0.754675)
Yet if carbon pricing schemes treat the gases as if they are interchangeable, then companies could offset their CO2 emissions by reducing methane emissions.  (0.807025)

Sum Basic output

And it would add only a few months worth of CO2 emissions to the atmosphere.  (0.0116501145913)
But no ones talking about this for methane, Jackson said.  (0.0163101604278)
It traps about 84 times as much heat as CO2 over a 20-year period, and 28 times as much over the course of a century.  (0.00429536290585)
They catch the gas, and help with oxidation.  (0.00501336898396)
Thats because methane and carbon dioxide affect the climate in different ways.  (0.00539518037405)

Latent Semantic Indexing output

Having said that, were not getting the job done on reducing emissions, so I think we need to look at some of these other approaches, said Jackson, who chairs the Global Carbon Project, which tracks greenhouse gases.  (2.916704268372756)
Already, its clear that people will have to pull huge amounts of carbon dioxide out of the atmosphere to meet the goals of the Paris climate accord, which would limit global warming to less than 2 degrees Celsius above preindustrial levels.  (3.177025178180708)
Some scenarios call for removing up to 10 billion metric tons of the gas per year  a quarter of humanitys annual emissions  by storing it in biomass or soil, or building facilities that directly capture the gas from the air.  (2.2522541614428437)
Indeed, removing methane could be many times more lucrative than removing CO2, because the value of keeping a greenhouse gas out of the atmosphere typically depends on its heat-trapping ability.  (2.0782810099983973)
(A bipartisan bill introduced in the House this year would impose a fee on carbon starting at $15 per ton of CO2 and increasing to more than $100 by 2030; models predict prices could climb as high as $500 later in the century.)  (2.3918155164303068)

Non Negative Matrix Factorization output

In fact, by restoring the concentration of methane in the atmosphere to preindustrial levels, this counterintuitive strategy could eliminate about a sixth of human-caused warming, according to a paper published Monday in Nature Sustainability.  (2.3304601198013146)
Having said that, were not getting the job done on reducing emissions, so I think we need to look at some of these other approaches, said Jackson, who chairs the Global Carbon Project, which tracks greenhouse gases.  (2.495239211099699)
Already, its clear that people will have to pull huge amounts of carbon dioxide out of the atmosphere to meet the goals of the Paris climate accord, which would limit global warming to less than 2 degrees Celsius above preindustrial levels.  (2.1822043493045586)
Yet if carbon pricing schemes treat the gases as if they are interchangeable, then companies could offset their CO2 emissions by reducing methane emissions.  (2.423957931929234)
And he endorses the idea of developing technologies to offset the methane emissions that are difficult to eliminate, as the researchers also suggest in the paper.  (2.33185113146257)

Text Rank output

(Brian van der Brug / Los Angeles Times)
Raising cattle, growing rice and other seemingly essential activities produce the gas and they are unlikely to go away.  (0.9234375)
He and his colleagues will keep working to make methane conversion technology a reality, and they hope their paper will encourage others to try as well.  (0.9447125)
And that might be inspiring.  (0.978125)
The notion of restoring the atmosphere sends a message of hope to people, he said.  (0.9890625)
I would love to see something like this happen in my lifetime.  (0.7)

Embedding Text Rank output

Jackson said part of his motivation is symbolic.  (60)
Resetting methane concentrations to their preindustrial levels offers a way to repair the atmosphere, he said.  (61)
And that might be inspiring.  (62)
The notion of restoring the atmosphere sends a message of hope to people, he said.  (63)
I would love to see something like this happen in my lifetime.  (64)

The implementation contains 6 python classes, 1 for each algorithm. There are various parameters for each algorithm defined in separate properties file.

Optimization and Tuning

How do we know which algorithm to use how to tune the paramaters of the algorithm. Unlike supervised learning we don’t have labeled data. So we can not throw everything into an Auto ML like optimization pipeline and it spits out the ideal algorithm and the associated optimum set of parameter values.

One solution is to get user feedback as ratings and use Muti Arm Bandit learning to find the the optimum solution. This will be a continuously running learning engine. Periodically it will be asked for the algorithm to run and the associated parameters. The optimization search space will consist of the algorithms and parameters associated with each algorithm.

Summing Up

We have gone through 6 different unsupervised extractive algorithms for text summarization. Recently we deployed all 6 algorithms for a customer with limited amount of data and collected user ratings. The winner was found be LSI. To run all the algorithms with your data, you could use the tutorial document.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Science, NLP, Python, Text Analytic, Text Mining and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s