Similarity Based Recommendation – Tossed up with Text Analytic

In my last post I mentioned that similarity based recommendation engine in sifarish only considered categorical and integer attributes. I have added support for text attributes to sifarish. I am using Lucene for text processing and a variation of jaccard coefficient for distance between text attribute values. Lucene is an Apache open source package for text indexing and processing. In this post, I will go over the text analytic related enhancements.

As a quick disclaimer: for this post, by text processing I mean the so called bag of words approach, which is simply based on the occurrence of words and not on the semantic structure of the text. Analysis based on semantics falls within the realm of natural language processing .

Text Attribute

A text attribute value is a bag of words. Distance between text attributes is measured in terms of the match or overlap between the two bag of words.

To bring some context into this, I am going back to the Kiva micro loan example. The loan data has set of attributes which are free form text. I have also added set of keywords as an attribute to the lender profile.  The idea is to match the keywords with text attributes of loan and calculate the distance in terms of the match

Text Preprocessing

Text consists of one or more sentences. To convert them to a bag of words or terms some processing is necessary. This is where Lucene enters the scene. I am using the Lucence StandardAnalyzer which takes a block of text and spits out a list of tokens. The analyzer does the following

  • Lower casing:  Turns everything to lower case
  • Stop word removal:  Removes words like is, the which add no value
  • Stemming:  Transforms inflected or derived words to their base e.g., cars become car

Here is an example of some text

If enterprises have a problem coping with their own structured data, then the huge volumes of unstructured data they receive has the potential to cause an even bigger headache. This could be video, or audio, email correspondence, internet traffic, diagnostic or spatial data – a myriad of data types that do not arise from transactional activities or fit into a relational database in a data warehouse.

Here is the list of tokens after lucene analysis of the text.

enterprises have problem coping own structured data huge volumes unstructured data receive has potential cause even bigger headache could video audio email correspondence internet traffic diagnostic spatial data myriad data types do arise from transactional activities fit relational database data warehouse

If you compare the text chunks, you will notice how the lower casing, stop word removal and stemming have operated to produce the processed text.

The preprocessing is done by a map reduce job that has been added to sifarish, that has only the mapper. Essentially, it analyzes every text field by calling lucene API. It only transforms the data.You can think of it as the T of and ETL process. The source code can be viewed here.

For my example, I ran the loan data through this map reduce job. Each text field is replaced with the output of the analyzer which is a list of terms.

Optionally, based on a boolean configuration parameter consolidate.field, all the terms from all the fields could be consolidated into one field at the end of the record and replace the original text fields.

There is an advantage in not consolidating all processed text fields into one. If left separately, different weights could be assigned to different text fields depending on their relative importance.  For example, when dealing with product data, you may want to assign higher weight to the product description text field, compared to other text fields. It’s know as  weighted zone scoring in text analytic.

Attribute weight assignment is discussed in my earlier post. When computing distance between entities, it allows to control the the contribution of a particular attribute distance towards the aggregated distance.

Map Reduce

The MR class TextAnalyzer has only a mapper. It does text pre processing with lucene. By pre processing I mean stemming.  The text fields are identified through the configuration parameter text.field.ordinals. It’s value is a coma separated list of text field ordinals.

Every text field is replaced by the output of the stemming operation. If the boolean configuration  parameter consolidate.field is set to true, all the text fields are consolidated into one text field at the end.

This map reduce also has the additional feature of extracting values from other text fields based on reg ex defined in the entity schema file.  These extracted fields are appended to every row that is generated in the output.

Sample output

I only ran the loan data through this map reduce. I assumed that the lender profile data contains set of keywords, which don’t need any transformation based on text analysis. Here is row of loan data before and after processing.

368472[]Zeinab[]Clothing[]Lebanon[]1600[]0[][]Clothing Sales[]to purchase a new collection of clothes such as dresses, scarves, pants, and many other items.
368472[]Zeinab[]Clothing[]Lebanon[]1600[]0[][]clothing sales purchase new collection clothes dresses scarves pants many other items

I had to use those funny [] as my field delimiter,  because the text fields have comma embedded in them. The field delimiter is configurable through the configuration parameter field.delim.  In this case, the last two text fields in the original record have been replaced with one text field at the end of the processed record.

Text Similarity and Matching

After being preprocessed, the loan data is ready to be run through the similarity measure map reduce class which in our case is DiffTypeSimilarity. The similarity measure map reduce is enhanced with text matching logic.If we were matching entities of same type, we would have been using the MR SameTypeSimilarity. A good example would be products with text attributes among other types of attributes.

Essentially, the distance between two text fields depends on the extent of overlapping terms between the two text attribute values, higher the overlap lower the distance. The different approaches for text similarity calculation are as follows. The first two algorithms perform boolean matching. Each text field is considered as a vector of boolean, the value of which  indicates whether the corresponding  term is present or not.

Simple Matching

In simple matching, the matching coefficient is the ratio of matching terms (including both co presence and co absence of terms) to the total number of terms. This does not work very well for text matching, because of the term vectors happen to be sparse i.e majority of the terms in the vocabulary will not be present the in the text fields being matched.

Since the number of terms not appearing in both text fields will be high and will have the dominant influence on the matching coefficient and it will have a value close to 1.


Jaccard coefficient is based on the occurrence of terms and the numbers of term matches between two text fields. It’s defined as number of matching terms / number of matching terms + terms that appear in one but not the other. It’s essentially the ratio between the intersection and union of the two sets of terms for the two text fields.

Since we don’t consider terms that don’t appear in any of the text fields, such variables are called  asymmetric binary variables. It’s asymmetric because we don’t consider (false, false) to be a match. For text matching, presence of a term in two text fields is more important than the absence of a term.

Cosine Distance

In cosine distance, frequency of term occurrence is used and each text field is treated as a vector of the term frequency count. This method  finds the angle between two term vectors. It reflects similarity in terms of the similarity in term count distribution between the two text fields.  It’s similar to jaccard coefficient because (0,0) does not contribute to the match.

Generalized Jaccard

Since I am am matching a small set of keywords with text field content, I found Jaccard coefficient to be better fit. I took the liberty of generalizing it a bit and redefined it as follows

jaccard(t1,t2) = match(t1, t2) / match(t1,t2) + w1 * nonMatch(t1,t2) +  w2 * nonMatch(t2,t1)
match(t1,t2) = number of matching terms between t1 and t2
nonMatch(t1,t2) = number of terms in t1 but not in t2
w1,w2 = weight less than 1
distance(t1,t2) = 1 – jaccard(t1,t2)

We can derive some interesting cases depending on values of the weight chosen.

  • When w1 = w2 = 1, we get the original jaccard coefficient.
  • When w1 = w2 = 0.5, it’s called Dice coefficient.
  • You would set w1 = 1 and w2 = 0, if you wanted a similarity coefficient simply based on the percentage of keyword that matched.

The problem of using the original form of jaccard coefficient in our case is that since we are matching few keywords with a large body of text the coefficient value will be close to zero because the intersection of the terms   will be much smaller than the union of terms. If we were interested in ordering the text fields in terms of match with the keywords, we would not have cared about the absolute values.

Since we are normalizing the distance between 0 and 1, we want the distance between text attributes to be well spread out in that range. If the distance values are always going to be close to 1, they won’t have enough discriminating power. As a result,  the distance between the text attributes won’t have much impact on the overall distance between two entities.

One way to do that is to choose weight values less than 1 and weaken the impact of the non matching terms count. Intuitively, it makes sense to choose weight values less than 1, since the weight of count of terms that do not appear in any of the text field is 0, the weight for count of terms that appear in one field and not the other should be somewhere between 0 and 1.

Similarity Match Map Reduce

As I mentioned earlier, this map reduce has been enhanced to handle text fields. I ran the same kiva data to find top N loan matches for a lender profile. Here is a snippet of configuration related to text analytic part of this map reduce.

	"textMatchingAlgorithm" : "jaccard",
	"srcNonMatchingTermWeight" : 0.8,
	"trgNonMatchingTermWeight" : 0.2,

As you can see, I have left the door open for text attribute distance algorithm through configuration. In future, I might implement some other algorithm.

Language Support

Several other languages besides English are  supported. The language is selected by appropriately setting the configuration parameter text.language. Here is the list of languages supported.

  1. English
  2. German
  3. French
  4. Spanish
  5. Italian
  6. Brazilian Portuguese
  7. Russian
  8. Polish



You might be interested in this tutorial with all the details of using these the two map reduce jobs in tandem. The tutorial  also has a third map reduce to find the top n matches.The tutorial uses data for retail  products  with one text attribute along with attributes of other types. Here is the JSON schema file for the data.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Mining, Hadoop and Map Reduce, Recommendation Engine, Text Analytic and tagged , , , , , . Bookmark the permalink.

8 Responses to Similarity Based Recommendation – Tossed up with Text Analytic

  1. George says:

    I am a student and i have as a project for my school to implement in Lucene Jaccard similarity.
    So far my thought is to override the default similarity and tweek it so that it only returns the number of terms common between the query and the doc and divided by the union of the two sets.
    The problem is that only coord() seems to have that info (as overlap in its arguments) but coord is only called in queries with more than one terms .
    So i was wondering if you could give me any insights on how to obtain the “match(t1,t2) = number of matching terms between t1 and t2” of your code.
    For the union i have already though of counting the terms of each doc during indexing
    So if you have the time for that simple answer i would be gratefull
    anyway thank you

  2. Pranab says:


    Take a look at It’s like this, if you have two term vectors v1 an v2
    1. Iterate through both in a nested loop to find count of matching terms (m)
    2. Count of terms in v1 and not in v2 = v1.length – m
    3. Count of terms in v2 and not in v1 = v2.length – m

    You have everything for Jaccard coeff now. Good luck with your project.

  3. Pingback: Similarity Based Recommendation – Tossed up with Text Analytic |

  4. Pingback: Semantic Matching with Hadoop | Mawazo

  5. Pingback: Identifying Duplicate Records with Fuzzy Matching | Mawazo

  6. wi2lWilliam says:

    Hello. Could you provide any implementation of the Jaccard coefficient algorithm with lucene please. I am trying to derivate the DefaultSimilarity or/and SimilarityBase class from the Similarity API of Lucene with no success.
    Best regards

    • Pranab says:

      I use lucene only for stop word removal. stemming etc. Jaccard algorithms is implemented inside my Hadoop implementation.

      • wi2l says:

        Ok. I found a solution in the meantime. Instead of using the Similarity API of lucene i created a CustomScoreProvider by extending the CustomScoreQuery class, and then overidded the getCustomScoreProvider() function. This function return a new CustomScoreProvider instance which have an overidded customScore() member function. From this last one, im able to get the terms of the query and the stored content of the field i want to score withe the query. For that purpose, i had to use the tokenStream() function of my analyzer to process the content of my stored field and get better results (i used my own analyzer who implement the PorterStemmer algorithm).
        Then i can computer the query and the analyzed field with a simple Jaccard algorithm that work on two String.
        Do you think it is a good way to do it ? I think my problem is similar to George.
        Best regards

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s