In my last post I mentioned that similarity based recommendation engine in sifarish only considered categorical and integer attributes. I have added support for text attributes to sifarish. I am using Lucene for text processing and a variation of jaccard coefficient for distance between text attribute values. Lucene is an Apache open source package for text indexing and processing. In this post, I will go over the text analytic related enhancements.
As a quick disclaimer: for this post, by text processing I mean the so called bag of words approach, which is simply based on the occurrence of words and not on the semantic structure of the text. Analysis based on semantics falls within the realm of natural language processing .
A text attribute value is a bag of words. Distance between text attributes is measured in terms of the match or overlap between the two bag of words.
To bring some context into this, I am going back to the Kiva micro loan example. The loan data has set of attributes which are free form text. I have also added set of keywords as an attribute to the lender profile. The idea is to match the keywords with text attributes of loan and calculate the distance in terms of the match
Text consists of one or more sentences. To convert them to a bag of words or terms some processing is necessary. This is where Lucene enters the scene. I am using the Lucence StandardAnalyzer which takes a block of text and spits out a list of tokens. The analyzer does the following
- Lower casing: Turns everything to lower case
- Stop word removal: Removes words like is, the which add no value
- Stemming: Transforms inflected or derived words to their base e.g., cars become car
Here is an example of some text
Here is the list of tokens after lucene analysis of the text.
If you compare the text chunks, you will notice how the lower casing, stop word removal and stemming have operated to produce the processed text.
The preprocessing is done by a map reduce job that has been added to sifarish, that has only the mapper. Essentially, it analyzes every text field by calling lucene API. It only transforms the data.You can think of it as the T of and ETL process. The source code can be viewed here.
For my example, I ran the loan data through this map reduce job. Each text field is replaced with the output of the analyzer which is a list of terms.
There is an advantage in not consolidating all processed text fields into one. If left separately, different weights could be assigned to different text fields depending on their relative importance. For example, when dealing with product data, you may want to assign higher weight to the product description text field, compared to other text fields. It’s know as weighted zone scoring in text analytic.
Attribute weight assignment is discussed in my earlier post. When computing distance between entities, it allows to control the the contribution of a particular attribute distance towards the aggregated distance.
The MR class TextAnalyzer has only a mapper. It does text pre processing with lucene. By pre processing I mean stemming. The text fields are identified through the configuration parameter text.field.ordinals. It’s value is a coma separated list of text field ordinals.
Every text field is replaced by the output of the stemming operation. If the boolean configuration parameter consolidate.field is set to true, all the text fields are consolidated into one text field at the end.
This map reduce also has the additional feature of extracting values from other text fields based on reg ex defined in the entity schema file. These extracted fields are appended to every row that is generated in the output.
I only ran the loan data through this map reduce. I assumed that the lender profile data contains set of keywords, which don’t need any transformation based on text analysis. Here is row of loan data before and after processing.
I had to use those funny  as my field delimiter, because the text fields have comma embedded in them. The field delimiter is configurable through the configuration parameter field.delim. In this case, the last two text fields in the original record have been replaced with one text field at the end of the processed record.
Text Similarity and Matching
After being preprocessed, the loan data is ready to be run through the similarity measure map reduce class which in our case is DiffTypeSimilarity. The similarity measure map reduce is enhanced with text matching logic.If we were matching entities of same type, we would have been using the MR SameTypeSimilarity. A good example would be products with text attributes among other types of attributes.
Essentially, the distance between two text fields depends on the extent of overlapping terms between the two text attribute values, higher the overlap lower the distance. The different approaches for text similarity calculation are as follows. The first two algorithms perform boolean matching. Each text field is considered as a vector of boolean, the value of which indicates whether the corresponding term is present or not.
In simple matching, the matching coefficient is the ratio of matching terms (including both co presence and co absence of terms) to the total number of terms. This does not work very well for text matching, because of the term vectors happen to be sparse i.e majority of the terms in the vocabulary will not be present the in the text fields being matched.
Since the number of terms not appearing in both text fields will be high and will have the dominant influence on the matching coefficient and it will have a value close to 1.
Jaccard coefficient is based on the occurrence of terms and the numbers of term matches between two text fields. It’s defined as number of matching terms / number of matching terms + terms that appear in one but not the other. It’s essentially the ratio between the intersection and union of the two sets of terms for the two text fields.
Since we don’t consider terms that don’t appear in any of the text fields, such variables are called asymmetric binary variables. It’s asymmetric because we don’t consider (false, false) to be a match. For text matching, presence of a term in two text fields is more important than the absence of a term.
In cosine distance, frequency of term occurrence is used and each text field is treated as a vector of the term frequency count. This method finds the angle between two term vectors. It reflects similarity in terms of the similarity in term count distribution between the two text fields. It’s similar to jaccard coefficient because (0,0) does not contribute to the match.
Since I am am matching a small set of keywords with text field content, I found Jaccard coefficient to be better fit. I took the liberty of generalizing it a bit and redefined it as follows
We can derive some interesting cases depending on values of the weight chosen.
- When w1 = w2 = 1, we get the original jaccard coefficient.
- When w1 = w2 = 0.5, it’s called Dice coefficient.
- You would set w1 = 1 and w2 = 0, if you wanted a similarity coefficient simply based on the percentage of keyword that matched.
The problem of using the original form of jaccard coefficient in our case is that since we are matching few keywords with a large body of text the coefficient value will be close to zero because the intersection of the terms will be much smaller than the union of terms. If we were interested in ordering the text fields in terms of match with the keywords, we would not have cared about the absolute values.
Since we are normalizing the distance between 0 and 1, we want the distance between text attributes to be well spread out in that range. If the distance values are always going to be close to 1, they won’t have enough discriminating power. As a result, the distance between the text attributes won’t have much impact on the overall distance between two entities.
Similarity Match Map Reduce
As I mentioned earlier, this map reduce has been enhanced to handle text fields. I ran the same kiva data to find top N loan matches for a lender profile. Here is a snippet of configuration related to text analytic part of this map reduce.
"textMatchingAlgorithm" : "jaccard", "srcNonMatchingTermWeight" : 0.8, "trgNonMatchingTermWeight" : 0.2,
As you can see, I have left the door open for text attribute distance algorithm through configuration. In future, I might implement some other algorithm.
Several other languages besides English are supported. The language is selected by appropriately setting the configuration parameter text.language. Here is the list of languages supported.
- Brazilian Portuguese
You might be interested in this tutorial with all the details of using these the two map reduce jobs in tandem. The tutorial also has a third map reduce to find the top n matches.The tutorial uses data for retail products with one text attribute along with attributes of other types. Here is the JSON schema file for the data.