Identifying Duplicate Records with Fuzzy Matching


I was prompted to write this post  in response to a recent discussion thread in linkedin Hadoop Users Group regarding fuzzy string matching for duplicate record identification with Hadoop. As part of my open source Hadoop based recommendation engine project sifarish, I have a MapReduce class for fuzzy matching between  entities with multiple attributes. It’s used for content based recommendation. My initial reaction was that the  same MapReduce class could be used for duplicate detection, by modelling all the fields of a record as text field and calculating  Jaccard similarity between corresponding text fields.

On further thought, I realized that the Jaccard similarity may not be right choice and Levenshtein Distance is more appropriate. Accordingly, I added support for Levenshtein Distance based text matching to sifarish. In this post, I will show how it’s been used to detect duplicate records.

Fuzzy Entity Matching

The details of the matching algorithms can be found from my earlier posts. For an entity, various attribute types are supported including integer, double, categorical, text, time, location etc. Essentially for any pair of entities, distance is calculated between corresponding attributes. Attribute wise distances are aggregated over all the attributes of an entity to find the distance between two entities.

Why is Jaccard distance not appropriate for this use case? Given two blocks of text, Jaccard similarity depends on the number of  words common to both and the union of words in the the two text blocks. Two words will not match whether there is difference in one character or multiple characters. Levenshtein Distance gives us more fine grained text matching algorithm.

Levneshtein Distance

Given two words, it’s defined as the number inserts and deletes of characters necessary to make one word same as the other. The Levenshtein Distance between two words w1 and w2 is calculated as follows

distance(w1,w2) = length(w1) + length(w2) – 2 * maxCommonSubSeqLength(w1,w2)
where
maxCommonSubSeqLength is the length of longest common sub sequence between w1 and w2.

The distance is normalized, by dividing the distance with sum of the length of the two words as below, forcing the normalized distance to be always be between 0 and 1.0.

normDistance(w1,w2) = distance(w1,w2) / (length(w1) + length(w2))

For a pair of  text blocks, normalized distance is calculated between corresponding word pairs. The final distance between two text fields is the distance averaged over all the words. The implementation is the class EditDistanceSimilarity. Various attribute distance algorithms are supported e.g., euclidian, manhattan etc.

This is not the only way to calculate Levinshtein Distance. It is also possible to take the whole text field as one token and calculate distance by setting  the parameter edit.dist.token to false. I have one word of caution. The edit distance calculation computation time grows non linearly with the token length. Treating the whole text field as one token may be very computationally intensive.

Similarity Map Reduce

The class SameTypeSimilarity contains the implementation. The details of it’s inner workings can be found in my earlier posts. Essentially, it does a hash based self join and works as follows

  1. We configure a set of buckets
  2. Each record is hashed into one of the buckets
  3. A bucket pair and the associated records is processed in one reducer call.
  4. The reducer pairs records from each bucket in a nested loop and calculates distances

It turns out, that the proper configuration of the number of buckets is critical for this example. The number of buckets should be large, so that each reducer call does not have to process too many records.  The number of buckets is set using the parameter bucket.count. While running the map reduce, I had the Hadoop job tracker terminating a reducer task because of the heart beat time out, until I increased the bucket count.

As thumb rule, I would suggest choosing the number of buckets such that each bucket has about 4 or 5 records. Otherwise you may have the unfortunate experience of reducer taking too much processing time in one call and eventually getting timed out terminated.

In this Map Reduce, every record is compared with every other record. So the complexity is O(n x n). However, for our particular example, if we had two data sets that are supposed to be identical, a simpler approach could be taken. We could simply compare corresponding records from each set. The complexity in that case would have been O(n).

Attribute Distance Threshold

One way to optimize the problem and to convert the complexity to a almost O(n) is to abandon distance processing between two records, as soon as the distance between  an attribute pair is found to be above a predefined threshold.

For example, if the the distance between the name fields for two customer records is significant enough, we can skip processing the remaining attributes and set the distance between the two records to a large value.

The threshold value is defined in the meta data JSON file as below. The threshold can be set for any number of attributes. This is how it’s set for the name field of a customer record.

{
	"name" : "name",
	"ordinal" : 1,
	"dataType" : "text",
	"distThreshold" : 0.4
}

Two Near Identical Data Sets

If you have two data sets where the records are supposed to identical, except for small differences, the processing can be performed in O(n) time. This is optimization is done by enabling inter set matching as below.

inter.set.matching=true
set.ID.size=1

The first parameter enables inter set matching. For this to work, the entity ID of each record needs to be encoded in a special way. The first few characters of the entity ID is the set ID and the remaining characters comprise the real entity ID within a set. The length of the set ID is defined by the second parameter above.

Essentially, matching is performed between records for the two sets only if entity ID after peeling off the set ID matches.  So, we are finding distances  only between corresponding entities from the  two data sets.

Duplicate Customer Records

We will be using customers records as example. The record has the the following fields. The first field which is the ID, does not enter into distance calculation in any way.

  1. ID
  2. Name
  3. Address
  4. City and state
  5. Phone number

The data consists of two sets of identical customer records. For some records, I introduced small typographical errors. Here is some sample input

106379,Richard Gomez,2934 Encino Pt,San Antonio TX 78259,210 4811932
158280,Jim Dobbs,2756 Bulls Bay Hwy,Jacksonville FL 32220,312 2850284
137943,Robert Lewis,194 Buckboard Dr,Augusta GA 30907,406 5404029
125849,Jena Marin,276 Durham St,Menlo Park CA 94025,650 2957406
156290,Dharam Patel,84 Prospect Hill Dr,Tewksbury MA 01876,718 4702915
.........
206379,Richard Gomez,2935 Encino Pt,San Antonio TX 78259,210 4811932
258280,Jim Dobbs,2756 Bulls Bay Hwy,Jacksonville FL 32220,312 2850284
237943,Robert Lewis,194 Buckboard Dr,Augusta GA 30908,406 5404029
225849,Jena Marin,276 Durham St,Menlo Park CA 94025,650 2957406

The output MR class SameTypeSimilarity consists of ID pair followed by the distance. There there types of output as follows

  1. Distance is 0 for identical records
  2. Distance is close to 0 for records that are identical except for small  typographical errors
  3. Different records with high distance value

The second type is of main interest to us. Because of the fuzzy matching logic, we are able identify records as duplicate records in spite of small typographical errors. Here some sample output. The last column is the distance scaled by 1000.

106379,278253,757
106379,206379,16
178253,278253,0
178253,206379,757
278253,206379,757
168295,185062,612

We see all 3 types of output as described earlier. Let’s pay attention to the records with ID 106379 and 206379 with the distance between them as 16. It’s the case of near duplicate records with small typographical errors. Here are the two corresponding records. There is one wrong character in the address field.

106379,Richard Gomez,2934 Encino Pt,San Antonio TX 78259,210 4811932
206379,Richard Gomez,2935 Encino Pt,San Antonio TX 78259,210 4811932

That one misplaced character has caused the distance between the records to be 16 instead of 0. Otherwise the distance would have been 0.

Attribute Distance Aggregation

The attribute distances are aggregated for all attributes. There are different  aggregation algorithms. For this example, I have used mahattan distance a.k.a L1 distance.

All attributes are treated equally. However it is possible to assign different attribute weights through the meta data. For example, if I had specified a weight greater than (e.g. 1.2) for the phone number field, it would have shrunk the distance. In other works, the distance between two phone numbers are made to appear less than the actual distance.

Wrapping Up

In this post I have shown how to use a Hadoop based solution for finding near duplicate records. Here is a tutorial document with details on how to run the map reduce job.

About these ads

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source contributor. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Data Mining and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Hadoop and Map Reduce, Text Analytic and tagged , , . Bookmark the permalink.

6 Responses to Identifying Duplicate Records with Fuzzy Matching

  1. Mathew says:

    Hi..Pranab,
    A great post. I really found a great idea to identify and find out a duplicate records.

  2. Mathew says:

    Thank you Pranab!

  3. Asutosh Parida says:

    Hi Pranab,
    Lets say I have hudge no of candidate records(csv data), now I want to find duplicate records depending on email-id and contact-no of cadidate using mapreduce.

    I am using writablecomparator for my candidate object, the problem with this is I am not able to figure out where to compare two objects .
    Can you please help me out ???

  4. Pranab says:

    Please follow the steps in the tutorial to find near duplicate records using sifarish. There are complex algorithms involved in finding distance between two objects with different types of attributes. The link to the tutorial is at the end of the blog. Here it is any way

    https://github.com/pranab/sifarish/blob/master/resource/duplicate_detection_tutorial.txt

  5. Pingback: Location and Time Based Service | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s