Removing Duplicates from Order Data Using Spark

If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes a necessity. In this post we will focus on de duplication based on exact match, whether for the whole record or set of specified key fields. De duplication can also be performed based on fuzzy matching. We will address de duplication for flat record oriented data only.

The Spark based implementation is available in my open source project chombo. There is a corresponding Hadoop based implementation also in the same project.

Duplicates Based on Exact Match

Exact match is based on simple string equality. If the exact match is to be done for the whole record, then it’s a trivial problem in Spark. You can simply use the distinct function in Spark. However, in may real world problems, equality between records is defined by a sub set of the fields in the record.

Consider merging of customer data between 2 companies, one of which has acquired the other one. In finding duplicates between the two customer data sets, it will suffice to consider only the following fields.

first name
last name
zip code
last 4 digits of SSN

These fields are enough to uniquely identify a customer. Considering all fields is unnecessary. Here some example fields, inclusion of which will exacerbate the de duplication process and cause it to fail.

last transaction data
last login date

These fields are transitory and do not define innate characteristics of a customer.

Business to Business Order Data

We will use Business to Business (B2B) order data as the use case. The data set is available from UCI data repository. The data consists of the following fields. A business customer places order for a set of product, and they all get the same invoice number.

invoice no
stock code
description
quantity
invoice date
unit price
customer ID
country

The second field stock code uniquely identifies a product. Here is some sample input.

536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/10 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/10 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/10 8:28,1.85,17850,United Kingdom
536366,22632,HAND WARMER RED POLKA DOT,6,12/1/10 8:28,1.85,17850,United Kingdom
536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,12/1/10 8:34,1.69,13047,United Kingdom
536367,22745,POPPY'S PLAYHOUSE BEDROOM ,6,12/1/10 8:34,2.1,13047,United Kingdom
536367,22748,POPPY'S PLAYHOUSE KITCHEN,6,12/1/10 8:34,2.1,13047,United Kingdom

We may get duplicate orders, because multiple representatives from the same buyer organization inadvertently may place the same order. Another way, duplicates may be created is by double clicking on the submit button in the order placement page. Considering these scenarios, we are going to find duplicates based on partial matching using the fields stock code, quantity and customer ID

Duplicates should be found for orders data with invoice date not too far apart and within a small time window of 2 or 3 days. If the invoice date is too far apart, the duplicates found may be legitimate orders that happen to have the same items and quantities.

Data Validation

The invoice data set was down loaded from UCI data repository. I used a small small sub set of the actual data. A cursory look at the data revealed that there were lot of missing and invalid data. I decided to run a Spark based simple data validator from chombo to purge all the invalid records.

The implementation is in the Spark job SimpleDataValidator. Validation checks are performed based on number of fields in the record, missing fields and whether the field value comply with expected data type. Lot of records were discarded resulting from validation failure. The output is same as the input, except for fewer number of records.

Duplicate Removal

Duplicate processing is implemented in the Spark job DuplicateRemover. The output can be controlled in 3 different ways through configuration parameter.

all : all records included in output
duplicate : only duplicate records included in the output
unique : only unique records included in the output

In case of all or duplicate, when duplicate records are found, one of them is selected randomly to be part of the output. Rest of the records are discarded. With partial matching, there is no guarantee that the record selected among the duplicates, is the one desired, causing some ambiguity. Here is some sample output

536464,21329,DINOSAURS  WRITING SET ,1,12/1/10 12:23,1.65,17968,United Kingdom
536404,21063,PARTY INVITES JAZZ HEARTS,12,12/1/10 11:29,0.85,16218,United Kingdom
536415,22551,PLASTERS IN TIN SPACEBOY,3,12/1/10 11:57,1.65,12838,United Kingdom
536488,22144,CHRISTMAS CRAFT LITTLE FRIENDS,2,12/1/10 12:31,2.1,17897,United Kingdom
536385,22783,SET 3 WICKER OVAL BASKETS W LIDS,1,12/1/10 9:56,19.95,17420,United Kingdom

To alleviate this problem, all the duplicate records can also be output as an option. Here are some duplicate records

(22749,1,17920,536412,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,1,12/1/10 11:49,3.75,17920,United Kingdom)
(22749,1,17920,536412,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,1,12/1/10 11:49,3.75,17920,United Kingdom)

The first 3 fields are the key fields and duplication is detected based on these fields. The rest is the actual record. This output file can reviewed to remove any ambiguity alluded to earlier.

Record Group wise Duplicates

In our use case, what is really required is a way to detect duplicates for a group of records. When there are duplicate orders, the same of line items with same quantity for each line item will appear under 2 separate invoices. The de duplication process should select one of those groups of records i.e one of the 2 invoice numbers.

However, the current implementation does single record level de duplication only. As a result, for a given customer with duplicate orders, we may find line items from both invoices. It will be necessary for someone to select one of the invoice numbers from the line items. I will circle back some time to tackle the problem of de duplication for groups of records.

Summing Up

We have gone through a Spark based implementation for de duplicating data. For a step by step instruction to execute the use case, please refer to the tutorial document.

We have used eCommerce order data as example. Any data set can be used for duplicate removal. We have addressed only de duplication of flat record oriented data. De duplication of binary data (think email attachment) is also a big topic.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →

2 Responses to Removing Duplicates from Order Data Using Spark

Wim Geurden (@wgeurden) says:

January 4, 2018 at 11:11 am

Very interesting and usefull. Just to make sure – the fuzzy matcher referenced is implemented in the sifarish project, not in the chombo project- correct ?

I am new to spark/hadoop/…. – how hard is it to port the SameTypeSimilarity class so it can be used by chombo ?

- Pranab says:
  
  January 4, 2018 at 5:34 pm
  
  That’s correct. It’s in sifarish. It’s been ported to chombo under the name RecordSimilarity