If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes a necessity. In this post we will focus on de duplication based on exact match, whether for the whole record or set of specified key fields. De duplication can also be performed based on fuzzy matching. We will address de duplication for flat record oriented data only.
The Spark based implementation is available in my open source project chombo. There is a corresponding Hadoop based implementation also in the same project.
Duplicates Based on Exact Match
Exact match is based on simple string equality. If the exact match is to be done for the whole record, then it’s a trivial problem in Spark. You can simply use the distinct function in Spark. However, in may real world problems, equality between records is defined by a sub set of the fields in the record.
Consider merging of customer data between 2 companies, one of which has acquired the other one. In finding duplicates between the two customer data sets, it will suffice to consider only the following fields.
- first name
- last name
- zip code
- last 4 digits of SSN
These fields are enough to uniquely identify a customer. Considering all fields is unnecessary. Here some example fields, inclusion of which will exacerbate the de duplication process and cause it to fail.
- last transaction data
- last login date
These fields are transitory and do not define innate characteristics of a customer.
Business to Business Order Data
We will use Business to Business (B2B) order data as the use case. The data set is available from UCI data repository. The data consists of the following fields. A business customer places order for a set of product, and they all get the same invoice number.
- invoice no
- stock code
- invoice date
- unit price
- customer ID
The second field stock code uniquely identifies a product. Here is some sample input.
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/10 8:26,7.65,17850,United Kingdom 536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/10 8:26,4.25,17850,United Kingdom 536366,22633,HAND WARMER UNION JACK,6,12/1/10 8:28,1.85,17850,United Kingdom 536366,22632,HAND WARMER RED POLKA DOT,6,12/1/10 8:28,1.85,17850,United Kingdom 536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,12/1/10 8:34,1.69,13047,United Kingdom 536367,22745,POPPY'S PLAYHOUSE BEDROOM ,6,12/1/10 8:34,2.1,13047,United Kingdom 536367,22748,POPPY'S PLAYHOUSE KITCHEN,6,12/1/10 8:34,2.1,13047,United Kingdom
We may get duplicate orders, because multiple representatives from the same buyer organization inadvertently may place the same order. Another way, duplicates may be created is by double clicking on the submit button in the order placement page. Considering these scenarios, we are going to find duplicates based on partial matching using the fields stock code, quantity and customer ID
Duplicates should be found for orders data with invoice date not too far apart and within a small time window of 2 or 3 days. If the invoice date is too far apart, the duplicates found may be legitimate orders that happen to have the same items and quantities.
The invoice data set was down loaded from UCI data repository. I used a small small sub set of the actual data. A cursory look at the data revealed that there were lot of missing and invalid data. I decided to run a Spark based simple data validator from chombo to purge all the invalid records.
The implementation is in the Spark job SimpleDataValidator. Validation checks are performed based on number of fields in the record, missing fields and whether the field value comply with expected data type. Lot of records were discarded resulting from validation failure. The output is same as the input, except for fewer number of records.
Duplicate processing is implemented in the Spark job DuplicateRemover. The output can be controlled in 3 different ways through configuration parameter.
- all : all records included in output
- duplicate : only duplicate records included in the output
- unique : only unique records included in the output
In case of all or duplicate, when duplicate records are found, one of them is selected randomly to be part of the output. Rest of the records are discarded. With partial matching, there is no guarantee that the record selected among the duplicates, is the one desired, causing some ambiguity. Here is some sample output
536464,21329,DINOSAURS WRITING SET ,1,12/1/10 12:23,1.65,17968,United Kingdom 536404,21063,PARTY INVITES JAZZ HEARTS,12,12/1/10 11:29,0.85,16218,United Kingdom 536415,22551,PLASTERS IN TIN SPACEBOY,3,12/1/10 11:57,1.65,12838,United Kingdom 536488,22144,CHRISTMAS CRAFT LITTLE FRIENDS,2,12/1/10 12:31,2.1,17897,United Kingdom 536385,22783,SET 3 WICKER OVAL BASKETS W LIDS,1,12/1/10 9:56,19.95,17420,United Kingdom
To alleviate this problem, all the duplicate records can also be output as an option. Here are some duplicate records
(22749,1,17920,536412,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,1,12/1/10 11:49,3.75,17920,United Kingdom) (22749,1,17920,536412,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,1,12/1/10 11:49,3.75,17920,United Kingdom)
The first 3 fields are the key fields and duplication is detected based on these fields. The rest is the actual record. This output file can reviewed to remove any ambiguity alluded to earlier.
Record Group wise Duplicates
In our use case, what is really required is a way to detect duplicates for a group of records. When there are duplicate orders, the same of line items with same quantity for each line item will appear under 2 separate invoices. The de duplication process should select one of those groups of records i.e one of the 2 invoice numbers.
However, the current implementation does single record level de duplication only. As a result, for a given customer with duplicate orders, we may find line items from both invoices. It will be necessary for someone to select one of the invoice numbers from the line items. I will circle back some time to tackle the problem of de duplication for groups of records.
We have gone through a Spark based implementation for de duplicating data. For a step by step instruction to execute the use case, please refer to the tutorial document.
We have used eCommerce order data as example. Any data set can be used for duplicate removal. We have addressed only de duplication of flat record oriented data. De duplication of binary data (think email attachment) is also a big topic.