Duplicate Data Detection with Neural Network and Contrastive Learning

Duplicate data is a ubiquitous problem in the data world. It often appears when data from different silos are consolidated. It could be an issue in an analytic project based on data aggregated from various sources. The training data for a machine learning model may have duplicates and unless removed it will have an adverse impact on the model performance. In this post we will go through a simple feed forward neural network based solution for finding duplicates. It’s applicable for any structured data, whether a relational table or a JSON data.

The solution is available in open source project avenir. PyTorch has been used for the NN model. It could easily be reimplemented with TensorFlow.

Near Duplicate Data

The general approach is to pair a record with all other data records and find similarity. If the similarity is above some threshold, the pair of records is considered duplicates. Here are the 2 steps for similarity calculation

Find field pairwise similarity between the 2 records
Aggregate field similarityies

For numeric fields, the field similarity could simply be a normalized difference. For text fields similarity there are various options as below

Edit distance e.g Levenshtein Distance
Vectorize and then cosine similarity

Edit distance algorithms are computationally intensive and and does not scale well for long text. Here are some techniques for vectorization based approaches. Sometimes log of the frequency is taken. The frequencies may also be normalized

Term i.e word frequency inverse document frequency of TF-IDF.
Frequency of character ngrams.

I have chosen a character ngram based approach, which is more appropriate when the data is characterized by small typographical errors. It has the following advantages compared to TF-IDF

Vocabulary size is fixed and the size depends on the number of characters in the ngram. So there is no out of vocabulary issue
It’s more tolerant of typographical errors in data

Field Similarity Aggregation with Machine Learning

The second task for finding record similarities is aggregation of field similarities.There are various techniques for record similarities. Some of the popular metric are listed below

Euclidean
Cosine
Minkowski
Manhattan

All these techniques treat all fields equally, which may not be appropriate for highly nuanced business data. All fields may not be equally relevant. For example, the email ID may be different for same customer data from different sources. We would like to assign varying weights to the different fields.

Instead of guessing the weights we can train a neural network for regression. The training data consists pairs of records and a a target value. If the two records are duplicate the target value is 1 and 0 otherwise. For a positive case, two records are near duplicate. A negative case is generated by pairing a record with any random record, so that the probability being very similar is very low.The input size of the network is the number of fields.If the network has no hidden layer,the edge weights will represent the field weights.

However, we learn the field weights in a data driven way by training a network instead of guessing. Essentially we discover our own custom record similarity metric empirically that’s most appropriate for the data we have.

The technique of learning by keeping similar data together in some representation space and dissimilar data far away is known as contrastive learning. It’s a powerful learning mechanism used for supervised and unsupervised learning setting. It’s been used successfully in few shot learning.

Training Data Preparation

As an example I have used customer data from 2 different sources with some near duplicates. The data has been synthetically generated based on some sample customer data set I downloaded. The customer record has the following fields.

Name
Address
City
State
Zip
Email

For the positive cases, near duplicate data has been generated with a script by randomly mutating some characters. In real life to generate training, you have to manually find near duplicate records and label the record pair with 1.

For the negative case, one of the records from the pair is selected and paired with some random record. Since real duplicate records are going to rare compared to normal records, you have to make sure that the data is balanced i.e you have more or less equal number of positive and negative cases.

Training Model and Detecting Duplicates

I have used a neural network without any hidden layer. Adding a hidden layer did not have any significant impact. I use a wrapper class around PyTorch for the feed forward network. The wrapper class along with a configuration file allows configuration and training of a network without writing any Python code. Here is the configuration file for our network. The driver script contains code for data preparation, train and duplicate detection Please consult the tutorial document for detailed instructions.

Here is some output for duplicate detection. For any newly arriving data set, to find near duplicates with the existing data set, each record of the new data set is paired with every record of the existing data set and prediction is made with the trained model. If the prediction output is close to 1, a near duplicate has been found.

['Serina Zagen', '7 S Beverly Dr', 'Hays', 'KS', '67601', 'serina@yahoo.com']  0.198
['Graciela Ruta', '98 Connecticut Ave Nw', 'Chagrin Qalls', 'OH', '44023', 'gruta@cox.net']  1.072
['Haydee Denooyer', '25346 New Rd', 'Anchorage', 'AK', '99515', 'haydee@aol.com']  0.446

For the second record in the output a near duplicate has been found in the existing data set, because the similarity score is close to 1.0

If you have the setting train.print.weights=True in the configuration file, it will output the network edge weights. You will find that the network weight distribution is far from uniform. It has learnt the bias in various fields.

Although I have not used any hidden layer for this particular problem, it should not be construed as a general guideline. For another data set in another domain, a hidden layer may be appropriate.

Contrastive Learning

As alluded to earlier, contrastive learning is based on similarities between data items in some representational space. Generally the representation is also learnt by the network. However we have defined the representation manually, following these 2 steps. It results in a vector of size n if n number of fields are involved.

Find ngram frequency vector for each field
Find cosine similarity for the ngram vectors for corresponding

Wrapping Up

Not all fields are created equal when it comes to similarity calculation. We have trained a simple neural network model that learns the relative importance of various fields. I setup the problem as regression problem. It could also be set up as a classification problem by adding softmax at the output. I preferred regression, so that the you could manually set a minimum threshold on the output for detecting duplicates.

Instead of a neural network multiple regression i.e multiple predictors and one targetcould also have been used. I used neural network because I was not sure whether non linearities would be needed to solve the problem. Moreover NN is likely to work much better for higher dimensional data.

Generally data pre processing is done on training data before training ML models. We have turned it around used ML for solving a data pre processing problem.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →

3 Responses to Duplicate Data Detection with Neural Network and Contrastive Learning

Kenneth Foster says:

August 2, 2022 at 9:18 pm

Hi Pranab, if I could ask a high level question: You mention record similarity metrics (like cosine similarity) and their weights, and suggest that a neural network or multiple regression could be used to find these appropriate weights. Can the weights from a neural network truly be transposed into weights for a similarity metric and still produce robust results? Seems too simple to be true!

Thanks in advance for sharing your work,
-Kenneth

- Pranab says:
  
  August 3, 2022 at 11:15 pm
  
  I have computed pair wise similarity for all fields in vectorized form between 2 records. Did that for both positive and negative example. Fed field similarity values to a simple linear NN to train for similarity at record level.
  
  Ideally should have used Siamese twin network with original vectorized field values as input for pair of records
  
  - Kenneth says:
    
    September 21, 2022 at 9:09 pm
    
    Thanks so much for elaborating. Sorry for the late response!