Semantic Matching with Hadoop

Recently, I had a request to support semantic matching in sifarish, my open source matching and recommendation engine. By  semantic matching, I mean any algorithm that does not rely on explicit keyword based match.

In this post, I will provide the details of a semantic network based  solution and the enhancements made in sifarish to implement the solution. I will use an example with personal preferences and show how to find the match between two persons, given their preferences in various facets of life.

Text matching feature is already provided in sifarish. That feature is useful if the entities have text fields and you want explicit keyword based match. It uses Lucene for pre processing and Cosine or Jaccard similarity measures.


As input we have an entity and set of associated tags. For example. for user preference the entity is a person and the tags represent preferences for the user. A semantic matching algorithm will find match between two tags, even though there may not be any keyword based match between the tags. For example, if one user1 likes a chinese restaurant A and user2 likes chinese restaurant B, a semantic matching based solution will be able to find a match, because they both like chinese cuisine or asian cuisine.

Since there may be a plethora of algorithms for semantic matching, I decided to implement the solution as a plugin framework, so that custom algorithms may be used, provided through a plugin. My solution based on semantic network will be one such plugin, available in sifarish.

Semantic Network

There are various ways to represent knowledge in AI. Semantic network is one simple and intuitive way of representing knowledge. Semantic network is essentially a graphical  structure where the nodes are concepts or some  entity.  The entities usually appear at the leaf level of the graph.

The arcs connecting nodes represent relationship between concepts or the entities. You can think of the arc as a predicate connecting a subject with an object. Here is an example

John  likes San Francisco 49ers.
where subject = John, predicate = likes and object =  San Francisco 49ers

Resource Description Framework

The next question that naturally comes up is how to represent the semantic network. I decided to use the Resource Description Framework (RDF). RDF is an XML based  semantic web standard. It’s essentially a set (subject, predicate, object) of triples. An object can be a literal, in which case it’s a leaf in the RDF graph. It can also be another subject, which will be an internal node in RDF graph

I will be using the Apache Jena RDF implementation. Jena provides API for building and querying RDF model. An RDF model can also be loaded from an XML file describing the RDF model.

Matching Algorithm

Each tag in our example is assumed to be an object literal in the RDF model. To find the match between two tags, we navigate the RDF model graph starting from the two object literals which are the leaf nodes and we find the shortest possible path between the two leaf nodes.  The shorter the path length, better is the match. There are other algorithms for finding semantic similarity. The following paper is a good resource

“Element level semantic matching using WordNet” by Mikalai Yatskevich and Fausto Giunchiglia

In the so called Leacock Chodorow matcher, the matching score is defined as below. It’s function of the  the shortest path length normalized by the maximum path length in the graph.

score = – ln (path(n1,n2) / 2 * D)
where path(n1,n2) is the shorted path between nodes n1 and n2 and
D is the maximum path length in the semantic graph

In our implementation, we have not taken the natural logarithm. Since we are interested in top k matches, we only care about the relative score.

Map  Reduce

The map reduce class ItemDuynamicAttributeSimilarity is used for semantic matching. This a general purpose map reduce used to find similarities between entities with variable number of attributes, but all of the same type. One example is document similarity, where each document is associated with a set of words.

Each record of the input consists of an user ID, followed by a variable number RDF object literal. In our example, the object literal represents something the user likes. Here is a snippet of the input

30GDF2YX36,http://restaurant/chinese/Hong Fu,http://restaurant/thai/Red Pepper
GIA3U30T20,http://restaurant/chinese/Hong Fu,http://sports/footbal/49ers
1HUT68NQV1,http://sports/baseball/Giants,http://restaurant/chinese/Hong Fu

The RDF model used for this problem can be found here. Each XML element represents a subject and all associated predicate object pairs. The attributes following the user ID in the input corresponds to rdf:about  attribute value of the leaf nodes in the RDF model. Each of these represents something the user likes for our example.

The configuration for the map reduce defines the class name for the semantic matcher class and other related parameters. The configuration for the specific semantic matcher class is defined through the semantic.matcher.params parameter. The RDF model is stored in HDFS. the path is defined through semantic.rdf.modelFilePath parameter. The semantic matching scale, which in our case is the twice the maximum path length of the RDF model graph is defined through semantic.match.scale.

The workhorse of the semantic matching algorithm is in the class ResourceDescribedEntity. It’s extended from the TaggedEntity class. All plugin classes need to to extend this class. It computes all the paths through a node and then finds all the intersecting nodes between two sets of the paths, corresponding to the two nodes.

The match() method of the class is called for every possible pairs of user preferences between two users. The in memory RDF model is built using the RDF data in the HDFS file. Once built it’s cached for the duration of the reducer.

Here is some sample output from the map reduce. Each line of the input consists of two user IDs, followed by the semantic distance, normalized to a scale of 1000.


Optionally, we can also include some matching related context data in our output. The context data is the intersection node in the shortest path calculation between the two nodes. This output can be fed to the MR ToMatches to find the top k matches for a given user.

Summing Up

As long as the domain knowledge representation is available in a structured forms as a semantic net, the algorithm described here is an effective way to find match. Another approach which is lot more complex is Natural Language Processing (NLP).  Some of the specific techniques are Latent Semantic Indexing and Random Indexing.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in AI, Big Data, Hadoop and Map Reduce, Java, Semantic and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s