Recently, I had a request to support semantic matching in sifarish, my open source matching and recommendation engine. By semantic matching, I mean any algorithm that does not rely on explicit keyword based match.
In this post, I will provide the details of a semantic network based solution and the enhancements made in sifarish to implement the solution. I will use an example with personal preferences and show how to find the match between two persons, given their preferences in various facets of life.
Text matching feature is already provided in sifarish. That feature is useful if the entities have text fields and you want explicit keyword based match. It uses Lucene for pre processing and Cosine or Jaccard similarity measures.
As input we have an entity and set of associated tags. For example. for user preference the entity is a person and the tags represent preferences for the user. A semantic matching algorithm will find match between two tags, even though there may not be any keyword based match between the tags. For example, if one user1 likes a chinese restaurant A and user2 likes chinese restaurant B, a semantic matching based solution will be able to find a match, because they both like chinese cuisine or asian cuisine.
Since there may be a plethora of algorithms for semantic matching, I decided to implement the solution as a plugin framework, so that custom algorithms may be used, provided through a plugin. My solution based on semantic network will be one such plugin, available in sifarish.
There are various ways to represent knowledge in AI. Semantic network is one simple and intuitive way of representing knowledge. Semantic network is essentially a graphical structure where the nodes are concepts or some entity. The entities usually appear at the leaf level of the graph.
The arcs connecting nodes represent relationship between concepts or the entities. You can think of the arc as a predicate connecting a subject with an object. Here is an example
Resource Description Framework
The next question that naturally comes up is how to represent the semantic network. I decided to use the Resource Description Framework (RDF). RDF is an XML based semantic web standard. It’s essentially a set (subject, predicate, object) of triples. An object can be a literal, in which case it’s a leaf in the RDF graph. It can also be another subject, which will be an internal node in RDF graph
I will be using the Apache Jena RDF implementation. Jena provides API for building and querying RDF model. An RDF model can also be loaded from an XML file describing the RDF model.
Each tag in our example is assumed to be an object literal in the RDF model. To find the match between two tags, we navigate the RDF model graph starting from the two object literals which are the leaf nodes and we find the shortest possible path between the two leaf nodes. The shorter the path length, better is the match. There are other algorithms for finding semantic similarity. The following paper is a good resource
“Element level semantic matching using WordNet” by Mikalai Yatskevich and Fausto Giunchiglia
In the so called Leacock Chodorow matcher, the matching score is defined as below. It’s function of the the shortest path length normalized by the maximum path length in the graph.
In our implementation, we have not taken the natural logarithm. Since we are interested in top k matches, we only care about the relative score.
The map reduce class ItemDuynamicAttributeSimilarity is used for semantic matching. This a general purpose map reduce used to find similarities between entities with variable number of attributes, but all of the same type. One example is document similarity, where each document is associated with a set of words.
Each record of the input consists of an user ID, followed by a variable number RDF object literal. In our example, the object literal represents something the user likes. Here is a snippet of the input
L12E6PCN38,http://sports/baseball/Giants 30GDF2YX36,http://restaurant/chinese/Hong Fu,http://restaurant/thai/Red Pepper GIA3U30T20,http://restaurant/chinese/Hong Fu,http://sports/footbal/49ers E1BTUFA8H0,http://sports/baseball/Mariners 1HUT68NQV1,http://sports/baseball/Giants,http://restaurant/chinese/Hong Fu ...............................................
The RDF model used for this problem can be found here. Each XML element represents a subject and all associated predicate object pairs. The attributes following the user ID in the input corresponds to rdf:about attribute value of the leaf nodes in the RDF model. Each of these represents something the user likes for our example.
The configuration for the map reduce defines the class name for the semantic matcher class and other related parameters. The configuration for the specific semantic matcher class is defined through the semantic.matcher.params parameter. The RDF model is stored in HDFS. the path is defined through semantic.rdf.modelFilePath parameter. The semantic matching scale, which in our case is the twice the maximum path length of the RDF model graph is defined through semantic.match.scale.
The workhorse of the semantic matching algorithm is in the class ResourceDescribedEntity. It’s extended from the TaggedEntity class. All plugin classes need to to extend this class. It computes all the paths through a node and then finds all the intersecting nodes between two sets of the paths, corresponding to the two nodes.
The match() method of the class is called for every possible pairs of user preferences between two users. The in memory RDF model is built using the RDF data in the HDFS file. Once built it’s cached for the duration of the reducer.
Here is some sample output from the map reduce. Each line of the input consists of two user IDs, followed by the semantic distance, normalized to a scale of 1000.
L1XYIH2I18,9KHZI4FTQU,0 L1XYIH2I18,30GDF2YX36,125 L1XYIH2I18,QM1FP0NPCQ,1000 L1XYIH2I18,E2Q1135HFQ,375 L1XYIH2I18,VE0NSHKC13,1000 L1XYIH2I18,T04XAF17NA,0 JG15VV4E4O,I6QWUJMBFI,0 JG15VV4E4O,1HNJG054HF,500 JG15VV4E4O,S0H8P3E7QE,0 JG15VV4E4O,I390HTSXAV,500 JG15VV4E4O,0IKPKA4P06,1000
Optionally, we can also include some matching related context data in our output. The context data is the intersection node in the shortest path calculation between the two nodes. This output can be fed to the MR ToMatches to find the top k matches for a given user.
As long as the domain knowledge representation is available in a structured forms as a semantic net, the algorithm described here is an effective way to find match. Another approach which is lot more complex is Natural Language Processing (NLP). Some of the specific techniques are Latent Semantic Indexing and Random Indexing.