You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your search engine results are satisfying the users and meeting their information needs.
If you have access to relevance feedback data from users, there are various search result relevance metrics that could be calculated. In this post the focus is on computing a metric called Normalized Cumulative Discounted Gain (NCDG) on Spark.
Two kinds of data are necessary to compute NCDG, 1)search engine result as queries are executed and 2)relevance feedback from users who interact with the search results. The Spark implementation of NCDG is available in my open source project visitante.
Search Engine Result
When a query is executed, search engine returns results with lot of details. The search result needs to be saved by your search application before returning them to the user. As for as computation of NCDG is concerned following the following 3 fields are necessary.
- Query ID
- Document ID
Each query needs to assigned a unique ID. One way to generate a unique ID from the search key words is as follows
- Preform text pre processing on the key words including lemmatization, stop word removal, smaller case conversion.
- Optionally, perform synonym replacement
- Sort the key words
- Generate an unique ID, if one does not already exist for the tokens in step 3
Search engines typically generates unique ID for each document indexed by the search engine. Search engine assigns a score to each document returned by the search engine in response to a query.
The score indicates how relevant the document is with respect to the query. This score is calculated based on the content of a document with respect to a query. Documents are presented to the user in descending order of the score, so that the most relevant document appears at the beginning of the search result i.e the documents are ranked by the descending order of the score.
Here is some sample synthetic data for one query , generated with a python script
E0NGKNS66TH2,WN88E17Y,0.927 E0NGKNS66TH2,PU448556,0.926 E0NGKNS66TH2,CK42DJ7J,0.872 E0NGKNS66TH2,52LED81S,0.864 E0NGKNS66TH2,58Z09GOT,0.836 E0NGKNS66TH2,KZ30O9JT,0.738 E0NGKNS66TH2,AIU8W7T4,0.723 E0NGKNS66TH2,A300D2BT,0.686 E0NGKNS66TH2,YAW39CW1,0.467 E0NGKNS66TH2,3Z6D2N87,0.420
Ideally relevance feedback from users should be explicit e.g by rating in a scale of 1 to 10. However in reality, it’s difficult to get the user to provide explicit feedback. Instead, we rely on implicit feedback signals to calculate relevance score. Following implicit feedback signals could be used
- Number of clicks
- Dwell time on a page
- Amount of scrolling performed on a page
These signals need to be combined in some way e.g weighted average or some other algorithm for the final relevance score. For our analysis, I have used used raw click count as relevance feedback.
The relevance feedback data for computing NCDG should have the following fields
- Query ID
- Document ID
- Relevance Score
We will use only number of clicks as the relevance feedback score. But ideally, the other signals should be used if they are available.
Here is some sample synthetic data. it’s basically click data. Click counts are aggregated and interpreted as relevance score for each query, document pair.
YH3OVM01WI3L,YYZDH90Q,1 0TK4NM256DGS,981L9E62,1 YH3OVM01WI3L,JWVX95M0,1 H882CSC1K3LP,E7QLD0S0,1 YH3OVM01WI3L,MO8J6X4H,1 YH3OVM01WI3L,YYZDH90Q,1 6FB6CZA8Q58X,B4JR5675,1 YH3OVM01WI3L,MO8J6X4H,1
Normalized Cumulative Discounted Gain (NCDG)
The metric DCG is a function of relevance score and rank of the top n documents in search result. It’s defined as below
The steps for calculating NCDG are
- Calculate CDG with documents ranked in descending order of the search engine score
- Calculate CDG with documents ranked in descending order of the relevance score
- Divide the result of 1 with result of 2
In the second step, we calculate the maximum possible value for CDG, so we can use it to normalize and get NCDG in step 3.
The value of NCDG ranges from 0 to 1. Higher value of NCDG indicates better alignment between between search engine score distribution and relevance score distribution. In other words, a higher NCDG value indicates users are confirming search engine’s judgement about relevancy of documents for a given query.
Spark Job for NCDG
The Spark job is implemented in the scala object NormDiscCumGain. A union operation is performed on 2 RDD, one for search score data and one for the relevance score data.
Relevance data aggregated keyed by query ID and document ID pair, before the union operation. Here is the output, with one record per query. The first field is the query ID and the second field is the corresponding NCDG value.
H882CSC1K3LP,0.993 U56TS9626Z5I,0.928 0TK4NM256DGS,0.975 2ZRJ3B576G98,0.967 YH3OVM01WI3L,0.983 E0NGKNS66TH2,0.911 6FB6CZA8Q58X,0.969 0S4U7QWSRGB2,0.968 GGEKJ3A6F1XD,0.977 7LOZSYYR2BC3,0.995
From the result, the queries E0NGKNS66TH2 and U56TS9626Z5I seem to be worst performing. There can be many reasons for non performing, including lack of relevant content for the query. The query 7LOZSYYR2BC3 is best performing.
In case raw click data is used, the aggregated relevance data can be saved, so that when additional recent click data is available it can be added to the existing aggregated relevance data and NCDG re computed.
When raw click data is used, there is a configuration available to regularize the click data to handle large click counts. The regularization functions available are natural log and base 10 log defined with the configuration parameter rel.regularizer. The regularization is applied before computing NCDG.
If you are processing relevance data outside in a separate process, then this step may not be necessary. However if you are generating relevance data with wide range, regularization should be turned on.
Relevance data should have only 3 columns as shown in the example. If you have additional columns, they need to be filtered out. For removing unnecessary columns and doing any kind of transformation, you could use my Spark based data transformation ETL solution.
If you want to use recent relevance data e.g last 1 year, then aggregated relevance data should not be saved. Incremental relevance data should be accumulated and before computing NCDG, older data should be purged.
Search Score Data
Assuming that search results are logged whenever a query is being executed, search engine result data have to be consolidated before using it for the computation of NCDG.
This step is necessary, because the same query may return different search results at different points in time, because of many reasons. Here are some examples.
- A new query may have been executed, since last computation of NCDG.
- A new document may have been added and indexed by the search engine. This may result in the new document displacing an existing document in the top n search result
- An existing document may have been updated, resulting in a change in the search score and potentially changing the rank of the document among the top n result
- Search engine configuration may have been tweaked, changing the behavior of the search engine, resulting in a change in the search scores and the ranks.
- Addition or removal of documents will cause inverse document frequecy for any term to change and hence search scores.
To consolidate the search result data, you can use the Spark job for bulk mutation as described in one of my earlier blogs. For this Spark job to work, you need have a time stamp or sequence column with each record of the search score data. The time stamp or sequence data gets used to decide the most recent record for a given query. and document combination.
We have gone through a Spark based implementation for computing NCDG metric used for evaluation search result quality. Since search results and user click behavior may change with time, NCDG needs to be computed periodically. If you want to execute the use case in this article, please follow the steps in the tutorial.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing.