Elastic Search or Solr Search Result Quality Evaluation with NCDG Metric on Spark


You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your search engine results are satisfying the users and meeting their information needs.

If you have access to relevance feedback data from users, there are various search result relevance metrics that could be calculated. In this post the focus is on computing a metric called Normalized Cumulative Discounted Gain (NCDG) on Spark.

Two kinds of data are necessary to compute NCDG, 1)search engine result as queries are executed and 2)relevance feedback from users who interact with the search results. The Spark implementation of NCDG is available in my open source project visitante.

Search Engine Result

When a query is executed, search engine returns results with lot of details. The search result needs to be saved by your search application before returning them to the user. As for as computation  of NCDG is concerned following the following 3 fields are necessary.

  1. Query ID
  2. Document ID
  3. Score

Each query needs to assigned a unique ID. One way to generate a unique ID from the search key words is as follows

  1. Preform text pre processing on the key words including lemmatization, stop word removal, smaller case conversion.
  2. Optionally, perform synonym replacement
  3. Sort the key words
  4. Generate an unique ID, if one does not already exist for the tokens in step 3

Search engines typically generates unique ID for each document indexed by the search engine. Search engine assigns a score to each document returned by the search engine in response to a query.

The score indicates how relevant the document is with respect to the query. This score is calculated based on the content of a document with respect to a query. Documents are presented to the user in descending order of the score, so that the most relevant document appears at the beginning of the search result i.e the documents are ranked by the descending order of the score.

Here is some sample synthetic data for one query , generated with a python script

E0NGKNS66TH2,WN88E17Y,0.927
E0NGKNS66TH2,PU448556,0.926
E0NGKNS66TH2,CK42DJ7J,0.872
E0NGKNS66TH2,52LED81S,0.864
E0NGKNS66TH2,58Z09GOT,0.836
E0NGKNS66TH2,KZ30O9JT,0.738
E0NGKNS66TH2,AIU8W7T4,0.723
E0NGKNS66TH2,A300D2BT,0.686
E0NGKNS66TH2,YAW39CW1,0.467
E0NGKNS66TH2,3Z6D2N87,0.420

Relevance Feedback

Ideally relevance feedback from users should be explicit e.g by rating in a scale of 1 to 10. However in reality, it’s difficult to get the user to provide explicit feedback. Instead, we rely on implicit feedback signals to calculate relevance score. Following implicit feedback signals could be used

  1. Number of clicks
  2. Dwell time on a page
  3. Amount of scrolling performed on a page

These signals need to be combined in some way e.g weighted average or some other algorithm for the final relevance score. For our analysis, I have used used raw click count as relevance feedback.

The relevance feedback data for computing NCDG should have the following fields

  1. Query ID
  2. Document ID
  3. Relevance Score

We will use only number of clicks as the relevance feedback score. But ideally, the other signals should be used if they are available.

Here is some sample synthetic data. it’s basically click data. Click counts are aggregated and interpreted as relevance score for each query, document pair.

YH3OVM01WI3L,YYZDH90Q,1
0TK4NM256DGS,981L9E62,1
YH3OVM01WI3L,JWVX95M0,1
H882CSC1K3LP,E7QLD0S0,1
YH3OVM01WI3L,MO8J6X4H,1
YH3OVM01WI3L,YYZDH90Q,1
6FB6CZA8Q58X,B4JR5675,1
YH3OVM01WI3L,MO8J6X4H,1

Normalized Cumulative Discounted Gain (NCDG)

The metric DCG is a function of relevance score and rank of the top n documents in search result. It’s defined as below

CDG = Σi=1 to n (2rel(i) – 1) / (log2(i + 1) )   where
CDG = cumulative discounted gain
n = number of documents
rel(i) = relevance score of ith ranked document

The steps for calculating NCDG are

  1. Calculate CDG with documents ranked in descending order of the search engine score
  2. Calculate CDG with documents ranked in descending order of the relevance score
  3. Divide the result of 1 with result of 2

In the second step, we calculate the maximum possible value for CDG, so we can use it to normalize and get NCDG in step 3.

The value of NCDG ranges from 0 to 1. Higher value of NCDG indicates better alignment between  between search engine score distribution and relevance score distribution. In other words, a higher NCDG value indicates users are confirming search engine’s judgement about relevancy of documents for a given query.

Spark Job for NCDG

The Spark job is implemented in the scala object NormDiscCumGain. A union operation is performed on 2 RDD, one for search score data and one for the relevance score data.

Relevance  data aggregated  keyed by query ID and document ID pair, before the union operation. Here is the output, with one record per query. The first field is the query ID and the second field is the corresponding NCDG value.

H882CSC1K3LP,0.993
U56TS9626Z5I,0.928
0TK4NM256DGS,0.975
2ZRJ3B576G98,0.967
YH3OVM01WI3L,0.983
E0NGKNS66TH2,0.911
6FB6CZA8Q58X,0.969
0S4U7QWSRGB2,0.968
GGEKJ3A6F1XD,0.977
7LOZSYYR2BC3,0.995

From the result, the queries E0NGKNS66TH2 and U56TS9626Z5I seem to be worst performing. There can be many reasons for non performing, including lack of relevant content for the query. The query 7LOZSYYR2BC3 is best performing.

Relevance Data

In case raw click data is used, the aggregated relevance data can be saved, so that when additional recent click  data is available it can be added to the existing aggregated relevance data and NCDG re computed.

When raw click data is used, there is a configuration available to regularize the click data to handle large click counts. The regularization functions available are natural log and base 10 log defined with the configuration parameter rel.regularizer. The regularization is applied before computing NCDG.

If you are processing relevance data outside in a separate process, then this step may not be necessary. However if you are generating relevance data with wide range, regularization should be turned on.

Relevance data should have only 3 columns as shown in the example. If you have additional columns, they need to be filtered out. For removing unnecessary columns and doing any kind of transformation, you could use my Spark based data transformation ETL solution.

If you want to use recent relevance data e.g last 1 year, then aggregated relevance data should not be saved. Incremental relevance data should be accumulated and before computing NCDG, older data should be purged.

Search Score Data

Assuming that search results are logged whenever a query is being executed, search engine result data  have to be consolidated before using it for the computation of NCDG.

This step is necessary, because the same query may return different search results at different points in time, because of many reasons. Here are some examples.

  • A new query may have been executed, since last computation of NCDG.
  • A new document may have been added and indexed by the search engine. This may result in the new document displacing an existing document in the top n search result
  • An existing document may have been updated, resulting in a change in the search score and potentially changing the rank of the document among the top n result
  • Search engine configuration may have been tweaked, changing the behavior of the search engine, resulting in a change in the search scores and the ranks.
  • Addition or removal of documents will cause inverse document frequecy for any term to change and hence search scores.

To consolidate the search result data, you can use the Spark job for bulk mutation as described  in one of my earlier blogs. For  this Spark job to work, you need have a time stamp  or sequence column with each record of the search score data. The time stamp or sequence data gets used to decide the most recent record for a given query. and document combination.

Wrapping Up

We have gone through a Spark based implementation for computing NCDG metric used for evaluation search result quality. Since search results and user click behavior may change with time, NCDG needs to be computed periodically. If you want to execute the use case in this article, please follow the steps in the tutorial.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s