Cassandra Secondary Index Patterns


We all know that any real application needs to do query based on attributes other than the primary key or row key in case of Cassandra. Cassandra version .7 onwards provides native secondary index support. But there are several limitations.

Native Secondary Index

Cassandra’s native index  is like a hashed index, which means you can only do equality query and not range query. The link I just mentioned shows how you can do range query on one attributes once you  have an  index on another attributes, as in the following query written SQL

select * from product where category = xxx and price < yyy

This kind of query will work in Cassandra, provided an index is defined for category. In this case there is no index for price. Cassandra simply does additional filtering once the rows are fetched based on the index on category.

The other limitation is interesting.  Cassandra’s index is only recommended for attributes with low cardinality, i.e. attributes that have few unique values e.g., color of a product. It’s not immediately clear why this limitation exists and and I am sure it has something to do with the implementation of secondary index in Cassandra.

For attributes with high cardinality, the documentation recommends using separate column families and to implement your own secondary index. There are few suggestions there. In this post, I will expand upon those ideas and provide a set of secondary index patterns as far as storage and usage is concerned.

Secondary Index Patterns

Index storage structure is different depending whether the whole index is stored in one row or not

One row per index

In the first set of  patterns the whole index is stored in one row of a column family. The row key in this column family is the name of secondary index, which should be unique in a keyspace.

This index structure is appropriate when the number of rows in the target column family is not expected to be very large. A column family with super column is used for storage of this kind of index. There will be as many rows in this column family, as the number of indexes you define.

The super column name is the indexed column value. As far as the columns under the super column are concerned, there are several variations. In simplest one, the column name is the row key for the target column family and column value is empty.

Type 1

Index name Indexed col value Indexed col value
Target row key Target row key Target row key Target row key

The next pattern is same as before, except that the column value contains some column value of the target row. This kind of caching may prevent an additional query on the target column family based on the row key found from the index, if the goal of the query was to fetch the column cached in the index.

Type 2

Index name Indexed col value Indexed col value
Target row key Target row key Target row key Target row key
Target col value Target col value Target col value Target col value

In the third variation,the column name and value are switched. This kind of index is useful if you want to get the target row keys  sorted by some column value of the target column family. An equivalent SQL query will be as follows

select * from product where category = xxx order by price

For this example, category will be the indexed column value stored in super column name. Column name will be price and column value will store the row key, as below.

With the last two patterns, it is possible to do range queries with row key, which is not possible unless you have ordered partitioning, instead of hash partitioning.

Type 3

Index name Indexed col value Indexed col value
Target col value Target col value Target col value Target col value
Target row key Target row key Target row key Target row key

This is essentially a composite index, the first one being the super column  name and the second  one the column name.  We can do point queries or range queries using them.  Since they are both sorted we can do range query on both.

The following equivalent SQL query will be able to fully leverage this kind of index. Here we are doing point query on one and range query on the other.

select * from product where category = xxx and price < yyy

This query is same as the one mentioned earlier, except that it will use  real secondary index for both columns

Since the second and third pattern  involve an additional column from the target column family, there is some additional overhead for index management. Whenever the column value changes, the index needs to be revised, to reflect the change.

The index patterns described so far can be useful for paginating through a column family, e.g, to fetch 10 rows at  a time from a column family.

Since the super columns and the sub columns are sorted, with appropriate use of SliceRange, the pagination logic can be easily implemented.

With the first two patterns, the pagination is ordered by the row key and with the third pattern, the pagination is ordered by the  column values of the target column family.

One row per index value

For the patterns described so far, the whole index is stored in one row. In that row there is one column for each row in the target column family.  When very large number of rows are expected in the target column  family, storing the  index in one row, may not be practical.

Instead, for the second set of patterns,  we use a column family with regular columns and use the indexed column values as the row key.  Essentially,  an index will be stored in multiple rows, instead of one as in the earlier patterns.

Actually, the row key will be indexed column value prefixed with index name, to make it unique. Since the the same column family may be used for storing indexes for multiple target column families, indexed column value may not be unique. The simplest structure is as follows.

Type 4

Index name + col value Target row key Target row key

As before, when we want to cache a column from the target column family , the structure is as below.

Type 5

Index name + col value Target row key Target row key
Target col value Target col value

To be able to get target row keys sorted or to be able to to do range query on an additional column, we just flip the contents of column name and value, as below.

Type 6

Index name + col value Target col value Target col value
Target row key Target row key

One advantage of the first set of super column base index patterns is that you can do range queries based on the indexed column, in addition to equality query, since super column names are sorted.For example, if order data is indexed with order date, the following query will be able to leverage the index.

select * from order where orderDate > xxx

For simple ordered access through row keys, type 1,2,4 or 5 can be used with a dummy indexed column value. It’s a a fictitious column value that encompasses all rows in the target column family.

Type 3 and type 6 index are only valid when the relation between the target column value and the row key is one to one i.e., when the target column values are unique, unless a list of row keys is stored as the column value.

Target Column Family

So far in the discussion, I have been referring to what’s being indexed as the target row. Strictly speaking, we are indexing a target record. I am defining a record as a collection of columns or super columns. A record may occupy a whole row. In that case, storing the row key as the target of the index makes sense.

However, a row in a super column family may store multiple records, each super column being a record. For example, in a super column family for storing order, each row might store all the orders for the day. Within the row, each super column could store an order. The super column name could be the order time, which will sort the orders by the time of the day.

Date Time Time
cust Id amount cust Id amount
___ ___ ___ ___

To handle this kind of storage structure, the index needs to store a concatenation of row key and super column name in the target column family, instead of the just the row key. Only then the order record can be accessed from the index for our example.

Final Thoughts

I have described few simple patterns for Cassandra secondary indexes. By no means, I claim that they will meet requirements of any query. Feel free to suggest any other pattern that you think might be useful.

Earlier, I blogged about a simple Cassandra API that I implemented. I am planning to enhance that by adding support for these index patterns and named queries like in Hibernate. The user will define the indexes and named queries in JSON, which the API will consume at run time. The full source code is hosted on github. It’s work in progress.

About these ads

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source contributor. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Data Mining and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Cassandra, Indexing, NOSQL and tagged , . Bookmark the permalink.

7 Responses to Cassandra Secondary Index Patterns

  1. Geeek says:

    Hello..
    In the type 2 & 3, can you please provide real world data/example ? Like, instead of saying, “row key”, “indexed col value” etc., can you say in terms of category, price, etc., ? It will be easy to understand? thanks..

  2. Pranab says:

    @Geek
    Here is an example

    Type 2
    indexed col val : product category
    target row key (col name) : whatever is the row key for product col family
    target col value (col value) : product price

    Type 3
    Same as before except that target row key and target col value are switched, so that you get the row keys sorted by some column value (product price in this example) in the target col family

  3. Pingback: Cassandra Secondary Index Patterns | BigDataCloud.com

  4. Ganges says:

    How to use one-to-many and many-to-many relationship indexes in Cassandra.

  5. Pranab says:

    You could use type 1, with one row for one entity type and the other row for the other entity type

  6. Indika says:

    Is there any way to create secondary index using two col names? Do I need to customize Cassandra for implement such a solution?

  7. Pingback: Count data written in last 15 min to Cassandra | BlogoSfera

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s