Storing Nested Objects in Cassandra with Composite Columns


One of the popular features of MongoDB is the ability to store arbitrarily nested objects and be able to index on any nested field. In this post I will show how to store nested objects in Cassandra using composite columns. I have recently added this feature to my open source Cassandra project agiato. Since in Cassandra, like many other NOSQL databases, stored data is highly denormalized. The denormalized data  often manifests itself in the form of a nested object e.g., denormalizing one to many relations.

In the  solution presented here, the object data is stored in column families with composite columns. An object will typically have some statically defined fields. Rest of the fields will be dynamic. It is possible to define the table or column family behind the object in CQL3, based on the statically known fields of the object.

Composite Column

Super columns are frowned upon in Cassandra. Composite columns are the preferred approach. Composite columns can do what super columns could do and more. The underlying data model in a super column is a 3 level nested map(including row key, super column name or key and column name or key). Composite columns are more flexible, allowing you to create a data model with multi level nested map.

Here is an example of data model based on composite columns. Consider a building HVAC system with the sensor data  stored in a column family with composite columns. Here are the fields.

day day as integer 20130421
city city name Los Angeles
buildingID An integer ID for a building 2658
sensorID An integer ID for a sensor in a building 295016
time timestamp as long 1361237156
temp temarature as double 71.2
humidity relative humidity as double 30.7

A record is uniquely identified with the tuple (day, city, buildingID, sensorID, time). The corresponding column family definition with composite key will be as below

create column family buildingControl 
with key_validation_class = 'IntegerType'
and comparator = 'CompositeType(UTF8Type, IntegerType, IntegerType, LongType, UTF8Type)'
and default_validation_class='DoubleType';

The composite column along with row key provide a 6 level nested map. The composite column has 5 components. The first 4 correspond to the value of the  tuple (city, buildingID, sensorID, time).  The last element is the name of a non primary column. The storage layout of a record will look as below

20130421 Los Angeles:2658:295016:1361237156:temp Los Angeles:2658:295016:1361237156:humidity
71.2 30.7

The first 4 fields of the composite column name are the values of (city, buildingID, sensorID, time) values respectively, which are the clustering keys. The last component is the column name for a non primary key column. The value of the composite column is the corresponding non primary key column value.

The composite column name is shown in a human readable form. Internally, the composite column name is stored with some byte encoding logic. As a nested Map,  a record, in this example can be represented as Map<Integer,Map<String, Map<Integer, Map<Integer, Map<Long, Map<String, Double>>>>>> . Strictly speaking all the inner maps are sorted maps. As we will see later, in terms of CQL, the outer map corresponds to the row key and all the inner maps except for the last one correspond to the clustering key.

A record comprises of a set of columns. The set of columns is repeated in the row for different unique values of (city, buildingID, sensorID, time). Essentially, a row will contain multiple records. Before composite key was introduced in Cassandra, people had resorted to using this pattern for multi level map by doing string concatenation for column values and names.

Primary Key Syntactic Sugar

In thrift, the interface to Cassandra is tightly coupled to the internal storage structure.  CQL introduces a layer of abstraction between the underlying storage structure and the logical model and makes the the data model  look like a familiar RDBMS table. The entity in our example has 7 attributes and as shown below, all 7 of them appear as columns in the CQL  table definition.

CREATE TABLE buildingControl (
	day int,
	city text,
	buildingID int,
	sensorID int,
	time timestamp,
	temp double,
	humidity double
	PRIMARY KEY (day, city, buildingID, sensorID, time)
);

The primary key definition maps to the storage structure as follows. The first element day is the row key or the partitioning key. The remaining 4 elements constitute the clustering key. Internally, a record is stored with  2 physical columns, as we saw in the thrift data model definition. The 7 logical columns map to one row key and two composite columns.

Into the Object Land

So far in out example, data model had a flat record structure and we were able to define everything nicely with CQL.  What if we have multiple related entities in our use case and we wanted to denormalize the data model. Here is an example of an order object in an eCommerce application, represented with JSON.

{
	"custID" : '12736467',
	"date" : '2013-06-10',
	"orderID" : 19482065,
	"amount" : 216.28,
	"status" : 'picked',
	"notes" : 'on time',
	"items" : [
		{
			"sku" : 87482734,
			"quantity" : 4
		},
		{
			"sku" : 32851042,
			"quantity" : 2
		}
	],
	"customer" : {
		"name" : 'Joe Smith',
		"tel" : '231 456 7890',
		"email" : 'joe@yahoo.com'
	}
}

As you can see the the order object contains a child customer object and a list of order line item objects. The embedded customer object is statically defined and we could include it’s attributes  as columns in CQL table definition. However, the embedded list of order line item objects can not be represented in CQL. Because, CQL can handle only list and map of primitives.

Cassandra is a schema optional data base and you can have any of the following with respect to schema

  1. Completely static
  2. Completely dynamic
  3. Semi static

The schema for our example falls into the third category, because we can not define all the fields in the schema.

Linearizing Object Hieararchy

If we a traverse an arbitrarily nested object, the leaf nodes will contain field name, primitive value pair.  To linearize an object, after the traversing the object hierarchy, a list of leaf node objects is generated. Each such node is saved as a composite column. For nested fields, the column name is generated by concatenating all fields names in the path from the leaf node to the root, separated by period.

A CQL table along with primary key could be defined comprising of only the statically defined primitive fields of an object.  Here is the partial definition for the order object

CREATE TABLE orders (
    custID text,
    date text,
    orderID int,
    amount double,
    status text,
    notes text,
    PRIMARY KEY (custID,date,orderID)
);

The only advantage of creating the CQL schema, is that you could run CQL select query. However the query will only return the columns defined in the CQL schema, but not the dynamic attributes of the object.

Saving the Object

The object to be saved in passed to the agiato API as a SimpleDynaBean which is simplified implementation  of apache common  DynaBean interface. With SimpleDynaBean you could define an arbitrarily nested object. The API also gets passed the  primary key definition.   The different options for object representation are as follows.

  1. SimpleDynaBean object
  2. JSON string
  3. Java bean object

The following code snippet uses SimpleDynaBean  and saves the order object in the column family orders.

SimpleDynaBean obj = new  SimpleDynaBean();
//populate hierarchical object
……
AgiatoContext.initialize(“/home/pranab/Projects/bin/agiato/test.json”);
ColumnFamilyWriter writer = AgiatoContext.createWriter(“orders”);
PrimaryKey primKey = new PrimaryKey(“custID”, “date”, “orderID”);
writer.writeObject(obj, primKey, ConsistencyLevel.ONE);

The object  is traversed in a depth first way and a list leaf nodes is generated. The leaf nodes will include the columns defined in CQL as well as all the dynamic fields.

The  arguments  in the PrimaryKey  constructor are the primary key column names. Here only the first element constitutes the row key. If that was not the case  you had to call setRowKeyElementCount() to specify how many fields out of the primary key elements constitute  the row key.

If after the depth first traversal, the primary key fields  appear in the beginning of the list,  you could call another PrimaryKey constructor and pass the number of primary key elements.

The class ObjectSerDes hadles object traversal and mapping object fields to composite key columns. Here is the DataAccess class for Cassandra data access through thrift. From the list of leaf nodes, the row key and the clustering key are identified.  After traversing an object, the flattened list of NamedObject  is created as shown below

row key ObjectNode clustering key ObjectNodes other ObjectNodes

The first NamedObject value is converted to a row key. The clustering key NamedObject values makeup the prefix part of the composite key column name.  Then the columns are saved in a way that has the same effect as running a CQL insert query, except that it will store all the dynamic fields of the object, which were not part of the CQL definition.

Field values are serialized based on introspection of the field data. There is no check made against table meta data. For the fields defined in CQL schema, if data with wrong type is passed in the object,  be prepared to get surprising results from CQL query.

Update and Delete

For update and delete, it’s necessary to pass a partially formed object, whether a dynamic object or JSON string, to the API, containing the primary key fields and other fields that are going to be updated or deleted. For example, to update the status of an order, it’s necessary for the order object to have the primary key fields and the status field.

With  statically compiled java bean object,  it’s a difficult to define a  partially formed object. One way of doing it is to have all the fields as primitive wrapper objects, where null will imply absence of a field value.

Querying the Object

I don’t have the query method to return fully populated object implemented yet. I will be adding it soon. In the mean time we can use CQL and CLI. If we run a CQl query this is what we get. As expected it return only the columns it knows about from the table definition. It does not return any of the dynamic columns.

 custid   | date       | orderid  | amount | notes   | status
----------+------------+----------+--------+---------+--------
 12736467 | 2013-06-10 | 19482065 | 216.28 | in time | picked

However, when we run a CLI query it returns every field that is physically stored in Cassandra as below. All the values are in bytes in the output

RowKey: 3132373336343637
=> (name=2013-06-10:19482065:amount, value=406b08f5c28f5c29, timestamp=13..)
=> (name=2013-06-10:19482065:customer.email, value=6a6f65407961686f6f2e636f6d, timestamp=13..)
=> (name=2013-06-10:19482065:customer.name, value=4a6f6520536d697468, timestamp=13..)
=> (name=2013-06-10:19482065:customer.tel, value=323331203435362037383930, timestamp=13..)
=> (name=2013-06-10:19482065:items.[0].quantity, value=00000004, timestamp=13..)
=> (name=2013-06-10:19482065:items.[0].sku, value=000000000536e16e, timestamp=13..)
=> (name=2013-06-10:19482065:items.[1].quantity, value=00000002, timestamp=13..)
=> (name=2013-06-10:19482065:items.[1].sku, value=0000000001f54462, timestamp=13..)
=> (name=2013-06-10:19482065:notes, value=696e2074696d65, timestamp=13..)
=> (name=2013-06-10:19482065:status, value=7069636b6564, timestamp=13..)

Why not Just Serialize

You might ask why bother with mapping all the nested fields of an object to columns. Why not serialize all the dynamic fields of the object and store it as a JSON string and include that column as a text filed in CQL create table.  Granted, the column will be visible as a JSON string when you run CQl select query. If you are leaning in that direction, consider the following scenario

  • You want to create secondary index on some nested field e.g., customer.zipCode. The field needs an independent existence as a column
  • You want to treat such nested field  it as dimension in for aggregation and analytic queries involving nested objects

The other advantage of having separate column for each nested field, is that CLI query results are more readable.

Next Steps

My first goal is to support object based query. A prototype object will be passed to agiato. The fields defining the query criteria will need to have the values defined in this object. These fields will be a subset of the primary key fields.

The API after running the query will return one or more fully populated object. Other fields need not be passed as part of the query object. However, if some prototype value for other fields is passed, then the API will return typed value for the fields, otherwise it will return byte arrays for values.

My next goal is to support aggregate and analytic queries. The user will define cube schema consisting dimensions and measures pointing to object attributes, including nested attributes. While storing an object aggregate values will be computed and stored in separate column families. Storm will be part of the aggregation engine ecosystem.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Cassandra, Data Model, NOSQL and tagged , , , . Bookmark the permalink.

18 Responses to Storing Nested Objects in Cassandra with Composite Columns

  1. Pingback: Cassandra | Annotary

  2. Alex Petrov says:

    Do I understand it correctly, you’re basically usin Wide Rows and use custom notation for transforming them back to the objects, right?

    For example, in {foo: “Bar”, baz: {baz1: “baz2” } }, Bar will be stored with a “foo” key, and “baz2” – with “baz.baz1”.

    Don’t want to criticize anything, just first part of post is clear and elaborate and after that you give a partial CQL example and then something that has something to do with agiato rather than Cassandra itself in general. I’m familiar with concepts, but would you still mind to elaborate?

  3. Pingback: Reading Nested Objects Modeled with Composite Key from Cassandra | Mawazo

  4. Jeff says:

    I’m wondering about your translation of JSON array elements into composite column names. Rather than use the array index directly in the column name, wouldn’t there be a benefit in using some field within the array element as an “index”? In your example, sku would work well in that respect, since its value is unique within the array. That would allow the JSON array to be considered an unordered list rather than an ordered one. Of course, JSON itself expects its arrays to be ordered, but an unordered one allows elements to be inserted and deleted (mostly deleted, since being unordered there would be no particular value in inserting one in the middle). How would deleting an array element work in your implementation?

    • Jeff says:

      Actually, that was a poorly thought out question. Forget about unordered lists. Let me ask this instead: Because the JSON array index is part of the composite column’s name, if a client deletes an element from the middle of the array, it appears that every subsequent composite column must get an update, and the last one gets deleted. That seems very wasteful, and I’m wondering whether it could be made more efficiently by somehow making use of the fact that at least in your example, “sku” has a unique value among the elements in the array?

      • Pranab says:

        Jeff

        For any collection, the collection key becomes part of the composite column name. For array, the key is the index. For sku to be part of the collection key and hence part of column name, instead of an index, you have to model items as a map instead of an array in the JSON.

        If any collection element is deleted, if you have delete the corresponding columns, before writing the updated object. Otherwise when reading an object based on complete or partial primary key, it won’t be properly deserialized based on the column values

      • Jeff says:

        Pranab,

        Thanks for explaining. I’m not sure I understand your second paragraph though. I think you’re saying that if I remove an element from the middle of the JSON array, I should delete ALL the columns from the Cassandra row – or at least from the deleted element onward – before updating the row, is that right? Sounds painful; I should probably just do my best to avoid ever deleting an element from the middle of a JSON array!

  5. Padmika says:

    Trying to do the same in C#, having a tough time though

  6. Pranab says:

    Jeff
    What I meant was if the array contains a primitive type, then removing one column will suffice. However if the array element contains a complex type, you have to remove all columns corresponding to the leaf nodes of a sub tree rooted at that index.

    No, you don’t have to delete all subsequent index. When read the object back, you will see a gap in the array for the index deleted. It’s a small fix in the code to consolidate the array elements in case there are deletes in between in the deserializer. I will take care of that.

  7. Fabio Yamada says:

    Just an updated info: Cassandra 2.1 was released and it provides Collections and indexes within collections. Also they included the ability to declare user defined types for better organization (must use CQL3 though)

  8. deepak says:

    i have data which is heirarchical in nature. some thing like customer->region->plant->machine
    and each machine can have thousands of sensors . So does cassandra is good choice for this kind of data. we need to store the heirarchical and time series. Is HBase a good choice or Cassandra a good choice considering OLAP is done mostly on the data.

    • Pranab says:

      You will be better off with Cassandra. Cassandra does not provide any aggregation function. You can do aggregation as the data is being ingested into Cassandra. You could also read data back from Cassandra and do aggregation on the client side. Cassandra 3.0 will provide aggregation functions. I am not sure about the scope.

  9. deepak says:

    Thanks Pranab for your insight. So are you suggesting it is better if we don’t use Cassandra and use some relational data base for hierarchical data . what can we use for time series data , H Base or Cassandra ?

    • Pranab says:

      Relational database is for relational data. You still need NOSQL databases. Most NOSQL databases provided limited aggregation functions. MongoDB may have better aggregation engine compared to others. Take a look.

  10. Pingback: Storing nested JSON with Cassandra - DexPage

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s