Reading Nested Objects Modeled with Composite Key from Cassandra


My earlier post was about storing nested objects modeled with composite key in Cassandra. Well, we need to be able to read the data back as objects and that’s the topic for this post. This post will focus on rest of the object story.  This is part of my open source project agiato.

Mapping Object to Composite Columns

As described in  the earlier post, here are some of the salient features of mapping between an object and Cassandra column family. The mapping logic does not use column family meta data. Instead it relies on introspection of the object passed.

  • The object representation supported are JSON and Apache DynaBean
  • Each leaf node of a nested object which is a primitive data type  is mapped to a column.
  • The intermediate node of a nested object could be another object, a list or a map
  • A non primary key column name is derived from the path in the object tree from the root to the leaf node in questions.
  • Certain attributes of the object are designated as row key and cluster key components. 
  • A minimal column family schema definition would require only primary key components. The rest could be left as dynamic
  • While writing, all primary key fields should be included in the object. Only the non primary key fields that need to be written should be included.
  • While reading, the prototype object should include complete or partial primary key. Any additional filed value if provided is used only for gathering type information, so that strongly typed filed values can be returned in the query result.

If JSON object representation is being used, on the client side it could represent statically defined java bean object or a nested map Map<String, Object>. Essentially, java bean type objects are supported indirectly through JSON.

While reading an object , we follow the reverse process. From the column name, we construct all the intermediate nodes in the object tree from the root to the leaf node in questions and create all the nodes in that path. This process is repeated for all the columns i.e. all the leaf nodes, which results in a fully formed nested object.

Query Criteria

As I mentioned in my earlier post, two kinds of dynamic object representations are supported: Apache DynaBean and JSON. A prototype object is passed to the query API which gets used for constructing the query condition.

The query condition is based on primary key only. You can not use non primary key attributes of the object to specify query condition. In Cassandra the primary key consists of the row key and the cluster key.

The number of objects returned from the query  will depend upon whether the primary key is fully or partially specified in the prototype object. We have the following scenario

  • If the primary key is fully specified, only one object is returned.
  • If only the row key is specified or the row key along with partial cluster key is specified, then multiple objects may be returned

Example

In our example,  we pass a partially formed object in JSON representation of  the order object as the prototype object.  We have specified  the row key and  cluster key as follows. This query will return multiple orders  for a given customer on a given day. As we will see later, the last element of of the cluster key orderID, although defined,  is not actually used in the query condition,

{
	"custID" : '12736467',
	"date" : '2013-06-10',
	"orderID" : 19482065,
	"amount" : 216.28,
	"status" : 'picked',
	"items" : [
		{
			"sku" : 87482734,
			"quantity" : 4
		}
	]
}

Although (custID, date, orderID) comprise the primary key, we have specified some other fields in the JSON fragment. This has to do with returning strongly  typed field value in the returned objects. We will get to this later. The code snipped to run the query is as follows. We create a reader object and call the query API on the reader object.

AgiatoContext.initialize(“/home/pranab/Projects/bin/agiato/test.json”);
ColumnFamilyReader reader = AgiatoContext.createReader(“orders”);
String proto = “JSOn string described above…”
PrimaryKey primKey = new PrimaryKey(“custID”, “date”, “orderID”).withNumPrimKeyComponentsSet(2);
List<Object> objs = reader.retrieveObject(proto,  primKey, ConsistencyLevel consLevel);

The  variable proto contains the JSON String listed earlier.  The primary key components to be used in the query condition is set by the call withNumPrimKeyComponentsSet(2).  In this case the query will return all order objects based on custID and date . The last  call returns a list of objects. In this case, it will be a list of Strings. Each JSON String will represent an order object returned from the query.

The orders column family used here has been denormalized all the way. Modeling the same data using RDBMS would have required 4 tables; order, orderItem, product and customer. However, we are the paying the price of denormalization i.e. duplicate data.

To be Typed or Not

How do we return typed value for all the fields in the returned objects. Even if we used column family meta data, it may not solve the problem completely. Because columns family may have dynamic schema and metadata for all columns may not have been explicitly defined.

We take different approach by introspecting the object passed. To get typed primitive field values back , the user should provide prototype typed field values in the query criteria object passed. If we look at the JSON definition, you will find that amount and status fields have actual values. These values get used to infer the types of the field values in the returned object. As a result, those values will be returned as double and String respectively.

We have also provided sku and quantiy values. We expect to see those attribute values to be returned with proper types. Since items is a list, it suffices to provide only one element of the list.If the data type for any primitive field can not be inferred, it’s returned as raw byte array in the returned objects.

Caveat with JSON

If JSON is being used, to get the returned objects as JSON, all the attributes prototype values need to be passed in the prototype object. In other words, it should be possible to infer all the primitive field value types in the data returned from Cassandra. Otherwise, we can not generate a serialized JSON string for the returned objects.

With JSON, if all primitive value types can not be inferred, the query API will return a nested map object Map<String,Object> and some of the primitive  field values will be raw byte array.

Secondary Index

Although secondary indexes should be avoided and instead one should rely  on  appropriate primary key design, sometimes there is no option. In future I will add support for secondary index for any nested field e.g. sku in the order object. it will be very similar to indexing in MongoDB.

Analytic

I am considering adding analytic feature to agiato, by aggregating data at the point of ingestion. The user will be define dimension and facts at any nested attribute level. Here is an example  for analytic on product quantity sold.

  1. dimenesion : date (with hierarchy month, year)
  2. dimension : items[].sku (with hierarchy category, dept)
  3. fact : items[].quantity

Both the dimensions here have hierarchy. We could roll up along dimension hierarchy or remove a dimension altogether. Adding analytic feature would require defining cube metadata for dimensions and facts.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Cassandra, Data Model, NOSQL and tagged . Bookmark the permalink.

2 Responses to Reading Nested Objects Modeled with Composite Key from Cassandra

  1. This is good post. I have a question, How can we model a STAR Schema in Cassandra . this is the kind of problem I am working on. any help will be greatly appreciated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s