Roll out your own BPM with some help from Cassandra (Part 2)


In Part 1 we talked about BPM architecture in  general. My focus in this post will be a data model for our BPM using Cassandra. Well, you might ask why not stick with MySQL that we all love.

Why Cassandra

We expect our order processing system built with BPM to have the following desirable characteristics.

  • When the user is searching and browsing for products, we want snappy response, so as so minimize the time user spends on product selection.
  • When a customer places an item in his or her shopping cart, we don’t want it take too long leaving the customer wondering about the outcome of his or her action
  • When a customer finally checks out and places the order, we don’t want the dreaded message “Server error : please try again”. Chances are the user may not try again and just leave the site. Worst yet, while leaving the site the user might be wondering if the credit card was charged while order not processed.

The first requirement implies a database with low read latency. The second requirement boils down to having a data store that has low write latency even in the face of traffic spike. The last requirement  requires our system to be almost always available.

These requirements can be met with traditional RDBMS like MySQL, but not without some extra work. You can improve read latency of MySQL with master slave replication. For write latency, you may use sharding or partitioning. Finally, you can improve availability using replication with fail over enabled by a  load balancer. Adding all these capabilities require lot of extra work.

Lucky for us, Cassandra comes with these capabilities out of the box. Cassandra is highly available, eventually consistent and highly scalable database. There is plenty of information on Cassandra. Here is a quick bullet list of key features

  • Fault Tolerant : Data automatically replicated to multiple nodes. User chooses replication factor. Replication also provides low read latency
  • Low write latency : Achieved through data partitioning. Default partitioning scheme is based on consistent hashing.
  • Availability and Consistency : Consistency level is user selected for read and write operations.  For example, user could read from  one node (highest availability)  all the way from all replicated nodes (highest consistency) or anywhere in between. Similar choice exists for writes.
  • Flexible Schema : A column family is  essentially a multi dimensional map with up to two levels of nesting. Data organization is based more on usage patterns rather than structural similarity.
  • No single point of failure : All nodes in the cluster are treated equally and have same roles

BPM Data Model

We will be using two column families, both with super columns. One will store  process meta data and the other will store  process run time data.

ProcessMeta  Column Family

In the first column family, we will store the following different entities. This will require us to get our mind out of relational model mind set. In this column family we will have rows with different structure, with different rows having different columns. You have to be crazy if you try doing that with an RDBMS.

  • Process meta data
  • Process states transtion
  • Process states
  • Process events

The column family will have one row per process meta data and state transition data. One row will store all the process states. Yet another row will store all the process events.

A process meta data row has one super column for the process type name, description etc. The rest of the super columns are for the state transitions, one for each state transition. The structure is as below. There is one such row for each process type defined. Process type name could serve as row key for these rows.

Super Col Name: meta
Col Name Col Value
name process type name
description process type description
createdBy process type creator
Super Col Name: stateID1
Col Name Col Value
event eventId1
nextState sateId2
invocationLang java
invocationClass mycom.order.ProcessOrder
invocationmethod handleInventory
Super Col Name: stateID2
Col Name Col Value
event eventId1
nextState sateId2
invocatopnLang java
invocatopnClass mycom.order.ProcessOrder
invocatopnmethod handleInventory

The structure of the row containing state definitions is as below. There is only one such row. The row key could be “states”

Super Col Name: stateID1
Col Name Col Value
name state1 name
description state1 description
Super Col Name: stateID2
Col Name Col Value
name state2 name
description state2 description

The row containing event data has the following structure. The row key could be “events”. There is only one such row. You may have observed that a row with super column is like a row within a row.

Super Col Name: eventID1
Col Name Col Value
name event1 name
description event1 description
Super Col Name: eventID2
Col Name Col Value
name event2 name
description event2 description

The number of rows in this column family will be number process types defined plus two.

Process Column Family

The process column family stores run time data for a process. This column family is also based on super columns. There is one row per process instance.

Each row has two super columns one for current state related data for the process and the other for the context data for the process. The structure of a row with JSONish notation is as follows.


{
	key : processId		//row key
	value : [
		{ col : state	//super column
		  value :	{
			processType : ---	//column
			currentState : ---
			prevState : ----
			lastEvent : ----
                        correlatedEntityType: ---
                        correlatedEntityID: ---
		   }
		},
		{ col : context	//super column
		  value :	{
			parameName1 : paramVal1	//column
			parameName2 : paramVal2
			parameName3 : paramVal3
			----------- : ---------
		  }
		}
	]
}

This column family is expected to have lot of data. For an order processing BPM, there will be one row for every order in the system. But the data will evenly distributed across the Cassandra cluster because of the hash partitioning of the row key which in this case is the process Id. This column family will be very write intensive. Every time a state transition occurs, there will be an update to this column family.

But we  need not worry. Built in hash partitioning of Cassandra makes  writes fast. As data grows we need to scale, we just add more nodes to the Cassandra cluster. Good news is that adding a new node does not require us to shut down Cassandra.

Are We There Yet

Cassandra allows only primary index, which is the row key. So, although we can query Process column family based on process ID, what if we need to query based on the correlated entity ID.

As I mentioned in Part 1 of this post, we often have to look up process Id based on the correlated entity Id e.g., order ID.

The only option appears to be reading all rows for the process and match the correlated entity ID. But when there are hundreds of thousands of rows, this is not a realistic solution.

We need secondary index. But unfortunately we are on our own. There is no native support for secondary index in Cassandra. I will explore this topic in another post.

References

You might like these posts if you want to dig more into Cassandra data model

  1. WTF is a SuperColumn? An introduction to Cassandra data model
  2. Up and running with Cassandra
Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in BPM, Cassandra, Data Model, NOSQL and tagged , , . Bookmark the permalink.

2 Responses to Roll out your own BPM with some help from Cassandra (Part 2)

  1. Pingback: Cassandra secondary index to the rescue | Mawazo

  2. Pingback: Easy Cassandra Data Access | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s