Cassandra and Hadoop


I was always interested in mining patterns and knowledge from data. While working on a data mining project some time ago, I ran into a road block when dealing with very large data set. Most data mining algorithms are monolithic and sequential in nature. They expect the  whole data set to be loaded in memory before processing. If the memory is not large enough for the data, you are simply out of luck.

The memory and scalability issues gradually steered me in the direction of Hadoop and map reduce.  I started  exporting data out of Mysql database and into HDFS, which worked fine. However, I could not use the data mining product any more. I had to cast the data mining algorithms in a parallel distributed processing setting and implement them as map reduce tasks.

Everything was going fine, until the next wrinkle came up. The company, I was doing the work for  started experiencing massive growth in data volume and it was pushing Mysql to it’s limit. They are seriously evaluating  Cassandra as a solution. My initial reaction to the migration plan was to to export data out of Cassandra just like  Mysql. Although, there are some interesting issues with doing range queries with timestamped data using order preserving partition in Cassandra. I will take that up in a future post.

Then I found out that Cassandra ver 0.6 release provides native  Hadoop support with InputFormat and InputSplit  implementation. It provides data locality by having a map task running on the node where the the input split resides in the Cassandra cluster. You can find more information here

http://allthingshadoop.com/2010/04/24/running-hadoop-mapreduce-with-cassandra-nosql/

While looking at the details of Cassndra, Hadoop integration  found out that Cassandra  will essentially pass all the data in a column family to Hadoop. A column family is like a RDBMS table, except that you can have any number of columns and all the rows don’t need to have the same set of columns. A  Cassandra column is essentially a name, value pair.

However in my case, I only want to pass a sub set of a column family data to Hadoop. I am dealing with time stamped data and I may want to send  data for last 24 hours to Hadoop. I need to be able to specify key range, just like Cassandra key range query. There does not seem to be nay way of doing it.

I guess I could do a key range query on the original  column family and populated another column family to be used by Hadoop. But that’s no better than exporting my result of range query to the HDFS. So I am back where I started i.e., exporting data out of database (Mysql or Cassndra) into HDFS.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Cassandra, Hadoop and Map Reduce, NOSQL and tagged , . Bookmark the permalink.

One Response to Cassandra and Hadoop

  1. Mr WordPress says:

    Hi, this is a comment.
    To delete a comment, just log in, and view the posts’ comments, there you will have the option to edit or delete them.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s