I was always interested in mining patterns and knowledge from data. While working on a data mining project some time ago, I ran into a road block when dealing with very large data set. Most data mining algorithms are monolithic and sequential in nature. They expect the whole data set to be loaded in memory before processing. If the memory is not large enough for the data, you are simply out of luck.
The memory and scalability issues gradually steered me in the direction of Hadoop and map reduce. I started exporting data out of Mysql database and into HDFS, which worked fine. However, I could not use the data mining product any more. I had to cast the data mining algorithms in a parallel distributed processing setting and implement them as map reduce tasks.
Everything was going fine, until the next wrinkle came up. The company, I was doing the work for started experiencing massive growth in data volume and it was pushing Mysql to it’s limit. They are seriously evaluating Cassandra as a solution. My initial reaction to the migration plan was to to export data out of Cassandra just like Mysql. Although, there are some interesting issues with doing range queries with timestamped data using order preserving partition in Cassandra. I will take that up in a future post.
Then I found out that Cassandra ver 0.6 release provides native Hadoop support with InputFormat and InputSplit implementation. It provides data locality by having a map task running on the node where the the input split resides in the Cassandra cluster. You can find more information here
While looking at the details of Cassndra, Hadoop integration found out that Cassandra will essentially pass all the data in a column family to Hadoop. A column family is like a RDBMS table, except that you can have any number of columns and all the rows don’t need to have the same set of columns. A Cassandra column is essentially a name, value pair.
However in my case, I only want to pass a sub set of a column family data to Hadoop. I am dealing with time stamped data and I may want to send data for last 24 hours to Hadoop. I need to be able to specify key range, just like Cassandra key range query. There does not seem to be nay way of doing it.
I guess I could do a key range query on the original column family and populated another column family to be used by Hadoop. But that’s no better than exporting my result of range query to the HDFS. So I am back where I started i.e., exporting data out of database (Mysql or Cassndra) into HDFS.