The number of choices for big data solutions sometimes makes it overwhelming and confusing. Purpose of this post is to layout a road map for the big data solutions. I will be categorizing the products under four different category of solutions 1. Query 2. Analytic 3. Real time processing 4.Search. My list will include only well known open source solutions. I won’t be doing comprehensive analysis of the pros and cons of different products. There are lot posts out there with comparative studies, sometimes very strongly opinionated. I will be providing links for further exploration of the products listed for interested readers.
Often I have been in situations, where I had to make recommendations for big data products. Having a high level holistic view of the big data landscape is helpful in such situations.
My focus is on the storage and processing aspects of the big data platform. I will not cover the whole big data ecosystem. For example, data collection and data movement products are not included.
Here are the open source big data products. Sometimes, the functionality of products overlap and the categorization as below is not strictly correct.
By query, I mean being to select a sub set of the data based on some criteria. The products are subdivided based on whether SQL interface is supported or not. Among those supporting SQL, Stinger, Impala and Shark are next generation products claiming low latency.
Stinger is also known as the next generation Hive with much better performance. Impala does not use map reduce and instead has a parallel distributed query engine. Shark is build on top of Spark. You can think of it as Hive for Spark. Presto is a new entry from Facebook. like Impala, Presto also does not use map reduce.
By analytic, I mean doing some kind of processing that requires scanning all or most of the data. Analytic falls into two sub categories. By aggregate, I mean the classical data warehousing type queries. Under this sub category, some products support SQL and some do not.
For NOSQL databases, that do not provide aggregation support in the database engine (e.g., Cassandra, HBase), pre computing the aggregates at the point of data ingestion and storing the aggregates is an alternative.
Sometimes, the crisp categorization does not apply for some products. For example, Druid although a NOSQL analytic engine for aggregation, also has the the capability to ingest data in real time like Storm or Samza. In Druid, the aggregates are stored in memory across the cluster for fast access.
By other sub category, I mean any kind custom algorithmic processing including predictive analytic. Although Hadoop is the king in this space, Spark is a new kid on the block claiming better performance. In addition to native HDFS, Hadoop also works with Hbase or Cassandra as the storage engine.
By real time, I mean the capability to ingest and process real time stream. The processing could range from something as simple as cleaning up and pre processing of data to aggregation and algorithmic analysis.
Generally these systems are pure processing engine and do not include any storage sub system. If storage of processed data is necessary, the user is free to choose any storage solution.
By search, I mean the capability to index and search primarily for text data. Both Solr and Elastic Search use lucene as the underlying indexing engine.
Since many NOSQL databases lack adequate secondary index capabilities, these products are sometimes used to supplement the main NOSQL database with indexing capability.
Messaging systems deliver messages in point to point and publish and subscribe mode. Typically these systems appear in front of real time stream processors like Storm or Spark Streaming.
Kafka is a messaging system, with caveat that that you can also access messages in a random access manner.
Deciding on a big data platform is a difficult decision making process involving complex engineering trade offs. Many factors need to be taken into account. Hopefully, this post will be helpful in providing a holistic view of big data solutions and presenting the product choices available.