The number of choices for big data solutions sometimes makes it overwhelming and confusing. Purpose of this post is to layout a road map for the big data solutions. I will be categorizing the products under four different category of solutions 1. Query 2. Analytic 3. Real time processing 4.Search 5.Message Bus. My list will include only well known open source solutions. I won’t be doing comprehensive analysis of the pros and cons of different products. There are lot posts out there with comparative studies, sometimes very strongly opinionated. I will be providing links for further exploration of the products listed for interested readers.
Often I have been in situations, where I had to make recommendations for big data products. Having a high level holistic view of the big data landscape is helpful in such situations.
My focus is on the storage and processing aspects of the big data platform. I will not cover the whole big data ecosystem. For example, data collection and data movement products are not included.
Here are the open source big data products. Sometimes, the functionality of products overlap and the categorization as below is not strictly correct.
By query, I mean being to select a sub set of the data based on some criteria. The products are subdivided based on whether SQL interface is supported or not. Among those supporting SQL, Stinger, Impala and Shark are next generation products claiming low latency.
Stinger is also known as the next generation Hive with much better performance. Impala does not use map reduce and instead has a parallel distributed query engine. Shark is build on top of Spark. You can think of it as Hive for Spark. Presto is a new entry from Facebook. like Impala, Presto also does not use map reduce.
By analytic, I mean doing some kind of processing that requires scanning all or most of the data. Analytic falls into two sub categories. By aggregate, I mean the classical data warehousing type queries. Under this sub category, some products support SQL and some do not.
For NOSQL databases, that do not provide aggregation support in the database engine (e.g., Cassandra, HBase), pre computing the aggregates at the point of data ingestion and storing the aggregates is an alternative.
Sometimes, the crisp categorization does not apply for some products. For example, Druid although a NOSQL analytic engine for aggregation, also has the the capability to ingest data in real time like Storm or Samza. In Druid, the aggregates are stored in memory across the cluster for fast access.
By algorithmic sub category, I mean any kind custom algorithmic processing including predictive analytic. Although Hadoop is the king in this space, Spark is a new kid on the block claiming better performance. Flink is the latest entry in this field. In addition to native HDFS, Hadoop also works with Hbase or Cassandra as the storage engine.
Real time stream
By real time, I mean the capability to ingest and process real time stream. The processing could range from something as simple as cleaning up and pre processing of data to aggregation and algorithmic analysis.
Generally these systems are pure processing engine and do not include any storage sub system. If storage of processed data is necessary, the user is free to choose any storage solution.
By search, I mean the capability to index and search primarily for text data. Both Solr and Elastic Search use lucene as the underlying indexing engine.
Since many NOSQL databases lack adequate secondary index capabilities, these products are sometimes used to supplement the main NOSQL database with indexing capability.
Messaging systems deliver messages in point to point and publish and subscribe mode. Typically these systems appear in front of real time stream processors like Storm or Spark Streaming.
Kafka is a messaging system, with caveat that that you can also access messages in a random access manner.
With a plethora of solution options and a continuously evolving Big Data eco-system, it could be overwhelming when trying to decide the architecture of your system. You could ask your self these questions to help narrow down your choices.
- Is data ingestion in incremental streaming fashion or in batch fashion
- Is scalable storage important?
- Is SQL important?
- Does my data have have deeply nested structure?
- In secondary indexing important for low query latency?
- Do I need scalable text search?
- Is scalable analytical processing important?
- Do I need to process data in real time?
- Is SQL enough for my analytical processing?
- Do I need to do algorithm driven analytical processing?
- For algorithm driven analytic processing, is it descriptive or predictive?
Deciding on a big data platform is a difficult decision making process involving complex engineering trade offs. Many factors need to be taken into account. Hopefully, this post will be helpful in providing a holistic view of big data solutions and presenting the product choices available.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,