The number of choices for big data solutions sometimes makes it overwhelming and confusing. Purpose of this post is to layout a road map for the big data solutions. I will be categorizing the products under four different category of solutions 1. Query 2. Analytic 3. Real time processing 4.Search 5.Message Bus. My list will include only well known open source solutions. I won’t be doing comprehensive analysis of the pros and cons of different products. There are lot posts out there with comparative studies, sometimes very strongly opinionated. I will be providing links for further exploration of the products listed for interested readers.
Often I have been in situations, where I had to make recommendations for big data products. Having a high level holistic view of the big data landscape is helpful in such situations.
My focus is on the storage and processing aspects of the big data platform. I will not cover the whole big data ecosystem. For example, data collection and data movement products are not included.
Road Map
Here are the open source big data products. Sometimes, the functionality of products overlap and the categorization as below is not strictly correct.
-
Query
-
Analytic
-
Real time stream
-
Search
-
Messaging
Query
By query, I mean being to select a sub set of the data based on some criteria. The products are subdivided based on whether SQL interface is supported or not. Among those supporting SQL, Stinger, Impala and Shark are next generation products claiming low latency.
Stinger is also known as the next generation Hive with much better performance. Impala does not use map reduce and instead has a parallel distributed query engine. Shark is build on top of Spark. You can think of it as Hive for Spark. Presto is a new entry from Facebook. like Impala, Presto also does not use map reduce.
Analytic
By analytic, I mean doing some kind of processing that requires scanning all or most of the data. Analytic falls into two sub categories. By aggregate, I mean the classical data warehousing type queries. Under this sub category, some products support SQL and some do not.
For NOSQL databases, that do not provide aggregation support in the database engine (e.g., Cassandra, HBase), pre computing the aggregates at the point of data ingestion and storing the aggregates is an alternative.
Sometimes, the crisp categorization does not apply for some products. For example, Druid although a NOSQL analytic engine for aggregation, also has the the capability to ingest data in real time like Storm or Samza. In Druid, the aggregates are stored in memory across the cluster for fast access.
By algorithmic sub category, I mean any kind custom algorithmic processing including predictive analytic. Although Hadoop is the king in this space, Spark is a new kid on the block claiming better performance. Flink is the latest entry in this field. In addition to native HDFS, Hadoop also works with Hbase or Cassandra as the storage engine.
Real time stream
By real time, I mean the capability to ingest and process real time stream. The processing could range from something as simple as cleaning up and pre processing of data to aggregation and algorithmic analysis.
Generally these systems are pure processing engine and do not include any storage sub system. If storage of processed data is necessary, the user is free to choose any storage solution.
Search
By search, I mean the capability to index and search primarily for text data. Both Solr and Elastic Search use lucene as the underlying indexing engine.
Since many NOSQL databases lack adequate secondary index capabilities, these products are sometimes used to supplement the main NOSQL database with indexing capability.
Messaging
Messaging systems deliver messages in point to point and publish and subscribe mode. Typically these systems appear in front of real time stream processors like Storm or Spark Streaming.
Kafka is a messaging system, with caveat that that you can also access messages in a random access manner.
Ask Yourself
With a plethora of solution options and a continuously evolving Big Data eco-system, it could be overwhelming when trying to decide the architecture of your system. You could ask your self these questions to help narrow down your choices.
- Is data ingestion in incremental streaming fashion or in batch fashion
- Is scalable storage important?
- Is SQL important?
- Does my data have have deeply nested structure?
- In secondary indexing important for low query latency?
- Do I need scalable text search?
- Is scalable analytical processing important?
- Do I need to process data in real time?
- Is SQL enough for my analytical processing?
- Do I need to do algorithm driven analytical processing?
- For algorithm driven analytic processing, is it descriptive or predictive?
Summing Up
Deciding on a big data platform is a difficult decision making process involving complex engineering trade offs. Many factors need to be taken into account. Hopefully, this post will be helpful in providing a holistic view of big data solutions and presenting the product choices available.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,
Pingback: Big Road Map for Big Data | Big Data Cloud
Hi
This categorization is very good for me. But What is the difference between Elasticsearch and Druid?
My problem is about full text searching in terabyte data in PDF format. Which is best for my problem? Elasticsearch OR Druid?
Samira
You could follow the links and evaluate. The purpose of the post was to give the layout of the Big Data landscape and provide links for detailed information and not to recommend products.
Thank you for your reply. I need a help about full text searching in pdf files. What is your suggestion about dealing with this amount of data.
thanks a lot
Samira
A simple google search brought me these for a solution based elasticsearch and tikka
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html
http://stackoverflow.com/questions/10854858/best-practices-for-searchable-archive-of-thousands-of-documents-pdf-and-or-xml/10861308#10861308
I really don’t want to turn this into a thread for indexing PDF and other office documents. If you want to communicate further, please use my email from my linkedin profile