Big Road Map for Big Data

The number of choices for big data solutions sometimes makes it overwhelming and confusing. Purpose of this post is to layout a road map for the big data solutions. I will be categorizing the products under four different category of solutions 1. Query 2. Analytic 3. Real time processing 4.Search 5.Message Bus. My list will include only well known open source solutions. I won’t be doing comprehensive analysis of the pros and cons of different products. There are lot posts out there with comparative studies, sometimes very strongly opinionated. I will be providing links for further exploration of the products listed for interested readers.

Often I have been in situations, where I had to make recommendations for big data products. Having a high level holistic view of the big data landscape is helpful in such situations.

My focus is on the storage and processing aspects of the big data platform. I will not cover the whole big data ecosystem. For example, data collection and data movement products are not included.

Road Map

Here are the open source big data products. Sometimes, the functionality of products overlap and the categorization as below is not strictly correct.

Query
- NOSQL
  - Cassandra
  - HBase
  - MongoDB
  - Couchbase
- SQL
  - Hive
  - Stinger
  - Impala
  - Presto
  - Spark SQL
Analytic
- Aggregate
  - SQL
    - Hive
    - Stinger
    - Impala
    - Presto
    - Spark SQL
  - NOSQL
    - Druid
    - MongoDB
- Algorithmic
  - Hadoop
  - Spark
  - Flink
Real time stream
- Storm
- Samza
- S4
- Spark Streaming
- Flink
- Kafka Stream
Search
- Solr
- Elastic Search
Messaging
- Kafka

Query

By query, I mean being to select a sub set of the data based on some criteria. The products are subdivided based on whether SQL interface is supported or not. Among those supporting SQL, Stinger, Impala and Shark are next generation products claiming low latency.

Stinger is also known as the next generation Hive with much better performance. Impala does not use map reduce and instead has a parallel distributed query engine. Shark is build on top of Spark. You can think of it as Hive for Spark. Presto is a new entry from Facebook. like Impala, Presto also does not use map reduce.

Analytic

By analytic, I mean doing some kind of processing that requires scanning all or most of the data. Analytic falls into two sub categories. By aggregate, I mean the classical data warehousing type queries. Under this sub category, some products support SQL and some do not.

For NOSQL databases, that do not provide aggregation support in the database engine (e.g., Cassandra, HBase), pre computing the aggregates at the point of data ingestion and storing the aggregates is an alternative.

Sometimes, the crisp categorization does not apply for some products. For example, Druid although a NOSQL analytic engine for aggregation, also has the the capability to ingest data in real time like Storm or Samza. In Druid, the aggregates are stored in memory across the cluster for fast access.

By algorithmic sub category, I mean any kind custom algorithmic processing including predictive analytic. Although Hadoop is the king in this space, Spark is a new kid on the block claiming better performance. Flink is the latest entry in this field. In addition to native HDFS, Hadoop also works with Hbase or Cassandra as the storage engine.

Real time stream

By real time, I mean the capability to ingest and process real time stream. The processing could range from something as simple as cleaning up and pre processing of data to aggregation and algorithmic analysis.

Generally these systems are pure processing engine and do not include any storage sub system. If storage of processed data is necessary, the user is free to choose any storage solution.

Search

By search, I mean the capability to index and search primarily for text data. Both Solr and Elastic Search use lucene as the underlying indexing engine.

Since many NOSQL databases lack adequate secondary index capabilities, these products are sometimes used to supplement the main NOSQL database with indexing capability.

Messaging

Messaging systems deliver messages in point to point and publish and subscribe mode. Typically these systems appear in front of real time stream processors like Storm or Spark Streaming.

Kafka is a messaging system, with caveat that that you can also access messages in a random access manner.

Ask Yourself

With a plethora of solution options and a continuously evolving Big Data eco-system, it could be overwhelming when trying to decide the architecture of your system. You could ask your self these questions to help narrow down your choices.

Is data ingestion in incremental streaming fashion or in batch fashion
Is scalable storage important?
Is SQL important?
Does my data have have deeply nested structure?
In secondary indexing important for low query latency?
Do I need scalable text search?
Is scalable analytical processing important?
Do I need to process data in real time?
Is SQL enough for my analytical processing?
Do I need to do algorithm driven analytical processing?
For algorithm driven analytic processing, is it descriptive or predictive?

Summing Up

Deciding on a big data platform is a difficult decision making process involving complex engineering trade offs. Many factors need to be taken into account. Hopefully, this post will be helpful in providing a holistic view of big data solutions and presenting the product choices available.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →

5 Responses to Big Road Map for Big Data

Pingback: Big Road Map for Big Data | Big Data Cloud
samira says:

August 23, 2014 at 12:02 am

Hi
This categorization is very good for me. But What is the difference between Elasticsearch and Druid?
My problem is about full text searching in terabyte data in PDF format. Which is best for my problem? Elasticsearch OR Druid?

- Pranab says:
  
  August 24, 2014 at 6:16 pm
  
  Samira
  You could follow the links and evaluate. The purpose of the post was to give the layout of the Big Data landscape and provide links for detailed information and not to recommend products.
  
  - samira says:
    
    August 25, 2014 at 12:26 am
    
    Thank you for your reply. I need a help about full text searching in pdf files. What is your suggestion about dealing with this amount of data.
    thanks a lot
Pranab says:

August 25, 2014 at 3:16 pm

Samira

A simple google search brought me these for a solution based elasticsearch and tikka

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html
http://stackoverflow.com/questions/10854858/best-practices-for-searchable-archive-of-thousands-of-documents-pdf-and-or-xml/10861308#10861308

I really don’t want to turn this into a thread for indexing PDF and other office documents. If you want to communicate further, please use my email from my linkedin profile