Hive Plays Well with JSON


Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity.  In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS.  The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS.

In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical data. Hive works well with flat record structured data. I wanted to find out how Hive handles hierarchically structured data. I found out that Hive works well with hierarchical JSON formatted data.

Quick Review of Hive

Hive wiki is a very good resource to learn Hive. Here is a quick review. When you run a Hive query, this is what happens roughly.

  1. The Query Planner analyzes the query and converts it to DAG (Directed Acyclic Graph) of Hadoop Map Reduce jobs
  2. The jobs get submitted to Hadoop cluster in the order implied by the DAG
  3. For simple query, only mappers run. The Input Output format is responsible for managing an input split and reading the data off HDFS. Next, the data flows into a layer called SerDe (Serializer Deserializer). In this case data as byte stream gets converted to a structured format by the deserializer part of the SerdDe.
  4. For aggregate queries, the Map Reduce jobs will also include reducers. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.

A typical Hive query will involve one or more  Map Reduce jobs and full scan of the data, unless the table has partitions.

As a result,  a Hive may be a high latency operation, depending on the amount of data and the Hadoop cluster configuration.

Blog Comment Example

For exploring Hive’s capability in handling JSON data, I am using blog comments as an example. A blog comment has the following fields.

  1. blogID
  2. date
  3. commenter name
  4. comment
  5. commenter email
  6. commenter web site

Here are some typical JSON records for  blog comment. The data has two level of nesting.

{ "blogID" : "FJY26J1333", "date" : "2012-04-01", "name" : "vpxnksu", "comment" : "good stuff", "contact" : { "email" : "vpxnksu@gmail.com", "website" : "vpxnksu.wordpress.com" } }
{ "blogID" : "VSAUMDFGSD", "date" : "2012-04-01", "name" : "yhftrcx", "comment" : "another comment",}

You may have noticed that not all comment records will have the commenter’s contact info. As we will see later, when we query, Hive return NULL for any missing filed in any record.

Create Table

My table is  very simple with only one column, which holds the JSON data. You can think of the data being completely denormalized. I am using external table which means I am keeping my table data off Hive’s control. I am simply telling Hive where to look for the data. Here is the DDL for the table

CREATE  EXTERNAL  TABLE comments
(
value STRING 
)
LOCATION  '/Users/pghosh/test/comments';

Next, I loaded my data from a file into the Comments table as below

LOAD DATA LOCAL INPATH '/Users/pghosh/Workspace/comment.txt' OVERWRITE INTO TABLE comments;

The keyword OVERWRITE causes existing data for the table to be wiped out, before the load. Now that the data is loaded, we are ready for some queries.

Time for Query

My first query is going to find emails for all commenters for a given blog. You may want to such queries for the following scenario. We just had a new blog published, which happens to be very similar to a past blog. We want to find email for all commenters for the old blog, so that we can send them email with link to the new blog. It’s very likely that they will be interested in the new blog. Here is the query.

SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b 
AS blogID, contact  LATERAL VIEW json_tuple(b.contact, 'email', 'website') c 
AS email, website WHERE b.blogID='64FY4D0B28';

In this query, the JSON support in Hive has kicked in. The LATERAL VIEW and json_tuple give us a way of navigating the JSON tree. The first instance gives us a virtual table with with two columns blogID and contact. We repeat the same process again to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns email and website.

The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names.

Here is the query result. As expected, this query triggers one Map reduce job that does not include any reducer. The Hadoop job contains only the map task.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201205052116_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0001
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0001
2012-05-05 21:42:08,639 Stage-1 map = 0%,  reduce = 0%
2012-05-05 21:42:15,639 Stage-1 map = 100%,  reduce = 0%
2012-05-05 21:42:17,900 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201205052116_0001
OK
64FY4D0B28	NULL
64FY4D0B28	NULL
64FY4D0B28	NULL
64FY4D0B28	lyau@hotmail.com
64FY4D0B28	NULL
64FY4D0B28	gyocok@gmail.com
64FY4D0B28	shrmf@gmail.com
Time taken: 21.459 seconds

Our next query is an aggregate query. We are interested in number of comments for each blog. This is a group by query as below. Here we navigate only one level in the JSON  tree.

SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'name') b 
AS blogID, name group by b.blogID;

Here is the result of the query. I am only showing partial result. The interesting thing to note is that that this query has mapped to one Hadoop job and the job includes reducers.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201205052116_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0002
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0002
2012-05-05 21:48:06,878 Stage-1 map = 0%,  reduce = 0%
2012-05-05 21:48:23,088 Stage-1 map = 4%,  reduce = 0%
2012-05-05 21:48:25,097 Stage-1 map = 100%,  reduce = 0%
2012-05-05 21:48:39,214 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201205052116_0002
OK
002F01U5F1	8
00PYU30L6R	13
012A2RDU7V	8
01D31LI365	10
01UBOW8E31	5
0223774V51	6
023WWOWS60	16
02E09F81RD	5
02MNORRJNP	3
02RCP58IIG	7
030GFNQN0K	7
033TWOLGY3	7
03C2WCN4UA	6
03M4N0WFB1	7
04197CKG0J	11
042JJOY1X1	7
04HAB1U275	2
073VPCIE4H	5
07NDU9S4B1	10
..................

Another Way

Another way to handle JSON is to use a JSON SerDe. The external view of any Hive encapsulated  data is always column and row oriented. However the internal on disk representation of data could be anything.

A Hive SerDe is the  bridge between the internal representation and the external column and record oriented view.  In this case it will do  all the necessary mapping of JSON fields and columns. You have to add the following to your table DDL. With this approach, your table will have multiple columns.

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'

Summing Up

Hopefully, this post has given you some idea about what Hive can do with JSO data. An interesting question that comes up is whether we still need MongoDB. MongoDB is a popular NOSQL document structured data base, with very powerful indexing and querying capabilities.

If you are doing mostly aggregate queries touching most of the data as in a data warehouse application, Hive is a   better optional than MongoDB. However, if your use cases include queries touching only a small fraction of data, you may be better off staying with MongoDB or other equivalent document structured data base.

Update in Hive is an expensive operation, because it involves a full table overwrite. So for data that is frequently updated, MogoDb may be a better choice.

About these ads

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source contributor. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Data Mining and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Warehouse, Hadoop and Map Reduce, Hive, Query and tagged , . Bookmark the permalink.

23 Responses to Hive Plays Well with JSON

  1. Pingback: Hive (computing): What is the best Hive JSON SerDe? - Quora

  2. Pingback: Hive Plays Well with JSON | BigDataCloud.com

  3. Alicedelic says:

    Your example saved me, thank you.

  4. Manoj says:

    Good Work! Much helpful…

  5. bourne says:

    This blog is very helpful, and I have a question.
    It seems that json_tuple() not support array, is it right?

  6. Asher says:

    Could you give an example getting email and website info where you have json like this:
    { “blogID” : “FJY26J1333″, “date” : “2012-04-01″, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ "email" : "vpxnksu@gmail.com", "website" : "vpxnksu.wordpress.com" }, { "email" : "vpxnksu@yahoo.com", "website" : "vpxnksu.tumblr.com" } ] }

    • Pranab says:

      It’s synthetic data generated by a script

      • Asher says:

        Right. Can you give a sample hive query that would return the email and website from the json I provided? I’m not sure how to get them from a nested structure.

      • Nagesh says:

        when i run first query i am getting nothing.. no data.. please find the below information.

        hive> SELECT b.blogID, c.website FROM sample a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
        > AS blogID, contact LATERAL VIEW json_tuple(b.contact, ‘email’, ‘website’) c
        > AS email, website WHERE b.blogID=’FJY26J1333′;
        Total MapReduce jobs = 1
        Launching Job 1 out of 1
        Number of reduce tasks is set to 0 since there’s no reduce operator
        Starting Job = job_201305131006_0022, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201305131006_0022
        Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201305131006_0022
        Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
        2013-05-13 15:10:20,754 Stage-1 map = 0%, reduce = 0%
        2013-05-13 15:10:26,781 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:27,792 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:28,817 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:29,823 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:30,838 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:31,844 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:32,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.69 sec
        MapReduce Total cumulative CPU time: 690 msec
        Ended Job = job_201305131006_0022
        MapReduce Jobs Launched:
        Job 0: Map: 1 Cumulative CPU: 0.69 sec HDFS Read: 498 HDFS Write: 0 SUCCESS
        Total MapReduce CPU Time Spent: 690 msec
        OK
        Time taken: 28.19 seconds
        hive>

  7. Fero Kocun says:

    It helped me, thanks!

  8. Nagesh says:

    Hi pranab,
    This blog is very helpful for us.Thanks for your blog.
    I have installed Hadoop and Hive in linux. I am getting an error message while creating table using hive. I am using Json serde jar which is copied in $HIVE_HOME/lib dir. I have added jar into class path also.Please help me on this issue.
    Please find the error below.
    hive> create table air(
    > airline string,
    > airlineid string,
    > sourceairport string,
    > sourceairportid string,
    > destinationairport string,
    > destinationairportid string,
    > codeshare string,
    > stop string,
    > equipment String)
    > ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’
    > location ‘/tmp/air_routes_result/part-00000′;
    FAILED: Error in metadata: java.lang.NullPointerException
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

  9. Pingback: 如何在Hive中使用Json格式数据 | 四号程序员

  10. Pingback: Live Streaming Κάλυψη Συνεδρείων

  11. Romit says:

    Hello, is it possible to use this method on nested json structure?
    (Refering to the sample from above) for example: { “blogID” : “FJY26J1333″, “date” : “2012-04-01″, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ "email" : "vpxnksu@gmail.com", "website" : "vpxnksu.wordpress.com" }, { "email" : "vpxnksu@yahoo.com", "website" : "vpxnksu.tumblr.com" } ] }

    Thanks for your help!

  12. Hadokee says:

    Hello Pranab thanks for your Article. I have one question for you, I’m trying to import with ur method this JSON into a Hive table:

    JSON
    {
    “id”:”3434″,
    “hashtags”: [
    {“text”:”element1″,”indices”:[1,2,3]},
    {“text”:”element2″,”indices”:[1,2,3]}
    ]
    }

    HiveTable
    id int
    hashtags array

    I would simply import “element1″,”element2″ in the hashtag ARRAY column.
    How I can do It? It seems not possibile import data directly into an array…

    Thanks you

  13. Hadokee says:

    Json_tuple seems not support arrays…

  14. Chandra Bhatt says:

    Thanks Pranab. Nice explanation with good deal of details.

  15. dheeraj says:

    Thanks for the post Pranab. It helped me a lot. Just one quick question, in your example you mentioned about inserting of data in to the external table created. Can you give an example of how to work with json data, if partitions are involved. My data is partitioned as below year=2014,month=01,day=01
    year=2014,month=01,day=02 & so on.
    Thanks for your time.

    • Pranab says:

      Dheeraj, I am not sure if this is the best way. You could store data in a non partitioned table first. Then you can do insert overwrite to read data from the first table and write to the partitioned table, You should use dynamic partitioning. You could extract the partition values using json_tuple etc in the query to the first table

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s