Hive Plays Well with JSON


Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity.  In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS.  The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS.

In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical data. Hive works well with flat record structured data. I wanted to find out how Hive handles hierarchically structured data. I found out that Hive works well with hierarchical JSON formatted data.

Quick Review of Hive

Hive wiki is a very good resource to learn Hive. Here is a quick review. When you run a Hive query, this is what happens roughly.

  1. The Query Planner analyzes the query and converts it to DAG (Directed Acyclic Graph) of Hadoop Map Reduce jobs
  2. The jobs get submitted to Hadoop cluster in the order implied by the DAG
  3. For simple query, only mappers run. The Input Output format is responsible for managing an input split and reading the data off HDFS. Next, the data flows into a layer called SerDe (Serializer Deserializer). In this case data as byte stream gets converted to a structured format by the deserializer part of the SerdDe.
  4. For aggregate queries, the Map Reduce jobs will also include reducers. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.

A typical Hive query will involve one or more  Map Reduce jobs and full scan of the data, unless the table has partitions.

As a result,  a Hive may be a high latency operation, depending on the amount of data and the Hadoop cluster configuration.

Blog Comment Example

For exploring Hive’s capability in handling JSON data, I am using blog comments as an example. A blog comment has the following fields.

  1. blogID
  2. date
  3. commenter name
  4. comment
  5. commenter email
  6. commenter web site

Here are some typical JSON records for  blog comment. The data has two level of nesting.

{ "blogID" : "FJY26J1333", "date" : "2012-04-01", "name" : "vpxnksu", "comment" : "good stuff", "contact" : { "email" : "vpxnksu@gmail.com", "website" : "vpxnksu.wordpress.com" } }
{ "blogID" : "VSAUMDFGSD", "date" : "2012-04-01", "name" : "yhftrcx", "comment" : "another comment",}

You may have noticed that not all comment records will have the commenter’s contact info. As we will see later, when we query, Hive return NULL for any missing filed in any record.

Create Table

My table is  very simple with only one column, which holds the JSON data. You can think of the data being completely denormalized. I am using external table which means I am keeping my table data off Hive’s control. I am simply telling Hive where to look for the data. Here is the DDL for the table

CREATE  EXTERNAL  TABLE comments
(
value STRING 
)
LOCATION  '/Users/pghosh/test/comments';

Next, I loaded my data from a file into the Comments table as below

LOAD DATA LOCAL INPATH '/Users/pghosh/Workspace/comment.txt' OVERWRITE INTO TABLE comments;

The keyword OVERWRITE causes existing data for the table to be wiped out, before the load. Now that the data is loaded, we are ready for some queries.

Time for Query

My first query is going to find emails for all commenters for a given blog. You may want to such queries for the following scenario. We just had a new blog published, which happens to be very similar to a past blog. We want to find email for all commenters for the old blog, so that we can send them email with link to the new blog. It’s very likely that they will be interested in the new blog. Here is the query.

SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b 
AS blogID, contact  LATERAL VIEW json_tuple(b.contact, 'email', 'website') c 
AS email, website WHERE b.blogID='64FY4D0B28';

In this query, the JSON support in Hive has kicked in. The LATERAL VIEW and json_tuple give us a way of navigating the JSON tree. The first instance gives us a virtual table with with two columns blogID and contact. We repeat the same process again to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns email and website.

The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names.

Here is the query result. As expected, this query triggers one Map reduce job that does not include any reducer. The Hadoop job contains only the map task.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201205052116_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0001
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0001
2012-05-05 21:42:08,639 Stage-1 map = 0%,  reduce = 0%
2012-05-05 21:42:15,639 Stage-1 map = 100%,  reduce = 0%
2012-05-05 21:42:17,900 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201205052116_0001
OK
64FY4D0B28	NULL
64FY4D0B28	NULL
64FY4D0B28	NULL
64FY4D0B28	lyau@hotmail.com
64FY4D0B28	NULL
64FY4D0B28	gyocok@gmail.com
64FY4D0B28	shrmf@gmail.com
Time taken: 21.459 seconds

Our next query is an aggregate query. We are interested in number of comments for each blog. This is a group by query as below. Here we navigate only one level in the JSON  tree.

SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'name') b 
AS blogID, name group by b.blogID;

Here is the result of the query. I am only showing partial result. The interesting thing to note is that that this query has mapped to one Hadoop job and the job includes reducers.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201205052116_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0002
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0002
2012-05-05 21:48:06,878 Stage-1 map = 0%,  reduce = 0%
2012-05-05 21:48:23,088 Stage-1 map = 4%,  reduce = 0%
2012-05-05 21:48:25,097 Stage-1 map = 100%,  reduce = 0%
2012-05-05 21:48:39,214 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201205052116_0002
OK
002F01U5F1	8
00PYU30L6R	13
012A2RDU7V	8
01D31LI365	10
01UBOW8E31	5
0223774V51	6
023WWOWS60	16
02E09F81RD	5
02MNORRJNP	3
02RCP58IIG	7
030GFNQN0K	7
033TWOLGY3	7
03C2WCN4UA	6
03M4N0WFB1	7
04197CKG0J	11
042JJOY1X1	7
04HAB1U275	2
073VPCIE4H	5
07NDU9S4B1	10
..................

Another Way

Another way to handle JSON is to use a JSON SerDe. The external view of any Hive encapsulated  data is always column and row oriented. However the internal on disk representation of data could be anything.

A Hive SerDe is the  bridge between the internal representation and the external column and record oriented view.  In this case it will do  all the necessary mapping of JSON fields and columns. You have to add the following to your table DDL. With this approach, your table will have multiple columns.

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'

Summing Up

Hopefully, this post has given you some idea about what Hive can do with JSO data. An interesting question that comes up is whether we still need MongoDB. MongoDB is a popular NOSQL document structured data base, with very powerful indexing and querying capabilities.

If you are doing mostly aggregate queries touching most of the data as in a data warehouse application, Hive is a   better optional than MongoDB. However, if your use cases include queries touching only a small fraction of data, you may be better off staying with MongoDB or other equivalent document structured data base.

Update in Hive is an expensive operation, because it involves a full table overwrite. So for data that is frequently updated, MogoDb may be a better choice.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Warehouse, Hadoop and Map Reduce, Hive, Query and tagged , . Bookmark the permalink.

46 Responses to Hive Plays Well with JSON

  1. Pingback: Hive (computing): What is the best Hive JSON SerDe? - Quora

  2. Pingback: Hive Plays Well with JSON | BigDataCloud.com

  3. Alicedelic says:

    Your example saved me, thank you.

  4. Manoj says:

    Good Work! Much helpful…

  5. bourne says:

    This blog is very helpful, and I have a question.
    It seems that json_tuple() not support array, is it right?

  6. Asher says:

    Could you give an example getting email and website info where you have json like this:
    { “blogID” : “FJY26J1333”, “date” : “2012-04-01”, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” }, { “email” : “vpxnksu@yahoo.com”, “website” : “vpxnksu.tumblr.com” } ] }

    • Pranab says:

      It’s synthetic data generated by a script

      • Asher says:

        Right. Can you give a sample hive query that would return the email and website from the json I provided? I’m not sure how to get them from a nested structure.

      • Nagesh says:

        when i run first query i am getting nothing.. no data.. please find the below information.

        hive> SELECT b.blogID, c.website FROM sample a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
        > AS blogID, contact LATERAL VIEW json_tuple(b.contact, ’email’, ‘website’) c
        > AS email, website WHERE b.blogID=’FJY26J1333′;
        Total MapReduce jobs = 1
        Launching Job 1 out of 1
        Number of reduce tasks is set to 0 since there’s no reduce operator
        Starting Job = job_201305131006_0022, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201305131006_0022
        Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201305131006_0022
        Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
        2013-05-13 15:10:20,754 Stage-1 map = 0%, reduce = 0%
        2013-05-13 15:10:26,781 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:27,792 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:28,817 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:29,823 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:30,838 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:31,844 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
        2013-05-13 15:10:32,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.69 sec
        MapReduce Total cumulative CPU time: 690 msec
        Ended Job = job_201305131006_0022
        MapReduce Jobs Launched:
        Job 0: Map: 1 Cumulative CPU: 0.69 sec HDFS Read: 498 HDFS Write: 0 SUCCESS
        Total MapReduce CPU Time Spent: 690 msec
        OK
        Time taken: 28.19 seconds
        hive>

      • Rishav says:

        I’m having the same issue, can you please provide the hive query for this scenario.

    • Rishav says:

      Please let me know if you found the hive query for this scenario. I’m having the same issue.

  7. Fero Kocun says:

    It helped me, thanks!

  8. Nagesh says:

    Hi pranab,
    This blog is very helpful for us.Thanks for your blog.
    I have installed Hadoop and Hive in linux. I am getting an error message while creating table using hive. I am using Json serde jar which is copied in $HIVE_HOME/lib dir. I have added jar into class path also.Please help me on this issue.
    Please find the error below.
    hive> create table air(
    > airline string,
    > airlineid string,
    > sourceairport string,
    > sourceairportid string,
    > destinationairport string,
    > destinationairportid string,
    > codeshare string,
    > stop string,
    > equipment String)
    > ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’
    > location ‘/tmp/air_routes_result/part-00000’;
    FAILED: Error in metadata: java.lang.NullPointerException
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

  9. Pingback: 如何在Hive中使用Json格式数据 | 四号程序员

  10. Pingback: Live Streaming Κάλυψη Συνεδρείων

  11. Romit says:

    Hello, is it possible to use this method on nested json structure?
    (Refering to the sample from above) for example: { “blogID” : “FJY26J1333″, “date” : “2012-04-01″, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” }, { “email” : “vpxnksu@yahoo.com”, “website” : “vpxnksu.tumblr.com” } ] }

    Thanks for your help!

  12. Hadokee says:

    Hello Pranab thanks for your Article. I have one question for you, I’m trying to import with ur method this JSON into a Hive table:

    JSON
    {
    “id”:”3434″,
    “hashtags”: [
    {“text”:”element1″,”indices”:[1,2,3]},
    {“text”:”element2″,”indices”:[1,2,3]}
    ]
    }

    HiveTable
    id int
    hashtags array

    I would simply import “element1″,”element2” in the hashtag ARRAY column.
    How I can do It? It seems not possibile import data directly into an array…

    Thanks you

    • sri says:

      Hi Hadokee.
      I had same use case ,can you please let me know how to write query if i want to select id and text from the table

  13. Hadokee says:

    Json_tuple seems not support arrays…

  14. Chandra Bhatt says:

    Thanks Pranab. Nice explanation with good deal of details.

  15. dheeraj says:

    Thanks for the post Pranab. It helped me a lot. Just one quick question, in your example you mentioned about inserting of data in to the external table created. Can you give an example of how to work with json data, if partitions are involved. My data is partitioned as below year=2014,month=01,day=01
    year=2014,month=01,day=02 & so on.
    Thanks for your time.

    • Pranab says:

      Dheeraj, I am not sure if this is the best way. You could store data in a non partitioned table first. Then you can do insert overwrite to read data from the first table and write to the partitioned table, You should use dynamic partitioning. You could extract the partition values using json_tuple etc in the query to the first table

  16. Rahul Banerjee says:

    Do not Format the JSON using formatting tools…keep the JSON in a single line otherwise it does not return any data.

  17. hi i am new to spark i want to read json file and insert the fields from json file to hbase using spark
    thanks in advance

  18. Sandeep says:

    I am facing one problem with your example, really appreciate if you can help resolving this. I loaded below data. But SQL result did not show any column value if “contact” tag is missing.

    { “blogID” : “FJY26J1333”, “date” : “2012-04-01”, “name” : “vpxnksu”, “comment” : “good stuff”, “contact” : { “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” } }
    { “blogID” : “VSAUMDFGSD”, “date” : “2012-04-01”, “name” : “yhftrcx”, “comment” : “another comment”,}

    If you execute your SQL, It did not return any column value for blogID VSAUMDFGSD
    hive> SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
    > AS blogID, contact LATERAL VIEW json_tuple(b.contact, ’email’, ‘website’) c
    > AS email, website WHERE b.blogID=’VSAUMDFGSD’;
    OK
    Time taken: 0.12 seconds

    How we can get result something like this ?
    VSAUMDFGSD NULL

  19. afsar says:

    I am just thinking when you say “create external table comments” with some location then why are you loading data into table “LOAD DATA LOCAL INPATH ‘/Users/pghosh/Workspace/comment.txt’ OVERWRITE INTO TABLE comments;”

    My understanding is if you create any external table in hive and define the location of data then you don’t need to load any data otherwise it is going against the terminalogy of external table in hive.

    Please correct me if I am wrong with this.

    • Pranab says:

      With external table you are telling hive to refer to some directory for data. You are free to do whatever you want in that directory outside hive

      • afsar says:

        Thanks Pranab for the reply.

        My original question remains unanswered. If we have given the location directory for data then hive can look for data in that directory and no need to load data in hive for same table.

  20. mohan says:

    this blog is very helpful..thankyou for this blog..
    also see our blog for hive json..
    http://bit.ly/2c2cQsO

  21. Pingback: Big data and Football: Using Pig to Analyze Pigskin - The Blend: A West Monroe Partners Blog

  22. syam says:

    Hi Pranab,

    Nice Blog.
    I have a querry for analysing the flume data using hive.
    I am using the Cloudera.quickstart VM
    I have configure the Flume and required Flume data loaded into HDFS

    Required hive-serdes-1.0-SNAPSHOT.jar downloaded and copied to Hive Lib and while creating the table jar is added :
    > Add jar /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
    Jar Added for the class path
    After that I have Created a external table in hive :
    > create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’ LOCATION ‘/user/cloudera/flume/tweets;
    Table is created

    > select * from load_tweets;
    Console error :
    Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeExc eption: org.codehaus.jackson.JsonParseException: Unexpected character (‘O’ (code 79)): expected a valid value (number, String, array, object, ‘true’, ‘false’ or ‘null’) at [Source: java.io.StringReader@4c445c13; line: 1, column: 2]

    Please Help me on this,

    Thanks,
    Syam.

  23. Kullai says:

    Hi
    Good work Pranab ! Most helpfull

  24. Hi,
    Hi, Very informative Article.There is so much information about the Hadoop and Cybersecurity in this post. keep it up and thanks for the great content. Bookmarked your site.
    Thanks,
    Learn Hadoop Online Training In India

  25. Brenda says:

    Llámenos y solicite presupuesto sim compromiso, nuestros costes son muy competitivos. http://all4webs.com/ovalmaniac667/insumosdeaseo.htm?48410=53394

  26. newuser says:

    how to create josn record from oracle data base?

  27. Ariph says:

    what is the best possible way to find JSON performance in between two different Databases and What could be the methodologies for performance analysis?

  28. Artur Łysik says:

    Thank you for this article. It helped me a lot 🙂

  29. Pingback: Analiza danych z Uboat.net – część 2 – przygotowanie danych – Fifty men, one head

  30. Amit Kumar says:

    Hi Pranab,

    I am dealing with a json file in which each record has class id and each class id have different structure. Could you please suggest an approach for analyzing such files.

    Thanks,
    Amit

Leave a reply to Pranab Cancel reply