Hive Plays Well with JSON

Posted on May 6, 2012 by Pranab

Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity. In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS. The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS.

In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical data. Hive works well with flat record structured data. I wanted to find out how Hive handles hierarchically structured data. I found out that Hive works well with hierarchical JSON formatted data.

Quick Review of Hive

Hive wiki is a very good resource to learn Hive. Here is a quick review. When you run a Hive query, this is what happens roughly.

The Query Planner analyzes the query and converts it to DAG (Directed Acyclic Graph) of Hadoop Map Reduce jobs
The jobs get submitted to Hadoop cluster in the order implied by the DAG
For simple query, only mappers run. The Input Output format is responsible for managing an input split and reading the data off HDFS. Next, the data flows into a layer called SerDe (Serializer Deserializer). In this case data as byte stream gets converted to a structured format by the deserializer part of the SerdDe.
For aggregate queries, the Map Reduce jobs will also include reducers. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.

A typical Hive query will involve one or more Map Reduce jobs and full scan of the data, unless the table has partitions.

As a result, a Hive may be a high latency operation, depending on the amount of data and the Hadoop cluster configuration.

Blog Comment Example

For exploring Hive’s capability in handling JSON data, I am using blog comments as an example. A blog comment has the following fields.

blogID
date
commenter name
comment
commenter email
commenter web site

Here are some typical JSON records for blog comment. The data has two level of nesting.

{ "blogID" : "FJY26J1333", "date" : "2012-04-01", "name" : "vpxnksu", "comment" : "good stuff", "contact" : { "email" : "vpxnksu@gmail.com", "website" : "vpxnksu.wordpress.com" } }
{ "blogID" : "VSAUMDFGSD", "date" : "2012-04-01", "name" : "yhftrcx", "comment" : "another comment",}

You may have noticed that not all comment records will have the commenter’s contact info. As we will see later, when we query, Hive return NULL for any missing filed in any record.

Create Table

My table is very simple with only one column, which holds the JSON data. You can think of the data being completely denormalized. I am using external table which means I am keeping my table data off Hive’s control. I am simply telling Hive where to look for the data. Here is the DDL for the table

CREATE  EXTERNAL  TABLE comments
(
value STRING 
)
LOCATION  '/Users/pghosh/test/comments';

Next, I loaded my data from a file into the Comments table as below

LOAD DATA LOCAL INPATH '/Users/pghosh/Workspace/comment.txt' OVERWRITE INTO TABLE comments;

The keyword OVERWRITE causes existing data for the table to be wiped out, before the load. Now that the data is loaded, we are ready for some queries.

Time for Query

My first query is going to find emails for all commenters for a given blog. You may want to such queries for the following scenario. We just had a new blog published, which happens to be very similar to a past blog. We want to find email for all commenters for the old blog, so that we can send them email with link to the new blog. It’s very likely that they will be interested in the new blog. Here is the query.

SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b 
AS blogID, contact  LATERAL VIEW json_tuple(b.contact, 'email', 'website') c 
AS email, website WHERE b.blogID='64FY4D0B28';

In this query, the JSON support in Hive has kicked in. The LATERAL VIEW and json_tuple give us a way of navigating the JSON tree. The first instance gives us a virtual table with with two columns blogID and contact. We repeat the same process again to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns email and website.

The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names.

Here is the query result. As expected, this query triggers one Map reduce job that does not include any reducer. The Hadoop job contains only the map task.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201205052116_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0001
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0001
2012-05-05 21:42:08,639 Stage-1 map = 0%,  reduce = 0%
2012-05-05 21:42:15,639 Stage-1 map = 100%,  reduce = 0%
2012-05-05 21:42:17,900 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201205052116_0001
OK
64FY4D0B28	NULL
64FY4D0B28	NULL
64FY4D0B28	NULL
64FY4D0B28	lyau@hotmail.com
64FY4D0B28	NULL
64FY4D0B28	gyocok@gmail.com
64FY4D0B28	shrmf@gmail.com
Time taken: 21.459 seconds

Our next query is an aggregate query. We are interested in number of comments for each blog. This is a group by query as below. Here we navigate only one level in the JSON tree.

SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'name') b 
AS blogID, name group by b.blogID;

Here is the result of the query. I am only showing partial result. The interesting thing to note is that that this query has mapped to one Hadoop job and the job includes reducers.

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201205052116_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0002
Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0002
2012-05-05 21:48:06,878 Stage-1 map = 0%,  reduce = 0%
2012-05-05 21:48:23,088 Stage-1 map = 4%,  reduce = 0%
2012-05-05 21:48:25,097 Stage-1 map = 100%,  reduce = 0%
2012-05-05 21:48:39,214 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201205052116_0002
OK
002F01U5F1	8
00PYU30L6R	13
012A2RDU7V	8
01D31LI365	10
01UBOW8E31	5
0223774V51	6
023WWOWS60	16
02E09F81RD	5
02MNORRJNP	3
02RCP58IIG	7
030GFNQN0K	7
033TWOLGY3	7
03C2WCN4UA	6
03M4N0WFB1	7
04197CKG0J	11
042JJOY1X1	7
04HAB1U275	2
073VPCIE4H	5
07NDU9S4B1	10
..................

Another Way

Another way to handle JSON is to use a JSON SerDe. The external view of any Hive encapsulated data is always column and row oriented. However the internal on disk representation of data could be anything.

A Hive SerDe is the bridge between the internal representation and the external column and record oriented view. In this case it will do all the necessary mapping of JSON fields and columns. You have to add the following to your table DDL. With this approach, your table will have multiple columns.

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'

Summing Up

Hopefully, this post has given you some idea about what Hive can do with JSO data. An interesting question that comes up is whether we still need MongoDB. MongoDB is a popular NOSQL document structured data base, with very powerful indexing and querying capabilities.

If you are doing mostly aggregate queries touching most of the data as in a data warehouse application, Hive is a better optional than MongoDB. However, if your use cases include queries touching only a small fraction of data, you may be better off staying with MongoDB or other equivalent document structured data base.

Update in Hive is an expensive operation, because it involves a full table overwrite. So for data that is frequently updated, MogoDb may be a better choice.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →

This entry was posted in Big Data, Data Warehouse, Hadoop and Map Reduce, Hive, Query and tagged blog data, JSON. Bookmark the permalink.

46 Responses to Hive Plays Well with JSON

Pingback: Hive (computing): What is the best Hive JSON SerDe? - Quora
Pingback: Hive Plays Well with JSON | BigDataCloud.com
Alicedelic says:

December 6, 2012 at 10:24 pm

Your example saved me, thank you.

Reply
- Pranab says:
  
  December 9, 2012 at 8:37 pm
  
  I am glad it helped
  
  Reply
Manoj says:

December 7, 2012 at 6:21 pm

Good Work! Much helpful…

Reply
bourne says:

January 6, 2013 at 12:13 am

This blog is very helpful, and I have a question.
It seems that json_tuple() not support array, is it right?

Reply
Asher says:

February 12, 2013 at 1:09 pm

Could you give an example getting email and website info where you have json like this:
{ “blogID” : “FJY26J1333”, “date” : “2012-04-01”, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” }, { “email” : “vpxnksu@yahoo.com”, “website” : “vpxnksu.tumblr.com” } ] }

Reply
- Pranab says:
  
  February 12, 2013 at 11:50 pm
  
  It’s synthetic data generated by a script
  
  Reply
  - Asher says:
    
    February 13, 2013 at 8:39 am
    
    Right. Can you give a sample hive query that would return the email and website from the json I provided? I’m not sure how to get them from a nested structure.
  - Nagesh says:
    
    May 13, 2013 at 2:47 am
    
    when i run first query i am getting nothing.. no data.. please find the below information.
    
    hive> SELECT b.blogID, c.website FROM sample a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
    > AS blogID, contact LATERAL VIEW json_tuple(b.contact, ’email’, ‘website’) c
    > AS email, website WHERE b.blogID=’FJY26J1333′;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there’s no reduce operator
    Starting Job = job_201305131006_0022, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201305131006_0022
    Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201305131006_0022
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2013-05-13 15:10:20,754 Stage-1 map = 0%, reduce = 0%
    2013-05-13 15:10:26,781 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
    2013-05-13 15:10:27,792 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
    2013-05-13 15:10:28,817 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
    2013-05-13 15:10:29,823 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
    2013-05-13 15:10:30,838 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
    2013-05-13 15:10:31,844 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
    2013-05-13 15:10:32,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.69 sec
    MapReduce Total cumulative CPU time: 690 msec
    Ended Job = job_201305131006_0022
    MapReduce Jobs Launched:
    Job 0: Map: 1 Cumulative CPU: 0.69 sec HDFS Read: 498 HDFS Write: 0 SUCCESS
    Total MapReduce CPU Time Spent: 690 msec
    OK
    Time taken: 28.19 seconds
    hive>
  - Rishav says:
    
    November 22, 2020 at 2:31 am
    
    I’m having the same issue, can you please provide the hive query for this scenario.
- Rishav says:
  
  November 22, 2020 at 2:33 am
  
  Please let me know if you found the hive query for this scenario. I’m having the same issue.
  
  Reply
Fero Kocun says:

March 4, 2013 at 8:43 am

It helped me, thanks!

Reply
Nagesh says:

April 30, 2013 at 1:53 am

Hi pranab,
This blog is very helpful for us.Thanks for your blog.
I have installed Hadoop and Hive in linux. I am getting an error message while creating table using hive. I am using Json serde jar which is copied in $HIVE_HOME/lib dir. I have added jar into class path also.Please help me on this issue.
Please find the error below.
hive> create table air(
> airline string,
> airlineid string,
> sourceairport string,
> sourceairportid string,
> destinationairport string,
> destinationairportid string,
> codeshare string,
> stop string,
> equipment String)
> ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’
> location ‘/tmp/air_routes_result/part-00000’;
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Reply
- Nagesam says:
  
  May 13, 2013 at 2:44 am
  
  Please help on above issue.. I am stuck here, i can’t able to find this issue..
  
  Reply
  - Arun says:
    
    May 31, 2018 at 5:27 am
    
    Hi nagesh did you find the solution ? I have been stuck in the very same situation i have to create an external table
- Pranab says:
  
  May 13, 2013 at 6:52 pm
  
  Nagesh
  
  I have not used JsonSerde. Try it’s home in google code. There might even be a google group for it
  
  Reply
Pingback: 如何在Hive中使用Json格式数据 | 四号程序员
Pingback: Live Streaming Κάλυψη Συνεδρείων
Romit says:

January 2, 2014 at 10:17 pm

Hello, is it possible to use this method on nested json structure?
(Refering to the sample from above) for example: { “blogID” : “FJY26J1333″, “date” : “2012-04-01″, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” }, { “email” : “vpxnksu@yahoo.com”, “website” : “vpxnksu.tumblr.com” } ] }

Thanks for your help!

Reply
- Pranab says:
  
  January 6, 2014 at 7:41 pm
  
  This solution is all about nested JSON structure.
  
  Reply
  - Nick Jones says:
    
    March 22, 2017 at 10:40 am
    
    The contacts are in an array in this case. I think they are asking how to deal with a json array.
- sri says:
  
  June 16, 2018 at 12:14 am
  
  Hi Romit,
  Did you find solution for this,if so can you please let me know the query to see results for blogid and email
  
  Reply
Hadokee says:

January 24, 2014 at 6:47 am

Hello Pranab thanks for your Article. I have one question for you, I’m trying to import with ur method this JSON into a Hive table:

JSON
{
“id”:”3434″,
“hashtags”: [
{“text”:”element1″,”indices”:[1,2,3]},
{“text”:”element2″,”indices”:[1,2,3]}
]
}

HiveTable
id int
hashtags array

I would simply import “element1″,”element2” in the hashtag ARRAY column.
How I can do It? It seems not possibile import data directly into an array…

Thanks you

Reply
- sri says:
  
  June 16, 2018 at 12:12 am
  
  Hi Hadokee.
  I had same use case ,can you please let me know how to write query if i want to select id and text from the table
  
  Reply
Hadokee says:

January 24, 2014 at 7:26 am

Json_tuple seems not support arrays…

Reply
Chandra Bhatt says:

April 8, 2014 at 1:52 pm

Thanks Pranab. Nice explanation with good deal of details.

Reply
dheeraj says:

June 25, 2014 at 10:36 am

Thanks for the post Pranab. It helped me a lot. Just one quick question, in your example you mentioned about inserting of data in to the external table created. Can you give an example of how to work with json data, if partitions are involved. My data is partitioned as below year=2014,month=01,day=01
year=2014,month=01,day=02 & so on.
Thanks for your time.

Reply
- Pranab says:
  
  June 25, 2014 at 6:12 pm
  
  Dheeraj, I am not sure if this is the best way. You could store data in a non partitioned table first. Then you can do insert overwrite to read data from the first table and write to the partitioned table, You should use dynamic partitioning. You could extract the partition values using json_tuple etc in the query to the first table
  
  Reply
Rahul Banerjee says:

January 19, 2015 at 11:46 pm

Do not Format the JSON using formatting tools…keep the JSON in a single line otherwise it does not return any data.

Reply
Fakhruddin Kararawala says:

March 24, 2015 at 2:16 pm

hi i am new to spark i want to read json file and insert the fields from json file to hbase using spark
thanks in advance

Reply
Sandeep says:

November 9, 2015 at 7:38 pm

I am facing one problem with your example, really appreciate if you can help resolving this. I loaded below data. But SQL result did not show any column value if “contact” tag is missing.

{ “blogID” : “FJY26J1333”, “date” : “2012-04-01”, “name” : “vpxnksu”, “comment” : “good stuff”, “contact” : { “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” } }
{ “blogID” : “VSAUMDFGSD”, “date” : “2012-04-01”, “name” : “yhftrcx”, “comment” : “another comment”,}

If you execute your SQL, It did not return any column value for blogID VSAUMDFGSD
hive> SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
> AS blogID, contact LATERAL VIEW json_tuple(b.contact, ’email’, ‘website’) c
> AS email, website WHERE b.blogID=’VSAUMDFGSD’;
OK
Time taken: 0.12 seconds

How we can get result something like this ?
VSAUMDFGSD NULL

Reply
afsar says:

February 1, 2016 at 8:47 am

I am just thinking when you say “create external table comments” with some location then why are you loading data into table “LOAD DATA LOCAL INPATH ‘/Users/pghosh/Workspace/comment.txt’ OVERWRITE INTO TABLE comments;”

My understanding is if you create any external table in hive and define the location of data then you don’t need to load any data otherwise it is going against the terminalogy of external table in hive.

Please correct me if I am wrong with this.

Reply
- Pranab says:
  
  February 1, 2016 at 4:20 pm
  
  With external table you are telling hive to refer to some directory for data. You are free to do whatever you want in that directory outside hive
  
  Reply
  - afsar says:
    
    February 2, 2016 at 9:20 am
    
    Thanks Pranab for the reply.
    
    My original question remains unanswered. If we have given the location directory for data then hive can look for data in that directory and no need to load data in hive for same table.
mohan says:

August 29, 2016 at 4:08 am

this blog is very helpful..thankyou for this blog..
also see our blog for hive json..
http://bit.ly/2c2cQsO

Reply
Pingback: Big data and Football: Using Pig to Analyze Pigskin - The Blend: A West Monroe Partners Blog
syam says:

November 7, 2016 at 10:53 pm

Hi Pranab,

Nice Blog.
I have a querry for analysing the flume data using hive.
I am using the Cloudera.quickstart VM
I have configure the Flume and required Flume data loaded into HDFS

Required hive-serdes-1.0-SNAPSHOT.jar downloaded and copied to Hive Lib and while creating the table jar is added :
> Add jar /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
Jar Added for the class path
After that I have Created a external table in hive :
> create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’ LOCATION ‘/user/cloudera/flume/tweets;
Table is created

> select * from load_tweets;
Console error :
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeExc eption: org.codehaus.jackson.JsonParseException: Unexpected character (‘O’ (code 79)): expected a valid value (number, String, array, object, ‘true’, ‘false’ or ‘null’) at [Source: java.io.StringReader@4c445c13; line: 1, column: 2]

Please Help me on this,

Thanks,
Syam.

Reply
Kullai says:

February 2, 2017 at 11:57 pm

Hi
Good work Pranab ! Most helpfull

Reply
Learn Hadoop Online Training In India says:

April 22, 2017 at 3:02 am

Hi,
Hi, Very informative Article.There is so much information about the Hadoop and Cybersecurity in this post. keep it up and thanks for the great content. Bookmarked your site.
Thanks,
Learn Hadoop Online Training In India

Reply
Brenda says:

May 1, 2017 at 5:10 pm

Llámenos y solicite presupuesto sim compromiso, nuestros costes son muy competitivos. http://all4webs.com/ovalmaniac667/insumosdeaseo.htm?48410=53394

Reply
newuser says:

July 11, 2017 at 7:38 am

how to create josn record from oracle data base?

Reply
Ariph says:

July 13, 2017 at 5:00 am

what is the best possible way to find JSON performance in between two different Databases and What could be the methodologies for performance analysis?

Reply
Artur Łysik says:

September 13, 2018 at 4:24 am

Thank you for this article. It helped me a lot 🙂

Reply
Pingback: Analiza danych z Uboat.net – część 2 – przygotowanie danych – Fifty men, one head
Amit Kumar says:

September 20, 2019 at 5:26 am

Hi Pranab,

I am dealing with a json file in which each record has class id and each class id have different structure. Could you please suggest an approach for analyzing such files.

Thanks,
Amit

Reply