Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity. In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS. The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS.
In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical data. Hive works well with flat record structured data. I wanted to find out how Hive handles hierarchically structured data. I found out that Hive works well with hierarchical JSON formatted data.
Quick Review of Hive
Hive wiki is a very good resource to learn Hive. Here is a quick review. When you run a Hive query, this is what happens roughly.
- The Query Planner analyzes the query and converts it to DAG (Directed Acyclic Graph) of Hadoop Map Reduce jobs
- The jobs get submitted to Hadoop cluster in the order implied by the DAG
- For simple query, only mappers run. The Input Output format is responsible for managing an input split and reading the data off HDFS. Next, the data flows into a layer called SerDe (Serializer Deserializer). In this case data as byte stream gets converted to a structured format by the deserializer part of the SerdDe.
- For aggregate queries, the Map Reduce jobs will also include reducers. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.
A typical Hive query will involve one or more Map Reduce jobs and full scan of the data, unless the table has partitions.
As a result, a Hive may be a high latency operation, depending on the amount of data and the Hadoop cluster configuration.
Blog Comment Example
For exploring Hive’s capability in handling JSON data, I am using blog comments as an example. A blog comment has the following fields.
- blogID
- date
- commenter name
- comment
- commenter email
- commenter web site
Here are some typical JSON records for blog comment. The data has two level of nesting.
{ "blogID" : "FJY26J1333", "date" : "2012-04-01", "name" : "vpxnksu", "comment" : "good stuff", "contact" : { "email" : "vpxnksu@gmail.com", "website" : "vpxnksu.wordpress.com" } } { "blogID" : "VSAUMDFGSD", "date" : "2012-04-01", "name" : "yhftrcx", "comment" : "another comment",}
You may have noticed that not all comment records will have the commenter’s contact info. As we will see later, when we query, Hive return NULL for any missing filed in any record.
Create Table
My table is very simple with only one column, which holds the JSON data. You can think of the data being completely denormalized. I am using external table which means I am keeping my table data off Hive’s control. I am simply telling Hive where to look for the data. Here is the DDL for the table
CREATE EXTERNAL TABLE comments ( value STRING ) LOCATION '/Users/pghosh/test/comments';
Next, I loaded my data from a file into the Comments table as below
LOAD DATA LOCAL INPATH '/Users/pghosh/Workspace/comment.txt' OVERWRITE INTO TABLE comments;
The keyword OVERWRITE causes existing data for the table to be wiped out, before the load. Now that the data is loaded, we are ready for some queries.
Time for Query
My first query is going to find emails for all commenters for a given blog. You may want to such queries for the following scenario. We just had a new blog published, which happens to be very similar to a past blog. We want to find email for all commenters for the old blog, so that we can send them email with link to the new blog. It’s very likely that they will be interested in the new blog. Here is the query.
SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b AS blogID, contact LATERAL VIEW json_tuple(b.contact, 'email', 'website') c AS email, website WHERE b.blogID='64FY4D0B28';
In this query, the JSON support in Hive has kicked in. The LATERAL VIEW and json_tuple give us a way of navigating the JSON tree. The first instance gives us a virtual table with with two columns blogID and contact. We repeat the same process again to extract data from the next level of the JSON tree. This time it gives us another virtual table with the columns email and website.
The function json_tuple explodes a JSON node and return the child node values. The first argument is the node to explode. The rest of the arguments are the child node names.
Here is the query result. As expected, this query triggers one Map reduce job that does not include any reducer. The Hadoop job contains only the map task.
Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201205052116_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0001 Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0001 2012-05-05 21:42:08,639 Stage-1 map = 0%, reduce = 0% 2012-05-05 21:42:15,639 Stage-1 map = 100%, reduce = 0% 2012-05-05 21:42:17,900 Stage-1 map = 100%, reduce = 100% Ended Job = job_201205052116_0001 OK 64FY4D0B28 NULL 64FY4D0B28 NULL 64FY4D0B28 NULL 64FY4D0B28 lyau@hotmail.com 64FY4D0B28 NULL 64FY4D0B28 gyocok@gmail.com 64FY4D0B28 shrmf@gmail.com Time taken: 21.459 seconds
Our next query is an aggregate query. We are interested in number of comments for each blog. This is a group by query as below. Here we navigate only one level in the JSON tree.
SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'name') b AS blogID, name group by b.blogID;
Here is the result of the query. I am only showing partial result. The interesting thing to note is that that this query has mapped to one Hadoop job and the job includes reducers.
Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201205052116_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205052116_0002 Kill Command = /Users/pghosh/Tools/hadoop-0.20.2-cdh3u1/bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201205052116_0002 2012-05-05 21:48:06,878 Stage-1 map = 0%, reduce = 0% 2012-05-05 21:48:23,088 Stage-1 map = 4%, reduce = 0% 2012-05-05 21:48:25,097 Stage-1 map = 100%, reduce = 0% 2012-05-05 21:48:39,214 Stage-1 map = 100%, reduce = 100% Ended Job = job_201205052116_0002 OK 002F01U5F1 8 00PYU30L6R 13 012A2RDU7V 8 01D31LI365 10 01UBOW8E31 5 0223774V51 6 023WWOWS60 16 02E09F81RD 5 02MNORRJNP 3 02RCP58IIG 7 030GFNQN0K 7 033TWOLGY3 7 03C2WCN4UA 6 03M4N0WFB1 7 04197CKG0J 11 042JJOY1X1 7 04HAB1U275 2 073VPCIE4H 5 07NDU9S4B1 10 ..................
Another Way
Another way to handle JSON is to use a JSON SerDe. The external view of any Hive encapsulated data is always column and row oriented. However the internal on disk representation of data could be anything.
A Hive SerDe is the bridge between the internal representation and the external column and record oriented view. In this case it will do all the necessary mapping of JSON fields and columns. You have to add the following to your table DDL. With this approach, your table will have multiple columns.
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
Summing Up
Hopefully, this post has given you some idea about what Hive can do with JSO data. An interesting question that comes up is whether we still need MongoDB. MongoDB is a popular NOSQL document structured data base, with very powerful indexing and querying capabilities.
If you are doing mostly aggregate queries touching most of the data as in a data warehouse application, Hive is a better optional than MongoDB. However, if your use cases include queries touching only a small fraction of data, you may be better off staying with MongoDB or other equivalent document structured data base.
Update in Hive is an expensive operation, because it involves a full table overwrite. So for data that is frequently updated, MogoDb may be a better choice.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,
Pingback: Hive (computing): What is the best Hive JSON SerDe? - Quora
Pingback: Hive Plays Well with JSON | BigDataCloud.com
Your example saved me, thank you.
I am glad it helped
Good Work! Much helpful…
This blog is very helpful, and I have a question.
It seems that json_tuple() not support array, is it right?
Could you give an example getting email and website info where you have json like this:
{ “blogID” : “FJY26J1333”, “date” : “2012-04-01”, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” }, { “email” : “vpxnksu@yahoo.com”, “website” : “vpxnksu.tumblr.com” } ] }
It’s synthetic data generated by a script
Right. Can you give a sample hive query that would return the email and website from the json I provided? I’m not sure how to get them from a nested structure.
when i run first query i am getting nothing.. no data.. please find the below information.
hive> SELECT b.blogID, c.website FROM sample a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
> AS blogID, contact LATERAL VIEW json_tuple(b.contact, ’email’, ‘website’) c
> AS email, website WHERE b.blogID=’FJY26J1333′;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305131006_0022, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201305131006_0022
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201305131006_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-13 15:10:20,754 Stage-1 map = 0%, reduce = 0%
2013-05-13 15:10:26,781 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
2013-05-13 15:10:27,792 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
2013-05-13 15:10:28,817 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
2013-05-13 15:10:29,823 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
2013-05-13 15:10:30,838 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
2013-05-13 15:10:31,844 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
2013-05-13 15:10:32,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.69 sec
MapReduce Total cumulative CPU time: 690 msec
Ended Job = job_201305131006_0022
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.69 sec HDFS Read: 498 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 690 msec
OK
Time taken: 28.19 seconds
hive>
I’m having the same issue, can you please provide the hive query for this scenario.
Please let me know if you found the hive query for this scenario. I’m having the same issue.
It helped me, thanks!
Hi pranab,
This blog is very helpful for us.Thanks for your blog.
I have installed Hadoop and Hive in linux. I am getting an error message while creating table using hive. I am using Json serde jar which is copied in $HIVE_HOME/lib dir. I have added jar into class path also.Please help me on this issue.
Please find the error below.
hive> create table air(
> airline string,
> airlineid string,
> sourceairport string,
> sourceairportid string,
> destinationairport string,
> destinationairportid string,
> codeshare string,
> stop string,
> equipment String)
> ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’
> location ‘/tmp/air_routes_result/part-00000’;
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Please help on above issue.. I am stuck here, i can’t able to find this issue..
Hi nagesh did you find the solution ? I have been stuck in the very same situation i have to create an external table
Nagesh
I have not used JsonSerde. Try it’s home in google code. There might even be a google group for it
Pingback: 如何在Hive中使用Json格式数据 | 四号程序员
Pingback: Live Streaming Κάλυψη Συνεδρείων
Hello, is it possible to use this method on nested json structure?
(Refering to the sample from above) for example: { “blogID” : “FJY26J1333″, “date” : “2012-04-01″, “name” : “vpxnksu”, “comment” : “good stuff”, “contacts” : [{ “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” }, { “email” : “vpxnksu@yahoo.com”, “website” : “vpxnksu.tumblr.com” } ] }
Thanks for your help!
This solution is all about nested JSON structure.
The contacts are in an array in this case. I think they are asking how to deal with a json array.
Hi Romit,
Did you find solution for this,if so can you please let me know the query to see results for blogid and email
Hello Pranab thanks for your Article. I have one question for you, I’m trying to import with ur method this JSON into a Hive table:
JSON
{
“id”:”3434″,
“hashtags”: [
{“text”:”element1″,”indices”:[1,2,3]},
{“text”:”element2″,”indices”:[1,2,3]}
]
}
HiveTable
id int
hashtags array
I would simply import “element1″,”element2” in the hashtag ARRAY column.
How I can do It? It seems not possibile import data directly into an array…
Thanks you
Hi Hadokee.
I had same use case ,can you please let me know how to write query if i want to select id and text from the table
Json_tuple seems not support arrays…
Thanks Pranab. Nice explanation with good deal of details.
Thanks for the post Pranab. It helped me a lot. Just one quick question, in your example you mentioned about inserting of data in to the external table created. Can you give an example of how to work with json data, if partitions are involved. My data is partitioned as below year=2014,month=01,day=01
year=2014,month=01,day=02 & so on.
Thanks for your time.
Dheeraj, I am not sure if this is the best way. You could store data in a non partitioned table first. Then you can do insert overwrite to read data from the first table and write to the partitioned table, You should use dynamic partitioning. You could extract the partition values using json_tuple etc in the query to the first table
Do not Format the JSON using formatting tools…keep the JSON in a single line otherwise it does not return any data.
hi i am new to spark i want to read json file and insert the fields from json file to hbase using spark
thanks in advance
I am facing one problem with your example, really appreciate if you can help resolving this. I loaded below data. But SQL result did not show any column value if “contact” tag is missing.
{ “blogID” : “FJY26J1333”, “date” : “2012-04-01”, “name” : “vpxnksu”, “comment” : “good stuff”, “contact” : { “email” : “vpxnksu@gmail.com”, “website” : “vpxnksu.wordpress.com” } }
{ “blogID” : “VSAUMDFGSD”, “date” : “2012-04-01”, “name” : “yhftrcx”, “comment” : “another comment”,}
If you execute your SQL, It did not return any column value for blogID VSAUMDFGSD
hive> SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, ‘blogID’, ‘contact’) b
> AS blogID, contact LATERAL VIEW json_tuple(b.contact, ’email’, ‘website’) c
> AS email, website WHERE b.blogID=’VSAUMDFGSD’;
OK
Time taken: 0.12 seconds
How we can get result something like this ?
VSAUMDFGSD NULL
I am just thinking when you say “create external table comments” with some location then why are you loading data into table “LOAD DATA LOCAL INPATH ‘/Users/pghosh/Workspace/comment.txt’ OVERWRITE INTO TABLE comments;”
My understanding is if you create any external table in hive and define the location of data then you don’t need to load any data otherwise it is going against the terminalogy of external table in hive.
Please correct me if I am wrong with this.
With external table you are telling hive to refer to some directory for data. You are free to do whatever you want in that directory outside hive
Thanks Pranab for the reply.
My original question remains unanswered. If we have given the location directory for data then hive can look for data in that directory and no need to load data in hive for same table.
this blog is very helpful..thankyou for this blog..
also see our blog for hive json..
http://bit.ly/2c2cQsO
Pingback: Big data and Football: Using Pig to Analyze Pigskin - The Blend: A West Monroe Partners Blog
Hi Pranab,
Nice Blog.
I have a querry for analysing the flume data using hive.
I am using the Cloudera.quickstart VM
I have configure the Flume and required Flume data loaded into HDFS
Required hive-serdes-1.0-SNAPSHOT.jar downloaded and copied to Hive Lib and while creating the table jar is added :
> Add jar /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
Jar Added for the class path
After that I have Created a external table in hive :
> create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’ LOCATION ‘/user/cloudera/flume/tweets;
Table is created
> select * from load_tweets;
Console error :
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeExc eption: org.codehaus.jackson.JsonParseException: Unexpected character (‘O’ (code 79)): expected a valid value (number, String, array, object, ‘true’, ‘false’ or ‘null’) at [Source: java.io.StringReader@4c445c13; line: 1, column: 2]
Please Help me on this,
Thanks,
Syam.
Hi
Good work Pranab ! Most helpfull
Hi,
Hi, Very informative Article.There is so much information about the Hadoop and Cybersecurity in this post. keep it up and thanks for the great content. Bookmarked your site.
Thanks,
Learn Hadoop Online Training In India
Llámenos y solicite presupuesto sim compromiso, nuestros costes son muy competitivos. http://all4webs.com/ovalmaniac667/insumosdeaseo.htm?48410=53394
how to create josn record from oracle data base?
what is the best possible way to find JSON performance in between two different Databases and What could be the methodologies for performance analysis?
Thank you for this article. It helped me a lot 🙂
Pingback: Analiza danych z Uboat.net – część 2 – przygotowanie danych – Fifty men, one head
Hi Pranab,
I am dealing with a json file in which each record has class id and each class id have different structure. Could you please suggest an approach for analyzing such files.
Thanks,
Amit