JSON to Relational Mapping with Spark


If there one data format that’s ubiquitous, it’s JSON. Whether  you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle  JSON and you want to store the data in relational format. You may want to do offline batch processing with Spark or Hadoop and and your Hadoop or Spark application may only support flat field delimited record oriented input.

Whatever the case, a tool is needed to map JSON to flat record oriented data. In this post we will go over a Spark based solution for converting JSON to flat relational data. The implementation is part of my open source project chombo on github. A Hadoop based implementation is also available in the same project.

Mapping from JSON to Relational

JSON has hierarchical data structure. A parent node may have a 1 to 1 or a 1 to many association with child nodes. The leaf node of the JSON tree contains primitive data. Internal JSON nodes are either an object or arrays of objects. We perform JSON to relational mapping in the following way.

  1. We define a set of Xpath like paths though the JSON data. Each path  starts  from the root and goes all the way to a leaf node.
  2. Each path defines a field in the target relational format. You don’t necessarily have to map all JSON fields to relational data.
  3. If there is an array in any path, we will have multiple child records. With multiple child records, we have the choice of either generating normalized or denormalized flat relational data, through a configuration setting. There could be multiple arrays encountered in a path
  4. If the output is normalized, separate records are created for child records. To provide linkage to the parent record, the child record will contain a field which is the id or unique key of the corresponding parent record. If the parent does not contain any unique id, one can be generated automatically.
  5. If the output is denormalized, only i dataset is produced and only 1 child type can be handled. If there are multiple child types, normalized output should be used.
  6. If an array of primitive data is encountered at the end of a path at the leaf, it’s serialized to a coma separated string as the field value

For each path and each JSON record, it is recursively navigated to extract one or more values. If the path contains one or more array, multiple values will be extracted.

Mobile Data Usage

We will use mobile usage data as example. The JSON data contains usage data for various users or accounts. For each account there are multiple usage data records. Here is a fragment of the input JSON

{
	"dataUsage" : [
		{
			"acctID" : "5UAD44ZM",
			"email" : "Q1831Q@yahoo.com",
			"phoneNum" : "(305)6730550",
  			"usages" : [ 
  			{
    			"plan" : "super",
    			"deviceID" : "G1ZH8B33ERCT",
    			"used" : 5627690,
    			"startTime" : 1479223694295,
    			"endTime" : 1479223727386
  			},
  			{
    			"plan" : "super",
    			"deviceID" : "G1ZH8B33ERCT",
    			"used" : 5059768,
    			"startTime" : 1478562784891,
    			"endTime" : 1478562859544
  			},
  			{
    			"plan" : "super",
    			"deviceID" : "G1ZH8B33ERCT",
    			"used" : 3693797,
    			"startTime" : 1478741647986,
    			"endTime" : 1478741719360
  			},
  			{
    			"plan" : "super",
    			"deviceID" : "G1ZH8B33ERCT",
    			"used" : 3649133,
    			"startTime" : 1478889534448,
    			"endTime" : 1478889579611
  			},
........

Here are some examples of path expression for extracting fields from JSON for this use case.

Path JSON Mapped relational
acctID account ID field of root node field in parent record, also used for referencing from child records
usages@a.used used data field  in child node used data field in child record, of which there are many
usages@a.startTime start time for used data field  in child node start time for used data field in child record, of which there are many

The symbol @a is used to indicate 1 to many association between parent and child. if the symbol is followed by an index as in @a[2], a specific child object will selected, resulting in 1 to 1 association.

Mapping Spark Job

The Spark job FlatRecordExtractorFromJson  in chombo converts JSON to flat relational data. It performs the following steps

  1. Identify a complete JSON record. In the input JSON record could be contained in one line or it could span across multiple lines of input. This information is provided through a configuration parameter.
  2. For each JSON record extracted from step 1, it creates a set of flat records for denormalized data. For normalized data, it generates flat records of various child types. There is a always a parent record. For each child type , there are multiple records, with referential links to the parent record

I have run this for normalized output. Here are some parent records. We have only 3 fields in root JSON node, the account ID, email and phone number.

5UAD44ZM,Q1831Q@yahoo.com,(305)6730550
U6J8QTH4,4WEIA9@yahoo.com,(650)2070155

Here some child records for the 2 parent records above. The first field contains the referential ID of the parent record, which is the account ID.

5UAD44ZM,super,G1ZH8B33ERCT,5627690,1479223694295,1479223727386
5UAD44ZM,super,G1ZH8B33ERCT,5059768,1478562784891,1478562859544
5UAD44ZM,super,G1ZH8B33ERCT,3693797,1478741647986,1478741719360
5UAD44ZM,super,G1ZH8B33ERCT,3649133,1478889534448,1478889579611
5UAD44ZM,super,G1ZH8B33ERCT,6182859,1478992345974,1478992428628
U6J8QTH4,super,06WCVOIN22EH,5260203,1478677808227,1478677852566
U6J8QTH4,super,06WCVOIN22EH,2699333,1479207884617,1479207914106
U6J8QTH4,super,06WCVOIN22EH,6592006,1478779260111,1478779286157
U6J8QTH4,super,06WCVOIN22EH,6031358,1478606824205,1478606886746
U6J8QTH4,super,06WCVOIN22EH,6607773,1479180818790,1479180868043
U6J8QTH4,super,06WCVOIN22EH,2228982,1478562267221,1478562327598

Once all the data is imported to a SQL database, they are available for queries or analytic processing.

Output for Normalized Data

For normalized output, we have separate data sets, 1 for parent entity type and 1 each for the child entity types. Each data set is written to a separate directory under the base output directory as specified through the command line.

For our use case, two output sub directories are created as follows. Each sub directory may have multiple output files, depending on the input data size;

  1. /etl/output/jex/root  (parent records)
  2. /etl/output/jex/usages@a  (child records)

Wrapping Up

We have a gone through JSON to relational mapping tool, using path like expressions referring to fields within JSON that need to be mapped relational format. Only selected data from JSON, as defined through the path expressions,  is converted relational format. Here is a tutorial document for running the use case.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, ETL, Spark and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s