Data Loader for NOSQL Databases


In one of my recent projects, I had to load product data from a CSV file into HBase and also to index it for search purpose.. I decided to separate out the loader part of the project as a stand alone tool and make available as open source. Currently, it’s hosted in github. It supports HBase. I will be adding support for Cassandra soon. It supports Solr indexing right now.

Introduction

The tool is very generic and configurable. It takes a CSV file as input and writes to HBase or Cassandra.

The CSV file could have been generated from queries on Oracle or MySQL. So it could be used to migrate data from RDBMS to NOSQL databases.

It also takes a JSON file, which defines the the mapping between the columns in the CSV and the NOSQL column family and column along with other metadata.

Here is a quick summary of the features. The terminology I am using is based on HBase.

  • Loads data from CSV file.
  • Mapping between CSV columns and NOSQL column family and column is provided in JSON file.
  • There is many to many association between CSV column and NOSQL column family and column.
  • The row key for NOSQL could be created by concatenating multiple CSV columns.
  • Solr indexing of data as it’s being loaded.

The indexing feature is not implemented yet. I will be working on it next. A CSV column could be split into multiple parts and used to populate multiple NOSQL columns. On the flip side, multiple CSV columns could be consolidated to populate one NOSQL column.

Configuration Dissection

Here is a snippet of a sample configuration. There is some global configuration like NOSQL table name, row key definition etc. The global section is followed by multiple sections, one for each data item extracted from a CSV row.

{
	"name" : "product catalog",
	"database" : "hbase",
	"table" : "products",
	"colNameInFirstRow" : true,
	"rowKey" : ["seller", "category", "sku"],
	"batchSize" : 5,
	"preLoadHandler" : "org.chargeur.handler.ProductPreLoadHandler",
	"indexManagerClass" : "org.chargeur.core.SolrIndexManager",
	"indexServerUrl" : "http://localhost:8983/solr",
	"columnMappers" :
	[
		{
			"name" : "sku",
			"ordinal" : 1,
			"dataType" : "string",
			"colFamily" : "main",
			"col" : "sku",
			"useAsRowKey" : true,
			"maxSize" : 20,
			"sideData" : false,
			"indexed" : true
		},
		{
			"name" : "seller",
			"ordinal" : 0,
			"dataType" : "string",
			"colFamily" : "main",
			"col" : "seller",
			"useAsRowKey" : true,
			"maxSize" : 30,
			"sideData" : true,
			"defaultValue" : "anySeller",
			"indexed" : true
		},
		{
			"name" : "model",
			"ordinal" : 3,
			"dataType" : "string",
			"colFamily" : "main",
			"col" : "model",
			"useAsRowKey"  : false,
			"sideData" : false
		},
		{
			"name" : "title",
			"ordinal" : 4,
			"dataType" : "string",
			"colFamily" : "main",
			"col" : "title",
			"useAsRowKey" : false,
			"sideData" : false,
			"indexed" : true
		},
		{
			"name" : "description",
			"ordinal" : 5,
			"dataType" : "string",
			"colFamily" : "main",
			"col" : "description",
			"useAsRowKey" : false,
			"sideData" : false,
			"indexed" : true
		},
		{
			"name" : "category",
			"ordinal" : 4,
			"dataType" : "string",
			"colFamily" : "main",
			"col" : "category",
			"useAsRowKey" : true,
			"maxSize" : 30,
			"sideData" : false,
			"defaultValue" : "generic",
			"indexed" : true
		},
		{
			"name" : "price",
			"ordinal" : 17,
			"colFamily" : "main",
			"col" : "price"
		},
		{
			"name" : "details1",
			"ordinal" : 6,
			"colFamily" : "other",
			"col" : "details",
			"manyToOne" : true,
			"indexed" : true
		},
		{
			"name" : "details2",
			"ordinal" : 7,
			"colFamily" : "other",
			"col" : "details",
 			"manyToOne" : true,
			"indexed" : true
		},
                ........
     ]
}

The global configurations are as follows.

database hbase or casandra
table table name
colNameInFirstRow true if first row of CSV contains column names
rowKey data items for row key
batchSize number of rows to load before flushing
preLoadHandler user supplied pre load handler java class, which can be used for splitting column value into multiple parts and data validation
indexManagerClass java class for indexing
indexServerUrl URL for index server

The configuration for each data item is as follows

name data item name
ordinal CSV column ordinal
dataType data type: string, int etc
colFamily NOSQL table column family
col NOSQL table column
useAsRowKey true if part of NOSQL row key
maxSize max byte array size, needed if useAsRowKey=true and dataType=string
sideData true if this data is coming as a hash map in API call
defaultValue default value to use if the CSV column value is missing
indexed true if Solr indexing is required

Row Key

Composite row key  is created based on data items defined in the global   rowKey parameter . For HBase where rows are ordered partitioned, for sort order to work, a byte array with fixed length is created.

For each data item that is string and part of the row key, a byte array of length maxSize is created. The byte array is appropriately padded with zero value byte. These component byte array are concatenated to create the row key. For numeric data item that is part of the row key, this is not necessary, because they have a natural fixed size.

For Cassandra, where the rows are hash partitioned by default, creating fixed size byte array for row key is not necessary. But it does not hurt either to follow the same logic of creating fixed size byte array.

Solr Indexing

All attributes with indexed flag set are indexed with Solr.  It’s assumed that there a unique field called  id in the Solr schema. The id value is constructed concatenating  the same attributes used for HBase  row key. The Solr field names  are assumed to be same as the attribute  names in the JSON metadata file. Solr schema file should created with field names reflecting the same names as in the meta data file.

All fields indexed should not be saved in Solr to minimize duplication between NOSQL and Solr. Once a document is fetched from Solr, the NOSQL  row key can be constructed based on the id of the document to fetch any other field of interest from NOSQL . Solr should store only a minimum subset of NOSQL columns.

HBase and Cassandra

I have used the terminology from HBase. Here is terminology mapping between HBase and Cassandra

HBase Cassandra Comment
table column family like an RDBMS table
column family super column group of columns
column column has a name and value

API

The API consists of a load() method in Loader class as below

load(String csvFile, String mappingFile, Map sideData)

Here is the usage:

Loader loader = new Loader();
Map sideData = new HashMap();
sideData.put("seller", "testSeller");
loader.load(csvFile, mappingFile, sideData);

Why Not Hadoop?

You may ask why not use Hadoop and TableOutputFormat to write to HBase. We could, but it will work only for HBase.  You could use Brisk for Cassandra , which is essentially map reduce reduce with Cassandra for input and output Unless you are importing massive amount of data, this simple approach might be enough.

Wrapping Up

The source code for this tool is available in github. More on HBase can be found here. For Cassandra this is the place to go.

About these ads

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source contributor. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Data Mining and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Cassandra, HBase, Indexing, NOSQL and tagged , , , . Bookmark the permalink.

3 Responses to Data Loader for NOSQL Databases

  1. C says:

    Good useful information, keep up the good work.

  2. Ryan Terwedo says:

    If you use AWS (Amazon Web Services), all key/attribute pairs are auto indexed in both simpledb and dynamo db

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s