Transforming Big Data

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering  the fact that in most data projects more than 50% of the time is spent on  data cleaning and munging, I have added significant ETL function to my OSS project chombo in github, including validation, transformation and profiling.

This post will focus on data transformation. Validation, transformation and profiling are the key activities in any ETL process. I will be using the retail sales data for a fictitious multinational retailer to showcase data transformation capabilities in chombo.

Data Transformation

There are ready to use transformers out of the box in chombo that can easily be configured. Each transformer is identified with an unique tag and optionally set of configuration parameters. Typesafe HOCON is used for configuration. Transformers have the following characteristics.

  1. Transformers transform a field value to 1 or more field values e.g. converting date time format
  2. One or more transformers can be chained for a given field. only the last transformer in the chain can generate multiple values
  3. While transforming data, the field position can be changed in the output
  4. Generators generate completely new fields e.g. unique key
  5. Depending on the data type i.e., string or numeric, there are two categories of transformers

Here is a list of transformers and generators available out of the box. I expect to add more in future as needs arise.


Tag Type Data Type Comment
lowerCaseTrans transformer string converts to lower case
upperCaseTrans transformer string converts to upper case
patternBasedTrans transformer string creates multiple values using regex with groups
searchReplaceTrans transformer string does search and replace
keyValueTrans transformer string if field matches with key, replace with value
defaultValueTrans transformer string if field value is missing replaced with deafult
anoynmizerTrans transformer string anoynmizes field value
trimTrans transformer string trims white space from both ends
dateFormatTrans transformer string transform date format
groupTrans transformer string replaces value with group ID
forcedReplaceTrans transformer string replaces field value irrespective of content
stringCustomTrans transformer string script based custom transformer
uniqueKeyGen generator string generates unique key
dateGen generator string generates current date
constGen generator string generates a literal string
longPolynomialTrans transformer long generates second degree polynimial value
doublePolynomialTrans transformer double generates second degree polynimial value
longCustomTrans transformer long script based custom transformation
doubleCustomTrans transformer double script based custom transformation
intAddTrans transformer int adds
intSubtractTrans transformer int subtracts
intMultiplyTrans transformer int multiplies
intDivideTrans transformer int divides
doubleAddTrans transformer double adds
doubleSubtractTrans transformer double subtracts
doubleMultiplyTrans transformer double multiplies
doubleDivideTrans transformer double divides
discretizerTrans transformer numeric discretizes into buckets
epochTimeGen generator long generates epoch time
stringConcatenationTrans transformer string appends or preppends string to a field
stringFieldMergeTrans transformer string merges multiple fields
elapsedTimeTrans transformer string (date) finds elapsed time w.r.t a configured reference data time
contextualElapsedTimeTrans transformer string (date) finds elapsed time w.r.t a reference data time, which is in another field
timeCycleShiftTrans transformer string (date) moves date time by cycle length e.g. month until a mile stone date
contextualTimeCycleShiftTrans transformer string (date) moves date time by cycle length with cycle and mile stone date from other fields
stringWithinFieldDelimTrans transformer string replaces field delimiter to collapse adjacent columns into one
stringBinaryTrans transformer string (numeric) converts to binary value based on supplied predicate
categoricalBinaryTrans transformer string (categorical) replaces field with multiple binary fields
dateComponentTrans transformer string (date) extracts one or more components from date
binaryArithOperatorGen generator numeric binary arithmetic operation between 2 columns
intRoundOffTrans transformer numeric rounds off floating to integer
floatRoundOffTrans transformer numeric rounds off floating to floating with configured precision

Most of the transformers and generators require configuration. For example, patternBasedTrans requires the regex and the number of groups in the regex as configuration parameters .

Retail Sales Data

As mentioned earlier, the data is retails sales data for a multinational retailer. It receives data from various countries, which needs to be processed to handle the idiosyncrasies of different countries and converted to a canonical format.

In our use case, the data from various source countries get consolidated into a Hadoop based data warehouse located  in US west coast.

The data fields are as follows, along with the transformers and generators applied. Here is the JSON schema for the data.

Field Transformer or Generator Comment
xactionID transaction ID
custID anoynmizerTrans customer ID
storeID store ID
dateTime dateFormatTrans date and time of transaction
prodID product ID
quantity product quantity
amount doubleMultiplyTrans monetary amount
currency forcedReplaceTrans currency for the amount
country constGen country of origin

Essentially,we are doing the following

  1. Anoynmizing customer ID,
  2. Converting data time a different format and time zone,
  3. Multiplying the amount by exchange rate for different currency,
  4. Changing the currency to converted currency and
  5. Adding a new field which is the country of origin.

All the transformation related configuration are in the Typesafe configuration file. For our use case, here is the transformer and generator configuration file.

Transformer Map Reduce

The solution consists of a mapper only. The logic in the mapper is on the lighter side, since all the heavy lifting is done by various transformers and generators which are plain java classes.

For each field in each record, the processing steps are as follows.

  • Each transformer in the chain is called.
  • The out of one transformer is fed as input of the next transformer.
  • The transformer output can be placed  a different field position. The JSON schema file defines the target field ordinal.

Generators are also called. Generators are never cascaded. Each generator generates a new field.

When all the target fields have been generated after running all the transformers and generators, a delemeter separated record is built and emitted by the mapper.

Here is some input data received from  stores in UK. This data is transformed according to central Hadoop based data warehouse standards and norms in the US west coast.

653DZO3APG25,B83GE87PFD,985WYA,2015-11-04 15:35:30,9VC040B7,1,25.25,UKP 
653DZO3APG25,B83GE87PFD,985WYA,2015-11-04 15:35:30,S9J51190,2,190.23,UKP 
653DZO3APG25,B83GE87PFD,985WYA,2015-11-04 15:35:30,K40LI8W4,1,94.99,UKP 
249WU6H6059K,16O81VBFJA,4YIM5Q,2015-11-04 15:39:16,E918R4ET,3,189.31,UKP 
249WU6H6059K,16O81VBFJA,4YIM5Q,2015-11-04 15:39:16,PAR53EAI,1,16.34,UKP

Here is the corresponding transformed data, that complies with the transformations applied.

653DZO3APG25,xxxxxxxxxx,985WYA,11-15-04 07:35:30,9VC040B7,1,38.43,USD,UK
653DZO3APG25,xxxxxxxxxx,985WYA,11-15-04 07:35:30,S9J51190,2,289.53,USD,UK
653DZO3APG25,xxxxxxxxxx,985WYA,11-15-04 07:35:30,K40LI8W4,1,144.57,USD,UK
249WU6H6059K,xxxxxxxxxx,4YIM5Q,11-15-04 07:39:16,E918R4ET,3,288.13,USD,UK
249WU6H6059K,xxxxxxxxxx,4YIM5Q,11-15-04 07:39:16,PAR53EAI,1,24.87,USD,UK


Custom transformers and generators can be created in two ways. Customization can be done by creating  Java transformer classes extending a base class. It is also necessary to create a custom transformer factory class implementing a factory interface. Instead of custom transformer factory class, the list of transformer tag and the corresponding custom transformer class name could be provided through configuration.

Another option is to use existing custom transformer Java classes and configuring them with appropriate Groovy scripts.The script takes a field value as input and expected to return a value of the same type. Non Java programmers will prefer the script based approach.


The configuration properties file provides information on schema file path and HOCON configuration file path. The list of transformer to be applied to a field is defined either in the properties files or in the schema file.

The configuration file contains configuration for each type of transformer to be used. As transformers are built for all the transformers in the list, configuration data from the configuration file is used.

Wrapping Up

We have gone through a simple process of transforming Hadoop data. The process is completely driven by meta data and configuration and is extensible.A tutorial to run the example in this post is available.

A transformer takes one field value as input. It will be nice to have transformers that take multiple field values as input e.g. you may want to combine two filed values to generate a new field value.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce and tagged , . Bookmark the permalink.

3 Responses to Transforming Big Data

  1. Kritika Kapoor says:

    The author has done a great job in tracing the transformation of Big data. I found it comprehensively informative! If you are looking for a more details on Big Data just go through this link –

  2. swarnava datta says:

    Thanks keep up the good work , its really interesting reading your posts … 🙂

  3. Pingback: Combating High Cardinality Features in Supervised Machine Learning | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s