This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering the fact that in most data projects more than 50% of the time is spent on data cleaning and munging, I have added significant ETL function to my OSS project chombo in github, including validation, transformation and profiling.
This post will focus on data transformation. Validation, transformation and profiling are the key activities in any ETL process. I will be using the retail sales data for a fictitious multinational retailer to showcase data transformation capabilities in chombo.
There are ready to use transformers out of the box in chombo that can easily be configured. Each transformer is identified with an unique tag and optionally set of configuration parameters. Typesafe HOCON is used for configuration. Transformers have the following characteristics.
- Transformers transform a field value to 1 or more field values e.g. converting date time format
- One or more transformers can be chained for a given field. only the last transformer in the chain can generate multiple values
- While transforming data, the field position can be changed in the output
- Generators generate completely new fields e.g. unique key
- Depending on the data type i.e., string or numeric, there are two categories of transformers
Here is a list of transformers and generators available out of the box. I expect to add more in future as needs arise.
|lowerCaseTrans||transformer||string||converts to lower case|
|upperCaseTrans||transformer||string||converts to upper case|
|patternBasedTrans||transformer||string||creates multiple values using regex with groups|
|searchReplaceTrans||transformer||string||does search and replace|
|keyValueTrans||transformer||string||if field matches with key, replace with value|
|defaultValueTrans||transformer||string||if field value is missing replaced with deafult|
|anoynmizerTrans||transformer||string||anoynmizes field value|
|trimTrans||transformer||string||trims white space from both ends|
|dateFormatTrans||transformer||string||transform date format|
|groupTrans||transformer||string||replaces value with group ID|
|forcedReplaceTrans||transformer||string||replaces field value irrespective of content|
|stringCustomTrans||transformer||string||script based custom transformer|
|uniqueKeyGen||generator||string||generates unique key|
|dateGen||generator||string||generates current date|
|constGen||generator||string||generates a literal string|
|longPolynomialTrans||transformer||long||generates second degree polynimial value|
|doublePolynomialTrans||transformer||double||generates second degree polynimial value|
|longCustomTrans||transformer||long||script based custom transformation|
|doubleCustomTrans||transformer||double||script based custom transformation|
|discretizerTrans||transformer||numeric||discretizes into buckets|
|epochTimeGen||generator||long||generates epoch time|
|stringConcatenationTrans||transformer||string||appends or preppends string to a field|
|stringFieldMergeTrans||transformer||string||merges multiple fields|
|elapsedTimeTrans||transformer||string (date)||finds elapsed time w.r.t a configured reference data time|
|contextualElapsedTimeTrans||transformer||string (date)||finds elapsed time w.r.t a reference data time, which is in another field|
|timeCycleShiftTrans||transformer||string (date)||moves date time by cycle length e.g. month until a mile stone date|
|contextualTimeCycleShiftTrans||transformer||string (date)||moves date time by cycle length with cycle and mile stone date from other fields|
|stringWithinFieldDelimTrans||transformer||string||replaces field delimiter to collapse adjacent columns into one|
|stringBinaryTrans||transformer||string (numeric)||converts to binary value based on supplied predicate|
|categoricalBinaryTrans||transformer||string (categorical)||replaces field with multiple binary fields|
|dateComponentTrans||transformer||string (date)||extracts one or more components from date|
|binaryArithOperatorGen||generator||numeric||binary arithmetic operation between 2 columns|
|intRoundOffTrans||transformer||numeric||rounds off floating to integer|
|floatRoundOffTrans||transformer||numeric||rounds off floating to floating with configured precision|
Most of the transformers and generators require configuration. For example, patternBasedTrans requires the regex and the number of groups in the regex as configuration parameters .
Retail Sales Data
As mentioned earlier, the data is retails sales data for a multinational retailer. It receives data from various countries, which needs to be processed to handle the idiosyncrasies of different countries and converted to a canonical format.
In our use case, the data from various source countries get consolidated into a Hadoop based data warehouse located in US west coast.
The data fields are as follows, along with the transformers and generators applied. Here is the JSON schema for the data.
|Field||Transformer or Generator||Comment|
|dateTime||dateFormatTrans||date and time of transaction|
|currency||forcedReplaceTrans||currency for the amount|
|country||constGen||country of origin|
Essentially,we are doing the following
- Anoynmizing customer ID,
- Converting data time a different format and time zone,
- Multiplying the amount by exchange rate for different currency,
- Changing the currency to converted currency and
- Adding a new field which is the country of origin.
All the transformation related configuration are in the Typesafe configuration file. For our use case, here is the transformer and generator configuration file.
Transformer Map Reduce
The solution consists of a mapper only. The logic in the mapper is on the lighter side, since all the heavy lifting is done by various transformers and generators which are plain java classes.
For each field in each record, the processing steps are as follows.
- Each transformer in the chain is called.
- The out of one transformer is fed as input of the next transformer.
- The transformer output can be placed a different field position. The JSON schema file defines the target field ordinal.
Generators are also called. Generators are never cascaded. Each generator generates a new field.
When all the target fields have been generated after running all the transformers and generators, a delemeter separated record is built and emitted by the mapper.
Here is some input data received from stores in UK. This data is transformed according to central Hadoop based data warehouse standards and norms in the US west coast.
653DZO3APG25,B83GE87PFD,985WYA,2015-11-04 15:35:30,9VC040B7,1,25.25,UKP 653DZO3APG25,B83GE87PFD,985WYA,2015-11-04 15:35:30,S9J51190,2,190.23,UKP 653DZO3APG25,B83GE87PFD,985WYA,2015-11-04 15:35:30,K40LI8W4,1,94.99,UKP 249WU6H6059K,16O81VBFJA,4YIM5Q,2015-11-04 15:39:16,E918R4ET,3,189.31,UKP 249WU6H6059K,16O81VBFJA,4YIM5Q,2015-11-04 15:39:16,PAR53EAI,1,16.34,UKP
Here is the corresponding transformed data, that complies with the transformations applied.
653DZO3APG25,xxxxxxxxxx,985WYA,11-15-04 07:35:30,9VC040B7,1,38.43,USD,UK 653DZO3APG25,xxxxxxxxxx,985WYA,11-15-04 07:35:30,S9J51190,2,289.53,USD,UK 653DZO3APG25,xxxxxxxxxx,985WYA,11-15-04 07:35:30,K40LI8W4,1,144.57,USD,UK 249WU6H6059K,xxxxxxxxxx,4YIM5Q,11-15-04 07:39:16,E918R4ET,3,288.13,USD,UK 249WU6H6059K,xxxxxxxxxx,4YIM5Q,11-15-04 07:39:16,PAR53EAI,1,24.87,USD,UK
Custom transformers and generators can be created in two ways. Customization can be done by creating Java transformer classes extending a base class. It is also necessary to create a custom transformer factory class implementing a factory interface. Instead of custom transformer factory class, the list of transformer tag and the corresponding custom transformer class name could be provided through configuration.
Another option is to use existing custom transformer Java classes and configuring them with appropriate Groovy scripts.The script takes a field value as input and expected to return a value of the same type. Non Java programmers will prefer the script based approach.
The configuration properties file provides information on schema file path and HOCON configuration file path. The list of transformer to be applied to a field is defined either in the properties files or in the schema file.
The configuration file contains configuration for each type of transformer to be used. As transformers are built for all the transformers in the list, configuration data from the configuration file is used.
We have gone through a simple process of transforming Hadoop data. The process is completely driven by meta data and configuration and is extensible.A tutorial to run the example in this post is available.
A transformer takes one field value as input. It will be nice to have transformers that take multiple field values as input e.g. you may want to combine two filed values to generate a new field value.