Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based data transformation solution with a plugin framework was discussed exhaustively.
This is a companion article , where we will go through a data transformation implementation on Spark. The Spark implementation is part of my open source project chombo on github.
Data transformation is needed because often the data representation at source is often not compatible with the processing logic of the process that consumes the data. The consuming process may be an executive dashboard or a Data Science project.
Please refer to my earlier blogs for details of the plugin There are some transformations that are commonly done in most data projects. This observation led to the development of of the plugin based frame work, where each transformer is identified with an unique tag. A transformer may also be associated with some configuration, depending on the complexity of the transformer.
Each field may be associated with a set of transformers, piped together. The output of one goes as input to the next. The last transformer in the chain may produce multiple values. Each field in the source data may produce no output, one output or multiple output. The output field positions in the target record is specified in the configuration.
Retail Transaction Data
We will use some fictitious eCommerce transaction data, on which we will apply some transformation to make it ready for application of some algorithmic analysis for evaluating customer loyalty. The fields are as follows
- segment ID
- customer ID
- transaction ID
- transaction time stamp
- monetary amount
The customer data has been segmented based on online behavior and the the first is the segment ID. Here is some sample data
2,OCW9VCN59K,689NW1W11YP2OL3Q,1516885604,165.50 3,8J2ZEARB96,43H8HBAUF2M94685,1516886324,314.63 0,MUSRQ5Y27U,PX54GX7CLY665M17,1516887104,44.44 1,2XA2RL92LU,I89R6LVT8I8Y9CAC,1516887704,80.59 2,UY05GS13X1,W36RQGV8064SX8PG,1516888364,117.55 3,9G6J4N10DT,011IMT38W1X72RB1,1516889024,314.56 3,7SQ9A379OZ,K44A2IA4K6H033LK,1516889444,332.64 0,MAL4V0WYV4,89971KBCFNXLLXB5,1516890104,66.27 1,2BMHP1392C,E45HCRNKCV5H6C33,1516890644,51.54
We want to perform the following transformations
- Get rid of segment ID
- Based on customer ID, generate an addition filed called loyalty based customer ID. It is assumed that current loyalty data is available for all customers. Loyalty lever are defines as High, Medium and Low. A transformer called keyValueTrans will be used
- Retain transaction ID
- based on transaction time, generate two fields for week of the day and hour of the day and replace the original field with the generated fields. A transformer called multiTimeCycleTrans is used for this field
- Retain monetary amount
The transformer configuration for the fields is in the JSON based schema. Both the transformers used for this task require additional configuration which is defined in a Typesafe format configuration file. Some simple transformers like capitalizing do not require additional configuration.
For a complete list of transformers available out of the box, please refer to my earlier blog.
Transformation Spark Job
The spark implementation is in the scala object DataTransformer. It off loads most of the transformation work to the java based transformation frame work. Here is some sample output. The input has 5 fields. As expected,the output has 6 fields.
W49WEWZ6P1,M,NV2KF1Z22PW87S56,5,0,122.16 OCW9VCN59K,M,689NW1W11YP2OL3Q,5,0,165.50 8J2ZEARB96,H,43H8HBAUF2M94685,5,0,314.63 MUSRQ5Y27U,L,PX54GX7CLY665M17,5,0,44.44 2XA2RL92LU,M,I89R6LVT8I8Y9CAC,5,0,80.59 UY05GS13X1,H,W36RQGV8064SX8PG,5,0,117.55 9G6J4N10DT,M,011IMT38W1X72RB1,5,1,314.56 7SQ9A379OZ,H,K44A2IA4K6H033LK,5,1,332.64 MAL4V0WYV4,L,89971KBCFNXLLXB5,5,1,66.27 2BMHP1392C,L,E45HCRNKCV5H6C33,5,1,51.54
We have gone through a plugin frame work based data transformation solution on Spark. if you are interested in executing this use case please follow the steps in the tutorial.
If you have an idea about some common transformation that’s not yet supported, please feel free to suggest it. If it makes sense, I will implement it.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing.