Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based data transformation  solution with a plugin framework was discussed exhaustively.

This is a companion article , where we will go through a data transformation implementation on Spark. The Spark implementation is part of my open source project chombo on github.

Data Transformation

Data transformation is needed because often the data representation at source is often not compatible with the processing logic of the process that consumes the data. The consuming process may be an executive dashboard or a Data Science project.

Please refer to my earlier blogs for details of the plugin There are some transformations that are commonly done in most data projects. This observation led to the development of of the plugin based frame work, where each transformer is identified with an unique tag. A transformer may also be associated with some configuration, depending on the complexity of the transformer.

Each field may be associated with a set of transformers, piped together. The output of one goes as input to the next. The last transformer in the chain may produce multiple values. Each field in the source data may produce no output, one output or multiple output. The output field positions in the target record is specified in the configuration.

Retail Transaction Data

We will use some fictitious eCommerce transaction data, on which we will apply some transformation to make it ready for application of some  algorithmic analysis for evaluating customer loyalty.  The fields are as follows

  1. segment ID
  2. customer ID
  3. transaction ID
  4. transaction time stamp
  5. monetary amount

The customer data has been segmented based on online behavior and the the first is the segment ID. Here is some sample data


Transformation Requirement

We want to perform the following transformations

  • Get rid of segment ID
  • Based on customer ID, generate an addition filed called loyalty based customer ID. It is assumed that current loyalty data is available for all customers. Loyalty lever are defines as High, Medium and Low. A transformer called keyValueTrans will be used
  • Retain transaction ID
  • based on transaction time, generate two fields for week of the day and hour of the day and replace the original field with the generated fields. A transformer called multiTimeCycleTrans is used for this field
  • Retain monetary amount

The transformer configuration for the fields is in the JSON based schema. Both the transformers used for this task require additional configuration which is defined in a Typesafe format configuration file. Some  simple transformers like capitalizing do not require additional configuration.

For a complete list of transformers available out of the box, please refer to my earlier blog.

Transformation Spark Job

The spark implementation is in  the scala object DataTransformer. It off loads most of the transformation work to the java based transformation frame work. Here is some sample output. The input has 5 fields. As expected,the output has 6 fields.


Wrapping Up

We have gone through a plugin frame work based data transformation solution on Spark. if you are interested in executing this use case please follow the steps in the tutorial.

If you have an idea about some common transformation that’s not yet supported, please feel free to suggest it. If it makes sense, I will implement it.


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, ETL, Scala, Spark and tagged , . Bookmark the permalink.

2 Responses to Plugin Framework Based Data Transformation on Spark

  1. Pingback: Time Series Sequence Anomaly Detection with Markov Chain on Spark | Mawazo

  2. Pingback: Elastic Search or Solr Search Result Quality Evaluation with NCDG Metric on Spark | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s