Data Type Auto Discovery with Spark

In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data  through some tools instead of going through the tedious process of manually perusing the data and try to make sense out of it.

In this a post we will go through a Spark based implementation to automatically discover data types for the various fields in a data set. The implementation in available in my OSS project chombo.

Data type discovery is only one of the ways to learn about data. There are various other techniques to learn characteristics of individual fields as well relationship between fields. We will discuss about some them and refer you some other blogs of mine for more details.

Data Type Hierarchy

Basic data types are essentially numeric and string. However stronger data types may be defined based on the basic data types. A stronger data as referred to in this post is defined as a data type that imposes stricter constraints on it’s values compared to an weaker type. The relationship between different types is Directed Acyclic Graph (DAG). A weaker type subsumes a stronger type.

Stronger data types may be defined for business domain specific attributes.  For example, in eCommerce we may have attributes like phone number, zip code, order quantity and monetary amount

As an example, an integer is stronger type compared to a floating point, because digits with decimal point can not be interpreted as an integer. An integer is also a floating point number but not vice versa. A person’s age is a stronger data type than an integer, because age enforces a range limit e.g. between 1 and 120.

For numeric data types, a constraint for a stronger data types is defined by imposing a range of acceptable values, with a narrower range implying a stronger constraint.

For string data types, a constraint for a stronger data type is defined by patterns expressed regular expression, with stricter pattern matching implying a stronger constraint.

Data Type Discovery

Data type is inferred based on content only. In it’s raw form,  all data is string.  That’s why when the type inference fails, it defaults to string type.

A numeric data type is defined with the following attributes.

  1. data type name
  2. numeric range
  3. strength

A string based data type is defined with the following attributes.

  1. data type name
  2. regular expression
  3. length of the string (optional)
  4. strength

The strength value is used as tie breaker.  If a data value is matched by multiple types, the one with the maximum strength value is chosen. A stronger type is assigned higher strength compared to an weaker type. For example if a field matches both zip code and string, zip code is inferred to be type.

A large number of data types are pre defined and available out of the box. They are as follows. Custom data types may also be introduced by providing the attributes as listed above.

type base type comment
age numeric age of person
epochTime numeric epoch time in ms
phoneNumber string US phone number
SSN string US social security number
zip string US zip code
date string formatted date
streetAddress string US street address
city string city name
state string US state
currency string currency symbol
monetaryAmount string monetary symbol followed by amount
idShort string id of short length
idMedium string id of medium length
idLong string id of long length

Custom Data Types

In addition, any custom types can also be configured. The custom types can be numeric based or string based. They serve two purposes

  1. New data types related to a business domain may be defined
  2. Existing data types supported out of the box may be modified.

Under the first category, you can create domain specific  data types e.g. order quantity or shipping mode. As an example of  the second category, you may want to modify the regex for the data type streetAddress, which is available out of the box

Retail Sale Data

The data contains the following fields. As we will see later,  we will be able to discover data type for most fields.

  1. order ID
  2. customer ID
  3. store ID
  4. zip
  5. date
  6. product ID
  7. quantity
  8. monetary amount

Here is some sample input

9O7AP683JG87,O7CO0RC670,0GU84O,97218,2016-12-28 05:13:49,U6HATUW1,5,USD 80.91
9O7AP683JG87,O7CO0RC670,0GU84O,97218,2016-12-28 05:13:49,N6609ISN,1,USD 39.28
H3OEXT17E7PV,968EUP61A6,899454,95678,2016-12-28 05:21:46,R00P2QTF,1,USD 31.04
5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,NRIU8IO2,5,USD 295.91
5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,T0287LFL,1,USD 16.91
5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,U5EJLZ4T,5,USD 294.85
5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,4V4643U5,4,USD 245.89
5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,3DM7US8Z,2,USD 163.68
394I975K7814,BO9G00PSQ9,1V0KSS,37639,2016-12-28 05:30:53,6EX5X1FC,2,USD 89.06

Type Inference Spark Job

The Spark job is implemented in the scala object DataTypeInferencer. For each field, we count the number of matches for each data type configured. The type that matches for all the records is considered to be the discovered data type. In case we have more than one type that matches for all the records for a given field, the type type with the highest strength is considered to be the winner.

Probabilistic type matching is also possible, in case of some missing values, wrong values or incorrectly specified range constraint for numeric type. For example for some records the phone number field may be missing or incorrectly formatted. If a threshold fraction is configured, as long fraction of the records that match exceed the threshold, it’s considered to be a match.

Here is the output. The first field is the column index and the second field is the discovered data type.


Column 6 which is quantity has defaulted to an integer. If we knew the typical range of values for order quantity, we could defined a orderQuantity type and it would have been discovered.

The length of ID fields is considered to be fixed. For ID types, 3 different types are supported and for each type the length needs to be specified. We have failed to discover one ID field and as a result column 2 is shown as string.

Probabilistic Type Matching

This feature is enabled by setting a probability threshold. For a given field and type, if the fraction of matched records is above the threshold, it’s considered to be probabilistic-ally matched to that that type

Consider a data type order quantity from eCommerce domain. You might have a hunch that quantity is between 1 and 10 and define a data type orderQuantity accordingly. As long as most of the records meet the constraint and the fraction of matched records is above the configured probability threshold e.g., 0.9, it’s be considered to be a match.

The output also includes the fraction of the records that matched. This value will be indicative of how confident you might feel about the inferred data type.

Next Steps

Going beyond data type discovery, chombo also provides various tools for data exploration to learn about statistical and other properties of data fields as well as relationship between fields. They are implemented in Hadoop, but being ported to Spark.

Summing Up

We have gone through a Spark based implementation for automatic data type discovery for a data set without any available meta data. This tool may also be used to verify available meta data about some data set.  The algorithm makes it’s best effort and tries to discover data type for as many fields as  possible. To run the use case please follow the steps as described in the tutorial document.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Profiling, Data Science, Scala, Spark and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s