In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data through some tools instead of going through the tedious process of manually perusing the data and try to make sense out of it.
In this a post we will go through a Spark based implementation to automatically discover data types for the various fields in a data set. The implementation in available in my OSS project chombo.
Data type discovery is only one of the ways to learn about data. There are various other techniques to learn characteristics of individual fields as well relationship between fields. We will discuss about some them and refer you some other blogs of mine for more details.
Data Type Hierarchy
Basic data types are essentially numeric and string. However stronger data types may be defined based on the basic data types. A stronger data as referred to in this post is defined as a data type that imposes stricter constraints on it’s values compared to an weaker type. The relationship between different types is Directed Acyclic Graph (DAG). A weaker type subsumes a stronger type.
Stronger data types may be defined for business domain specific attributes. For example, in eCommerce we may have attributes like phone number, zip code, order quantity and monetary amount
As an example, an integer is stronger type compared to a floating point, because digits with decimal point can not be interpreted as an integer. An integer is also a floating point number but not vice versa. A person’s age is a stronger data type than an integer, because age enforces a range limit e.g. between 1 and 120.
For numeric data types, a constraint for a stronger data types is defined by imposing a range of acceptable values, with a narrower range implying a stronger constraint.
For string data types, a constraint for a stronger data type is defined by patterns expressed regular expression, with stricter pattern matching implying a stronger constraint.
Data Type Discovery
Data type is inferred based on content only. In it’s raw form, all data is string. That’s why when the type inference fails, it defaults to string type.
A numeric data type is defined with the following attributes.
- data type name
- numeric range
A string based data type is defined with the following attributes.
- data type name
- regular expression
- length of the string (optional)
The strength value is used as tie breaker. If a data value is matched by multiple types, the one with the maximum strength value is chosen. A stronger type is assigned higher strength compared to an weaker type. For example if a field matches both zip code and string, zip code is inferred to be type.
A large number of data types are pre defined and available out of the box. They are as follows. Custom data types may also be introduced by providing the attributes as listed above.
|age||numeric||age of person|
|epochTime||numeric||epoch time in ms|
|phoneNumber||string||US phone number|
|SSN||string||US social security number|
|zip||string||US zip code|
|streetAddress||string||US street address|
|monetaryAmount||string||monetary symbol followed by amount|
|idShort||string||id of short length|
|idMedium||string||id of medium length|
|idLong||string||id of long length|
Custom Data Types
In addition, any custom types can also be configured. The custom types can be numeric based or string based. They serve two purposes
- New data types related to a business domain may be defined
- Existing data types supported out of the box may be modified.
Under the first category, you can create domain specific data types e.g. order quantity or shipping mode. As an example of the second category, you may want to modify the regex for the data type streetAddress, which is available out of the box
Retail Sale Data
The data contains the following fields. As we will see later, we will be able to discover data type for most fields.
- order ID
- customer ID
- store ID
- product ID
- monetary amount
Here is some sample input
9O7AP683JG87,O7CO0RC670,0GU84O,97218,2016-12-28 05:13:49,U6HATUW1,5,USD 80.91 9O7AP683JG87,O7CO0RC670,0GU84O,97218,2016-12-28 05:13:49,N6609ISN,1,USD 39.28 H3OEXT17E7PV,968EUP61A6,899454,95678,2016-12-28 05:21:46,R00P2QTF,1,USD 31.04 5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,NRIU8IO2,5,USD 295.91 5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,T0287LFL,1,USD 16.91 5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,U5EJLZ4T,5,USD 294.85 5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,4V4643U5,4,USD 245.89 5KIH2953U808,T159QZ1H35,8F762H,40560,2016-12-28 05:23:53,3DM7US8Z,2,USD 163.68 394I975K7814,BO9G00PSQ9,1V0KSS,37639,2016-12-28 05:30:53,6EX5X1FC,2,USD 89.06
Type Inference Spark Job
The Spark job is implemented in the scala object DataTypeInferencer. For each field, we count the number of matches for each data type configured. The type that matches for all the records is considered to be the discovered data type. In case we have more than one type that matches for all the records for a given field, the type type with the highest strength is considered to be the winner.
Probabilistic type matching is also possible, in case of some missing values, wrong values or incorrectly specified range constraint for numeric type. For example for some records the phone number field may be missing or incorrectly formatted. If a threshold fraction is configured, as long fraction of the records that match exceed the threshold, it’s considered to be a match.
Here is the output. The first field is the column index and the second field is the discovered data type.
4,date 0,idShort 6,int 2,string 1,idMedium 3,zip 7,monetaryAmount 5,string
Column 6 which is quantity has defaulted to an integer. If we knew the typical range of values for order quantity, we could defined a orderQuantity type and it would have been discovered.
The length of ID fields is considered to be fixed. For ID types, 3 different types are supported and for each type the length needs to be specified. We have failed to discover one ID field and as a result column 2 is shown as string.
Probabilistic Type Matching
This feature is enabled by setting a probability threshold. For a given field and type, if the fraction of matched records is above the threshold, it’s considered to be probabilistic-ally matched to that that type
Consider a data type order quantity from eCommerce domain. You might have a hunch that quantity is between 1 and 10 and define a data type orderQuantity accordingly. As long as most of the records meet the constraint and the fraction of matched records is above the configured probability threshold e.g., 0.9, it’s be considered to be a match.
The output also includes the fraction of the records that matched. This value will be indicative of how confident you might feel about the inferred data type.
Going beyond data type discovery, chombo also provides various tools for data exploration to learn about statistical and other properties of data fields as well as relationship between fields. They are implemented in Hadoop, but being ported to Spark.
We have gone through a Spark based implementation for automatic data type discovery for a data set without any available meta data. This tool may also be used to verify available meta data about some data set. The algorithm makes it’s best effort and tries to discover data type for as many fields as possible. To run the use case please follow the steps as described in the tutorial document.