For many Big Data projects, it has been reported that significant part of the time, sometimes up to 70-80% of time, is spent in data cleaning and preparation. Typically, in most ETL tools, you define constraints and rules statically for data validation. Some examples of such rules are limit checking for numerical quantities and pattern matching for text data.
Sometimes it’s not feasible to define the rules statically, because there could be too many variables and the variables could be non stationary. Data is non stationary when it’s statistical properties change with time. In this post, we will go through a technique of detecting whether some numerical data is outside an acceptable range by detecting outliers. Continue reading
I have seen the topic of data size as it relates to machine learning being discussed often. But they are mostly opinions, views and innuendos, not backed by any rational explanation. Comments like we need “we need meaningful data not big data” or “insightful data, not big data” abound which often leave me puzzled. How does Big Data preclude meaning or insight in the data. Being motivated to find concrete answers to the question of data size as it relates to learning, I decided to explore theories of Machine Learning to get to the heart of the issue. I found some interesting and insightful answers in Computational Learning Theory.
In this post I will summarize my findings. As part of this exercise, I have implemented a python script to estimate the minimum training set size required by a learner, based on PAC theory. Continue reading
Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only fashion and are easy to handle. Such data tend to be fact data e.g., user behavior tracking data or sensor data.
However, dimension data or master data e.g., customer data, product data will typically be mutable. Generally they arrive in batches from some external source system reflecting incremental inserts, updates and deletes in the external system. All incremental changes need to be consolidated with the existing master data. This is a problem faced in many Hadoop or Hive based Big Data analytic project. Even if you are using the latest version of Hive, there is no bulk update or delete support. This post is about a Map Reduce job that will perform bulk insert, update and delete with data in HDFS. Continue reading
You may be wondering about the relationship, I alluded to in the title. A personalization and recommendation system like sifarish bootstraps from user and item engagement data. This kind of data is gleaned from various signals e.g. an user’s engagement with various items, an user’s explicit rating of an item. A recommendation system could be benefit significantly from customer service data residing in Customer Relationship Management (CRM) systems. It’s yet another customer item engagement signal.
In this post the focus will be on extraction customer engagement data from CRM and how it can be combined with other customer item engagement signals to define a Continue reading
As far as analytic of sensor generated data is concerned, in Internet of Things (IoT) and in a connected everything world, it’s mostly about real time analytic of time series data. In this post, I will be addressing an use case involving detecting outliers in sensor generated data with Spark Streaming. Outliers are data points that deviate significantly from most of the data. The implementation is part of my new open source project ruscello, implemented in Scala. The project is focused on real time analytic of sensor data with various IoT use cases in mind.
The specific use we will be addressing in this post Continue reading
One of the nagging problems in personalized recommendation systems is crowding of items with same attribute values in the recommendation list. For example, if you happen to like certain artiste, the songs by the same artiste will tend to flood recommended song list in a music recommendation system. The flooding happens because of the accuracy of the collaborative filtering algorithm. But that accuracy may not always generate the best outcome. The only criterion for success for a recommender is whether or not the the user has clicked on one or more items in the recommendation impression list.
As I mentioned in my earlier post, getting a recommendation list through Collaborative Filtering (CF) machine learning algorithms goes only part of the way towards providing an effective solution. Various post processing mechanisms are required to modify and improve the recommendation list Continue reading
The basic recommendation output consisting of the tuple (user, item, predicted rating), is easy to obtain from any Collaborative Filtering (CF) based Recommendation and Personalization engine, including sifarish. It’s been reported that there is a bigger return for the quality of recommendation results by applying various post processing logic on the basic CF output, instead of tweaking the core machine learning algorithm. We can think of these post processing units as plugins that apply to the basic recommendation output.
Many of those have been implemented in sifarish. They are not based on rigorous machine learning algorithms, but on simple intuition and heuristics. In this post I will go through one of them that was implemented recently. The actual rating of an item by an user is is considered in conjunction with the predicted rating to derive a net rating. Continue reading