Is Bigger Data Better for Machine Learning

I have seen the topic of data size as it relates to machine learning being discussed often. But they are mostly opinions, views and innuendos, not backed by any rational explanation. Comments like we need “we need meaningful data not big data” or “insightful data, not big data” abound which often leave me puzzled. How does Big Data preclude meaning or insight in the data. Being motivated to find concrete answers to the question of data size as it relates to learning, I decided to explore theories of Machine Learning to get to the heart of the issue.  I found some interesting and insightful answers in Computational Learning Theory.

In this post I will summarize my findings. As part of this exercise, I have implemented a python script to estimate the minimum training set size required by a learner, based on PAC theory. Continue reading

Posted in Big Data, Machine Learning, Predictive Analytic | Tagged , , , | 1 Comment

Bulk Insert, Update and Delete in Hadoop Data Lake

Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only fashion and are easy to handle. Such data  tend to be fact data e.g., user behavior tracking data or sensor data.

However, dimension data or master data e.g., customer data, product data will typically be mutable. Generally they arrive in batches from some external source system reflecting incremental inserts, updates and deletes in the external system. All  incremental  changes need to be consolidated with the existing master data. This is a problem faced in many Hadoop or Hive based Big Data analytic project. Even if you are using the latest version of Hive, there is no bulk update or delete support. This post is about a Map Reduce job that will perform bulk insert, update and delete with data in HDFS. Continue reading

Posted in Big Data, ETL, Hadoop and Map Reduce, Hive | Tagged , , , | 1 Comment

Customer Service and Recommendation System

You may be wondering about the relationship, I alluded to in the title. A personalization and recommendation system like sifarish bootstraps from user and item engagement data. This kind of data is gleaned from various signals e.g. an user’s engagement with various items, an user’s explicit rating of an item. A recommendation system could be benefit significantly from customer service data residing in Customer Relationship Management (CRM) systems. It’s yet another customer item engagement signal.

In this post the focus will be on extraction customer engagement data from CRM and how it can be combined  with other customer item engagement signals to define a Continue reading

Posted in Big Data, Customer Service, eCommerce, Hadoop and Map Reduce, Recommendation Engine | Tagged , | 1 Comment

Real Time Detection of Outliers in Sensor Data using Spark Streaming

As far as analytic of sensor generated data is concerned, in Internet of Things (IoT) and in a connected everything world, it’s mostly about real time analytic of time series data. In this post, I will be addressing an use case involving detecting outliers in sensor generated data with Spark Streaming. Outliers are data points that deviate significantly from most of the data. The implementation is part of my new open source project ruscello, implemented in Scala. The project is focused on real time analytic of sensor data with various IoT use cases in mind.

The specific use we will be addressing in this post Continue reading

Posted in Big Data, Internet of Things, Outlier Detection, Real Time Processing, Spark, Time Series Analytic | Tagged , , | Leave a comment

Diversity in Personalization with Attribute Diffusion

One of the nagging problems in  personalized recommendation systems is crowding of items with same attribute values in the recommendation list. For example, if you happen to like certain artiste, the songs by the same artiste will tend to flood recommended song list in a music recommendation system. The flooding happens because of  the accuracy of the collaborative filtering algorithm. But that accuracy may not always generate the best outcome. The only criterion for success for a recommender  is whether or not the the user has clicked on one or more items in the recommendation impression list.

As I mentioned in my earlier post, getting a recommendation list through Collaborative Filtering (CF) machine learning algorithms goes only part of the way towards providing an effective solution. Various post processing mechanisms are required to modify and improve the recommendation list Continue reading

Posted in Big Data, Hadoop and Map Reduce, Personalization, Recommendation Engine | Tagged , | Leave a comment

Positive Feedback Driven Recommendation Rank Reordering

The basic recommendation output consisting of the tuple (user, item, predicted rating), is easy to obtain from any Collaborative Filtering (CF) based Recommendation and Personalization engine, including sifarish. It’s been reported that there is a bigger return for the quality of recommendation results by applying various post processing logic on the basic CF output, instead of tweaking the core machine learning algorithm. We can think of these post processing units as plugins that apply to the basic recommendation output.

Many of those have been implemented in sifarish. They  are not based on rigorous machine learning algorithms, but on simple intuition and heuristics. In this post I will go through one of them that was implemented recently. The actual rating of an item by an user is is considered in conjunction with the predicted rating to derive a net rating. Continue reading

Posted in Big Data, Collaborative Filtering, Hadoop and Map Reduce, Personalization, Recommendation Engine | Tagged , , | 1 Comment

Counting Unique Mobile App Users with HyperLogLog

Continuing along the theme of real time analytic with approximate algorithms, the  focus this time is approximate cardinality estimation. To put the ideas in a context, the use case we will be working with is for counting number of unique users for a mobile app. Analyzing the trend of such unique counts, reveal valuable insights into the popularity of an app.

We will be using  HyperLogLog algorithm which is available in my open source project hoidla as a Java API. The storm implementation of the use case Continue reading

Posted in Approximate Query, Big Data, Mobile, Real Time Processing, Storm | Tagged , , | Leave a comment