Category Archives: Spark

eCommerce Order Processing System Monitoring with Isolation Forest Based Anomaly Detection on Spark

Timely delivery of orders is critical for customer satisfaction for any retail eCommerce business. It’s even more critical for time bound guaranteed delivery orders. Retail eCommerce businesses generally use order processing workflow systems, which are state machines where state transition … Continue reading

Posted in Anomaly Detection, Data Science, eCommerce, Scala, Spark | Tagged , , , | Leave a comment

Time Series Change Point Detection with Two Sample Statistic on Spark with Application for Retail Sales Data

The goal of change point detection is to detect the times when statistically significant and sustained changes happen in a time series. It has wide range of applications in various domains including retail, medical, IoT, finance, business and meteorology. In … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Scala, Spark, Time Series Analytic | Tagged , , | Leave a comment

Detecting Quarantine Violation from Mobile Phone Location Anomaly on Spark

With the world under siege with Corona virus, you might find this topic timely. There are two main aspects of any epidemic breakout, epidemic spread and containment. There are various strategies for containing epidemic spread. One of them is to … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Scala, Spark | Tagged , , | Leave a comment

Model Drift Detection with Kolmogorov Smirnov Statistic on Spark

In retail business, you may be using various business solutions based on product demand data e.g inventory management or how a newly introduced product may be performing with time. The buying behavior model may change with time rendering the those … Continue reading

Posted in Data Science, Machine Learning, Spark, Statistics | Tagged | Leave a comment

Contextual Data Completeness Metric Computation on Spark

Data quality is critical for the healthy operation of any data driven enterprise. There are various kinds of data quality metrics. In this post, the focus will be on the completeness of data. Data quality from a completeness point of … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged , | Leave a comment

Time Series Trend and Seasonality Component Decomposition with STL on Spark

You may be interested in decomposing a time series into level, trend, seasonality and remainder components to gain more insight into your time series. You may also be interested in decomposition to separate out the remainder component for anomaly detection. … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, ETL, Spark, Time Series Analytic | Tagged , , , | Leave a comment

Encoding High Cardinality Categorical Variables with Feature Hashing on Spark

Categorical variables are ubiquitous in data. They pose a serious problem in many Data Science analysis processes. For example, many supervised Machine Learning algorithms work only with numerical data. With high cardinality categorical variables, popular encoding solutions like One Hot … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , , | 2 Comments

Time Series Sequence Anomaly Detection with Markov Chain on Spark

There are many techniques for time series anomaly detection. In this post, the focus is on sequence based anomaly detection of time series data with Markov Chain. The technique will be elucidated with a use case involving data from a … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Machine Learning, Outlier Detection, Scala, Spark | Tagged , , , , | 1 Comment

Elastic Search or Solr Search Result Quality Evaluation with NCDG Metric on Spark

You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your … Continue reading

Posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark | Tagged , , , | Leave a comment

Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , | 2 Comments