Category Archives: Statistics

Time Series Sequence Anomaly Detection with Markov Chain

There are many algorithms for anomaly detection in time series including Deep Learning based solutions. The anomalies are of 2 types, point and sequence. Sequence anomaly detection is generally of more interest for time series data. In this post we … Continue reading

Posted in Anomaly Detection, Data Science, Machine Learning, Python, Statistics, time series | Tagged , | Leave a comment

Stock Portfolio Balancing with Monte Carlo Simulation

Portfolio balancing is a complex optimization problem. The problem can be stated as assignment of weights to different stocks in the portfolio so that a metric called Sharpe Ratio is maximized. In this post we will see how Monte Carlo … Continue reading

Posted in Data Science, Python, Simulation, Statistics | Tagged , , | 1 Comment

Concept Drift Detection Techniques with Python Implementation for Supervised Machine Learning Models

Concept drift is a serious problem for production deployed machine learning models. Concept drift occurs there is significant change in the underlying data generation process causing significant shift in the posterior distribution p(y|x). Concept drift is manifested as significant increase … Continue reading

Posted in Data Science, Machine Learning, Python, Statistics | Tagged , , , | 3 Comments

Learn about Your Data with about Hundred Data Exploration Functions All in One Python Class

It’s a costly mistake to jump straight  into building machine learning models before getting a good insight into your data. I have made the mistake and paid the price.  Since then I made a resolution to learn about the data … Continue reading

Posted in Data Science, Python, Statistics | Tagged , , , , , | 7 Comments

Monte Carlo Simulation Library in Python with Project Cost Estimation as an Example

I was working on a solution for change point detection in time series, which led me to certain two sample statistic, for which critical values didn’t exist. The only option was to simulate the statistic values and estimate critical values … Continue reading

Posted in Data Science, Python, Statistics | Tagged , | Leave a comment

Model Drift Detection with Kolmogorov Smirnov Statistic on Spark

In retail business, you may be using various business solutions based on product demand data e.g inventory management or how a newly introduced product may be performing with time. The buying behavior model may change with time rendering the those … Continue reading

Posted in Data Science, Machine Learning, Spark, Statistics | Tagged | Leave a comment

Synthetic Training Data Generation for Machine Learning Classification Problems using Ancestral Sampling

Access to good training data set is a serious impediment to building supervised Machine Learning models. Such data is scarce and when available, the quality of the data set may be questionable. Even if good quality data set is available, … Continue reading

Posted in Python, Statistics, Supervised Learning | Tagged , , | 1 Comment

Normal Distribution Fitness Test with Chi Square on Spark

Many Machine Learning models is based on certain assumptions made about the data. For example, in ZScore based  anomaly detection, it is  assumed that the data has normal distribution. Your Machine Learning model will be as good as how those … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark, Statistics | Tagged , | Leave a comment

Time Series Seasonal Cycle Detection with Auto Correlation on Spark

There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one … Continue reading

Posted in Big Data, Correlation, Spark, Statistics, Time Series Analytic | Tagged , , | 3 Comments

Mobile Phone Usage Data Analytics for Effective Marketing Campaign

Insights gained from analyzing mobile phone usage data can be extremely valuable in marketing campaign and customer engagement efforts. For example, hour of the day when an user engages most with his or her mobile  device could be used to … Continue reading

Posted in Big Data, Data Profiling, Marketing Analytic, Spark, Statistics | Tagged , | Leave a comment