Category Archives: Data Science

Time Series Trend and Seasonality Component Decomposition with STL on Spark

You may be interested in decomposing a time series into level, trend, seasonality and remainder components to gain more insight into your time series. You may also be interested in decomposition to separate out the remainder component for anomaly detection. … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, ETL, Spark, Time Series Analytic | Tagged , , , | Leave a comment

Missing Value Imputation with Restricted Boltzmann Machine Neural Network

Missing value is a common problem in many real world data set. There are various techniques for imputing missing values. We will use a kind of Neural Network called RBM for imputing missing values. Restricted Boltzmann Machine (RBM) are stochastic … Continue reading

Posted in Data Science, Deep Learning, ETL, Machine Learning, Python | Tagged , , , | Leave a comment

Encoding High Cardinality Categorical Variables with Feature Hashing on Spark

Categorical variables are ubiquitous in data. They pose a serious problem in many Data Science analysis processes. For example, many supervised Machine Learning algorithms work only with numerical data. With high cardinality categorical variables, popular encoding solutions like One Hot … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , , | Leave a comment

Time Series Sequence Anomaly Detection with Markov Chain on Spark

There are many techniques for time series anomaly detection. In this post, the focus is on sequence based anomaly detection of time series data with Markov Chain. The technique will be elucidated with a use case involving data from a … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Machine Learning, Outlier Detection, Scala, Spark | Tagged , , , , | 1 Comment

Six Unsupervised Extractive Text Summarization Techniques Side by Side

In text summarization, we create a summary of the original content that is coherent and captures the salient points in the original content. There are various important usages of text summarization. Something we face almost every day is the text … Continue reading

Posted in Data Science, NLP, Python, Text Analytic, Text Mining | Tagged , | Leave a comment

Elastic Search or Solr Search Result Quality Evaluation with NCDG Metric on Spark

You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your … Continue reading

Posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark | Tagged , , , | Leave a comment

Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , | 1 Comment