Category Archives: Scala

Encoding High Cardinality Categorical Variables with Feature Hashing on Spark

Categorical variables are ubiquitous in data. They pose a serious problem in many Data Science analysis processes. For example, many supervised Machine Learning algorithms work only with numerical data. With high cardinality categorical variables, popular encoding solutions like One Hot … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , , | Leave a comment

Time Series Sequence Anomaly Detection with Markov Chain on Spark

There are many techniques for time series anomaly detection. In this post, the focus is on sequence based anomaly detection of time series data with Markov Chain. The technique will be elucidated with a use case involving data from a … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Machine Learning, Outlier Detection, Scala, Spark | Tagged , , , , | Leave a comment

Elastic Search or Solr Search Result Quality Evaluation with NCDG Metric on Spark

You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your … Continue reading

Posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark | Tagged , , , | Leave a comment

Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , | 1 Comment

Handling Categorical Feature Variables in Machine Learning using Spark

Categorical features variables i.e. features variables with fixed set of unique values  appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms … Continue reading

Posted in Big Data, Data Science, Data Transformation, ETL, Scala, Spark | Tagged , , | Leave a comment

Optimizing Discount Price for Perishable Products with Thompson Sampling using Spark

For retailers, stocking perishable products is a risky business. If a product doesn’t sell completely by the expiry date, then the remaining inventory has to be discarded and loss be taken for those items. Retailers will do whatever is necessary … Continue reading

Posted in AI, Big Data, Data Science, Reinforcement Learning, Scala, Spark | Tagged , | 2 Comments

Data Type Auto Discovery with Spark

In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data  through … Continue reading

Posted in Big Data, Data Profiling, Data Science, Scala, Spark | Tagged , | Leave a comment