Real Time Detection of Outliers in Sensor Data using Spark Streaming

As far as analytic of sensor generated data is concerned, in Internet of Things (IoT) and in a connected everything world, it’s mostly about real time analytic of time series data. In this post, I will be addressing an use case involving detecting outliers in sensor generated data with Spark Streaming. Outliers are data points that deviate significantly from most of the data. The implementation is part of my new open source project ruscello, implemented in Scala. The project is focused on real time analytic of sensor data with various IoT use cases in mind.

The specific use we will be addressing in this post has to with temperature data from temperature controlled shipping containers.

Temperature Controlled Shipping Containers

Consider some product being shipped in temperature controlled containers. The customer has a Service Level Agreement (SLA) with the transportation company, which defines how the temperature is maintained within a predefined range. Here are some examples of  SLA, that have been implemented

  1. Mean temperature within a time window has to be below predefined upper limit or above some predefined lower threshold.
  2. Some minimum percentage of the data within a time window has to be below some upper threshold or above some lower threshold

The second SLA is more tolerant of small signal to noise ratio data. All data points that cause the SLA to be violated are outliers. In Machine Learning parlance the problem we are solving is supervised outlier detection. It’s supervised because we are specifying the outlier conditions explicitly through the SLA.

In time series analysis literature, this problem is also known level shift detection. There are other techniques for detecting level shift, e.g., computing histogram for sliding window data and detecting any notable change in histogram mode.

The transportation company is responsible for detecting all the violation incidents of SLA and reporting them to the customer. The customer may claim damages when SLA is violated.

IoT and Big Data Analytic

According Cisco, there will 21 billion connected devices by 2018. All these devices will be spewing data creating a data deluge. But all the data may not end up in a Big Data platform.

Although Big Data analytic will play a significant role in IoT, we must be cognizant of the fact that with IoT, some of the data processing may happen at the edge of the sensor network. It’s been called fog computation. There may be hub nodes in a sensor network which may collect and aggregate data from a set of sensors.

The data arriving at the Big Data platform may not always be the raw sensor data. It may be data aggregated and preprocessed at the network edge. For example, inside a sensor network a hub may aggregate data from multiple sensors and based on computation may decided to activate some actuator. The hub may send data related to the decision it made to a Big Data platform, instead of the raw data. Big Data platform will still have a big role in IoT, however some of the computation will happen at the edge.

If the sensor raw data needs to be stored in a NOSQL database, it will always be sent to back end Big Data platform. If the insight being sought through analytic needs a global context e.g., other sensor data or some metadata then also all the data need to be collected and processed in a Big Data platform.

Quick Tour of Spark Streaming

Spark Streaming architecture is fundamentally different from other real time stream processing systems. It has the following notable characteristics.

  1. Messages are processed in micro batches, where the stream is essentially a sequence of  RDDs. A batch size is defined by a batch interval set by the user.
  2. It may not sound very intuitive, but Spark Streaming latency has been found to be lower than Storm, especially for small messages.
  3. RDDs from the stream are processed like normal spark offline RDD processing.
  4. Recovery from node failure is handled through RDD lineage. Recovery is parallelized and fast
  5. Effectively handles stragglers through speculative execution
  6. Spark unifies batch and real time processing in one system, which is a big plus for many projects.

As you can tell, Spark Streaming is a powerful platform for real time stream processing.The fact that it leverages many features of the underlying core engine for batch processing is very clever.

Outlier Detection in Streaming Temperature Sensor Data

The sensor data generation in simulated through a python script. The temperature stays at the desired level except for some random noise superimposed. At some random time, the temperature shifts to an upper or lower level, again with some random noise superimposed. It stays there for a while and then goes back to the normal level. The sensor data has three fields as below

  1. Sensor ID
  2. Time stamp
  3. Temperature

The data is piped to socket server. In Spark Streaming, the data is ingested by socket text stream receivers. The data is partitioned by sensor ID and processed through windows. . Kafka and HDFS is also supported as data stream source, although they are not fully tested yet.

With Kafka, the data stream ingestion by Spark receivers could be parallelized by having multiple input DStreams, if the sensor data in Kafka is partitioned by the sensorID.

The project ruscello is essentially a Spark shell around hoidla, which is a  java library for streaming algorithms. Windowing functions from hoidla is used. Windowing in hoidla offers the following powerful features.

  1. Size bound window or time bound window.
  2. Various window population techniques, including simple sliding widow with every data point added to the window and more complex techniques e.g., reservoir sampling
  3. Various statistical parameter calculations on windowed data e.g., standard deviation, entropy etc
  4. A simple predicate logic based expression evaluator  e.g mean(window) < 60 and stdDev(window) < 1.2. Currently only disjunction of conjunctives are supported.

The condition expression is disjunction of a set of set of conjunctive expressions. Each conjunctive expression is conjunction of a set of predicates. Predicates are either relational or functional.

Since we are dealing with time series data, we should be using time bound window. However, I made the assumption that data samples arrive at regular intervals and the variability in sampling period is negligible. Accordingly I have used size bound window. Knowing the sampling rate and the desired time span of the window, we can easily find the window size in terms of number of data points.

The window object is held inside a state object, leveraging the state management feature of Spark Streaming. Because of partitioning based on sensorID, there is a separate state object for each sensor. I could not use Spark’s native window function, because it did not meet my requirements.

As each data point arrives, it’s added to the window and the SLA condition expression is evaluated. If the evaluation returns true, then the tuple (timestamp, mean) is appended to a list inside the state object.

SLA Breach Output

Finally Spark returns a stream of RDDs, where each RDD is comprised of (sensorID, state object). We query the state object for number of violations and return the number of violations for each sensor. We see the number of violations increasing with time. Here is some sample output

device:U4W8U4L3 num violations:102
device:HCEJRWFP num violations:194
device:U4W8U4L3 num violations:102
device:HCEJRWFP num violations:194
device:U4W8U4L3 num violations:247
device:HCEJRWFP num violations:411

We could also produce a more detailed output containing the time stamp and mean temperature reading for each violation of each sensor.

Summing Up

Spark Streaming is a powerful real time stream processing platform. We have shown it’s usefulness using IoT sensor data analytic as an example. IoT related data  is expected to have explosive growth and Spark Streaming can play a critical role in analysis of such data.

If you are interested in running this use case, here is a tutorial document. You will find other associated configuration file, shell script etc in the same github directory.


For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, Internet of Things, Outlier Detection, Real Time Processing, Spark, Time Series Analytic and tagged , , . Bookmark the permalink.

2 Responses to Real Time Detection of Outliers in Sensor Data using Spark Streaming


    one of the awesome blogs to follow…..thanks for such informative posts !!

  2. Pingback: Alarm Flooding Control with Event Clustering Using Spark Streaming | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s