Monday, 27 February 2017

Rule-Based Fault Management for Environmental Monitoring IoT system

Final report

Description of the project

Thidea of the project was to build a fault management solution for distributed IoT systems. Fault managment, along with security, is among the most important management features of IoT networks. The solution had to demonstrate three main fault management functions:
  • detect faults in IoT system from symptons (events and messages, containing raw data about faults);
  • isolate and diagnose the causes of faults;
  • apply the fault recovery procedures.
The rule-based reasoning technic had to be used for the implementation of these functions.
The environmental monitoring IoT system was selected as an example of the diagnosable system.

The current status

The following is already done:
  • the hardware part of environmental monitoring system created;
  • the sensor node software fully implemented;
  • the Raspberry Pi based field IoT gateway configured to work as the Wireless Access Point;
  • Mosquitto MQTT broker installed and configured on the IoT gateway component;
  • the IoT gateway software partially implemented (this work is in progress, see GitHub repo with the sources);
  • one DigitalOcean 1Gb CentOS 7 droplet created;
  • Docker and Docker Compose installed on the droplet;
  • Docker daemon secured with TLS;
  • Eclipse Hono v0.5-M3 deployed on the droplet and tested.
Well, there is still much to be done :)

Changes in the components of the solution

Current changes are related to the implementation of the sensor node and field gateway components:

The sensor node component's ESP-12 modules replaced with SparkFun ESP8266 Things. The Raspberry Pi based IoT gateway now configured as the WAP for sensor nodes and connected to the router using LAN connection.
I initially considered Kura for the IoT gateway, but after some research think it's a bit overcomplicated for the case of the project. I needed to implement bidirectional MQTT - AMQP bridge to connect Mosquiotto broker, which running locally on Raspberry PI and Eclipse Hono AMQP telemetry endpoint, running on the DigitalOcean droplet. And as I can see there is no direct way to implement it with Kura. I would have to develop a bundle from scratch to support request-response scenario for the local Mosquitto. Also Kura's CloudService is EDC specific and it seems like it doesn't fit Hono API. So I decided to implement custom IoT micro-gateway using Apache Camel and run it as Linux service on IoT gateway component. In this case it will be possible to connect to Hono using Hono API through AMQP protocol.

Lessons learned

As it is seen now, the declared scope of the project turned out to be too broad for the stated time frame. And, unfortunately, working full-time, it was not always possible to fully engage in the Challenge.

In conclusion I want to note that this is definitely not the end of the journey. I will update this blog regularly. So stay tuned :)

Saturday, 18 February 2017

Rule-Based Fault Management for Environmental Monitoring IoT system

Sensor node software

In this post I'm showing the software part of the sensors. The code is implemented with Arduino IDE. Libs used:
  • ESP - provides ESP8266 specific functions;
  • Wire - provides I2C protocol support;
  • ESP8266WiFi - WiFi related functions;
  • EEPROM - allows to work with persistent data storadge;
  • PubSubClient - a client library for MQTT support.
The software is devided into two parts: a) BME280 I2C driver lib and b) Arduino sketch with the sensor functionality. Next I'm describing the implementation of each part.

BME280 driver

The BME280 sensor driver is implemented as Arduino library. The BME280 datasheet was the main reference for the implementation. The library functional spec is following:
  • I2C interface support only;
  • Forced mode support only; 
  • allows to dynamically set the BME280 I2C address;
  • allows to dynamically set SDA and SCL pins fot I2C interface;
  • oversampling isn't used (acceptable for the environmental measurements);
  • IIR filter isn't used (also acceptable for the environmental measurements);
The public API of the library includes two constructors and three methods:
The UML diagram depicts the library usage workflow:

The diagram is self-explanatory. One thing can be mentioned here that according to BME280 datasheet (see 3.3.3. Forced mode) the Forced mode has to be selected again for each subsequent measurement cycle. For this purpose the last statement of readAll() method sets the sensor mode on each method call (5.4.5 Register 0xF4 "ctrl_meas"): 
The BME280 ADC output values for temperature, pressure and humidity are compensated using formulas from the datasheet (4.2.3 Compensation formulas).
You can find sources for the BME280 driver in the GitHub repo.

Arduino sketch

The project's sensor node software is implemented as an Arduino sketch. The main features that was implemented in the sketch:
  • wireless WiFi communication channel with the field IoT gateway;
  • MQTT client, provides connectivity with MQTT broker which runs on a field IoT gateway;
  • bidirectional communication with BME280 sensor;
  • the use of a persistent storage for the current state of the sensor node during restarts;
  • three modes of  operation:
    • active;
    • power save;
    • suspended.
  • one-way message exchange pattern implementation over MQTT for environmental telemetry data messages;
  • request-reply message exchange pattern implementation over MQTT for control messages, sended from the field gateway to the sensor node.


The UML sequence diagram depicts the startup process after powering the sensor node:

The programm starts with reading the MAC address as it is used as a MQTT client name and included in MQTT topic tree structure. Then BME280 is initialized using I2C protocol. Then the EEPROM memory allocated - 5 bytes total (4 bytes for the sleep period value in seconds + 1 byte for the current mode id). After that WiFi connection is established with a WiFi access point, based on the Raspberry Pi. The same Raspberry PI is running local Mosquitto MQTT broker instance and the field IoT gateway software. Next steps are for MQTT  PubSub client callback function setup, MQTT topic names initialization and the sensor node current operating mode processing.
The following topics are configured by the sensor node:

Topic name templateTopic name exampleTopic typePurpose
env/<MAC_address>/statusenv/5ccf7f2f1d04/statusPublishSensor node state messages.
Possible values: 'off', 'sleeping', 'on'
env/<MAC_address>/temperatureenv/5ccf7f2f1d04/temperaturePublishTelemetry data messages: temperature in DegC
env/<MAC_address>/pressureenv/5ccf7f2f1d04/pressurePublishTelemetry data messages: atmospheric pressure in hPa
env/<MAC_address>/humidityenv/5ccf7f2f1d04/humidityPublishTelemetry data messages: humidity in %RH
env/<MAC_address>/requestenv/5ccf7f2f1d04/requestSubscribeCommand messages: request
messages: reply

The preprocessMode() function implements a conditional workflow, that depends on the current sensor node mode. Three operational modes are supported:

According to SparkFun's ESP8266 Thing Hookup Guide XPD pin has to be connected to DTR pin to enable the sleep capability.

The UML activity diagram represents the details of the workflow:

Two interesting points here:
1. As the Arduino Client for MQTT only supports Clean Sessions (see for example this note), the command messages that are addressed to the sensor node, can only be sent while the node is connected to the local MQTT broker, i.e. the node never receives the command message if it was sent while the node is disconnected from the broker in the deep sleep state. So as a workaround, before sending the sensor node to deep sleep in PWR_SAVE or SUSPENDED mode, the node waits for an incoming command message at a fixed interval of 5 seconds and then runs PubSub client's loop() method to process incomming messages.
2. Here is the strange thing: I've never managed to successfully process incoming request message if the loop() is called once. In this case the callback function is never called. I was able to fix it by inserting the second loop() call:

First integration test

I ran integration tests using the following setup:

I created several gifs to visualize the tests.
This gif reflects the following scenario:
1. The sensor node is configured to connect to the Mosquitto MQTT broker which runs on a field IoT gateway, based on Raspbery PI;
2. mqtt-spy utility is configured to connect to the same MQTT broker;
3. Four MQTT subscribtions are created to the topics in mqtt-spy:
4. The sensor node run in PWR_SAVE mode and initially is in the deep sleep state;
5. The payload of a message in the env/5ccf7f2f1dc8/status topic contains 'off' string value which corresponds the payload of a retain message of the sensor node;
6. When the power save period ends, the sensor node establishes wireless MQTT connection with the broker;
7. The node publishes new message to env/5ccf7f2f1dc8/status with a payload, containing 'on' string indicating that the status of the node is changed.
8. The node reads new environmental data from BME280 and publishes three new messages to env/5ccf7f2f1dc8/temperature, env/5ccf7f2f1dc8/humidity and env/5ccf7f2f1dc8/pressure topics (see Preprocess Mode Activity Diagram).
9. The node publishes new message to env/5ccf7f2f1dc8/status with a payload, containing 'sleeping' string indicating that the status of the node is changed.
10. The node is sent to the deep sleep state.
11. The broker disconnects the network connection as it doesn't receive any packets from the sensor node within one and a half times the Keep Alive time period.
12. The broker publishes the last will message of the node to env/5ccf7f2f1dc8/status with a payload, containing 'off' string.

Commands and command messages

The sensor node supports request-reply message exchange for command messages. The current implementation supports three commands:
  • getsensid - returns the BME280 sensor identifier in hex string format;
  • getbattery - returns the supply voltage (VCC) value;
  • setmode - sets the operational mode of the node.
For a command messages very lightweight application level protocol was created. The payload of the command message is formatted as a CSV-string.

Request command message format:
<msg_id>,<command_name>,<param 1>,<param 2>,...,<param n>
  msg_id - unique identifier of the message (required),
  command_name - predefined command name (required),
  param 1, param 2, param n - the command input parameters (optional). 

Reply command message format:
  correl_id - correlation identifier, must match the msg_id  value of the corresponding request message (required),
  status - command execution status. Only one possible value 200 is supported in the current implementation (required),
  payload - command output payload (optional). 


  Request command message payload: 0001,getbattery
  Reply command message payload: 0001,200,3153

  Request command message payload: 0002,setmode,1
  Reply command message payload: 0001,200

Command messages are handled by the message callback function of MQTT PubSub client. The callback function dispatches the command from the message to the corresponding command handler function:

You can find sources of the sketch in the GitHub repo.

Wednesday, 25 January 2017

Rule-Based Fault Management for Environmental Monitoring IoT system

Hardware setup

For the hardware part of the system I'm using two DIY environmental sensors where each sensor consists of two parts: the external part, mounted on the outside of the window and the internal part, mounted in the apartment of the apartment building.

The internal part

The internal part is a ESP8266 module powered with LiPo 3.7V 700mAh battery. After submitting the proposal I started to work on its hardware components. And my initial development configuration looked like this:

But since my project won the Gift Certificate I've decided to apply it and after a little waiting got this goodies from SparkFun:

and modified the prototyping setup:

This is a final configuration that I used to develop the sensor driver software. I'll describe the sensor software development in the next post.

SparkFun ESP8266 Thing has onboard voltage regulator and charger for 3.7V LiPo batteries. It is a great addition for solutions with autonomous power supply.

The components are mounted on a small solderable breadboard with 2x10-pin stackable headers and 4-pin break away header for the sensor ribbon cable on it:

Assembled boards and LiPo batteries are placed in a protective enclosure which made of latticed PVC-plates. The enclosure has a separate section for each board.The back panel is removed to provide access to the ESP8266 Thing power switch and Micro USB connector.

The external part

For the external part I made two simple sensor holders. The holder allows to place the BME280 sensor at a distance from the wall of the building to slightly reduce the impact of the wall proximity on the temperature measurement. Also it provides partial protection from the weather conditions.

Simple raw materials:

And it has not done without hot glue :)

Unfortunately I burned one SparkFun BME280 sensor breakout board, so I had to go with 4-pin BME280, which I bought on eBay earlier last year. Here is bottom-up view of covers with sensors inside:
I attached sensor holders to the window and put the fully assembled device on a sill:


Btw, it's true: there is a lot of snow in winter in Russia :)

Sunday, 25 December 2016

Rule-Based Fault Management for Environmental Monitoring IoT system

Fault Management is an important part of the IoT Management area in general. And several approaches exist to fault detection and diagnosis in information systems. For a 3.0 edition of the Challenge I’m working on a Fault Management solution for a distributed Environmental Monitoring IoT system. The solution is based on a Rule-Based principles of errors (symptoms) detection and faults isolation and diagnose.

Components of the system

The solution will be implemented and deployed as a distributed system, composed of several components. Here is short description of the main components.
Component view of the solution

1. Sensor component, consisting of two identical battery powered WiFi enabled IoT sensors based on ESP8266 and BME280. The component provides temperature, humidity a barometric pressure values using push/pull communication schema with its field gateway via MQTT protocol.

2. IoT Gateway component, based on Raspberry Pi component, running Raspbian OS. RPI has a wireless WiFi USB dongle connected. Eclipse Kura will be installed on the device and used as IoT gateway implementation. The gateway communicates with IoT sensors via MQTT protocol. Also SNMP protocol agent for Raspbian OS will be deployed to enable RPi to receive SNMP commands and send SNMP trap events.

The following components will be deployed on two or more cloud CentOS 7 instances, provided by Vscale. Some components will be run inside Docker containers.

3. Connectivity component, based on Eclipse Hono. The component will provide bidirectional communication channel for the IoT gateway and its cloud backend. Two types of communication protocols for interaction will be used: SNMP and MQTT. Telemetry data from sensors and control commands from the backend will be transmitted via MQTT. SNMP will be used for receiving TRAP and INFORM events from the OS components of the IoT gateway, and for sending GET requests from the backend. For SNMP I’m going to develop SNMP Protocol Adapter for Hono. So from the gateway side there will be two separate data flows: 
sensor readings and monitoring, error events and alarms.

4. Fault Management component, based on JBoss Drools. The aim of this component is to receive symptom events (errors, alarms), detect errors, isolate and diagnose the causes of faults and apply recover activities. For this purpose a set of rules will be created for decision making and complex event processing. The component will be able to send control messages to check the status of other components and to send commands to components as a recovery procedure (trying to restart the failed component for example). The component will be deployed as a self-contained decision service, which communicates with Hono and Data storage components via Apache Camel routes.

5. Data Storage component, implemented via Redis data structure store. Two instances of the data storage will be used: one as the temporary Environmental telemetry data storage and second as the Fault Management database, which persists symptom events, notifications and alerts data.

6. Integration —Āomponent, based on Apache Camel. This component will implement the business scenario of the solution by receiving Environmental telemetry data (temperature, humidity and barometric pressure values) from the Connectivity component and transmitting this data to the local geoinformation SaaS service - Public Monitoring Project ( to display sensor readings on the world map.

7. UI component. Will be implemented as a standalone web application. The component will deliver real-time fault and the system status data to the user, providing online visualization and notification functionality using WebSocket protocol.

8. UI client component implemented as HTML5 web client application. The component will display data, provided by the UI component web app.

Sample Use Cases

Use Case 1: Two Environmental data sensors deployed in the field. One sensor is in active mode, periodically sending data readings to its field gateway. Another sensor is a reserve and is in standby mode. The first sensor stops functioning due to the power problems, for example. The Fault Management system detects the situation in which readings data cease to flow from the gateway. The system inits fault isolation procedure by sending status request control message to the malfunctioning sensor. As the sensor doesn’t respond, the system executes recovery activity by sending wake command message to the second sensor. The second sensor switches to the active mode thus continuing the operation of the overall system.

Use Case 2: The Fault Management system receives multiple SNMP trap events with the information on large memory usage by OS of the IoT field gateway. The system sends SNMP GET request to retrieve the OS performance data. The response confirms the bad OS performance. The system executes the recovery procedure by sending the reset command message to the field gateway OS.

At this time I'm working on the Sensor Component hardware and software part.