Sunday 25 December 2016

Rule-Based Fault Management for Environmental Monitoring IoT system


Fault Management is an important part of the IoT Management area in general. And several approaches exist to fault detection and diagnosis in information systems. For a 3.0 edition of the Challenge I’m working on a Fault Management solution for a distributed Environmental Monitoring IoT system. The solution is based on a Rule-Based principles of errors (symptoms) detection and faults isolation and diagnose.

Components of the system

The solution will be implemented and deployed as a distributed system, composed of several components. Here is short description of the main components.
Component view of the solution

1. Sensor component, consisting of two identical battery powered WiFi enabled IoT sensors based on ESP8266 and BME280. The component provides temperature, humidity a barometric pressure values using push/pull communication schema with its field gateway via MQTT protocol.

2. IoT Gateway component, based on Raspberry Pi component, running Raspbian OS. RPI has a wireless WiFi USB dongle connected. Eclipse Kura will be installed on the device and used as IoT gateway implementation. The gateway communicates with IoT sensors via MQTT protocol. Also SNMP protocol agent for Raspbian OS will be deployed to enable RPi to receive SNMP commands and send SNMP trap events.

The following components will be deployed on two or more cloud CentOS 7 instances, provided by Vscale. Some components will be run inside Docker containers.

3. Connectivity component, based on Eclipse Hono. The component will provide bidirectional communication channel for the IoT gateway and its cloud backend. Two types of communication protocols for interaction will be used: SNMP and MQTT. Telemetry data from sensors and control commands from the backend will be transmitted via MQTT. SNMP will be used for receiving TRAP and INFORM events from the OS components of the IoT gateway, and for sending GET requests from the backend. For SNMP I’m going to develop SNMP Protocol Adapter for Hono. So from the gateway side there will be two separate data flows: 
sensor readings and monitoring, error events and alarms.

4. Fault Management component, based on JBoss Drools. The aim of this component is to receive symptom events (errors, alarms), detect errors, isolate and diagnose the causes of faults and apply recover activities. For this purpose a set of rules will be created for decision making and complex event processing. The component will be able to send control messages to check the status of other components and to send commands to components as a recovery procedure (trying to restart the failed component for example). The component will be deployed as a self-contained decision service, which communicates with Hono and Data storage components via Apache Camel routes.

5. Data Storage component, implemented via Redis data structure store. Two instances of the data storage will be used: one as the temporary Environmental telemetry data storage and second as the Fault Management database, which persists symptom events, notifications and alerts data.

6. Integration сomponent, based on Apache Camel. This component will implement the business scenario of the solution by receiving Environmental telemetry data (temperature, humidity and barometric pressure values) from the Connectivity component and transmitting this data to the local geoinformation SaaS service - Public Monitoring Project (narodmon.ru) to display sensor readings on the world map.

7. UI component. Will be implemented as a standalone web application. The component will deliver real-time fault and the system status data to the user, providing online visualization and notification functionality using WebSocket protocol.

8. UI client component implemented as HTML5 web client application. The component will display data, provided by the UI component web app.

Sample Use Cases

Use Case 1: Two Environmental data sensors deployed in the field. One sensor is in active mode, periodically sending data readings to its field gateway. Another sensor is a reserve and is in standby mode. The first sensor stops functioning due to the power problems, for example. The Fault Management system detects the situation in which readings data cease to flow from the gateway. The system inits fault isolation procedure by sending status request control message to the malfunctioning sensor. As the sensor doesn’t respond, the system executes recovery activity by sending wake command message to the second sensor. The second sensor switches to the active mode thus continuing the operation of the overall system.

Use Case 2: The Fault Management system receives multiple SNMP trap events with the information on large memory usage by OS of the IoT field gateway. The system sends SNMP GET request to retrieve the OS performance data. The response confirms the bad OS performance. The system executes the recovery procedure by sending the reset command message to the field gateway OS.

At this time I'm working on the Sensor Component hardware and software part.