How we ensure the stability of our platform

In the context of an Energy IoT platform, stability essentially means one thing: it must be reliably available. Availability is measured by the percentage of time that the service is actually available. We explain how Kiwigrid ensures the stability of KiwiOS in this blog.


When is a platform considered available?

When can we talk about something being available? For example, if it is still possible to call up a frontend via the web browser, but nothing happens when any functionality is called up ("click on a button"), is it still available? In order to map something like this as accurately as possible, various gradations of availability are recorded: 

  • Operational
  • Degraded Performance
  • Partial Outage
  • Major Outage
  • Under Maintenance

“Degraded performance”, for example, is when a service (e.g., the user interface) is disproportionately slow.. The delays are accordingly included in the calculation of availability. However, announced interruptions during which the platform or individual areas are being maintained ("under maintenance"), are not included in the calculation.
 

The availability of the KiwiOS platform

In principle, an IoT platform runs 24/7 without a break. Determining the stability of a platform involves checking how much of this time it was available without interruption.

Kiwigrid aims for a platform availability of 99.5%, which corresponds to approximately 23 hours and 55 minutes per day. The availability target that a company sets for itself is also known as a Service Level Objective (SLO). If a certain minimum availability is agreed upon with customers, it is then recorded in a Service Level Agreement (SLA). If the availability specified in the SLA cannot be maintained, the customer may lodge a complaint and, if necessary, the costs for the service are then reduced accordingly.
 

How we ensure platform stability at Kiwigrid

Kiwigrid's platform is built on a microservice architecture. If a request is sent and the corresponding microservice is not active, the request fails. Therefore, multiple services run simultaneously. If one service fails, the next one takes over.

On an organizational level, the stability of a software platform is traditionally overseen by an SRE (Site Reliability Engineering) team. At Kiwigrid, responsibility for this area lies within the DPE team (Developer Productivity Engineering). In addition to various tasks that overlap with those of a traditional SRE team, the DPE team is also responsible for making developers' work easier by optimizing processes. At Kiwigrid, the DPE team is always on call, which means that there is at least one person who is available 24/7. If health checks that are automatically performed on a regular basis fail, an automated alarm is triggered on a DPE member’s mobile phone.

Good incident management is important in order to be able to respond to faults as quickly as possible. If a fault occurs at 6 p.m., it should not be left overnight but dealt with as quickly as possible. Overall, a distinction is made between three alarm levels:

  • P1: very critical (call, SMS, app message).
  • P2: partly requires immediate action, partly may wait e.g. IP address sync suspended, only app sends message, etc.
  • P3: does not require immediate action, email next day
     

"When we migrated to Google Cloud at the end of 2021, it was quite common for the alarm to go off during the night shift. Over time, however, this has become much less frequent. The good thing, after all, is that the stability of our platform continues to improve as a result of constant checks and adjustments, and faults become rarer." – Christian Laußat (DPE team member)


KiwiOS - Stable at all levels

At Kiwigrid, platform availability is calculated automatically in a complex process. Kiwigrid assures its customers a platform availability of 99.5%. On an operational level, availability is ensured by stable APIs and a sophisticated microservice infrastructure. On an organizational level, there is a dedicated DPE team that monitors the stability of the platform 24/7.

 

Are you interested in Energy IoT? Follow us on LinkedIn for more news.