Concurrent Alarms Processing
Currently I am working on a development project of a Control System and am faced with the task of massive event processing of measurements for signaling alarm conditions (alarms that are activated when a certain value becomes above or below a certain level, etc.). This control system notifies the alarm processing component each of the measured variables (tags) and the component must update the state machine of each of the alarms set in the control system. The problem I face is that we want the system as a whole to be scalable (we need to resize depending on the load to which it is submitted) and therefore in some scenarios the number of notifications to the component of alarm processing is such that exceeds its handling capacity so it is impossible for a single computing entity (task or process) of performing real-time processing.
Given this scenario, I raised the possibility of horizontally scale the alarms processing component.
But the problem of scaling alarm processing component on the assumption that reports of measured values are evenly distributed among the agents is that agents (processes, tasks) compete for control of the state machine alarms configured in the control system and an exclusion mechanism to prevent corruption of the state machine would be needed.
Here comes the concept of consensus to the rescue. How to do it using only Cassandra (the database used mostly by the project) can be found here. The idea is that the state machine remains under the control of an owner (task, process) and that it is chosen by consensus mechanism so that these can mutually exclude each other to operate on state machines of each alarm.
In upcoming entries we go into the details of the final implementation of the proposed solution.
But the problem of scaling alarm processing component on the assumption that reports of measured values are evenly distributed among the agents is that agents (processes, tasks) compete for control of the state machine alarms configured in the control system and an exclusion mechanism to prevent corruption of the state machine would be needed.
Here comes the concept of consensus to the rescue. How to do it using only Cassandra (the database used mostly by the project) can be found here. The idea is that the state machine remains under the control of an owner (task, process) and that it is chosen by consensus mechanism so that these can mutually exclude each other to operate on state machines of each alarm.
In upcoming entries we go into the details of the final implementation of the proposed solution.
I reed your post in it let my with more questions than answers. I couldn't figure what you were trying to solve. I don't know if it is because I am not familiar with the particular project you are working on or due to my default MARGINALIDAD.
ReplyDeleteYou mention that the entire system needs to be scalable depending on demand, " I don't know the nature of he system or e function of the alarmas"
I am assuming that the role of those alarms is to notify the control element of changes on a "variable" so it can adjust others accordingly.
If that is the case the variable related to "demand" should be your "main variable". And you should work on a hierarchic order, depending on the relevance of what a variable represent for the system is the order how they should be allowed to influence the control element.
As I mention before I don't know the nature of the system therefore I am just coming up with conjectures here, but should the system be a real one there are limits of operation for everything. For instance if temperature "T" is not highly relevant for your process it should not affect the control element over one more relevant or the "main variable" unless the value of "T" is such that it could affect the processes end product or safety. That will include the concept of "range of validity" for your default hierarchy. Alarms monitoring alarms so to speak.
That's what I gather from your entry, you are having not necessarily fundamental variables competing for control of your system. My answer give them priorities and for as long as the value of a variable is not such that it will affect your process keep it lock in place.
The alarms are conditions (value above some limit, etc.), they do not influence the control system to take actions (at least not now, maybe on the future). Currently alarms when activated or deactivated (when they change state) are notified to interested parties (users).
DeleteThe problem y face is that under some conditions (a big amount of configured alarms changing state or a massive stream of notifications in a short period of time) a single processing entity (process, thread of execution, etc.) don't have enough computational capacity to perform in real time (the notification is received and the change of state is notified before some deadline). If the computational power is small enough then under normal load the notifications start to pile up and the alarm processing start to take longer to change state of alarms and notify conditions rsulting on not processing in real time (a requirement of every control system, remember that control systems control processes hapening in real time).
Thats why we need the alarm processing element to be horizontaly scalable (ie just by adding more instances of the entity you increase computational power and therefore processing capacity, all without changing the code or the infrastructure). The problem with the scalability is that it introduce the issue of race conditions (two processing entities trying to update the same alarm state machine at the same time, resulting in the state of the alarm to get corrupted) and therefore a mutual exclusion mechanism is needed, the problem is that this mutual exclusion mechanism need to be distributed to be able to overcome the computational power limitations of one host (computer).
Here enters the concept of consensus (multiple entities trying to agree on some thing). In this case the processing elements are the entities agreen on which of them have control over one of the alarm state machines. Therefore allowing the build of a alarm state machine lock mechanism on top of that. Cassandra (the database engine) allow the implementation of consensus though the implementation on recent version of lighweight transactions. The use of this feature will allow us to build a locking mechanism (something like this https://github.com/leandromoreira/cassandra-lock) on top of existing infrastructure thus avoiding the introduction of more complexity into the control system and its deplyment procedure.