AODH - Events and Alarms

Alarms provide a basic monitoring-as-a-service for user’s resources running on Openstack. One application of these alarms is to create autoscaling stacks, where alarms determine whether a scale-up or scale-down policy is applied to a group of instances.

Alarm states follow a tri-state model:

ok The rule governing the alarm has been evaluated as False.

alarm The rule governing the alarm have been evaluated as True.

insufficient data There are not enough datapoints available in the evaluation periods to meaningfully determine the alarm state.

Installing the clients

sudo apt install python3-aodhclient python3-gnocchiclient

Or using pip:

python3 -m pip install aodhclient
python3 -m pip install gnocchiclient

Threshold alarms

Alarm state transitions are governed by the value of a metric.

Example of metrics available :

cpu (Unit: ns)

memory.usage (Unit: MB)

vcpus (Unit: number)

openstack alarm create \
  --name memory_hi \
  --type gnocchi_resources_threshold \
  --description 'instance consumming memory' \
  --metric memory.usage\
  --threshold 100\
  --comparison-operator gt \
  --aggregation-method mean \
  --granularity 300 \
  --evaluation-periods 3 \
  --resource-id MY_INSTANCE_ID \
  --resource-type instance \
  --alarm-action "http://www.mydomain.ch/alarm"

This creates an alarm that will fire when the average memory utilisation for an individual instance exceeds 100MB for three consecutive 5 minute periods. The notification in this case is a web URL.

You can display memory.usage metric values for the the instance 9f43a7e6-12bb-4d4d-846d-47712657a5e4 this way :

openstack metric measures show -r 9f43a7e6-12bb-4d4d-846d-47712657a5e4 memory.usage --granularity 300
+---------------------------+-------------+--------+
| timestamp                 | granularity |  value |
+---------------------------+-------------+--------+
| 2021-05-20T12:30:00+02:00 |       300.0 | 148.0 |
| 2021-05-20T12:35:00+02:00 |       300.0 | 147.0 |
| 2021-05-20T12:40:00+02:00 |       300.0 | 147.0 |
| 2021-05-20T13:00:00+02:00 |       300.0 | 148.0 |
+---------------------------+-------------+--------+

openstack alarm create \
  --name cpu_hi \
  --type gnocchi_resources_threshold \
  --description 'instance consumming cpu' \
  --metric cpu \
  --threshold 180000000000\
  --comparison-operator gt \
  --aggregation-method rate:mean \
  --granularity 300 \
  --evaluation-periods 3 \
  --resource-id MY_INSTANCE_ID \
  --resource-type instance \
  --alarm-action "http://www.mydomain.ch/alarm"

This creates an alarm that will fire when the average cpu utilisation for an individual instance exceeds 180000000000ns for three consecutive 5 minute periods. The notification in this case is a web URL.

You can display cpu metric values for the the instance 9f43a7e6-12bb-4d4d-846d-47712657a5e4 this way :

openstack metric aggregates "(metric cpu rate:mean)" "id=9f43a7e6-12bb-4d4d-846d-47712657a5e4"
+----------------------------------------------------+---------------------------+-------------+-------------------+
| name                                               | timestamp                 | granularity |             value |
+----------------------------------------------------+---------------------------+-------------+-------------------+
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:20:00+00:00 |       300.0 |       450000000.0 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:25:00+00:00 |       300.0 |       500000000.0 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:30:00+00:00 |       300.0 |       400000000.0 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:35:00+00:00 |       300.0 |       370000000.0 |
+----------------------------------------------------+---------------------------+-------------+-------------------+

To get more human readable values in % of utilisation rather than cpu cycles in ns :

taylor@laptop:~$ openstack metric aggregates "(/(metric cpu rate:mean)300000000000 )" "id=9f43a7e6-12bb-4d4d-846d-47712657a5e4"
+----------------------------------------------------+---------------------------+-------------+--------------------+
| name                                               | timestamp                 | granularity |              value |
+----------------------------------------------------+---------------------------+-------------+--------------------+
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:20:00+00:00 |       300.0 |             0.0015 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:25:00+00:00 |       300.0 | 0.0016666666666668 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:30:00+00:00 |       300.0 | 0.0013333333333333 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:35:00+00:00 |       300.0 | 0.0012333333333332 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:40:00+00:00 |       300.0 | 0.0012666666666666 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:45:00+00:00 |       300.0 |             1.2279 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:50:00+00:00 |       300.0 |     8.000166666667 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T11:55:00+00:00 |       300.0 |     7.999866666667 |
| 9f43a7e6-12bb-4d4d-846d-47712657a5e4/cpu/rate:mean | 2021-05-20T12:00:00+00:00 |       300.0 |     7.999366666667 |
+----------------------------------------------------+---------------------------+-------------+--------------------+

value of 0.01 = 1 vcpu 1% used

value of 1 = 1 vcpu 100% used

value of 8 = 8 vcpus 100% used

Warning

The alarm granularity must match the granularities of the metric configured in Gnocchi, otherwise the alarm will only return an insufficient data state.

Event Alarms

Event alarms will be triggerd when a specific event occurs such as a VM powering off.

This example will create an event alarm based on an instance powering off

openstack alarm create --type event \
--name instance_off \
--description 'Instance powered OFF' \
--event-type "compute.instance.power_off.*" \
--enable True \
--query "traits.instance_id=string::YOUR_INSTANCE_UUID" \
--action "https://weburl.domain/notify_vm_is_off"

This example will create an event alarm based on an instance being active but in error.

openstack alarm create --type event \
--name instance_on_but_in_error \
--description 'Instance powered ON but in ERROR' \
--event-type "compute.instance.power_on.*" \
--enable True \
--query "traits.instance_id=string::YOUR_INSTANCE_UUID;traits.state=string::error"\
--action "https://weburl.domain/notify_vm_is_in_error"

Note

Unlike threshold alarms, event alarms will only change state when the specific event you defined occurrs. Which means that the alarm will never transtition to ok automatically, you have to do it manually. it also so means that the alarm will remains is state insufficient data as long as the event didn't happen.

To manually set the alarm state to ok :

openstack alarm state set --state ok ALARM_ID

The full list of event types and traits is available here

List alarms

taylor@laptop:~$ openstack alarm list
+--------------------------------------+-----------------------------+--------------+-------+----------+---------+
| alarm_id                             | type                        | name         | state | severity | enabled |
+--------------------------------------+-----------------------------+--------------+-------+----------+---------+
| 2ec95ea9-6444-401a-9ec3-9e21bf714334 | gnocchi_resources_threshold | cpu_hi       | ok    | low      | True    |
| 2ec95ea9-6444-401a-9ec3-9e21bf714334 | gnocchi_resources_threshold | memory_hi    | alarm | low      | True    |
| ae1e20f9-4927-49d8-bcb0-f4f007e51e44 | event                       | instance_off | ok    | low      | True    |
+--------------------------------------+-----------------------------+--------------+-------+----------+---------+