[kagent-chef] kagent falsely assumes service is dead when it restarts

Description

kagent monitors the services running on local host by periodically executing 'systemct status SERVICE'. There are certain services such as 'airflow-scheduler' which by design restarts. kagent reports to Hopsworks that is dead. As a consequence, it pollutes the logs but most importantly it fills up the 'alerts' table in hopsworks database with false-positives.

Instead of white-listing a service, which can be dangerous, kagent should identify a service dead N times before sending an alert to Hopsworks.

Assignee

Antonios Kouzoupis

Reporter

Antonios Kouzoupis

Labels

None

Fix versions

Affects versions

Priority

High
Configure