kagent - conda can get stuck

Description

There was a kagent on hadoop1 that couldn't authenticate with hopsworks - it was alive, but not receiving conda commands. I restarted the kagent.
Now, the commands status was "ongoing", but the kagent didn't retrieve the commands' statuses from local storage. I just deleted all commands from the admin ui - nothing else i could do from the UI.

Suggested fixes:
1. Change the admin ui, so we can move commands back into "new" state, to retry them
2. Return the stderr/stdout message from the conda command when it fails as a VARCHAR to hopsworks, and make that message viewable in both the admin ui and the python microservice
3. Have a canary conda command issued by hopsworks to kagent every 10 mins or so, and check the return value is correct. We need to know kagent is working correctly with conda. If it is not, we can run a local sudo shell script to restart kagent on that host ( /srv/hops/domains/domain1/bin/kagent-restart.sh HOST).

Assignee

Antonios Kouzoupis

Reporter

Jim Dowling

Labels

None

Fix versions

Affects versions

Priority

High
Configure