kagent - conda can get stuck


There was a kagent on hadoop1 that couldn't authenticate with hopsworks - it was alive, but not receiving conda commands. I restarted the kagent.
Now, the commands status was "ongoing", but the kagent didn't retrieve the commands' statuses from local storage. I just deleted all commands from the admin ui - nothing else i could do from the UI.

Suggested fixes:
1. Change the admin ui, so we can move commands back into "new" state, to retry them
2. Return the stderr/stdout message from the conda command when it fails as a VARCHAR to hopsworks, and make that message viewable in both the admin ui and the python microservice
3. Have a canary conda command issued by hopsworks to kagent every 10 mins or so, and check the return value is correct. We need to know kagent is working correctly with conda. If it is not, we can run a local sudo shell script to restart kagent on that host ( /srv/hops/domains/domain1/bin/kagent-restart.sh HOST).


Antonios Kouzoupis


Jim Dowling



Fix versions

Affects versions