GPUs can get stuck, need nvidia-smi command to restart them

Description

On hops.site, we have experienced that all processes for a GPU may have died, but the GPU is still considered in-use. We need to kill all processes with handles on the GPU devices in order to restart them.
There will be 2 scripts. A 'nice' one to try and restart just the stuck device, and a 'kill -9' script that restarts all the devices on a host.

Assignee

Jim Dowling

Reporter

Jim Dowling

Labels

None

Fix versions

Affects versions

Priority

Medium
Configure