Linux Finally Introducing A Standardized Way Of Informing User-Space Over Hung GPUs

The upcoming Linux 6.15 kernel is set to finally introduce a standardized way of informing user-space of GPUs becoming hung or otherwise unresponsive. This is initially wired up for AMD and Intel graphics drivers on Linux so the user can be properly notified of problems and/or user-space software taking steps to address the hung/unresponsive graphics processor.

With the work started by Intel graphics driver engineers for their Xe and i915 Direct Rendering Manager drivers, a new device wedged event is set to be added to Linux 6.15 for reporting unresponsive hardware to user-space via a uevent. The AMDGPU driver is also adapted to make use of this device wedged event while with time other non-Intel/AMD Linux GPU drivers will likely adopt this event interface too.

This work notifies user-space of a hung/unusable hardware state and can be useful if the driver already has attempted a GPU reset on its own in an attempt to correct the hardware state. The hope is this will be a generic way for helping to recover from hung GPUs with user-space intervention. Besides alerting user-space of the problem itself, via udev rules or other custom recovery scripts, steps could be taken when informed of the hung/unresponsive GPU.

Wedged GPU

Recovery methods could include unbinding and rebinding the kernel driver, unbinding and rebinding the driver with resetting the bus device after the driver unbind, or other steps and/or no action. This is useful for situations where the kernel driver itself can’t address the problematic GPU hardware state on its own due to being unable to unload/reload the driver itself or needing to take other steps to correct the hardware state. An example GPU recovery script is being added to the new documentation:

GPU recovery script example

More details on the new GPU/DRM device-wedged event via the documentation patch.

This new device-wedged event was submitted today via drm-misc-next to DRM-Next ahead of the Linux 6.15 merge window opening at the end of March. So barring any last minute issues, this new standardized functionality will be found in Linux 6.15.

Today’s drm-misc-next pull also adds atomic helpers for async pageflips on arbitrary planes. AMDGPU is making use of this to support async page flipping on overlay planes.