Debugging GPU hangs with RADV

UMR (optional)

UMR is needed for dumping a lot of useful information. Clone, build and install UMR. Do not forget to run chmod +s $(which umr) so RADV can actually access UMR.

UMR needs to access some kernel debug interfaces:

chmod 777 /sys/kernel/debug
chmod -R 777 /sys/kernel/debug/dri

Secure boot has to be disabled as well.

Generating and analyzing hang reports

With UMR installed, you can now set RADV_DEBUG=hang which makes RADV insert trace markers and synchronization and check for hangs. The hang report will be saved to ~/radv_dumps_<pid>_<time>. Inside the directory of the hang report, there are a couple of files:

  • *.spv: SPIR-V binaries of the pipeline that was bound when the hang occurred.

  • addr_binding_report.log: VK_EXT_address_binding_report logs.

  • app_info.log: VkApplicationInfo fields.

  • bo_history.log: A list of every GPU memory allocation and deallocation. If the GPU hang was caused by a page fault, you can use radv_check_va.py to figure out if address is invalid or used after the memory was deallocated.

  • bo_ranges.log: Address ranges that were valid at the time of submission.

  • dmesg.log: Output of dmesg, if available.

  • gpu_info.log: Fields of radeon_info.

  • pipeline.log: IR of the shaders that were bound during the hang as well as programm counters of waves executing said shaders and bound descriptors.

  • registers.log: Various GPU state registers.

  • trace.log: An annotated list of the command stream that caused the hang. the commands that hung come after !!!!! This is the last trace point that was reached by the CP !!!!!.

  • umr_ring.log: Similar to trace.log.

  • umr_waves.log: A list of waves that were active at the time of the hang, including register values.

  • vm_fault.log: The page fault address if a page fault occurred.

Note: By default, the backend IR (ACO or LLVM) and the disassembly should be dumped to pipeline.log. But due to shaders caching, you might need RADV_DEBUG=hang,nocache to get SPIR-V and NIR in the GPU hang report.

Debugging Steam games

Steam games require a bit more work so RADV can access UMR: In your Steam library, make sure Tools is checked and search for Steam Linux Runtime. Under Properties -> Installed Files, click Browse, open _v2-entry-point and add

shift 2
exec "$@"

at the top of the file. Hang debugging can be enabled by selecting the faulting game and adding RADV_DEBUG=hang %command% under Properties -> General -> LAUNCH OPTIONS.

Debugging hangs without RADV_DEBUG=hang

In some situations, RADV_DEBUG=hang wouldn’t be able to generate a GPU hang report, like for synchronization issues (because it enables RADV_DEBUG=syncshaders behind the scene). An alternative solution is to disable GPU recovery by adding amdgpu.gpu_recovery=0 to your kernel command line options. And then invoke UMR manually with umr --by-pci <pci_id> -O bits,halt_waves -go 0 -wa <ring> -go 1 2>&1 for dumping the waves and umr --by-pci <pci_id> -RS <ring> 2>&1 for dumping the rings once the GPU hang occurred.