Debugging GPU hangs with RADV¶
UMR (optional)¶
UMR is needed for dumping a lot of useful information. Clone, build and install
UMR. Do not forget to run
chmod +s $(which umr)
so RADV can actually access UMR.
UMR needs to access some kernel debug interfaces:
chmod 777 /sys/kernel/debug
chmod -R 777 /sys/kernel/debug/dri
Secure boot has to be disabled as well.
Generating and analyzing hang reports¶
With UMR installed, you can now set RADV_DEBUG=hang
which makes RADV insert
trace markers and synchronization and check for hangs. The hang report will be
saved to ~/radv_dumps_<pid>_<time>
. Inside the directory of the hang report,
there are a couple of files:
*.spv
: SPIR-V binaries of the pipeline that was bound when the hang occurred.addr_binding_report.log
: VK_EXT_address_binding_report logs.app_info.log
:VkApplicationInfo
fields.bo_history.log
: A list of every GPU memory allocation and deallocation. If the GPU hang was caused by a page fault, you can use radv_check_va.py to figure out if address is invalid or used after the memory was deallocated.bo_ranges.log
: Address ranges that were valid at the time of submission.dmesg.log
: Output ofdmesg
, if available.gpu_info.log
: Fields ofradeon_info
.pipeline.log
: IR of the shaders that were bound during the hang as well as programm counters of waves executing said shaders and bound descriptors.registers.log
: Various GPU state registers.trace.log
: An annotated list of the command stream that caused the hang. the commands that hung come after!!!!! This is the last trace point that was reached by the CP !!!!!
.umr_ring.log
: Similar totrace.log
.umr_waves.log
: A list of waves that were active at the time of the hang, including register values.vm_fault.log
: The page fault address if a page fault occurred.
Note: By default, the backend IR (ACO or LLVM) and the disassembly should be
dumped to pipeline.log
. But due to shaders caching, you might need
RADV_DEBUG=hang,nocache
to get SPIR-V and NIR in the GPU hang report.
Debugging Steam games¶
Steam games require a bit more work so RADV can access UMR: In your Steam library,
make sure Tools is checked and search for Steam Linux Runtime.
Under Properties -> Installed Files, click Browse, open
_v2-entry-point
and add
shift 2
exec "$@"
at the top of the file. Hang debugging can be enabled by selecting the faulting
game and adding RADV_DEBUG=hang %command%
under Properties -> General
-> LAUNCH OPTIONS.
Debugging hangs without RADV_DEBUG=hang¶
In some situations, RADV_DEBUG=hang
wouldn’t be able to generate a GPU hang
report, like for synchronization issues (because it enables
RADV_DEBUG=syncshaders
behind the scene). An alternative solution is to
disable GPU recovery by adding amdgpu.gpu_recovery=0
to your kernel command
line options. And then invoke UMR manually with
umr --by-pci <pci_id> -O bits,halt_waves -go 0 -wa <ring> -go 1 2>&1
for
dumping the waves and umr --by-pci <pci_id> -RS <ring> 2>&1
for dumping the
rings once the GPU hang occurred.