Panfrost

The Panfrost driver stack includes an OpenGL ES implementation for Arm Mali GPUs based on the Midgard and Bifrost microarchitectures. It is conformant on Mali-G52 and Mali-G57 but non-conformant on other GPUs. The following hardware is currently supported:

Product

Architecture

OpenGL ES

OpenGL

Mali T600

Midgard (v4)

2.0

2.1

Mali T620

Midgard (v4)

2.0

2.1

Mali T720

Midgard (v4)

2.0

2.1

Mali T760

Midgard (v5)

3.1

3.1

Mali T820

Midgard (v5)

3.1

3.1

Mali T830

Midgard (v5)

3.1

3.1

Mali T860

Midgard (v5)

3.1

3.1

Mali T880

Midgard (v5)

3.1

3.1

Mali G72

Bifrost (v6)

3.1

3.1

Mali G31

Bifrost (v7)

3.1

3.1

Mali G51

Bifrost (v7)

3.1

3.1

Mali G52

Bifrost (v7)

3.1

3.1

Mali G76

Bifrost (v7)

3.1

3.1

Mali G57

Valhall (v9)

3.1

3.1

Mali G310

Valhall (v10)

3.1

3.1

Mali G610

Valhall (v10)

3.1

3.1

Other Midgard and Bifrost chips (e.g. G71) are not yet supported.

Older Mali chips based on the Utgard architecture (Mali 400, Mali 450) are supported in the Lima driver, not Panfrost. Lima is also available in Mesa.

Other graphics APIs (Vulkan, OpenCL) are not supported at this time.

Building

Panfrost’s OpenGL support is a Gallium driver. Since Mali GPUs are 3D-only and do not include a display controller, Mesa uses kmsro to support display controllers paired with Mali GPUs. If your board with a Panfrost supported GPU has a display controller with mainline Linux support not supported by kmsro, it’s easy to add support, see the commit cff7de4bb597e9 as an example.

LLVM is not required by Panfrost’s compilers. LLVM support in Mesa can safely be disabled for most OpenGL ES users with Panfrost.

Build like meson . build/ -Dvulkan-drivers= -Dgallium-drivers=panfrost -Dllvm=disabled for a build directory build.

For general information on building Mesa, read the install documentation.

Chat

Panfrost developers and users hang out on IRC at #panfrost on OFTC. Note that registering and authenticating with NickServ is required to prevent spam. Join the chat.

Compressed texture support

In the driver, Panfrost supports ASTC, ETC, and all BCn formats (e.g. RGTC, S3TC, etc.) However, Panfrost depends on the hardware to support these formats efficiently. All supported Mali architectures support these formats, but not every system-on-chip with a Mali GPU support all these formats. Many lower-end systems lack support for some BCn formats, which can cause problems when playing desktop games with Panfrost. To check whether this issue applies to your system-on-chip, Panfrost includes a panfrost_texfeatures tool to query supported formats.

To use this tool, include the option -Dtools=panfrost when configuring Mesa. Then inside your Mesa build directory, the tool is located at src/panfrost/tools/panfrost_texfeatures. Copy it to your target device, set as executable as necessary, and run on the target device. A table of supported formats will be printed to standard output.

drm-shim

Panfrost implements drm-shim, stubbing out the Panfrost kernel interface. Use cases for this functionality include:

  • Future hardware bring up

  • Running shader-db on non-Mali workstations

  • Reproducing compiler (and some driver) bugs without Mali hardware

Although Mali hardware is usually paired with an Arm CPU, Panfrost is portable C code and should work on any Linux machine. In particular, you can test the compiler on shader-db on an Intel desktop.

To build Mesa with Panfrost drm-shim, configure Meson with -Dgallium-drivers=panfrost and -Dtools=drm-shim. See the above building section for a full invocation. The drm-shim binary will be built to build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so.

To use, set the LD_PRELOAD environment variable to the drm-shim binary. It may also be necessary to set LIBGL_DRIVERS_PATH to the location where Mesa was installed.

By default, drm-shim mocks a Mali-G52 system. To select a specific Mali GPU, set the PAN_GPU_ID environment variable to the desired GPU ID:

Product

Architecture

GPU ID

Mali-T720

Midgard (v4)

720

Mali-T860

Midgard (v5)

860

Mali-G72

Bifrost (v6)

6221

Mali-G52

Bifrost (v7)

7212

Mali-G57

Valhall (v9)

9093

Additional GPU IDs are enumerated in the panfrost_model_list list in src/panfrost/lib/pan_props.c.

As an example: assuming Mesa is installed to a local path ~/lib and Mesa’s build directory is ~/mesa/build, a shader can be compiled for Mali-G52 as:

~/shader-db$ BIFROST_MESA_DEBUG=shaders \
LIBGL_DRIVERS_PATH=~/lib/dri/ \
LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
PAN_GPU_ID=7212 \
./run shaders/glmark/1-1.shader_test

The same shader can be compiled for Mali-T720 as:

~/shader-db$ MIDGARD_MESA_DEBUG=shaders \
LIBGL_DRIVERS_PATH=~/lib/dri/ \
LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
PAN_GPU_ID=720 \
./run shaders/glmark/1-1.shader_test

These examples set the compilers’ shaders debug flags to dump the optimized NIR, backend IR after instruction selection, backend IR after register allocation and scheduling, and a disassembly of the final compiled binary.

As another example, this invocation runs a single dEQP test “on” Mali-G52, pretty-printing GPU data structures and disassembling all shaders (PAN_MESA_DEBUG=trace) as well as dumping raw GPU memory (PAN_MESA_DEBUG=dump). The EGL_PLATFORM=surfaceless environment variable and various flags to dEQP mimic the surfaceless environment that our continuous integration (CI) uses. This eliminates window system dependencies, although it requires a specially built CTS:

~/VK-GL-CTS/build/external/openglcts/modules$ PAN_MESA_DEBUG=trace,dump \
LIBGL_DRIVERS_PATH=~/lib/dri/ \
LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
PAN_GPU_ID=7212 EGL_PLATFORM=surfaceless \
./glcts --deqp-surface-type=pbuffer \
--deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 \
--deqp-surface-height=256 -n \
dEQP-GLES31.functional.shaders.builtin_functions.common.abs.float_highp_compute

U-interleaved tiling

Panfrost supports u-interleaved tiling. U-interleaved tiling is indicated by the DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED modifier.

The tiling reorders whole pixels (blocks). It does not compress or modify the pixels themselves, so it can be used for any image format. Internally, images are divided into tiles. Tiles occur in source order, but pixels (blocks) within each tile are reordered according to a space-filling curve.

For regular formats, 16x16 tiles are used. This harmonizes with the default tile size for binning and CRCs (transaction elimination). It also means a single line (16 pixels) at 4 bytes per pixel equals a single 64-byte cache line.

For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are used, where entire blocks are reorder. Most of these formats compress 4x4 blocks, so this gives an effective 16x16 tiling. This justifies the tile size intuitively, though it’s not a rule: ASTC may uses larger blocks.

Within a tile, the X and Y bits are interleaved (like Morton order), but with a twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious. Visually, addresses take the form:

| y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |

Reference routines to encode/decode u-interleaved images are available in src/panfrost/shared/test/test-tiling.cpp, which documents the space-filling curve. This reference implementation is used to unit test the optimized implementation used in production. The optimized implementation is available in src/panfrost/shared/pan_tiling.c.

Although these routines are part of Panfrost, they are also used by Lima, as Arm introduced the format with Utgard. It is the only tiling supported on Utgard. On Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and should be used instead where possible. However, not all formats are compressible, so u-interleaved tiling remains an important fallback on Panfrost.

Instancing

The attribute descriptor lets the attribute unit compute the address of an attribute given the vertex and instance ID. Unfortunately, the way this works is rather complicated when instancing is enabled.

To explain this, first we need to explain how compute and vertex threads are dispatched. When a quad is dispatched, it receives a single, linear index. However, we need to translate that index into a (vertex id, instance id) pair. One option would be to do:

\[ \begin{align}\begin{aligned}\text{vertex id} = \text{linear id} \% \text{num vertices}\\\text{instance id} = \text{linear id} / \text{num vertices}\end{aligned}\end{align} \]

but this involves a costly division and modulus by an arbitrary number. Instead, we could pad num_vertices. We dispatch padded_num_vertices * num_instances threads instead of num_vertices * num_instances, which results in some “extra” threads with vertex_id >= num_vertices, which we have to discard. The more we pad num_vertices, the more “wasted” threads we dispatch, but the division is potentially easier.

One straightforward choice is to pad num_vertices to the next power of two, which means that the division and modulus are just simple bit shifts and masking. But the actual algorithm is a bit more complicated. The thread dispatcher has special support for dividing by 3, 5, 7, and 9, in addition to dividing by a power of two. As a result, padded_num_vertices can be 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, since we need less padding.

padded_num_vertices is picked by the hardware. The driver just specifies the actual number of vertices. Note that padded_num_vertices is a multiple of four (presumably because threads are dispatched in groups of 4). Also, padded_num_vertices is always at least one more than num_vertices, which seems like a quirk of the hardware. For larger num_vertices, the hardware uses the following algorithm: using the binary representation of num_vertices, we look at the most significant set bit as well as the following 3 bits. Let n be the number of bits after those 4 bits. Then we set padded_num_vertices according to the following table:

high bits

padded_num_vertices

1000

\(9 \cdot 2^n\)

1001

\(5 \cdot 2^{n+1}\)

101x

\(3 \cdot 2^{n+2}\)

110x

\(7 \cdot 2^{n+1}\)

111x

\(2^{n+4}\)

For example, if num_vertices = 70 is passed to glDraw(), its binary representation is 1000110, so n = 3 and the high bits are 1000, and therefore padded_num_vertices = \(9 \cdot 2^3\) = 72.

The attribute unit works in terms of the original linear_id. if num_instances = 1, then they are the same, and everything is simple. However, with instancing things get more complicated. There are four possible modes, two of them we can group together:

  1. Use the linear_id directly. Only used when there is no instancing.

2. Use the linear_id modulo a constant. This is used for per-vertex attributes with instancing enabled by making the constant equal padded_num_vertices. Because the modulus is always padded_num_vertices, this mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9. The shift field specifies the power of two, while the extra_flags field specifies the odd number. If shift = n and extra_flags = m, then the modulus is \((2m + 1) \cdot 2^n\). As an example, if num_vertices = 70, then as computed above, padded_num_vertices = \(9 \cdot 2^3\), so we should set extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware algorithm used to get padded_num_vertices in order to correctly implement per-vertex attributes.

3. Divide the linear_id by a constant. In order to correctly implement instance divisors, we have to divide linear_id by padded_num_vertices times to user-specified divisor. So first we compute padded_num_vertices, again following the exact same algorithm that the hardware uses, then multiply it by the GL-level divisor to get the hardware-level divisor. This case is further divided into two more cases. If the hardware-level divisor is a power of two, then we just need to shift. The shift amount is specified by the shift field, so that the hardware-level divisor is just \(2^\text{shift}\).

If it isn’t a power of two, then we have to divide by an arbitrary integer. For that, we use the well-known technique of multiplying by an approximation of the inverse. The driver must compute the magic multiplier and shift amount, and then the hardware does the multiplication and shift. The hardware and driver also use the “round-down” optimization as described in https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. The hardware further assumes the multiplier is between \(2^{31}\) and \(2^{32}\), so the high bit is implicitly set to 1 even though it is set to 0 by the driver – presumably this simplifies the hardware multiplier a little. The hardware first multiplies linear_id by the multiplier and takes the high 32 bits, then applies the round-down correction if extra_flags = 1, then finally shifts right by the shift field.

There are some differences between ridiculousfish’s algorithm and the Mali hardware algorithm, which means that the reference code from ridiculousfish doesn’t always produce the right constants. Mali does not use the pre-shift optimization, since that would make a hardware implementation slower (it would have to always do the pre-shift, multiply, and post-shift operations). It also forces the multiplier to be at least \(2^{31}\), which means that the exponent is entirely fixed, so there is no trial-and-error. Altogether, given the divisor d, the algorithm the driver must follow is:

  1. Set shift = \(\lfloor \log_2(d) \rfloor\).

  2. Compute \(m = \lceil 2^{shift + 32} / d \rceil\) and \(e = 2^{shift + 32} % d\).

  3. If \(e <= 2^{shift}\), then we need to use the round-down algorithm. Set magic_divisor = m - 1 and extra_flags = 1. 4. Otherwise, set magic_divisor = m and extra_flags = 0.