Tiling¶

The naive view of an image in memory is that the pixels are stored one after another in memory usually in an X-major order. An image that is arranged in this way is called “linear”. Linear images, while easy to reason about, can have very bad cache locality. Graphics operations tend to act on pixels that are close together in 2-D euclidean space. If you move one pixel to the right or left in a linear image, you only move a few bytes to one side or the other in memory. However, if you move one pixel up or down you can end up kilobytes or even megabytes away.

Tiling (sometimes referred to as swizzling) is a method of re-arranging the pixels of a surface so that pixels which are close in 2-D euclidean space are likely to be close in memory.

Basics¶

The basic idea of a tiled image is that the image is first divided into two-dimensional blocks or tiles. Each tile takes up a chunk of contiguous memory and the tiles are arranged like pixels in linear surface. This is best demonstrated with a specific example. Suppose we have a RGBA8888 X-tiled surface on Intel graphics. Then the surface is divided into 128x8 pixel tiles each of which is 4KB of memory. Within each tile, the pixels are laid out like a 128x8 linear image. The tiles themselves are laid out row-major in memory like giant pixels. This means that, as long as you don’t leave your 128x8 tile, you can move in both dimensions without leaving the same 4K page in memory.

You can, however do even better than this. Suppose that same image is, instead, Y-tiled. Then the surface is divided into 32x32 pixel tiles each of which is 4KB of memory. Within a tile, each 64B cache line corresponds to 4x4 pixel region of the image (you can think of it as a tile within a tile). This means that very small deviations don’t even leave the cache line. This added bit of pixel shuffling is known to have a substantial performance impact in most real-world applications.

Intel graphics has several different tiling formats that we’ll discuss in detail in later sections. The most commonly used as of the writing of this chapter is Y-tiling. In all tiling formats the basic principal is the same: The image is divided into tiles of a particular size and, within those tiles, the data is re-arranged (or swizzled) based on a particular pattern. A tile size will always be specified in bytes by rows and the actual X-dimension of the tile in elements depends on the size of the element in bytes.

Bit-6 Swizzling¶

On some older hardware, there is an additional address swizzle that is applied on top of the tiling format. This has been removed starting with Broadwell because, as it says in the Broadwell PRM Vol 5 “Tiling Algorithm” (p. 17):

Address Swizzling for Tiled-Surfaces is no longer used because the main memory controller has a more effective address swizzling algorithm.

Whether or not swizzling is enabled depends on the memory configuration of the system. Generally, systems with dual-channel RAM have swizzling enabled and single-channel do not. Supposedly, this swizzling allows for better balancing between the two memory channels and increases performance. Because it depends on the memory configuration which may change from one boot to the next, it requires a run-time check.

The best documentation for bit-6 swizzling can be found in the Haswell PRM Vol. 5 “Memory Views” in the section entitled “Address Swizzling for Tiled-Y Surfaces”. It exists on older platforms but the docs get progressively worse the further you go back.

ISL Representation¶

The structure of any given tiling format is represented by ISL using the isl_tiling enum and the isl_tile_info structure:

enum isl_tiling¶

Describes the memory tiling of a surface

This differs from the HW enum values used to represent tiling. The bits used by hardware have varried significantly over the years from the “Tile Walk” bit on old pre-Broadwell parts to the “Tile Mode” enum on Broadwell to the combination of “Tile Mode” and “Tiled Resource Mode” on Skylake. This enum represents them all in a consistent manner and in one place.

Note that legacy Y tiling is ISL_TILING_Y0 instead of ISL_TILING_Y, to clearly distinguish it from Yf and Ys.

enumerator ISL_TILING_LINEAR = 0¶: Linear, or no tiling

enumerator ISL_TILING_W¶: W tiling

enumerator ISL_TILING_X¶: X tiling

enumerator ISL_TILING_Y0¶: Legacy Y tiling

enumerator ISL_TILING_SKL_Yf¶: Standard 4K tiling. The ‘f’ means “four”.

enumerator ISL_TILING_SKL_Ys¶: Standard 64K tiling. The ‘s’ means “sixty-four”.

enumerator ISL_TILING_ICL_Yf¶: Standard 4K tiling. The ‘f’ means “four”.

enumerator ISL_TILING_ICL_Ys¶: Standard 64K tiling. The ‘s’ means “sixty-four”.

enumerator ISL_TILING_4¶: 4K tiling.

enumerator ISL_TILING_64¶: 64K tiling.

enumerator ISL_TILING_64_XE2¶: Xe2 64K tiling.

enumerator ISL_TILING_HIZ¶: Tiling format for HiZ surfaces

enumerator ISL_TILING_CCS¶: Tiling format for CCS surfaces

void isl_tiling_get_info(enum isl_tiling tiling, enum isl_surf_dim dim, enum isl_msaa_layout msaa_layout, uint32_t format_bpb, uint32_t samples, struct isl_tile_info *tile_info)¶

Returns an isl_tile_info representation of the given isl_tiling when combined when used in the given configuration.

Parameters:

tiling – [in] The tiling format to introspect
dim – [in] The dimensionality of the surface being tiled
msaa_layout – [in] The layout of samples in the surface being tiled
format_bpb – [in] The number of bits per surface element (block) for the surface being tiled
samples – [in] The samples in the surface being tiled
tile_info – [out] Return parameter for the tiling information

struct isl_tile_info¶

enum isl_tiling tiling¶: Tiling represented by this isl_tile_info

uint32_t format_bpb¶

The size (in bits per block) of a single surface element

For surfaces with power-of-two formats, this is the same as isl_format_layout::bpb. For non-power-of-two formats it may be smaller. The logical_extent_el field is in terms of elements of this size.

For example, consider ISL_FORMAT_R32G32B32_FLOAT for which isl_format_layout::bpb is 96 (a non-power-of-two). In this case, none of the tiling formats can actually hold an integer number of 96-bit surface elements so isl_tiling_get_info returns an isl_tile_info for a 32-bit element size. It is the responsibility of the caller to recognize that 32 != 96 and adjust accordingly. For instance, to compute the width of a surface in tiles, you would do:

width_tl = DIV_ROUND_UP(width_el * (format_bpb / tile_info.format_bpb),
                        tile_info.logical_extent_el.width);

struct isl_extent4d logical_extent_el¶

The logical size of the tile in units of format_bpb size elements

This field determines how a given surface is cut up into tiles. It is used to compute the size of a surface in tiles and can be used to determine the location of the tile containing any given surface element. The exact value of this field depends heavily on the bits-per-block of the format being used.

uint32_t max_miptail_levels¶

The maximum number of miplevels that will fit in the miptail.

This does not guarantee that the given number of miplevels will fit in the miptail as that is also dependent on the size of the miplevels.

struct isl_extent2d phys_extent_B¶

The physical size of the tile in bytes and rows of bytes

This field determines how the tiles of a surface are physically laid out in memory. The logical and physical tile extent are frequently the same but this is not always the case. For instance, a W-tile (which is always used with ISL_FORMAT_R8) has a logical size of 64el x 64el but its physical size is 128B x 32rows, the same as a Y-tile.

See isl_surf.row_pitch_B

const uint8_t *swiz¶: Swizzle of the virtual address

uint8_t swiz_count¶: Number of bit swizzles in swiz[]

The isl_tile_info structure has two different sizes for a tile: a logical size in surface elements and a physical size in bytes. In order to determine the proper logical size, the bits-per-block of the underlying format has to be passed into isl_tiling_get_info. The proper way to compute the size of an image in bytes given a width and height in elements is as follows:

uint32_t width_tl = DIV_ROUND_UP(width_el * (format_bpb / tile_info.format_bpb),
                                 tile_info.logical_extent_el.w);
uint32_t height_tl = DIV_ROUND_UP(height_el, tile_info.logical_extent_el.h);
uint32_t row_pitch = width_tl * tile_info.phys_extent_el.w;
uint32_t size = height_tl * tile_info.phys_extent_el.h * row_pitch;

It is very important to note that there is no direct conversion between isl_tile_info.logical_extent_el and isl_tile_info.phys_extent_B. It is tempting to assume that the logical and physical heights are the same and simply divide the width of isl_tile_info.phys_extent_B by the size of the format (which is what the PRM does) to get isl_tile_info.logical_extent_el but this is not at all correct. Some tiling formats have logical and physical heights that differ and so no such calculation will work in general. The easiest case study for this is W-tiling. From the Sky Lake PRM Vol. 2d, “RENDER_SURFACE_STATE” (p. 427):

If the surface is a stencil buffer (and thus has Tile Mode set to TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on width, as the stencil buffer is stored with two rows interleaved.

What does this mean? Why are we multiplying the pitch by two? What does it mean that “the stencil buffer is stored with two rows interleaved”? The explanation for all these questions is that a W-tile (which is only used for stencil) has a logical size of 64el x 64el but a physical size of 128B x 32rows. In memory, a W-tile has the same footprint as a Y-tile (128B x 32rows) but every pair of rows in the stencil buffer is interleaved into a single row of bytes yielding a two-dimensional area of 64el x 64el. You can consider this as its own tiling format or as a modification of Y-tiling. The interpretation in the PRMs vary by hardware generation; on Sandy Bridge they simply said it was Y-tiled but by Sky Lake there is almost no mention of Y-tiling in connection with stencil buffers and they are always W-tiled. This mismatch between logical and physical tile sizes are also relevant for hierarchical depth buffers as well as single-channel MCS and CCS buffers.

X-tiling¶

The simplest tiling format available on Intel graphics (which has been available since gen4) is X-tiling. An X-tile is 512B x 8rows and, within the tile, the data is arranged in an X-major linear fashion. You can also look at X-tiling as being an 8x8 cache line grid where the cache lines are arranged X-major as follows:

0x000	0x040	0x080	0x0c0	0x100	0x140	0x180	0x1c0
0x200	0x240	0x280	0x2c0	0x300	0x340	0x380	0x3c0
0x400	0x440	0x480	0x4c0	0x500	0x540	0x580	0x5c0
0x600	0x640	0x680	0x6c0	0x700	0x740	0x780	0x7c0
0x800	0x840	0x880	0x8c0	0x900	0x940	0x980	0x9c0
0xa00	0xa40	0xa80	0xac0	0xb00	0xb40	0xb80	0xbc0
0xc00	0xc40	0xc80	0xcc0	0xd00	0xd40	0xd80	0xdc0
0xe00	0xe40	0xe80	0xec0	0xf00	0xf40	0xf80	0xfc0

Each cache line represents a piece of a single row of pixels within the image. The memory locations of two vertically adjacent pixels within the same X-tile always differs by 512B or 8 cache lines.

As mentioned above, X-tiling is slower than Y-tiling (though still faster than linear). However, until Sky Lake, the display scan-out hardware could only do X-tiling so we have historically used X-tiling for all window-system buffers (because X or a Wayland compositor may want to put it in a plane).

Bit-6 Swizzling¶

When bit-6 swizzling is enabled, bits 9 and 10 are XORed in with bit 6 of the tiled address:

addr[6] ^= addr[9] ^ addr[10];

Y-tiling¶

The Y-tiling format, also available since gen4, is substantially different from X-tiling and performs much better in practice. Each Y-tile is an 8x8 grid of cache lines arranged Y-major as follows:

0x000	0x200	0x400	0x600	0x800	0xa00	0xc00	0xe00
0x040	0x240	0x440	0x640	0x840	0xa40	0xc40	0xe40
0x080	0x280	0x480	0x680	0x880	0xa80	0xc80	0xe80
0x0c0	0x2c0	0x4c0	0x6c0	0x8c0	0xac0	0xcc0	0xec0
0x100	0x300	0x500	0x700	0x900	0xb00	0xd00	0xf00
0x140	0x340	0x540	0x740	0x940	0xb40	0xd40	0xf40
0x180	0x380	0x580	0x780	0x980	0xb80	0xd80	0xf80
0x1c0	0x3c0	0x5c0	0x7c0	0x9c0	0xbc0	0xdc0	0xfc0

Each 64B cache line within the tile is laid out as 4 rows of 16B each:

0x00

0x01

0x02

0x03

0x04

0x05

0x06

0x07

0x08

0x09

0x0a

0x0b

0x0c

0x0d

0x0e

0x0f

0x10

0x11

0x12

0x13

0x14

0x15

0x16

0x17

0x18

0x19

0x1a

0x1b

0x1c

0x1d

0x1e

0x1f

0x20

0x21

0x22

0x23

0x24

0x25

0x26

0x27

0x28

0x29

0x2a

0x2b

0x2c

0x2d

0x2e

0x2f

0x30

0x31

0x32

0x33

0x34

0x35

0x36

0x37

0x38

0x39

0x3a

0x3b

0x3c

0x3d

0x3e

0x3f

Y-tiling is widely regarded as being substantially faster than X-tiling so it is generally preferred. However, prior to Sky Lake, Y-tiling was not available for scanout so X tiling was used for any sort of window-system buffers. Starting with Sky Lake, we can scan out from Y-tiled buffers.

Bit-6 Swizzling¶

When bit-6 swizzling is enabled, bit 9 is XORed in with bit 6 of the tiled address:

addr[6] ^= addr[9];

W-tiling¶

W-tiling is a new tiling format added on Sandy Bridge for use in stencil buffers. W-tiling is similar to Y-tiling in that it’s arranged as an 8x8 Y-major grid of cache lines. The bytes within each cache line are arranged as follows:

0x00	0x01	0x04	0x05	0x10	0x11	0x14	0x15
0x02	0x03	0x06	0x07	0x12	0x13	0x16	0x17
0x08	0x09	0x0c	0x0d	0x18	0x19	0x1c	0x1d
0x0a	0x0b	0x0e	0x0f	0x1a	0x1b	0x1e	0x1f
0x20	0x21	0x24	0x25	0x30	0x31	0x34	0x35
0x22	0x23	0x26	0x27	0x32	0x33	0x36	0x37
0x28	0x29	0x2c	0x2d	0x38	0x39	0x3c	0x3d
0x2a	0x2b	0x2e	0x2f	0x3a	0x3b	0x3e	0x3f

While W-tiling has been required for stencil all the way back to Sandy Bridge, the docs are somewhat confused as to whether stencil buffers are W or Y-tiled. This seems to stem from the fact that the hardware seems to implement W-tiling as a sort of modified Y-tiling. One example of this is the somewhat odd requirement that W-tiled buffers have their pitch multiplied by 2. From the Sky Lake PRM Vol. 2d, “RENDER_SURFACE_STATE” (p. 427):

If the surface is a stencil buffer (and thus has Tile Mode set to TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on width, as the stencil buffer is stored with two rows interleaved.

The last phrase holds the key here: “the stencil buffer is stored with two rows interleaved”. More accurately, a W-tiled buffer can be viewed as a Y-tiled buffer with each set of 4 W-tiled lines interleaved to form 2 Y-tiled lines. In ISL, we represent a W-tile as a tiling with a logical dimension of 64el x 64el but a physical size of 128B x 32rows. This cleanly takes care of the pitch issue above and seems to nicely model the hardware.

Tile4¶

The tile4 format, introduced on Xe-HP, is somewhat similar to Y but with more internal shuffling. Each tile4 tile is an 8x8 grid of cache lines arranged as follows:

0x000	0x040	0x080	0x0a0	0x200	0x240	0x280	0x2a0
0x100	0x140	0x180	0x1a0	0x300	0x340	0x380	0x3a0
0x400	0x440	0x480	0x4a0	0x600	0x640	0x680	0x6a0
0x500	0x540	0x580	0x5a0	0x700	0x740	0x780	0x7a0
0x800	0x840	0x880	0x8a0	0xa00	0xa40	0xa80	0xaa0
0x900	0x940	0x980	0x9a0	0xb00	0xb40	0xb80	0xba0
0xc00	0xc40	0xc80	0xca0	0xe00	0xe40	0xe80	0xea0
0xd00	0xd40	0xd80	0xda0	0xf00	0xf40	0xf80	0xfa0

Each 64B cache line within the tile is laid out the same way as for a Y-tile, as 4 rows of 16B each:

0x00

0x01

0x02

0x03

0x04

0x05

0x06

0x07

0x08

0x09

0x0a

0x0b

0x0c

0x0d

0x0e

0x0f

0x10

0x11

0x12

0x13

0x14

0x15

0x16

0x17

0x18

0x19

0x1a

0x1b

0x1c

0x1d

0x1e

0x1f

0x20

0x21

0x22

0x23

0x24

0x25

0x26

0x27

0x28

0x29

0x2a

0x2b

0x2c

0x2d

0x2e

0x2f

0x30

0x31

0x32

0x33

0x34

0x35

0x36

0x37

0x38

0x39

0x3a

0x3b

0x3c

0x3d

0x3e

0x3f

Tiling as a bit pattern¶

There is one more important angle on tiling that should be discussed before we finish. Every tiling can be described by three things:

A logical width and height in elements

A physical width in bytes and height in rows

A mapping from logical elements to physical bytes within the tile

We have spent a good deal of time on the first two because this is what you really need for doing surface layout calculations. However, there are cases in which the map from logical to physical elements is critical. One example is W-tiling where we have code to do W-tiled encoding and decoding in the shader for doing stencil blits because the hardware does not allow us to render to W-tiled surfaces.

There are many ways to mathematically describe the mapping from logical elements to physical bytes. In the PRMs they give a very complicated set of formulas involving lots of multiplication, modulus, and sums that show you how to compute the mapping. With a little creativity, you can easily reduce those to a set of bit shifts and ORs. By far the simplest formulation, however, is as a mapping from the bits of the texture coordinates to bits in the address. Suppose that \((u, v)\) is location of a 1-byte element within a tile. If you represent \(u\) as \(u_n u_{n-1} \cdots u_2 u_1 u_0\) where \(u_0\) is the LSB and \(u_n\) is the MSB of \(u\) and similarly \(v = v_m v_{m-1} \cdots v_2 v_1 v_0\), then the bits of the address within the tile are given by the table below:

Tiling	11	10	9	8	7	6	5	4	3	2	1	0
`isl_tiling.ISL_TILING_X`	\(v_2\)	\(v_1\)	\(v_0\)	\(u_8\)	\(u_7\)	\(u_6\)	\(u_5\)	\(u_4\)	\(u_3\)	\(u_2\)	\(u_1\)	\(u_0\)
`isl_tiling.ISL_TILING_Y0`	\(u_6\)	\(u_5\)	\(u_4\)	\(v_4\)	\(v_3\)	\(v_2\)	\(v_1\)	\(v_0\)	\(u_3\)	\(u_2\)	\(u_1\)	\(u_0\)
`isl_tiling.ISL_TILING_W`	\(u_5\)	\(u_4\)	\(u_3\)	\(v_5\)	\(v_4\)	\(v_3\)	\(v_2\)	\(u_2\)	\(v_1\)	\(u_1\)	\(v_0\)	\(u_0\)
`isl_tiling.ISL_TILING_4`	\(v_4\)	\(v_3\)	\(u_6\)	\(v_2\)	\(u_5\)	\(u_4\)	\(v_1\)	\(v_0\)	\(u_3\)	\(u_2\)	\(u_1\)	\(u_0\)

Constructing the mapping this way makes a lot of sense when you think about hardware. It may seem complex on paper but “simple” things such as addition are relatively expensive in hardware while interleaving bits in a well-defined pattern is practically free. For a format that has more than one byte per element, you simply chop bits off the bottom of the pattern, hard-code them to 0, and adjust bit indices as needed. For a 128-bit format, for instance, the Y-tiled pattern becomes \(u_2 u_1 u_0 v_4 v_3 v_2 v_1 v_0\). The Sky Lake PRM Vol. 5 in the section “2D Surfaces” contains an expanded version of the above table (which we will not repeat here) that also includes the bit patterns for the Ys and Yf tiling formats.