Fragment Density Map

VK_EXT_fragment_density_map is an extension which is intended to allow users to render parts of the screen at a lower resolution. It is designed to be implemented on tiled rendering GPU architectures such as Adreno, and the intention is that it is implemented by rendering some of the tiles at a lower resolution and scaling them up when resolving to system memory or when sampling the resulting image. This inherently means that it is “all or nothing,” that is, it must be enabled or disabled for the entire render pass. While the idea is simple, the implementation in turnip is very subtle with lots of interactions with various different features. This page attempts to document the main principles behind the implementation.

Coordinate Space Soup

In order to render a tile at lower resolution, we have to override the user’s viewport and scissor for each tile depending on the scaling factor provided by the user. This becomes complicated fast, so let’s start by defining a few coordinate spaces that we’ll have to work with.

Framebuffer space

This is the space of the final rendered image. From the user’s perspective everything is specified in this space, and fragments created by the rasterizer appear to be larger than 1 pixel. But this is not what actually happens in the hardware, it is a fiction created by the driver. The other spaces below are what the hardware actually “sees”.

GMEM Space

This space exists whenever tiled rendering/GMEM is used, even without FDM. It is the space used to access GMEM, with the origin at the upper left of the tile. The hardware automatically transforms rendering space into GMEM space whenever GMEM is accessed using the various *_WINDOW_OFFSET registers. The origin of this space will be called \(b_{cs}\), the common bin start, for reasons that are explained below. When using FDM, coordinates in this space must be multiplied by the scaling factor \(s\) derived from the fragment density map, or equivalently divided by the fragment area (as defined by the Vulkan specification), with the origin still at the upper left of the tile. For example, if \(s_x = 1/2\), then the bin is half as wide as it would’ve been without FDM and all coordinates in this space must be divided by 2.

Rendering space

This is the space in which the hardware rasterizer operates and produces fragments. Normally this is the same as framebuffer space, but with FDM it is not. We transform the viewport and scissor from framebuffer space to rendering space by patching them per-tile in the driver and then when we resolve the tile we scale the resulting tile back to the correct resolution by blitting from the rendering space source to the framebuffer space destination.

In order to come up with the correct transform from framebuffer space to rendering space, it has to shrink the coordinates by \(s\) while mapping the original bin start in framebuffer space \(b_s\) to \(b_{cs}\). Since \(b_{cs}\) is entirely defined by the driver when programming *_WINDOW_OFFSET, one tempting way to do this is to just multiply by \(s\) and define \(b_{cs} = b_s * s\). It turns out, however, that this doesn’t work. A key requirement is to handle cases where the same scene is rendered in multiple different views at the same time using VK_KHR_multiview, as in VR use-cases, and in this case we want \(s\) to vary per view, but \(b_{cs}\) is always the same for every view because there is only one *_WINDOW_OFFSET register for all layers (hence the name).

We follow the blob by leaving \(b_{cs}\) the same regardless of whether FDM is enabled or not. This means that normally \(b_s = b_{cs}\), although this is not the case if VK_EXT_fragment_density_map_offset is in use and the bins are shifted per-view. Since the coordinates need to be scaled by \(s\), we know that the transform needs to look like \(x' = s * x + o\), where only the offset \(o\) is free. Plugging in the constraint that \(b_s\) maps to \(b_{cs}\), we get that \(b_{cs} = s * b_s + o\) or \(o = b_{cs} - s * b_s\). This is the function computed by tu_fdm_per_bin_offset and used to calculate the transform for the viewport, scissor, and gl_FragCoord. One critical thing is that the offset must be an integer, or in other words the framebuffer space bin start \(b_s\) must be a multiple of \(1 / s\). This is a natural constraint anyway, because if it wasn’t the case then the bin would start in the middle of a fragment which isn’t possible to handle correctly.

Viewport and Scissor Patching

In order to have \(s\) differ per view, we have to be able to override the viewport per view. That is, we need to transform the viewport for each view differently. If there is only one viewport, then we duplicate the user’s viewport for each view and transform it using the \(b_s\) and \(s\) for that view, and we set a “per-view viewport” bit to select the viewport per view instead of using the default viewport 0. When VK_VALVE_fragment_density_map_layered is in use, we instead have to insert shader code to achieve the same thing.

If the user specifies multiple viewports but they are per-view because VK_QCOM_multiview_per_view_viewport is enabled, then we can just set the per-view viewport bit and transform each user viewport individually by the corresponding scale. But if the user explicitly writes gl_ViewportIndex, then there is nothing we can do and we have to make \(s\) the same for all views by conservatively taking the minimum. Then we apply \(s\) to all of the user-specified viewports.

Because the bin size is now per-view, the usual mechanism of *_WINDOW_SCISSOR for clipping fragments outside the bin doesn’t work. Instead the driver needs to intersect the transformed user-specified scissor with the transformed rendering-space bin coordinates, effectively replacing *_WINDOW_SCISSOR.

Fragment density map offset

In order to “properly” implement VK_EXT_fragment_density_map_offset, we need to add an extra row/column of bins at the end and then shift the binning grid up and to the left by an offset \(b_o\). This offset is based on the user’s offset but has the opposite sign, i.e. when shifting the FDM to the left we have to shift the binning grid to the right, and once the user’s offset becomes large enough then we “wrap around” and shift over the scaling factor \(s\) to the next bin. This has to happen per-view. In turnip the function that computes \(b_o\) is called tu_bin_offset. Each tile then gets an offseted start \(b_s = b_{cs} - b_o\) except for the first row/column which only shrink in height/width respectively.

If we cannot make \(s\) per-view, then we also cannot make \(b_s\) per-view and so we cannot shift the bins over. Therefore we fall back to only shifting where \(s\) is sampled from, which produces jittery and jarring transitions when a bin suddenly changes resolution.

Bin merging

FDM shrinks the size of the bin in GMEM, which results in a lot of wasteful unused extra space in GMEM. a7xx mitigates this by introducing “bin merging”. If two tiles next to each other have the same scaling for each view, then we combine them into one tile, as long as the combined size in rendering space isn’t larger than the original size of an unscaled bin in framebuffer space. We can even merge larger groups of tiles. The only hardware feature needed for this to work is the ability to merge the visibility streams for the tiles, which was added on a7xx by a new bitmask in CP_SET_BIN_DATA5 and variants. Only bins within the same visibility stream/VSC pipe can be merged.

Hardware scaling registers and LRZ

One disadvantage of FDM on a6xx is that low-resolution tiles cannot use LRZ, because the LRZ hardware is not aware of the transform between framebuffer space and rendering space and applies the framebuffer-space LRZ values to the rendering-space fragments. In order to fix this, a740 adds new offset and scale registers. The offset \(o'\) is applied to fragment coordinates during rasterization after LRZ, so that viewport, scissor, and LRZ are in a new “LRZ space” while the other operations (resolves and unresolves, and attachment writes) still happen in the rendering space which is now offset. \(o'\) is specified for each layer. The scale \(s\) is the same as before, and it is used to multiply the fragment area covered by each LRZ pixel.

Without VK_EXT_fragment_density_map_offset, we can simply make LRZ space equal to framebuffer space scaled down by \(s\). That is, we can set \(o'\) to what \(o\) was before and then set \(o\) to 0, only scaling down the viewport but not shifting it and letting the hardware handle the shift. Then LRZ pixels will be scaled up appropriately and everything will work. However, this doesn’t work if there is a bin offset \(b_o\). In order to make binning work, we shift the viewport and scissor by \(b_o\) when binning. Unfortunately the offset registers do not have any effect when binning, so rendering space and LRZ space have to be the same when binning, and the visibility stream is generated from rendering space. This means that LRZ space also has to be shifted over compared to framebuffer space, and the LRZ buffer must be overallocated when FDM offset might be used with it (which is signalled by VK_IMAGE_CREATE_FRAGMENT_DENSITY_MAP_OFFSET_BIT_EXT) because the LRZ image will be shifted by \(b_o\).

In order for LRZ to work, LRZ space when rendering must be equal to LRZ space when binning scaled down by \(s\). The origin of LRZ space when binning is \(-b_o\), and this must be mapped to 0. The transform from framebuffer space to LRZ space is \(x' = x * s + o\), and the transform from framebuffer space to rendering space is \(x'' = x * s + o + o'\). We get that \(o + o' = b_{cs} - b_s * s\), similar to before, and \(0 = -b_o * s + o\) so that \(o = b_o * s\) and finally \(o' = b_{cs} - b_s * s - b_o * s\), or after rearranging \(o' = b_{cs} - (b_s + b_o) * s\). For all tiles except those in the first row or column, this simplifies to \(o' = b_{cs} - b_{cs} * s\) because \(b_{cs} = b_s + b_o\). For tiles in the first row or column, \(b_s\) and \(b_{cs}\) are both 0 in one of the coordinates, so it becomes \(o' = -b_o * s\) in that coordinate. This isn’t representable in hardware, both because it is negative (which can be worked around by artifically shifting \(b_{cs}\)) but more importantly because it may not meet the alignment requirements for the hardware register (which is currently 8 pixels). We have to just disable LRZ in this case.