Hardware docs¶

Command buffers¶

Command format¶

Each command sent to the GPU contains a method and some data. Method names are documented in corresponding header files copied from NVIDIA, eg. the fermi graphics methods are in src/nouveau/headers/nvidia/classes/cl9097.h

P_IMMD¶

A lot of the time, you will want to issue a single method with its data, which can be done with P_IMMD:

P_IMMD(p, NV9097, WAIT_FOR_IDLE, 0);

P_IMMD will emit either a single immediate-data method, which takes a single word, or a pair of words that’s equivalent to P_MTHD + the provided data. Code must count P_IMMD as possibly 2 words as a result.

P_MTHD¶

P_MTHD is a convenient way to execute multiple consecutive methods without repeating the method header. For example, the code:

P_MTHD(p, NV9097, SET_REPORT_SEMAPHORE_A);
P_NV9097_SET_REPORT_SEMAPHORE_A(p, addr >> 32);
P_NV9097_SET_REPORT_SEMAPHORE_B(p, addr);
P_NV9097_SET_REPORT_SEMAPHORE_C(p, value);

generates four words - one word for the method header (defaulting to 1INC) and then the next three words for data. 1INC will automatically increment the method id by one word for each data value, which is why the example can advance from SET_REPORT_SEMAPHORE_A to B to C.

P_0INC¶

0INC will issue the same method repeatedly for each following data word.

P_1INC¶

1INC will increment after one word and then issue the following method repeatedly. For example, the code:

P_1INC(p, NV9097, CALL_MME_MACRO(NVK_MME_SET_PRIV_REG));
P_INLINE_DATA(p, 0);
P_INLINE_DATA(p, BITFIELD_BIT(3));

issues one NV9097_CALL_MME_MACRO command, then increments the method and issues two NV9097_CALL_MME_DATA commands.

Execution barriers¶

Commands within a command buffer can be synchronized in a few different ways.

Explicit WFI - Idles all engines before executing the next command eg. via NVA16F_WFI or NV9097_WAIT_FOR_IDLE

Semaphores - Delay execution based on values in a memory location. See open-gpu-doc on semaphores

A subchannel switch - Causes the hardware to execute an implied WFI

Subchannel switches¶

A subchannel switch occurs when the hardware receives a command for a different subchannel than the one that it’s currently executing. For example, if the hardware is currently executing commands on the 3D engine (SUBC_NV9097 == 0), a command executed on the compute engine (SUBC_NV90C0 == 1) will cause a subchannel switch. Host methods (class *6F) are an exception to this - they can be issued to any subchannel and will not trigger a subchannel switch [5] [2].

Subchannel switches act the same way that an explicit WFI does - they fully idle the channel before issuing commands to the next engine [1] [2].

This works the same on Blackwell. Some NVIDIA documentation contradicts this: “On NVIDIA Blackwell Architecture GPUs and newer, subchannel switches do not occur between 3D and compute workloads”[1]. This documentation appears to be wrong or inapplicable for some reason - tests do not reproduce this behavior [2] [3], and the blob does not change its event implementation for blackwell [4].

Copy engine¶

The copy engine’s PIPELINED mode allows a new transfer to start before the previous transfer finishes, while NON_PIPELINED acts as an execution barrier between the current copy and the previous one [6].

[6]
https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/src/common/sdk/nvidia/inc/ctrl/ctrl0050.h#L75-L83