Thursday, November 1, 2012

i915/GEM Crashcourse, Part 2

After the previous installement this part will cover command submission to the gpu. See the i915/GEM crashcourse overview for links to the other parts of this series.

Command Submission and Relocations

As I've alluded already, gpu command submission on intel hardware happens by sending a special buffer object with rendering commands to the kernel for execution on the gpu, the so called batch buffer. The ioctl to do this is called execbuf. Now this buffer contains tons of references to various other buffer objects which contain textures, render buffers, depth&stencil buffers, vertices, all kinds of gpu specific things like shaders and also quite some state buffers which e.g. describe the configuration of specific (fixed-function) gpu units.

The problem now is that userspace doesn't control where all these buffers are - the kernel manages the entire GTT.  And the kernel needs to manage the entire GTT, since otherwise multiple users of the same single gpu can't get along. So the kernel needs to be able to move buffers around in the GTT when they don't all fit in at the same time, which means clearing the PTEs in the relevant pagetables for the old buffers that get kicked out and then filling them again with entries pointing at new the buffers which the gpu now requires to execute the batch buffer. In short userspace needs to fill the batchbuffer with tons of GTT addresses, but only the kernel really knows them at any given point.

This little problem is solved by supplying a big list of relocation entries along with the batchbuffer and a list of all the buffers required to execute this batch. To optimize for the common case where buffers don't move around, userspace prefills all references with the GTT offsets from the last command submission (the kernel is so kind to tell userspace the updated offset after successful submission of a batch). The kernel then goes through that relocation list, checks whether the offsets that userspace presumed are still correct. And if that's not the case, it updates the buffer references in the batch and so relocates the referenced data, hence the name.

A slight complication is that the gpu data structures can be several levels deep, e.g. the batch points at the surface state, which then points at the texture/render buffers. So each buffer in the command submission list has a relocation list pointer, but for most buffers it's just NULL (since they just contain data and don't reference any other buffers).

Now along with any information required to rewrite references for relocated buffers, userspace also supplies some other information (the read/write domains) about how it wants to use the buffer. Originally this was to optimize cache coherency management (coherency will be covered in detail later on), but nowadays that code is massively simplified since that clever optimized cache tracking is simply not worth it. We do though still use these domain values to implement a workaround on Sandybridge though: Since we use PPGTT, all memory accesses from the batch are directed to go through the PPGTT, with the exception that pipe control writes (useful for a bunch of things, but mostly for OpenGL queries) always go through the global GTT (it's a bug in the hw ...). Hence we need to ensure that we not only bind to the PPGTT, but also set up a mapping in the global GTT. And we detect this situation by checking for a GEM_INSTRUCTION write domain - only pipe control writes have that set.

The other special relocation is for older generations, where the gpu needs a fence set up to access tiled buffers, at least for some operations. The relocation entries have a flag for that to signal the kernel that a fence is required. Another peculiarity is that fences can only be set up in the mappable part of the GTT, at least on those chips that require them for rendering. Hence we also restrict the placement of any buffers that require a fence to the mappable part of the GTT.

So after rewriting any references to buffers that moved around, the kernel is ready to submit the batch to the gpu. Every gpu engine has a ringbuffer that the kernel can fill with its own commands. First we emit a few preparatory commands to flush caches and set a few registers (which normal userspace batches can't write) to the values that userspace needs. Then we start the batch by emitting a MI_BATCHBUFFER_START command.

Retiring and Synchronization

Now the gpu can happily process the commands and do the rendering, but that leaves the kernel with a problem: When is the gpu done? Userspace obviously needs to know this to avoid reading back incomplete results. But the kernel also needs to know this, to avoid unmapping buffers which are still in use by the gpu, e.g. when a render operation requires a temporary buffer, userspace might free that buffer right away after the execbuf call completes. But the kernel needs to delay the unmapping and freeing of the backing storage until the gpu not longer needs that buffer.

Therefore the kernel associates a sequence number with every batchbuffer and adds a write of that sequence number to the ringbuffer. Every engine has a hardware status page (HWS_PAGE) which we can use for such synchronization purposes. The upshot of that special status page is that gpu writes to it snoop the cpu caches, and hence a read from it is much faster than reading directly the gpu head pointer register of the ring buffer. We also add a MI_USER_IRQ command after the sequence number (seqno for short) write, so that we don't need to constantly poll when waiting for the gpu.

Two little optimizations apply to this gpu-to-cpu synchronization mechanism: If the cpu doesn't wait for the gpu, we mask the gpu engine interrupts to avoid flooding the cpu with thousands of interrupts (and potentially waking it up from deeper sleep states all the time). And the seqno read has a fastpath which might not be fully coherent, and a potentially much more expensive slowpath. This is because of some coherency issues on recent platforms, where the interrupt seemingly arrives before the seqno write has landed in the status page. Since we check that seqno rather often it's good to have an lightweight check which might not give the most up-to-date value, but is good enough to avoid going through more costly slowpaths in the code that handles synchronization with the gpu.

So now we have a means to track the progress of the gpu through the batches submitted to the engine's ringbuffer, but not yet a means to prevent the kernel from unmapping or freeing still in-use buffers. For that the kernel keeps a per-engine list of all active buffers, and marks each buffer with the seqno of the latest batch it has been used for. It also keeps a list of all still outstanding seqnos in a per-engine request list. The clever trick now is that the kernel keeps an additional reference on each buffer object that resides on one of the active list - that way a buffer can never disappear while still in use by the gpu, even when userspace removes all it's references. To batch up the active list processing and retiring of any newly completed requests, the kernel runs a regular task from a worker thread to clean things up.

To avoid polling in userspace the kernel also provides interfaces for userspace to wait until rendering completes on a given buffer object: The wait_timeout ioctl simply waits until the gpu is done with an object (optionally with a timeout), the set_domain ioctl doesn't have a timeout, but additionally takes a flag to indicate whether userspace only wants to read or also whether it wants to write. Furthermore set_domain also ensures that cpu caches are coherent, but we will look at that little issue later on. The set_domain ioctl doesn't wait for all usage by the gpu to complete if userspace only wants to read the buffer. In that case it only waits for all outstanding gpu writes - the kernel knows this thanks to the separate read/write domains in the relocation entries and keeps track of both the last gpu write and read by remembering the seqnos of the respective batches.

The kernel also supports a busy ioctl to simply inquire whether a buffer is still being used by the gpu. This recently gained the ability to tell userspace on which gpu engine an object is busy on - which is useful for compositors that get buffer objects from clients to decide which engine is the most suitable one, if a given operation can be done with more than one engine (pretty much all of them can be coaxed into copying data).

With that we have gpu/cpu synchronization covered. But as just mentioned above, the gpu itself also has multiple engines (at least on newer platforms) which can run in parallel. So we need to have some means of synchronization between them. To do that the kernel not only keeps track of the seqno of the last batch an object has been used for, but also of the engine (commonly just called ring in the kernel, since that's what the kernel really cares about).

If a batchbuffer then uses an object which is still busy on a different engine, the kernel inserts a synchronization point: Either by employing so called hardware semaphores, which similarly to when the kernel needs to wait for the gpu simply wait for the correct seqno to appear, only using internal registers instead of the status page. Or if that's disabled, simply by blocking in the kernel until rendering completes. To avoid inserting too many synchronization points the kernel also keeps track of the last synchronization point for each ring. For ring/ring synchronization we don't track read domains separately, at least not yet.

Note that a big difference of GEM compared to a lot of other gpu execution management frameworks and kernel drivers is that GEM does not expose explicit sync objects/gpu fences to userspace. A synchronization point is always only implicitly attached to a buffer object, which is the only thing userspace deals with. In practice the difference is not big, since userspace can have equally fine-grained control over synchronization by keeping onto all the batchbuffer objects - keeping them around until the gpu is done with them won't waste any memory anyway. But the big upside is that when sharing buffers across processes, e.g. with DRI2 on X or generally when using a compositor, there's no need to also share a sync object: Any required sync state comes attached to the buffer object, and the kernel simply Does The Right Thing.

This concludes the part about command submission, active object retiring and synchronization. In the next installment we will take a closer look at how the kernel manages the GTT, what happens when we run out of space in the GTT and how the i915.ko currently handles out-of-memory conditions.


  1. I've always wondered - why relocations? Or more specifically, is there a plan to switch to true paging?

    As I understand, OpenCL makes relocations impossible anyway (since one can cast a pointer to integer and do all kinds of tricks with it).

    1. Well a bunch of reasons: First you'd need pagefault support to do real paging (since otherwise userspace still needs to supply a list of buffers), and current hardware just can't. Then there's the issue that 2GB of GTT isn't really a sensible amount of address space in todays world with 8GB of main memory. Much better than 256MB of just a few generations agon, but still. And last but not least current hardware is still rather disastrous at address space switching performance wise (and only Ivybridge actually supports more than one address space), so the added incentive of using real per-process address spaces with better insulation just isn't there.

      Wrt OpenCL I'm by far no expert, but I think OpenCL 1 still works perfectly well in a relocatable world. OpenCL 2 seems to be different, but on Intel hardware that's still a distant future. At least Haswell has pretty much the same GTT model as Ivybridge, so I don't see a good way to get rid of relocations soon.

      Last but not least I think we can squeeze out some more juice from relocations, we haven't explored yet all ways to micro-optimize them. And given that simple things like batch-processing relocations in groups of 64 seems to have helped tremendously in relocation throughput I guess there are some other lower hanging fruit still around.

    2. In defense of this strategy, Relocation tables are perfectly fine for GPUs IMO. Though Relocation tables are best implemented using Segments not Page Tables since using a single address space leaves you vulnerable to Security issues across separate processes.

      Even when you have a Fast switching MMU and/or 40-bit address space, I would generally vouch for using a single address space for Graphics (atleast) Reason being, Page Tables require a lot of memory of their own especially when you have more than 2 levels of page tables. Consider storing 3 512K textures in 3 separate address spaces. You would allocate first level, second level and third level page directory(tables) for each address space. With a single address space you do the first 3 level allocations just once.

  2. Whats the command buffer submission granularity here? Draw call boundary or Render Pass boundary or whenever the user buffer fills up. Can you have Render to Texture dependencies within the same Command submission?

    1. The granularity is a driver heuristics trying to optimize between latency (i.e. flush the command buffer often) and throughput. It's at least one drawcall, but could be a lot more than a render pass.

      The command buffers gets flushed only on fence waits, glflush or when the cpu reads back from a render buffer. For render to texture the driver internally keeps track of renderbuffers and inserts a pipeline cache flush within the current command buffer for coherency (so doesn't stall otherwise nor for a full command buffer flush).

    2. Got it.

      >> and inserts a pipeline cache flush within the current command
      >> buffer for coherency

      Can I press for a bit more details here if you are comfortable sharing this level of HW detail.

      So Render-texture or Transform-Feedback setup Producer Consumer relationship between Draw calls. There is a Cache Visibility issue (as you pointed out) and there is a Stall that needs to be implemented somewhere. Essentially to prevent the shader threads of the second draw call from starting before the shader threads of the first have completed and written out to whatever the point of coherence is.

      My question being this, at what point in the pipeline is this being enforced. Is it the Command Streamer that enforces the cache_flush + stall because of some commands you have encoded, is it a Firmware layer in there somewhere doing this or do the EUs have a mechanism to mark certain threads as blocked until signaled otherwise.

    3. Hey, we have a full open-source driver with public docs. It's all there, so I have no problem explaining how it works ;-)

      The flush is at the command parser level but at least to some extend (the newer your chip the more) internally pipeline and asynchronous. If you want to see the details look at the PIPE_CONTROL command in the public PRM:

  3. I am curious how Pixel Sync works for Haswell

    Are there smarts in the EU to say have Lane 2 out of the 8 SIMD lanes be masked out (similar to what branches in a shader would do) to prevent concurrent execution with another fragment at the same pixel from a different primitive?

    This would require elaborate Scoreboarding or per fragment resource tracking in Hw.

    Or is there a big hammer synchronization at the top of the flight where primitives are fragment shaded one at a time.