i915/GEM Q&A
So apparently people do indeed read my my i915/GEM crashcourse and a bunch of follow-up questions popped up in private mails. Since I’m a lazy bastard I’ve cleaned some of the common questions&answers up to be able to easily point at them. And hopefully they also help someone else to clarify things a bit.
Question: What’s the significance of i915_gem_sw_finish_ioctl
? It seems to flush cpu caches, but only conditional on obj->pin_count != 0
. Why does it no unconditionally flush the cpu caches like e.g. when we move an unsnooped/not LLC-cached object into a gpu domain?
Answer: i915_gem_sw_finish_ioctl
is only used to flush out cpu rendering to the display (and in current userspace it’s not used at all). obj->pin_count != 0
is used as a proxy for “this a scanout buffer”. Obviously more intelligent userspace should know whether it is doing cpu rendering to a displayed buffer or not and force the expensive clflushing with e.g. the set_domain
ioctl only when really required. But the sw_finish
ioctl is called from the libdrm cpu mmap unmap function, which does not have this knowledge at hand, hence the check in the kernel. Furthermore for efficient integration of cpu rendering into the gpu render pipeline we want to use snoopable objects even on non-LLC platforms which means that this ioctl shouldn’t really be used any more for new code.
Question: So the cpu can only access a GEM object through the GTT when it’s in the mappable part of the GTT, i.e. when gtt_offset + size <= gtt_mappable_end
. But the i915_gem_object_set_to_gtt_domain
function does not check that whether this condition is satisfied or not and simply goes ahead with the domain change. Why is that done so, even though the cpu won’t be able to access the buffer object at its current place?
Answer:The GTT domain is purely about coherency, i.e. a buffer object is in the GTT domain if reads/writes through the GTT would see the correct values. The other big domain is cpu domain, i.e. the data (when accessed directly in the physical memory location, not going through the GTT) is coherent with cpu caches. Shifting between these two domains requires flushing/invalidating cpu caches.
Note that on recent kernels that doesn’t even mean that there’s a global GTT mapping allocated for that buffer object: This is used to optimize away the redundant cache flushing when moving an object around, e.g. when moving it into the mappable range to serve a cpu access page fault. In the future this will be even more common once we have proper per-process GTT address spaces. Then an object could be fully coherent with the GTT domain, read by the gpu through a PPGTT mapping, but don’t have an offset allocated for it in the global GTT at all.
The mappable GTT address range on the other hand is a different concept and simply means the object has a GTT mapping visible to the cpu (on gpus without PPGTT the global GTT can be up to 2g, but only 256m are usually visible in the pci bar). Note that GEM object can be mappable but can be (at the same time) in the cpu domain. This happens when userspace writes to the buffer object through the cpu mappings.
Question: How does the the i915_gem_fault
function handle a page fault when it itself is invoked through a page fault in the i915 GEM kernel code? Like suppose if fault_in_pages_readable
function is called which dereferences a user pointer - won’t that cause issues with deadlocks?
Answer:Yes, this can happen and we need to be careful that we cannot possible deadlock with our own pagefault handlers. And it’s not just theoretical, it happens in the wild when a GL client tries to use a pointer obtained from one of the texture mapping funtions (which can use a GTT memory mapping internally) to upload data (which could use the pwrite
GEM ioctl).
These potential deadlocks are resolved by instructing the linux memory subsystem to not serve pagefaults when accessing userspace memory but instead fail it. Then our code can release any resources and locks required by our own page fault handler and retry the operation in a slowpath. Often this requires that we copy the data into a (unfaultable) temporary buffer in kernel’s memory space. These atomic sections are often implicit, but we have a few places where we need to explicitly disable page fault handler with pagefault_disable/enable()
calls.
Question: Is obj->fenced_gpu_access
ever set on modern platforms - it seems not? Or could this cause a stall waiting for the gpu when all fences are in use and we need a few fence to handle a GTT page fault?
Answer: No, this is only ever set on Gen2/3 devices. Those gpus use the same GTT fences used on all platforms for detiling cpu access also for gpu access, at least for some gpu rendering functions. So this is irrelevant on modern platforms and can’t lead to a stall in the pagefault handler when accessing an otherwise idle buffer object.
Question: What is this
Answer: This is part of the gpu hang detection and reset handling code, which I didn’t really cover in my crashcourse. It is set when we’ve detected a hang but failed to reset the gpu. It will cause all subsequent command submission from userspace to fail with -EIO
, which is used by userspace as a signal to fall back to software rendering. The i915 hang detection and reset code has been (and still is) under pretty active development and is nowadays a rather complex piece of code. I plan to cover it more in-depth hopefully soon.
Question: In the use_cpu_reloc
function, why is the obj->cache_level != I915_CACHE_NONE
condition used?
Answer: That’s just crazy optimization - it’s always faster to write relocations through cpu maps if LLC caching is enabled. But without caching it’s faster to use global GTT access - but then only if we have the mappable mapping already set up. Note that pwrite
ioctl code has similar tricks.