stuff by sima

Upstream, Why & How

Thu, 14 Mar 2024 00:00:00 +0000

In a different epoch, before the pandemic, I’ve done a presentation about upstream first at the Siemens Linux Community Event 2018, where I’ve tried to explain the fundamentals of open source using microeconomics. Unfortunately that talk didn’t work out too well with an audience that isn’t well-versed in upstream and open source concepts, largely because it was just too much material crammed into too little time.

Last year I got the opportunity to try again for an Intel-internal event series, and this time I’ve split the material into two parts. I think that worked a lot better. For obvious reasons I cannot publish the recordings, but I can publish the slides.

The first part “Upstream, Why?” covers a few concepts from microeconomcis 101, and then applies them to upstream stream open source. The key concept is on one hand that open source achieves an efficient software market in the microeconomic sense by driving margins and prices to zero. And the only way to make money in such a market is to either have more-or-less unstable barriers to entry that prevent the efficient market from forming and destroying all monetary value. Or to sell a complementary product.

The second part”Upstream, How?” then looks at what this all means for the different stakeholders involved:

Individual engineers, who have skills and create a product with zero economic value, and might still be stupid enough and try to build a career on that.
Upstream communities, often with a formal structure as a foundation, and what exactly their goals should be to build a thriving upstream open source project that can actually pay some bills, generate some revenue somewhere else and get engineers paid. Because without that you’re not going to have much of a project with a long term future.
Engineering organizations, what exactly their incentives and goals should be, and the fundamental conflicts of interest this causes. Specifically on this I’ve only seen bad solutions, and ugly solutions, but not yet a really good one. A relevant pre-pandemic talk of mine on this topic is also “Upstream Graphics: Too Little, Too Late”
And finally the overall business and more importantly, what kind of business strategy is needed to really thrive with an open source upstream first approach: You need to clearly understand which software market’s economic value you want to destroy by driving margins and prices to zero, and which complemenetary product you’re selling to still earn money.

At least judging by the feedback I’ve received internally taking more time and going a bit more in-depth on the various concept worked much better than the keynote presentation I’ve done at Siemens, hence I decided to publish at the least the slides.

EOSS Prague: Kernel Locking Engineering

Fri, 28 Jul 2023 00:00:00 +0000

EOSS in Prague was great, lots of hallway track, good talks, good food, excellent tea at meetea - first time I had proper tea in my life, quite an experience. And also my first talk since covid, pack room with standing audience, apparently one of the top ten most attended talks per LF’s conference report.

The video recording is now uploaded, I’ve uploaded the fixed slides, including the missing slide that I accidentally cut in a last-minute edit. It’s the same content as my blog posts from last year, first talking about locking engineering principles and then the hierarchy of locking engineering patterns.

Locking Engineering Hierarchy

Wed, 03 Aug 2022 00:00:00 +0000

The first part of this series covered principles of locking engineering. This part goes through a pile of locking patterns and designs, from most favourable and easiest to adjust and hence resulting in a long term maintainable code base, to the least favourable since hardest to ensure it works correctly and stays that way while the code evolves. For convenience even color coded, with the dangerous levels getting progressively more crispy red indicating how close to the burning fire you are! Think of it as Dante’s Inferno, but for locking.

As a reminder from the intro of the first part, with locking engineering I mean the art of ensuring that there’s sufficient consistency in reading and manipulating data structures, and not just sprinkling mutex_lock() and mutex_unlock() calls around until the result looks reasonable and lockdep has gone quiet.

Level 0: No Locking

The dumbest possible locking is no need for locking at all. Which does not mean extremely clever lockless tricks for a “look, no calls to mutex_lock()” feint, but an overall design which guarantees that any writers cannot exist concurrently with any other access at all. This removes the need for consistency guarantees while accessing an object at the architectural level.

There’s a few standard patterns to achieve locking nirvana.

Locking Pattern: Immutable State

The lesson in graphics API design over the last decade is that immutable state objects rule, because they both lead to simpler driver stacks and also better performance. Vulkan instead of the OpenGL with it’s ridiculous amount of mutable and implicit state is the big example, but atomic instead of legacy kernel mode setting or Wayland instead of the X11 are also built on the assumption that immutable state objects are a Great Thing (tm).

The usual pattern is:

A single thread fully constructs an object, including any sub structures and anything else you might need. Often subsystems provide initialization helpers for objects that driver can subclass through embedding, e.g. drm_connector_init() for initializing a kernel modesetting output object. Additional functions can set up different or optional aspects of an object, e.g. drm_connector_attach_encoder() sets up the invariant links to the preceding element in a kernel modesetting display chain.
The fully formed object is published to the world, in the kernel this often happens by registering it under some kind of identifier. This could be a global identifier like register_chrdev() for character devices, something attached to a device like registering a new display output on a driver with drm_connector_register() or some struct xarray in the file private structure. Note that this step here requires memory barriers of some sort. If you hand roll the data structure like a list or lookup tree with your own fancy locking scheme instead of using existing standard interfaces you are on a fast path to level 3 locking hell. Don’t do that.
From this point on there are no consistency issues anymore and all threads can access the object without any locking.

Locking Pattern: Single Owner

Another way to ensure there’s no concurrent access is by only allowing one thread to own an object at a given point of time, and have well defined handover points if that is necessary.

Most often this pattern is used for asynchronously processing a userspace request:

The syscall or IOCTL constructs an object with sufficient information to process the userspace’s request.
That object is handed over to a worker thread with e.g. queue_work().
The worker thread is now the sole owner of that piece of memory and can do whatever it feels like with it.

Again the second step requires memory barriers, which means if you hand roll your own lockless queue you’re firmly in level 3 territory and won’t get rid of the burned in red hot afterglow in your retina for quite some time. Use standard interfaces like struct completion or even better libraries like the workqueue subsystem here.

Note that the handover can also be chained or split up, e.g. for a nonblocking atomic kernel modeset requests there’s three asynchronous processing pieces involved:

The main worker, which pushes the display state update to the hardware and which is enqueued with queue_work().
The userspace completion event handling built around struct drm_pending_event and generally handed off to the interrupt handler of the driver from the main worker and processed in the interrupt handler.
The cleanup of the no longer used old scanout buffers from the preceding update. The synchronization between the preceding update and the cleanup is done through struct completion to ensure that there’s only ever a single worker which owns a state structure and is allowed to change it.

Locking Pattern: Reference Counting

Users generally don’t appreciate if the kernel leaks memory too much, and cleaning up objects by freeing their memory and releasing any other resources tends to be an operation of the very much mutable kind. Reference counting to the rescue!

Every pointer to the reference counted object must guarantee that a reference exists for as long as the pointer is in use. Usually that’s done by calling kref_get() when making a copy of the pointer, but implied references by e.g. continuing to hold a lock that protects a different pointer are often good enough too for a temporary pointer.
The cleanup code runs when the last reference is released with kref_put(). Note that this again requires memory barriers to work correctly, which means if you’re not using struct kref then it’s safe to assume you’ve screwed up.

Note that this scheme falls apart when released objects are put into some kind of cache and can be resurrected. In that case your cleanup code needs to somehow deal with these zombies and ensure there’s no confusion, and vice versa any code that resurrects a zombie needs to deal the wooden spikes the cleanup code might throw at an inopportune time. The worst example of this kind is SLAB_TYPESAFE_BY_RCU, where readers that are only protected with rcu_read_lock() may need to deal with objects potentially going through simultaneous zombie resurrections, potentially multiple times, while the readers are trying to figure out what is going on. This generally leads to lots of sorrow, wailing and ill-tempered maintainers, as the GPU subsystem has and continues to experience with struct dma_fence.

Hence use standard reference counting, and don’t be tempted by the siren of trying to implement clever caching of any kind.

Level 1: Big Dumb Lock

It would be great if nothing ever changes, but sometimes that cannot be avoided. At that point you add a single lock for each logical object. An object could be just a single structure, but it could also be multiple structures that are dynamically allocated and freed under the protection of that single big dumb lock, e.g. when managing GPU virtual address space with different mappings.

The tricky part is figuring out what is an object to ensure that your lock is neither too big nor too small:

If you make your lock too big you run the risk of creating a dreaded subsystem lock, or violating the “Protect Data, not Code” principle in some other way. Split your locking further so that a single lock really only protects a single object, and not a random collection of unrelated ones. So one lock per device instance, not one lock for all the device instances in a driver or worse in an entire subsystem.

The trouble is that once a lock is too big and has firmly moved into “protects some vague collection of code” territory, it’s very hard to get out of that hole.
Different problems strike when the locking scheme is too fine-grained, e.g. in the GPU virtual memory management example when every address mapping in the big vma tree has its own private lock. Or when a structure has a lot of different locks for different member fields.

One issue is that locks aren’t free, the overhead of fine-grained locking can seriously hurt, especially when common operations have to take most of the locks anyway and so there’s no chance of any concurrency benefit. Furthermore fine-grained locking leads to the temptation of solving locking overhead with ever more clever lockless tricks, instead of radically simplifying the design.

The other issue is that more locks improve the odds for locking inversions, and those can be tough nuts to crack. Again trying to solve this with more lockless tricks to avoid inversions is tempting, and again in most cases the wrong approach.

Ideally, your big dumb lock would always be right-sized everytime the requirements on the datastructures changes. But working magic 8 balls tend to be on short supply, and you tend to only find out that your guess was wrong when the pain of the lock being too big or too small is already substantial. The inherent struggles of resizing a lock as the code evolves then keeps pushing you further away from the optimum instead of closer. Good luck!

Level 2: Fine-grained Locking

It would be great if this is all the locking we ever need, but sometimes there’s functional reasons that force us to go beyond the single lock for each logical object approach. This section will go through a few of the common examples, and the usual pitfalls to avoid.

But before we delve into details remember to document in kerneldoc with the inline per-member kerneldoc comment style once you go beyond a simple single lock per object approach. It’s the best place for future bug fixers and reviewers - meaning you - to find the rules for how at least things were meant to work.

Locking Pattern: Object Tracking Lists

One of the main duties of the kernel is to track everything, least to make sure there’s no leaks and everything gets cleaned up again. But there’s other reasons to maintain lists (or other container structures) of objects.

Now sometimes there’s a clear parent object, with its own lock, which could also protect the list with all the objects, but this does not always work:

It might force the lock of the parent object to essentially become a subsystem lock and so protect much more than it should when following the “Protect Data, not Code” principle. In that case it’s better to have a separate (spin-)lock just for the list to be able to clearly untangle what the parent and subordinate object’s lock each protect.
Different code paths might need to walk and possibly manipulate the list both from the container object and contained object, which would lead to locking inversion if the list isn’t protected by it’s own stand-alone (nested) lock. This tends to especially happen when an object can be attached to multiple other objects, like a GPU buffer object can be mapped into multiple GPU virtual address spaces of different processes.
The constraints of calling contexts for adding or removing objects from the list could be different and incompatible from the requirements when walking the list itself. The main example here are LRU lists where the shrinker needs to be able to walk the list from reclaim context, whereas the superior object locks often have a need to allocate memory while holding each lock. Those object locks the shrinker can then only trylock, which is generally good enough, but only being able to trylock the LRU list lock itself is not.

Simplicity should still win, therefore only add a (nested) lock for lists or other container objects if there’s really no suitable object lock that could do the job instead.

Locking Pattern: Interrupt Handler State

Another example that requires nested locking is when part of the object is manipulated from a different execution context. The prime example here are interrupt handlers. Interrupt handlers can only use interrupt safe spinlocks, but often the main object lock must be a mutex to allow sleeping or allocating memory or nesting with other mutexes.

Hence the need for a nested spinlock to just protect the object state shared between the interrupt handler and code running from process context. Process context should generally only acquire the spinlock nested with the main object lock, to avoid surprises and limit any concurrency issues to just the singleton interrupt handler.

Locking Pattern: Async Processing

Very similar to the interrupt handler problems is coordination with async workers. The best approach is the single owner pattern, but often state needs to be shared between the worker and other threads operating on the same object.

The naive approach of just using a single object lock tends to deadlock:

start_processing(obj)
{
	mutex_lock(&obj->lock);
	/* set up the data for the async work */;
	schedule_work(&obj->work);
	mutex_unlock(&obj->lock);
}

stop_processing(obj)
{
	mutex_lock(&obj->lock);
	/* clear the data for the async work */;
	cancel_work_sync(&obj->work);
	mutex_unlock(&obj->lock);
}

work_fn(work)
{
	obj = container_of(work, work);

	mutex_lock(&obj->lock);
	/* do some processing */
	mutex_unlock(&obj->lock);
}

Do not worry if you don’t spot the deadlock, because it is a cross-release dependency between the entire work_fn() and cancel_work_sync() and these are a lot trickier to spot. Since cross-release dependencies are a entire huge topic on themselves I won’t go into more details, a good starting point is this LWN article.

There’s a bunch of variations of this theme, with problems in different scenarios:

Replacing the cancel_work_sync() with cancel_work() avoids the deadlock, but often means the work_fn() is prone to use-after-free issues.
Calling cancel_work_sync()before taking the mutex can work in some cases, but falls apart when the work is self-rearming. Or maybe the work or overall object isn’t guaranteed to exist without holding it’s lock, e.g. if this is part of an async processing queue for a parent structure.
Cancelling the work after the call to mutex_unlock() might race with concurrent restarting of the work and upset the bookkeeping.

Like with interrupt handlers the clean solution tends to be an additional nested lock which protects just the mutable state shared with the work function and nests within the main object lock. That way work can be cancelled while the main object lock is held, which avoids a ton of races. But without holding the sublock that work_fn() needs, which avoids the deadlock.

Note that in some cases the superior lock doesn’t need to exist, e.g. struct drm_connector_state is protected by the single owner pattern, but drivers might have some need for some further decoupled asynchronous processing, e.g. for handling the content protect or link training machinery. In that case only the sublock for the mutable driver private state shared with the worker exists.

Locking Pattern: Weak References

Reference counting is a great pattern, but sometimes you need be able to store pointers without them holding a full reference. This could be for lookup caches, or because your userspace API mandates that some references do not keep the object alive - we’ve unfortunately committed that mistake in the GPU world. Or because holding full references everywhere would lead to unreclaimable references loops and there’s no better way to break them than to make some of the references weak. In languages with a garbage collector weak references are implemented by the runtime, and so no real worry. But in the kernel the concept has to be implemented by hand.

Since weak references are such a standard pattern struct kref has ready-made support for them. The simple approach is using kref_put_mutex() with the same lock that also protects the structure containing the weak reference. This guarantees that either the weak reference pointer is gone too, or there is at least somewhere still a strong reference around and it is therefore safe to call kref_get(). But there are some issues with this approach:

It doesn’t compose to multiple weak references, at least if they are protected by different locks - all the locks need to be taken before the final kref_put() is called, which means minimally some pain with lock nesting and you get to hand-roll it all to boot.
The mutex required to be held during the final put is the one which protects the structure with the weak reference, and often has little to do with the object that’s being destroyed. So a pretty nasty violation of the big dumb lock pattern. Furthermore the lock is held over the entire cleanup function, which defeats the point of the reference counting pattern, which is meant to enable “no locking” cleanup code. It becomes very tempting to stuff random other pieces of code under the protection of this look, making it a sprawling mess and violating the principle to protect data, not code: The lock held during the entire cleanup operation is protecting against that cleanup code doing things, and not anymore a specific data structure.

The much better approach is using kref_get_unless_zero(), together with a spinlock for your data structure containing the weak reference. This looks especially nifty in combination with struct xarray.

obj_find_in_cache(id)
{
	xa_lock();
	obj = xa_find(id);
	if (!kref_get_unless_zero(&obj->kref))
		obj = NULL;
	xa_unlock();

	return obj;
}

With this all the issues are resolved:

Arbitrary amounts of weak references in any kind of structures protected by their own spinlock can be added, without causing dependencies between them.
In the object’s cleanup function the same spinlock only needs to be held right around when the weak references are removed from the lookup structure. The lock critical section is no longer needlessly enlarged, we’re back to protecting data instead of code.

With both together the locking does no longer leak beyond the lookup structure and it’s associated code any more, unlike with kref_put_mutex() and similar approaches. Thankfully kref_get_unless_zero() has become the much more popular approach since it was added 10 years ago!

Locking Antipattern: Confusing Object Lifetime and Data Consistency

We’ve now seen a few examples where the “no locking” patterns from level 0 collide in annoying ways when more locking is added to the point where we seem to violate the principle to protect data, not code. It’s worth to look at this a bit closer, since we can generalize what’s going on here to a fairly high-level antipattern.

The key insight is that the “no locking” patterns all rely on memory barrier primitives in disguise, not classic locks, to synchronize access between multiple threads. In the case of the single owner pattern there might also be blocking semantics involved, when the next owner needs to wait for the previous owner to finish processing first. These are functions like flush_work() or the various wait functions like wait_event() or wait_completion().

Calling these barrier functions while holding locks commonly leads to issues:

Blocking functions like flush_work() pull in every lock or other dependency the work we wait on, or more generally, any of the previous owners of an object needed as a so called cross-release dependency. Unfortunately lockdep does not understand these natively, and the usual tricks to add manual annotations have severe limitations. There’s work ongoing to add cross-release dependency tracking to lockdep, but nothing looks anywhere near ready to merge. Since these dependency chains can be really long and get ever longer when more code is added to a worker - dependencies are pulled in even if only a single lock is held at any given time - this can quickly become a nightmare to untangle.
Often the requirement to hold a lock over these barrier type functions comes from the fact that the object would disappear. Or otherwise undergo some serious confusion about it’s lifetime state - not just whether it’s still alive or getting destroyed, but also who exactly owns it or whether it’s maybe a resurrected zombie representing a different instance now. This encourages that the lock morphs from a “protects some specific data” to a “protects specific code from running” design, leading to all the code maintenance issues discussed in the protect data, not code principle.

For these reasons try as hard as possible to not hold any locks, or as few as feasible, when calling any of these memory barriers in disguise functions used to manage object lifetime or ownership in general. The antipattern here is abusing locks to fix lifetime issues. We have seen two specific instances thus far:

kref_put_mutex instead of kref_get_unless_zero() in the weak reference pattern. This is a special case of the reference counting pattern, but with some finer-grained locking added to support weak references.
Calling flush_work() while holding locks in the async worker. This is a special case of the single owner pattern, again with a bit more locking added to support some mutable state.

We will see some more, but the antipattern holds in general as a source of troubles.

Level 2.5: Splitting Locks for Performance Reasons

We’ve looked at a pile of functional reasons for complicating the locking design, but sometimes you need to add more fine-grained locking for performance reasons. This is already getting dangerous, because it’s very tempting to tune some microbenchmark just because we can, or maybe delude ourselves that it will be needed in the future. Therefore only complicate your locking if:

You have actual real world benchmarks with workloads relevant to users that show measurable gains outside of statistical noise.
You’ve fully exhausted architectural changes to outright avoid the overhead, like io_uring pre-registering file descriptors locally to avoid manipulating the file descriptor table.
You’ve fully exhausted algorithm improvements like batching up operations to amortize locking overhead better.

Only then make your future maintenance pain guaranteed worse by applying more tricky locking than the bare minimum necessary for correctness. Still, go with the simplest approach, often converting a lock to its read-write variant is good enough.

Sometimes this isn’t enough, and you actually have to split up a lock into more fine-grained locks to achieve more parallelism and less contention among threads. Note that doing so blindly will backfire because locks are not free. When common operations still have to take most of the locks anyway, even if it’s only for short time and in strict succession, the performance hit on single threaded workloads will not justify any benefit in more threaded use-cases.

Another issue with more fine-grained locking is that often you cannot define a strict nesting hierarchy, or worse might need to take multiple locks of the same object or lock class. I’ve written previously about this specific issue, and more importantly, how to teach lockdep about lock nesting, the bad and the good ways.

One really entertaining story from the GPU subsystem, for bystanders at least, is that we really screwed this up for good by defacto allowing userspace to control the lock order of all the objects involved in an IOCTL. Furthermore disjoint operations should actually proceed without contention. If you ever manage to repeat this feat you can take a look at the wait-wound mutexes. Or if you just want some pretty graphs, LWN has an old article about wait-wound mutexes too.

Level 3: Lockless Tricks

Do not go here wanderer!

Seriously, I have seen a lot of very fancy driver subsystem locking designs, I have not yet found a lot that were actually justified. Because only real world, non-contrived performance issues can ever justify reaching for this level, and in almost all cases algorithmic or architectural fixes yield much better improvements than any kind of (locking) micro-optimization could ever hope for.

Hence this is just a long list of antipatterns, so that people who have not yet a grumpy expression permanently chiseled into their facial structure know when they’re in trouble.

Note that this section isn’t limited to lockless tricks in the academic sense of guaranteed constant overhead forward progress, meaning no spinning or retrying anywhere at all. It’s for everything which doesn’t use standard locks like struct mutex, spinlock_t, struct rw_semaphore, or any of the others provided in the Linux kernel.

Locking Antipattern: Using RCU

Yeah RCU is really awesome and impressive, but it comes at serious costs:

By design, at least with standard usage, RCU elevates mixing up lifetime and consistency concerns to a virtue. rcu_read_lock() gives you both a read-side critical section and it extends the lifetime of any RCU protected object. There’s absolutely no way you can avoid that antipattern, it’s built in.

Worse, RCU read-side critical section nest rather freely, which means unlike with real locks abusing them to keep objects alive won’t run into nasty locking inversion issues when you pull that stunt with nesting different objects or classes of objects. Using locks to paper over lifetime issues is bad, but with RCU it’s weapons-grade levels of dangerous.
Equally nasty, RCU practically forces you to deal with zombie objects, which breaks the reference counting pattern in annoying ways.
On top of all this breaking out of RCU is costly and kinda defeats the point, and hence there’s a huge temptation to delay this as long as possible. Meaning check as many things and dereference as many pointers under RCU protection as you can, before you take a real lock or upgrade to a proper reference with kref_get_unless_zero().

Unless extreme restraint is applied this results in RCU leading you towards locking antipatterns. Worse RCU tends to spread them to ever more objects and ever more fields within them.

All together all freely using RCU achieves is proving that there really is no bottom on the code maintainability scale. It is not a great day when your driver dies in synchronize_rcu() and lockdep has no idea what’s going on, and I’ve seen such days.

Personally I think in driver subsystem the most that’s still a legit and justified use of RCU is for object lookup with struct xarray and kref_get_unless_zero(), and cleanup handled entirely by kfree_rcu(). Anything more and you’re very likely chasing a rabbit down it’s hole and have not realized it yet.

Locking Antipattern: Atomics

Firstly, Linux atomics have two annoying properties just to start:

Unlike e.g. C++ atomics in userspace they are unordered or weakly ordered by default in a lot of cases. A lot of people are surprised by that, and then have an even harder time understanding the memory barriers they need to sprinkle over the code to make it work correctly.
Worse, many atomic functions neither operate on the atomic types atomic_t and atomic64_t nor have atomic anywhere in their names, and so pose serious pitfalls to reviewers:
- READ_ONCE() and WRITE_ONCE for volatile stores and loads.
- cmpxchg() and the various variants of atomic exchange with or without a compare operation.
- Atomic bitops like set_bit() are all atomic. Worse, their non-atomic variants have the __set_bit() double underscores to scare you away from using them, despite that these are the ones you really want by default.

Those are a lot of unnecessary trap doors, but the real bad part is what people tend to build with atomic instructions:

I’ve seen at least three different, incomplete and ill-defined reimplementations of read write semaphores without lockdep support. Reinventing completions is also pretty popular. Worse, the folks involved didn’t realize what they built. That’s an impressive violation of the “Make it Correct” principle.
It seems very tempting to build terrible variations of the “no locking” patterns. It’s very easy to screw them up by extending them in a bad way, e.g. reference counting with weak reference or RCU optimizations done wrong very quickly leads to a complete mess. There are reasons why you should never deviate from these.
What looks innocent are statistical counters with atomics, but almost always there’s already a lock you could take instead of unordered counter updates. Often resulting in better code organization to boot since the statistics for a list and it’s manipulation are then closer together. There are some exceptions with real performance justification, a recent one I’ve seen is memory shrinkers where you really want your shrinker->count_objects() to not have to acquire any locks. Otherwise in a memory intense workload all threads are stuck on the one thread doing actual reclaim holding the same lock in your shrinker->scan_objects() function.

In short, unless you’re actually building a new locking or synchronization primitive in the core kernel, you most likely do not want to get seen even looking at atomic operations as an option.

Locking Antipattern: `preempt/local_irq/bh_disable()` and Friends …

This one is simple: Lockdep doesn’t understand them. The real-time folks hate them. Whatever it is you’re doing, use proper primitives instead, and at least read up on the LWN coverage on why these are problematic what to do instead. If you need some kind of synchronization primitive - maybe to avoid the lifetime vs. consistency antipattern pitfalls - then use the proper functions for that like synchronize_irq().

Locking Antipattern: Memory Barriers

Or more often, lack of them, incorrect or imbalanced use of barriers, badly or wrongly or just not at all documented memory barriers, or …

Fact is that exceedingly most kernel hackers, and more so driver people, have no useful understanding of the Linux kernel’s memory model, and should never be caught entertaining use of explicit memory barriers in production code. Personally I’m pretty good at spotting holes, but I’ve had to learn the hard way that I’m not even close to being able to positively prove correctness. And for better or worse, nothing short of that tends to cut it.

For a still fairly cursory discussion read the LWN series on lockless algorithms. If the code comments and commit message are anything less rigorous than that it’s fairly safe to assume there’s an issue.

Now don’t get me wrong, I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly. But aside from extreme exceptions this kind of maintenance cost has simply no justification in a driver subsystem. At least unless it’s packaged in a driver hacker proof library or core kernel service of some sorts with all the memory barriers well hidden away where ordinary fools like me can’t touch them.

Closing Thoughts

I hope you enjoyed this little tour of progressively more worrying levels of locking engineering, with really just one key take away:

Simple, dumb locking is good locking, since with that you have a fighting chance to make it correct locking.

Thanks to Daniel Stone and Jason Ekstrand for reading and commenting on drafts of this text.

Locking Engineering Principles

Wed, 27 Jul 2022 00:00:00 +0000

For various reasons I spent the last two years way too much looking at code with terrible locking design and trying to rectify it, instead of a lot more actual building cool things. Symptomatic that the last post here on my neglected blog is also a rant on lockdep abuse.

I tried to distill all the lessons learned into some training slides, and this two part is the writeup of the same. There are some GPU specific rules, but I think the key points should apply to at least apply to kernel drivers in general.

The first part here lays out some principles, the second part builds a locking engineering design pattern hierarchy from the most easiest to understand and maintain to the most nightmare inducing approaches.

Also with locking engineering I mean the general problem of protecting data structures against concurrent access by multiple threads and trying to ensure that each sufficiently consistent view of the data it reads and that the updates it commits won’t result in confusion. Of course it highly depends upon the precise requirements what exactly sufficiently consistent means, but figuring out these kind of questions is out of scope for this little series here.

Priorities in Locking Engineering

Designing a correct locking scheme is hard, validating that your code actually implements your design is harder, and then debugging when - not if! - you screwed up is even worse. Therefore the absolute most important rule in locking engineering, at least if you want to have any chance at winning this game, is to make the design as simple and dumb as possible.

1. Make it Dumb

Since this is the key principle the entire second part of this series will go through a lot of different locking design patterns, from the simplest and dumbest and easiest to understand, to the most hair-raising horrors of complexity and trickiness.

Meanwhile let’s continue to look at everything else that matters.

2. Make it Correct

Since simple doesn’t necessarily mean correct, especially when transferring a concept from design to code, we need guidelines. On the design front the most important one is to design for lockdep, and not fight it, for which I already wrote a full length rant. Here I will only go through the main lessons: Validating locking by hand against all the other locking designs and nesting rules the kernel has overall is nigh impossible, extremely slow, something only few people can do with any chance of success and hence in almost all cases a complete waste of time. We need tools to automate this, and in the Linux kernel this is lockdep.

Therefore if lockdep doesn’t understand your locking design your design is at fault, not lockdep. Adjust accordingly.

A corollary is that you actually need to teach lockdep your locking rules, because otherwise different drivers or subsystems will end up with defacto incompatible nesting and dependencies. Which, as long as you never exercise them on the same kernel boot-up, much less same machine, wont make lockdep grumpy. But it will make maintainers very much question why they are doing what they’re doing.

Hence at driver/subsystem/whatever load time, when CONFIG_LOCKDEP is enabled, take all key locks in the correct order. One example for this relevant to GPU drivers is in the dma-buf subsystem.

In the same spirit, at every entry point to your library or subsytem, or anything else big, validate that the callers hold up the locking contract with might_lock(), might_sleep(), might_alloc() and all the variants and more specific implementations of this. Note that there’s a huge overlap between locking contracts and calling context in general (like interrupt safety, or whether memory allocation is allowed to call into direct reclaim), and since all these functions compile away to nothing when debugging is disabled there’s really no cost in sprinkling them around very liberally.

On the implementation and coding side there’s a few rules of thumb to follow:

Never invent your own locking primitives, you’ll get them wrong, or at least build something that’s slow. The kernel’s locks are built and tuned by people who’ve done nothing else their entire career, you wont beat them except in bug count, and that by a lot.
The same holds for synchronization primitives - don’t build your own with a struct wait_queue_head, or worse, hand-roll your own wait queue. Instead use the most specific existing function that provides the synchronization you need, e.g. flush_work() or flush_workqueue() and the enormous pile of variants available for synchronizing against scheduled work items.

A key reason here is that very often these more specific functions already come with elaborate lockdep annotations, whereas anything hand-roll tends to require much more manual design validation.
Finally at the intersection of “make it dumb” and “make it correct”, pick the simplest lock that works, like a normal mutex instead of an read-write semaphore. This is because in general, stricter rules catch bugs and design issues quicker, hence picking a very fancy “anything goes” locking primitives is a bad choice.

As another example pick spinlocks over mutexes because spinlocks are a lot more strict in what code they allow in their critical section. Hence much less risk you put something silly in there by accident and close a dependency loop that could lead to a deadlock.

3. Make it Fast

Speed doesn’t matter if you don’t understand the design anymore in the future, you need simplicity first.

Speed doesn’t matter if all you’re doing is crashing faster. You need correctness before speed.

Finally speed doesn’t matter where users don’t notice it. If you micro-optimize a path that doesn’t even show up in real world workloads users care about, all you’ve done is wasted time and committed to future maintenance pain for no gain at all.

Similarly optimizing code paths which should never be run when you instead improve your design are not worth it. This holds especially for GPU drivers, where the real application interfaces are OpenGL, Vulkan or similar, and there’s an entire driver in the userspace side - the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts.

The big example here is GPU address patch list processing at command submission time, which was necessary for old hardware that completely lacked any useful concept of a per process virtual address space. But that has changed, which means virtual addresses can stay constant, while the kernel can still freely manage the physical memory by manipulating pagetables, like on the CPU. Unfortunately one driver in the DRM subsystem instead spent an easy engineer decade of effort to tune relocations, write lots of testcases for the resulting corner cases in the multi-level fastpath fallbacks, and even more time handling the impressive amounts of fallout in the form of bugs and future headaches due to the resulting unmaintainable code complexity …

In other subsystems where the kernel ABI is the actual application contract these kind of design simplifications might instead need to be handled between the subsystem’s code and driver implementations. This is what we’ve done when moving from the old kernel modesetting infrastructure to atomic modesetting. But sometimes no clever tricks at all help and you only get true speed with a radically revamped uAPI - io_uring is a great example here.

Protect Data, not Code

A common pitfall is to design locking by looking at the code, perhaps just sprinkling locking calls over it until it feels like it’s good enough. The right approach is to design locking for the data structures, which means specifying for each structure or member field how it is protected against concurrent changes, and how the necessary amount of consistency is maintained across the entire data structure with rules that stay invariant, irrespective of how code operates on the data. Then roll it out consistently to all the functions, because the code-first approach tends to have a lot of issues:

A code centric approach to locking often leads to locking rules changing over the lifetime of an object, e.g. with different rules for a structure or member field depending upon whether an object is in active use, maybe just cached or undergoing reclaim. This is hard to teach to lockdep, especially when the nesting rules change for different states. Lockdep assumes that the locking rules are completely invariant over the lifetime of the entire kernel, not just over the lifetime of an individual object or structure even.

Starting from the data structures on the other hand encourages that locking rules stay the same for a structure or member field.
Locking design that changes depending upon the code that can touch the data would need either complicated documentation entirely separate from the code - so high risk of becoming stale. Or the explanations, if there are any are sprinkled over the various functions, which means reviewers need to reacquire the entire relevant chunks of the code base again to make sure they don’t miss an odd corner cases.

With data structure driven locking design there’s a perfect, because unique place to document the rules - in the kerneldoc of each structure or member field.
A consequence for code review is that to recheck the locking design for a code first approach every function and flow has to be checked against all others, and changes need to be checked against all the existing code. If this is not done you might miss a corner cases where the locking falls apart with a race condition or could deadlock.

With a data first approach to locking changes can be reviewed incrementally against the invariant rules, which means review of especially big or complex subsystems actually scales.
When facing a locking bug it’s tempting to try and fix it just in the affected code. By repeating that often enough a locking scheme that protects data acquires code specific special cases. Therefore locking issues always need to be first mapped back to new or changed requirements on the data structures and how they are protected.

The big antipattern of how you end up with code centric locking is to protect an entire subsystem (or worse, a group of related subsystems) with a single huge lock. The canonical example was the big kernel lock BKL, that’s gone, but in many cases it’s just replaced by smaller, but still huge locks like console_lock().

This results in a lot of long term problems when trying to adjust the locking design later on:

Since the big lock protects everything, it’s often very hard to tell what it does not protect. Locking at the fringes tends to be inconsistent, and due to that its coverage tends to creep ever further when people try to fix bugs where a given structure is not consistently protected by the same lock.
Also often subsystems have different entry points, e.g. consoles can be reached through the console subsystem directly, through vt, tty subsystems and also through an enormous pile of driver specific interfaces with the fbcon IOCTLs as an example. Attempting to split the big lock into smaller per-structure locks pretty much guarantees that different entry points have to take the per-object locks in opposite order, which often can only be resolved through a large-scale rewrite of all impacted subsystems.

Worse, as long as the big subsystem lock continues to be in use no one is spotting these design issues in the code flow. Hence they will slowly get worse instead of the code moving towards a better structure.

For these reasons big subsystem locks tend to live way past their justified usefulness until code maintenance becomes nigh impossible: Because no individual bugfix is worth the task to really rectify the design, but each bugfix tends to make the situation worse.

From Principles to Practice

Stay tuned for next week’s installment, which will cover what these principles mean when applying to practice: Going through a large pile of locking design patterns from the most desirable to the most hair raising complex.

Lockdep False Positives, some stories about

Thu, 13 Aug 2020 00:00:00 +0000

Lockdep is giving false positives are the new the compiler is broken.
— David Airlie (@DaveAirlie) August 8, 2020

Recently we’ve looked a bit at lockdep annotations in the GPU subsystems, and I figured it’s a good opportunity to explain how this all works, and what the tradeoffs are. Creating working locking hierarchies for the kernel isn’t easy, making sure the kernel’s locking validator lockdep is happy and reviewers don’t have their brains explode even more so.

First things first, and the fundamental issue:

Lockdep is about trading false positives against better testing.

The only way to avoid false positives for deadlocks is to only report a deadlock when the kernel actually deadlocked. Which is useless, since the entire point of lockdep is to catch potential deadlock issues before they actually happen. Hence false postives are not avoidable, at least not in theory, to be able to report potential issues before they hang the machine. Read on for what to do in practice.

We need to understand how exactly lockdep trades false positives to better discovery locking inconsistencies. Lockdep makes a few assumptions about how real code does locking in practice:

Invariance of locking rules over time

First assumption baked into lockdep is that the locking rules for a given lock do not change over the lifetime of the lock’s existence. This already throws out a large chunk of perfectly correct locking designs, since state transitions can control how an object is accessed, and therefore how the lock is used. Examples include different rules for creation and destruction, or whether an object is on a specific list (e.g. only a gpu buffer object that’s in the lru can be evicted). It’s not possible to proof automatically that certain code flat out wont ever run together with some other code on the same structure, at least not in generality. Hence this is pretty much a required assumption to make lockdep useful - if every new lock() call could follow new rules there’s nothing to check. Besides realizing that an actual deadlock indeed occured and all is lost already.

And of course getting such state transitions correct, with the guarantee that all the old code will no longer run, is tricky to get right, and very hard on reviewers. It’s a good thing lockdep has problems with such code too.

Common locking rules for the same objects

Second assumption is that all locks initialized by the same code are following the same locking rules. This is achieved by making all lock initializers C macros, which create the corresponding lockdep class as a static variable within the calling function. Again this is pretty much required, since to spot inconsistencies you need as many observations of all the different code path possibilities. Best to share them all between the same object. Also a distinct lockdep class for each individual object would explode the runtime overhead in both memory and cpu cycles.

And again this is good from a code design point too, since having the same data structure and code follow different locking rules for different objects is at best very confusing for reviewers.

Fighting lockdep, badly

Now things go wrong, you have a lockdep splat at your hands, concluded it’s a false positive and go ahead trying to teach lockdep about what’s going on. The first class of annotains are special lock_nested(lock, subclass) functions. Without lockdep nothing in the generated code changes, but it tells lockdep that for this lock acquisition, we’re using a different class to track the observed locking.

This breaks both the time invariance - nothing is stopping you from using different classes for the same lock at different times - and commonality of locking for the same objects. Worse, you can write code which obviously deadlocks, but lockdep will think everything is perfectly fine:

mutex_init(&A);

mutex_lock(&A);
mutex_lock_nested(&A, SINGLE_DEPTH_NESTING);

This is no good and puts a huge burden on reviewers to carefully check all these places themselves, manually. Exactly the kind of tedious and error prone work lockdep was meant to take over.

Slightly better are the annotations which adjust the lockdep class once, when the object is initialized, using lockdep_set_class() and related functions. This at least does not break time invariance, hence will at least guarantee that lockdep spots the deadlock latest when it happens. It still reduces how much lockdep can connect what’s going on, but occasionally “rewrite the entire subsystem” to resolve a locking inconsistency is just not a reasonable option.

It still means that reviewers always need to remember what the locking rules for all types of different objects behind the same structure are, instead of just one. And then check against every path whether that code needs to work with all of them, or just some, or only one. Again tedious work that really lockdep is supposed to help with. If it’s hard to come by a system where you can easily run the code for the different types of object without rebooting, then lockdep cannot help at all.

All these annotations have in common that they don’t change the code logic, only how lockdep interprets what’s going on.

An even more insideous trick on reviewers and lockdep is to push locking into an asynchronous worker of some sorts. This hides issues because lockdep does not follow dependencies between threads through waiter/wakee relationships like wait_for_completion() and complete(), or through wait queues. There are lockdep annotations for specific dependencies, like in the kernel’s workqueue code when flushing workers or specific work items with flush_work(). Automatic annotations have been attemped with the lockdep cross-release extension, which for various reasons had to be backed out again. Therefore hand-rolled asynchronous code is a great place to create complexity and hide locking issues from both lockdep and reviewers.

Playing to lockdep’s strength

Except when there’s very strong justification for all the complexity, the real fix is to change the locking and make it simpler. Simple enough for lockdep to understand what’s going on, which also makes reviewer’s lifes a lot better. Often this means substantial code rework, but at least in some cases there are useful tricks.

A special kind of annotations are the lock_nest_lock(lock, superlock) family of functions - these tell lockdep that when multiple locks of the same class are acquired, it’s all serialized by the single superlock. Lockdep then validates that the right superlock is indeed held. A great example is mm_take_all_locks(), which as the name implies, takes all locks related to the given mm_struct. In a sense this is not a pure annotation, unlike the ones above, since it requires that the superlock is actually locked. That’s generally the easier to understand scheme than clever sorting of lock acquisition of some sort for reviewers too, not just for lockdep.

A different situation often arises when creating or destroying an object. But at that stage often no other thread has a reference to the object and therefore can take the lock, and the best way to resolve locking inconsistency over the lifetime of an object due to creation and destruction code is to not take any locks at all in these paths. There is nothing to protect against after all!

In all these cases the best option for long term maintainability is to simplify the locking design, not reduce lockdep’s power by reducing the amount of false positives it reports. And that should be the general principle.

tldr; do not fix lockdep false positives, fix your locking

Upstream Graphics: Too Little, Too Late

Tue, 10 Dec 2019 00:00:00 +0000

Unlike the tradition of my past few talks at Linux Plumbers or Kernel conferences, this time around in Lisboa I did not start out with a rant proposing to change everything. Instead I celebrated roughly 10 years of upstream graphics progress and finally achieving paradise. But that was all just prelude to a few bait-and-switches later fulfill expectations on what’s broken this time around in upstream, totally, and what needs to be fixed and changed, maybe.

The LPC video recording is now released, slides are uploaded. If neither of that is to your taste, read below the break for the written summary.

Mission Accomplished

10 or so years ago upstream graphics was essentially a proof of concept for the promised to come. Kernel display modeset just landed, finally bringing a somewhat modern display driver userspace API to linux. And GEM, the graphics execution manager landed, bringing proper GPU memory management and multi client rendering. Realistically a lot needed to be done still, from rendering drivers for all the various SoC, to an atomic display API that can expose all the features, not just what was needed to light up a linux desktop back in the days. And lots of work to improve the codebase and make it much easier and quicker to write drivers.

There’s obviously still a lot to do, but I think we’ve achieved that - for full details, check out my ELCE talk about everything great for upstream graphics.

Now despite all this justified celebrating, there is one sticking point still:

NVIDIA

The trouble with team green from an open source perspective - for them it’s a great boon - is that they own the GPU software stack in two crucial ways:

NVIDIA defines how desktop GL works. Not so relevant anymore, and at least the core profile is a solid spec and has fully open source test suite from Khronos by now. But the compatibility profile, which didn’t throw out all the legacy features from the GL1.x days in the 90s, does not have any of the interactions with all the new features specced out and covered with tests - NVIDIA’s binary driver is that standard, and that since roughly 20 years.
More relevant today is CUDA, not quite as long as desktop GL, but for a market that’s growing at a rather brisk pace. CUDA is the undisputed king of the general purpose GPU compute hill. Anything and everything that matters runs on top of it, often exclusively.

Together these create a huge software moat around the high margin hardware business. All an open stack would achieve is filling in that moat and inviting competition to eat the nice surplus. In other words, stupid to even attempt, vendor lock-in just pays too well.

Now of course the reverse engineered nouveau driver still exists. But if you have to pay for reverse engineering already, then you might as well go with someone else’s hardware, since you’re not going to get any of the CUDA/GL goodies.

And the business case for open source drivers indeed exists so much that even paying for reverse engineering a full stack is no problem. The result is a vibrant community of hardware vendors, customers, distros and consulting shops who pay the bills for all the open driver work that’s being done. And in userspace even “upstream first” works - releases happen quickly and often enough, with sufficiently smooth merge process that having a vendor tree is simply not needed. Plus customer’s willingness to upgrade if necessary, because it’s usually a well-contained component to enable new hardware support.

In short without a solid business case behind open graphics drivers, they’re just not going to happen, viz. NVIDIA.

Not Shipping Upstream

Unfortunately the business case for “upstream first” on the kernel side is completely broken. Not for open source, and not for any fundamental reasons, but simply because the kernel moves too slowly, is too big, drivers aren’t well contained enough and therefore customer will not or even can not upgrade. For some hardware upstreaming early enough is possible, but graphics simply moves too fast: By the time the upstreamed driver is actually in shipping distros, it’s already one hardware generation behind. And missing almost a year of tuning and performance improvements. Worse it’s not just new hardware, but also GL and Vulkan versions that won’t work on older kernels due to missing features, fragementing the ecosystem further.

This is entirely unlike the userspace side, where refactoring and code sharing in a cross-vendor shared upstream project actually pays off. Even in the short term.

There’s a lot of approaches trying to paper over this rift with the linux kernel:

Stable kernel ABI for driver modules, so that you can upgrade the core kernel and drivers independently. Google Android is very much aiming this solution at their huge vendor tree problem. Traditionally enterprise distros do the same. This works, safe that stable kernel-internal ABI is not a notion that’s very popular with kernel maintainers …
If you go with an “upstream first” approach to shipping graphics drivers you first need to polish your driver, refactor out common components, and push it to upstream. Only to then pay a second team to re-add all the crap so you can ship your driver on all the old kernels, where all the helpers and new common code don’t exist.
Pay your distro or OS vendor to just backport the new helpers before they even have landed in an upstream release. Which means instead of a backporting team for the driver on your payroll you now pay for backporting the entire subsystem - which in many cases is cheaper, but an even harder sell to beancounters. And sometimes not possible because other driver teams from competitors might not be on board and insist on not breaking the stable driver ABI for a given distro release kernel.

Also, there just isn’t a single LTS kernel. Even upstream has multiple, plus every distro has their own flavour, plus customers love to grow their own variety trees too. Often they’re not even coordinated on the same upstream release. Cheapest way to support this entire madness is to completely ignore upstream and just write your own subsystem. Or at least not use any of the helper libraries provided by kernel subsystems, completely defeating the supposed benefit of upstreaming code.

No matter the strategy, they all boil down to paying twice - if you want to upstream your code. And there’s no added return for the doubled bill. In conclusion, upstream first needs a business case, like the open source graphics stack in general. And that business case is very much real, except for upstreaming, it’s only real in userspace.

In the kernel, “upstream first” is a sham, at least for graphics drivers.

Thanks to Alex Deucher for reading and commenting on drafts of this text.

ELCE Lyon: Everything Great About Upstream Graphics

Tue, 03 Dec 2019 00:00:00 +0000

At ELC Europe in Lyon I held a nice little presentation about the state of upstream graphics drivers, and how absolutely awesome it all is. Of course with a big focus on SoC and embedded drivers. Slides and the video recording

Key takeaways for the busy:

The upstream DRM graphics subsystem really scales down to tiny drivers now, with the smallest driver coming in at just around 250 lines (including comments and whitespace), 10’000x less than the biggest!
Batteries all included - there’s modular helpers for everything. As a rule of thumb even minimal legacy fbdev drivers ported to DRM shrink by a factor of 2-4 thanks to these helpers taking care of anything that’s remotely standardized in displays and GPUs.
For shipping userspace drivers go with a dual-stack: Open source GL and Vulkan drivers for those who want that, and for getting the kernel driver merged into upstream. Closed source for everyone else, running on the same userspace API and kernel driver.
And for customer support backport the entire subsystem, try to avoid backporting an individual driver.

In other words, world domination is assured and progressing according to plan.

Upstream First

Thu, 02 May 2019 00:00:00 +0000

lwn.net just featured an article the sustainability of open source, which seems to be a bit a topic in various places since a while. I’ve made a keynote at Siemens Linux Community Event 2018 last year which lends itself to a different take on all this:

The slides for those who don’t like videos.

This talk was mostly aimed at managers of engineering teams and projects with fairly little experience in shipping open source, and much less experience in shipping open source through upstream cross vendor projects like the kernel. It goes through all the usual failings and missteps and explains why an upstream first strategy is the right one, but with a twist: Instead of technical reasons, it’s all based on economical considerations of why open source is succeeding. Fundamentally it’s not about the better software, or the cheaper prize, or that the software freedoms are a good thing worth supporting.

Instead open source is eating the world because it enables a much more competitive software market. And all the best practices around open development are just to enable that highly competitive market. Instead of arguing that open source has open development and strongly favours public discussions because that results in better collaboration and better software we put on the economic lens, and private discussions become insider trading and collusions. And that’s just not considered cool in a competitive market. Similar arguments can be made with everything else going on in open source projects.

Circling back to the list of articles at the top I think it’s worth looking at the sustainability of open source as an economic issue of an extremely competitive market, in other words, as a market failure: Occasionally the result is that no one gets paid, the customers only receive a sub-par product with all costs externalized - costs like keeping up with security issues. And like with other market failures, a solution needs to be externally imposed through regulations, taxation and transfers to internalize all the costs again into the product’s prize. Frankly no idea how that would look like in practice though.

Anyway, just a thought, but good enough a reason to finally publish the recording and slides of my talk, which covers this just in passing in an offhand remark.

Update: Fix slides link.

X.org Elections: freedesktop.org Merger - Vote Now!

Fri, 29 Mar 2019 00:00:00 +0000

Aside from the regular board elections we also have some bylaw changes to vote on. As usual with bylaw changes, we need a supermajority of all members to agree - if you don’t vote you essentially reject it, but the board has no way of knowing.

Please see the detailed changes of the bylaws, make up your mind, and go voting on the shiny new members page.

Why no 2D Userspace API in DRM?

Mon, 27 Aug 2018 00:00:00 +0000

The DRM (direct rendering manager, not the content protection stuff) graphics subsystem in the linux kernel does not have a generic 2D accelaration API. Despite an awful lot of of GPUs having more or less featureful blitter units. And many systems need them for a lot of use-cases, because the 3D engine is a bit too slow or too power hungry for just rendering desktops.

It’s a FAQ why this doesn’t exist and why it won’t get added, so I figured I’ll answer this once and for all.

Bit of nomeclatura upfront: A 2D engine (or blitter) is a bit of hardware that can copy stuff with some knowledge of the 2D layout usually used for pixel buffers. Some blitters also can do more like basic blending, converting color spaces or stretching/scaling. A 3D engine on the other hand is the fancy bit of high performance compute block, which run small programs (called shaders) on a massively parallel archicture. Generally with huge memory bandwidth and a dedicated controller to feed this beast through an asynchronous command buffer. 3D engines happen to be really good at rendering the pixels for 3D action games, among other things.

There’s no 2D Acceleration Standard

3D has it easy: There’s OpenGL and Vulkan and DirectX that require a certain feature set. And huge market forces that make sure if you use these features like a game would, rendering is fast.

Aside: This means the 2D engine in a browser actually needs to work like a 3D action game, or the GPU will crawl. The impendence mismatch compared to traditional 2D rendering designs is huge.

On the 2D side there’s no such thing: Every blitter engine is its own bespoke thing, with its own features, limitations and performance characteristics. There’s also no standard benchmarks that would drive common performance characteristics - today blitters are neeeded mostly in small systems, with very specific use cases. Anything big enough to run more generic workloads will have a 3D rendering block anyway. These systems still have blitters, but mostly just to help move data in and out of VRAM for the 3D engine to consume.

Now the huge problem here is that you need to fill these gaps in various hardware 2D engines using CPU side software rendering. The crux with any 2D render design is that transferring buffers and data too often between the GPU and CPU will kill performance. Usually the cliff is so steep that pure CPU rendering using only software easily beats any simplistic 2D acceleration design.

The only way to fix this is to be really careful when moving data between the CPU and GPU for different rendering operations. Sticking to one side, even if it’s a bit slower, tends to be an overall win. But these decisions highly depend upon the exact features and performance characteristics of your 2D engine. Putting a generic abstraction layer in the middle of this stack, where it’s guaranteed to be if you make it a part of the kernel/userspace interface, will not result in actual accelaration.

So either you make your 2D rendering look like it’s a 3D game, using 3D interfaces like OpenGL or Vulkan. Or you need a software stack that’s bespoke to your use-case and the specific hardware you want to run on.

2D Accelaration is Really Hard

This is the primary reason really. If you don’t believe that, look at all the tricks a browser employs to render CSS and HTML and text really fast, while still animating all that stuff smoothly. Yes, a web-browser is the pinnacle of current 2D acceleration tech, and you really need all the things in there for decent performance: Scene graphs, clever render culling, massive batching and huge amounts of pains to make sure you don’t have to fall back to CPU based software rendering at the wrong point in a rendering pipeline. Plus managing all kinds of assorted caches to balance reuse against running out of memory.

Unfortunately lots of people assume 2D must be a lot simpler than 3D rendering, and therefore they can design a 2D API that’s fast enough for everyone. No one jumps in and suggests we’ll have a generic 3D interface at the kernel level, because the lessons there are very clear:

The real application interface is fairly high level, and in userspace.
There’s a huge industry group doing really hard work to specify these interfaces.
The actual kernel to userspace interfaces ends up being highly specific to the hardware and architecture of the userspace driver (which contains most of the magic). Any attempt at a generic interface leaves lots of hardware specific tricks and hence performance on the floor.
3D APIs like OpenGL or Vulkan have all the batching and queueing and memory management issues covered in one way or another.

There are a bunch of DRM drivers which have a support for 2D render engines exposed to userspace. But they all use highly hardware specific interfaces, fully streamlined for the specific engine. And they all require a decently sized chunk of driver code in userspace to translate from a generic API to the hardware formats. This is what DRM maintainers will recommend you to do, if you submit a patch to add a generic 2D acceleration API.

Exactly like a 3D driver.

If All Else Fails, There’s Options

Now if you don’t care about the last bit of performance, and your use-case is limited, and your blitter engine is limited, then there’s already options:

You can take whatever pixel buffer you have, export it as a dma-buf, and then import it into some other subsystem which already has some kind of limited 2D accelaration support. Depending upon your blitter engine, a v4l2 mem2m device, or for simpler things there’s also dmaengines.

On top, the DRM subsystem does allow you to implement the traditional accelaration methods exposed by the fbdev subsystem. In case you have userspace that really insists on using these; it’s not recommended for anything new.

What about KMS?

The above is kinda a lie, since the KMS (kernel modesetting) IOCTL userspace API is a fairly full-featured 2D rendering interface. The aim of course is to render different pixel buffers onto a screen. With the recently added writeback support operations targetting memory are now possible. This could be used to expose a traditional blitter, if you only expose writeback support and no other outputs in your KMS driver.

There’s a few downsides:

KMS is highly geared for compositing just a few buffers (hardware usually has a very limited set of planes). For accelerated text rendering you want to do a composite operation for each character, which means this has rather limited use.
KMS only needs to run at 60Hz, or whatever the refresh rate of your monitor is. It’s not optimized for efficiency at higher throughput at all.

So all together this isn’t the high-speed 2D accelaration API you’re looking for either. It is a valid alternative to the options above though, e.g. instead of a v4l2 mem2m device.

FAQ for the FAQ, or: OpenVG?

OpenVG isn’t the standard you’re looking for either. For one it’s a userspace API, like OpenGL. All the same reasons for not implementing a generic OpenGL interface at the kernel/userspace apply to OpenVG, too.

Second, the Mesa3D userspace library did support OpenVG once. Didn’t gain traction, got canned. Just because it calls itself a standard doesn’t make it a widely adopted industry default. Unlike OpenGL/Vulkan/DirectX on the 3D side.

Thanks to Dave Airlie and Daniel Stone for reading and commenting on drafts of this text.