Locking Engineering Principles

For various reasons I spent the last two years way too much looking at code with terrible locking design and trying to rectify it, instead of a lot more actual building cool things. Symptomatic that the last post here on my neglected blog is also a rant on lockdep abuse.

I tried to distill all the lessons learned into some training slides, and this two part is the writeup of the same. There are some GPU specific rules, but I think the key points should apply to at least apply to kernel drivers in general.

The first part here lays out some principles, the second part builds a locking engineering design pattern hierarchy from the most easiest to understand and maintain to the most nightmare inducing approaches.

Also with locking engineering I mean the general problem of protecting data structures against concurrent access by multiple threads and trying to ensure that each sufficiently consistent view of the data it reads and that the updates it commits won’t result in confusion. Of course it highly depends upon the precise requirements what exactly sufficiently consistent means, but figuring out these kind of questions is out of scope for this little series here.

Priorities in Locking Engineering

Designing a correct locking scheme is hard, validating that your code actually implements your design is harder, and then debugging when - not if! - you screwed up is even worse. Therefore the absolute most important rule in locking engineering, at least if you want to have any chance at winning this game, is to make the design as simple and dumb as possible.

1. Make it Dumb

Since this is the key principle the entire second part of this series will go through a lot of different locking design patterns, from the simplest and dumbest and easiest to understand, to the most hair-raising horrors of complexity and trickiness.

Meanwhile let’s continue to look at everything else that matters.

2. Make it Correct

Since simple doesn’t necessarily mean correct, especially when transferring a concept from design to code, we need guidelines. On the design front the most important one is to design for lockdep, and not fight it, for which I already wrote a full length rant. Here I will only go through the main lessons: Validating locking by hand against all the other locking designs and nesting rules the kernel has overall is nigh impossible, extremely slow, something only few people can do with any chance of success and hence in almost all cases a complete waste of time. We need tools to automate this, and in the Linux kernel this is lockdep.

Therefore if lockdep doesn’t understand your locking design your design is at fault, not lockdep. Adjust accordingly.

A corollary is that you actually need to teach lockdep your locking rules, because otherwise different drivers or subsystems will end up with defacto incompatible nesting and dependencies. Which, as long as you never exercise them on the same kernel boot-up, much less same machine, wont make lockdep grumpy. But it will make maintainers very much question why they are doing what they’re doing.

Hence at driver/subsystem/whatever load time, when CONFIG_LOCKDEP is enabled, take all key locks in the correct order. One example for this relevant to GPU drivers is in the dma-buf subsystem.

In the same spirit, at every entry point to your library or subsytem, or anything else big, validate that the callers hold up the locking contract with might_lock(), might_sleep(), might_alloc() and all the variants and more specific implementations of this. Note that there’s a huge overlap between locking contracts and calling context in general (like interrupt safety, or whether memory allocation is allowed to call into direct reclaim), and since all these functions compile away to nothing when debugging is disabled there’s really no cost in sprinkling them around very liberally.

On the implementation and coding side there’s a few rules of thumb to follow:

Never invent your own locking primitives, you’ll get them wrong, or at least build something that’s slow. The kernel’s locks are built and tuned by people who’ve done nothing else their entire career, you wont beat them except in bug count, and that by a lot.
The same holds for synchronization primitives - don’t build your own with a struct wait_queue_head, or worse, hand-roll your own wait queue. Instead use the most specific existing function that provides the synchronization you need, e.g. flush_work() or flush_workqueue() and the enormous pile of variants available for synchronizing against scheduled work items.

A key reason here is that very often these more specific functions already come with elaborate lockdep annotations, whereas anything hand-roll tends to require much more manual design validation.
Finally at the intersection of “make it dumb” and “make it correct”, pick the simplest lock that works, like a normal mutex instead of an read-write semaphore. This is because in general, stricter rules catch bugs and design issues quicker, hence picking a very fancy “anything goes” locking primitives is a bad choice.

As another example pick spinlocks over mutexes because spinlocks are a lot more strict in what code they allow in their critical section. Hence much less risk you put something silly in there by accident and close a dependency loop that could lead to a deadlock.

3. Make it Fast

Speed doesn’t matter if you don’t understand the design anymore in the future, you need simplicity first.

Speed doesn’t matter if all you’re doing is crashing faster. You need correctness before speed.

Finally speed doesn’t matter where users don’t notice it. If you micro-optimize a path that doesn’t even show up in real world workloads users care about, all you’ve done is wasted time and committed to future maintenance pain for no gain at all.

Similarly optimizing code paths which should never be run when you instead improve your design are not worth it. This holds especially for GPU drivers, where the real application interfaces are OpenGL, Vulkan or similar, and there’s an entire driver in the userspace side - the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts.

The big example here is GPU address patch list processing at command submission time, which was necessary for old hardware that completely lacked any useful concept of a per process virtual address space. But that has changed, which means virtual addresses can stay constant, while the kernel can still freely manage the physical memory by manipulating pagetables, like on the CPU. Unfortunately one driver in the DRM subsystem instead spent an easy engineer decade of effort to tune relocations, write lots of testcases for the resulting corner cases in the multi-level fastpath fallbacks, and even more time handling the impressive amounts of fallout in the form of bugs and future headaches due to the resulting unmaintainable code complexity …

In other subsystems where the kernel ABI is the actual application contract these kind of design simplifications might instead need to be handled between the subsystem’s code and driver implementations. This is what we’ve done when moving from the old kernel modesetting infrastructure to atomic modesetting. But sometimes no clever tricks at all help and you only get true speed with a radically revamped uAPI - io_uring is a great example here.

Protect Data, not Code

A common pitfall is to design locking by looking at the code, perhaps just sprinkling locking calls over it until it feels like it’s good enough. The right approach is to design locking for the data structures, which means specifying for each structure or member field how it is protected against concurrent changes, and how the necessary amount of consistency is maintained across the entire data structure with rules that stay invariant, irrespective of how code operates on the data. Then roll it out consistently to all the functions, because the code-first approach tends to have a lot of issues:

A code centric approach to locking often leads to locking rules changing over the lifetime of an object, e.g. with different rules for a structure or member field depending upon whether an object is in active use, maybe just cached or undergoing reclaim. This is hard to teach to lockdep, especially when the nesting rules change for different states. Lockdep assumes that the locking rules are completely invariant over the lifetime of the entire kernel, not just over the lifetime of an individual object or structure even.

Starting from the data structures on the other hand encourages that locking rules stay the same for a structure or member field.
Locking design that changes depending upon the code that can touch the data would need either complicated documentation entirely separate from the code - so high risk of becoming stale. Or the explanations, if there are any are sprinkled over the various functions, which means reviewers need to reacquire the entire relevant chunks of the code base again to make sure they don’t miss an odd corner cases.

With data structure driven locking design there’s a perfect, because unique place to document the rules - in the kerneldoc of each structure or member field.
A consequence for code review is that to recheck the locking design for a code first approach every function and flow has to be checked against all others, and changes need to be checked against all the existing code. If this is not done you might miss a corner cases where the locking falls apart with a race condition or could deadlock.

With a data first approach to locking changes can be reviewed incrementally against the invariant rules, which means review of especially big or complex subsystems actually scales.
When facing a locking bug it’s tempting to try and fix it just in the affected code. By repeating that often enough a locking scheme that protects data acquires code specific special cases. Therefore locking issues always need to be first mapped back to new or changed requirements on the data structures and how they are protected.

The big antipattern of how you end up with code centric locking is to protect an entire subsystem (or worse, a group of related subsystems) with a single huge lock. The canonical example was the big kernel lock BKL, that’s gone, but in many cases it’s just replaced by smaller, but still huge locks like console_lock().

This results in a lot of long term problems when trying to adjust the locking design later on:

Since the big lock protects everything, it’s often very hard to tell what it does not protect. Locking at the fringes tends to be inconsistent, and due to that its coverage tends to creep ever further when people try to fix bugs where a given structure is not consistently protected by the same lock.
Also often subsystems have different entry points, e.g. consoles can be reached through the console subsystem directly, through vt, tty subsystems and also through an enormous pile of driver specific interfaces with the fbcon IOCTLs as an example. Attempting to split the big lock into smaller per-structure locks pretty much guarantees that different entry points have to take the per-object locks in opposite order, which often can only be resolved through a large-scale rewrite of all impacted subsystems.

Worse, as long as the big subsystem lock continues to be in use no one is spotting these design issues in the code flow. Hence they will slowly get worse instead of the code moving towards a better structure.

For these reasons big subsystem locks tend to live way past their justified usefulness until code maintenance becomes nigh impossible: Because no individual bugfix is worth the task to really rectify the design, but each bugfix tends to make the situation worse.

From Principles to Practice

Stay tuned for next week’s installment, which will cover what these principles mean when applying to practice: Going through a large pile of locking design patterns from the most desirable to the most hair raising complex.