<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>stuff by sima</title>
    <description>Simona Vetter&apos;s ramblings on code, bugs, graphics, hw and the utter lack of sanity in all this.
</description>
    <link>http://blog.ffwll.ch/</link>
    <atom:link href="http://blog.ffwll.ch/feeds/posts/default" rel="self" type="application/rss+xml"/>
    <pubDate>Sun, 07 Dec 2025 19:29:28 +0000</pubDate>
    <lastBuildDate>Sun, 07 Dec 2025 19:29:28 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Upstream, Why &amp; How</title>
        <description>&lt;p&gt;In a different epoch, before the pandemic, I’ve done a &lt;a href=&quot;/2019/05/upstream-first.html&quot;&gt;presentation about
upstream first&lt;/a&gt; at the Siemens Linux Community
Event 2018, where I’ve tried to explain the fundamentals of open source using
microeconomics. Unfortunately that talk didn’t work out too well with an
audience that isn’t well-versed in upstream and open source concepts, largely
because it was just too much material crammed into too little time.&lt;/p&gt;

&lt;p&gt;Last year I got the opportunity to try again for an Intel-internal event series,
and this time I’ve split the material into two parts. I think that worked a lot
better. For obvious reasons I cannot publish the recordings, but I can publish
the slides.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/slides/intel-gdansk-2023.pdf&quot;&gt;first part “Upstream, Why?”&lt;/a&gt; covers a
few concepts from microeconomcis 101, and then applies them to upstream stream
open source. The key concept is on one hand that open source achieves an
efficient software market in the microeconomic sense by driving margins and
prices to zero. And the only way to make money in such a market is to either
have more-or-less unstable barriers to entry that prevent the efficient market
from forming and destroying all monetary value. Or to sell a complementary
product.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/slides/intel-gdansk-2023-part2.pdf&quot;&gt;second part”Upstream, How?”&lt;/a&gt; then
looks at what this all means for the different stakeholders involved:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Individual engineers, who have skills and create a product with zero economic
value, and might still be stupid enough and try to build a career on that.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Upstream communities, often with a formal structure as a foundation, and what
exactly their goals should be to build a thriving upstream open source project
that can actually pay some bills, generate some revenue somewhere else and get
engineers paid. Because without that you’re not going to have much of a
project with a long term future.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Engineering organizations, what exactly their incentives and goals should
be, and the fundamental conflicts of interest this causes. Specifically on
this I’ve only seen bad solutions, and ugly solutions, but not yet a really
good one. A relevant pre-pandemic talk of mine on this topic is also
&lt;a href=&quot;/2019/12/upstream-too-little-too-late.html&quot;&gt;“Upstream Graphics: Too Little, Too
Late”&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;And finally the overall business and more importantly, what kind of business
strategy is needed to really thrive with an open source upstream first
approach: You need to clearly understand which software market’s economic
value you want to destroy by driving margins and prices to zero, and which
complemenetary product you’re selling to still earn money.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At least judging by the feedback I’ve received internally taking more time and
going a bit more in-depth on the various concept worked much better than the
keynote presentation I’ve done at Siemens, hence I decided to publish at the
least the slides.&lt;/p&gt;
</description>
        <pubDate>Thu, 14 Mar 2024 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2024/03/upstream-why-how.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2024/03/upstream-why-how.html</guid>
        
        <category>Maintainer-Stuff</category>
        
        
      </item>
    
      <item>
        <title>EOSS Prague: Kernel Locking Engineering</title>
        <description>&lt;p&gt;EOSS in Prague was great, lots of hallway track, good talks, good food,
&lt;a href=&quot;https://www.meetea.cz/&quot;&gt;excellent tea at meetea&lt;/a&gt; - first time I had proper tea
in my life, quite an experience. And also my first talk since covid, pack room
with standing audience, &lt;a href=&quot;https://events.linuxfoundation.org/wp-content/uploads/2023/07/EOSS-23-PostEventReport_072023.pdf&quot;&gt;apparently one of the top ten most attended talks per
LF’s conference
report&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.youtube.com/watch?v=LPH3MUw9m0o&quot;&gt;video recording is now
uploaded&lt;/a&gt;, I’ve uploaded &lt;a href=&quot;/slides/elce-2023-locking.pdf&quot;&gt;the fixed
slides&lt;/a&gt;, including the missing slide that I
accidentally cut in a last-minute edit. It’s the same content as my blog posts
from last year, first talking about &lt;a href=&quot;/2022/07/locking-engineering.html&quot;&gt;locking engineering
principles&lt;/a&gt; and then &lt;a href=&quot;/2022/08/locking-hierarchy.html&quot;&gt;the hierarchy of
locking engineering patterns&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Fri, 28 Jul 2023 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2023/07/eoss-prague-locking-engineering.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2023/07/eoss-prague-locking-engineering.html</guid>
        
        <category>Conferences</category>
        
        
      </item>
    
      <item>
        <title>Locking Engineering Hierarchy</title>
        <description>&lt;p&gt;The first part of this series covered &lt;a href=&quot;/2022/07/locking-engineering.html&quot;&gt;principles of locking
engineering&lt;/a&gt;. This part goes through a pile
of locking patterns and designs, from most favourable and easiest to adjust and
hence resulting in a long term maintainable code base, to the least favourable
since hardest to ensure it works correctly and stays that way while the code
evolves. For convenience even color coded, with the dangerous levels getting
progressively more crispy red indicating how close to the burning fire you are!
Think of it as Dante’s Inferno, but for locking.&lt;/p&gt;

&lt;p&gt;As a reminder from the intro of the first part, with locking engineering I mean
the art of ensuring that there’s sufficient consistency in reading and
manipulating data structures, and not just sprinkling &lt;code&gt;mutex_lock()&lt;/code&gt;
and &lt;code&gt;mutex_unlock()&lt;/code&gt; calls around until the result looks reasonable
and lockdep has gone quiet.
&lt;!--more--&gt;&lt;/p&gt;

&lt;h2 id=&quot;level-0-no-locking&quot;&gt;Level 0: No Locking&lt;/h2&gt;

&lt;p&gt;The dumbest possible locking is no need for locking at all. Which does not mean
extremely clever lockless tricks for a “look, no calls to
&lt;code&gt;mutex_lock()&lt;/code&gt;” feint, but an overall design which guarantees that
any writers cannot exist concurrently with any other access at all. This
removes the need for consistency guarantees while accessing an object at the
architectural level.&lt;/p&gt;

&lt;p&gt;There’s a few standard patterns to achieve locking nirvana.&lt;/p&gt;

&lt;h3 id=&quot;locking-pattern-immutable-state&quot;&gt;Locking Pattern: Immutable State&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;The&lt;/em&gt; lesson in graphics API design over the last decade is that immutable state
objects rule, because they both lead to simpler driver stacks and also better
performance. Vulkan instead of the OpenGL with it’s ridiculous amount of
mutable and implicit state is the big example, but atomic instead of legacy
kernel mode setting or Wayland instead of the X11 are also built on the
assumption that immutable state objects are a Great Thing (tm).&lt;/p&gt;

&lt;p&gt;The usual pattern is:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;A single thread fully constructs an object, including any sub structures and
anything else you might need. Often subsystems provide initialization helpers for
objects that driver can subclass through embedding, e.g.
&lt;code&gt;drm_connector_init()&lt;/code&gt; for initializing a kernel modesetting output
object. Additional functions can set up different or optional aspects of an
object, e.g.  &lt;code&gt;drm_connector_attach_encoder()&lt;/code&gt; sets up the invariant
links to the preceding element in a kernel modesetting display chain.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The fully formed object is published to the world, in the kernel this often
happens by registering it under some kind of identifier. This could be a global
identifier like &lt;code&gt;register_chrdev()&lt;/code&gt; for character devices, something attached to a device like
registering a new display output on a driver with
&lt;code&gt;drm_connector_register()&lt;/code&gt; or some &lt;code&gt;struct xarray&lt;/code&gt; in the
file private structure. Note that this step here requires memory barriers of
some sort. If you hand roll the data structure like a list or lookup
tree with your own fancy locking scheme instead of using existing standard
interfaces you are on a fast path to level 3 locking hell. Don’t do that.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;From this point on there are no consistency issues anymore and all threads
can access the object without any locking.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;locking-pattern-single-owner&quot;&gt;Locking Pattern: Single Owner&lt;/h3&gt;

&lt;p&gt;Another way to ensure there’s no concurrent access is by only allowing one
thread to own an object at a given point of time, and have well defined handover
points if that is necessary.&lt;/p&gt;

&lt;p&gt;Most often this pattern is used for asynchronously processing a userspace
request:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The syscall or IOCTL constructs an object with sufficient information to
process the userspace’s request.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;That object is handed over to a worker thread with e.g.
&lt;code&gt;queue_work()&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The worker thread is now the sole owner of that piece of memory and can do
whatever it feels like with it.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Again the second step requires memory barriers, which means if you hand roll
your own lockless queue you’re firmly in level 3 territory and won’t get rid of
the burned in red hot afterglow in your retina for quite some time. Use standard
interfaces like &lt;code&gt;struct completion&lt;/code&gt; or even better libraries like the
workqueue subsystem here.&lt;/p&gt;

&lt;p&gt;Note that the handover can also be chained or split up, e.g. for a nonblocking
atomic kernel modeset requests there’s three asynchronous processing pieces
involved:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The main worker, which pushes the display state update to the hardware and
which is enqueued with &lt;code&gt;queue_work()&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The userspace completion event handling built around &lt;code&gt;struct
drm_pending_event&lt;/code&gt; and generally handed off to the interrupt handler of
the driver from the main worker and processed in the interrupt handler.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The cleanup of the no longer used old scanout buffers from the preceding
update. The synchronization between the preceding update and the cleanup is
done through &lt;code&gt;struct completion&lt;/code&gt; to ensure that there’s only ever a
single worker which owns a state structure and is allowed to change it.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;locking-pattern-reference-counting&quot;&gt;Locking Pattern: Reference Counting&lt;/h3&gt;

&lt;p&gt;Users generally don’t appreciate if the kernel leaks memory too much, and
cleaning up objects by freeing their memory and releasing any other resources
tends to be an operation of the very much mutable kind. Reference counting to
the rescue!&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Every pointer to the reference counted object must guarantee that a reference
exists for as long as the pointer is in use. Usually that’s done by calling
&lt;code&gt;kref_get()&lt;/code&gt; when making a copy of the pointer, but implied
references by e.g. continuing to hold a lock that protects a different pointer
are often good enough too for a temporary pointer.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The cleanup code runs when the last reference is released with
&lt;code&gt;kref_put()&lt;/code&gt;. Note that this again requires memory barriers to work
correctly, which means if you’re not using &lt;code&gt;struct kref&lt;/code&gt; then it’s
safe to assume you’ve screwed up.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that this scheme falls apart when released objects are put into some kind
of cache and can be resurrected. In that case your cleanup code needs to somehow
deal with these zombies and ensure there’s no confusion, and vice versa any code
that resurrects a zombie needs to deal the wooden spikes the cleanup code might
throw at an inopportune time. The worst example of this kind is
&lt;code&gt;SLAB_TYPESAFE_BY_RCU&lt;/code&gt;, where readers that are only protected with
&lt;code&gt;rcu_read_lock()&lt;/code&gt; may need to deal with objects potentially going
through simultaneous zombie resurrections, potentially multiple times, while
the readers are trying to figure out what is going on. This generally leads to 
lots of sorrow, wailing and ill-tempered maintainers, as the GPU subsystem
has and continues to experience with &lt;code&gt;struct dma_fence&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Hence use standard reference counting, and don’t be tempted by the siren of
trying to implement clever caching of any kind.&lt;/p&gt;

&lt;h2 id=&quot;level-1-big-dumb-lock&quot;&gt;Level 1: Big Dumb Lock&lt;/h2&gt;

&lt;p&gt;It would be great if nothing ever changes, but sometimes that cannot be avoided.
At that point you add a single lock for each logical object. An object could be
just a single structure, but it could also be multiple structures that are
dynamically allocated and freed under the protection of that single big dumb
lock, e.g. when managing GPU virtual address space with different mappings.&lt;/p&gt;

&lt;p&gt;The tricky part is figuring out what is an object to ensure that your lock is
neither too big nor too small:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;If you make your lock too big you run the risk of creating a dreaded subsystem
lock, or violating the &lt;a href=&quot;/2022/07/locking-engineering.html#protect-data-not-code&quot;&gt;“Protect Data, not
Code”&lt;/a&gt; principle in
some other way. Split your locking further so that a single lock really only
protects a single object, and not a random collection of unrelated ones. So
one lock per device instance, not one lock for all the device instances in a
driver or worse in an entire subsystem.&lt;/p&gt;

    &lt;p&gt;The trouble is that once a lock is too big and has firmly moved into “protects
some vague collection of code” territory, it’s very hard to get out of that
hole.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Different problems strike when the locking scheme is too fine-grained, e.g. in
the GPU virtual memory management example when every address mapping in the
big vma tree has its own private lock. Or when a structure has a lot of
different locks for different member fields.&lt;/p&gt;

    &lt;p&gt;One issue is that locks aren’t free, the overhead of fine-grained locking can
seriously hurt, especially when common operations have to take most of the
locks anyway and so there’s no chance of any concurrency benefit. Furthermore
fine-grained locking leads to the temptation of solving locking overhead with
ever more clever lockless tricks, instead of radically simplifying the
design.&lt;/p&gt;

    &lt;p&gt;The other issue is that more locks improve the odds for locking inversions,
and those can be tough nuts to crack. Again trying to solve this with more
lockless tricks to avoid inversions is tempting, and again in most cases the
wrong approach.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ideally, your big dumb lock would always be right-sized everytime the
requirements on the datastructures changes. But working magic 8 balls tend to be
on short supply, and you tend to only find out that your guess was wrong when
the pain of the lock being too big or too small is already substantial. The
inherent struggles of resizing a lock as the code evolves then keeps pushing you
further away from the optimum instead of closer. Good luck!&lt;/p&gt;

&lt;h2 style=&quot;background:yellow;&quot;&gt; Level 2: Fine-grained Locking&lt;/h2&gt;

&lt;p&gt;It would be great if this is all the locking we ever need, but sometimes there’s
functional reasons that force us to go beyond the single lock for each logical
object approach. This section will go through a few of the common examples, and
the usual pitfalls to avoid.&lt;/p&gt;

&lt;p&gt;But before we delve into details remember to document in kerneldoc with the
inline per-member kerneldoc comment style once you go beyond a simple single
lock per object approach. It’s the best place for future bug fixers and
reviewers - meaning you - to find the rules for how at least things were meant
to work.&lt;/p&gt;

&lt;h3 id=&quot;locking-pattern-object-tracking-lists&quot;&gt;Locking Pattern: Object Tracking Lists&lt;/h3&gt;

&lt;p&gt;One of the main duties of the kernel is to track everything, least to make sure
there’s no leaks and everything gets cleaned up again. But there’s other reasons
to maintain lists (or other container structures) of objects.&lt;/p&gt;

&lt;p&gt;Now sometimes there’s a clear parent object, with its own lock, which could also
protect the list with all the objects, but this does not always work:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;It might force the lock of the parent object to essentially become a subsystem
lock and so protect much more than it should when following the &lt;a href=&quot;/2022/07/locking-engineering.html#protect-data-not-code&quot;&gt;“Protect Data, not
Code”&lt;/a&gt; principle. In
that case it’s better to have a separate (spin-)lock just for the list to be
able to clearly untangle what the parent and subordinate object’s lock each
protect.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Different code paths might need to walk and possibly manipulate the list both
from the container object and contained object, which would lead to locking
inversion if the list isn’t protected by it’s own stand-alone (nested) lock.
This tends to especially happen when an object can be attached to multiple
other objects, like a GPU buffer object can be mapped into multiple GPU
virtual address spaces of different processes.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The constraints of calling contexts for adding or removing objects from the
list could be different and incompatible from the requirements when walking
the list itself. The main example here are LRU lists where the shrinker needs
to be able to walk the list from reclaim context, whereas the superior object
locks often have a need to allocate memory while holding each lock. Those
object locks the shrinker can then only trylock, which is generally good
enough, but only being able to trylock the LRU list lock itself is not.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simplicity should still win, therefore only add a (nested) lock for lists or
other container objects if there’s really no suitable object lock that could do
the job instead.&lt;/p&gt;

&lt;h3 id=&quot;locking-pattern-interrupt-handler-state&quot;&gt;Locking Pattern: Interrupt Handler State&lt;/h3&gt;

&lt;p&gt;Another example that requires nested locking is when part of the object is
manipulated from a different execution context. The prime example here are
interrupt handlers. Interrupt handlers can only use interrupt safe spinlocks,
but often the main object lock must be a mutex to allow sleeping or allocating
memory or nesting with other mutexes.&lt;/p&gt;

&lt;p&gt;Hence the need for a nested spinlock to just protect the object state shared
between the interrupt handler and code running from process context. Process
context should generally only acquire the spinlock nested with the main object
lock, to avoid surprises and limit any concurrency issues to just the singleton
interrupt handler.&lt;/p&gt;

&lt;h3 id=&quot;locking-pattern-async-processing&quot;&gt;Locking Pattern: Async Processing&lt;/h3&gt;

&lt;p&gt;Very similar to the interrupt handler problems is coordination with async
workers. The best approach is the &lt;a href=&quot;#locking-pattern-single-owner&quot;&gt;single owner
pattern&lt;/a&gt;, but often state needs to be shared
between the worker and other threads operating on the same object.&lt;/p&gt;

&lt;p&gt;The naive approach of just using a single object lock tends to deadlock:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;start_processing(obj)
{
	mutex_lock(&amp;amp;obj-&amp;gt;lock);
	/* set up the data for the async work */;
	schedule_work(&amp;amp;obj-&amp;gt;work);
	mutex_unlock(&amp;amp;obj-&amp;gt;lock);
}

stop_processing(obj)
{
	mutex_lock(&amp;amp;obj-&amp;gt;lock);
	/* clear the data for the async work */;
	cancel_work_sync(&amp;amp;obj-&amp;gt;work);
	mutex_unlock(&amp;amp;obj-&amp;gt;lock);
}

work_fn(work)
{
	obj = container_of(work, work);

	mutex_lock(&amp;amp;obj-&amp;gt;lock);
	/* do some processing */
	mutex_unlock(&amp;amp;obj-&amp;gt;lock);
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Do not worry if you don’t spot the deadlock, because it is a cross-release
dependency between the entire &lt;code&gt;work_fn()&lt;/code&gt; and
&lt;code&gt;cancel_work_sync()&lt;/code&gt; and these are a lot trickier to spot. Since
cross-release dependencies are a entire huge topic on themselves I won’t go into
more details, a good starting point is &lt;a href=&quot;https://lwn.net/Articles/709849/&quot;&gt;this LWN
article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There’s a bunch of variations of this theme, with problems in different
scenarios:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Replacing the &lt;code&gt;cancel_work_sync()&lt;/code&gt; with &lt;code&gt;cancel_work()&lt;/code&gt;
avoids the deadlock, but often means the &lt;code&gt;work_fn()&lt;/code&gt; is prone to
use-after-free issues.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Calling &lt;code&gt;cancel_work_sync()&lt;/code&gt;before taking the mutex can work in
some cases, but falls apart when the work is self-rearming. Or maybe the
work or overall object isn’t guaranteed to exist without holding it’s lock,
e.g. if this is part of an async processing queue for a parent structure.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Cancelling the work after the call to &lt;code&gt;mutex_unlock()&lt;/code&gt; might race
with concurrent restarting of the work and upset the bookkeeping.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Like with interrupt handlers the clean solution tends to be an additional nested
lock which protects just the mutable state shared with the work function and
nests within the main object lock. That way work can be cancelled while the main
object lock is held, which avoids a ton of races. But without holding the
sublock that &lt;code&gt;work_fn()&lt;/code&gt; needs, which avoids the deadlock.&lt;/p&gt;

&lt;p&gt;Note that in some cases the superior lock doesn’t need to exist, e.g.
&lt;code&gt;struct drm_connector_state&lt;/code&gt; is protected by the &lt;a href=&quot;#locking-pattern-single-owner&quot;&gt;single
owner pattern&lt;/a&gt;, but drivers might have some need
for some further decoupled asynchronous processing, e.g. for handling the
content protect or link training machinery. In that case only the sublock for
the mutable driver private state shared with the worker exists.&lt;/p&gt;

&lt;h3 id=&quot;locking-pattern-weak-references&quot;&gt;Locking Pattern: Weak References&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;#locking-pattern-reference-counting&quot;&gt;Reference counting&lt;/a&gt; is a great pattern, but
sometimes you need be able to store pointers without them holding a full
reference. This could be for lookup caches, or because your userspace API
mandates that some references do not keep the object alive - we’ve unfortunately
committed that mistake in the GPU world. Or because holding full references
everywhere would lead to unreclaimable references loops and there’s no better
way to break them than to make some of the references weak. In languages with a
garbage collector weak references are implemented by the runtime, and so no real
worry. But in the kernel the concept has to be implemented by hand.&lt;/p&gt;

&lt;p&gt;Since weak references are such a standard pattern &lt;code&gt;struct kref&lt;/code&gt; has
ready-made support for them. The simple approach is using
&lt;code&gt;kref_put_mutex()&lt;/code&gt; with the same lock that also protects the
structure containing the weak reference. This guarantees that either the weak
reference pointer is gone too, or there is at least somewhere still a strong
reference around and it is therefore safe to call &lt;code&gt;kref_get()&lt;/code&gt;. But
there are some issues with this approach:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;It doesn’t compose to multiple weak references, at least if they are protected
by different locks - all the locks need to be taken before the final
&lt;code&gt;kref_put()&lt;/code&gt; is called, which means minimally some pain with lock
nesting and you get to hand-roll it all to boot.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The mutex required to be held during the final put is the one which protects
the structure with the weak reference, and often has little to do with the
object that’s being destroyed. So a pretty nasty violation of the &lt;a href=&quot;#level-1-big-dumb-lock&quot;&gt;big dumb
lock pattern&lt;/a&gt;. Furthermore the lock is held
over the entire cleanup function, which defeats the point of the &lt;a href=&quot;#locking-pattern-reference-counting&quot;&gt;reference
counting pattern&lt;/a&gt;, which is meant to
enable “no locking” cleanup code. It becomes very tempting to stuff random
other pieces of code under the protection of this look, making it a sprawling
mess and violating the &lt;a href=&quot;/2022/07/locking-engineering.html#protect-data-not-code&quot;&gt;principle to protect data, not
code&lt;/a&gt;: The
lock held during the entire cleanup operation is protecting against that
cleanup code doing things, and not anymore a specific data structure.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The much better approach is using &lt;code&gt;kref_get_unless_zero()&lt;/code&gt;, together
with a spinlock for your data structure containing the weak reference. This
looks especially nifty in combination with &lt;code&gt;struct xarray&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;obj_find_in_cache(id)
{
	xa_lock();
	obj = xa_find(id);
	if (!kref_get_unless_zero(&amp;amp;obj-&amp;gt;kref))
		obj = NULL;
	xa_unlock();

	return obj;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With this all the issues are resolved:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Arbitrary amounts of weak references in any kind of structures protected by
their own spinlock can be added, without causing dependencies between them.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In the object’s cleanup function the same spinlock only needs to be held right
around when the weak references are removed from the lookup structure. The
lock critical section is no longer needlessly enlarged, we’re back to
protecting data instead of code.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With both together the locking does no longer leak beyond the lookup structure
and it’s associated code any more, unlike with &lt;code&gt;kref_put_mutex()&lt;/code&gt; and
similar approaches.  Thankfully &lt;code&gt;kref_get_unless_zero()&lt;/code&gt; has become
the much more popular approach since it was added 10 years ago!&lt;/p&gt;

&lt;h2 id=&quot;locking-antipattern-confusing-object-lifetime-and-data-consistency&quot;&gt;Locking Antipattern: Confusing Object Lifetime and Data Consistency&lt;/h2&gt;

&lt;p&gt;We’ve now seen a few examples where the &lt;a href=&quot;#level-0-no-locking&quot;&gt;“no locking” patterns from level
0&lt;/a&gt; collide in annoying ways when more locking is added to
the point where we seem to violate the &lt;a href=&quot;/2022/07/locking-engineering.html#protect-data-not-code&quot;&gt;principle to protect data, not
code&lt;/a&gt;. It’s worth to
look at this a bit closer, since we can generalize what’s going on here to a
fairly high-level antipattern.&lt;/p&gt;

&lt;p&gt;The key insight is that the “no locking” patterns all rely on memory barrier
primitives in disguise, not classic locks, to synchronize access between
multiple threads. In the case of the &lt;a href=&quot;#locking-pattern-single-owner&quot;&gt;single owner
pattern&lt;/a&gt; there might also be blocking semantics
involved, when the next owner needs to wait for the previous owner to finish
processing first. These are functions like &lt;code&gt;flush_work()&lt;/code&gt; or the
various wait functions like &lt;code&gt;wait_event()&lt;/code&gt; or
&lt;code&gt;wait_completion()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Calling these barrier functions while holding locks commonly leads to issues:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Blocking functions like &lt;code&gt;flush_work()&lt;/code&gt; pull in every lock or other
dependency the work we wait on, or more generally, any of the previous owners
of an object needed as a so called cross-release dependency. Unfortunately
lockdep does not understand these natively, and the usual tricks to add manual
annotations have severe limitations. There’s work ongoing to add
&lt;a href=&quot;https://lwn.net/Articles/709849/&quot;&gt;cross-release dependency tracking to
lockdep&lt;/a&gt;, but nothing looks anywhere near
ready to merge. Since these dependency chains can be really long and get ever
longer when more code is added to a worker - dependencies are pulled in even
if only a single lock is held at any given time - this can quickly become a
nightmare to untangle.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Often the requirement to hold a lock over these barrier type functions comes
from the fact that the object would disappear. Or otherwise undergo some
serious confusion about it’s lifetime state - not just whether it’s still
alive or getting destroyed, but also who exactly owns it or whether it’s maybe
a resurrected zombie representing a different instance now. This encourages
that the lock morphs from a “protects some specific data” to a “protects
specific code from running” design, leading to all the code maintenance issues
discussed in the &lt;a href=&quot;/2022/07/locking-engineering.html#protect-data-not-code&quot;&gt;protect data, not code
principle&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these reasons try as hard as possible to not hold any locks, or as few as
feasible, when calling any of these memory barriers in disguise functions used
to manage object lifetime or ownership in general. The antipattern here is
abusing locks to fix lifetime issues. We have seen two specific instances thus
far:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;code&gt;kref_put_mutex&lt;/code&gt; instead of &lt;code&gt;kref_get_unless_zero()&lt;/code&gt; in
the &lt;a href=&quot;#locking-pattern-weak-reference&quot;&gt;weak reference pattern&lt;/a&gt;. This is a
special case of the &lt;a href=&quot;#locking-pattern-reference-counting&quot;&gt;reference counting
pattern&lt;/a&gt;, but with some finer-grained
locking added to support weak references.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Calling &lt;code&gt;flush_work()&lt;/code&gt; while holding locks in the &lt;a href=&quot;#locking-pattern-async-processing&quot;&gt;async
worker&lt;/a&gt;. This is a special case of the
&lt;a href=&quot;#locking-pattern-single-owner&quot;&gt;single owner pattern&lt;/a&gt;, again with a bit more
locking added to support some mutable state.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will see some more, but the antipattern holds in general as a source of
troubles.&lt;/p&gt;

&lt;h2 style=&quot;background:orange;&quot;&gt; Level 2.5: Splitting Locks for Performance
Reasons&lt;/h2&gt;

&lt;p&gt;We’ve looked at a pile of functional reasons for complicating the locking
design, but sometimes you need to add more fine-grained locking for performance
reasons. This is already getting dangerous, because it’s very tempting to tune
some microbenchmark just because we can, or maybe delude ourselves that it will
be needed in the future. Therefore only complicate your locking if:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;You have actual real world benchmarks with workloads relevant to users that
show measurable gains outside of statistical noise.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;You’ve fully exhausted architectural changes to outright avoid the overhead,
like io_uring pre-registering file descriptors locally to avoid manipulating
the file descriptor table.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;You’ve fully exhausted algorithm improvements like batching up operations to
amortize locking overhead better.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only then make your future maintenance pain guaranteed worse by applying more
tricky locking than the bare minimum necessary for correctness. Still, go with the simplest approach, often converting a lock to its read-write
variant is good enough.&lt;/p&gt;

&lt;p&gt;Sometimes this isn’t enough, and you actually have to split up a lock into more
fine-grained locks to achieve more parallelism and less contention among
threads. Note that doing so blindly will backfire because locks are not free.
When common operations still have to take most of the locks anyway, even if it’s
only for short time and in strict succession, the performance hit on single
threaded workloads will not justify any benefit in more threaded use-cases.&lt;/p&gt;

&lt;p&gt;Another issue with more fine-grained locking is that often you cannot define a
strict nesting hierarchy, or worse might need to take multiple locks of the same
object or lock class. I’ve written previously about this specific issue, and
more importantly, &lt;a href=&quot;/2020/08/lockdep-false-positives.html#fighting-lockdep-badly&quot;&gt;how to teach lockdep about lock nesting, the bad and the
good ways&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One really entertaining story from the GPU subsystem, for bystanders at least,
is that we really screwed this up for good by defacto allowing userspace to
control the lock order of all the objects involved in an IOCTL. Furthermore
disjoint operations should actually proceed without contention. If
you ever manage to repeat this feat you can take a look at the &lt;a href=&quot;https://www.kernel.org/doc/html/latest/locking/ww-mutex-design.html&quot;&gt;wait-wound
mutexes&lt;/a&gt;.
Or if you just want some pretty graphs, &lt;a href=&quot;https://lwn.net/Articles/548909/&quot;&gt;LWN has an old article about wait-wound
mutexes too&lt;/a&gt;.&lt;/p&gt;

&lt;h2 style=&quot;background:red&quot;&gt; Level 3: Lockless Tricks&lt;/h2&gt;

&lt;p&gt;Do not go here wanderer!&lt;/p&gt;

&lt;p&gt;Seriously, I have seen a lot of very fancy driver subsystem locking designs, I
have not yet found a lot that were actually justified. Because only real world,
non-contrived performance issues can ever justify reaching for this level, and
in almost all cases algorithmic or architectural fixes yield much better
improvements than any kind of (locking) micro-optimization could ever hope for.&lt;/p&gt;

&lt;p&gt;Hence this is just a long list of antipatterns, so that people who have not yet
a grumpy expression permanently chiseled into their facial structure know when
they’re in trouble.&lt;/p&gt;

&lt;p&gt;Note that this section isn’t limited to lockless tricks in the academic sense of
guaranteed constant overhead forward progress, meaning no spinning or retrying
anywhere at all. It’s for everything which doesn’t use standard locks like
&lt;code&gt;struct mutex&lt;/code&gt;, &lt;code&gt;spinlock_t&lt;/code&gt;, &lt;code&gt;struct
rw_semaphore&lt;/code&gt;, or any of the others provided in the Linux kernel.&lt;/p&gt;

&lt;h3 id=&quot;locking-antipattern-using-rcu&quot;&gt;Locking Antipattern: Using RCU&lt;/h3&gt;

&lt;p&gt;Yeah RCU is really awesome and impressive, but it comes at serious costs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;By design, at least with standard usage, RCU elevates &lt;a href=&quot;#locking-antipattern-confusing-object-lifetime-and-data-consistency&quot;&gt;mixing up lifetime and
consistency
concerns&lt;/a&gt;
to a virtue. &lt;code&gt;rcu_read_lock()&lt;/code&gt; gives you both a read-side critical
section &lt;em&gt;and&lt;/em&gt; it extends the lifetime of any RCU protected object. There’s
absolutely no way you can avoid that antipattern, it’s built in.&lt;/p&gt;

    &lt;p&gt;Worse, RCU read-side critical section nest rather freely, which means unlike
with real locks abusing them to keep objects alive won’t run into nasty locking
inversion issues when you pull that stunt with nesting different objects or
classes of objects. Using locks to paper over lifetime issues is bad, but with
RCU it’s weapons-grade levels of dangerous.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Equally nasty, RCU practically forces you to deal with zombie objects, which
breaks the &lt;a href=&quot;#locking-pattern-reference-counting&quot;&gt;reference counting pattern&lt;/a&gt;
in annoying ways.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;On top of all this breaking out of RCU is costly and kinda defeats the point,
and hence there’s a huge temptation to delay this as long as possible. Meaning
check as many things and dereference as many pointers under RCU protection as
you can, before you take a real lock or upgrade to a proper reference with
&lt;code&gt;kref_get_unless_zero()&lt;/code&gt;.&lt;/p&gt;

    &lt;p&gt;Unless extreme restraint is applied this results in RCU leading you towards
locking antipatterns. Worse RCU tends to spread them to ever more objects and
ever more fields within them.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All together all freely using RCU achieves is proving that there really is no
bottom on the code maintainability scale. It is not a great day when your driver
dies in &lt;code&gt;synchronize_rcu()&lt;/code&gt; and lockdep has no idea what’s going on,
and I’ve seen such days.&lt;/p&gt;

&lt;p&gt;Personally I think in driver subsystem the most that’s still a legit and
justified use of RCU is for object lookup with &lt;code&gt;struct xarray&lt;/code&gt; and
&lt;code&gt;kref_get_unless_zero()&lt;/code&gt;, and cleanup handled entirely by
&lt;code&gt;kfree_rcu()&lt;/code&gt;. Anything more and you’re very likely chasing a rabbit
down it’s hole and have not realized it yet.&lt;/p&gt;

&lt;h3 id=&quot;locking-antipattern-atomics&quot;&gt;Locking Antipattern: Atomics&lt;/h3&gt;

&lt;p&gt;Firstly, Linux atomics have two annoying properties just to start:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Unlike e.g. C++ atomics in userspace they are unordered or weakly ordered
by default in a lot of cases. A lot of people are surprised by that, and then
have an even harder time understanding the memory barriers they need to
sprinkle over the code to make it work correctly.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Worse, many atomic functions neither operate on the atomic types
&lt;code&gt;atomic_t&lt;/code&gt; and &lt;code&gt;atomic64_t&lt;/code&gt; nor have &lt;code&gt;atomic&lt;/code&gt;
anywhere in their names, and so pose serious pitfalls to reviewers:&lt;/p&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;code&gt;READ_ONCE()&lt;/code&gt; and &lt;code&gt;WRITE_ONCE&lt;/code&gt; for volatile stores and
loads.&lt;/li&gt;
      &lt;li&gt;&lt;code&gt;cmpxchg()&lt;/code&gt; and the various variants of atomic exchange with or
without a compare operation.&lt;/li&gt;
      &lt;li&gt;Atomic bitops like &lt;code&gt;set_bit()&lt;/code&gt; are all atomic. Worse, their
non-atomic variants have the &lt;code&gt;__set_bit()&lt;/code&gt; double underscores to
scare you away from using them, despite that these are the ones you really
want by default.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are a lot of unnecessary trap doors, but the real bad part is what people
tend to build with atomic instructions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;I’ve seen at least three different, incomplete and ill-defined
reimplementations of read write semaphores without lockdep support. Reinventing
completions is also pretty popular. Worse, the folks involved didn’t realize
what they built. That’s an impressive violation of the &lt;a href=&quot;/2022/07/locking-engineering.html#2-make-it-correct&quot;&gt;“Make it Correct”
principle&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;It seems very tempting to build terrible variations of the &lt;a href=&quot;#level-0-no-locking&quot;&gt;“no locking”
patterns&lt;/a&gt;. It’s very easy to screw them up by extending
them in a bad way, e.g. reference counting with weak reference or RCU
optimizations done wrong very quickly leads to a complete mess. There are 
reasons why you should never deviate from these.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What looks innocent are statistical counters with atomics, but almost always
there’s already a lock you could take instead of unordered counter updates.
Often resulting in better code organization to boot since the statistics for a
list and it’s manipulation are then closer together. There are some exceptions
with real performance justification, a recent one I’ve seen is memory
shrinkers where you really want your &lt;code&gt;shrinker-&amp;gt;count_objects()&lt;/code&gt; to
not have to acquire any locks.  Otherwise in a memory intense workload all
threads are stuck on the one thread doing actual reclaim holding the same lock
in your &lt;code&gt;shrinker-&amp;gt;scan_objects()&lt;/code&gt; function.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, unless you’re actually building a new locking or synchronization
primitive in the core kernel, you most likely do not want to get seen even
looking at atomic operations as an option.&lt;/p&gt;

&lt;h3 id=&quot;locking-antipattern-preemptlocal_irqbh_disable-and-friends-&quot;&gt;Locking Antipattern: &lt;code&gt;preempt/local_irq/bh_disable()&lt;/code&gt; and Friends …&lt;/h3&gt;

&lt;p&gt;This one is simple: Lockdep doesn’t understand them. The real-time folks hate
them. Whatever it is you’re doing, use proper primitives instead, and at least
read up on the &lt;a href=&quot;https://lwn.net/Articles/828477/&quot;&gt;LWN coverage on why these are
problematic what to do instead&lt;/a&gt;. If you need
some kind of synchronization primitive - maybe to avoid the &lt;a href=&quot;#locking-antipattern-confusing-object-lifetime-and-data-consistency&quot;&gt;lifetime vs.
consistency antipattern
pitfalls&lt;/a&gt; -
then use the proper functions for that like &lt;code&gt;synchronize_irq()&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;locking-antipattern-memory-barriers&quot;&gt;Locking Antipattern: Memory Barriers&lt;/h3&gt;

&lt;p&gt;Or more often, lack of them, incorrect or imbalanced use of barriers, badly or
wrongly or just not at all documented memory barriers, or …&lt;/p&gt;

&lt;p&gt;Fact is that exceedingly most kernel hackers, and more so driver people, have no
useful understanding of the Linux kernel’s memory model, and should never be
caught entertaining use of explicit memory barriers in production code.
Personally I’m pretty good at spotting holes, but I’ve had to learn the hard way
that I’m not even close to being able to positively prove correctness. And for
better or worse, nothing short of that tends to cut it.&lt;/p&gt;

&lt;p&gt;For a still fairly cursory discussion read the &lt;a href=&quot;https://lwn.net/Articles/844224/&quot;&gt;LWN series on lockless
algorithms&lt;/a&gt;. If the code comments and commit
message are anything less rigorous than that it’s fairly safe to assume there’s
an issue.&lt;/p&gt;

&lt;p&gt;Now don’t get me wrong, I love to read an article or watch a talk by Paul
McKenney on RCU like anyone else to get my brain fried properly. But aside from
extreme exceptions this kind of maintenance cost has simply no justification in
a driver subsystem. At least unless it’s packaged in a driver hacker proof
library or core kernel service of some sorts with all the memory barriers well
hidden away where ordinary fools like me can’t touch them.&lt;/p&gt;

&lt;h2 id=&quot;closing-thoughts&quot;&gt;Closing Thoughts&lt;/h2&gt;

&lt;p&gt;I hope you enjoyed this little tour of progressively more worrying levels of
locking engineering, with really just one key take away:&lt;/p&gt;

&lt;p&gt;Simple, dumb locking is good locking, since with that you have a fighting chance
to make it correct locking.&lt;/p&gt;

&lt;p&gt;Thanks to Daniel Stone and Jason Ekstrand for reading and commenting on drafts
of this text.&lt;/p&gt;
</description>
        <pubDate>Wed, 03 Aug 2022 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2022/08/locking-hierarchy.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2022/08/locking-hierarchy.html</guid>
        
        <category>In-Depth Tech</category>
        
        
      </item>
    
      <item>
        <title>Locking Engineering Principles</title>
        <description>&lt;p&gt;For various reasons I spent the last two years way too much looking at code with
terrible locking design and trying to rectify it, instead of a lot more actual
building cool things. Symptomatic that the last post here on my neglected blog
is also a &lt;a href=&quot;/2020/08/lockdep-false-positives.html&quot;&gt;rant on lockdep abuse&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I tried to distill all the lessons learned into some training slides, and this
two part is the writeup of the same. There are some GPU specific rules, but I
think the key points should apply to at least apply to kernel drivers in
general.&lt;/p&gt;

&lt;p&gt;The first part here lays out some principles, the &lt;a href=&quot;/2022/08/locking-hierarchy.html&quot;&gt;second part builds a locking
engineering design pattern hierarchy&lt;/a&gt; from the
most easiest to understand and maintain to the most nightmare inducing
approaches.&lt;/p&gt;

&lt;p&gt;Also with locking engineering I mean the general problem of protecting data
structures against concurrent access by multiple threads and trying to ensure
that each sufficiently consistent view of the data it reads and that the updates
it commits won’t result in confusion. Of course it highly depends upon the
precise requirements what exactly sufficiently consistent means, but figuring
out these kind of questions is out of scope for this little series here.&lt;/p&gt;

&lt;!--more--&gt;
&lt;h2 id=&quot;priorities-in-locking-engineering&quot;&gt;Priorities in Locking Engineering&lt;/h2&gt;

&lt;p&gt;Designing a correct locking scheme is hard, validating that your code actually
implements your design is harder, and then debugging when - not if! - you
screwed up is even worse. Therefore the absolute most important rule in locking
engineering, at least if you want to have any chance at winning this game, is to
make the design as simple and dumb as possible.&lt;/p&gt;

&lt;h3 id=&quot;1-make-it-dumb&quot;&gt;1. Make it Dumb&lt;/h3&gt;

&lt;p&gt;Since this is &lt;em&gt;the&lt;/em&gt; key principle the entire second part of this series will go
through a lot of different locking design patterns, from the simplest and
dumbest and easiest to understand, to the most hair-raising horrors of
complexity and trickiness.&lt;/p&gt;

&lt;p&gt;Meanwhile let’s continue to look at everything else that matters.&lt;/p&gt;

&lt;h3 id=&quot;2-make-it-correct&quot;&gt;2. Make it Correct&lt;/h3&gt;

&lt;p&gt;Since simple doesn’t necessarily mean correct, especially when transferring a
concept from design to code, we need guidelines. On the design front the most
important one is to &lt;a href=&quot;/2020/08/lockdep-false-positives.html&quot;&gt;design for lockdep, and not fight
it&lt;/a&gt;, for which I already wrote a full length
rant. Here I will only go through the main lessons: Validating locking by hand
against all the other locking designs and nesting rules the kernel has overall
is nigh impossible, extremely slow, something only few people can do with any
chance of success and hence in almost all cases a complete waste of time. We
need tools to automate this, and in the Linux kernel this is lockdep.&lt;/p&gt;

&lt;p&gt;Therefore if lockdep doesn’t understand your locking design your design is at
fault, not lockdep. Adjust accordingly.&lt;/p&gt;

&lt;p&gt;A corollary is that you actually need to teach lockdep your locking rules,
because otherwise different drivers or subsystems will end up with defacto
incompatible nesting and dependencies. Which, as long as you never exercise them
on the same kernel boot-up, much less same machine, wont make lockdep grumpy.
But it will make maintainers very much question why they are doing what they’re
doing.&lt;/p&gt;

&lt;p&gt;Hence at driver/subsystem/whatever load time, when CONFIG_LOCKDEP is enabled,
take all key locks in the correct order. One example for this relevant
to GPU drivers is &lt;a href=&quot;https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/dma-resv.c?h=v5.18#n685&quot;&gt;in the dma-buf
subsystem&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the same spirit, at every entry point to your library or subsytem, or
anything else big, validate that the callers hold up the locking contract with
&lt;code&gt;might_lock(), might_sleep(), might_alloc()&lt;/code&gt; and all the variants and
more specific implementations of this. Note that there’s a huge overlap between
locking contracts and calling context in general (like interrupt safety, or
whether memory allocation is allowed to call into direct reclaim), and since all
these functions compile away to nothing when debugging is disabled there’s
really no cost in sprinkling them around very liberally.&lt;/p&gt;

&lt;p&gt;On the implementation and coding side there’s a few rules of thumb to follow:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Never invent your own locking primitives, you’ll get them wrong, or at least
build something that’s slow. The kernel’s locks are built and tuned by people
who’ve done nothing else their entire career, you wont beat them except in bug
count, and that by a lot.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The same holds for synchronization primitives - don’t build your own with a
&lt;code&gt;struct wait_queue_head&lt;/code&gt;, or worse, hand-roll your own wait queue.
Instead use the most specific existing function that provides the
synchronization you need, e.g. &lt;code&gt;flush_work()&lt;/code&gt; or
&lt;code&gt;flush_workqueue()&lt;/code&gt; and the enormous pile of variants available for
synchronizing against scheduled work items.&lt;/p&gt;

    &lt;p&gt;A key reason here is that very often these more specific functions already
come with elaborate lockdep annotations, whereas anything hand-roll tends to
require much more manual design validation.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Finally at the intersection of “make it dumb” and “make it correct”, pick the
simplest lock that works, like a normal mutex instead of an read-write
semaphore. This is because in general, stricter rules catch bugs and design
issues quicker, hence picking a very fancy “anything goes” locking primitives
is a bad choice.&lt;/p&gt;

    &lt;p&gt;As another example pick spinlocks over mutexes because spinlocks are a lot
more strict in what code they allow in their critical section. Hence much less
risk you put something silly in there by accident and close a dependency loop
that could lead to a deadlock.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-make-it-fast&quot;&gt;3. Make it Fast&lt;/h3&gt;

&lt;p&gt;Speed doesn’t matter if you don’t understand the design anymore in the future,
you need simplicity first.&lt;/p&gt;

&lt;p&gt;Speed doesn’t matter if all you’re doing is crashing faster. You need
correctness before speed.&lt;/p&gt;

&lt;p&gt;Finally speed doesn’t matter where users don’t notice it. If you
micro-optimize a path that doesn’t even show up in real world workloads users
care about, all you’ve done is wasted time and committed to future maintenance
pain for no gain at all.&lt;/p&gt;

&lt;p&gt;Similarly optimizing code paths which should never be run when you instead
improve your design are not worth it. This holds especially for GPU drivers,
where the real application interfaces are OpenGL, Vulkan or similar, and there’s
an entire driver in the userspace side - the right fix for performance issues
is very often to radically update the contract and sharing of responsibilities
between the userspace and kernel driver parts.&lt;/p&gt;

&lt;p&gt;The big example here is GPU address patch list processing at command submission
time, which was necessary for old hardware that completely lacked any useful
concept of a per process virtual address space. But that has changed, which
means virtual addresses can stay constant, while the kernel can still freely
manage the physical memory by manipulating pagetables, like on the CPU.
Unfortunately one driver in the DRM subsystem instead spent an easy engineer
decade of effort to tune relocations, write lots of testcases for the resulting
corner cases in the multi-level fastpath fallbacks, and even more time handling
the impressive amounts of fallout in the form of bugs and future headaches due
to the resulting unmaintainable code complexity …&lt;/p&gt;

&lt;p&gt;In other subsystems where the kernel ABI is the actual application contract
these kind of design simplifications might instead need to be handled between
the subsystem’s code and driver implementations. This is what we’ve done when
moving from the old kernel modesetting infrastructure to atomic modesetting.
But sometimes no clever tricks at all help and you only get true speed with a
radically revamped uAPI - io_uring is a great example here.&lt;/p&gt;

&lt;h2 id=&quot;protect-data-not-code&quot;&gt;Protect Data, not Code&lt;/h2&gt;

&lt;p&gt;A common pitfall is to design locking by looking at the code, perhaps just
sprinkling locking calls over it until it feels like it’s good enough. The right
approach is to design locking for the data structures, which means specifying
for each structure or member field how it is protected against concurrent
changes, and how the necessary amount of consistency is maintained across the
entire data structure with rules that stay invariant, irrespective of how code
operates on the data. Then roll it out consistently to all the functions,
because the code-first approach tends to have a lot of issues:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;A code centric approach to locking often leads to locking rules changing over
the lifetime of an object, e.g. with different rules for a structure or member
field depending upon whether an object is in active use, maybe just cached or
undergoing reclaim. This is hard to teach to lockdep, especially when the
nesting rules change for different states. Lockdep assumes that the
locking rules are completely invariant over the lifetime of the entire kernel,
not just over the lifetime of an individual object or structure even.&lt;/p&gt;

    &lt;p&gt;Starting from the data structures on the other hand encourages that locking
rules stay the same for a structure or member field.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Locking design that changes depending upon the code that can touch the data
would need either complicated documentation entirely separate from the
code - so high risk of becoming stale. Or the explanations, if there are any
are sprinkled over the various functions, which means reviewers need to
reacquire the entire relevant chunks of the code base again to make sure they
don’t miss an odd corner cases.&lt;/p&gt;

    &lt;p&gt;With data structure driven locking design there’s a perfect, because unique
place to document the rules - in the kerneldoc of each structure or member
field.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A consequence for code review is that to recheck the locking design for a code
first approach every function and flow has to be checked against all others,
and changes need to be checked against all the existing code. If this is not
done you might miss a corner cases where the locking falls apart with a race
condition or could deadlock.&lt;/p&gt;

    &lt;p&gt;With a data first approach to locking changes can be reviewed incrementally
against the invariant rules, which means review of especially big or complex
subsystems actually scales.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;When facing a locking bug it’s tempting to try and fix it just in the affected
code. By repeating that often enough a locking scheme that protects data
acquires code specific special cases. Therefore locking issues always
need to be first mapped back to new or changed requirements on the data
structures and how they are protected.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;The&lt;/em&gt; big antipattern of how you end up with code centric locking is to protect
an entire subsystem (or worse, a group of related subsystems) with a single
huge lock. The canonical example was the big kernel lock &lt;em&gt;BKL&lt;/em&gt;, that’s gone, but
in many cases it’s just replaced by smaller, but still huge locks like
&lt;code&gt;console_lock()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This results in a lot of long term problems when trying to adjust the locking
design later on:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Since the big lock protects everything, it’s often very hard to tell what it
does not protect. Locking at the fringes tends to be inconsistent, and due to
that its coverage tends to creep ever further when people try to fix bugs
where a given structure is not consistently protected by the same lock.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Also often subsystems have different entry points, e.g. consoles can be
reached through the console subsystem directly, through vt, tty subsystems and
also through an enormous pile of driver specific interfaces with the fbcon
IOCTLs as an example. Attempting to split the big lock into smaller
per-structure locks pretty much guarantees that different entry points have to
take the per-object locks in opposite order, which often can only be resolved
through a large-scale rewrite of all impacted subsystems.&lt;/p&gt;

    &lt;p&gt;Worse, as long as the big subsystem lock continues to be in use no one is
spotting these design issues in the code flow. Hence they will slowly get
worse instead of the code moving towards a better structure.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these reasons big subsystem locks tend to live way past their justified
usefulness until code maintenance becomes nigh impossible: Because no individual
bugfix is worth the task to really rectify the design, but each bugfix tends to
make the situation worse.&lt;/p&gt;

&lt;h2 id=&quot;from-principles-to-practice&quot;&gt;From Principles to Practice&lt;/h2&gt;

&lt;p&gt;Stay tuned for next week’s installment, which will cover what these principles
mean when applying to practice: Going through a large pile of locking design
patterns from the most desirable to the most hair raising complex.&lt;/p&gt;
</description>
        <pubDate>Wed, 27 Jul 2022 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2022/07/locking-engineering.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2022/07/locking-engineering.html</guid>
        
        <category>In-Depth Tech</category>
        
        
      </item>
    
      <item>
        <title>Lockdep False Positives, some stories about</title>
        <description>&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Lockdep is giving false positives are the new the compiler is broken.&lt;/p&gt;&amp;mdash; David Airlie (@DaveAirlie) &lt;a href=&quot;https://twitter.com/DaveAirlie/status/1291932064606859269?ref_src=twsrc%5Etfw&quot;&gt;August 8, 2020&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;Recently we’ve looked a bit at lockdep annotations in the GPU subsystems, and I
figured it’s a good opportunity to explain how this all works, and what the
tradeoffs are. Creating working locking hierarchies for the kernel isn’t easy,
making sure the kernel’s locking validator
&lt;a href=&quot;https://www.kernel.org/doc/html/v5.6/locking/lockdep-design.html&quot;&gt;lockdep&lt;/a&gt; is
happy and reviewers don’t have their brains explode even more so.&lt;/p&gt;

&lt;p&gt;First things first, and the fundamental issue:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Lockdep is about trading false positives against better testing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The only way to avoid false positives for deadlocks is to only report a deadlock
when the kernel actually deadlocked. Which is useless, since the entire point of
lockdep is to catch potential deadlock issues before they actually happen. Hence
false postives are not avoidable, at least not in theory, to be able to report
potential issues before they hang the machine. Read on for what to do in
practice.&lt;/p&gt;

&lt;!--more--&gt;

&lt;p&gt;We need to understand how exactly lockdep trades false positives to better
discovery locking inconsistencies. Lockdep makes a few assumptions about how
real code does locking in practice:&lt;/p&gt;

&lt;h2 id=&quot;invariance-of-locking-rules-over-time&quot;&gt;Invariance of locking rules over time&lt;/h2&gt;

&lt;p&gt;First assumption baked into lockdep is that the locking rules for a given lock
do not change over the lifetime of the lock’s existence. This already throws out
a large chunk of perfectly correct locking designs, since state transitions can
control how an object is accessed, and therefore how the lock is used.  Examples
include different rules for creation and destruction, or whether an object is on
a specific list (e.g. only a gpu buffer object that’s in the lru can be
evicted). It’s not possible to proof automatically that certain code flat
out wont ever run together with some other code on the same structure, at least
not in generality. Hence this is pretty much a required assumption to make
lockdep useful - if every new &lt;code&gt;lock()&lt;/code&gt; call could follow new rules
there’s nothing to check. Besides realizing that an actual deadlock indeed
occured and all is lost already.&lt;/p&gt;

&lt;p&gt;And of course getting such state transitions correct, with the guarantee that
all the old code will no longer run, is tricky to get right, and very hard on
reviewers. It’s a good thing lockdep has problems with such code too.&lt;/p&gt;

&lt;h2 id=&quot;common-locking-rules-for-the-same-objects&quot;&gt;Common locking rules for the same objects&lt;/h2&gt;

&lt;p&gt;Second assumption is that all locks initialized by the same code are following
the same locking rules. This is achieved by making all lock initializers C
macros, which create the corresponding lockdep class as a static variable within
the calling function. Again this is pretty much required, since to spot
inconsistencies you need as many observations of all the different code path
possibilities. Best to share them all between the same object. Also a distinct
lockdep class for each individual object would explode the runtime overhead in
both memory and cpu cycles.&lt;/p&gt;

&lt;p&gt;And again this is good from a code design point too, since having the same data
structure and code follow different locking rules for different objects is at
best very confusing for reviewers.&lt;/p&gt;

&lt;h2 id=&quot;fighting-lockdep-badly&quot;&gt;Fighting lockdep, badly&lt;/h2&gt;

&lt;p&gt;Now things go wrong, you have a lockdep splat at your hands, concluded it’s a
false positive and go ahead trying to teach lockdep about what’s going on. The
first class of annotains are special &lt;code&gt;lock_nested(lock, subclass)&lt;/code&gt;
functions. Without lockdep nothing in the generated code changes, but it tells
lockdep that for this lock acquisition, we’re using a different class to track
the observed locking.&lt;/p&gt;

&lt;p&gt;This breaks both the time invariance - nothing is stopping you from using
different classes for the same lock at different times - and commonality of
locking for the same objects. Worse, you can write code which obviously
deadlocks, but lockdep will think everything is perfectly fine:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mutex_init(&amp;amp;A);

mutex_lock(&amp;amp;A);
mutex_lock_nested(&amp;amp;A, SINGLE_DEPTH_NESTING);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is no good and puts a huge burden on reviewers to carefully check all these
places themselves, manually. Exactly the kind of tedious and error prone work
lockdep was meant to take over.&lt;/p&gt;

&lt;p&gt;Slightly better are the annotations which adjust the lockdep class once, when
the object is initialized, using &lt;code&gt;lockdep_set_class()&lt;/code&gt; and related
functions. This at least does not break time invariance, hence will at least
guarantee that lockdep spots the deadlock latest when it happens. It still
reduces how much lockdep can connect what’s going on, but occasionally “rewrite
the entire subsystem” to resolve a locking inconsistency is just not a
reasonable option.&lt;/p&gt;

&lt;p&gt;It still means that reviewers always need to remember what the locking rules for
all types of different objects behind the same structure are, instead of just
one. And then check against every path whether that code needs to work with all
of them, or just some, or only one. Again tedious work that really lockdep is
supposed to help with. If it’s hard to come by a system where you can
easily run the code for the different types of object without rebooting,
then lockdep cannot help at all.&lt;/p&gt;

&lt;p&gt;All these annotations have in common that they don’t change the code logic, only
how lockdep interprets what’s going on.&lt;/p&gt;

&lt;p&gt;An even more insideous trick on reviewers and lockdep is to push locking into an
asynchronous worker of some sorts. This hides issues because lockdep does not
follow dependencies between threads through waiter/wakee relationships like
&lt;code&gt;wait_for_completion()&lt;/code&gt; and &lt;code&gt;complete()&lt;/code&gt;, or through wait
queues. There are lockdep annotations for specific dependencies, like in
the kernel’s workqueue code when flushing workers or specific work items with
&lt;code&gt;flush_work()&lt;/code&gt;. Automatic annotations have been attemped with the
&lt;a href=&quot;https://lwn.net/Articles/709849/&quot;&gt;lockdep cross-release extension&lt;/a&gt;, which for
various reasons had to be backed out again. Therefore hand-rolled asynchronous
code is a great place to create complexity and hide locking issues from both
lockdep and reviewers.&lt;/p&gt;

&lt;h2 id=&quot;playing-to-lockdeps-strength&quot;&gt;Playing to lockdep’s strength&lt;/h2&gt;

&lt;p&gt;Except when there’s very strong justification for all the complexity, the real
fix is to change the locking and make it simpler. Simple enough for lockdep to
understand what’s going on, which also makes reviewer’s lifes a lot better.
Often this means substantial code rework, but at least in some cases there
are useful tricks.&lt;/p&gt;

&lt;p&gt;A special kind of annotations are the &lt;code&gt;lock_nest_lock(lock,
superlock)&lt;/code&gt; family of functions - these tell lockdep that when multiple
locks of the same class are acquired, it’s all serialized by the single
superlock. Lockdep then validates that the right superlock is indeed held. A
great example is &lt;code&gt;mm_take_all_locks()&lt;/code&gt;, which as the name implies,
takes all locks related to the given &lt;code&gt;mm_struct&lt;/code&gt;. In a sense this is
not a pure annotation, unlike the ones above, since it requires that the
superlock is actually locked. That’s generally the easier to understand scheme
than clever sorting of lock acquisition of some sort for reviewers too, not just
for lockdep.&lt;/p&gt;

&lt;p&gt;A different situation often arises when creating or destroying an object. But at
that stage often no other thread has a reference to the object and therefore can
take the lock, and the best way to resolve locking inconsistency over the
lifetime of an object due to creation and destruction code is to not take any
locks at all in these paths. There is nothing to protect against after all!&lt;/p&gt;

&lt;p&gt;In all these cases the best option for long term maintainability is to simplify
the locking design, not reduce lockdep’s power by reducing the amount of false
positives it reports. And that should be the general principle.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;tldr; do not fix lockdep false positives, fix your locking&lt;/p&gt;
&lt;/blockquote&gt;
</description>
        <pubDate>Thu, 13 Aug 2020 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2020/08/lockdep-false-positives.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2020/08/lockdep-false-positives.html</guid>
        
        <category>In-Depth Tech</category>
        
        
      </item>
    
      <item>
        <title>Upstream Graphics: Too Little, Too Late</title>
        <description>&lt;p&gt;Unlike the tradition of my past few talks at Linux Plumbers or Kernel
conferences, this time around in Lisboa I did not start out with a rant proposing
to change everything. Instead I celebrated roughly 10 years of upstream graphics
progress and finally achieving paradise.  But that was all just prelude to a few
bait-and-switches later fulfill expectations on what’s broken this time around
in upstream, totally, and what needs to be fixed and changed, maybe.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.youtube.com/watch?v=S1I34t5RpnI&quot;&gt;LPC video recording&lt;/a&gt; is now
released, &lt;a href=&quot;/slides/lpc-2019-upstream.pdf&quot;&gt;slides&lt;/a&gt; are uploaded. If neither of
that is to your taste, read below the break for the written summary.&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/S1I34t5RpnI&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;!--more--&gt;

&lt;h2 id=&quot;mission-accomplished&quot;&gt;Mission Accomplished&lt;/h2&gt;

&lt;p&gt;10 or so years ago upstream graphics was essentially a proof of concept for the
promised to come. Kernel display modeset just landed, finally bringing a
somewhat modern display driver userspace API to linux. And GEM, the graphics
execution manager landed, bringing proper GPU memory management and multi client
rendering. Realistically a lot needed to be done still, from rendering drivers
for all the various SoC, to an &lt;a href=&quot;/2015/08/atomic-modesetting-design-overview.html&quot;&gt;atomic display
API&lt;/a&gt; that can expose all the
features, not just what was needed to light up a linux desktop back in the days.
And lots of work to improve the codebase and make it much easier and quicker to
write drivers.&lt;/p&gt;

&lt;p&gt;There’s obviously still a lot to do, but I think we’ve achieved that - for full
details, check out my &lt;a href=&quot;/2019/12/elce-lyon-everything-great.html&quot;&gt;ELCE talk about everything great for upstream
graphics&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now despite all this justified celebrating, there is one sticking point still:&lt;/p&gt;

&lt;h2 id=&quot;nvidia&quot;&gt;NVIDIA&lt;/h2&gt;

&lt;p&gt;The trouble with team green from an open source perspective - for them it’s a
great boon - is that they own the GPU software stack in two crucial ways:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;NVIDIA defines how desktop GL works. Not so relevant anymore, and at least the
core profile is a solid spec and has fully open source test suite from Khronos
by now. But the compatibility profile, which didn’t throw out all the legacy
features from the GL1.x days in the 90s, does not have any of the interactions
with all the new features specced out and covered with tests - NVIDIA’s binary
driver is that standard, and that since roughly 20 years.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;More relevant today is CUDA, not quite as long as desktop GL, but for a market
that’s growing at a rather brisk pace. CUDA is the undisputed king of the
general purpose GPU compute hill. Anything and everything that matters runs on
top of it, often exclusively.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together these create a huge software moat around the high margin hardware
business. All an open stack would achieve is filling in that moat and inviting
competition to eat the nice surplus. In other words, stupid to even attempt,
vendor lock-in just pays too well.&lt;/p&gt;

&lt;p&gt;Now of course the reverse engineered nouveau driver still exists. But if you
have to pay for reverse engineering already, then you might as well go with
someone else’s hardware, since you’re not going to get any of the CUDA/GL
goodies.&lt;/p&gt;

&lt;p&gt;And the business case for open source drivers indeed exists so much that even
paying for reverse engineering a full stack is no problem. The result is a
vibrant community of hardware vendors, customers, distros and consulting shops
who pay the bills for all the open driver work that’s being done. And in
userspace even “upstream first” works - releases happen quickly and often
enough, with sufficiently smooth merge process that having a vendor tree is
simply not needed. Plus customer’s willingness to upgrade if necessary, because
it’s usually a well-contained component to enable new hardware support.&lt;/p&gt;

&lt;p&gt;In short without a solid business case behind open graphics drivers, they’re
just not going to happen, viz. NVIDIA.&lt;/p&gt;

&lt;h2 id=&quot;not-shipping-upstream&quot;&gt;Not Shipping Upstream&lt;/h2&gt;

&lt;p&gt;Unfortunately the business case for “upstream first” on the kernel side is
completely broken. Not for open source, and not for any fundamental reasons, but
simply because the kernel moves too slowly, is too big, drivers aren’t well
contained enough and therefore customer will not or even can not upgrade. For
some hardware upstreaming early enough is possible, but graphics simply moves
too fast: By the time the upstreamed driver is actually in shipping distros,
it’s already one hardware generation behind. And missing almost a year of tuning
and performance improvements. Worse it’s not just new hardware, but also GL and
Vulkan versions that won’t work on older kernels due to missing features,
fragementing the ecosystem further.&lt;/p&gt;

&lt;p&gt;This is entirely unlike the userspace side, where refactoring and code sharing
in a cross-vendor shared upstream project actually pays off. Even in the short
term.&lt;/p&gt;

&lt;p&gt;There’s a lot of approaches trying to paper over this rift with the linux
kernel:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Stable kernel ABI for driver modules, so that you can upgrade the core kernel
and drivers independently. Google Android is very much aiming this solution at
their huge vendor tree problem. Traditionally enterprise distros do the same.
This works, safe that stable kernel-internal ABI is &lt;a href=&quot;https://www.kernel.org/doc/html/latest/process/stable-api-nonsense.html&quot;&gt;not a notion that’s very
popular with kernel maintainers
…&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If you go with an “upstream first” approach to shipping graphics drivers you
first need to polish your driver, refactor out common components, and push it
to upstream.  Only to then pay a second team to re-add all the crap so you can
ship your driver on all the old kernels, where all the helpers and new common
code don’t exist.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Pay your distro or OS vendor to just backport the new helpers before they even
have landed in an upstream release. Which means instead of a backporting team
for the driver on your payroll you now pay for backporting the entire
subsystem - which in many cases is cheaper, but an even harder sell to
beancounters. And sometimes not possible because other driver teams from
competitors might not be on board and insist on not breaking the stable driver
ABI for a given distro release kernel.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, there just isn’t a single LTS kernel. Even upstream has multiple, plus
every distro has their own flavour, plus customers love to grow their own
variety trees too. Often they’re not even coordinated on the same upstream
release. Cheapest way to support this entire madness is to completely ignore
upstream and just write your own subsystem. Or at least not use any of the
helper libraries provided by kernel subsystems, completely defeating the
supposed benefit of upstreaming code.&lt;/p&gt;

&lt;p&gt;No matter the strategy, they all boil down to paying twice - if you want to
upstream your code. And there’s no added return for the doubled bill. In
conclusion, upstream first needs a business case, like the open source graphics
stack in general. And that business case is very much real, except for
upstreaming, it’s only real in userspace.&lt;/p&gt;

&lt;p&gt;In the kernel, “upstream first” is a sham, at least for graphics drivers.&lt;/p&gt;

&lt;p&gt;Thanks to Alex Deucher for reading and commenting on drafts of this text.&lt;/p&gt;
</description>
        <pubDate>Tue, 10 Dec 2019 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2019/12/upstream-too-little-too-late.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2019/12/upstream-too-little-too-late.html</guid>
        
        <category>Maintainer-Stuff</category>
        
        <category>Conferences</category>
        
        
      </item>
    
      <item>
        <title>ELCE Lyon: Everything Great About Upstream Graphics</title>
        <description>&lt;p&gt;At ELC Europe in Lyon I held a nice little presentation about the state of
upstream graphics drivers, and how absolutely awesome it all is. Of course with
a big focus on SoC and embedded drivers. &lt;a href=&quot;/slides/elce-2019-upstream.pdf&quot;&gt;Slides&lt;/a&gt;
and the &lt;a href=&quot;https://www.youtube.com/watch?v=kVzHOgt6WGE&quot;&gt;video
recording&lt;/a&gt;&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/kVzHOgt6WGE&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Key takeaways for the busy:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The upstream DRM graphics subsystem really scales down to tiny drivers now,
with the smallest driver coming in at just around 250 lines (including
comments and whitespace), 10’000x less than the biggest!&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Batteries all included - there’s modular helpers for everything. As a rule of
thumb even minimal legacy fbdev drivers ported to DRM shrink by a factor of
2-4 thanks to these helpers taking care of anything that’s remotely
standardized in displays and GPUs.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;For shipping userspace drivers go with a dual-stack: Open source GL and Vulkan
drivers for those who want that, and for getting the kernel driver merged into
upstream. Closed source for everyone else, running on the same userspace API
and kernel driver.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;And for customer support backport the entire subsystem, try to avoid
backporting an individual driver.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, world domination is assured and progressing according to plan.&lt;/p&gt;
</description>
        <pubDate>Tue, 03 Dec 2019 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2019/12/elce-lyon-everything-great.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2019/12/elce-lyon-everything-great.html</guid>
        
        <category>Conferences</category>
        
        
      </item>
    
      <item>
        <title>Upstream First</title>
        <description>&lt;p&gt;lwn.net just featured an article &lt;a href=&quot;https://lwn.net/Articles/786304/&quot;&gt;the sustainability of open
source&lt;/a&gt;, which &lt;a href=&quot;https://lwn.net/Articles/783169/&quot;&gt;seems to be a bit a
topic&lt;/a&gt; in &lt;a href=&quot;https://www.youtube.com/watch?v=W2AR1owg0ao&quot;&gt;various places since a
while&lt;/a&gt;. I’ve made a keynote at
Siemens Linux Community Event 2018 last year which lends itself to a different
take on all this:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/eJDcdYyOwko&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;&lt;a href=&quot;/slides/siemens-2018.pdf&quot;&gt;The slides&lt;/a&gt; for those who don’t like videos.&lt;/p&gt;

&lt;p&gt;This talk was mostly aimed at managers of engineering teams and projects with
fairly little experience in shipping open source, and much less experience in
shipping open source through upstream cross vendor projects like the kernel. It
goes through all the usual failings and missteps and explains why an upstream
first strategy is the right one, but with a twist: Instead of technical reasons,
it’s all based on economical considerations of why open source is succeeding.
Fundamentally it’s not about the better software, or the cheaper prize, or that
the software freedoms are a good thing worth supporting.&lt;/p&gt;

&lt;p&gt;Instead open source is eating the world because it enables a much more
competitive software market. And all the best practices around open development
are just to enable that highly competitive market. Instead of arguing that
open source has open development and strongly favours public discussions because
that results in better collaboration and better software we put on the economic
lens, and private discussions become insider trading and collusions. And that’s
just not considered cool in a competitive market. Similar arguments can be made
with everything else going on in open source projects.&lt;/p&gt;

&lt;p&gt;Circling back to the list of articles at the top I think it’s worth looking at
the sustainability of open source as an economic issue of an extremely
competitive market, in other words, as a market failure: Occasionally the result
is that no one gets paid, the customers only receive a sub-par product with
all costs externalized - costs like keeping up with security issues. And like
with other market failures, a solution needs to be externally imposed through
regulations, taxation and transfers to internalize all the costs again into the
product’s prize. Frankly no idea how that would look like in practice though.&lt;/p&gt;

&lt;p&gt;Anyway, just a thought, but good enough a reason to finally publish the
recording and slides of my talk, which covers this just in passing in an offhand
remark.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update:&lt;/em&gt; Fix slides link.&lt;/p&gt;
</description>
        <pubDate>Thu, 02 May 2019 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2019/05/upstream-first.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2019/05/upstream-first.html</guid>
        
        <category>Maintainer-Stuff</category>
        
        <category>Conferences</category>
        
        
      </item>
    
      <item>
        <title>X.org Elections: freedesktop.org Merger - Vote Now!</title>
        <description>&lt;div style=&quot;text-align: center&quot;&gt;
&lt;img border=&quot;0&quot; height=&quot;320&quot; src=&quot;/img/vote_now.jpg&quot; width=&quot;320&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Aside from the regular &lt;a href=&quot;https://www.x.org/wiki/BoardOfDirectors/Elections/2019/&quot;&gt;board
elections&lt;/a&gt; we also have
some &lt;a href=&quot;https://gitlab.freedesktop.org/xorgfoundation/bylaws/blob/bylaw-updates/bylaws.pdf&quot;&gt;bylaw
changes&lt;/a&gt;
to vote on. As usual with bylaw changes, we need a supermajority of all members
to agree - if you don’t vote you essentially reject it, but the board has no way
of knowing.&lt;/p&gt;

&lt;p&gt;Please see the &lt;a href=&quot;https://gitlab.freedesktop.org/xorgfoundation/bylaws/commit/06e7f04f79131df2c86e9cdfedc00aa1d1ec3f52&quot;&gt;detailed changes of the
bylaws&lt;/a&gt;,
make up your mind, and go voting on the &lt;a href=&quot;https://members.x.org/&quot;&gt;shiny new members
page&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Fri, 29 Mar 2019 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2019/03/xorg-election-fdo-merger.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2019/03/xorg-election-fdo-merger.html</guid>
        
        
      </item>
    
      <item>
        <title>Why no 2D Userspace API in DRM?</title>
        <description>&lt;p&gt;The DRM (direct rendering manager, not the content protection stuff) graphics
subsystem in the linux kernel does not have a generic 2D accelaration API.
Despite an awful lot of of GPUs having more or less featureful blitter
units. And many systems need them for a lot of use-cases, because the 3D engine
is a bit too slow or too power hungry for just rendering desktops.&lt;/p&gt;

&lt;p&gt;It’s a FAQ why this doesn’t exist and why it won’t get added, so I figured I’ll
answer this once and for all.&lt;/p&gt;

&lt;!--more--&gt;

&lt;p&gt;Bit of nomeclatura upfront: A 2D engine (or blitter) is a bit of hardware that
can copy stuff with some knowledge of the 2D layout usually used for pixel
buffers.  Some blitters also can do more like basic blending, converting color
spaces or stretching/scaling. A 3D engine on the other hand is the fancy bit of
high performance compute block, which run small programs (called shaders) on
a massively parallel archicture. Generally with huge memory bandwidth and a
dedicated controller to feed this beast through an asynchronous command buffer.
3D engines happen to be really good at rendering the pixels for 3D action games,
among other things.&lt;/p&gt;

&lt;h2 id=&quot;theres-no-2d-acceleration-standard&quot;&gt;There’s no 2D Acceleration Standard&lt;/h2&gt;

&lt;p&gt;3D has it easy: There’s OpenGL and Vulkan and DirectX that require a certain
feature set. And huge market forces that make sure if you use these features
like a game would, rendering is fast.&lt;/p&gt;

&lt;p&gt;Aside: This means the 2D engine in a browser actually needs to work like a
3D action game, or the GPU will crawl. The impendence mismatch compared to
traditional 2D rendering designs is huge.&lt;/p&gt;

&lt;p&gt;On the 2D side there’s no such thing: Every blitter engine is its own bespoke
thing, with its own features, limitations and performance characteristics.
There’s also no standard benchmarks that would drive common performance
characteristics - today blitters are neeeded mostly in small systems, with very
specific use cases. Anything big enough to run more generic workloads will have
a 3D rendering block anyway. These systems still have blitters, but mostly just
to help move data in and out of VRAM for the 3D engine to consume.&lt;/p&gt;

&lt;p&gt;Now the huge problem here is that you need to fill these gaps in various
hardware 2D engines using CPU side software rendering. The crux with any 2D
render design is that transferring buffers and data too often between the GPU
and CPU will kill performance. Usually the cliff is so steep that pure
CPU rendering using only software easily beats any simplistic 2D acceleration
design.&lt;/p&gt;

&lt;p&gt;The only way to fix this is to be really careful when moving data between the
CPU and GPU for different rendering operations. Sticking to one side, even if
it’s a bit slower, tends to be an overall win. But these decisions highly depend
upon the exact features and performance characteristics of your 2D engine.
Putting a generic abstraction layer in the middle of this stack, where it’s guaranteed
to be if you make it a part of the kernel/userspace interface, will not result
in actual accelaration.&lt;/p&gt;

&lt;p&gt;So either you make your 2D rendering look like it’s a 3D game, using 3D
interfaces like OpenGL or Vulkan. Or you need a software stack that’s bespoke to
your use-case &lt;em&gt;and&lt;/em&gt; the specific hardware you want to run on.&lt;/p&gt;

&lt;h2 id=&quot;2d-accelaration-is-really-hard&quot;&gt;2D Accelaration is Really Hard&lt;/h2&gt;

&lt;p&gt;This is the primary reason really. If you don’t believe that, look at all the
tricks a browser employs to render CSS and HTML and text really fast, while
still animating all that stuff smoothly. Yes, a web-browser is the pinnacle of
current 2D acceleration tech, and you really need all the things in there for
decent performance: Scene graphs, clever render culling, massive batching and
huge amounts of pains to make sure you don’t have to fall back to CPU based
software rendering at the wrong point in a rendering pipeline. Plus managing
all kinds of assorted caches to balance reuse against running out of memory.&lt;/p&gt;

&lt;p&gt;Unfortunately lots of people assume 2D must be a lot simpler than 3D rendering,
and therefore they can design a 2D API that’s fast enough for everyone. No one
jumps in and suggests we’ll have a generic 3D interface at the kernel level,
because the lessons there are very clear:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The real application interface is fairly high level, and in userspace.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There’s a huge industry group doing really hard work to specify these
interfaces.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The actual kernel to userspace interfaces ends up being highly specific to the
hardware and architecture of the userspace driver (which contains most of the
magic). Any attempt at a generic interface leaves lots of hardware specific
tricks and hence performance on the floor.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;3D APIs like OpenGL or Vulkan have all the batching and queueing and memory
management issues covered in one way or another.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a bunch of DRM drivers which have a support for 2D render engines
exposed to userspace. But they all use highly hardware specific interfaces,
fully streamlined for the specific engine. And they all require a decently sized
chunk of driver code in userspace to translate from a generic API to the
hardware formats. This is what DRM maintainers will recommend you to do, if you
submit a patch to add a generic 2D acceleration API.&lt;/p&gt;

&lt;p&gt;Exactly like a 3D driver.&lt;/p&gt;

&lt;h2 id=&quot;if-all-else-fails-theres-options&quot;&gt;If All Else Fails, There’s Options&lt;/h2&gt;

&lt;p&gt;Now if you don’t care about the last bit of performance, and your use-case is
limited, and your blitter engine is limited, then there’s already options:&lt;/p&gt;

&lt;p&gt;You can take whatever pixel buffer you have, export it as a dma-buf, and then
import it into some other subsystem which already has some kind of limited 2D
accelaration support. Depending upon your blitter engine, a v4l2 mem2m device,
or for simpler things there’s also dmaengines.&lt;/p&gt;

&lt;p&gt;On top, the DRM subsystem does allow you to implement the traditional
accelaration methods exposed by the fbdev subsystem. In case you have userspace
that really insists on using these; it’s not recommended for anything new.&lt;/p&gt;

&lt;h2 id=&quot;what-about-kms&quot;&gt;What about KMS?&lt;/h2&gt;

&lt;p&gt;The above is kinda a lie, since the KMS (kernel modesetting) IOCTL userspace API
is a fairly full-featured 2D rendering interface. The aim of course is to render
different pixel buffers onto a screen. With the recently added writeback support
operations targetting memory are now possible.  This could be used to expose a
traditional blitter, if you only expose writeback support and no other outputs
in your KMS driver.&lt;/p&gt;

&lt;p&gt;There’s a few downsides:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;KMS is highly geared for compositing just a few buffers (hardware usually has
a very limited set of planes). For accelerated text rendering you want to do a
composite operation for each character, which means this has rather limited
use.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;KMS only needs to run at 60Hz, or whatever the refresh rate of your monitor
is. It’s not optimized for efficiency at higher throughput at all.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So all together this isn’t the high-speed 2D accelaration API you’re looking for
either. It is a valid alternative to the options above though, e.g. instead of a
v4l2 mem2m device.&lt;/p&gt;

&lt;h2 id=&quot;faq-for-the-faq-or-openvg&quot;&gt;FAQ for the FAQ, or: OpenVG?&lt;/h2&gt;

&lt;p&gt;OpenVG isn’t the standard you’re looking for either. For one it’s a userspace
API, like OpenGL. All the same reasons for not implementing a generic OpenGL
interface at the kernel/userspace apply to OpenVG, too.&lt;/p&gt;

&lt;p&gt;Second, the Mesa3D userspace library did support OpenVG once. Didn’t gain
traction, got canned. Just because it calls itself a standard doesn’t make it a
widely adopted industry default. Unlike OpenGL/Vulkan/DirectX on the 3D side.&lt;/p&gt;

&lt;p&gt;Thanks to Dave Airlie and Daniel Stone for reading and commenting on drafts of this
text.&lt;/p&gt;
</description>
        <pubDate>Mon, 27 Aug 2018 00:00:00 +0000</pubDate>
        <link>http://blog.ffwll.ch/2018/08/no-2d-in-drm.html</link>
        <guid isPermaLink="true">http://blog.ffwll.ch/2018/08/no-2d-in-drm.html</guid>
        
        <category>In-Depth Tech</category>
        
        
      </item>
    
  </channel>
</rss>
