Message ID | 20130228102502.15191.14146.stgit@patser (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: > +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow: > + Similar to mutex_reserve_lock, except it won't backoff with > -EAGAIN. > + This is useful when mutex_reserve_lock failed with -EAGAIN, and you > + unreserved all reservation_locks so no deadlock can occur. > + I don't particularly like these function names, with lock implementations the _slow post-fix is typically used for slow path implementations, not API type interfaces.
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: > +struct ticket_mutex { > + struct mutex base; > + atomic_long_t reservation_id; > +}; I'm not sure this is a good name, esp. due to the potential confusion vs the ticket spinlocks we have.
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: > +Reservation type mutexes > +struct ticket_mutex { > +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock, That's two different names and two different forms of one (for a total of 3 variants) for the same scheme. FAIL... Also, is there anything in CS literature that comes close to this? I'd think the DBMS people would have something similar with their transactional systems. What do they call it?
On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: > +The algorithm that TTM came up with for dealing with this problem is > +quite simple. For each group of buffers (execbuf) that need to be > +locked, the caller would be assigned a unique reservation_id, from a > +global counter. In case of deadlock while locking all the buffers > +associated with a execbuf, the one with the lowest reservation_id > +wins, and the one with the higher reservation_id unlocks all of the > +buffers that it has already locked, and then tries again. So the thing that made me cringe inside when I read this was making it all work on -rt, where we also need to take PI into account and ensure the entire thing is deterministic. It _might_ be 'easy' and we could fall back to PI mutex behaviour in the case the reservation IDs match; however the entire for-all-bufs retry loops do worry me still. Head hurts, needs more time to ponder. It would be good if someone else (this would probably be you maarten) would also consider this and explore this 'interesting' problem space :-)
Hey, Thanks for reviewing. Op 02-04-13 13:00, Peter Zijlstra schreef: > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: >> +Reservation type mutexes >> +struct ticket_mutex { >> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock, > That's two different names and two different forms of one (for a total > of 3 variants) for the same scheme. > > FAIL... It's been hard since I haven't seen anything similar in the kernel, I originally went with tickets since that's what ttm originally called it, and tried to kill as many references as I could when I noticed ticket mutexes already being taken. I'll fix up the ticket_mutex -> reservation_mutex, and mutex_reserve_* -> reserve_mutex_* > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: >> +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow: >> + Similar to mutex_reserve_lock, except it won't backoff with >> -EAGAIN. >> + This is useful when mutex_reserve_lock failed with -EAGAIN, and you >> + unreserved all reservation_locks so no deadlock can occur. >> + > I don't particularly like these function names, with lock > implementations the _slow post-fix is typically used for slow path > implementations, not API type interfaces. I didn't intend for drivers to use the new calls directly, but rather through a wrapper, for example by ttm_eu_reserve_buffers in drivers/gpu/drm/ttm/ttm_execbuf_util.c > Also, is there anything in CS literature that comes close to this? I'd > think the DBMS people would have something similar with their > transactional systems. What do they call it? I didn't study cs, but judging from your phrasing I guess you mean you want me to call it transaction_mutexes instead? > Head hurts, needs more time to ponder. It would be good if someone else > (this would probably be you maarten) would also consider this and > explore > this 'interesting' problem space :-) My head too, evil priority stuff! Hacky but pragmatical workaround for now: use a real mutex around all the reserve_mutex_lock* calls instead of a virtual lock. It can be unlocked as soon as all locks have been taken, before any actual work is done. It only slightly kills the point of having a reservation in the first place, but at least it won't break completely -rt completely for now. ~Maarten
On Tue, Apr 2, 2013 at 1:00 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > Also, is there anything in CS literature that comes close to this? I'd > think the DBMS people would have something similar with their > transactional systems. What do they call it? I've looked around a bit and in dbms row-locking land this seems to be called the wound-wait deadlock avoidance algorithm. It's the same approach where if you encounter an older ticket (there called transaction timestamp) you drop all locked rows and retry (or abort) the transaction. If you encounter a newer ticket when trying to grab a lock simply do a blocking wait. So ticket/reservation in Maartens patches is the analog of timestamp/transaction. -Daniel
On Tue, 2013-04-02 at 16:57 +0200, Maarten Lankhorst wrote: > Hey, > > Thanks for reviewing. Only partway through so far :-) > Op 02-04-13 13:00, Peter Zijlstra schreef: > > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: > >> +Reservation type mutexes > >> +struct ticket_mutex { > >> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock, > > That's two different names and two different forms of one (for a total > > of 3 variants) for the same scheme. > > > > FAIL... > It's been hard since I haven't seen anything similar in the kernel, I > originally went with tickets since that's what ttm originally called > it, and tried to kill as many references as I could when I noticed > ticket mutexes already being taken. Ticket mutexes as such don't exist, but we have ticket based spinlock implementations. It seems a situation ripe for confusion to have two locking primitives (mutex, spinlock) with similar names (ticket) but vastly different semantics. > I'll fix up the ticket_mutex -> reservation_mutex, and mutex_reserve_* > -> reserve_mutex_* Do a google for "lock reservation" and observe the results.. its some scheme where they pre-assign lock ownership to the most likely thread. > > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote: > >> +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow: > >> + Similar to mutex_reserve_lock, except it won't backoff with > >> -EAGAIN. > >> + This is useful when mutex_reserve_lock failed with -EAGAIN, and you > >> + unreserved all reservation_locks so no deadlock can occur. > >> + > > I don't particularly like these function names, with lock > > implementations the _slow post-fix is typically used for slow path > > implementations, not API type interfaces. > I didn't intend for drivers to use the new calls directly, but rather > through a wrapper, for example by ttm_eu_reserve_buffers in > drivers/gpu/drm/ttm/ttm_execbuf_util.c You're providing a generic interface to the core kernel, other people will end up using it. Providing a proper API is helpful. > > Also, is there anything in CS literature that comes close to this? I'd > > think the DBMS people would have something similar with their > > transactional systems. What do they call it? > I didn't study cs, but judging from your phrasing I guess you mean you > want me to call it transaction_mutexes instead? Nah, me neither, I just hate reinventing names for something that's already got a perfectly fine name under which a bunch of people know it. See the email from Daniel, apparently its known as wound-wait deadlock avoidance -- its actually described in the "deadlock" wikipedia article. So how about we call the thing something like: struct ww_mutex; /* wound/wait */ int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */ int mutex_wait_lock(struct ww_mutex *); /* does not fail */ Hmm.. thinking about that,.. you only need that second variant because you don't have a clear lock to wait for the 'older' process to complete; but having the unconditional wait makes the entire thing prone to accidents and deadlocks when the 'user' (read your fellow programmers) make a mistake. Ideally we'd only have the one primitive that returns -EDEADLK and use a 'proper' mutex to wait on or somesuch.. let me ponder this a bit more. > > Head hurts, needs more time to ponder. It would be good if someone else > > (this would probably be you maarten) would also consider this explore > > this 'interesting' problem space :-) > My head too, evil priority stuff! > > Hacky but pragmatical workaround for now: use a real mutex around all > the reserve_mutex_lock* calls instead of a virtual lock. It can be > unlocked as soon as all locks have been taken, before any actual work > is done. > > It only slightly kills the point of having a reservation in the first > place, but at least it won't break completely -rt completely for now. Yeah, global lock, yay :-(
On Tue, Apr 2, 2013 at 6:59 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: >> > Head hurts, needs more time to ponder. It would be good if someone else >> > (this would probably be you maarten) would also consider this explore >> > this 'interesting' problem space :-) > >> My head too, evil priority stuff! >> >> Hacky but pragmatical workaround for now: use a real mutex around all >> the reserve_mutex_lock* calls instead of a virtual lock. It can be >> unlocked as soon as all locks have been taken, before any actual work >> is done. >> >> It only slightly kills the point of having a reservation in the first >> place, but at least it won't break completely -rt completely for now. > > Yeah, global lock, yay :-( We've discussed this quite a bit on irc and came up with a bunch of other ideas. The global lock is completely transparent to users, since the lockdep annotations already rely on ticket_init/fini being a virtual lock. So we can always fall back to that option. For more fancy approaches we need to consider the aim first - do we just want to prevent deadlocks through PI or do we aim for bounded per-reservation_mutex wait block-to-acquire times for the thread with highest rt-prio. If it's just the former I think we can get by by piggy-packing on top of the existing PI mutex code. Only downside is that threads can lock arbitrary many reservation locks and so we're looking at boosting an entire tree of processes. Otoh common operations done while holding such a lock are swapping buffer objects in or waiting for gpu rendering. And since we can easily queue up a few ms of rendering rt guarantees are out the window ;-) If that's not good enough and the global lock not scalable enough we could try to limit the fan-out by setting a PI-boost flag in the reservation ticket (in additional to the normal PI boosting for the reservation mutex). Threads which are boosted in that fashion will get a -EAGAIN on the next mutex_reserv_lock call, ensuring that the blocking doesn't spread to further threads. But that requires that we pass around pointers to tickets instead of values, so lifetime fun (atm the ticket is on the stack) and probably tons of races in updating the ticket boost state. I'd like to avoid that until we've demonstrated a need for it ... In any way I think that all three approaches should fit into the proposed interfaces, so we should be able to do something sane here. But since I have pretty much zero clue about rt I have no idea which of the first two approaches would be preferable. Cheers, Daniel
On Tue, Apr 2, 2013 at 6:59 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: >> > Also, is there anything in CS literature that comes close to this? I'd >> > think the DBMS people would have something similar with their >> > transactional systems. What do they call it? > >> I didn't study cs, but judging from your phrasing I guess you mean you >> want me to call it transaction_mutexes instead? > > Nah, me neither, I just hate reinventing names for something that's > already got a perfectly fine name under which a bunch of people know > it. > > See the email from Daniel, apparently its known as wound-wait deadlock > avoidance -- its actually described in the "deadlock" wikipedia > article. > > So how about we call the thing something like: > > struct ww_mutex; /* wound/wait */ > > int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */ > int mutex_wait_lock(struct ww_mutex *); /* does not fail */ I'm not sold on this prefix, since wound-wait is just the implementation detail of how it detects/handles deadlocks. For users a really dumb strategy of just doing a mutex trylock and always returning -EAGAIN if that fails (together with a msleep(rand) in the slowpath) would have the same interface. Almost at least, we could ditch the ticket - but the ticket is used as a virtual lock for the lockdep annotation, so ditching it would also reduce lockdep usefulness (due to all those trylocks). So in case we ever switch the deadlock/backoff algo ww_ would be a bit misleading. Otoh reservation_ is just what it's used for in graphics-land, so not that much better. I don't really have a good idea for what this is besides mutexes_with_magic_deadlock_detection_and_backoff. Which is a bit long. -Daniel
I'm sorry, this email ended up quite a bit longer than I had hoped for; please bear with me. On Tue, 2013-04-02 at 18:59 +0200, Peter Zijlstra wrote: > struct ww_mutex; /* wound/wait */ > > int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */ > int mutex_wait_lock(struct ww_mutex *); /* does not fail */ > > Hmm.. thinking about that,.. you only need that second variant because > you don't have a clear lock to wait for the 'older' process to > complete; but having the unconditional wait makes the entire thing > prone to accidents and deadlocks when the 'user' (read your fellow > programmers) make a mistake. > > Ideally we'd only have the one primitive that returns -EDEADLK and use > a 'proper' mutex to wait on or somesuch.. let me ponder this a bit > more. So I had me a little ponder.. Its all really about breaking the symmetry (equivalence of both sides) of the deadlock. Lets start with the simplest AB-BA deadlock: task-O task-Y A B A <-- blocks on O B <-- blocks on Y In this case the wound-wait algorithm says that the older task (O) has precedence over the younger task (Y) and we'll 'kill' Y to allow progress for O. Now clearly we don't really want to kill Y, instead we'll allow its lock operation to return -EDEADLK. This would suggest that our 'new' mutex implementation (ww_mutex henceforth) has owner tracking -- so O acquiring B can find Y to 'kill' -- and that we have a new wakeup state TASK_DEADLOCK similar to TASK_WAKEKILL (can we avoid this by using the extra state required below? I don't think so since we'd have the risk of waking Y while it would be blocked on a !ww_mutex) Now this gets a little more interesting if we change the scenario a little: task-O task-Y A B B <-- blocks on Y * <-- could be A In this case when O blocks Y isn't actually blocked, so our TASK_DEADLOCK wakeup doesn't actually achieve anything. This means we also have to track (task) state so that once Y tries to acquire A (creating the actual deadlock) we'll not wait so our TASK_DEADLOCK wakeup doesn't actually achieve anything. Note that Y doesn't need to acquire A in order to return -EDEADLK, any acquisition from the same set (see below) would return -EDEADLK even if there isn't an actual deadlock. This is the cost of heuristic; we could walk the actual block graph but that would be prohibitively expensive since we'd have to do this on every acquire. This raises the point of what would happen if Y were to acquire a 'regular' mutex in between the series of ww_mutexes. In this case we'd clearly have an actual deadlock scenario. However lockdep would catch this for we'd clearly invert the lock class order: ww_mutex -> other -> ww_mutex For the more complex scenarios there's the case of multiple waiters. In this case its desirable to order the waiters in age order so that we don't introduce age inversion -- this minimizes the amount of -EDEADLK occurrences, but also allows doing away with the unconditional wait. With the lock queue ordering we can simply immediately re-try the lock acquisition sequence. Since the older task is already queued it will obtain the lock, even if we re-queue ourselves before the lock is assigned. Now the one thing I've so far 'ignored' is how to assign age. Forward progress guarantees demand that the age doesn't get reset on retries; doing so would mean you're always pushing yourself fwd, decreasing your 'priority', never getting to be the most eligible. However it also means that we should (re)set the time every time we start a 'new' acquisition sequence. If we'd use a static (per task) age the task with the lowest age would be able to 'starve' all other. This means we need means to mark the start of an acquisition sequence; one that is not included in the restart loop. Having a start suggests having an end marker too, and indeed we can find a reason for having one; suppose our second scenario above, where Y has already acquired all locks it needs to proceed. In this case we would have marked Y to fail on another acquisition. This doesn't seem like a problem until you consider the case where we nest different mm_mutex sets. In this case Y would -EDEADLK on the first of the second (nested) set, even though re-trying that set is pointless, O doesn't care. Furthermore, considering the first scenario, O could block on a lock of the first set when Y is blocked on a lock of the second (nested) set; we need means of discerning the different lock sets. This cannot be done by local means, that is, any context created by the start/end markers would be local to that task and would not be recognizable by another as to belong to the same group. Instead we'd need to create a set-class, much like lockdep has and somehow ensure all locks in a set are of that class. One thing I'm not entirely sure of is the strict need for a local context it could be used to track the set-class, but I'm not entirely sure we need that to clean up state at the end marker. I haven't been creative enough to construct a fail case where both O and Y nest and a pending EDEADLK state could cross the end marker wrongly. However, having a local context, even if empty, does put a few constraints on the API usage, which as per below, is a good thing. To recap, we now have: struct ww_mutex_key { }; struct ww_mutex; struct ww_mutex_ctx; void __ww_mutex_init(struct ww_mutex *, void *); #define ww_mutex_init(ww_mutex) \ do { \ static struct ww_mutex_key __key; \ __ww_mutex_init(ww_mutex, &__key); \ } while (0) void ww_mutex_acquire_start(struct ww_mutex_ctx *); void ww_mutex_acquire_end (struct ww_mutex_ctx *); int ww_mutex_lock(struct ww_mutex_ctx *, struct ww_mutex *); void ww_mutex_unlock(struct ww_mutex *); Which are to be used like: int bufs_lock(bufs) { struct ww_mutex_ctx ww_ctx; ww_mutex_acquire_start(&ww_ctx); retry: for-all-buffers(buf, bufs) { err = ww_mutex_lock(&ww_ctx, &buf->lock); if (err) goto error; } ww_mutex_acquire_end(&ww_ctx); return 0; /* success */ error: for-all-buffers(buf2, bufs) { if (buf2 == buf) break; ww_mutex_unlock(&buf->lock); } if (err == -EDEADLK) goto again; return err; /* other error */ } void bufs_unlock(bufs) { for-all-buffers(buf, bufs) ww_mutex_unlock(&buf->lock); } This API has a few problems as per Rusty's API guidelines, it is fairly trivial to use it wrong; mostly the placement of the start and end markers are easy to get wrong, but of course one can get all kinds of badness from doing the retry wrong. At least having the local context forces one to use the start/end markers. The only remaining 'problem' left is RT, however the above suggests a very clean solution; change the lock queueing order to first sort on priority (be it the SCHED_FIFO/RR static priority or SCHED_DEADLINE dynamic priority) and secondly on the age. Note that (re)setting the age at every acquisition set is really only a means to provide FIFO fairness between tasks, RT tasks don't strictly require this, they have their own ordering. With all that this locking scheme should be deterministic; and in cases where the older task would block (the young task has passed the end marker) we can apply PI and push Y's priority. Did I miss something? Anyway, I haven't actually looked at the details of your implementation yet, I'll get to that next, but I think something like the above should be what we want to aim for. *phew* thanks for reading this far ! :-)
On Thu, Apr 04, 2013 at 02:01:48PM +0200, Peter Zijlstra wrote: > > I'm sorry, this email ended up quite a bit longer than I had hoped for; > please bear with me. > > On Tue, 2013-04-02 at 18:59 +0200, Peter Zijlstra wrote: > > struct ww_mutex; /* wound/wait */ > > > > int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */ > > int mutex_wait_lock(struct ww_mutex *); /* does not fail */ > > > > Hmm.. thinking about that,.. you only need that second variant because > > you don't have a clear lock to wait for the 'older' process to > > complete; but having the unconditional wait makes the entire thing > > prone to accidents and deadlocks when the 'user' (read your fellow > > programmers) make a mistake. > > > > Ideally we'd only have the one primitive that returns -EDEADLK and use > > a 'proper' mutex to wait on or somesuch.. let me ponder this a bit > > more. > > So I had me a little ponder.. I need to chew some more through this, but figured some early comments to clarify a few things would be good. > Its all really about breaking the symmetry (equivalence of both sides) > of the deadlock. Lets start with the simplest AB-BA deadlock: > > task-O task-Y > A > B > A <-- blocks on O > B <-- blocks on Y > > In this case the wound-wait algorithm says that the older task (O) has > precedence over the younger task (Y) and we'll 'kill' Y to allow > progress for O. > > Now clearly we don't really want to kill Y, instead we'll allow its > lock operation to return -EDEADLK. > > This would suggest that our 'new' mutex implementation (ww_mutex > henceforth) has owner tracking -- so O acquiring B can find Y to 'kill' > -- and that we have a new wakeup state TASK_DEADLOCK similar to > TASK_WAKEKILL (can we avoid this by using the extra state required > below? I don't think so since we'd have the risk of waking Y while it > would be blocked on a !ww_mutex) We do add some form of owner tracking by storing the reservation ticket of the current holder into every ww_mutex. So when task-Y in your above example tries to acquire lock A it notices that it's behind in the global queue and immedieately returns -EAGAIN to indicate the deadlock. Aside, there's a bit of fun in that ttm uses -EDEADLCK for when you try to reserve the same buffer twice - it can easily detect that by comparing the lock owner ticket with the provided one and if it matches bail out. The special error code is useful since userspace could prepare a gpu batch command submission which lists a buffer twice, which is a userspace bug with the current drm/gem apis. But since userspace usually doesn't try to trick the kernel it's better to detect this lazily, and due to the retry logic we have all the relevant error bail-out code anyway. Hence we've kept the special -EDEADLCK semantics even for the ww_mutex stuff. > Now this gets a little more interesting if we change the scenario a > little: > > task-O task-Y > A > B > B <-- blocks on Y > * <-- could be A The current code at least simple blocks task-O on B until task-Y unlocks B. Deadlocks cannot happen since if task-Y ever tries to acquire a lock which is held by an older task (e.g. lock A) it will bail out with -EAGAIN. > In this case when O blocks Y isn't actually blocked, so our > TASK_DEADLOCK wakeup doesn't actually achieve anything. > > This means we also have to track (task) state so that once Y tries to > acquire A (creating the actual deadlock) we'll not wait so our > TASK_DEADLOCK wakeup doesn't actually achieve anything. > > Note that Y doesn't need to acquire A in order to return -EDEADLK, any > acquisition from the same set (see below) would return -EDEADLK even if > there isn't an actual deadlock. This is the cost of heuristic; we could > walk the actual block graph but that would be prohibitively expensive > since we'd have to do this on every acquire. Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait times of older task. This could be interesting for RT, but I'm unsure of the implications. The trick with the current code is that the oldest task will never see an -EAGAIN ever and hence is guaranteed to make forward progress. If the task is really unlucky though it might be forced to wait for a younger task for every ww_mutex it tries to acquire. The thing is now that you're not expected to hold these locks for a long time - if you need to synchronously stall while holding a lock performance will go down the gutters anyway. And since most current gpus/co-processors still can't really preempt fairness isn't that high a priority, either. So we didn't think too much about that. > This raises the point of what would happen if Y were to acquire a > 'regular' mutex in between the series of ww_mutexes. In this case we'd > clearly have an actual deadlock scenario. However lockdep would catch > this for we'd clearly invert the lock class order: > > ww_mutex -> other -> ww_mutex > > For the more complex scenarios there's the case of multiple waiters. In > this case its desirable to order the waiters in age order so that we > don't introduce age inversion -- this minimizes the amount of -EDEADLK > occurrences, but also allows doing away with the unconditional wait. > > With the lock queue ordering we can simply immediately re-try the lock > acquisition sequence. Since the older task is already queued it will > obtain the lock, even if we re-queue ourselves before the lock is > assigned. > > Now the one thing I've so far 'ignored' is how to assign age. Forward > progress guarantees demand that the age doesn't get reset on retries; > doing so would mean you're always pushing yourself fwd, decreasing your > 'priority', never getting to be the most eligible. However it also > means that we should (re)set the time every time we start a 'new' > acquisition sequence. If we'd use a static (per task) age the task with > the lowest age would be able to 'starve' all other. > > This means we need means to mark the start of an acquisition sequence; > one that is not included in the restart loop. > > Having a start suggests having an end marker too, and indeed we can > find a reason for having one; suppose our second scenario above, where > Y has already acquired all locks it needs to proceed. In this case we > would have marked Y to fail on another acquisition. This doesn't seem > like a problem until you consider the case where we nest different > mm_mutex sets. In this case Y would -EDEADLK on the first of the second > (nested) set, even though re-trying that set is pointless, O doesn't > care. Another big reason for having a start/end marker like you've describe is lockdep support. Lockdep is oblivious to the magic deadlock avoidance, so when trying to just annotate the ww_mutexes themselves you can only annotate them as trylocks (which in a way they are). But that completely breaks lockdep warnings for the above mentioned > ww_mutex -> other -> ww_mutex loop. So we don't want that. The trick Maarten's patches employ is to add a fake/virtual lockdep lock for the reservation ticket/age itself, which is acquired in start and dropped again in end. All the ww_mutex locking is then annotated as blocking locks nested within that virtual mutex (this is already used in mm_take_all_locks). Maarten added a few patches to ensure that lockdep properly checks this nesting, too: commit d094595078d00b63839d0c5ccb8b184ef242cb45 Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Thu Sep 13 11:39:51 2012 +0200 lockdep: Check if nested lock is actually held > Furthermore, considering the first scenario, O could block on a lock of > the first set when Y is blocked on a lock of the second (nested) set; > we need means of discerning the different lock sets. This cannot be > done by local means, that is, any context created by the start/end > markers would be local to that task and would not be recognizable by > another as to belong to the same group. > > Instead we'd need to create a set-class, much like lockdep has and > somehow ensure all locks in a set are of that class. I'm a bit confused about the different classes you're talking about. Since the ticket queue is currently a global counter there's only one class of ww_mutexes. I guess we could change that once a second user shows up, but currently with dma-bufs we _want_ all reservartion mutexes to be in the same class. And if you nest reservation mutexe loops (i.e. nest calls to acquire_start in your example below) we want that to fail. So currently we only have one lockdep class for the mutexes plus one class for the virtual lock everything nests in. > One thing I'm not entirely sure of is the strict need for a local > context it could be used to track the set-class, but I'm not entirely > sure we need that to clean up state at the end marker. I haven't been > creative enough to construct a fail case where both O and Y nest and a > pending EDEADLK state could cross the end marker wrongly. > > However, having a local context, even if empty, does put a few > constraints on the API usage, which as per below, is a good thing. > > To recap, we now have: > > struct ww_mutex_key { }; > struct ww_mutex; > struct ww_mutex_ctx; > > void __ww_mutex_init(struct ww_mutex *, void *); > > #define ww_mutex_init(ww_mutex) \ > do { \ > static struct ww_mutex_key __key; \ > __ww_mutex_init(ww_mutex, &__key); \ > } while (0) > > void ww_mutex_acquire_start(struct ww_mutex_ctx *); > void ww_mutex_acquire_end (struct ww_mutex_ctx *); > > int ww_mutex_lock(struct ww_mutex_ctx *, struct ww_mutex *); > void ww_mutex_unlock(struct ww_mutex *); > > Which are to be used like: > > int bufs_lock(bufs) > { > struct ww_mutex_ctx ww_ctx; > > ww_mutex_acquire_start(&ww_ctx); > retry: > for-all-buffers(buf, bufs) { > err = ww_mutex_lock(&ww_ctx, &buf->lock); > if (err) > goto error; > } > > ww_mutex_acquire_end(&ww_ctx); > > return 0; /* success */ > > error: > for-all-buffers(buf2, bufs) { > if (buf2 == buf) > break; > ww_mutex_unlock(&buf->lock); > } > if (err == -EDEADLK) > goto again; > > return err; /* other error */ > } > > void bufs_unlock(bufs) > { > for-all-buffers(buf, bufs) > ww_mutex_unlock(&buf->lock); > } > > This API has a few problems as per Rusty's API guidelines, it is fairly > trivial to use it wrong; mostly the placement of the start and end > markers are easy to get wrong, but of course one can get all kinds of > badness from doing the retry wrong. At least having the local context > forces one to use the start/end markers. I share your concerns about the api, it's way too easy to get wrong. But with the lockdep nesting checks we should at least be covered for the ordering in the fastpath. For the error paths themselves I've just thought about ranodm -EAGAIN injection as a debug option. With that we should at least be able to cover these paths sufficiently in regression test suites. Would be a new patch for Maarten to write though ;-) > The only remaining 'problem' left is RT, however the above suggests a > very clean solution; change the lock queueing order to first sort on > priority (be it the SCHED_FIFO/RR static priority or SCHED_DEADLINE > dynamic priority) and secondly on the age. > > Note that (re)setting the age at every acquisition set is really only a > means to provide FIFO fairness between tasks, RT tasks don't strictly > require this, they have their own ordering. > > With all that this locking scheme should be deterministic; and in cases > where the older task would block (the young task has passed the end > marker) we can apply PI and push Y's priority. We've discussed this approach of using (rt-prio, age) instead of just age to determine the the "oldness" of a task for deadlock-breaking with -EAGAIN. The problem is that through PI boosting or normal rt-prio changes while tasks are trying to acquire ww_mutexes you can create acyclic loops in the blocked-on graph of ww_mutexes after the fact and so cause deadlocks. So we've convinced ourselves that this approche doesn't work. The only somewhat similar approache we couldn't poke holes into is to set a PI-boost flag on the reservation ticket (your ww_ctx) of the task a higher-prio RT thread is blocking on. Any task who has set that flag on its ww_ctx would then unconditionally fail with -EAGAIN on the next ww_mutex_lock. See my other mail from yesterday for some of the ideas we've discussed about RT-compliant ww_mutexes on irc. > Did I miss something? > > Anyway, I haven't actually looked at the details of your implementation > yet, I'll get to that next, but I think something like the above should > be what we want to aim for. > > *phew* thanks for reading this far ! :-) Well, it was a good read and I'm rather happy that we agree on the ww_ctx thing (whatever it's called in the end), even though we have slightly different reasons for it. I don't really have a useful idea to make the retry handling for users more rusty-compliant though, and I'm still unhappy with all current naming proposals ;-) Cheers, Daniel
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > We do add some form of owner tracking by storing the reservation > ticket > of the current holder into every ww_mutex. So when task-Y in your > above > example tries to acquire lock A it notices that it's behind in the > global > queue and immedieately returns -EAGAIN to indicate the deadlock. > > Aside, there's a bit of fun in that ttm uses -EDEADLCK for when you > try to > reserve the same buffer twice - it can easily detect that by comparing > the > lock owner ticket with the provided one and if it matches bail out. Sure, but this should bear no influence on the design of the mutex primitive. At most we should ensure this situation is recognizable. > Hence we've kept the special -EDEADLCK semantics even for the ww_mutex > stuff. There's EDEADLK and EDEADLOCK, EDEADLCK does alas not exist :-) > > Now this gets a little more interesting if we change the scenario a > > little: > > > > task-O task-Y > > A > > B > > B <-- blocks on Y > > * <-- could be A > > The current code at least simple blocks task-O on B until task-Y > unlocks > B. Deadlocks cannot happen since if task-Y ever tries to acquire a > lock > which is held by an older task (e.g. lock A) it will bail out with > -EAGAIN. Agreed, O would have to wait until Y unlocks B. It was a 'detail' I skimped over for the sake of brevity. I was merely trying to illustrate the ways in which we must bring Y to return -EDEADLK (or -EAGAIN in your version).
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the > wait > times of older task. No, imagine the following: struct ww_mutex A, B; struct mutex C; task-O task-Y task-X A B C C B At this point O finds that Y owns B and thus we want to make Y 'yield' B to make allow B progress. Since Y is blocked, we'll send a wakeup. However Y is blocked on a different locking primitive; one that doesn't collaborate in the -EDEADLK scheme therefore we don't want the wakeup to succeed.
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > The trick with the current code is that the oldest task > will never see an -EAGAIN ever and hence is guaranteed to make forward > progress. If the task is really unlucky though it might be forced to > wait > for a younger task for every ww_mutex it tries to acquire. Agreed on that.. while I didn't state this my proposed thing should behave the same. It follows from the symmetry breaking in that only younger tasks can get 'kill'ed.
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > The thing is now that you're not expected to hold these locks for a > long > time - if you need to synchronously stall while holding a lock > performance > will go down the gutters anyway. And since most current > gpus/co-processors > still can't really preempt fairness isn't that high a priority, > either. > So we didn't think too much about that. Yeah but you're proposing a new synchronization primitive for the core kernel.. all such 'fun' details need to be considered, not only those few that bear on the one usecase.
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > Another big reason for having a start/end marker like you've describe > is > lockdep support. Yeah, I saw how you did that.. but there's other ways of making it work too, you could for instance create a new validation state for this type of lock. That said, I didn't consider lockdep too much, I first want the regular semantics clear.
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > I'm a bit confused about the different classes you're talking about. > Since > the ticket queue is currently a global counter there's only one class > of > ww_mutexes. Right, so that's not something that's going to fly.. we need to support multiple users, including nesting of those. Again, you're adding something to the generic kernel infrastructure, it had better be usable :-) > I guess we could change that once a second user shows up No, we fix that before it all goes in. I would _so_ hate to find out it cannot be 'fixed' and be stuck with a half-arsed sync primitive in the core kernel that's only every usable by the one existent user. So for now, forget all about TTM, DMA-BUF and other such coolness except to verify that whatever we end up with does indeed work for the case you need it for ;-)
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > We've discussed this approach of using (rt-prio, age) instead of just > age > to determine the the "oldness" of a task for deadlock-breaking with > -EAGAIN. The problem is that through PI boosting or normal rt-prio > changes > while tasks are trying to acquire ww_mutexes you can create acyclic > loops > in the blocked-on graph of ww_mutexes after the fact and so cause > deadlocks. So we've convinced ourselves that this approche doesn't > work. Could you pretty please draw me a picture, I'm having trouble seeing what you mean. AFAICT as long as you boost Y while its the lock owner things should work out, no?
On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > Well, it was a good read and I'm rather happy that we agree on the > ww_ctx > thing (whatever it's called in the end), even though we have slightly > different reasons for it. Yeah, I tried various weirdness to get out from under it, but the whole progress/fairness thing made it rather hard. Ideally you'd be able to use some existing scheduler state since its the same goal, but the whole wakeup-retry muck makes that hard. > I don't really have a useful idea to make the retry handling for users > more rusty-compliant though, and I'm still unhappy with all current > naming > proposals ;-) Ah, naming,.. yeah I'm not too terribly attached to most of them. I just want to avoid something that's reasonably well known to mean something different. Furthermore, since we use the wound/wait symmetry breaking it would make sense for that to appear somewhere in the name.
On Thu, Apr 4, 2013 at 3:31 PM, Daniel Vetter <daniel@ffwll.ch> wrote: >> In this case when O blocks Y isn't actually blocked, so our >> TASK_DEADLOCK wakeup doesn't actually achieve anything. >> >> This means we also have to track (task) state so that once Y tries to >> acquire A (creating the actual deadlock) we'll not wait so our >> TASK_DEADLOCK wakeup doesn't actually achieve anything. >> >> Note that Y doesn't need to acquire A in order to return -EDEADLK, any >> acquisition from the same set (see below) would return -EDEADLK even if >> there isn't an actual deadlock. This is the cost of heuristic; we could >> walk the actual block graph but that would be prohibitively expensive >> since we'd have to do this on every acquire. > > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait > times of older task. This could be interesting for RT, but I'm unsure of > the implications. The trick with the current code is that the oldest task > will never see an -EAGAIN ever and hence is guaranteed to make forward > progress. If the task is really unlucky though it might be forced to wait > for a younger task for every ww_mutex it tries to acquire. [Aside: I'm writing this while your replies trickle in, but I think it's not yet answered already.] Ok, I've discussed this a lot with Maarten on irc and I think I see a bit clearer now what's the aim with the new sleep state. Or at least I have an illusion about it ;-) So let me try to recap my understanding to check whether we're talking roughly about the same idea. I think for starters we need to have a slightly more interesting example: 3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M is in between. 2 ww_mutexes: A, B Y has already acquired ww_mutex A, M has already acquired ww_mutex B. Now O wants to acquire B and M wants to acquire A (let's ignore detailed ordering for now), resulting in O blocking on M (M holds B already, but O is older) and M blocking on Y (same for lock B). Now first question to check my understanding: Your aim with that special wakeup is to kick M so that it backs off and drops B? That way O does not need to wait for Y to complete whatever it's currently doing, unlock A and then in turn M to complete whatever it's doing so that it can unlock A&B and finally allows O to grab the lock. Presuming I'm still following we should be able to fix this with the new sleep state TASK_DEADLOCK and a flag somewhere in the thread info (let's call it PF_GTFO for simplicity). Then every time a task does a blocking wait on a ww_mutex it would set this special sleep state and also check the PF_GTFO bit. If the later is set, it bails out with -EAGAIN (so that all locks are dropped). Now if a task wants to take a lock and notices that it's held by a younger locker it can set that flag and wake the thread up (need to think about all the races a bit, but we should be able to make this work). Then it can do the normal blocking mutex slowpath and wait for the unlock. Now if O and M race a bit against each another M should either get woken (if it's already blocked on Y) and back off, or notice that the thread flag is set before it even tries to grab another mutex (and so before the block tree can extend further to Y). And the special sleep state is to make sure we don't cause any other spurious interrupts. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Apr 4, 2013 at 6:38 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: >> Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the >> wait >> times of older task. > > No, imagine the following: > > struct ww_mutex A, B; > struct mutex C; > > task-O task-Y task-X > A > B > C > C > B > > At this point O finds that Y owns B and thus we want to make Y 'yield' > B to make allow B progress. Since Y is blocked, we'll send a wakeup. > However Y is blocked on a different locking primitive; one that doesn't > collaborate in the -EDEADLK scheme therefore we don't want the wakeup to > succeed. Yeah, I've thought about this some more and the special sleep state seems to be only required to ensure we don't cause spurious wakeups for other any other reasons a task blocks. But I think we can use that kick-current-holder approach to ensure that older tasks get the lock in a more timely fashion than the current code. I've tried to detail it a bit with another 3 task example - that seems to be the point where the fun starts ;-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Apr 4, 2013 at 6:49 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: >> We've discussed this approach of using (rt-prio, age) instead of just >> age >> to determine the the "oldness" of a task for deadlock-breaking with >> -EAGAIN. The problem is that through PI boosting or normal rt-prio >> changes >> while tasks are trying to acquire ww_mutexes you can create acyclic >> loops >> in the blocked-on graph of ww_mutexes after the fact and so cause >> deadlocks. So we've convinced ourselves that this approche doesn't >> work. > > Could you pretty please draw me a picture, I'm having trouble seeing > what you mean. > > AFAICT as long as you boost Y while its the lock owner things should > work out, no? So this one here took a bit of pondering. But I think you'll like the conclusion. Now the deadlock detection works because it impose a dag onto all lockers which clearly denotes who'll wait and who will get wounded. The problem with using (rt_prio, ww_age/ticket) instead of just the ticket si that it's ambigous in the face of PI boosting. E.g. - rt_prio of A > rt_prio of B, but - task A is younger than task B Before boosting task A wins, but after boosting (when only the age matters) task B wins, since it's now older. So now when B comes around and tries to grab a lock A currently holds, it'll also block. Now the cool thing with TASK_KILLABLE (hey, I start to appreciate it, bear with me!) is that this is no problem - it limits the length of any chain of blocked tasks to just one link of blocked tasks: - If the length is currently one it cannot be extended - the blocked task will set the PF_GTFO flag on the current holder, and since that would be checked before blocking on another ww_mutex the chain can't grow past one blocked task. - If the current holder itself is blocked on another ww_mutex it'll be in TASK_DEADLOCK state and get woken up. In both case the current holder of the lock will either continue with what it's been doing (PI-boosted ofc) until it unlocks everything or hits the PF_GTFO check and backs off from all currently held locks. RT mission accomplished I think since the wait time for the highest-prio task for grabbing a lock is bounded by the lock hold time for the completely uncontended case. And within a given rt-prio the normal wait/wound algorithm will resolve conflicts (and ensure forward progress). So I'm now rather convinced that with the TASK_DEADLOCK implementation we can make (rt_prio, age) lock ordering work. But it's not an artifact of consistent PI boosting (I've raced down that alley first trying to prove it to no avail), but the fact that lock holder kicking through PF_GTFO naturally limits blocked task chains on ww_mutexes to just one link. That in turn makes any post-PI-boosted considerations moot and so can't result in havoc due to a PI-boost having flipped an edge in the lock_er_ ordering DAG (we sort tasks with wait/wound, not locks, hence it's not a lock ordering DAG). Please poke holes into this argument, I've probably missed a sublety somewhere ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote: > On Thu, Apr 4, 2013 at 3:31 PM, Daniel Vetter <daniel@ffwll.ch> wrote: > >> In this case when O blocks Y isn't actually blocked, so our > >> TASK_DEADLOCK wakeup doesn't actually achieve anything. > >> > >> This means we also have to track (task) state so that once Y tries to > >> acquire A (creating the actual deadlock) we'll not wait so our > >> TASK_DEADLOCK wakeup doesn't actually achieve anything. > >> > >> Note that Y doesn't need to acquire A in order to return -EDEADLK, any > >> acquisition from the same set (see below) would return -EDEADLK even if > >> there isn't an actual deadlock. This is the cost of heuristic; we could > >> walk the actual block graph but that would be prohibitively expensive > >> since we'd have to do this on every acquire. > > > > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait > > times of older task. This could be interesting for RT, but I'm unsure of > > the implications. The trick with the current code is that the oldest task > > will never see an -EAGAIN ever and hence is guaranteed to make forward > > progress. If the task is really unlucky though it might be forced to wait > > for a younger task for every ww_mutex it tries to acquire. > > [Aside: I'm writing this while your replies trickle in, but I think > it's not yet answered already.] > > Ok, I've discussed this a lot with Maarten on irc and I think I see a > bit clearer now what's the aim with the new sleep state. Or at least I > have an illusion about it ;-) So let me try to recap my understanding > to check whether we're talking roughly about the same idea. > > I think for starters we need to have a slightly more interesting example: > > 3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M > is in between. > 2 ww_mutexes: A, B > > Y has already acquired ww_mutex A, M has already acquired ww_mutex B. > > Now O wants to acquire B and M wants to acquire A (let's ignore > detailed ordering for now), resulting in O blocking on M (M holds B > already, but O is older) and M blocking on Y (same for lock B). drawing the picture for myself: task-O task-M task-Y A B B A > Now first question to check my understanding: Your aim with that > special wakeup is to kick M so that it backs off and drops B? That way > O does not need to wait for Y to complete whatever it's currently > doing, unlock A and then in turn M to complete whatever it's doing so > that it can unlock A&B and finally allows O to grab the lock. No, we always need to wait for locks to be unlocked. The sole purpose of the special wakeups state is to not wake other (!ww_mutex) locks that might be held by the task holding the contended ww_mutex. While all schedule() sites should deal with spurious wakeups its a sad fact of life that they do not :/ > Presuming I'm still following we should be able to fix this with the > new sleep state TASK_DEADLOCK and a flag somewhere in the thread info > (let's call it PF_GTFO for simplicity). I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are unsuitable since they are not atomic and we'd need to set it from another thread. > Then every time a task does a > blocking wait on a ww_mutex it would set this special sleep state and > also check the PF_GTFO bit. So its the contending task (O for B) setting PF_GTFO on the owning task (M for B), right? But yeah, all ww_mutex sleep states should have the new TASK_DEADLOCK sleep state added. > If the later is set, it bails out with > -EAGAIN (so that all locks are dropped). I would really rather see -EDEADLK for that.. > Now if a task wants to take a lock and notices that it's held by a > younger locker it can set that flag and wake the thread up (need to > think about all the races a bit, but we should be able to make this > work). Then it can do the normal blocking mutex slowpath and wait for > the unlock. Right. > Now if O and M race a bit against each another M should either get > woken (if it's already blocked on Y) and back off, or notice that the > thread flag is set before it even tries to grab another mutex ww_mutex, it should block just fine on regular mutexes and other primitives. > (and so > before the block tree can extend further to Y). And the special sleep > state is to make sure we don't cause any other spurious interrupts. Right, I think we're understanding one another here.
On Mon, Apr 08, 2013 at 12:39:24PM +0200, Peter Zijlstra wrote: > On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote: > > Presuming I'm still following we should be able to fix this with the > > new sleep state TASK_DEADLOCK and a flag somewhere in the thread info > > (let's call it PF_GTFO for simplicity). > > I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are > unsuitable since they are not atomic and we'd need to set it from > another thread. I think the PF_ flag is a non-starter for a different reason, too. To make different clases of ww_mutexes nest properly, we need to make sure that we don't dead/live-lock trying to wake up a task holdinga ww_mutex of class A, while that task is trying to acquire ww_mutexes of class B. Picture: task 1 task 2 task 3 holds lock B ticket_A = acquire_start(class A) ww_mutex_lock(lock A, ticket_A) busy doing something ticket_B = acquire_start(class B) ww_mutex_lock(lock B, ticket_B) -> contention with task 3 ticket_task2 = acquire_start(class A) ww_mutex_lock(lock A, ticket_task2) -> contention with task 1 If we go with the PF_GTFO task flag, task 2 will set it on task 1 (handwave locking/atomicity again here ;-). But taks 1 will only back off from any locks in class B. Or at least I think we should impose that restriction, since otherwise a ww_mutex acquire loop will leak out badly abstraction-wise and making nesting pretty much impossible in practice. But if we really want nesting to work transparently, then task 1 should be allowed to continue taking locks from class A (after an acquire_end(class B) call ofc). But then it will have missed the wakeup from task 2, so task 2 blocks for too long and task 1 doesn't back off from further acquiring locks in class A. Presuming my reasoning for the rt case isn't broken, this break deadlock avoidance if we sort by (rt_prio, ticket). So I think we need to move the PF_GTFO flag to the ticket to make sure that task1 will notice that there's contention with one of the locks it holds from class A and that it better gtfo asap (i.e. back off, drop all currently held locks in class A and go into the slowpath). But I still need to think the nesting case through a bit more (and draw some diagrams), so not sure yet. One thing I still need to ponder is how to best stack tickets (for tasks doing nested ww_mutex locking) and where all we need a pointer to relevant tickets and what kind of fun this entails ... I need to ponder this some more. > > Then every time a task does a > > blocking wait on a ww_mutex it would set this special sleep state and > > also check the PF_GTFO bit. > > So its the contending task (O for B) setting PF_GTFO on the owning task > (M for B), right? Yeah, the idea is that the contending task sets this bit on the current holder (whether we put that bit into the task or the ticket), so that the current owner can back off and drop all currently held locks at the next opportunity (either due to being woken up in TASK_DEADLCOK state or in the next ww_mutex_lock call). > But yeah, all ww_mutex sleep states should have the new TASK_DEADLOCK > sleep state added. > > > If the later is set, it bails out with > > -EAGAIN (so that all locks are dropped). > > I would really rather see -EDEADLK for that.. I agree it makes more sense for a general api (outside of ttm), but I struggle a bit to come up with a good errno for the "you hold this lock already" case. Best I could do is -EBUSY ... Hm, looking through errno.h I've just spotted -EALREADY. Seems to be used all over (not just networking), so I guess we could extend it's meaning a bit? > > Now if a task wants to take a lock and notices that it's held by a > > younger locker it can set that flag and wake the thread up (need to > > think about all the races a bit, but we should be able to make this > > work). Then it can do the normal blocking mutex slowpath and wait for > > the unlock. > > Right. > > > Now if O and M race a bit against each another M should either get > > woken (if it's already blocked on Y) and back off, or notice that the > > thread flag is set before it even tries to grab another mutex > > ww_mutex, it should block just fine on regular mutexes and other > primitives. Yes, that special behaviour should only apply to ww_mutexes not to any other locking. I've not been terribly careful here ... -Daniel
On Tue, Apr 02, 2013 at 06:59:14PM +0200, Peter Zijlstra wrote: > > So how about we call the thing something like: > > struct ww_mutex; /* wound/wait */ Reading this I can't help but think of Elmer Fudd saying "Round Robin" as "Wound Wobin" -- Steve > > int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */ > int mutex_wait_lock(struct ww_mutex *); /* does not fail */ >
On Thu, Apr 04, 2013 at 06:38:36PM +0200, Peter Zijlstra wrote: > On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the > > wait > > times of older task. > > No, imagine the following: > > struct ww_mutex A, B; > struct mutex C; > > task-O task-Y task-X > A > B > C > C > B > > At this point O finds that Y owns B and thus we want to make Y 'yield' > B to make allow B progress. Since Y is blocked, we'll send a wakeup. > However Y is blocked on a different locking primitive; one that doesn't > collaborate in the -EDEADLK scheme therefore we don't want the wakeup to > succeed. I'm confused to why the above is a problem. Task-X will eventually release C, and then Y will release B and O will get to continue. Do we have to drop them once the owner is blocked? Can't we follow the chain like the PI code does? -- Steve
On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote: > On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > > The thing is now that you're not expected to hold these locks for a > > long > > time - if you need to synchronously stall while holding a lock > > performance > > will go down the gutters anyway. And since most current > > gpus/co-processors > > still can't really preempt fairness isn't that high a priority, > > either. > > So we didn't think too much about that. > > Yeah but you're proposing a new synchronization primitive for the core > kernel.. all such 'fun' details need to be considered, not only those > few that bear on the one usecase. Which bares the question, what other use cases are there? -- Steve
On Thu, Apr 04, 2013 at 06:56:58PM +0200, Daniel Vetter wrote: > > I think for starters we need to have a slightly more interesting example: > > 3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M > is in between. > 2 ww_mutexes: A, B > > Y has already acquired ww_mutex A, M has already acquired ww_mutex B. > Now I probably missed it in the thread somewhere, but what makes task O the oldest and Y the youngest? Is it the actual time from when the task was created? What about setting an age as soon as it starts the process of grabbing one of these locks? And it keeps the age until it successfully grabs and releases all the locks again. It wont reset if it had to drop the locks and start over. Or am I totally not understanding what's going on here? Which could also be the case ;-) -- Steve
On Tue, 2013-04-09 at 18:42 -0400, Steven Rostedt wrote: > What about setting an age as soon as it starts the process > of grabbing one of these locks? And it keeps the age until it > successfully grabs and releases all the locks again. It wont reset if > it > had to drop the locks and start over. That is indeed the proposed mechanism. It ensures FIFO fairness between the various threads that try to acquire a set.
On Wed, Apr 10, 2013 at 12:27 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, Apr 04, 2013 at 06:38:36PM +0200, Peter Zijlstra wrote: >> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: >> > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the >> > wait >> > times of older task. >> >> No, imagine the following: >> >> struct ww_mutex A, B; >> struct mutex C; >> >> task-O task-Y task-X >> A >> B >> C >> C >> B >> >> At this point O finds that Y owns B and thus we want to make Y 'yield' >> B to make allow B progress. Since Y is blocked, we'll send a wakeup. >> However Y is blocked on a different locking primitive; one that doesn't >> collaborate in the -EDEADLK scheme therefore we don't want the wakeup to >> succeed. > > I'm confused to why the above is a problem. Task-X will eventually > release C, and then Y will release B and O will get to continue. Do we > have to drop them once the owner is blocked? Can't we follow the chain > like the PI code does? Just waiting until every task already holding a lock completes and unlucks it is indeed a viable solution - it's the currently implemented algorithm in ttm and Maarten's current patches. The nice thing with Peter's wakeup idea on top is: - It bounds blocked times. - And (at least I think so) it's the key thing making PI boosting possible without any ugly PI inversion deadlocks happening. See Message-ID: <CAKMK7uEUdtiDDCRPwpiumkrST6suFY7YuQcPAXR_nJ0XHKzsAw@mail.gmail.com> for my current reasoning about this (I have not yet managed to poke a hole into it). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tue, Apr 09, 2013 at 06:28:08PM -0400, Steven Rostedt wrote: > On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote: > > On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > > > The thing is now that you're not expected to hold these locks for a > > > long > > > time - if you need to synchronously stall while holding a lock > > > performance > > > will go down the gutters anyway. And since most current > > > gpus/co-processors > > > still can't really preempt fairness isn't that high a priority, > > > either. > > > So we didn't think too much about that. > > > > Yeah but you're proposing a new synchronization primitive for the core > > kernel.. all such 'fun' details need to be considered, not only those > > few that bear on the one usecase. > > Which bares the question, what other use cases are there? Tbh I don't see any other either - but I agree with Peter and thinking things through and making the api a bit more generic seems to help in clarifying the semantics. Reminds me that I still need to draw a few diagrams ;-) -Daniel
On Mon, Apr 08, 2013 at 01:50:26PM +0200, Daniel Vetter wrote: > On Mon, Apr 08, 2013 at 12:39:24PM +0200, Peter Zijlstra wrote: > > On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote: > > > Presuming I'm still following we should be able to fix this with the > > > new sleep state TASK_DEADLOCK and a flag somewhere in the thread info > > > (let's call it PF_GTFO for simplicity). > > > > I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are > > unsuitable since they are not atomic and we'd need to set it from > > another thread. > > I think the PF_ flag is a non-starter for a different reason, too. To make > different clases of ww_mutexes nest properly, we need to make sure that we > don't dead/live-lock trying to wake up a task holdinga ww_mutex of class > A, while that task is trying to acquire ww_mutexes of class B. Picture: > > task 1 task 2 task 3 > holds lock B > ticket_A = acquire_start(class A) > ww_mutex_lock(lock A, ticket_A) busy doing something > > ticket_B = acquire_start(class B) > ww_mutex_lock(lock B, ticket_B) > -> contention with task 3 > > ticket_task2 = acquire_start(class A) > ww_mutex_lock(lock A, ticket_task2) > -> contention with task 1 > > If we go with the PF_GTFO task flag, task 2 will set it on task 1 > (handwave locking/atomicity again here ;-). But taks 1 will only back off > from any locks in class B. Or at least I think we should impose that > restriction, since otherwise a ww_mutex acquire loop will leak out badly > abstraction-wise and making nesting pretty much impossible in practice. > > But if we really want nesting to work transparently, then task 1 should be > allowed to continue taking locks from class A (after an acquire_end(class > B) call ofc). But then it will have missed the wakeup from task 2, so task > 2 blocks for too long and task 1 doesn't back off from further acquiring > locks in class A. > > Presuming my reasoning for the rt case isn't broken, this break deadlock > avoidance if we sort by (rt_prio, ticket). > > So I think we need to move the PF_GTFO flag to the ticket to make sure > that task1 will notice that there's contention with one of the locks it > holds from class A and that it better gtfo asap (i.e. back off, drop all > currently held locks in class A and go into the slowpath). > > But I still need to think the nesting case through a bit more (and draw > some diagrams), so not sure yet. One thing I still need to ponder is how > to best stack tickets (for tasks doing nested ww_mutex locking) and where > all we need a pointer to relevant tickets and what kind of fun this > entails ... I need to ponder this some more. Ok, so I think the nesting case should all work if we have a per-task/ticket flag to signal contention. The point I've mused over a bit is how to get at both the flag and the corresponding task to do the wakeup. I think for non-rt the simplest way is to store a pointer to the ticket in the ww_mutex, and the ticket the holds both the contention flag and a pointer to the task. Lifetime of that stuff would be protected by the wait_lock from disappearing untimely, which should allow the lock/unlock fastpaths to set/clear it non-atomically - only careful places is in the contented unlock slowpath. So rough api sketch for nesting ww_mutexes: struct ww_class { atomic_t next_ticket; /* virtual lock to annotate the acquire_start/end sections. */ struct lock_class_key acquire_key; /* lockdep class of all ww_mutexes in this class */ struct lock_class_key mutex_key; /* for debug/tracepts */ const char *name }; DEFINE_WW_CLASS(name) /* ... and all the other usual magic init funcitons */ struct ww_acquire_ctx { struct task_struct *task; /* invariant */ usinged ticket; /* invariant */ /* atomic is probably overkill, but need to review the resulting * code with all the barriers first. */ atomic_t contention; } void ww_acquire_start(struct ww_acuire_ctx *ctx, struct ww_class *class); void ww_acquire_end(struct ww_acuire_ctx *ctx); Originally I've thought we could store the pointer to the current acquire ctx in task_struct (since we need that handy for PI boosting anyway), but that makes things a bit ugly with nesting. Having a (somewhat) redundant pointer to the acquiring taks in the ctx seemed less evil. struct ww_mutex { struct mutex base; struct ww_acquire_ctx *holding_ctx; } I've considered whether we should try to pull clever tricks like with lockdep keys and make the ww_class more implicit (by auto-instantiating it somewhere and adding a pointer to it in each ww_mutex). But I think we should not hide this complexity, but instead make it explicit. All the ww_mutex_(un)lock* variants would then also need to take the context instead of the ticket. Besides the usual return codes matching the normal mutex functions we'd have: -EDEADLK: Deadlock detected (or got kicked by a higher-prio task in RT), drop all currently held locks (of the same class) and block on this lock (since it's the contention point) with ww_mutex_lock_slow. -EALREADY: Lock already held by the same acquire context. We need this to be able to efficiently yell at evil userspace. If ww_mutex_lock goes into the slowpath and it's not one of the above cases (i.e. wait, not wound) then it sets the contention flag (unconditionally) and wakes up the current holder iff the current holder is in the TASK_DEADLOCK state. If ww_mutex_lock gets woken up it checks the contextion flag on the acquire context. If it is set it backs off and returns -EDEADLK. Obviously it also needs to check that before blocking, and we might even want to check unconditionally before attempting any lock operation (to ensure we get out of the way of higher-prio tasks quicker). One thing we probably need is to make TASK_DEADLOCK a flag, since we still want the different interruptible types of blocking. At least i915 goes to great lengths to make sure that we take every lock which could be held while waiting for the gpu with interruptible sleeps (or at least killable). Did I miss anything? -Daniel
On Wed, Apr 10, 2013 at 12:28 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote: >> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: >> > The thing is now that you're not expected to hold these locks for a >> > long >> > time - if you need to synchronously stall while holding a lock >> > performance >> > will go down the gutters anyway. And since most current >> > gpus/co-processors >> > still can't really preempt fairness isn't that high a priority, >> > either. >> > So we didn't think too much about that. >> >> Yeah but you're proposing a new synchronization primitive for the core >> kernel.. all such 'fun' details need to be considered, not only those >> few that bear on the one usecase. > > Which bares the question, what other use cases are there? Just stumbled over one I think: If we have a big graph of connected things (happens really often for video pipelines). And we want multiple users to use them in parallel. But sometimes a configuration change could take way too long and so would unduly stall a 2nd thread with just a global mutex, then per-object ww_mutexes would be a fit: You'd start with grabbing all the locks for the objects you want to change anything with, then grab anything in the graph that you also need to check. Thanks to loop detection and self-recursion this would all nicely work out, even for cyclic graphs of objects. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Wed, Apr 17, 2013 at 09:08:17PM +0200, Daniel Vetter wrote: > On Wed, Apr 10, 2013 at 12:28 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote: > >> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote: > >> > The thing is now that you're not expected to hold these locks for a > >> > long > >> > time - if you need to synchronously stall while holding a lock > >> > performance > >> > will go down the gutters anyway. And since most current > >> > gpus/co-processors > >> > still can't really preempt fairness isn't that high a priority, > >> > either. > >> > So we didn't think too much about that. > >> > >> Yeah but you're proposing a new synchronization primitive for the core > >> kernel.. all such 'fun' details need to be considered, not only those > >> few that bear on the one usecase. > > > > Which bares the question, what other use cases are there? > > Just stumbled over one I think: If we have a big graph of connected > things (happens really often for video pipelines). And we want > multiple users to use them in parallel. But sometimes a configuration > change could take way too long and so would unduly stall a 2nd thread > with just a global mutex, then per-object ww_mutexes would be a fit: > You'd start with grabbing all the locks for the objects you want to > change anything with, then grab anything in the graph that you also > need to check. Thanks to loop detection and self-recursion this would > all nicely work out, even for cyclic graphs of objects. Indeed, that would make the locking for atomic modeset/page flip very easy to handle, while still allowing the use of suitable fine grained locks. I like the idea.
diff --git a/Documentation/reservation-mutex-design.txt b/Documentation/reservation-mutex-design.txt new file mode 100644 index 0000000..4e2866c --- /dev/null +++ b/Documentation/reservation-mutex-design.txt @@ -0,0 +1,136 @@ +Reservation type mutexes +--- + +Please read mutex-design.txt first, as it applies to reservation mutexes too. + +GPU's do operations that commonly involve many buffers. Those buffers +can be shared across contexts/processes, exist in different memory +domains (for example VRAM vs system memory), and so on. And with +PRIME / dmabuf, they can even be shared across devices. So there are +a handful of situations where the driver needs to wait for buffers to +become ready. If you think about this in terms of waiting on a buffer +mutex for it to become available, this presents a problem because +there is no way to guarantee that buffers appear in a execbuf/batch in +the same order in all contexts. That is directly under control of +userspace, and a result of the sequence of GL calls that an application +makes. Which results in the potential for deadlock. The problem gets +more complex when you consider that the kernel may need to migrate the +buffer(s) into VRAM before the GPU operates on the buffer(s), which +may in turn require evicting some other buffers (and you don't want to +evict other buffers which are already queued up to the GPU), but for a +simplified understanding of the problem you can ignore this. + +The algorithm that TTM came up with for dealing with this problem is +quite simple. For each group of buffers (execbuf) that need to be +locked, the caller would be assigned a unique reservation_id, from a +global counter. In case of deadlock while locking all the buffers +associated with a execbuf, the one with the lowest reservation_id +wins, and the one with the higher reservation_id unlocks all of the +buffers that it has already locked, and then tries again. + +How it is used: +--------------- + +A very simplified version: + + int lock_execbuf(execbuf, ticket) + { + struct buf *res_buf = NULL; + + /* acquiring locks, before queuing up to GPU: */ + *ticket = assign_global_seqno(); + + retry: + for (buf in execbuf->buffers) { + if (buf == res_buf) { + res_buf = NULL; + continue; + } + ret = mutex_reserve_lock(&buf->lock, ticket, ticket->seqno); + if (ret < 0) + goto err; + } + + /* now everything is good to go, submit job to GPU: */ + ... + + return 0; + + err: + for (all buf2 before buf in execbuf->buffers) + mutex_unreserve_unlock(&buf2->lock); + if (res_buf) + mutex_unreserve_unlock(&res_buf->lock); + + if (ret == -EAGAIN) { + /* we lost out in a seqno race, lock and retry.. */ + mutex_reserve_lock_slow(&buf->lock, ticket, ticket->seqno); + res_buf = buf; + goto retry; + } + release_global_seqno(ticket); + + return ret; + } + + int unlock_execbuf(execbuf, ticket) + { + /* when GPU is finished; */ + for (buf in execbuf->buffers) + mutex_unreserve_unlock(&buf->lock); + release_global_seqno(ticket); + } + +Functions: +---------- + +mutex_reserve_lock, and mutex_reserve_lock_interruptible: + Lock a reservation_lock with a reservation_id set. reservation_id must not + be set to 0, since this is a special value that means no reservation_id. + + Normally if reservation_id is not set, or is older than the + reservation_id which is currently set on the mutex, the behavior will + be to wait normally. However, if the reservation_id is newer than + the current reservation_id, -EAGAIN will be returned. + + These functions will return -EDEADLK instead of -EAGAIN if + reservation_id is the same as the reservation_id that's attempted to + lock the mutex with, since in that case you presumably attempted to + lock the same lock twice. + +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow: + Similar to mutex_reserve_lock, except it won't backoff with -EAGAIN. + This is useful when mutex_reserve_lock failed with -EAGAIN, and you + unreserved all reservation_locks so no deadlock can occur. + +mutex_unreserve_unlock: + Unlock a reservation_lock reserved with one of the mutex_reserve_*lock* + calls. + +Design: + ticket_mutex encapsulates a struct mutex, this means no extra overhead for + normal mutex locks, which are far more common. As such there is only a small + increase in code size if reservation mutexes are not used. + + In general, not much contention is expected. The locks are intended to + serialize access to resources for devices. The only way to make wakeups + smarter would be at the cost of adding a field to struct mutex_waiter. This + would add overhead to all cases where normal mutexes are used, and + ticket_mutexes are less performance sensitive. For this reason + mutex_reserve_lock wakes up all waiters, so they can recheck themselves. + +Lockdep: + Lockdep operates on classes of locks. Ideally there should be only 2 lockdep + classes, one for the reservation_id itself, and one shared across all + reservation mutexes. + + It may seem odd that reservation_id needs a lock class, but it's needed for + lockdep to work correctly. Because the lock class of all reservation mutexes + are equal, without nesting lockdep will complain. Using multiple lockdep + classes will result in reduced effectiveness. + + The reason only a single lockdep class is wanted for reservation mutexes is + because it's always nested to the reservation_id's "lock", or only a single + lock is taken without any nesting at all. In the latter case no more than 1 + mutex should be locked at the same time, which lockdep can only verify + if the lock class is always the same. diff --git a/include/linux/mutex.h b/include/linux/mutex.h index 9121595..602c247 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -62,6 +62,11 @@ struct mutex { #endif }; +struct ticket_mutex { + struct mutex base; + atomic_long_t reservation_id; +}; + /* * This is the control structure for tasks blocked on mutex, * which resides on the blocked task's kernel stack: @@ -109,12 +114,24 @@ static inline void mutex_destroy(struct mutex *lock) {} __DEBUG_MUTEX_INITIALIZER(lockname) \ __DEP_MAP_MUTEX_INITIALIZER(lockname) } +#define __TICKET_MUTEX_INITIALIZER(lockname) \ + { .base = __MUTEX_INITIALIZER(lockname) \ + , .reservation_id = ATOMIC_LONG_INIT(0) } + #define DEFINE_MUTEX(mutexname) \ struct mutex mutexname = __MUTEX_INITIALIZER(mutexname) extern void __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key); +static inline void __ticket_mutex_init(struct ticket_mutex *lock, + const char *name, + struct lock_class_key *key) +{ + __mutex_init(&lock->base, name, key); + atomic_long_set(&lock->reservation_id, 0); +} + /** * mutex_is_locked - is the mutex locked * @lock: the mutex to be queried @@ -133,26 +150,91 @@ static inline int mutex_is_locked(struct mutex *lock) #ifdef CONFIG_DEBUG_LOCK_ALLOC extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass); extern void _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock); + extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass); extern int __must_check mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass); +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock, + struct lockdep_map *nest_lock, + unsigned long reservation_id); + +extern int __must_check _mutex_reserve_lock_interruptible(struct ticket_mutex *, + struct lockdep_map *nest_lock, + unsigned long reservation_id); + +extern void _mutex_reserve_lock_slow(struct ticket_mutex *lock, + struct lockdep_map *nest_lock, + unsigned long reservation_id); + +extern int __must_check _mutex_reserve_lock_intr_slow(struct ticket_mutex *, + struct lockdep_map *nest_lock, + unsigned long reservation_id); + #define mutex_lock(lock) mutex_lock_nested(lock, 0) #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0) #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0) #define mutex_lock_nest_lock(lock, nest_lock) \ do { \ - typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ + typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ _mutex_lock_nest_lock(lock, &(nest_lock)->dep_map); \ } while (0) +#define mutex_reserve_lock(lock, nest_lock, reservation_id) \ +({ \ + typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ + _mutex_reserve_lock(lock, &(nest_lock)->dep_map, reservation_id); \ +}) + +#define mutex_reserve_lock_interruptible(lock, nest_lock, reservation_id) \ +({ \ + typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ + _mutex_reserve_lock_interruptible(lock, &(nest_lock)->dep_map, \ + reservation_id); \ +}) + +#define mutex_reserve_lock_slow(lock, nest_lock, reservation_id) \ +do { \ + typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ + _mutex_reserve_lock_slow(lock, &(nest_lock)->dep_map, reservation_id); \ +} while (0) + +#define mutex_reserve_lock_intr_slow(lock, nest_lock, reservation_id) \ +({ \ + typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ + _mutex_reserve_lock_intr_slow(lock, &(nest_lock)->dep_map, \ + reservation_id); \ +}) + #else extern void mutex_lock(struct mutex *lock); extern int __must_check mutex_lock_interruptible(struct mutex *lock); extern int __must_check mutex_lock_killable(struct mutex *lock); +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock, + unsigned long reservation_id); +extern int __must_check _mutex_reserve_lock_interruptible(struct ticket_mutex *, + unsigned long reservation_id); + +extern void _mutex_reserve_lock_slow(struct ticket_mutex *lock, + unsigned long reservation_id); +extern int __must_check _mutex_reserve_lock_intr_slow(struct ticket_mutex *, + unsigned long reservation_id); + +#define mutex_reserve_lock(lock, nest_lock, reservation_id) \ + _mutex_reserve_lock(lock, reservation_id) + +#define mutex_reserve_lock_interruptible(lock, nest_lock, reservation_id) \ + _mutex_reserve_lock_interruptible(lock, reservation_id) + +#define mutex_reserve_lock_slow(lock, nest_lock, reservation_id) \ + _mutex_reserve_lock_slow(lock, reservation_id) + +#define mutex_reserve_lock_intr_slow(lock, nest_lock, reservation_id) \ + _mutex_reserve_lock_intr_slow(lock, reservation_id) + # define mutex_lock_nested(lock, subclass) mutex_lock(lock) # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock) # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock) @@ -167,6 +249,8 @@ extern int __must_check mutex_lock_killable(struct mutex *lock); */ extern int mutex_trylock(struct mutex *lock); extern void mutex_unlock(struct mutex *lock); +extern void mutex_unreserve_unlock(struct ticket_mutex *lock); + extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); #ifndef CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX diff --git a/kernel/mutex.c b/kernel/mutex.c index 84a5f07..d6999a5 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -127,16 +127,116 @@ void __sched mutex_unlock(struct mutex *lock) EXPORT_SYMBOL(mutex_unlock); +/** + * mutex_unreserve_unlock - release the mutex + * @lock: the mutex to be released + * + * Unlock a mutex that has been locked by this task previously + * with _mutex_reserve_lock*. + * + * This function must not be used in interrupt context. Unlocking + * of a unlocked mutex is not allowed. + */ +void __sched mutex_unreserve_unlock(struct ticket_mutex *lock) +{ + /* + * The unlocking fastpath is the 0->1 transition from 'locked' + * into 'unlocked' state: + */ + + /* + * mark mutex as no longer part of a reservation, next + * locker can set this again + */ +#ifdef CONFIG_DEBUG_MUTEXES + unsigned long rid; + + rid = atomic_long_xchg(&lock->reservation_id, 0); + + /* + * If this WARN_ON triggers, you used mutex_lock to acquire, + * but released with mutex_unreserve_unlock in this call. + */ + DEBUG_LOCKS_WARN_ON(!rid); +#else + atomic_long_set(&lock->reservation_id, 0); + + /* + * When debugging is enabled we must not clear the owner before time, + * the slow path will always be taken, and that clears the owner field + * after verifying that it was indeed current. + */ + mutex_clear_owner(&lock->base); +#endif + __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath); +} +EXPORT_SYMBOL(mutex_unreserve_unlock); + +static inline int __sched +__mutex_lock_check_reserve(struct mutex *lock, unsigned long reservation_id) +{ + struct ticket_mutex *m = container_of(lock, struct ticket_mutex, base); + unsigned long cur_id; + + cur_id = atomic_long_read(&m->reservation_id); + if (!cur_id) + return 0; + + if (unlikely(reservation_id == cur_id)) + return -EDEADLK; + + if (unlikely(reservation_id - cur_id <= LONG_MAX)) + return -EAGAIN; + + return 0; +} + +/* + * after acquiring lock with fastpath or when we lost out in contested + * slowpath, set reservation_id and wake up any waiters so they can recheck. + * + * This function is never called when CONFIG_DEBUG_LOCK_ALLOC is set, + * as the fastpath and opportunistic spinning are disabled in that case. + */ +static __always_inline void +mutex_set_reservation_fastpath(struct ticket_mutex *lock, + unsigned long reservation_id) +{ + unsigned long flags; + struct mutex_waiter *cur; + + atomic_long_set(&lock->reservation_id, reservation_id); + + /* + * Check if lock is contended, if not there is nobody to wake up + */ + if (likely(atomic_read(&lock->base.count) == 0)) + return; + + /* + * Uh oh, we raced in fastpath, wake up everyone in this case, + * so they can see the new reservation_id + */ + spin_lock_mutex(&lock->base.wait_lock, flags); + list_for_each_entry(cur, &lock->base.wait_list, list) { + debug_mutex_wake_waiter(&lock->base, cur); + wake_up_process(cur->task); + } + spin_unlock_mutex(&lock->base.wait_lock, flags); +} + /* * Lock a mutex (possibly interruptible), slowpath: */ -static inline int __sched +static __always_inline int __sched __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, - struct lockdep_map *nest_lock, unsigned long ip) + struct lockdep_map *nest_lock, unsigned long ip, + unsigned long reservation_id, bool res_slow) { struct task_struct *task = current; struct mutex_waiter waiter; unsigned long flags; + int ret; preempt_disable(); mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip); @@ -163,6 +263,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, for (;;) { struct task_struct *owner; + if (!__builtin_constant_p(reservation_id) && !res_slow) { + ret = __mutex_lock_check_reserve(lock, reservation_id); + if (ret) + goto err_nowait; + } + /* * If there's an owner, wait for it to either * release the lock or go to sleep. @@ -173,6 +279,13 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, if (atomic_cmpxchg(&lock->count, 1, 0) == 1) { lock_acquired(&lock->dep_map, ip); + if (res_slow) { + struct ticket_mutex *m; + m = container_of(lock, struct ticket_mutex, base); + + mutex_set_reservation_fastpath(m, reservation_id); + } + mutex_set_owner(lock); preempt_enable(); return 0; @@ -228,15 +341,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * TASK_UNINTERRUPTIBLE case.) */ if (unlikely(signal_pending_state(state, task))) { - mutex_remove_waiter(lock, &waiter, - task_thread_info(task)); - mutex_release(&lock->dep_map, 1, ip); - spin_unlock_mutex(&lock->wait_lock, flags); + ret = -EINTR; + goto err; + } - debug_mutex_free_waiter(&waiter); - preempt_enable(); - return -EINTR; + if (!__builtin_constant_p(reservation_id) && !res_slow) { + ret = __mutex_lock_check_reserve(lock, reservation_id); + if (ret) + goto err; } + __set_task_state(task, state); /* didn't get the lock, go to sleep: */ @@ -251,6 +365,41 @@ done: mutex_remove_waiter(lock, &waiter, current_thread_info()); mutex_set_owner(lock); + if (!__builtin_constant_p(reservation_id)) { + struct ticket_mutex *m = container_of(lock, + struct ticket_mutex, + base); + struct mutex_waiter *cur; + + /* + * this should get optimized out for the common case, + * and is only important for _mutex_reserve_lock + */ + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + unsigned long old_id; + old_id = atomic_long_xchg(&m->reservation_id, reservation_id); + + /* + * If this WARN_ON triggers, you used mutex_lock to acquire, + * but released with mutex_unreserve_unlock in this call. + */ + DEBUG_LOCKS_WARN_ON(old_id); +#else + atomic_long_set(&m->reservation_id, reservation_id); +#endif + + /* + * give any possible sleeping processes the chance to wake up, + * so they can recheck if they have to back off from + * reservations + */ + list_for_each_entry(cur, &lock->wait_list, list) { + debug_mutex_wake_waiter(lock, cur); + wake_up_process(cur->task); + } + } + /* set it to 0 if there are no waiters left: */ if (likely(list_empty(&lock->wait_list))) atomic_set(&lock->count, 0); @@ -261,6 +410,19 @@ done: preempt_enable(); return 0; + +err: + mutex_remove_waiter(lock, &waiter, task_thread_info(task)); + spin_unlock_mutex(&lock->wait_lock, flags); + debug_mutex_free_waiter(&waiter); + +#ifdef CONFIG_MUTEX_SPIN_ON_OWNER +err_nowait: +#endif + mutex_release(&lock->dep_map, 1, ip); + + preempt_enable(); + return ret; } #ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -268,7 +430,8 @@ void __sched mutex_lock_nested(struct mutex *lock, unsigned int subclass) { might_sleep(); - __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_); + __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, + subclass, NULL, _RET_IP_, 0, 0); } EXPORT_SYMBOL_GPL(mutex_lock_nested); @@ -277,7 +440,8 @@ void __sched _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest) { might_sleep(); - __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, nest, _RET_IP_); + __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, + 0, nest, _RET_IP_, 0, 0); } EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock); @@ -286,7 +450,8 @@ int __sched mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass) { might_sleep(); - return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_); + return __mutex_lock_common(lock, TASK_KILLABLE, + subclass, NULL, _RET_IP_, 0, 0); } EXPORT_SYMBOL_GPL(mutex_lock_killable_nested); @@ -295,10 +460,63 @@ mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass) { might_sleep(); return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, - subclass, NULL, _RET_IP_); + subclass, NULL, _RET_IP_, 0, 0); } EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested); + +int __sched +_mutex_reserve_lock(struct ticket_mutex *lock, struct lockdep_map *nest, + unsigned long reservation_id) +{ + DEBUG_LOCKS_WARN_ON(!reservation_id); + + might_sleep(); + return __mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, + 0, nest, _RET_IP_, reservation_id, 0); +} +EXPORT_SYMBOL_GPL(_mutex_reserve_lock); + + +int __sched +_mutex_reserve_lock_interruptible(struct ticket_mutex *lock, + struct lockdep_map *nest, + unsigned long reservation_id) +{ + DEBUG_LOCKS_WARN_ON(!reservation_id); + + might_sleep(); + return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, + 0, nest, _RET_IP_, reservation_id, 0); +} +EXPORT_SYMBOL_GPL(_mutex_reserve_lock_interruptible); + +void __sched +_mutex_reserve_lock_slow(struct ticket_mutex *lock, struct lockdep_map *nest, + unsigned long reservation_id) +{ + DEBUG_LOCKS_WARN_ON(!reservation_id); + + might_sleep(); + __mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0, + nest, _RET_IP_, reservation_id, 1); +} +EXPORT_SYMBOL_GPL(_mutex_reserve_lock_slow); + +int __sched +_mutex_reserve_lock_intr_slow(struct ticket_mutex *lock, + struct lockdep_map *nest, + unsigned long reservation_id) +{ + DEBUG_LOCKS_WARN_ON(!reservation_id); + + might_sleep(); + return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0, + nest, _RET_IP_, reservation_id, 1); +} +EXPORT_SYMBOL_GPL(_mutex_reserve_lock_intr_slow); + + #endif /* @@ -401,20 +619,39 @@ __mutex_lock_slowpath(atomic_t *lock_count) { struct mutex *lock = container_of(lock_count, struct mutex, count); - __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_); + __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, + NULL, _RET_IP_, 0, 0); } static noinline int __sched __mutex_lock_killable_slowpath(struct mutex *lock) { - return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_); + return __mutex_lock_common(lock, TASK_KILLABLE, 0, + NULL, _RET_IP_, 0, 0); } static noinline int __sched __mutex_lock_interruptible_slowpath(struct mutex *lock) { - return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_); + return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, + NULL, _RET_IP_, 0, 0); +} + +static noinline int __sched +__mutex_lock_reserve_slowpath(struct ticket_mutex *lock, unsigned long rid) +{ + return __mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, 0, + NULL, _RET_IP_, rid, 0); +} + +static noinline int __sched +__mutex_lock_interruptible_reserve_slowpath(struct ticket_mutex *lock, + unsigned long rid) +{ + return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, 0, + NULL, _RET_IP_, rid, 0); } + #endif /* @@ -470,6 +707,63 @@ int __sched mutex_trylock(struct mutex *lock) } EXPORT_SYMBOL(mutex_trylock); +#ifndef CONFIG_DEBUG_LOCK_ALLOC +int __sched +_mutex_reserve_lock(struct ticket_mutex *lock, unsigned long rid) +{ + int ret; + + might_sleep(); + + ret = __mutex_fastpath_lock_retval(&lock->base.count); + + if (likely(!ret)) { + mutex_set_reservation_fastpath(lock, rid); + mutex_set_owner(&lock->base); + } else + ret = __mutex_lock_reserve_slowpath(lock, rid); + return ret; +} +EXPORT_SYMBOL(_mutex_reserve_lock); + +int __sched +_mutex_reserve_lock_interruptible(struct ticket_mutex *lock, unsigned long rid) +{ + int ret; + + might_sleep(); + + ret = __mutex_fastpath_lock_retval(&lock->base.count); + + if (likely(!ret)) { + mutex_set_reservation_fastpath(lock, rid); + mutex_set_owner(&lock->base); + } else + ret = __mutex_lock_interruptible_reserve_slowpath(lock, rid); + return ret; +} +EXPORT_SYMBOL(_mutex_reserve_lock_interruptible); + +void __sched +_mutex_reserve_lock_slow(struct ticket_mutex *lock, unsigned long rid) +{ + might_sleep(); + __mutex_lock_common(&lock->base, TASK_UNINTERRUPTIBLE, + 0, NULL, _RET_IP_, rid, 1); +} +EXPORT_SYMBOL(_mutex_reserve_lock_slow); + +int __sched +_mutex_reserve_lock_intr_slow(struct ticket_mutex *lock, unsigned long rid) +{ + might_sleep(); + return __mutex_lock_common(&lock->base, TASK_INTERRUPTIBLE, + 0, NULL, _RET_IP_, rid, 1); +} +EXPORT_SYMBOL(_mutex_reserve_lock_intr_slow); + +#endif + /** * atomic_dec_and_mutex_lock - return holding mutex if we dec to 0 * @cnt: the atomic which we are to dec
New version. All of the documentation has been moved from the commit log to Documentation/reservation-mutex-design.txt Missing at the moment, maybe TODO? Add a lockdep check in the *_slow calls that verifies that the lock being nested into has no nested lock any more? This would be a check to make sure that mutex_unreserve_unlock has been called on all other locks correctly. Changes since RFC patch v1: - Updated to use atomic_long instead of atomic, since the reservation_id was a long. - added mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow - removed mutex_locked_set_reservation_id (or w/e it was called) Changes since RFC patch v2: - remove use of __mutex_lock_retval_arg, add warnings when using wrong combination of mutex_(,reserve_)lock/unlock. Changes since v1: - Add __always_inline to __mutex_lock_common, otherwise reservation paths can be triggered from normal locks, because __builtin_constant_p might evaluate to false for the constant 0 in that case. Tests for this have been added in the next patch. - Updated documentation slightly. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> --- Documentation/reservation-mutex-design.txt | 136 ++++++++++++ include/linux/mutex.h | 86 +++++++ kernel/mutex.c | 326 +++++++++++++++++++++++++++- 3 files changed, 531 insertions(+), 17 deletions(-) create mode 100644 Documentation/reservation-mutex-design.txt