From patchwork Tue Mar 25 14:23:11 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Maarten Lankhorst X-Patchwork-Id: 3887951 Return-Path: X-Original-To: patchwork-dri-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id BDC509F334 for ; Tue, 25 Mar 2014 14:23:21 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 4AD71201FD for ; Tue, 25 Mar 2014 14:23:20 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by mail.kernel.org (Postfix) with ESMTP id 3551E201F2 for ; Tue, 25 Mar 2014 14:23:18 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 9156D6E31B; Tue, 25 Mar 2014 07:23:16 -0700 (PDT) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from youngberry.canonical.com (youngberry.canonical.com [91.189.89.112]) by gabe.freedesktop.org (Postfix) with ESMTP id 895B46E31E for ; Tue, 25 Mar 2014 07:23:14 -0700 (PDT) Received: from 5ed49945.cm-7-5c.dynamic.ziggo.nl ([94.212.153.69] helo=[192.168.1.128]) by youngberry.canonical.com with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1WSSG0-0000wg-CB; Tue, 25 Mar 2014 14:23:12 +0000 Message-ID: <5331914F.6060906@canonical.com> Date: Tue, 25 Mar 2014 15:23:11 +0100 From: Maarten Lankhorst User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Thomas Hellstrom Subject: Re: [PATCH] drm/ttm: remove fence_lock References: <507835EF.2020806@canonical.com> <507FAF84.4060509@shipmail.org> <507FB6E7.1000403@shipmail.org> <507FBFB4.50004@canonical.com> <507FE1BB.3060104@shipmail.org> <507FEA2F.6090403@canonical.com> <507FEE3B.5090509@shipmail.org> <50801619.9020605@canonical.com> <508031BC.1010404@shipmail.org> <20121018170316.GA2086@gmail.com> <532BF804.5040203@shipmail.org> <532C2CAB.9040604@canonical.com> <532C38CF.7050106@shipmail.org> In-Reply-To: <532C38CF.7050106@shipmail.org> Cc: Daniel Vetter , dri-devel X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hey, op 21-03-14 14:04, Thomas Hellstrom schreef: > On 03/21/2014 01:12 PM, Maarten Lankhorst wrote: >> Hey, >> >> op 21-03-14 09:27, Thomas Hellstrom schreef: >>> On 03/21/2014 12:55 AM, Dave Airlie wrote: >>>> On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse >>>> wrote: >>>>> On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote: >>>>>> On 10/18/2012 04:45 PM, Maarten Lankhorst wrote: >>>>>>> Op 18-10-12 13:55, Thomas Hellstrom schreef: >>>>>>>> On 10/18/2012 01:38 PM, Maarten Lankhorst wrote: >>>>>>>>> Op 18-10-12 13:02, Thomas Hellstrom schreef: >>>>>>>>>> On 10/18/2012 10:37 AM, Maarten Lankhorst wrote: >>>>>>>>>>> Hey, >>>>>>>>>>> >>>>>>>>>>> Op 18-10-12 09:59, Thomas Hellstrom schreef: >>>>>>>>>>>> On 10/18/2012 09:28 AM, Thomas Hellstrom wrote: >>>>>>>>>>>>> Hi, Maarten, >>>>>>>>>>>>> >>>>>>>>>>>>> As you know I have been having my doubts about this change. >>>>>>>>>>>>> To me it seems insane to be forced to read the fence >>>>>>>>>>>>> pointer under a >>>>>>>>>>>>> reserved lock, simply because when you take the reserve >>>>>>>>>>>>> lock, another >>>>>>>>>>>>> process may have it and there is a substantial chance that >>>>>>>>>>>>> that process >>>>>>>>>>>>> will also be waiting for idle while holding the reserve lock. >>>>>>>>>>> I think it makes perfect sense, the only times you want to >>>>>>>>>>> read the fence >>>>>>>>>>> is when you want to change the members protected by the >>>>>>>>>>> reservation. >>>>>>>>>> No, that's not true. A typical case (or the only case) >>>>>>>>>> is where you want to do a map with no_wait semantics. You will >>>>>>>>>> want >>>>>>>>>> to be able to access a buffer's results even if the eviction code >>>>>>>>>> is about to schedule an unbind from the GPU, and have the buffer >>>>>>>>>> reserved? >>>>>>>>> Well either block on reserve or return -EBUSY if reserved, >>>>>>>>> presumably the >>>>>>>>> former.. >>>>>>>>> >>>>>>>>> ttm_bo_vm_fault does the latter already, anyway >>>>>>>> ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem, >>>>>>>> it's really >>>>>>>> a waiting reserve but for different reasons. Typically a >>>>>>>> user-space app will keep >>>>>>>> asynchronous maps to TTM during a buffer lifetime, and the >>>>>>>> buffer lifetime may >>>>>>>> be long as user-space apps keep caches. >>>>>>>> That's not the same as accessing a buffer after the GPU is done >>>>>>>> with it. >>>>>>> Ah indeed. >>>>>>>>> You don't need to hold the reservation while performing the >>>>>>>>> wait itself though, >>>>>>>>> you could check if ttm_bo_wait(no_wait_gpu = true) returns >>>>>>>>> -EBUSY, and if so >>>>>>>>> take a ref to the sync_obj member and then wait after >>>>>>>>> unreserving. You won't >>>>>>>>> reset sync_obj member to NULL in that case, but that should be >>>>>>>>> harmless. >>>>>>>>> This will allow you to keep the reservations fast and short. >>>>>>>>> Maybe a helper >>>>>>>>> would be appropriate for this since radeon and nouveau both >>>>>>>>> seem to do this. >>>>>>>>> >>>>>>>> The problem is that as long as other users are waiting for idle >>>>>>>> with reservation >>>>>>>> held, for example to switch GPU engine or to delete a GPU bind, >>>>>>>> waiting >>>>>>>> for reserve will in many case mean wait for GPU. >>>>>>> This example sounds inefficient, I know nouveau can do this, but >>>>>>> this essentially >>>>>>> just stalls the gpu entirely. I think guess you mean functions >>>>>>> like nouveau_gem_object_close? >>>>>>> It wouldn't surprise me if performance in nouveau could be >>>>>>> improved simply by >>>>>>> fixing those cases up instead, since it stalls the application >>>>>>> completely too for other uses. >>>>>>> >>>>>> Please see the Nouveau cpu_prep patch as well. >>>>>> >>>>>> While there are a number of cases that can be fixed up, also in >>>>>> Radeon, there's no way around that reservation >>>>>> is a heavyweight lock that, particularly on simpler hardware without >>>>>> support for fence ordering >>>>>> with barriers and / or "semaphores" and accelerated eviction will be >>>>>> held while waiting for idle. >>>>>> >>>>>> As such, it is unsuitable to protect read access to the fence >>>>>> pointer. If the intention is to keep a single fence pointer >>>>>> I think the best solution is to allow reading the fence pointer >>>>>> outside reservation, but make sure this can be done >>>>>> atomically. If the intention is to protect an array or list of fence >>>>>> pointers, I think a spinlock is the better solution. >>>>>> >>>>>> /Thomas >>>>> Just wanted to point out that like Thomas i am concern about having to >>>>> have object reserved when waiting on its associated fence. I fear it >>>>> will hurt us somehow. >>>>> >>>>> I will try to spend couple days stress testing your branch on radeon >>>>> trying to see if it hurts performance anyhow with current use case. >>>>> >>>> I've been trying to figure out what to do with Maarten's patches going >>>> forward since they are essential for all kinds of SoC people, >>>> >>>> However I'm still not 100% sure I believe either the fact that the >>>> problem is anything more than a microoptimisation, and possibly a >>>> premature one at that, this needs someone to start suggesting >>>> benchmarks we can run and a driver set to run them on, otherwise I'm >>>> starting to tend towards we are taking about an optimisation we can >>>> fix later, >>>> >>>> The other option is to somehow merge this stuff under an option that >>>> allows us to test it using a command line option, but I don't think >>>> that is sane either, >>>> >>>> So Maarten, Jerome, Thomas, please start discussing actual real world >>>> tests you think merging this code will affect and then I can make a >>>> better consideration. >>>> >>>> Dave. >>> Dave, >>> >>> This is IMHO a fundamental design discussion, not a micro-optimization. >>> >>> I'm pretty sure all sorts of horrendous things could be done to the DRM >>> design without affecting real world application performance. >>> >>> In TTM data protection is primarily done with spinlocks: This serves two >>> purposes. >>> >>> a) You can't unnecessarily hold a data protection lock while waiting for >>> GPU, which is typically a very stupid thing to do (perhaps not so in >>> this particular case) but when the sleep while atomic or locking >>> inversion kernel warning turns up, that should at least >>> ring a bell. Trying to implement dma-buf fencing using the TTM fencing >>> callbacks will AFAICT cause a locking inversion. >>> >>> b) Spinlocks essentially go away on UP systems. The whole reservation >>> path was essentially lock-free on UP systems pre dma-buf integration, >>> and with very few atomic locking operations even on SMP systems. It was >>> all prompted by a test some years ago (by Eric Anholt IIRC) showing that >>> locking operations turned up quite high on Intel driver profiling. >>> >>> If we start protecting data with reservations, when we also export the >>> reservation locks, so that people find them a convenient "serialize work >>> on this buffer" lock, all kind of interesting things might happen, and >>> we might end up in a situation >>> similar to protecting everything with struct_mutex. >>> >>> This is why I dislike this change. In particular when there is a very >>> simple remedy. >>> >>> But if I can get an ACK to convert the reservation object fence pointers >>> and associated operations on them to be rcu-safe when I have some time >>> left, I'd be prepared to accept this patch series in it's current state. >> RCU could be a useful way to deal with this. But in my case I've shown >> there are very few places where it's needed, core ttm does not need it >> at all. >> This is why I wanted to leave it to individual drivers to implement it. >> >> I think kfree_rcu for free in the driver itself should be enough, >> and obtaining in the driver would require something like this, similar >> to whats in rcuref.txt: >> >> rcu_read_lock(); >> f = rcu_dereference(bo->fence); >> if (f && !kref_get_unless-zero(&f->kref)) f = NULL; >> rcu_read_unlock(); >> >> if (f) { >> // do stuff here >> ... >> >> // free fence >> kref_put(&f->kref, fence_put_with_kfree_rcu); >> } >> >> Am I wrong or is something like this enough to make sure fence is >> still alive? > No, you're correct. > > And a bo check for idle would amount to (since we don't care if the > fence refcount is zero). > > rcu_read_lock() > f = rcu_dereference(bo->fence); > signaled = !f || f->signaled; > rcu_read_unlock(). > > /Thomas > This is very easy to implement when there is only 1 fence slot, bo->fence being equal to reservation_object->fence_excl. It appears to break down when slightly when I implement it on top of shared fences. My diff is against git://git.linaro.org/people/sumit.semwal/linux-3.x.git for-next-fences shared_max_fence is held in reservation_object to prevent the reallocation in reservation_object_reserve_shared_fence every time the same reservation_object gets > 4 shared fences; if it happens once, it's likely going to happen again on the same object. Anyhow does the below look sane to you? This has only been tested by my compiler, I haven't checked if this boots. --- commit c73b87d33fd08f7c1b0a381b08b3626128f8f3b8 Author: Maarten Lankhorst Date: Tue Mar 25 15:10:21 2014 +0100 add rcu to fence HACK WIP DO NOT USE KILLS YOUR POT PLANTS PETS AND FAMILY diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c index 96338bf7f457..0a88c10b3db9 100644 --- a/drivers/base/dma-buf.c +++ b/drivers/base/dma-buf.c @@ -134,7 +134,10 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) { struct dma_buf *dmabuf; struct reservation_object *resv; + struct reservation_object_shared *shared; + struct fence *fence_excl; unsigned long events; + unsigned shared_count; dmabuf = file->private_data; if (!dmabuf || !dmabuf->resv) @@ -148,14 +151,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) if (!events) return 0; - ww_mutex_lock(&resv->lock, NULL); + rcu_read_lock(); - if (resv->fence_excl && (!(events & POLLOUT) || - resv->fence_shared_count == 0)) { + shared = rcu_dereference(resv->shared); + fence_excl = rcu_dereference(resv->fence_excl); + shared_count = ACCESS_ONCE(shared->count); + + if (fence_excl && (!(events & POLLOUT) || + (!shared || shared_count == 0))) { struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_excl; unsigned long pevents = POLLIN; - if (resv->fence_shared_count == 0) + if (!shared || shared_count == 0) pevents |= POLLOUT; spin_lock_irq(&dmabuf->poll.lock); @@ -167,19 +174,26 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) spin_unlock_irq(&dmabuf->poll.lock); if (events & pevents) { - if (!fence_add_callback(resv->fence_excl, - &dcb->cb, dma_buf_poll_cb)) + if (!kref_get_unless_zero(&fence_excl->refcount)) { + /* force a recheck */ + events &= ~pevents; + dma_buf_poll_cb(NULL, &dcb->cb); + } else if (!fence_add_callback(fence_excl, &dcb->cb, + dma_buf_poll_cb)) { events &= ~pevents; - else + fence_put(fence_excl); + } else { /* * No callback queued, wake up any additional * waiters. */ + fence_put(fence_excl); dma_buf_poll_cb(NULL, &dcb->cb); + } } } - if ((events & POLLOUT) && resv->fence_shared_count > 0) { + if ((events & POLLOUT) && shared && shared_count > 0) { struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_shared; int i; @@ -194,20 +208,34 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) if (!(events & POLLOUT)) goto out; - for (i = 0; i < resv->fence_shared_count; ++i) - if (!fence_add_callback(resv->fence_shared[i], - &dcb->cb, dma_buf_poll_cb)) { + for (i = 0; i < shared_count; ++i) { + struct fence *fence = ACCESS_ONCE(shared->fence[i]); + if (!kref_get_unless_zero(&fence->refcount)) { + /* + * fence refcount dropped to zero, this means + * that the shared object has been freed from under us. + * call dma_buf_poll_cb and force a recheck! + */ events &= ~POLLOUT; + dma_buf_poll_cb(NULL, &dcb->cb); break; } + if (!fence_add_callback(fence, &dcb->cb, + dma_buf_poll_cb)) { + fence_put(fence); + events &= ~POLLOUT; + break; + } + fence_put(fence); + } /* No callback queued, wake up any additional waiters. */ - if (i == resv->fence_shared_count) + if (i == shared_count) dma_buf_poll_cb(NULL, &dcb->cb); } out: - ww_mutex_unlock(&resv->lock); + rcu_read_unlock(); return events; } diff --git a/drivers/base/fence.c b/drivers/base/fence.c index 8fff13fb86cf..be03a9e8ad8b 100644 --- a/drivers/base/fence.c +++ b/drivers/base/fence.c @@ -170,7 +170,7 @@ void release_fence(struct kref *kref) if (fence->ops->release) fence->ops->release(fence); else - kfree(fence); + kfree_rcu(fence, rcu); } EXPORT_SYMBOL(release_fence); diff --git a/include/linux/fence.h b/include/linux/fence.h index 65f2a01ee7e4..d19620405c34 100644 --- a/include/linux/fence.h +++ b/include/linux/fence.h @@ -40,6 +40,7 @@ struct fence_cb; * struct fence - software synchronization primitive * @refcount: refcount for this fence * @ops: fence_ops associated with this fence + * @rcu: used for releasing fence with kfree_rcu * @cb_list: list of all callbacks to call * @lock: spin_lock_irqsave used for locking * @context: execution context this fence belongs to, returned by @@ -73,6 +74,7 @@ struct fence_cb; struct fence { struct kref refcount; const struct fence_ops *ops; + struct rcu_head rcu; struct list_head cb_list; spinlock_t *lock; unsigned context, seqno; diff --git a/include/linux/reservation.h b/include/linux/reservation.h index f3f57460a205..91576fabafdb 100644 --- a/include/linux/reservation.h +++ b/include/linux/reservation.h @@ -42,6 +42,7 @@ #include #include #include +#include extern struct ww_class reservation_ww_class; @@ -49,8 +50,12 @@ struct reservation_object { struct ww_mutex lock; struct fence *fence_excl; - struct fence **fence_shared; - u32 fence_shared_count, fence_shared_max; + u32 shared_max_fence; + struct reservation_object_shared { + struct rcu_head rcu; + u32 count; + struct fence *fence[]; + } *shared; }; static inline void @@ -58,8 +63,8 @@ reservation_object_init(struct reservation_object *obj) { ww_mutex_init(&obj->lock, &reservation_ww_class); - obj->fence_shared_count = obj->fence_shared_max = 0; - obj->fence_shared = NULL; + obj->shared = NULL; + obj->shared_max_fence = 4; obj->fence_excl = NULL; } @@ -70,11 +75,97 @@ reservation_object_fini(struct reservation_object *obj) if (obj->fence_excl) fence_put(obj->fence_excl); - for (i = 0; i < obj->fence_shared_count; ++i) - fence_put(obj->fence_shared[i]); - kfree(obj->fence_shared); + if (obj->shared) { + for (i = 0; i < obj->shared->count; ++i) + fence_put(obj->shared->fence[i]); + + /* + * This object should be dead and all references must have + * been released to it, so no need to free with rcu. + */ + kfree(obj->shared); + } ww_mutex_destroy(&obj->lock); } +/* + * Reserve space to add a shared fence to a reservation_object, + * must be called with obj->lock held. + */ +static inline int +reservation_object_reserve_shared_fence(struct reservation_object *obj) +{ + struct reservation_object_shared *shared, *old; + u32 max = obj->shared_max_fence; + + if (obj->shared) { + if (obj->shared->count < max) + return 0; + max *= 2; + } + + shared = kmalloc(offsetof(typeof(*shared), fence[max]), GFP_KERNEL); + if (!shared) + return -ENOMEM; + old = obj->shared; + + if (old) { + shared->count = old->count; + memcpy(shared->fence, old->fence, old->count * sizeof(*old->fence)); + } else { + shared->count = 0; + } + rcu_assign_pointer(obj->shared, shared); + obj->shared_max_fence = max; + kfree_rcu(old, rcu); + return 0; +} + +/* + * Add a fence to a shared slot, obj->lock must be held, and + * reservation_object_reserve_shared_fence has been called. + */ + +static inline void +reservation_object_add_shared_fence(struct reservation_object *obj, + struct fence *fence) +{ + unsigned i; + + BUG_ON(obj->shared->count == obj->shared_max_fence); + fence_get(fence); + + for (i = 0; i < obj->shared->count; ++i) + if (obj->shared->fence[i]->context == fence->context) { + struct fence *old = obj->shared->fence[i]; + rcu_assign_pointer(obj->shared->fence[i], fence); + fence_put(old); + return; + } + + obj->shared->fence[obj->shared->count] = fence; + smp_wmb(); + obj->shared->count++; +} + +/* + * May be called after adding an exclusive to wipe all shared fences. + */ + +static inline void +reservation_object_clear_shared(struct reservation_object *obj) +{ + struct reservation_object_shared *old = obj->shared; + unsigned i; + + if (!old) + return; + + rcu_assign_pointer(obj->shared, NULL); + for (i = 0; i < old->count; ++i) + fence_put(old->fence[i]); + kfree_rcu(old, rcu); +} + #endif /* _LINUX_RESERVATION_H */