From patchwork Wed Dec 19 13:57:56 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Maarten Lankhorst X-Patchwork-Id: 1895701 Return-Path: X-Original-To: patchwork-dri-devel@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by patchwork1.kernel.org (Postfix) with ESMTP id 61A503FC64 for ; Wed, 19 Dec 2012 13:58:15 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2B817E6498 for ; Wed, 19 Dec 2012 05:58:15 -0800 (PST) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from youngberry.canonical.com (youngberry.canonical.com [91.189.89.112]) by gabe.freedesktop.org (Postfix) with ESMTP id 0ED99E5F37 for ; Wed, 19 Dec 2012 05:57:58 -0800 (PST) Received: from 5ed48cef.cm-7-5c.dynamic.ziggo.nl ([94.212.140.239] helo=[192.168.1.128]) by youngberry.canonical.com with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1TlK9l-0006uh-89; Wed, 19 Dec 2012 13:57:57 +0000 Message-ID: <50D1C7E4.1060701@canonical.com> Date: Wed, 19 Dec 2012 14:57:56 +0100 From: Maarten Lankhorst User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Markus Trippelsdorf Subject: Re: GPU lockup CP stall for more than 10000msec on latest vanilla git References: <20121217182752.GA351@x4> <20121217214819.GA228@x4> <20121217222519.GA229@x4> <20121217225534.GA219@x4> <1355829632.17142.59.camel@thor.local> <20121218133831.GA218@x4> <50D08ACB.4090605@canonical.com> <20121218161238.GA213@x4> In-Reply-To: <20121218161238.GA213@x4> Cc: =?ISO-8859-1?Q?Michel_D=E4nzer?= , dri-devel@lists.freedesktop.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org Errors-To: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org Op 18-12-12 17:12, Markus Trippelsdorf schreef: > On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote: >> Op 18-12-12 14:38, Markus Trippelsdorf schreef: >>> On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: >>>> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: >>>>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: >>>>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: >>>>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf >>>>>>> wrote: >>>>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: >>>>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf >>>>>>>>> wrote: >>>>>>>>>> As soon as I open the following website: >>>>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html >>>>>>>>>> >>>>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: >>>>>>>>> Is this a regression? Most likely a 3D driver bug unless you are only >>>>>>>>> seeing it with specific kernels. What browser are you using and do >>>>>>>>> you have hw accelerated webgl, etc. enabled? If so, what version of >>>>>>>>> mesa are you using? >>>>>>>> This is a regression, because it is caused by yesterdays merge of >>>>>>>> drm-next by Linus. IOW I only see this bug when running a >>>>>>>> v3.7-9432-g9360b53 kernel. >>>>>>> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: >>>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 >>>>>> Yes, the commit above causes the issue. >>>>>> >>>>>> 2d6cc72 GPU lockups >>>>> With 2d6cc72 reverted I get: >>>>> >>>>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ >>>> Probably a separate issue, can you bisect this one as well? >>> Yes. Git-bisect points to: >>> >>> 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit >>> commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 >>> Author: Maarten Lankhorst >>> Date: Thu Nov 29 11:36:54 2012 +0000 >>> >>> drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock >>> held, v3 >>> >>> (Please note that this bug is a little bit harder to reproduce. But >>> when you scroll up and down for ~10 seconds on the webpage mentioned >>> above it will trigger the oops. >>> So while I'm not 100% sure that the issue is caused by exactly this >>> commit, the vicinity should be right) >>> >> Those dmesg warnings sound suspicious, looks like something is going >> very wrong there. >> >> Can you revert the one before it? "drm/radeon: allow move_notify to be >> called without reservation" Reservation should be held at this point, >> that commit got in accidentally. >> >> I doubt not holding a reservation is causing it though, I don't really >> see how that commit could cause it however, so can you please double >> check it never happened before that point, and only started at that >> commit? >> >> also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in >> ttm_bo_cleanup_refs_and_unlock for good measure, and a >> BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait. >> >> I really don't see how that specific commit can be wrong though, so >> awaiting your results first before I try to dig more into it. > I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd) > and I landed on the same commit as above: > > commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3) > > So now I'm pretty sure it's specifically this commit that started the > issue. > > With your supposed debugging BUG_ONs added I still get: > > Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ > Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() > Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name > Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 > Dec 18 17:01:15 x4 kernel: Call Trace: > Dec 18 17:01:15 x4 kernel: [] ? warn_slowpath_common+0x74/0xb0 > Dec 18 17:01:15 x4 kernel: [] ? radeon_fence_ref+0x2c/0x40 > Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 > Dec 18 17:01:15 x4 kernel: [] ? ttm_mem_evict_first+0x1dc/0x2a0 > Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_man_get_node+0x62/0xb0 > Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_mem_space+0x28e/0x340 > Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_move_buffer+0xfc/0x170 > Dec 18 17:01:15 x4 kernel: [] ? kmem_cache_alloc+0xb2/0xc0 > Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_validate+0x95/0x110 > Dec 18 17:01:15 x4 kernel: [] ? ttm_bo_init+0x2ec/0x3b0 > Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_create+0x18a/0x200 > Dec 18 17:01:15 x4 kernel: [] ? radeon_bo_clear_va+0x40/0x40 > Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_object_create+0x92/0x160 > Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_create_ioctl+0x6c/0x150 > Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_object_free+0x2f/0x40 > Dec 18 17:01:15 x4 kernel: [] ? drm_ioctl+0x420/0x4f0 > Dec 18 17:01:15 x4 kernel: [] ? radeon_gem_pwrite_ioctl+0x20/0x20 > Dec 18 17:01:15 x4 kernel: [] ? do_vfs_ioctl+0x2e4/0x4e0 > Dec 18 17:01:15 x4 kernel: [] ? vfs_read+0x118/0x160 > Dec 18 17:01:15 x4 kernel: [] ? sys_ioctl+0x4c/0xa0 > Dec 18 17:01:15 x4 kernel: [] ? sys_read+0x51/0xa0 > Dec 18 17:01:15 x4 kernel: [] ? system_call_fastpath+0x16/0x1b so the kref to fence is null here. This should be impossible and indicates a bug in refcounting somewhere, or possibly memory corruption. Lets first look where things could go wrong.. sync_obj member requires fence_lock to be taken, but radeon code in general doesn't do that, hm.. I think radeon_cs_sync_rings needs to take fence_lock during the iteration, then taking on a refcount to the fence, and radeon_crtc_page_flip and radeon_move_blit are lacking refcount on fence_lock as well. But that would probably still not explain why it crashes in radeon_vm_bo_invalidate shortly after, so it seems just as likely that it's operating on freed memory there or something. But none of the code touches refcounting for that bo, and I really don't see how I messed up anything there. I seem to be able to reproduce it if I add a hack though, can you test if you get the exact same issues if you apply this patch? I call it "aggressively evict MRU buffer, and never call ddestroy", and for me it triggers by merely starting X. :-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 0bf66f9..9a8f0d8 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -512,6 +512,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) spin_lock(&glob->lru_lock); ret = ttm_bo_reserve_locked(bo, false, true, false, 0); + goto skip; spin_lock(&bdev->fence_lock); (void) ttm_bo_wait(bo, false, false, true); if (!ret && !bo->sync_obj && 0) { @@ -529,6 +530,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) sync_obj = driver->sync_obj_ref(bo->sync_obj); spin_unlock(&bdev->fence_lock); +skip: if (!ret) { atomic_set(&bo->reserved, 0); wake_up_all(&bo->event_queue); @@ -542,8 +544,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) driver->sync_obj_flush(sync_obj); driver->sync_obj_unref(&sync_obj); } - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } /** @@ -699,8 +700,7 @@ static void ttm_bo_delayed_workqueue(struct work_struct *work) container_of(work, struct ttm_bo_device, wq.work); if (ttm_bo_delayed_delete(bdev, false)) { - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } } @@ -743,8 +743,7 @@ EXPORT_SYMBOL(ttm_bo_lock_delayed_workqueue); void ttm_bo_unlock_delayed_workqueue(struct ttm_bo_device *bdev, int resched) { if (resched) - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } EXPORT_SYMBOL(ttm_bo_unlock_delayed_workqueue); @@ -815,12 +814,15 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev, retry: spin_lock(&glob->lru_lock); - if (list_empty(&man->lru)) { - spin_unlock(&glob->lru_lock); - return -EBUSY; - } + if (list_empty(&bdev->ddestroy)) { + if (list_empty(&man->lru)) { + spin_unlock(&glob->lru_lock); + return -EBUSY; + } + bo = list_entry(man->lru.prev, struct ttm_buffer_object, lru); + } else + bo = list_entry(bdev->ddestroy.prev, struct ttm_buffer_object, ddestroy); - bo = list_first_entry(&man->lru, struct ttm_buffer_object, lru); kref_get(&bo->list_kref); if (!list_empty(&bo->ddestroy)) {