From patchwork Fri Jan 21 11:40:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 12719637 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D1751C433F5 for ; Fri, 21 Jan 2022 11:40:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350192AbiAULko (ORCPT ); Fri, 21 Jan 2022 06:40:44 -0500 Received: from www262.sakura.ne.jp ([202.181.97.72]:53726 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230234AbiAULko (ORCPT ); Fri, 21 Jan 2022 06:40:44 -0500 Received: from fsav116.sakura.ne.jp (fsav116.sakura.ne.jp [27.133.134.243]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id 20LBeFjm048207; Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav116.sakura.ne.jp (F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp); Fri, 21 Jan 2022 20:40:15 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp) Received: from localhost.localdomain (M106072142033.v4.enabler.ne.jp [106.72.142.33]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id 20LBe9Hb048197 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) From: Tetsuo Handa To: Jens Axboe , Christoph Hellwig , Jan Kara Cc: linux-block@vger.kernel.org, Tetsuo Handa Subject: [PATCH v3 1/5] task_work: export task_work_add() Date: Fri, 21 Jan 2022 20:40:02 +0900 Message-Id: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> X-Mailer: git-send-email 2.18.4 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") silenced a circular locking dependency warning by moving autoclear operation to WQ context. Then, it was reported that WQ context is too late to run autoclear operation; some userspace programs (e.g. xfstest) assume that the autoclear operation already completed by the moment close() returns to user mode so that they can immediately call umount() of a partition containing a backing file which the autoclear operation should have closed. Then, Jan Kara found that fundamental problem is that waiting for I/O completion (from blk_mq_freeze_queue() or flush_workqueue()) with disk->open_mutex held has possibility of deadlock. Then, I found that since disk->open_mutex => lo->lo_mutex dependency is recorded by lo_open() and lo_release(), and blk_mq_freeze_queue() by e.g. loop_set_status() waits for I/O completion with lo->lo_mutex held, from locking dependency chain perspective we need to avoid holding lo->lo_mutex from lo_open() and lo_release(). And we can avoid holding lo->lo_mutex from lo_open(), for we can instead use a spinlock dedicated for Lo_deleting check. But we cannot avoid holding lo->lo_mutex from lo_release(), for WQ context was too late to run autoclear operation. We need to make whole lo_release() operation start without disk->open_mutex and complete before returning to user mode. One of approaches that can meet such requirement is to use the task_work context. Thus, export task_work_add() for the loop driver. Cc: Jan Kara Cc: Christoph Hellwig Signed-off-by: Tetsuo Handa --- kernel/task_work.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/task_work.c b/kernel/task_work.c index 1698fbe6f0e1..2a1644189182 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -60,6 +60,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work, return 0; } +EXPORT_SYMBOL_GPL(task_work_add); /** * task_work_cancel_match - cancel a pending work added by task_work_add() From patchwork Fri Jan 21 11:40:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 12719638 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9203C433EF for ; Fri, 21 Jan 2022 11:40:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230234AbiAULky (ORCPT ); Fri, 21 Jan 2022 06:40:54 -0500 Received: from www262.sakura.ne.jp ([202.181.97.72]:53953 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231960AbiAULkx (ORCPT ); Fri, 21 Jan 2022 06:40:53 -0500 Received: from fsav116.sakura.ne.jp (fsav116.sakura.ne.jp [27.133.134.243]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id 20LBeFvr048214; Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav116.sakura.ne.jp (F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp); Fri, 21 Jan 2022 20:40:15 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp) Received: from localhost.localdomain (M106072142033.v4.enabler.ne.jp [106.72.142.33]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id 20LBe9Hc048197 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) From: Tetsuo Handa To: Jens Axboe , Christoph Hellwig , Jan Kara Cc: linux-block@vger.kernel.org, Tetsuo Handa , kernel test robot Subject: [PATCH v3 2/5] loop: revert "make autoclear operation asynchronous" Date: Fri, 21 Jan 2022 20:40:03 +0900 Message-Id: <20220121114006.3633-2-penguin-kernel@I-love.SAKURA.ne.jp> X-Mailer: git-send-email 2.18.4 In-Reply-To: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> References: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org The kernel test robot is reporting that xfstest which does umount ext2 on xfs umount xfs sequence started failing, for commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") removed a guarantee that fput() of backing file is processed before lo_release() from close() returns to user mode. As a preparation for retrying with task_work_add() approach, firstly make a clean revert. Reported-by: kernel test robot Cc: Jan Kara Cc: Christoph Hellwig Signed-off-by: Tetsuo Handa --- drivers/block/loop.c | 65 ++++++++++++++++++++------------------------ drivers/block/loop.h | 1 - 2 files changed, 29 insertions(+), 37 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index b1b05c45c07c..e52a8a5e8cbc 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1082,7 +1082,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, return error; } -static void __loop_clr_fd(struct loop_device *lo) +static void __loop_clr_fd(struct loop_device *lo, bool release) { struct file *filp; gfp_t gfp = lo->old_gfp_mask; @@ -1144,6 +1144,8 @@ static void __loop_clr_fd(struct loop_device *lo) /* let user-space know about this change */ kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE); mapping_set_gfp_mask(filp->f_mapping, gfp); + /* This is safe: open() is still holding a reference. */ + module_put(THIS_MODULE); blk_mq_unfreeze_queue(lo->lo_queue); disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE); @@ -1151,52 +1153,44 @@ static void __loop_clr_fd(struct loop_device *lo) if (lo->lo_flags & LO_FLAGS_PARTSCAN) { int err; - mutex_lock(&lo->lo_disk->open_mutex); + /* + * open_mutex has been held already in release path, so don't + * acquire it if this function is called in such case. + * + * If the reread partition isn't from release path, lo_refcnt + * must be at least one and it can only become zero when the + * current holder is released. + */ + if (!release) + mutex_lock(&lo->lo_disk->open_mutex); err = bdev_disk_changed(lo->lo_disk, false); - mutex_unlock(&lo->lo_disk->open_mutex); + if (!release) + mutex_unlock(&lo->lo_disk->open_mutex); if (err) pr_warn("%s: partition scan of loop%d failed (rc=%d)\n", __func__, lo->lo_number, err); /* Device is gone, no point in returning error */ } + /* + * lo->lo_state is set to Lo_unbound here after above partscan has + * finished. There cannot be anybody else entering __loop_clr_fd() as + * Lo_rundown state protects us from all the other places trying to + * change the 'lo' device. + */ lo->lo_flags = 0; if (!part_shift) lo->lo_disk->flags |= GENHD_FL_NO_PART; - - fput(filp); -} - -static void loop_rundown_completed(struct loop_device *lo) -{ mutex_lock(&lo->lo_mutex); lo->lo_state = Lo_unbound; mutex_unlock(&lo->lo_mutex); - module_put(THIS_MODULE); -} - -static void loop_rundown_workfn(struct work_struct *work) -{ - struct loop_device *lo = container_of(work, struct loop_device, - rundown_work); - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __loop_clr_fd(lo); - kobject_put(&bdev->bd_device.kobj); - module_put(disk->fops->owner); - loop_rundown_completed(lo); -} -static void loop_schedule_rundown(struct loop_device *lo) -{ - struct block_device *bdev = lo->lo_device; - struct gendisk *disk = lo->lo_disk; - - __module_get(disk->fops->owner); - kobject_get(&bdev->bd_device.kobj); - INIT_WORK(&lo->rundown_work, loop_rundown_workfn); - queue_work(system_long_wq, &lo->rundown_work); + /* + * Need not hold lo_mutex to fput backing file. Calling fput holding + * lo_mutex triggers a circular lock dependency possibility warning as + * fput can take open_mutex which is usually taken before lo_mutex. + */ + fput(filp); } static int loop_clr_fd(struct loop_device *lo) @@ -1228,8 +1222,7 @@ static int loop_clr_fd(struct loop_device *lo) lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); - __loop_clr_fd(lo); - loop_rundown_completed(lo); + __loop_clr_fd(lo, false); return 0; } @@ -1754,7 +1747,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode) * In autoclear mode, stop the loop thread * and remove configuration after last close. */ - loop_schedule_rundown(lo); + __loop_clr_fd(lo, true); return; } else if (lo->lo_state == Lo_bound) { /* diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 918a7a2dc025..082d4b6bfc6a 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -56,7 +56,6 @@ struct loop_device { struct gendisk *lo_disk; struct mutex lo_mutex; bool idr_visible; - struct work_struct rundown_work; }; struct loop_cmd { From patchwork Fri Jan 21 11:40:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 12719641 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D45A9C433EF for ; Fri, 21 Jan 2022 11:42:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1380209AbiAULmT (ORCPT ); Fri, 21 Jan 2022 06:42:19 -0500 Received: from www262.sakura.ne.jp ([202.181.97.72]:56980 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1380202AbiAULmQ (ORCPT ); Fri, 21 Jan 2022 06:42:16 -0500 Received: from fsav116.sakura.ne.jp (fsav116.sakura.ne.jp [27.133.134.243]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id 20LBeFff048221; Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav116.sakura.ne.jp (F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp); Fri, 21 Jan 2022 20:40:15 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp) Received: from localhost.localdomain (M106072142033.v4.enabler.ne.jp [106.72.142.33]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id 20LBe9Hd048197 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) From: Tetsuo Handa To: Jens Axboe , Christoph Hellwig , Jan Kara Cc: linux-block@vger.kernel.org, Tetsuo Handa Subject: [PATCH v3 3/5] loop: don't hold lo->lo_mutex from lo_open() Date: Fri, 21 Jan 2022 20:40:04 +0900 Message-Id: <20220121114006.3633-3-penguin-kernel@I-love.SAKURA.ne.jp> X-Mailer: git-send-email 2.18.4 In-Reply-To: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> References: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Waiting for I/O completion with disk->open_mutex held has possibility of deadlock. Since disk->open_mutex => lo->lo_mutex dependency is recorded by lo_open(), and blk_mq_freeze_queue() by e.g. loop_set_status() waits for I/O completion with lo->lo_mutex held, from locking dependency chain perspective waiting for I/O completion with disk->open_mutex held still remains. Introduce loop_delete_spinlock dedicated for protecting lo->lo_state versus lo->lo_refcnt race in lo_open() and loop_remove_control(). Cc: Jan Kara Cc: Christoph Hellwig Signed-off-by: Tetsuo Handa --- drivers/block/loop.c | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index e52a8a5e8cbc..5ce8ac2dfa4c 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -89,6 +89,7 @@ static DEFINE_IDR(loop_index_idr); static DEFINE_MUTEX(loop_ctl_mutex); static DEFINE_MUTEX(loop_validate_mutex); +static DEFINE_SPINLOCK(loop_delete_spinlock); /** * loop_global_lock_killable() - take locks for safe loop_validate_file() test @@ -1717,16 +1718,15 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo = bdev->bd_disk->private_data; - int err; + int err = 0; - err = mutex_lock_killable(&lo->lo_mutex); - if (err) - return err; - if (lo->lo_state == Lo_deleting) + spin_lock(&loop_delete_spinlock); + /* lo->lo_state may be changed to any Lo_* but Lo_deleting. */ + if (data_race(lo->lo_state) == Lo_deleting) err = -ENXIO; else atomic_inc(&lo->lo_refcnt); - mutex_unlock(&lo->lo_mutex); + spin_unlock(&loop_delete_spinlock); return err; } @@ -2112,19 +2112,18 @@ static int loop_control_remove(int idx) ret = mutex_lock_killable(&lo->lo_mutex); if (ret) goto mark_visible; - if (lo->lo_state != Lo_unbound || - atomic_read(&lo->lo_refcnt) > 0) { - mutex_unlock(&lo->lo_mutex); + spin_lock(&loop_delete_spinlock); + /* Mark this loop device no longer open()-able if nobody is using. */ + if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) ret = -EBUSY; - goto mark_visible; - } - /* Mark this loop device no longer open()-able. */ - lo->lo_state = Lo_deleting; + else + lo->lo_state = Lo_deleting; + spin_unlock(&loop_delete_spinlock); mutex_unlock(&lo->lo_mutex); - - loop_remove(lo); - return 0; - + if (!ret) { + loop_remove(lo); + return 0; + } mark_visible: /* Show this loop device again. */ mutex_lock(&loop_ctl_mutex); From patchwork Fri Jan 21 11:40:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 12719640 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9972FC433EF for ; Fri, 21 Jan 2022 11:42:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1380204AbiAULmO (ORCPT ); Fri, 21 Jan 2022 06:42:14 -0500 Received: from www262.sakura.ne.jp ([202.181.97.72]:56591 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1380202AbiAULmM (ORCPT ); Fri, 21 Jan 2022 06:42:12 -0500 Received: from fsav116.sakura.ne.jp (fsav116.sakura.ne.jp [27.133.134.243]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id 20LBeF0d048224; Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav116.sakura.ne.jp (F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp); Fri, 21 Jan 2022 20:40:15 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp) Received: from localhost.localdomain (M106072142033.v4.enabler.ne.jp [106.72.142.33]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id 20LBe9He048197 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) From: Tetsuo Handa To: Jens Axboe , Christoph Hellwig , Jan Kara Cc: linux-block@vger.kernel.org, Tetsuo Handa Subject: [PATCH v3 4/5] loop: don't hold lo->lo_mutex from lo_release() Date: Fri, 21 Jan 2022 20:40:05 +0900 Message-Id: <20220121114006.3633-4-penguin-kernel@I-love.SAKURA.ne.jp> X-Mailer: git-send-email 2.18.4 In-Reply-To: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> References: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This is a retry of commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous"). Since it turned out that we need to avoid waiting for I/O completion with disk->open_mutex held, move whole lo_release() operation to task work context (when possible) or WQ context (otherwise). Refcount management in lo_release() and loop_release_workfn() needs to be updated in sync with blkdev_put(), for blkdev_put() already dropped references by the moment loop_release_callbackfn() is invoked. Cc: Jan Kara Cc: Christoph Hellwig Signed-off-by: Tetsuo Handa --- drivers/block/loop.c | 151 +++++++++++++++++++++++++++++++------------ drivers/block/loop.h | 1 + 2 files changed, 111 insertions(+), 41 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 5ce8ac2dfa4c..74d919e98a6b 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1083,7 +1083,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, return error; } -static void __loop_clr_fd(struct loop_device *lo, bool release) +static void __loop_clr_fd(struct loop_device *lo) { struct file *filp; gfp_t gfp = lo->old_gfp_mask; @@ -1145,8 +1145,6 @@ static void __loop_clr_fd(struct loop_device *lo, bool release) /* let user-space know about this change */ kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE); mapping_set_gfp_mask(filp->f_mapping, gfp); - /* This is safe: open() is still holding a reference. */ - module_put(THIS_MODULE); blk_mq_unfreeze_queue(lo->lo_queue); disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE); @@ -1154,37 +1152,18 @@ static void __loop_clr_fd(struct loop_device *lo, bool release) if (lo->lo_flags & LO_FLAGS_PARTSCAN) { int err; - /* - * open_mutex has been held already in release path, so don't - * acquire it if this function is called in such case. - * - * If the reread partition isn't from release path, lo_refcnt - * must be at least one and it can only become zero when the - * current holder is released. - */ - if (!release) - mutex_lock(&lo->lo_disk->open_mutex); + mutex_lock(&lo->lo_disk->open_mutex); err = bdev_disk_changed(lo->lo_disk, false); - if (!release) - mutex_unlock(&lo->lo_disk->open_mutex); + mutex_unlock(&lo->lo_disk->open_mutex); if (err) pr_warn("%s: partition scan of loop%d failed (rc=%d)\n", __func__, lo->lo_number, err); /* Device is gone, no point in returning error */ } - /* - * lo->lo_state is set to Lo_unbound here after above partscan has - * finished. There cannot be anybody else entering __loop_clr_fd() as - * Lo_rundown state protects us from all the other places trying to - * change the 'lo' device. - */ lo->lo_flags = 0; if (!part_shift) lo->lo_disk->flags |= GENHD_FL_NO_PART; - mutex_lock(&lo->lo_mutex); - lo->lo_state = Lo_unbound; - mutex_unlock(&lo->lo_mutex); /* * Need not hold lo_mutex to fput backing file. Calling fput holding @@ -1192,6 +1171,10 @@ static void __loop_clr_fd(struct loop_device *lo, bool release) * fput can take open_mutex which is usually taken before lo_mutex. */ fput(filp); + mutex_lock(&lo->lo_mutex); + lo->lo_state = Lo_unbound; + mutex_unlock(&lo->lo_mutex); + module_put(THIS_MODULE); } static int loop_clr_fd(struct loop_device *lo) @@ -1223,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo) lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); - __loop_clr_fd(lo, false); + __loop_clr_fd(lo); return 0; } @@ -1715,10 +1698,31 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, } #endif +struct loop_release_task { + union { + struct list_head head; + struct callback_head cb; + struct work_struct ws; + }; + struct loop_device *lo; +}; + +static LIST_HEAD(release_task_spool); +static DEFINE_SPINLOCK(release_task_spool_spinlock); + static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo = bdev->bd_disk->private_data; int err = 0; + /* + * In order to avoid doing __GFP_NOFAIL allocaion from lo_release(), + * reserve memory for calling lo_post_release() from lo_open(). + */ + struct loop_release_task *lrt = + kmalloc(sizeof(*lrt), GFP_KERNEL | __GFP_NOWARN); + + if (!lrt) + return -ENOMEM; spin_lock(&loop_delete_spinlock); /* lo->lo_state may be changed to any Lo_* but Lo_deleting. */ @@ -1727,33 +1731,40 @@ static int lo_open(struct block_device *bdev, fmode_t mode) else atomic_inc(&lo->lo_refcnt); spin_unlock(&loop_delete_spinlock); - return err; + if (err) { + kfree(lrt); + return err; + } + spin_lock(&release_task_spool_spinlock); + list_add(&lrt->head, &release_task_spool); + spin_unlock(&release_task_spool_spinlock); + return 0; } -static void lo_release(struct gendisk *disk, fmode_t mode) +static void lo_post_release(struct gendisk *disk) { struct loop_device *lo = disk->private_data; mutex_lock(&lo->lo_mutex); - if (atomic_dec_return(&lo->lo_refcnt)) - goto out_unlock; + /* Check whether this loop device can be cleared. */ + if (atomic_dec_return(&lo->lo_refcnt) || lo->lo_state != Lo_bound) + goto out_unlock; + /* + * Clear this loop device since nobody is using. Note that since + * lo_open() increments lo->lo_refcnt without holding lo->lo_mutex, + * I might become no longer the last user, but there is a fact that + * there was no user. + * + * In autoclear mode, destroy WQ and remove configuration. + * Otherwise flush possible ongoing bios in WQ and keep configuration. + */ if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) { - if (lo->lo_state != Lo_bound) - goto out_unlock; lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); - /* - * In autoclear mode, stop the loop thread - * and remove configuration after last close. - */ - __loop_clr_fd(lo, true); + __loop_clr_fd(lo); return; - } else if (lo->lo_state == Lo_bound) { - /* - * Otherwise keep thread (if running) and config, - * but flush possible ongoing bios in thread. - */ + } else { blk_mq_freeze_queue(lo->lo_queue); blk_mq_unfreeze_queue(lo->lo_queue); } @@ -1762,6 +1773,60 @@ static void lo_release(struct gendisk *disk, fmode_t mode) mutex_unlock(&lo->lo_mutex); } +static void loop_release_workfn(struct work_struct *work) +{ + struct loop_release_task *lrt = + container_of(work, struct loop_release_task, ws); + struct loop_device *lo = lrt->lo; + struct gendisk *disk = lo->lo_disk; + + lo_post_release(disk); + /* Drop references which will be dropped after lo_release(). */ + kobject_put(&disk_to_dev(disk)->kobj); + module_put(disk->fops->owner); + kfree(lrt); + atomic_dec(&lo->async_pending); +} + +static void loop_release_callbackfn(struct callback_head *callback) +{ + struct loop_release_task *lrt = + container_of(callback, struct loop_release_task, cb); + + loop_release_workfn(&lrt->ws); +} + +static void lo_release(struct gendisk *disk, fmode_t mode) +{ + struct loop_device *lo = disk->private_data; + struct loop_release_task *lrt; + + atomic_inc(&lo->async_pending); + /* + * Fetch from spool. Since a successful lo_open() call is coupled with + * a lo_release() call, we are guaranteed that spool is not empty. + */ + spin_lock(&release_task_spool_spinlock); + lrt = list_first_entry(&release_task_spool, typeof(*lrt), head); + list_del(&lrt->head); + spin_unlock(&release_task_spool_spinlock); + /* Hold references which will be dropped after lo_release(). */ + __module_get(disk->fops->owner); + kobject_get(&disk_to_dev(disk)->kobj); + /* + * Prefer task work so that clear operation completes + * before close() returns to user mode. + */ + lrt->lo = lo; + if (!(current->flags & PF_KTHREAD)) { + init_task_work(&lrt->cb, loop_release_callbackfn); + if (!task_work_add(current, &lrt->cb, TWA_RESUME)) + return; + } + INIT_WORK(&lrt->ws, loop_release_workfn); + queue_work(system_long_wq, &lrt->ws); +} + static const struct block_device_operations lo_fops = { .owner = THIS_MODULE, .open = lo_open, @@ -2023,6 +2088,7 @@ static int loop_add(int i) if (!part_shift) disk->flags |= GENHD_FL_NO_PART; atomic_set(&lo->lo_refcnt, 0); + atomic_set(&lo->async_pending, 0); mutex_init(&lo->lo_mutex); lo->lo_number = i; spin_lock_init(&lo->lo_lock); @@ -2064,6 +2130,9 @@ static int loop_add(int i) static void loop_remove(struct loop_device *lo) { + /* Wait for task work and/or WQ to complete. */ + while (atomic_read(&lo->async_pending)) + schedule_timeout_uninterruptible(1); /* Make this loop device unreachable from pathname. */ del_gendisk(lo->lo_disk); blk_cleanup_disk(lo->lo_disk); diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 082d4b6bfc6a..20fc5eebe455 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -56,6 +56,7 @@ struct loop_device { struct gendisk *lo_disk; struct mutex lo_mutex; bool idr_visible; + atomic_t async_pending; }; struct loop_cmd { From patchwork Fri Jan 21 11:40:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 12719639 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDDE5C433EF for ; Fri, 21 Jan 2022 11:41:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231960AbiAULk6 (ORCPT ); Fri, 21 Jan 2022 06:40:58 -0500 Received: from www262.sakura.ne.jp ([202.181.97.72]:54297 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350181AbiAULk6 (ORCPT ); Fri, 21 Jan 2022 06:40:58 -0500 Received: from fsav116.sakura.ne.jp (fsav116.sakura.ne.jp [27.133.134.243]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id 20LBeFwS048235; Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav116.sakura.ne.jp (F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp); Fri, 21 Jan 2022 20:40:15 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/550/fsav116.sakura.ne.jp) Received: from localhost.localdomain (M106072142033.v4.enabler.ne.jp [106.72.142.33]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id 20LBe9Hf048197 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 21 Jan 2022 20:40:15 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) From: Tetsuo Handa To: Jens Axboe , Christoph Hellwig , Jan Kara Cc: linux-block@vger.kernel.org, Tetsuo Handa , Jan Stancek , Mike Galbraith Subject: [PATCH v3 5/5] loop: add workaround for racy loop device reuse logic in /bin/mount Date: Fri, 21 Jan 2022 20:40:06 +0900 Message-Id: <20220121114006.3633-5-penguin-kernel@I-love.SAKURA.ne.jp> X-Mailer: git-send-email 2.18.4 In-Reply-To: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> References: <20220121114006.3633-1-penguin-kernel@I-love.SAKURA.ne.jp> Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Since lo_open() and lo_release() were previously serialized via disk->open_mutex, new file descriptors returned by open() never reached a loop device in Lo_rundown state unless ioctl(LOOP_CLR_FD) was inside __loop_clr_fd(). But now that since lo_open() and lo_release() no longer hold lo->lo_mutex in order to kill disk->open_mutex => lo->lo_mutex dependency, new file descriptors returned by open() can easily reach a loop device in Lo_rundown state. So far, Jan Stancek and Mike Galbraith found that LTP's isofs testcase which do mount/umount in close succession started failing. The root cause is that loop device reuse logic in /bin/mount is racy, and Jan Kara posted a patch for fixing one of two bugs [1]. But we need some migration period for allowing users to update their util-linux package. Not everybody can use latest userspace programs. Therefore, add a switch for allow emulating serialization between lo_open() and lo_release() without using disk->open_mutex. This emulation is disabled by default, and will be removed eventually. Since this emulation runs from task work context, we don't need to worry about locking dependency problem. Link: https://lkml.kernel.org/r/20220120114705.25342-1-jack@suse.cz [1] Reported-by: Jan Stancek Reported-by: Mike Galbraith Analyzed-by: Jan Kara Cc: Christoph Hellwig Signed-off-by: Tetsuo Handa --- This hack is not popular, but without enabling this workaround, about 20% of mount requests fails. If this workaround is enabled, no mount request fails. I think we need this hack for a while. root@fuzz:/mnt# time for i in $(seq 1 100); do mount -o loop,ro isofs.iso isofs/ && umount isofs/; done mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: isofs/: operation permitted for root only. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: isofs/: operation permitted for root only. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. mount: /mnt/isofs: can't read superblock on /dev/loop0. real 0m9.896s user 0m0.161s sys 0m8.523s drivers/block/loop.c | 58 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 74d919e98a6b..844471213494 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -90,6 +90,7 @@ static DEFINE_IDR(loop_index_idr); static DEFINE_MUTEX(loop_ctl_mutex); static DEFINE_MUTEX(loop_validate_mutex); static DEFINE_SPINLOCK(loop_delete_spinlock); +static DECLARE_WAIT_QUEUE_HEAD(loop_rundown_wait); /** * loop_global_lock_killable() - take locks for safe loop_validate_file() test @@ -1174,6 +1175,7 @@ static void __loop_clr_fd(struct loop_device *lo) mutex_lock(&lo->lo_mutex); lo->lo_state = Lo_unbound; mutex_unlock(&lo->lo_mutex); + wake_up_all(&loop_rundown_wait); module_put(THIS_MODULE); } @@ -1710,6 +1712,38 @@ struct loop_release_task { static LIST_HEAD(release_task_spool); static DEFINE_SPINLOCK(release_task_spool_spinlock); +/* Workaround code for racy loop device reuse logic in /bin/mount. */ +static bool open_waits_rundown_device; +module_param(open_waits_rundown_device, bool, 0644); +MODULE_PARM_DESC(open_waits_rundown_device, "Please report if you need to enable this option."); + +struct loop_open_task { + struct callback_head cb; + struct loop_device *lo; +}; + +static void lo_post_open(struct gendisk *disk) +{ + struct loop_device *lo = disk->private_data; + + /* Wait for lo_post_release() to leave lo->lo_mutex section. */ + if (mutex_lock_killable(&lo->lo_mutex) == 0) + mutex_unlock(&lo->lo_mutex); + /* Also wait for __loop_clr_fd() to complete if Lo_rundown was set. */ + wait_event_killable(loop_rundown_wait, data_race(lo->lo_state) != Lo_rundown); + atomic_dec(&lo->async_pending); +} + +static void loop_open_callbackfn(struct callback_head *callback) +{ + struct loop_open_task *lot = + container_of(callback, struct loop_open_task, cb); + struct gendisk *disk = lot->lo->lo_disk; + + lo_post_open(disk); + kfree(lot); +} + static int lo_open(struct block_device *bdev, fmode_t mode) { struct loop_device *lo = bdev->bd_disk->private_data; @@ -1738,6 +1772,30 @@ static int lo_open(struct block_device *bdev, fmode_t mode) spin_lock(&release_task_spool_spinlock); list_add(&lrt->head, &release_task_spool); spin_unlock(&release_task_spool_spinlock); + + /* + * Try to avoid accessing Lo_rundown loop device. + * + * Since the task_work list is LIFO, lo_post_release() scheduled by + * lo_release() can run before lo_post_open() scheduled by lo_open() + * runs when an error occurred and fput() scheduled lo_release() before + * returning to user mode. This means that lo->refcnt may be already 0 + * when lo_post_open() runs. Therefore, use lo->async_pending in order + * to prevent loop_remove() from releasing this loop device. + */ + if (open_waits_rundown_device && !(current->flags & PF_KTHREAD)) { + struct loop_open_task *lot = + kmalloc(sizeof(*lrt), GFP_KERNEL | __GFP_NOWARN); + + if (!lot) + return 0; + lot->lo = lo; + init_task_work(&lot->cb, loop_open_callbackfn); + if (task_work_add(current, &lot->cb, TWA_RESUME)) + kfree(lot); + else + atomic_inc(&lo->async_pending); + } return 0; }