From patchwork Thu Jun 2 11:07:38 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 9149799 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 3965C6074E for ; Thu, 2 Jun 2016 11:07:45 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2A17725EF7 for ; Thu, 2 Jun 2016 11:07:45 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1EDCA264FB; Thu, 2 Jun 2016 11:07:45 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 79BDB25EF7 for ; Thu, 2 Jun 2016 11:07:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932344AbcFBLHk (ORCPT ); Thu, 2 Jun 2016 07:07:40 -0400 Received: from mail-io0-f193.google.com ([209.85.223.193]:33862 "EHLO mail-io0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751217AbcFBLHj (ORCPT ); Thu, 2 Jun 2016 07:07:39 -0400 Received: by mail-io0-f193.google.com with SMTP id l9so6050505ioe.1 for ; Thu, 02 Jun 2016 04:07:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc; bh=sOKHkT+vE5q6rqH2EglaRteQPgNtlUicYrsusmquyWk=; b=L9AW/CDSm9qU1j16QkyXopwgb0AifEtrzjGG9fcIltdSByC4uNMSesa5/LPN/bFKsZ 5Czj4+3gHFIwWdDK9G3shK2cqhM6W3ctF0so3YbGs2slmz8uiQaGcrcURpFw7aaQCiMd ifhUPT7J3TN6Z13ewkvxkIN79bR1Pl+75OZ5dZ8MgC2BW4CTkQz70Fl9tyfw01VENPTM 6h4OhSXSdf1O/CUN7GBIW/JTsusqNUeI1ImZPQUXgzJOAjt1MmmvCbL2H9YsLfxbg3ga RA1XGNBurFGtozfMgnwRyurDkn9nYKqVPDLssAp8Q9O6zKKmnYG1gSZq55jcRZjoeOw/ ecDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:cc; bh=sOKHkT+vE5q6rqH2EglaRteQPgNtlUicYrsusmquyWk=; b=Ai/BwCHnLxfwRt2SbsObdNbG12IKqzjCEdlk3H7dFhnMSSsXC3qTxVsAlYVDtNKfi7 w1JfpAQAES0QXKgqkwqQ35chbb1XPJeB/rD75O+RZJhY9Rz2qJhD2oUFO8qKqtM2sqX/ 24Cyz6p+rjAeOuCDeZexPtc3s8hZTxqmcZKpvMPFImvO96O4UilwiGXDphTPr3RP707z dk2jb8guBkg0Bjz4FvoUhD78GTH1dKALFk02woDPTeIkcbgbz2wucAXm5Un8lW/+wUZ6 SRZKIHb2EBhrSGgYUZ09KYp1kfyU2+pyPfl9Yt5gjstRGSc4ytZQ+znu/mcsNixH6/d0 OBzA== X-Gm-Message-State: ALyK8tI9b2DoJdazkQ/aR1Xcfy5LJGJRmk1LuRwE967A2CgaGZwo3Tz8U8R4RjPWhUPv+ff2D9xE/15WgQ6JQQ== MIME-Version: 1.0 X-Received: by 10.107.1.199 with SMTP id 190mr2350205iob.74.1464865658633; Thu, 02 Jun 2016 04:07:38 -0700 (PDT) Received: by 10.79.111.193 with HTTP; Thu, 2 Jun 2016 04:07:38 -0700 (PDT) Reply-To: fdmanana@gmail.com In-Reply-To: <20160601043959.GJ15597@hungrycats.org> References: <1462422229-23797-1-git-send-email-ce3g8jdj@umail.furryterror.org> <20160601043959.GJ15597@hungrycats.org> Date: Thu, 2 Jun 2016 12:07:38 +0100 Message-ID: Subject: Re: [PATCH] btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes From: Filipe Manana To: Zygo Blaxell Cc: "linux-btrfs@vger.kernel.org" Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Wed, Jun 1, 2016 at 5:39 AM, Zygo Blaxell wrote: > On Thu, May 05, 2016 at 12:23:49AM -0400, Zygo Blaxell wrote: >> During a mount, we start the cleaner kthread first because the transaction >> kthread wants to wake up the cleaner kthread. We start the transaction >> kthread next because everything in btrfs wants transactions. We do reloc >> recovery in the thread that was doing the original mount call once the >> transaction kthread is running. This means that the cleaner kthread >> could already be running when reloc recovery happens (e.g. if a snapshot >> delete was started before a crash). >> >> Relocation does not play well with the cleaner kthread, so a mutex was >> added in commit 5f3164813b90f7dbcb5c3ab9006906222ce471b7 "Btrfs: fix >> race between balance recovery and root deletion" to prevent both from >> being active at the same time. >> >> If the cleaner kthread is already holding the mutex by the time we get >> to btrfs_recover_relocation, the mount will be blocked until at least >> one deleted subvolume is cleaned (possibly more if the mount process >> doesn't get the lock right away). During this time (which could be an >> arbitrarily long time on a large/slow filesystem), the mount process is >> stuck and the filesystem is unnecessarily inaccessible. >> >> Fix this by locking cleaner_mutex before we start cleaner_kthread, and >> unlocking the mutex after mount no longer requires it. This ensures >> that the mounting process will not be blocked by the cleaner kthread. >> The cleaner kthread is already prepared for mutex contention and will >> just go to sleep until the mutex is available. >> --- >> fs/btrfs/disk-io.c | 18 +++++++++++++++--- >> 1 file changed, 15 insertions(+), 3 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index d8d68af..7c8f435 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -2509,6 +2509,7 @@ int open_ctree(struct super_block *sb, >> int num_backups_tried = 0; >> int backup_index = 0; >> int max_active; >> + bool cleaner_mutex_locked = false; >> >> tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info); >> chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info); >> @@ -2988,6 +2989,13 @@ retry_root_backup: >> goto fail_sysfs; >> } >> >> + /* >> + * Hold the cleaner_mutex thread here so that we don't block >> + * for a long time on btrfs_recover_relocation. cleaner_kthread >> + * will wait for us to finish mounting the filesystem. >> + */ >> + mutex_lock(&fs_info->cleaner_mutex); >> + cleaner_mutex_locked = true; >> fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root, >> "btrfs-cleaner"); >> if (IS_ERR(fs_info->cleaner_kthread)) > > Unfortunately, if we have a log to replay, we get to code like this > in open_ctree: > > /* do not make disk changes in broken FS */ > if (btrfs_super_log_root(disk_super) != 0) { > ret = btrfs_replay_log(fs_info, fs_devices); > if (ret) { > err = ret; > goto fail_qgroup; > } > } > > and: > > static int btrfs_replay_log(struct btrfs_fs_info *fs_info, > struct btrfs_fs_devices *fs_devices) > { > [...] > if (fs_info->sb->s_flags & MS_RDONLY) { > ret = btrfs_commit_super(tree_root); > if (ret) > return ret; > } > > and finally: > > int btrfs_commit_super(struct btrfs_root *root) > { > struct btrfs_trans_handle *trans; > > mutex_lock(&root->fs_info->cleaner_mutex); > btrfs_run_delayed_iputs(root); > mutex_unlock(&root->fs_info->cleaner_mutex); > wake_up_process(root->fs_info->cleaner_kthread); > > Well, dammit. Since we have already locked cleaner_mutex, it promptly > recursive-deadlocks on itself--but only if the filesystem was not cleanly > umounted, and the problem disappears if you reboot and try to mount again > because there won't be a log to replay the second time. > > Could we just add a bool to fs_info that says to cleaner_kthread "don't > do anything yet, we're not finished mounting"? That way it doesn't break > if some new place to lock cleaner_mutex pops up (they do seem to move > around from one kernel version to the next). > > I think we can do btrfs_run_delayed_iputs and just skip the > wake_up_process call here? Or neuter it by having cleaner_kthread do > nothing while we are still somewhere in the middle of open_ctree. You can try something as simple as (untested): > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6628fca..a96a71a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3827,9 +3827,13 @@ int btrfs_commit_super(struct btrfs_root *root) { struct btrfs_trans_handle *trans; - mutex_lock(&root->fs_info->cleaner_mutex); + if (root->fs_info->open) + mutex_lock(&root->fs_info->cleaner_mutex); + else + ASSERT(mutex_is_locked(&root->fs_info->cleaner_mutex)); btrfs_run_delayed_iputs(root); - mutex_unlock(&root->fs_info->cleaner_mutex); + if (root->fs_info->open) + mutex_unlock(&root->fs_info->cleaner_mutex); wake_up_process(root->fs_info->cleaner_kthread); /* wait until ongoing cleanup work done */