From patchwork Tue Jun 14 20:25:39 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 880012 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter2.kernel.org (8.14.4/8.14.4) with ESMTP id p5EKPl4I031318 for ; Tue, 14 Jun 2011 20:25:47 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753740Ab1FNUZo (ORCPT ); Tue, 14 Jun 2011 16:25:44 -0400 Received: from mx1.redhat.com ([209.132.183.28]:58356 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753059Ab1FNUZn (ORCPT ); Tue, 14 Jun 2011 16:25:43 -0400 Received: from int-mx12.intmail.prod.int.phx2.redhat.com (int-mx12.intmail.prod.int.phx2.redhat.com [10.5.11.25]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p5EKPg2g012585 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 14 Jun 2011 16:25:42 -0400 Received: from localhost.localdomain.com (vpn-10-46.rdu.redhat.com [10.11.10.46]) by int-mx12.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p5EKPejJ007115; Tue, 14 Jun 2011 16:25:41 -0400 From: Josef Bacik To: linux-btrfs@vger.kernel.org, jaschut@sandia.gov Subject: [PATCH] Btrfs: set no_trans_join after trying to expand the transaction Date: Tue, 14 Jun 2011 16:25:39 -0400 Message-Id: <1308083139-23734-1-git-send-email-josef@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.25 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter2.kernel.org [140.211.167.43]); Tue, 14 Jun 2011 20:25:47 +0000 (UTC) We can lockup if we try to allow new writers join the transaction and we have flushoncommit set or have a pending snapshot. This is because we set no_trans_join and then loop around and try to wait for ordered extents again. The problem is the ordered endio stuff needs to join the transaction, which it can't do because no_trans_join is set. So instead wait until after this loop to set no_trans_join and then make sure to wait for num_writers == 1 in case anybody got started in between us exiting the loop and setting no_trans_join. This could easily be reproduced by mounting -o flushoncommit and running xfstest 13. It cannot be reproduced with this patch. Thanks, Reported-by: Jim Schutt Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 14 +++++++++++--- 1 files changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 8754997..d0f1c07 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1239,12 +1239,20 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, schedule_timeout(1); finish_wait(&cur_trans->writer_wait, &wait); - spin_lock(&root->fs_info->trans_lock); - root->fs_info->trans_no_join = 1; - spin_unlock(&root->fs_info->trans_lock); } while (atomic_read(&cur_trans->num_writers) > 1 || (should_grow && cur_trans->num_joined != joined)); + /* + * Ok now we need to make sure to block out any other joins while we + * commit the transaction. We could have started a join before setting + * no_join so make sure to wait for num_writers to == 1 again. + */ + spin_lock(&root->fs_info->trans_lock); + root->fs_info->trans_no_join = 1; + spin_unlock(&root->fs_info->trans_lock); + wait_event(cur_trans->writer_wait, + atomic_read(&cur_trans->num_writers) == 1); + ret = create_pending_snapshots(trans, root->fs_info); BUG_ON(ret);