From patchwork Tue May 11 14:01:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andreas Gruenbacher X-Patchwork-Id: 12251087 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9F64C433ED for ; Tue, 11 May 2021 14:01:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6547661625 for ; Tue, 11 May 2021 14:01:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6547661625 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A3C346B006E; Tue, 11 May 2021 10:01:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A121D6B0071; Tue, 11 May 2021 10:01:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 88D0A6B0072; Tue, 11 May 2021 10:01:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0194.hostedemail.com [216.40.44.194]) by kanga.kvack.org (Postfix) with ESMTP id 6D23C6B006E for ; Tue, 11 May 2021 10:01:30 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 2AC3A180B144F for ; Tue, 11 May 2021 14:01:30 +0000 (UTC) X-FDA: 78129112740.29.B151A5A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id DF29AA0049CA for ; Tue, 11 May 2021 14:01:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1620741681; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=/J2SbDDpf1/xl12w6sV/ZvOSp9oddaI9kZRn1S4ksFA=; b=MBWemO5ULQJSPpJ/5Q9pjQpNyZ4yELbNsT5g5pWa8aS0IidVJbXFTuqT8WgevS7XlcOCx+ zQtQDobslKAR1x5ZHFCsBUow1bo7t5o5nXt7Uq8orjPsY8k+tKLBZtmT6O7qtF+q5UAZJi bWCVqpQdwPaSh02iVQDtYsELfw3gbF8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-506-UbNFmkdyNFOv3l7m6nD3FA-1; Tue, 11 May 2021 10:01:19 -0400 X-MC-Unique: UbNFmkdyNFOv3l7m6nD3FA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 067BA18E5641; Tue, 11 May 2021 14:01:17 +0000 (UTC) Received: from max.com (unknown [10.40.195.97]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8040E5D6D1; Tue, 11 May 2021 14:01:15 +0000 (UTC) From: Andreas Gruenbacher To: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jan Kara , Andreas Gruenbacher Subject: [PATCH] [RFC] Trigger retry from fault vm operation Date: Tue, 11 May 2021 16:01:13 +0200 Message-Id: <20210511140113.1225981-1-agruenba@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Rspamd-Queue-Id: DF29AA0049CA Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MBWemO5U; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf07.hostedemail.com: domain of agruenba@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=agruenba@redhat.com X-Rspamd-Server: rspam03 X-Stat-Signature: sx67y6xzc6jekjwpqjdbdhewbhecf3hn Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf07; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620741683-797904 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, we have a locking problem in gfs2 that I don't have a proper solution for, so I'm looking for suggestions. What's happening is that a page fault triggers during a read or write operation, while we're holding a glock (the cluster-wide gfs2 inode lock), and the page fault requires another glock. We can recognize and handle the case when both glocks are the same, but when the page fault requires another glock, there is a chance that taking that other glock would deadlock. When we realize that we may not be able to take the other glock in gfs2_fault, we need to communicate that to the read or write operation, which will then drop and re-acquire the "outer" glock and retry. However, there doesn't seem to be a good way to do that; we can only indicate that a page fault should fail by returning VM_FAULT_SIGBUS or similar; that will then be mapped to -EFAULT. We'd need something like VM_FAULT_RESTART that can be mapped to -EBUSY so that we can tell the retry case apart from genuine -EFAULT errors. To show what I mean, below is a proof of concept that adds a restart_hack flag to struct task_struct as a side channel. An even uglier alternative would be to abuse task->journal_info. The obvious response to this email would be "fix your locking order". Well, we can do that by pinning the user pages in gfs2_file_{read,write}_iter before taking the "outer" glock, but we'd ideally only do that when it's actually necessary (i.e., when gfs2_fault has indicated to retry). Thanks for any thoughts. Andreas --- fs/gfs2/file.c | 37 +++++++++++++++++++++++++++++++++++-- include/linux/sched.h | 1 + 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 658fed79d65a..253e720f2df0 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -543,10 +543,15 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf) vm_fault_t ret; int err; - gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_TRY, &gh); if (likely(!recursive)) { err = gfs2_glock_nq(&gh); if (err) { + if (err == GLR_TRYFAILED) { + current->restart_hack = 1; + ret = VM_FAULT_SIGBUS; + goto out_uninit; + } ret = block_page_mkwrite_return(err); goto out_uninit; } @@ -773,12 +778,16 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to, return 0; /* skip atime */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); +restart: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; + current->restart_hack = 0; ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0); gfs2_glock_dq(gh); + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart; out_uninit: gfs2_holder_uninit(gh); return ret; @@ -803,6 +812,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, * VFS does. */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); + +restart: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; @@ -811,11 +822,14 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, if (offset + len > i_size_read(&ip->i_inode)) goto out; + current->restart_hack = 0; ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0); if (ret == -ENOTBLK) ret = 0; out: gfs2_glock_dq(gh); + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart; out_uninit: gfs2_holder_uninit(gh); return ret; @@ -834,7 +848,9 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; iocb->ki_flags &= ~IOCB_DIRECT; } +restart1: iocb->ki_flags |= IOCB_NOIO; + current->restart_hack = 0; ret = generic_file_read_iter(iocb, to); iocb->ki_flags &= ~IOCB_NOIO; if (ret >= 0) { @@ -842,6 +858,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; written = ret; } else { + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart1; if (ret != -EAGAIN) return ret; if (iocb->ki_flags & IOCB_NOWAIT) @@ -849,13 +867,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) } ip = GFS2_I(iocb->ki_filp->f_mapping->host); gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + +restart2: ret = gfs2_glock_nq(&gh); if (ret) goto out_uninit; + current->restart_hack = 0; ret = generic_file_read_iter(iocb, to); if (ret > 0) written += ret; gfs2_glock_dq(&gh); + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart2; out_uninit: gfs2_holder_uninit(&gh); return written ? written : ret; @@ -912,13 +935,18 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) goto out_unlock; iocb->ki_flags |= IOCB_DSYNC; +restart1: + current->restart_hack = 0; current->backing_dev_info = inode_to_bdi(inode); buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info = NULL; if (unlikely(buffered <= 0)) { + if (unlikely(buffered == -EFAULT) && current->restart_hack) + goto restart1; if (!ret) ret = buffered; goto out_unlock; + } /* * We need to ensure that the page cache pages are written to @@ -935,10 +963,15 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (!ret || ret2 > 0) ret += ret2; } else { +restart2: + current->restart_hack = 0; current->backing_dev_info = inode_to_bdi(inode); ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info = NULL; - if (likely(ret > 0)) { + if (unlikely(ret <= 0)) { + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart2; + } else { iocb->ki_pos += ret; ret = generic_write_sync(iocb, ret); } diff --git a/include/linux/sched.h b/include/linux/sched.h index d2c881384517..de053ac8d8d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1067,6 +1067,7 @@ struct task_struct { /* Journalling filesystem info: */ void *journal_info; + unsigned restart_hack:1; /* Stacked block device info: */ struct bio_list *bio_list;