From patchwork Fri Jan 5 01:34:02 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Trond Myklebust X-Patchwork-Id: 10145917 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 2F4C160244 for ; Fri, 5 Jan 2018 01:34:11 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 66BB726256 for ; Fri, 5 Jan 2018 01:34:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 57B46286F6; Fri, 5 Jan 2018 01:34:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 66D7526256 for ; Fri, 5 Jan 2018 01:34:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751383AbeAEBeJ (ORCPT ); Thu, 4 Jan 2018 20:34:09 -0500 Received: from us-smtp-delivery-194.mimecast.com ([63.128.21.194]:49984 "EHLO us-smtp-delivery-194.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751208AbeAEBeH (ORCPT ); Thu, 4 Jan 2018 20:34:07 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=primarydata.com; s=mimecast20170802; t=1515116046; h=from:subject:date:message-id:to:cc:mime-version:content-type:in-reply-to:references; bh=OEo0HuDE2vXRzr6SmcwB7jKaOAfo7HRrrv1v5pCFpSg=; b=eAUwOPrz67Bpq5wvZn5XZJLvLUmXcznR0S2WZQ53brBaUFX6cqh7WQ50+CdTgdVn9d5R6CWHejNv3re6aZxKTk69VBFEFhLInPK+Im5VB/+qT17EBhcTfvCxPVnr89XE+uDVHU1Ac01kQ8edsL8wYA2SppPmiagZtxOfaCbZzfA= Received: from NAM01-BN3-obe.outbound.protection.outlook.com (mail-bn3nam01lp0183.outbound.protection.outlook.com [216.32.180.183]) (Using TLS) by us-smtp-1.mimecast.com with ESMTP id us-mta-140-Am6PFMlRNYWqjDq4ODQrNQ-1; Thu, 04 Jan 2018 20:34:04 -0500 X-MC-Unique: Am6PFMlRNYWqjDq4ODQrNQ-1 Received: from DM5PR11MB0075.namprd11.prod.outlook.com (10.164.155.144) by DM5PR11MB0075.namprd11.prod.outlook.com (10.164.155.144) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.366.8; Fri, 5 Jan 2018 01:34:03 +0000 Received: from DM5PR11MB0075.namprd11.prod.outlook.com ([10.164.155.144]) by DM5PR11MB0075.namprd11.prod.outlook.com ([10.164.155.144]) with mapi id 15.20.0366.011; Fri, 5 Jan 2018 01:34:02 +0000 From: Trond Myklebust To: "neilb@suse.com" , "chuck.lever@oracle.com" , "jlayton@kernel.org" CC: "Anna.Schumaker@netapp.com" , "linux-kernel@vger.kernel.org" , "linux-nfs@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" Subject: Re: [PATCH/RFC] NFS: add nostatflush mount option. Thread-Topic: [PATCH/RFC] NFS: add nostatflush mount option. Thread-Index: AQHTegeFGOaJJxFIg0GzqIIc5l9e7aNN77WAgAAES4CAAFU3gIACo1KAgA7QEwCABNnSgA== Date: Fri, 5 Jan 2018 01:34:02 +0000 Message-ID: <1515116033.87651.1.camel@primarydata.com> References: <87k1xgkct1.fsf@notabene.neil.brown.name> <4B4DA4D4-8068-4C10-92BE-F03632522C75@oracle.com> <1513871689.11836.3.camel@primarydata.com> <87efnnkda2.fsf@notabene.neil.brown.name> <1514035013.3425.8.camel@kernel.org> <87d12tf99x.fsf@notabene.neil.brown.name> In-Reply-To: <87d12tf99x.fsf@notabene.neil.brown.name> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=trondmy@primarydata.com; x-originating-ip: [68.49.162.121] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; DM5PR11MB0075; 7:JaMCUrBfhMECmGPeSXBxpHJM9XmOcKKmUHz2SjAM/4bHFA6pswL7TdDm6f6d3Ae8Sbn1YQ+0u/rihqNQifCh4d/uDQMOzaJbatMHiG+zxGS8p7MANJULAu7v0txrf4YvnpWuOFmyvwDwBzbLKcU2Yt5dWV9SoMyNB2g1/e852e1M7V4hF63SC2UOsun8W6adlHpX8BSgxGQ3pD//pFDZ7LkNkJJMW0/N74pTqUBsrTMHxJew6WS491Y25bN/kEjO; 20:AtqboHUTdKq9rxnEcPeDjcP72q5qIM4TR8t3fJiWdivjjtW7JhUstcm+WbdYn0skdQrB7Krappbt8eV3Q5ry58bVAIIwvjId742J1FJ3DcoYJyFeM4kjs6A8TGA4QB1vI4yINtE5KlHVxydXM1iIbWUv7ELkFDs5uCh+6xFXA88= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: 352aae73-7db9-4900-7d5e-08d553dc6707 x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(4534020)(4602075)(4603075)(4627115)(201702281549075)(5600026)(4604075)(3008032)(2017052603307)(7153060)(49563074); SRVR:DM5PR11MB0075; x-ms-traffictypediagnostic: DM5PR11MB0075: x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(158342451672863)(278428928389397); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(102415395)(6040470)(2401047)(5005006)(8121501046)(3231023)(944501075)(10201501046)(3002001)(93006095)(93001095)(6041268)(2016111802025)(20161123562045)(20161123564045)(20161123560045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(6043046)(6072148)(201708071742011); SRVR:DM5PR11MB0075; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:DM5PR11MB0075; x-forefront-prvs: 05437568AA x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(39380400002)(346002)(396003)(376002)(39840400004)(366004)(24454002)(377424004)(199004)(189003)(478600001)(14454004)(3280700002)(105586002)(25786009)(2906002)(106356001)(6512007)(2201001)(3846002)(6116002)(2900100001)(3660700001)(36756003)(4326008)(305945005)(316002)(99936001)(81166006)(8936002)(54906003)(7736002)(110136005)(6436002)(66066001)(102836004)(2501003)(99286004)(76176011)(6246003)(68736007)(8676002)(4001150100001)(81156014)(229853002)(2950100002)(5660300001)(53546011)(93886005)(6506007)(53936002)(59450400001)(6486002)(86362001)(77096006)(97736004)(103116003); DIR:OUT; SFP:1102; SCL:1; SRVR:DM5PR11MB0075; H:DM5PR11MB0075.namprd11.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: primarydata.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: g5tAztm3mIn+pCqfWvTkYymsLyvLDxn/AQLrQpEpuuFg/dJBYlsP2G8IkOPxqQLH+NT9BbKeLg4SjtRvtF8PZg== spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM MIME-Version: 1.0 X-OriginatorOrg: primarydata.com X-MS-Exchange-CrossTenant-Network-Message-Id: 352aae73-7db9-4900-7d5e-08d553dc6707 X-MS-Exchange-CrossTenant-originalarrivaltime: 05 Jan 2018 01:34:02.6278 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 03193ed6-8726-4bb3-a832-18ab0d28adb7 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR11MB0075 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi Neil, On Tue, 2018-01-02 at 10:29 +1100, NeilBrown wrote: > On Sat, Dec 23 2017, Jeff Layton wrote: > > > On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote: > > > On Thu, Dec 21 2017, Trond Myklebust wrote: > > > > > > > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote: > > > > > Hi Neil- > > > > > > > > > > > > > > > > On Dec 20, 2017, at 9:57 PM, NeilBrown > > > > > > wrote: > > > > > > > > > > > > > > > > > > When an i_op->getattr() call is made on an NFS file > > > > > > (typically from a 'stat' family system call), NFS > > > > > > will first flush any dirty data to the server. > > > > > > > > > > > > This ensures that the mtime reported is correct and stable, > > > > > > but has a performance penalty. 'stat' is normally thought > > > > > > to be a quick operation, and imposing this cost can be > > > > > > surprising. > > > > > > > > > > To be clear, this behavior is a POSIX requirement. > > > > > > > > > > > > > > > > I have seen problems when one process is writing a large > > > > > > file and another process performs "ls -l" on the containing > > > > > > directory and is blocked for as long as it take to flush > > > > > > all the dirty data to the server, which can be minutes. > > > > > > > > > > Yes, a well-known annoyance that cannot be addressed > > > > > even with a write delegation. > > > > > > > > > > > > > > > > I have also seen a legacy application which frequently > > > > > > calls > > > > > > "fstat" on a file that it is writing to. On a local > > > > > > filesystem (and in the Solaris implementation of NFS) this > > > > > > fstat call is cheap. On Linux/NFS, the causes a noticeable > > > > > > decrease in throughput. > > > > > > > > > > If the preceding write is small, Linux could be using > > > > > a FILE_SYNC write, but Solaris could be using UNSTABLE. > > > > > > > > > > > > > > > > The only circumstances where an application calling > > > > > > 'stat()' > > > > > > might get an mtime which is not stable are times when some > > > > > > other process is writing to the file and the two processes > > > > > > are not using locking to ensure consistency, or when the > > > > > > one > > > > > > process is both writing and stating. In neither of these > > > > > > cases is it reasonable to expect the mtime to be stable. > > > > > > > > > > I'm not convinced this is a strong enough rationale > > > > > for claiming it is safe to disable the existing > > > > > behavior. > > > > > > > > > > You've explained cases where the new behavior is > > > > > reasonable, but do you have any examples where the > > > > > new behavior would be a problem? There must be a > > > > > reason why POSIX explicitly requires an up-to-date > > > > > mtime. > > > > > > > > > > What guidance would nfs(5) give on when it is safe > > > > > to specify the new mount option? > > > > > > > > > > > > > > > > In the most common cases where mtime is important > > > > > > (e.g. make), no other process has the file open, so there > > > > > > will be no dirty data and the mtime will be stable. > > > > > > > > > > Isn't it also the case that make is a multi-process > > > > > workload where one process modifies a file, then > > > > > closes it (which triggers a flush), and then another > > > > > process stats the file? The new mount option does > > > > > not change the behavior of close(2), does it? > > > > > > > > > > > > > > > > Rather than unilaterally changing this behavior of 'stat', > > > > > > this patch adds a "nosyncflush" mount option to allow > > > > > > sysadmins to have applications which are hurt by the > > > > > > current > > > > > > behavior to disable it. > > > > > > > > > > IMO a mount option is at the wrong granularity. A > > > > > mount point will be shared between applications that > > > > > can tolerate the non-POSIX behavior and those that > > > > > cannot, for instance. > > > > > > > > Agreed. > > > > > > > > The other thing to note here is that we now have an embryonic > > > > statx() > > > > system call, which allows the application itself to decide > > > > whether or > > > > not it needs up to date values for the atime/ctime/mtime. While > > > > we > > > > haven't yet plumbed in the NFS side, the intention was always > > > > to use > > > > that information to turn off the writeback flushing when > > > > possible. > > > > > > Yes, if statx() were actually working, we could change the > > > application > > > to avoid the flush. But then if changing the application were an > > > option, I suspect that - for my current customer issue - we could > > > just > > > remove the fstat() calls. I doubt they are really necessary. > > > I think programmers often think of stat() (and particularly > > > fstat()) as > > > fairly cheap and so they use it whenever convenient. Only NFS > > > violates > > > this expectation. > > > > > > Also statx() is only a real solution if/when it gets widely > > > used. Will > > > "ls -l" default to AT_STATX_DONT_SYNC ?? > > > > > > > Maybe. Eventually, I could see glibc converting normal > > stat/fstat/etc. > > to use a statx() syscall under the hood (similar to how stat > > syscalls on > > 32-bit arches will use stat64 in most cases). > > > > With that, we could look at any number of ways to sneak a "don't > > flush" > > flag into the call. Maybe an environment variable that causes the > > stat > > syscall wrapper to add it? I think there are possibilities there > > that > > don't necessarily require recompiling applications. > > Thanks - interesting ideas. > > One possibility would be an LD_PRELOAD which implements fstat() using > statx(). > That doesn't address the "ls -l is needlessly slow" problem, but it > would address the "legacy application calls fstat too often" problem. > > This isn't an option for the "enterprise kernel" the customer is > using > (statx? what is statx?), but having a clear view of a credible > upstream solution is very helpful. > > So thanks - and thanks a lot to Trond and Chuck for your input. It > helped clarify my thoughts a lot. > > Is anyone working on proper statx support for NFS, or is it a case of > "that shouldn't be hard and we should do that, but it isn't a high > priority for anyone" ?? How about something like the following? Cheers Trond 8<-------------------------------------------------------- From 755b6771deb8d793c90f56fddf7070d7c2ea87b5 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Thu, 4 Jan 2018 17:46:09 -0500 Subject: [PATCH] Support statx() mask and query flags parameters Support the query flags AT_STATX_FORCE_SYNC by forcing an attribute revalidation, and AT_STATX_DONT_SYNC by returning cached attributes only. Use the mask to optimise away server revalidation for attributes that are not being requested by the user. Signed-off-by: Trond Myklebust --- fs/nfs/inode.c | 40 ++++++++++++++++++++++++++++++++++------ 1 file changed, 34 insertions(+), 6 deletions(-) diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index b112002dbdb2..a703b1d1500d 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -735,12 +735,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); - int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME; + unsigned long cache_validity; + bool force_sync = query_flags & AT_STATX_FORCE_SYNC; + bool dont_sync = !force_sync && query_flags & AT_STATX_DONT_SYNC; + bool need_atime = !dont_sync; + bool need_cmtime = !dont_sync; + bool reval = force_sync; int err = 0; + if (!(request_mask & STATX_ATIME)) + need_atime = false; + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) + need_cmtime = false; + trace_nfs_getattr_enter(inode); /* Flush out writes to the server in order to update c/mtime. */ - if (S_ISREG(inode->i_mode)) { + if (S_ISREG(inode->i_mode) && need_cmtime) { err = filemap_write_and_wait(inode->i_mapping); if (err) goto out; @@ -757,9 +767,22 @@ int nfs_getattr(const struct path *path, struct kstat *stat, */ if ((path->mnt->mnt_flags & MNT_NOATIME) || ((path->mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))) - need_atime = 0; - - if (need_atime || nfs_need_revalidate_inode(inode)) { + need_atime = false; + + /* Check for whether the cached attributes are invalid */ + cache_validity = READ_ONCE(NFS_I(inode)->cache_validity); + if (need_cmtime) + reval |= cache_validity & NFS_INO_REVAL_PAGECACHE; + if (need_atime) + reval |= cache_validity & NFS_INO_INVALID_ATIME; + if (request_mask & (STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID| + STATX_ATIME|STATX_MTIME|STATX_CTIME| + STATX_SIZE|STATX_BLOCKS)) + reval |= cache_validity & NFS_INO_INVALID_ATTR; + if (dont_sync) + reval = false; + + if (reval) { struct nfs_server *server = NFS_SERVER(inode); if (!(server->flags & NFS_MOUNT_NOAC)) @@ -767,13 +790,18 @@ int nfs_getattr(const struct path *path, struct kstat *stat, else nfs_readdirplus_parent_cache_hit(path->dentry); err = __nfs_revalidate_inode(server, inode); - } else + } else if (!dont_sync) nfs_readdirplus_parent_cache_hit(path->dentry); if (!err) { generic_fillattr(inode, stat); stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode)); if (S_ISDIR(inode->i_mode)) stat->blksize = NFS_SERVER(inode)->dtsize; + /* Return only the requested attrs if others may be stale */ + if (!reval && cache_validity & (NFS_INO_REVAL_PAGECACHE| + NFS_INO_INVALID_ATIME| + NFS_INO_INVALID_ATTR)) + stat->result_mask &= request_mask; } out: trace_nfs_getattr_exit(inode, err);