diff mbox series

nfsd: don't use GFP_KERNEL from nfsd_getxattr()/nfsd_listxattr()

Message ID 72bf692e-bb6b-c1f2-d1ba-3205ab649b43@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show
Series nfsd: don't use GFP_KERNEL from nfsd_getxattr()/nfsd_listxattr() | expand

Commit Message

Tetsuo Handa April 15, 2023, 11:07 a.m. UTC
Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
does not make sense. Drop __GFP_FS flag in order to avoid deadlock.

Fixes: 32119446bb65 ("nfsd: define xattr functions to call into their vfs counterparts")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 fs/nfsd/vfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Jeff Layton April 15, 2023, 11:42 a.m. UTC | #1
On Sat, 2023-04-15 at 20:07 +0900, Tetsuo Handa wrote:
> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> 
> Fixes: 32119446bb65 ("nfsd: define xattr functions to call into their vfs counterparts")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  fs/nfsd/vfs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5783209f17fc..109b31246666 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -2164,7 +2164,7 @@ nfsd_getxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name,
>  		goto out;
>  	}
>  
> -	buf = kvmalloc(len, GFP_KERNEL | GFP_NOFS);
> +	buf = kvmalloc(len, GFP_NOFS);
>  	if (buf == NULL) {
>  		err = nfserr_jukebox;
>  		goto out;
> @@ -2230,7 +2230,7 @@ nfsd_listxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char **bufp,
>  	/*
>  	 * We're holding i_rwsem - use GFP_NOFS.
>  	 */
> -	buf = kvmalloc(len, GFP_KERNEL | GFP_NOFS);
> +	buf = kvmalloc(len, GFP_NOFS);
>  	if (buf == NULL) {
>  		err = nfserr_jukebox;
>  		goto out;

I don't get it.

These are the NFS server handlers for the GETXATTR and LISTXATTR
operations. Basically the client is making a call to the server to fetch
or list xattrs. I don't really see a huge issue with them recursing into
reclaim. It's not like this code is in the writeback path.
Chuck Lever III April 15, 2023, 4:13 p.m. UTC | #2
> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.

The server side threads run in process context. GFP_KERNEL
is safe to use here -- as Jeff said, this code is not in
the server's reclaim path. Plenty of other call sites in
the NFS server code use GFP_KERNEL.

But I agree that the flag combination doesn't make sense.
Maybe drop GFP_NOFS instead and call it only a clean-up?


> Fixes: 32119446bb65 ("nfsd: define xattr functions to call into their vfs counterparts")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
> fs/nfsd/vfs.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5783209f17fc..109b31246666 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -2164,7 +2164,7 @@ nfsd_getxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name,
> goto out;
> }
> 
> - buf = kvmalloc(len, GFP_KERNEL | GFP_NOFS);
> + buf = kvmalloc(len, GFP_NOFS);
> if (buf == NULL) {
> err = nfserr_jukebox;
> goto out;
> @@ -2230,7 +2230,7 @@ nfsd_listxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char **bufp,
> /*
> * We're holding i_rwsem - use GFP_NOFS.
> */
> - buf = kvmalloc(len, GFP_KERNEL | GFP_NOFS);
> + buf = kvmalloc(len, GFP_NOFS);
> if (buf == NULL) {
> err = nfserr_jukebox;
> goto out;


--
Chuck Lever
Tetsuo Handa April 15, 2023, 5:11 p.m. UTC | #3
On 2023/04/16 1:13, Chuck Lever III wrote:
>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
>>
>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> 
> The server side threads run in process context. GFP_KERNEL
> is safe to use here -- as Jeff said, this code is not in
> the server's reclaim path. Plenty of other call sites in
> the NFS server code use GFP_KERNEL.

GFP_KERNEL memory allocation calls filesystem's shrinker functions
because of __GFP_FS flag. My understanding is

  Whether this code is in memory reclaim path or not is irrelevant.
  Whether memory reclaim path might hold lock or not is relevant.

. Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?

> 
> But I agree that the flag combination doesn't make sense.
> Maybe drop GFP_NOFS instead and call it only a clean-up?
Jeff Layton April 15, 2023, 6:40 p.m. UTC | #4
On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> On 2023/04/16 1:13, Chuck Lever III wrote:
> > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > 
> > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> > > does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> > 
> > The server side threads run in process context. GFP_KERNEL
> > is safe to use here -- as Jeff said, this code is not in
> > the server's reclaim path. Plenty of other call sites in
> > the NFS server code use GFP_KERNEL.
> 
> GFP_KERNEL memory allocation calls filesystem's shrinker functions
> because of __GFP_FS flag. My understanding is
> 
>   Whether this code is in memory reclaim path or not is irrelevant.
>   Whether memory reclaim path might hold lock or not is relevant.
> 
> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> 

No. At the time of these allocations, the i_rwsem is not held.

> > 
> > But I agree that the flag combination doesn't make sense.
> > Maybe drop GFP_NOFS instead and call it only a clean-up?
> 

Yeah, I think that's the right thing to do here.
Tetsuo Handa April 15, 2023, 11:21 p.m. UTC | #5
On 2023/04/16 3:40, Jeff Layton wrote:
> On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
>> On 2023/04/16 1:13, Chuck Lever III wrote:
>>>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
>>>>
>>>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
>>>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
>>>
>>> The server side threads run in process context. GFP_KERNEL
>>> is safe to use here -- as Jeff said, this code is not in
>>> the server's reclaim path. Plenty of other call sites in
>>> the NFS server code use GFP_KERNEL.
>>
>> GFP_KERNEL memory allocation calls filesystem's shrinker functions
>> because of __GFP_FS flag. My understanding is
>>
>>   Whether this code is in memory reclaim path or not is irrelevant.
>>   Whether memory reclaim path might hold lock or not is relevant.
>>
>> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
>>
> 
> No. At the time of these allocations, the i_rwsem is not held.

Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
That's why

	/*
	 * We're holding i_rwsem - use GFP_NOFS.
	 */

is explicitly there in nfsd_listxattr() side.

If memory reclaim path (directly or indirectly via locking dependency) involves
inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
Jeff Layton April 16, 2023, 11:51 a.m. UTC | #6
On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
> On 2023/04/16 3:40, Jeff Layton wrote:
> > On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> > > On 2023/04/16 1:13, Chuck Lever III wrote:
> > > > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > > > 
> > > > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> > > > > does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> > > > 
> > > > The server side threads run in process context. GFP_KERNEL
> > > > is safe to use here -- as Jeff said, this code is not in
> > > > the server's reclaim path. Plenty of other call sites in
> > > > the NFS server code use GFP_KERNEL.
> > > 
> > > GFP_KERNEL memory allocation calls filesystem's shrinker functions
> > > because of __GFP_FS flag. My understanding is
> > > 
> > >   Whether this code is in memory reclaim path or not is irrelevant.
> > >   Whether memory reclaim path might hold lock or not is relevant.
> > > 
> > > . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> > > 
> > 
> > No. At the time of these allocations, the i_rwsem is not held.
> 
> Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
> via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
> That's why
> 
> 	/*
> 	 * We're holding i_rwsem - use GFP_NOFS.
> 	 */
> 
> is explicitly there in nfsd_listxattr() side.
> 
> If memory reclaim path (directly or indirectly via locking dependency) involves
> inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
> 

(cc'ing Frank V. who wrote this code and -fsdevel)

I stand corrected! You're absolutely right that it's taking the i_rwsem
for read there. That seems pretty weird, actually. I don't believe we
need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
certainly nothing else under there requires it.

Frank, was there some reason you decided you needed the inode_lock
there? It looks like under the hood, the xattr code requires you to take
it for write in setxattr and removexattr, but you don't need it at all
in getxattr or listxattr. Go figure.

If there's no reason to keep it there, then in addition to removing
GFP_NOFS there I think we probably ought to just remove the inode_lock
from nfsd_getxattr and nfsd_listxattr altogether.

Longer term, I wonder what the inode_lock is protecting in setxattr and
removexattr operations, given that get and list don't require them?
These are always delegated to the filesystem driver -- there is no
generic xattr implementation.

Maybe we ought to do a lock pushdown on those operations at some point?
I'd bet that most of the setxattr/removexattr operations do their own
locking and don't require the i_rwsem at all.
Chuck Lever III April 16, 2023, 4:20 p.m. UTC | #7
> On Apr 16, 2023, at 7:51 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
>> On 2023/04/16 3:40, Jeff Layton wrote:
>>> On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
>>>> On 2023/04/16 1:13, Chuck Lever III wrote:
>>>>>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
>>>>>> 
>>>>>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
>>>>>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
>>>>> 
>>>>> The server side threads run in process context. GFP_KERNEL
>>>>> is safe to use here -- as Jeff said, this code is not in
>>>>> the server's reclaim path. Plenty of other call sites in
>>>>> the NFS server code use GFP_KERNEL.
>>>> 
>>>> GFP_KERNEL memory allocation calls filesystem's shrinker functions
>>>> because of __GFP_FS flag. My understanding is
>>>> 
>>>>  Whether this code is in memory reclaim path or not is irrelevant.
>>>>  Whether memory reclaim path might hold lock or not is relevant.
>>>> 
>>>> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
>>>> 
>>> 
>>> No. At the time of these allocations, the i_rwsem is not held.
>> 
>> Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
>> via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
>> That's why
>> 
>> /*
>> * We're holding i_rwsem - use GFP_NOFS.
>> */
>> 
>> is explicitly there in nfsd_listxattr() side.
>> 
>> If memory reclaim path (directly or indirectly via locking dependency) involves
>> inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
>> 
> 
> (cc'ing Frank V. who wrote this code and -fsdevel)

Frank is no longer at Amazon. I've added his correct address.


> I stand corrected! You're absolutely right that it's taking the i_rwsem
> for read there. That seems pretty weird, actually.

For sure. I certainly didn't expect that.


> I don't believe we
> need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
> certainly nothing else under there requires it.
> 
> Frank, was there some reason you decided you needed the inode_lock
> there? It looks like under the hood, the xattr code requires you to take
> it for write in setxattr and removexattr, but you don't need it at all
> in getxattr or listxattr. Go figure.
> 
> If there's no reason to keep it there, then in addition to removing
> GFP_NOFS there I think we probably ought to just remove the inode_lock
> from nfsd_getxattr and nfsd_listxattr altogether.
> 
> Longer term, I wonder what the inode_lock is protecting in setxattr and
> removexattr operations, given that get and list don't require them?
> These are always delegated to the filesystem driver -- there is no
> generic xattr implementation.
> 
> Maybe we ought to do a lock pushdown on those operations at some point?
> I'd bet that most of the setxattr/removexattr operations do their own
> locking and don't require the i_rwsem at all.
> -- 
> Jeff Layton <jlayton@kernel.org>

--
Chuck Lever
Dave Chinner April 16, 2023, 11:37 p.m. UTC | #8
On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
> > On 2023/04/16 3:40, Jeff Layton wrote:
> > > On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> > > > On 2023/04/16 1:13, Chuck Lever III wrote:
> > > > > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > > > > 
> > > > > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> > > > > > does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> > > > > 
> > > > > The server side threads run in process context. GFP_KERNEL
> > > > > is safe to use here -- as Jeff said, this code is not in
> > > > > the server's reclaim path. Plenty of other call sites in
> > > > > the NFS server code use GFP_KERNEL.
> > > > 
> > > > GFP_KERNEL memory allocation calls filesystem's shrinker functions
> > > > because of __GFP_FS flag. My understanding is
> > > > 
> > > >   Whether this code is in memory reclaim path or not is irrelevant.
> > > >   Whether memory reclaim path might hold lock or not is relevant.
> > > > 
> > > > . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> > > > 
> > > 
> > > No. At the time of these allocations, the i_rwsem is not held.
> > 
> > Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
> > via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
> > That's why
> > 
> > 	/*
> > 	 * We're holding i_rwsem - use GFP_NOFS.
> > 	 */
> > 
> > is explicitly there in nfsd_listxattr() side.

You can do GFP_KERNEL allocations holding the i_rwsem just fine.
All that it requires is the caller holds a reference to the inode,
and at that point inode will should skip the given inode without
every locking it.

Of course, lockdep can't handle the "referenced inode lock ->
fsreclaim -> unreferenced inode lock" pattern at all. It throws out
false positives when it detects this because it's not aware of the
fact that reference counts prevent inode lock recursion based
deadlocks in the vfs inode cache shrinker.

If a custom, non-vfs shrinker is walking inodes that have no
references and taking i_rwsem in a way that can block without first
checking whether it is safe to lock the inode in a deadlock free
manner, they are doing the wrong thing and the custom shrinker needs
to be fixed.

> > 
> > If memory reclaim path (directly or indirectly via locking dependency) involves
> > inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
> > 
> 
> (cc'ing Frank V. who wrote this code and -fsdevel)
> 
> I stand corrected! You're absolutely right that it's taking the i_rwsem
> for read there. That seems pretty weird, actually. I don't believe we
> need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
> certainly nothing else under there requires it.
> 
> Frank, was there some reason you decided you needed the inode_lock
> there? It looks like under the hood, the xattr code requires you to take
> it for write in setxattr and removexattr, but you don't need it at all
> in getxattr or listxattr. Go figure.

IIRC, the filesytsem can't take the i_rwsem for get/listxattr
because the lookup contexts may already hold the i_rwsem. I think
this is largely a problem caused by LSMs (e.g. IMA) needing to
access security xattrs in paths where the inode is already
locked.

> Longer term, I wonder what the inode_lock is protecting in setxattr and
> removexattr operations, given that get and list don't require them?
> These are always delegated to the filesystem driver -- there is no
> generic xattr implementation.

Serialising updates against each other. xattr modifications are
supposed to be "atomic" from the perspective of the user - they see
the entire old or the new xattr, never a partial value.

FWIW, XFS uses it's internal metadata rwsem for access/update
serialisation of xattrs because we can't rely on the i_rwsem for
reliable serialisation. I'm guessing that most journalling
filesystems do something similar.

Cheers,

Dave.
Frank van der Linden April 17, 2023, 10:25 p.m. UTC | #9
On Sun, Apr 16, 2023 at 4:38 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
> > On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
> > > On 2023/04/16 3:40, Jeff Layton wrote:
> > > > On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> > > > > On 2023/04/16 1:13, Chuck Lever III wrote:
> > > > > > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > > > > >
> > > > > > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> > > > > > > does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> > > > > >
> > > > > > The server side threads run in process context. GFP_KERNEL
> > > > > > is safe to use here -- as Jeff said, this code is not in
> > > > > > the server's reclaim path. Plenty of other call sites in
> > > > > > the NFS server code use GFP_KERNEL.
> > > > >
> > > > > GFP_KERNEL memory allocation calls filesystem's shrinker functions
> > > > > because of __GFP_FS flag. My understanding is
> > > > >
> > > > >   Whether this code is in memory reclaim path or not is irrelevant.
> > > > >   Whether memory reclaim path might hold lock or not is relevant.
> > > > >
> > > > > . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> > > > >
> > > >
> > > > No. At the time of these allocations, the i_rwsem is not held.
> > >
> > > Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
> > > via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
> > > That's why
> > >
> > >     /*
> > >      * We're holding i_rwsem - use GFP_NOFS.
> > >      */
> > >
> > > is explicitly there in nfsd_listxattr() side.
>
> You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> All that it requires is the caller holds a reference to the inode,
> and at that point inode will should skip the given inode without
> every locking it.
>
> Of course, lockdep can't handle the "referenced inode lock ->
> fsreclaim -> unreferenced inode lock" pattern at all. It throws out
> false positives when it detects this because it's not aware of the
> fact that reference counts prevent inode lock recursion based
> deadlocks in the vfs inode cache shrinker.
>
> If a custom, non-vfs shrinker is walking inodes that have no
> references and taking i_rwsem in a way that can block without first
> checking whether it is safe to lock the inode in a deadlock free
> manner, they are doing the wrong thing and the custom shrinker needs
> to be fixed.
>
> > >
> > > If memory reclaim path (directly or indirectly via locking dependency) involves
> > > inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
> > >
> >
> > (cc'ing Frank V. who wrote this code and -fsdevel)
> >
> > I stand corrected! You're absolutely right that it's taking the i_rwsem
> > for read there. That seems pretty weird, actually. I don't believe we
> > need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
> > certainly nothing else under there requires it.
> >
> > Frank, was there some reason you decided you needed the inode_lock
> > there? It looks like under the hood, the xattr code requires you to take
> > it for write in setxattr and removexattr, but you don't need it at all
> > in getxattr or listxattr. Go figure.
>
> IIRC, the filesytsem can't take the i_rwsem for get/listxattr
> because the lookup contexts may already hold the i_rwsem. I think
> this is largely a problem caused by LSMs (e.g. IMA) needing to
> access security xattrs in paths where the inode is already
> locked.
>
> > Longer term, I wonder what the inode_lock is protecting in setxattr and
> > removexattr operations, given that get and list don't require them?
> > These are always delegated to the filesystem driver -- there is no
> > generic xattr implementation.
>
> Serialising updates against each other. xattr modifications are
> supposed to be "atomic" from the perspective of the user - they see
> the entire old or the new xattr, never a partial value.
>
> FWIW, XFS uses it's internal metadata rwsem for access/update
> serialisation of xattrs because we can't rely on the i_rwsem for
> reliable serialisation. I'm guessing that most journalling
> filesystems do something similar.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Sorry about the slow response - thanks Chuck for adding my
correct/updated email address.

As Dave notes, you need inode_lock to get an atomic view of an xattr.
Since both nfsd_getxattr and nfsd_listxattr to the standard trick of
querying the xattr length with a NULL buf argument (just getting the
length back), allocating the right buffer size, and then querying
again, they need to hold the inode lock to avoid having the xattr
changed from under them while doing that.

From that then flows the requirement that GFP_FS could cause problems
while holding i_rwsem, so I added GFP_NOFS.

If any of these requirements have changed since then, obviously this
should be fixed. But I don't think they have, right?

- Frank
Frank van der Linden April 17, 2023, 11:07 p.m. UTC | #10
On Mon, Apr 17, 2023 at 3:25 PM Frank van der Linden <fvdl@google.com> wrote:
>
> On Sun, Apr 16, 2023 at 4:38 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
> > > On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
> > > > On 2023/04/16 3:40, Jeff Layton wrote:
> > > > > On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> > > > > > On 2023/04/16 1:13, Chuck Lever III wrote:
> > > > > > > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > > > > > >
> > > > > > > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> > > > > > > > does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> > > > > > >
> > > > > > > The server side threads run in process context. GFP_KERNEL
> > > > > > > is safe to use here -- as Jeff said, this code is not in
> > > > > > > the server's reclaim path. Plenty of other call sites in
> > > > > > > the NFS server code use GFP_KERNEL.
> > > > > >
> > > > > > GFP_KERNEL memory allocation calls filesystem's shrinker functions
> > > > > > because of __GFP_FS flag. My understanding is
> > > > > >
> > > > > >   Whether this code is in memory reclaim path or not is irrelevant.
> > > > > >   Whether memory reclaim path might hold lock or not is relevant.
> > > > > >
> > > > > > . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> > > > > >
> > > > >
> > > > > No. At the time of these allocations, the i_rwsem is not held.
> > > >
> > > > Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
> > > > via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
> > > > That's why
> > > >
> > > >     /*
> > > >      * We're holding i_rwsem - use GFP_NOFS.
> > > >      */
> > > >
> > > > is explicitly there in nfsd_listxattr() side.
> >
> > You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> > All that it requires is the caller holds a reference to the inode,
> > and at that point inode will should skip the given inode without
> > every locking it.
> >
> > Of course, lockdep can't handle the "referenced inode lock ->
> > fsreclaim -> unreferenced inode lock" pattern at all. It throws out
> > false positives when it detects this because it's not aware of the
> > fact that reference counts prevent inode lock recursion based
> > deadlocks in the vfs inode cache shrinker.
> >
> > If a custom, non-vfs shrinker is walking inodes that have no
> > references and taking i_rwsem in a way that can block without first
> > checking whether it is safe to lock the inode in a deadlock free
> > manner, they are doing the wrong thing and the custom shrinker needs
> > to be fixed.
> >
> > > >
> > > > If memory reclaim path (directly or indirectly via locking dependency) involves
> > > > inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
> > > >
> > >
> > > (cc'ing Frank V. who wrote this code and -fsdevel)
> > >
> > > I stand corrected! You're absolutely right that it's taking the i_rwsem
> > > for read there. That seems pretty weird, actually. I don't believe we
> > > need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
> > > certainly nothing else under there requires it.
> > >
> > > Frank, was there some reason you decided you needed the inode_lock
> > > there? It looks like under the hood, the xattr code requires you to take
> > > it for write in setxattr and removexattr, but you don't need it at all
> > > in getxattr or listxattr. Go figure.
> >
> > IIRC, the filesytsem can't take the i_rwsem for get/listxattr
> > because the lookup contexts may already hold the i_rwsem. I think
> > this is largely a problem caused by LSMs (e.g. IMA) needing to
> > access security xattrs in paths where the inode is already
> > locked.
> >
> > > Longer term, I wonder what the inode_lock is protecting in setxattr and
> > > removexattr operations, given that get and list don't require them?
> > > These are always delegated to the filesystem driver -- there is no
> > > generic xattr implementation.
> >
> > Serialising updates against each other. xattr modifications are
> > supposed to be "atomic" from the perspective of the user - they see
> > the entire old or the new xattr, never a partial value.
> >
> > FWIW, XFS uses it's internal metadata rwsem for access/update
> > serialisation of xattrs because we can't rely on the i_rwsem for
> > reliable serialisation. I'm guessing that most journalling
> > filesystems do something similar.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
>
> Sorry about the slow response - thanks Chuck for adding my
> correct/updated email address.
>
> As Dave notes, you need inode_lock to get an atomic view of an xattr.
> Since both nfsd_getxattr and nfsd_listxattr to the standard trick of
> querying the xattr length with a NULL buf argument (just getting the
> length back), allocating the right buffer size, and then querying
> again, they need to hold the inode lock to avoid having the xattr
> changed from under them while doing that.
>
> From that then flows the requirement that GFP_FS could cause problems
> while holding i_rwsem, so I added GFP_NOFS.
>
> If any of these requirements have changed since then, obviously this
> should be fixed. But I don't think they have, right?
>
> - Frank

Forgot to mention - I know there is vfs_getxattr_alloc, but that
doesn't hold any lock across the query-alloc-requery sequence, so I
didn't use it. The locking in fs/xattr.c does seem to be inconsistent
in this regard, as was mentioned in this thread.

Arguably, the chances of an xattr being changed while you're querying
one are very low (and it doesn't seem to have caused any problems all
this time), but being correct doesn't cost much here.

- Frank
Chuck Lever III April 19, 2023, 1:51 p.m. UTC | #11
> On Apr 16, 2023, at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
>> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
>>> On 2023/04/16 3:40, Jeff Layton wrote:
>>>> On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
>>>>> On 2023/04/16 1:13, Chuck Lever III wrote:
>>>>>>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
>>>>>>> 
>>>>>>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
>>>>>>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
>>>>>> 
>>>>>> The server side threads run in process context. GFP_KERNEL
>>>>>> is safe to use here -- as Jeff said, this code is not in
>>>>>> the server's reclaim path. Plenty of other call sites in
>>>>>> the NFS server code use GFP_KERNEL.
>>>>> 
>>>>> GFP_KERNEL memory allocation calls filesystem's shrinker functions
>>>>> because of __GFP_FS flag. My understanding is
>>>>> 
>>>>>  Whether this code is in memory reclaim path or not is irrelevant.
>>>>>  Whether memory reclaim path might hold lock or not is relevant.
>>>>> 
>>>>> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
>>>>> 
>>>> 
>>>> No. At the time of these allocations, the i_rwsem is not held.
>>> 
>>> Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
>>> via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
>>> That's why
>>> 
>>> /*
>>>  * We're holding i_rwsem - use GFP_NOFS.
>>>  */
>>> 
>>> is explicitly there in nfsd_listxattr() side.
> 
> You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> All that it requires is the caller holds a reference to the inode,
> and at that point inode will should skip the given inode without
> every locking it.

This suggests that the fix is to replace "GFP_KERNEL | GFP_NOFS"
with "GFP_KERNEL" /and/ ensure those paths are holding an
appropriate inode reference.

I'd rather not weaken the retry semantics of the memory
allocation if we do not have to do so.


> Of course, lockdep can't handle the "referenced inode lock ->
> fsreclaim -> unreferenced inode lock" pattern at all. It throws out
> false positives when it detects this because it's not aware of the
> fact that reference counts prevent inode lock recursion based
> deadlocks in the vfs inode cache shrinker.
> 
> If a custom, non-vfs shrinker is walking inodes that have no
> references and taking i_rwsem in a way that can block without first
> checking whether it is safe to lock the inode in a deadlock free
> manner, they are doing the wrong thing and the custom shrinker needs
> to be fixed.
> 
>>> 
>>> If memory reclaim path (directly or indirectly via locking dependency) involves
>>> inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
>>> 
>> 
>> (cc'ing Frank V. who wrote this code and -fsdevel)
>> 
>> I stand corrected! You're absolutely right that it's taking the i_rwsem
>> for read there. That seems pretty weird, actually. I don't believe we
>> need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
>> certainly nothing else under there requires it.
>> 
>> Frank, was there some reason you decided you needed the inode_lock
>> there? It looks like under the hood, the xattr code requires you to take
>> it for write in setxattr and removexattr, but you don't need it at all
>> in getxattr or listxattr. Go figure.
> 
> IIRC, the filesytsem can't take the i_rwsem for get/listxattr
> because the lookup contexts may already hold the i_rwsem. I think
> this is largely a problem caused by LSMs (e.g. IMA) needing to
> access security xattrs in paths where the inode is already
> locked.
> 
>> Longer term, I wonder what the inode_lock is protecting in setxattr and
>> removexattr operations, given that get and list don't require them?
>> These are always delegated to the filesystem driver -- there is no
>> generic xattr implementation.
> 
> Serialising updates against each other. xattr modifications are
> supposed to be "atomic" from the perspective of the user - they see
> the entire old or the new xattr, never a partial value.
> 
> FWIW, XFS uses it's internal metadata rwsem for access/update
> serialisation of xattrs because we can't rely on the i_rwsem for
> reliable serialisation. I'm guessing that most journalling
> filesystems do something similar.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


--
Chuck Lever
Dave Chinner April 19, 2023, 11:32 p.m. UTC | #12
On Wed, Apr 19, 2023 at 01:51:12PM +0000, Chuck Lever III wrote:
> 
> 
> > On Apr 16, 2023, at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
> >> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
> >>> On 2023/04/16 3:40, Jeff Layton wrote:
> >>>> On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> >>>>> On 2023/04/16 1:13, Chuck Lever III wrote:
> >>>>>>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> >>>>>>> 
> >>>>>>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> >>>>>>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> >>>>>> 
> >>>>>> The server side threads run in process context. GFP_KERNEL
> >>>>>> is safe to use here -- as Jeff said, this code is not in
> >>>>>> the server's reclaim path. Plenty of other call sites in
> >>>>>> the NFS server code use GFP_KERNEL.
> >>>>> 
> >>>>> GFP_KERNEL memory allocation calls filesystem's shrinker functions
> >>>>> because of __GFP_FS flag. My understanding is
> >>>>> 
> >>>>>  Whether this code is in memory reclaim path or not is irrelevant.
> >>>>>  Whether memory reclaim path might hold lock or not is relevant.
> >>>>> 
> >>>>> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> >>>>> 
> >>>> 
> >>>> No. At the time of these allocations, the i_rwsem is not held.
> >>> 
> >>> Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
> >>> via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
> >>> That's why
> >>> 
> >>> /*
> >>>  * We're holding i_rwsem - use GFP_NOFS.
> >>>  */
> >>> 
> >>> is explicitly there in nfsd_listxattr() side.
> > 
> > You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> > All that it requires is the caller holds a reference to the inode,
> > and at that point inode will should skip the given inode without
> > every locking it.
> 
> This suggests that the fix is to replace "GFP_KERNEL | GFP_NOFS"
> with "GFP_KERNEL" /and/ ensure those paths are holding an
> appropriate inode reference.

If the code that provided the inode to nfsd_listxattr() did not
already have an active inode reference in the first place then there
are much, much bigger UAF problems to worry about than simple
memory reclaim deadlocks.

That said, nfsd_listxattr() does:

        dentry = fhp->fh_dentry;
        inode = d_inode(dentry);
        *lenp = 0;

        inode_lock_shared(inode);

        len = vfs_listxattr(dentry, NULL, 0);

Given that a dentry pointing to an inode *must* hold an active
reference to that inode, I don't see how it is possible this code
path could be using an unreferenced inode.

nfsd_getxattr() has a similar code fragment to obtain the inode as
well, so same goes for that...

-Dave.
Chuck Lever III April 20, 2023, 1:41 p.m. UTC | #13
> On Apr 19, 2023, at 7:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Wed, Apr 19, 2023 at 01:51:12PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Apr 16, 2023, at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
>>>> On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
>>>>> On 2023/04/16 3:40, Jeff Layton wrote:
>>>>>> On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
>>>>>>> On 2023/04/16 1:13, Chuck Lever III wrote:
>>>>>>>>> On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
>>>>>>>>> 
>>>>>>>>> Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
>>>>>>>>> does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
>>>>>>>> 
>>>>>>>> The server side threads run in process context. GFP_KERNEL
>>>>>>>> is safe to use here -- as Jeff said, this code is not in
>>>>>>>> the server's reclaim path. Plenty of other call sites in
>>>>>>>> the NFS server code use GFP_KERNEL.
>>>>>>> 
>>>>>>> GFP_KERNEL memory allocation calls filesystem's shrinker functions
>>>>>>> because of __GFP_FS flag. My understanding is
>>>>>>> 
>>>>>>> Whether this code is in memory reclaim path or not is irrelevant.
>>>>>>> Whether memory reclaim path might hold lock or not is relevant.
>>>>>>> 
>>>>>>> . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
>>>>>>> 
>>>>>> 
>>>>>> No. At the time of these allocations, the i_rwsem is not held.
>>>>> 
>>>>> Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
>>>>> via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
>>>>> That's why
>>>>> 
>>>>> /*
>>>>> * We're holding i_rwsem - use GFP_NOFS.
>>>>> */
>>>>> 
>>>>> is explicitly there in nfsd_listxattr() side.
>>> 
>>> You can do GFP_KERNEL allocations holding the i_rwsem just fine.
>>> All that it requires is the caller holds a reference to the inode,
>>> and at that point inode will should skip the given inode without
>>> every locking it.
>> 
>> This suggests that the fix is to replace "GFP_KERNEL | GFP_NOFS"
>> with "GFP_KERNEL" /and/ ensure those paths are holding an
>> appropriate inode reference.
> 
> If the code that provided the inode to nfsd_listxattr() did not
> already have an active inode reference in the first place then there
> are much, much bigger UAF problems to worry about than simple
> memory reclaim deadlocks.
> 
> That said, nfsd_listxattr() does:
> 
>        dentry = fhp->fh_dentry;
>        inode = d_inode(dentry);
>        *lenp = 0;
> 
>        inode_lock_shared(inode);
> 
>        len = vfs_listxattr(dentry, NULL, 0);
> 
> Given that a dentry pointing to an inode *must* hold an active
> reference to that inode, I don't see how it is possible this code
> path could be using an unreferenced inode.
> 
> nfsd_getxattr() has a similar code fragment to obtain the inode as
> well, so same goes for that...

Dave, thanks for handling the due diligence! I was not 100% sure
about code that handles xattrs rather than the primary byte stream
of a file.

Tetsuo, you can send a v2, or just let me know and I will make
a patch to correct the GFP flags.

--
Chuck Lever
Tetsuo Handa April 20, 2023, 2:05 p.m. UTC | #14
On 2023/04/20 22:41, Chuck Lever III wrote:
>> That said, nfsd_listxattr() does:
>>
>>        dentry = fhp->fh_dentry;
>>        inode = d_inode(dentry);
>>        *lenp = 0;
>>
>>        inode_lock_shared(inode);
>>
>>        len = vfs_listxattr(dentry, NULL, 0);
>>
>> Given that a dentry pointing to an inode *must* hold an active
>> reference to that inode, I don't see how it is possible this code
>> path could be using an unreferenced inode.
>>
>> nfsd_getxattr() has a similar code fragment to obtain the inode as
>> well, so same goes for that...
> 
> Dave, thanks for handling the due diligence! I was not 100% sure
> about code that handles xattrs rather than the primary byte stream
> of a file.
> 
> Tetsuo, you can send a v2, or just let me know and I will make
> a patch to correct the GFP flags.

So, this inode_lock_shared() was there with an intention to make sure that
xattr of inode inside the exported filesystem does not change between
vfs_listxattr(dentry, NULL, 0) and vfs_listxattr(dentry, buf, len),
wasn't it?

Then, we can remove this inode_lock_shared() by adding a "goto retry;"
when vfs_listxattr(dentry, buf, len) failed with out of buffer size
due to a race condition, can't we?

I leave replacing inode lock with retry path and removing GFP_NOFS to you.

Thank you.
Chuck Lever III April 20, 2023, 2:41 p.m. UTC | #15
> On Apr 20, 2023, at 10:05 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> On 2023/04/20 22:41, Chuck Lever III wrote:
>>> That said, nfsd_listxattr() does:
>>> 
>>>       dentry = fhp->fh_dentry;
>>>       inode = d_inode(dentry);
>>>       *lenp = 0;
>>> 
>>>       inode_lock_shared(inode);
>>> 
>>>       len = vfs_listxattr(dentry, NULL, 0);
>>> 
>>> Given that a dentry pointing to an inode *must* hold an active
>>> reference to that inode, I don't see how it is possible this code
>>> path could be using an unreferenced inode.
>>> 
>>> nfsd_getxattr() has a similar code fragment to obtain the inode as
>>> well, so same goes for that...
>> 
>> Dave, thanks for handling the due diligence! I was not 100% sure
>> about code that handles xattrs rather than the primary byte stream
>> of a file.
>> 
>> Tetsuo, you can send a v2, or just let me know and I will make
>> a patch to correct the GFP flags.
> 
> So, this inode_lock_shared() was there with an intention to make sure that
> xattr of inode inside the exported filesystem does not change between
> vfs_listxattr(dentry, NULL, 0) and vfs_listxattr(dentry, buf, len),
> wasn't it?
> 
> Then, we can remove this inode_lock_shared() by adding a "goto retry;"
> when vfs_listxattr(dentry, buf, len) failed with out of buffer size
> due to a race condition, can't we?
> 
> I leave replacing inode lock with retry path and removing GFP_NOFS to you.

Jeff similarly mentioned to me the tactic of simply retrying if the
retrieved listxattr contents exceed the size of the buffer. I think
I'd like to leave that for another time, as it seems to be safe to
use GFP_KERNEL with the inode lock held in this case.


--
Chuck Lever
Jeff Layton April 20, 2023, 3:20 p.m. UTC | #16
On Mon, 2023-04-17 at 09:37 +1000, Dave Chinner wrote:
> On Sun, Apr 16, 2023 at 07:51:41AM -0400, Jeff Layton wrote:
> > On Sun, 2023-04-16 at 08:21 +0900, Tetsuo Handa wrote:
> > > On 2023/04/16 3:40, Jeff Layton wrote:
> > > > On Sun, 2023-04-16 at 02:11 +0900, Tetsuo Handa wrote:
> > > > > On 2023/04/16 1:13, Chuck Lever III wrote:
> > > > > > > On Apr 15, 2023, at 7:07 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > > > > > > 
> > > > > > > Since GFP_KERNEL is GFP_NOFS | __GFP_FS, usage like GFP_KERNEL | GFP_NOFS
> > > > > > > does not make sense. Drop __GFP_FS flag in order to avoid deadlock.
> > > > > > 
> > > > > > The server side threads run in process context. GFP_KERNEL
> > > > > > is safe to use here -- as Jeff said, this code is not in
> > > > > > the server's reclaim path. Plenty of other call sites in
> > > > > > the NFS server code use GFP_KERNEL.
> > > > > 
> > > > > GFP_KERNEL memory allocation calls filesystem's shrinker functions
> > > > > because of __GFP_FS flag. My understanding is
> > > > > 
> > > > >   Whether this code is in memory reclaim path or not is irrelevant.
> > > > >   Whether memory reclaim path might hold lock or not is relevant.
> > > > > 
> > > > > . Therefore, question is, does nfsd hold i_rwsem during memory reclaim path?
> > > > > 
> > > > 
> > > > No. At the time of these allocations, the i_rwsem is not held.
> > > 
> > > Excuse me? nfsd_getxattr()/nfsd_listxattr() _are_ holding i_rwsem
> > > via inode_lock_shared(inode) before kvmalloc(GFP_KERNEL | GFP_NOFS) allocation.
> > > That's why
> > > 
> > > 	/*
> > > 	 * We're holding i_rwsem - use GFP_NOFS.
> > > 	 */
> > > 
> > > is explicitly there in nfsd_listxattr() side.
> 
> You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> All that it requires is the caller holds a reference to the inode,
> and at that point inode will should skip the given inode without
> every locking it.
> 
> Of course, lockdep can't handle the "referenced inode lock ->
> fsreclaim -> unreferenced inode lock" pattern at all. It throws out
> false positives when it detects this because it's not aware of the
> fact that reference counts prevent inode lock recursion based
> deadlocks in the vfs inode cache shrinker.
> 
> If a custom, non-vfs shrinker is walking inodes that have no
> references and taking i_rwsem in a way that can block without first
> checking whether it is safe to lock the inode in a deadlock free
> manner, they are doing the wrong thing and the custom shrinker needs
> to be fixed.
> 
> > > 
> > > If memory reclaim path (directly or indirectly via locking dependency) involves
> > > inode_lock_shared(inode)/inode_lock(inode), it is not safe to use __GFP_FS flag.
> > > 
> > 
> > (cc'ing Frank V. who wrote this code and -fsdevel)
> > 
> > I stand corrected! You're absolutely right that it's taking the i_rwsem
> > for read there. That seems pretty weird, actually. I don't believe we
> > need to hold the inode_lock to call vfs_getxattr or vfs_listxattr, and
> > certainly nothing else under there requires it.
> > 
> > Frank, was there some reason you decided you needed the inode_lock
> > there? It looks like under the hood, the xattr code requires you to take
> > it for write in setxattr and removexattr, but you don't need it at all
> > in getxattr or listxattr. Go figure.
> 
> IIRC, the filesytsem can't take the i_rwsem for get/listxattr
> because the lookup contexts may already hold the i_rwsem. I think
> this is largely a problem caused by LSMs (e.g. IMA) needing to
> access security xattrs in paths where the inode is already
> locked.

> > Longer term, I wonder what the inode_lock is protecting in setxattr and
> > removexattr operations, given that get and list don't require them?
> > These are always delegated to the filesystem driver -- there is no
> > generic xattr implementation.
> 
> Serialising updates against each other. xattr modifications are
> supposed to be "atomic" from the perspective of the user - they see
> the entire old or the new xattr, never a partial value.
> 
> FWIW, XFS uses it's internal metadata rwsem for access/update
> serialisation of xattrs because we can't rely on the i_rwsem for
> reliable serialisation. I'm guessing that most journalling
> filesystems do something similar.

(Almost?) All of the existing xattr implementations already have their
own internal locking for consistency. Given that they can't rely on
taking the i_rwsem during getxattr or listxattr, it's hard to see how
any of them could rely on that to ensure consistency there.

We probably ought to see if we can push the inode_lock down into the
various setxattr/removexattr handlers and then we could just have them
drop that lock on a case-by-case basis after validating that it wasn't
needed.
diff mbox series

Patch

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5783209f17fc..109b31246666 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -2164,7 +2164,7 @@  nfsd_getxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name,
 		goto out;
 	}
 
-	buf = kvmalloc(len, GFP_KERNEL | GFP_NOFS);
+	buf = kvmalloc(len, GFP_NOFS);
 	if (buf == NULL) {
 		err = nfserr_jukebox;
 		goto out;
@@ -2230,7 +2230,7 @@  nfsd_listxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char **bufp,
 	/*
 	 * We're holding i_rwsem - use GFP_NOFS.
 	 */
-	buf = kvmalloc(len, GFP_KERNEL | GFP_NOFS);
+	buf = kvmalloc(len, GFP_NOFS);
 	if (buf == NULL) {
 		err = nfserr_jukebox;
 		goto out;