diff mbox

[RFC] NFSD: fix cannot umounting mount points under pseudo root

Message ID 5538EB18.7080802@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kinglong Mee April 23, 2015, 12:52 p.m. UTC
On 4/23/2015 7:44 AM, NeilBrown wrote:
> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
>>> it is necessary to store them in svc_export.
>>>
>>> Neil points out another way of 'fs_pin', I will check that.
>>
>> Yes, that'd be interesting.  On a very quick look--I don't understand
>> what it's meant to do at all.  But if it does provide a way to get a
>> callback on umount, that'd certainly be interesting....
> 
> Yeah, on a quick look it isn't really obvious at all.
> 
> But I didn't read the code.  I read
>  https://lwn.net/Articles/636730/
> 
> which says:
> 
>     In its place is a mechanism by which an object can be added to a vfsmount
>     structure (which represents a mounted filesystem); that object supports 
>     only one method: kill(). These "pin" objects hang around until the final
>     reference to the vfsmount goes away, at which point each one's kill() function
>     is called. It thus is a clean mechanism for performing cleanup when a filesystem
>     goes away.
> 
> This is used to close "BSD process accounting" files on umount, and sound
> like the perfect interface for flushing things from the sunrpc cache on
> umount.

I have review those codes in fs/fs_pin.c and kernel/acct.c.
'fs_pin' is used to fix the race between file->f_path.mnt for acct
and umount, file->f_path.mnt just holds the reference of vfsmount
but not keep busy, umount will check the reference and return busy.

The logical is same as sunrpc cache for exports.

Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
method, acct get a pin to vfsmount and then put the reference,
so umount finds no other reference and callback those pins in vfsmount,
at last umount success.

After that commit, besides pin to original vfsmount and put the reference
of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.

The different between those two methods is the value of file->f_path.mnt,
the first one, it contains the original vfsmnt without reference,
the second one, contains a cloned vfsmnt with reference.

I have test the first method, pins to vfsmount for all exports cache,
nfsd can work but sometimes will delay or hang at rpc.mountd's calling
of name_to_handle_at, I can't find the reason.

Also test the second method, there are many problem caused by the cloned
vfsmount, mnt_clone_internal() change mnt_root values to the path dentry.

Those cases are tested for all exports cache, but, I think nfsd should
put the reference of vfsmount when finding exports upcall fail.
The success exports cache should also take it.

The following is a draft of the first method.

thanks,
Kinglong Mee

----------------------------------------------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

NeilBrown April 24, 2015, 3 a.m. UTC | #1
On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:

> On 4/23/2015 7:44 AM, NeilBrown wrote:
> > On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> > wrote:
> >>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
> >>> it is necessary to store them in svc_export.
> >>>
> >>> Neil points out another way of 'fs_pin', I will check that.
> >>
> >> Yes, that'd be interesting.  On a very quick look--I don't understand
> >> what it's meant to do at all.  But if it does provide a way to get a
> >> callback on umount, that'd certainly be interesting....
> > 
> > Yeah, on a quick look it isn't really obvious at all.
> > 
> > But I didn't read the code.  I read
> >  https://lwn.net/Articles/636730/
> > 
> > which says:
> > 
> >     In its place is a mechanism by which an object can be added to a vfsmount
> >     structure (which represents a mounted filesystem); that object supports 
> >     only one method: kill(). These "pin" objects hang around until the final
> >     reference to the vfsmount goes away, at which point each one's kill() function
> >     is called. It thus is a clean mechanism for performing cleanup when a filesystem
> >     goes away.
> > 
> > This is used to close "BSD process accounting" files on umount, and sound
> > like the perfect interface for flushing things from the sunrpc cache on
> > umount.
> 
> I have review those codes in fs/fs_pin.c and kernel/acct.c.
> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
> and umount, file->f_path.mnt just holds the reference of vfsmount
> but not keep busy, umount will check the reference and return busy.
> 
> The logical is same as sunrpc cache for exports.
> 
> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
> method, acct get a pin to vfsmount and then put the reference,
> so umount finds no other reference and callback those pins in vfsmount,
> at last umount success.
> 
> After that commit, besides pin to original vfsmount and put the reference
> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
> 
> The different between those two methods is the value of file->f_path.mnt,
> the first one, it contains the original vfsmnt without reference,
> the second one, contains a cloned vfsmnt with reference.
> 
> I have test the first method, pins to vfsmount for all exports cache,
> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
> of name_to_handle_at, I can't find the reason.

A kernel stack trace of exactly where it is hanging would help a lot.


> 
> Also test the second method, there are many problem caused by the cloned
> vfsmount, mnt_clone_internal() change mnt_root values to the path dentry.
> 
> Those cases are tested for all exports cache, but, I think nfsd should
> put the reference of vfsmount when finding exports upcall fail.
> The success exports cache should also take it.
> 
> The following is a draft of the first method.

I think the first method sounds a lot better than the second.

> 
> thanks,
> Kinglong Mee
> 
> ----------------------------------------------------------------------
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b540..553e8b1 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
>  	wake_up_locked(&pin->wait);
>  	spin_unlock_irq(&pin->wait.lock);
>  }
> +EXPORT_SYMBOL(pin_remove);
>  
>  void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
>  {
> @@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
>  	hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
>  	spin_unlock(&pin_lock);
>  }
> +EXPORT_SYMBOL(pin_insert_group);
>  
>  void pin_insert(struct fs_pin *pin, struct vfsmount *m)
>  {
>  	pin_insert_group(pin, m, &m->mnt_sb->s_pins);
>  }
> +EXPORT_SYMBOL(pin_insert);
>  
>  void pin_kill(struct fs_pin *p)
>  {
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index c3e3b6e..3e3df8c 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -309,7 +309,7 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
>  static void svc_export_put(struct kref *ref)
>  {
>  	struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
> -	path_put(&exp->ex_path);
> +	pin_remove(&exp->ex_pin);
>  	auth_domain_put(exp->ex_client);
>  	nfsd4_fslocs_free(&exp->ex_fslocs);
>  	kfree(exp->ex_uuid);
> @@ -695,15 +695,26 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
>  		orig->ex_path.mnt == new->ex_path.mnt;
>  }
>  
> +static void export_pin_kill(struct fs_pin *pin)
> +{
> +	struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
> +
> +	write_lock(&exp->cd->hash_lock);
> +	cache_mark_expired(&exp->h);
> +	pin_remove(pin);
> +	write_unlock(&exp->cd->hash_lock);
> +}

I think you need to add some sort of delay here.  The svc_export entry might
still be in active use by an nfsd thread and you need to wait for that to
complete.

I think you need to wait for the refcnt to drop to '1'.  Maybe create a
global wait_queue to wait for that.

Actually... svc_expkey contains a reference to a 'path' too, so you need to
use pinning to purge those when the filesystem is unmounted.

Oh.. and you don't want to call 'pin_remove' from export_pin_kill().  It is
called from svc_export_put() and calling from both could be problematic.

Hmmm...  Ahhhh.

If you get export_pin_kill to only call cache_mark_expired() (maybe move the
locking into that function) and *not* call pin_remove(), then the pin will
remain on the list and ->done will be -1.
So mnt_pin_kill will loop around and this time pin_kill() will wait for
p->done to be set.
So we really need that thing to be removed from cache promptly.  I don't
think we can wait for the next time cache_clean will be run.  We could
call cache_flush, but I think I'd rather remove just that one entry.

Also.... this requires that the pin (and the struct containing it) be freed
using RCU.

So:
 - add an rcu_head to svc_export and svc_expkey
 - use rcu_kfree to free both those objects
 - write a 'cache_force_expire()' which sets the expiry time, and then
   removes the entry from the cache.
 - Use pin_insert_group rather then mntget for both svc_export and svc_expkey
 - have the 'kill' functions for both call cache_force_expire(), but *not*
   pin_remove


> +
>  static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
>  {
>  	struct svc_export *new = container_of(cnew, struct svc_export, h);
>  	struct svc_export *item = container_of(citem, struct svc_export, h);
>  
> +	init_fs_pin(&new->ex_pin, export_pin_kill);
>  	kref_get(&item->ex_client->ref);
>  	new->ex_client = item->ex_client;
>  	new->ex_path = item->ex_path;
> -	path_get(&item->ex_path);
> +	pin_insert_group(&new->ex_pin, new->ex_path.mnt, NULL);

This doesn't look right.  path_get does a 'mntget()' and a 'dget()'.
You are replacing 'mntget()' with 'pin_insert_group()', but I think you still
need the dget().

Similarly you need a dput() up where you removed path_put().


Thanks for working on this!

NeilBrown



>  	new->ex_fslocs.locations = NULL;
>  	new->ex_fslocs.locations_count = 0;
>  	new->ex_fslocs.migrated = 0;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index 1f52bfc..718a27e 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -4,6 +4,7 @@
>  #ifndef NFSD_EXPORT_H
>  #define NFSD_EXPORT_H
>  
> +#include <linux/fs_pin.h>
>  #include <linux/sunrpc/cache.h>
>  #include <uapi/linux/nfsd/export.h>
>  
> @@ -49,6 +50,7 @@ struct svc_export {
>  	struct auth_domain *	ex_client;
>  	int			ex_flags;
>  	struct path		ex_path;
> +	struct fs_pin		ex_pin;
>  	kuid_t			ex_anon_uid;
>  	kgid_t			ex_anon_gid;
>  	int			ex_fsid;
> diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
> index 437ddb6..6936684 100644
> --- a/include/linux/sunrpc/cache.h
> +++ b/include/linux/sunrpc/cache.h
> @@ -206,6 +206,11 @@ static inline int cache_is_expired(struct cache_detail *detail, struct cache_hea
>  		(detail->flush_time > h->last_refresh);
>  }
>  
> +static inline void cache_mark_expired(struct cache_head *h)
> +{
> +	h->expiry_time = seconds_since_boot() - 1;
> +}
> +
>  extern int cache_check(struct cache_detail *detail,
>  		       struct cache_head *h, struct cache_req *rqstp);
>  extern void cache_flush(void);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kinglong Mee April 27, 2015, 12:11 p.m. UTC | #2
On 4/24/2015 11:00 AM, NeilBrown wrote:
> On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
> 
>> On 4/23/2015 7:44 AM, NeilBrown wrote:
>>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
>>> wrote:
>>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
>>>>> it is necessary to store them in svc_export.
>>>>>
>>>>> Neil points out another way of 'fs_pin', I will check that.
>>>>
>>>> Yes, that'd be interesting.  On a very quick look--I don't understand
>>>> what it's meant to do at all.  But if it does provide a way to get a
>>>> callback on umount, that'd certainly be interesting....
>>>
>>> Yeah, on a quick look it isn't really obvious at all.
>>>
>>> But I didn't read the code.  I read
>>>  https://lwn.net/Articles/636730/
>>>
>>> which says:
>>>
>>>     In its place is a mechanism by which an object can be added to a vfsmount
>>>     structure (which represents a mounted filesystem); that object supports 
>>>     only one method: kill(). These "pin" objects hang around until the final
>>>     reference to the vfsmount goes away, at which point each one's kill() function
>>>     is called. It thus is a clean mechanism for performing cleanup when a filesystem
>>>     goes away.
>>>
>>> This is used to close "BSD process accounting" files on umount, and sound
>>> like the perfect interface for flushing things from the sunrpc cache on
>>> umount.
>>
>> I have review those codes in fs/fs_pin.c and kernel/acct.c.
>> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
>> and umount, file->f_path.mnt just holds the reference of vfsmount
>> but not keep busy, umount will check the reference and return busy.
>>
>> The logical is same as sunrpc cache for exports.
>>
>> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
>> method, acct get a pin to vfsmount and then put the reference,
>> so umount finds no other reference and callback those pins in vfsmount,
>> at last umount success.
>>
>> After that commit, besides pin to original vfsmount and put the reference
>> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
>>
>> The different between those two methods is the value of file->f_path.mnt,
>> the first one, it contains the original vfsmnt without reference,
>> the second one, contains a cloned vfsmnt with reference.
>>
>> I have test the first method, pins to vfsmount for all exports cache,
>> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
>> of name_to_handle_at, I can't find the reason.
> 
> A kernel stack trace of exactly where it is hanging would help a lot.

Witch adding fs_pin to exports, it seems there is a mutex race between
nfsd and rpc.mountd for locking parent inode.

Apr 27 19:57:03 ntest kernel: rpc.mountd      D ffff88006ac5fc28     0  1152      1 0x00000000
Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08
Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff
Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000
Apr 27 19:57:03 ntest kernel: Call Trace:
Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
Apr 27 19:57:03 ntest kernel: [<ffffffff8178031e>] schedule_preempt_disabled+0xe/0x10
Apr 27 19:57:03 ntest kernel: [<ffffffff81781e92>] __mutex_lock_slowpath+0xb2/0x120
Apr 27 19:57:03 ntest kernel: [<ffffffff81781f23>] mutex_lock+0x23/0x40
Apr 27 19:57:03 ntest kernel: [<ffffffff81230714>] lookup_slow+0x34/0xc0
Apr 27 19:57:03 ntest kernel: [<ffffffff812358ee>] path_lookupat+0x89e/0xc60
Apr 27 19:57:03 ntest kernel: [<ffffffff81205192>] ? kmem_cache_alloc+0x1e2/0x260
Apr 27 19:57:03 ntest kernel: [<ffffffff812367a6>] ? getname_flags+0x56/0x200
Apr 27 19:57:03 ntest kernel: [<ffffffff81235cd7>] filename_lookup+0x27/0xc0
Apr 27 19:57:03 ntest kernel: [<ffffffff81237b53>] user_path_at_empty+0x63/0xd0
Apr 27 19:57:03 ntest kernel: [<ffffffff8121d722>] ? put_object+0x32/0x60
Apr 27 19:57:03 ntest kernel: [<ffffffff8121d94d>] ? delete_object_full+0x2d/0x40
Apr 27 19:57:03 ntest kernel: [<ffffffff8123e273>] ? dput+0x33/0x230
Apr 27 19:57:03 ntest kernel: [<ffffffff81237bd1>] user_path_at+0x11/0x20
Apr 27 19:57:03 ntest kernel: [<ffffffff81286489>] SyS_name_to_handle_at+0x59/0x200
Apr 27 19:57:03 ntest kernel: [<ffffffff81783f6e>] system_call_fastpath+0x12/0x71
Apr 27 19:57:03 ntest kernel: nfsd            S ffff88006e92b708     0  1170      2 0x00000000
Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758
Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040
Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09
Apr 27 19:57:03 ntest kernel: Call Trace:
Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
Apr 27 19:57:03 ntest kernel: [<ffffffff81782e97>] schedule_timeout+0x117/0x230
Apr 27 19:57:03 ntest kernel: [<ffffffff8110b240>] ? internal_add_timer+0xb0/0xb0
Apr 27 19:57:03 ntest kernel: [<ffffffff81780ff3>] wait_for_completion_interruptible_timeout+0xf3/0x150
Apr 27 19:57:03 ntest kernel: [<ffffffff810cb090>] ? wake_up_state+0x20/0x20
Apr 27 19:57:03 ntest kernel: [<ffffffffa01350cc>] cache_wait_req.isra.10+0x9c/0x110 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0134760>] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0135c54>] cache_check+0x1d4/0x380 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffff8133248c>] ? inode_doinit_with_dentry+0x48c/0x6a0
Apr 27 19:57:03 ntest kernel: [<ffffffffa045e132>] exp_get_by_name+0x82/0xb0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffff81776d0a>] ? kmemleak_free+0x3a/0xa0
Apr 27 19:57:03 ntest kernel: [<ffffffff81332210>] ? inode_doinit_with_dentry+0x210/0x6a0
Apr 27 19:57:03 ntest kernel: [<ffffffff813326bc>] ? selinux_d_instantiate+0x1c/0x20
Apr 27 19:57:03 ntest kernel: [<ffffffff8123def7>] ? _d_rehash+0x37/0x40
Apr 27 19:57:03 ntest kernel: [<ffffffff8123f776>] ? d_splice_alias+0xa6/0x2d0
Apr 27 19:57:03 ntest kernel: [<ffffffff812be90b>] ? ext4_lookup+0xdb/0x160
Apr 27 19:57:03 ntest kernel: [<ffffffffa0460114>] rqst_exp_get_by_name+0x64/0x140 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045a8f6>] nfsd_cross_mnt+0x76/0x1b0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0475300>] nfsd4_encode_dirent+0x160/0x3d0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa04751a0>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045c631>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045a1e0>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa046bd70>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa047573d>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa046a51d>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0455f83>] nfsd_dispatch+0xc3/0x220 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa012a01b>] svc_process_common+0x43b/0x690 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa012b133>] svc_process+0x103/0x1b0 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045596f>] nfsd+0xff/0x170 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0455870>] ? nfsd_destroy+0x80/0x80 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffff810bd378>] kthread+0xd8/0xf0
Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
Apr 27 19:57:03 ntest kernel: [<ffffffff81784362>] ret_from_fork+0x42/0x70
Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180

>>
>> Also test the second method, there are many problem caused by the cloned
>> vfsmount, mnt_clone_internal() change mnt_root values to the path dentry.
>>
>> Those cases are tested for all exports cache, but, I think nfsd should
>> put the reference of vfsmount when finding exports upcall fail.
>> The success exports cache should also take it.
>>
>> The following is a draft of the first method.
> 
> I think the first method sounds a lot better than the second.
> 
>>
>> thanks,
>> Kinglong Mee
>>
>> ----------------------------------------------------------------------
>> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
>> index 611b540..553e8b1 100644
>> --- a/fs/fs_pin.c
>> +++ b/fs/fs_pin.c
>> @@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
>>  	wake_up_locked(&pin->wait);
>>  	spin_unlock_irq(&pin->wait.lock);
>>  }
>> +EXPORT_SYMBOL(pin_remove);
>>  
>>  void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
>>  {
>> @@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
>>  	hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
>>  	spin_unlock(&pin_lock);
>>  }
>> +EXPORT_SYMBOL(pin_insert_group);
>>  
>>  void pin_insert(struct fs_pin *pin, struct vfsmount *m)
>>  {
>>  	pin_insert_group(pin, m, &m->mnt_sb->s_pins);
>>  }
>> +EXPORT_SYMBOL(pin_insert);
>>  
>>  void pin_kill(struct fs_pin *p)
>>  {
>> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
>> index c3e3b6e..3e3df8c 100644
>> --- a/fs/nfsd/export.c
>> +++ b/fs/nfsd/export.c
>> @@ -309,7 +309,7 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
>>  static void svc_export_put(struct kref *ref)
>>  {
>>  	struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
>> -	path_put(&exp->ex_path);
>> +	pin_remove(&exp->ex_pin);
>>  	auth_domain_put(exp->ex_client);
>>  	nfsd4_fslocs_free(&exp->ex_fslocs);
>>  	kfree(exp->ex_uuid);
>> @@ -695,15 +695,26 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
>>  		orig->ex_path.mnt == new->ex_path.mnt;
>>  }
>>  
>> +static void export_pin_kill(struct fs_pin *pin)
>> +{
>> +	struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
>> +
>> +	write_lock(&exp->cd->hash_lock);
>> +	cache_mark_expired(&exp->h);
>> +	pin_remove(pin);
>> +	write_unlock(&exp->cd->hash_lock);
>> +}
> 
> I think you need to add some sort of delay here.  The svc_export entry might
> still be in active use by an nfsd thread and you need to wait for that to
> complete.
> 
> I think you need to wait for the refcnt to drop to '1'.  Maybe create a
> global wait_queue to wait for that.
> 
> Actually... svc_expkey contains a reference to a 'path' too, so you need to
> use pinning to purge those when the filesystem is unmounted.
> 
> Oh.. and you don't want to call 'pin_remove' from export_pin_kill().  It is
> called from svc_export_put() and calling from both could be problematic.
> 
> Hmmm...  Ahhhh.
> 
> If you get export_pin_kill to only call cache_mark_expired() (maybe move the
> locking into that function) and *not* call pin_remove(), then the pin will
> remain on the list and ->done will be -1.
> So mnt_pin_kill will loop around and this time pin_kill() will wait for
> p->done to be set.
> So we really need that thing to be removed from cache promptly.  I don't
> think we can wait for the next time cache_clean will be run.  We could
> call cache_flush, but I think I'd rather remove just that one entry.
> 
> Also.... this requires that the pin (and the struct containing it) be freed
> using RCU.
> 
> So:
>  - add an rcu_head to svc_export and svc_expkey
>  - use rcu_kfree to free both those objects
>  - write a 'cache_force_expire()' which sets the expiry time, and then
>    removes the entry from the cache.
>  - Use pin_insert_group rather then mntget for both svc_export and svc_expkey
>  - have the 'kill' functions for both call cache_force_expire(), but *not*
>    pin_remove

Thanks very much for your comment in great detail.
I will send a patch after fix the above race and basic tests. 

thanks,
Kinglong Mee

> 
>> +
>>  static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
>>  {
>>  	struct svc_export *new = container_of(cnew, struct svc_export, h);
>>  	struct svc_export *item = container_of(citem, struct svc_export, h);
>>  
>> +	init_fs_pin(&new->ex_pin, export_pin_kill);
>>  	kref_get(&item->ex_client->ref);
>>  	new->ex_client = item->ex_client;
>>  	new->ex_path = item->ex_path;
>> -	path_get(&item->ex_path);
>> +	pin_insert_group(&new->ex_pin, new->ex_path.mnt, NULL);
> 
> This doesn't look right.  path_get does a 'mntget()' and a 'dget()'.
> You are replacing 'mntget()' with 'pin_insert_group()', but I think you still
> need the dget().
> 
> Similarly you need a dput() up where you removed path_put().
> 
> 
> Thanks for working on this!
> 
> NeilBrown
> 
> 
> 
>>  	new->ex_fslocs.locations = NULL;
>>  	new->ex_fslocs.locations_count = 0;
>>  	new->ex_fslocs.migrated = 0;
>> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
>> index 1f52bfc..718a27e 100644
>> --- a/fs/nfsd/export.h
>> +++ b/fs/nfsd/export.h
>> @@ -4,6 +4,7 @@
>>  #ifndef NFSD_EXPORT_H
>>  #define NFSD_EXPORT_H
>>  
>> +#include <linux/fs_pin.h>
>>  #include <linux/sunrpc/cache.h>
>>  #include <uapi/linux/nfsd/export.h>
>>  
>> @@ -49,6 +50,7 @@ struct svc_export {
>>  	struct auth_domain *	ex_client;
>>  	int			ex_flags;
>>  	struct path		ex_path;
>> +	struct fs_pin		ex_pin;
>>  	kuid_t			ex_anon_uid;
>>  	kgid_t			ex_anon_gid;
>>  	int			ex_fsid;
>> diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
>> index 437ddb6..6936684 100644
>> --- a/include/linux/sunrpc/cache.h
>> +++ b/include/linux/sunrpc/cache.h
>> @@ -206,6 +206,11 @@ static inline int cache_is_expired(struct cache_detail *detail, struct cache_hea
>>  		(detail->flush_time > h->last_refresh);
>>  }
>>  
>> +static inline void cache_mark_expired(struct cache_head *h)
>> +{
>> +	h->expiry_time = seconds_since_boot() - 1;
>> +}
>> +
>>  extern int cache_check(struct cache_detail *detail,
>>  		       struct cache_head *h, struct cache_req *rqstp);
>>  extern void cache_flush(void);
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown April 29, 2015, 2:57 a.m. UTC | #3
On Mon, 27 Apr 2015 20:11:48 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:

> On 4/24/2015 11:00 AM, NeilBrown wrote:
> > On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
> > 
> >> On 4/23/2015 7:44 AM, NeilBrown wrote:
> >>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> >>> wrote:
> >>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
> >>>>> it is necessary to store them in svc_export.
> >>>>>
> >>>>> Neil points out another way of 'fs_pin', I will check that.
> >>>>
> >>>> Yes, that'd be interesting.  On a very quick look--I don't understand
> >>>> what it's meant to do at all.  But if it does provide a way to get a
> >>>> callback on umount, that'd certainly be interesting....
> >>>
> >>> Yeah, on a quick look it isn't really obvious at all.
> >>>
> >>> But I didn't read the code.  I read
> >>>  https://lwn.net/Articles/636730/
> >>>
> >>> which says:
> >>>
> >>>     In its place is a mechanism by which an object can be added to a vfsmount
> >>>     structure (which represents a mounted filesystem); that object supports 
> >>>     only one method: kill(). These "pin" objects hang around until the final
> >>>     reference to the vfsmount goes away, at which point each one's kill() function
> >>>     is called. It thus is a clean mechanism for performing cleanup when a filesystem
> >>>     goes away.
> >>>
> >>> This is used to close "BSD process accounting" files on umount, and sound
> >>> like the perfect interface for flushing things from the sunrpc cache on
> >>> umount.
> >>
> >> I have review those codes in fs/fs_pin.c and kernel/acct.c.
> >> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
> >> and umount, file->f_path.mnt just holds the reference of vfsmount
> >> but not keep busy, umount will check the reference and return busy.
> >>
> >> The logical is same as sunrpc cache for exports.
> >>
> >> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
> >> method, acct get a pin to vfsmount and then put the reference,
> >> so umount finds no other reference and callback those pins in vfsmount,
> >> at last umount success.
> >>
> >> After that commit, besides pin to original vfsmount and put the reference
> >> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
> >>
> >> The different between those two methods is the value of file->f_path.mnt,
> >> the first one, it contains the original vfsmnt without reference,
> >> the second one, contains a cloned vfsmnt with reference.
> >>
> >> I have test the first method, pins to vfsmount for all exports cache,
> >> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
> >> of name_to_handle_at, I can't find the reason.
> > 
> > A kernel stack trace of exactly where it is hanging would help a lot.
> 
> Witch adding fs_pin to exports, it seems there is a mutex race between
> nfsd and rpc.mountd for locking parent inode.
> 
> Apr 27 19:57:03 ntest kernel: rpc.mountd      D ffff88006ac5fc28     0  1152      1 0x00000000
> Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08
> Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff
> Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000
> Apr 27 19:57:03 ntest kernel: Call Trace:
> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
> Apr 27 19:57:03 ntest kernel: [<ffffffff8178031e>] schedule_preempt_disabled+0xe/0x10
> Apr 27 19:57:03 ntest kernel: [<ffffffff81781e92>] __mutex_lock_slowpath+0xb2/0x120
> Apr 27 19:57:03 ntest kernel: [<ffffffff81781f23>] mutex_lock+0x23/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff81230714>] lookup_slow+0x34/0xc0
> Apr 27 19:57:03 ntest kernel: [<ffffffff812358ee>] path_lookupat+0x89e/0xc60
> Apr 27 19:57:03 ntest kernel: [<ffffffff81205192>] ? kmem_cache_alloc+0x1e2/0x260
> Apr 27 19:57:03 ntest kernel: [<ffffffff812367a6>] ? getname_flags+0x56/0x200
> Apr 27 19:57:03 ntest kernel: [<ffffffff81235cd7>] filename_lookup+0x27/0xc0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81237b53>] user_path_at_empty+0x63/0xd0
> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d722>] ? put_object+0x32/0x60
> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d94d>] ? delete_object_full+0x2d/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123e273>] ? dput+0x33/0x230
> Apr 27 19:57:03 ntest kernel: [<ffffffff81237bd1>] user_path_at+0x11/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffff81286489>] SyS_name_to_handle_at+0x59/0x200
> Apr 27 19:57:03 ntest kernel: [<ffffffff81783f6e>] system_call_fastpath+0x12/0x71
> Apr 27 19:57:03 ntest kernel: nfsd            S ffff88006e92b708     0  1170      2 0x00000000
> Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758
> Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040
> Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09
> Apr 27 19:57:03 ntest kernel: Call Trace:
> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
> Apr 27 19:57:03 ntest kernel: [<ffffffff81782e97>] schedule_timeout+0x117/0x230
> Apr 27 19:57:03 ntest kernel: [<ffffffff8110b240>] ? internal_add_timer+0xb0/0xb0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81780ff3>] wait_for_completion_interruptible_timeout+0xf3/0x150
> Apr 27 19:57:03 ntest kernel: [<ffffffff810cb090>] ? wake_up_state+0x20/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffffa01350cc>] cache_wait_req.isra.10+0x9c/0x110 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0134760>] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0135c54>] cache_check+0x1d4/0x380 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffff8133248c>] ? inode_doinit_with_dentry+0x48c/0x6a0
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045e132>] exp_get_by_name+0x82/0xb0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffff81776d0a>] ? kmemleak_free+0x3a/0xa0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81332210>] ? inode_doinit_with_dentry+0x210/0x6a0
> Apr 27 19:57:03 ntest kernel: [<ffffffff813326bc>] ? selinux_d_instantiate+0x1c/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123def7>] ? _d_rehash+0x37/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123f776>] ? d_splice_alias+0xa6/0x2d0
> Apr 27 19:57:03 ntest kernel: [<ffffffff812be90b>] ? ext4_lookup+0xdb/0x160
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0460114>] rqst_exp_get_by_name+0x64/0x140 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a8f6>] nfsd_cross_mnt+0x76/0x1b0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0475300>] nfsd4_encode_dirent+0x160/0x3d0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa04751a0>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045c631>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a1e0>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa046bd70>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa047573d>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa046a51d>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455f83>] nfsd_dispatch+0xc3/0x220 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa012a01b>] svc_process_common+0x43b/0x690 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa012b133>] svc_process+0x103/0x1b0 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045596f>] nfsd+0xff/0x170 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455870>] ? nfsd_destroy+0x80/0x80 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd378>] kthread+0xd8/0xf0
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
> Apr 27 19:57:03 ntest kernel: [<ffffffff81784362>] ret_from_fork+0x42/0x70
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
> 

Hmm... That looks bad.  I wonder why we haven't hit the deadlock before.

nfsd gets a LOOKUP request for $DIR/name which is a mountpoint, or a READDIR
request for $DIR which contains 'name' which is a mountpoint.

nfsd_lookup_dentry of nfsd_buffered_readdir lock i_mutex on $DIR, then do
the mountpoint check.  If the mounted filesystem is not exported, but the $DIR
filesystem has 'crossmnt', an upcall to mountd asks for export information.
When mountd try to do a lookup on the name, it works fine if lookup_fast finds
the child, but it deadlocks on the parent's i_mutex if it has to go to
lookup_slow.

It should only go to lookup_slow() if:
  - the child isn't in cache .. but as it is mounted on, it must be
  - d_revalidate returns 0

This probably means that $DIR is on an NFS filesystem and it failed the
revalidation for some reason....

Kinglong: can you reproduce this easily?  If so can you enable nfs debugging 
and collect the output?
I want to confirm that the dfprintk in nfs_lookup_revalidate is saying that
the name is invalid.

In any case, holding a mutex while waiting for an upcall is probably a bad
idea.  We should try to find a way to work around that...
Any ideas Bruce ?-)

NeilBrown
Kinglong Mee April 29, 2015, 8:45 a.m. UTC | #4
On 4/29/2015 10:57 AM, NeilBrown wrote:
> On Mon, 27 Apr 2015 20:11:48 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
> 
>> On 4/24/2015 11:00 AM, NeilBrown wrote:
>>> On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
>>>
>>>> On 4/23/2015 7:44 AM, NeilBrown wrote:
>>>>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
>>>>> wrote:
>>>>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
>>>>>>> it is necessary to store them in svc_export.
>>>>>>>
>>>>>>> Neil points out another way of 'fs_pin', I will check that.
>>>>>>
>>>>>> Yes, that'd be interesting.  On a very quick look--I don't understand
>>>>>> what it's meant to do at all.  But if it does provide a way to get a
>>>>>> callback on umount, that'd certainly be interesting....
>>>>>
>>>>> Yeah, on a quick look it isn't really obvious at all.
>>>>>
>>>>> But I didn't read the code.  I read
>>>>>  https://lwn.net/Articles/636730/
>>>>>
>>>>> which says:
>>>>>
>>>>>     In its place is a mechanism by which an object can be added to a vfsmount
>>>>>     structure (which represents a mounted filesystem); that object supports 
>>>>>     only one method: kill(). These "pin" objects hang around until the final
>>>>>     reference to the vfsmount goes away, at which point each one's kill() function
>>>>>     is called. It thus is a clean mechanism for performing cleanup when a filesystem
>>>>>     goes away.
>>>>>
>>>>> This is used to close "BSD process accounting" files on umount, and sound
>>>>> like the perfect interface for flushing things from the sunrpc cache on
>>>>> umount.
>>>>
>>>> I have review those codes in fs/fs_pin.c and kernel/acct.c.
>>>> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
>>>> and umount, file->f_path.mnt just holds the reference of vfsmount
>>>> but not keep busy, umount will check the reference and return busy.
>>>>
>>>> The logical is same as sunrpc cache for exports.
>>>>
>>>> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
>>>> method, acct get a pin to vfsmount and then put the reference,
>>>> so umount finds no other reference and callback those pins in vfsmount,
>>>> at last umount success.
>>>>
>>>> After that commit, besides pin to original vfsmount and put the reference
>>>> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
>>>>
>>>> The different between those two methods is the value of file->f_path.mnt,
>>>> the first one, it contains the original vfsmnt without reference,
>>>> the second one, contains a cloned vfsmnt with reference.
>>>>
>>>> I have test the first method, pins to vfsmount for all exports cache,
>>>> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
>>>> of name_to_handle_at, I can't find the reason.
>>>
>>> A kernel stack trace of exactly where it is hanging would help a lot.
>>
>> Witch adding fs_pin to exports, it seems there is a mutex race between
>> nfsd and rpc.mountd for locking parent inode.
>>
>> Apr 27 19:57:03 ntest kernel: rpc.mountd      D ffff88006ac5fc28     0  1152      1 0x00000000
>> Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08
>> Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff
>> Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000
>> Apr 27 19:57:03 ntest kernel: Call Trace:
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8178031e>] schedule_preempt_disabled+0xe/0x10
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81781e92>] __mutex_lock_slowpath+0xb2/0x120
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81781f23>] mutex_lock+0x23/0x40
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81230714>] lookup_slow+0x34/0xc0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff812358ee>] path_lookupat+0x89e/0xc60
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81205192>] ? kmem_cache_alloc+0x1e2/0x260
>> Apr 27 19:57:03 ntest kernel: [<ffffffff812367a6>] ? getname_flags+0x56/0x200
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81235cd7>] filename_lookup+0x27/0xc0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81237b53>] user_path_at_empty+0x63/0xd0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d722>] ? put_object+0x32/0x60
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d94d>] ? delete_object_full+0x2d/0x40
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8123e273>] ? dput+0x33/0x230
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81237bd1>] user_path_at+0x11/0x20
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81286489>] SyS_name_to_handle_at+0x59/0x200
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81783f6e>] system_call_fastpath+0x12/0x71
>> Apr 27 19:57:03 ntest kernel: nfsd            S ffff88006e92b708     0  1170      2 0x00000000
>> Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758
>> Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040
>> Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09
>> Apr 27 19:57:03 ntest kernel: Call Trace:
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81782e97>] schedule_timeout+0x117/0x230
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8110b240>] ? internal_add_timer+0xb0/0xb0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81780ff3>] wait_for_completion_interruptible_timeout+0xf3/0x150
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810cb090>] ? wake_up_state+0x20/0x20
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa01350cc>] cache_wait_req.isra.10+0x9c/0x110 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0134760>] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0135c54>] cache_check+0x1d4/0x380 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8133248c>] ? inode_doinit_with_dentry+0x48c/0x6a0
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045e132>] exp_get_by_name+0x82/0xb0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81776d0a>] ? kmemleak_free+0x3a/0xa0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81332210>] ? inode_doinit_with_dentry+0x210/0x6a0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff813326bc>] ? selinux_d_instantiate+0x1c/0x20
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8123def7>] ? _d_rehash+0x37/0x40
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8123f776>] ? d_splice_alias+0xa6/0x2d0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff812be90b>] ? ext4_lookup+0xdb/0x160
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0460114>] rqst_exp_get_by_name+0x64/0x140 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a8f6>] nfsd_cross_mnt+0x76/0x1b0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0475300>] nfsd4_encode_dirent+0x160/0x3d0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa04751a0>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045c631>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a1e0>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa046bd70>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa047573d>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa046a51d>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455f83>] nfsd_dispatch+0xc3/0x220 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa012a01b>] svc_process_common+0x43b/0x690 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa012b133>] svc_process+0x103/0x1b0 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045596f>] nfsd+0xff/0x170 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455870>] ? nfsd_destroy+0x80/0x80 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd378>] kthread+0xd8/0xf0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81784362>] ret_from_fork+0x42/0x70
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
>>
> 
> Hmm... That looks bad.  I wonder why we haven't hit the deadlock before.

It appears after nfs-utils commit 6091c0a4c4 
("mountd: add support for case-insensitive file names")
adds using name_to_handle_at which will locking parent.

> 
> nfsd gets a LOOKUP request for $DIR/name which is a mountpoint, or a READDIR
> request for $DIR which contains 'name' which is a mountpoint.
> 
> nfsd_lookup_dentry of nfsd_buffered_readdir lock i_mutex on $DIR, then do
> the mountpoint check.  If the mounted filesystem is not exported, but the $DIR
> filesystem has 'crossmnt', an upcall to mountd asks for export information.
> When mountd try to do a lookup on the name, it works fine if lookup_fast finds
> the child, but it deadlocks on the parent's i_mutex if it has to go to
> lookup_slow.

I have send a patch for it,
"[PATCH] NFSD: Avoid race of locking parent's mutex at cross mount"

> 
> It should only go to lookup_slow() if:
>   - the child isn't in cache .. but as it is mounted on, it must be

Yes, it is.
If after some operations of those files, they are cached,
the race will not appear.

>   - d_revalidate returns 0
> 
> This probably means that $DIR is on an NFS filesystem and it failed the
> revalidation for some reason....

It is caused by readdir of pseudo root, there are many unexported files in it.

> 
> Kinglong: can you reproduce this easily?  If so can you enable nfs debugging 
> and collect the output?
> I want to confirm that the dfprintk in nfs_lookup_revalidate is saying that
> the name is invalid.

Yes, I will send the reproduce method in the other threads, add cc you.

thanks,
Kinglong Mee

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields April 29, 2015, 7:19 p.m. UTC | #5
On Wed, Apr 29, 2015 at 12:57:28PM +1000, NeilBrown wrote:
> In any case, holding a mutex while waiting for an upcall is probably a bad
> idea.  We should try to find a way to work around that...

Yeah, it sounds fragile even if we could fix this particular issue in
mountd.

> Any ideas Bruce ?-)

Looking at nfsd_buffered_readdir():

	/*
	 * Various filldir functions may end up calling back into
	 * lookup_one_len() and the file system's ->lookup() method.
	 * These expect i_mutex to be held, as it would within readdir.
	 */
        host_err = mutex_lock_killable(&dir_inode->i_mutex);

and as far as I can tell Kinglong's approach (adding an unlock and lock
around nfsd_cross_mnt() calls might actually be OK.

Though in general I expect

	lock()
	...code...
	unlock()

to mark a critical section and would be unpleasantly surprised to
discover that a function somewhere in the middle had a buried
unlock/lock pair.

Maybe drop the locking from nfsd_buffered_readdir and *just* take the
i_mutex around lookup_one_len(), if that's the only place we need it?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown April 29, 2015, 9:52 p.m. UTC | #6
On Wed, 29 Apr 2015 15:19:34 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Wed, Apr 29, 2015 at 12:57:28PM +1000, NeilBrown wrote:
> > In any case, holding a mutex while waiting for an upcall is probably a bad
> > idea.  We should try to find a way to work around that...
> 
> Yeah, it sounds fragile even if we could fix this particular issue in
> mountd.
> 
> > Any ideas Bruce ?-)
> 
> Looking at nfsd_buffered_readdir():
> 
> 	/*
> 	 * Various filldir functions may end up calling back into
> 	 * lookup_one_len() and the file system's ->lookup() method.
> 	 * These expect i_mutex to be held, as it would within readdir.
> 	 */
>         host_err = mutex_lock_killable(&dir_inode->i_mutex);
> 
> and as far as I can tell Kinglong's approach (adding an unlock and lock
> around nfsd_cross_mnt() calls might actually be OK.

Yes, I think now that it would work.  It would look odd though, as you
suggest.
There would need to be very clear comments explaining why the lock is being
managed that way.

> 
> Though in general I expect
> 
> 	lock()
> 	...code...
> 	unlock()
> 
> to mark a critical section and would be unpleasantly surprised to
> discover that a function somewhere in the middle had a buried
> unlock/lock pair.
> 
> Maybe drop the locking from nfsd_buffered_readdir and *just* take the
> i_mutex around lookup_one_len(), if that's the only place we need it?

I think it is the only place needed.  Certainly normal path lookup only takes
i_mutex very briefly around __lookup_hash().

One cost would be that we would take the mutex for every name instead of once
for a whole set of names.  That is generally frowned upon for performance
reasons.

An alternative might be to stop using lookup_one_len, and use kern_path()
instead.  Or kern_path_mountpoint if we don't want to cross a mountpoint.

They would only take the i_mutex if absolutely necessary.

The draw back is that they don't take a 'len' argument, so we would need to
make sure the name  is nul terminated (and maybe double-check that there is
no '/'??).  It would be fairly easy to nul-terminate names from readdir -
just use a bit more space  in the buffer in nfsd_buffered_filldir().

I'm less sure about the locking needed in nfsd_lookup_dentry().
The comments suggests that there is good reason to keep the lock for an
extended period.  But that reasoning might only apply to files, and
nfsd_mountpoint should  only be true for directories... I hope.

A different approach would be to pass NULL for the rqstp to nfsd_cross_mnt().
This will ensure it doesn't block, but it might fail incorrectly.
If it does fail, we drop the lock, retry with the real rqstp, retake the lock
and .... no, I think that gets a bit too messy.

I think I'm in favour of:
 - not taking the lock in readdir
 - maybe not taking it in lookup
 - using kern_path_mountpoint or kern_path
 - nul terminating then names, copying the nfsd_lookup name to
   __getname() if necessary.

Sound reasonable?

NeilBrown
J. Bruce Fields April 30, 2015, 9:36 p.m. UTC | #7
On Thu, Apr 30, 2015 at 07:52:25AM +1000, NeilBrown wrote:
> On Wed, 29 Apr 2015 15:19:34 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
> > Maybe drop the locking from nfsd_buffered_readdir and *just* take the
> > i_mutex around lookup_one_len(), if that's the only place we need it?
> 
> I think it is the only place needed.  Certainly normal path lookup
> only takes i_mutex very briefly around __lookup_hash().
> 
> One cost would be that we would take the mutex for every name instead
> of once for a whole set of names.  That is generally frowned upon for
> performance reasons.
> 
> An alternative might be to stop using lookup_one_len, and use
> kern_path() instead.  Or kern_path_mountpoint if we don't want to
> cross a mountpoint.
> 
> They would only take the i_mutex if absolutely necessary.
> 
> The draw back is that they don't take a 'len' argument, so we would
> need to make sure the name  is nul terminated (and maybe double-check
> that there is no '/'??).  It would be fairly easy to nul-terminate
> names from readdir - just use a bit more space  in the buffer in
> nfsd_buffered_filldir().

They're also a lot more complicated than lookup_one_len.  Is any of this
extra stuff they do (audit?) going to bite us?

For me understanding that would be harder than writing a variant of
lookup_one_len that took the i_mutex when required.  But I guess that's
my problem, I should try to understand the lookup code better....

> I'm less sure about the locking needed in nfsd_lookup_dentry().  The
> comments suggests that there is good reason to keep the lock for an
> extended period.  But that reasoning might only apply to files, and
> nfsd_mountpoint should  only be true for directories... I hope.

I thought d_mountpoint() could be true for files, but certainly we won't
be doing nfs4 opens on directories.

> A different approach would be to pass NULL for the rqstp to
> nfsd_cross_mnt().  This will ensure it doesn't block, but it might
> fail incorrectly.  If it does fail, we drop the lock, retry with the
> real rqstp, retake the lock and .... no, I think that gets a bit too
> messy.

Yes.

> I think I'm in favour of:
>  - not taking the lock in readdir
>  - maybe not taking it in lookup
>  - using kern_path_mountpoint or kern_path
>  - nul terminating then names, copying the nfsd_lookup name to
>    __getname() if necessary.
> 
> Sound reasonable?

I guess so, though I wish I understood kern_path_mountpoint better.

And nfsd_lookup_dentry looks like work for another day.  No, wait, I
forgot the goal here: you're right, nfsd_lookup_dentry has the same
upcall-under-i_mutex problem, so we need to fix it too.

OK, agreed.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro May 1, 2015, 1:55 a.m. UTC | #8
On Thu, Apr 30, 2015 at 07:52:25AM +1000, NeilBrown wrote:

> An alternative might be to stop using lookup_one_len, and use kern_path()
> instead.  Or kern_path_mountpoint if we don't want to cross a mountpoint.

NAK.  This is a monumentally bad idea - please, do not do it.  Not to mention
anything else, kern_path() is a _lot_ heavier, including much worse stack
footprint.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index 611b540..553e8b1 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -17,6 +17,7 @@  void pin_remove(struct fs_pin *pin)
 	wake_up_locked(&pin->wait);
 	spin_unlock_irq(&pin->wait.lock);
 }
+EXPORT_SYMBOL(pin_remove);
 
 void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
 {
@@ -26,11 +27,13 @@  void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
 	hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
 	spin_unlock(&pin_lock);
 }
+EXPORT_SYMBOL(pin_insert_group);
 
 void pin_insert(struct fs_pin *pin, struct vfsmount *m)
 {
 	pin_insert_group(pin, m, &m->mnt_sb->s_pins);
 }
+EXPORT_SYMBOL(pin_insert);
 
 void pin_kill(struct fs_pin *p)
 {
diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index c3e3b6e..3e3df8c 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -309,7 +309,7 @@  static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
 static void svc_export_put(struct kref *ref)
 {
 	struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
-	path_put(&exp->ex_path);
+	pin_remove(&exp->ex_pin);
 	auth_domain_put(exp->ex_client);
 	nfsd4_fslocs_free(&exp->ex_fslocs);
 	kfree(exp->ex_uuid);
@@ -695,15 +695,26 @@  static int svc_export_match(struct cache_head *a, struct cache_head *b)
 		orig->ex_path.mnt == new->ex_path.mnt;
 }
 
+static void export_pin_kill(struct fs_pin *pin)
+{
+	struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
+
+	write_lock(&exp->cd->hash_lock);
+	cache_mark_expired(&exp->h);
+	pin_remove(pin);
+	write_unlock(&exp->cd->hash_lock);
+}
+
 static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
 {
 	struct svc_export *new = container_of(cnew, struct svc_export, h);
 	struct svc_export *item = container_of(citem, struct svc_export, h);
 
+	init_fs_pin(&new->ex_pin, export_pin_kill);
 	kref_get(&item->ex_client->ref);
 	new->ex_client = item->ex_client;
 	new->ex_path = item->ex_path;
-	path_get(&item->ex_path);
+	pin_insert_group(&new->ex_pin, new->ex_path.mnt, NULL);
 	new->ex_fslocs.locations = NULL;
 	new->ex_fslocs.locations_count = 0;
 	new->ex_fslocs.migrated = 0;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index 1f52bfc..718a27e 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -4,6 +4,7 @@ 
 #ifndef NFSD_EXPORT_H
 #define NFSD_EXPORT_H
 
+#include <linux/fs_pin.h>
 #include <linux/sunrpc/cache.h>
 #include <uapi/linux/nfsd/export.h>
 
@@ -49,6 +50,7 @@  struct svc_export {
 	struct auth_domain *	ex_client;
 	int			ex_flags;
 	struct path		ex_path;
+	struct fs_pin		ex_pin;
 	kuid_t			ex_anon_uid;
 	kgid_t			ex_anon_gid;
 	int			ex_fsid;
diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 437ddb6..6936684 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -206,6 +206,11 @@  static inline int cache_is_expired(struct cache_detail *detail, struct cache_hea
 		(detail->flush_time > h->last_refresh);
 }
 
+static inline void cache_mark_expired(struct cache_head *h)
+{
+	h->expiry_time = seconds_since_boot() - 1;
+}
+
 extern int cache_check(struct cache_detail *detail,
 		       struct cache_head *h, struct cache_req *rqstp);
 extern void cache_flush(void);