mbox series

[RFC,0/6] nfs-utils: Improving NFS re-exports

Message ID 20220217131531.2890-1-richard@nod.at (mailing list archive)
Headers show
Series nfs-utils: Improving NFS re-exports | expand

Message

Richard Weinberger Feb. 17, 2022, 1:15 p.m. UTC
This is the second iteration of the NFS re-export improvement series for nfs-utils.
While the kernel side didn't change at all and is still small,
the userspace side saw much more changes.
Please note that this is still an RFC, there is room for improvement.

The core idea is adding new export option: reeport=
Using reexport= it is possible to mark an export entry in the exports file
explicitly as NFS re-export and select a strategy how unique identifiers
should be provided.
"remote-devfsid" is the strategy I have proposed in my first patch,
I understand that this one is dangerous. But I still find it useful in some
situations.
"auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
backend to keep track of generated ids.
For a more detailed description see patch "exports: Implement new export option reexport=".
I choose SQLite because nfs-utils already uses it and using SQL ids can nicely
generated and maintained. It will also scale for large setups where the amount
of subvolumes is high.

Beside of id generation this series also addresses the reboot problem.
If the re-exporting NFS server reboots, uncovered NFS subvolumes are not yet
mounted and file handles become stale.
Now mountd/exportd keeps track of uncovered subvolumes and makes sure they get
uncovered while nfsd starts.

The whole set of features is currently opt-in via --enable-reexport.
I'm also not sure about the rearrangement of the reexport code,
currently it is a helper library.

Please let me know whether you like this approach.
If so I'd tidy it up and submit it as non-RFC.

TODOs/Open questions:
- When re-exporting, fs.nfs.nfs_mountpoint_timeout should be set to 0
  to make subvolumes not vanish.
  Is this something exportfs should do automatically when it sees an export entry with a reexport= option?
- exportd saw only minimal testing so far, I wasn't aware of it yet. :-S
- Currently wtere is no way to release the shared memory which contains the database lock.
  I guess it could be released via exportfs -f, which is the very last exec in nfs-server.service
- Add a tool to import/export entries from the reexport database which obeys the shared lock.
- When doing v4->v4 or v3->v4 re-exports very first read access to a file block a few seconds until
  the client does a retransmit. 
  v3->v3 works fine. More investigation needed.

Looking forward for your feedback!

Thanks,
//richard

Richard Weinberger (6):
  Implement reexport helper library
  exports: Implement new export option reexport=
  export: Implement logic behind reexport=
  export: Record mounted volumes
  nfsd: statfs() every known subvolume upon start
  export: Garbage collect orphaned subvolumes upon start

 configure.ac                 |  12 +
 support/Makefile.am          |   4 +
 support/export/Makefile.am   |   2 +
 support/export/cache.c       | 241 +++++++++++++++++-
 support/export/export.h      |   3 +
 support/include/nfslib.h     |   1 +
 support/nfs/Makefile.am      |   1 +
 support/nfs/exports.c        |  73 ++++++
 support/reexport/Makefile.am |   6 +
 support/reexport/reexport.c  | 477 +++++++++++++++++++++++++++++++++++
 support/reexport/reexport.h  |  53 ++++
 utils/exportd/Makefile.am    |   8 +-
 utils/exportd/exportd.c      |  17 ++
 utils/exportfs/Makefile.am   |   4 +
 utils/mount/Makefile.am      |   6 +
 utils/mountd/Makefile.am     |   6 +
 utils/mountd/mountd.c        |   1 +
 utils/mountd/svc_run.c       |  18 ++
 utils/nfsd/Makefile.am       |   6 +
 utils/nfsd/nfsd.c            |  10 +
 20 files changed, 934 insertions(+), 15 deletions(-)
 create mode 100644 support/reexport/Makefile.am
 create mode 100644 support/reexport/reexport.c
 create mode 100644 support/reexport/reexport.h

Comments

J. Bruce Fields Feb. 17, 2022, 4:33 p.m. UTC | #1
On Thu, Feb 17, 2022 at 02:15:25PM +0100, Richard Weinberger wrote:
> This is the second iteration of the NFS re-export improvement series for nfs-utils.
> While the kernel side didn't change at all and is still small,
> the userspace side saw much more changes.
> Please note that this is still an RFC, there is room for improvement.
> 
> The core idea is adding new export option: reeport=
> Using reexport= it is possible to mark an export entry in the exports file
> explicitly as NFS re-export and select a strategy how unique identifiers
> should be provided.
> "remote-devfsid" is the strategy I have proposed in my first patch,
> I understand that this one is dangerous. But I still find it useful in some
> situations.
> "auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
> backend to keep track of generated ids.
> For a more detailed description see patch "exports: Implement new export option reexport=".

Thanks, I'll try to take a look.

Before upstreaming, I would like us to pick just one.  These kind of
options tend to complicate testing and documentation and debugging.

For an RFC, though, I think it makes sense, so I'm fine with keeping
"reexport=" while we're still exploring the different options.  And,
hey, maybe we end up adding more than one after we've upstreamed the
first one.

> I choose SQLite because nfs-utils already uses it and using SQL ids can nicely
> generated and maintained. It will also scale for large setups where the amount
> of subvolumes is high.
> 
> Beside of id generation this series also addresses the reboot problem.
> If the re-exporting NFS server reboots, uncovered NFS subvolumes are not yet
> mounted and file handles become stale.
> Now mountd/exportd keeps track of uncovered subvolumes and makes sure they get
> uncovered while nfsd starts.
> 
> The whole set of features is currently opt-in via --enable-reexport.
> I'm also not sure about the rearrangement of the reexport code,
> currently it is a helper library.
> 
> Please let me know whether you like this approach.
> If so I'd tidy it up and submit it as non-RFC.
> 
> TODOs/Open questions:
> - When re-exporting, fs.nfs.nfs_mountpoint_timeout should be set to 0
>   to make subvolumes not vanish.
>   Is this something exportfs should do automatically when it sees an export entry with a reexport= option?

Setting the timeout to 0 doesn't help with re-export server reboots.
After a reboot is another case where we could end up in a situation
where a client hands us a filehandle for a filesystem that isn't mounted
yet.

I think you want to keep a path with each entry in the database.  When
mountd gets a request for a filesystem it hasn't seen before, it stats
that path, which should trigger the automounts.

And it'd be good to have a test case with a client (Linux client or
pynfs) that, say, opens a file several mounts deep, then reboots the
reexport server, then tries to, say, write to the file descriptor after
the reboot.  (Or maybe there's a way to force the mounts to expire as a
shortcut instead of doing a full reboot.)

> - exportd saw only minimal testing so far, I wasn't aware of it yet. :-S
> - Currently wtere is no way to release the shared memory which contains the database lock.
>   I guess it could be released via exportfs -f, which is the very last exec in nfs-server.service
> - Add a tool to import/export entries from the reexport database which obeys the shared lock.
> - When doing v4->v4 or v3->v4 re-exports very first read access to a file block a few seconds until
>   the client does a retransmit. 
>   v3->v3 works fine. More investigation needed.

Might want to strace mountd and look at the communication over the
/proc/fs/nfsd/*/channel files, maybe mountd is failing to respond to an
upcall.

--b.

> 
> Looking forward for your feedback!
> 
> Thanks,
> //richard
> 
> Richard Weinberger (6):
>   Implement reexport helper library
>   exports: Implement new export option reexport=
>   export: Implement logic behind reexport=
>   export: Record mounted volumes
>   nfsd: statfs() every known subvolume upon start
>   export: Garbage collect orphaned subvolumes upon start
> 
>  configure.ac                 |  12 +
>  support/Makefile.am          |   4 +
>  support/export/Makefile.am   |   2 +
>  support/export/cache.c       | 241 +++++++++++++++++-
>  support/export/export.h      |   3 +
>  support/include/nfslib.h     |   1 +
>  support/nfs/Makefile.am      |   1 +
>  support/nfs/exports.c        |  73 ++++++
>  support/reexport/Makefile.am |   6 +
>  support/reexport/reexport.c  | 477 +++++++++++++++++++++++++++++++++++
>  support/reexport/reexport.h  |  53 ++++
>  utils/exportd/Makefile.am    |   8 +-
>  utils/exportd/exportd.c      |  17 ++
>  utils/exportfs/Makefile.am   |   4 +
>  utils/mount/Makefile.am      |   6 +
>  utils/mountd/Makefile.am     |   6 +
>  utils/mountd/mountd.c        |   1 +
>  utils/mountd/svc_run.c       |  18 ++
>  utils/nfsd/Makefile.am       |   6 +
>  utils/nfsd/nfsd.c            |  10 +
>  20 files changed, 934 insertions(+), 15 deletions(-)
>  create mode 100644 support/reexport/Makefile.am
>  create mode 100644 support/reexport/reexport.c
>  create mode 100644 support/reexport/reexport.h
> 
> -- 
> 2.31.1
Richard Weinberger Feb. 17, 2022, 5:27 p.m. UTC | #2
Bruce,

----- Ursprüngliche Mail -----
> Von: "bfields" <bfields@fieldses.org>
> An: "richard" <richard@nod.at>
> CC: "linux-nfs" <linux-nfs@vger.kernel.org>, "david" <david@sigma-star.at>, "luis turcitu"
> <luis.turcitu@appsbroker.com>, "david young" <david.young@appsbroker.com>, "david oberhollenzer"
> <david.oberhollenzer@sigma-star.at>, "trond myklebust" <trond.myklebust@hammerspace.com>, "anna schumaker"
> <anna.schumaker@netapp.com>, "chris chilvers" <chris.chilvers@appsbroker.com>
> Gesendet: Donnerstag, 17. Februar 2022 17:33:32
> Betreff: Re: [RFC PATCH 0/6] nfs-utils: Improving NFS re-exports

> On Thu, Feb 17, 2022 at 02:15:25PM +0100, Richard Weinberger wrote:
>> This is the second iteration of the NFS re-export improvement series for
>> nfs-utils.
>> While the kernel side didn't change at all and is still small,
>> the userspace side saw much more changes.
>> Please note that this is still an RFC, there is room for improvement.
>> 
>> The core idea is adding new export option: reeport=
>> Using reexport= it is possible to mark an export entry in the exports file
>> explicitly as NFS re-export and select a strategy how unique identifiers
>> should be provided.
>> "remote-devfsid" is the strategy I have proposed in my first patch,
>> I understand that this one is dangerous. But I still find it useful in some
>> situations.
>> "auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
>> backend to keep track of generated ids.
>> For a more detailed description see patch "exports: Implement new export option
>> reexport=".
> 
> Thanks, I'll try to take a look.
> 
> Before upstreaming, I would like us to pick just one.  These kind of
> options tend to complicate testing and documentation and debugging.
> 
> For an RFC, though, I think it makes sense, so I'm fine with keeping
> "reexport=" while we're still exploring the different options.  And,
> hey, maybe we end up adding more than one after we've upstreamed the
> first one.

Which one do you prefer?
"predefined-fsidnum" should be the safest one to start.

>> - When re-exporting, fs.nfs.nfs_mountpoint_timeout should be set to 0
>>   to make subvolumes not vanish.
>>   Is this something exportfs should do automatically when it sees an export entry
>>   with a reexport= option?
> 
> Setting the timeout to 0 doesn't help with re-export server reboots.
> After a reboot is another case where we could end up in a situation
> where a client hands us a filehandle for a filesystem that isn't mounted
> yet.
> 
> I think you want to keep a path with each entry in the database.  When
> mountd gets a request for a filesystem it hasn't seen before, it stats
> that path, which should trigger the automounts.

I have implemented that already. This feature is part of this series. :-)
 
> And it'd be good to have a test case with a client (Linux client or
> pynfs) that, say, opens a file several mounts deep, then reboots the
> reexport server, then tries to, say, write to the file descriptor after
> the reboot.  (Or maybe there's a way to force the mounts to expire as a
> shortcut instead of doing a full reboot.)

Okay, I'll test that.

>> - exportd saw only minimal testing so far, I wasn't aware of it yet. :-S
>> - Currently wtere is no way to release the shared memory which contains the
>> database lock.
>>   I guess it could be released via exportfs -f, which is the very last exec in
>>   nfs-server.service
>> - Add a tool to import/export entries from the reexport database which obeys the
>> shared lock.
>> - When doing v4->v4 or v3->v4 re-exports very first read access to a file block
>> a few seconds until
>>   the client does a retransmit.
>>   v3->v3 works fine. More investigation needed.
> 
> Might want to strace mountd and look at the communication over the
> /proc/fs/nfsd/*/channel files, maybe mountd is failing to respond to an
> upcall.

Ahh. The upcall might be the issue. Thanks for the pointer.

Thanks,
//richard
J. Bruce Fields Feb. 17, 2022, 7:27 p.m. UTC | #3
On Thu, Feb 17, 2022 at 06:27:15PM +0100, Richard Weinberger wrote:
> > Von: "bfields" <bfields@fieldses.org>
> > Thanks, I'll try to take a look.
> > 
> > Before upstreaming, I would like us to pick just one.  These kind of
> > options tend to complicate testing and documentation and debugging.
> > 
> > For an RFC, though, I think it makes sense, so I'm fine with keeping
> > "reexport=" while we're still exploring the different options.  And,
> > hey, maybe we end up adding more than one after we've upstreamed the
> > first one.
> 
> Which one do you prefer?
> "predefined-fsidnum" should be the safest one to start.

I don't know!  I'll have to do some more reading and think about it.

> > Setting the timeout to 0 doesn't help with re-export server reboots.
> > After a reboot is another case where we could end up in a situation
> > where a client hands us a filehandle for a filesystem that isn't mounted
> > yet.
> > 
> > I think you want to keep a path with each entry in the database.  When
> > mountd gets a request for a filesystem it hasn't seen before, it stats
> > that path, which should trigger the automounts.
> 
> I have implemented that already. This feature is part of this series. :-)

Oh, good, got it.  It'll take me some time to catch up.

--b.
Richard Weinberger Feb. 17, 2022, 8:15 p.m. UTC | #4
----- Ursprüngliche Mail -----
> Von: "bfields" <bfields@fieldses.org>
>> Which one do you prefer?
>> "predefined-fsidnum" should be the safest one to start.
> 
> I don't know!  I'll have to do some more reading and think about it.

No need to worry, take your time.
 
>> > Setting the timeout to 0 doesn't help with re-export server reboots.
>> > After a reboot is another case where we could end up in a situation
>> > where a client hands us a filehandle for a filesystem that isn't mounted
>> > yet.
>> > 
>> > I think you want to keep a path with each entry in the database.  When
>> > mountd gets a request for a filesystem it hasn't seen before, it stats
>> > that path, which should trigger the automounts.
>> 
>> I have implemented that already. This feature is part of this series. :-)
> 
> Oh, good, got it.  It'll take me some time to catch up.

The reason why setting the timeout to 0 is still needed is because
when mountd uncovers a subvolume but no client uses it a filehandle,
it is not locked and can be unmounted later.
Only when nfsd sees a matching filehandle the reference counter will
be increased and umounting is no longer possible.

Thanks,
//richard
J. Bruce Fields Feb. 17, 2022, 8:18 p.m. UTC | #5
On Thu, Feb 17, 2022 at 09:15:38PM +0100, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
> > Von: "bfields" <bfields@fieldses.org>
> >> Which one do you prefer?
> >> "predefined-fsidnum" should be the safest one to start.
> > 
> > I don't know!  I'll have to do some more reading and think about it.
> 
> No need to worry, take your time.
>  
> >> > Setting the timeout to 0 doesn't help with re-export server reboots.
> >> > After a reboot is another case where we could end up in a situation
> >> > where a client hands us a filehandle for a filesystem that isn't mounted
> >> > yet.
> >> > 
> >> > I think you want to keep a path with each entry in the database.  When
> >> > mountd gets a request for a filesystem it hasn't seen before, it stats
> >> > that path, which should trigger the automounts.
> >> 
> >> I have implemented that already. This feature is part of this series. :-)
> > 
> > Oh, good, got it.  It'll take me some time to catch up.
> 
> The reason why setting the timeout to 0 is still needed is because
> when mountd uncovers a subvolume but no client uses it a filehandle,
> it is not locked and can be unmounted later.
> Only when nfsd sees a matching filehandle the reference counter will
> be increased and umounting is no longer possible.

I understand that.  But, then if a client comes in with a matching
filehandle, mountd should be able to look up the filesystem and trigger
a new mount, right?

I can imagine that setting the timeout to 0 might be an optimization,
but I'm not seeing why it's required for correct behavior.

--b.
Richard Weinberger Feb. 17, 2022, 8:29 p.m. UTC | #6
----- Ursprüngliche Mail -----
> Von: "bfields" <bfields@fieldses.org>
>> The reason why setting the timeout to 0 is still needed is because
>> when mountd uncovers a subvolume but no client uses it a filehandle,
>> it is not locked and can be unmounted later.
>> Only when nfsd sees a matching filehandle the reference counter will
>> be increased and umounting is no longer possible.
> 
> I understand that.  But, then if a client comes in with a matching
> filehandle, mountd should be able to look up the filesystem and trigger
> a new mount, right?

This does not match what I saw in my experiments. The handle machted only
when the subvolume was mounted before.
But I understand now your point, in theory it should work.
I'll investigate into this.

Thanks,
//richard
Richard Weinberger March 7, 2022, 9:25 a.m. UTC | #7
Bruce,

----- Ursprüngliche Mail -----
> Von: "bfields" <bfields@fieldses.org>

> On Thu, Feb 17, 2022 at 02:15:25PM +0100, Richard Weinberger wrote:
>> This is the second iteration of the NFS re-export improvement series for
>> nfs-utils.
>> While the kernel side didn't change at all and is still small,
>> the userspace side saw much more changes.
>> Please note that this is still an RFC, there is room for improvement.
>> 
>> The core idea is adding new export option: reeport=
>> Using reexport= it is possible to mark an export entry in the exports file
>> explicitly as NFS re-export and select a strategy how unique identifiers
>> should be provided.
>> "remote-devfsid" is the strategy I have proposed in my first patch,
>> I understand that this one is dangerous. But I still find it useful in some
>> situations.
>> "auto-fsidnum" and "predefined-fsidnum" are new and use a SQLite database as
>> backend to keep track of generated ids.
>> For a more detailed description see patch "exports: Implement new export option
>> reexport=".
> 
> Thanks, I'll try to take a look.

Did you find some cycles to think about which approach you prefer?

Thanks,
//richard
J. Bruce Fields March 7, 2022, 10:29 p.m. UTC | #8
On Mon, Mar 07, 2022 at 10:25:52AM +0100, Richard Weinberger wrote:
> Did you find some cycles to think about which approach you prefer?

I haven't.  I'll try to take a look in the next couple days.  Thanks for
the reminder.

--b.
Steve Dickson April 19, 2022, 8:20 p.m. UTC | #9
On 3/7/22 5:29 PM, bfields wrote:
> On Mon, Mar 07, 2022 at 10:25:52AM +0100, Richard Weinberger wrote:
>> Did you find some cycles to think about which approach you prefer?
> 
> I haven't.  I'll try to take a look in the next couple days.  Thanks for
> the reminder.

Is there any movement on this?? I would be good to have some
for the upcoming Bakeathon... Next week.

steved.
Richard Weinberger April 19, 2022, 8:31 p.m. UTC | #10
----- Ursprüngliche Mail -----
> On 3/7/22 5:29 PM, bfields wrote:
>> On Mon, Mar 07, 2022 at 10:25:52AM +0100, Richard Weinberger wrote:
>>> Did you find some cycles to think about which approach you prefer?
>> 
>> I haven't.  I'll try to take a look in the next couple days.  Thanks for
>> the reminder.
> 
> Is there any movement on this?? I would be good to have some
> for the upcoming Bakeathon... Next week.

I'm in the middle of preparing the next version of the series.

Thanks,
//richard