diff mbox series

[v3] nfsd: disallow file locking and delegations for NFSv4 reexport

Message ID 20241023155846.63621-1-snitzer@kernel.org (mailing list archive)
State New
Headers show
Series [v3] nfsd: disallow file locking and delegations for NFSv4 reexport | expand

Commit Message

Mike Snitzer Oct. 23, 2024, 3:58 p.m. UTC
We do not and cannot support file locking with NFS reexport over
NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
server reboot cannot allow clients to recover locks because the source
NFS server has not rebooted, and so it is not in grace.  Since the
source NFS server is not in grace, it cannot offer any guarantees that
the file won't have been changed between the locks getting lost and
any attempt to recover/reclaim them.  The same applies to delegations
and any associated locks, so disallow them too.

Add EXPORT_OP_NOLOCKSUPPORT and exportfs_lock_op_is_unsupported(), set
EXPORT_OP_NOLOCKSUPPORT in nfs_export_ops and check for it in
nfsd4_lock(), nfsd4_locku() and nfs4_set_delegation().  Clients are
not allowed to get file locks or delegations from a reexport server,
any attempts will fail with operation not supported.

Update the "Reboot recovery" section accordingly in
Documentation/filesystems/nfs/reexport.rst

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 Documentation/filesystems/nfs/reexport.rst | 10 +++++++---
 fs/nfs/export.c                            |  3 ++-
 fs/nfsd/nfs4state.c                        | 20 ++++++++++++++++++++
 include/linux/exportfs.h                   | 14 ++++++++++++++
 4 files changed, 43 insertions(+), 4 deletions(-)

v3: refine the patch header and reexport.rst to be clear that both
    locks and delegations will fail against an NFS reexport server.

Comments

Martin Wege Oct. 29, 2024, 1:57 p.m. UTC | #1
On Wed, Oct 23, 2024 at 5:58 PM Mike Snitzer <snitzer@kernel.org> wrote:
>
> We do not and cannot support file locking with NFS reexport over
> NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
> server reboot cannot allow clients to recover locks because the source
> NFS server has not rebooted, and so it is not in grace.  Since the
> source NFS server is not in grace, it cannot offer any guarantees that
> the file won't have been changed between the locks getting lost and
> any attempt to recover/reclaim them.  The same applies to delegations
> and any associated locks, so disallow them too.
>
> Add EXPORT_OP_NOLOCKSUPPORT and exportfs_lock_op_is_unsupported(), set
> EXPORT_OP_NOLOCKSUPPORT in nfs_export_ops and check for it in
> nfsd4_lock(), nfsd4_locku() and nfs4_set_delegation().  Clients are
> not allowed to get file locks or delegations from a reexport server,
> any attempts will fail with operation not supported.

Are you aware that this virtually castrates NFSv4 reexport to a point
that it is no longer usable in real life? If you really want this,
then the only way forward is to disable and remove NFS reexport
support completely.

So this patch is absolutely a NO-GO, r-

Thanks,
Martin
Chuck Lever III Oct. 29, 2024, 2:11 p.m. UTC | #2
> On Oct 29, 2024, at 9:57 AM, Martin Wege <martin.l.wege@gmail.com> wrote:
> 
> On Wed, Oct 23, 2024 at 5:58 PM Mike Snitzer <snitzer@kernel.org> wrote:
>> 
>> We do not and cannot support file locking with NFS reexport over
>> NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
>> server reboot cannot allow clients to recover locks because the source
>> NFS server has not rebooted, and so it is not in grace.  Since the
>> source NFS server is not in grace, it cannot offer any guarantees that
>> the file won't have been changed between the locks getting lost and
>> any attempt to recover/reclaim them.  The same applies to delegations
>> and any associated locks, so disallow them too.
>> 
>> Add EXPORT_OP_NOLOCKSUPPORT and exportfs_lock_op_is_unsupported(), set
>> EXPORT_OP_NOLOCKSUPPORT in nfs_export_ops and check for it in
>> nfsd4_lock(), nfsd4_locku() and nfs4_set_delegation().  Clients are
>> not allowed to get file locks or delegations from a reexport server,
>> any attempts will fail with operation not supported.
> 
> Are you aware that this virtually castrates NFSv4 reexport to a point
> that it is no longer usable in real life?

"virtually castrates" is pretty nebulous. Please provide a
detailed (and less hostile) account of an existing application
that works today that no longer works when this patch is
applied. Only then can we count this as a regression report.


> If you really want this,
> then the only way forward is to disable and remove NFS reexport
> support completely.

"No locking" is already the way NFSv3 re-export works.

At the moment I cannot remember why we chose not to go with
the "only local locking for re-export" design instead.


--
Chuck Lever
Brian Cowan Oct. 29, 2024, 3:54 p.m. UTC | #3
Honestly, I don't know the usecase for re-exporting another server's
NFS export in the first place. Is this someone trying to share NFS
through a firewall? I've seen people share remote NFS exports via
Samba in an attempt to avoid paying their NAS vendor for SMB support.
(I think it's "standard equipment" now, but 10+ years ago? Not
always...) But re-exporting another server's NFS exports? Haven't seen
anyone do that in a while.

Using "only local locks for reexport" would mean that -- in cases
where different clients access the underlying export directly and
others access the re-export -- you would have 2 different sources of
"truth" with respect to locks... I have supported multiple tools that
used file or byte-range record locks in my career... And this could
easily royally hork any shared databases...

Regards,

Brian Cowan

Regards,

Brian Cowan

ClearCase/VersionVault SWAT



Mob: +1 (978) 907-2334



hcltechsw.com



On Tue, Oct 29, 2024 at 10:11 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Oct 29, 2024, at 9:57 AM, Martin Wege <martin.l.wege@gmail.com> wrote:
> >
> > On Wed, Oct 23, 2024 at 5:58 PM Mike Snitzer <snitzer@kernel.org> wrote:
> >>
> >> We do not and cannot support file locking with NFS reexport over
> >> NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
> >> server reboot cannot allow clients to recover locks because the source
> >> NFS server has not rebooted, and so it is not in grace.  Since the
> >> source NFS server is not in grace, it cannot offer any guarantees that
> >> the file won't have been changed between the locks getting lost and
> >> any attempt to recover/reclaim them.  The same applies to delegations
> >> and any associated locks, so disallow them too.
> >>
> >> Add EXPORT_OP_NOLOCKSUPPORT and exportfs_lock_op_is_unsupported(), set
> >> EXPORT_OP_NOLOCKSUPPORT in nfs_export_ops and check for it in
> >> nfsd4_lock(), nfsd4_locku() and nfs4_set_delegation().  Clients are
> >> not allowed to get file locks or delegations from a reexport server,
> >> any attempts will fail with operation not supported.
> >
> > Are you aware that this virtually castrates NFSv4 reexport to a point
> > that it is no longer usable in real life?
>
> "virtually castrates" is pretty nebulous. Please provide a
> detailed (and less hostile) account of an existing application
> that works today that no longer works when this patch is
> applied. Only then can we count this as a regression report.
>
>
> > If you really want this,
> > then the only way forward is to disable and remove NFS reexport
> > support completely.
>
> "No locking" is already the way NFSv3 re-export works.
>
> At the moment I cannot remember why we chose not to go with
> the "only local locking for re-export" design instead.
>
>
> --
> Chuck Lever
>
>
Chuck Lever III Oct. 29, 2024, 4:03 p.m. UTC | #4
> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
> 
> Honestly, I don't know the usecase for re-exporting another server's
> NFS export in the first place. Is this someone trying to share NFS
> through a firewall? I've seen people share remote NFS exports via
> Samba in an attempt to avoid paying their NAS vendor for SMB support.
> (I think it's "standard equipment" now, but 10+ years ago? Not
> always...) But re-exporting another server's NFS exports? Haven't seen
> anyone do that in a while.

The "re-export" case is where there is a central repository
of data and branch offices that access that via a WAN. The
re-export servers cache some of that data locally so that
local clients have a fast persistent cache nearby.

This is also effective in cases where a small cluster of
clients want fast access to a pile of data that is
significantly larger than their own caches. Say, HPC or
animation, where the small cluster is working on a small
portion of the full data set, which is stored on a central
server.


> Using "only local locks for reexport" would mean that -- in cases
> where different clients access the underlying export directly and
> others access the re-export -- you would have 2 different sources of
> "truth" with respect to locks... I have supported multiple tools that
> used file or byte-range record locks in my career... And this could
> easily royally hork any shared databases...

Yes, that's the downside of the local-lock-only approach.

I had assumed that when locking was not available on the NFS
server, the client can mount with "local_lock" (man nfs(5)).


> Regards,
> 
> Brian Cowan
> 
> Regards,
> 
> Brian Cowan
> 
> ClearCase/VersionVault SWAT
> 
> 
> 
> Mob: +1 (978) 907-2334
> 
> 
> 
> hcltechsw.com
> 
> 
> 
> On Tue, Oct 29, 2024 at 10:11 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Oct 29, 2024, at 9:57 AM, Martin Wege <martin.l.wege@gmail.com> wrote:
>>> 
>>> On Wed, Oct 23, 2024 at 5:58 PM Mike Snitzer <snitzer@kernel.org> wrote:
>>>> 
>>>> We do not and cannot support file locking with NFS reexport over
>>>> NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
>>>> server reboot cannot allow clients to recover locks because the source
>>>> NFS server has not rebooted, and so it is not in grace.  Since the
>>>> source NFS server is not in grace, it cannot offer any guarantees that
>>>> the file won't have been changed between the locks getting lost and
>>>> any attempt to recover/reclaim them.  The same applies to delegations
>>>> and any associated locks, so disallow them too.
>>>> 
>>>> Add EXPORT_OP_NOLOCKSUPPORT and exportfs_lock_op_is_unsupported(), set
>>>> EXPORT_OP_NOLOCKSUPPORT in nfs_export_ops and check for it in
>>>> nfsd4_lock(), nfsd4_locku() and nfs4_set_delegation().  Clients are
>>>> not allowed to get file locks or delegations from a reexport server,
>>>> any attempts will fail with operation not supported.
>>> 
>>> Are you aware that this virtually castrates NFSv4 reexport to a point
>>> that it is no longer usable in real life?
>> 
>> "virtually castrates" is pretty nebulous. Please provide a
>> detailed (and less hostile) account of an existing application
>> that works today that no longer works when this patch is
>> applied. Only then can we count this as a regression report.
>> 
>> 
>>> If you really want this,
>>> then the only way forward is to disable and remove NFS reexport
>>> support completely.
>> 
>> "No locking" is already the way NFSv3 re-export works.
>> 
>> At the moment I cannot remember why we chose not to go with
>> the "only local locking for re-export" design instead.
>> 
>> 
>> --
>> Chuck Lever
>> 
>> 

--
Chuck Lever
Cedric Blancher Oct. 30, 2024, 2:55 p.m. UTC | #5
On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
> >
> > Honestly, I don't know the usecase for re-exporting another server's
> > NFS export in the first place. Is this someone trying to share NFS
> > through a firewall? I've seen people share remote NFS exports via
> > Samba in an attempt to avoid paying their NAS vendor for SMB support.
> > (I think it's "standard equipment" now, but 10+ years ago? Not
> > always...) But re-exporting another server's NFS exports? Haven't seen
> > anyone do that in a while.
>
> The "re-export" case is where there is a central repository
> of data and branch offices that access that via a WAN. The
> re-export servers cache some of that data locally so that
> local clients have a fast persistent cache nearby.
>
> This is also effective in cases where a small cluster of
> clients want fast access to a pile of data that is
> significantly larger than their own caches. Say, HPC or
> animation, where the small cluster is working on a small
> portion of the full data set, which is stored on a central
> server.
>
Another use case is "isolation", IT shares a filesystem to your
department, and you need to re-export only a subset to another
department or homeoffice. Part of such a scenario might also be policy
related, e.g. IT shares you the full filesystem but will do NOTHING
else, and any further compartmentalization must be done in your own
department.
This is the typical use case for gov NFS re-export.

Of course no one needs the gov customers, so feel free to break locking.

Ced
Chuck Lever III Oct. 30, 2024, 4:15 p.m. UTC | #6
> On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> 
> On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>>> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
>>> 
>>> Honestly, I don't know the usecase for re-exporting another server's
>>> NFS export in the first place. Is this someone trying to share NFS
>>> through a firewall? I've seen people share remote NFS exports via
>>> Samba in an attempt to avoid paying their NAS vendor for SMB support.
>>> (I think it's "standard equipment" now, but 10+ years ago? Not
>>> always...) But re-exporting another server's NFS exports? Haven't seen
>>> anyone do that in a while.
>> 
>> The "re-export" case is where there is a central repository
>> of data and branch offices that access that via a WAN. The
>> re-export servers cache some of that data locally so that
>> local clients have a fast persistent cache nearby.
>> 
>> This is also effective in cases where a small cluster of
>> clients want fast access to a pile of data that is
>> significantly larger than their own caches. Say, HPC or
>> animation, where the small cluster is working on a small
>> portion of the full data set, which is stored on a central
>> server.
>> 
> Another use case is "isolation", IT shares a filesystem to your
> department, and you need to re-export only a subset to another
> department or homeoffice. Part of such a scenario might also be policy
> related, e.g. IT shares you the full filesystem but will do NOTHING
> else, and any further compartmentalization must be done in your own
> department.
> This is the typical use case for gov NFS re-export.

It's not clear to me from this description why re-export is
the right tool for this job. Please explain why ACLs are not
used in this case -- this is exactly what they are designed
to do.

And again, clients of the re-export server need to mount it
with local_lock. Apps can still use locking in that case,
but the locks are not visible to apps on other clients. Your
description does not explain why local_lock is not
sufficient or feasible.


> Of course no one needs the gov customers, so feel free to break locking.


Please have a look at the patch description again: lock
recovery does not work now, and cannot work without
changes to the protocol. Isn't that a problem for such
workloads?

In other words, locking is already broken on NFSv4 re-export,
but the current situation can lead to silent data corruption.


--
Chuck Lever
Cedric Blancher Oct. 30, 2024, 4:37 p.m. UTC | #7
On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> >
> > On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>
> >>> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
> >>>
> >>> Honestly, I don't know the usecase for re-exporting another server's
> >>> NFS export in the first place. Is this someone trying to share NFS
> >>> through a firewall? I've seen people share remote NFS exports via
> >>> Samba in an attempt to avoid paying their NAS vendor for SMB support.
> >>> (I think it's "standard equipment" now, but 10+ years ago? Not
> >>> always...) But re-exporting another server's NFS exports? Haven't seen
> >>> anyone do that in a while.
> >>
> >> The "re-export" case is where there is a central repository
> >> of data and branch offices that access that via a WAN. The
> >> re-export servers cache some of that data locally so that
> >> local clients have a fast persistent cache nearby.
> >>
> >> This is also effective in cases where a small cluster of
> >> clients want fast access to a pile of data that is
> >> significantly larger than their own caches. Say, HPC or
> >> animation, where the small cluster is working on a small
> >> portion of the full data set, which is stored on a central
> >> server.
> >>
> > Another use case is "isolation", IT shares a filesystem to your
> > department, and you need to re-export only a subset to another
> > department or homeoffice. Part of such a scenario might also be policy
> > related, e.g. IT shares you the full filesystem but will do NOTHING
> > else, and any further compartmentalization must be done in your own
> > department.
> > This is the typical use case for gov NFS re-export.
>
> It's not clear to me from this description why re-export is
> the right tool for this job. Please explain why ACLs are not
> used in this case -- this is exactly what they are designed
> to do.

1. IT departments want better/harder/immutable isolation than ACLs
2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or
NFSv4 ACLs. So there is no proper way to prevent ACL editing,
rendering them useless in this case.

There is a reason why POSIX draft ACls were abandoned - they are not
fine-granted enough for real world usage outside the Linux universe.
As soon as interoperability is required these things just bite you
HARD.

Also, just running more nfsd in parallel on the origin NFS server is
not a better option - remember the debate of non-2049 ports for nfsd?

>
> And again, clients of the re-export server need to mount it
> with local_lock. Apps can still use locking in that case,
> but the locks are not visible to apps on other clients. Your
> description does not explain why local_lock is not
> sufficient or feasible.

Because:
- it breaks applications running on more than one machine?
- it breaks use cases like NFS--->SMB bridges, because without locking
the typical Windows .NET application will refuse to write to a file
- it breaks even SIMPLE things like Microsoft Excel

Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file
will always work.

> > Of course no one needs the gov customers, so feel free to break locking.
>
>
> Please have a look at the patch description again: lock
> recovery does not work now, and cannot work without
> changes to the protocol. Isn't that a problem for such
> workloads?

Nope, because of UPS (Uninterruptible power supply). Either everything
is UP, or *everything* is DOWN. Boolean.

>
> In other words, locking is already broken on NFSv4 re-export,
> but the current situation can lead to silent data corruption.

Would storing the locking information into persistent files help, ie.
files which persist across nfsd server restarts?

Ced
Chuck Lever III Oct. 30, 2024, 4:59 p.m. UTC | #8
> On Oct 30, 2024, at 12:37 PM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> 
> On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>> 
>>> On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>> 
>>>>> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
>>>>> 
>>>>> Honestly, I don't know the usecase for re-exporting another server's
>>>>> NFS export in the first place. Is this someone trying to share NFS
>>>>> through a firewall? I've seen people share remote NFS exports via
>>>>> Samba in an attempt to avoid paying their NAS vendor for SMB support.
>>>>> (I think it's "standard equipment" now, but 10+ years ago? Not
>>>>> always...) But re-exporting another server's NFS exports? Haven't seen
>>>>> anyone do that in a while.
>>>> 
>>>> The "re-export" case is where there is a central repository
>>>> of data and branch offices that access that via a WAN. The
>>>> re-export servers cache some of that data locally so that
>>>> local clients have a fast persistent cache nearby.
>>>> 
>>>> This is also effective in cases where a small cluster of
>>>> clients want fast access to a pile of data that is
>>>> significantly larger than their own caches. Say, HPC or
>>>> animation, where the small cluster is working on a small
>>>> portion of the full data set, which is stored on a central
>>>> server.
>>>> 
>>> Another use case is "isolation", IT shares a filesystem to your
>>> department, and you need to re-export only a subset to another
>>> department or homeoffice. Part of such a scenario might also be policy
>>> related, e.g. IT shares you the full filesystem but will do NOTHING
>>> else, and any further compartmentalization must be done in your own
>>> department.
>>> This is the typical use case for gov NFS re-export.
>> 
>> It's not clear to me from this description why re-export is
>> the right tool for this job. Please explain why ACLs are not
>> used in this case -- this is exactly what they are designed
>> to do.
> 
> 1. IT departments want better/harder/immutable isolation than ACLs

So you want MAC, and the storage administrator won't set
that up for you on the NFS server. NFS doesn't do MAC
very well if at all.


> 2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or
> NFSv4 ACLs. So there is no proper way to prevent ACL editing,
> rendering them useless in this case.

Er. Linux NFSv4 stores the ACLs as POSIX draft, because
that's what Linux file systems can support. NFSD, via
NFSv4, makes these appear like NFSv4 ACLs.

But I think I understand.


> There is a reason why POSIX draft ACls were abandoned - they are not
> fine-granted enough for real world usage outside the Linux universe.
> As soon as interoperability is required these things just bite you
> HARD.

You, of course, have the ability to run some other NFS
server implementation that meets your security requirements
more fully.


> Also, just running more nfsd in parallel on the origin NFS server is
> not a better option - remember the debate of non-2049 ports for nfsd?

I'm not sure where this is going. Do you mean the storage
administrator would provide NFS service on alternate
ports that each expose a separate set of exports?

So the only option Linux has there is using containers or
libvirt. We've continued to privately discuss the ability
for NFSD to support a separate set of exports on alternate
ports, but it doesn't look feasible. The export management
infrastructure and user space tools would need to be
rewritten.


>> And again, clients of the re-export server need to mount it
>> with local_lock. Apps can still use locking in that case,
>> but the locks are not visible to apps on other clients. Your
>> description does not explain why local_lock is not
>> sufficient or feasible.
> 
> Because:
> - it breaks applications running on more than one machine?

Yes, obviously. Your description needs to mention that is
a requirement, since there are a lot of applications that
don't need locking across multiple clients.


> - it breaks use cases like NFS--->SMB bridges, because without locking
> the typical Windows .NET application will refuse to write to a file

That's a quagmire, and I don't think we can guarantee that
will work. Linux NFS doesn't support "deny" modes, for
example.


> - it breaks even SIMPLE things like Microsoft Excel

If you need SMB semantics, why not use Samba?

The upshot appears to be that this usage is a stack of
mismatched storage protocols that work around a bunch of
local IT bureaucracy. I'm trying to be sympathetic, but
it's hard to say that /anyone/ would fully support this.


> Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file
> will always work.
> 
>>> Of course no one needs the gov customers, so feel free to break locking.
>> 
>> 
>> Please have a look at the patch description again: lock
>> recovery does not work now, and cannot work without
>> changes to the protocol. Isn't that a problem for such
>> workloads?
> 
> Nope, because of UPS (Uninterruptible power supply). Either everything
> is UP, or *everything* is DOWN. Boolean.

Power outages are not the only reason lock recovery might
be necessary. Network partitions, re-export server
upgrades or reboots, etc. So I'm not hearing anythying
to suggest this kind of workload is not impacted by
the current lock recovery problems.


>> In other words, locking is already broken on NFSv4 re-export,
>> but the current situation can lead to silent data corruption.
> 
> Would storing the locking information into persistent files help, ie.
> files which persist across nfsd server restarts?

Yes, but it would make things horribly slow.

And of course there would be a lot of coding involved
to get this to work.

What if we added an export option to allow the re-export
server to continue handling locking, but default it to
off (which is the safer option) ?

--
Chuck Lever
Rick Macklem Oct. 30, 2024, 10:48 p.m. UTC | #9
On Wed, Oct 30, 2024 at 10:08 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca.
>
>
>
>
> > On Oct 30, 2024, at 12:37 PM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> >
> > On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> >>>
> >>> On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>>>
> >>>>> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
> >>>>>
> >>>>> Honestly, I don't know the usecase for re-exporting another server's
> >>>>> NFS export in the first place. Is this someone trying to share NFS
> >>>>> through a firewall? I've seen people share remote NFS exports via
> >>>>> Samba in an attempt to avoid paying their NAS vendor for SMB support.
> >>>>> (I think it's "standard equipment" now, but 10+ years ago? Not
> >>>>> always...) But re-exporting another server's NFS exports? Haven't seen
> >>>>> anyone do that in a while.
> >>>>
> >>>> The "re-export" case is where there is a central repository
> >>>> of data and branch offices that access that via a WAN. The
> >>>> re-export servers cache some of that data locally so that
> >>>> local clients have a fast persistent cache nearby.
> >>>>
> >>>> This is also effective in cases where a small cluster of
> >>>> clients want fast access to a pile of data that is
> >>>> significantly larger than their own caches. Say, HPC or
> >>>> animation, where the small cluster is working on a small
> >>>> portion of the full data set, which is stored on a central
> >>>> server.
> >>>>
> >>> Another use case is "isolation", IT shares a filesystem to your
> >>> department, and you need to re-export only a subset to another
> >>> department or homeoffice. Part of such a scenario might also be policy
> >>> related, e.g. IT shares you the full filesystem but will do NOTHING
> >>> else, and any further compartmentalization must be done in your own
> >>> department.
> >>> This is the typical use case for gov NFS re-export.
> >>
> >> It's not clear to me from this description why re-export is
> >> the right tool for this job. Please explain why ACLs are not
> >> used in this case -- this is exactly what they are designed
> >> to do.
> >
> > 1. IT departments want better/harder/immutable isolation than ACLs
>
> So you want MAC, and the storage administrator won't set
> that up for you on the NFS server. NFS doesn't do MAC
> very well if at all.
>
>
> > 2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or
> > NFSv4 ACLs. So there is no proper way to prevent ACL editing,
> > rendering them useless in this case.
>
> Er. Linux NFSv4 stores the ACLs as POSIX draft, because
> that's what Linux file systems can support. NFSD, via
> NFSv4, makes these appear like NFSv4 ACLs.
>
> But I think I understand.
>
>
> > There is a reason why POSIX draft ACls were abandoned - they are not
> > fine-granted enough for real world usage outside the Linux universe.
> > As soon as interoperability is required these things just bite you
> > HARD.
>
> You, of course, have the ability to run some other NFS
> server implementation that meets your security requirements
> more fully.
>
>
> > Also, just running more nfsd in parallel on the origin NFS server is
> > not a better option - remember the debate of non-2049 ports for nfsd?
>
> I'm not sure where this is going. Do you mean the storage
> administrator would provide NFS service on alternate
> ports that each expose a separate set of exports?
>
> So the only option Linux has there is using containers or
> libvirt. We've continued to privately discuss the ability
> for NFSD to support a separate set of exports on alternate
> ports, but it doesn't look feasible. The export management
> infrastructure and user space tools would need to be
> rewritten.
>
>
> >> And again, clients of the re-export server need to mount it
> >> with local_lock. Apps can still use locking in that case,
> >> but the locks are not visible to apps on other clients. Your
> >> description does not explain why local_lock is not
> >> sufficient or feasible.
> >
> > Because:
> > - it breaks applications running on more than one machine?
>
> Yes, obviously. Your description needs to mention that is
> a requirement, since there are a lot of applications that
> don't need locking across multiple clients.
>
>
> > - it breaks use cases like NFS--->SMB bridges, because without locking
> > the typical Windows .NET application will refuse to write to a file
>
> That's a quagmire, and I don't think we can guarantee that
> will work. Linux NFS doesn't support "deny" modes, for
> example.
>
>
> > - it breaks even SIMPLE things like Microsoft Excel
>
> If you need SMB semantics, why not use Samba?
>
> The upshot appears to be that this usage is a stack of
> mismatched storage protocols that work around a bunch of
> local IT bureaucracy. I'm trying to be sympathetic, but
> it's hard to say that /anyone/ would fully support this.
>
>
> > Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file
> > will always work.
> >
> >>> Of course no one needs the gov customers, so feel free to break locking.
> >>
> >>
> >> Please have a look at the patch description again: lock
> >> recovery does not work now, and cannot work without
> >> changes to the protocol. Isn't that a problem for such
> >> workloads?
> >
> > Nope, because of UPS (Uninterruptible power supply). Either everything
> > is UP, or *everything* is DOWN. Boolean.
>
> Power outages are not the only reason lock recovery might
> be necessary. Network partitions, re-export server
> upgrades or reboots, etc. So I'm not hearing anythying
> to suggest this kind of workload is not impacted by
> the current lock recovery problems.
>
>
> >> In other words, locking is already broken on NFSv4 re-export,
> >> but the current situation can lead to silent data corruption.
> >
> > Would storing the locking information into persistent files help, ie.
> > files which persist across nfsd server restarts?
>
> Yes, but it would make things horribly slow.
>
> And of course there would be a lot of coding involved
> to get this to work.
I suspect this suggestion might be a fair amount of code too
(and I am certainly not volunteering to write it), but I will mention it.

Another possibility would be to have the re-exporting NFSv4 server
just pass locking ops through to the backend NFSv4 server.
- It is roughly the inverse of what I did when I constructed a flex files
  pNFS server. The MDS did the locking ops and any I/O ops. were
  passed through to the DS(s). Of course, it was hoped the client
  would use layouts and bypass the MDS for I/O.

rick

>
> What if we added an export option to allow the re-export
> server to continue handling locking, but default it to
> off (which is the safer option) ?
>
> --
> Chuck Lever
>
>
Jeff Layton Oct. 31, 2024, 11:43 a.m. UTC | #10
On Wed, 2024-10-30 at 15:48 -0700, Rick Macklem wrote:
> On Wed, Oct 30, 2024 at 10:08 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
> > 
> > CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca.
> > 
> > 
> > 
> > 
> > > On Oct 30, 2024, at 12:37 PM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> > > 
> > > On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@oracle.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> > > > > 
> > > > > On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
> > > > > > 
> > > > > > > On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
> > > > > > > 
> > > > > > > Honestly, I don't know the usecase for re-exporting another server's
> > > > > > > NFS export in the first place. Is this someone trying to share NFS
> > > > > > > through a firewall? I've seen people share remote NFS exports via
> > > > > > > Samba in an attempt to avoid paying their NAS vendor for SMB support.
> > > > > > > (I think it's "standard equipment" now, but 10+ years ago? Not
> > > > > > > always...) But re-exporting another server's NFS exports? Haven't seen
> > > > > > > anyone do that in a while.
> > > > > > 
> > > > > > The "re-export" case is where there is a central repository
> > > > > > of data and branch offices that access that via a WAN. The
> > > > > > re-export servers cache some of that data locally so that
> > > > > > local clients have a fast persistent cache nearby.
> > > > > > 
> > > > > > This is also effective in cases where a small cluster of
> > > > > > clients want fast access to a pile of data that is
> > > > > > significantly larger than their own caches. Say, HPC or
> > > > > > animation, where the small cluster is working on a small
> > > > > > portion of the full data set, which is stored on a central
> > > > > > server.
> > > > > > 
> > > > > Another use case is "isolation", IT shares a filesystem to your
> > > > > department, and you need to re-export only a subset to another
> > > > > department or homeoffice. Part of such a scenario might also be policy
> > > > > related, e.g. IT shares you the full filesystem but will do NOTHING
> > > > > else, and any further compartmentalization must be done in your own
> > > > > department.
> > > > > This is the typical use case for gov NFS re-export.
> > > > 
> > > > It's not clear to me from this description why re-export is
> > > > the right tool for this job. Please explain why ACLs are not
> > > > used in this case -- this is exactly what they are designed
> > > > to do.
> > > 
> > > 1. IT departments want better/harder/immutable isolation than ACLs
> > 
> > So you want MAC, and the storage administrator won't set
> > that up for you on the NFS server. NFS doesn't do MAC
> > very well if at all.
> > 
> > 
> > > 2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or
> > > NFSv4 ACLs. So there is no proper way to prevent ACL editing,
> > > rendering them useless in this case.
> > 
> > Er. Linux NFSv4 stores the ACLs as POSIX draft, because
> > that's what Linux file systems can support. NFSD, via
> > NFSv4, makes these appear like NFSv4 ACLs.
> > 
> > But I think I understand.
> > 
> > 
> > > There is a reason why POSIX draft ACls were abandoned - they are not
> > > fine-granted enough for real world usage outside the Linux universe.
> > > As soon as interoperability is required these things just bite you
> > > HARD.
> > 
> > You, of course, have the ability to run some other NFS
> > server implementation that meets your security requirements
> > more fully.
> > 
> > 
> > > Also, just running more nfsd in parallel on the origin NFS server is
> > > not a better option - remember the debate of non-2049 ports for nfsd?
> > 
> > I'm not sure where this is going. Do you mean the storage
> > administrator would provide NFS service on alternate
> > ports that each expose a separate set of exports?
> > 
> > So the only option Linux has there is using containers or
> > libvirt. We've continued to privately discuss the ability
> > for NFSD to support a separate set of exports on alternate
> > ports, but it doesn't look feasible. The export management
> > infrastructure and user space tools would need to be
> > rewritten.
> > 
> > 
> > > > And again, clients of the re-export server need to mount it
> > > > with local_lock. Apps can still use locking in that case,
> > > > but the locks are not visible to apps on other clients. Your
> > > > description does not explain why local_lock is not
> > > > sufficient or feasible.
> > > 
> > > Because:
> > > - it breaks applications running on more than one machine?
> > 
> > Yes, obviously. Your description needs to mention that is
> > a requirement, since there are a lot of applications that
> > don't need locking across multiple clients.
> > 
> > 
> > > - it breaks use cases like NFS--->SMB bridges, because without locking
> > > the typical Windows .NET application will refuse to write to a file
> > 
> > That's a quagmire, and I don't think we can guarantee that
> > will work. Linux NFS doesn't support "deny" modes, for
> > example.
> > 
> > 
> > > - it breaks even SIMPLE things like Microsoft Excel
> > 
> > If you need SMB semantics, why not use Samba?
> > 
> > The upshot appears to be that this usage is a stack of
> > mismatched storage protocols that work around a bunch of
> > local IT bureaucracy. I'm trying to be sympathetic, but
> > it's hard to say that /anyone/ would fully support this.
> > 
> > 
> > > Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file
> > > will always work.
> > > 
> > > > > Of course no one needs the gov customers, so feel free to break locking.
> > > > 
> > > > 
> > > > Please have a look at the patch description again: lock
> > > > recovery does not work now, and cannot work without
> > > > changes to the protocol. Isn't that a problem for such
> > > > workloads?
> > > 
> > > Nope, because of UPS (Uninterruptible power supply). Either everything
> > > is UP, or *everything* is DOWN. Boolean.
> > 
> > Power outages are not the only reason lock recovery might
> > be necessary. Network partitions, re-export server
> > upgrades or reboots, etc. So I'm not hearing anythying
> > to suggest this kind of workload is not impacted by
> > the current lock recovery problems.
> > 
> > 
> > > > In other words, locking is already broken on NFSv4 re-export,
> > > > but the current situation can lead to silent data corruption.
> > > 
> > > Would storing the locking information into persistent files help, ie.
> > > files which persist across nfsd server restarts?
> > 
> > Yes, but it would make things horribly slow.
> > 
> > And of course there would be a lot of coding involved
> > to get this to work.
> I suspect this suggestion might be a fair amount of code too
> (and I am certainly not volunteering to write it), but I will mention it.
> 
> Another possibility would be to have the re-exporting NFSv4 server
> just pass locking ops through to the backend NFSv4 server.
> - It is roughly the inverse of what I did when I constructed a flex files
>   pNFS server. The MDS did the locking ops and any I/O ops. were
>   passed through to the DS(s). Of course, it was hoped the client
>   would use layouts and bypass the MDS for I/O.
> 

How do you handle reclaim in this case? IOW, suppose the backend server
crashes but the reexporter stays up. How do you coordinate the grace
periods between the two so that the client can reclaim its lock on the
backend?

> 
> > 
> > What if we added an export option to allow the re-export
> > server to continue handling locking, but default it to
> > off (which is the safer option) ?
> > 
> > --
> > Chuck Lever
> > 
> > 
>
Rick Macklem Oct. 31, 2024, 2:48 p.m. UTC | #11
On Thu, Oct 31, 2024 at 4:43 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Wed, 2024-10-30 at 15:48 -0700, Rick Macklem wrote:
> > On Wed, Oct 30, 2024 at 10:08 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
> > >
> > > CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca.
> > >
> > >
> > >
> > >
> > > > On Oct 30, 2024, at 12:37 PM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> > > >
> > > > On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@oracle.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
> > > > > > >
> > > > > > > > On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
> > > > > > > >
> > > > > > > > Honestly, I don't know the usecase for re-exporting another server's
> > > > > > > > NFS export in the first place. Is this someone trying to share NFS
> > > > > > > > through a firewall? I've seen people share remote NFS exports via
> > > > > > > > Samba in an attempt to avoid paying their NAS vendor for SMB support.
> > > > > > > > (I think it's "standard equipment" now, but 10+ years ago? Not
> > > > > > > > always...) But re-exporting another server's NFS exports? Haven't seen
> > > > > > > > anyone do that in a while.
> > > > > > >
> > > > > > > The "re-export" case is where there is a central repository
> > > > > > > of data and branch offices that access that via a WAN. The
> > > > > > > re-export servers cache some of that data locally so that
> > > > > > > local clients have a fast persistent cache nearby.
> > > > > > >
> > > > > > > This is also effective in cases where a small cluster of
> > > > > > > clients want fast access to a pile of data that is
> > > > > > > significantly larger than their own caches. Say, HPC or
> > > > > > > animation, where the small cluster is working on a small
> > > > > > > portion of the full data set, which is stored on a central
> > > > > > > server.
> > > > > > >
> > > > > > Another use case is "isolation", IT shares a filesystem to your
> > > > > > department, and you need to re-export only a subset to another
> > > > > > department or homeoffice. Part of such a scenario might also be policy
> > > > > > related, e.g. IT shares you the full filesystem but will do NOTHING
> > > > > > else, and any further compartmentalization must be done in your own
> > > > > > department.
> > > > > > This is the typical use case for gov NFS re-export.
> > > > >
> > > > > It's not clear to me from this description why re-export is
> > > > > the right tool for this job. Please explain why ACLs are not
> > > > > used in this case -- this is exactly what they are designed
> > > > > to do.
> > > >
> > > > 1. IT departments want better/harder/immutable isolation than ACLs
> > >
> > > So you want MAC, and the storage administrator won't set
> > > that up for you on the NFS server. NFS doesn't do MAC
> > > very well if at all.
> > >
> > >
> > > > 2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or
> > > > NFSv4 ACLs. So there is no proper way to prevent ACL editing,
> > > > rendering them useless in this case.
> > >
> > > Er. Linux NFSv4 stores the ACLs as POSIX draft, because
> > > that's what Linux file systems can support. NFSD, via
> > > NFSv4, makes these appear like NFSv4 ACLs.
> > >
> > > But I think I understand.
> > >
> > >
> > > > There is a reason why POSIX draft ACls were abandoned - they are not
> > > > fine-granted enough for real world usage outside the Linux universe.
> > > > As soon as interoperability is required these things just bite you
> > > > HARD.
> > >
> > > You, of course, have the ability to run some other NFS
> > > server implementation that meets your security requirements
> > > more fully.
> > >
> > >
> > > > Also, just running more nfsd in parallel on the origin NFS server is
> > > > not a better option - remember the debate of non-2049 ports for nfsd?
> > >
> > > I'm not sure where this is going. Do you mean the storage
> > > administrator would provide NFS service on alternate
> > > ports that each expose a separate set of exports?
> > >
> > > So the only option Linux has there is using containers or
> > > libvirt. We've continued to privately discuss the ability
> > > for NFSD to support a separate set of exports on alternate
> > > ports, but it doesn't look feasible. The export management
> > > infrastructure and user space tools would need to be
> > > rewritten.
> > >
> > >
> > > > > And again, clients of the re-export server need to mount it
> > > > > with local_lock. Apps can still use locking in that case,
> > > > > but the locks are not visible to apps on other clients. Your
> > > > > description does not explain why local_lock is not
> > > > > sufficient or feasible.
> > > >
> > > > Because:
> > > > - it breaks applications running on more than one machine?
> > >
> > > Yes, obviously. Your description needs to mention that is
> > > a requirement, since there are a lot of applications that
> > > don't need locking across multiple clients.
> > >
> > >
> > > > - it breaks use cases like NFS--->SMB bridges, because without locking
> > > > the typical Windows .NET application will refuse to write to a file
> > >
> > > That's a quagmire, and I don't think we can guarantee that
> > > will work. Linux NFS doesn't support "deny" modes, for
> > > example.
> > >
> > >
> > > > - it breaks even SIMPLE things like Microsoft Excel
> > >
> > > If you need SMB semantics, why not use Samba?
> > >
> > > The upshot appears to be that this usage is a stack of
> > > mismatched storage protocols that work around a bunch of
> > > local IT bureaucracy. I'm trying to be sympathetic, but
> > > it's hard to say that /anyone/ would fully support this.
> > >
> > >
> > > > Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file
> > > > will always work.
> > > >
> > > > > > Of course no one needs the gov customers, so feel free to break locking.
> > > > >
> > > > >
> > > > > Please have a look at the patch description again: lock
> > > > > recovery does not work now, and cannot work without
> > > > > changes to the protocol. Isn't that a problem for such
> > > > > workloads?
> > > >
> > > > Nope, because of UPS (Uninterruptible power supply). Either everything
> > > > is UP, or *everything* is DOWN. Boolean.
> > >
> > > Power outages are not the only reason lock recovery might
> > > be necessary. Network partitions, re-export server
> > > upgrades or reboots, etc. So I'm not hearing anythying
> > > to suggest this kind of workload is not impacted by
> > > the current lock recovery problems.
> > >
> > >
> > > > > In other words, locking is already broken on NFSv4 re-export,
> > > > > but the current situation can lead to silent data corruption.
> > > >
> > > > Would storing the locking information into persistent files help, ie.
> > > > files which persist across nfsd server restarts?
> > >
> > > Yes, but it would make things horribly slow.
> > >
> > > And of course there would be a lot of coding involved
> > > to get this to work.
> > I suspect this suggestion might be a fair amount of code too
> > (and I am certainly not volunteering to write it), but I will mention it.
> >
> > Another possibility would be to have the re-exporting NFSv4 server
> > just pass locking ops through to the backend NFSv4 server.
> > - It is roughly the inverse of what I did when I constructed a flex files
> >   pNFS server. The MDS did the locking ops and any I/O ops. were
> >   passed through to the DS(s). Of course, it was hoped the client
> >   would use layouts and bypass the MDS for I/O.
> >
>
> How do you handle reclaim in this case? IOW, suppose the backend server
> crashes but the reexporter stays up. How do you coordinate the grace
> periods between the two so that the client can reclaim its lock on the
> backend?
Well, I'm not saying it is trivial.
I think you would need to pass through all state operations:
ExchangeID, Open,...,Lock,LockU
- The tricky bit would be sessions, since the re-exporter would need to
   maintain sessions.
   --> Maybe the re-exporter would need to save the ClientID (from the
         backend nfsd) in non-volatile storage.

When the backend server crashes/reboots, the re-exporter would see
this as a failure (usually NFS4ERR_BAD_SESSION) and would pass
that to the client.
The only recovery RPC that would not be passed through would be
Create_session, although the re-exporter would do a Create_session
for connection(s) it has against the backend server.
I think something like that would work for the backend crash/recovery.

A crash of the re-exporter could be more of a problem, I think.
It would need to have the ClientID (stored in non-volatile storage)
so that it could do a Create_session with it against the backend server.
- It would also depend on the backend server being courteous, so that
  an re-exporter crash/reboot that takes a while such that the lease expires
  doesn't result in a loss of state on the backend server.

Anyhow, something like that.
Like I said, I'm not volunteering to code it, rick

>
> >
> > >
> > > What if we added an export option to allow the re-export
> > > server to continue handling locking, but default it to
> > > off (which is the safer option) ?
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> >
>
> --
> Jeff Layton <jlayton@kernel.org>
Chuck Lever III Oct. 31, 2024, 3:01 p.m. UTC | #12
> On Oct 31, 2024, at 10:48 AM, Rick Macklem <rick.macklem@gmail.com> wrote:
> 
> On Thu, Oct 31, 2024 at 4:43 AM Jeff Layton <jlayton@kernel.org> wrote:
>> 
>> On Wed, 2024-10-30 at 15:48 -0700, Rick Macklem wrote:
>>> On Wed, Oct 30, 2024 at 10:08 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>> 
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Oct 30, 2024, at 12:37 PM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>>>> 
>>>>> On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>>>>>> 
>>>>>>> On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>>>>>> 
>>>>>>>>> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
>>>>>>>>> 
>>>>>>>>> Honestly, I don't know the usecase for re-exporting another server's
>>>>>>>>> NFS export in the first place. Is this someone trying to share NFS
>>>>>>>>> through a firewall? I've seen people share remote NFS exports via
>>>>>>>>> Samba in an attempt to avoid paying their NAS vendor for SMB support.
>>>>>>>>> (I think it's "standard equipment" now, but 10+ years ago? Not
>>>>>>>>> always...) But re-exporting another server's NFS exports? Haven't seen
>>>>>>>>> anyone do that in a while.
>>>>>>>> 
>>>>>>>> The "re-export" case is where there is a central repository
>>>>>>>> of data and branch offices that access that via a WAN. The
>>>>>>>> re-export servers cache some of that data locally so that
>>>>>>>> local clients have a fast persistent cache nearby.
>>>>>>>> 
>>>>>>>> This is also effective in cases where a small cluster of
>>>>>>>> clients want fast access to a pile of data that is
>>>>>>>> significantly larger than their own caches. Say, HPC or
>>>>>>>> animation, where the small cluster is working on a small
>>>>>>>> portion of the full data set, which is stored on a central
>>>>>>>> server.
>>>>>>>> 
>>>>>>> Another use case is "isolation", IT shares a filesystem to your
>>>>>>> department, and you need to re-export only a subset to another
>>>>>>> department or homeoffice. Part of such a scenario might also be policy
>>>>>>> related, e.g. IT shares you the full filesystem but will do NOTHING
>>>>>>> else, and any further compartmentalization must be done in your own
>>>>>>> department.
>>>>>>> This is the typical use case for gov NFS re-export.
>>>>>> 
>>>>>> It's not clear to me from this description why re-export is
>>>>>> the right tool for this job. Please explain why ACLs are not
>>>>>> used in this case -- this is exactly what they are designed
>>>>>> to do.
>>>>> 
>>>>> 1. IT departments want better/harder/immutable isolation than ACLs
>>>> 
>>>> So you want MAC, and the storage administrator won't set
>>>> that up for you on the NFS server. NFS doesn't do MAC
>>>> very well if at all.
>>>> 
>>>> 
>>>>> 2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or
>>>>> NFSv4 ACLs. So there is no proper way to prevent ACL editing,
>>>>> rendering them useless in this case.
>>>> 
>>>> Er. Linux NFSv4 stores the ACLs as POSIX draft, because
>>>> that's what Linux file systems can support. NFSD, via
>>>> NFSv4, makes these appear like NFSv4 ACLs.
>>>> 
>>>> But I think I understand.
>>>> 
>>>> 
>>>>> There is a reason why POSIX draft ACls were abandoned - they are not
>>>>> fine-granted enough for real world usage outside the Linux universe.
>>>>> As soon as interoperability is required these things just bite you
>>>>> HARD.
>>>> 
>>>> You, of course, have the ability to run some other NFS
>>>> server implementation that meets your security requirements
>>>> more fully.
>>>> 
>>>> 
>>>>> Also, just running more nfsd in parallel on the origin NFS server is
>>>>> not a better option - remember the debate of non-2049 ports for nfsd?
>>>> 
>>>> I'm not sure where this is going. Do you mean the storage
>>>> administrator would provide NFS service on alternate
>>>> ports that each expose a separate set of exports?
>>>> 
>>>> So the only option Linux has there is using containers or
>>>> libvirt. We've continued to privately discuss the ability
>>>> for NFSD to support a separate set of exports on alternate
>>>> ports, but it doesn't look feasible. The export management
>>>> infrastructure and user space tools would need to be
>>>> rewritten.
>>>> 
>>>> 
>>>>>> And again, clients of the re-export server need to mount it
>>>>>> with local_lock. Apps can still use locking in that case,
>>>>>> but the locks are not visible to apps on other clients. Your
>>>>>> description does not explain why local_lock is not
>>>>>> sufficient or feasible.
>>>>> 
>>>>> Because:
>>>>> - it breaks applications running on more than one machine?
>>>> 
>>>> Yes, obviously. Your description needs to mention that is
>>>> a requirement, since there are a lot of applications that
>>>> don't need locking across multiple clients.
>>>> 
>>>> 
>>>>> - it breaks use cases like NFS--->SMB bridges, because without locking
>>>>> the typical Windows .NET application will refuse to write to a file
>>>> 
>>>> That's a quagmire, and I don't think we can guarantee that
>>>> will work. Linux NFS doesn't support "deny" modes, for
>>>> example.
>>>> 
>>>> 
>>>>> - it breaks even SIMPLE things like Microsoft Excel
>>>> 
>>>> If you need SMB semantics, why not use Samba?
>>>> 
>>>> The upshot appears to be that this usage is a stack of
>>>> mismatched storage protocols that work around a bunch of
>>>> local IT bureaucracy. I'm trying to be sympathetic, but
>>>> it's hard to say that /anyone/ would fully support this.
>>>> 
>>>> 
>>>>> Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file
>>>>> will always work.
>>>>> 
>>>>>>> Of course no one needs the gov customers, so feel free to break locking.
>>>>>> 
>>>>>> 
>>>>>> Please have a look at the patch description again: lock
>>>>>> recovery does not work now, and cannot work without
>>>>>> changes to the protocol. Isn't that a problem for such
>>>>>> workloads?
>>>>> 
>>>>> Nope, because of UPS (Uninterruptible power supply). Either everything
>>>>> is UP, or *everything* is DOWN. Boolean.
>>>> 
>>>> Power outages are not the only reason lock recovery might
>>>> be necessary. Network partitions, re-export server
>>>> upgrades or reboots, etc. So I'm not hearing anythying
>>>> to suggest this kind of workload is not impacted by
>>>> the current lock recovery problems.
>>>> 
>>>> 
>>>>>> In other words, locking is already broken on NFSv4 re-export,
>>>>>> but the current situation can lead to silent data corruption.
>>>>> 
>>>>> Would storing the locking information into persistent files help, ie.
>>>>> files which persist across nfsd server restarts?
>>>> 
>>>> Yes, but it would make things horribly slow.
>>>> 
>>>> And of course there would be a lot of coding involved
>>>> to get this to work.
>>> I suspect this suggestion might be a fair amount of code too
>>> (and I am certainly not volunteering to write it), but I will mention it.
>>> 
>>> Another possibility would be to have the re-exporting NFSv4 server
>>> just pass locking ops through to the backend NFSv4 server.
>>> - It is roughly the inverse of what I did when I constructed a flex files
>>>  pNFS server. The MDS did the locking ops and any I/O ops. were
>>>  passed through to the DS(s). Of course, it was hoped the client
>>>  would use layouts and bypass the MDS for I/O.
>>> 
>> 
>> How do you handle reclaim in this case? IOW, suppose the backend server
>> crashes but the reexporter stays up. How do you coordinate the grace
>> periods between the two so that the client can reclaim its lock on the
>> backend?
> Well, I'm not saying it is trivial.
> I think you would need to pass through all state operations:
> ExchangeID, Open,...,Lock,LockU
> - The tricky bit would be sessions, since the re-exporter would need to
>   maintain sessions.
>   --> Maybe the re-exporter would need to save the ClientID (from the
>         backend nfsd) in non-volatile storage.
> 
> When the backend server crashes/reboots, the re-exporter would see
> this as a failure (usually NFS4ERR_BAD_SESSION) and would pass
> that to the client.
> The only recovery RPC that would not be passed through would be
> Create_session, although the re-exporter would do a Create_session
> for connection(s) it has against the backend server.
> I think something like that would work for the backend crash/recovery.

The backend server would be in grace, and the re-exporter
would be able to recover its lock state on the backend
server using normal state recovery. I think the re-
exporter would not need to expose the backend server's
crash to its own clients.


> A crash of the re-exporter could be more of a problem, I think.
> It would need to have the ClientID (stored in non-volatile storage)
> so that it could do a Create_session with it against the backend server.
> - It would also depend on the backend server being courteous, so that
>  an re-exporter crash/reboot that takes a while such that the lease expires
>  doesn't result in a loss of state on the backend server.

The backend server would not be in grace after the re-export
server crashes. There's no way for the re-export server's
NFS client to recover its lock state from the backend server.

The re-export server recovers by re-learning lock state from
its own clients. The question is how the re-export server
could re-initialize this state in its local client of the
backend server.


--
Chuck Lever
Chuck Lever III Oct. 31, 2024, 3:14 p.m. UTC | #13
On Wed, Oct 23, 2024 at 11:58:46AM -0400, Mike Snitzer wrote:
> We do not and cannot support file locking with NFS reexport over
> NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
> server reboot cannot allow clients to recover locks because the source
> NFS server has not rebooted, and so it is not in grace.  Since the
> source NFS server is not in grace, it cannot offer any guarantees that
> the file won't have been changed between the locks getting lost and
> any attempt to recover/reclaim them.  The same applies to delegations
> and any associated locks, so disallow them too.
> 
> Add EXPORT_OP_NOLOCKSUPPORT and exportfs_lock_op_is_unsupported(), set
> EXPORT_OP_NOLOCKSUPPORT in nfs_export_ops and check for it in
> nfsd4_lock(), nfsd4_locku() and nfs4_set_delegation().  Clients are
> not allowed to get file locks or delegations from a reexport server,
> any attempts will fail with operation not supported.
> 
> Update the "Reboot recovery" section accordingly in
> Documentation/filesystems/nfs/reexport.rst
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  Documentation/filesystems/nfs/reexport.rst | 10 +++++++---
>  fs/nfs/export.c                            |  3 ++-
>  fs/nfsd/nfs4state.c                        | 20 ++++++++++++++++++++
>  include/linux/exportfs.h                   | 14 ++++++++++++++
>  4 files changed, 43 insertions(+), 4 deletions(-)
> 
> v3: refine the patch header and reexport.rst to be clear that both
>     locks and delegations will fail against an NFS reexport server.
> 
> diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
> index ff9ae4a46530..044be965d75e 100644
> --- a/Documentation/filesystems/nfs/reexport.rst
> +++ b/Documentation/filesystems/nfs/reexport.rst
> @@ -26,9 +26,13 @@ Reboot recovery
>  ---------------
>  
>  The NFS protocol's normal reboot recovery mechanisms don't work for the
> -case when the reexport server reboots.  Clients will lose any locks
> -they held before the reboot, and further IO will result in errors.
> -Closing and reopening files should clear the errors.
> +case when the reexport server reboots because the source server has not
> +rebooted, and so it is not in grace.  Since the source server is not in
> +grace, it cannot offer any guarantees that the file won't have been
> +changed between the locks getting lost and any attempt to recover them.
> +The same applies to delegations and any associated locks.  Clients are
> +not allowed to get file locks or delegations from a reexport server, any
> +attempts will fail with operation not supported.
>  
>  Filehandle limits
>  -----------------
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index be686b8e0c54..2f001a0273bc 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -154,5 +154,6 @@ const struct export_operations nfs_export_ops = {
>  		 EXPORT_OP_CLOSE_BEFORE_UNLINK	|
>  		 EXPORT_OP_REMOTE_FS		|
>  		 EXPORT_OP_NOATOMIC_ATTR	|
> -		 EXPORT_OP_FLUSH_ON_CLOSE,
> +		 EXPORT_OP_FLUSH_ON_CLOSE	|
> +		 EXPORT_OP_NOLOCKSUPPORT,
>  };
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index ac1859c7cc9d..63297ea82e4e 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -5813,6 +5813,15 @@ nfs4_set_delegation(struct nfsd4_open *open, struct nfs4_ol_stateid *stp,
>  	if (!nf)
>  		return ERR_PTR(-EAGAIN);
>  
> +	/*
> +	 * File delegations and associated locks cannot be recovered if
> +	 * export is from NFS proxy server.
> +	 */
> +	if (exportfs_lock_op_is_unsupported(nf->nf_file->f_path.mnt->mnt_sb->s_export_op)) {
> +		nfsd_file_put(nf);
> +		return ERR_PTR(-EOPNOTSUPP);
> +	}
> +
>  	spin_lock(&state_lock);
>  	spin_lock(&fp->fi_lock);
>  	if (nfs4_delegation_exists(clp, fp))
> @@ -7917,6 +7926,11 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	}
>  	sb = cstate->current_fh.fh_dentry->d_sb;
>  
> +	if (exportfs_lock_op_is_unsupported(sb->s_export_op)) {
> +		status = nfserr_notsupp;
> +		goto out;
> +	}
> +
>  	if (lock->lk_is_new) {
>  		if (nfsd4_has_session(cstate))
>  			/* See rfc 5661 18.10.3: given clientid is ignored: */
> @@ -8266,6 +8280,12 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  		status = nfserr_lock_range;
>  		goto put_stateid;
>  	}
> +
> +	if (exportfs_lock_op_is_unsupported(nf->nf_file->f_path.mnt->mnt_sb->s_export_op)) {
> +		status = nfserr_notsupp;
> +		goto put_file;
> +	}
> +
>  	file_lock = locks_alloc_lock();
>  	if (!file_lock) {
>  		dprintk("NFSD: %s: unable to allocate lock!\n", __func__);
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 893a1d21dc1c..106fd590d323 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -247,6 +247,7 @@ struct export_operations {
>  						*/
>  #define EXPORT_OP_FLUSH_ON_CLOSE	(0x20) /* fs flushes file data on close */
>  #define EXPORT_OP_ASYNC_LOCK		(0x40) /* fs can do async lock request */
> +#define EXPORT_OP_NOLOCKSUPPORT		(0x80) /* no file locking support */
>  	unsigned long	flags;
>  };
>  
> @@ -263,6 +264,19 @@ exportfs_lock_op_is_async(const struct export_operations *export_ops)
>  	return export_ops->flags & EXPORT_OP_ASYNC_LOCK;
>  }
>  
> +/**
> + * exportfs_lock_op_is_unsupported() - export does not support file locking
> + * @export_ops:	the nfs export operations to check
> + *
> + * Returns true if the nfs export_operations structure has
> + * EXPORT_OP_NOLOCKSUPPORT in their flags set
> + */
> +static inline bool
> +exportfs_lock_op_is_unsupported(const struct export_operations *export_ops)
> +{
> +	return export_ops->flags & EXPORT_OP_NOLOCKSUPPORT;
> +}
> +
>  extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
>  				    int *max_len, struct inode *parent,
>  				    int flags);
> -- 
> 2.44.0
> 

There seems to be some controversy about this approach.

Also I think it would be nicer all around if we followed the usual
process for changes that introduce possible behavior regressions:

 - add the new behavior, make it optional, default old behavior
 - wait a few releases
 - change the default to new behavior

Lastly, there haven't been any user complaints about the current
situation of no lock recovery in the re-export case.

Jeff and I discussed this, and we plan to drop this one for 6.13 but
let the conversation continue. Mike, no action needed on your part
for the moment, but please stay tuned!

IMO having an export option (along the lines of "async/sync") that
is documented in a man page is going to be a better plan. But if we
find a way to deal with this situation without a new administrative
control, that would be even better.
Rick Macklem Oct. 31, 2024, 4:02 p.m. UTC | #14
On Thu, Oct 31, 2024 at 8:01 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Oct 31, 2024, at 10:48 AM, Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Thu, Oct 31, 2024 at 4:43 AM Jeff Layton <jlayton@kernel.org> wrote:
> >>
> >> On Wed, 2024-10-30 at 15:48 -0700, Rick Macklem wrote:
> >>> On Wed, Oct 30, 2024 at 10:08 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>>>
> >>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> On Oct 30, 2024, at 12:37 PM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> >>>>>
> >>>>> On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> >>>>>>>
> >>>>>>> On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>>>>>>>
> >>>>>>>>> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@hcl-software.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Honestly, I don't know the usecase for re-exporting another server's
> >>>>>>>>> NFS export in the first place. Is this someone trying to share NFS
> >>>>>>>>> through a firewall? I've seen people share remote NFS exports via
> >>>>>>>>> Samba in an attempt to avoid paying their NAS vendor for SMB support.
> >>>>>>>>> (I think it's "standard equipment" now, but 10+ years ago? Not
> >>>>>>>>> always...) But re-exporting another server's NFS exports? Haven't seen
> >>>>>>>>> anyone do that in a while.
> >>>>>>>>
> >>>>>>>> The "re-export" case is where there is a central repository
> >>>>>>>> of data and branch offices that access that via a WAN. The
> >>>>>>>> re-export servers cache some of that data locally so that
> >>>>>>>> local clients have a fast persistent cache nearby.
> >>>>>>>>
> >>>>>>>> This is also effective in cases where a small cluster of
> >>>>>>>> clients want fast access to a pile of data that is
> >>>>>>>> significantly larger than their own caches. Say, HPC or
> >>>>>>>> animation, where the small cluster is working on a small
> >>>>>>>> portion of the full data set, which is stored on a central
> >>>>>>>> server.
> >>>>>>>>
> >>>>>>> Another use case is "isolation", IT shares a filesystem to your
> >>>>>>> department, and you need to re-export only a subset to another
> >>>>>>> department or homeoffice. Part of such a scenario might also be policy
> >>>>>>> related, e.g. IT shares you the full filesystem but will do NOTHING
> >>>>>>> else, and any further compartmentalization must be done in your own
> >>>>>>> department.
> >>>>>>> This is the typical use case for gov NFS re-export.
> >>>>>>
> >>>>>> It's not clear to me from this description why re-export is
> >>>>>> the right tool for this job. Please explain why ACLs are not
> >>>>>> used in this case -- this is exactly what they are designed
> >>>>>> to do.
> >>>>>
> >>>>> 1. IT departments want better/harder/immutable isolation than ACLs
> >>>>
> >>>> So you want MAC, and the storage administrator won't set
> >>>> that up for you on the NFS server. NFS doesn't do MAC
> >>>> very well if at all.
> >>>>
> >>>>
> >>>>> 2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or
> >>>>> NFSv4 ACLs. So there is no proper way to prevent ACL editing,
> >>>>> rendering them useless in this case.
> >>>>
> >>>> Er. Linux NFSv4 stores the ACLs as POSIX draft, because
> >>>> that's what Linux file systems can support. NFSD, via
> >>>> NFSv4, makes these appear like NFSv4 ACLs.
> >>>>
> >>>> But I think I understand.
> >>>>
> >>>>
> >>>>> There is a reason why POSIX draft ACls were abandoned - they are not
> >>>>> fine-granted enough for real world usage outside the Linux universe.
> >>>>> As soon as interoperability is required these things just bite you
> >>>>> HARD.
> >>>>
> >>>> You, of course, have the ability to run some other NFS
> >>>> server implementation that meets your security requirements
> >>>> more fully.
> >>>>
> >>>>
> >>>>> Also, just running more nfsd in parallel on the origin NFS server is
> >>>>> not a better option - remember the debate of non-2049 ports for nfsd?
> >>>>
> >>>> I'm not sure where this is going. Do you mean the storage
> >>>> administrator would provide NFS service on alternate
> >>>> ports that each expose a separate set of exports?
> >>>>
> >>>> So the only option Linux has there is using containers or
> >>>> libvirt. We've continued to privately discuss the ability
> >>>> for NFSD to support a separate set of exports on alternate
> >>>> ports, but it doesn't look feasible. The export management
> >>>> infrastructure and user space tools would need to be
> >>>> rewritten.
> >>>>
> >>>>
> >>>>>> And again, clients of the re-export server need to mount it
> >>>>>> with local_lock. Apps can still use locking in that case,
> >>>>>> but the locks are not visible to apps on other clients. Your
> >>>>>> description does not explain why local_lock is not
> >>>>>> sufficient or feasible.
> >>>>>
> >>>>> Because:
> >>>>> - it breaks applications running on more than one machine?
> >>>>
> >>>> Yes, obviously. Your description needs to mention that is
> >>>> a requirement, since there are a lot of applications that
> >>>> don't need locking across multiple clients.
> >>>>
> >>>>
> >>>>> - it breaks use cases like NFS--->SMB bridges, because without locking
> >>>>> the typical Windows .NET application will refuse to write to a file
> >>>>
> >>>> That's a quagmire, and I don't think we can guarantee that
> >>>> will work. Linux NFS doesn't support "deny" modes, for
> >>>> example.
> >>>>
> >>>>
> >>>>> - it breaks even SIMPLE things like Microsoft Excel
> >>>>
> >>>> If you need SMB semantics, why not use Samba?
> >>>>
> >>>> The upshot appears to be that this usage is a stack of
> >>>> mismatched storage protocols that work around a bunch of
> >>>> local IT bureaucracy. I'm trying to be sympathetic, but
> >>>> it's hard to say that /anyone/ would fully support this.
> >>>>
> >>>>
> >>>>> Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file
> >>>>> will always work.
> >>>>>
> >>>>>>> Of course no one needs the gov customers, so feel free to break locking.
> >>>>>>
> >>>>>>
> >>>>>> Please have a look at the patch description again: lock
> >>>>>> recovery does not work now, and cannot work without
> >>>>>> changes to the protocol. Isn't that a problem for such
> >>>>>> workloads?
> >>>>>
> >>>>> Nope, because of UPS (Uninterruptible power supply). Either everything
> >>>>> is UP, or *everything* is DOWN. Boolean.
> >>>>
> >>>> Power outages are not the only reason lock recovery might
> >>>> be necessary. Network partitions, re-export server
> >>>> upgrades or reboots, etc. So I'm not hearing anythying
> >>>> to suggest this kind of workload is not impacted by
> >>>> the current lock recovery problems.
> >>>>
> >>>>
> >>>>>> In other words, locking is already broken on NFSv4 re-export,
> >>>>>> but the current situation can lead to silent data corruption.
> >>>>>
> >>>>> Would storing the locking information into persistent files help, ie.
> >>>>> files which persist across nfsd server restarts?
> >>>>
> >>>> Yes, but it would make things horribly slow.
> >>>>
> >>>> And of course there would be a lot of coding involved
> >>>> to get this to work.
> >>> I suspect this suggestion might be a fair amount of code too
> >>> (and I am certainly not volunteering to write it), but I will mention it.
> >>>
> >>> Another possibility would be to have the re-exporting NFSv4 server
> >>> just pass locking ops through to the backend NFSv4 server.
> >>> - It is roughly the inverse of what I did when I constructed a flex files
> >>>  pNFS server. The MDS did the locking ops and any I/O ops. were
> >>>  passed through to the DS(s). Of course, it was hoped the client
> >>>  would use layouts and bypass the MDS for I/O.
> >>>
> >>
> >> How do you handle reclaim in this case? IOW, suppose the backend server
> >> crashes but the reexporter stays up. How do you coordinate the grace
> >> periods between the two so that the client can reclaim its lock on the
> >> backend?
> > Well, I'm not saying it is trivial.
> > I think you would need to pass through all state operations:
> > ExchangeID, Open,...,Lock,LockU
> > - The tricky bit would be sessions, since the re-exporter would need to
> >   maintain sessions.
> >   --> Maybe the re-exporter would need to save the ClientID (from the
> >         backend nfsd) in non-volatile storage.
> >
> > When the backend server crashes/reboots, the re-exporter would see
> > this as a failure (usually NFS4ERR_BAD_SESSION) and would pass
> > that to the client.
> > The only recovery RPC that would not be passed through would be
> > Create_session, although the re-exporter would do a Create_session
> > for connection(s) it has against the backend server.
> > I think something like that would work for the backend crash/recovery.
>
> The backend server would be in grace, and the re-exporter
> would be able to recover its lock state on the backend
> server using normal state recovery. I think the re-
> exporter would not need to expose the backend server's
> crash to its own clients.
For what I suggested, the re-exporting server does not hold any state,
except for sessions. (It essentially becomes like an NFSv3 stateless
server.) It would expose the backend server's crash/reboot to the
client, which would do the recovery.
(By pass through I mean "just repackage the arguments and do
the operation against the backend server" instead of doing the
operations in the re-exporter server. For example, it would have
a separate "struct nfsd4_operations" array with different functions
for open, lock, ...)

Sessions are the weird case and the re-exporter would have to
maintain session(s) for the client. On backend server reboot, the
re-exporter would see NFS4ERR_BAD_SESSION. It would then
nuke the session(s) for the client, so that it sees NFS4ERR_BAD_SESSION
as well and starts state recovery. Those state recovery ops would
be passed through to the backend server.

When the re-exporter reboots, it only needs to recover sessions.
It would reply NFS4ERR_BAD_SESSION to the client.
The client would do a Create_session using the backend server's
clientID. At that point, the re-exporter would know the clientID, which
it could use to Create_session against the backend server and then
it could create the session for the client side, assuming the Create_session
on the backend server worked ok.

rick

>
>
> > A crash of the re-exporter could be more of a problem, I think.
> > It would need to have the ClientID (stored in non-volatile storage)
> > so that it could do a Create_session with it against the backend server.
> > - It would also depend on the backend server being courteous, so that
> >  an re-exporter crash/reboot that takes a while such that the lease expires
> >  doesn't result in a loss of state on the backend server.
>
> The backend server would not be in grace after the re-export
> server crashes. There's no way for the re-export server's
> NFS client to recover its lock state from the backend server.
>
> The re-export server recovers by re-learning lock state from
> its own clients. The question is how the re-export server
> could re-initialize this state in its local client of the
> backend server.
>
>
> --
> Chuck Lever
>
>
Chuck Lever III Nov. 18, 2024, 6:57 p.m. UTC | #15
On Thu, Oct 31, 2024 at 11:14:51AM -0400, Chuck Lever wrote:
> On Wed, Oct 23, 2024 at 11:58:46AM -0400, Mike Snitzer wrote:
> > We do not and cannot support file locking with NFS reexport over
> > NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport

 [ ... patch snipped ... ]

> > diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
> > index ff9ae4a46530..044be965d75e 100644
> > --- a/Documentation/filesystems/nfs/reexport.rst
> > +++ b/Documentation/filesystems/nfs/reexport.rst
> > @@ -26,9 +26,13 @@ Reboot recovery
> >  ---------------
> >  
> >  The NFS protocol's normal reboot recovery mechanisms don't work for the
> > -case when the reexport server reboots.  Clients will lose any locks
> > -they held before the reboot, and further IO will result in errors.
> > -Closing and reopening files should clear the errors.
> > +case when the reexport server reboots because the source server has not
> > +rebooted, and so it is not in grace.  Since the source server is not in
> > +grace, it cannot offer any guarantees that the file won't have been
> > +changed between the locks getting lost and any attempt to recover them.
> > +The same applies to delegations and any associated locks.  Clients are
> > +not allowed to get file locks or delegations from a reexport server, any
> > +attempts will fail with operation not supported.
> >  
> >  Filehandle limits
> >  -----------------

Note for Mike:

Last sentence "Clients are not allowed to get ... delegations from a
reexport server" -- IIUC it's up to the re-export server to not hand
out delegations to its clients. Still, it's important to note that
NFSv4 delegation would not be available for re-exports.

See below for more: I'd like this paragraph to continue to discuss
the issue of OPEN and I/O behavior when the re-export server
restarts. The patch seems to redact that bit of detail.

Following is general discussion:


> There seems to be some controversy about this approach.
> 
> Also I think it would be nicer all around if we followed the usual
> process for changes that introduce possible behavior regressions:
> 
>  - add the new behavior, make it optional, default old behavior
>  - wait a few releases
>  - change the default to new behavior
> 
> Lastly, there haven't been any user complaints about the current
> situation of no lock recovery in the re-export case.
> 
> Jeff and I discussed this, and we plan to drop this one for 6.13 but
> let the conversation continue. Mike, no action needed on your part
> for the moment, but please stay tuned!
> 
> IMO having an export option (along the lines of "async/sync") that
> is documented in a man page is going to be a better plan. But if we
> find a way to deal with this situation without a new administrative
> control, that would be even better.

Proposed solutions so far:

- Disable NFS locking entirely on NFS re-export

- Implement full state pass-through for re-export

Some history of the NFSD design and the re-export issue is provided
here:

  http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export#reboot_recovery

Certain usage scenarios require that lock state be globally visible,
so disabling NFS locking on re-export mounts will need to be
considered carefully.

Assuming that NFSv4 LOCK operations are proliferated to the back-end
server in today's NFSD, does it make sense to avoid code changes at
the moment, but more carefully document the configuration options
and their risks?

+++ In all following configurations, no state recovery occurs when
the re-export server restarts, as explained in
Documentation/filesystems/nfs/reexport.rst.

Mount options on the re-export server and clients:

* All default: open and lock state is proliferated to the back-end
  server and is visible to all NFS clients.

* local_lock=all on the re-export server's mount of the back-end
  server: clients of that server all see the same set of locks, but
  these locks are not visible to the back-end server or any of its
  clients. Open state is visible everywhere.

* local_lock=all on the NFS mounts on client mounts of the re-export
  server: applications on NFS clients do not see locks set by
  applications on any other NFS clients. Open state is visible
  everywhere.

When an NFS client of the re-export server OPENs a file, currently
that creates OPEN state on the re-export server, and I assume also
on the back-end server. That state cannot be recovered if the
re-export server restarts, but it also cannot be blocked by a mount
option.

Likewise, I assume the back-end server can hand out delegations to
the re-export server. If the re-export server restarts, how does it
recover those delegations? The re-export server could disable
delegation by blocking off its callback service, but should it?

What, if anything, is being done to further develop and regularly 
test NFS re-export in upstream kernels?

The reexport.rst file: This still reads more like design notes than
administrative documentation.  IMHO it should instead have a more
detailed description and disclaimer regarding what kind of manual
recovery is needed after a re-export server restart. That seems like
important information for administrators who think they might want
to deploy this solution. Maybe Documentation/ isn't the right place
for administrative documentation?

It might be prudent to (temporarily) label NFS re-export as
experimental use only, given its incompleteness and the long list
of caveats.
Daire Byrne Nov. 19, 2024, 12:37 a.m. UTC | #16
On Mon, 18 Nov 2024 at 18:57, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Thu, Oct 31, 2024 at 11:14:51AM -0400, Chuck Lever wrote:
> > On Wed, Oct 23, 2024 at 11:58:46AM -0400, Mike Snitzer wrote:
> > > We do not and cannot support file locking with NFS reexport over
> > > NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
>
>  [ ... patch snipped ... ]
>
> > > diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
> > > index ff9ae4a46530..044be965d75e 100644
> > > --- a/Documentation/filesystems/nfs/reexport.rst
> > > +++ b/Documentation/filesystems/nfs/reexport.rst
> > > @@ -26,9 +26,13 @@ Reboot recovery
> > >  ---------------
> > >
> > >  The NFS protocol's normal reboot recovery mechanisms don't work for the
> > > -case when the reexport server reboots.  Clients will lose any locks
> > > -they held before the reboot, and further IO will result in errors.
> > > -Closing and reopening files should clear the errors.
> > > +case when the reexport server reboots because the source server has not
> > > +rebooted, and so it is not in grace.  Since the source server is not in
> > > +grace, it cannot offer any guarantees that the file won't have been
> > > +changed between the locks getting lost and any attempt to recover them.
> > > +The same applies to delegations and any associated locks.  Clients are
> > > +not allowed to get file locks or delegations from a reexport server, any
> > > +attempts will fail with operation not supported.
> > >
> > >  Filehandle limits
> > >  -----------------
>
> Note for Mike:
>
> Last sentence "Clients are not allowed to get ... delegations from a
> reexport server" -- IIUC it's up to the re-export server to not hand
> out delegations to its clients. Still, it's important to note that
> NFSv4 delegation would not be available for re-exports.
>
> See below for more: I'd like this paragraph to continue to discuss
> the issue of OPEN and I/O behavior when the re-export server
> restarts. The patch seems to redact that bit of detail.
>
> Following is general discussion:
>
>
> > There seems to be some controversy about this approach.
> >
> > Also I think it would be nicer all around if we followed the usual
> > process for changes that introduce possible behavior regressions:
> >
> >  - add the new behavior, make it optional, default old behavior
> >  - wait a few releases
> >  - change the default to new behavior
> >
> > Lastly, there haven't been any user complaints about the current
> > situation of no lock recovery in the re-export case.
> >
> > Jeff and I discussed this, and we plan to drop this one for 6.13 but
> > let the conversation continue. Mike, no action needed on your part
> > for the moment, but please stay tuned!
> >
> > IMO having an export option (along the lines of "async/sync") that
> > is documented in a man page is going to be a better plan. But if we
> > find a way to deal with this situation without a new administrative
> > control, that would be even better.
>
> Proposed solutions so far:
>
> - Disable NFS locking entirely on NFS re-export
>
> - Implement full state pass-through for re-export
>
> Some history of the NFSD design and the re-export issue is provided
> here:
>
>   http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export#reboot_recovery
>
> Certain usage scenarios require that lock state be globally visible,
> so disabling NFS locking on re-export mounts will need to be
> considered carefully.
>
> Assuming that NFSv4 LOCK operations are proliferated to the back-end
> server in today's NFSD, does it make sense to avoid code changes at
> the moment, but more carefully document the configuration options
> and their risks?
>
> +++ In all following configurations, no state recovery occurs when
> the re-export server restarts, as explained in
> Documentation/filesystems/nfs/reexport.rst.
>
> Mount options on the re-export server and clients:
>
> * All default: open and lock state is proliferated to the back-end
>   server and is visible to all NFS clients.
>
> * local_lock=all on the re-export server's mount of the back-end
>   server: clients of that server all see the same set of locks, but
>   these locks are not visible to the back-end server or any of its
>   clients. Open state is visible everywhere.
>
> * local_lock=all on the NFS mounts on client mounts of the re-export
>   server: applications on NFS clients do not see locks set by
>   applications on any other NFS clients. Open state is visible
>   everywhere.
>
> When an NFS client of the re-export server OPENs a file, currently
> that creates OPEN state on the re-export server, and I assume also
> on the back-end server. That state cannot be recovered if the
> re-export server restarts, but it also cannot be blocked by a mount
> option.
>
> Likewise, I assume the back-end server can hand out delegations to
> the re-export server. If the re-export server restarts, how does it
> recover those delegations? The re-export server could disable
> delegation by blocking off its callback service, but should it?
>
> What, if anything, is being done to further develop and regularly
> test NFS re-export in upstream kernels?
>
> The reexport.rst file: This still reads more like design notes than
> administrative documentation.  IMHO it should instead have a more
> detailed description and disclaimer regarding what kind of manual
> recovery is needed after a re-export server restart. That seems like
> important information for administrators who think they might want
> to deploy this solution. Maybe Documentation/ isn't the right place
> for administrative documentation?
>
> It might be prudent to (temporarily) label NFS re-export as
> experimental use only, given its incompleteness and the long list
> of caveats.

As someone who uses NFSv3 re-export extensively in production, I can't
comment much on the "correctness" of the current locking, but it is
"good enough" for us (we don't explicitly mount with local locks atm).

The unique thing about our workloads though is that other than maybe
the odd log file or home directory shell history file, a single
process always writes a new unique file and we never overwrite. We
have an asset management DB that determines the file paths to be
written and a batch system to run processes (i.e. a production
pipeline + render farm).

We also really try to avoid having either the origin backend server or
re-export server crash/reboot. But even when once a year something
does invariably go wrong, we are willing to take the hit on broken
mounts, processes or corrupted files (just re-run the batch jobs).

Basically the upsides outweigh the downsides for our specific workloads.

Coupled with FS-Cache and a few TBs of storage, using a re-export
server is a very efficient way to serve files to many clients over a
bandwidth constrained and/or high latency WAN link. In the case of
high latency (e.g. global offices), we even do things like increase
actimeo and disable CTO to reduce repeat metadata round-trips to the
absolute minimum. Again, I think we have a unique workload that allows
for this.

If the locks will eventually be passed through to the backend server,
then I suspect we would still want a way to opt out to reduce WAN
latency overhead at the expense of locking correctness (maybe just
using local locks).

I think others with similar workloads are using it in this way too and
I know Google were maintaining a howto to help customers migrate
workloads to their cloud:

https://github.com/GoogleCloudPlatform/knfsd-cache-utils
https://cloud.google.com/architecture/deploy-nfs-caching-proxy-compute-engine

Although it seems like that specific project has gone a bit quiet of
late. They also helped get the reexport/crossmount fsidd helper merged
into nfs-utils.

I have also heard others say reexports are useful for "converting"
NFSv4 storage to NFSv3 (or vice-versa) for older non-NFSv4 clients or
servers, but I'm not sure how big a thing that is in this day and age.

I guess Netapp's "FlexCache" product is a doing a similar thing to
reexporting and seems to lean heavily on NFSv4 and delegations to
achieve that? The latest version can even do write-back caching on
files (get lock first, write back later).

I could probably write a whole (longish) thread about the different
ways we currently use NFS re-exporting and some of the remaining
pitfalls if there is any interest in that...

Daire
diff mbox series

Patch

diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
index ff9ae4a46530..044be965d75e 100644
--- a/Documentation/filesystems/nfs/reexport.rst
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -26,9 +26,13 @@  Reboot recovery
 ---------------
 
 The NFS protocol's normal reboot recovery mechanisms don't work for the
-case when the reexport server reboots.  Clients will lose any locks
-they held before the reboot, and further IO will result in errors.
-Closing and reopening files should clear the errors.
+case when the reexport server reboots because the source server has not
+rebooted, and so it is not in grace.  Since the source server is not in
+grace, it cannot offer any guarantees that the file won't have been
+changed between the locks getting lost and any attempt to recover them.
+The same applies to delegations and any associated locks.  Clients are
+not allowed to get file locks or delegations from a reexport server, any
+attempts will fail with operation not supported.
 
 Filehandle limits
 -----------------
diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index be686b8e0c54..2f001a0273bc 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -154,5 +154,6 @@  const struct export_operations nfs_export_ops = {
 		 EXPORT_OP_CLOSE_BEFORE_UNLINK	|
 		 EXPORT_OP_REMOTE_FS		|
 		 EXPORT_OP_NOATOMIC_ATTR	|
-		 EXPORT_OP_FLUSH_ON_CLOSE,
+		 EXPORT_OP_FLUSH_ON_CLOSE	|
+		 EXPORT_OP_NOLOCKSUPPORT,
 };
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index ac1859c7cc9d..63297ea82e4e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -5813,6 +5813,15 @@  nfs4_set_delegation(struct nfsd4_open *open, struct nfs4_ol_stateid *stp,
 	if (!nf)
 		return ERR_PTR(-EAGAIN);
 
+	/*
+	 * File delegations and associated locks cannot be recovered if
+	 * export is from NFS proxy server.
+	 */
+	if (exportfs_lock_op_is_unsupported(nf->nf_file->f_path.mnt->mnt_sb->s_export_op)) {
+		nfsd_file_put(nf);
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
 	spin_lock(&state_lock);
 	spin_lock(&fp->fi_lock);
 	if (nfs4_delegation_exists(clp, fp))
@@ -7917,6 +7926,11 @@  nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	}
 	sb = cstate->current_fh.fh_dentry->d_sb;
 
+	if (exportfs_lock_op_is_unsupported(sb->s_export_op)) {
+		status = nfserr_notsupp;
+		goto out;
+	}
+
 	if (lock->lk_is_new) {
 		if (nfsd4_has_session(cstate))
 			/* See rfc 5661 18.10.3: given clientid is ignored: */
@@ -8266,6 +8280,12 @@  nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		status = nfserr_lock_range;
 		goto put_stateid;
 	}
+
+	if (exportfs_lock_op_is_unsupported(nf->nf_file->f_path.mnt->mnt_sb->s_export_op)) {
+		status = nfserr_notsupp;
+		goto put_file;
+	}
+
 	file_lock = locks_alloc_lock();
 	if (!file_lock) {
 		dprintk("NFSD: %s: unable to allocate lock!\n", __func__);
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 893a1d21dc1c..106fd590d323 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -247,6 +247,7 @@  struct export_operations {
 						*/
 #define EXPORT_OP_FLUSH_ON_CLOSE	(0x20) /* fs flushes file data on close */
 #define EXPORT_OP_ASYNC_LOCK		(0x40) /* fs can do async lock request */
+#define EXPORT_OP_NOLOCKSUPPORT		(0x80) /* no file locking support */
 	unsigned long	flags;
 };
 
@@ -263,6 +264,19 @@  exportfs_lock_op_is_async(const struct export_operations *export_ops)
 	return export_ops->flags & EXPORT_OP_ASYNC_LOCK;
 }
 
+/**
+ * exportfs_lock_op_is_unsupported() - export does not support file locking
+ * @export_ops:	the nfs export operations to check
+ *
+ * Returns true if the nfs export_operations structure has
+ * EXPORT_OP_NOLOCKSUPPORT in their flags set
+ */
+static inline bool
+exportfs_lock_op_is_unsupported(const struct export_operations *export_ops)
+{
+	return export_ops->flags & EXPORT_OP_NOLOCKSUPPORT;
+}
+
 extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
 				    int *max_len, struct inode *parent,
 				    int flags);