mbox series

[v11,00/20] nfs/nfsd: add support for localio

Message ID 20240702162831.91604-1-snitzer@kernel.org (mailing list archive)
Headers show
Series nfs/nfsd: add support for localio | expand

Message

Mike Snitzer July 2, 2024, 4:28 p.m. UTC
Hi,

There seems to be consensus that these changes worthwhile and
extensively iterated on.  I really appreciate the coordinated
development and review to this point.

I'd very much like these changes to land upstream as-is (unless review
teases out some show-stopper).  These changes have been tested fairly
extensively (via xfstests) at this point.

Can we noew please provide formal review tags and merge these changes
through the NFS client tree for 6.11?

Changes since v10:
- Now that XFS will _not_ be patched with "xfs: enable WQ_MEM_RECLAIM
  on m_sync_workqueue", reintroduce "nfs/localio: use dedicated workqueues for
  filesystem read and write" (patch 18).  Also fixed it so that it passes
  xfstests generic/355.

FYI:
- I do not intend to rebase this series ontop of NeilBrown's partial
  exploration of simplifying away the need for a "fake" svc_rqst
  (noble goals and happy to help those changes land upstream as an
  incremental improvement):
  https://marc.info/?l=linux-nfs&m=171980269529965&w=2

- In addition, tweaks to use nfsd_file_acquire_gc() instead of
  nfsd_file_acquire() aren't a priority.  Each incremental change
  comes with it the potential for regression and we need to lock-down
  and stop churning.  Happy to explore as incremental improvement and
  optimization.

All review and comments are welcome!

Thanks,
Mike

My git tree is here:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/

This v11 is both branch nfs-localio-for-6.11 (always tracks latest)
and nfs-localio-for-6.11.v11

Mike Snitzer (10):
  nfs: factor out {encode,decode}_opaque_fixed to nfs_xdr.h
  nfs_common: add NFS LOCALIO auxiliary protocol enablement
  nfsd: add Kconfig options to allow localio to be enabled
  nfsd: manage netns reference in nfsd_open_local_fh
  nfsd: use percpu_ref to interlock nfsd_destroy_serv and nfsd_open_local_fh
  nfsd: implement server support for NFS_LOCALIO_PROGRAM
  nfs: fix nfs_localio_vfs_getattr() to properly support v4
  SUNRPC: remove call_allocate() BUG_ON if p_arglen=0 to allow RPC with void arg
  nfs: implement client support for NFS_LOCALIO_PROGRAM
  nfs: add Documentation/filesystems/nfs/localio.rst

NeilBrown (1):
  SUNRPC: replace program list with program array

Trond Myklebust (3):
  nfs: enable localio for non-pNFS I/O
  pnfs/flexfiles: enable localio for flexfiles I/O
  nfs/localio: use dedicated workqueues for filesystem read and write

Weston Andros Adamson (6):
  SUNRPC: add rpcauth_map_to_svc_cred_local
  nfsd: add "localio" support
  nfs: pass nfs_client to nfs_initiate_pgio
  nfs: pass descriptor thru nfs_initiate_pgio path
  nfs: pass struct file to nfs_init_pgio and nfs_init_commit
  nfs: add "localio" support

 Documentation/filesystems/nfs/localio.rst | 135 ++++
 fs/Kconfig                                |   3 +
 fs/nfs/Kconfig                            |  14 +
 fs/nfs/Makefile                           |   1 +
 fs/nfs/blocklayout/blocklayout.c          |   6 +-
 fs/nfs/client.c                           |  15 +-
 fs/nfs/filelayout/filelayout.c            |  16 +-
 fs/nfs/flexfilelayout/flexfilelayout.c    | 131 +++-
 fs/nfs/flexfilelayout/flexfilelayout.h    |   2 +
 fs/nfs/flexfilelayout/flexfilelayoutdev.c |   6 +
 fs/nfs/inode.c                            |  61 +-
 fs/nfs/internal.h                         |  61 +-
 fs/nfs/localio.c                          | 891 ++++++++++++++++++++++
 fs/nfs/nfs4xdr.c                          |  13 -
 fs/nfs/nfstrace.h                         |  61 ++
 fs/nfs/pagelist.c                         |  32 +-
 fs/nfs/pnfs.c                             |  24 +-
 fs/nfs/pnfs.h                             |   6 +-
 fs/nfs/pnfs_nfs.c                         |   2 +-
 fs/nfs/write.c                            |  13 +-
 fs/nfs_common/Makefile                    |   3 +
 fs/nfs_common/nfslocalio.c                |  74 ++
 fs/nfsd/Kconfig                           |  14 +
 fs/nfsd/Makefile                          |   1 +
 fs/nfsd/filecache.c                       |   2 +-
 fs/nfsd/localio.c                         | 329 ++++++++
 fs/nfsd/netns.h                           |  12 +-
 fs/nfsd/nfsctl.c                          |   2 +-
 fs/nfsd/nfsd.h                            |   2 +-
 fs/nfsd/nfssvc.c                          | 116 ++-
 fs/nfsd/trace.h                           |   3 +-
 fs/nfsd/vfs.h                             |   9 +
 include/linux/nfs.h                       |   9 +
 include/linux/nfs_fs.h                    |   2 +
 include/linux/nfs_fs_sb.h                 |  10 +
 include/linux/nfs_xdr.h                   |  20 +-
 include/linux/nfslocalio.h                |  42 +
 include/linux/sunrpc/auth.h               |   4 +
 include/linux/sunrpc/svc.h                |   7 +-
 net/sunrpc/auth.c                         |  15 +
 net/sunrpc/clnt.c                         |   1 -
 net/sunrpc/svc.c                          |  68 +-
 net/sunrpc/svc_xprt.c                     |   2 +-
 net/sunrpc/svcauth_unix.c                 |   3 +-
 44 files changed, 2089 insertions(+), 154 deletions(-)
 create mode 100644 Documentation/filesystems/nfs/localio.rst
 create mode 100644 fs/nfs/localio.c
 create mode 100644 fs/nfs_common/nfslocalio.c
 create mode 100644 fs/nfsd/localio.c
 create mode 100644 include/linux/nfslocalio.h

Comments

Chuck Lever July 2, 2024, 6:06 p.m. UTC | #1
> On Jul 2, 2024, at 12:28 PM, Mike Snitzer <snitzer@kernel.org> wrote:
> 
> Hi,
> 
> There seems to be consensus that these changes worthwhile and
> extensively iterated on.

I don't see a public consensus about "extensively iterated
on". The folks you talk to privately might believe that,
though.


> I'd very much like these changes to land upstream as-is (unless review
> teases out some show-stopper).  These changes have been tested fairly
> extensively (via xfstests) at this point.
> 
> Can we now please provide formal review tags and merge these changes
> through the NFS client tree for 6.11?

Contributors don't get to determine the kernel release where
their code lands; maintainers make that decision. You've stated
your preference, and we are trying to accommodate. But frankly,
the (server) changes don't stand up to close inspection yet.

One of the client maintainers has had years to live with this
work. But the server maintainers had their first look at this
just a few weeks ago, and this is not the only thing any of us
have on our plates at the moment. So you need to be patient.


> FYI:
> - I do not intend to rebase this series ontop of NeilBrown's partial
>  exploration of simplifying away the need for a "fake" svc_rqst
>  (noble goals and happy to help those changes land upstream as an
>  incremental improvement):
>  https://marc.info/?l=linux-nfs&m=171980269529965&w=2

Sorry, rebasing is going to be a requirement.

Again, as with the dprintk stuff, this is code that would get
reverted or replaced as soon as we merge. We don't knowingly
merge that kind of code; we fix it first.

To make it official, for v11 of this series:

  Nacked-by: Chuck Lever <chuck.lever@oracle.com>

I'll be much more ready to consider an Acked-by: once the
"fake svc_rqst" code has been replaced.


> - In addition, tweaks to use nfsd_file_acquire_gc() instead of
>  nfsd_file_acquire() aren't a priority.

The discussion has moved well beyond that now... IIUC the
preferred approach might be to hold the file open until the
local app is done with it. However, I'm still not convinced
there's a benefit to using the NFSD file cache vs. a plain
dentry_open().

Neil's clean-up might not need add a new nfsd_file_acquire()
API if we go with plain dentry_open().

There are still interesting choices to make here before it
is merged, so IMO the choices around nfsd_file_acquire()
remain a priority for merge-readiness.


--
Chuck Lever
Mike Snitzer July 2, 2024, 6:32 p.m. UTC | #2
On Tue, Jul 02, 2024 at 06:06:09PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jul 2, 2024, at 12:28 PM, Mike Snitzer <snitzer@kernel.org> wrote:
> > 
> > Hi,
> > 
> > There seems to be consensus that these changes worthwhile and
> > extensively iterated on.
> 
> I don't see a public consensus about "extensively iterated
> on". The folks you talk to privately might believe that,
> though.
> 
> 
> > I'd very much like these changes to land upstream as-is (unless review
> > teases out some show-stopper).  These changes have been tested fairly
> > extensively (via xfstests) at this point.
> > 
> > Can we now please provide formal review tags and merge these changes
> > through the NFS client tree for 6.11?
> 
> Contributors don't get to determine the kernel release where
> their code lands; maintainers make that decision. You've stated
> your preference, and we are trying to accommodate. But frankly,
> the (server) changes don't stand up to close inspection yet.
> 
> One of the client maintainers has had years to live with this
> work. But the server maintainers had their first look at this
> just a few weeks ago, and this is not the only thing any of us
> have on our plates at the moment. So you need to be patient.
> 
> 
> > FYI:
> > - I do not intend to rebase this series ontop of NeilBrown's partial
> >  exploration of simplifying away the need for a "fake" svc_rqst
> >  (noble goals and happy to help those changes land upstream as an
> >  incremental improvement):
> >  https://marc.info/?l=linux-nfs&m=171980269529965&w=2
> 
> Sorry, rebasing is going to be a requirement.

What?  You're imposing a rebase on completely unfinished and untested
code?  Any idea when Neil will post v2?  Or am I supposed to take his
partial first pass and fix it?

> Again, as with the dprintk stuff, this is code that would get
> reverted or replaced as soon as we merge. We don't knowingly
> merge that kind of code; we fix it first.

Nice rule, except there is merit in tested code landing without it
having to see last minute academic changes.  These aren't dprintk,
these are disruptive changes that aren't fully formed.  If they were
fully formed I wouldn't be resisting them.

> To make it official, for v11 of this series:
> 
>   Nacked-by: Chuck Lever <chuck.lever@oracle.com>

Thanks for that.

> I'll be much more ready to consider an Acked-by: once the
> "fake svc_rqst" code has been replaced.

If Neil completes his work I'll rebase.  But last time I rebased to
his simplification of the localio protocol (to use array and not
lists, nice changes, appreciated but it took serious work on my part
to fold them in): the code immediately BUG_ON()'d in sunrpc trivially.
So please be considerate of my time and the requirement for code to
actually work.

I'm fine with these changes not landing for 6.11 if warranted.  I just
seriously question the arbitrary nature of what constitutes necessary
change to allow inclusion.

> > - In addition, tweaks to use nfsd_file_acquire_gc() instead of
> >  nfsd_file_acquire() aren't a priority.
> 
> The discussion has moved well beyond that now... IIUC the
> preferred approach might be to hold the file open until the
> local app is done with it. However, I'm still not convinced
> there's a benefit to using the NFSD file cache vs. a plain
> dentry_open().

Saving an nfs_file to open_context, etc.  All incremental improvement
(that needs time to stick the landing).

Why do you think it appropriate to cause upheaval on code that has
clearly drawn a line in the sand in terms of established fitness?

Eliding allocation of things and micro-optimizing can come later.  But
I guess I'll just have to agree to disagree with this approach.
Really feels like I'll be forced to keep both pieces when it breaks in
the near-term.

By all means layer on new improvements.  But this fear to establish a
baseline out of fear that we _might_ change it: don't even know where
to begin with that.

> Neil's clean-up might not need add a new nfsd_file_acquire()
> API if we go with plain dentry_open().
> 
> There are still interesting choices to make here before it
> is merged, so IMO the choices around nfsd_file_acquire()
> remain a priority for merge-readiness.

Maybe Neil will post a fully working v12 rebased on his changes.

Mike
Chuck Lever July 2, 2024, 8:10 p.m. UTC | #3
> On Jul 2, 2024, at 2:32 PM, Mike Snitzer <snitzer@kernel.org> wrote:
> 
> On Tue, Jul 02, 2024 at 06:06:09PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Jul 2, 2024, at 12:28 PM, Mike Snitzer <snitzer@kernel.org> wrote:
>>> 
>>> Hi,
>>> 
>>> There seems to be consensus that these changes worthwhile and
>>> extensively iterated on.
>> 
>> I don't see a public consensus about "extensively iterated
>> on". The folks you talk to privately might believe that,
>> though.
>> 
>> 
>>> I'd very much like these changes to land upstream as-is (unless review
>>> teases out some show-stopper).  These changes have been tested fairly
>>> extensively (via xfstests) at this point.
>>> 
>>> Can we now please provide formal review tags and merge these changes
>>> through the NFS client tree for 6.11?
>> 
>> Contributors don't get to determine the kernel release where
>> their code lands; maintainers make that decision. You've stated
>> your preference, and we are trying to accommodate. But frankly,
>> the (server) changes don't stand up to close inspection yet.
>> 
>> One of the client maintainers has had years to live with this
>> work. But the server maintainers had their first look at this
>> just a few weeks ago, and this is not the only thing any of us
>> have on our plates at the moment. So you need to be patient.
>> 
>> 
>>> FYI:
>>> - I do not intend to rebase this series ontop of NeilBrown's partial
>>> exploration of simplifying away the need for a "fake" svc_rqst
>>> (noble goals and happy to help those changes land upstream as an
>>> incremental improvement):
>>> https://marc.info/?l=linux-nfs&m=171980269529965&w=2
>> 
>> Sorry, rebasing is going to be a requirement.
> 
> What?  You're imposing a rebase on completely unfinished and untested
> code?  Any idea when Neil will post v2?  Or am I supposed to take his
> partial first pass and fix it?

Don't be ridiculous. Wait for Neil to post a working version.


>> Again, as with the dprintk stuff, this is code that would get
>> reverted or replaced as soon as we merge. We don't knowingly
>> merge that kind of code; we fix it first.
> 
> Nice rule, except there is merit in tested code landing without it
> having to see last minute academic changes.  These aren't dprintk,
> these are disruptive changes that aren't fully formed.  If they were
> fully formed I wouldn't be resisting them.

It's your server patch that isn't fully formed. Allocating
a fake svc_rqst outside of an svc thread context and adding
a work-around to avoid the cache lookup deferral is nothing
but a hacky smelly prototype. It's not merge-ready or -worthy.


> If Neil completes his work I'll rebase.  But last time I rebased to
> his simplification of the localio protocol (to use array and not
> lists, nice changes, appreciated but it took serious work on my part
> to fold them in): the code immediately BUG_ON()'d in sunrpc trivially.

You should be very grateful that Neil is writing your code
for you. He's already contributed much more than you have
any reason to expect from someone who is not employed by
Hammerspace.

And quite frankly, it is not reasonable to expect anyone's
freshly written code to be completely free of bugs. I'm
sorry it took you a little while to find the problem, but
it will become easier when you become more familiar with
the code base.


> So please be considerate of my time and the requirement for code to
> actually work.

I'll be considerate when you are considerate of our time and
stop patch bombing the list with tiny incremental changes,
demanding we "get the review done and merge it" before it
is ready.

Honestly, the work is proceeding quite unremarkably for a
new feature. The problem seems to be that you don't
understand why we're asking for (actually quite small)
changes before merging, and we're asking you to do that
work. Why are we asking you to do it?

It's because you are asking for /our/ time. But we don't
work for Hammerspace and do not have any particular interest
in localIO and have no real way to test the facility yet
(no, running fstests does not count as a full test).

It's your responsibility to get this code put together,
it's got to be your time and effort. You are getting paid
to deal with this. None of the rest of us are. No-one else
is asking for this feature.


>>> - In addition, tweaks to use nfsd_file_acquire_gc() instead of
>>> nfsd_file_acquire() aren't a priority.
>> 
>> The discussion has moved well beyond that now... IIUC the
>> preferred approach might be to hold the file open until the
>> local app is done with it. However, I'm still not convinced
>> there's a benefit to using the NFSD file cache vs. a plain
>> dentry_open().
> 
> Saving an nfs_file to open_context, etc.  All incremental improvement
> (that needs time to stick the landing).

You are still missing the point. The phony svc_rqst is being
passed into nfsd_file_acquire(). Either we have to fix
nfsd_file_acquire (as Neil did) or replace it's use with
fh_verify() / dentry_open().

This is not about garbage collection, and hasn't been for
a while. It's about replacing unmergable prototype code.

And sticking the landing? If a few good fstests results
are supposed to be good enough for us to merge your code
as it exists now, why aren't they good enough to verify
your code is OK to merge after a rebase?


--
Chuck Lever
NeilBrown July 3, 2024, 12:52 a.m. UTC | #4
On Wed, 03 Jul 2024, Mike Snitzer wrote:
> 
> Maybe Neil will post a fully working v12 rebased on his changes.

Maybe I will, but it won't be before Friday.

I too wonder about the unusual expectation of haste, and what its real
source is.

NeilBrown
Mike Snitzer July 3, 2024, 12:57 a.m. UTC | #5
I am an upstream Linux kernel maintainer too.  My ideals and approach
are different but they are my own ;)

The first localio RFC (that made it to list as v2) was posted on June
11.  I have tried to work well with you and everyone willing to help
and engage.  So for it to come to this exchange is unfortunate.

Development with excess rebases is just soul-sucking.  My v11's 0th
header certainly conveyed exhaustion in that aspect of how things have
gone as this series has evolved.

I clearly upset you by suggesting v11 suitable to merge for 6.11.  I
really wasn't trying to be pushy.  I didn't think it controversial,
but I concede not giving you much to work with if/when you disagreed.
Sorry about painting you into a corner.

v11 is a solid basis to develop upon further.  I am all for iterating
further, am aware it is my burden to carry, and am hopeful we can get
localio staged in linux-next early during the 6.12 development window.
Let it soak (Anna's instinct was solid).

However, I'm hopeful to avoid the hell of frequent rebasing ontop of
refactored code that optimizes approaches that this v11 baseline
provides.

SO I'd like to propose I carry the v11 baseline in my git tree:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-next
And any changes (e.g. Neil's promising refactor to avoid needing
"fake" svc_rqst) can be based on 'nfs-localio-for-next' with
standalone incremental commits that can possibly get folded via a
final rebase once we're happy with the end result of the changes?

Thanks,
Mike
Mike Snitzer July 3, 2024, 1:13 a.m. UTC | #6
On Wed, Jul 03, 2024 at 10:52:20AM +1000, NeilBrown wrote:
> On Wed, 03 Jul 2024, Mike Snitzer wrote:
> > 
> > Maybe Neil will post a fully working v12 rebased on his changes.
> 
> Maybe I will, but it won't be before Friday.

No problem!  I can also just run with the first patchset you provided.

But hopeful you're OK with us doing incremental changes to this "v11"
baseline?:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-next

Please see the other reply I just sent for more context on why I hope
this works for you.  Happy to do a final rebase once the code is
settled.

> I too wonder about the unusual expectation of haste, and what its real
> source is.

Desire to settle approach is all, to allow settling development and
ultimately move on to developing something else in NFS.

Mike
Christoph Hellwig July 3, 2024, 5:04 a.m. UTC | #7
On Tue, Jul 02, 2024 at 06:06:09PM +0000, Chuck Lever III wrote:
> To make it official, for v11 of this series:
> 
>   Nacked-by: Chuck Lever <chuck.lever@oracle.com>

We've also not even looked into tackling the whole memory reclaim
recursion problem that have historically made local loopback
network file system an unsupported configuration.  We've have an
ongoing discussion on the XFS list that really needs to to fsdevel
and mm to make any progress first.  I see absolutely not chance to
solved that in this merge window.  I'm also a bit surprised and
shocked by the rush here.
Mike Snitzer July 3, 2024, 8:52 a.m. UTC | #8
On Tue, Jul 02, 2024 at 10:04:56PM -0700, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 06:06:09PM +0000, Chuck Lever III wrote:
> > To make it official, for v11 of this series:
> > 
> >   Nacked-by: Chuck Lever <chuck.lever@oracle.com>
> 
> We've also not even looked into tackling the whole memory reclaim
> recursion problem that have historically made local loopback
> network file system an unsupported configuration.  We've have an
> ongoing discussion on the XFS list that really needs to to fsdevel
> and mm to make any progress first.  I see absolutely not chance to
> solved that in this merge window.  I'm also a bit surprised and
> shocked by the rush here.

linux-nfs and linu-xfs both received those emails, there isn't some
secret game here:
https://marc.info/?l=linux-nfs&m=171976530216518&w=2
https://marc.info/?l=linux-xfs&m=171976530416523&w=2

I pivoted away from that (after Dave's helpful response) and back to
what I provided in most every revision of this patchset (with
different header and code revisions), most recent being patch 18 of
this v11 series: https://marc.info/?l=linux-nfs&m=171993773109538&w=2

And if spending the past 2 months discussing and developing in the
open is rushing things, I clearly need to slow down...

If only I had reason to think others were considering merging these
changes: https://marc.info/?l=linux-nfs&m=171942776105165&w=2

Ultimately I simply wanted to keep momentum up, I'm sure you can
relate to having a vision for phasing changes in without missing a
cycle.  But happy to just continue working it into the 6.12
development window.

I'll be sure to cc linux-fsdevel on future revision(s).
Christoph Hellwig July 3, 2024, 2:16 p.m. UTC | #9
On Wed, Jul 03, 2024 at 04:52:34AM -0400, Mike Snitzer wrote:
> Ultimately I simply wanted to keep momentum up, I'm sure you can
> relate to having a vision for phasing changes in without missing a
> cycle.  But happy to just continue working it into the 6.12
> development window.

It just feels really rushed to have something with cross-subsystem
communication going in past -rc6 in a US holiday week.  Sometimes
not rushing things too much will lead to much better results.
Mike Snitzer July 3, 2024, 3:11 p.m. UTC | #10
On Wed, Jul 03, 2024 at 07:16:31AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 04:52:34AM -0400, Mike Snitzer wrote:
> > Ultimately I simply wanted to keep momentum up, I'm sure you can
> > relate to having a vision for phasing changes in without missing a
> > cycle.  But happy to just continue working it into the 6.12
> > development window.
> 
> It just feels really rushed to have something with cross-subsystem
> communication going in past -rc6 in a US holiday week.  Sometimes
> not rushing things too much will lead to much better results.

Yes, I knew it to be very tight given the holiday.  I should've just
yielded to the reality of the calendar and there being some extra
changes needed (remove "fake" svc_rqst in fs/nfsd/localio.c -- I was
hopeful that could be done incrementally after merge but I digress).

Will welcome any help you might offer to optimize localio as much as
possible (doesn't need to be in near-term, whenever you might have
time to look).  Its current approach to use synchronous buffered
read_iter and write_iter, with active waiting, should be improved.

But Dave's idea to go full RMW to be page aligned will complicate a
forecasted NFS roadmap item to allow for: "do-not-cache capabilities,
so that the NFS server can turn off the buffer caching of files on
clients (force O_DIRECT-like writing/reading)".  But even that seems a
catch-22 given the NFS client doesn't enforce DIO alignment.
Christoph Hellwig July 3, 2024, 3:16 p.m. UTC | #11
I've stated looking a bit at the code, and the architectural model
confuses me more than a bit.

A first thing that would be very helpful is an actual problem statement.
The only mention of a concrete use case is about containers, implying
that this about a client in one container/namespace with the server
or the servers in another containers/namespace.  Is that the main use
case, are there others?

I kinda deduct from that that the client and server probably do not
have the same view and access permissions to the underlying file
systems?  As this would defeat the use of NFS I suspect that is the
case, but it should probably be stated clearly somewhere.

Going from there I don't understand why we need multiple layers of
server bypass.  The normal way to do this in NFSv4 is to use pNFS
layout.

I.e. you add a pnfs localio layout that just does local reads
and writes for the I/O path.  We'd still need a way to find a good
in-kernel way to get the file structure, but compared to the two
separate layers of bypasses in the current code it should be
significantly simpler.
Christoph Hellwig July 3, 2024, 3:18 p.m. UTC | #12
On Wed, Jul 03, 2024 at 11:11:51AM -0400, Mike Snitzer wrote:
> Will welcome any help you might offer to optimize localio as much as
> possible (doesn't need to be in near-term, whenever you might have
> time to look).  Its current approach to use synchronous buffered
> read_iter and write_iter, with active waiting, should be improved.
> 
> But Dave's idea to go full RMW to be page aligned will complicate a
> forecasted NFS roadmap item to allow for: "do-not-cache capabilities,
> so that the NFS server can turn off the buffer caching of files on
> clients (force O_DIRECT-like writing/reading)".  But even that seems a
> catch-22 given the NFS client doesn't enforce DIO alignment.

As I just wrote in another mail I've now looked at the architecture,
and either I'm missing some unstated requires, or the whole architecture
seems very overcomplicated and suboptimal.  If localio actually just was
a pNFS layout type you could trivially do asynchronous direct I/O from
the layout driver, and bypass a lot of the complexity.  The actual way
to find the file struct still would be nasty, but I'll try to think of
something good for that.
Chuck Lever July 3, 2024, 3:24 p.m. UTC | #13
> On Jul 3, 2024, at 11:18 AM, Christoph Hellwig <hch@infradead.org> wrote:
> The actual way
> to find the file struct still would be nasty, but I'll try to think of
> something good for that.

It is that very code that I've asked to be replaced before this
series can be merged. We have a set of patches for improving
that aspect that Neil is working on now.

When Mike presented LOCALIO to me at LSF, my initial suggestion
was to use pNFS. I think Jeff had the same reaction. IMO the
design document should, as part of the problem statement,
explain why a pNFS-only solution is not workable.

I'm also concerned about applications in one container being
able to reach around existing mount namespace silos into the
NFS server container's file systems. Obviously the NFS protocol
has its own authorization that would grant permission for that
access, but via the network.


--
Chuck Lever
Chuck Lever July 3, 2024, 3:26 p.m. UTC | #14
> On Jul 3, 2024, at 4:52 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> 
> And if spending the past 2 months discussing and developing in the
> open is rushing things, I clearly need to slow down...
> 
> If only I had reason to think others were considering merging these
> changes: https://marc.info/?l=linux-nfs&m=171942776105165&w=2

There is no mention of a particular kernel release in that
email, nor is there a promise that we could hit the v6.11
merge window.

In particular, I was asking how the series should be split
up, since it modifies two separately maintained subsystems.

I apologize for not asking this more clearly.


--
Chuck Lever
Mike Snitzer July 3, 2024, 3:28 p.m. UTC | #15
On Wed, Jul 03, 2024 at 08:16:00AM -0700, Christoph Hellwig wrote:
> I've stated looking a bit at the code, and the architectural model
> confuses me more than a bit.
> 
> A first thing that would be very helpful is an actual problem statement.
> The only mention of a concrete use case is about containers, implying
> that this about a client in one container/namespace with the server
> or the servers in another containers/namespace.  Is that the main use
> case, are there others?

Containers is a significant usecase, but also any client that might
need to access local storage efficiently (e.g. GPU service running NFS
client that needs to access NVMe on same host).

> I kinda deduct from that that the client and server probably do not
> have the same view and access permissions to the underlying file
> systems?  As this would defeat the use of NFS I suspect that is the
> case, but it should probably be stated clearly somewhere.

I can tighten that up in the Documentation.

> Going from there I don't understand why we need multiple layers of
> server bypass.  The normal way to do this in NFSv4 is to use pNFS
> layout.
>
> I.e. you add a pnfs localio layout that just does local reads
> and writes for the I/O path.  We'd still need a way to find a good
> in-kernel way to get the file structure, but compared to the two
> separate layers of bypasses in the current code it should be
> significantly simpler.

Using pNFS layout isn't viable because NFSv3 is very much in the mix
(flexfiles layout) for Hammerspace.
Christoph Hellwig July 3, 2024, 3:29 p.m. UTC | #16
On Wed, Jul 03, 2024 at 03:24:18PM +0000, Chuck Lever III wrote:
> I'm also concerned about applications in one container being
> able to reach around existing mount namespace silos into the
> NFS server container's file systems. Obviously the NFS protocol
> has its own authorization that would grant permission for that
> access, but via the network.

Yes.  One good way I could think is to use SCM_RIGHT to duplicate a file
descriptor over a unix socket.  For that we'd need a way to actually
create that unix socket first and I also don't think we currently have
support for using that in-kernel, but it's a well-known way to hand file
descriptors to other processes.  A big plus would be that this would
even work with non-kernel servers (or event clients for the matter)
as long as they run on the same kernel (including non-Linux kernels).
Mike Snitzer July 3, 2024, 3:36 p.m. UTC | #17
On Wed, Jul 03, 2024 at 03:24:18PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jul 3, 2024, at 11:18 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > The actual way
> > to find the file struct still would be nasty, but I'll try to think of
> > something good for that.
> 
> It is that very code that I've asked to be replaced before this
> series can be merged. We have a set of patches for improving
> that aspect that Neil is working on now.
> 
> When Mike presented LOCALIO to me at LSF, my initial suggestion
> was to use pNFS. I think Jeff had the same reaction.

No, Jeff suggested using a O_TMPFILE based thing for localio
handshake.  But he had the benefit of knowing NFSv3 important for the
intended localio usecase, so I'm not aware of him having pNFS design
ideas.

> IMO the design document should, as part of the problem statement,
> explain why a pNFS-only solution is not workable.

Sure, I can add that.

I explained the NFSv3 requirement when we discussed at LSF.

> I'm also concerned about applications in one container being
> able to reach around existing mount namespace silos into the
> NFS server container's file systems. Obviously the NFS protocol
> has its own authorization that would grant permission for that
> access, but via the network.

Jeff also had concerns there (as did I) but we arrived at NFS having
the ability to do it over network, so doing it with localio ultimately
"OK".  That said, localio isn't taking special action to escape mount
namespaces (that I'm aware of) and in practice there are no
requirements to do so.
Mike Snitzer July 3, 2024, 3:37 p.m. UTC | #18
On Wed, Jul 03, 2024 at 03:26:30PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jul 3, 2024, at 4:52 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> > 
> > And if spending the past 2 months discussing and developing in the
> > open is rushing things, I clearly need to slow down...
> > 
> > If only I had reason to think others were considering merging these
> > changes: https://marc.info/?l=linux-nfs&m=171942776105165&w=2
> 
> There is no mention of a particular kernel release in that
> email, nor is there a promise that we could hit the v6.11
> merge window.
> 
> In particular, I was asking how the series should be split
> up, since it modifies two separately maintained subsystems.
> 
> I apologize for not asking this more clearly.

No problem, I just got ahead of myself.  Not your fault.
Jeff Layton July 3, 2024, 5:06 p.m. UTC | #19
On Wed, 2024-07-03 at 11:36 -0400, Mike Snitzer wrote:
> On Wed, Jul 03, 2024 at 03:24:18PM +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Jul 3, 2024, at 11:18 AM, Christoph Hellwig
> > > <hch@infradead.org> wrote:
> > > The actual way
> > > to find the file struct still would be nasty, but I'll try to
> > > think of
> > > something good for that.
> > 
> > It is that very code that I've asked to be replaced before this
> > series can be merged. We have a set of patches for improving
> > that aspect that Neil is working on now.
> > 
> > When Mike presented LOCALIO to me at LSF, my initial suggestion
> > was to use pNFS. I think Jeff had the same reaction.
> 
> No, Jeff suggested using a O_TMPFILE based thing for localio
> handshake.  But he had the benefit of knowing NFSv3 important for the
> intended localio usecase, so I'm not aware of him having pNFS design
> ideas.
> 

The other problem with doing this is that if a server is running in a
container, how is it to know that the client is in different container
on the same host, and hence that it can give out a localio layout? We'd
still need some way to detect that anyway, which would probably look a
lot like the localio protocol.

> > IMO the design document should, as part of the problem statement,
> > explain why a pNFS-only solution is not workable.
> 
> Sure, I can add that.
> 
> I explained the NFSv3 requirement when we discussed at LSF.
>
> > I'm also concerned about applications in one container being
> > able to reach around existing mount namespace silos into the
> > NFS server container's file systems. Obviously the NFS protocol
> > has its own authorization that would grant permission for that
> > access, but via the network.
> 
> Jeff also had concerns there (as did I) but we arrived at NFS having
> the ability to do it over network, so doing it with localio
> ultimately
> "OK".  That said, localio isn't taking special action to escape mount
> namespaces (that I'm aware of) and in practice there are no
> requirements to do so.

The one thing I think we need to ensure is that an unauthorized NFS
client on the same kernel can't use this to bypass export permission
checks.

IOW, suppose we have a client and server on the same host. The server
allows the client to access some of its exports, but not all. The rest
are restricted to only certain IP addresses.

Can the client use its localio access to bypass that since it's not
going across the network anymore? Maybe by using open_by_handle_at on
the NFS share on a guessed filehandle? I think we need to ensure that
that isn't possible.

I wonder if it's also worthwhile to gate localio access on an export
option, just out of an abundance of caution.
Chuck Lever July 3, 2024, 5:19 p.m. UTC | #20
> On Jul 3, 2024, at 11:36 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> 
> On Wed, Jul 03, 2024 at 03:24:18PM +0000, Chuck Lever III wrote:
> 
>> IMO the design document should, as part of the problem statement,
>> explain why a pNFS-only solution is not workable.
> 
> Sure, I can add that.
> 
> I explained the NFSv3 requirement when we discussed at LSF.

You explained it to me in a private conversation, although
there was a lot of "I don't know yet" in that discussion.

It needs to be (re)explained in a public forum because
reviewers keep bringing this question up.

I hope to see more than just "NFSv3 is in the mix". There
needs to be some explanation of why it is necessary to
support NFSv3 without the use of pNFS flexfile.


--
Chuck Lever
Mike Snitzer July 3, 2024, 7:04 p.m. UTC | #21
On Wed, Jul 03, 2024 at 05:19:06PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jul 3, 2024, at 11:36 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> > 
> > On Wed, Jul 03, 2024 at 03:24:18PM +0000, Chuck Lever III wrote:
> > 
> >> IMO the design document should, as part of the problem statement,
> >> explain why a pNFS-only solution is not workable.
> > 
> > Sure, I can add that.
> > 
> > I explained the NFSv3 requirement when we discussed at LSF.
> 
> You explained it to me in a private conversation, although
> there was a lot of "I don't know yet" in that discussion.

Those "I don't know yet" were in response to you asking why a pNFS
layout (like the block layout) is not possible to achieve localio.

The answer to that is: someone(s) could try that, but there is no
interest from me or my employer to resort to using block layout with
centralized mapping of which client and DS are local so that the pNFS
MDS could handout such pNFS block layouts.

That added MDS complexity can be avoided if the client and server have
autonomy to negotiate more performant access without a centralized
arbiter (hence the "localio" handshake).

> It needs to be (re)explained in a public forum because
> reviewers keep bringing this question up.

Sure.

> I hope to see more than just "NFSv3 is in the mix". There
> needs to be some explanation of why it is necessary to
> support NFSv3 without the use of pNFS flexfile.

Loaded question there, not sure why you're leading with it being
invalid to decouple localio (leveraging client and server locality)
from pNFS.

NFS can realize benefits from localio being completely decoupled from
flexfiles and pNFS.  There are clear benefits with container use-cases
that don't use pNFS at all.

Just so happens that flexfiles ushers in the use of NFSv3.  Once the
client gets a flexfiles layout that points to an NFSv3 DS: the client
IO is issued in terms of NFSv3.  If the client happens to be on the
same host as the server then using localio is a win.
NeilBrown July 3, 2024, 9:35 p.m. UTC | #22
On Thu, 04 Jul 2024, Chuck Lever III wrote:
> 
> 
> > On Jul 3, 2024, at 11:36 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> > 
> > On Wed, Jul 03, 2024 at 03:24:18PM +0000, Chuck Lever III wrote:
> > 
> >> IMO the design document should, as part of the problem statement,
> >> explain why a pNFS-only solution is not workable.
> > 
> > Sure, I can add that.
> > 
> > I explained the NFSv3 requirement when we discussed at LSF.
> 
> You explained it to me in a private conversation, although
> there was a lot of "I don't know yet" in that discussion.
> 
> It needs to be (re)explained in a public forum because
> reviewers keep bringing this question up.
> 
> I hope to see more than just "NFSv3 is in the mix". There
> needs to be some explanation of why it is necessary to
> support NFSv3 without the use of pNFS flexfile.
> 

My perspective if "of course NFSv3".
This core idea is to accelerate loop-back NFS and unless we have decided
to deprecate NFSv3 (as I think we have decided to deprecate NFSv2), then
NFSv3 support should be on the table.

If v3 support turns out to be particularly burdensome, then it's not a
"must have" for me, but it isn't at all clear to me that a pNFS approach
would have fewer problems - only different problems.

Just my 2c worth.

NeilBrown
Christoph Hellwig July 4, 2024, 5:49 a.m. UTC | #23
On Wed, Jul 03, 2024 at 11:28:55AM -0400, Mike Snitzer wrote:
> Containers is a significant usecase, but also any client that might
> need to access local storage efficiently (e.g. GPU service running NFS
> client that needs to access NVMe on same host).

Please explain that in terms of who talks to whom concretely using
the actual Linux and/or NFS entities.  The last sentence just sound
like a AI generated marketing whitepaper.

> I can tighten that up in the Documentation.

Please write up a coherent document for the use case and circle it
around.  It's kinda pointless do code review if we don't have a
problem statement and use case.

> Using pNFS layout isn't viable because NFSv3 is very much in the mix
> (flexfiles layout) for Hammerspace.

Again, why and how.  We have a codebase that works entirely inside the
Linux kernel, and requires new code to be merged.  If we can't ask
people to use a the current protocol (where current means a 14 year
old RFC!), we have a problem.
Christoph Hellwig July 4, 2024, 5:55 a.m. UTC | #24
On Wed, Jul 03, 2024 at 03:04:05PM -0400, Mike Snitzer wrote:
> The answer to that is: someone(s) could try that, but there is no
> interest from me or my employer to resort to using block layout with
> centralized mapping of which client and DS are local so that the pNFS
> MDS could handout such pNFS block layouts.

Where did block layout suddenly come from?

> That added MDS complexity can be avoided if the client and server have
> autonomy to negotiate more performant access without a centralized
> arbiter (hence the "localio" handshake).

Doing a localio layout would actually be a lot simpler than the current
mess, so that argument goes the other way around.

> NFS can realize benefits from localio being completely decoupled from
> flexfiles and pNFS.

How about actually listing the benefits?

> There are clear benefits with container use-cases
> that don't use pNFS at all.

Well, the point would be to make them use pNFS, because pNFS is the
well known and proven way to bypass the main server in NFS.

> Just so happens that flexfiles ushers in the use of NFSv3.  Once the
> client gets a flexfiles layout that points to an NFSv3 DS: the client
> IO is issued in terms of NFSv3.  If the client happens to be on the
> same host as the server then using localio is a win.

I have no idea where flexfiles comes in here and why it matters.  The
Linux server does not even support flexfiles layouts.
Christoph Hellwig July 4, 2024, 6 a.m. UTC | #25
On Wed, Jul 03, 2024 at 01:06:51PM -0400, Jeff Layton wrote:
> The other problem with doing this is that if a server is running in a
> container, how is it to know that the client is in different container
> on the same host, and hence that it can give out a localio layout? We'd
> still need some way to detect that anyway, which would probably look a
> lot like the localio protocol.

We'll need some way to detect that client and server are capable
of the bypass.  And from all it looks that's actually the hard and
complicated part, and we'll need that for any scheme.

And then we need a way to bypass the server for I/O, which currently is
rather complex in the patchset and would be almost trivial with a new
pNFS layout.

> Can the client use its localio access to bypass that since it's not
> going across the network anymore? Maybe by using open_by_handle_at on
> the NFS share on a guessed filehandle? I think we need to ensure that
> that isn't possible.

If a file system is shared by containers and users in containers have
the capability to use open_by_handle_at the security model is already
broken without NFS or localio involved.

> I wonder if it's also worthwhile to gate localio access on an export
> option, just out of an abundance of caution.

export and mount option.  We're speaking a non-standard side band
protocol here, there is no way that should be done without explicit
opt-in from both sides.
Christoph Hellwig July 4, 2024, 6:01 a.m. UTC | #26
On Wed, Jul 03, 2024 at 11:36:00AM -0400, Mike Snitzer wrote:
> > When Mike presented LOCALIO to me at LSF, my initial suggestion
> > was to use pNFS. I think Jeff had the same reaction.
> 
> No, Jeff suggested using a O_TMPFILE based thing for localio
> handshake.  But he had the benefit of knowing NFSv3 important for the
> intended localio usecase, so I'm not aware of him having pNFS design
> ideas.

How does O_TMPFILE fit in here?  NFS doesn't even support O_TMPFILE.
Jeff Layton July 4, 2024, 10:13 a.m. UTC | #27
On Wed, 2024-07-03 at 23:01 -0700, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 11:36:00AM -0400, Mike Snitzer wrote:
> > > When Mike presented LOCALIO to me at LSF, my initial suggestion
> > > was to use pNFS. I think Jeff had the same reaction.
> > 
> > No, Jeff suggested using a O_TMPFILE based thing for localio
> > handshake.  But he had the benefit of knowing NFSv3 important for the
> > intended localio usecase, so I'm not aware of him having pNFS design
> > ideas.
> 
> How does O_TMPFILE fit in here?  NFS doesn't even support O_TMPFILE.
> 

At LSF we were tossing around ideas about how to detect whether the
client and server were on the same host. My thinking was to have a
common fs (maybe even a tmpfs) that was exported by all of the servers
on the host and accessible by all of the containers on the host.

The client would then do an O_TMPFILE open in that tmpfs, write some
data to it (uuids or something) and determine the filehandle. Then it
could issue a v3 READ against the NFS server for that filehandle and if
it worked and the contents were as expected you could be sure you're on
the same host. The client could then just close the file and it would
be cleaned up.

The problem of course is that that requires having a fs that is
commonly accessible between all of the containers, which is a bit more
setup than is ideal.

The localio protocol (particularly with Neil's suggested improvements)
is really a better scheme I think.
Mike Snitzer July 4, 2024, 6:31 p.m. UTC | #28
On Wed, Jul 03, 2024 at 11:00:27PM -0700, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 01:06:51PM -0400, Jeff Layton wrote:
> > The other problem with doing this is that if a server is running in a
> > container, how is it to know that the client is in different container
> > on the same host, and hence that it can give out a localio layout? We'd
> > still need some way to detect that anyway, which would probably look a
> > lot like the localio protocol.

(NOTE, Jeff's message above was him stating flaws in his O_TMPFILE
idea that we discussed at LSF, his idea wasn't pusued for many
reasons. And Jeff has stated he believes localio better)

> We'll need some way to detect that client and server are capable
> of the bypass.  And from all it looks that's actually the hard and
> complicated part, and we'll need that for any scheme.

Yes, hence the localio protocol that has wide buyin.  To the point I
ran with registering an NFS Program number with iana.org for the
effort.  My doing the localio protocol was born out of the requirement
to support NFSv3.

Neil's proposed refinement to add a a localio auth_domain to the
nfsd_net and his proposed risk-averse handshake within the localio
protocol will both improve security.

> And then we need a way to bypass the server for I/O, which currently is
> rather complex in the patchset and would be almost trivial with a new
> pNFS layout.

Some new layout misses the entire point of having localio work for
NFSv3 and NFSv4.  NFSv3 is very ubiquitous.

And in this localio series, flexfiles is trained to use localio.
(Which you apparently don't recognize or care about because nfsd
doesn't have flexfiles server support).

> > Can the client use its localio access to bypass that since it's not
> > going across the network anymore? Maybe by using open_by_handle_at on
> > the NFS share on a guessed filehandle? I think we need to ensure that
> > that isn't possible.
> 
> If a file system is shared by containers and users in containers have
> the capability to use open_by_handle_at the security model is already
> broken without NFS or localio involved.

Containers deployed by things like podman.io and kubernetes are
perfectly happy to allow containers permission to drive knfsd threads
in the host kernel.  That this is foreign to you is odd.

An NFS client that happens to be on the host should work perfectly
fine too (if it has adequate permissions).

> > I wonder if it's also worthwhile to gate localio access on an export
> > option, just out of an abundance of caution.
> 
> export and mount option.  We're speaking a non-standard side band
> protocol here, there is no way that should be done without explicit
> opt-in from both sides.

That is already provided my existing controls.  With both Kconfig
options that default to N, and the ability to disable the use of
localio entirely even if enabled in the Kconfig:
echo N > /sys/module/nfs/parameters/localio_enabled

And then ontop of it you have to loopback NFS mount.
Christoph Hellwig July 5, 2024, 5:18 a.m. UTC | #29
On Thu, Jul 04, 2024 at 02:31:46PM -0400, Mike Snitzer wrote:
> Some new layout misses the entire point of having localio work for
> NFSv3 and NFSv4.  NFSv3 is very ubiquitous.

I'm getting tird of bringing up this "oh NFSv3" again and again without
any explanation of why that matters for communication insides the
same Linux kernel instance with a kernel that obviously requires
patching.  Why is running an obsolete protocol inside the same OS
instance required.  Maybe it is, but if so it needs a very good
explanation.

> And in this localio series, flexfiles is trained to use localio.
> (Which you apparently don't recognize or care about because nfsd
> doesn't have flexfiles server support).

And you fail to explain why it matters.  You are trying to sell this
code, you better have an explanation why it's complicated and convoluted
as hell.  So far we are running in circles but there has been no clear
explanation of use cases.

> > > Can the client use its localio access to bypass that since it's not
> > > going across the network anymore? Maybe by using open_by_handle_at on
> > > the NFS share on a guessed filehandle? I think we need to ensure that
> > > that isn't possible.
> > 
> > If a file system is shared by containers and users in containers have
> > the capability to use open_by_handle_at the security model is already
> > broken without NFS or localio involved.
> 
> Containers deployed by things like podman.io and kubernetes are
> perfectly happy to allow containers permission to drive knfsd threads
> in the host kernel.  That this is foreign to you is odd.
> 
> An NFS client that happens to be on the host should work perfectly
> fine too (if it has adequate permissions).

Can you please stop the personal attacks?  I am just stating the fact
that IF the containers using the NFS mount has access to the exported
file systems and the privileges to use open by handle there is nothing
nfsd can do about security as the container has full access to the file
system anyway.  That's a fact and how you deploy the various containers
is completely irrelevant.  It is also in case that you didn't notice
it last time about the _client_ containers as stated by me and the
original poster I replied to.

> > > I wonder if it's also worthwhile to gate localio access on an export
> > > option, just out of an abundance of caution.
> > 
> > export and mount option.  We're speaking a non-standard side band
> > protocol here, there is no way that should be done without explicit
> > opt-in from both sides.
> 
> That is already provided my existing controls.  With both Kconfig
> options that default to N, and the ability to disable the use of
> localio entirely even if enabled in the Kconfig:
> echo N > /sys/module/nfs/parameters/localio_enabled

And all of that is global and not per-mount or nfsd instance, which
doesn't exactly scale to a multi-tenant container hosting setup.
Chuck Lever July 5, 2024, 1:35 p.m. UTC | #30
> On Jul 5, 2024, at 1:18 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Thu, Jul 04, 2024 at 02:31:46PM -0400, Mike Snitzer wrote:
>> Some new layout misses the entire point of having localio work for
>> NFSv3 and NFSv4.  NFSv3 is very ubiquitous.
> 
> I'm getting tird of bringing up this "oh NFSv3" again and again without
> any explanation of why that matters for communication insides the
> same Linux kernel instance with a kernel that obviously requires
> patching.  Why is running an obsolete protocol inside the same OS
> instance required.  Maybe it is, but if so it needs a very good
> explanation.

I agree: I think the requirement for NFSv3 in this situation
needs a clear justification. Both peers are recent vintage
Linux kernels; both peers can use NFSv4.x, there's no
explicit need for backwards compatibility in the use cases
that have been provided so far.

Generally I do agree with Neil's "why not NFSv3, we still
support it" argument. But with NFSv4, you get better locking
semantics, delegation, pNFS (possibly), and proper protocol
extensibility. There are really strong reasons to restrict
this facility to NFSv4.


--
Chuck Lever
Christoph Hellwig July 5, 2024, 1:39 p.m. UTC | #31
On Fri, Jul 05, 2024 at 01:35:18PM +0000, Chuck Lever III wrote:
> I agree: I think the requirement for NFSv3 in this situation
> needs a clear justification. Both peers are recent vintage
> Linux kernels; both peers can use NFSv4.x, there's no
> explicit need for backwards compatibility in the use cases
> that have been provided so far.

More importantly both peers are in fact the exact same Linux kernel
instance.  Which is the important point here - we are doing a bypass
for a kernel talking to itself, although a kernel suffering from
multiple personality (dis)order where the different sides might
expose very different system views.
Mike Snitzer July 5, 2024, 2:15 p.m. UTC | #32
On Fri, Jul 05, 2024 at 01:35:18PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jul 5, 2024, at 1:18 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > On Thu, Jul 04, 2024 at 02:31:46PM -0400, Mike Snitzer wrote:
> >> Some new layout misses the entire point of having localio work for
> >> NFSv3 and NFSv4.  NFSv3 is very ubiquitous.
> > 
> > I'm getting tird of bringing up this "oh NFSv3" again and again without
> > any explanation of why that matters for communication insides the
> > same Linux kernel instance with a kernel that obviously requires
> > patching.  Why is running an obsolete protocol inside the same OS
> > instance required.  Maybe it is, but if so it needs a very good
> > explanation.
> 
> I agree: I think the requirement for NFSv3 in this situation
> needs a clear justification. Both peers are recent vintage
> Linux kernels; both peers can use NFSv4.x, there's no
> explicit need for backwards compatibility in the use cases
> that have been provided so far.
> 
> Generally I do agree with Neil's "why not NFSv3, we still
> support it" argument. But with NFSv4, you get better locking
> semantics, delegation, pNFS (possibly), and proper protocol
> extensibility. There are really strong reasons to restrict
> this facility to NFSv4.

NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3 knfsd on
the same host.
Christoph Hellwig July 5, 2024, 2:18 p.m. UTC | #33
On Fri, Jul 05, 2024 at 10:15:46AM -0400, Mike Snitzer wrote:
> NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3 knfsd on
> the same host.

That doesn't really bring is any further.  Why is it required?

I think we'll just need to stop this discussion until we have reasonable
documentation of the use cases and assumptions, because without that
we'll get hund up in dead loops.
Mike Snitzer July 5, 2024, 2:36 p.m. UTC | #34
On Fri, Jul 05, 2024 at 07:18:29AM -0700, Christoph Hellwig wrote:
> On Fri, Jul 05, 2024 at 10:15:46AM -0400, Mike Snitzer wrote:
> > NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3 knfsd on
> > the same host.
> 
> That doesn't really bring is any further.  Why is it required?
> 
> I think we'll just need to stop this discussion until we have reasonable
> documentation of the use cases and assumptions, because without that
> we'll get hund up in dead loops.

It _really_ isn't material to the core capability that localio provides.
localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
containers).

Hammerspace needs localio to work with NFSv3 to assist with its "data
movers" that run on the host (using nfs and nfsd).

Please just remove yourself from the conversation if you cannot make
sense of this.  If you'd like to be involved, put the work in to
understand the code and be professional.
Chuck Lever July 5, 2024, 2:59 p.m. UTC | #35
> On Jul 5, 2024, at 10:36 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> 
> On Fri, Jul 05, 2024 at 07:18:29AM -0700, Christoph Hellwig wrote:
>> On Fri, Jul 05, 2024 at 10:15:46AM -0400, Mike Snitzer wrote:
>>> NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3 knfsd on
>>> the same host.
>> 
>> That doesn't really bring is any further.  Why is it required?
>> 
>> I think we'll just need to stop this discussion until we have reasonable
>> documentation of the use cases and assumptions, because without that
>> we'll get hund up in dead loops.
> 
> It _really_ isn't material to the core capability that localio provides.
> localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
> containers).
> 
> Hammerspace needs localio to work with NFSv3 to assist with its "data
> movers" that run on the host (using nfs and nfsd).
> 
> Please just remove yourself from the conversation if you cannot make
> sense of this.  If you'd like to be involved, put the work in to
> understand the code and be professional.

Sorry, I can't make sense of this either, and I find the
personal attack here completely inappropriate (and a bit
hypocritical, to be honest).

I have nothing else to contribute that you won't either
dismiss or treat as a personal attack, so I can't continue
this conversation.


--
Chuck Lever
Jeff Layton July 5, 2024, 6:59 p.m. UTC | #36
On Fri, 2024-07-05 at 10:36 -0400, Mike Snitzer wrote:
> On Fri, Jul 05, 2024 at 07:18:29AM -0700, Christoph Hellwig wrote:
> > On Fri, Jul 05, 2024 at 10:15:46AM -0400, Mike Snitzer wrote:
> > > NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3
> > > knfsd on
> > > the same host.
> > 
> > That doesn't really bring is any further.  Why is it required?
> > 
> > I think we'll just need to stop this discussion until we have
> > reasonable
> > documentation of the use cases and assumptions, because without
> > that
> > we'll get hund up in dead loops.
> 
> It _really_ isn't material to the core capability that localio
> provides.
> localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
> containers).
> 
> Hammerspace needs localio to work with NFSv3 to assist with its "data
> movers" that run on the host (using nfs and nfsd).
> 
> Please just remove yourself from the conversation if you cannot make
> sense of this.  If you'd like to be involved, put the work in to
> understand the code and be professional.

I disagree wholeheartedly with this statement. Christoph has raised a
very valid point. You have _not_ articulated why v3 access is important
here.

I'm aware of why it is (at least to HS), and I think there are other
valid reasons to keep v3 in the mix (as Neil has pointed out). But,
that info should be in the cover letter and changelogs. Not everyone
has insight into this, and tbqh, my understanding could be wrong.

Let's do please try to keep the discussion civil.
NeilBrown July 5, 2024, 10:08 p.m. UTC | #37
On Fri, 05 Jul 2024, Christoph Hellwig wrote:
> On Thu, Jul 04, 2024 at 02:31:46PM -0400, Mike Snitzer wrote:
> > Some new layout misses the entire point of having localio work for
> > NFSv3 and NFSv4.  NFSv3 is very ubiquitous.
> 
> I'm getting tird of bringing up this "oh NFSv3" again and again without
> any explanation of why that matters for communication insides the
> same Linux kernel instance with a kernel that obviously requires
> patching.  Why is running an obsolete protocol inside the same OS
> instance required.  Maybe it is, but if so it needs a very good
> explanation.

I would like to see a good explanation for why NOT NFSv3.
I don't think NFSv3 is obsolete.  The first dictionary is "No longer in
use." which certainly doesn't apply.
I think "deprecated" is a more relevant term.  I believe that NFSv2 has
been deprecated.  I believe that NFSv4.0 should be deprecated.  But I
don't see any reason to consider NFSv3 to be deprecated.


> 
> > And in this localio series, flexfiles is trained to use localio.
> > (Which you apparently don't recognize or care about because nfsd
> > doesn't have flexfiles server support).
> 
> And you fail to explain why it matters.  You are trying to sell this
> code, you better have an explanation why it's complicated and convoluted
> as hell.  So far we are running in circles but there has been no clear
> explanation of use cases.

Please avoid sweeping statements like "complicated and convoluted"
without backing them up with specifics.
I don't particularly want to defend the current localio protocol, and I
certainly see a number of points which can and must be improved.  But it
isn't clear to me that the big picture is either complicated or
convoluted.  Please provide details.

Thanks,
NeilBrown
Mike Snitzer July 6, 2024, 3:58 a.m. UTC | #38
On Fri, Jul 05, 2024 at 02:59:31PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jul 5, 2024, at 10:36 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> > 
> > On Fri, Jul 05, 2024 at 07:18:29AM -0700, Christoph Hellwig wrote:
> >> On Fri, Jul 05, 2024 at 10:15:46AM -0400, Mike Snitzer wrote:
> >>> NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3 knfsd on
> >>> the same host.
> >> 
> >> That doesn't really bring is any further.  Why is it required?
> >> 
> >> I think we'll just need to stop this discussion until we have reasonable
> >> documentation of the use cases and assumptions, because without that
> >> we'll get hund up in dead loops.
> > 
> > It _really_ isn't material to the core capability that localio provides.
> > localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
> > containers).
> > 
> > Hammerspace needs localio to work with NFSv3 to assist with its "data
> > movers" that run on the host (using nfs and nfsd).
> > 
> > Please just remove yourself from the conversation if you cannot make
> > sense of this.  If you'd like to be involved, put the work in to
> > understand the code and be professional.
> 
> Sorry, I can't make sense of this either, and I find the
> personal attack here completely inappropriate (and a bit
> hypocritical, to be honest).

Hi Chuck,

I'm out-gunned with this good-cop/bad-cop dynamic.  I was replying to
Christoph.  Who has taken to feign incapable of understanding localio
yet is perfectly OK with flexing like he is an authority on the topic.

He rallied to your Nacked-By with his chest puffed up and has
proceeded to baselessly shit-talk (did you miss his emails while we
slept last night?).  Yes, let's condone and encourage more of that!?
No, I won't abide such toxicity.  But thankfully Neil has since called
for him to stop.  Alas...

Earlier today I answered the question about "why NFSv3?" in simple
terms.  You and Christoph rejected it.  I'm _not_ being evassive.
There isn't a lot to it: "More efficient NFS in containers" _is_ the
answer.

But hopefully this email settles "why NFSv3?".  If not, please help me
(or others) further your understanding by reframing your NFSv3 concern
in terms other than "why NFSv3?".  Getting a bit like having to answer
"why is water wet?"

Why NFSv3?
----------

The localio feature improves IO performance when using NFSv3 and NFSv4
with containers.  Hammerspace has immediate need for the NFSv3
support, because its "data movers" use NFSv3, but NFSv4 support is
expected to be useful in the future.

Just because Hammerspace is very invested in pNFS doesn't mean all
aspects are framed in terms of it.

General statement:
------------------

I wrote maybe ~30% of the entire localio code as it stands at "v11"
and that was focused primarily on adding NFSv4 support and developing
the localio protocol, hooking it into NFS's client initialization and
teardown along with the server (and vice-versa, nfsd lifetime due to
container applications: tearing down nfsd in container while nfs
client actively connected from host).  Neil helped refine the localio
protocol part, and he has also looked critically at many aspects and
has a great list of improvements that are needed.  Jeff provided
top-notch review of my initial use of SRCU and later the percpu refcnt
for interlocking with the client and server.

My point: others wrote the majority of localio (years ago).  I'm just
trying to shepherd it upstream in an acceptable form.  And yes,
localio supporting both NFSv3 and NFSv4 is important to me,
Hammerspace and anyone who'd like more efficient IO with both NFSv3
and NFSv4 in containers.

Answering "Why NFSv3?" with questions:
--------------------------------------

1) Why wasn't general NFS localio bypass controversial 3 weeks ago?
Instead (given all inputs, NFSv3 support requirement being one of
them) the use of a "localio protocol" got broad consensus and buyin
from you, Jeff, and Neil.

I _thought_ we all agreed localio was a worthwhile _auxilliary_
addition to Linux's NFS client and server (to give them awareness of
each other through nfs_common) regardless of NFS protocol version.
That is why I registered a localio RPC program number with IANA (at
your suggestion, you were cc'd when I applied for it, and you are
named on IANA.org along with Trond and myself for the program number
IANA assigned):
https://www.iana.org/assignments/rpc-program-numbers/rpc-program-numbers.txt

$ cat rpc-program-numbers.txt | egrep 'Snitzer|Myklebust|Lever'
   Linux Kernel Organization                  400122         nfslocalio                   [Mike_Snitzer][Trond_Myklebust][Chuck_Lever]
   [Chuck_Lever]           Chuck Lever           mailto:chuck.lever&oracle.com  2024-06-20
   [Mike_Snitzer]          Mike Snitzer          mailto:snitzer&kernel.org      2024-06-20
   [Trond_Myklebust]       Trond Myklebust       mailto:trondmy&hammerspace.com 2024-06-20

2) If we're introducing a general NFS localio bypass feature _and_
NFSv3 is important to the stakeholder proposing the feature _and_
NFSv3 support is easily implemented and supported: then why do you
have such concern about localio supporting NFSv3?

3) Why do you think NFSv3 unworthy?  Is this just a bellweather for
broader opposition to flexfiles (and its encouraging more use of
NFSv3)?  Flexfiles is at the heart of NFSv3 use at Hammerspace.  I've
come to understand from you and Christoph that the lack of flexfiles
support in NFSD helps fuel dislike for flexfiles.  That's a lot for me
to unpack, and pretty far removed from "why NFSv3?", so I'd need more
context than I have if Hammerspace's use of flexfiles is what is
fueling your challenge of localio's NFSv3 support.

...

Reiterating and then expanding on my email above:

  localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
  containers).

  Hammerspace needs localio to work with NFSv3 to assist with its
  "data movers" that run on the host (using nfs [on host] and nfsd
  [within container]).

Now you can ask why _that_ is.. but it really is pretty disjoint from
the simple matter of ensuring localio support both NFSv3 and NFSv4.

I've shared that Hammerspace's "data movers" use NFSv3 currently, in
the future they could use NFSv4 as needed.  Hence the desire to
support localio with both NFSv3 and NFSv4.  [when I picked up the
localio code NFSv4 wasn't even supported yet].

I _hope_ I've now answered "why NFSv3?" clearly.

> I have nothing else to contribute that you won't either
> dismiss or treat as a personal attack, so I can't continue
> this conversation.

That isn't even a little bit fair... but not taking the bait.

Neil has been wonderful to work with and I look forward to all future
work with him (localio and beyond).  I am not trying to do anything
out of line with this feature.  I am and have been actively working 
with you, Neil and Jeff for over a month now.  I've adapted and
learned, _with_ your and others' help, to the best of my ability.

I'm trying here, maybe you could say "I'm trying too hard".  Well I
just started a new job with Hammerspace after working for Red Hat for
the past 15 years (much of my time spent as the upstream Linux DM
maintainer -- but you know this).  I am a capable engineer and I've
proposed the upstreaming of a localio feature that would do well to
land upstream.  I've done so in a proficient way all things
considered, always happy to learn new things and improve.  I need to
work with you.  Hopefully well, and hopefully I can earn your respect,
please just know I'm merely trying to improve NFS.

Hammerspace would like to get all its Linux kernel NFS innovation
upstream.  And I'm trying to do that.  localio is my first task and
I've been working on it with focus for the past 2 months since joining
Hammerspace.  But you basically know all this, I said all of it to you
at LSF.

So if you know all these things (I _know_ you do), why are you
treating me in this way?  I feel like I'm caught in the middle of some
much bigger divide than anything I've been involved with, caused or
made privy to.

Guess the messenger gets shot sometimes.

Mike
NeilBrown July 6, 2024, 5:52 a.m. UTC | #39
On Sat, 06 Jul 2024, Mike Snitzer wrote:
> On Fri, Jul 05, 2024 at 02:59:31PM +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Jul 5, 2024, at 10:36 AM, Mike Snitzer <snitzer@kernel.org> wrote:
> > > 
> > > On Fri, Jul 05, 2024 at 07:18:29AM -0700, Christoph Hellwig wrote:
> > >> On Fri, Jul 05, 2024 at 10:15:46AM -0400, Mike Snitzer wrote:
> > >>> NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3 knfsd on
> > >>> the same host.
> > >> 
> > >> That doesn't really bring is any further.  Why is it required?
> > >> 
> > >> I think we'll just need to stop this discussion until we have reasonable
> > >> documentation of the use cases and assumptions, because without that
> > >> we'll get hund up in dead loops.
> > > 
> > > It _really_ isn't material to the core capability that localio provides.
> > > localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
> > > containers).
> > > 
> > > Hammerspace needs localio to work with NFSv3 to assist with its "data
> > > movers" that run on the host (using nfs and nfsd).
> > > 
> > > Please just remove yourself from the conversation if you cannot make
> > > sense of this.  If you'd like to be involved, put the work in to
> > > understand the code and be professional.
> > 
> > Sorry, I can't make sense of this either, and I find the
> > personal attack here completely inappropriate (and a bit
> > hypocritical, to be honest).
> 
> Hi Chuck,
> 
> I'm out-gunned with this good-cop/bad-cop dynamic.  I was replying to
> Christoph.  Who has taken to feign incapable of understanding localio
> yet is perfectly OK with flexing like he is an authority on the topic.

Ad Hominem doesn't achieve anything useful.  Please stick with technical
arguments.  (They are the only ones I understand).

NeilBrown
Christoph Hellwig July 6, 2024, 5:58 a.m. UTC | #40
On Fri, Jul 05, 2024 at 11:58:56PM -0400, Mike Snitzer wrote:
> I'm out-gunned with this good-cop/bad-cop dynamic.  I was replying to
> Christoph.  Who has taken to feign incapable of understanding localio
> yet is perfectly OK with flexing like he is an authority on the topic.

Hi Mike,

please take a few days off and relax, and then write an actual use case
and requirements document.  I'm out of this thread for now, but I'd
appreciate if you'd just restart, assuming no one is activing in bad
faith and try to explain what you are doing and why without getting
upset.
Christoph Hellwig July 6, 2024, 6:02 a.m. UTC | #41
On Sat, Jul 06, 2024 at 08:08:07AM +1000, NeilBrown wrote:
> I would like to see a good explanation for why NOT NFSv3.
> I don't think NFSv3 is obsolete.  The first dictionary is "No longer in
> use." which certainly doesn't apply.
> I think "deprecated" is a more relevant term.  I believe that NFSv2 has
> been deprecated.  I believe that NFSv4.0 should be deprecated.  But I
> don't see any reason to consider NFSv3 to be deprecated.

The obvious answer is that NFSv4.1/2 (which is really the same thing)
is the only version of NFS under development and open for new features
at the protocol level.  So from the standardization perspective NFSv3
is obsolete.

But the more important point is that NFSv4 has a built-in way to bypass
the server for I/O namely pNFS.  And bypassing the server by directly
going to a local file system is the text book example for what pNFS
does.  So we'll need a really good argument why we need to reinvented
a different scheme for bypassing the server for I/O.  Maybe there is
a really good killer argument for doing that, but it needs to be clearly
stated and defended instead of assumed.
NeilBrown July 6, 2024, 6:37 a.m. UTC | #42
On Sat, 06 Jul 2024, Christoph Hellwig wrote:
> On Sat, Jul 06, 2024 at 08:08:07AM +1000, NeilBrown wrote:
> > I would like to see a good explanation for why NOT NFSv3.
> > I don't think NFSv3 is obsolete.  The first dictionary is "No longer in
> > use." which certainly doesn't apply.
> > I think "deprecated" is a more relevant term.  I believe that NFSv2 has
> > been deprecated.  I believe that NFSv4.0 should be deprecated.  But I
> > don't see any reason to consider NFSv3 to be deprecated.
> 
> The obvious answer is that NFSv4.1/2 (which is really the same thing)
> is the only version of NFS under development and open for new features
> at the protocol level.  So from the standardization perspective NFSv3
> is obsolete.

RFC-1813 is certainly obsolete from a standardization perspective - it
isn't even an IETF standard - only informational.  It can't be extended
with any hope of interoperability between implementations.

But we don't want interoperability between implementations.  We want to
enhance the internal workings of one particular implementation.  I don't
see that the standards status affects that choice.

> 
> But the more important point is that NFSv4 has a built-in way to bypass
> the server for I/O namely pNFS.  And bypassing the server by directly
> going to a local file system is the text book example for what pNFS
> does.  So we'll need a really good argument why we need to reinvented
> a different scheme for bypassing the server for I/O.  Maybe there is
> a really good killer argument for doing that, but it needs to be clearly
> stated and defended instead of assumed.

Could you provide a reference to the text book - or RFC - that describes
a pNFS DS protocol that completely bypasses the network, allowing the
client and MDS to determine if they are the same host and to potentially
do zero-copy IO.

If not, I will find it hard to understand your claim that it is "the
text book example".

Also, neither you nor I are in a position to assert what is needed for a
change to get accepted.  That is up the the people with M: in front of
their email address.  I believe that any council that either of us give
will considered with genuine interest, but making demands seems out of
place.

Thanks,
NeilBrown
Christoph Hellwig July 6, 2024, 6:42 a.m. UTC | #43
On Sat, Jul 06, 2024 at 04:37:22PM +1000, NeilBrown wrote:
> > a different scheme for bypassing the server for I/O.  Maybe there is
> > a really good killer argument for doing that, but it needs to be clearly
> > stated and defended instead of assumed.
> 
> Could you provide a reference to the text book - or RFC - that describes
> a pNFS DS protocol that completely bypasses the network, allowing the
> client and MDS to determine if they are the same host and to potentially
> do zero-copy IO.

I did not say that we have the exact same functionality available and
there is no work to do at all, just that it is the standard way to bypass
the server.

RFC 5662, RFC 5663 and RFC 8154 specify layouts that completely bypass
the network and require the client and server to find out that they talk
to the same storage devuce, and directly perform zero copy I/O.
They do not require to be on the same host, though.

> If not, I will find it hard to understand your claim that it is "the
> text book example".

pNFS is all about handing out grants to bypass the server for I/O.
That is exactly what localio is doing.
Mike Snitzer July 6, 2024, 1:05 p.m. UTC | #44
Earlier yesterday I answered the question about "why NFSv3?" in simple
terms.  Chuck and Christoph rejected it.  I'm _not_ being evassive.
There isn't a lot to it: "More efficient NFS in containers" _is_ the
answer.

But hopefully this email settles "why NFSv3?".  If not, please help me
(or others) further your understanding by reframing your NFSv3 concern
in terms other than "why NFSv3?".

Why NFSv3?
----------

The localio feature improves IO performance when using NFSv3 and NFSv4
with containers.  Hammerspace has immediate need for the NFSv3
support, because its "data movers" use NFSv3, but NFSv4 support is
expected to be useful in the future.

Why not use pNFS?
-----------------

Just because Hammerspace is very invested in pNFS doesn't mean all
aspects are framed in terms of it.

The complexity of a pNFS-based approach to addressing localio makes it
inferior to the proposed solution of an autonomous NFS client and
server rendezvous to allow for network bypass.  There is no need for
pNFS and by not using pNFS the localio feature is accessible for
general NFSv3 and NFSv4 use.

Answering "Why NFSv3?" with questions:
--------------------------------------

1) Why wasn't general NFS localio bypass controversial 3 weeks ago?
Instead (given all inputs, NFSv3 support requirement being one of
them) the use of a "localio protocol" got broad consensus and buyin
from Chuck, Jeff, and Neil.

I _thought_ we all agreed localio was a worthwhile _auxilliary_
addition to Linux's NFS client and server (to give them awareness of
each other through nfs_common) regardless of NFS protocol version.
That is why I registered a localio RPC program number with IANA, at
Chuck's suggestion, see:
https://www.iana.org/assignments/rpc-program-numbers/rpc-program-numbers.txt

$ cat rpc-program-numbers.txt | egrep 'Snitzer|Myklebust|Lever'
   Linux Kernel Organization                  400122         nfslocalio                   [Mike_Snitzer][Trond_Myklebust][Chuck_Lever]
   [Chuck_Lever]           Chuck Lever           mailto:chuck.lever&oracle.com  2024-06-20
   [Mike_Snitzer]          Mike Snitzer          mailto:snitzer&kernel.org      2024-06-20
   [Trond_Myklebust]       Trond Myklebust       mailto:trondmy&hammerspace.com 2024-06-20

2) If we're introducing a general NFS localio bypass feature _and_
NFSv3 is important to the stakeholder proposing the feature _and_
NFSv3 support is easily implemented and supported: then why do you
have such concern about localio supporting NFSv3?

3) Why do you think NFSv3 unworthy?  Is this just a bellweather for
broader opposition to flexfiles (and its encouraging more use of
NFSv3)?  Flexfiles is at the heart of NFSv3 use at Hammerspace.  I've
come to understand from Chuck and Christoph that the lack of flexfiles
support in NFSD helps fuel dislike for flexfiles.  That's a lot for me
to unpack, and pretty far removed from "why NFSv3?", so I'd need more
context than I have if Hammerspace's use of flexfiles is what is
fueling your challenge of localio's NFSv3 support.

...

Reiterating and then expanding on my earlier succinct answer:

  localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
  containers).

  Hammerspace needs localio to work with NFSv3 to assist with its
  "data movers" that run on the host (using nfs [on host] and nfsd
  [within container]).

Now you can ask why _that_ is.. but it really is pretty disjoint from
the simple matter of ensuring localio support both NFSv3 and NFSv4.

I've shared that Hammerspace's "data movers" use NFSv3 currently, in
the future they could use NFSv4 as needed.  Hence the desire to
support localio with both NFSv3 and NFSv4.  [when I picked up the
localio code NFSv4 wasn't even supported yet].
Mike Snitzer July 6, 2024, 1:12 p.m. UTC | #45
On Fri, Jul 05, 2024 at 10:58:18PM -0700, Christoph Hellwig wrote:
> On Fri, Jul 05, 2024 at 11:58:56PM -0400, Mike Snitzer wrote:
> > I'm out-gunned with this good-cop/bad-cop dynamic.  I was replying to
> > Christoph.  Who has taken to feign incapable of understanding localio
> > yet is perfectly OK with flexing like he is an authority on the topic.
> 
> Hi Mike,
> 
> please take a few days off and relax, and then write an actual use case
> and requirements document.  I'm out of this thread for now, but I'd
> appreciate if you'd just restart, assuming no one is activing in bad
> faith and try to explain what you are doing and why without getting
> upset.

If you'd like reasonable people to not get upset with you, you might
try treating them with respect rather than gaslighting them.
Chuck Lever July 6, 2024, 4:58 p.m. UTC | #46
> On Jul 5, 2024, at 11:58 PM, Mike Snitzer <snitzer@kernel.org> wrote:
> 
> On Fri, Jul 05, 2024 at 02:59:31PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Jul 5, 2024, at 10:36 AM, Mike Snitzer <snitzer@kernel.org> wrote:
>>> 
>>> On Fri, Jul 05, 2024 at 07:18:29AM -0700, Christoph Hellwig wrote:
>>>> On Fri, Jul 05, 2024 at 10:15:46AM -0400, Mike Snitzer wrote:
>>>>> NFSv3 is needed because NFSv3 is used to initiate IO to NFSv3 knfsd on
>>>>> the same host.
>>>> 
>>>> That doesn't really bring is any further.  Why is it required?
>>>> 
>>>> I think we'll just need to stop this discussion until we have reasonable
>>>> documentation of the use cases and assumptions, because without that
>>>> we'll get hund up in dead loops.
>>> 
>>> It _really_ isn't material to the core capability that localio provides.
>>> localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
>>> containers).
>>> 
>>> Hammerspace needs localio to work with NFSv3 to assist with its "data
>>> movers" that run on the host (using nfs and nfsd).
>>> 
>>> Please just remove yourself from the conversation if you cannot make
>>> sense of this.  If you'd like to be involved, put the work in to
>>> understand the code and be professional.
>> 
>> Sorry, I can't make sense of this either, and I find the
>> personal attack here completely inappropriate (and a bit
>> hypocritical, to be honest).
> 
> Hi Chuck,
> 
> I'm out-gunned with this good-cop/bad-cop dynamic.  I was replying to
> Christoph.  Who has taken to feign incapable of understanding localio
> yet is perfectly OK with flexing like he is an authority on the topic.

Well let's try a reality test.

Christoph has authored an IETF RFC on pNFS. He's also contributed
the pNFS SCSI (and now NVMe) implementation in the Linux server
and client. He seems to know the code well enough to offer an
informed opinion.


> He rallied to your Nacked-By with his chest puffed up and has
> proceeded to baselessly shit-talk (did you miss his emails while we
> slept last night?).  Yes, let's condone and encourage more of that!?
> No, I won't abide such toxicity.  But thankfully Neil has since called
> for him to stop.  Alas...
> 
> Earlier today I answered the question about "why NFSv3?" in simple
> terms.  You and Christoph rejected it.  I'm _not_ being evassive.
> There isn't a lot to it: "More efficient NFS in containers" _is_ the
> answer.
> 
> But hopefully this email settles "why NFSv3?".  If not, please help me
> (or others) further your understanding by reframing your NFSv3 concern
> in terms other than "why NFSv3?".  Getting a bit like having to answer
> "why is water wet?"
> 
> Why NFSv3?
> ----------
> 
> The localio feature improves IO performance when using NFSv3 and NFSv4
> with containers.  Hammerspace has immediate need for the NFSv3
> support, because its "data movers" use NFSv3, but NFSv4 support is
> expected to be useful in the future.
> 
> Just because Hammerspace is very invested in pNFS doesn't mean all
> aspects are framed in terms of it.
> 
> General statement:
> ------------------
> 
> I wrote maybe ~30% of the entire localio code as it stands at "v11"
> and that was focused primarily on adding NFSv4 support and developing
> the localio protocol, hooking it into NFS's client initialization and
> teardown along with the server (and vice-versa, nfsd lifetime due to
> container applications: tearing down nfsd in container while nfs
> client actively connected from host).  Neil helped refine the localio
> protocol part, and he has also looked critically at many aspects and
> has a great list of improvements that are needed.  Jeff provided
> top-notch review of my initial use of SRCU and later the percpu refcnt
> for interlocking with the client and server.
> 
> My point: others wrote the majority of localio (years ago).  I'm just
> trying to shepherd it upstream in an acceptable form.  And yes,
> localio supporting both NFSv3 and NFSv4 is important to me,
> Hammerspace and anyone who'd like more efficient IO with both NFSv3
> and NFSv4 in containers.
> 
> Answering "Why NFSv3?" with questions:
> --------------------------------------
> 
> 1) Why wasn't general NFS localio bypass controversial 3 weeks ago?
> Instead (given all inputs, NFSv3 support requirement being one of
> them) the use of a "localio protocol" got broad consensus and buyin
> from you, Jeff, and Neil.
> 
> I _thought_ we all agreed localio was a worthwhile _auxilliary_
> addition to Linux's NFS client and server (to give them awareness of
> each other through nfs_common) regardless of NFS protocol version.
> That is why I registered a localio RPC program number with IANA (at
> your suggestion, you were cc'd when I applied for it, and you are
> named on IANA.org along with Trond and myself for the program number
> IANA assigned):
> https://www.iana.org/assignments/rpc-program-numbers/rpc-program-numbers.txt
> 
> $ cat rpc-program-numbers.txt | egrep 'Snitzer|Myklebust|Lever'
>   Linux Kernel Organization                  400122         nfslocalio                   [Mike_Snitzer][Trond_Myklebust][Chuck_Lever]
>   [Chuck_Lever]           Chuck Lever           mailto:chuck.lever&oracle.com  2024-06-20
>   [Mike_Snitzer]          Mike Snitzer          mailto:snitzer&kernel.org      2024-06-20
>   [Trond_Myklebust]       Trond Myklebust       mailto:trondmy&hammerspace.com 2024-06-20
> 
> 2) If we're introducing a general NFS localio bypass feature _and_
> NFSv3 is important to the stakeholder proposing the feature _and_
> NFSv3 support is easily implemented and supported: then why do you
> have such concern about localio supporting NFSv3?
> 
> 3) Why do you think NFSv3 unworthy?  Is this just a bellweather for
> broader opposition to flexfiles (and its encouraging more use of
> NFSv3)?  Flexfiles is at the heart of NFSv3 use at Hammerspace.  I've
> come to understand from you and Christoph that the lack of flexfiles
> support in NFSD helps fuel dislike for flexfiles.  That's a lot for me
> to unpack, and pretty far removed from "why NFSv3?", so I'd need more
> context than I have if Hammerspace's use of flexfiles is what is
> fueling your challenge of localio's NFSv3 support.
> 
> ...
> 
> Reiterating and then expanding on my email above:
> 
>  localio supporting NFSv3 is beneficial for NFSv3 users (NFSv3 in
>  containers).
> 
>  Hammerspace needs localio to work with NFSv3 to assist with its
>  "data movers" that run on the host (using nfs [on host] and nfsd
>  [within container]).
> 
> Now you can ask why _that_ is.. but it really is pretty disjoint from
> the simple matter of ensuring localio support both NFSv3 and NFSv4.
> 
> I've shared that Hammerspace's "data movers" use NFSv3 currently, in
> the future they could use NFSv4 as needed.  Hence the desire to
> support localio with both NFSv3 and NFSv4.  [when I picked up the
> localio code NFSv4 wasn't even supported yet].
> 
> I _hope_ I've now answered "why NFSv3?" clearly.
> 
>> I have nothing else to contribute that you won't either
>> dismiss or treat as a personal attack, so I can't continue
>> this conversation.
> 
> That isn't even a little bit fair... but not taking the bait.
> 
> Neil has been wonderful to work with and I look forward to all future
> work with him (localio and beyond).  I am not trying to do anything
> out of line with this feature.  I am and have been actively working 
> with you, Neil and Jeff for over a month now.  I've adapted and
> learned, _with_ your and others' help, to the best of my ability.
> 
> I'm trying here, maybe you could say "I'm trying too hard".  Well I
> just started a new job with Hammerspace after working for Red Hat for
> the past 15 years (much of my time spent as the upstream Linux DM
> maintainer -- but you know this).  I am a capable engineer and I've
> proposed the upstreaming of a localio feature that would do well to
> land upstream.  I've done so in a proficient way all things
> considered, always happy to learn new things and improve.  I need to
> work with you.  Hopefully well, and hopefully I can earn your respect,
> please just know I'm merely trying to improve NFS.
> 
> Hammerspace would like to get all its Linux kernel NFS innovation
> upstream.  And I'm trying to do that.  localio is my first task and
> I've been working on it with focus for the past 2 months since joining
> Hammerspace.  But you basically know all this, I said all of it to you
> at LSF.
> 
> So if you know all these things (I _know_ you do), why are you
> treating me in this way?  I feel like I'm caught in the middle of some
> much bigger divide than anything I've been involved with, caused or
> made privy to.

Yes, NFS is larger (much larger) and much older than the
Linux implementation of it. I'm sorry your new colleagues
have not seen fit to help you fit yourself into this
picture.


> Guess the messenger gets shot sometimes.

This is exactly the problem: your attitude of victimhood.
You act like our questions are personal attacks on you.

Answering "I don't know" or "I need to think about it" or
"Let me ask someone" or "Can you explain that further" is
perfectly fine. Acknowledging the areas where you need to
learn more is a quintessential part of being a professional
software engineer.

---

I have no strong feelings one way or another about flexfiles.

And, I remain a full-throated advocate of the NFSv3 support
in the Linux NFS stack.

If you get an impression from me, come talk to me first
to confirm it. Don't go talk to your colleagues; in
particular don't talk to the ones who like to spread a lot
of weird ideas about me. You're getting bad information.

---

Our question isn't "Why NFSv3?" It's: Can your design
document explain in detail how the one existing application
(the data mover) will use and benefit from loopback
acceleration? It needs to explain why the data mover
does not use possibly more suitable solutions like
NFSv4.2 COPY. Why are we going to the effort of adding
this side car instead of using facilities that are
already available?

We're not asking for a one sentence email. A one sentence
email is not "a response in simple terms". It is a petulant
dismissal of our request for more info.

We're asking for a real problem statement and use case,
in detail, in the design document, not in email.

(Go read the requests in this thread again. Honest, that's
all we're asking for).

---

But finally: We've asked repeatedly for very typical
changes and answers, and though sometimes we're met
with a positive response, other times we get a defensive
response, or "I don't feel like it" or "that's busy work"
or "you're wasting my time." That doesn't sound like the
spirit of co-operation that I would like to see from a
regular contributor, nor do I expect it from someone who
is also a Linux kernel maintainer who really ought to
know better.

So heed Christoph's excellent advice: go eat a Snickers.
Calm down. Breathe. None of the rest of us are anywhere
near as upset about this as you are right now.


--
Chuck Lever
Chuck Lever July 6, 2024, 5:15 p.m. UTC | #47
> On Jul 6, 2024, at 2:42 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Sat, Jul 06, 2024 at 04:37:22PM +1000, NeilBrown wrote:
>>> a different scheme for bypassing the server for I/O.  Maybe there is
>>> a really good killer argument for doing that, but it needs to be clearly
>>> stated and defended instead of assumed.
>> 
>> Could you provide a reference to the text book - or RFC - that describes
>> a pNFS DS protocol that completely bypasses the network, allowing the
>> client and MDS to determine if they are the same host and to potentially
>> do zero-copy IO.
> 
> I did not say that we have the exact same functionality available and
> there is no work to do at all, just that it is the standard way to bypass
> the server.
> 
> RFC 5662, RFC 5663 and RFC 8154 specify layouts that completely bypass
> the network and require the client and server to find out that they talk
> to the same storage devuce, and directly perform zero copy I/O.
> They do not require to be on the same host, though.
> 
>> If not, I will find it hard to understand your claim that it is "the
>> text book example".
> 
> pNFS is all about handing out grants to bypass the server for I/O.
> That is exactly what localio is doing.

In particular, Neil, a pNFS block/SCSI layout provides the
client with a set of device IDs. If the client is on the
same storage fabric as those devices, it can then access
those devices directly using SCSI commands rather than
going on the network [RFC8154].

This is equivalent to a loopback acceleration mechanism. If
the client and server are on the same host, then there are
natural ways to expose the devices to both peers, and the
existing pNFS protocol and SCSI Persistent Reservation
provide strong access authorization.

Both the Linux NFS client and server implement RFC 8154
well enough that this could be an alternative or even a
better solution than LOCALIO. The server stores an XFS
file system on the devices, and hands out layouts with
the device ID and LBAs of the extents where file content
is located.

The fly in this ointment is the need for NFSv3 support.

In an earlier email Mike mentioned that Hammerspace isn't
interested in providing a centrally managed directory of
block devices that could be utilized by the MDS to simply
inform the client of local devices. I don't think that's
the only possible solution for discovering the locality of
storage devices.


--
Chuck Lever
Mike Snitzer July 7, 2024, 12:42 a.m. UTC | #48
Chuck,

I think we can both agree there is no real benefit to us trading
punches and looking like fools more than we already have.

I say that not to impugn you or your position, but we look foolish.

I will pull my punches entirely (I really am, I could blow up
everything and that'll really be career limiting, heh).  My aim is to
say my peace in this reply and hopefully we can set this unfortunate
exchange to one side. 

I'll learn what I can from it, maybe you will, maybe others will...

On Sat, Jul 06, 2024 at 04:58:50PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jul 5, 2024, at 11:58 PM, Mike Snitzer <snitzer@kernel.org> wrote:
> > 
> > 
> > Hi Chuck,
> > 
> > I'm out-gunned with this good-cop/bad-cop dynamic.  I was replying to
> > Christoph.  Who has taken to feign incapable of understanding localio
> > yet is perfectly OK with flexing like he is an authority on the topic.
> 
> Well let's try a reality test.

Perception is reality, so your reality is different than mine.

Neil, Christoph and yourself all took my last "Who has taken to feign"
sentence above as some ad-hominem attack.  It wasn't, Christoph was
acting like the localio code incomprehensible for effect.
 
> Christoph has authored an IETF RFC on pNFS. He's also contributed
> the pNFS SCSI (and now NVMe) implementation in the Linux server
> and client. He seems to know the code well enough to offer an
> informed opinion.

I am _not_ questioning (and _never_ have questioned) Christoph's
extensive contributions, experience or intelligence.

I am calling out that Christoph didn't actually review the localio
code but proceeded to make extreme baseless negative judgments of it.

And you've glossed over that entirely and made it about "Christoph is
Christoph, who are you again?". Well aware who he is but I'm saying
baseless negative judgments are also a very regular occurrence from
him when he decides he just wants to blow up what I am doing (could be
he has done this to others, I have no idea).  It isn't about the
technical details in the moment he does this, if he says it it must be
true, the aim is taint my work.  Sad because I thought that dynamic
died when I finally relented in our feud over NVMe vs DM multipath
that spanned 2018 to 2021)

But damnit, it is happening all over again, now in the context of
localio. And you're buying it because he is parroting your concerns
about "why can't pNFS be used instead!?"

Neil won't get an answer after having called Christoph out on this
(unless Christoph now wants to try to make a liar out of me by
fabricating something well after his moment to do so has passed):
https://marc.info/?l=linux-nfs&m=172021727813076&w=2

If Christoph could arrest his propensity to do harm where it not
warranted I'd take every single technical critical feedback very
seriously and adjust the code as needed.  When it is about the code,
_actually_ about the code.. Christoph rules.

He knows this, happy to say it (again): I respect his technical
ability, etc.  I do _not_ respect him making blanket statements about 
code without saying _why_ he arrived at his judgment.

<snip many words from me answering "why NFSv3?">

> > Hammerspace would like to get all its Linux kernel NFS innovation
> > upstream.  And I'm trying to do that.  localio is my first task and
> > I've been working on it with focus for the past 2 months since joining
> > Hammerspace.  But you basically know all this, I said all of it to you
> > at LSF.
> > 
> > So if you know all these things (I _know_ you do), why are you
> > treating me in this way?  I feel like I'm caught in the middle of some
> > much bigger divide than anything I've been involved with, caused or
> > made privy to.
> 
> Yes, NFS is larger (much larger) and much older than the
> Linux implementation of it. I'm sorry your new colleagues
> have not seen fit to help you fit yourself into this
> picture.

See a non-sequitur that takes shots at Hammerspace isn't professional.

Over the numerous iterations of localio there have been a handful of
times you have taken to name drop "Hammerspace" with a negative
connotation, simmering with contempt.  Like I and others should feel
shame by association.

They haven't done anything wrong here.  Trond hasn't done anything
wrong here.  Whatever you grievance with Hammerspace, when interacting
with me please know I don't have any context for it.  It is immaterial
to me and I don't need to know.  If/when you have individual technical
problems with something Hammerspace is doing let's just work it
through without it needing to be this awkward elephant.

> > Guess the messenger gets shot sometimes.
> 
> This is exactly the problem: your attitude of victimhood.

Statements like that show you're going out of your way to make an
enemy of me with no _real_ basis.  But let me be clear: I have
absolutely been a victim of bullying and serious psychological attacks
over this past week, particularly repeat gaslighting by both you and
Christoph.  That you're so unaware of how you have spun a false
narrative and are now tag-teaming with Christoph to attack me further
has been apparent for all to see. Hopefully linux-nfs is low traffic ;)

You have obviously formed an unfavorable opinion of me, apparent
turning point was my v11 patch header and ensuing negative exchange. I
own being too pushy when seeking progress on localio v11's inclusion
for 6.11; I apologized for that.

Hopefully we can actually get past all of this.

> You act like our questions are personal attacks on you.

No, I act like your personal attacks are personal attacks.  Questions
and technical issues are always fair game.  I pushed back on _one_
of your NFSD requirements (tracepoints), sorry if that was some
indication that I'm a malcontent.. but I'm not.  I can have a
technical opinion though.  SO I may make them known at times.

> Answering "I don't know" or "I need to think about it" or
> "Let me ask someone" or "Can you explain that further" is
> perfectly fine. Acknowledging the areas where you need to
> learn more is a quintessential part of being a professional
> software engineer.

I'm not afraid to admit when I don't know something (I've said it to
you when we met at LSF).  I'm in no way an expert in NFS, you are.  I
respect your command of NFS and welcome learning from you (like I have
to this point and hope to in the future).

> ---
> 
> I have no strong feelings one way or another about flexfiles.
> 
> And, I remain a full-throated advocate of the NFSv3 support
> in the Linux NFS stack.
> 
> If you get an impression from me, come talk to me first
> to confirm it. Don't go talk to your colleagues; in
> particular don't talk to the ones who like to spread a lot
> of weird ideas about me. You're getting bad information.

I haven't ever heard from anyone at Hammerspace about you.  And I
haven't sought insight about you either.  AFAIK you aren't on anyone's
mind within Hammerspace (other than me given I'm actively dealing with
this unfortunate situation). 

> ---
> 
> Our question isn't "Why NFSv3?" It's: Can your design
> document explain in detail how the one existing application
> (the data mover) will use and benefit from loopback
> acceleration? It needs to explain why the data mover
> does not use possibly more suitable solutions like
> NFSv4.2 COPY. Why are we going to the effort of adding
> this side car instead of using facilities that are
> already available?
> 
> We're not asking for a one sentence email. A one sentence
> email is not "a response in simple terms". It is a petulant
> dismissal of our request for more info.

Well I in no way intended it to be petulant.  I was trying to reduce
the attack surface.  And if you'd trust me at my word that'd go a long
way.

Try to take a step back and trust me when I tell you something.
You're welcome to unfulfilled by an answer and seek clarity, but I
promise you I'm not evasive or leaving information on the floor when
I answer questions.  If I don't know something I seek the details out.

But you and Christoph are making a simple line of development
(localio) into some referendum on design choices in layers you have no
charter to concern yourself with (Hammerspace's data movers). You
still cannot accept that localio is devoid of pNFS use case
requirements (as required by Hammerspace) but still see fit to
reengineer the feature in terms of pNFS.

Hammerspace simply wants to optimize an NFS client on a host
connecting to an NFSD running in a container on the same host.  It is
doing that in service to its distributed namespace that it is hosting
in disparate NFSD instances.  But the "data movers" are a sideband
remapping/reshaping/rebalancing service.  Completely disjoint from the
distribute pNFS namespace that the primary (flexfiles) clients are
accessing.

> We're asking for a real problem statement and use case,
> in detail, in the design document, not in email.
> 
> (Go read the requests in this thread again. Honest, that's
> all we're asking for).

I'll consult with Trond to try to see how he suggests appeasing your
request more than I already have.  But you really are departing
heavily from the narrow scope that localio covers.

> ---
> 
> But finally: We've asked repeatedly for very typical
> changes and answers, and though sometimes we're met
> with a positive response, other times we get a defensive
> response, or "I don't feel like it" or "that's busy work"
> or "you're wasting my time." That doesn't sound like the
> spirit of co-operation that I would like to see from a
> regular contributor, nor do I expect it from someone who
> is also a Linux kernel maintainer who really ought to
> know better.

Projecting attributes and actions onto me doesn't make them true. I
haven't done or said most of the things you just asserted there.

I pushed back on tracepoints, but in the end I removed all the
dprintk()s from the NFSD code.

> So heed Christoph's excellent advice: go eat a Snickers.
> Calm down. Breathe. None of the rest of us are anywhere
> near as upset about this as you are right now.

Your words say otherwise, they have become quite negatively charged to
harm in the last week, but please let's just move on.  I'm serious, no
lasting harm done from my vantage point, I can move on if you can.

Thanks,
Mike
NeilBrown July 8, 2024, 4:03 a.m. UTC | #49
On Sat, 06 Jul 2024, Christoph Hellwig wrote:
> On Sat, Jul 06, 2024 at 04:37:22PM +1000, NeilBrown wrote:
> > > a different scheme for bypassing the server for I/O.  Maybe there is
> > > a really good killer argument for doing that, but it needs to be clearly
> > > stated and defended instead of assumed.
> > 
> > Could you provide a reference to the text book - or RFC - that describes
> > a pNFS DS protocol that completely bypasses the network, allowing the
> > client and MDS to determine if they are the same host and to potentially
> > do zero-copy IO.
> 
> I did not say that we have the exact same functionality available and
> there is no work to do at all, just that it is the standard way to bypass
> the server.

Sometimes what you don't say is important.  As you acknowledge there is
work to do.  Understanding how much work is involved is critical to
understanding that possible direction.

> 
> RFC 5662, RFC 5663 and RFC 8154 specify layouts that completely bypass
> the network and require the client and server to find out that they talk
> to the same storage devuce, and directly perform zero copy I/O.
> They do not require to be on the same host, though.

Thanks.

> 
> > If not, I will find it hard to understand your claim that it is "the
> > text book example".
> 
> pNFS is all about handing out grants to bypass the server for I/O.
> That is exactly what localio is doing.

Yes, there is clearly an alignment.

But pNFS is about handing out grants using standardised protocols that
support interoperability between distinct nodes, and possibly distinct
implementations.  localio doesn't need any of that.  It all exists in a
single implementation on a single node.  So in that sense there can be
expected to be different priorities.

Why should we pay the costs of pNFS when implementing localio?  That
question can only be answered if we have a good understanding of the
costs and benefits.  And that requires having a concrete proposal for
the "pNFS" option - if only a detailed sketch.

Just because pNFS could be part of the answer (which I don't dispute)
that doesn't necessarily mean it should be part of the answer.

Thanks,
NeilBrown
NeilBrown July 8, 2024, 4:10 a.m. UTC | #50
On Sun, 07 Jul 2024, Chuck Lever III wrote:
> 
> Both the Linux NFS client and server implement RFC 8154
> well enough that this could be an alternative or even a
> better solution than LOCALIO. The server stores an XFS
> file system on the devices, and hands out layouts with
> the device ID and LBAs of the extents where file content
> is located.
> 
> The fly in this ointment is the need for NFSv3 support.

Another fly in this ointment is that only XFS currently implements that
.map_blocks export_operation, so only it could be used as a server-side
filesystem.

Maybe that would not be a barrier to Mike, but it does make it a lot
less interesting to me (not that I have a particular use case in mind,
but I just find "local bypass for NFSv4.1+ on XFS" less interesting than
"local bypass for NFS on Linux").

But my interest isn't a requirement of course.

> 
> In an earlier email Mike mentioned that Hammerspace isn't
> interested in providing a centrally managed directory of
> block devices that could be utilized by the MDS to simply
> inform the client of local devices. I don't think that's
> the only possible solution for discovering the locality of
> storage devices.

Could you sketch out an alternate solution so that it can be assessed
objectively? 

Thanks,
NeilBrown
Christoph Hellwig July 8, 2024, 9:37 a.m. UTC | #51
On Mon, Jul 08, 2024 at 02:03:02PM +1000, NeilBrown wrote:
> > I did not say that we have the exact same functionality available and
> > there is no work to do at all, just that it is the standard way to bypass
> > the server.
> 
> Sometimes what you don't say is important.  As you acknowledge there is
> work to do.  Understanding how much work is involved is critical to
> understanding that possible direction.

Of course there is.  I've never said we don't need to do any work,
I'm just asking why we are not using the existing infrastruture to do
it.

> But pNFS is about handing out grants using standardised protocols that
> support interoperability between distinct nodes, and possibly distinct
> implementations.  localio doesn't need any of that.  It all exists in a
> single implementation on a single node.  So in that sense there can be
> expected to be different priorities.
> 
> Why should we pay the costs of pNFS when implementing localio?

Why do you think we pay a cost for it?  From all I can tell it makes
the job simpler, especially if we want to do things like bypassing
the second page cache.

> That
> question can only be answered if we have a good understanding of the
> costs and benefits.  And that requires having a concrete proposal for
> the "pNFS" option - if only a detailed sketch.

I sketched the the very sketchy sketch earlier - add a new localio
layout that does local file I/O.  The I/O side of that is pretty
tivial, and maybe I can find some time to write draft code.  The file
open side is just as horrible as in the current localio proposal,
and I could just reuse that for now, although I think the concept
of opening the file in the client contect is fundamentally wrong
no matter how we skin the cat.
Christoph Hellwig July 8, 2024, 9:40 a.m. UTC | #52
On Sat, Jul 06, 2024 at 05:15:08PM +0000, Chuck Lever III wrote:
> In an earlier email Mike mentioned that Hammerspace isn't
> interested in providing a centrally managed directory of
> block devices that could be utilized by the MDS to simply
> inform the client of local devices. I don't think that's
> the only possible solution for discovering the locality of
> storage devices.

Btw, no matter what Hammerspace feels (and why that matters for Linux),
the block layout is not a good idea for bypassing the network between
supposedly isolated containers.  It completely bypasses the NFS
security model, which is an spectacularly bad idea if the clients aren't
fully trusted.  I've mentioned this before, but I absolutely do not
advocate for using the block layout as a network bypass here.

The concept to do local file I/O from the client in cases where we
can do it is absolutely sensible.  I just don't think doing it is
a magic unmanaged layer is a good idea, and figuring out how to
pass the opened file from nfsd to the client without risking security
problems and creating painful layering violations needs to be solved.
Christoph Hellwig July 8, 2024, 9:46 a.m. UTC | #53
On Sat, Jul 06, 2024 at 09:12:44AM -0400, Mike Snitzer wrote:
> If you'd like reasonable people to not get upset with you, you might
> try treating them with respect rather than gaslighting them.

Let me just repeat the calm down offer.  I'll do my side of it by
stopping from replying to these threads for this week now, after that
I'll forget all the accusations you're throwing at me, and we restart
a purely technical discussion, ok?
Mike Snitzer July 8, 2024, 12:55 p.m. UTC | #54
On Mon, Jul 08, 2024 at 02:46:28AM -0700, Christoph Hellwig wrote:
> On Sat, Jul 06, 2024 at 09:12:44AM -0400, Mike Snitzer wrote:
> > If you'd like reasonable people to not get upset with you, you might
> > try treating them with respect rather than gaslighting them.
> 
> Let me just repeat the calm down offer.  I'll do my side of it by
> stopping from replying to these threads for this week now, after that
> I'll forget all the accusations you're throwing at me, and we restart
> a purely technical discussion, ok?

OK
Chuck Lever July 8, 2024, 2:41 p.m. UTC | #55
> On Jul 8, 2024, at 12:10 AM, NeilBrown <neilb@suse.de> wrote:
> 
> On Sun, 07 Jul 2024, Chuck Lever III wrote:
>> 
>> Both the Linux NFS client and server implement RFC 8154
>> well enough that this could be an alternative or even a
>> better solution than LOCALIO. The server stores an XFS
>> file system on the devices, and hands out layouts with
>> the device ID and LBAs of the extents where file content
>> is located.
>> 
>> The fly in this ointment is the need for NFSv3 support.
> 
> Another fly in this ointment is that only XFS currently implements that
> .map_blocks export_operation, so only it could be used as a server-side
> filesystem.

I agree that limiting loopback acceleration only to
XFS exports is an artificial and undesirable
constraint.


> Maybe that would not be a barrier to Mike, but it does make it a lot
> less interesting to me (not that I have a particular use case in mind,
> but I just find "local bypass for NFSv4.1+ on XFS" less interesting than
> "local bypass for NFS on Linux").
> 
> But my interest isn't a requirement of course.

My focus for now is ensuring that whatever is merged
into NFSD can be audited, in a straightforward and
rigorous way, to give us confidence that it is secure
and respects the existing policies and boundaries that
are configured via export authorization and local
namespace settings.

The pNFS-based approaches have the advantage that a lot
of that auditing has already been done, and by a wider
set of reviewers than just the Linux community. That's
my only interest in possibly excluding NFSv3; I have no
other quibble with the high-level goal of supporting
NFSv3 for loopback acceleration.

Thanks everyone for a spirited and frank discussion.


>> In an earlier email Mike mentioned that Hammerspace isn't
>> interested in providing a centrally managed directory of
>> block devices that could be utilized by the MDS to simply
>> inform the client of local devices. I don't think that's
>> the only possible solution for discovering the locality of
>> storage devices.
> 
> Could you sketch out an alternate solution so that it can be assessed
> objectively?

I'll give it some thought.


--
Chuck Lever
NeilBrown July 10, 2024, 12:10 a.m. UTC | #56
On Mon, 08 Jul 2024, Christoph Hellwig wrote:
> On Mon, Jul 08, 2024 at 02:03:02PM +1000, NeilBrown wrote:
> > > I did not say that we have the exact same functionality available and
> > > there is no work to do at all, just that it is the standard way to bypass
> > > the server.
> > 
> > Sometimes what you don't say is important.  As you acknowledge there is
> > work to do.  Understanding how much work is involved is critical to
> > understanding that possible direction.
> 
> Of course there is.  I've never said we don't need to do any work,
> I'm just asking why we are not using the existing infrastruture to do
> it.
> 
> > But pNFS is about handing out grants using standardised protocols that
> > support interoperability between distinct nodes, and possibly distinct
> > implementations.  localio doesn't need any of that.  It all exists in a
> > single implementation on a single node.  So in that sense there can be
> > expected to be different priorities.
> > 
> > Why should we pay the costs of pNFS when implementing localio?
> 
> Why do you think we pay a cost for it?  From all I can tell it makes
> the job simpler, especially if we want to do things like bypassing
> the second page cache.
> 
> > That
> > question can only be answered if we have a good understanding of the
> > costs and benefits.  And that requires having a concrete proposal for
> > the "pNFS" option - if only a detailed sketch.
> 
> I sketched the the very sketchy sketch earlier - add a new localio
> layout that does local file I/O.  The I/O side of that is pretty
> tivial, and maybe I can find some time to write draft code.  The file
> open side is just as horrible as in the current localio proposal,
> and I could just reuse that for now, although I think the concept
> of opening the file in the client contect is fundamentally wrong
> no matter how we skin the cat.
> 

I had been assuming that you were proposing a new pNFS layout type with
associated protocol extension, an RFC describing them, and navigation of
the IETF standards process.  These are (some of) the costs I was
thinking of.  Of course the IETF requires demonstration of
interoperability between multiple implementations, and as that is
impossible for localio, we would fail before we started.

But I now suspect that I guessed your intention wrongly (I'm rubbish at
guessing other people's intentions).  Your use of the word
"infrastructure" above and the sketchy sketch you provide (thanks) seems
to indicate you are only suggesting that we re-use some of the pnfs
abstractions and interfaces already implemented in the Linux NFS client
and server.

Is that what you mean?

If it is, then it isn't immediately clear to me that this would have to
be NFSv4 only.  The different versions already share code where that
makes sense.  Moving the pnfs code from the nfsv4 module to a new pnfs
module which nfsv3 also depends on might make sense.

I'm keen to know what you are really thinking.

Thanks,
NeilBrown