Message ID | 20190916204419.21717-1-trond.myklebust@hammerspace.com (mailing list archive) |
---|---|
Headers | show |
Series | Various NFSv4 state error handling fixes | expand |
Hi Trond, These set of patches do not address the locking problem. It's actually not the locking patch (which I thought it was as I reverted it and still had the issue). Without the whole patch series the unlock works fine so something in these new patches. Something is up with the 2 patches: NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU If I remove either one separately, unlock fails but if I remove both unlock works. On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <trondmy@gmail.com> wrote: > > Various NFSv4 fixes to ensure we handle state errors correctly. In > particular, we need to ensure that for COMPOUNDs like CLOSE and > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the > layout state errors so that a retry of either the LAYOUTRETURN, or > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN > reply. > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our > best to still try to destroy the state on the server, in order to > avoid causing state leakage. > > v2: Fix bug reports from Olga > - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when > doing fully serialised NFSv4.0. > - Ensure LOCKU initialises the stateid correctly. > > Trond Myklebust (9): > pNFS: Ensure we do clear the return-on-close layout stateid on fatal > errors > NFSv4: Clean up pNFS return-on-close error handling > NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close > NFSv4: Handle RPC level errors in LAYOUTRETURN > NFSv4: Add a helper to increment stateid seqids > pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state > seqid > NFSv4: Fix OPEN_DOWNGRADE error handling > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > fs/nfs/nfs4_fs.h | 11 ++- > fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++--------------- > fs/nfs/nfs4state.c | 16 ---- > fs/nfs/pnfs.c | 71 ++++++++++++++-- > fs/nfs/pnfs.h | 17 +++- > 5 files changed, 229 insertions(+), 90 deletions(-) > > -- > 2.21.0 >
Hi Olga On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote: > Hi Trond, > > These set of patches do not address the locking problem. It's > actually > not the locking patch (which I thought it was as I reverted it and > still had the issue). Without the whole patch series the unlock works > fine so something in these new patches. Something is up with the 2 > patches: > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > If I remove either one separately, unlock fails but if I remove both > unlock works. Can you describe how you are testing this, and perhaps provide wireshark traces that show how we're triggering these problems? Thanks! Trond > On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <trondmy@gmail.com> > wrote: > > Various NFSv4 fixes to ensure we handle state errors correctly. In > > particular, we need to ensure that for COMPOUNDs like CLOSE and > > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the > > layout state errors so that a retry of either the LAYOUTRETURN, or > > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN > > reply. > > > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our > > best to still try to destroy the state on the server, in order to > > avoid causing state leakage. > > > > v2: Fix bug reports from Olga > > - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when > > doing fully serialised NFSv4.0. > > - Ensure LOCKU initialises the stateid correctly. > > > > Trond Myklebust (9): > > pNFS: Ensure we do clear the return-on-close layout stateid on > > fatal > > errors > > NFSv4: Clean up pNFS return-on-close error handling > > NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close > > NFSv4: Handle RPC level errors in LAYOUTRETURN > > NFSv4: Add a helper to increment stateid seqids > > pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the > > state > > seqid > > NFSv4: Fix OPEN_DOWNGRADE error handling > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > > > fs/nfs/nfs4_fs.h | 11 ++- > > fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++----------- > > ---- > > fs/nfs/nfs4state.c | 16 ---- > > fs/nfs/pnfs.c | 71 ++++++++++++++-- > > fs/nfs/pnfs.h | 17 +++- > > 5 files changed, 229 insertions(+), 90 deletions(-) > > > > -- > > 2.21.0 > >
Hi Olga On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote: > Hi Trond, > > On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust < > trondmy@hammerspace.com> wrote: > > Hi Olga > > > > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote: > > > Hi Trond, > > > > > > These set of patches do not address the locking problem. It's > > > actually > > > not the locking patch (which I thought it was as I reverted it > > > and > > > still had the issue). Without the whole patch series the unlock > > > works > > > fine so something in these new patches. Something is up with the > > > 2 > > > patches: > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > > > > > If I remove either one separately, unlock fails but if I remove > > > both > > > unlock works. > > > > Can you describe how you are testing this, and perhaps provide > > wireshark traces that show how we're triggering these problems? > > I'm triggering by running "nfstest_lock --nfsversion 4.1 --runtest > btest01" against either linux or ontap servers (while the test > doesn't > fail but on the network trace you can see unlock failing with > bad_stateid). Network trace attached. > > But actually a simple test open, lock, unlock does the trick (network > trace attached). > fd1 = open(RDWR) > fctl(fd1) (lock /unlock) These traces really do not mesh with what I'm seeing using a simple Connectathon lock test run. When I look at the wireshark output from that, I see exadtly two cases where the stateid arguments are both zero, and those are both SETATTR, so expected. All the LOCKU are showing up as non-zero stateids, and so I'm seeing no BAD_STATEID or OLD_STATEID errors at all. Is there something special about how your test is running? Cheers Trond PS: Note: I do think I need a v3 of the LOCKU patch in order to add a spinlock around the new copy of the lock stateid in nfs4_alloc_unlockdata(). However I don't see how the missing spinlocks could cause you to consistently be seeing an all-zero stateid argument. > > > Thanks! > > Trond > > > > > On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust < > > > trondmy@gmail.com> > > > wrote: > > > > Various NFSv4 fixes to ensure we handle state errors correctly. > > > > In > > > > particular, we need to ensure that for COMPOUNDs like CLOSE and > > > > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle > > > > the > > > > layout state errors so that a retry of either the LAYOUTRETURN, > > > > or > > > > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN > > > > reply. > > > > > > > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do > > > > our > > > > best to still try to destroy the state on the server, in order > > > > to > > > > avoid causing state leakage. > > > > > > > > v2: Fix bug reports from Olga > > > > - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE > > > > when > > > > doing fully serialised NFSv4.0. > > > > - Ensure LOCKU initialises the stateid correctly. > > > > > > > > Trond Myklebust (9): > > > > pNFS: Ensure we do clear the return-on-close layout stateid > > > > on > > > > fatal > > > > errors > > > > NFSv4: Clean up pNFS return-on-close error handling > > > > NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close > > > > NFSv4: Handle RPC level errors in LAYOUTRETURN > > > > NFSv4: Add a helper to increment stateid seqids > > > > pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping > > > > the > > > > state > > > > seqid > > > > NFSv4: Fix OPEN_DOWNGRADE error handling > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > > > > > > > fs/nfs/nfs4_fs.h | 11 ++- > > > > fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++------- > > > > ---- > > > > ---- > > > > fs/nfs/nfs4state.c | 16 ---- > > > > fs/nfs/pnfs.c | 71 ++++++++++++++-- > > > > fs/nfs/pnfs.h | 17 +++- > > > > 5 files changed, 229 insertions(+), 90 deletions(-) > > > > > > > > -- > > > > 2.21.0 > > > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@hammerspace.com > > > > Trond Myklebust CTO, Hammerspace Inc 4300 El Camino Real, Suite 105 Los Altos, CA 94022 www.hammer.space
Hi Trond, On Thu, Sep 19, 2019 at 7:42 PM Trond Myklebust <trondmy@hammerspace.com> wrote: > > Hi Olga > > On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote: > > Hi Trond, > > > > On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust < > > trondmy@hammerspace.com> wrote: > > > Hi Olga > > > > > > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote: > > > > Hi Trond, > > > > > > > > These set of patches do not address the locking problem. It's > > > > actually > > > > not the locking patch (which I thought it was as I reverted it > > > > and > > > > still had the issue). Without the whole patch series the unlock > > > > works > > > > fine so something in these new patches. Something is up with the > > > > 2 > > > > patches: > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > > > > > > > If I remove either one separately, unlock fails but if I remove > > > > both > > > > unlock works. > > > > > > Can you describe how you are testing this, and perhaps provide > > > wireshark traces that show how we're triggering these problems? > > > > I'm triggering by running "nfstest_lock --nfsversion 4.1 --runtest > > btest01" against either linux or ontap servers (while the test > > doesn't > > fail but on the network trace you can see unlock failing with > > bad_stateid). Network trace attached. > > > > But actually a simple test open, lock, unlock does the trick (network > > trace attached). > > fd1 = open(RDWR) > > fctl(fd1) (lock /unlock) > > > These traces really do not mesh with what I'm seeing using a simple > Connectathon lock test run. When I look at the wireshark output from > that, I see exadtly two cases where the stateid arguments are both > zero, and those are both SETATTR, so expected. > > All the LOCKU are showing up as non-zero stateids, and so I'm seeing no > BAD_STATEID or OLD_STATEID errors at all. > > Is there something special about how your test is running? There is nothing special that I can think of about my setup or how test run. I pull from your testing branch, build it (no extra patches). Run tests over 4.1 (default mount opts) against a linux server (typically same kernel). Is this patch series somewhere in your git branches? I've been testing your testing branch (as I could see v2 changes were in the testing branch). It's not obvious to me what was changed in v3 to see if the testing branch has the right code. > > Cheers > Trond > > PS: Note: I do think I need a v3 of the LOCKU patch in order to add a > spinlock around the new copy of the lock stateid in > nfs4_alloc_unlockdata(). However I don't see how the missing spinlocks > could cause you to consistently be seeing an all-zero stateid argument. I'm not testing with v3 so I don't think that's it. > > > > > > > Thanks! > > > Trond > > > > > > > On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust < > > > > trondmy@gmail.com> > > > > wrote: > > > > > Various NFSv4 fixes to ensure we handle state errors correctly. > > > > > In > > > > > particular, we need to ensure that for COMPOUNDs like CLOSE and > > > > > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle > > > > > the > > > > > layout state errors so that a retry of either the LAYOUTRETURN, > > > > > or > > > > > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN > > > > > reply. > > > > > > > > > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do > > > > > our > > > > > best to still try to destroy the state on the server, in order > > > > > to > > > > > avoid causing state leakage. > > > > > > > > > > v2: Fix bug reports from Olga > > > > > - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE > > > > > when > > > > > doing fully serialised NFSv4.0. > > > > > - Ensure LOCKU initialises the stateid correctly. > > > > > > > > > > Trond Myklebust (9): > > > > > pNFS: Ensure we do clear the return-on-close layout stateid > > > > > on > > > > > fatal > > > > > errors > > > > > NFSv4: Clean up pNFS return-on-close error handling > > > > > NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close > > > > > NFSv4: Handle RPC level errors in LAYOUTRETURN > > > > > NFSv4: Add a helper to increment stateid seqids > > > > > pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping > > > > > the > > > > > state > > > > > seqid > > > > > NFSv4: Fix OPEN_DOWNGRADE error handling > > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > > > > > > > > > fs/nfs/nfs4_fs.h | 11 ++- > > > > > fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++------- > > > > > ---- > > > > > ---- > > > > > fs/nfs/nfs4state.c | 16 ---- > > > > > fs/nfs/pnfs.c | 71 ++++++++++++++-- > > > > > fs/nfs/pnfs.h | 17 +++- > > > > > 5 files changed, 229 insertions(+), 90 deletions(-) > > > > > > > > > > -- > > > > > 2.21.0 > > > > > > > > -- > > > Trond Myklebust > > > Linux NFS client maintainer, Hammerspace > > > trond.myklebust@hammerspace.com > > > > > > > Trond Myklebust > CTO, Hammerspace Inc > 4300 El Camino Real, Suite 105 > Los Altos, CA 94022 > www.hammer.space > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
On Fri, 2019-09-20 at 10:25 -0400, Olga Kornievskaia wrote: > Hi Trond, > > On Thu, Sep 19, 2019 at 7:42 PM Trond Myklebust < > trondmy@hammerspace.com> wrote: > > Hi Olga > > > > On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote: > > > Hi Trond, > > > > > > On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust < > > > trondmy@hammerspace.com> wrote: > > > > Hi Olga > > > > > > > > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote: > > > > > Hi Trond, > > > > > > > > > > These set of patches do not address the locking problem. It's > > > > > actually > > > > > not the locking patch (which I thought it was as I reverted > > > > > it > > > > > and > > > > > still had the issue). Without the whole patch series the > > > > > unlock > > > > > works > > > > > fine so something in these new patches. Something is up with > > > > > the > > > > > 2 > > > > > patches: > > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE > > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU > > > > > > > > > > If I remove either one separately, unlock fails but if I > > > > > remove > > > > > both > > > > > unlock works. > > > > > > > > Can you describe how you are testing this, and perhaps provide > > > > wireshark traces that show how we're triggering these problems? > > > > > > I'm triggering by running "nfstest_lock --nfsversion 4.1 -- > > > runtest > > > btest01" against either linux or ontap servers (while the test > > > doesn't > > > fail but on the network trace you can see unlock failing with > > > bad_stateid). Network trace attached. > > > > > > But actually a simple test open, lock, unlock does the trick > > > (network > > > trace attached). > > > fd1 = open(RDWR) > > > fctl(fd1) (lock /unlock) > > > > These traces really do not mesh with what I'm seeing using a simple > > Connectathon lock test run. When I look at the wireshark output > > from > > that, I see exadtly two cases where the stateid arguments are both > > zero, and those are both SETATTR, so expected. > > > > All the LOCKU are showing up as non-zero stateids, and so I'm > > seeing no > > BAD_STATEID or OLD_STATEID errors at all. > > > > Is there something special about how your test is running? > > There is nothing special that I can think of about my setup or how > test run. I pull from your testing branch, build it (no extra > patches). Run tests over 4.1 (default mount opts) against a linux > server (typically same kernel). > > Is this patch series somewhere in your git branches? I've been > testing > your testing branch (as I could see v2 changes were in the testing > branch). It's not obvious to me what was changed in v3 to see if the > testing branch has the right code. I hadn't yet updated the testing branch with the v3 code. Pushed out now as a forced-update. Cheers Trond