Message ID | 20230906010328.54634-1-trondmy@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Revert "SUNRPC: Fail faster on bad verifier" | expand |
> On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > This reverts commit 0701214cd6e66585a999b132eb72ae0489beb724. > > The premise of this commit was incorrect. There are exactly 2 cases > where rpcauth_checkverf() will return an error: > > 1) If there was an XDR decode problem (i.e. garbage data). > 2) If gss_validate() had a problem verifying the RPCSEC_GSS MIC. There's also the AUTH_TLS probe: https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 That was the purpose of 0701214cd6e6. Reverting this commit is likely to cause problems when our TLS-capable client interacts with a server that knows nothing of AUTH_TLS. > In the second case, there are again 2 subcases: > > a) The GSS context expires, in which case gss_validate() will force a > new context negotiation on retry by invalidating the cred. > b) The sequence number check failed because an RPC call timed out, and > the client retransmitted the request using a new sequence number, > as required by RFC2203. > > In neither subcase is this a fatal error. > > Reported-by: Russell Cattelan <cattelan@thebarn.com> > Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier") > Cc: stable@vger.kernel.org > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > --- > net/sunrpc/clnt.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > index 12c46e129db8..5a7de7e55548 100644 > --- a/net/sunrpc/clnt.c > +++ b/net/sunrpc/clnt.c > @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task *task, struct xdr_stream *xdr) > > out_verifier: > trace_rpc_bad_verifier(task); > - goto out_err; > + goto out_garbage; > > out_msg_denied: > error = -EACCES; > -- > 2.41.0 > -- Chuck Lever
On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote: > > > > On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > This reverts commit 0701214cd6e66585a999b132eb72ae0489beb724. > > > > The premise of this commit was incorrect. There are exactly 2 cases > > where rpcauth_checkverf() will return an error: > > > > 1) If there was an XDR decode problem (i.e. garbage data). > > 2) If gss_validate() had a problem verifying the RPCSEC_GSS MIC. > > There's also the AUTH_TLS probe: > > https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 > > That was the purpose of 0701214cd6e6. > > Reverting this commit is likely to cause problems when our > TLS-capable client interacts with a server that knows > nothing of AUTH_TLS. The patch completely broke the semantics of the header validation code. There is no discussion about whether or not it needs to be reverted. If the TLS code needs special treatment, then a separate patch is needed to fix tls_validate() to return something that can be caught by rpc_decode_header and interpreted differently to the EIO and EACCES error codes currently being returned by RPCSEC_GSS, AUTH_SYS and others. > > In the second case, there are again 2 subcases: > > > > a) The GSS context expires, in which case gss_validate() will force > > a > > new context negotiation on retry by invalidating the cred. > > b) The sequence number check failed because an RPC call timed out, > > and > > the client retransmitted the request using a new sequence number, > > as required by RFC2203. > > > > In neither subcase is this a fatal error. > > > > Reported-by: Russell Cattelan <cattelan@thebarn.com> > > Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier") > > Cc: stable@vger.kernel.org > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > > --- > > net/sunrpc/clnt.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > > index 12c46e129db8..5a7de7e55548 100644 > > --- a/net/sunrpc/clnt.c > > +++ b/net/sunrpc/clnt.c > > @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task *task, > > struct xdr_stream *xdr) > > > > out_verifier: > > trace_rpc_bad_verifier(task); > > - goto out_err; > > + goto out_garbage; > > > > out_msg_denied: > > error = -EACCES; > > -- > > 2.41.0 > > > > -- > Chuck Lever > >
> On Sep 6, 2023, at 10:33 AM, Trond Myklebust <trondmy@kernel.org> wrote: > > On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote: >> >> >>> On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: >>> >>> From: Trond Myklebust <trond.myklebust@hammerspace.com> >>> >>> This reverts commit 0701214cd6e66585a999b132eb72ae0489beb724. >>> >>> The premise of this commit was incorrect. There are exactly 2 cases >>> where rpcauth_checkverf() will return an error: >>> >>> 1) If there was an XDR decode problem (i.e. garbage data). >>> 2) If gss_validate() had a problem verifying the RPCSEC_GSS MIC. >> >> There's also the AUTH_TLS probe: >> >> https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 >> >> That was the purpose of 0701214cd6e6. >> >> Reverting this commit is likely to cause problems when our >> TLS-capable client interacts with a server that knows >> nothing of AUTH_TLS. > > The patch completely broke the semantics of the header validation code. If that were truly the case, it's amazing that the client has hobbled along for the past 14 months with no-one noticing the breakage until now. Seriously, though, treating a bad verifier as garbage args is not intuitive. If it's that critical there needs to be a comment in the code explaining why. > There is no discussion about whether or not it needs to be reverted. The patch description is wrong, though, to exclude AUTH_TLS. The reverting patch description claims to be an exhaustive list of all the cases, but it doesn't mention the AUTH_TLS case at all. > If the TLS code needs special treatment, then a separate patch is > needed to fix tls_validate() to return something that can be caught by > rpc_decode_header and interpreted differently to the EIO and EACCES > error codes currently being returned by RPCSEC_GSS, AUTH_SYS and > others. That could have been brought up when 0701214cd6e6 was first posted for review. Interesting that the decoder currently does not distinguish between EIO and EACCES. Thanks for the suggestion, I'll have a look. >>> In the second case, there are again 2 subcases: >>> >>> a) The GSS context expires, in which case gss_validate() will force >>> a >>> new context negotiation on retry by invalidating the cred. >>> b) The sequence number check failed because an RPC call timed out, >>> and >>> the client retransmitted the request using a new sequence number, >>> as required by RFC2203. >>> >>> In neither subcase is this a fatal error. >>> >>> Reported-by: Russell Cattelan <cattelan@thebarn.com> >>> Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier") >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> >>> --- >>> net/sunrpc/clnt.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c >>> index 12c46e129db8..5a7de7e55548 100644 >>> --- a/net/sunrpc/clnt.c >>> +++ b/net/sunrpc/clnt.c >>> @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task *task, >>> struct xdr_stream *xdr) >>> >>> out_verifier: >>> trace_rpc_bad_verifier(task); >>> - goto out_err; >>> + goto out_garbage; >>> >>> out_msg_denied: >>> error = -EACCES; >>> -- >>> 2.41.0 >>> >> >> -- >> Chuck Lever >> >> > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com -- Chuck Lever
On Wed, 2023-09-06 at 15:20 +0000, Chuck Lever III wrote: > > > > On Sep 6, 2023, at 10:33 AM, Trond Myklebust <trondmy@kernel.org> > > wrote: > > > > On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote: > > > > > > > > > > On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: > > > > > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > > > This reverts commit 0701214cd6e66585a999b132eb72ae0489beb724. > > > > > > > > The premise of this commit was incorrect. There are exactly 2 > > > > cases > > > > where rpcauth_checkverf() will return an error: > > > > > > > > 1) If there was an XDR decode problem (i.e. garbage data). > > > > 2) If gss_validate() had a problem verifying the RPCSEC_GSS > > > > MIC. > > > > > > There's also the AUTH_TLS probe: > > > > > > https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 > > > > > > That was the purpose of 0701214cd6e6. > > > > > > Reverting this commit is likely to cause problems when our > > > TLS-capable client interacts with a server that knows > > > nothing of AUTH_TLS. > > > > The patch completely broke the semantics of the header validation > > code. > > If that were truly the case, it's amazing that the client > has hobbled along for the past 14 months with no-one > noticing the breakage until now. > > Seriously, though, treating a bad verifier as garbage args > is not intuitive. If it's that critical there needs to be > a comment in the code explaining why. > It is necessary because of the peculiarities of RPCSEC_GSS and the session semantics it implements. See https://datatracker.ietf.org/doc/html/rfc2203#section-5.3.3.1 and in particular, the paragraph discussing retransmissions by the client. > > There is no discussion about whether or not it needs to be > > reverted. > > The patch description is wrong, though, to exclude AUTH_TLS. > > The reverting patch description claims to be an exhaustive > list of all the cases, but it doesn't mention the AUTH_TLS > case at all. > > > > If the TLS code needs special treatment, then a separate patch is > > needed to fix tls_validate() to return something that can be caught > > by > > rpc_decode_header and interpreted differently to the EIO and EACCES > > error codes currently being returned by RPCSEC_GSS, AUTH_SYS and > > others. > > That could have been brought up when 0701214cd6e6 was first > posted for review. Interesting that the decoder currently > does not distinguish between EIO and EACCES. > > Thanks for the suggestion, I'll have a look. > Now that I look at it, I think your approach to satisfying RFC9289 is not correct. Since this is a transport level issue, why should we not just mark the xprt for disconnection, and then retry? It is entirely possible that some load balancer/floating IP has just moved the connection to some node that was not expecting to do TLS. The only case where that should not be assumed is the case where the error happens right at the very beginning of the mount, when disconnecting should normally suffice to trigger the RPC_TASK_SOFTCONN code anyway. > > > > > In the second case, there are again 2 subcases: > > > > > > > > a) The GSS context expires, in which case gss_validate() will > > > > force > > > > a > > > > new context negotiation on retry by invalidating the cred. > > > > b) The sequence number check failed because an RPC call timed > > > > out, > > > > and > > > > the client retransmitted the request using a new sequence > > > > number, > > > > as required by RFC2203. > > > > > > > > In neither subcase is this a fatal error. > > > > > > > > Reported-by: Russell Cattelan <cattelan@thebarn.com> > > > > Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier") > > > > Cc: stable@vger.kernel.org > > > > Signed-off-by: Trond Myklebust > > > > <trond.myklebust@hammerspace.com> > > > > --- > > > > net/sunrpc/clnt.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > > > > index 12c46e129db8..5a7de7e55548 100644 > > > > --- a/net/sunrpc/clnt.c > > > > +++ b/net/sunrpc/clnt.c > > > > @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task *task, > > > > struct xdr_stream *xdr) > > > > > > > > out_verifier: > > > > trace_rpc_bad_verifier(task); > > > > - goto out_err; > > > > + goto out_garbage; > > > > > > > > out_msg_denied: > > > > error = -EACCES; > > > > -- > > > > 2.41.0 > > > > > > > > > > -- > > > Chuck Lever > > > > > > > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@hammerspace.com > > > -- > Chuck Lever > >
> On Sep 6, 2023, at 12:18 PM, Trond Myklebust <trondmy@kernel.org> wrote: > > On Wed, 2023-09-06 at 15:20 +0000, Chuck Lever III wrote: >> >> >>> On Sep 6, 2023, at 10:33 AM, Trond Myklebust <trondmy@kernel.org> >>> wrote: >>> >>> On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote: >>>> >>>> >>>>> On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: >>>>> >>>>> From: Trond Myklebust <trond.myklebust@hammerspace.com> >>>>> >>>>> This reverts commit 0701214cd6e66585a999b132eb72ae0489beb724. >>>>> >>>>> The premise of this commit was incorrect. There are exactly 2 >>>>> cases >>>>> where rpcauth_checkverf() will return an error: >>>>> >>>>> 1) If there was an XDR decode problem (i.e. garbage data). >>>>> 2) If gss_validate() had a problem verifying the RPCSEC_GSS >>>>> MIC. >>>> >>>> There's also the AUTH_TLS probe: >>>> >>>> https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 >>>> >>>> That was the purpose of 0701214cd6e6. >>>> >>>> Reverting this commit is likely to cause problems when our >>>> TLS-capable client interacts with a server that knows >>>> nothing of AUTH_TLS. >>> >>> The patch completely broke the semantics of the header validation >>> code. >> >> If that were truly the case, it's amazing that the client >> has hobbled along for the past 14 months with no-one >> noticing the breakage until now. >> >> Seriously, though, treating a bad verifier as garbage args >> is not intuitive. If it's that critical there needs to be >> a comment in the code explaining why. >> > > It is necessary because of the peculiarities of RPCSEC_GSS and the > session semantics it implements. > See https://datatracker.ietf.org/doc/html/rfc2203#section-5.3.3.1 and > in particular, the paragraph discussing retransmissions by the client. Retrying is fine. But the counter in the client is called "garbage_retries". That's not what is going on the EACCES case, though the behavior is close enough -- it's code re-use (good) without appropriate documentation (bad). The decoder treats EIO and EACCES exactly the same way. Again, code reuse (good) without appropriate documentation (bad). I tried to address that in my RFC patch by adding a small explanatory comment and by adding an API contract for rpcauth_checkverf(). >>> There is no discussion about whether or not it needs to be >>> reverted. >> >> The patch description is wrong, though, to exclude AUTH_TLS. >> >> The reverting patch description claims to be an exhaustive >> list of all the cases, but it doesn't mention the AUTH_TLS >> case at all. >> >> >>> If the TLS code needs special treatment, then a separate patch is >>> needed to fix tls_validate() to return something that can be caught >>> by >>> rpc_decode_header and interpreted differently to the EIO and EACCES >>> error codes currently being returned by RPCSEC_GSS, AUTH_SYS and >>> others. >> >> That could have been brought up when 0701214cd6e6 was first >> posted for review. Interesting that the decoder currently >> does not distinguish between EIO and EACCES. >> >> Thanks for the suggestion, I'll have a look. >> > > Now that I look at it, I think your approach to satisfying RFC9289 is > not correct. I'm not following what aspect of the implementation is problematic. I'm going to assume you mean the implementation of opportunism. > Since this is a transport level issue, why should we not just mark the > xprt for disconnection, and then retry? It is entirely possible that > some load balancer/floating IP has just moved the connection to some > node that was not expecting to do TLS. Depending on the security policy chosen by the client's administrator, that could either be a security problem or a "don't care" situation. If the administrator wants the client to _require_ TLS, then connecting to a load balancer where TLS suddenly becomes unavailable after a reconnect is a hard error. This prevents STRIPTLS attacks. That's good security. If the administrator wants to allow operation to continue even if TLS is not available, then the client can recover by not using TLS. That's rather terrible security, but can be desirable to improve backward compatibility. > The only case where that should > not be assumed is the case where the error happens right at the very > beginning of the mount, when disconnecting should normally suffice to > trigger the RPC_TASK_SOFTCONN code anyway. If TLS goes away after a reconnect, that's a problem. Whether further operation should stop depends on the administrator's chosen security policy. The security policies are NFS-level settings (eg, mount options). RPC just indicates to NFS whether the traffic is protected or not. When NFS asks RPC to ensure the communication channel is protected, that means every reconnect is protected. Communication with that security policy cannot happen without protection. Trust me, the security community will have it no other way. If you need opportunism in this case, then I can add back the "xprtsec=auto" mount option, which you asked me to remove a while back. >>>>> In the second case, there are again 2 subcases: >>>>> >>>>> a) The GSS context expires, in which case gss_validate() will >>>>> force >>>>> a >>>>> new context negotiation on retry by invalidating the cred. >>>>> b) The sequence number check failed because an RPC call timed >>>>> out, >>>>> and >>>>> the client retransmitted the request using a new sequence >>>>> number, >>>>> as required by RFC2203. >>>>> >>>>> In neither subcase is this a fatal error. >>>>> >>>>> Reported-by: Russell Cattelan <cattelan@thebarn.com> >>>>> Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier") >>>>> Cc: stable@vger.kernel.org >>>>> Signed-off-by: Trond Myklebust >>>>> <trond.myklebust@hammerspace.com> >>>>> --- >>>>> net/sunrpc/clnt.c | 2 +- >>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>> >>>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c >>>>> index 12c46e129db8..5a7de7e55548 100644 >>>>> --- a/net/sunrpc/clnt.c >>>>> +++ b/net/sunrpc/clnt.c >>>>> @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task *task, >>>>> struct xdr_stream *xdr) >>>>> >>>>> out_verifier: >>>>> trace_rpc_bad_verifier(task); >>>>> - goto out_err; >>>>> + goto out_garbage; >>>>> >>>>> out_msg_denied: >>>>> error = -EACCES; >>>>> -- >>>>> 2.41.0 >>>>> >>>> >>>> -- >>>> Chuck Lever >>>> >>>> >>> >>> -- >>> Trond Myklebust >>> Linux NFS client maintainer, Hammerspace >>> trond.myklebust@hammerspace.com >> >> >> -- >> Chuck Lever -- Chuck Lever
On Wed, 2023-09-06 at 18:05 +0000, Chuck Lever III wrote: > > > > On Sep 6, 2023, at 12:18 PM, Trond Myklebust <trondmy@kernel.org> > > wrote: > > > > On Wed, 2023-09-06 at 15:20 +0000, Chuck Lever III wrote: > > > > > > > > > > On Sep 6, 2023, at 10:33 AM, Trond Myklebust > > > > <trondmy@kernel.org> > > > > wrote: > > > > > > > > On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > > On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: > > > > > > > > > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > > > > > > > This reverts commit > > > > > > 0701214cd6e66585a999b132eb72ae0489beb724. > > > > > > > > > > > > The premise of this commit was incorrect. There are exactly > > > > > > 2 > > > > > > cases > > > > > > where rpcauth_checkverf() will return an error: > > > > > > > > > > > > 1) If there was an XDR decode problem (i.e. garbage data). > > > > > > 2) If gss_validate() had a problem verifying the RPCSEC_GSS > > > > > > MIC. > > > > > > > > > > There's also the AUTH_TLS probe: > > > > > > > > > > https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 > > > > > > > > > > That was the purpose of 0701214cd6e6. > > > > > > > > > > Reverting this commit is likely to cause problems when our > > > > > TLS-capable client interacts with a server that knows > > > > > nothing of AUTH_TLS. > > > > > > > > The patch completely broke the semantics of the header > > > > validation > > > > code. > > > > > > If that were truly the case, it's amazing that the client > > > has hobbled along for the past 14 months with no-one > > > noticing the breakage until now. > > > > > > Seriously, though, treating a bad verifier as garbage args > > > is not intuitive. If it's that critical there needs to be > > > a comment in the code explaining why. > > > > > > > It is necessary because of the peculiarities of RPCSEC_GSS and the > > session semantics it implements. > > See > > https://datatracker.ietf.org/doc/html/rfc2203#section-5.3.3.1 and > > in particular, the paragraph discussing retransmissions by the > > client. > > Retrying is fine. > > But the counter in the client is called "garbage_retries". > That's not what is going on the EACCES case, though the > behavior is close enough -- it's code re-use (good) without > appropriate documentation (bad). > > The decoder treats EIO and EACCES exactly the same way. > Again, code reuse (good) without appropriate documentation > (bad). > > I tried to address that in my RFC patch by adding a small > explanatory comment and by adding an API contract for > rpcauth_checkverf(). > > > > > > There is no discussion about whether or not it needs to be > > > > reverted. > > > > > > The patch description is wrong, though, to exclude AUTH_TLS. > > > > > > The reverting patch description claims to be an exhaustive > > > list of all the cases, but it doesn't mention the AUTH_TLS > > > case at all. > > > > > > > > > > If the TLS code needs special treatment, then a separate patch > > > > is > > > > needed to fix tls_validate() to return something that can be > > > > caught > > > > by > > > > rpc_decode_header and interpreted differently to the EIO and > > > > EACCES > > > > error codes currently being returned by RPCSEC_GSS, AUTH_SYS > > > > and > > > > others. > > > > > > That could have been brought up when 0701214cd6e6 was first > > > posted for review. Interesting that the decoder currently > > > does not distinguish between EIO and EACCES. > > > > > > Thanks for the suggestion, I'll have a look. > > > > > > > Now that I look at it, I think your approach to satisfying RFC9289 > > is > > not correct. > > I'm not following what aspect of the implementation is problematic. > I'm going to assume you mean the implementation of opportunism. > > > > Since this is a transport level issue, why should we not just mark > > the > > xprt for disconnection, and then retry? It is entirely possible > > that > > some load balancer/floating IP has just moved the connection to > > some > > node that was not expecting to do TLS. > > Depending on the security policy chosen by the client's > administrator, > that could either be a security problem or a "don't care" situation. > > If the administrator wants the client to _require_ TLS, then > connecting to a load balancer where TLS suddenly becomes unavailable > after a reconnect is a hard error. This prevents STRIPTLS attacks. > That's good security. > > If the administrator wants to allow operation to continue even if TLS > is not available, then the client can recover by not using TLS. > That's > rather terrible security, but can be desirable to improve backward > compatibility. > > > > The only case where that should > > not be assumed is the case where the error happens right at the > > very > > beginning of the mount, when disconnecting should normally suffice > > to > > trigger the RPC_TASK_SOFTCONN code anyway. > > If TLS goes away after a reconnect, that's a problem. Whether > further operation should stop depends on the administrator's > chosen security policy. > > The security policies are NFS-level settings (eg, mount options). > RPC just indicates to NFS whether the traffic is protected or not. > > When NFS asks RPC to ensure the communication channel is protected, > that means every reconnect is protected. Communication with that > security policy cannot happen without protection. > > Trust me, the security community will have it no other way. > > If you need opportunism in this case, then I can add back the > "xprtsec=auto" mount option, which you asked me to remove a while > back. I don't see how described behaviour would cause the operation to proceed without TLS. I'm saying disconnect and then retry TLS negotiation, and then eventually fail. > > > > > > > > In the second case, there are again 2 subcases: > > > > > > > > > > > > a) The GSS context expires, in which case gss_validate() > > > > > > will > > > > > > force > > > > > > a > > > > > > new context negotiation on retry by invalidating the > > > > > > cred. > > > > > > b) The sequence number check failed because an RPC call > > > > > > timed > > > > > > out, > > > > > > and > > > > > > the client retransmitted the request using a new sequence > > > > > > number, > > > > > > as required by RFC2203. > > > > > > > > > > > > In neither subcase is this a fatal error. > > > > > > > > > > > > Reported-by: Russell Cattelan <cattelan@thebarn.com> > > > > > > Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier") > > > > > > Cc: stable@vger.kernel.org > > > > > > Signed-off-by: Trond Myklebust > > > > > > <trond.myklebust@hammerspace.com> > > > > > > --- > > > > > > net/sunrpc/clnt.c | 2 +- > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > > > > > > index 12c46e129db8..5a7de7e55548 100644 > > > > > > --- a/net/sunrpc/clnt.c > > > > > > +++ b/net/sunrpc/clnt.c > > > > > > @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task > > > > > > *task, > > > > > > struct xdr_stream *xdr) > > > > > > > > > > > > out_verifier: > > > > > > trace_rpc_bad_verifier(task); > > > > > > - goto out_err; > > > > > > + goto out_garbage; > > > > > > > > > > > > out_msg_denied: > > > > > > error = -EACCES; > > > > > > -- > > > > > > 2.41.0 > > > > > > > > > > > > > > > > -- > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > -- > > > > Trond Myklebust > > > > Linux NFS client maintainer, Hammerspace > > > > trond.myklebust@hammerspace.com > > > > > > > > > -- > > > Chuck Lever > > > -- > Chuck Lever > >
> On Sep 6, 2023, at 2:27 PM, Trond Myklebust <trondmy@kernel.org> wrote: > > On Wed, 2023-09-06 at 18:05 +0000, Chuck Lever III wrote: >> >> >>> On Sep 6, 2023, at 12:18 PM, Trond Myklebust <trondmy@kernel.org> >>> wrote: >>> >>> On Wed, 2023-09-06 at 15:20 +0000, Chuck Lever III wrote: >>>> >>>> >>>>> On Sep 6, 2023, at 10:33 AM, Trond Myklebust >>>>> <trondmy@kernel.org> >>>>> wrote: >>>>> >>>>> On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote: >>>>>> >>>>>> >>>>>>> On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: >>>>>>> >>>>>>> From: Trond Myklebust <trond.myklebust@hammerspace.com> >>>>>>> >>>>>>> This reverts commit >>>>>>> 0701214cd6e66585a999b132eb72ae0489beb724. >>>>>>> >>>>>>> The premise of this commit was incorrect. There are exactly >>>>>>> 2 >>>>>>> cases >>>>>>> where rpcauth_checkverf() will return an error: >>>>>>> >>>>>>> 1) If there was an XDR decode problem (i.e. garbage data). >>>>>>> 2) If gss_validate() had a problem verifying the RPCSEC_GSS >>>>>>> MIC. >>>>>> >>>>>> There's also the AUTH_TLS probe: >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 >>>>>> >>>>>> That was the purpose of 0701214cd6e6. >>>>>> >>>>>> Reverting this commit is likely to cause problems when our >>>>>> TLS-capable client interacts with a server that knows >>>>>> nothing of AUTH_TLS. >>>>> >>>>> The patch completely broke the semantics of the header >>>>> validation >>>>> code. >>>> >>>> If that were truly the case, it's amazing that the client >>>> has hobbled along for the past 14 months with no-one >>>> noticing the breakage until now. >>>> >>>> Seriously, though, treating a bad verifier as garbage args >>>> is not intuitive. If it's that critical there needs to be >>>> a comment in the code explaining why. >>>> >>> >>> It is necessary because of the peculiarities of RPCSEC_GSS and the >>> session semantics it implements. >>> See >>> https://datatracker.ietf.org/doc/html/rfc2203#section-5.3.3.1 and >>> in particular, the paragraph discussing retransmissions by the >>> client. >> >> Retrying is fine. >> >> But the counter in the client is called "garbage_retries". >> That's not what is going on the EACCES case, though the >> behavior is close enough -- it's code re-use (good) without >> appropriate documentation (bad). >> >> The decoder treats EIO and EACCES exactly the same way. >> Again, code reuse (good) without appropriate documentation >> (bad). >> >> I tried to address that in my RFC patch by adding a small >> explanatory comment and by adding an API contract for >> rpcauth_checkverf(). >> >> >>>>> There is no discussion about whether or not it needs to be >>>>> reverted. >>>> >>>> The patch description is wrong, though, to exclude AUTH_TLS. >>>> >>>> The reverting patch description claims to be an exhaustive >>>> list of all the cases, but it doesn't mention the AUTH_TLS >>>> case at all. >>>> >>>> >>>>> If the TLS code needs special treatment, then a separate patch >>>>> is >>>>> needed to fix tls_validate() to return something that can be >>>>> caught >>>>> by >>>>> rpc_decode_header and interpreted differently to the EIO and >>>>> EACCES >>>>> error codes currently being returned by RPCSEC_GSS, AUTH_SYS >>>>> and >>>>> others. >>>> >>>> That could have been brought up when 0701214cd6e6 was first >>>> posted for review. Interesting that the decoder currently >>>> does not distinguish between EIO and EACCES. >>>> >>>> Thanks for the suggestion, I'll have a look. >>>> >>> >>> Now that I look at it, I think your approach to satisfying RFC9289 >>> is >>> not correct. >> >> I'm not following what aspect of the implementation is problematic. >> I'm going to assume you mean the implementation of opportunism. >> >> >>> Since this is a transport level issue, why should we not just mark >>> the >>> xprt for disconnection, and then retry? It is entirely possible >>> that >>> some load balancer/floating IP has just moved the connection to >>> some >>> node that was not expecting to do TLS. >> >> Depending on the security policy chosen by the client's >> administrator, >> that could either be a security problem or a "don't care" situation. >> >> If the administrator wants the client to _require_ TLS, then >> connecting to a load balancer where TLS suddenly becomes unavailable >> after a reconnect is a hard error. This prevents STRIPTLS attacks. >> That's good security. >> >> If the administrator wants to allow operation to continue even if TLS >> is not available, then the client can recover by not using TLS. >> That's >> rather terrible security, but can be desirable to improve backward >> compatibility. >> >> >>> The only case where that should >>> not be assumed is the case where the error happens right at the >>> very >>> beginning of the mount, when disconnecting should normally suffice >>> to >>> trigger the RPC_TASK_SOFTCONN code anyway. >> >> If TLS goes away after a reconnect, that's a problem. Whether >> further operation should stop depends on the administrator's >> chosen security policy. >> >> The security policies are NFS-level settings (eg, mount options). >> RPC just indicates to NFS whether the traffic is protected or not. >> >> When NFS asks RPC to ensure the communication channel is protected, >> that means every reconnect is protected. Communication with that >> security policy cannot happen without protection. >> >> Trust me, the security community will have it no other way. >> >> If you need opportunism in this case, then I can add back the >> "xprtsec=auto" mount option, which you asked me to remove a while >> back. > > I don't see how described behaviour would cause the operation to > proceed without TLS. I'm saying disconnect and then retry TLS > negotiation, and then eventually fail. As I said above: >> I'm not following what aspect of the implementation is problematic. I still don't understand what you want changed. (Not arguing, just not understanding). One or the other side will disconnect if the probe fails. My memory might be failing, but I'm not aware of the client continuing to use a connection after sending an RPC_AUTH_TLS probe that fails. Also I'm not aware that RFC 9289 requires disconnection after a probe failure, though I believe that the Linux implementation does disconnect. By "continuing operation" I mean simply that either the client tries to reconnect with TLS (in the _requires_ case) or the client reconnects with no probe and no TLS (in the _opportunistic_ case). A TLS failure, after the initial contact with the server, will not cause a non-probe RPC to terminate. If we don't disconnect after a probe failure, I can look into changing that. But I think we already work that way. -- Chuck Lever
On 9/6/23 1:05 PM, Chuck Lever III wrote: > > >> On Sep 6, 2023, at 12:18 PM, Trond Myklebust <trondmy@kernel.org> wrote: >> >> On Wed, 2023-09-06 at 15:20 +0000, Chuck Lever III wrote: >>> >>> >>>> On Sep 6, 2023, at 10:33 AM, Trond Myklebust <trondmy@kernel.org> >>>> wrote: >>>> >>>> On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote: >>>>> >>>>> >>>>>> On Sep 5, 2023, at 9:03 PM, trondmy@kernel.org wrote: >>>>>> >>>>>> From: Trond Myklebust <trond.myklebust@hammerspace.com> >>>>>> >>>>>> This reverts commit 0701214cd6e66585a999b132eb72ae0489beb724. >>>>>> >>>>>> The premise of this commit was incorrect. There are exactly 2 >>>>>> cases >>>>>> where rpcauth_checkverf() will return an error: >>>>>> >>>>>> 1) If there was an XDR decode problem (i.e. garbage data). >>>>>> 2) If gss_validate() had a problem verifying the RPCSEC_GSS >>>>>> MIC. >>>>> >>>>> There's also the AUTH_TLS probe: >>>>> >>>>> https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7 >>>>> >>>>> That was the purpose of 0701214cd6e6. >>>>> >>>>> Reverting this commit is likely to cause problems when our >>>>> TLS-capable client interacts with a server that knows >>>>> nothing of AUTH_TLS. >>>> >>>> The patch completely broke the semantics of the header validation >>>> code. >>> >>> If that were truly the case, it's amazing that the client >>> has hobbled along for the past 14 months with no-one >>> noticing the breakage until now. >>> >>> Seriously, though, treating a bad verifier as garbage args >>> is not intuitive. If it's that critical there needs to be >>> a comment in the code explaining why. >>> >> >> It is necessary because of the peculiarities of RPCSEC_GSS and the >> session semantics it implements. >> See https://datatracker.ietf.org/doc/html/rfc2203#section-5.3.3.1 and >> in particular, the paragraph discussing retransmissions by the client. > > Retrying is fine. > > But the counter in the client is called "garbage_retries". > That's not what is going on the EACCES case, though the > behavior is close enough -- it's code re-use (good) without > appropriate documentation (bad). > > The decoder treats EIO and EACCES exactly the same way. > Again, code reuse (good) without appropriate documentation > (bad). > > I tried to address that in my RFC patch by adding a small > explanatory comment and by adding an API contract for > rpcauth_checkverf(). > So this has come out of discussion that Trond and I are having as what should happen when rpcauth_checkverf fails. The problem we are running into and fairly easy to repo with iptables drop <server ip> ; sleep 60 ;iptables accept <server ip> Our systems are currently running either ubuntu 18.04 4.15 based or ubutunu 22.04 5.15 based both exhibit problem and neither has the change. The problem happens before xdr_inline_decode and the switch statement so I don't think the change has any affect on the problem we trying to sort out where the gss checksum does not match due to the re-trans due to the RPC timeout. --Russell > >>>> There is no discussion about whether or not it needs to be >>>> reverted. >>> >>> The patch description is wrong, though, to exclude AUTH_TLS. >>> >>> The reverting patch description claims to be an exhaustive >>> list of all the cases, but it doesn't mention the AUTH_TLS >>> case at all. >>> >>> >>>> If the TLS code needs special treatment, then a separate patch is >>>> needed to fix tls_validate() to return something that can be caught >>>> by >>>> rpc_decode_header and interpreted differently to the EIO and EACCES >>>> error codes currently being returned by RPCSEC_GSS, AUTH_SYS and >>>> others. >>> >>> That could have been brought up when 0701214cd6e6 was first >>> posted for review. Interesting that the decoder currently >>> does not distinguish between EIO and EACCES. >>> >>> Thanks for the suggestion, I'll have a look. >>> >> >> Now that I look at it, I think your approach to satisfying RFC9289 is >> not correct. > > I'm not following what aspect of the implementation is problematic. > I'm going to assume you mean the implementation of opportunism. > > >> Since this is a transport level issue, why should we not just mark the >> xprt for disconnection, and then retry? It is entirely possible that >> some load balancer/floating IP has just moved the connection to some >> node that was not expecting to do TLS. > > Depending on the security policy chosen by the client's administrator, > that could either be a security problem or a "don't care" situation. > > If the administrator wants the client to _require_ TLS, then > connecting to a load balancer where TLS suddenly becomes unavailable > after a reconnect is a hard error. This prevents STRIPTLS attacks. > That's good security. > > If the administrator wants to allow operation to continue even if TLS > is not available, then the client can recover by not using TLS. That's > rather terrible security, but can be desirable to improve backward > compatibility. > > >> The only case where that should >> not be assumed is the case where the error happens right at the very >> beginning of the mount, when disconnecting should normally suffice to >> trigger the RPC_TASK_SOFTCONN code anyway. > > If TLS goes away after a reconnect, that's a problem. Whether > further operation should stop depends on the administrator's > chosen security policy. > > The security policies are NFS-level settings (eg, mount options). > RPC just indicates to NFS whether the traffic is protected or not. > > When NFS asks RPC to ensure the communication channel is protected, > that means every reconnect is protected. Communication with that > security policy cannot happen without protection. > > Trust me, the security community will have it no other way. > > If you need opportunism in this case, then I can add back the > "xprtsec=auto" mount option, which you asked me to remove a while > back. > > >>>>>> In the second case, there are again 2 subcases: >>>>>> >>>>>> a) The GSS context expires, in which case gss_validate() will >>>>>> force >>>>>> a >>>>>> new context negotiation on retry by invalidating the cred. >>>>>> b) The sequence number check failed because an RPC call timed >>>>>> out, >>>>>> and >>>>>> the client retransmitted the request using a new sequence >>>>>> number, >>>>>> as required by RFC2203. >>>>>> >>>>>> In neither subcase is this a fatal error. >>>>>> >>>>>> Reported-by: Russell Cattelan <cattelan@thebarn.com> >>>>>> Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier") >>>>>> Cc: stable@vger.kernel.org >>>>>> Signed-off-by: Trond Myklebust >>>>>> <trond.myklebust@hammerspace.com> >>>>>> --- >>>>>> net/sunrpc/clnt.c | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c >>>>>> index 12c46e129db8..5a7de7e55548 100644 >>>>>> --- a/net/sunrpc/clnt.c >>>>>> +++ b/net/sunrpc/clnt.c >>>>>> @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task *task, >>>>>> struct xdr_stream *xdr) >>>>>> >>>>>> out_verifier: >>>>>> trace_rpc_bad_verifier(task); >>>>>> - goto out_err; >>>>>> + goto out_garbage; >>>>>> >>>>>> out_msg_denied: >>>>>> error = -EACCES; >>>>>> -- >>>>>> 2.41.0 >>>>>> >>>>> >>>>> -- >>>>> Chuck Lever >>>>> >>>>> >>>> >>>> -- >>>> Trond Myklebust >>>> Linux NFS client maintainer, Hammerspace >>>> trond.myklebust@hammerspace.com >>> >>> >>> -- >>> Chuck Lever > > > -- > Chuck Lever > >
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index 12c46e129db8..5a7de7e55548 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task *task, struct xdr_stream *xdr) out_verifier: trace_rpc_bad_verifier(task); - goto out_err; + goto out_garbage; out_msg_denied: error = -EACCES;