Message ID | 20210807170206.68768-1-dai.ngo@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] rpcbind: Fix DoS vulnerability in rpcbind. | expand |
Hi Dai- Thanks for posting. I agree this is a real bug, but I'm going to quibble a little with the details of your proposed solution. > On Aug 7, 2021, at 1:02 PM, Dai Ngo <dai.ngo@oracle.com> wrote: > > Currently my_svc_run does not handle poll time allowing idle TCP s/poll time/poll timeout/ > connections to remain ESTABLISHED indefinitely. When the number > of connections reaches the limit the open file descriptors > (ulimit -n) then accept(2) fails with EMFILE. Since libtirpc does > not handle EMFILE returned from accept(2) Just curious, should libtirpc be handling EMFILE? > this get my_svc_run into > a tight loop calling accept(2) resulting in the RPC service being > down, it's no longer able to service any requests. > > Fix by add call to __svc_destroy_idle to my_svc_run to remove > inactive connections when poll(2) returns timeout. > > Fixes: b2c9430f46c4 Use poll() instead of select() in svc_run() I don't have this commit in my rpcbind repo. I do have 44bf15b86861 ("rpcbind: don't use obsolete svc_fdset interface of libtirpc"). IMO that's the one you should list here. Also, please Cc: Thorsten Kukuk <kukuk@thkukuk.de> on future versions of this patch series in case he has any comments. > Signed-off-by: dai.ngo@oracle.com > --- > src/rpcb_svc_com.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/src/rpcb_svc_com.c b/src/rpcb_svc_com.c > index 1743dadf5db7..cb33519010d3 100644 > --- a/src/rpcb_svc_com.c > +++ b/src/rpcb_svc_com.c > @@ -1048,6 +1048,8 @@ netbuffree(struct netbuf *ap) > } > > > +extern void __svc_destroy_idle(int, bool_t); Your libtirpc patch does not add __svc_destroy_idle() as part of the official libtirpc API. We really must avoid adding "secret" library entry points (not to mention, as SteveD pointed out, a SONAME change would be required). rpcbind used to call __svc_clean_idle(), which is undocumented (as far as I can tell). 44bf15b86861 removed that call. I think a better approach would be to duplicate your proposed __svc_destroy_idle() code in rpcbind, since rpcbind is not actually using the library's RPC dispatcher. That would get rid of the technical debt of calling an undocumented library API. The helper would need to call svc_vc_destroy() rather than __xprt_unregister_unlocked() and __svc_vc_dodestroy(). Also not clear to me that rpcbind's my_svc_run() needs the protection of holding svc_fd_lock, if rpcbind itself is not multi-threaded? The alternative would be to construct a fully documented public API with man pages and header declarations. I'm not terribly comfortable with diverging from the historical Sun TI-RPC programming interface, however. > + > void > my_svc_run() > { > @@ -1076,6 +1078,7 @@ my_svc_run() > * other outside event) and not caused by poll(). > */ > case 0: > + __svc_destroy_idle(30, FALSE); > continue; > default: > /* > -- > 2.26.2 > -- Chuck Lever
Hi Chuck, On 8/8/21 12:36 PM, Chuck Lever III wrote: > Hi Dai- > > Thanks for posting. I agree this is a real bug, > but I'm going to quibble a little with the > details of your proposed solution. > > >> On Aug 7, 2021, at 1:02 PM, Dai Ngo <dai.ngo@oracle.com> wrote: >> >> Currently my_svc_run does not handle poll time allowing idle TCP > s/poll time/poll timeout/ thanks, will fix. > >> connections to remain ESTABLISHED indefinitely. When the number >> of connections reaches the limit the open file descriptors >> (ulimit -n) then accept(2) fails with EMFILE. Since libtirpc does >> not handle EMFILE returned from accept(2) > Just curious, should libtirpc be handling EMFILE? Yes, if we don't handle EMFILE returned from accept in libtirpc then svc_run gets in a tight loop trying to accept the same connection: poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=5, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, ... {fd=36, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, ...], 1019, 30000) = 1 ([{fd=4, revents=POLLIN|POLLRDNORM}]) accept(4, 0x7ffd4ac368f0, [128]) = -1 EMFILE (Too many open files) This is what causes the RPC service to be down. > > >> this get my_svc_run into >> a tight loop calling accept(2) resulting in the RPC service being >> down, it's no longer able to service any requests. >> >> Fix by add call to __svc_destroy_idle to my_svc_run to remove >> inactive connections when poll(2) returns timeout. >> >> Fixes: b2c9430f46c4 Use poll() instead of select() in svc_run() > I don't have this commit in my rpcbind repo. > > I do have 44bf15b86861 ("rpcbind: don't use obsolete > svc_fdset interface of libtirpc"). IMO that's the one > you should list here. Sorry, my mistake, that is the commit for libtirpc. The commit for rpcbind is as you listed: 44bf15b8 rpcbind: don't use obsolete svc_fdset interface of libtirpc > > Also, please Cc: Thorsten Kukuk <kukuk@thkukuk.de> on > future versions of this patch series in case he has > any comments. will do. > > >> Signed-off-by: dai.ngo@oracle.com >> --- >> src/rpcb_svc_com.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/src/rpcb_svc_com.c b/src/rpcb_svc_com.c >> index 1743dadf5db7..cb33519010d3 100644 >> --- a/src/rpcb_svc_com.c >> +++ b/src/rpcb_svc_com.c >> @@ -1048,6 +1048,8 @@ netbuffree(struct netbuf *ap) >> } >> >> >> +extern void __svc_destroy_idle(int, bool_t); > Your libtirpc patch does not add __svc_destroy_idle() > as part of the official libtirpc API. We really must > avoid adding "secret" library entry points (not to > mention, as SteveD pointed out, a SONAME change > would be required). > > rpcbind used to call __svc_clean_idle(), which is > undocumented (as far as I can tell). I try to use the same mechanism as __svc_clean_idle, assuming it was an acceptable approach. > 44bf15b86861 > removed that call. which is one of the bugs that causes this problem. > > I think a better approach would be to duplicate your > proposed __svc_destroy_idle() code in rpcbind, since > rpcbind is not actually using the library's RPC > dispatcher. then we have to do the same for nfs-utils. That means the same code exits in 3 places: libtirpc, rpcbind and nfs-utils. And any new consumer that uses it own my_svc_run will need its own __svc_destroy_idle. It does not sound very efficient. > > That would get rid of the technical debt of calling > an undocumented library API. > > The helper would need to call svc_vc_destroy() > rather than __xprt_unregister_unlocked() and > __svc_vc_dodestroy(). Also not clear to me that > rpcbind's my_svc_run() needs the protection of > holding svc_fd_lock, if rpcbind itself is not > multi-threaded? The lock is held by __svc_destroy_idle, or __svc_clean_idle before it was removed, in libtirpc. I think libtirpc is multi-threaded. > > > The alternative would be to construct a fully > documented public API with man pages and header > declarations. I'm not terribly comfortable with > diverging from the historical Sun TI-RPC programming > interface, however. What I'm trying to do is to follow what __svc_clean_idle used to do before it was removed. I can't totally re-use __svc_clean_idle because it was written to work with svc_fdset and select. The current svc_run uses poll so the handling of idle connections is a little difference. -Dai > > >> + >> void >> my_svc_run() >> { >> @@ -1076,6 +1078,7 @@ my_svc_run() >> * other outside event) and not caused by poll(). >> */ >> case 0: >> + __svc_destroy_idle(30, FALSE); >> continue; >> default: >> /* >> -- >> 2.26.2 >> > -- > Chuck Lever > > >
> On Aug 8, 2021, at 5:18 PM, Dai Ngo <dai.ngo@oracle.com> wrote: > > On 8/8/21 12:36 PM, Chuck Lever III wrote: >> >>> Signed-off-by: dai.ngo@oracle.com >>> --- >>> src/rpcb_svc_com.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/src/rpcb_svc_com.c b/src/rpcb_svc_com.c >>> index 1743dadf5db7..cb33519010d3 100644 >>> --- a/src/rpcb_svc_com.c >>> +++ b/src/rpcb_svc_com.c >>> @@ -1048,6 +1048,8 @@ netbuffree(struct netbuf *ap) >>> } >>> >>> >>> +extern void __svc_destroy_idle(int, bool_t); >> Your libtirpc patch does not add __svc_destroy_idle() >> as part of the official libtirpc API. We really must >> avoid adding "secret" library entry points (not to >> mention, as SteveD pointed out, a SONAME change >> would be required). >> >> rpcbind used to call __svc_clean_idle(), which is >> undocumented (as far as I can tell). > > I try to use the same mechanism as __svc_clean_idle, assuming > it was an acceptable approach. Understood. I'm not quibbling with that mechanism, but with the addition of another private API. Private APIs are very difficult to maintain in the long run. >> 44bf15b86861 removed that call. > > which is one of the bugs that causes this problem. > >> I think a better approach would be to duplicate your >> proposed __svc_destroy_idle() code in rpcbind, since >> rpcbind is not actually using the library's RPC >> dispatcher. > > then we have to do the same for nfs-utils. Exactly. > That means the > same code exits in 3 places: libtirpc, rpcbind and nfs-utils. > And any new consumer that uses it own my_svc_run will > need its own __svc_destroy_idle. It does not sound very > efficient. You mean you don't particularly care for the duplication of the code. Fair enough. I don't have any problem copying this small piece of code into callers that long ago went to the effort of implementing their own polling loops. The code duplication here isn't bothering me much. If you see a way to convert these external callers to use enough of the library svc APIs that they get to use __svc_destroy_idle() automatically, that would to me be an acceptable alternative. >> That would get rid of the technical debt of calling >> an undocumented library API. >> >> The helper would need to call svc_vc_destroy() >> rather than __xprt_unregister_unlocked() and >> __svc_vc_dodestroy(). Also not clear to me that >> rpcbind's my_svc_run() needs the protection of >> holding svc_fd_lock, if rpcbind itself is not >> multi-threaded? > > The lock is held by __svc_destroy_idle, or __svc_clean_idle > before it was removed, in libtirpc. I think libtirpc is > multi-threaded. libtirpc is multi-threaded, so __svc_destroy_idle() is correct to take that lock. I'm suggesting that the copies of destroy_idle() in the external callers would likely not need to take this lock. I don't see any references to it in the rpcbind source code, so it's probably not necessary for the rpcbind version of the destroy_idle helper to take the lock. Note that svc_fd_lock is not part of the public library API either. >> The alternative would be to construct a fully >> documented public API with man pages and header >> declarations. I'm not terribly comfortable with >> diverging from the historical Sun TI-RPC programming >> interface, however. > > What I'm trying to do is to follow what __svc_clean_idle > used to do before it was removed. Sure, I get that. That seems like a prudent approach. > I can't totally re-use > __svc_clean_idle because it was written to work with > svc_fdset and select. The current svc_run uses poll so > the handling of idle connections is a little difference. Introducing yet another private API should be off the table. An alternative fix would be to add a public API that is documented and appears in the headers. Again, I would rather not diverge from the historical Sun TI-RPC library API, and SteveD would rather not have to change the library's SONAME, so I'm hesitant to add a public API too. The bottom line is that adding a library API to handle idle connections is the right architectural approach, all things being equal. But the goal for this library is compatibility with a historical API that is also available in other operating systems. We have to stick with that API if we can. -- Chuck Lever
diff --git a/src/rpcb_svc_com.c b/src/rpcb_svc_com.c index 1743dadf5db7..cb33519010d3 100644 --- a/src/rpcb_svc_com.c +++ b/src/rpcb_svc_com.c @@ -1048,6 +1048,8 @@ netbuffree(struct netbuf *ap) } +extern void __svc_destroy_idle(int, bool_t); + void my_svc_run() { @@ -1076,6 +1078,7 @@ my_svc_run() * other outside event) and not caused by poll(). */ case 0: + __svc_destroy_idle(30, FALSE); continue; default: /*
Currently my_svc_run does not handle poll time allowing idle TCP connections to remain ESTABLISHED indefinitely. When the number of connections reaches the limit the open file descriptors (ulimit -n) then accept(2) fails with EMFILE. Since libtirpc does not handle EMFILE returned from accept(2) this get my_svc_run into a tight loop calling accept(2) resulting in the RPC service being down, it's no longer able to service any requests. Fix by add call to __svc_destroy_idle to my_svc_run to remove inactive connections when poll(2) returns timeout. Fixes: b2c9430f46c4 Use poll() instead of select() in svc_run() Signed-off-by: dai.ngo@oracle.com --- src/rpcb_svc_com.c | 3 +++ 1 file changed, 3 insertions(+)