From patchwork Wed Jul 20 19:20:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Attila Kovacs X-Patchwork-Id: 12924414 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 028D0C433EF for ; Wed, 20 Jul 2022 19:20:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229579AbiGTTUs (ORCPT ); Wed, 20 Jul 2022 15:20:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56312 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229551AbiGTTUr (ORCPT ); Wed, 20 Jul 2022 15:20:47 -0400 Received: from mail-qt1-x829.google.com (mail-qt1-x829.google.com [IPv6:2607:f8b0:4864:20::829]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 47F2313F2A for ; Wed, 20 Jul 2022 12:20:46 -0700 (PDT) Received: by mail-qt1-x829.google.com with SMTP id a9so14076874qtw.10 for ; Wed, 20 Jul 2022 12:20:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cfa.harvard.edu; s=google; h=message-id:date:mime-version:user-agent:subject:content-language :from:to:references:cc:in-reply-to; bh=gVS9zxwV9binEZt7aKyyxTVIW7LDL2V/HUzhngKrzMk=; b=KdpPEXEKHXH19dGQKkAUJZjhKKzZYUqtdROhv2nkc81Yb9+EpXlQEUocoSjDCtIlM0 lbNjBtlWAyHUPm6XCOchYSgpjglVxKS3w8fGiOUOGZGjSB0rn0w4WexQitHSDcK8sWdA +vCc3t3/bzypvznUYKdKsCgEroryYJXEJ7LGIIiNcWbB1festbN+7QRP2tqKFzAvv4dE iWSeBy6zrOODOvlm4CYotYrj26umIRdC3BqB36WqNFMxNgvKlMA1fanIQe3sMtQCqn4w SlWEe1B8Jy43hCQO82/9PIjvKw0+7L3OE4/NjzMZGeYMOLvcEL3Rgoz+ajNEnpbnWT6z S/1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:from:to:references:cc:in-reply-to; bh=gVS9zxwV9binEZt7aKyyxTVIW7LDL2V/HUzhngKrzMk=; b=We+361KBgw78T3bF2afZhUeQF228aYnQansxbcf0nLxZdWqveu1hDe9ZCCg5myC5G2 kIcZ9ivAAl1BjYQ5lREPVpEJng6ka/wVz5HEVLNxm4Kf84LdTjZoF8185b9K+X3jD76W kd0h9TOKjEQ+Ro6dKpTIC8+tIhkoTQ7Vc1rMFBaN3j9SuOnp4jydhWoex5+UUlD0PWd6 ihoCrY3iPGvF9h1w5rRMru/hXqf/ZMSc8TVH7TIcfv8yHzUHHNeVT9Wqw1xmZTeyouR/ QBnY3bVcf/tuulzOXKxF3YoXThyil2OWtXkj+9JSQvTLyzKLp5a+jHBKLQJKX9ePdrk6 RjuQ== X-Gm-Message-State: AJIora9aipcp1ny4A6o77bm8wYBzs8VX2ZJDk8AZO1GPFaQMeKE/UFQh PJMSCZOVpHcko4yEAZdHFYIScg== X-Google-Smtp-Source: AGRyM1sL8XwaKNo6SJAS+kN9lUgthPBCQ5PTZDRv6IcBS6K7KXJwo7SvaA9u5y/b31zT14UTrcrPPw== X-Received: by 2002:a05:622a:50c:b0:31e:eecb:2afb with SMTP id l12-20020a05622a050c00b0031eeecb2afbmr13359183qtx.183.1658344845147; Wed, 20 Jul 2022 12:20:45 -0700 (PDT) Received: from [131.142.152.103] (dhcp-131-142-152-103.cfa.harvard.edu. [131.142.152.103]) by smtp.gmail.com with ESMTPSA id bl25-20020a05620a1a9900b006a787380a5csm17299429qkb.67.2022.07.20.12.20.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 20 Jul 2022 12:20:44 -0700 (PDT) Message-ID: Date: Wed, 20 Jul 2022 15:20:43 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: libtirpc deadlocks and race conditions patch Content-Language: en-US From: Attila Kovacs To: libtirpc-devel@lists.sourceforge.net References: <76bfebd4-b4cd-4403-8a6d-676705bfdcf9@cfa.harvard.edu> <6b6c9e8b-deb1-f570-a082-2924eb1aef98@cfa.harvard.edu> Cc: linux-nfs@vger.kernel.org In-Reply-To: <6b6c9e8b-deb1-f570-a082-2924eb1aef98@cfa.harvard.edu> Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org Hi, Attached is the patch for the previously mentioned issues of clnt_create() deadlocking on errors, and cond_signal() being unprotected against asyncronous calls to clnt_destroy(). cheers, -- A. On 7/20/22 14:16, Attila Kovacs wrote: > > More mutex issues -- this time more likely to do with the connect > deadlock we see: > > in clnt_dg.c, in clnt_dg_create(): > > > clnt_fd_lock is locked on L171, but then not released if jumping to the > err1 label on an error (L175 and L180). This means that those errors > will deadlock all further operations that require clnt_fd_lock access. > > Same in clnt_vc.c in clnt_vc_create, on lines 215, 222, and 230 > respectively. > > -- A. > > > > > > On 7/20/22 12:09, Attila Kovacs wrote: >> Hi Steve, >> >> >> We are using the tirpc library in an MT environment. We are >> experiencing occasional deadlocks when calling clnt_create_timed() >> concurrently from multiple threads. When this happens, all connect >> threads hang, with each thread taking up 100% CPU. >> >> I was peeking at the code (release 1.3.2), and some potential problems >> I see is how waiting / signaling is implemented on cu->cu_fd_lock.cv >> in clnt_dg.c and clnt_vc.c. >> >> 1. In cnlt_dg_freeres() and clnt_vc_freeres(), cond_signal() is called >> after unlocking the mutex (clnt_fd_lock). The manual of >> pthread_cond_signal() allows that, but mentions that for consistent >> scheduling, cond_signal() should be called with the waiting mutex >> locked.  (i.e. lock -> signal -> unlock, rather than lock -> unlock -> >> signal). >> >> One of the dangers of signaling with the unlocked mutex, is that in >> MT, another thread can lock the mutex before the signal call is made, >> and potentially destroy the condition variable (e.g. an asynchronous >> call to clnt_*_destroy()). Thus, by the time the signal() is called in >> clnt*_freeres(), both the condition may be invalid. >> >> The proper sequence here should be: >> >>      [...] >>      mutex_lock(&clnt_fd_lock); >>      while (ct->ct_fd_lock->active) >>          cond_wait(&ct->ct_fd_lock->cv, &clnt_fd_lock); >>      xdrs->x_op = XDR_FREE; >>      dummy = (*xdr_res)(xdrs, res_ptr); >>      thr_sigsetmask(SIG_SETMASK, &(mask), NULL); >>      cond_signal(&ct->ct_fd_lock->cv); >>      mutex_unlock(&clnt_fd_lock); >> >> >> 2. Similar issue in the macro release_fd_lock() in both the vc and dg >> sources. Here again the sequence ought to be: >> >> #define release_fd_lock(fd_lock, mask) {    \ >>      mutex_lock(&clnt_fd_lock);    \ >>      fd_lock->active = FALSE;    \ >>      thr_sigsetmask(SIG_SETMASK, &(mask), (sigset_t *) NULL); \ >>      cond_signal(&fd_lock->cv);    \ >>      mutex_unlock(&clnt_fd_lock);    \ >> } >> >> >> 3. The use of cond_wait() in clnt_dg.c and clnt_vc.c is also >> unprotected against the asynchronous destruction of the condition >> variable, since cond_wait() releases the mutex as it enters the >> waiting state. So, when it is called again in the while() loop, the >> condtion might no longer exist. Thus, before calling cond_wait(), one >> should check that the client is valid (not destroyed) inside the wait >> loop. >> >> >> I'm not sure if these particular issues are the cause of the deadlocks >> we see, but I think they are problematic enough on their own, and >> perhaps should be fixed in the next release. >> >> Thanks, >> >> -- Attila diff --git a/src/clnt_dg.c b/src/clnt_dg.c index b3d82e7..7c5d22e 100644 --- a/src/clnt_dg.c +++ b/src/clnt_dg.c @@ -101,9 +101,9 @@ extern mutex_t clnt_fd_lock; #define release_fd_lock(fd_lock, mask) { \ mutex_lock(&clnt_fd_lock); \ fd_lock->active = FALSE; \ - mutex_unlock(&clnt_fd_lock); \ thr_sigsetmask(SIG_SETMASK, &(mask), NULL); \ cond_signal(&fd_lock->cv); \ + mutex_unlock(&clnt_fd_lock); \ } static const char mem_err_clnt_dg[] = "clnt_dg_create: out of memory"; @@ -172,12 +172,15 @@ clnt_dg_create(fd, svcaddr, program, version, sendsz, recvsz) if (dg_fd_locks == (fd_locks_t *) NULL) { dg_fd_locks = fd_locks_init(); if (dg_fd_locks == (fd_locks_t *) NULL) { + mutex_unlock(&clnt_fd_lock); goto err1; } } fd_lock = fd_lock_create(fd, dg_fd_locks); - if (fd_lock == (fd_lock_t *) NULL) + if (fd_lock == (fd_lock_t *) NULL) { + mutex_unlock(&clnt_fd_lock); goto err1; + } mutex_unlock(&clnt_fd_lock); thr_sigsetmask(SIG_SETMASK, &(mask), NULL); @@ -573,9 +576,9 @@ clnt_dg_freeres(cl, xdr_res, res_ptr) cu->cu_fd_lock->active = TRUE; xdrs->x_op = XDR_FREE; dummy = (*xdr_res)(xdrs, res_ptr); - mutex_unlock(&clnt_fd_lock); thr_sigsetmask(SIG_SETMASK, &mask, NULL); cond_signal(&cu->cu_fd_lock->cv); + mutex_unlock(&clnt_fd_lock); return (dummy); } diff --git a/src/clnt_vc.c b/src/clnt_vc.c index a07e297..3c73e65 100644 --- a/src/clnt_vc.c +++ b/src/clnt_vc.c @@ -153,9 +153,9 @@ extern mutex_t clnt_fd_lock; #define release_fd_lock(fd_lock, mask) { \ mutex_lock(&clnt_fd_lock); \ fd_lock->active = FALSE; \ - mutex_unlock(&clnt_fd_lock); \ thr_sigsetmask(SIG_SETMASK, &(mask), (sigset_t *) NULL); \ cond_signal(&fd_lock->cv); \ + mutex_unlock(&clnt_fd_lock); \ } static const char clnt_vc_errstr[] = "%s : %s"; @@ -216,7 +216,9 @@ clnt_vc_create(fd, raddr, prog, vers, sendsz, recvsz) if (vc_fd_locks == (fd_locks_t *) NULL) { vc_fd_locks = fd_locks_init(); if (vc_fd_locks == (fd_locks_t *) NULL) { - struct rpc_createerr *ce = &get_rpc_createerr(); + struct rpc_createerr *ce; + mutex_unlock(&clnt_fd_lock); + ce = &get_rpc_createerr(); ce->cf_stat = RPC_SYSTEMERROR; ce->cf_error.re_errno = errno; goto err; @@ -224,7 +226,9 @@ clnt_vc_create(fd, raddr, prog, vers, sendsz, recvsz) } fd_lock = fd_lock_create(fd, vc_fd_locks); if (fd_lock == (fd_lock_t *) NULL) { - struct rpc_createerr *ce = &get_rpc_createerr(); + struct rpc_createerr *ce; + mutex_unlock(&clnt_fd_lock); + ce = &get_rpc_createerr(); ce->cf_stat = RPC_SYSTEMERROR; ce->cf_error.re_errno = errno; goto err; @@ -495,9 +499,9 @@ clnt_vc_freeres(cl, xdr_res, res_ptr) cond_wait(&ct->ct_fd_lock->cv, &clnt_fd_lock); xdrs->x_op = XDR_FREE; dummy = (*xdr_res)(xdrs, res_ptr); - mutex_unlock(&clnt_fd_lock); thr_sigsetmask(SIG_SETMASK, &(mask), NULL); cond_signal(&ct->ct_fd_lock->cv); + mutex_unlock(&clnt_fd_lock); return dummy; }