From patchwork Fri Nov 1 21:39:50 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Hefty, Sean" X-Patchwork-Id: 3127721 Return-Path: X-Original-To: patchwork-linux-rdma@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 048D7BEEB2 for ; Fri, 1 Nov 2013 21:40:05 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id C775720461 for ; Fri, 1 Nov 2013 21:40:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7946520495 for ; Fri, 1 Nov 2013 21:40:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752305Ab3KAVkA (ORCPT ); Fri, 1 Nov 2013 17:40:00 -0400 Received: from mga02.intel.com ([134.134.136.20]:17861 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752153Ab3KAVj7 (ORCPT ); Fri, 1 Nov 2013 17:39:59 -0400 Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP; 01 Nov 2013 14:39:59 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.93,535,1378882800"; d="scan'208";a="428493035" Received: from cst-linux.jf.intel.com ([10.23.221.72]) by orsmga002.jf.intel.com with ESMTP; 01 Nov 2013 14:39:58 -0700 From: sean.hefty@intel.com To: linux-rdma@vger.kernel.org, roland@purestorage.com Cc: Sean Hefty Subject: [PATCH] rdma/cm: Discard events for IDs not yet claimed by user space Date: Fri, 1 Nov 2013 14:39:50 -0700 Message-Id: <1383341990-4055-1-git-send-email-sean.hefty@intel.com> X-Mailer: git-send-email 1.7.3 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Sean Hefty Problem reported by Avneesh Pant : >> It looks like we are triggering a bug in RDMA CM/UCM interaction. The bug specifically hits when we have an incoming connection request and the connecting process dies BEFORE the passive end of the connection can process the request i.e. it does not call rdma_get_cm_event() to retrieve the initial connection event. We were able to triage this further and have some additional information now. In the example below when P1 dies after issuing a connect request as the CM id is being destroyed all outstanding connects (to P2) are sent a reject message. We see this reject message being received on the passive end and the appropriate CM ID created for the initial connection message being retrieved in cm_match_req(). The problem is in the ucma_event_handler() code when this reject message is delivered to it and the initial connect message itself HAS NOT been delivered to the client. In fact the client has not even called rdma_cm_get_event() at this stage so we haven't allocated a new ctx in ucma_get_event() and updated the new connection CM_ID to point to the new UCMA context. This results in the reject message not being dropped in ucma_event_handler() for the new connection request as the (if (!ctx->uid)) block is skipped since the ctx it refers to is the listen CM id context which does have a valid UID associated with it (I believe the new CMID for the connection initially uses the listen CMID -> context when it is created in cma_new_conn_id). Thus the assumption that new events for a connection can get dropped in ucma_event_handler() is incorrect IF the initial connect request has not been retrieved in the first case. We end up getting a CM Reject event on the listen CM ID and our upper layer code asserts (in fact this event does not even have the listen_id set as that only gets set up librdmacm for connect requests). << The solution is to verify that the cm_id being reported in the event is the same as the cm_id referenced by the ucma context. A mismatch indicates that the ucma context corresponds to the listen. This fix was validated by using a modified version of the librdmacm that was able to verify the problem and see that the reject message was indeed dropped after this patch was applied. Signed-off-by: Sean Hefty --- drivers/infiniband/core/ucma.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index b0f189b..826016b 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -271,7 +271,7 @@ static int ucma_event_handler(struct rdma_cm_id *cm_id, goto out; } ctx->backlog--; - } else if (!ctx->uid) { + } else if (!ctx->uid || ctx->cm_id != cm_id) { /* * We ignore events for new connections until userspace has set * their context. This can only happen if an error occurs on a