From patchwork Sun Feb 4 09:57:38 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Sagi Grimberg X-Patchwork-Id: 10199285 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 64D65602CA for ; Sun, 4 Feb 2018 09:57:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3F6FA27F93 for ; Sun, 4 Feb 2018 09:57:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 33D4827FAC; Sun, 4 Feb 2018 09:57:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A01FE27F93 for ; Sun, 4 Feb 2018 09:57:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750821AbeBDJ5o (ORCPT ); Sun, 4 Feb 2018 04:57:44 -0500 Received: from mail-wr0-f196.google.com ([209.85.128.196]:46481 "EHLO mail-wr0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750818AbeBDJ5m (ORCPT ); Sun, 4 Feb 2018 04:57:42 -0500 Received: by mail-wr0-f196.google.com with SMTP id g21so26637176wrb.13 for ; Sun, 04 Feb 2018 01:57:42 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=9ksPcuqDNLuJN0eaZqfgjMCEHgO47VBj3y/rch9FRjg=; b=t2sS7BPPlHHAA9TPyjRCj+vxI6undQfyy3l5NdeiEu/fUPC5X3oTiMikvUJCW4/Lw2 fMcOUSXWug1kk0zHFAaL3pnGLl8fB6hyKw6PjaHzun6n+wC9bWkVy/bjuKOTq9fuNLjb aBrbCeKQflDF1Ad4hEwduakZtNPNoUol5uXQZXboAMEQw1TUlgj463Jwp7Feyqlnnf2G wvxQLGjKSMyesAJCXegiuNzT8Qx7mIRzdZod/2ejbbuWwZoQyyUjnJz+P7flbcMmquFJ s4PYF8XWHjL7YM5w2Cmw4AfM3tgttwWk1F4hBVPE9BY7sMKugNMP50gYnFJ5cV56nY+C fC+Q== X-Gm-Message-State: AKwxytdI+bVvUOzcwoXMg1dyHrDKKyZvAI4rF2GHZAsyUbHI7rybuWYR 71Tb46pknKtP3wooosbv3jNgVDgg X-Google-Smtp-Source: AH8x225iqqBJ3LYRbNakpFCf7rBBdH9KRsRIB9oeMYOK9fYlvXAONYcRyQa/CHwygHOf+FothUUR3w== X-Received: by 10.223.153.54 with SMTP id x51mr30522998wrb.210.1517738261330; Sun, 04 Feb 2018 01:57:41 -0800 (PST) Received: from [192.168.64.117] (bzq-219-42-90.isdn.bezeqint.net. [62.219.42.90]) by smtp.gmail.com with ESMTPSA id y52sm19191593wrb.52.2018.02.04.01.57.39 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 04 Feb 2018 01:57:40 -0800 (PST) Subject: Re: Regression: Connect-X5 doesn't connect with NVME-of To: Saeed Mahameed , Logan Gunthorpe , "linux-rdma@vger.kernel.org" Cc: Max Gurtovoy , Stephen Bates , linux-nvme , Christoph Hellwig References: <66a5332c-01ee-7a39-8224-189fa52a7298@deltatee.com> From: Sagi Grimberg Message-ID: <0d629a68-a1fa-7297-e371-5abbc2dd5fe7@grimberg.me> Date: Sun, 4 Feb 2018 11:57:38 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP >> Hello, Hi Logan, thanks for reporting. >> We've experienced a regression with using nvme-of and two Connect-X5s. With v4.15 and v4.14.16 we see the following dmesgs when trying to connect to the target: >> >>> [   43.732539] nvme nvme2: creating 16 I/O queues. >>> [   44.072427] nvmet: adding queue 1 to ctrl 1. >>> [   44.072553] nvmet: adding queue 2 to ctrl 1. >>> [   44.072597] nvme nvme2: Connect command failed, error wo/DNR bit: -16402 >>> [   44.072609] nvme nvme2: failed to connect queue: 3 ret=-18 >>> [   44.075421] nvmet_rdma: freeing queue 2 >>> [   44.075792] nvmet_rdma: freeing queue 1 >>> [   44.264293] nvmet_rdma: freeing queue 3 >>> *snip* >> >> (on v4.15 there is additional error panics likely do to some other nvme-of error handling bugs) >> >> And nvme connect returns: >> >>> Failed to write to /dev/nvme-fabrics: Invalid cross-device link >> >> The two adapters are the same with the latest available firmware: >> >>>     transport:            InfiniBand (0) >>>     fw_ver:                16.21.2010 >>>     vendor_id:            0x02c9 >>>     vendor_part_id:            4119 >>>     hw_ver:                0x0 >>>     board_id:            MT_0000000010 >> >> We bisected to find the commit that broke our setup is: >> >> 05e0cc84e00c net/mlx5: Fix get vector affinity helper function I'm really bummed out about this... I seem to have missed it in my review and apparently went in untested. If we look at the patch, it clearly shows that the behavior changed as mlx5_get_vector_affinity does not add the offset of MLX5_EQ_VEC_COMP_BASE as before. The API assumes that completion vector 0 means the first _completion_ vector which means ignoring the private/internal mlx5 vectors created for stuff like port async events, fw commands and page requests... What happens is that the consumer asked for affinity mask of completion vector 0 and got the async event vector and the skew continued leading to unmapped block queues. So I think this should make the problem go away: --- return NULL; -- Can you verify that this fixes your problem? Regardless, it looks like we also have a second bug in here such that we still attempt to connect a queue which is unmapped and fail the controller association when it fails. This was not an option before because PCI_IRQ_AFFINITY guaranteed us that we will have the cpu spread that we need to ignore this case, but thats changed now. We should either settle with less queues, or fallback to the default mq_map for the queues that are left unmapped, or we should at least continue forward without these unmapped queues (I think the former makes better sense). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index a0610427e168..b82c4ae92411 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -1238,7 +1238,7 @@ mlx5_get_vector_affinity(struct mlx5_core_dev *dev, int vector) int eqn; int err; - err = mlx5_vector2eqn(dev, vector, &eqn, &irq); + err = mlx5_vector2eqn(dev, MLX5_EQ_VEC_COMP_BASE + vector, &eqn, &irq); if (err)