From patchwork Sun Jul 2 11:56:56 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sagi Grimberg X-Patchwork-Id: 9821115 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 559F960246 for ; Sun, 2 Jul 2017 11:57:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3934C27528 for ; Sun, 2 Jul 2017 11:57:05 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2A92128178; Sun, 2 Jul 2017 11:57:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.4 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8596827528 for ; Sun, 2 Jul 2017 11:57:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751624AbdGBL5B (ORCPT ); Sun, 2 Jul 2017 07:57:01 -0400 Received: from mail-wm0-f65.google.com ([74.125.82.65]:36403 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751608AbdGBL5A (ORCPT ); Sun, 2 Jul 2017 07:57:00 -0400 Received: by mail-wm0-f65.google.com with SMTP id y5so15560408wmh.3 for ; Sun, 02 Jul 2017 04:56:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=Vo3LVD8860A+VkQ4ZP2rDXqoP0dU5Mg1LsU9pRhpJbY=; b=DYZzKXW3bbCo9ynlLYp4I0FNNM4V0b5fAWJz80xKDBnoz50KM/RgfbnpFoZCzhr9vH /3M8FG9YigzMjBMohYd381Rem4bNlyCvMSaO0K9IhB3mEC85/2hFOeqc5MyfDVciR+Dv nV9zGf8/0dljxYGbaOfe8ZpY/QUQJZ5UvLliL01foaB5W8b0VT+TYV3tT9f60vNy+1RK elq6eAyHXinP8ZaDUfa1P7VlVrwsm/yU/4LU70Y3enaUt5MBNw2YWtgZ11nuQhqYbzkU 4WaYkWLphxMlxeU1yPiudkaHLmwFxY8x58P+kjUa+KDjvzTDBTw84yFvY5SwjDSQ7jMH 4QDQ== X-Gm-Message-State: AIVw110D3wnczb2Gi/QlYBo7uHGd6pyRRO7U6DAjHS4ngfOByRggwQUL gbwbXO5vLm/QQg== X-Received: by 10.28.141.142 with SMTP id p136mr7537550wmd.125.1498996618830; Sun, 02 Jul 2017 04:56:58 -0700 (PDT) Received: from [192.168.64.116] (bzq-82-81-101-184.red.bezeqint.net. [82.81.101.184]) by smtp.gmail.com with ESMTPSA id r142sm3009320wmg.24.2017.07.02.04.56.57 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 02 Jul 2017 04:56:58 -0700 (PDT) Subject: Re: NVMe induced NULL deref in bt_iter() To: Max Gurtovoy , Jens Axboe Cc: "linux-block@vger.kernel.org" , "linux-nvme@lists.infradead.org" References: <9afc0fd3-e598-dea9-a505-d8fa0f608d16@mellanox.com> From: Sagi Grimberg Message-ID: <7138df5a-b1ce-7f46-281f-ae15172c61e5@grimberg.me> Date: Sun, 2 Jul 2017 14:56:56 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <9afc0fd3-e598-dea9-a505-d8fa0f608d16@mellanox.com> Content-Language: en-US Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 02/07/17 13:45, Max Gurtovoy wrote: > > > On 6/30/2017 8:26 PM, Jens Axboe wrote: >> Hi Max, > > Hi Jens, > >> >> I remembered you reporting this. I think this is a regression introduced >> with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[] >> is, but that's not indexable by the tag we find. So I think we need to >> guard those with a NULL check. The actual requests themselves are >> static, so we know the memory itself isn't going away. But if we race >> with completion, we could find a NULL there, validly. >> >> Since you could reproduce it, can you try the below? > > I still can repro the null deref with this patch applied. > >> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index d0be72ccb091..b856b2827157 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, >> unsigned int bitnr, void *data) >> bitnr += tags->nr_reserved_tags; >> rq = tags->rqs[bitnr]; >> >> - if (rq->q == hctx->queue) >> + if (rq && rq->q == hctx->queue) >> iter_data->fn(hctx, rq, iter_data->data, reserved); >> return true; >> } >> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, >> unsigned int bitnr, void *data) >> if (!reserved) >> bitnr += tags->nr_reserved_tags; >> rq = tags->rqs[bitnr]; >> - >> - iter_data->fn(rq, iter_data->data, reserved); >> + if (rq) >> + iter_data->fn(rq, iter_data->data, reserved); >> return true; >> } > > see the attached file for dmesg output. > > output of gdb: > > (gdb) list *(blk_mq_flush_busy_ctxs+0x48) > 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs > (./include/linux/sbitmap.h:234). > 229 > 230 for (i = 0; i < sb->map_nr; i++) { > 231 struct sbitmap_word *word = &sb->map[i]; > 232 unsigned int off, nr; > 233 > 234 if (!word->word) > 235 continue; > 236 > 237 nr = 0; > 238 off = i << sb->shift; > > > when I change the "if (!word->word)" to "if (word && !word->word)" > I can get null deref at "nr = find_next_bit(&word->word, word->depth, > nr);". Seems like somehow word becomes NULL. > > Adding the linux-nvme guys too. > Sagi has mentioned that this can be null only if we remove the tagset > while I/O is trying to get a tag and when killing the target we get into > error recovery and periodic reconnects, which does _NOT_ include freeing > the tagset, so this is probably the admin tagset. > > Sagi, > you've mention a patch for centrelizing the treatment of the admin > tagset to the nvme core. I think I missed this patch, so can you please > send a pointer to it and I'll check if it helps ? Hmm, In the above flow we should not be freeing the tag_set, not on admin as well. The target keep removing namespaces and finally removes the subsystem which generates a error recovery flow. What we at least try to do is: 1. mark rdma queues as not live 2. stop all the sw queues (admin and io) 3. fail inflight I/Os 4. restart all sw queues (to fast fail until we recover) We shouldn't be freeing the tagsets (although we might update them when we recover and cpu map changed - which I don't think is happening). However, I do see a difference between bt_tags_for_each and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL). Unrelated to this I think we should quiesce/unquiesce the admin_q instead of stop/start because it respects the submission path rcu [1]. It might hide the issue, but given that we never free the tagset its seems like it's not in nvme-rdma (max, can you see if this makes the issue go away?) [1]: Tested-by: Max Gurtovoy Reviewed-by: Max Gurtovoy --- requests */ if (ctrl->ctrl.queue_count > 1) @@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) * queues are not a live anymore, so restart the queues to fail fast * new IO */ - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true); + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + blk_mq_kick_requeue_list(ctrl->ctrl.admin_q); nvme_start_queues(&ctrl->ctrl); nvme_rdma_reconnect_or_remove(ctrl); @@ -1651,7 +1652,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl) if (test_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags)) nvme_shutdown_ctrl(&ctrl->ctrl); - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q); + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_admin_queue(ctrl); -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index e3996db22738..094873a4ee38 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -785,7 +785,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) if (ctrl->ctrl.queue_count > 1) nvme_stop_queues(&ctrl->ctrl); - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q); + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); /* We must take care of fastfail/requeue all our inflight