From patchwork Tue Feb 4 09:51:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Klaus Jensen X-Patchwork-Id: 11364325 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 71E5414E3 for ; Tue, 4 Feb 2020 10:00:41 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 364F0217BA for ; Tue, 4 Feb 2020 10:00:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=samsung.com header.i=@samsung.com header.b="Za2kui1M" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 364F0217BA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=samsung.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Received: from localhost ([::1]:55430 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iyv0a-00041p-CO for patchwork-qemu-devel@patchwork.kernel.org; Tue, 04 Feb 2020 05:00:40 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:56029) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iyusr-0004dH-QB for qemu-devel@nongnu.org; Tue, 04 Feb 2020 04:52:45 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iyuso-0007iF-GZ for qemu-devel@nongnu.org; Tue, 04 Feb 2020 04:52:41 -0500 Received: from mailout1.w1.samsung.com ([210.118.77.11]:35354) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1iyuso-0007GJ-2M for qemu-devel@nongnu.org; Tue, 04 Feb 2020 04:52:38 -0500 Received: from eucas1p1.samsung.com (unknown [182.198.249.206]) by mailout1.w1.samsung.com (KnoxPortal) with ESMTP id 20200204095227euoutp01d5d7bbc6a585df543c00c18ab0cc2238~wKkWCq5BJ3121331213euoutp01P for ; Tue, 4 Feb 2020 09:52:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 mailout1.w1.samsung.com 20200204095227euoutp01d5d7bbc6a585df543c00c18ab0cc2238~wKkWCq5BJ3121331213euoutp01P DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsung.com; s=mail20170921; t=1580809947; bh=trrwaADNo4IHDMLwfKX+4vxribtcVLcCzl1JNZUwQ5s=; h=From:To:CC:Subject:Date:In-Reply-To:References:From; b=Za2kui1M3Bt4nUBG0O2U/K4igLyIS6s9uIpSbixrUCx6TzZi4szZX/qv4lehTXvJc ej4/Mm6Wj4BWLNC5bMnvCfbIP50HNEmt5p2uQzYFM4oqtdQY4yUD4Q2+ZoJw68f+hu 2X0duSP4M1TRa19l7Fw222EaVtU1eNE8P9h5bUsw= Received: from eusmges2new.samsung.com (unknown [203.254.199.244]) by eucas1p2.samsung.com (KnoxPortal) with ESMTP id 20200204095227eucas1p219ad5ea7c11222d8e6d985a12a94d18b~wKkV2DG7E0112201122eucas1p2W; Tue, 4 Feb 2020 09:52:27 +0000 (GMT) Received: from eucas1p2.samsung.com ( [182.198.249.207]) by eusmges2new.samsung.com (EUCPMTA) with SMTP id 8F.EB.60679.ADE393E5; Tue, 4 Feb 2020 09:52:26 +0000 (GMT) Received: from eusmtrp2.samsung.com (unknown [182.198.249.139]) by eucas1p2.samsung.com (KnoxPortal) with ESMTPA id 20200204095226eucas1p2429f45a5e23fe6ed57dee293be5e1b44~wKkVgJ0hw2200522005eucas1p2G; Tue, 4 Feb 2020 09:52:26 +0000 (GMT) Received: from eusmgms2.samsung.com (unknown [182.198.249.180]) by eusmtrp2.samsung.com (KnoxPortal) with ESMTP id 20200204095226eusmtrp262ab7668e78f25a402b090967bcde028~wKkVfY8eU0489204892eusmtrp2J; Tue, 4 Feb 2020 09:52:26 +0000 (GMT) X-AuditID: cbfec7f4-0e5ff7000001ed07-3b-5e393edab2ec Received: from eusmtip2.samsung.com ( [203.254.199.222]) by eusmgms2.samsung.com (EUCPMTA) with SMTP id FC.6A.07950.ADE393E5; Tue, 4 Feb 2020 09:52:26 +0000 (GMT) Received: from CAMSVWEXC01.scsc.local (unknown [106.1.227.71]) by eusmtip2.samsung.com (KnoxPortal) with ESMTPA id 20200204095226eusmtip206771cdfbbb46a0b49654b22f9cf2f1e~wKkVWPC5u2765927659eusmtip2N; Tue, 4 Feb 2020 09:52:26 +0000 (GMT) Received: from CAMSVWEXC01.scsc.local (2002:6a01:e347::6a01:e347) by CAMSVWEXC01.scsc.local (2002:6a01:e347::6a01:e347) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 4 Feb 2020 09:52:25 +0000 Received: from apples.local (106.110.32.41) by CAMSVWEXC01.scsc.local (106.1.227.71) with Microsoft SMTP Server id 15.0.1320.4 via Frontend Transport; Tue, 4 Feb 2020 09:52:25 +0000 From: Klaus Jensen To: Subject: [PATCH v5 17/26] nvme: allow multiple aios per command Date: Tue, 4 Feb 2020 10:51:59 +0100 Message-ID: <20200204095208.269131-18-k.jensen@samsung.com> X-Mailer: git-send-email 2.25.0 In-Reply-To: <20200204095208.269131-1-k.jensen@samsung.com> MIME-Version: 1.0 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrCKsWRmVeSWpSXmKPExsWy7djP87q37CzjDC5Nt7HY3N/OaLH/4DdW i0mHrjFaLLmYajHvlrLFrHftbBbHe3ewOLB7nNtxnt1j06pONo871/aweTy5tpnJ4/2+q2wB rFFcNimpOZllqUX6dglcGSuf/GEtuHuNseLv3BtMDYwTFjB2MXJySAiYSDxt3sfUxcjFISSw glHiavchFgjnC6PEt8V3GCGcz4wSl//PYe9i5ABrWf3dESK+nFHi/80dTHBF35ZMhGo/zSgx sXEClLMTKNN8hxlkI5uApsT2P/9ZQGwRAWmJ9quTWEGKmAV+M0rMbZzIBpIQFrCX+HviJlgD i4CKxJSGN2Dn8gpYS/ztfskOcbq8xOzG02D1nEDx/atOMEHUCEqcnPkEbAEzUE3z1tnMELaE xMEXL5hBlkkI7GKXeL37PBPEIBeJJROfMkPYwhKvjm+BWiAj8X/nfCaIhm5Gib4PX6G6ZzBK TF/2nQ0SHNYSfWdyIBocJW7fmAwNJT6JG28FIRbzSUzaNp0ZIswr0dEmBFGtJrGjaSvjBEbl WUjOnoXk7FlIzl7AyLyKUTy1tDg3PbXYKC+1XK84Mbe4NC9dLzk/dxMjMO2c/nf8yw7GXX+S DjEKcDAq8fCePGsRJ8SaWFZcmXuIUYKDWUmE97y+ZZwQb0piZVVqUX58UWlOavEhRmkOFiVx XuNFL2OFBNITS1KzU1MLUotgskwcnFINjNaGlZmSwtUVkhaTgl4r2MbOazs837P2wWK3DINc XveX+yLvfgqpk3x1vbLO7tWcZyXXG2W2KSfuWsfy8ss6ldID6xntGA16Tiw5tu3mvHzuZq9S 3/NCtWc+uBsv45wumNNf2dy55rFt9J5563jU8r7J/FzMduL/80+8L3/tC5n1gMNllcX/U0os xRmJhlrMRcWJAMKvkuk3AwAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrHIsWRmVeSWpSXmKPExsVy+t/xe7q37CzjDH72qlls7m9ntNh/8Bur xaRD1xgtllxMtZh3S9li1rt2NovjvTtYHNg9zu04z+6xaVUnm8eda3vYPJ5c28zk8X7fVbYA 1ig9m6L80pJUhYz84hJbpWhDCyM9Q0sLPSMTSz1DY/NYKyNTJX07m5TUnMyy1CJ9uwS9jJVP /rAW3L3GWPF37g2mBsYJCxi7GDk4JARMJFZ/d+xi5OIQEljKKPHo2nzWLkZOoLiMxKcrH9kh bGGJP9e62CCKPjJKvPjSwQThnGaUWL+qjRnC2QmU6dzACNLCJqApsf3PfxYQW0RAWqL96iSw scwCvxkl/vQFgtjCAvYSf0/cZAaxWQRUJKY0vAHr5RWwlvjb/RJqtbzE7MbTbCA2J1B8/6oT TCBnCwlYSbQe5YYoF5Q4OfMJC8R4eYnmrbOZIWwJiYMvXjBPYBSehaRsFpKyWUjKFjAyr2IU SS0tzk3PLTbSK07MLS7NS9dLzs/dxAiMu23Hfm7Zwdj1LvgQowAHoxIPr4ajRZwQa2JZcWXu IUYJDmYlEd7z+pZxQrwpiZVVqUX58UWlOanFhxhNgX6byCwlmpwPTAl5JfGGpobmFpaG5sbm xmYWSuK8HQIHY4QE0hNLUrNTUwtSi2D6mDg4pRoYY7z2c3R63mGve52wMv/bV5ENnv++X6ng MnBefTFf+3lzko34FM1epkuf9im8WfZGNfVFVvV91QqvAtf5XmZPt8bn/1klvZ+v3aSCMa57 79mP8318NM80TZLuNm5UTDnlHHZix+6jX+M/hkSta5vQWHpQ3v7cvvaIx+oC3BaLipZFdhwz TjqsxFKckWioxVxUnAgAtQa8QdECAAA= X-CMS-MailID: 20200204095226eucas1p2429f45a5e23fe6ed57dee293be5e1b44 X-Msg-Generator: CA X-RootMTR: 20200204095226eucas1p2429f45a5e23fe6ed57dee293be5e1b44 X-EPHeader: CA CMS-TYPE: 201P X-CMS-RootMailID: 20200204095226eucas1p2429f45a5e23fe6ed57dee293be5e1b44 References: <20200204095208.269131-1-k.jensen@samsung.com> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 210.118.77.11 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Beata Michalska , qemu-devel@nongnu.org, Max Reitz , Klaus Jensen , Keith Busch , Javier Gonzalez Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" This refactors how the device issues asynchronous block backend requests. The NvmeRequest now holds a queue of NvmeAIOs that are associated with the command. This allows multiple aios to be issued for a command. Only when all requests have been completed will the device post a completion queue entry. Because the device is currently guaranteed to only issue a single aio request per command, the benefit is not immediately obvious. But this functionality is required to support metadata, the dataset management command and other features. Signed-off-by: Klaus Jensen Signed-off-by: Klaus Jensen --- hw/block/nvme.c | 449 +++++++++++++++++++++++++++++++++--------- hw/block/nvme.h | 134 +++++++++++-- hw/block/trace-events | 8 + 3 files changed, 480 insertions(+), 111 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 334265efb21e..e97da35c4ca1 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -19,7 +19,8 @@ * -drive file=,if=none,id= * -device nvme,drive=,serial=,id=, \ * cmb_size_mb=, \ - * num_queues= + * num_queues=, \ + * mdts= * * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at * offset 0 in BAR2 and supports only WDS, RDS and SQS for now. @@ -57,6 +58,7 @@ } while (0) static void nvme_process_sq(void *opaque); +static void nvme_aio_cb(void *opaque, int ret); static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr) { @@ -341,6 +343,107 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len, return status; } +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req) +{ + NvmeNamespace *ns = req->ns; + + uint32_t len = req->nlb << nvme_ns_lbads(ns); + uint64_t prp1 = le64_to_cpu(cmd->prp1); + uint64_t prp2 = le64_to_cpu(cmd->prp2); + + return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req); +} + +static void nvme_aio_destroy(NvmeAIO *aio) +{ + g_free(aio); +} + +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio, + NvmeAIOOp opc) +{ + aio->opc = opc; + + trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk), + aio->offset, aio->len, nvme_aio_opc_str(aio), req); + + if (req) { + QTAILQ_INSERT_TAIL(&req->aio_tailq, aio, tailq_entry); + } +} + +static void nvme_aio(NvmeAIO *aio) +{ + BlockBackend *blk = aio->blk; + BlockAcctCookie *acct = &aio->acct; + BlockAcctStats *stats = blk_get_stats(blk); + + bool is_write, dma; + + switch (aio->opc) { + case NVME_AIO_OPC_NONE: + break; + + case NVME_AIO_OPC_FLUSH: + block_acct_start(stats, acct, 0, BLOCK_ACCT_FLUSH); + aio->aiocb = blk_aio_flush(blk, nvme_aio_cb, aio); + break; + + case NVME_AIO_OPC_WRITE_ZEROES: + block_acct_start(stats, acct, aio->len, BLOCK_ACCT_WRITE); + aio->aiocb = blk_aio_pwrite_zeroes(blk, aio->offset, aio->len, + BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio); + break; + + case NVME_AIO_OPC_READ: + case NVME_AIO_OPC_WRITE: + dma = aio->qsg != NULL; + is_write = (aio->opc == NVME_AIO_OPC_WRITE); + + block_acct_start(stats, acct, aio->len, + is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ); + + if (dma) { + aio->aiocb = is_write ? + dma_blk_write(blk, aio->qsg, aio->offset, + BDRV_SECTOR_SIZE, nvme_aio_cb, aio) : + dma_blk_read(blk, aio->qsg, aio->offset, + BDRV_SECTOR_SIZE, nvme_aio_cb, aio); + + return; + } + + aio->aiocb = is_write ? + blk_aio_pwritev(blk, aio->offset, aio->iov, 0, + nvme_aio_cb, aio) : + blk_aio_preadv(blk, aio->offset, aio->iov, 0, + nvme_aio_cb, aio); + + break; + } +} + +static void nvme_rw_aio(BlockBackend *blk, uint64_t offset, NvmeRequest *req) +{ + NvmeAIO *aio; + size_t len = req->qsg.nsg > 0 ? req->qsg.size : req->iov.size; + + aio = g_new0(NvmeAIO, 1); + + *aio = (NvmeAIO) { + .blk = blk, + .offset = offset, + .len = len, + .req = req, + .qsg = &req->qsg, + .iov = &req->iov, + }; + + nvme_req_register_aio(req, aio, nvme_req_is_write(req) ? + NVME_AIO_OPC_WRITE : NVME_AIO_OPC_READ); + nvme_aio(aio); +} + static void nvme_post_cqes(void *opaque) { NvmeCQueue *cq = opaque; @@ -364,6 +467,7 @@ static void nvme_post_cqes(void *opaque) nvme_inc_cq_tail(cq); pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe, sizeof(req->cqe)); + nvme_req_clear(req); QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); } if (cq->tail != cq->head) { @@ -374,8 +478,8 @@ static void nvme_post_cqes(void *opaque) static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req) { assert(cq->cqid == req->sq->cqid); - trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid, - req->status); + trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid, req->status); + QTAILQ_REMOVE(&req->sq->out_req_list, req, entry); QTAILQ_INSERT_TAIL(&cq->req_list, req, entry); timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500); @@ -460,135 +564,272 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type) } } -static void nvme_rw_cb(void *opaque, int ret) +static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len, + NvmeRequest *req) +{ + uint8_t mdts = n->params.mdts; + + if (mdts && len > n->page_size << mdts) { + trace_nvme_dev_err_mdts(nvme_cid(req), n->page_size << mdts, len); + return NVME_INVALID_FIELD | NVME_DNR; + } + + return NVME_SUCCESS; +} + +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeRequest *req) +{ + NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd; + NvmeNamespace *ns = req->ns; + + uint16_t ctrl = le16_to_cpu(rw->control); + + if ((ctrl & NVME_RW_PRINFO_PRACT) && !(ns->id_ns.dps & DPS_TYPE_MASK)) { + trace_nvme_dev_err_prinfo(nvme_cid(req), ctrl); + return NVME_INVALID_FIELD | NVME_DNR; + } + + return NVME_SUCCESS; +} + +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba, + uint32_t nlb, NvmeRequest *req) +{ + NvmeNamespace *ns = req->ns; + uint64_t nsze = le64_to_cpu(ns->id_ns.nsze); + + if (unlikely((slba + nlb) > nsze)) { + block_acct_invalid(blk_get_stats(n->conf.blk), + nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ); + trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze); + return NVME_LBA_RANGE | NVME_DNR; + } + + return NVME_SUCCESS; +} + +static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req) +{ + NvmeNamespace *ns = req->ns; + size_t len = req->nlb << nvme_ns_lbads(ns); + uint16_t status; + + status = nvme_check_mdts(n, len, req); + if (status) { + return status; + } + + status = nvme_check_prinfo(n, req); + if (status) { + return status; + } + + status = nvme_check_bounds(n, req->slba, req->nlb, req); + if (status) { + return status; + } + + return NVME_SUCCESS; +} + +static void nvme_rw_cb(NvmeRequest *req, void *opaque) { - NvmeRequest *req = opaque; NvmeSQueue *sq = req->sq; NvmeCtrl *n = sq->ctrl; NvmeCQueue *cq = n->cq[sq->cqid]; - if (!ret) { - block_acct_done(blk_get_stats(n->conf.blk), &req->acct); - req->status = NVME_SUCCESS; - } else { - block_acct_failed(blk_get_stats(n->conf.blk), &req->acct); - req->status = NVME_INTERNAL_DEV_ERROR; - } - - if (req->qsg.nalloc) { - qemu_sglist_destroy(&req->qsg); - } - if (req->iov.nalloc) { - qemu_iovec_destroy(&req->iov); - } + trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid); nvme_enqueue_req_completion(cq, req); } -static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd, - NvmeRequest *req) +static void nvme_aio_cb(void *opaque, int ret) { - block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0, - BLOCK_ACCT_FLUSH); - req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req); + NvmeAIO *aio = opaque; + NvmeRequest *req = aio->req; - return NVME_NO_COMPLETE; -} + BlockBackend *blk = aio->blk; + BlockAcctCookie *acct = &aio->acct; + BlockAcctStats *stats = blk_get_stats(blk); -static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd, - NvmeRequest *req) -{ - NvmeRwCmd *rw = (NvmeRwCmd *)cmd; - const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas); - const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds; - uint64_t slba = le64_to_cpu(rw->slba); - uint32_t nlb = le16_to_cpu(rw->nlb) + 1; - uint64_t offset = slba << data_shift; - uint32_t count = nlb << data_shift; - - if (unlikely(slba + nlb > ns->id_ns.nsze)) { - trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze); - return NVME_LBA_RANGE | NVME_DNR; - } - - block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0, - BLOCK_ACCT_WRITE); - req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count, - BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req); - return NVME_NO_COMPLETE; -} - -static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd, - NvmeRequest *req) -{ - NvmeRwCmd *rw = (NvmeRwCmd *)cmd; - uint32_t nlb = le32_to_cpu(rw->nlb) + 1; - uint64_t slba = le64_to_cpu(rw->slba); - uint64_t prp1 = le64_to_cpu(rw->prp1); - uint64_t prp2 = le64_to_cpu(rw->prp2); - - uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas); - uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds; - uint64_t data_size = (uint64_t)nlb << data_shift; - uint64_t data_offset = slba << data_shift; - int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0; - enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ; + Error *local_err = NULL; - trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba); + trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset, + nvme_aio_opc_str(aio), req); - if (unlikely((slba + nlb) > ns->id_ns.nsze)) { - block_acct_invalid(blk_get_stats(n->conf.blk), acct); - trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze); - return NVME_LBA_RANGE | NVME_DNR; + if (req) { + QTAILQ_REMOVE(&req->aio_tailq, aio, tailq_entry); } - if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) { - block_acct_invalid(blk_get_stats(n->conf.blk), acct); - return NVME_INVALID_FIELD | NVME_DNR; - } - - if (req->qsg.nsg > 0) { - block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size, - acct); - - req->aiocb = is_write ? - dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE, - nvme_rw_cb, req) : - dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE, - nvme_rw_cb, req); + if (!ret) { + block_acct_done(stats, acct); } else { - block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size, - acct); + block_acct_failed(stats, acct); - req->aiocb = is_write ? - blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb, - req) : - blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb, - req); + if (req) { + uint16_t status; + + switch (aio->opc) { + case NVME_AIO_OPC_READ: + status = NVME_UNRECOVERED_READ; + break; + case NVME_AIO_OPC_WRITE: + case NVME_AIO_OPC_WRITE_ZEROES: + status = NVME_WRITE_FAULT; + break; + default: + status = NVME_INTERNAL_DEV_ERROR; + break; + } + + trace_nvme_dev_err_aio(nvme_cid(req), aio, blk_name(blk), + aio->offset, nvme_aio_opc_str(aio), req, status); + + error_setg_errno(&local_err, -ret, "aio failed"); + error_report_err(local_err); + + /* + * An Internal Error trumps all other errors. For other errors, + * only set the first error encountered. Any additional errors will + * be recorded in the error information log page. + */ + if (!req->status || + nvme_status_is_error(status, NVME_INTERNAL_DEV_ERROR)) { + req->status = status; + } + } + } + + if (aio->cb) { + aio->cb(aio, aio->cb_arg, ret); + } + + if (req && QTAILQ_EMPTY(&req->aio_tailq)) { + if (req->cb) { + req->cb(req, req->cb_arg); + } else { + NvmeSQueue *sq = req->sq; + NvmeCtrl *n = sq->ctrl; + NvmeCQueue *cq = n->cq[sq->cqid]; + + nvme_enqueue_req_completion(cq, req); + } } + nvme_aio_destroy(aio); +} + +static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req) +{ + NvmeAIO *aio = g_new0(NvmeAIO, 1); + + *aio = (NvmeAIO) { + .blk = n->conf.blk, + .req = req, + }; + + nvme_req_register_aio(req, aio, NVME_AIO_OPC_FLUSH); + nvme_aio(aio); + + return NVME_NO_COMPLETE; +} + +static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req) +{ + NvmeAIO *aio; + + NvmeNamespace *ns = req->ns; + NvmeRwCmd *rw = (NvmeRwCmd *) cmd; + + int64_t offset; + size_t count; + uint16_t status; + + req->slba = le64_to_cpu(rw->slba); + req->nlb = le16_to_cpu(rw->nlb) + 1; + + trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid), + req->slba, req->nlb); + + status = nvme_check_bounds(n, req->slba, req->nlb, req); + if (unlikely(status)) { + block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE); + return status; + } + + offset = req->slba << nvme_ns_lbads(ns); + count = req->nlb << nvme_ns_lbads(ns); + + aio = g_new0(NvmeAIO, 1); + + *aio = (NvmeAIO) { + .blk = n->conf.blk, + .offset = offset, + .len = count, + .req = req, + }; + + nvme_req_register_aio(req, aio, NVME_AIO_OPC_WRITE_ZEROES); + nvme_aio(aio); + + return NVME_NO_COMPLETE; +} + +static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req) +{ + NvmeRwCmd *rw = (NvmeRwCmd *) cmd; + NvmeNamespace *ns = req->ns; + int status; + + enum BlockAcctType acct = + nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ; + + req->nlb = le16_to_cpu(rw->nlb) + 1; + req->slba = le64_to_cpu(rw->slba); + + trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb, + req->nlb << nvme_ns_lbads(req->ns), req->slba); + + status = nvme_check_rw(n, req); + if (status) { + block_acct_invalid(blk_get_stats(n->conf.blk), acct); + return status; + } + + status = nvme_map(n, cmd, req); + if (status) { + block_acct_invalid(blk_get_stats(n->conf.blk), acct); + return status; + } + + nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req); + nvme_req_set_cb(req, nvme_rw_cb, NULL); + return NVME_NO_COMPLETE; } static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req) { - NvmeNamespace *ns; uint32_t nsid = le32_to_cpu(cmd->nsid); + trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid), + cmd->opcode); + if (unlikely(nsid == 0 || nsid > n->num_namespaces)) { trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces); return NVME_INVALID_NSID | NVME_DNR; } - ns = &n->namespaces[nsid - 1]; + req->ns = &n->namespaces[nsid - 1]; + switch (cmd->opcode) { case NVME_CMD_FLUSH: - return nvme_flush(n, ns, cmd, req); + return nvme_flush(n, cmd, req); case NVME_CMD_WRITE_ZEROS: - return nvme_write_zeros(n, ns, cmd, req); + return nvme_write_zeros(n, cmd, req); case NVME_CMD_WRITE: case NVME_CMD_READ: - return nvme_rw(n, ns, cmd, req); + return nvme_rw(n, cmd, req); default: trace_nvme_dev_err_invalid_opc(cmd->opcode); return NVME_INVALID_OPCODE | NVME_DNR; @@ -612,6 +853,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd) NvmeRequest *req, *next; NvmeSQueue *sq; NvmeCQueue *cq; + NvmeAIO *aio; uint16_t qid = le16_to_cpu(c->qid); if (unlikely(!qid || nvme_check_sqid(n, qid))) { @@ -624,8 +866,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd) sq = n->sq[qid]; while (!QTAILQ_EMPTY(&sq->out_req_list)) { req = QTAILQ_FIRST(&sq->out_req_list); - assert(req->aiocb); - blk_aio_cancel(req->aiocb); + while (!QTAILQ_EMPTY(&req->aio_tailq)) { + aio = QTAILQ_FIRST(&req->aio_tailq); + assert(aio->aiocb); + blk_aio_cancel(aio->aiocb); + } } if (!nvme_check_cqid(n, sq->cqid)) { cq = n->cq[sq->cqid]; @@ -662,6 +907,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, QTAILQ_INIT(&sq->out_req_list); for (i = 0; i < sq->size; i++) { sq->io_req[i].sq = sq; + QTAILQ_INIT(&(sq->io_req[i].aio_tailq)); QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry); } sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq); @@ -800,6 +1046,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req) uint32_t numdl, numdu; uint64_t off, lpol, lpou; size_t len; + uint16_t status; numdl = (dw10 >> 16); numdu = (dw11 & 0xffff); @@ -815,6 +1062,11 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req) trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off); + status = nvme_check_mdts(n, len, req); + if (status) { + return status; + } + switch (lid) { case NVME_LOG_ERROR_INFO: if (!rae) { @@ -1348,7 +1600,7 @@ static void nvme_process_sq(void *opaque) req = QTAILQ_FIRST(&sq->req_list); QTAILQ_REMOVE(&sq->req_list, req, entry); QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry); - memset(&req->cqe, 0, sizeof(req->cqe)); + req->cqe.cid = cmd.cid; memcpy(&req->cmd, &cmd, sizeof(NvmeCmd)); @@ -1928,6 +2180,7 @@ static void nvme_init_ctrl(NvmeCtrl *n) id->ieee[0] = 0x00; id->ieee[1] = 0x02; id->ieee[2] = 0xb3; + id->mdts = params->mdts; id->ver = cpu_to_le32(NVME_SPEC_VER); id->oacs = cpu_to_le16(0); diff --git a/hw/block/nvme.h b/hw/block/nvme.h index d27baa9d5391..3319f8edd7e1 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -8,7 +8,8 @@ DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \ DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \ DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \ - DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64) + DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64), \ + DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7) typedef struct NvmeParams { char *serial; @@ -16,6 +17,7 @@ typedef struct NvmeParams { uint32_t cmb_size_mb; uint8_t aerl; uint32_t aer_max_queued; + uint8_t mdts; } NvmeParams; typedef struct NvmeAsyncEvent { @@ -23,17 +25,58 @@ typedef struct NvmeAsyncEvent { NvmeAerResult result; } NvmeAsyncEvent; -typedef struct NvmeRequest { - struct NvmeSQueue *sq; - BlockAIOCB *aiocb; - uint16_t status; - NvmeCqe cqe; - BlockAcctCookie acct; - QEMUSGList qsg; - QEMUIOVector iov; - NvmeCmd cmd; - QTAILQ_ENTRY(NvmeRequest)entry; -} NvmeRequest; +typedef struct NvmeRequest NvmeRequest; +typedef void NvmeRequestCompletionFunc(NvmeRequest *req, void *opaque); + +struct NvmeRequest { + struct NvmeSQueue *sq; + struct NvmeNamespace *ns; + + NvmeCqe cqe; + NvmeCmd cmd; + uint16_t status; + + uint64_t slba; + uint32_t nlb; + + QEMUSGList qsg; + QEMUIOVector iov; + + NvmeRequestCompletionFunc *cb; + void *cb_arg; + + QTAILQ_HEAD(, NvmeAIO) aio_tailq; + QTAILQ_ENTRY(NvmeRequest) entry; +}; + +static inline void nvme_req_clear(NvmeRequest *req) +{ + req->ns = NULL; + memset(&req->cqe, 0, sizeof(req->cqe)); + req->status = NVME_SUCCESS; + req->slba = req->nlb = 0x0; + req->cb = req->cb_arg = NULL; + + if (req->qsg.sg) { + qemu_sglist_destroy(&req->qsg); + } + + if (req->iov.iov) { + qemu_iovec_destroy(&req->iov); + } +} + +static inline void nvme_req_set_cb(NvmeRequest *req, + NvmeRequestCompletionFunc *cb, void *cb_arg) +{ + req->cb = cb; + req->cb_arg = cb_arg; +} + +static inline void nvme_req_clear_cb(NvmeRequest *req) +{ + req->cb = req->cb_arg = NULL; +} typedef struct NvmeSQueue { struct NvmeCtrl *ctrl; @@ -85,6 +128,60 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns) return 1 << nvme_ns_lbads(ns); } +typedef enum NvmeAIOOp { + NVME_AIO_OPC_NONE = 0x0, + NVME_AIO_OPC_FLUSH = 0x1, + NVME_AIO_OPC_READ = 0x2, + NVME_AIO_OPC_WRITE = 0x3, + NVME_AIO_OPC_WRITE_ZEROES = 0x4, +} NvmeAIOOp; + +typedef struct NvmeAIO NvmeAIO; +typedef void NvmeAIOCompletionFunc(NvmeAIO *aio, void *opaque, int ret); + +struct NvmeAIO { + NvmeRequest *req; + + NvmeAIOOp opc; + int64_t offset; + size_t len; + BlockBackend *blk; + BlockAIOCB *aiocb; + BlockAcctCookie acct; + + NvmeAIOCompletionFunc *cb; + void *cb_arg; + + QEMUSGList *qsg; + QEMUIOVector *iov; + + QTAILQ_ENTRY(NvmeAIO) tailq_entry; +}; + +static inline const char *nvme_aio_opc_str(NvmeAIO *aio) +{ + switch (aio->opc) { + case NVME_AIO_OPC_NONE: return "NVME_AIO_OP_NONE"; + case NVME_AIO_OPC_FLUSH: return "NVME_AIO_OP_FLUSH"; + case NVME_AIO_OPC_READ: return "NVME_AIO_OP_READ"; + case NVME_AIO_OPC_WRITE: return "NVME_AIO_OP_WRITE"; + case NVME_AIO_OPC_WRITE_ZEROES: return "NVME_AIO_OP_WRITE_ZEROES"; + default: return "NVME_AIO_OP_UNKNOWN"; + } +} + +static inline bool nvme_req_is_write(NvmeRequest *req) +{ + switch (req->cmd.opcode) { + case NVME_CMD_WRITE: + case NVME_CMD_WRITE_UNCOR: + case NVME_CMD_WRITE_ZEROS: + return true; + default: + return false; + } +} + #define TYPE_NVME "nvme" #define NVME(obj) \ OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME) @@ -139,10 +236,21 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns) static inline uint16_t nvme_cid(NvmeRequest *req) { if (req) { - return le16_to_cpu(req->cqe.cid); + return le16_to_cpu(req->cmd.cid); } return 0xffff; } +static inline bool nvme_status_is_error(uint16_t status, uint16_t err) +{ + /* strip DNR and MORE */ + return (status & 0xfff) == err; +} + +static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req) +{ + return req->sq->ctrl; +} + #endif /* HW_NVME_H */ diff --git a/hw/block/trace-events b/hw/block/trace-events index 77aa0da99ee0..90a57fb6099a 100644 --- a/hw/block/trace-events +++ b/hw/block/trace-events @@ -34,7 +34,12 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin" nvme_dev_irq_masked(void) "IRQ is masked" nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64"" nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d" +nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count %"PRIu64" opc \"%s\" req %p" +nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p" +nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8"" nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64"" +nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32"" +nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32"" nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16"" nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d" nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16"" @@ -75,6 +80,9 @@ nvme_dev_mmio_shutdown_set(void) "shutdown bit set" nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared" # nvme traces for error conditions +nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64"" +nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16"" +nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p status 0x%"PRIx16"" nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size" nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64"" nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""