From patchwork Mon Oct 7 04:43:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13824050 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3823443ABC for ; Mon, 7 Oct 2024 04:43:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276236; cv=none; b=WbfvEE44oEj6vz7/AZWw6yLJnHIEwMaCzy21jePiZe4/jg5bfLPQOETtWYR+QZBKVX0UW+fHH6tW1DGssl05OJnDyzHGAFtQp49TyUmIzpUN/SyyZIcoSo+9qjM796VE50+7JKp/nYJabmj0nMfL64S34L3Xn0ypM9TyX0sMUfA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276236; c=relaxed/simple; bh=vfUOd45XU/XxutZyW6nJEnB0x4D/3FMkZDnFt4MKos8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MGd8Tnaj62XPqe5ceiR9MpeI/vbh3V8pQjgT76eac72qf87U5Rgmdg/DoxJ6eYfbOgKC1UW4HjhV3+lkvK0ML5SSLQEfiTW407TbXzTElVp6OpdnX9u0yNZjixBHjO0EqViXEHR4Qq1M0RuTouXLdaNXAuM2pw3t4UA5hAULJbk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=NSNNkHhC; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="NSNNkHhC" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 448ACC4CED0; Mon, 7 Oct 2024 04:43:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728276235; bh=vfUOd45XU/XxutZyW6nJEnB0x4D/3FMkZDnFt4MKos8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=NSNNkHhCzpQBZtqIoBqgdOZYO2asdfp+8w7pJB4t/Z172qFdLAKMgcZk+fwq6HI+M hMpNxrZIr4f/bLTXI3ezRKpmNdSM7LfLzAY8IxS4FhaTiwp1A2wmSuBtM3KMrmrWC6 dtPCgf043bqSbstafIDWEBCVig8LC7wlb+Vz/NiHTgYEDuDYIYx2oT4CD54utsTLP3 q2+wNYCjio1fZFW7Jm3FaZogkufnpgl6h3Xa5O9NhgIy0tYaNNvt/1q7oNMr/NnMDo 1Z6xwVxJRUVNfzLpNXfpBcDYFxv8XPqLc0iRsrvYlJP5B2DFWJn5f4+o6I9lWnZy7Z mEZNaCgTOJLIg== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v1 1/5] nvmet: rename and move nvmet_get_log_page_len() Date: Mon, 7 Oct 2024 13:43:47 +0900 Message-ID: <20241007044351.157912-2-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241007044351.157912-1-dlemoal@kernel.org> References: <20241007044351.157912-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The code for nvmet_get_log_page_len() has no pedendency on nvme target code and only depends on struct nvme_command. Move this helper function out of drivers/nvme/target/admin-cmd.c and inline it as part of the generic definitions in include/linux/nvme.h. Apply the same modification to nvmet_get_log_page_offset(). Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Sagi Grimberg --- drivers/nvme/target/admin-cmd.c | 20 +------------------- drivers/nvme/target/discovery.c | 4 ++-- drivers/nvme/target/nvmet.h | 3 --- include/linux/nvme.h | 19 +++++++++++++++++++ 4 files changed, 22 insertions(+), 24 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 954d4c074770..b20cd1dba207 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -12,19 +12,6 @@ #include #include "nvmet.h" -u32 nvmet_get_log_page_len(struct nvme_command *cmd) -{ - u32 len = le16_to_cpu(cmd->get_log_page.numdu); - - len <<= 16; - len += le16_to_cpu(cmd->get_log_page.numdl); - /* NUMD is a 0's based value */ - len += 1; - len *= sizeof(u32); - - return len; -} - static u32 nvmet_feat_data_len(struct nvmet_req *req, u32 cdw10) { switch (cdw10 & 0xff) { @@ -35,11 +22,6 @@ static u32 nvmet_feat_data_len(struct nvmet_req *req, u32 cdw10) } } -u64 nvmet_get_log_page_offset(struct nvme_command *cmd) -{ - return le64_to_cpu(cmd->get_log_page.lpo); -} - static void nvmet_execute_get_log_page_noop(struct nvmet_req *req) { nvmet_req_complete(req, nvmet_zero_sgl(req, 0, req->transfer_len)); @@ -319,7 +301,7 @@ static void nvmet_execute_get_log_page_ana(struct nvmet_req *req) static void nvmet_execute_get_log_page(struct nvmet_req *req) { - if (!nvmet_check_transfer_len(req, nvmet_get_log_page_len(req->cmd))) + if (!nvmet_check_transfer_len(req, nvme_get_log_page_len(req->cmd))) return; switch (req->cmd->get_log_page.lid) { diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index 28843df5fa7c..71c94a54bcd8 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -163,8 +163,8 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req) const int entry_size = sizeof(struct nvmf_disc_rsp_page_entry); struct nvmet_ctrl *ctrl = req->sq->ctrl; struct nvmf_disc_rsp_page_hdr *hdr; - u64 offset = nvmet_get_log_page_offset(req->cmd); - size_t data_len = nvmet_get_log_page_len(req->cmd); + u64 offset = nvme_get_log_page_offset(req->cmd); + size_t data_len = nvme_get_log_page_len(req->cmd); size_t alloc_len; struct nvmet_subsys_link *p; struct nvmet_port *r; diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 190f55e6d753..6e9499268c28 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -541,9 +541,6 @@ u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len); u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len); -u32 nvmet_get_log_page_len(struct nvme_command *cmd); -u64 nvmet_get_log_page_offset(struct nvme_command *cmd); - extern struct list_head *nvmet_ports; void nvmet_port_disc_changed(struct nvmet_port *port, struct nvmet_subsys *subsys); diff --git a/include/linux/nvme.h b/include/linux/nvme.h index b58d9405d65e..1f6d8cd0389a 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -10,6 +10,7 @@ #include #include #include +#include /* NQN names in commands fields specified one size */ #define NVMF_NQN_FIELD_LEN 256 @@ -1856,6 +1857,24 @@ static inline bool nvme_is_write(const struct nvme_command *cmd) return cmd->common.opcode & 1; } +static inline __u32 nvme_get_log_page_len(struct nvme_command *cmd) +{ + __u32 len = le16_to_cpu(cmd->get_log_page.numdu); + + len <<= 16; + len += le16_to_cpu(cmd->get_log_page.numdl); + /* NUMD is a 0's based value */ + len += 1; + len *= sizeof(__u32); + + return len; +} + +static inline __u64 nvme_get_log_page_offset(struct nvme_command *cmd) +{ + return le64_to_cpu(cmd->get_log_page.lpo); +} + enum { /* * Generic Command Status: From patchwork Mon Oct 7 04:43:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13824051 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EC20A43ABC for ; Mon, 7 Oct 2024 04:43:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276238; cv=none; b=QYoOCiquz6UHV36DyVtyTc1k5pwpQscyoTYTeEZlQLR41S8psvJNzGptpw5WAFXE89WBN8wwiXxcRBeYxv2VFfheTIKvFny0P1IJ9R3ppzDo+p8gj88y671PxXWGEzvJvQXVK5tJke9acotlOA9M/v2C3qTk/Gec1gQrIEW80Yw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276238; c=relaxed/simple; bh=ksVCnorPjWn7+AyFB7R0NeXDZPsR2eZvsmvQ/R3OnJE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KGVuFiTdfD7hkANigCf5hbuqh7ayW01JGaKiz3RdFqnGkBD4Kss6nQPnNv7cUD8auo5haVU5r2hgHIorM4C6Sv2MRN3IAi06wvtqy148vSSte5OtJ16hsoC3ciQrQVcv8YIvDeSklhtm3coJ+uWG10qzxXK1asDMP5leXPKz1K0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fI3mIYwa; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fI3mIYwa" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1A49BC4CECF; Mon, 7 Oct 2024 04:43:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728276237; bh=ksVCnorPjWn7+AyFB7R0NeXDZPsR2eZvsmvQ/R3OnJE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=fI3mIYwaOgSApf8p5+di9x584jUdgCpNG/mpOyYjW5/o379m7Kt4/yxGUG4pTXyLv X34LeUK67xZFsdFgtTiXS8XagLNUba5vRr3Q5sN8a513B8YyzXLxYepuS55uHWhgf7 Ascrknql9LYK2ew1Im+gatjFsp3zBCnbSGmvkhqrmAuIsOybuU8O+FuEJpfBkDGpzz Gq1xd+qCyQLoznx1YTqeGmrBaOTLqDZ0UEg1JABH8LJtViL7xZKmC/JGB0At4IuWrq sjyi26kl99KZtPLVFzJW/03yRLNvRs+vPXvMBFBH7EOZq2zmMCF0qq77L4Nj+eOHPl aNwlcROvIC9vQ== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v1 2/5] nvmef: export nvmef_create_ctrl() Date: Mon, 7 Oct 2024 13:43:48 +0900 Message-ID: <20241007044351.157912-3-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241007044351.157912-1-dlemoal@kernel.org> References: <20241007044351.157912-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Export nvmef_create_ctrl() to allow drivers to directly call this function instead of forcing the creation of a fabrics host controller with a write to /dev/nvmef. The export is restricted to the NVME_FABRICS namespace. Signed-off-by: Damien Le Moal --- drivers/nvme/host/fabrics.c | 4 ++-- drivers/nvme/host/fabrics.h | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 432efcbf9e2f..e3c990d50704 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -1276,8 +1276,7 @@ EXPORT_SYMBOL_GPL(nvmf_free_options); NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\ NVMF_OPT_DHCHAP_CTRL_SECRET) -static struct nvme_ctrl * -nvmf_create_ctrl(struct device *dev, const char *buf) +struct nvme_ctrl *nvmf_create_ctrl(struct device *dev, const char *buf) { struct nvmf_ctrl_options *opts; struct nvmf_transport_ops *ops; @@ -1346,6 +1345,7 @@ nvmf_create_ctrl(struct device *dev, const char *buf) nvmf_free_options(opts); return ERR_PTR(ret); } +EXPORT_SYMBOL_NS_GPL(nvmf_create_ctrl, NVME_FABRICS); static const struct class nvmf_class = { .name = "nvme-fabrics", diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index 21d75dc4a3a0..2dd3aeb8c53a 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -214,6 +214,7 @@ static inline unsigned int nvmf_nr_io_queues(struct nvmf_ctrl_options *opts) min(opts->nr_poll_queues, num_online_cpus()); } +struct nvme_ctrl *nvmf_create_ctrl(struct device *dev, const char *buf); int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val); int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val); int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val); From patchwork Mon Oct 7 04:43:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13824052 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A2EA43ABC for ; Mon, 7 Oct 2024 04:43:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276239; cv=none; b=ZR7ErnheQfqfDuFD9Spd5YeqwwJs4Prmb1yCEhuUwIwOfoZxuz9LtQ1NpHboKb4tgSAKsDeZQu5JeyThIKXALOKYWm12H/V0ASi7Z3ddwwO9L4GvP4vmrzoI4YHpTzk2PIvdsILPAfifkcyOypkt1MrZi2eM+JQ5tm6Lr++7INQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276239; c=relaxed/simple; bh=dnDNUE6zdmzn3HjRvxUKwPog/N4iHYyhVE47uHqaT/E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=TvxZaYGIoWrZiAHuYx7X7308DaqHMAuIbK7dJ+yKAVC1dzHlqVKxRo/X1/f52T2mf1IerxO3+5f4RghPjJ9pTI1qQw2VVR0UnJV36LppgMqlwPgbwXPtcbqjN4cCSnx9A1b2HpSy/iKDPS50lLUV6Tg/mlySJ50r3uPcZKU6bI0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UEFU6Wru; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UEFU6Wru" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E27E8C4CED0; Mon, 7 Oct 2024 04:43:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728276239; bh=dnDNUE6zdmzn3HjRvxUKwPog/N4iHYyhVE47uHqaT/E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=UEFU6WruypIZWIv6hkktGYhHs6rTC8JeiCCZ0U1pKty1Z5IvLOAiO+gcJcNzp7Okj 64+rbvj3o1gtqgmVLlJML6k81nZxqDTs5A9OjOGiykYhIjqYDpkDv2Z7Y8HGmM2aDq Foeccl7Q27A01qEfbjbQQb2qcQ9QD42TYFSNUo7heHX+Z+VYvW8W7eS0K40pq5wdwW mLhd89WHVmrT6hgWN12Dv0AUDTikyhezpL/0ArLafI8bZLRLloAMwxedNm2AvZHjNK XlNaxpQaDAtWbzAK97doVuER9ZM+bYWQsIvCf/jlSSB9nYRqv/F/prYxVFcJt0Et8c gDKO5GAyaGh2g== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v1 3/5] nvmef: Introduce the NVME_OPT_HIDDEN_NS option Date: Mon, 7 Oct 2024 13:43:49 +0900 Message-ID: <20241007044351.157912-4-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241007044351.157912-1-dlemoal@kernel.org> References: <20241007044351.157912-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Introduce the NVME fabrics option NVME_OPT_HIDDEN_NS to allow a host controller to be created without any user visible or internally usable namespace devices. That is, if set, this option will result in the controller having no character device and no block device for any of its namespaces. This option should be used only when the nvme controller will be managed using passthrough commands using the controller character device, either by the user or by another device driver. Signed-off-by: Damien Le Moal --- drivers/nvme/host/core.c | 17 ++++++++++++++--- drivers/nvme/host/fabrics.c | 7 ++++++- drivers/nvme/host/fabrics.h | 4 ++++ 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index ba6508455e18..c7f0be39a30a 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1714,11 +1714,17 @@ static void nvme_enable_aen(struct nvme_ctrl *ctrl) queue_work(nvme_wq, &ctrl->async_event_work); } +static inline bool nvme_hidden_ns(struct nvme_ctrl *ctrl) +{ + return ctrl->opts && ctrl->opts->hidden_ns; +} + static int nvme_ns_open(struct nvme_ns *ns) { /* should never be called due to GENHD_FL_HIDDEN */ - if (WARN_ON_ONCE(nvme_ns_head_multipath(ns->head))) + if (WARN_ON_ONCE(nvme_ns_head_multipath(ns->head) || + nvme_hidden_ns(ns->ctrl))) goto fail; if (!nvme_get_ns(ns)) goto fail; @@ -3828,6 +3834,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) disk->fops = &nvme_bdev_ops; disk->private_data = ns; + if (nvme_hidden_ns(ctrl)) + disk->flags |= GENHD_FL_HIDDEN; + ns->disk = disk; ns->queue = disk->queue; ns->ctrl = ctrl; @@ -3879,7 +3888,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) if (device_add_disk(ctrl->device, ns->disk, nvme_ns_attr_groups)) goto out_cleanup_ns_from_list; - if (!nvme_ns_head_multipath(ns->head)) + if (!nvme_ns_head_multipath(ns->head) && + !nvme_hidden_ns(ctrl)) nvme_add_ns_cdev(ns); nvme_mpath_add_disk(ns, info->anagrpid); @@ -3945,7 +3955,8 @@ static void nvme_ns_remove(struct nvme_ns *ns) /* guarantee not available in head->list */ synchronize_srcu(&ns->head->srcu); - if (!nvme_ns_head_multipath(ns->head)) + if (!nvme_ns_head_multipath(ns->head) && + !nvme_hidden_ns(ns->ctrl)) nvme_cdev_del(&ns->cdev, &ns->cdev_device); del_gendisk(ns->disk); diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index e3c990d50704..64e95727ae2a 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -707,6 +707,7 @@ static const match_table_t opt_tokens = { #ifdef CONFIG_NVME_TCP_TLS { NVMF_OPT_TLS, "tls" }, #endif + { NVMF_OPT_HIDDEN_NS, "hidden_ns" }, { NVMF_OPT_ERR, NULL } }; @@ -1053,6 +1054,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, } opts->tls = true; break; + case NVMF_OPT_HIDDEN_NS: + opts->hidden_ns = true; + break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); @@ -1274,7 +1278,8 @@ EXPORT_SYMBOL_GPL(nvmf_free_options); NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\ NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_DISCOVERY |\ NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\ - NVMF_OPT_DHCHAP_CTRL_SECRET) + NVMF_OPT_DHCHAP_CTRL_SECRET | \ + NVMF_OPT_HIDDEN_NS) struct nvme_ctrl *nvmf_create_ctrl(struct device *dev, const char *buf) { diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index 2dd3aeb8c53a..5388610e475d 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -66,6 +66,7 @@ enum { NVMF_OPT_TLS = 1 << 25, NVMF_OPT_KEYRING = 1 << 26, NVMF_OPT_TLS_KEY = 1 << 27, + NVMF_OPT_HIDDEN_NS = 1 << 28, }; /** @@ -108,6 +109,8 @@ enum { * @nr_poll_queues: number of queues for polling I/O * @tos: type of service * @fast_io_fail_tmo: Fast I/O fail timeout in seconds + * @fast_io_fail_tmo: Fast I/O fail timeout in seconds + * @hide_dev: Hide block devices for the namesapces of the controller */ struct nvmf_ctrl_options { unsigned mask; @@ -133,6 +136,7 @@ struct nvmf_ctrl_options { bool disable_sqflow; bool hdr_digest; bool data_digest; + bool hidden_ns; unsigned int nr_write_queues; unsigned int nr_poll_queues; int tos; From patchwork Mon Oct 7 04:43:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13824053 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BAC8F43ABC for ; Mon, 7 Oct 2024 04:44:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276241; cv=none; b=DcZHPgswKlc/dAK/zBjg3ImQI9b+BbuzBtvtzLgQXtw95VXDQJ5hM98DKbgKYln/lQ1LzO2/puLoko+QyMrDAKVOuhzHL94FBPrL5uUT/zYm2E4U5h3Wn6Ap+nMvJPYipV74E1LJbRUFKce0jzdSzKbOu2eKqsU+NVSFQ3A66/0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276241; c=relaxed/simple; bh=5FqzYMBXsHJHqrYUCVw93WdvNtxgA7Ip3UwNDdQmIvU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NLgdyHQyi5fzjeGfMnzyjD/pISCNa308GJLi8r8KStCuVf25W1ZP7VfiCWbxJgJJe+rVis38aBu+p9yRqlbZCNU7kRjpuWpttG5rEjs+N9uMA+Jj96ggDCcK09GTj2KFImqBZR1xNHXxNKpukULKdL/aHyHkkKIEMIoBArTpxw8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=AncZFMIE; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="AncZFMIE" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B8A41C4CED1; Mon, 7 Oct 2024 04:43:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728276241; bh=5FqzYMBXsHJHqrYUCVw93WdvNtxgA7Ip3UwNDdQmIvU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=AncZFMIECNOUCd+81s/ffXwJJ58WEc+rnxh3UapZFuidn7jyTmRR8knHZsOfBeld1 TATAjoIkvWJkEpyvanDik8CG43uR49K+7i5H+DvWfrRmNcqWrgl/9fnGrafiD3J71l WayCKi07Bb1j415Vjs7tMmo1W+yByGWO/kTOyPYaw8tt9fLar4Sa7F/ViRpQB/Qz3S 6Peo0LoJjFNpQKj/ru9fiJNf/H73xRvrhnn2qIog9LieWnw1AQylm87GRG11+ekgwB r5/4bxFJDlBaLwq+iWtFRLdiImv32cZchTZQk0+ljK6ddTJBUuoMRMZObez7nBsiU0 BMTVDTLA7em9Q== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v1 4/5] PCI: endpoint: Add NVMe endpoint function driver Date: Mon, 7 Oct 2024 13:43:50 +0900 Message-ID: <20241007044351.157912-5-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241007044351.157912-1-dlemoal@kernel.org> References: <20241007044351.157912-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Damien Le Moal Add a Linux PCI Endpoint function driver that implements a PCIe NVMe controller for a local NVMe fabrics host controller. The NVMe endpoint function driver relies as most as possible on the NVMe fabrics driver for executing NVMe commands received from the host, to minimize NVMe command parsing. However, some admin commands must be modified to satisfy PCI transport specifications constraints (e.g. queue management commands support and the optional SGL support). The NVMe endpoint function drivers is created as follows: 1) Upon binding of the endpoint function driver to the endpoint controller (pci_epf_nvme_bind()), the function driver sets up BAR 0 for the NVMe PCI controller with enough doorbell space to accommodate up to PCI_EPF_NVME_MAX_NR_QUEUES (16) queue pairs. The DMA channels that will be used to exchange data with the host over the PCI link are also initialized. 2) The endpoint function driver then creates the NVMe host fabrics controller using nvmef_create_ctrl() (called from pci_epf_nvme_create_ctrl()), which connects the host controller to its target (e.g. a loop target with a file or block device or a TCP remote target). 3) Once the PCI link status is detected to be up, the endpoint controller initializes IRQ management and BAR 0 content to advertize its capabilities. The capabilities of the fabrics controller are mostly used unmodified (pci_epf_nvme_init_ctrl_regs()). With that, the endpoint controller starts a delayed task to poll the BAR 0 register bar to detect changes to the CC register. 4) When the PCI host enables the controller, pci_epf_nvme_enable_ctrl() is called to create the admin submission and completion queues and start the fabrics controller with nvme_start_ctrl(). The endpoint controller then starts a delayed work to poll the admin submission queue doorbell to detect commands from the PCI host. 5) Admin commands received from the PCI host are retrieved from the admin queue by mapping the queue memory to PCI memory space, copying the command locally using a struct pci_epf_nvme_cmd, and proccess the command using pci_epf_nvme_process_admin_cmd(). 6) I/O commands are similarly handled: each I/O submission queue uses a delayed work to poll the queue doorbell and upon detection of a command being issued by the host, the I/O command is copied locally and processed using pci_epf_nvme_process_io_cmd(). I/O and admin commands are processed as follows: 1) A minimal parsing of the command is done to determine the command buffer size and data transfer direction. The command processing then continues using a command work scheduled using a per queue-pair high-priority workqueue (pci_epf_nvme_exec_cmd_work()). 2) The command execution work calls pci_epf_nvme_exec_cmd() which will retrieve and parse the command PRPs to determine the PCI address location of the command buffer segments, and retrieve the command data if the command is a write command. The command is then executed using the host fabrics controller by calling __nvme_submit_sync_cmd(). Once done, pci_epf_nvme_complete_cmd() is called to complete the command, after having transferred the command data back to the PCI host in the case of a read command. 3) pci_epf_nvme_complete_cmd() queues the command in a completion list for the completion queue of the command and schedules the queue completion work which will batch CQ entry transfers to the PCI host with the completion queue memory mapped to the host PCI address of the completion queue. With this processing, most of the command parsing and handling is left to the NVMe fabrics code. The only NVMe specific parsing implemented in the endpoint driver is the command PRP parsing. Of note is that the current code does not support SGL (this capability is thus not advertized). For data transfers, the endpoint driver relies by default on the DMA RX and TX channels of the hardware endpoint PCI controller. If no DMA channels are available, the NVMe endpoint function driver falls back to using mmio, which degrades performance significantly but keeps the function working. The BAR register polling work also monitors for controller-disable events (e.g. the PCI host reboots or shutdown). Such events trigger calls to pci_epf_nvme_disable_ctrl() which drains, cleanups and destroys the local queue pairs. The configuration and enablement of this NVMe endpoint function driver can be fully controlled using configfs, once a NVMe fabrics target is also setup. The available configfs parameters are: - ctrl_opts: Fabrics controller connection arguments, as formatted for the nvme cli "connect" command. - dma_enable: Enable (default) or disable DMA data transfers. - mdts_kb: Change the maximum data transfer size (default: 128 KB). Early versions of this driver code were based on an RFC submission by Alan Mikhak (https://lwn.net/Articles/804369/). The code however has since been completely rewritten. Co-developed-by: Rick Wertenbroek Signed-off-by: Rick Wertenbroek Signed-off-by: Damien Le Moal --- MAINTAINERS | 7 + drivers/pci/endpoint/functions/Kconfig | 9 + drivers/pci/endpoint/functions/Makefile | 1 + drivers/pci/endpoint/functions/pci-epf-nvme.c | 2489 +++++++++++++++++ 4 files changed, 2506 insertions(+) create mode 100644 drivers/pci/endpoint/functions/pci-epf-nvme.c diff --git a/MAINTAINERS b/MAINTAINERS index c27f3190737f..c9b40621dbc1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16554,6 +16554,13 @@ S: Supported F: drivers/platform/x86/nvidia-wmi-ec-backlight.c F: include/linux/platform_data/x86/nvidia-wmi-ec-backlight.h +NVME ENDPOINT DRIVER +M: Damien Le Moal +L: linux-pci@vger.kernel.org +L: linux-nvme@lists.infradead.org +S: Supported +F: drivers/pci/endpoint/functions/pci-epf-nvme.c + NVM EXPRESS DRIVER M: Keith Busch M: Jens Axboe diff --git a/drivers/pci/endpoint/functions/Kconfig b/drivers/pci/endpoint/functions/Kconfig index 0c9cea0698d7..ea641d558fb8 100644 --- a/drivers/pci/endpoint/functions/Kconfig +++ b/drivers/pci/endpoint/functions/Kconfig @@ -47,3 +47,12 @@ config PCI_EPF_MHI devices such as SDX55. If in doubt, say "N" to disable Endpoint driver for MHI bus. + +config PCI_EPF_NVME + tristate "PCI Endpoint NVMe function driver" + depends on PCI_ENDPOINT && NVME_TARGET + help + Enable this configuration option to enable the NVMe PCI endpoint + function driver. + + If in doubt, say "N". diff --git a/drivers/pci/endpoint/functions/Makefile b/drivers/pci/endpoint/functions/Makefile index 696473fce50e..fe2d6cf8c502 100644 --- a/drivers/pci/endpoint/functions/Makefile +++ b/drivers/pci/endpoint/functions/Makefile @@ -7,3 +7,4 @@ obj-$(CONFIG_PCI_EPF_TEST) += pci-epf-test.o obj-$(CONFIG_PCI_EPF_NTB) += pci-epf-ntb.o obj-$(CONFIG_PCI_EPF_VNTB) += pci-epf-vntb.o obj-$(CONFIG_PCI_EPF_MHI) += pci-epf-mhi.o +obj-$(CONFIG_PCI_EPF_NVME) += pci-epf-nvme.o diff --git a/drivers/pci/endpoint/functions/pci-epf-nvme.c b/drivers/pci/endpoint/functions/pci-epf-nvme.c new file mode 100644 index 000000000000..16e01897e563 --- /dev/null +++ b/drivers/pci/endpoint/functions/pci-epf-nvme.c @@ -0,0 +1,2489 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe function driver for PCI Endpoint Framework + * + * Copyright (C) 2019 SiFive + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../../../nvme/host/nvme.h" +#include "../../../nvme/host/fabrics.h" +#include "../../../nvme/target/nvmet.h" + +/* + * Maximum number of queue pairs: A higheer this number, the more mapping + * windows of the PCI endpoint controller will be used. To avoid exceeding the + * maximum number of mapping windows available (i.e. avoid PCI space mapping + * failures) errors, the maximum number of queue pairs should be limited to + * the number of mapping windows minus 2 (one window for IRQ issuing and one + * window for data transfers) and divided by 2 (one mapping windows for the SQ + * and one mapping window for the CQ). + */ +#define PCI_EPF_NVME_MAX_NR_QUEUES 16 + +/* + * Default maximum data transfer size: limit to 128 KB to avoid + * excessive local memory use for buffers. + */ +#define PCI_EPF_NVME_MDTS_KB 128 +#define PCI_EPF_NVME_MAX_MDTS_KB 1024 + +/* + * Queue flags. + */ +#define PCI_EPF_NVME_QUEUE_IS_SQ (1U << 0) +#define PCI_EPF_NVME_QUEUE_LIVE (1U << 1) + +/* PRP manipulation macros */ +#define pci_epf_nvme_prp_addr(ctrl, prp) ((prp) & ~(ctrl)->mps_mask) +#define pci_epf_nvme_prp_ofst(ctrl, prp) ((prp) & (ctrl)->mps_mask) +#define pci_epf_nvme_prp_size(ctrl, prp) \ + ((size_t)((ctrl)->mps - pci_epf_nvme_prp_ofst(ctrl, prp))) + +static struct kmem_cache *epf_nvme_cmd_cache; + +struct pci_epf_nvme; + +/* + * Host PCI memory segment for admin and IO commands. + */ +struct pci_epf_nvme_segment { + phys_addr_t pci_addr; + size_t size; +}; + +/* + * Queue definition and mapping for the local PCI controller. + */ +struct pci_epf_nvme_queue { + struct pci_epf_nvme *epf_nvme; + + unsigned int qflags; + int ref; + + phys_addr_t pci_addr; + size_t pci_size; + struct pci_epc_map pci_map; + + u16 qid; + u16 cqid; + u16 size; + u16 depth; + u16 flags; + u16 vector; + u16 head; + u16 tail; + u16 phase; + u32 db; + + size_t qes; + + struct workqueue_struct *cmd_wq; + struct delayed_work work; + spinlock_t lock; + struct list_head list; +}; + +/* + * Local PCI controller exposed with the endpoint function. + */ +struct pci_epf_nvme_ctrl { + /* Fabrics host controller */ + struct nvme_ctrl *ctrl; + + /* Registers of the local PCI controller */ + void *reg; + u64 cap; + u32 vs; + u32 cc; + u32 csts; + u32 aqa; + u64 asq; + u64 acq; + + size_t adm_sqes; + size_t adm_cqes; + size_t io_sqes; + size_t io_cqes; + + size_t mps_shift; + size_t mps; + size_t mps_mask; + + size_t mdts; + + unsigned int nr_queues; + struct pci_epf_nvme_queue *sq; + struct pci_epf_nvme_queue *cq; +}; + +/* + * Descriptor of commands sent by the host. + */ +struct pci_epf_nvme_cmd { + struct list_head link; + struct pci_epf_nvme *epf_nvme; + + int sqid; + int cqid; + unsigned int status; + struct nvme_ns *ns; + struct nvme_command cmd; + struct nvme_completion cqe; + + /* Internal buffer that we will transfer over PCI */ + size_t buffer_size; + void *buffer; + enum dma_data_direction dma_dir; + + /* + * Host PCI address segments: if nr_segs is 1, we use only "seg", + * otherwise, the segs array is allocated and used to store + * multiple segments. + */ + unsigned int nr_segs; + struct pci_epf_nvme_segment seg; + struct pci_epf_nvme_segment *segs; + + struct work_struct work; +}; + +/* + * EPF function private data representing our NVMe subsystem. + */ +struct pci_epf_nvme { + struct pci_epf *epf; + const struct pci_epc_features *epc_features; + + void *reg_bar; + size_t msix_table_offset; + + unsigned int irq_type; + unsigned int nr_vectors; + + unsigned int queue_count; + + struct pci_epf_nvme_ctrl ctrl; + bool ctrl_enabled; + + __le64 *prp_list_buf; + + struct dma_chan *dma_chan_tx; + struct dma_chan *dma_chan_rx; + struct mutex xfer_lock; + + struct mutex irq_lock; + + struct delayed_work reg_poll; + + /* Function configfs attributes */ + struct config_group group; + char *ctrl_opts_buf; + bool dma_enable; + size_t mdts_kb; +}; + +/* + * Read a 32-bits BAR register (equivalent to readl()). + */ +static inline u32 pci_epf_nvme_reg_read32(struct pci_epf_nvme_ctrl *ctrl, + u32 reg) +{ + __le32 *ctrl_reg = ctrl->reg + reg; + + return le32_to_cpu(READ_ONCE(*ctrl_reg)); +} + +/* + * Write a 32-bits BAR register (equivalent to writel()). + */ +static inline void pci_epf_nvme_reg_write32(struct pci_epf_nvme_ctrl *ctrl, + u32 reg, u32 val) +{ + __le32 *ctrl_reg = ctrl->reg + reg; + + WRITE_ONCE(*ctrl_reg, cpu_to_le32(val)); +} + +/* + * Read a 64-bits BAR register (equivalent to lo_hi_readq()). + */ +static inline u64 pci_epf_nvme_reg_read64(struct pci_epf_nvme_ctrl *ctrl, + u32 reg) +{ + return (u64)pci_epf_nvme_reg_read32(ctrl, reg) | + ((u64)pci_epf_nvme_reg_read32(ctrl, reg + 4) << 32); +} + +/* + * Write a 64-bits BAR register (equivalent to lo_hi_writeq()). + */ +static inline void pci_epf_nvme_reg_write64(struct pci_epf_nvme_ctrl *ctrl, + u32 reg, u64 val) +{ + pci_epf_nvme_reg_write32(ctrl, reg, val & 0xFFFFFFFF); + pci_epf_nvme_reg_write32(ctrl, reg + 4, (val >> 32) & 0xFFFFFFFF); +} + +static inline bool pci_epf_nvme_ctrl_ready(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + + if (!epf_nvme->ctrl_enabled) + return false; + return (ctrl->cc & NVME_CC_ENABLE) && (ctrl->csts & NVME_CSTS_RDY); +} + +struct pci_epf_nvme_dma_filter { + struct device *dev; + u32 dma_mask; +}; + +static bool pci_epf_nvme_dma_filter(struct dma_chan *chan, void *arg) +{ + struct pci_epf_nvme_dma_filter *filter = arg; + struct dma_slave_caps caps; + + memset(&caps, 0, sizeof(caps)); + dma_get_slave_caps(chan, &caps); + + return chan->device->dev == filter->dev && + (filter->dma_mask & caps.directions); +} + +static bool pci_epf_nvme_init_dma(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf *epf = epf_nvme->epf; + struct device *dev = &epf->dev; + struct pci_epf_nvme_dma_filter filter; + struct dma_chan *chan; + dma_cap_mask_t mask; + + mutex_init(&epf_nvme->xfer_lock); + mutex_init(&epf_nvme->irq_lock); + + dma_cap_zero(mask); + dma_cap_set(DMA_SLAVE, mask); + + filter.dev = epf->epc->dev.parent; + filter.dma_mask = BIT(DMA_DEV_TO_MEM); + + chan = dma_request_channel(mask, pci_epf_nvme_dma_filter, &filter); + if (!chan) + return false; + epf_nvme->dma_chan_rx = chan; + + filter.dma_mask = BIT(DMA_MEM_TO_DEV); + chan = dma_request_channel(mask, pci_epf_nvme_dma_filter, &filter); + if (!chan) { + dma_release_channel(epf_nvme->dma_chan_rx); + epf_nvme->dma_chan_rx = NULL; + return false; + } + epf_nvme->dma_chan_tx = chan; + + dev_info(dev, "DMA RX channel %s, maximum segment size %u B\n", + dma_chan_name(epf_nvme->dma_chan_rx), + dma_get_max_seg_size(epf_nvme->dma_chan_rx->device->dev)); + dev_info(dev, "DMA TX channel %s, maximum segment size %u B\n", + dma_chan_name(epf_nvme->dma_chan_tx), + dma_get_max_seg_size(epf_nvme->dma_chan_tx->device->dev)); + + return true; +} + +static void pci_epf_nvme_clean_dma(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + if (epf_nvme->dma_chan_tx) { + dma_release_channel(epf_nvme->dma_chan_tx); + epf_nvme->dma_chan_tx = NULL; + } + + if (epf_nvme->dma_chan_rx) { + dma_release_channel(epf_nvme->dma_chan_rx); + epf_nvme->dma_chan_rx = NULL; + } +} + +static void pci_epf_nvme_dma_callback(void *param) +{ + complete(param); +} + +static ssize_t pci_epf_nvme_dma_transfer(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_segment *seg, + enum dma_data_direction dir, void *buf) +{ + struct pci_epf *epf = epf_nvme->epf; + struct device *dma_dev = epf->epc->dev.parent; + struct dma_async_tx_descriptor *desc; + DECLARE_COMPLETION_ONSTACK(complete); + struct dma_slave_config sconf = {}; + struct device *dev = &epf->dev; + phys_addr_t dma_addr; + struct dma_chan *chan; + dma_cookie_t cookie; + int ret; + + switch (dir) { + case DMA_FROM_DEVICE: + chan = epf_nvme->dma_chan_rx; + sconf.direction = DMA_DEV_TO_MEM; + sconf.src_addr = seg->pci_addr; + break; + case DMA_TO_DEVICE: + chan = epf_nvme->dma_chan_tx; + sconf.direction = DMA_MEM_TO_DEV; + sconf.dst_addr = seg->pci_addr; + break; + default: + return -EINVAL; + } + + ret = dmaengine_slave_config(chan, &sconf); + if (ret) { + dev_err(dev, "Failed to configure DMA channel\n"); + return ret; + } + + dma_addr = dma_map_single(dma_dev, buf, seg->size, dir); + ret = dma_mapping_error(dma_dev, dma_addr); + if (ret) { + dev_err(dev, "Failed to map remote memory\n"); + return ret; + } + + desc = dmaengine_prep_slave_single(chan, dma_addr, + seg->size, sconf.direction, + DMA_CTRL_ACK | DMA_PREP_INTERRUPT); + if (!desc) { + dev_err(dev, "Failed to prepare DMA\n"); + ret = -EIO; + goto unmap; + } + + desc->callback = pci_epf_nvme_dma_callback; + desc->callback_param = &complete; + + cookie = dmaengine_submit(desc); + ret = dma_submit_error(cookie); + if (ret) { + dev_err(dev, "DMA submit failed %d\n", ret); + goto unmap; + } + + dma_async_issue_pending(chan); + ret = wait_for_completion_timeout(&complete, msecs_to_jiffies(1000)); + if (!ret) { + dev_err(dev, "DMA transfer timeout\n"); + dmaengine_terminate_sync(chan); + ret = -ETIMEDOUT; + goto unmap; + } + + ret = seg->size; + +unmap: + dma_unmap_single(dma_dev, dma_addr, seg->size, dir); + + return ret; +} + +static ssize_t pci_epf_nvme_mmio_transfer(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_segment *seg, + enum dma_data_direction dir, + void *buf) +{ + struct pci_epf *epf = epf_nvme->epf; + struct pci_epc_map map; + int ret; + + /* Map segment */ + ret = pci_epc_mem_map(epf->epc, epf->func_no, epf->vfunc_no, + seg->pci_addr, seg->size, &map); + if (ret) + return ret; + + switch (dir) { + case DMA_FROM_DEVICE: + memcpy_fromio(buf, map.virt_addr, map.pci_size); + ret = map.pci_size; + break; + case DMA_TO_DEVICE: + memcpy_toio(map.virt_addr, buf, map.pci_size); + ret = map.pci_size; + break; + default: + ret = -EINVAL; + break; + } + + pci_epc_mem_unmap(epf->epc, epf->func_no, epf->vfunc_no, &map); + + return ret; +} + +static int pci_epf_nvme_transfer(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_segment *seg, + enum dma_data_direction dir, void *buf) +{ + size_t size = seg->size; + int ret; + + while (size) { + /* + * Note: mmio transfers do not need serialization, but + * this is an easy way to prevent using too many mapped + * memory areauiswhich would lead to errors. + */ + mutex_lock(&epf_nvme->xfer_lock); + if (!epf_nvme->dma_enable) + ret = pci_epf_nvme_mmio_transfer(epf_nvme, seg, + dir, buf); + else + ret = pci_epf_nvme_dma_transfer(epf_nvme, seg, + dir, buf); + mutex_unlock(&epf_nvme->xfer_lock); + + if (ret < 0) + return ret; + + size -= ret; + buf += ret; + } + + return 0; +} + +static const char *pci_epf_nvme_cmd_name(struct pci_epf_nvme_cmd *epcmd) +{ + u8 opcode = epcmd->cmd.common.opcode; + + if (epcmd->sqid) + return nvme_get_opcode_str(opcode); + return nvme_get_admin_opcode_str(opcode); +} + +static inline struct pci_epf_nvme_cmd * +pci_epf_nvme_alloc_cmd(struct pci_epf_nvme *nvme) +{ + return kmem_cache_alloc(epf_nvme_cmd_cache, GFP_KERNEL); +} + +static void pci_epf_nvme_exec_cmd_work(struct work_struct *work); + +static void pci_epf_nvme_init_cmd(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd, + int sqid, int cqid) +{ + memset(epcmd, 0, sizeof(*epcmd)); + INIT_LIST_HEAD(&epcmd->link); + INIT_WORK(&epcmd->work, pci_epf_nvme_exec_cmd_work); + epcmd->epf_nvme = epf_nvme; + epcmd->sqid = sqid; + epcmd->cqid = cqid; + epcmd->status = NVME_SC_SUCCESS; + epcmd->dma_dir = DMA_NONE; +} + +static int pci_epf_nvme_alloc_cmd_buffer(struct pci_epf_nvme_cmd *epcmd) +{ + void *buffer; + + buffer = kmalloc(epcmd->buffer_size, GFP_KERNEL); + if (!buffer) { + epcmd->buffer_size = 0; + return -ENOMEM; + } + + if (!epcmd->sqid) + memset(buffer, 0, epcmd->buffer_size); + epcmd->buffer = buffer; + + return 0; +} + +static int pci_epf_nvme_alloc_cmd_segs(struct pci_epf_nvme_cmd *epcmd, + int nr_segs) +{ + struct pci_epf_nvme_segment *segs; + + /* Single segment case: use the command embedded structure */ + if (nr_segs == 1) { + epcmd->segs = &epcmd->seg; + epcmd->nr_segs = 1; + return 0; + } + + /* More than one segment needed: allocate an array */ + segs = kcalloc(nr_segs, sizeof(struct pci_epf_nvme_segment), GFP_KERNEL); + if (!segs) + return -ENOMEM; + + epcmd->nr_segs = nr_segs; + epcmd->segs = segs; + + return 0; +} + +static void pci_epf_nvme_free_cmd(struct pci_epf_nvme_cmd *epcmd) +{ + if (epcmd->ns) + nvme_put_ns(epcmd->ns); + + kfree(epcmd->buffer); + + if (epcmd->segs && epcmd->segs != &epcmd->seg) + kfree(epcmd->segs); + + kmem_cache_free(epf_nvme_cmd_cache, epcmd); +} + +static void pci_epf_nvme_complete_cmd(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_queue *cq; + unsigned long flags; + + if (!pci_epf_nvme_ctrl_ready(epf_nvme)) { + pci_epf_nvme_free_cmd(epcmd); + return; + } + + /* + * Add the command to the list of completed commands for the + * target cq and schedule the list processing. + */ + cq = &epf_nvme->ctrl.cq[epcmd->cqid]; + spin_lock_irqsave(&cq->lock, flags); + list_add_tail(&epcmd->link, &cq->list); + queue_delayed_work(cq->cmd_wq, &cq->work, 0); + spin_unlock_irqrestore(&cq->lock, flags); +} + +static int pci_epf_nvme_transfer_cmd_data(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_segment *seg; + void *buf = epcmd->buffer; + size_t size = 0; + int i, ret; + + /* Transfer each segment of the command */ + for (i = 0; i < epcmd->nr_segs; i++) { + seg = &epcmd->segs[i]; + + if (size >= epcmd->buffer_size) { + dev_err(&epf_nvme->epf->dev, "Invalid transfer size\n"); + goto xfer_err; + } + + ret = pci_epf_nvme_transfer(epf_nvme, seg, epcmd->dma_dir, buf); + if (ret) + goto xfer_err; + + buf += seg->size; + size += seg->size; + } + + return 0; + +xfer_err: + epcmd->status = NVME_SC_DATA_XFER_ERROR | NVME_STATUS_DNR; + return -EIO; +} + +static void pci_epf_nvme_raise_irq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *cq) +{ + struct pci_epf *epf = epf_nvme->epf; + int ret; + + if (!(cq->qflags & NVME_CQ_IRQ_ENABLED)) + return; + + mutex_lock(&epf_nvme->irq_lock); + + switch (epf_nvme->irq_type) { + case PCI_IRQ_MSIX: + case PCI_IRQ_MSI: + ret = pci_epc_raise_irq(epf->epc, epf->func_no, epf->vfunc_no, + epf_nvme->irq_type, cq->vector + 1); + if (!ret) + break; + /* + * If we got an error, it is likely because the host is using + * legacy IRQs (e.g. BIOS, grub). + */ + fallthrough; + case PCI_IRQ_INTX: + ret = pci_epc_raise_irq(epf->epc, epf->func_no, epf->vfunc_no, + PCI_IRQ_INTX, 0); + break; + default: + WARN_ON_ONCE(1); + ret = -EINVAL; + break; + } + + if (ret) + dev_err(&epf->dev, "Raise IRQ failed %d\n", ret); + + mutex_unlock(&epf_nvme->irq_lock); +} + +/* + * Transfer a prp list from the host and return the number of prps. + */ +static int pci_epf_nvme_get_prp_list(struct pci_epf_nvme *epf_nvme, u64 prp, + size_t xfer_len) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + size_t nr_prps = (xfer_len + ctrl->mps_mask) >> ctrl->mps_shift; + struct pci_epf_nvme_segment seg; + int ret; + + /* + * Compute the number of PRPs required for the number of bytes to + * transfer (xfer_len). If this number overflows the memory page size + * with the PRP list pointer specified, only return the space available + * in the memory page, the last PRP in there will be a PRP list pointer + * to the remaining PRPs. + */ + seg.pci_addr = prp; + seg.size = min(pci_epf_nvme_prp_size(ctrl, prp), nr_prps << 3); + ret = pci_epf_nvme_transfer(epf_nvme, &seg, DMA_FROM_DEVICE, + epf_nvme->prp_list_buf); + if (ret) + return ret; + + return seg.size >> 3; +} + +static int pci_epf_nvme_cmd_parse_prp_list(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_command *cmd = &epcmd->cmd; + __le64 *prps = epf_nvme->prp_list_buf; + struct pci_epf_nvme_segment *seg; + size_t size = 0, ofst, prp_size, xfer_len; + size_t transfer_len = epcmd->buffer_size; + int nr_segs, nr_prps = 0; + phys_addr_t pci_addr; + int i = 0, ret; + u64 prp; + + /* + * Allocate segments for the command: this considers the worst case + * scenario where all prps are discontiguous, so get as many segments + * as we can have prps. In practice, most of the time, we will have + * far less segments than prps. + */ + prp = le64_to_cpu(cmd->common.dptr.prp1); + if (!prp) + goto invalid_field; + + ofst = pci_epf_nvme_prp_ofst(ctrl, prp); + nr_segs = (transfer_len + ofst + NVME_CTRL_PAGE_SIZE - 1) + >> NVME_CTRL_PAGE_SHIFT; + + ret = pci_epf_nvme_alloc_cmd_segs(epcmd, nr_segs); + if (ret) + goto internal; + + /* Set the first segment using prp1 */ + seg = &epcmd->segs[0]; + seg->pci_addr = prp; + seg->size = pci_epf_nvme_prp_size(ctrl, prp); + + size = seg->size; + pci_addr = prp + size; + nr_segs = 1; + + /* + * Now build the PCI address segments using the prp lists, starting + * from prp2. + */ + prp = le64_to_cpu(cmd->common.dptr.prp2); + if (!prp) + goto invalid_field; + + while (size < transfer_len) { + xfer_len = transfer_len - size; + + if (!nr_prps) { + /* Get the prp list */ + nr_prps = pci_epf_nvme_get_prp_list(epf_nvme, prp, + xfer_len); + if (nr_prps < 0) + goto internal; + + i = 0; + ofst = 0; + } + + /* Current entry */ + prp = le64_to_cpu(prps[i]); + if (!prp) + goto invalid_field; + + /* Did we reach the last prp entry of the list ? */ + if (xfer_len > ctrl->mps && i == nr_prps - 1) { + /* We need more PRPs: prp is a list pointer */ + nr_prps = 0; + continue; + } + + /* Only the first prp is allowed to have an offset */ + if (pci_epf_nvme_prp_ofst(ctrl, prp)) + goto invalid_offset; + + if (prp != pci_addr) { + /* Discontiguous prp: new segment */ + nr_segs++; + if (WARN_ON_ONCE(nr_segs > epcmd->nr_segs)) + goto internal; + + seg++; + seg->pci_addr = prp; + seg->size = 0; + pci_addr = prp; + } + + prp_size = min_t(size_t, ctrl->mps, xfer_len); + seg->size += prp_size; + pci_addr += prp_size; + size += prp_size; + + i++; + } + + epcmd->nr_segs = nr_segs; + ret = 0; + + if (size != transfer_len) { + dev_err(&epf_nvme->epf->dev, + "PRPs transfer length mismatch %zu / %zu\n", + size, transfer_len); + goto internal; + } + + return 0; + +internal: + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return -EINVAL; + +invalid_offset: + epcmd->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; + return -EINVAL; + +invalid_field: + epcmd->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; + return -EINVAL; +} + +static int pci_epf_nvme_cmd_parse_prp_simple(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_command *cmd = &epcmd->cmd; + size_t transfer_len = epcmd->buffer_size; + int ret, nr_segs = 1; + u64 prp1, prp2 = 0; + size_t prp1_size; + + /* prp1 */ + prp1 = le64_to_cpu(cmd->common.dptr.prp1); + prp1_size = pci_epf_nvme_prp_size(ctrl, prp1); + + /* For commands crossing a page boundary, we should have a valid prp2 */ + if (transfer_len > prp1_size) { + prp2 = le64_to_cpu(cmd->common.dptr.prp2); + if (!prp2) + goto invalid_field; + if (pci_epf_nvme_prp_ofst(ctrl, prp2)) + goto invalid_offset; + if (prp2 != prp1 + prp1_size) + nr_segs = 2; + } + + /* Create segments using the prps */ + ret = pci_epf_nvme_alloc_cmd_segs(epcmd, nr_segs); + if (ret) + goto internal; + + epcmd->segs[0].pci_addr = prp1; + if (nr_segs == 1) { + epcmd->segs[0].size = transfer_len; + } else { + epcmd->segs[0].size = prp1_size; + epcmd->segs[1].pci_addr = prp2; + epcmd->segs[1].size = transfer_len - prp1_size; + } + + return 0; + +invalid_offset: + epcmd->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; + return -EINVAL; + +invalid_field: + epcmd->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; + return -EINVAL; + +internal: + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return ret; +} + +static int pci_epf_nvme_cmd_parse_dptr(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_command *cmd = &epcmd->cmd; + u64 prp1 = le64_to_cpu(cmd->common.dptr.prp1); + size_t ofst; + int ret; + + if (epcmd->buffer_size > ctrl->mdts) + goto invalid_field; + + /* We do not support SGL for now */ + if (epcmd->cmd.common.flags & NVME_CMD_SGL_ALL) + goto invalid_field; + + /* Get PCI address segments for the command using its prps */ + ofst = pci_epf_nvme_prp_ofst(ctrl, prp1); + if (ofst & 0x3) + goto invalid_offset; + + if (epcmd->buffer_size + ofst <= NVME_CTRL_PAGE_SIZE * 2) + ret = pci_epf_nvme_cmd_parse_prp_simple(epf_nvme, epcmd); + else + ret = pci_epf_nvme_cmd_parse_prp_list(epf_nvme, epcmd); + if (ret) + return ret; + + /* Get an internal buffer for the command */ + ret = pci_epf_nvme_alloc_cmd_buffer(epcmd); + if (ret) { + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return ret; + } + + return 0; + +invalid_field: + epcmd->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; + return -EINVAL; + +invalid_offset: + epcmd->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; + return -EINVAL; +} + +static void pci_epf_nvme_exec_cmd(struct pci_epf_nvme_cmd *epcmd, + void (*post_exec_hook)(struct pci_epf_nvme_cmd *)) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct nvme_command *cmd = &epcmd->cmd; + struct request_queue *q; + int ret; + + if (epcmd->ns) + q = epcmd->ns->queue; + else + q = epf_nvme->ctrl.ctrl->admin_q; + + if (epcmd->buffer_size) { + /* Setup the command buffer */ + ret = pci_epf_nvme_cmd_parse_dptr(epcmd); + if (ret) + return; + + /* Get data from the host if needed */ + if (epcmd->dma_dir == DMA_FROM_DEVICE) { + ret = pci_epf_nvme_transfer_cmd_data(epcmd); + if (ret) + return; + } + } + + /* Synchronously execute the command */ + ret = __nvme_submit_sync_cmd(q, cmd, &epcmd->cqe.result, + epcmd->buffer, epcmd->buffer_size, + NVME_QID_ANY, 0); + if (ret < 0) + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + else if (ret > 0) + epcmd->status = ret; + + if (epcmd->status != NVME_SC_SUCCESS) { + dev_err(&epf_nvme->epf->dev, + "QID %d: submit command %s (0x%x) failed, status 0x%0x\n", + epcmd->sqid, pci_epf_nvme_cmd_name(epcmd), + epcmd->cmd.common.opcode, epcmd->status); + return; + } + + if (post_exec_hook) + post_exec_hook(epcmd); + + if (epcmd->buffer_size && epcmd->dma_dir == DMA_TO_DEVICE) + pci_epf_nvme_transfer_cmd_data(epcmd); +} + +static void pci_epf_nvme_exec_cmd_work(struct work_struct *work) +{ + struct pci_epf_nvme_cmd *epcmd = + container_of(work, struct pci_epf_nvme_cmd, work); + + pci_epf_nvme_exec_cmd(epcmd, NULL); + + pci_epf_nvme_complete_cmd(epcmd); +} + +static bool pci_epf_nvme_queue_response(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf *epf = epf_nvme->epf; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_queue *sq = &ctrl->sq[epcmd->sqid]; + struct pci_epf_nvme_queue *cq = &ctrl->cq[epcmd->cqid]; + struct nvme_completion *cqe = &epcmd->cqe; + + /* + * Do not try to complete commands if the controller is not ready + * anymore, e.g. after the host cleared CC.EN. + */ + if (!pci_epf_nvme_ctrl_ready(epf_nvme) || + !(cq->qflags & PCI_EPF_NVME_QUEUE_LIVE)) + goto free_cmd; + + /* Check completion queue full state */ + cq->head = pci_epf_nvme_reg_read32(ctrl, cq->db); + if (cq->head == cq->tail + 1) + return false; + + /* Setup the completion entry */ + cqe->sq_id = cpu_to_le16(epcmd->sqid); + cqe->sq_head = cpu_to_le16(sq->head); + cqe->command_id = epcmd->cmd.common.command_id; + cqe->status = cpu_to_le16((epcmd->status << 1) | cq->phase); + + /* Post the completion entry */ + dev_dbg(&epf->dev, + "cq[%d]: %s status 0x%x, head %d, tail %d, phase %d\n", + epcmd->cqid, pci_epf_nvme_cmd_name(epcmd), + epcmd->status, cq->head, cq->tail, cq->phase); + + memcpy_toio(cq->pci_map.virt_addr + cq->tail * cq->qes, cqe, + sizeof(struct nvme_completion)); + + /* Advance the tail */ + cq->tail++; + if (cq->tail >= cq->depth) { + cq->tail = 0; + cq->phase ^= 1; + } + +free_cmd: + pci_epf_nvme_free_cmd(epcmd); + + return true; +} + +static int pci_epf_nvme_map_queue(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *q) +{ + struct pci_epf *epf = epf_nvme->epf; + int ret; + + ret = pci_epc_mem_map(epf->epc, epf->func_no, epf->vfunc_no, + q->pci_addr, q->pci_size, &q->pci_map); + if (ret) { + dev_err(&epf->dev, "Map %cQ %d failed %d\n", + q->qflags & PCI_EPF_NVME_QUEUE_IS_SQ ? 'S' : 'C', + q->qid, ret); + return ret; + } + + if (q->pci_map.pci_size < q->pci_size) { + dev_err(&epf->dev, "Partial %cQ %d mapping\n", + q->qflags & PCI_EPF_NVME_QUEUE_IS_SQ ? 'S' : 'C', + q->qid); + pci_epc_mem_unmap(epf->epc, epf->func_no, epf->vfunc_no, + &q->pci_map); + return -ENOMEM; + } + + return 0; +} + +static inline void pci_epf_nvme_unmap_queue(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *q) +{ + struct pci_epf *epf = epf_nvme->epf; + + pci_epc_mem_unmap(epf->epc, epf->func_no, epf->vfunc_no, + &q->pci_map); +} + +static void pci_epf_nvme_delete_queue(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *q) +{ + struct pci_epf_nvme_cmd *epcmd; + + q->qflags &= ~PCI_EPF_NVME_QUEUE_LIVE; + + flush_workqueue(q->cmd_wq); + destroy_workqueue(q->cmd_wq); + q->cmd_wq = NULL; + + flush_delayed_work(&q->work); + cancel_delayed_work_sync(&q->work); + + while (!list_empty(&q->list)) { + epcmd = list_first_entry(&q->list, + struct pci_epf_nvme_cmd, link); + list_del_init(&epcmd->link); + pci_epf_nvme_free_cmd(epcmd); + } +} + +static void pci_epf_nvme_cq_work(struct work_struct *work); + +static int pci_epf_nvme_create_cq(struct pci_epf_nvme *epf_nvme, int qid, + int flags, int size, int vector, + phys_addr_t pci_addr) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_queue *cq = &ctrl->cq[qid]; + struct pci_epf *epf = epf_nvme->epf; + + /* + * Increment the queue reference count: if the queue is already being + * used, we have nothing to do. + */ + cq->ref++; + if (cq->ref > 1) + return 0; + + /* Setup the completion queue */ + cq->pci_addr = pci_addr; + cq->qid = qid; + cq->cqid = qid; + cq->size = size; + cq->flags = flags; + cq->depth = size + 1; + cq->vector = vector; + cq->head = 0; + cq->tail = 0; + cq->phase = 1; + cq->db = NVME_REG_DBS + (((qid * 2) + 1) * sizeof(u32)); + pci_epf_nvme_reg_write32(ctrl, cq->db, 0); + INIT_DELAYED_WORK(&cq->work, pci_epf_nvme_cq_work); + if (!qid) + cq->qes = ctrl->adm_cqes; + else + cq->qes = ctrl->io_cqes; + cq->pci_size = cq->qes * cq->depth; + + cq->cmd_wq = alloc_workqueue("cq%d_wq", WQ_HIGHPRI, 1, qid); + if (!cq->cmd_wq) { + dev_err(&epf->dev, "Create CQ %d cqe wq failed\n", qid); + memset(cq, 0, sizeof(*cq)); + return -ENOMEM; + } + + dev_dbg(&epf->dev, + "CQ %d: %d entries of %zu B, vector IRQ %d\n", + qid, cq->size, cq->qes, (int)cq->vector + 1); + + cq->qflags = PCI_EPF_NVME_QUEUE_LIVE; + + return 0; +} + +static void pci_epf_nvme_delete_cq(struct pci_epf_nvme *epf_nvme, int qid) +{ + struct pci_epf_nvme_queue *cq = &epf_nvme->ctrl.cq[qid]; + + if (cq->ref < 1) + return; + + cq->ref--; + if (cq->ref) + return; + + pci_epf_nvme_delete_queue(epf_nvme, cq); +} + +static void pci_epf_nvme_sq_work(struct work_struct *work); + +static int pci_epf_nvme_create_sq(struct pci_epf_nvme *epf_nvme, int qid, + int cqid, int flags, int size, + phys_addr_t pci_addr) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_queue *sq = &ctrl->sq[qid]; + struct pci_epf_nvme_queue *cq = &ctrl->cq[cqid]; + struct pci_epf *epf = epf_nvme->epf; + + /* Setup the submission queue */ + sq->qflags = PCI_EPF_NVME_QUEUE_IS_SQ; + sq->pci_addr = pci_addr; + sq->ref = 1; + sq->qid = qid; + sq->cqid = cqid; + sq->size = size; + sq->flags = flags; + sq->depth = size + 1; + sq->head = 0; + sq->tail = 0; + sq->phase = 0; + sq->db = NVME_REG_DBS + (qid * 2 * sizeof(u32)); + pci_epf_nvme_reg_write32(ctrl, sq->db, 0); + INIT_DELAYED_WORK(&sq->work, pci_epf_nvme_sq_work); + if (!qid) + sq->qes = ctrl->adm_sqes; + else + sq->qes = ctrl->io_sqes; + sq->pci_size = sq->qes * sq->depth; + + sq->cmd_wq = alloc_workqueue("sq%d_wq", WQ_HIGHPRI | WQ_UNBOUND, + min_t(int, sq->depth, WQ_MAX_ACTIVE), qid); + if (!sq->cmd_wq) { + dev_err(&epf->dev, "Create SQ %d cmd wq failed\n", qid); + memset(sq, 0, sizeof(*sq)); + return -ENOMEM; + } + + /* Get a reference on the completion queue */ + cq->ref++; + + dev_dbg(&epf->dev, + "SQ %d: %d queue entries of %zu B, CQ %d\n", + qid, size, sq->qes, cqid); + + sq->qflags |= PCI_EPF_NVME_QUEUE_LIVE; + + return 0; +} + +static void pci_epf_nvme_delete_sq(struct pci_epf_nvme *epf_nvme, int qid) +{ + struct pci_epf_nvme_queue *sq = &epf_nvme->ctrl.sq[qid]; + + if (!sq->ref) + return; + + sq->ref--; + if (WARN_ON_ONCE(sq->ref != 0)) + return; + + pci_epf_nvme_delete_queue(epf_nvme, sq); + + if (epf_nvme->ctrl.cq[sq->cqid].ref) + epf_nvme->ctrl.cq[sq->cqid].ref--; +} + +static void pci_epf_nvme_disable_ctrl(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf *epf = epf_nvme->epf; + int qid; + + if (!epf_nvme->ctrl_enabled) + return; + + dev_info(&epf->dev, "Disabling controller\n"); + + /* + * Delete the submission queues first to release all references + * to the completion queues. This also stops polling for submissions + * and drains any pending command from the queue. + */ + for (qid = 1; qid < ctrl->nr_queues; qid++) + pci_epf_nvme_delete_sq(epf_nvme, qid); + + for (qid = 1; qid < ctrl->nr_queues; qid++) + pci_epf_nvme_delete_cq(epf_nvme, qid); + + /* Unmap the admin queue last */ + pci_epf_nvme_delete_sq(epf_nvme, 0); + pci_epf_nvme_delete_cq(epf_nvme, 0); + + /* Tell the host we are done */ + ctrl->csts &= ~NVME_CSTS_RDY; + if (ctrl->cc & NVME_CC_SHN_NORMAL) { + ctrl->csts |= NVME_CSTS_SHST_CMPLT; + ctrl->cc &= ~NVME_CC_SHN_NORMAL; + } + ctrl->cc &= ~NVME_CC_ENABLE; + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CSTS, ctrl->csts); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CC, ctrl->cc); + + epf_nvme->ctrl_enabled = false; +} + +static void pci_epf_nvme_delete_ctrl(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + + dev_info(&epf->dev, "Deleting controller\n"); + + if (ctrl->ctrl) { + nvme_put_ctrl(ctrl->ctrl); + ctrl->ctrl = NULL; + + ctrl->cc &= ~NVME_CC_SHN_NORMAL; + ctrl->csts |= NVME_CSTS_SHST_CMPLT; + } + + pci_epf_nvme_disable_ctrl(epf_nvme); + + ctrl->nr_queues = 0; + kfree(ctrl->cq); + ctrl->cq = NULL; + kfree(ctrl->sq); + ctrl->sq = NULL; +} + +static struct pci_epf_nvme_queue * +pci_epf_nvme_alloc_queues(struct pci_epf_nvme *epf_nvme, int nr_queues) +{ + struct pci_epf_nvme_queue *q; + int i; + + q = kcalloc(nr_queues, sizeof(struct pci_epf_nvme_queue), GFP_KERNEL); + if (!q) + return NULL; + + for (i = 0; i < nr_queues; i++) { + q[i].epf_nvme = epf_nvme; + spin_lock_init(&q[i].lock); + INIT_LIST_HEAD(&q[i].list); + } + + return q; +} + +static int pci_epf_nvme_create_ctrl(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + const struct pci_epc_features *features = epf_nvme->epc_features; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_ctrl *fctrl; + int ret; + + /* We must have nvme fabrics options. */ + if (!epf_nvme->ctrl_opts_buf) { + dev_err(&epf->dev, "No nvme fabrics options specified\n"); + return -EINVAL; + } + + /* Create the fabrics controller */ + fctrl = nvmf_create_ctrl(&epf->dev, epf_nvme->ctrl_opts_buf); + if (IS_ERR(fctrl)) { + dev_err(&epf->dev, "Create nvme fabrics controller failed\n"); + return PTR_ERR(fctrl); + } + + /* We only support IO controllers */ + if (fctrl->cntrltype != NVME_CTRL_IO) { + dev_err(&epf->dev, "Unsupported controller type\n"); + ret = -EINVAL; + goto out_delete_ctrl; + } + + dev_info(&epf->dev, "NVMe fabrics controller created, %u I/O queues\n", + fctrl->queue_count - 1); + + epf_nvme->queue_count = + min(fctrl->queue_count, PCI_EPF_NVME_MAX_NR_QUEUES); + if (features->msix_capable && epf->msix_interrupts) { + dev_info(&epf->dev, + "NVMe PCI controller supports MSI-X, %u vectors\n", + epf->msix_interrupts); + epf_nvme->queue_count = + min(epf_nvme->queue_count, epf->msix_interrupts); + } else if (features->msi_capable && epf->msi_interrupts) { + dev_info(&epf->dev, + "NVMe PCI controller supports MSI, %u vectors\n", + epf->msi_interrupts); + epf_nvme->queue_count = + min(epf_nvme->queue_count, epf->msi_interrupts); + } + + if (epf_nvme->queue_count < 2) { + dev_info(&epf->dev, "Invalid number of queues %u\n", + epf_nvme->queue_count); + ret = -EINVAL; + goto out_delete_ctrl; + } + + if (epf_nvme->queue_count != fctrl->queue_count) + dev_info(&epf->dev, "Limiting number of queues to %u\n", + epf_nvme->queue_count); + + dev_info(&epf->dev, "NVMe PCI controller: %u I/O queues\n", + epf_nvme->queue_count - 1); + + /* Allocate queues */ + ctrl->nr_queues = epf_nvme->queue_count; + ctrl->sq = pci_epf_nvme_alloc_queues(epf_nvme, ctrl->nr_queues); + if (!ctrl->sq) { + ret = -ENOMEM; + goto out_delete_ctrl; + } + + ctrl->cq = pci_epf_nvme_alloc_queues(epf_nvme, ctrl->nr_queues); + if (!ctrl->cq) { + ret = -ENOMEM; + goto out_delete_ctrl; + } + + epf_nvme->ctrl.ctrl = fctrl; + + return 0; + +out_delete_ctrl: + pci_epf_nvme_delete_ctrl(epf); + + return ret; +} + +static void pci_epf_nvme_init_ctrl_regs(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + + ctrl->reg = epf_nvme->reg_bar; + + /* Copy the fabrics controller capabilities as a base */ + ctrl->cap = ctrl->ctrl->cap; + + /* Contiguous Queues Required (CQR) */ + ctrl->cap |= 0x1ULL << 16; + + /* Set Doorbell stride to 4B (DSTRB) */ + ctrl->cap &= ~GENMASK(35, 32); + + /* Clear NVM Subsystem Reset Supported (NSSRS) */ + ctrl->cap &= ~(0x1ULL << 36); + + /* Clear Boot Partition Support (BPS) */ + ctrl->cap &= ~(0x1ULL << 45); + + /* Memory Page Size minimum (MPSMIN) = 4K */ + ctrl->cap |= (NVME_CTRL_PAGE_SHIFT - 12) << NVME_CC_MPS_SHIFT; + + /* Memory Page Size maximum (MPSMAX) = 4K */ + ctrl->cap |= (NVME_CTRL_PAGE_SHIFT - 12) << NVME_CC_MPS_SHIFT; + + /* Clear Persistent Memory Region Supported (PMRS) */ + ctrl->cap &= ~(0x1ULL << 56); + + /* Clear Controller Memory Buffer Supported (CMBS) */ + ctrl->cap &= ~(0x1ULL << 57); + + /* NVMe version supported */ + ctrl->vs = ctrl->ctrl->vs; + + /* Controller configuration */ + ctrl->cc = ctrl->ctrl->ctrl_config & (~NVME_CC_ENABLE); + + /* Controller Status (not ready) */ + ctrl->csts = 0; + + pci_epf_nvme_reg_write64(ctrl, NVME_REG_CAP, ctrl->cap); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_VS, ctrl->vs); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CSTS, ctrl->csts); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CC, ctrl->cc); +} + +static void pci_epf_nvme_enable_ctrl(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf *epf = epf_nvme->epf; + int ret; + + dev_info(&epf->dev, "Enabling controller\n"); + + ctrl->mdts = epf_nvme->mdts_kb * SZ_1K; + + ctrl->mps_shift = ((ctrl->cc >> NVME_CC_MPS_SHIFT) & 0xf) + 12; + ctrl->mps = 1UL << ctrl->mps_shift; + ctrl->mps_mask = ctrl->mps - 1; + + ctrl->adm_sqes = 1UL << NVME_ADM_SQES; + ctrl->adm_cqes = sizeof(struct nvme_completion); + ctrl->io_sqes = 1UL << ((ctrl->cc >> NVME_CC_IOSQES_SHIFT) & 0xf); + ctrl->io_cqes = 1UL << ((ctrl->cc >> NVME_CC_IOCQES_SHIFT) & 0xf); + + if (ctrl->io_sqes < sizeof(struct nvme_command)) { + dev_err(&epf->dev, "Unsupported IO sqes %zu (need %zu)\n", + ctrl->io_sqes, sizeof(struct nvme_command)); + return; + } + + if (ctrl->io_cqes < sizeof(struct nvme_completion)) { + dev_err(&epf->dev, "Unsupported IO cqes %zu (need %zu)\n", + ctrl->io_sqes, sizeof(struct nvme_completion)); + return; + } + + ctrl->aqa = pci_epf_nvme_reg_read32(ctrl, NVME_REG_AQA); + ctrl->asq = pci_epf_nvme_reg_read64(ctrl, NVME_REG_ASQ); + ctrl->acq = pci_epf_nvme_reg_read64(ctrl, NVME_REG_ACQ); + + /* + * Create the PCI controller admin submission and completion queues. + */ + ret = pci_epf_nvme_create_cq(epf_nvme, 0, + NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED, + (ctrl->aqa & 0x0fff0000) >> 16, 0, + ctrl->acq & GENMASK(63, 12)); + if (ret) + return; + + ret = pci_epf_nvme_create_sq(epf_nvme, 0, 0, NVME_QUEUE_PHYS_CONTIG, + ctrl->aqa & 0x0fff, + ctrl->asq & GENMASK(63, 12)); + if (ret) { + pci_epf_nvme_delete_cq(epf_nvme, 0); + return; + } + + nvme_start_ctrl(ctrl->ctrl); + + /* Tell the host we are now ready */ + ctrl->csts |= NVME_CSTS_RDY; + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CSTS, ctrl->csts); + + /* Start polling the admin submission queue */ + schedule_delayed_work(&ctrl->sq[0].work, msecs_to_jiffies(5)); + + epf_nvme->ctrl_enabled = true; +} + +static void pci_epf_nvme_process_create_cq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + int mqes = NVME_CAP_MQES(epf_nvme->ctrl.cap); + u16 cqid, cq_flags, qsize, vector; + int ret; + + cqid = le16_to_cpu(cmd->create_cq.cqid); + if (cqid >= epf_nvme->ctrl.nr_queues || epf_nvme->ctrl.cq[cqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + cq_flags = le16_to_cpu(cmd->create_cq.cq_flags); + if (!(cq_flags & NVME_QUEUE_PHYS_CONTIG)) { + epcmd->status = NVME_SC_INVALID_QUEUE | NVME_STATUS_DNR; + return; + } + + qsize = le16_to_cpu(cmd->create_cq.qsize); + if (!qsize || qsize > NVME_CAP_MQES(epf_nvme->ctrl.cap)) { + if (qsize > mqes) + dev_warn(&epf_nvme->epf->dev, + "Create CQ %d, qsize %d > mqes %d: buggy driver?\n", + cqid, (int)qsize, mqes); + epcmd->status = NVME_SC_QUEUE_SIZE | NVME_STATUS_DNR; + return; + } + + vector = le16_to_cpu(cmd->create_cq.irq_vector); + if (vector >= epf_nvme->nr_vectors) { + epcmd->status = NVME_SC_INVALID_VECTOR | NVME_STATUS_DNR; + return; + } + + ret = pci_epf_nvme_create_cq(epf_nvme, cqid, cq_flags, qsize, vector, + le64_to_cpu(cmd->create_cq.prp1)); + if (ret) + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; +} + +static void pci_epf_nvme_process_delete_cq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + u16 cqid; + + cqid = le16_to_cpu(cmd->delete_queue.qid); + if (!cqid || + cqid >= epf_nvme->ctrl.nr_queues || + !epf_nvme->ctrl.cq[cqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + pci_epf_nvme_delete_cq(epf_nvme, cqid); +} + +static void pci_epf_nvme_process_create_sq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + int mqes = NVME_CAP_MQES(epf_nvme->ctrl.cap); + u16 sqid, cqid, sq_flags, qsize; + int ret; + + sqid = le16_to_cpu(cmd->create_sq.sqid); + if (!sqid || sqid > epf_nvme->ctrl.nr_queues || + epf_nvme->ctrl.sq[sqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + cqid = le16_to_cpu(cmd->create_sq.cqid); + if (!cqid || !epf_nvme->ctrl.cq[cqid].ref) { + epcmd->status = NVME_SC_CQ_INVALID | NVME_STATUS_DNR; + return; + } + + sq_flags = le16_to_cpu(cmd->create_sq.sq_flags); + if (!(sq_flags & NVME_QUEUE_PHYS_CONTIG)) { + epcmd->status = NVME_SC_INVALID_QUEUE | NVME_STATUS_DNR; + return; + } + + qsize = le16_to_cpu(cmd->create_sq.qsize); + if (!qsize || qsize > mqes) { + if (qsize > mqes) + dev_warn(&epf_nvme->epf->dev, + "Create SQ %d, qsize %d > mqes %d: buggy driver?\n", + sqid, (int)qsize, mqes); + epcmd->status = NVME_SC_QUEUE_SIZE | NVME_STATUS_DNR; + return; + } + + ret = pci_epf_nvme_create_sq(epf_nvme, sqid, cqid, sq_flags, qsize, + le64_to_cpu(cmd->create_sq.prp1)); + if (ret) { + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return; + } + + /* Start polling the submission queue */ + schedule_delayed_work(&epf_nvme->ctrl.sq[sqid].work, 1); +} + +static void pci_epf_nvme_process_delete_sq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + u16 sqid; + + sqid = le16_to_cpu(cmd->delete_queue.qid); + if (!sqid || + sqid >= epf_nvme->ctrl.nr_queues || + !epf_nvme->ctrl.sq[sqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + pci_epf_nvme_delete_sq(epf_nvme, sqid); +} + +static void pci_epf_nvme_identify_hook(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct nvme_command *cmd = &epcmd->cmd; + struct nvme_id_ctrl *id = epcmd->buffer; + unsigned int page_shift; + + if (cmd->identify.cns != NVME_ID_CNS_CTRL) + return; + + /* Set device vendor IDs */ + id->vid = cpu_to_le16(epf_nvme->epf->header->vendorid); + id->ssvid = id->vid; + + /* Set Maximum Data Transfer Size (MDTS) */ + page_shift = NVME_CAP_MPSMIN(epf_nvme->ctrl.ctrl->cap) + 12; + id->mdts = ilog2(epf_nvme->ctrl.mdts) - page_shift; + + /* Clear Controller Multi-Path I/O and Namespace Sharing Capabilities */ + id->cmic = 0; + + /* Do not report support for Autonomous Power State Transitions */ + id->apsta = 0; + + /* Indicate no support for SGLs */ + id->sgls = 0; +} + +static void pci_epf_nvme_get_log_hook(struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + struct nvme_effects_log *log = epcmd->buffer; + + if (cmd->get_log_page.lid != NVME_LOG_CMD_EFFECTS) + return; + + /* + * ACS0 [Delete I/O Submission Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[0] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); + + /* + * ACS1 [Create I/O Submission Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[1] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); + + /* + * ACS4 [Delete I/O Completion Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[4] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); + + /* + * ACS5 [Create I/O Completion Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[5] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); +} + +static void pci_epf_nvme_process_admin_cmd(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + void (*post_exec_hook)(struct pci_epf_nvme_cmd *) = NULL; + struct nvme_command *cmd = &epcmd->cmd; + + switch (cmd->common.opcode) { + case nvme_admin_identify: + post_exec_hook = pci_epf_nvme_identify_hook; + epcmd->buffer_size = NVME_IDENTIFY_DATA_SIZE; + epcmd->dma_dir = DMA_TO_DEVICE; + break; + + case nvme_admin_get_log_page: + post_exec_hook = pci_epf_nvme_get_log_hook; + epcmd->buffer_size = nvme_get_log_page_len(cmd); + epcmd->dma_dir = DMA_TO_DEVICE; + break; + + case nvme_admin_async_event: + /* + * Async events are a pain to deal with as they get canceled + * only once we delete the fabrics controller, which happens + * after the epf function is deleted, thus causing access to + * freed memory or leaking of epcmd. So ignore these commands + * for now, which is fine. The host will simply never see any + * event. + */ + pci_epf_nvme_free_cmd(epcmd); + return; + + case nvme_admin_set_features: + case nvme_admin_get_features: + case nvme_admin_abort_cmd: + break; + + case nvme_admin_create_cq: + pci_epf_nvme_process_create_cq(epf_nvme, epcmd); + goto complete; + + case nvme_admin_create_sq: + pci_epf_nvme_process_create_sq(epf_nvme, epcmd); + goto complete; + + case nvme_admin_delete_cq: + pci_epf_nvme_process_delete_cq(epf_nvme, epcmd); + goto complete; + + case nvme_admin_delete_sq: + pci_epf_nvme_process_delete_sq(epf_nvme, epcmd); + goto complete; + + default: + dev_err(&epf_nvme->epf->dev, + "Unhandled admin command %s (0x%02x)\n", + pci_epf_nvme_cmd_name(epcmd), cmd->common.opcode); + epcmd->status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; + goto complete; + } + + /* Synchronously execute the command */ + pci_epf_nvme_exec_cmd(epcmd, post_exec_hook); + +complete: + pci_epf_nvme_complete_cmd(epcmd); +} + +static inline size_t pci_epf_nvme_rw_data_len(struct pci_epf_nvme_cmd *epcmd) +{ + return ((u32)le16_to_cpu(epcmd->cmd.rw.length) + 1) << + epcmd->ns->head->lba_shift; +} + +static void pci_epf_nvme_process_io_cmd(struct pci_epf_nvme_cmd *epcmd, + struct pci_epf_nvme_queue *sq) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + + /* Get the command target namespace */ + epcmd->ns = nvme_find_get_ns(epf_nvme->ctrl.ctrl, + le32_to_cpu(epcmd->cmd.common.nsid)); + if (!epcmd->ns) { + epcmd->status = NVME_SC_INVALID_NS | NVME_STATUS_DNR; + goto complete; + } + + switch (epcmd->cmd.common.opcode) { + case nvme_cmd_read: + epcmd->buffer_size = pci_epf_nvme_rw_data_len(epcmd); + epcmd->dma_dir = DMA_TO_DEVICE; + break; + + case nvme_cmd_write: + epcmd->buffer_size = pci_epf_nvme_rw_data_len(epcmd); + epcmd->dma_dir = DMA_FROM_DEVICE; + break; + + case nvme_cmd_dsm: + epcmd->buffer_size = (le32_to_cpu(epcmd->cmd.dsm.nr) + 1) * + sizeof(struct nvme_dsm_range); + epcmd->dma_dir = DMA_FROM_DEVICE; + goto complete; + + case nvme_cmd_flush: + case nvme_cmd_write_zeroes: + break; + + default: + dev_err(&epf_nvme->epf->dev, + "Unhandled IO command %s (0x%02x)\n", + pci_epf_nvme_cmd_name(epcmd), + epcmd->cmd.common.opcode); + epcmd->status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; + goto complete; + } + + queue_work(sq->cmd_wq, &epcmd->work); + + return; + +complete: + pci_epf_nvme_complete_cmd(epcmd); +} + +static bool pci_epf_nvme_fetch_cmd(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *sq) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_cmd *epcmd; + int ret; + + if (!(sq->qflags & PCI_EPF_NVME_QUEUE_LIVE)) + return false; + + sq->tail = pci_epf_nvme_reg_read32(ctrl, sq->db); + if (sq->head == sq->tail) + return false; + + ret = pci_epf_nvme_map_queue(epf_nvme, sq); + if (ret) + return false; + + while (sq->head != sq->tail) { + epcmd = pci_epf_nvme_alloc_cmd(epf_nvme); + if (!epcmd) + break; + + /* Get the NVMe command submitted by the host */ + pci_epf_nvme_init_cmd(epf_nvme, epcmd, sq->qid, sq->cqid); + memcpy_fromio(&epcmd->cmd, + sq->pci_map.virt_addr + sq->head * sq->qes, + sizeof(struct nvme_command)); + + dev_dbg(&epf_nvme->epf->dev, + "sq[%d]: head %d/%d, tail %d, command %s\n", + sq->qid, (int)sq->head, (int)sq->depth, + (int)sq->tail, pci_epf_nvme_cmd_name(epcmd)); + + sq->head++; + if (sq->head == sq->depth) + sq->head = 0; + + list_add_tail(&epcmd->link, &sq->list); + } + + pci_epf_nvme_unmap_queue(epf_nvme, sq); + + return !list_empty(&sq->list); +} + +static void pci_epf_nvme_sq_work(struct work_struct *work) +{ + struct pci_epf_nvme_queue *sq = + container_of(work, struct pci_epf_nvme_queue, work.work); + struct pci_epf_nvme *epf_nvme = sq->epf_nvme; + struct pci_epf_nvme_cmd *epcmd; + int n = 0; + + /* Process received commands */ + while (pci_epf_nvme_ctrl_ready(epf_nvme)) { + /* + * Try to get commands from the host. If We do not have any + * command yet, aggressively keep polling the SQ for at most + * 1 ms and fall back to rescheduling the SQ work if we still + * have not received any command after that. This hybrid + * spin-polling method significantly increases the IOPS, + * especially for shallow queue depth operation (e.g. QD=1). + * Not that this is done only for I/O commands. + */ + if (!pci_epf_nvme_fetch_cmd(epf_nvme, sq)) { + if (sq->qid && n < 100) { + usleep_range(5, 10); + n++; + continue; + } + break; + } + + while (!list_empty(&sq->list)) { + epcmd = list_first_entry(&sq->list, + struct pci_epf_nvme_cmd, link); + list_del_init(&epcmd->link); + + if (sq->qid) + pci_epf_nvme_process_io_cmd(epcmd, sq); + else + pci_epf_nvme_process_admin_cmd(epcmd); + } + } + + if (pci_epf_nvme_ctrl_ready(epf_nvme)) { + unsigned long poll_interval = 1; + + /* No need to aggressively poll the admin queue. */ + if (!sq->qid) + poll_interval = msecs_to_jiffies(5); + schedule_delayed_work(&sq->work, poll_interval); + } +} + +static void pci_epf_nvme_cq_work(struct work_struct *work) +{ + struct pci_epf_nvme_queue *cq = + container_of(work, struct pci_epf_nvme_queue, work.work); + struct pci_epf_nvme *epf_nvme = cq->epf_nvme; + struct pci_epf_nvme_cmd *epcmd; + bool reschedule = false; + unsigned long flags; + int ret, nr_cqe; + LIST_HEAD(list); + + spin_lock_irqsave(&cq->lock, flags); + + while (!list_empty(&cq->list)) { + + list_splice_tail_init(&cq->list, &list); + spin_unlock_irqrestore(&cq->lock, flags); + + nr_cqe = 0; + + ret = pci_epf_nvme_map_queue(epf_nvme, cq); + if (ret) { + reschedule = true; + goto reschedule; + } + + while (!list_empty(&list)) { + epcmd = list_first_entry(&list, + struct pci_epf_nvme_cmd, link); + list_del_init(&epcmd->link); + if (!pci_epf_nvme_queue_response(epcmd)) + break; + nr_cqe++; + } + + pci_epf_nvme_unmap_queue(epf_nvme, cq); + + if (nr_cqe && pci_epf_nvme_ctrl_ready(cq->epf_nvme)) + pci_epf_nvme_raise_irq(cq->epf_nvme, cq); + + spin_lock_irqsave(&cq->lock, flags); + } + + spin_unlock_irqrestore(&cq->lock, flags); + +reschedule: + if (reschedule) + schedule_delayed_work(&cq->work, 1); +} + +static void pci_epf_nvme_reg_poll(struct work_struct *work) +{ + struct pci_epf_nvme *epf_nvme = + container_of(work, struct pci_epf_nvme, reg_poll.work); + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + u32 old_cc; + + /* Set the controller register bar */ + ctrl->reg = epf_nvme->reg_bar; + if (!ctrl->reg) { + dev_err(&epf_nvme->epf->dev, "No register BAR set\n"); + goto again; + } + + /* Check CC.EN to determine what we need to do */ + old_cc = ctrl->cc; + ctrl->cc = pci_epf_nvme_reg_read32(ctrl, NVME_REG_CC); + + /* If not enabled yet, wait */ + if (!(old_cc & NVME_CC_ENABLE) && !(ctrl->cc & NVME_CC_ENABLE)) + goto again; + + /* If CC.EN was set by the host, enable the controller */ + if (!(old_cc & NVME_CC_ENABLE) && (ctrl->cc & NVME_CC_ENABLE)) { + pci_epf_nvme_enable_ctrl(epf_nvme); + goto again; + } + + /* If CC.EN was cleared by the host, disable the controller */ + if (((old_cc & NVME_CC_ENABLE) && !(ctrl->cc & NVME_CC_ENABLE)) || + ctrl->cc & NVME_CC_SHN_NORMAL) + pci_epf_nvme_disable_ctrl(epf_nvme); + +again: + schedule_delayed_work(&epf_nvme->reg_poll, msecs_to_jiffies(5)); +} + +static int pci_epf_nvme_configure_bar(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + const struct pci_epc_features *features = epf_nvme->epc_features; + size_t reg_size, reg_bar_size; + size_t msix_table_size = 0; + + /* + * The first free BAR will be our register BAR and per NVMe + * specifications, it must be BAR 0. + */ + if (pci_epc_get_first_free_bar(features) != BAR_0) { + dev_err(&epf->dev, "BAR 0 is not free\n"); + return -EINVAL; + } + + /* Initialize BAR flags */ + if (features->bar[BAR_0].only_64bit) + epf->bar[BAR_0].flags |= PCI_BASE_ADDRESS_MEM_TYPE_64; + + /* + * Calculate the size of the register bar: NVMe registers first with + * enough space for the doorbells, followed by the MSI-X table + * if supported. + */ + reg_size = NVME_REG_DBS + (PCI_EPF_NVME_MAX_NR_QUEUES * 2 * sizeof(u32)); + reg_size = ALIGN(reg_size, 8); + + if (features->msix_capable) { + size_t pba_size; + + msix_table_size = PCI_MSIX_ENTRY_SIZE * epf->msix_interrupts; + epf_nvme->msix_table_offset = reg_size; + pba_size = ALIGN(DIV_ROUND_UP(epf->msix_interrupts, 8), 8); + + reg_size += msix_table_size + pba_size; + } + + reg_bar_size = ALIGN(reg_size, 4096); + + if (features->bar[BAR_0].type == BAR_FIXED) { + if (reg_bar_size > features->bar[BAR_0].fixed_size) { + dev_err(&epf->dev, + "Reg BAR 0 size %llu B too small, need %zu B\n", + features->bar[BAR_0].fixed_size, + reg_bar_size); + return -ENOMEM; + } + reg_bar_size = features->bar[BAR_0].fixed_size; + } + + epf_nvme->reg_bar = pci_epf_alloc_space(epf, reg_bar_size, BAR_0, + features, PRIMARY_INTERFACE); + if (!epf_nvme->reg_bar) { + dev_err(&epf->dev, "Allocate register BAR failed\n"); + return -ENOMEM; + } + memset(epf_nvme->reg_bar, 0, reg_bar_size); + + return 0; +} + +static void pci_epf_nvme_clear_bar(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + pci_epc_clear_bar(epf->epc, epf->func_no, epf->vfunc_no, + &epf->bar[BAR_0]); + pci_epf_free_space(epf, epf_nvme->reg_bar, BAR_0, PRIMARY_INTERFACE); + epf_nvme->reg_bar = NULL; +} + +static int pci_epf_nvme_init_irq(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + int ret; + + /* Enable MSI-X if supported, otherwise, use MSI */ + if (epf_nvme->epc_features->msix_capable && epf->msix_interrupts) { + ret = pci_epc_set_msix(epf->epc, epf->func_no, epf->vfunc_no, + epf->msix_interrupts, BAR_0, + epf_nvme->msix_table_offset); + if (ret) { + dev_err(&epf->dev, "MSI-X configuration failed\n"); + return ret; + } + + epf_nvme->nr_vectors = epf->msix_interrupts; + epf_nvme->irq_type = PCI_IRQ_MSIX; + + return 0; + } + + if (epf_nvme->epc_features->msi_capable && epf->msi_interrupts) { + ret = pci_epc_set_msi(epf->epc, epf->func_no, epf->vfunc_no, + epf->msi_interrupts); + if (ret) { + dev_err(&epf->dev, "MSI configuration failed\n"); + return ret; + } + + epf_nvme->nr_vectors = epf->msi_interrupts; + epf_nvme->irq_type = PCI_IRQ_MSI; + + return 0; + } + + /* MSI and MSI-X are not supported: fall back to INTX */ + epf_nvme->nr_vectors = 1; + epf_nvme->irq_type = PCI_IRQ_INTX; + + return 0; +} + +static int pci_epf_nvme_epc_init(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + int ret; + + if (epf->vfunc_no <= 1) { + /* Set device ID, class, etc */ + ret = pci_epc_write_header(epf->epc, epf->func_no, epf->vfunc_no, + epf->header); + if (ret) { + dev_err(&epf->dev, + "Write configuration header failed %d\n", ret); + return ret; + } + } + + /* Setup the PCIe BAR and enable interrupts */ + ret = pci_epc_set_bar(epf->epc, epf->func_no, epf->vfunc_no, + &epf->bar[BAR_0]); + if (ret) { + dev_err(&epf->dev, "Set BAR 0 failed\n"); + pci_epf_free_space(epf, epf_nvme->reg_bar, BAR_0, + PRIMARY_INTERFACE); + return ret; + } + + ret = pci_epf_nvme_init_irq(epf); + if (ret) + return ret; + + pci_epf_nvme_init_ctrl_regs(epf); + + if (!epf_nvme->epc_features->linkup_notifier) + schedule_delayed_work(&epf_nvme->reg_poll, msecs_to_jiffies(5)); + + return 0; +} + +static void pci_epf_nvme_epc_deinit(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + /* Stop polling BAR registers and disable the controller */ + cancel_delayed_work_sync(&epf_nvme->reg_poll); + + pci_epf_nvme_delete_ctrl(epf); + pci_epf_nvme_clean_dma(epf); + pci_epf_nvme_clear_bar(epf); +} + +static int pci_epf_nvme_link_up(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + dev_info(&epf->dev, "Link UP\n"); + + pci_epf_nvme_init_ctrl_regs(epf); + + /* Start polling the BAR registers to detect controller enable */ + schedule_delayed_work(&epf_nvme->reg_poll, 0); + + return 0; +} + +static int pci_epf_nvme_link_down(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + dev_info(&epf->dev, "Link DOWN\n"); + + /* Stop polling BAR registers and disable the controller */ + cancel_delayed_work_sync(&epf_nvme->reg_poll); + pci_epf_nvme_disable_ctrl(epf_nvme); + + return 0; +} + +static const struct pci_epc_event_ops pci_epf_nvme_event_ops = { + .epc_init = pci_epf_nvme_epc_init, + .epc_deinit = pci_epf_nvme_epc_deinit, + .link_up = pci_epf_nvme_link_up, + .link_down = pci_epf_nvme_link_down, +}; + +static int pci_epf_nvme_bind(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + const struct pci_epc_features *epc_features; + struct pci_epc *epc = epf->epc; + bool dma_supported; + int ret; + + if (!epc) { + dev_err(&epf->dev, "No endpoint controller\n"); + return -EINVAL; + } + + epc_features = pci_epc_get_features(epc, epf->func_no, epf->vfunc_no); + if (!epc_features) { + dev_err(&epf->dev, "epc_features not implemented\n"); + return -EOPNOTSUPP; + } + epf_nvme->epc_features = epc_features; + + ret = pci_epf_nvme_configure_bar(epf); + if (ret) + return ret; + + if (epf_nvme->dma_enable) { + dma_supported = pci_epf_nvme_init_dma(epf_nvme); + if (dma_supported) { + dev_info(&epf->dev, "DMA supported\n"); + } else { + dev_info(&epf->dev, + "DMA not supported, falling back to mmio\n"); + epf_nvme->dma_enable = false; + } + } else { + dev_info(&epf->dev, "DMA disabled\n"); + } + + /* Create the fabrics host controller */ + ret = pci_epf_nvme_create_ctrl(epf); + if (ret) + goto clean_dma; + + return 0; + +clean_dma: + pci_epf_nvme_clean_dma(epf); + pci_epf_nvme_clear_bar(epf); + + return ret; +} + +static void pci_epf_nvme_unbind(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + struct pci_epc *epc = epf->epc; + + cancel_delayed_work_sync(&epf_nvme->reg_poll); + + pci_epf_nvme_delete_ctrl(epf); + + if (epc->init_complete) { + pci_epf_nvme_clean_dma(epf); + pci_epf_nvme_clear_bar(epf); + } +} + +static struct pci_epf_header epf_nvme_pci_header = { + .vendorid = PCI_ANY_ID, + .deviceid = PCI_ANY_ID, + .progif_code = 0x02, /* NVM Express */ + .baseclass_code = PCI_BASE_CLASS_STORAGE, + .subclass_code = 0x08, /* Non-Volatile Memory controller */ + .interrupt_pin = PCI_INTERRUPT_INTA, +}; + +static int pci_epf_nvme_probe(struct pci_epf *epf, + const struct pci_epf_device_id *id) +{ + struct pci_epf_nvme *epf_nvme; + + epf_nvme = devm_kzalloc(&epf->dev, sizeof(*epf_nvme), GFP_KERNEL); + if (!epf_nvme) + return -ENOMEM; + + epf_nvme->epf = epf; + INIT_DELAYED_WORK(&epf_nvme->reg_poll, pci_epf_nvme_reg_poll); + + epf_nvme->prp_list_buf = devm_kzalloc(&epf->dev, NVME_CTRL_PAGE_SIZE, + GFP_KERNEL); + if (!epf_nvme->prp_list_buf) + return -ENOMEM; + + /* Set default attribute values */ + epf_nvme->dma_enable = true; + epf_nvme->mdts_kb = PCI_EPF_NVME_MDTS_KB; + + epf->event_ops = &pci_epf_nvme_event_ops; + epf->header = &epf_nvme_pci_header; + epf_set_drvdata(epf, epf_nvme); + + return 0; +} + +#define to_epf_nvme(epf_group) \ + container_of((epf_group), struct pci_epf_nvme, group) + +static ssize_t pci_epf_nvme_ctrl_opts_show(struct config_item *item, + char *page) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + + if (!epf_nvme->ctrl_opts_buf) + return 0; + + return sysfs_emit(page, "%s\n", epf_nvme->ctrl_opts_buf); +} + +#define PCI_EPF_NVME_OPT_HIDDEN_NS "hidden_ns" + +static ssize_t pci_epf_nvme_ctrl_opts_store(struct config_item *item, + const char *page, size_t len) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + size_t opt_buf_size; + + /* Do not allow setting options when the function is already started */ + if (epf_nvme->ctrl.ctrl) + return -EBUSY; + + if (!len) + return -EINVAL; + + kfree(epf_nvme->ctrl_opts_buf); + + /* + * Make sure we have enough room to add the hidden_ns option + * if it is missing. + */ + opt_buf_size = len + strlen(PCI_EPF_NVME_OPT_HIDDEN_NS) + 2; + epf_nvme->ctrl_opts_buf = kzalloc(opt_buf_size, GFP_KERNEL); + if (!epf_nvme->ctrl_opts_buf) + return -ENOMEM; + + strscpy(epf_nvme->ctrl_opts_buf, page, opt_buf_size); + if (!strnstr(page, PCI_EPF_NVME_OPT_HIDDEN_NS, len)) + strncat(epf_nvme->ctrl_opts_buf, + "," PCI_EPF_NVME_OPT_HIDDEN_NS, opt_buf_size); + + dev_dbg(&epf_nvme->epf->dev, + "NVMe fabrics controller options: %s\n", + epf_nvme->ctrl_opts_buf); + + return len; +} + +CONFIGFS_ATTR(pci_epf_nvme_, ctrl_opts); + +static ssize_t pci_epf_nvme_dma_enable_show(struct config_item *item, + char *page) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + + return sysfs_emit(page, "%d\n", epf_nvme->dma_enable); +} + +static ssize_t pci_epf_nvme_dma_enable_store(struct config_item *item, + const char *page, size_t len) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + int ret; + + if (epf_nvme->ctrl_enabled) + return -EBUSY; + + ret = kstrtobool(page, &epf_nvme->dma_enable); + if (ret) + return ret; + + return len; +} + +CONFIGFS_ATTR(pci_epf_nvme_, dma_enable); + +static ssize_t pci_epf_nvme_mdts_kb_show(struct config_item *item, char *page) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + + return sysfs_emit(page, "%zu\n", epf_nvme->mdts_kb); +} + +static ssize_t pci_epf_nvme_mdts_kb_store(struct config_item *item, + const char *page, size_t len) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + unsigned long mdts_kb; + int ret; + + if (epf_nvme->ctrl_enabled) + return -EBUSY; + + ret = kstrtoul(page, 0, &mdts_kb); + if (ret) + return ret; + if (!mdts_kb) + mdts_kb = PCI_EPF_NVME_MDTS_KB; + else if (mdts_kb > PCI_EPF_NVME_MAX_MDTS_KB) + mdts_kb = PCI_EPF_NVME_MAX_MDTS_KB; + + if (!is_power_of_2(mdts_kb)) + return -EINVAL; + + epf_nvme->mdts_kb = mdts_kb; + + return len; +} + +CONFIGFS_ATTR(pci_epf_nvme_, mdts_kb); + +static struct configfs_attribute *pci_epf_nvme_attrs[] = { + &pci_epf_nvme_attr_ctrl_opts, + &pci_epf_nvme_attr_dma_enable, + &pci_epf_nvme_attr_mdts_kb, + NULL, +}; + +static const struct config_item_type pci_epf_nvme_group_type = { + .ct_attrs = pci_epf_nvme_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct config_group *pci_epf_nvme_add_cfs(struct pci_epf *epf, + struct config_group *group) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + /* Add the NVMe target attributes */ + config_group_init_type_name(&epf_nvme->group, "nvme", + &pci_epf_nvme_group_type); + + return &epf_nvme->group; +} + +static const struct pci_epf_device_id pci_epf_nvme_ids[] = { + { .name = "pci_epf_nvme" }, + {}, +}; + +static struct pci_epf_ops pci_epf_nvme_ops = { + .bind = pci_epf_nvme_bind, + .unbind = pci_epf_nvme_unbind, + .add_cfs = pci_epf_nvme_add_cfs, +}; + +static struct pci_epf_driver epf_nvme_driver = { + .driver.name = "pci_epf_nvme", + .probe = pci_epf_nvme_probe, + .id_table = pci_epf_nvme_ids, + .ops = &pci_epf_nvme_ops, + .owner = THIS_MODULE, +}; + +static int __init pci_epf_nvme_init(void) +{ + int ret; + + epf_nvme_cmd_cache = kmem_cache_create("epf_nvme_cmd", + sizeof(struct pci_epf_nvme_cmd), + 0, SLAB_HWCACHE_ALIGN, NULL); + if (!epf_nvme_cmd_cache) + return -ENOMEM; + + ret = pci_epf_register_driver(&epf_nvme_driver); + if (ret) + goto out_cache; + + pr_info("Registered nvme EPF driver\n"); + + return 0; + +out_cache: + kmem_cache_destroy(epf_nvme_cmd_cache); + + pr_err("Register nvme EPF driver failed\n"); + + return ret; +} +module_init(pci_epf_nvme_init); + +static void __exit pci_epf_nvme_exit(void) +{ + pci_epf_unregister_driver(&epf_nvme_driver); + + kmem_cache_destroy(epf_nvme_cmd_cache); + + pr_info("Unregistered nvme EPF driver\n"); +} +module_exit(pci_epf_nvme_exit); + +MODULE_DESCRIPTION("PCI endpoint NVMe function driver"); +MODULE_AUTHOR("Damien Le Moal "); +MODULE_IMPORT_NS(NVME_TARGET_PASSTHRU); +MODULE_IMPORT_NS(NVME_FABRICS); +MODULE_LICENSE("GPL"); From patchwork Mon Oct 7 04:43:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13824054 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5098543ABC for ; Mon, 7 Oct 2024 04:44:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276243; cv=none; b=EzCguM8ItxGJ11XBWSrRVcqMiQVu+fUjXkHP333YQikKQlaycoxkL02ViCw7PQ1lQJBI42uKesOExFCCr5persZRmOWWp9H/lGBCq7qAmbiPGKTR4mpLpxYOomzUYIpXfoTEdlGm0jm41vvHtNRJ99ERZ76Qd05nQcAaupSzM4g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728276243; c=relaxed/simple; bh=20CLqu0cvsvLCa+RFbTMFNw/9vknYwY1r+O+k3VRpW0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=delFfZLIai8EupFfT3u0ddM7tYh9WYX5CfW8RIX2qfOvywxvS1Eln2p0PlZcTbyzqVLKUP8IVh2z3sOlxS0oqKTX+dAObW+PcaE/GWHFMjpUhTUXT3KAEzwJPbtpYFUgC9phpRvrI3HGVhS0aUQ6YWGnrqNovbR2NeTielqwS4c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=AHw33w4E; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="AHw33w4E" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9F979C4CECF; Mon, 7 Oct 2024 04:44:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728276243; bh=20CLqu0cvsvLCa+RFbTMFNw/9vknYwY1r+O+k3VRpW0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=AHw33w4EQwM4YQCnsW0gm4BGuT+aHNFJCwpqe+I7pOwnQF3YyYZGdMXz4fYZonzCb EdoUiQcKyKJ7pfSRr83LDA7y8dYnwPBIfLnXEOd9kwuQmRGZDaX23LVtJ+RxQ5xhjz dDOJWt8dPJWtWI87sLIpNlETohqe3OjSvugW68FN7Ut1Q4S8KJWViJpDhmvQORTed6 LbvCI9qFPXkbelJy46/w74y+PnPg1K+OmMBXooTaf8Yg7HChNeVw6AC98uIVJqhVUH 0uEJqbsqVXUMdzn+iIiifY376zvNKc9nhRtHgMYXzvKlC8cz/29M14H2aiLtfifJVH p5qe9ZvGk2DpA== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v1 5/5] PCI: endpoint: Document the NVMe endpoint function driver Date: Mon, 7 Oct 2024 13:43:51 +0900 Message-ID: <20241007044351.157912-6-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241007044351.157912-1-dlemoal@kernel.org> References: <20241007044351.157912-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add the documentation files: - Documentation/PCI/endpoint/pci-nvme-function.rst - Documentation/PCI/endpoint/pci-nvme-howto.rst - Documentation/PCI/endpoint/function/binding/pci-nvme.rst To respectively document the NVMe PCI endpoint function driver internals, provide a user guide explaning how to setup an NVMe endpoint device and describe the NVMe endpoint function driver binding attributes. Signed-off-by: Damien Le Moal --- .../endpoint/function/binding/pci-nvme.rst | 34 ++++ Documentation/PCI/endpoint/index.rst | 3 + .../PCI/endpoint/pci-nvme-function.rst | 151 ++++++++++++++ Documentation/PCI/endpoint/pci-nvme-howto.rst | 190 ++++++++++++++++++ MAINTAINERS | 2 + 5 files changed, 380 insertions(+) create mode 100644 Documentation/PCI/endpoint/function/binding/pci-nvme.rst create mode 100644 Documentation/PCI/endpoint/pci-nvme-function.rst create mode 100644 Documentation/PCI/endpoint/pci-nvme-howto.rst diff --git a/Documentation/PCI/endpoint/function/binding/pci-nvme.rst b/Documentation/PCI/endpoint/function/binding/pci-nvme.rst new file mode 100644 index 000000000000..d80293c08bcd --- /dev/null +++ b/Documentation/PCI/endpoint/function/binding/pci-nvme.rst @@ -0,0 +1,34 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +PCI NVMe Endpoint Function +========================== + +1) Create the function subdirectory pci_epf_nvme.0 in the +pci_ep/functions/pci_epf_nvme directory of configfs. + +Standard EPF Configurable Fields: + +================ =========================================================== +vendorid Do not care (e.g. PCI_ANY_ID) +deviceid Do not care (e.g. PCI_ANY_ID) +revid Do not care +progif_code Must be 0x02 (NVM Express) +baseclass_code Must be 0x1 (PCI_BASE_CLASS_STORAGE) +subclass_code Must be 0x08 (Non-Volatile Memory controller) +cache_line_size Do not care +subsys_vendor_id Do not care (e.g. PCI_ANY_ID) +subsys_id Do not care (e.g. PCI_ANY_ID) +msi_interrupts At least equal to the number of queue pairs desired +msix_interrupts At least equal to the number of queue pairs desired +interrupt_pin Interrupt PIN to use if MSI and MSI-X are not supported +================ =========================================================== + +The NVMe EPF specific configurable fields are in the nvme subdirectory of the +directory created in 1 + +================ =========================================================== +ctrl_opts NVMe target connection parameters +dma_enable Enable (1) or disable (0) DMA transfers; default = 1 +mdts_kb Maximum data transfer size in KiB; default = 128 +================ =========================================================== diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst index 4d2333e7ae06..764f1e8f81f2 100644 --- a/Documentation/PCI/endpoint/index.rst +++ b/Documentation/PCI/endpoint/index.rst @@ -15,6 +15,9 @@ PCI Endpoint Framework pci-ntb-howto pci-vntb-function pci-vntb-howto + pci-nvme-function + pci-nvme-howto function/binding/pci-test function/binding/pci-ntb + function/binding/pci-nvme diff --git a/Documentation/PCI/endpoint/pci-nvme-function.rst b/Documentation/PCI/endpoint/pci-nvme-function.rst new file mode 100644 index 000000000000..ac8baa5556be --- /dev/null +++ b/Documentation/PCI/endpoint/pci-nvme-function.rst @@ -0,0 +1,151 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +PCI NVMe Function +================= + +:Author: Damien Le Moal + +The PCI NVMe endpoint function driver implements a PCIe NVMe controller for a +local NVMe fabrics host controller. The fabrics controller target can use any +of the transports supported by the NVMe driver. In practice, using small SBC +boards equipped with a PCI endpoint controller, loop targets to files or block +devices or TCP targets to remote NVMe devices can be easily used. + +Overview +======== + +The NVMe endpoint function driver relies as most as possible on the NVMe +fabrics driver for executing NVMe commands received from the PCI RC host to +minimize NVMe command parsing. However, some admin commands must be modified to +satisfy PCI transport specifications constraints (e.g. queue management +commands support and the optional SGL support). + +Capabilities +------------ + +The NVMe capabilities exposed to the PCI RC host through the BAR 0 registers +are almost identical to the capabilities of the NVMe fabrics controller, with +some exceptions: + +1) NVMe-over-fabrics specifications mandate support for SGL. Howerver, this + capability is not exposed as supported because the current NVMe endpoint + driver code does not support SGL. + +2) The NVMe endpoint function driver can expose a different MDTS (Maximum Data + Transfer Size) than the fabrics controller used. + +Maximum Number of Queue Pairs +----------------------------- + +Upon binding of the NVMe endpoint function driver to the endpoint controller, +BAR 0 is allocated with enough space to accommodate up to +PCI_EPF_NVME_MAX_NR_QUEUES (16) queue pairs. This relatively low number is +necessary to avoid running out of memory windows for mapping PCI addresses to +local endpoint controller memory. + +The number of memory windows necessary for operation is roughly at most: +1) One memory window for raising MSI/MSI-X interrupts +2) One memory window for command PRP and data transfers +3) One memory window for each submission queue +4) One memory window for each completion queue + +Given the highly asynchronous nature of the NVMe endpoint function driver +operation, the memory windows needed as described above will generally not be +used simultaneously, but that may happen. So a safe maximum number of queue +pairs that can be supported is equal to the maximum number of memory windows of +the endpoint controller minus two and divided by two. E.g. for an endpoint PCI +controller with 32 outbound memory windows available, up to 10 queue pairs can +be safely operated without any risk of getting PCI space mapping errors due to +the lack of memory windows. + +The NVMe endpoint function driver allows configuring the maximum number of +queue pairs through configfs. + +Command Execution +================= + +The NVMe endpoint function driver relies on several work items to process NVMe +commands issued by the PCI RC host. + +Register Poll Work +------------------ + +The register poll work is a delayed work used to poll for changes to the +controller state register. This is used to detect operations initiated by the +PCI host such as enabling or enabling the NVMe controller. The register poll +work is scheduled every 5 ms. + +Submission Queue Work +--------------------- + +Upon creation of submission queues, starting with the submission queue for +admin commands, a delayed work is created and scheduled for execution every +jiffy to poll for a submission queue doorbell to detect submission of commands +by the PCI host. + +When changes to a submission queue work are detected by a submission queue +work, the work allocates a command structure to copy the NVMe command issued by +the PCI host and schedules processing of the command using the command work. + +Command Processing Work +----------------------- + +This per-NVMe command work is scheduled for execution when an NVMe command is +received from the host. This work will: + +1) Does minimal parsing of the NVMe command to determine if the command has a + data buffer. If it does, the PRP list for the command is retrieved to + identify the PCI address ranges used for the command data buffer. This can + lead to the command buffer being represented using several discontiguous + memory fragments. A local memory buffer is also allocated for local + execution of the command using the fabrics controller. + +2) If the command is a write command (DMA direction from host to device), data + is transferred from the host to the local memory buffer of the command. This + is handled in a loop to process all fragments of the command buffer as well + as simultaneously handle PCI address mapping constraints of the PCI endpoint + controller. + +3) The command is then executed using the NVMe driver fabrics code. This blocks + the command work until the command execution completes. + +4) When the command completes, the command work schedules handling of the + command response using the completion queue work. + +Completion Queue Work +--------------------- + +This per-completion queue work is used to aggregate handling of responses to +completed commands in batches to avoid having to issue an IRQ for every +completed command. This work is sceduled every time a command completes and +does: + +1) Post a command completion entry for all completed commands. + +2) Update the completion queue doorbell. + +3) Raise an IRQ to signal the host that commands have completed. + +Configuration +============= + +The NVMe endpoint function driver can be fully controlled using configfs, once +a NVMe fabrics target is also setup. The available configfs parameters are: + + ctrl_opts + + Fabrics controller connection arguments, as formatted for + the nvme cli "connect" command. + + dma_enable + + Enable (default) or disable DMA data transfers. + + mdts_kb + + Change the maximum data transfer size (default: 128 KB). + +See Documentation/PCI/endpoint/pci-nvme-howto.rst for a more detailed +description of these parameters and how to use them to configure an NVMe +endpoint function driver. diff --git a/Documentation/PCI/endpoint/pci-nvme-howto.rst b/Documentation/PCI/endpoint/pci-nvme-howto.rst new file mode 100644 index 000000000000..e377818dad5a --- /dev/null +++ b/Documentation/PCI/endpoint/pci-nvme-howto.rst @@ -0,0 +1,190 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +PCI NVMe Endpoint Function (EPF) User Guide +=========================================== + +:Author: Damien Le Moal + +This document is a guide to help users use the pci-epf-nvme function driver to +create PCIe NVMe controllers. For a high-level description of the NVMe function +driver internals, see Documentation/PCI/endpoint/pci-nvme-function.rst. + +Hardware and Kernel Requirements +================================ + +To use the NVMe PCI endpoint driver, at least one endpoint controller device +is required. + +To find the list of endpoint controller devices in the system:: + + # ls /sys/class/pci_epc/ + a40000000.pcie-ep + +If PCI_ENDPOINT_CONFIGFS is enabled:: + + # ls /sys/kernel/config/pci_ep/controllers + a40000000.pcie-ep + +Compiling the NVMe endpoint function driver depends on the target support of +the NVMe driver being enabled (CONFIG_NVME_TARGET). It is also recommended to +enable CONFIG_NVME_TARGET_LOOP to enable the use of loop targets (to use files +or block devices as storage for the NVMe target device). If the board used also +supports ethernet, CONFIG_NVME_TCP can be set to enable the use of remote TCP +NVMe targets. + +To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK) +is also recommended. With this, a simple setup using a null_blk block device +with an NVMe loop target can be used. + + +NVMe Endpoint Device +==================== + +Creating an NVMe endpoint device is a two step process. First, an NVMe target +device must be defined. Second, the NVMe endpoint device must be setup using +the defined NVMe target device. + +Creating a NVMe Target Device +----------------------------- + +Details about how to configure and NVMe target are outside the scope of this +document. The following only provides a simple example of a loop target setup +using a null_blk device for storage. + +First, make sure that configfs is enabled:: + + # mount -t configfs none /sys/kernel/config + +Next, create a null_blk device (default settings give a 250 GB device without +memory backing). The block device created will be /dev/nullb0 by default:: + + # modprobe null_blk + # ls /dev/nullb0 + /dev/nullb0 + +The NVMe loop target driver must be loaded:: + + # modprobe nvme_loop + # lsmod | grep nvme + nvme_loop 16384 0 + nvmet 106496 1 nvme_loop + nvme_fabrics 28672 1 nvme_loop + nvme_core 131072 3 nvme_loop,nvmet,nvme_fabrics + +Now, create the NVMe loop target, starting with the NVMe subsystem, specifying +a maximum of 4 queue pairs:: + + # cd /sys/kernel/config/nvmet/subsystems + # mkdir pci_epf_nvme.0.nqn + # echo -n "Linux-pci-epf" > pci_epf_nvme.0.nqn/attr_model + # echo 4 > pci_epf_nvme.0.nqn/attr_qid_max + # echo 1 > pci_epf_nvme.0.nqn/attr_allow_any_host + +Next, create the target namespace using the null_blk block device:: + + # mkdir pci_epf_nvme.0.nqn/namespaces/1 + # echo -n "/dev/nullb0" > pci_epf_nvme.0.nqn/namespaces/1/device_path + # echo 1 > "pci_epf_nvme.0.nqn/namespaces/1/enable" + +Finally, create the target port and link it to the subsystem:: + + # cd /sys/kernel/config/nvmet/ports + # mkdir 1 + # echo -n "loop" > 1/addr_trtype + # ln -s /sys/kernel/config/nvmet/subsystems/pci_epf_nvme.0.nqn + 1/subsystems/pci_epf_nvme.0.nqn + + +Creating a NVMe Endpoint Device +------------------------------- + +With the NVMe target ready for use, the NVMe PCI endpoint device can now be +created and enabled. The first step is to load the NVMe function driver:: + + # modprobe pci_epf_nvme + # ls /sys/kernel/config/pci_ep/functions + pci_epf_nvme + +Next, create function 0:: + + # cd /sys/kernel/config/pci_ep/functions/pci_epf_nvme + # mkdir pci_epf_nvme.0 + # ls pci_epf_nvme.0/ + baseclass_code msix_interrupts secondary + cache_line_size nvme subclass_code + deviceid primary subsys_id + interrupt_pin progif_code subsys_vendor_id + msi_interrupts revid vendorid + +Configure the function using any vendor ID and device ID:: + + # cd /sys/kernel/config/pci_ep/functions/pci_epf_nvme/pci_epf_nvme.0 + # echo 0x15b7 > vendorid + # echo 0x5fff > deviceid + # echo 32 > msix_interrupts + # echo -n "transport=loop,nqn=pci_epf_nvme.0.nqn,nr_io_queues=4" > \ + ctrl_opts + +The ctrl_opts attribute must be set using equivalent arguments as used for a +norma NVMe target connection using "nvme connect" command. For the example +above, the equivalen target connection command is:: + + # nvme connect --transport=loop --nqn=pci_epf_nvme.0.nqn --nr-io-queues=4 + +The endpoint function can then be bound to the endpoint controller and the +controller started:: + + # cd /sys/kernel/config/pci_ep + # ln -s functions/pci_epf_nvme/pci_epf_nvme.0 controllers/a40000000.pcie-ep/ + # echo 1 > controllers/a40000000.pcie-ep/start + +Kernel messages will show information as the NVMe target device and endpoint +device are created and connected. + +.. code-block:: text + + pci_epf_nvme: Registered nvme EPF driver + nvmet: adding nsid 1 to subsystem pci_epf_nvme.0.nqn + pci_epf_nvme pci_epf_nvme.0: DMA RX channel dma3chan2, maximum segment size 4294967295 B + pci_epf_nvme pci_epf_nvme.0: DMA TX channel dma3chan0, maximum segment size 4294967295 B + pci_epf_nvme pci_epf_nvme.0: DMA supported + nvmet: creating nvm controller 1 for subsystem pci_epf_nvme.0.nqn for NQN nqn.2014-08.org.nvmexpress:uuid:0aa34ec6-11c0-4b02-ac9b-e07dff4b5c84. + nvme nvme0: creating 4 I/O queues. + nvme nvme0: new ctrl: "pci_epf_nvme.0.nqn" + pci_epf_nvme pci_epf_nvme.0: NVMe fabrics controller created, 4 I/O queues + pci_epf_nvme pci_epf_nvme.0: NVMe PCI controller supports MSI-X, 32 vectors + pci_epf_nvme pci_epf_nvme.0: NVMe PCI controller: 4 I/O queues + + +PCI RootComplex Host +==================== + +Booting the host, the NVMe endpoint device will be discoverable as a PCI device:: + + # lspci -n + 0000:01:00.0 0108: 15b7:5fff + +An this device will be recognized as an NVMe device with a single namespace:: + + # lsblk + NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS + nullb0 254:0 0 250G 0 disk + nvme0n1 259:0 0 250G 0 disk + +The NVMe endpoint block device can then be used as any other regular NVMe +device. The nvme command line utility can be used to get more detailed +information about the endpoint device:: + + # nvme id-ctrl /dev/nvme0 + NVME Identify Controller: + vid : 0x15b7 + ssvid : 0x15b7 + sn : 0ec249554579a1d08fb5 + mn : Linux-pci-epf + fr : 6.12.0-r + rab : 6 + ieee : 000000 + cmic : 0 + mdts : 5 + ... diff --git a/MAINTAINERS b/MAINTAINERS index c9b40621dbc1..18eeaba01996 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16559,6 +16559,8 @@ M: Damien Le Moal L: linux-pci@vger.kernel.org L: linux-nvme@lists.infradead.org S: Supported +F: Documentation/PCI/endpoint/function/binding/pci-nvme.rst +F: Documentation/PCI/endpoint/pci-nvme-*.rst F: drivers/pci/endpoint/functions/pci-epf-nvme.c NVM EXPRESS DRIVER