diff mbox series

[v1,5/5] PCI: endpoint: Document the NVMe endpoint function driver

Message ID 20241007044351.157912-6-dlemoal@kernel.org (mailing list archive)
State Superseded
Headers show
Series NVMe PCI endpoint function driver | expand

Commit Message

Damien Le Moal Oct. 7, 2024, 4:43 a.m. UTC
Add the documentation files:
 - Documentation/PCI/endpoint/pci-nvme-function.rst
 - Documentation/PCI/endpoint/pci-nvme-howto.rst
 - Documentation/PCI/endpoint/function/binding/pci-nvme.rst

To respectively document the NVMe PCI endpoint function driver
internals, provide a user guide explaning how to setup an NVMe endpoint
device and describe the NVMe endpoint function driver binding
attributes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 .../endpoint/function/binding/pci-nvme.rst    |  34 ++++
 Documentation/PCI/endpoint/index.rst          |   3 +
 .../PCI/endpoint/pci-nvme-function.rst        | 151 ++++++++++++++
 Documentation/PCI/endpoint/pci-nvme-howto.rst | 190 ++++++++++++++++++
 MAINTAINERS                                   |   2 +
 5 files changed, 380 insertions(+)
 create mode 100644 Documentation/PCI/endpoint/function/binding/pci-nvme.rst
 create mode 100644 Documentation/PCI/endpoint/pci-nvme-function.rst
 create mode 100644 Documentation/PCI/endpoint/pci-nvme-howto.rst
diff mbox series

Patch

diff --git a/Documentation/PCI/endpoint/function/binding/pci-nvme.rst b/Documentation/PCI/endpoint/function/binding/pci-nvme.rst
new file mode 100644
index 000000000000..d80293c08bcd
--- /dev/null
+++ b/Documentation/PCI/endpoint/function/binding/pci-nvme.rst
@@ -0,0 +1,34 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PCI NVMe Endpoint Function
+==========================
+
+1) Create the function subdirectory pci_epf_nvme.0 in the
+pci_ep/functions/pci_epf_nvme directory of configfs.
+
+Standard EPF Configurable Fields:
+
+================   ===========================================================
+vendorid           Do not care (e.g. PCI_ANY_ID)
+deviceid           Do not care (e.g. PCI_ANY_ID)
+revid              Do not care
+progif_code	   Must be 0x02 (NVM Express)
+baseclass_code     Must be 0x1 (PCI_BASE_CLASS_STORAGE)
+subclass_code      Must be 0x08 (Non-Volatile Memory controller)
+cache_line_size    Do not care
+subsys_vendor_id   Do not care (e.g. PCI_ANY_ID)
+subsys_id          Do not care (e.g. PCI_ANY_ID)
+msi_interrupts     At least equal to the number of queue pairs desired
+msix_interrupts    At least equal to the number of queue pairs desired
+interrupt_pin      Interrupt PIN to use if MSI and MSI-X are not supported
+================   ===========================================================
+
+The NVMe EPF specific configurable fields are in the nvme subdirectory of the
+directory created in 1
+
+================   ===========================================================
+ctrl_opts          NVMe target connection parameters
+dma_enable         Enable (1) or disable (0) DMA transfers; default = 1
+mdts_kb            Maximum data transfer size in KiB; default = 128
+================   ===========================================================
diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst
index 4d2333e7ae06..764f1e8f81f2 100644
--- a/Documentation/PCI/endpoint/index.rst
+++ b/Documentation/PCI/endpoint/index.rst
@@ -15,6 +15,9 @@  PCI Endpoint Framework
    pci-ntb-howto
    pci-vntb-function
    pci-vntb-howto
+   pci-nvme-function
+   pci-nvme-howto
 
    function/binding/pci-test
    function/binding/pci-ntb
+   function/binding/pci-nvme
diff --git a/Documentation/PCI/endpoint/pci-nvme-function.rst b/Documentation/PCI/endpoint/pci-nvme-function.rst
new file mode 100644
index 000000000000..ac8baa5556be
--- /dev/null
+++ b/Documentation/PCI/endpoint/pci-nvme-function.rst
@@ -0,0 +1,151 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+PCI NVMe Function
+=================
+
+:Author: Damien Le Moal <dlemoal@kernel.org>
+
+The PCI NVMe endpoint function driver implements a PCIe NVMe controller for a
+local NVMe fabrics host controller. The fabrics controller target can use any
+of the transports supported by the NVMe driver. In practice, using small SBC
+boards equipped with a PCI endpoint controller, loop targets to files or block
+devices or TCP targets to remote NVMe devices can be easily used.
+
+Overview
+========
+
+The NVMe endpoint function driver relies as most as possible on the NVMe
+fabrics driver for executing NVMe commands received from the PCI RC host to
+minimize NVMe command parsing. However, some admin commands must be modified to
+satisfy PCI transport specifications constraints (e.g. queue management
+commands support and the optional SGL support).
+
+Capabilities
+------------
+
+The NVMe capabilities exposed to the PCI RC host through the BAR 0 registers
+are almost identical to the capabilities of the NVMe fabrics controller, with
+some exceptions:
+
+1) NVMe-over-fabrics specifications mandate support for SGL. Howerver, this
+   capability is not exposed as supported because the current NVMe endpoint
+   driver code does not support SGL.
+
+2) The NVMe endpoint function driver can expose a different MDTS (Maximum Data
+   Transfer Size) than the fabrics controller used.
+
+Maximum Number of Queue Pairs
+-----------------------------
+
+Upon binding of the NVMe endpoint function driver to the endpoint controller,
+BAR 0 is allocated with enough space to accommodate up to
+PCI_EPF_NVME_MAX_NR_QUEUES (16) queue pairs. This relatively low number is
+necessary to avoid running out of memory windows for mapping PCI addresses to
+local endpoint controller memory.
+
+The number of memory windows necessary for operation is roughly at most:
+1) One memory window for raising MSI/MSI-X interrupts
+2) One memory window for command PRP and data transfers
+3) One memory window for each submission queue
+4) One memory window for each completion queue
+
+Given the highly asynchronous nature of the NVMe endpoint function driver
+operation, the memory windows needed as described above will generally not be
+used simultaneously, but that may happen. So a safe maximum number of queue
+pairs that can be supported is equal to the maximum number of memory windows of
+the endpoint controller minus two and divided by two. E.g. for an endpoint PCI
+controller with 32 outbound memory windows available, up to 10 queue pairs can
+be safely operated without any risk of getting PCI space mapping errors due to
+the lack of memory windows.
+
+The NVMe endpoint function driver allows configuring the maximum number of
+queue pairs through configfs.
+
+Command Execution
+=================
+
+The NVMe endpoint function driver relies on several work items to process NVMe
+commands issued by the PCI RC host.
+
+Register Poll Work
+------------------
+
+The register poll work is a delayed work used to poll for changes to the
+controller state register. This is used to detect operations initiated by the
+PCI host such as enabling or enabling the NVMe controller. The register poll
+work is scheduled every 5 ms.
+
+Submission Queue Work
+---------------------
+
+Upon creation of submission queues, starting with the submission queue for
+admin commands, a delayed work is created and scheduled for execution every
+jiffy to poll for a submission queue doorbell to detect submission of commands
+by the PCI host.
+
+When changes to a submission queue work are detected by a submission queue
+work, the work allocates a command structure to copy the NVMe command issued by
+the PCI host and schedules processing of the command using the command work.
+
+Command Processing Work
+-----------------------
+
+This per-NVMe command work is scheduled for execution when an NVMe command is
+received from the host. This work will:
+
+1) Does minimal parsing of the NVMe command to determine if the command has a
+   data buffer. If it does, the PRP list for the command is retrieved to
+   identify the PCI address ranges used for the command data buffer. This can
+   lead to the command buffer being represented using several discontiguous
+   memory fragments.  A local memory buffer is also allocated for local
+   execution of the command using the fabrics controller.
+
+2) If the command is a write command (DMA direction from host to device), data
+   is transferred from the host to the local memory buffer of the command. This
+   is handled in a loop to process all fragments of the command buffer as well
+   as simultaneously handle PCI address mapping constraints of the PCI endpoint
+   controller.
+
+3) The command is then executed using the NVMe driver fabrics code. This blocks
+   the command work until the command execution completes.
+
+4) When the command completes, the command work schedules handling of the
+   command response using the completion queue work.
+
+Completion Queue Work
+---------------------
+
+This per-completion queue work is used to aggregate handling of responses to
+completed commands in batches to avoid having to issue an IRQ for every
+completed command. This work is sceduled every time a command completes and
+does:
+
+1) Post a command completion entry for all completed commands.
+
+2) Update the completion queue doorbell.
+
+3) Raise an IRQ to signal the host that commands have completed.
+
+Configuration
+=============
+
+The NVMe endpoint function driver can be fully controlled using configfs, once
+a NVMe fabrics target is also setup. The available configfs parameters are:
+
+  ctrl_opts
+
+        Fabrics controller connection arguments, as formatted for
+        the nvme cli "connect" command.
+
+  dma_enable
+
+        Enable (default) or disable DMA data transfers.
+
+  mdts_kb
+
+        Change the maximum data transfer size (default: 128 KB).
+
+See Documentation/PCI/endpoint/pci-nvme-howto.rst for a more detailed
+description of these parameters and how to use them to configure an NVMe
+endpoint function driver.
diff --git a/Documentation/PCI/endpoint/pci-nvme-howto.rst b/Documentation/PCI/endpoint/pci-nvme-howto.rst
new file mode 100644
index 000000000000..e377818dad5a
--- /dev/null
+++ b/Documentation/PCI/endpoint/pci-nvme-howto.rst
@@ -0,0 +1,190 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
+PCI NVMe Endpoint Function (EPF) User Guide
+===========================================
+
+:Author: Damien Le Moal <dlemoal@kernel.org>
+
+This document is a guide to help users use the pci-epf-nvme function driver to
+create PCIe NVMe controllers. For a high-level description of the NVMe function
+driver internals, see Documentation/PCI/endpoint/pci-nvme-function.rst.
+
+Hardware and Kernel Requirements
+================================
+
+To use the NVMe PCI endpoint driver, at least one endpoint controller device
+is required.
+
+To find the list of endpoint controller devices in the system::
+
+	# ls /sys/class/pci_epc/
+        a40000000.pcie-ep
+
+If PCI_ENDPOINT_CONFIGFS is enabled::
+
+	# ls /sys/kernel/config/pci_ep/controllers
+        a40000000.pcie-ep
+
+Compiling the NVMe endpoint function driver depends on the target support of
+the NVMe driver being enabled (CONFIG_NVME_TARGET). It is also recommended to
+enable CONFIG_NVME_TARGET_LOOP to enable the use of loop targets (to use files
+or block devices as storage for the NVMe target device). If the board used also
+supports ethernet, CONFIG_NVME_TCP can be set to enable the use of remote TCP
+NVMe targets.
+
+To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK)
+is also recommended. With this, a simple setup using a null_blk block device
+with an NVMe loop target can be used.
+
+
+NVMe Endpoint Device
+====================
+
+Creating an NVMe endpoint device is a two step process. First, an NVMe target
+device must be defined. Second, the NVMe endpoint device must be setup using
+the defined NVMe target device.
+
+Creating a NVMe Target Device
+-----------------------------
+
+Details about how to configure and NVMe target are outside the scope of this
+document. The following only provides a simple example of a loop target setup
+using a null_blk device for storage.
+
+First, make sure that configfs is enabled::
+
+	# mount -t configfs none /sys/kernel/config
+
+Next, create a null_blk device (default settings give a 250 GB device without
+memory backing). The block device created will be /dev/nullb0 by default::
+
+        # modprobe null_blk
+        # ls /dev/nullb0
+        /dev/nullb0
+
+The NVMe loop target driver must be loaded::
+
+        # modprobe nvme_loop
+        # lsmod | grep nvme
+        nvme_loop              16384  0
+        nvmet                 106496  1 nvme_loop
+        nvme_fabrics           28672  1 nvme_loop
+        nvme_core             131072  3 nvme_loop,nvmet,nvme_fabrics
+
+Now, create the NVMe loop target, starting with the NVMe subsystem, specifying
+a maximum of 4 queue pairs::
+
+        # cd /sys/kernel/config/nvmet/subsystems
+        # mkdir pci_epf_nvme.0.nqn
+        # echo -n "Linux-pci-epf" > pci_epf_nvme.0.nqn/attr_model
+        # echo 4 > pci_epf_nvme.0.nqn/attr_qid_max
+        # echo 1 > pci_epf_nvme.0.nqn/attr_allow_any_host
+
+Next, create the target namespace using the null_blk block device::
+
+        # mkdir pci_epf_nvme.0.nqn/namespaces/1
+        # echo -n "/dev/nullb0" > pci_epf_nvme.0.nqn/namespaces/1/device_path
+        # echo 1 > "pci_epf_nvme.0.nqn/namespaces/1/enable"
+
+Finally, create the target port and link it to the subsystem::
+
+        # cd /sys/kernel/config/nvmet/ports
+        # mkdir 1
+        # echo -n "loop" > 1/addr_trtype
+        # ln -s /sys/kernel/config/nvmet/subsystems/pci_epf_nvme.0.nqn
+                1/subsystems/pci_epf_nvme.0.nqn
+
+
+Creating a NVMe Endpoint Device
+-------------------------------
+
+With the NVMe target ready for use, the NVMe PCI endpoint device can now be
+created and enabled. The first step is to load the NVMe function driver::
+
+        # modprobe pci_epf_nvme
+        # ls /sys/kernel/config/pci_ep/functions
+        pci_epf_nvme
+
+Next, create function 0::
+
+        # cd /sys/kernel/config/pci_ep/functions/pci_epf_nvme
+        # mkdir pci_epf_nvme.0
+        # ls pci_epf_nvme.0/
+        baseclass_code    msix_interrupts   secondary
+        cache_line_size   nvme              subclass_code
+        deviceid          primary           subsys_id
+        interrupt_pin     progif_code       subsys_vendor_id
+        msi_interrupts    revid             vendorid
+
+Configure the function using any vendor ID and device ID::
+
+        # cd /sys/kernel/config/pci_ep/functions/pci_epf_nvme/pci_epf_nvme.0
+        # echo 0x15b7 > vendorid
+        # echo 0x5fff > deviceid
+        # echo 32 > msix_interrupts
+        # echo -n "transport=loop,nqn=pci_epf_nvme.0.nqn,nr_io_queues=4" > \
+                ctrl_opts
+
+The ctrl_opts attribute must be set using equivalent arguments as used for a
+norma NVMe target connection using "nvme connect" command. For the example
+above, the equivalen target connection command is::
+
+        # nvme connect --transport=loop --nqn=pci_epf_nvme.0.nqn --nr-io-queues=4
+
+The endpoint function can then be bound to the endpoint controller and the
+controller started::
+
+        # cd /sys/kernel/config/pci_ep
+        # ln -s functions/pci_epf_nvme/pci_epf_nvme.0 controllers/a40000000.pcie-ep/
+        # echo 1 > controllers/a40000000.pcie-ep/start
+
+Kernel messages will show information as the NVMe target device and endpoint
+device are created and connected.
+
+.. code-block:: text
+
+        pci_epf_nvme: Registered nvme EPF driver
+        nvmet: adding nsid 1 to subsystem pci_epf_nvme.0.nqn
+        pci_epf_nvme pci_epf_nvme.0: DMA RX channel dma3chan2, maximum segment size 4294967295 B
+        pci_epf_nvme pci_epf_nvme.0: DMA TX channel dma3chan0, maximum segment size 4294967295 B
+        pci_epf_nvme pci_epf_nvme.0: DMA supported
+        nvmet: creating nvm controller 1 for subsystem pci_epf_nvme.0.nqn for NQN nqn.2014-08.org.nvmexpress:uuid:0aa34ec6-11c0-4b02-ac9b-e07dff4b5c84.
+        nvme nvme0: creating 4 I/O queues.
+        nvme nvme0: new ctrl: "pci_epf_nvme.0.nqn"
+        pci_epf_nvme pci_epf_nvme.0: NVMe fabrics controller created, 4 I/O queues
+        pci_epf_nvme pci_epf_nvme.0: NVMe PCI controller supports MSI-X, 32 vectors
+        pci_epf_nvme pci_epf_nvme.0: NVMe PCI controller: 4 I/O queues
+
+
+PCI RootComplex Host
+====================
+
+Booting the host, the NVMe endpoint device will be discoverable as a PCI device::
+
+        # lspci -n
+        0000:01:00.0 0108: 15b7:5fff
+
+An this device will be recognized as an NVMe device with a single namespace::
+
+        # lsblk
+        NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
+        nullb0      254:0    0   250G  0 disk
+        nvme0n1     259:0    0   250G  0 disk
+
+The NVMe endpoint block device can then be used as any other regular NVMe
+device. The nvme command line utility can be used to get more detailed
+information about the endpoint device::
+
+        # nvme id-ctrl /dev/nvme0
+        NVME Identify Controller:
+        vid       : 0x15b7
+        ssvid     : 0x15b7
+        sn        : 0ec249554579a1d08fb5
+        mn        : Linux-pci-epf
+        fr        : 6.12.0-r
+        rab       : 6
+        ieee      : 000000
+        cmic      : 0
+        mdts      : 5
+        ...
diff --git a/MAINTAINERS b/MAINTAINERS
index c9b40621dbc1..18eeaba01996 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16559,6 +16559,8 @@  M:	Damien Le Moal <dlemoal@kernel.org>
 L:	linux-pci@vger.kernel.org
 L:	linux-nvme@lists.infradead.org
 S:	Supported
+F:	Documentation/PCI/endpoint/function/binding/pci-nvme.rst
+F:	Documentation/PCI/endpoint/pci-nvme-*.rst
 F:	drivers/pci/endpoint/functions/pci-epf-nvme.c
 
 NVM EXPRESS DRIVER