mbox series

[v6,00/16] nvmet: add target passthru commands support

Message ID 20190725172335.6825-1-logang@deltatee.com (mailing list archive)
Headers show
Series nvmet: add target passthru commands support | expand

Message

Logan Gunthorpe July 25, 2019, 5:23 p.m. UTC
Hi,

Chaitainya has asked us to take on these patches as we have an
interest in getting them into upstream. To that end, we've done
a large amount of testing, bug fixes and cleanup.

Passthru support for nvmet allows users to export an entire
NVMe controller through NVMe-oF. When exported in this way (as opposed
to exporting each namespace as a block device), all the NVMe commands
are passed to the given controller unmodified, including most admin
commands and Vendor Unique Commands (VUCs). A passthru target will
expose all namespaces for a given device to the remote host.

There are three major non-bugfix changes that we've done to the series:

1) Instead of using a seperate special passthru subsystem in
   configfs simply add a passthru directory that's analogous to
   the existing namespace directories. The directories have
   very similar attributes to namespaces but are mutually exclusive.
   If a user enables a namespaces, they can't then enable
   passthru controller and vice versa. This simplifies the code
   required to implement passthru configfs and IMO creates a much
   clearer and uniform interface.

2) Instead of taking a bare controller name (ie. "nvme1"), take a
   full device path to the controller's char device. This is more
   consistent with the regular namespaces which take a path and
   also allows users to make use of udev rules and symlinks to
   manage their controllers instead of the potentially unstable
   device names.

3) Implement block accounting for the passthru devices. This is so
   the target OS can still track device usage using /proc/diskstats.

Besides these three changes, we've also found a large number of bugs
and crashes and did a bunch of testing with KASAN, lockdep and kmemleak.
A more complete list of changes is given below.

Additionally, we've written some new blktests to test the passthru
code. A branch is available here[1] and can be submitted once these
patches are upstream.

These patches are based off of v5.3-rc1 and a git branch is available
at [2].

Thanks,

Logan

[1] https://github.com/Eideticom/blktests nvmet_passthru
[2] https://github.com/sbates130272/linux-p2pmem/ nvmet_passthru_v6

--

v6 Changes (done by Logan):
  1. Rebased onto v5.3-rc1
  2. Rework configfs interface to simply be a passthru directory
     within the existing subsystem. The directory is similar to
     and consistent with a namespace directory.
  3. Have the configfs take a path instead of a bare controller name
  4. Renamed the main passthru file to io-cmd-passthru.c for consistency
     with the file and block-dev methods.
  5. Cleaned up all the CONFIG_NVME_TARGET_PASSTHRU usage to remove
     all the inline #ifdefs
  6. Restructured nvmet_passthru_make_request() a bit for clearer code
  7. Moved nvme_find_get_ns() call into nvmet_passthru_execute_cmd()
     seeing calling it in nvmet_req_init() causes a lockdep warning
     due to nvme_find_get_ns() being able to sleep.
  8. Added a check in nvmet_passthru_execute_cmd() to ensure we don't
     violate queue_max_segments or queue_max_hw_sectors and overrode
     mdts to ensure hosts don't intentionally submit commands
     that will exceed these limits.
  9. Reworked the code which ensures there's only one subsystem per
     passthru controller to use an xarray instead of a list as this is
     simpler and more easily fixed some bugs triggered by disabling
     subsystems that weren't enabled.
 10. Removed the overide of the target cntlid with the passthru cntlid;
     this seemed like a really bad idea especially in the presence of
     mixed systems as you could end up with two ctrlrs with the same
     cntlid. For now, commands that depend on cntlid are black listed.
 11. Implement block accounting for passthru so the target can track
     usage using /proc/diskstats
 12. A number of other minor bug fixes and cleanups

v5 Changes (not sent to list, from Chaitanya):
  1. Added workqueue for admin commands.
  2. Added kconfig option for the pass-thru feature.
  3. Restructure the parsing code according to your suggestion,
     call nvmet_xxx_parse_cmd() from nvmet_passthru_parse_cmd().
  4. Use pass-thru instead of pt.
  5. Several cleanups and add comments at the appropriate locations.
  6. Minimize the code for checking pass-thru ns across all the subsystems.
  7. Removed the delays in the ns related admin commands since I was
     not able to reproduce the previous bug.

v4 Changes:
  1. Add request polling interface to the block layer.
  2. Use request polling interface in the NVMEoF target passthru code
     path.
  3. Add checks suggested by Sagi for creating one target ctrl per
     passthru ctrl.
  4. Don't enable the namespace if it belongs to the configured passthru
     ctrl.
  5. Adjust the code latest kernel.

v3 Changes:
  1. Split the addition of passthru command handlers and integration
     into two different patches since we add guards to create one target
     controller per passthru controller. This way it will be easier to
     review the code.
  2. Adjust the code for 4.18.

v2 Changes:
  1. Update the new nvme core controller find API naming and
     changed the string comparison of the ctrl.
  2. Get rid of the newly added #defines for target ctrl values.
  3. Use the newly added structure members in the same patch where
     they are used. Aggregate the passthru command handling support
     and integration with nvmet-core into one patch.
  4. Introduce global NVMe Target subsystem list for connected and
     not connected subsystems on the target side.
  5. Add check when configuring the target ns and target
     passthru ctrl to allow only one target controller to be created
     for one passthru subsystem.
  6. Use the passthru ctrl cntlid when creating the
     target controller.

Chaitanya Kulkarni (6):
  nvme-core: export existing ctrl and ns interfaces
  nvmet: add return value to  nvmet_add_async_event()
  nvmet-passthru: update KConfig with config passthru option
  nvmet-passthru: add passthru code to process commands
  nvmet-core: allow one host per passthru-ctrl
  nvmet-core: don't check the data len for pt-ctrl

Logan Gunthorpe (10):
  chardev: factor out cdev_lookup() helper
  chardev: introduce cdev_get_by_path()
  chardev: export cdev_put()
  nvme-core: introduce nvme_get_by_path()
  nvmet: make nvmet_copy_ns_identifier() non-static
  nvmet-passthru: add enable/disable helpers
  nvmet-configfs: introduce passthru configfs interface
  block: don't check blk_rq_is_passthrough() in blk_do_io_stat()
  block: call blk_account_io_start() in blk_execute_rq_nowait()
  nvmet-passthru: support block accounting

 block/blk-exec.c                      |   2 +
 block/blk-mq.c                        |   2 +-
 block/blk.h                           |   5 +-
 drivers/nvme/host/core.c              |  40 +-
 drivers/nvme/host/nvme.h              |   9 +
 drivers/nvme/target/Kconfig           |  10 +
 drivers/nvme/target/Makefile          |   1 +
 drivers/nvme/target/admin-cmd.c       |   4 +-
 drivers/nvme/target/configfs.c        |  99 ++++
 drivers/nvme/target/core.c            |  41 +-
 drivers/nvme/target/io-cmd-passthru.c | 681 ++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h           |  71 ++-
 fs/char_dev.c                         |  61 ++-
 include/linux/cdev.h                  |   1 +
 14 files changed, 1003 insertions(+), 24 deletions(-)
 create mode 100644 drivers/nvme/target/io-cmd-passthru.c

--
2.20.1

Comments

Hannes Reinecke July 26, 2019, 6:23 a.m. UTC | #1
Hi Logan,

On 7/25/19 7:23 PM, Logan Gunthorpe wrote:
> Hi,
> 
> Chaitainya has asked us to take on these patches as we have an
> interest in getting them into upstream. To that end, we've done
> a large amount of testing, bug fixes and cleanup.
> 
> Passthru support for nvmet allows users to export an entire
> NVMe controller through NVMe-oF. When exported in this way (as opposed
> to exporting each namespace as a block device), all the NVMe commands
> are passed to the given controller unmodified, including most admin
> commands and Vendor Unique Commands (VUCs). A passthru target will
> expose all namespaces for a given device to the remote host.
> 
In general I'm very much in favour of this, yet there are some issues
which I'm not quite clear about.

> There are three major non-bugfix changes that we've done to the series:
> 
> 1) Instead of using a seperate special passthru subsystem in
>    configfs simply add a passthru directory that's analogous to
>    the existing namespace directories. The directories have
>    very similar attributes to namespaces but are mutually exclusive.
>    If a user enables a namespaces, they can't then enable
>    passthru controller and vice versa. This simplifies the code
>    required to implement passthru configfs and IMO creates a much
>    clearer and uniform interface.
> 
How do you handle subsystem naming?
If you enable the 'passthru' device, the (nvmet) subsystem (and its
name) is already created. Yet the passthru device will have its own
internal subsystem naming, so if you're not extra careful you'll end up
with a nvmet subsystem which doesn't have any relationship with the
passthru subsystem, making addressing etc ... tricky.
Any thoughts about that?

Similarly: how do you propose to handle multipath devices?
Any NVMe with several paths will be enabling NVMe multipathing
automatically, presenting you with a single multipathed namespace.
How will these devices be treated?
Will the multipathed namespace be used for passthru?

Cheers,

Hannes
Logan Gunthorpe July 26, 2019, 5:07 p.m. UTC | #2
On 2019-07-26 12:23 a.m., Hannes Reinecke wrote:
> How do you handle subsystem naming?
> If you enable the 'passthru' device, the (nvmet) subsystem (and its
> name) is already created. Yet the passthru device will have its own
> internal subsystem naming, so if you're not extra careful you'll end up
> with a nvmet subsystem which doesn't have any relationship with the
> passthru subsystem, making addressing etc ... tricky.
> Any thoughts about that?

Well I can't say I have a great understanding of how multipath works, but...

I don't think it necessarily makes sense for the target subsynqn and the
target's device nqn to be the same. It would be weird for a user to want
to use the same device and a passed through device (through a loop) as
part of the same subsystem. That being said, it's possible for the user
to use the subsysnqn from the passed through device for the name of the
subsys of the target. I tried this and it works except for the fact that
the device I'm passing through doesn't set id->cmic.

> Similarly: how do you propose to handle multipath devices?
> Any NVMe with several paths will be enabling NVMe multipathing
> automatically, presenting you with a single multipathed namespace.
> How will these devices be treated?

Well passthru works on the controller level not on the namespace level.
So it can't make use of the multipath handling on the target system.

The one case that I think makes sense to me, but I don't know how if we
can handle, is if the user had a couple multipath enabled controllers
with the same subsynqn and wanted to passthru all of them to another
system and use multipath on the host with both controllers. This would
require having multiple target subsystems with the same name which I
don't think will work too well.

> Will the multipathed namespace be used for passthru?

Nope.

Honestly, I think the answer is if someone wants to use multipathed
controllers they should use regular NVMe-of as it doesn't really mesh
well with the passthru approach.

Logan
Sagi Grimberg July 26, 2019, 10:21 p.m. UTC | #3
>> How do you handle subsystem naming?
>> If you enable the 'passthru' device, the (nvmet) subsystem (and its
>> name) is already created. Yet the passthru device will have its own
>> internal subsystem naming, so if you're not extra careful you'll end up
>> with a nvmet subsystem which doesn't have any relationship with the
>> passthru subsystem, making addressing etc ... tricky.
>> Any thoughts about that?
> 
> Well I can't say I have a great understanding of how multipath works, but...

Why is this related to multipath?

> I don't think it necessarily makes sense for the target subsynqn and the
> target's device nqn to be the same. It would be weird for a user to want
> to use the same device and a passed through device (through a loop) as
> part of the same subsystem. That being said, it's possible for the user
> to use the subsysnqn from the passed through device for the name of the
> subsys of the target. I tried this and it works except for the fact that
> the device I'm passing through doesn't set id->cmic.

I don't see why should the subsystem nqn should be the same name. Its
just like any other nvmet subsystem, just happens to have a nvme
controller in the backend (which it knows about). No reason to have
the same name IMO.

>> Similarly: how do you propose to handle multipath devices?
>> Any NVMe with several paths will be enabling NVMe multipathing
>> automatically, presenting you with a single multipathed namespace.
>> How will these devices be treated?
> 
> Well passthru works on the controller level not on the namespace level.
> So it can't make use of the multipath handling on the target system.

Why? if nvmet is capable, why shouldn't we support it?

> The one case that I think makes sense to me, but I don't know how if we
> can handle, is if the user had a couple multipath enabled controllers
> with the same subsynqn

That is usually the case, there is no multipathing defined across NVM
subsystems (at least for now).

> and wanted to passthru all of them to another
> system and use multipath on the host with both controllers. This would
> require having multiple target subsystems with the same name which I
> don't think will work too well.

Don't understand why this is the case?

AFAICT, all nvmet needs to do is:
1. override cimc
2. allow allocating multiple controllers to the pt ctrl as long as the
hostnqn match.
3. answer all the ana stuff.

What else is missing?

>> Will the multipathed namespace be used for passthru?
> 
> Nope.
> 
> Honestly, I think the answer is if someone wants to use multipathed
> controllers they should use regular NVMe-of as it doesn't really mesh
> well with the passthru approach.

Maybe I'm missing something, but they should be orthogonal.. I know that
its sort of not real passthru, but we are exposing an nvme device across
a fabric, I think its reasonable to have some adjustments on top.
Logan Gunthorpe July 26, 2019, 10:37 p.m. UTC | #4
On 2019-07-26 4:21 p.m., Sagi Grimberg wrote:
>> I don't think it necessarily makes sense for the target subsynqn and the
>> target's device nqn to be the same. It would be weird for a user to want
>> to use the same device and a passed through device (through a loop) as
>> part of the same subsystem. That being said, it's possible for the user
>> to use the subsysnqn from the passed through device for the name of the
>> subsys of the target. I tried this and it works except for the fact that
>> the device I'm passing through doesn't set id->cmic.
> 
> I don't see why should the subsystem nqn should be the same name. Its
> just like any other nvmet subsystem, just happens to have a nvme
> controller in the backend (which it knows about). No reason to have
> the same name IMO.

Agreed.

>>> Similarly: how do you propose to handle multipath devices?
>>> Any NVMe with several paths will be enabling NVMe multipathing
>>> automatically, presenting you with a single multipathed namespace.
>>> How will these devices be treated?
>>
>> Well passthru works on the controller level not on the namespace level.
>> So it can't make use of the multipath handling on the target system.
> 
> Why? if nvmet is capable, why shouldn't we support it?

I'm saying that passthru is exporting a specific controller and submits
commands (both admin and IO) straight to the nvme_ctrl's queues. It's
not exporting an nvme_subsys and I think it would be troublesome to do
so; for example, if the target receives an admin command which ctrl of
the subsystem should it send it to? There's also no userspace handle for
a given subsystem we'd maybe have to use the subsysnqn.

>> The one case that I think makes sense to me, but I don't know how if we
>> can handle, is if the user had a couple multipath enabled controllers
>> with the same subsynqn
> 
> That is usually the case, there is no multipathing defined across NVM
> subsystems (at least for now).
> 
>> and wanted to passthru all of them to another
>> system and use multipath on the host with both controllers. This would
>> require having multiple target subsystems with the same name which I
>> don't think will work too well.
> 
> Don't understand why this is the case?
> 
> AFAICT, all nvmet needs to do is:
> 1. override cimc
> 2. allow allocating multiple controllers to the pt ctrl as long as the
> hostnqn match.
> 3. answer all the ana stuff.

But with this scheme the host will only see one controller and then the
target would have to make decisions on which ctrl to send any commands
to. Maybe it could be done for I/O but I don't see how it could be done
correctly for admin commands.

And from the hosts perspective, having cimc set doesn't help anything
because we've limited the passthru code to only accept one connection
from one host so the host can only actually have one route to this
controller.

Logan
Sagi Grimberg July 26, 2019, 11:13 p.m. UTC | #5
>> Why? if nvmet is capable, why shouldn't we support it?
> 
> I'm saying that passthru is exporting a specific controller and submits
> commands (both admin and IO) straight to the nvme_ctrl's queues. It's
> not exporting an nvme_subsys and I think it would be troublesome to do
> so; for example, if the target receives an admin command which ctrl of
> the subsystem should it send it to?

Its the same controller in the backend, what is the difference from
which fabrics controller the admin command came from?

> There's also no userspace handle for
> a given subsystem we'd maybe have to use the subsysnqn.

Umm, not sure I understand what you mean.

>>> The one case that I think makes sense to me, but I don't know how if we
>>> can handle, is if the user had a couple multipath enabled controllers
>>> with the same subsynqn
>>
>> That is usually the case, there is no multipathing defined across NVM
>> subsystems (at least for now).
>>
>>> and wanted to passthru all of them to another
>>> system and use multipath on the host with both controllers. This would
>>> require having multiple target subsystems with the same name which I
>>> don't think will work too well.
>>
>> Don't understand why this is the case?
>>
>> AFAICT, all nvmet needs to do is:
>> 1. override cimc
>> 2. allow allocating multiple controllers to the pt ctrl as long as the
>> hostnqn match.
>> 3. answer all the ana stuff.
> 
> But with this scheme the host will only see one controller and then the
> target would have to make decisions on which ctrl to send any commands
> to. Maybe it could be done for I/O but I don't see how it could be done
> correctly for admin commands.

I haven't thought this through so its very possible that I'm missing
something, but why can't the host see multiple controllers if it has
more than one path to the target?

What specific admin commands are you concerned about? What exactly
would clash?

> And from the hosts perspective, having cimc set doesn't help anything
> because we've limited the passthru code to only accept one connection
> from one host so the host can only actually have one route to this
> controller.

And I'm suggesting to allow more than a single controller given that all
controller allocations match a single hostnqn. It wouldn't make sense to
expose this controller to multiple hosts (although that might be doable
but but definitely requires non-trivial infrastructure around it).

Look, when it comes to fabrics, multipath is a fundamental piece of the
puzzle. Not supporting multipathing significantly diminishes the value
of this in my mind (assuming this answers a real-world use-case).
Logan Gunthorpe July 27, 2019, 12:09 a.m. UTC | #6
On 2019-07-26 5:13 p.m., Sagi Grimberg wrote:
> 
>>> Why? if nvmet is capable, why shouldn't we support it?
>>
>> I'm saying that passthru is exporting a specific controller and submits
>> commands (both admin and IO) straight to the nvme_ctrl's queues. It's
>> not exporting an nvme_subsys and I think it would be troublesome to do
>> so; for example, if the target receives an admin command which ctrl of
>> the subsystem should it send it to?
> 
> Its the same controller in the backend, what is the difference from
> which fabrics controller the admin command came from?

This is not my understanding. It's not really the same controller in the
back end and there are admin commands that operate on the controller
like the namespace attachment command which takes a list of cntlids
(though admittedly is not something I'm too familiar with because I
don't have any hardware to play around with). Though that command is
already a bit problematic for passthru because we have different cntlid
address spaces.

> I haven't thought this through so its very possible that I'm missing
> something, but why can't the host see multiple controllers if it has
> more than one path to the target?

Well a target controller is created for each connection. So if the host
wanted to see multiple controllers it would have to do multiple "nvme
connects" and some how need to address the individual controllers for
each connection. Right now a connect is based on subsysnqn which would
be the same for every multipath controller.

> What specific admin commands are you concerned about? What exactly
> would clash?

Namespace attach comes to mind.

> And I'm suggesting to allow more than a single controller given that all
> controller allocations match a single hostnqn. It wouldn't make sense to
> expose this controller to multiple hosts (although that might be doable
> but but definitely requires non-trivial infrastructure around it).

> Look, when it comes to fabrics, multipath is a fundamental piece of the
> puzzle. Not supporting multipathing significantly diminishes the value
> of this in my mind (assuming this answers a real-world use-case).

I'd agree with that. But it's the multipath through different ports that
seems important for fabrics. ie. If I have a host with a path through
RDMA and a path through TCP they should both work and allow fail over.
This is quite orthogonal to passthru and would be easily supported if we
dropped the multiple hosts restriction (I'm not sure what the objection
really is to that).

This is different from multipath on say a multi-controller PCI device
and trying to expose both those controllers through passthru. this is
where the problems we are discussing come up. Supporting this is what is
hard and I think the sensible answer is if users want to do something
like that, they use non-passthru NVME-of and the multipath code will
just work as designed.

Our real-world use case is to support our PCI device which has a bunch
of vendor unique commands and isn't likely to support multiple
controllers in the foreseeable future.

Logan
Stephen Bates July 27, 2019, 12:50 a.m. UTC | #7
> This is different from multipath on say a multi-controller PCI device
> and trying to expose both those controllers through passthru. this is
> where the problems we are discussing come up.

I *think* there is some confusion. I *think* Sagi is talking about network multi-path (i.e. the ability for the host to connect to a controller on the target via two different network paths that fail-over as needed). I *think* Logan is talking about multi-port PCIe NVMe devices that allow namespaces to be accessed via more than one controller over PCIe (dual-port NVMe SSDs being the most obvious example of this today).

> But it's the multipath through different ports that
>  seems important for fabrics. ie. If I have a host with a path through
>  RDMA and a path through TCP they should both work and allow fail over.

Yes, or even two paths that are both RDMA or both TCP but which take a different path through the network from host to target.

> Our real-world use case is to support our PCI device which has a bunch
> of vendor unique commands and isn't likely to support multiple
> controllers in the foreseeable future.

I think solving passthru for single-port PCIe controllers would be a good start.

Stephen
Sagi Grimberg July 29, 2019, 4:15 p.m. UTC | #8
>> This is different from multipath on say a multi-controller PCI device
>> and trying to expose both those controllers through passthru. this is
>> where the problems we are discussing come up.
> 
> I *think* there is some confusion. I *think* Sagi is talking about network multi-path (i.e. the ability for the host to connect to a controller on the target via two different network paths that fail-over as needed). I *think* Logan is talking about multi-port PCIe NVMe devices that allow namespaces to be accessed via more than one controller over PCIe (dual-port NVMe SSDs being the most obvious example of this today).

Yes, I was referring to fabrics multipathing which is somewhat
orthogonal to the backend pci multipathing (unless I'm missing
something).

>> But it's the multipath through different ports that
>>   seems important for fabrics. ie. If I have a host with a path through
>>   RDMA and a path through TCP they should both work and allow fail over.
> 
> Yes, or even two paths that are both RDMA or both TCP but which take a different path through the network from host to target.
> 
>> Our real-world use case is to support our PCI device which has a bunch
>> of vendor unique commands and isn't likely to support multiple
>> controllers in the foreseeable future.
> 
> I think solving passthru for single-port PCIe controllers would be a good start.

Me too.
Logan Gunthorpe July 29, 2019, 4:17 p.m. UTC | #9
On 2019-07-29 10:15 a.m., Sagi Grimberg wrote:
> 
>>> This is different from multipath on say a multi-controller PCI device
>>> and trying to expose both those controllers through passthru. this is
>>> where the problems we are discussing come up.
>>
>> I *think* there is some confusion. I *think* Sagi is talking about network multi-path (i.e. the ability for the host to connect to a controller on the target via two different network paths that fail-over as needed). I *think* Logan is talking about multi-port PCIe NVMe devices that allow namespaces to be accessed via more than one controller over PCIe (dual-port NVMe SSDs being the most obvious example of this today).
> 
> Yes, I was referring to fabrics multipathing which is somewhat
> orthogonal to the backend pci multipathing (unless I'm missing
> something).

Yes, so if we focus on the fabrics multipathing, the only issue I see is
that only one controller can be connected to a passthru target (I
believe this was at your request) so two paths simply cannot exist to
begin with. I can easily remove that restriction.

Logan