Message ID | 20240529160053.111531-1-aaptel@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | nvme-tcp receive offloads | expand |
On Wed, 29 May 2024 16:00:33 +0000 Aurelien Aptel wrote: > These offloads are similar in nature to the packet-based NIC TLS offloads, > which are already upstream (see net/tls/tls_device.c). > You can read more about TLS offload here: > https://www.kernel.org/doc/html/latest/networking/tls-offload.html Which I had to send a fix for blindly again today: https://lore.kernel.org/all/20240530232607.82686-1-kuba@kernel.org/ because there is no software model and none of the vendors apparently cares enough to rigorously test it. I'm not joking with the continuous tests.
FYI, I still absolutely detest this code. I know people want to avoid the page copy for NVMe over TCP (or any TCP based storage protocols for that matter), but having these weird vendors specific hooks all the way up into the application protocol are just horrible. IETF has standardized a generic data placement protocol, which is part of iWarp. Even if folks don't like RDMA it exists to solve exactly these kinds of problems of data placement. And if we can't arse folks into standard data placement methods we at least need it vendor independent and without hooks into the actual protocol driver.
On 31/05/2024 9:11, Christoph Hellwig wrote: > FYI, I still absolutely detest this code. I know people want to > avoid the page copy for NVMe over TCP (or any TCP based storage > protocols for that matter), but having these weird vendors specific > hooks all the way up into the application protocol are just horrible. I hoped for a transparent ddp offload as well, but I don't see how this is possible. > > IETF has standardized a generic data placement protocol, which is > part of iWarp. Even if folks don't like RDMA it exists to solve > exactly these kinds of problems of data placement. iWARP changes the wire protocol. Is your comment to just go make people use iWARP instead of TCP? or extending NVMe/TCP to natively support DDP? I think that the former is limiting, and the latter is unclear. From what I understand, the offload engine uses the NVMe command-id as the rkey (or stag) for ddp purposes. > And if we can't > arse folks into standard data placement methods we at least need it > vendor independent and without hooks into the actual protocol > driver. > That would be great, but what does a "vendor independent without hooks" look like from your perspective? I'd love having this translate to standard (and some new) socket operations, but I could not find a way that this can be done given the current architecture. Early on, I thought that enabling the queue offload could be modeled as a setsockopt() and and nvme_tcp_setup_ddp() would be modeled as a new recvmsg(MSG_DDP_BUFFER, iovec, tag) but where I got stuck was the whole async teardown mechanism that the nic has. But if this is solvable, I think such an interface is much better. FWIW, I think that the benefit of this is worth having. I think that the folks from NVIDIA are committed to supporting and evolving it.
On 31/05/2024 4:39, Jakub Kicinski wrote: > On Wed, 29 May 2024 16:00:33 +0000 Aurelien Aptel wrote: >> These offloads are similar in nature to the packet-based NIC TLS offloads, >> which are already upstream (see net/tls/tls_device.c). >> You can read more about TLS offload here: >> https://www.kernel.org/doc/html/latest/networking/tls-offload.html > Which I had to send a fix for blindly again today: > https://lore.kernel.org/all/20240530232607.82686-1-kuba@kernel.org/ > because there is no software model and none of the vendors apparently > cares enough to rigorously test it. > > I'm not joking with the continuous tests. Is there any plans to address the testing concerns here?
On Mon, Jun 03, 2024 at 10:09:26AM +0300, Sagi Grimberg wrote: >> IETF has standardized a generic data placement protocol, which is >> part of iWarp. Even if folks don't like RDMA it exists to solve >> exactly these kinds of problems of data placement. > > iWARP changes the wire protocol. Compared to plain NVMe over TCP that's a bit of an understatement :) > Is your comment to just go make people > use iWARP instead of TCP? or extending NVMe/TCP to natively support DDP? I don't know to be honest. In many ways just using RDMA instead of NVMe/TCP would solve all the problems this is trying to solve, but there are enough big customers that have religious concerns about the use of RDMA. So if people want to use something that looks non-RDMA but have the same benefits we have to reinvent it quite similarly under a different name. Looking at DDP and what we can learn from it without bringing the Verbs API along might be one way to do that. Another would be to figure out what amount of similarity and what amount of state we need in an on the wire protocol to have an efficient header splitting in the NIC, either hard coded or even better downloadable using something like eBPF. > That would be great, but what does a "vendor independent without hooks" > look like from > your perspective? I'd love having this translate to standard (and some new) > socket operations, > but I could not find a way that this can be done given the current > architecture. Any amount of calls into NIC/offload drivers from NVMe is a nogo.
> Is there any plans to address the testing concerns here?
Yes. We are still discussing it internally.
On 10/06/2024 15:29, Christoph Hellwig wrote: > On Mon, Jun 03, 2024 at 10:09:26AM +0300, Sagi Grimberg wrote: >>> IETF has standardized a generic data placement protocol, which is >>> part of iWarp. Even if folks don't like RDMA it exists to solve >>> exactly these kinds of problems of data placement. >> iWARP changes the wire protocol. > Compared to plain NVMe over TCP that's a bit of an understatement :) Yes :) the comment was that people want to use NVMe/TCP, and adding DDP awareness inspired by iWARP would change the existing NVMe/TCP wire protocol. This offload, does not. > >> Is your comment to just go make people >> use iWARP instead of TCP? or extending NVMe/TCP to natively support DDP? > I don't know to be honest. In many ways just using RDMA instead of > NVMe/TCP would solve all the problems this is trying to solve, but > there are enough big customers that have religious concerns about > the use of RDMA. > > So if people want to use something that looks non-RDMA but have the > same benefits we have to reinvent it quite similarly under a different > name. Looking at DDP and what we can learn from it without bringing > the Verbs API along might be one way to do that. > > Another would be to figure out what amount of similarity and what > amount of state we need in an on the wire protocol to have an > efficient header splitting in the NIC, either hard coded or even > better downloadable using something like eBPF. From what I understand, this is what this offload is trying to do. It uses the nvme command_id similar to how the read_stag is used in iwarp, it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps the command_id to an internal MR for dma purposes. What I think you don't like about this is the interface that the offload exposes to the TCP ulp driver (nvme-tcp in our case)? > >> That would be great, but what does a "vendor independent without hooks" >> look like from >> your perspective? I'd love having this translate to standard (and some new) >> socket operations, >> but I could not find a way that this can be done given the current >> architecture. > Any amount of calls into NIC/offload drivers from NVMe is a nogo. > Not following you here... *something* needs to program a buffer for DDP, *something* needs to invalidate this buffer, *something* needs to declare a TCP stream as DDP capable. Unless I interpret what you're saying is that the interface needs to be generalized to extend the standard socket operations (i.e. [s|g]etsockopt/recvmsg/cmsghdr etc) ?
On Mon, Jun 10, 2024 at 05:30:34PM +0300, Sagi Grimberg wrote: >> efficient header splitting in the NIC, either hard coded or even >> better downloadable using something like eBPF. > > From what I understand, this is what this offload is trying to do. It uses > the nvme command_id similar to how the read_stag is used in iwarp, > it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps > the command_id to an internal MR for dma purposes. > > What I think you don't like about this is the interface that the offload > exposes > to the TCP ulp driver (nvme-tcp in our case)? I don't see why a memory registration is needed at all. The by far biggest painpoint when doing storage protocols (including file systems) over IP based storage is the data copy on the receive path because the payload is not aligned to a page boundary. So we need to figure out a way that is as stateless as possible that allows aligning the actual data payload on a page boundary in an otherwise normal IP receive path.
On 11/06/2024 9:41, Christoph Hellwig wrote: > On Mon, Jun 10, 2024 at 05:30:34PM +0300, Sagi Grimberg wrote: >>> efficient header splitting in the NIC, either hard coded or even >>> better downloadable using something like eBPF. >> From what I understand, this is what this offload is trying to do. It uses >> the nvme command_id similar to how the read_stag is used in iwarp, >> it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps >> the command_id to an internal MR for dma purposes. >> >> What I think you don't like about this is the interface that the offload >> exposes >> to the TCP ulp driver (nvme-tcp in our case)? > I don't see why a memory registration is needed at all. I don't see how you can do it without memory registration. > > The by far biggest painpoint when doing storage protocols (including > file systems) over IP based storage is the data copy on the receive > path because the payload is not aligned to a page boundary. > > So we need to figure out a way that is as stateless as possible that > allows aligning the actual data payload on a page boundary in an > otherwise normal IP receive path. But the device gets payload from the network, and needs a buffer to dma to. In order to dma to the "correct" buffer it needs some sort of pre-registration expressed with a tag, that the device can infer by some sort of stream inspection. The socket recv call from the ulp happens at a later stage. I am not sure I understand the alignment assurance help the NIC to dma payload from the network to the "correct" buffer (i.e. userspace doing O_DIRECT read).
From: Christoph Hellwig > Sent: 11 June 2024 07:42 > > On Mon, Jun 10, 2024 at 05:30:34PM +0300, Sagi Grimberg wrote: > >> efficient header splitting in the NIC, either hard coded or even > >> better downloadable using something like eBPF. > > > > From what I understand, this is what this offload is trying to do. It uses > > the nvme command_id similar to how the read_stag is used in iwarp, > > it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps > > the command_id to an internal MR for dma purposes. > > > > What I think you don't like about this is the interface that the offload > > exposes > > to the TCP ulp driver (nvme-tcp in our case)? > > I don't see why a memory registration is needed at all. > > The by far biggest painpoint when doing storage protocols (including > file systems) over IP based storage is the data copy on the receive > path because the payload is not aligned to a page boundary. How much does the copy cost anyway? If the hardware has merged the segments then it should be a single copy. On x86 (does anyone care about anything else :-) 'rep mosvb' with a cache-line aligned destination runs at 64 bytes/clock. (The source alignment doesn't matter at all.) I guess it loads the source data into the D-cache, the target is probably required anyway - or you wouldn't be doing a read. David > > So we need to figure out a way that is as stateless as possible that > allows aligning the actual data payload on a page boundary in an > otherwise normal IP receive path. - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Hi, We have taken some time to review your documents and code and have had several internal discussions regarding the CI topic. We truly appreciate the benefits that a CI setup could bring. However, we believe that since this feature primarily relies on nvme-tcp, it might achieve better coverage and testing if integrated with blktest. Your design focuses on the netdev layer, which we don't think is sufficient. blktests/nvme is designed to test the entire nvme upstream infrastructure including nvme-tcp that targets corner cases and bugs in on-going development. Chaitanya, Shinichiro, Daniel and other developers are actively developing blktests and running these tests in timely manner on latest branch in linux-nvme repo and for-next branch in linux-block repo. Again, we are open to provide NIC so that others can also test this feature on upstream kernel on our NIC to facilitate easier testing including distros, as long as they are testing this feature on upstream kernel. In this way we don't have to replicate the nvme-block storage stack infra/tools/tests in the framework that is focused on netdev development and yet achieve good coverage, what do you think? Thanks
On Wed, 26 Jun 2024 18:21:54 +0300 Aurelien Aptel wrote: > We have taken some time to review your documents and code and have had > several internal discussions regarding the CI topic. We truly appreciate > the benefits that a CI setup could bring. However, we believe that since > this feature primarily relies on nvme-tcp, it might achieve better > coverage and testing if integrated with blktest. Your design focuses on > the netdev layer, which we don't think is sufficient. > > blktests/nvme is designed to test the entire nvme upstream > infrastructure including nvme-tcp that targets corner cases and bugs in > on-going development. Chaitanya, Shinichiro, Daniel and other > developers are actively developing blktests and running these tests in > timely manner on latest branch in linux-nvme repo and for-next branch in > linux-block repo. > > Again, we are open to provide NIC so that others can also test this > feature on upstream kernel on our NIC to facilitate easier testing > including distros, as long as they are testing this feature on upstream > kernel. In this way we don't have to replicate the nvme-block storage > stack infra/tools/tests in the framework that is focused on netdev > development and yet achieve good coverage, what do you think? I'm not sure we're on the same page. The ask is to run the tests on the netdev testing branch, at 12h cadence, and generate a simple JSON file with results we can ingest into our reporting. Extra points to reporting it to KCIDB. You mention "framework that is focused on netdev", IDK what you mean.
Hi Jakub, On 6/26/24 08:50, Jakub Kicinski wrote: > On Wed, 26 Jun 2024 18:21:54 +0300 Aurelien Aptel wrote: >> We have taken some time to review your documents and code and have had >> several internal discussions regarding the CI topic. We truly appreciate >> the benefits that a CI setup could bring. However, we believe that since >> this feature primarily relies on nvme-tcp, it might achieve better >> coverage and testing if integrated with blktest. Your design focuses on >> the netdev layer, which we don't think is sufficient. >> >> blktests/nvme is designed to test the entire nvme upstream >> infrastructure including nvme-tcp that targets corner cases and bugs in >> on-going development. Chaitanya, Shinichiro, Daniel and other >> developers are actively developing blktests and running these tests in >> timely manner on latest branch in linux-nvme repo and for-next branch in >> linux-block repo. >> >> Again, we are open to provide NIC so that others can also test this >> feature on upstream kernel on our NIC to facilitate easier testing >> including distros, as long as they are testing this feature on upstream >> kernel. In this way we don't have to replicate the nvme-block storage >> stack infra/tools/tests in the framework that is focused on netdev >> development and yet achieve good coverage, what do you think? > I'm not sure we're on the same page. The ask is to run the tests on > the netdev testing branch, at 12h cadence, and generate a simple JSON > file with results we can ingest into our reporting. Extra points to > reporting it to KCIDB. You mention "framework that is focused on > netdev", IDK what you mean. just to clarify are you saying that you are want us to :- 1. Pull the latest changes from netdev current development branch every 12 hours. 2. Run blktests on the HEAD of that branch. 3. Gather those results into JASON file. 4. Post it publicly accessible to you. I didn't understand the "ingest into our reporting part", can you please clarify ? -ck
On Wed, 26 Jun 2024 19:34:50 +0000 Chaitanya Kulkarni wrote: > > I'm not sure we're on the same page. The ask is to run the tests on > > the netdev testing branch, at 12h cadence, and generate a simple JSON > > file with results we can ingest into our reporting. Extra points to > > reporting it to KCIDB. You mention "framework that is focused on > > netdev", IDK what you mean. > > just to clarify are you saying that you are want us to :- > > 1. Pull the latest changes from netdev current development branch > every 12 hours. Small but important difference - testing branch, not development branch. There may be malicious code in that branch, since its not fully reviewed. > 2. Run blktests on the HEAD of that branch. Or some subset of blktest, if the whole run takes too long. > 3. Gather those results into JASON file. > 4. Post it publicly accessible to you. Yes. > I didn't understand the "ingest into our reporting part", can you > please clarify ? You just need to publish the JSON files with results, periodically (publish = expose somewhere we can fetch from over HTTP). The ingestion part is on our end.
Jakub, On 6/26/24 12:43, Jakub Kicinski wrote: > On Wed, 26 Jun 2024 19:34:50 +0000 Chaitanya Kulkarni wrote: >>> I'm not sure we're on the same page. The ask is to run the tests on >>> the netdev testing branch, at 12h cadence, and generate a simple JSON >>> file with results we can ingest into our reporting. Extra points to >>> reporting it to KCIDB. You mention "framework that is focused on >>> netdev", IDK what you mean. >> just to clarify are you saying that you are want us to :- >> >> 1. Pull the latest changes from netdev current development branch >> every 12 hours. > Small but important difference - testing branch, not development branch. > There may be malicious code in that branch, since its not fully > reviewed. > indeed, we will ask you for a right branch once we have things in place ... >> 2. Run blktests on the HEAD of that branch. > Or some subset of blktest, if the whole run takes too long. we need to target nvme-tcp blktests, I believe we can't skip any of those tests, if needed we can do some optimizations on testing side to speed up things .. >> 3. Gather those results into JASON file. >> 4. Post it publicly accessible to you. > Yes. sounds good .. >> I didn't understand the "ingest into our reporting part", can you >> please clarify ? > You just need to publish the JSON files with results, periodically > (publish = expose somewhere we can fetch from over HTTP). > The ingestion part is on our end. Thanks for the clarification, we will get back to you before mid next week. We are actively building and running blktests so we are committed to testing. My other question is how can we make progress on getting this code merge ? (assuming that we do all the above, but it will take some time to build the infra) IOW, can we get this code merge before we build above infrastructure ? If we can, which repo it should go through ? nvme or netdev ? -ck
On Wed, 26 Jun 2024 20:10:41 +0000 Chaitanya Kulkarni wrote: > My other question is how can we make progress on getting this > code merge ? (assuming that we do all the above, but it will take some time > to build the infra) IOW, can we get this code merge before we build above > infrastructure ? > > If we can, which repo it should go through ? nvme or netdev ? Let's wait for the testing.