Message ID | 20220928195510.165062-2-sagi@grimberg.me (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [rfc] nvme: support io stats on the mpath device | expand |
Hi Sagi, On 9/28/2022 10:55 PM, Sagi Grimberg wrote: > Our mpath stack device is just a shim that selects a bottom namespace > and submits the bio to it without any fancy splitting. This also means > that we don't clone the bio or have any context to the bio beyond > submission. However it really sucks that we don't see the mpath device > io stats. > > Given that the mpath device can't do that without adding some context > to it, we let the bottom device do it on its behalf (somewhat similar > to the approach taken in nvme_trace_bio_complete); Can you please paste the output of the application that shows the benefit of this commit ? > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > --- > drivers/nvme/host/apple.c | 2 +- > drivers/nvme/host/core.c | 10 ++++++++++ > drivers/nvme/host/fc.c | 2 +- > drivers/nvme/host/multipath.c | 18 ++++++++++++++++++ > drivers/nvme/host/nvme.h | 12 ++++++++++++ > drivers/nvme/host/pci.c | 2 +- > drivers/nvme/host/rdma.c | 2 +- > drivers/nvme/host/tcp.c | 2 +- > drivers/nvme/target/loop.c | 2 +- > 9 files changed, 46 insertions(+), 6 deletions(-) Several questions: 1. I guess that for the non-mpath case we get this for free from the block layer for each bio ? 2. Now we have doubled the accounting, haven't we ? 3. Do you have some performance numbers (we're touching the fast path here) ? 4. Should we enable this by default ? The implementation look good.
> Hi Sagi, > > On 9/28/2022 10:55 PM, Sagi Grimberg wrote: >> Our mpath stack device is just a shim that selects a bottom namespace >> and submits the bio to it without any fancy splitting. This also means >> that we don't clone the bio or have any context to the bio beyond >> submission. However it really sucks that we don't see the mpath device >> io stats. >> >> Given that the mpath device can't do that without adding some context >> to it, we let the bottom device do it on its behalf (somewhat similar >> to the approach taken in nvme_trace_bio_complete); > > Can you please paste the output of the application that shows the > benefit of this commit ? What do you mean? there is no noticeable effect on the application here. With this patch applied, /sys/block/nvmeXnY/stat is not zeroed out, sysstat and friends can monitor IO stats, as well as other observability tools. > >> >> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> >> --- >> drivers/nvme/host/apple.c | 2 +- >> drivers/nvme/host/core.c | 10 ++++++++++ >> drivers/nvme/host/fc.c | 2 +- >> drivers/nvme/host/multipath.c | 18 ++++++++++++++++++ >> drivers/nvme/host/nvme.h | 12 ++++++++++++ >> drivers/nvme/host/pci.c | 2 +- >> drivers/nvme/host/rdma.c | 2 +- >> drivers/nvme/host/tcp.c | 2 +- >> drivers/nvme/target/loop.c | 2 +- >> 9 files changed, 46 insertions(+), 6 deletions(-) > > Several questions: > > 1. I guess that for the non-mpath case we get this for free from the > block layer for each bio ? blk-mq provides all IO stat accounting, hence it is on by default. > 2. Now we have doubled the accounting, haven't we ? Yes. But as I listed in the cover-letter, I've been getting complaints about how IO stats appear only for the hidden devices (blk-mq devices) and there is an non-trivial logic to map that back to the mpath device, which can also depend on the path selection logic... I think that this is very much justified, the observability experience sucks. IMO we should have done it since introducing nvme-multipath. > 3. Do you have some performance numbers (we're touching the fast path > here) ? This is pretty light-weight, accounting is per-cpu and only wrapped by preemption disable. This is a very small price to pay for what we gain. I don't have any performance numbers, other than on my laptop VM that did not record any noticeable difference, which I don't expect to have. > 4. Should we enable this by default ? Yes. there is no reason why nvme-mpath should be the only block device that does not account and expose IO stats.
On 9/28/22 22:55, Sagi Grimberg wrote: > Our mpath stack device is just a shim that selects a bottom namespace > and submits the bio to it without any fancy splitting. This also means > that we don't clone the bio or have any context to the bio beyond > submission. However it really sucks that we don't see the mpath device > io stats. > > Given that the mpath device can't do that without adding some context > to it, we let the bottom device do it on its behalf (somewhat similar > to the approach taken in nvme_trace_bio_complete); > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > --- > drivers/nvme/host/apple.c | 2 +- > drivers/nvme/host/core.c | 10 ++++++++++ > drivers/nvme/host/fc.c | 2 +- > drivers/nvme/host/multipath.c | 18 ++++++++++++++++++ > drivers/nvme/host/nvme.h | 12 ++++++++++++ > drivers/nvme/host/pci.c | 2 +- > drivers/nvme/host/rdma.c | 2 +- > drivers/nvme/host/tcp.c | 2 +- > drivers/nvme/target/loop.c | 2 +- > 9 files changed, 46 insertions(+), 6 deletions(-) > > diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c > index 5fc5ea196b40..6df4b8a5d8ab 100644 > --- a/drivers/nvme/host/apple.c > +++ b/drivers/nvme/host/apple.c > @@ -763,7 +763,7 @@ static blk_status_t apple_nvme_queue_rq(struct blk_mq_hw_ctx *hctx, > goto out_free_cmd; > } > > - blk_mq_start_request(req); > + nvme_start_request(req); > apple_nvme_submit_cmd(q, cmnd); > return BLK_STS_OK; > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 9bacfd014e3d..f42e6e40d84b 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req) > nvme_end_req_zoned(req); > nvme_trace_bio_complete(req); > blk_mq_end_request(req, status); > + if (req->cmd_flags & REQ_NVME_MPATH) > + nvme_mpath_end_request(req); I guess the order should probably be reversed, because after blk_mq_end_request req may become invalid and create UAF?
On 9/29/2022 12:59 PM, Sagi Grimberg wrote: > >> Hi Sagi, >> >> On 9/28/2022 10:55 PM, Sagi Grimberg wrote: >>> Our mpath stack device is just a shim that selects a bottom namespace >>> and submits the bio to it without any fancy splitting. This also means >>> that we don't clone the bio or have any context to the bio beyond >>> submission. However it really sucks that we don't see the mpath device >>> io stats. >>> >>> Given that the mpath device can't do that without adding some context >>> to it, we let the bottom device do it on its behalf (somewhat similar >>> to the approach taken in nvme_trace_bio_complete); >> >> Can you please paste the output of the application that shows the >> benefit of this commit ? > > What do you mean? there is no noticeable effect on the application here. > With this patch applied, /sys/block/nvmeXnY/stat is not zeroed out, > sysstat and friends can monitor IO stats, as well as other observability > tools. > I meant the output for iostat application/tool. This will show us the double accounting I mentioned bellow. I guess it's the same situation we have today with /dev/dm-0 and its underlying devices /dev/sdb and /dev/sdc for example. This should be explained IMO. >> >>> >>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> >>> --- >>> drivers/nvme/host/apple.c | 2 +- >>> drivers/nvme/host/core.c | 10 ++++++++++ >>> drivers/nvme/host/fc.c | 2 +- >>> drivers/nvme/host/multipath.c | 18 ++++++++++++++++++ >>> drivers/nvme/host/nvme.h | 12 ++++++++++++ >>> drivers/nvme/host/pci.c | 2 +- >>> drivers/nvme/host/rdma.c | 2 +- >>> drivers/nvme/host/tcp.c | 2 +- >>> drivers/nvme/target/loop.c | 2 +- >>> 9 files changed, 46 insertions(+), 6 deletions(-) >> >> Several questions: >> >> 1. I guess that for the non-mpath case we get this for free from the >> block layer for each bio ? > > blk-mq provides all IO stat accounting, hence it is on by default. > >> 2. Now we have doubled the accounting, haven't we ? > > Yes. But as I listed in the cover-letter, I've been getting complaints > about how IO stats appear only for the hidden devices (blk-mq devices) > and there is an non-trivial logic to map that back to the mpath device, > which can also depend on the path selection logic... > > I think that this is very much justified, the observability experience > sucks. IMO we should have done it since introducing nvme-multipath. > >> 3. Do you have some performance numbers (we're touching the fast path >> here) ? > > This is pretty light-weight, accounting is per-cpu and only wrapped by > preemption disable. This is a very small price to pay for what we gain. > > I don't have any performance numbers, other than on my laptop VM that > did not record any noticeable difference, which I don't expect to have. > >> 4. Should we enable this by default ? > > Yes. there is no reason why nvme-mpath should be the only block device > that does not account and expose IO stats.
On Thu, Sep 29, 2022 at 12:59:46PM +0300, Sagi Grimberg wrote: > > 3. Do you have some performance numbers (we're touching the fast path > > here) ? > > This is pretty light-weight, accounting is per-cpu and only wrapped by > preemption disable. This is a very small price to pay for what we gain. It does add up, though, and some environments disable stats to skip the overhead. At a minimum, you need to add a check for blk_queue_io_stat() before assuming you need to account for stats. Instead of duplicating the accounting, could you just have the stats file report the sum of its hidden devices?
On 9/29/22 3:59 AM, Sagi Grimberg wrote: >> 3. Do you have some performance numbers (we're touching the fast path here) ? > > This is pretty light-weight, accounting is per-cpu and only wrapped by > preemption disable. This is a very small price to pay for what we gain. Is it? Enabling IO stats for normal devices has a very noticeable impact on performance at the higher end of the scale. So much so that I've contemplated how we can make this less expensive than it currently is.
On 9/29/22 4:04 AM, Sagi Grimberg wrote: > index 9bacfd014e3d..f42e6e40d84b 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req) >> ????? nvme_end_req_zoned(req); >> ????? nvme_trace_bio_complete(req); >> ????? blk_mq_end_request(req, status); >> +??? if (req->cmd_flags & REQ_NVME_MPATH) >> +??????? nvme_mpath_end_request(req); > > I guess the order should probably be reversed, because after > blk_mq_end_request req may become invalid and create UAF? Yes - blk_mq_end_request() will put the tag, it could be reused by the time you call nvme_mpath_end_request(). It won't be a UAF as the requests are allocated upfront and never freed, but the state will be uncertain at that point.
>>> 3. Do you have some performance numbers (we're touching the fast path >>> here) ? >> >> This is pretty light-weight, accounting is per-cpu and only wrapped by >> preemption disable. This is a very small price to pay for what we gain. > > It does add up, though, and some environments disable stats to skip the > overhead. At a minimum, you need to add a check for blk_queue_io_stat() before > assuming you need to account for stats. > > Instead of duplicating the accounting, could you just have the stats file report > the sum of its hidden devices? Interesting... How do you suggest we do that? .collect_stats() callout in fops?
>>> 3. Do you have some performance numbers (we're touching the fast path here) ? >> >> This is pretty light-weight, accounting is per-cpu and only wrapped by >> preemption disable. This is a very small price to pay for what we gain. > > Is it? Enabling IO stats for normal devices has a very noticeable impact > on performance at the higher end of the scale. Interesting, I didn't think this would be that noticeable. How much would you quantify the impact in terms of %? I don't have any insight on this for blk-mq, probably because I've never seen any user turn IO stats off (or at least don't remember). My (very limited) testing did not show any noticeable differences for nvme-loop. All I'm saying that we need to have IO stats for the mpath device node. If there is a clever way to collect this from the hidden devices just for nvme, great, but we need to expose these stats. > So much so that I've contemplated how we can make this less expensive than it currently is. Then nvme-mpath would benefit from that as well.
>>> 3. Do you have some performance numbers (we're touching the fast path >>> here) ? >> >> This is pretty light-weight, accounting is per-cpu and only wrapped by >> preemption disable. This is a very small price to pay for what we gain. > > It does add up, though, and some environments disable stats to skip the > overhead. At a minimum, you need to add a check for blk_queue_io_stat() before > assuming you need to account for stats. But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself? You mean disable IO stats in runtime?
On 9/29/22 10:25 AM, Sagi Grimberg wrote: > >>>> 3. Do you have some performance numbers (we're touching the fast path here) ? >>> >>> This is pretty light-weight, accounting is per-cpu and only wrapped by >>> preemption disable. This is a very small price to pay for what we gain. >> >> Is it? Enabling IO stats for normal devices has a very noticeable impact >> on performance at the higher end of the scale. > > Interesting, I didn't think this would be that noticeable. How much > would you quantify the impact in terms of %? If we take it to the extreme - my usual peak benchmark, which is drive limited at 122M IOPS, run at 113M IOPS if I have iostats enabled. If I lower the queue depth (128 -> 16), then peak goes from 46M to 44M. Not as dramatic, but still quite noticeable. This is just using a single thread on a single CPU core per drive, so not throwing tons of CPU at it. Now, I have no idea how well nvme multipath currently scales or works. Would be interesting to test that separately. But if you were to double (or more, I guess 3x if you're doing the exposed device and then adding stats to at least two below?) the overhead, that'd certainly not be free. > I don't have any insight on this for blk-mq, probably because I've never > seen any user turn IO stats off (or at least don't remember). Most people don't care, but some certainly do. As per the above, it's noticeable enough that it makes a difference if you're chasing latencies or peak performance. > My (very limited) testing did not show any noticeable differences for > nvme-loop. All I'm saying that we need to have IO stats for the mpath > device node. If there is a clever way to collect this from the hidden > devices just for nvme, great, but we need to expose these stats. From a previous message, sounds like that's just some qemu setup? Hard to measure anything there with precision in my experience, and it's not really peak performance territory either. >> So much so that I've contemplated how we can make this less expensive >> than it currently is. > > Then nvme-mpath would benefit from that as well. Yeah, it'd be a win all around for sure...
On Thu, Sep 29, 2022 at 07:32:39PM +0300, Sagi Grimberg wrote: > > > > > 3. Do you have some performance numbers (we're touching the fast path > > > > here) ? > > > > > > This is pretty light-weight, accounting is per-cpu and only wrapped by > > > preemption disable. This is a very small price to pay for what we gain. > > > > It does add up, though, and some environments disable stats to skip the > > overhead. At a minimum, you need to add a check for blk_queue_io_stat() before > > assuming you need to account for stats. > > But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself? > You mean disable IO stats in runtime? Yes, the user can disable it at any time. That actually makes things a bit tricky since it can be enabled at the start of an IO and disabled by the time it completes.
On Thu, Sep 29, 2022 at 07:14:23PM +0300, Sagi Grimberg wrote: > > > > > 3. Do you have some performance numbers (we're touching the fast path > > > > here) ? > > > > > > This is pretty light-weight, accounting is per-cpu and only wrapped by > > > preemption disable. This is a very small price to pay for what we gain. > > > > It does add up, though, and some environments disable stats to skip the > > overhead. At a minimum, you need to add a check for blk_queue_io_stat() before > > assuming you need to account for stats. > > > > Instead of duplicating the accounting, could you just have the stats file report > > the sum of its hidden devices? > > Interesting... > > How do you suggest we do that? .collect_stats() callout in fops? Maybe, yeah. I think we'd need something to enumerate the HIDDEN disks that make up the multipath device. Only the low-level driver can do that right now, so perhaps either call into the driver to get all the block_device parts, or the gendisk needs to maintain a list of those parts itself.
>>>>> 3. Do you have some performance numbers (we're touching the fast path >>>>> here) ? >>>> >>>> This is pretty light-weight, accounting is per-cpu and only wrapped by >>>> preemption disable. This is a very small price to pay for what we gain. >>> >>> It does add up, though, and some environments disable stats to skip the >>> overhead. At a minimum, you need to add a check for blk_queue_io_stat() before >>> assuming you need to account for stats. >> >> But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself? >> You mean disable IO stats in runtime? > > Yes, the user can disable it at any time. That actually makes things a bit > tricky since it can be enabled at the start of an IO and disabled by the time > it completes. That is what blk_do_io_stat is for, we can definitely export that.
>>>>> 3. Do you have some performance numbers (we're touching the fast path >>>>> here) ? >>>> >>>> This is pretty light-weight, accounting is per-cpu and only wrapped by >>>> preemption disable. This is a very small price to pay for what we gain. >>> >>> It does add up, though, and some environments disable stats to skip the >>> overhead. At a minimum, you need to add a check for blk_queue_io_stat() before >>> assuming you need to account for stats. >>> >>> Instead of duplicating the accounting, could you just have the stats file report >>> the sum of its hidden devices? >> >> Interesting... >> >> How do you suggest we do that? .collect_stats() callout in fops? > > Maybe, yeah. I think we'd need something to enumerate the HIDDEN disks that > make up the multipath device. Only the low-level driver can do that right now, > so perhaps either call into the driver to get all the block_device parts, or > the gendisk needs to maintain a list of those parts itself. I definitely don't think we want to propagate the device relationship to blk-mq. But a callback to the driver also seems very niche to nvme multipath and is also kinda messy to combine calculations like iops/bw/latency accurately which depends on the submission distribution to the bottom devices which we would need to track now. I'm leaning towards just moving forward with this, take the relatively small hit, and if people absolutely care about the extra latency, then they can disable it altogether (upper and/or bottom devices).
On 9/30/22 03:08, Jens Axboe wrote: > On 9/29/22 10:25 AM, Sagi Grimberg wrote: >> >>>>> 3. Do you have some performance numbers (we're touching the fast path here) ? >>>> >>>> This is pretty light-weight, accounting is per-cpu and only wrapped by >>>> preemption disable. This is a very small price to pay for what we gain. >>> >>> Is it? Enabling IO stats for normal devices has a very noticeable impact >>> on performance at the higher end of the scale. >> >> Interesting, I didn't think this would be that noticeable. How much >> would you quantify the impact in terms of %? > > If we take it to the extreme - my usual peak benchmark, which is drive > limited at 122M IOPS, run at 113M IOPS if I have iostats enabled. If I > lower the queue depth (128 -> 16), then peak goes from 46M to 44M. Not > as dramatic, but still quite noticeable. This is just using a single > thread on a single CPU core per drive, so not throwing tons of CPU at > it. > > Now, I have no idea how well nvme multipath currently scales or works. Should be pretty scalable and efficient. There is no bio cloning and the only shared state is an srcu wrapping the submission path and path lookup. > Would be interesting to test that separately. But if you were to double > (or more, I guess 3x if you're doing the exposed device and then adding > stats to at least two below?) the overhead, that'd certainly not be > free. It is not 3x, in the patch nvme-multipath is accounting separately from the bottom devices, so each request is accounted once for the bottom device and once for the upper device. But again, my working assumption is that IO stats must be exposed for a nvme-multipath device (unless the user disabled them). So it is a matter of weather we take a simple approach, where nvme-multipath does "double" accounting or, we come up with a scheme that allows the driver to collect stats on behalf of the block layer, and then add non-trivial logic to combine stats like iops/bw/latency accurately from the bottom devices. My vote would be to go with the former. >> I don't have any insight on this for blk-mq, probably because I've never >> seen any user turn IO stats off (or at least don't remember). > > Most people don't care, but some certainly do. As per the above, it's > noticeable enough that it makes a difference if you're chasing latencies > or peak performance. > >> My (very limited) testing did not show any noticeable differences for >> nvme-loop. All I'm saying that we need to have IO stats for the mpath >> device node. If there is a clever way to collect this from the hidden >> devices just for nvme, great, but we need to expose these stats. > > From a previous message, sounds like that's just some qemu setup? Hard > to measure anything there with precision in my experience, and it's not > really peak performance territory either. It's not qemu, it is null_blk exported over nvme-loop (nvmet loop device). so it is faster, but definitely not something that can provide insight in the realm of real HW.
>> index 9bacfd014e3d..f42e6e40d84b 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req) >>> ????? nvme_end_req_zoned(req); >>> ????? nvme_trace_bio_complete(req); >>> ????? blk_mq_end_request(req, status); >>> +??? if (req->cmd_flags & REQ_NVME_MPATH) >>> +??????? nvme_mpath_end_request(req); >> >> I guess the order should probably be reversed, because after >> blk_mq_end_request req may become invalid and create UAF? > > Yes - blk_mq_end_request() will put the tag, it could be reused by the > time you call nvme_mpath_end_request(). It won't be a UAF as the > requests are allocated upfront and never freed, but the state will be > uncertain at that point. Will reverse that...
On 10/3/22 11:02, Sagi Grimberg wrote: > >>>>>> 3. Do you have some performance numbers (we're touching the fast path >>>>>> here) ? >>>>> >>>>> This is pretty light-weight, accounting is per-cpu and only wrapped by >>>>> preemption disable. This is a very small price to pay for what we >>>>> gain. >>>> >>>> It does add up, though, and some environments disable stats to skip the >>>> overhead. At a minimum, you need to add a check for >>>> blk_queue_io_stat() before >>>> assuming you need to account for stats. >>> >>> But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself? >>> You mean disable IO stats in runtime? >> >> Yes, the user can disable it at any time. That actually makes things a >> bit >> tricky since it can be enabled at the start of an IO and disabled by >> the time >> it completes. > > That is what blk_do_io_stat is for, we can definitely export that. Actually, blk_do_io_stat refers to the bottom request queue. you're right, we can do the same trick for nvme.
On Mon, Oct 03, 2022 at 11:09:06AM +0300, Sagi Grimberg wrote: >> make up the multipath device. Only the low-level driver can do that right now, >> so perhaps either call into the driver to get all the block_device parts, or >> the gendisk needs to maintain a list of those parts itself. > > I definitely don't think we want to propagate the device relationship to > blk-mq. But a callback to the driver also seems very niche to nvme > multipath and is also kinda messy to combine calculations like > iops/bw/latency accurately which depends on the submission distribution > to the bottom devices which we would need to track now. > > I'm leaning towards just moving forward with this, take the relatively > small hit, and if people absolutely care about the extra latency, then > they can disable it altogether (upper and/or bottom devices). So looking at the patches I'm really not a big fan of the extra accounting calls, and especially the start_time field in the nvme_request and even more so the special start/end calls in all the transport drivers. the stats sysfs attributes already have the entirely separate blk-mq vs bio based code pathes. So I think having a block_device operation that replaces part_stat_read_all which allows nvme to iterate over all pathes and collect the numbers would seem a lot nicer. There might be some caveats like having to stash away the numbers for disappearing paths, though.
>>> make up the multipath device. Only the low-level driver can do that right now, >>> so perhaps either call into the driver to get all the block_device parts, or >>> the gendisk needs to maintain a list of those parts itself. >> >> I definitely don't think we want to propagate the device relationship to >> blk-mq. But a callback to the driver also seems very niche to nvme >> multipath and is also kinda messy to combine calculations like >> iops/bw/latency accurately which depends on the submission distribution >> to the bottom devices which we would need to track now. >> >> I'm leaning towards just moving forward with this, take the relatively >> small hit, and if people absolutely care about the extra latency, then >> they can disable it altogether (upper and/or bottom devices). > > So looking at the patches I'm really not a big fan of the extra > accounting calls, and especially the start_time field in the > nvme_request Don't love it either. > and even more so the special start/end calls in all > the transport drivers. The end is centralized and the start part has not sprinkled to the drivers. I don't think its bad. > the stats sysfs attributes already have the entirely separate > blk-mq vs bio based code pathes. So I think having a block_device > operation that replaces part_stat_read_all which allows nvme to > iterate over all pathes and collect the numbers would seem > a lot nicer. There might be some caveats like having to stash > away the numbers for disappearing paths, though. You think this is better? really? I don't agree with you, I think its better to pay a small cost than doing this very specialized thing that will only ever be used for nvme-mpath.
On Tue, Oct 25, 2022 at 06:58:19PM +0300, Sagi Grimberg wrote: >> and even more so the special start/end calls in all >> the transport drivers. > > The end is centralized and the start part has not sprinkled to > the drivers. I don't think its bad. Well. We need a new magic helper instead of blk_mq_start_request, and a new call to nvme_mpath_end_request in the lower driver to support functionality in the multipath driver that sits above them. This is because of the hack of storing the start_time in the nvme_request, which is realy owned by the lower driver, and quite a bit of a layering violation. If the multipath driver simply did the start and end itself things would be a lot better. The upside of that would be that it also accounts for the (tiny) overhead of the mpath driver. The big downside would be that we'd have to allocate memory just for the start_time as nvme-multipath has no per-I/O data structure of it's own. In a way it would be nice to just have a start_time in the bio, which would clean up the interface a lot, and also massively simplify the I/O accounting in md. But Jens might not be willing to grow the bio for this special case, even if some things in the bio seem even more obscure. >> the stats sysfs attributes already have the entirely separate >> blk-mq vs bio based code pathes. So I think having a block_device >> operation that replaces part_stat_read_all which allows nvme to >> iterate over all pathes and collect the numbers would seem >> a lot nicer. There might be some caveats like having to stash >> away the numbers for disappearing paths, though. > > You think this is better? really? I don't agree with you, I think its > better to pay a small cost than doing this very specialized thing that > will only ever be used for nvme-mpath. Yes, I think a callout at least conceptually would be much better.
diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c index 5fc5ea196b40..6df4b8a5d8ab 100644 --- a/drivers/nvme/host/apple.c +++ b/drivers/nvme/host/apple.c @@ -763,7 +763,7 @@ static blk_status_t apple_nvme_queue_rq(struct blk_mq_hw_ctx *hctx, goto out_free_cmd; } - blk_mq_start_request(req); + nvme_start_request(req); apple_nvme_submit_cmd(q, cmnd); return BLK_STS_OK; diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 9bacfd014e3d..f42e6e40d84b 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req) nvme_end_req_zoned(req); nvme_trace_bio_complete(req); blk_mq_end_request(req, status); + if (req->cmd_flags & REQ_NVME_MPATH) + nvme_mpath_end_request(req); } void nvme_complete_rq(struct request *req) @@ -419,6 +421,14 @@ void nvme_complete_rq(struct request *req) } EXPORT_SYMBOL_GPL(nvme_complete_rq); +void nvme_start_request(struct request *rq) +{ + if (rq->cmd_flags & REQ_NVME_MPATH) + nvme_mpath_start_request(rq); + blk_mq_start_request(rq); +} +EXPORT_SYMBOL_GPL(nvme_start_request); + void nvme_complete_batch_req(struct request *req) { trace_nvme_complete_rq(req); diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 127abaf9ba5d..2cdcc7f5d0a9 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2744,7 +2744,7 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue, atomic_set(&op->state, FCPOP_STATE_ACTIVE); if (!(op->flags & FCOP_FLAGS_AEN)) - blk_mq_start_request(op->rq); + nvme_start_request(op->rq); cmdiu->csn = cpu_to_be32(atomic_inc_return(&queue->csn)); ret = ctrl->lport->ops->fcp_io(&ctrl->lport->localport, diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 6ef497c75a16..9d2ff9ed8c6c 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -114,6 +114,23 @@ void nvme_failover_req(struct request *req) kblockd_schedule_work(&ns->head->requeue_work); } +void nvme_mpath_start_request(struct request *rq) +{ + struct nvme_ns *ns = rq->q->queuedata; + + nvme_req(rq)->start_time = bdev_start_io_acct(ns->head->disk->part0, + blk_rq_bytes(rq) >> SECTOR_SHIFT, + req_op(rq), jiffies); +} + +void nvme_mpath_end_request(struct request *rq) +{ + struct nvme_ns *ns = rq->q->queuedata; + + bdev_end_io_acct(ns->head->disk->part0, + req_op(rq), nvme_req(rq)->start_time); +} + void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; @@ -501,6 +518,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) blk_queue_flag_set(QUEUE_FLAG_NONROT, head->disk->queue); blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue); + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, head->disk->queue); /* * This assumes all controllers that refer to a namespace either * support poll queues or not. That is not a strict guarantee, diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 2d5d44a73f26..08f94c404cd8 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -162,6 +162,9 @@ struct nvme_request { u8 retries; u8 flags; u16 status; +#ifdef CONFIG_NVME_MULTIPATH + unsigned long start_time; +#endif struct nvme_ctrl *ctrl; }; @@ -753,6 +756,7 @@ static inline enum req_op nvme_req_op(struct nvme_command *cmd) } #define NVME_QID_ANY -1 +void nvme_start_request(struct request *rq); void nvme_init_request(struct request *req, struct nvme_command *cmd); void nvme_cleanup_cmd(struct request *req); blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req); @@ -861,6 +865,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns); void nvme_mpath_revalidate_paths(struct nvme_ns *ns); void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl); void nvme_mpath_shutdown_disk(struct nvme_ns_head *head); +void nvme_mpath_start_request(struct request *rq); +void nvme_mpath_end_request(struct request *rq); static inline void nvme_trace_bio_complete(struct request *req) { @@ -946,6 +952,12 @@ static inline void nvme_mpath_start_freeze(struct nvme_subsystem *subsys) static inline void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) { } +static inline void nvme_mpath_start_request(struct request *rq) +{ +} +static inline void nvme_mpath_end_request(struct request *rq) +{ +} #endif /* CONFIG_NVME_MULTIPATH */ int nvme_revalidate_zones(struct nvme_ns *ns); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 3bdb97205699..e898b9e4e6e0 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -929,7 +929,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req) goto out_unmap_data; } - blk_mq_start_request(req); + nvme_start_request(req); return BLK_STS_OK; out_unmap_data: nvme_unmap_data(dev, req); diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 8e52d2362fa1..ab9d5a17704b 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -2089,7 +2089,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx, if (ret) goto unmap_qe; - blk_mq_start_request(rq); + nvme_start_request(rq); if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && queue->pi_support && diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 2524b5304bfb..a1df405de7f1 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2461,7 +2461,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx, if (unlikely(ret)) return ret; - blk_mq_start_request(rq); + nvme_start_request(rq); nvme_tcp_queue_request(req, true, bd->last); diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 9750a7fca268..c327615decc2 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -145,7 +145,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx, if (ret) return ret; - blk_mq_start_request(req); + nvme_start_request(req); iod->cmd.common.flags |= NVME_CMD_SGL_METABUF; iod->req.port = queue->ctrl->port; if (!nvmet_req_init(&iod->req, &queue->nvme_cq,
Our mpath stack device is just a shim that selects a bottom namespace and submits the bio to it without any fancy splitting. This also means that we don't clone the bio or have any context to the bio beyond submission. However it really sucks that we don't see the mpath device io stats. Given that the mpath device can't do that without adding some context to it, we let the bottom device do it on its behalf (somewhat similar to the approach taken in nvme_trace_bio_complete); Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/apple.c | 2 +- drivers/nvme/host/core.c | 10 ++++++++++ drivers/nvme/host/fc.c | 2 +- drivers/nvme/host/multipath.c | 18 ++++++++++++++++++ drivers/nvme/host/nvme.h | 12 ++++++++++++ drivers/nvme/host/pci.c | 2 +- drivers/nvme/host/rdma.c | 2 +- drivers/nvme/host/tcp.c | 2 +- drivers/nvme/target/loop.c | 2 +- 9 files changed, 46 insertions(+), 6 deletions(-)