Message ID | 20210208085243.82367-10-jefflexu@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | dm: support IO polling | expand |
On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote: > DM will iterate and poll all polling hardware queues of all target mq > devices when polling IO for dm device. To mitigate the race introduced > by iterating all target hw queues, a per-hw-queue flag is maintained What is the per-hw-queue flag? > to indicate whether this polling hw queue currently being polled on or > not. Every polling hw queue is exclusive to one polling instance, i.e., > the polling instance will skip this polling hw queue if this hw queue > currently is being polled by another polling instance, and start > polling on the next hw queue. Not see such skip in dm_poll_one_dev() in which queue_for_each_poll_hw_ctx() is called directly for polling all POLL hctxs of the request queue, so can you explain it a bit more about this skip mechanism? Even though such skipping is implemented, not sure if good performance can be reached because hctx poll may be done in ping-pong style among several CPUs. But blk-mq hctx is supposed to have its cpu affinities.
On 2/9/21 11:11 AM, Ming Lei wrote: > On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote: >> DM will iterate and poll all polling hardware queues of all target mq >> devices when polling IO for dm device. To mitigate the race introduced >> by iterating all target hw queues, a per-hw-queue flag is maintained > > What is the per-hw-queue flag? Sorry I forgot to update the commit message as the implementation changed. Actually this mechanism is implemented by patch 10 of this patch set. > >> to indicate whether this polling hw queue currently being polled on or >> not. Every polling hw queue is exclusive to one polling instance, i.e., >> the polling instance will skip this polling hw queue if this hw queue >> currently is being polled by another polling instance, and start >> polling on the next hw queue. > > Not see such skip in dm_poll_one_dev() in which > queue_for_each_poll_hw_ctx() is called directly for polling all POLL > hctxs of the request queue, so can you explain it a bit more about this > skip mechanism? > It is implemented as patch 10 of this patch set. When spin_trylock() fails, the polling instance will return immediately, instead of busy waiting. > Even though such skipping is implemented, not sure if good performance > can be reached because hctx poll may be done in ping-pong style > among several CPUs. But blk-mq hctx is supposed to have its cpu affinities. > Yes, the mechanism of iterating all hw queues can make the competition worse. If every underlying data device has **only** one polling hw queue, then this ping-pong style polling still exist, even when we implement split bio tracking mechanism, i.e., acquiring the specific hw queue the bio enqueued into. Because multiple polling instance has to compete for the only polling hw queue. But if multiple polling hw queues per device are reserved for multiple polling instances, (e.g., every underlying data device has 3 polling hw queues when there are 3 polling instances), just as what we practice on mq polling, then the current implementation of iterating all hw queues will indeed works in a ping-pong style, while this issue shall not exist when accurate split bio tracking mechanism could be implemented. As for the performance, I cite the test results here, as summarized in the cover-letter (https://lore.kernel.org/io-uring/20210208085243.82367-1-jefflexu@linux.alibaba.com/) | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff ----------- | --------------- | -------------------- | ---- without opt | 318k | 256k | ~-20% with opt | 314k | 354k | ~13% The 'opt' refers to the optimization of patch 10, i.e., the skipping mechanism. There are 3 polling instances (i.e., 3 CPUs) in this test case. Indeed the current implementation of iterating all hw queues is some sort of compromise, as I find it really difficult to implement the accurate split bio mechanism, and to achieve high performance at the same time. Thus I turn to optimizing the original implementation of iterating all hw queues, such as optimization of patch 10 and 11.
On 2/9/21 2:13 PM, JeffleXu wrote: > > > On 2/9/21 11:11 AM, Ming Lei wrote: >> On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote: >>> DM will iterate and poll all polling hardware queues of all target mq >>> devices when polling IO for dm device. To mitigate the race introduced >>> by iterating all target hw queues, a per-hw-queue flag is maintained >> >> What is the per-hw-queue flag? > > Sorry I forgot to update the commit message as the implementation > changed. Actually this mechanism is implemented by patch 10 of this > patch set. > >> >>> to indicate whether this polling hw queue currently being polled on or >>> not. Every polling hw queue is exclusive to one polling instance, i.e., >>> the polling instance will skip this polling hw queue if this hw queue >>> currently is being polled by another polling instance, and start >>> polling on the next hw queue. >> >> Not see such skip in dm_poll_one_dev() in which >> queue_for_each_poll_hw_ctx() is called directly for polling all POLL >> hctxs of the request queue, so can you explain it a bit more about this >> skip mechanism? >> > > It is implemented as patch 10 of this patch set. When spin_trylock() > fails, the polling instance will return immediately, instead of busy > waiting. > > >> Even though such skipping is implemented, not sure if good performance >> can be reached because hctx poll may be done in ping-pong style >> among several CPUs. But blk-mq hctx is supposed to have its cpu affinities. >> > > Yes, the mechanism of iterating all hw queues can make the competition > worse. > > If every underlying data device has **only** one polling hw queue, then > this ping-pong style polling still exist, even when we implement split > bio tracking mechanism, i.e., acquiring the specific hw queue the bio > enqueued into. Because multiple polling instance has to compete for the > only polling hw queue. > > But if multiple polling hw queues per device are reserved for multiple > polling instances, (e.g., every underlying data device has 3 polling hw > queues when there are 3 polling instances), just as what we practice on > mq polling, then the current implementation of iterating all hw queues > will indeed works in a ping-pong style, while this issue shall not exist > when accurate split bio tracking mechanism could be implemented. > If not considering process migration, I could somehow avoid iterating all hw queues, still in the framework of the current implementation. For example. For example, the CPU number of the IO submitting process could be stored in the cookie, while the polling routine will only iterate hw queues to which the stored CPU number maps. Just a temporary insight though ....
On Tue, Feb 09, 2021 at 02:13:38PM +0800, JeffleXu wrote: > > > On 2/9/21 11:11 AM, Ming Lei wrote: > > On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote: > >> DM will iterate and poll all polling hardware queues of all target mq > >> devices when polling IO for dm device. To mitigate the race introduced > >> by iterating all target hw queues, a per-hw-queue flag is maintained > > > > What is the per-hw-queue flag? > > Sorry I forgot to update the commit message as the implementation > changed. Actually this mechanism is implemented by patch 10 of this > patch set. It is hard to associate patch 10's spin_trylock() with per-hw-queue flag. Also scsi's poll implementation is in-progress, and scsi's poll may not be implemented in this way. > > > > >> to indicate whether this polling hw queue currently being polled on or > >> not. Every polling hw queue is exclusive to one polling instance, i.e., > >> the polling instance will skip this polling hw queue if this hw queue > >> currently is being polled by another polling instance, and start > >> polling on the next hw queue. > > > > Not see such skip in dm_poll_one_dev() in which > > queue_for_each_poll_hw_ctx() is called directly for polling all POLL > > hctxs of the request queue, so can you explain it a bit more about this > > skip mechanism? > > > > It is implemented as patch 10 of this patch set. When spin_trylock() > fails, the polling instance will return immediately, instead of busy > waiting. > > > > Even though such skipping is implemented, not sure if good performance > > can be reached because hctx poll may be done in ping-pong style > > among several CPUs. But blk-mq hctx is supposed to have its cpu affinities. > > > > Yes, the mechanism of iterating all hw queues can make the competition > worse. > > If every underlying data device has **only** one polling hw queue, then > this ping-pong style polling still exist, even when we implement split > bio tracking mechanism, i.e., acquiring the specific hw queue the bio > enqueued into. Because multiple polling instance has to compete for the > only polling hw queue. > > But if multiple polling hw queues per device are reserved for multiple > polling instances, (e.g., every underlying data device has 3 polling hw > queues when there are 3 polling instances), just as what we practice on > mq polling, then the current implementation of iterating all hw queues > will indeed works in a ping-pong style, while this issue shall not exist > when accurate split bio tracking mechanism could be implemented. In reality it could be possible to have one hw queue for each numa node. And you may re-use blk_mq_map_queue() for getting the proper hw queue for poll.
On 2/9/21 4:07 PM, Ming Lei wrote: > On Tue, Feb 09, 2021 at 02:13:38PM +0800, JeffleXu wrote: >> >> >> On 2/9/21 11:11 AM, Ming Lei wrote: >>> On Mon, Feb 08, 2021 at 04:52:41PM +0800, Jeffle Xu wrote: >>>> DM will iterate and poll all polling hardware queues of all target mq >>>> devices when polling IO for dm device. To mitigate the race introduced >>>> by iterating all target hw queues, a per-hw-queue flag is maintained >>> >>> What is the per-hw-queue flag? >> >> Sorry I forgot to update the commit message as the implementation >> changed. Actually this mechanism is implemented by patch 10 of this >> patch set. > > It is hard to associate patch 10's spin_trylock() with per-hw-queue > flag. You're right, the commit message here is totally a mistake. Actually I had ever implemented a per-hw-queue flag, such as ``` struct blk_mq_hw_ctx { atomic_t busy; ... }; ``` In this case, the skipping mechanism is implemented in block layer. But later I refactor the code and move the implementation to the device driver layer as described by patch 10, while forgetting to update the commit message. The reason why I implement it in device driver layer is that, the competition actually stems from the underlying device driver (e.g., nvme driver), as described in the following snippet. ``` nvme_poll spin_lock(&nvmeq->cq_poll_lock); found = nvme_process_cq(nvmeq); spin_unlock(&nvmeq->cq_poll_lock); ``` It is @nvmeq->cq_poll_lock, i.e., the implementation of the underlying device driver that has caused the competition. Thus maybe it is reasonable to handle the competition issue in the device driver layer? > Also scsi's poll implementation is in-progress, and scsi's poll may > not be implemented in this way. Yes. The defect of leaving the competition issue to the device driver layer is that, every device driver supporting polling need to be somehow optimized individually. Actually I have not taken a close look at the other two types of nvme driver (drivers/nvme/host/tcp.c and drivers/nvme/host/rdma.c), which also support polling. >> >>> >>>> to indicate whether this polling hw queue currently being polled on or >>>> not. Every polling hw queue is exclusive to one polling instance, i.e., >>>> the polling instance will skip this polling hw queue if this hw queue >>>> currently is being polled by another polling instance, and start >>>> polling on the next hw queue. >>> >>> Not see such skip in dm_poll_one_dev() in which >>> queue_for_each_poll_hw_ctx() is called directly for polling all POLL >>> hctxs of the request queue, so can you explain it a bit more about this >>> skip mechanism? >>> >> >> It is implemented as patch 10 of this patch set. When spin_trylock() >> fails, the polling instance will return immediately, instead of busy >> waiting. >> >> >>> Even though such skipping is implemented, not sure if good performance >>> can be reached because hctx poll may be done in ping-pong style >>> among several CPUs. But blk-mq hctx is supposed to have its cpu affinities. >>> >> >> Yes, the mechanism of iterating all hw queues can make the competition >> worse. >> >> If every underlying data device has **only** one polling hw queue, then >> this ping-pong style polling still exist, even when we implement split >> bio tracking mechanism, i.e., acquiring the specific hw queue the bio >> enqueued into. Because multiple polling instance has to compete for the >> only polling hw queue. >> >> But if multiple polling hw queues per device are reserved for multiple >> polling instances, (e.g., every underlying data device has 3 polling hw >> queues when there are 3 polling instances), just as what we practice on >> mq polling, then the current implementation of iterating all hw queues >> will indeed works in a ping-pong style, while this issue shall not exist >> when accurate split bio tracking mechanism could be implemented. > > In reality it could be possible to have one hw queue for each numa node. > > And you may re-use blk_mq_map_queue() for getting the proper hw queue for poll. Thanks. But the optimization I proposed in [1] may not works well when the IO submitting process migrates to another CPU halfway. I mean, the process has submitted several split bios, and then it migrates to another CPU and moves on submitting the left split bios. [1] https://lore.kernel.org/io-uring/20210208085243.82367-1-jefflexu@linux.alibaba.com/T/#m0d9a0e55e11874a70c6a3491f191289df72a84f8
On Mon, 8 Feb 2021, Jeffle Xu wrote: > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index c2945c90745e..8423f1347bb8 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio) > return BLK_QC_T_NONE; > } > > +static int dm_poll_one_md(struct mapped_device *md); > + > +static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev, > + sector_t start, sector_t len, void *data) > +{ > + int i, *count = data; > + struct request_queue *q = bdev_get_queue(dev->bdev); > + struct blk_mq_hw_ctx *hctx; > + > + if (queue_is_mq(q)) { > + if (!percpu_ref_tryget(&q->q_usage_counter)) > + return 0; > + > + queue_for_each_poll_hw_ctx(q, hctx, i) > + *count += blk_mq_poll_hctx(q, hctx); > + > + percpu_ref_put(&q->q_usage_counter); > + } else > + *count += dm_poll_one_md(dev->bdev->bd_disk->private_data); This is fragile, because in the future there may be other bio-based drivers that support polling. You should check that "q" is really a device mapper device before calling dm_poll_one_md on it. Mikulas
On 2/19/21 10:17 PM, Mikulas Patocka wrote: > > > On Mon, 8 Feb 2021, Jeffle Xu wrote: > >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >> index c2945c90745e..8423f1347bb8 100644 >> --- a/drivers/md/dm.c >> +++ b/drivers/md/dm.c >> @@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio) >> return BLK_QC_T_NONE; >> } >> >> +static int dm_poll_one_md(struct mapped_device *md); >> + >> +static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev, >> + sector_t start, sector_t len, void *data) >> +{ >> + int i, *count = data; >> + struct request_queue *q = bdev_get_queue(dev->bdev); >> + struct blk_mq_hw_ctx *hctx; >> + >> + if (queue_is_mq(q)) { >> + if (!percpu_ref_tryget(&q->q_usage_counter)) >> + return 0; >> + >> + queue_for_each_poll_hw_ctx(q, hctx, i) >> + *count += blk_mq_poll_hctx(q, hctx); >> + >> + percpu_ref_put(&q->q_usage_counter); >> + } else >> + *count += dm_poll_one_md(dev->bdev->bd_disk->private_data); > > This is fragile, because in the future there may be other bio-based > drivers that support polling. You should check that "q" is really a device > mapper device before calling dm_poll_one_md on it. > Sorry I missed this reply. Your advice matters, thanks.
On 2/19/21 10:17 PM, Mikulas Patocka wrote: > > > On Mon, 8 Feb 2021, Jeffle Xu wrote: > >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >> index c2945c90745e..8423f1347bb8 100644 >> --- a/drivers/md/dm.c >> +++ b/drivers/md/dm.c >> @@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio) >> return BLK_QC_T_NONE; >> } >> >> +static int dm_poll_one_md(struct mapped_device *md); >> + >> +static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev, >> + sector_t start, sector_t len, void *data) >> +{ >> + int i, *count = data; >> + struct request_queue *q = bdev_get_queue(dev->bdev); >> + struct blk_mq_hw_ctx *hctx; >> + >> + if (queue_is_mq(q)) { >> + if (!percpu_ref_tryget(&q->q_usage_counter)) >> + return 0; >> + >> + queue_for_each_poll_hw_ctx(q, hctx, i) >> + *count += blk_mq_poll_hctx(q, hctx); >> + >> + percpu_ref_put(&q->q_usage_counter); >> + } else >> + *count += dm_poll_one_md(dev->bdev->bd_disk->private_data); > > This is fragile, because in the future there may be other bio-based > drivers that support polling. You should check that "q" is really a device > mapper device before calling dm_poll_one_md on it. > This can be easily fixed by calling disk->fops->poll() recursively, such as ```` if (queue_is_mq(q)) { ... } else { //disk = disk of underlying device pdata->count += disk->fops->poll(q, pdata->cookie); } ``` But meanwhile I realize that we are recursively calling disk->fops->poll() here, and thus may cause stack overflow similar to submit_bio() when the depth of the device stack is large enough. Maybe the control structure like bio_list shall be needed here, to flatten the recursive structure here. Any thoughts?
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index aa37f3e82238..b090b4c9692d 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1764,6 +1764,19 @@ static int device_requires_stable_pages(struct dm_target *ti, return blk_queue_stable_writes(q); } +static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) +{ + struct request_queue *q = bdev_get_queue(dev->bdev); + + return !test_bit(QUEUE_FLAG_POLL, &q->queue_flags); +} + +int dm_table_supports_poll(struct dm_table *t) +{ + return dm_table_all_devs_attr(t, device_not_poll_capable, NULL); +} + /* * type->iterate_devices() should be called when the sanity check needs to * iterate and check all underlying data devices. iterate_devices() will @@ -1875,6 +1888,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, #endif blk_queue_update_readahead(q); + + /* + * Check for request-based device is remained to + * dm_mq_init_request_queue()->blk_mq_init_allocated_queue(). + * For bio-based device, only set QUEUE_FLAG_POLL when all underlying + * devices supporting polling. + */ + if (__table_type_bio_based(t->type)) { + if (dm_table_supports_poll(t)) + blk_queue_flag_set(QUEUE_FLAG_POLL, q); + else + blk_queue_flag_clear(QUEUE_FLAG_POLL, q); + } } unsigned int dm_table_get_num_targets(struct dm_table *t) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index c2945c90745e..8423f1347bb8 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1657,6 +1657,68 @@ static blk_qc_t dm_submit_bio(struct bio *bio) return BLK_QC_T_NONE; } +static int dm_poll_one_md(struct mapped_device *md); + +static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) +{ + int i, *count = data; + struct request_queue *q = bdev_get_queue(dev->bdev); + struct blk_mq_hw_ctx *hctx; + + if (queue_is_mq(q)) { + if (!percpu_ref_tryget(&q->q_usage_counter)) + return 0; + + queue_for_each_poll_hw_ctx(q, hctx, i) + *count += blk_mq_poll_hctx(q, hctx); + + percpu_ref_put(&q->q_usage_counter); + } else + *count += dm_poll_one_md(dev->bdev->bd_disk->private_data); + + return 0; +} + +static int dm_poll_one_md(struct mapped_device *md) +{ + int i, srcu_idx, ret = 0; + struct dm_table *t; + struct dm_target *ti; + + t = dm_get_live_table(md, &srcu_idx); + + for (i = 0; i < dm_table_get_num_targets(t); i++) { + ti = dm_table_get_target(t, i); + ti->type->iterate_devices(ti, dm_poll_one_dev, &ret); + } + + dm_put_live_table(md, srcu_idx); + + return ret; +} + +static int dm_bio_poll(struct request_queue *q, blk_qc_t cookie) +{ + struct gendisk *disk = queue_to_disk(q); + struct mapped_device *md = disk->private_data; + + return dm_poll_one_md(md); +} + +static bool dm_bio_poll_capable(struct gendisk *disk) +{ + int ret, srcu_idx; + struct mapped_device *md = disk->private_data; + struct dm_table *t; + + t = dm_get_live_table(md, &srcu_idx); + ret = dm_table_supports_poll(t); + dm_put_live_table(md, srcu_idx); + + return ret; +} + /*----------------------------------------------------------------- * An IDR is used to keep track of allocated minor numbers. *---------------------------------------------------------------*/ @@ -3049,6 +3111,8 @@ static const struct pr_ops dm_pr_ops = { }; static const struct block_device_operations dm_blk_dops = { + .poll = dm_bio_poll, + .poll_capable = dm_bio_poll_capable, .submit_bio = dm_submit_bio, .open = dm_blk_open, .release = dm_blk_close, diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index 61a66fb8ebb3..6a9de3fd0087 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -515,6 +515,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t); fmode_t dm_table_get_mode(struct dm_table *t); struct mapped_device *dm_table_get_md(struct dm_table *t); const char *dm_table_device_name(struct dm_table *t); +int dm_table_supports_poll(struct dm_table *t); /* * Trigger an event.
DM will iterate and poll all polling hardware queues of all target mq devices when polling IO for dm device. To mitigate the race introduced by iterating all target hw queues, a per-hw-queue flag is maintained to indicate whether this polling hw queue currently being polled on or not. Every polling hw queue is exclusive to one polling instance, i.e., the polling instance will skip this polling hw queue if this hw queue currently is being polled by another polling instance, and start polling on the next hw queue. IO polling is enabled when all underlying target devices are capable of IO polling. The sanity check supports the stacked device model, in which one dm device may be build upon another dm device. In this case, the mapped device will check if the underlying dm target device supports IO polling. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> --- drivers/md/dm-table.c | 26 ++++++++++++++ drivers/md/dm.c | 64 +++++++++++++++++++++++++++++++++++ include/linux/device-mapper.h | 1 + 3 files changed, 91 insertions(+)