Message ID | 20230731224617.8665-2-kch@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pmem: set QUEUE_FLAG_NOWAIT | expand |
Chaitanya Kulkarni <kch@nvidia.com> writes: > Set the QUEUE_FLAG_NOWAIT. Following are the performance numbers with > io_uring fio engine for random read, note that device has been populated > fully with randwrite workload before taking these numbers :- I am slightly embarrassed to have to ask this question, but what are the implications of setting this queue flag? Is the submit_bio routine expected to never block? Is the I/O expected to be performed asynchronously? If either of those is the case, then pmem does not qualify. -Jeff > > linux-block (pmem-nowait-on) # grep IOPS pmem*fio | column -t > pmem-nowait-off-1.fio: read: IOPS=3683k, BW=14.0GiB/s > pmem-nowait-off-2.fio: read: IOPS=3819k, BW=14.6GiB/s > pmem-nowait-off-3.fio: read: IOPS=3999k, BW=15.3GiB/s > > pmem-nowait-on-1.fio: read: IOPS=5837k, BW=22.3GiB/s > pmem-nowait-on-2.fio: read: IOPS=5936k, BW=22.6GiB/s > pmem-nowait-on-3.fio: read: IOPS=5945k, BW=22.7GiB/s > > linux-block (pmem-nowait-on) # grep cpu pmem*fio | column -t > pmem-nowait-off-1.fio: cpu : usr=7.09%, sys=29.65%, ctx=198742065 > pmem-nowait-off-2.fio: cpu : usr=6.89%, sys=30.56%, ctx=205817652 > pmem-nowait-off-3.fio: cpu : usr=6.86%, sys=30.94%, ctx=222627094 > > pmem-nowait-on-1.fio: cpu : usr=10.58%, sys=88.44%, ctx=27181 > pmem-nowait-on-2.fio: cpu : usr=10.50%, sys=87.75%, ctx=25746 > pmem-nowait-on-3.fio: cpu : usr=10.67%, sys=88.60%, ctx=28261 > > linux-block (pmem-nowait-on) # grep slat pmem*fio | column -t > pmem-nowait-off-1.fio: slat (nsec): min=432, max=50847k, avg=9324.69 > pmem-nowait-off-2.fio: slat (nsec): min=441, max=52557k, avg=9132.45 > pmem-nowait-off-3.fio: slat (nsec): min=430, max=36113k, avg=9132.63 > > pmem-nowait-on-1.fio: slat (nsec): min=1393, max=68090k, avg=7615.31 > pmem-nowait-on-2.fio: slat (nsec): min=1222, max=44137k, avg=7493.77 > pmem-nowait-on-3.fio: slat (nsec): min=1493, max=40100k, avg=7486.36 > > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> > --- > drivers/nvdimm/pmem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 46e094e56159..ddd485c377eb 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -543,6 +543,7 @@ static int pmem_attach_disk(struct device *dev, > blk_queue_max_hw_sectors(q, UINT_MAX); > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q); > + blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q); > if (pmem->pfn_flags & PFN_MAP) > blk_queue_flag_set(QUEUE_FLAG_DAX, q);
On Tue, Aug 01, 2023 at 11:23:36AM -0400, Jeff Moyer wrote: > I am slightly embarrassed to have to ask this question, but what are the > implications of setting this queue flag? Is the submit_bio routine > expected to never block? Yes, at least not significantly. > Is the I/O expected to be performed > asynchronously? Not nessecarily if it is fast enough..
Hi, Christoph, Christoph Hellwig <hch@lst.de> writes: > On Tue, Aug 01, 2023 at 11:23:36AM -0400, Jeff Moyer wrote: >> I am slightly embarrassed to have to ask this question, but what are the >> implications of setting this queue flag? Is the submit_bio routine >> expected to never block? > > Yes, at least not significantly. If there happens to be poisoned memory, the write path can make an acpi device specific method call. That involves taking a mutex (see the call chain starting at acpi_ex_enter_interpreter()). I'm not sure how long a DSM takes, but I doubt it's fast. >> Is the I/O expected to be performed >> asynchronously? > > Not nessecarily if it is fast enough.. Clear as mud. :) It's a memcpy on potentially "slow" memory. So, the amount of time spent copying depends on the speed of the cpu, the media and the size of the I/O. I don't know off-hand what the upper bound would be on today's pmem. -Jeff
On 8/1/23 10:49, Jeff Moyer wrote: > Hi, Christoph, > > Christoph Hellwig <hch@lst.de> writes: > >> On Tue, Aug 01, 2023 at 11:23:36AM -0400, Jeff Moyer wrote: >>> I am slightly embarrassed to have to ask this question, but what are the >>> implications of setting this queue flag? Is the submit_bio routine >>> expected to never block? >> Yes, at least not significantly. > If there happens to be poisoned memory, the write path can make an acpi > device specific method call. That involves taking a mutex (see the call > chain starting at acpi_ex_enter_interpreter()). I'm not sure how long a > DSM takes, but I doubt it's fast. for this case I can add bio->bio_opf & REQ_NOWAIT check and return with bio_wouldblock_error() that will take care of blocking case e.g. something like this untested as a prep patch for write path :- linux-block (pmem-nowait-on) # git diff diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index ddd485c377eb..eff100f74357 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -179,12 +179,16 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem, static blk_status_t pmem_do_write(struct pmem_device *pmem, struct page *page, unsigned int page_off, - sector_t sector, unsigned int len) + sector_t sector, unsigned int len, struct bio *bio) { phys_addr_t pmem_off = to_offset(pmem, sector); void *pmem_addr = pmem->virt_addr + pmem_off; if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) { + if (bio && bio->bi_opf & REQ_NOWAIT) { + bio_wouldblock_error(bio); + return BLK_STS_AGAIN; + } blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len); if (rc != BLK_STS_OK) @@ -217,7 +221,7 @@ static void pmem_submit_bio(struct bio *bio) bio_for_each_segment(bvec, bio, iter) { if (op_is_write(bio_op(bio))) rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset, - iter.bi_sector, bvec.bv_len); + iter.bi_sector, bvec.bv_len, bio); else rc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset, iter.bi_sector, bvec.bv_len); @@ -297,7 +301,7 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0, PFN_PHYS(pgoff) >> SECTOR_SHIFT, - PAGE_SIZE)); + PAGE_SIZE, NULL)); } static long pmem_dax_direct_access(struct dax_device *dax_dev, >>> Is the I/O expected to be performed >>> asynchronously? >> Not nessecarily if it is fast enough.. > Clear as mud. :) It's a memcpy on potentially "slow" memory. So, the > amount of time spent copying depends on the speed of the cpu, the media > and the size of the I/O. I don't know off-hand what the upper bound > would be on today's pmem. Above scenario depends on the system and I'm not sure if we can generalize this for all the deployments. In case we end up generalizing above scenario then we can always add a modparam to disable nowait so user has total control whether to enable or disable this an incremental patch .. For the deployments those are not suffering from the "mempcy on potentially slow memory" this patch shows clear performance win with enabling NOWAIT for io_uring .. -ck
Given that pmem simply loops over an arbitrarily large bio I think we also need a threshold for which to allow nowait I/O. While it won't block for giant I/Os, doing all of them in the submitter context isn't exactly the idea behind the nowait I/O. I'm not really sure what a good theshold would be, though. Btw, please also always add linux-block to the Cc list for block driver patches that are even the slightest bit about the block layer interface.
Christoph Hellwig <hch@lst.de> writes: > Given that pmem simply loops over an arbitrarily large bio I think > we also need a threshold for which to allow nowait I/O. While it > won't block for giant I/Os, doing all of them in the submitter > context isn't exactly the idea behind the nowait I/O. > > I'm not really sure what a good theshold would be, though. There's no mention of the latency of the submission side in the documentation for RWF_NOWAIT. The man page says "Do not wait for data which is not immediately available." Data in pmem *is* immediately available. If we wanted to enforce a latency threshold for submission, it would have to be configurable and, ideally, a part of the API. I don't think it's something we should even try to guarantee, though, unless application writers are asking for it. So, I think with the change to return -EAGAIN for writes to poisoned memory, this patch is probably ok. Chaitanya, could you send a v2? Thanks, Jeff
On 8/2/23 6:30?AM, Christoph Hellwig wrote: > Given that pmem simply loops over an arbitrarily large bio I think > we also need a threshold for which to allow nowait I/O. While it > won't block for giant I/Os, doing all of them in the submitter > context isn't exactly the idea behind the nowait I/O. You can do a LOT of looping over a giant bio and still come out way ahead compared to needing to punt to a different thread. So I do think it's the right choice. But I'm making assumptions here on what it looks like, as I haven't seen the patch... > Btw, please also always add linux-block to the Cc list for block > driver patches that are even the slightest bit about the block > layer interface. Indeed. Particularly for these nowait changes, as some of them have been pretty broken in the past.
On Wed, Aug 02, 2023 at 09:36:32AM -0600, Jens Axboe wrote: > You can do a LOT of looping over a giant bio and still come out way > ahead compared to needing to punt to a different thread. So I do think > it's the right choice. But I'm making assumptions here on what it looks > like, as I haven't seen the patch... "a LOT" is relative. A GB or two will block the submission thread for quite a while.
On 8/2/23 05:30, Christoph Hellwig wrote: > Given that pmem simply loops over an arbitrarily large bio I think > we also need a threshold for which to allow nowait I/O. While it > won't block for giant I/Os, doing all of them in the submitter > context isn't exactly the idea behind the nowait I/O. > > I'm not really sure what a good theshold would be, though. > > Btw, please also always add linux-block to the Cc list for block > driver patches that are even the slightest bit about the block > layer interface. will do .. -ck
On 8/2/23 08:36, Jens Axboe wrote: > On 8/2/23 6:30?AM, Christoph Hellwig wrote: >> Given that pmem simply loops over an arbitrarily large bio I think >> we also need a threshold for which to allow nowait I/O. While it >> won't block for giant I/Os, doing all of them in the submitter >> context isn't exactly the idea behind the nowait I/O. > You can do a LOT of looping over a giant bio and still come out way > ahead compared to needing to punt to a different thread. So I do think > it's the right choice. But I'm making assumptions here on what it looks > like, as I haven't seen the patch... > will send out the V3 soon and CC you and linux-block ... >> Btw, please also always add linux-block to the Cc list for block >> driver patches that are even the slightest bit about the block >> layer interface. > Indeed. Particularly for these nowait changes, as some of them have been > pretty broken in the past. > yes, didn't forgot, lately dealing with bunch of stupidity, I'll send zram and null_blk nowait changes soon with your comments fixed .. -ck
On 8/2/23 08:27, Jeff Moyer wrote: > Christoph Hellwig <hch@lst.de> writes: > >> Given that pmem simply loops over an arbitrarily large bio I think >> we also need a threshold for which to allow nowait I/O. While it >> won't block for giant I/Os, doing all of them in the submitter >> context isn't exactly the idea behind the nowait I/O. >> >> I'm not really sure what a good theshold would be, though. > There's no mention of the latency of the submission side in the > documentation for RWF_NOWAIT. The man page says "Do not wait for data > which is not immediately available." Data in pmem *is* immediately > available. If we wanted to enforce a latency threshold for submission, > it would have to be configurable and, ideally, a part of the API. I > don't think it's something we should even try to guarantee, though, > unless application writers are asking for it. I need to see the usecase from application writter who cannot come with a solution so kernel have to provide this interface, I'll be happy to do that .. > So, I think with the change to return -EAGAIN for writes to poisoned > memory, this patch is probably ok. I believe you mean the same one I've provided earlier incremental .. > Chaitanya, could you send a v2? sure ... -ck
Chaitanya Kulkarni <chaitanyak@nvidia.com> writes: > On 8/2/23 08:27, Jeff Moyer wrote: >> So, I think with the change to return -EAGAIN for writes to poisoned >> memory, this patch is probably ok. > > I believe you mean the same one I've provided earlier incremental .. Yes, sorry if that wasn't clear. >> Chaitanya, could you send a v2? > > sure ... and I guess I should have said v3. ;-) Cheers, Jeff
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 46e094e56159..ddd485c377eb 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -543,6 +543,7 @@ static int pmem_attach_disk(struct device *dev, blk_queue_max_hw_sectors(q, UINT_MAX); blk_queue_flag_set(QUEUE_FLAG_NONROT, q); blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q); + blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q); if (pmem->pfn_flags & PFN_MAP) blk_queue_flag_set(QUEUE_FLAG_DAX, q);
Set the QUEUE_FLAG_NOWAIT. Following are the performance numbers with io_uring fio engine for random read, note that device has been populated fully with randwrite workload before taking these numbers :- linux-block (pmem-nowait-on) # grep IOPS pmem*fio | column -t pmem-nowait-off-1.fio: read: IOPS=3683k, BW=14.0GiB/s pmem-nowait-off-2.fio: read: IOPS=3819k, BW=14.6GiB/s pmem-nowait-off-3.fio: read: IOPS=3999k, BW=15.3GiB/s pmem-nowait-on-1.fio: read: IOPS=5837k, BW=22.3GiB/s pmem-nowait-on-2.fio: read: IOPS=5936k, BW=22.6GiB/s pmem-nowait-on-3.fio: read: IOPS=5945k, BW=22.7GiB/s linux-block (pmem-nowait-on) # grep cpu pmem*fio | column -t pmem-nowait-off-1.fio: cpu : usr=7.09%, sys=29.65%, ctx=198742065 pmem-nowait-off-2.fio: cpu : usr=6.89%, sys=30.56%, ctx=205817652 pmem-nowait-off-3.fio: cpu : usr=6.86%, sys=30.94%, ctx=222627094 pmem-nowait-on-1.fio: cpu : usr=10.58%, sys=88.44%, ctx=27181 pmem-nowait-on-2.fio: cpu : usr=10.50%, sys=87.75%, ctx=25746 pmem-nowait-on-3.fio: cpu : usr=10.67%, sys=88.60%, ctx=28261 linux-block (pmem-nowait-on) # grep slat pmem*fio | column -t pmem-nowait-off-1.fio: slat (nsec): min=432, max=50847k, avg=9324.69 pmem-nowait-off-2.fio: slat (nsec): min=441, max=52557k, avg=9132.45 pmem-nowait-off-3.fio: slat (nsec): min=430, max=36113k, avg=9132.63 pmem-nowait-on-1.fio: slat (nsec): min=1393, max=68090k, avg=7615.31 pmem-nowait-on-2.fio: slat (nsec): min=1222, max=44137k, avg=7493.77 pmem-nowait-on-3.fio: slat (nsec): min=1493, max=40100k, avg=7486.36 Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> --- drivers/nvdimm/pmem.c | 1 + 1 file changed, 1 insertion(+)