Message ID | 1555051638-32354-1-git-send-email-lirongqing@baidu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] nvdimm: pmem: always flush nvdimm for write request | expand |
On Thu, Apr 11, 2019 at 11:47 PM Li RongQing <lirongqing@baidu.com> wrote: > > flush nvdimm for write request can speed the random write, give > about 20% performance > > The below is result of fio 4k random write nvdimm as /dev/pmem0 > > Before: > Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s] > After: > Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s] Interesting result. Another experiment proposed below... > This makes sure that the newly written data is durable too It's overkill for durability because ADR already handles the necessary write-queue flush at power-loss. > Co-developed-by: Liang ZhiCheng <liangzhicheng@baidu.com> > Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com> > Signed-off-by: Li RongQing <lirongqing@baidu.com> > --- > This test is done on intel AEP > > drivers/nvdimm/pmem.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 1d432c5ed..9f8f25880 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -197,6 +197,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) > unsigned long start; > struct bio_vec bvec; > struct bvec_iter iter; > + unsigned int op = bio_op(bio); > struct pmem_device *pmem = q->queuedata; > struct nd_region *nd_region = to_region(pmem); > > @@ -206,7 +207,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) > do_acct = nd_iostat_start(bio, &start); > bio_for_each_segment(bvec, bio, iter) { > rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, > - bvec.bv_offset, bio_op(bio), iter.bi_sector); > + bvec.bv_offset, op, iter.bi_sector); > if (rc) { > bio->bi_status = rc; > break; > @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) > if (do_acct) > nd_iostat_end(bio, start); > > - if (bio->bi_opf & REQ_FUA) > + if (bio->bi_opf & REQ_FUA || op_is_write(op)) > nvdimm_flush(nd_region); One question is whether this benefit is coming from just the write ordering from the barrier, or the WPQ flush itself. Can you try the following and see if you get the same result? diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index bc2f700feef8..10c24de52395 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -217,6 +217,8 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) if (bio->bi_opf & REQ_FUA) nvdimm_flush(nd_region); + else if (op_is_write(bio_op(bio))) + wmb(); bio_endio(bio); return BLK_QC_T_NONE;
>> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) >> if (do_acct) >> nd_iostat_end(bio, start); >> >> - if (bio->bi_opf & REQ_FUA) >> + if (bio->bi_opf & REQ_FUA || op_is_write(op)) >> nvdimm_flush(nd_region); ... >> Before: >> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s] >> After: >> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s] >> >> -RongQing Doing more work cannot be faster than doing less work, so something else must be happening here. Please post the full fio job file and how you invoke it (i.e., with numactl). These tools help show what is happening on the CPUs and memory channels: perf top pcm.x pcm-memory.x -pmm
> -----邮件原件----- > 发件人: Elliott, Robert (Servers) [mailto:elliott@hpe.com] > 发送时间: 2019年4月14日 11:21 > 收件人: Li,Rongqing <lirongqing@baidu.com>; Dan Williams > <dan.j.williams@intel.com> > 抄送: linux-nvdimm <linux-nvdimm@lists.01.org> > 主题: RE: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request > > > > >> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct > request_queue *q, struct bio *bio) > >> if (do_acct) > >> nd_iostat_end(bio, start); > >> > >> - if (bio->bi_opf & REQ_FUA) > >> + if (bio->bi_opf & REQ_FUA || op_is_write(op)) > >> nvdimm_flush(nd_region); > ... > >> Before: > >> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta > >> 01m:43s] > >> After: > >> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s] > >> > >> -RongQing > > > Doing more work cannot be faster than doing less work, so something else > must be happening here. > Dan Williams maybe know more. > Please post the full fio job file and how you invoke it (i.e., with numactl). > This fio file is below, and we bind fio with cpu and node: numactl --membind=0 taskset -c 2-24 ./fio test_io_raw [global] numjobs=32 direct=1 filename=/dev/pmem0.1 iodepth=32 ioengine=libaio group_reporting=1 bs=4K time_based=1 [write1] rw=randwrite runtime=60 stonewall > These tools help show what is happening on the CPUs and memory channels: > perf top 62.40% [kernel] [k] memcpy_flushcache 21.17% [kernel] [k] fput 6.12% [kernel] [k] apic_timer_interrupt 0.89% [kernel] [k] rq_qos_done_bio 0.66% [kernel] [k] bio_endio 0.44% [kernel] [k] aio_complete_rw 0.39% [kernel] [k] blkdev_bio_end_io 0.31% [kernel] [k] entry_SYSCALL_64 0.26% [kernel] [k] bio_disassociate_task 0.23% [kernel] [k] read_tsc 0.21% fio [.] axmap_isset 0.20% [kernel] [k] ktime_get_raw_ts64 0.19% [vdso] [.] 0x7ffc475e2b30 0.18% [kernel] [k] gup_pgd_range 0.18% [kernel] [k] entry_SYSCALL_64_after_hwframe 0.16% [kernel] [k] __audit_syscall_exit 0.16% [kernel] [k] __x86_indirect_thunk_rax 0.13% [kernel] [k] copy_user_enhanced_fast_string 0.13% [kernel] [k] syscall_return_via_sysret 0.12% [kernel] [k] preempt_count_add 0.12% [kernel] [k] preempt_count_sub 0.11% [kernel] [k] __x64_sys_clock_gettime 0.11% [kernel] [k] tracer_hardirqs_off 0.10% [kernel] [k] native_write_msr 0.10% [kernel] [k] posix_get_monotonic_raw 0.10% fio [.] get_io_u > pcm.x http://pasted.co/6fc93b42 > pcm-memory.x -pmm > http://pasted.co/d5c0c96b If not bind fio with cpu and numa node, the performance will larger lower, but this optimization is suitable both condition , it will about 40% improvement sometime. -Li
On Sun, Apr 14, 2019 at 8:26 PM Li,Rongqing <lirongqing@baidu.com> wrote: > > > > > -----邮件原件----- > > 发件人: Elliott, Robert (Servers) [mailto:elliott@hpe.com] > > 发送时间: 2019年4月14日 11:21 > > 收件人: Li,Rongqing <lirongqing@baidu.com>; Dan Williams > > <dan.j.williams@intel.com> > > 抄送: linux-nvdimm <linux-nvdimm@lists.01.org> > > 主题: RE: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request > > > > > > > > >> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct > > request_queue *q, struct bio *bio) > > >> if (do_acct) > > >> nd_iostat_end(bio, start); > > >> > > >> - if (bio->bi_opf & REQ_FUA) > > >> + if (bio->bi_opf & REQ_FUA || op_is_write(op)) > > >> nvdimm_flush(nd_region); > > ... > > >> Before: > > >> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta > > >> 01m:43s] > > >> After: > > >> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s] > > >> > > >> -RongQing > > > > > > Doing more work cannot be faster than doing less work, so something else > > must be happening here. > > > > Dan Williams maybe know more. One thought is that back-pressure from awaiting write-posted-queue flush completion causes full buffer writes to coalesce at the controller, i.e. write-combining effects from the flush-delay.
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 1d432c5ed..9f8f25880 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -197,6 +197,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) unsigned long start; struct bio_vec bvec; struct bvec_iter iter; + unsigned int op = bio_op(bio); struct pmem_device *pmem = q->queuedata; struct nd_region *nd_region = to_region(pmem); @@ -206,7 +207,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) do_acct = nd_iostat_start(bio, &start); bio_for_each_segment(bvec, bio, iter) { rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, - bvec.bv_offset, bio_op(bio), iter.bi_sector); + bvec.bv_offset, op, iter.bi_sector); if (rc) { bio->bi_status = rc; break; @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) if (do_acct) nd_iostat_end(bio, start); - if (bio->bi_opf & REQ_FUA) + if (bio->bi_opf & REQ_FUA || op_is_write(op)) nvdimm_flush(nd_region); bio_endio(bio);