Message ID | 150153988620.49768.12914164179718467335.stgit@djiang5-desk3.ch.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Dave Jiang <dave.jiang@intel.com> writes: > Adding DMA support for pmem blk reads. This provides signficant CPU > reduction with large memory reads with good performance. DMAs are triggered > with test against bio_multiple_segment(), so the small I/Os (4k or less?) > are still performed by the CPU in order to reduce latency. By default > the pmem driver will be using blk-mq with DMA. > > Numbers below are measured against pmem simulated via DRAM using > memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon > platform. Keep in mind the performance for actual persistent memory > will differ. > Fio 2.21 was used. > > 64k: 1 task queuedepth=1 > CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU > CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU > > 64k: 16 tasks queuedepth=16 > CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU > CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU > > 2M: 1 task queuedepth=1 > CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU > CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU > > 2M: 16 tasks queuedepth=16 > CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU > CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU > > Signed-off-by: Dave Jiang <dave.jiang@intel.com> > --- Hi Dave, The above table shows that there's a performance benefit for 2M transfers but a regression for 64k transfers, if we forget about the CPU utilization for a second. Would it be beneficial to have heuristics on the transfer size that decide when to use dma and when not? You introduced this hunk: - rc = pmem_handle_cmd(cmd); + if (cmd->chan && bio_multiple_segments(req->bio)) + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); + else + rc = pmem_handle_cmd(cmd); Which utilizes dma for bios with multiple segments and for single segment bios you use the old path, maybe the single/multi segment logic can be amended to have something like: if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); else rc = pmem_handle_cmd(cmd); Just something woth considering IMHO. > + len = blk_rq_payload_bytes(req); > + page = virt_to_page(pmem_addr); > + off = (u64)pmem_addr & ~PAGE_MASK; off = offset_in_page(pmem_addr); ?
On 08/01/2017 12:34 AM, Johannes Thumshirn wrote: > Dave Jiang <dave.jiang@intel.com> writes: > >> Adding DMA support for pmem blk reads. This provides signficant CPU >> reduction with large memory reads with good performance. DMAs are triggered >> with test against bio_multiple_segment(), so the small I/Os (4k or less?) >> are still performed by the CPU in order to reduce latency. By default >> the pmem driver will be using blk-mq with DMA. >> >> Numbers below are measured against pmem simulated via DRAM using >> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon >> platform. Keep in mind the performance for actual persistent memory >> will differ. >> Fio 2.21 was used. >> >> 64k: 1 task queuedepth=1 >> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU >> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU >> >> 64k: 16 tasks queuedepth=16 >> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU >> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU >> >> 2M: 1 task queuedepth=1 >> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU >> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU >> >> 2M: 16 tasks queuedepth=16 >> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU >> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU >> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com> >> --- > > Hi Dave, > > The above table shows that there's a performance benefit for 2M > transfers but a regression for 64k transfers, if we forget about the CPU > utilization for a second. Would it be beneficial to have heuristics on > the transfer size that decide when to use dma and when not? You > introduced this hunk: > > - rc = pmem_handle_cmd(cmd); > + if (cmd->chan && bio_multiple_segments(req->bio)) > + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); > + else > + rc = pmem_handle_cmd(cmd); > > Which utilizes dma for bios with multiple segments and for single > segment bios you use the old path, maybe the single/multi segment logic > can be amended to have something like: > > if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) > rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); > else > rc = pmem_handle_cmd(cmd); Yes good idea. I will introduce a thresh parameter that's tune-able. > > Just something woth considering IMHO. > >> + len = blk_rq_payload_bytes(req); >> + page = virt_to_page(pmem_addr); >> + off = (u64)pmem_addr & ~PAGE_MASK; > > off = offset_in_page(pmem_addr); ? Thanks. Will fix.
On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > Dave Jiang <dave.jiang@intel.com> writes: > >> Adding DMA support for pmem blk reads. This provides signficant CPU >> reduction with large memory reads with good performance. DMAs are triggered >> with test against bio_multiple_segment(), so the small I/Os (4k or less?) >> are still performed by the CPU in order to reduce latency. By default >> the pmem driver will be using blk-mq with DMA. >> >> Numbers below are measured against pmem simulated via DRAM using >> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon >> platform. Keep in mind the performance for actual persistent memory >> will differ. >> Fio 2.21 was used. >> >> 64k: 1 task queuedepth=1 >> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU >> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU >> >> 64k: 16 tasks queuedepth=16 >> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU >> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU >> >> 2M: 1 task queuedepth=1 >> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU >> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU >> >> 2M: 16 tasks queuedepth=16 >> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU >> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU >> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com> >> --- > > Hi Dave, > > The above table shows that there's a performance benefit for 2M > transfers but a regression for 64k transfers, if we forget about the CPU > utilization for a second. I don't think we can forget about cpu utilization, and I would expect most users would value cpu over small bandwidth increases. Especially when those numbers show efficiency like this +160% cpu +26% read bandwidth +171% cpu -10% write bandwidth > Would it be beneficial to have heuristics on > the transfer size that decide when to use dma and when not? You > introduced this hunk: > > - rc = pmem_handle_cmd(cmd); > + if (cmd->chan && bio_multiple_segments(req->bio)) > + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); > + else > + rc = pmem_handle_cmd(cmd); > > Which utilizes dma for bios with multiple segments and for single > segment bios you use the old path, maybe the single/multi segment logic > can be amended to have something like: > > if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) > rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); > else > rc = pmem_handle_cmd(cmd); > > Just something woth considering IMHO. The thing we want to avoid most is having the cpu stall doing nothing waiting for memory access, so I think once the efficiency of adding more cpu goes non-linear it's better to let the dma engine handle that. The current heuristic seems to achieve that, but the worry is does it over-select the cpu for cases where requests could be merged into a bulkier i/o that is more suitable for dma. I'm not sure we want another tunable vs predictable / efficient behavior by default.
On Mon, Jul 31, 2017 at 03:24:46PM -0700, Dave Jiang wrote: > Adding DMA support for pmem blk reads. This provides signficant CPU > reduction with large memory reads with good performance. DMAs are triggered > with test against bio_multiple_segment(), so the small I/Os (4k or less?) > are still performed by the CPU in order to reduce latency. By default > the pmem driver will be using blk-mq with DMA. > > Numbers below are measured against pmem simulated via DRAM using > memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon > platform. Keep in mind the performance for actual persistent memory > will differ. > Fio 2.21 was used. > > 64k: 1 task queuedepth=1 > CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU > CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU > > 64k: 16 tasks queuedepth=16 > CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU > CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU > > 2M: 1 task queuedepth=1 > CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU > CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU > > 2M: 16 tasks queuedepth=16 > CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU > CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU > > Signed-off-by: Dave Jiang <dave.jiang@intel.com> > --- <> > +static void nd_pmem_dma_callback(void *data, > + const struct dmaengine_result *res) > +{ > + struct pmem_cmd *cmd = data; > + struct request *req = cmd->rq; > + struct request_queue *q = req->q; > + struct pmem_device *pmem = q->queuedata; > + struct nd_region *nd_region = to_region(pmem); > + struct device *dev = to_dev(pmem); > + int rc = 0; > + > + dev_dbg(dev, "%s()\n", __func__); Is this left-over debug, or did you mean to leave it in? > + > + if (res) { > + enum dmaengine_tx_result dma_err = res->result; > + > + switch (dma_err) { > + case DMA_TRANS_READ_FAILED: > + case DMA_TRANS_WRITE_FAILED: > + case DMA_TRANS_ABORTED: > + dev_dbg(dev, "bio failed\n"); > + rc = -ENXIO; > + break; > + case DMA_TRANS_NOERROR: > + default: > + break; > + } > + } > + > + if (req->cmd_flags & REQ_FUA) > + nvdimm_flush(nd_region); > + > + dev_dbg(dev, "ending request\n"); > + blk_mq_end_request(cmd->rq, rc); > +} > + > +static int pmem_handle_cmd_dma(struct pmem_cmd *cmd, bool is_write) > +{ > + struct request *req = cmd->rq; > + struct request_queue *q = req->q; > + struct pmem_device *pmem = q->queuedata; > + struct device *dev = to_dev(pmem); > + phys_addr_t pmem_off = blk_rq_pos(req) * 512 + pmem->data_offset; > + void *pmem_addr = pmem->virt_addr + pmem_off; > + struct nd_region *nd_region = to_region(pmem); > + size_t len; > + struct dma_device *dma = cmd->chan->device; > + struct dmaengine_unmap_data *unmap; > + dma_cookie_t cookie; > + struct dma_async_tx_descriptor *txd; > + struct page *page; > + unsigned int off; > + int rc; > + enum dma_data_direction dir; > + dma_addr_t dma_addr; > + > + if (req->cmd_flags & REQ_FLUSH) > + nvdimm_flush(nd_region); > + > + unmap = dmaengine_get_unmap_data(dma->dev, 2, GFP_NOWAIT); > + if (!unmap) { > + dev_dbg(dev, "failed to get dma unmap data\n"); > + rc = -ENOMEM; The value of 'rc' isn't used at all in the error paths at the end of this function. Instead it ends the mq request with -ENXIO and returns -ENXIO unconditionally. That code should probably use 'rc' instead ... (continued in next block) > + goto err; > + } > + > + /* > + * If reading from pmem, writing to scatterlist, > + * and if writing to pmem, reading from scatterlist. > + */ > + dir = is_write ? DMA_FROM_DEVICE : DMA_TO_DEVICE; > + cmd->sg_nents = blk_rq_map_sg(req->q, req, cmd->sg); > + if (cmd->sg_nents < 1) { > + rc = -EINVAL; > + goto err; > + } > + > + if (cmd->sg_nents > 128) { > + rc = -ENOMEM; > + dev_warn(dev, "Number of sg greater than allocated\n"); > + goto err; > + } > + > + rc = dma_map_sg(dma->dev, cmd->sg, cmd->sg_nents, dir); > + if (rc < 1) { > + rc = -ENXIO; > + goto err; > + } > + > + len = blk_rq_payload_bytes(req); > + page = virt_to_page(pmem_addr); > + off = (u64)pmem_addr & ~PAGE_MASK; > + dir = is_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE; > + dma_addr = dma_map_page(dma->dev, page, off, len, dir); > + if (dma_mapping_error(dma->dev, unmap->addr[0])) { > + dev_dbg(dma->dev, "src DMA mapping error\n"); > + goto err_unmap_sg; ... which means that these later gotos need to set 'rc'. This applies to the rest of the gotos in this function. > + } > + > + unmap->len = len; > + > + if (is_write) { > + unmap->addr[0] = dma_addr; > + unmap->addr[1] = (dma_addr_t)cmd->sg; > + unmap->to_cnt = 1; > + unmap->from_cnt = 0; > + dma_unmap_data_sg_from_nents(unmap, 2) = cmd->sg_nents; > + txd = dma->device_prep_dma_memcpy_from_sg(cmd->chan, dma_addr, > + cmd->sg, cmd->sg_nents, DMA_PREP_INTERRUPT); > + } else { > + unmap->addr[0] = (dma_addr_t)cmd->sg; > + unmap->addr[1] = dma_addr; > + unmap->from_cnt = 1; > + unmap->to_cnt = 0; > + dma_unmap_data_sg_to_nents(unmap, 2) = cmd->sg_nents; > + txd = dma->device_prep_dma_memcpy_to_sg(cmd->chan, cmd->sg, > + cmd->sg_nents, dma_addr, DMA_PREP_INTERRUPT); > + } > + > + if (!txd) { > + dev_dbg(dma->dev, "dma prep failed\n"); > + goto err_unmap_buffer; > + } > + > + txd->callback_result = nd_pmem_dma_callback; > + txd->callback_param = cmd; > + dma_set_unmap(txd, unmap); > + cookie = dmaengine_submit(txd); > + if (dma_submit_error(cookie)) { > + dev_dbg(dma->dev, "dma submit error\n"); > + goto err_set_unmap; > + } > + > + dmaengine_unmap_put(unmap); > + dma_async_issue_pending(cmd->chan); > + > + return 0; > + > +err_set_unmap: > + dmaengine_unmap_put(unmap); > +err_unmap_buffer: > + dma_unmap_page(dev, dma_addr, len, dir); > +err_unmap_sg: > + if (dir == DMA_TO_DEVICE) > + dir = DMA_FROM_DEVICE; > + else > + dir = DMA_TO_DEVICE; > + dma_unmap_sg(dev, cmd->sg, cmd->sg_nents, dir); > + dmaengine_unmap_put(unmap); > +err: > + blk_mq_end_request(cmd->rq, -ENXIO); > + return -ENXIO; > +}
On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote: > On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > > Dave Jiang <dave.jiang@intel.com> writes: > > > >> Adding DMA support for pmem blk reads. This provides signficant CPU > >> reduction with large memory reads with good performance. DMAs are triggered > >> with test against bio_multiple_segment(), so the small I/Os (4k or less?) > >> are still performed by the CPU in order to reduce latency. By default > >> the pmem driver will be using blk-mq with DMA. > >> > >> Numbers below are measured against pmem simulated via DRAM using > >> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon > >> platform. Keep in mind the performance for actual persistent memory > >> will differ. > >> Fio 2.21 was used. > >> > >> 64k: 1 task queuedepth=1 > >> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU > >> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU > >> > >> 64k: 16 tasks queuedepth=16 > >> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU > >> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU > >> > >> 2M: 1 task queuedepth=1 > >> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU > >> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU > >> > >> 2M: 16 tasks queuedepth=16 > >> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU > >> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU > >> > >> Signed-off-by: Dave Jiang <dave.jiang@intel.com> > >> --- > > > > Hi Dave, > > > > The above table shows that there's a performance benefit for 2M > > transfers but a regression for 64k transfers, if we forget about the CPU > > utilization for a second. > > I don't think we can forget about cpu utilization, and I would expect > most users would value cpu over small bandwidth increases. Especially > when those numbers show efficiency like this > > +160% cpu +26% read bandwidth > +171% cpu -10% write bandwidth > > > Would it be beneficial to have heuristics on > > the transfer size that decide when to use dma and when not? You > > introduced this hunk: > > > > - rc = pmem_handle_cmd(cmd); > > + if (cmd->chan && bio_multiple_segments(req->bio)) > > + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); > > + else > > + rc = pmem_handle_cmd(cmd); > > > > Which utilizes dma for bios with multiple segments and for single > > segment bios you use the old path, maybe the single/multi segment logic > > can be amended to have something like: > > > > if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) > > rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); > > else > > rc = pmem_handle_cmd(cmd); > > > > Just something woth considering IMHO. > > The thing we want to avoid most is having the cpu stall doing nothing > waiting for memory access, so I think once the efficiency of adding > more cpu goes non-linear it's better to let the dma engine handle > that. The current heuristic seems to achieve that, but the worry is > does it over-select the cpu for cases where requests could be merged > into a bulkier i/o that is more suitable for dma. > > I'm not sure we want another tunable vs predictable / efficient > behavior by default. Sorry for my late reply, but lots of automated performance regression CI tools _will_ report errors because of the performance drop. Yes these may be false positives, but it'll cost hours to resolve them. If we have a tunable it's way easier to set it correctly for the use cases. And please keep in mind some people don't really case about CPU utilization but maxiumum throughput. Thanks, Johannes
On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote: >> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >> > Dave Jiang <dave.jiang@intel.com> writes: >> > >> >> Adding DMA support for pmem blk reads. This provides signficant CPU >> >> reduction with large memory reads with good performance. DMAs are triggered >> >> with test against bio_multiple_segment(), so the small I/Os (4k or less?) >> >> are still performed by the CPU in order to reduce latency. By default >> >> the pmem driver will be using blk-mq with DMA. >> >> >> >> Numbers below are measured against pmem simulated via DRAM using >> >> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon >> >> platform. Keep in mind the performance for actual persistent memory >> >> will differ. >> >> Fio 2.21 was used. >> >> >> >> 64k: 1 task queuedepth=1 >> >> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU >> >> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU >> >> >> >> 64k: 16 tasks queuedepth=16 >> >> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU >> >> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU >> >> >> >> 2M: 1 task queuedepth=1 >> >> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU >> >> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU >> >> >> >> 2M: 16 tasks queuedepth=16 >> >> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU >> >> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU >> >> >> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com> >> >> --- >> > >> > Hi Dave, >> > >> > The above table shows that there's a performance benefit for 2M >> > transfers but a regression for 64k transfers, if we forget about the CPU >> > utilization for a second. >> >> I don't think we can forget about cpu utilization, and I would expect >> most users would value cpu over small bandwidth increases. Especially >> when those numbers show efficiency like this >> >> +160% cpu +26% read bandwidth >> +171% cpu -10% write bandwidth >> >> > Would it be beneficial to have heuristics on >> > the transfer size that decide when to use dma and when not? You >> > introduced this hunk: >> > >> > - rc = pmem_handle_cmd(cmd); >> > + if (cmd->chan && bio_multiple_segments(req->bio)) >> > + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); >> > + else >> > + rc = pmem_handle_cmd(cmd); >> > >> > Which utilizes dma for bios with multiple segments and for single >> > segment bios you use the old path, maybe the single/multi segment logic >> > can be amended to have something like: >> > >> > if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) >> > rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); >> > else >> > rc = pmem_handle_cmd(cmd); >> > >> > Just something woth considering IMHO. >> >> The thing we want to avoid most is having the cpu stall doing nothing >> waiting for memory access, so I think once the efficiency of adding >> more cpu goes non-linear it's better to let the dma engine handle >> that. The current heuristic seems to achieve that, but the worry is >> does it over-select the cpu for cases where requests could be merged >> into a bulkier i/o that is more suitable for dma. >> >> I'm not sure we want another tunable vs predictable / efficient >> behavior by default. > > Sorry for my late reply, but lots of automated performance regression CI tools > _will_ report errors because of the performance drop. Yes these may be false > positives, but it'll cost hours to resolve them. If we have a tunable it's way > easier to set it correctly for the use cases. And please keep in mind some > people don't really case about CPU utilization but maxiumum throughput. Returning more cpu to the application may also increase bandwidth because we stop spinning on completion and spend more time issuing I/O. I'm also worried that this threshold cross-over point is workload and pmem media specific. The tunable should be there as an override, but I think we need simple/sane defaults out of the gate. A "no regressions whatsoever on any metric" policy effectively kills this effort as far as I can see. We otherwise need quite a bit more qualification of different workloads, these 3 fio runs is not enough.
On 08/03/2017 08:41 AM, Dan Williams wrote: > On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote: >>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >>>> Dave Jiang <dave.jiang@intel.com> writes: >>>> >>>>> Adding DMA support for pmem blk reads. This provides signficant CPU >>>>> reduction with large memory reads with good performance. DMAs are triggered >>>>> with test against bio_multiple_segment(), so the small I/Os (4k or less?) >>>>> are still performed by the CPU in order to reduce latency. By default >>>>> the pmem driver will be using blk-mq with DMA. >>>>> >>>>> Numbers below are measured against pmem simulated via DRAM using >>>>> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon >>>>> platform. Keep in mind the performance for actual persistent memory >>>>> will differ. >>>>> Fio 2.21 was used. >>>>> >>>>> 64k: 1 task queuedepth=1 >>>>> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU >>>>> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU >>>>> >>>>> 64k: 16 tasks queuedepth=16 >>>>> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU >>>>> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU >>>>> >>>>> 2M: 1 task queuedepth=1 >>>>> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU >>>>> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU >>>>> >>>>> 2M: 16 tasks queuedepth=16 >>>>> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU >>>>> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU >>>>> >>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com> >>>>> --- >>>> >>>> Hi Dave, >>>> >>>> The above table shows that there's a performance benefit for 2M >>>> transfers but a regression for 64k transfers, if we forget about the CPU >>>> utilization for a second. >>> >>> I don't think we can forget about cpu utilization, and I would expect >>> most users would value cpu over small bandwidth increases. Especially >>> when those numbers show efficiency like this >>> >>> +160% cpu +26% read bandwidth >>> +171% cpu -10% write bandwidth >>> >>>> Would it be beneficial to have heuristics on >>>> the transfer size that decide when to use dma and when not? You >>>> introduced this hunk: >>>> >>>> - rc = pmem_handle_cmd(cmd); >>>> + if (cmd->chan && bio_multiple_segments(req->bio)) >>>> + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); >>>> + else >>>> + rc = pmem_handle_cmd(cmd); >>>> >>>> Which utilizes dma for bios with multiple segments and for single >>>> segment bios you use the old path, maybe the single/multi segment logic >>>> can be amended to have something like: >>>> >>>> if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) >>>> rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); >>>> else >>>> rc = pmem_handle_cmd(cmd); >>>> >>>> Just something woth considering IMHO. >>> >>> The thing we want to avoid most is having the cpu stall doing nothing >>> waiting for memory access, so I think once the efficiency of adding >>> more cpu goes non-linear it's better to let the dma engine handle >>> that. The current heuristic seems to achieve that, but the worry is >>> does it over-select the cpu for cases where requests could be merged >>> into a bulkier i/o that is more suitable for dma. >>> >>> I'm not sure we want another tunable vs predictable / efficient >>> behavior by default. >> >> Sorry for my late reply, but lots of automated performance regression CI tools >> _will_ report errors because of the performance drop. Yes these may be false >> positives, but it'll cost hours to resolve them. If we have a tunable it's way >> easier to set it correctly for the use cases. And please keep in mind some >> people don't really case about CPU utilization but maxiumum throughput. > > Returning more cpu to the application may also increase bandwidth > because we stop spinning on completion and spend more time issuing > I/O. I'm also worried that this threshold cross-over point is workload > and pmem media specific. The tunable should be there as an override, > but I think we need simple/sane defaults out of the gate. A "no > regressions whatsoever on any metric" policy effectively kills this > effort as far as I can see. We otherwise need quite a bit more > qualification of different workloads, these 3 fio runs is not enough. > I can put in a tunable knob like Johannes suggested and leave the default for that to what it is currently. And we can adjust as necessary as we do more testing.
On Thu, Aug 3, 2017 at 9:12 AM, Dave Jiang <dave.jiang@intel.com> wrote: > > > On 08/03/2017 08:41 AM, Dan Williams wrote: >> On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >>> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote: >>>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >>>>> Dave Jiang <dave.jiang@intel.com> writes: >>>>> >>>>>> Adding DMA support for pmem blk reads. This provides signficant CPU >>>>>> reduction with large memory reads with good performance. DMAs are triggered >>>>>> with test against bio_multiple_segment(), so the small I/Os (4k or less?) >>>>>> are still performed by the CPU in order to reduce latency. By default >>>>>> the pmem driver will be using blk-mq with DMA. >>>>>> >>>>>> Numbers below are measured against pmem simulated via DRAM using >>>>>> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon >>>>>> platform. Keep in mind the performance for actual persistent memory >>>>>> will differ. >>>>>> Fio 2.21 was used. >>>>>> >>>>>> 64k: 1 task queuedepth=1 >>>>>> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU >>>>>> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU >>>>>> >>>>>> 64k: 16 tasks queuedepth=16 >>>>>> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU >>>>>> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU >>>>>> >>>>>> 2M: 1 task queuedepth=1 >>>>>> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU >>>>>> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU >>>>>> >>>>>> 2M: 16 tasks queuedepth=16 >>>>>> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU >>>>>> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU >>>>>> >>>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com> >>>>>> --- >>>>> >>>>> Hi Dave, >>>>> >>>>> The above table shows that there's a performance benefit for 2M >>>>> transfers but a regression for 64k transfers, if we forget about the CPU >>>>> utilization for a second. >>>> >>>> I don't think we can forget about cpu utilization, and I would expect >>>> most users would value cpu over small bandwidth increases. Especially >>>> when those numbers show efficiency like this >>>> >>>> +160% cpu +26% read bandwidth >>>> +171% cpu -10% write bandwidth >>>> >>>>> Would it be beneficial to have heuristics on >>>>> the transfer size that decide when to use dma and when not? You >>>>> introduced this hunk: >>>>> >>>>> - rc = pmem_handle_cmd(cmd); >>>>> + if (cmd->chan && bio_multiple_segments(req->bio)) >>>>> + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); >>>>> + else >>>>> + rc = pmem_handle_cmd(cmd); >>>>> >>>>> Which utilizes dma for bios with multiple segments and for single >>>>> segment bios you use the old path, maybe the single/multi segment logic >>>>> can be amended to have something like: >>>>> >>>>> if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) >>>>> rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); >>>>> else >>>>> rc = pmem_handle_cmd(cmd); >>>>> >>>>> Just something woth considering IMHO. >>>> >>>> The thing we want to avoid most is having the cpu stall doing nothing >>>> waiting for memory access, so I think once the efficiency of adding >>>> more cpu goes non-linear it's better to let the dma engine handle >>>> that. The current heuristic seems to achieve that, but the worry is >>>> does it over-select the cpu for cases where requests could be merged >>>> into a bulkier i/o that is more suitable for dma. >>>> >>>> I'm not sure we want another tunable vs predictable / efficient >>>> behavior by default. >>> >>> Sorry for my late reply, but lots of automated performance regression CI tools >>> _will_ report errors because of the performance drop. Yes these may be false >>> positives, but it'll cost hours to resolve them. If we have a tunable it's way >>> easier to set it correctly for the use cases. And please keep in mind some >>> people don't really case about CPU utilization but maxiumum throughput. >> >> Returning more cpu to the application may also increase bandwidth >> because we stop spinning on completion and spend more time issuing >> I/O. I'm also worried that this threshold cross-over point is workload >> and pmem media specific. The tunable should be there as an override, >> but I think we need simple/sane defaults out of the gate. A "no >> regressions whatsoever on any metric" policy effectively kills this >> effort as far as I can see. We otherwise need quite a bit more >> qualification of different workloads, these 3 fio runs is not enough. >> > > I can put in a tunable knob like Johannes suggested and leave the > default for that to what it is currently. And we can adjust as necessary > as we do more testing. That's the problem, we need a multi-dimensional knob. It's not just transfer-size that effects total delivered bandwidth.
On Thu, Aug 03, 2017 at 09:15:12AM -0700, Dan Williams wrote: > > I can put in a tunable knob like Johannes suggested and leave the > > default for that to what it is currently. And we can adjust as necessary > > as we do more testing. > > That's the problem, we need a multi-dimensional knob. It's not just > transfer-size that effects total delivered bandwidth. If Dan opposes it so much, I'm OK with no tunable. I'm just expressing my concern that we get reports like this [1] one here as a result. Thanks, Johannes [1] http://marc.info/?l=linux-kernel&m=150175029403913&w=2
On Thu, Aug 3, 2017 at 11:07 PM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Thu, Aug 03, 2017 at 09:15:12AM -0700, Dan Williams wrote: >> > I can put in a tunable knob like Johannes suggested and leave the >> > default for that to what it is currently. And we can adjust as necessary >> > as we do more testing. >> >> That's the problem, we need a multi-dimensional knob. It's not just >> transfer-size that effects total delivered bandwidth. > > If Dan opposes it so much, I'm OK with no tunable. I'm just expressing my > concern that we get reports like this [1] one here as a result. Sorry if I came off harsh. In that report from Mel he's showing a slow down in a representative workload not just a canned bandwidth benchmark. I'm worried that tuning for a single dimension on an fio run doesn't actually improve throughput delivered to an application.
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 347835b..02ca9f0 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -32,6 +32,8 @@ #include <linux/dax.h> #include <linux/nd.h> #include <linux/blk-mq.h> +#include <linux/dmaengine.h> +#include <linux/dma-mapping.h> #include "pmem.h" #include "pfn.h" #include "nd.h" @@ -41,12 +43,24 @@ enum { PMEM_Q_MQ = 1, }; -static int queue_mode = PMEM_Q_BIO; +static int queue_mode = PMEM_Q_MQ; module_param(queue_mode, int, 0444); MODULE_PARM_DESC(queue_mode, "Pmem Queue Mode"); +static int queue_depth = 128; +module_param(queue_depth, int, 0444); +MODULE_PARM_DESC(queue_depth, "I/O Queue Depth for multi queue mode"); + +/* typically maps to number of DMA channels/devices per socket */ +static int q_per_node = 8; +module_param(q_per_node, int, 0444); +MODULE_PARM_DESC(q_per_node, "Hardware queues per node\n"); + struct pmem_cmd { struct request *rq; + struct dma_chan *chan; + int sg_nents; + struct scatterlist sg[]; }; static struct device *to_dev(struct pmem_device *pmem) @@ -276,6 +290,160 @@ static void pmem_release_disk(void *__pmem) put_disk(pmem->disk); } +static void nd_pmem_dma_callback(void *data, + const struct dmaengine_result *res) +{ + struct pmem_cmd *cmd = data; + struct request *req = cmd->rq; + struct request_queue *q = req->q; + struct pmem_device *pmem = q->queuedata; + struct nd_region *nd_region = to_region(pmem); + struct device *dev = to_dev(pmem); + int rc = 0; + + dev_dbg(dev, "%s()\n", __func__); + + if (res) { + enum dmaengine_tx_result dma_err = res->result; + + switch (dma_err) { + case DMA_TRANS_READ_FAILED: + case DMA_TRANS_WRITE_FAILED: + case DMA_TRANS_ABORTED: + dev_dbg(dev, "bio failed\n"); + rc = -ENXIO; + break; + case DMA_TRANS_NOERROR: + default: + break; + } + } + + if (req->cmd_flags & REQ_FUA) + nvdimm_flush(nd_region); + + dev_dbg(dev, "ending request\n"); + blk_mq_end_request(cmd->rq, rc); +} + +static int pmem_handle_cmd_dma(struct pmem_cmd *cmd, bool is_write) +{ + struct request *req = cmd->rq; + struct request_queue *q = req->q; + struct pmem_device *pmem = q->queuedata; + struct device *dev = to_dev(pmem); + phys_addr_t pmem_off = blk_rq_pos(req) * 512 + pmem->data_offset; + void *pmem_addr = pmem->virt_addr + pmem_off; + struct nd_region *nd_region = to_region(pmem); + size_t len; + struct dma_device *dma = cmd->chan->device; + struct dmaengine_unmap_data *unmap; + dma_cookie_t cookie; + struct dma_async_tx_descriptor *txd; + struct page *page; + unsigned int off; + int rc; + enum dma_data_direction dir; + dma_addr_t dma_addr; + + if (req->cmd_flags & REQ_FLUSH) + nvdimm_flush(nd_region); + + unmap = dmaengine_get_unmap_data(dma->dev, 2, GFP_NOWAIT); + if (!unmap) { + dev_dbg(dev, "failed to get dma unmap data\n"); + rc = -ENOMEM; + goto err; + } + + /* + * If reading from pmem, writing to scatterlist, + * and if writing to pmem, reading from scatterlist. + */ + dir = is_write ? DMA_FROM_DEVICE : DMA_TO_DEVICE; + cmd->sg_nents = blk_rq_map_sg(req->q, req, cmd->sg); + if (cmd->sg_nents < 1) { + rc = -EINVAL; + goto err; + } + + if (cmd->sg_nents > 128) { + rc = -ENOMEM; + dev_warn(dev, "Number of sg greater than allocated\n"); + goto err; + } + + rc = dma_map_sg(dma->dev, cmd->sg, cmd->sg_nents, dir); + if (rc < 1) { + rc = -ENXIO; + goto err; + } + + len = blk_rq_payload_bytes(req); + page = virt_to_page(pmem_addr); + off = (u64)pmem_addr & ~PAGE_MASK; + dir = is_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE; + dma_addr = dma_map_page(dma->dev, page, off, len, dir); + if (dma_mapping_error(dma->dev, unmap->addr[0])) { + dev_dbg(dma->dev, "src DMA mapping error\n"); + goto err_unmap_sg; + } + + unmap->len = len; + + if (is_write) { + unmap->addr[0] = dma_addr; + unmap->addr[1] = (dma_addr_t)cmd->sg; + unmap->to_cnt = 1; + unmap->from_cnt = 0; + dma_unmap_data_sg_from_nents(unmap, 2) = cmd->sg_nents; + txd = dma->device_prep_dma_memcpy_from_sg(cmd->chan, dma_addr, + cmd->sg, cmd->sg_nents, DMA_PREP_INTERRUPT); + } else { + unmap->addr[0] = (dma_addr_t)cmd->sg; + unmap->addr[1] = dma_addr; + unmap->from_cnt = 1; + unmap->to_cnt = 0; + dma_unmap_data_sg_to_nents(unmap, 2) = cmd->sg_nents; + txd = dma->device_prep_dma_memcpy_to_sg(cmd->chan, cmd->sg, + cmd->sg_nents, dma_addr, DMA_PREP_INTERRUPT); + } + + if (!txd) { + dev_dbg(dma->dev, "dma prep failed\n"); + goto err_unmap_buffer; + } + + txd->callback_result = nd_pmem_dma_callback; + txd->callback_param = cmd; + dma_set_unmap(txd, unmap); + cookie = dmaengine_submit(txd); + if (dma_submit_error(cookie)) { + dev_dbg(dma->dev, "dma submit error\n"); + goto err_set_unmap; + } + + dmaengine_unmap_put(unmap); + dma_async_issue_pending(cmd->chan); + + return 0; + +err_set_unmap: + dmaengine_unmap_put(unmap); +err_unmap_buffer: + dma_unmap_page(dev, dma_addr, len, dir); +err_unmap_sg: + if (dir == DMA_TO_DEVICE) + dir = DMA_FROM_DEVICE; + else + dir = DMA_TO_DEVICE; + dma_unmap_sg(dev, cmd->sg, cmd->sg_nents, dir); + dmaengine_unmap_put(unmap); +err: + blk_mq_end_request(cmd->rq, -ENXIO); + return -ENXIO; +} + static int pmem_handle_cmd(struct pmem_cmd *cmd) { struct request *req = cmd->rq; @@ -309,13 +477,17 @@ static int pmem_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { struct pmem_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); + struct request *req; int rc; - cmd->rq = bd->rq; - - blk_mq_start_request(bd->rq); + req = cmd->rq = bd->rq; + cmd->chan = dma_find_channel(DMA_MEMCPY); + blk_mq_start_request(req); - rc = pmem_handle_cmd(cmd); + if (cmd->chan && bio_multiple_segments(req->bio)) + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); + else + rc = pmem_handle_cmd(cmd); if (rc < 0) rc = BLK_MQ_RQ_QUEUE_ERROR; else @@ -348,10 +520,10 @@ static int pmem_attach_disk(struct device *dev, struct nd_pfn_sb *pfn_sb; struct pmem_device *pmem; struct resource pfn_res; - struct request_queue *q; struct gendisk *disk; void *addr; int rc; + struct dma_chan *chan; /* while nsio_rw_bytes is active, parse a pfn info block if present */ if (is_nd_pfn(dev)) { @@ -381,11 +553,20 @@ static int pmem_attach_disk(struct device *dev, } if (queue_mode == PMEM_Q_MQ) { + chan = dma_find_channel(DMA_MEMCPY); + if (!chan) { + queue_mode = PMEM_Q_BIO; + dev_warn(dev, "Forced back to PMEM_Q_BIO, no DMA\n"); + } + } + + if (queue_mode == PMEM_Q_MQ) { pmem->tag_set.ops = &pmem_mq_ops; - pmem->tag_set.nr_hw_queues = nr_online_nodes; - pmem->tag_set.queue_depth = 64; + pmem->tag_set.nr_hw_queues = nr_online_nodes * q_per_node; + pmem->tag_set.queue_depth = queue_depth; pmem->tag_set.numa_node = dev_to_node(dev); - pmem->tag_set.cmd_size = sizeof(struct pmem_cmd); + pmem->tag_set.cmd_size = sizeof(struct pmem_cmd) + + sizeof(struct scatterlist) * 128; pmem->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; pmem->tag_set.driver_data = pmem; @@ -446,14 +627,17 @@ static int pmem_attach_disk(struct device *dev, return PTR_ERR(addr); pmem->virt_addr = addr; - blk_queue_write_cache(q, true, true); - blk_queue_make_request(q, pmem_make_request); - blk_queue_physical_block_size(q, PAGE_SIZE); - blk_queue_max_hw_sectors(q, UINT_MAX); - blk_queue_bounce_limit(q, BLK_BOUNCE_ANY); - queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q); - queue_flag_set_unlocked(QUEUE_FLAG_DAX, q); - q->queuedata = pmem; + blk_queue_write_cache(pmem->q, true, true); + blk_queue_physical_block_size(pmem->q, PAGE_SIZE); + if (queue_mode == PMEM_Q_MQ) { + blk_queue_max_hw_sectors(pmem->q, 0x200000); + blk_queue_max_segments(pmem->q, 128); + } else + blk_queue_max_hw_sectors(pmem->q, UINT_MAX); + blk_queue_bounce_limit(pmem->q, BLK_BOUNCE_ANY); + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->q); + queue_flag_set_unlocked(QUEUE_FLAG_DAX, pmem->q); + pmem->q->queuedata = pmem; disk = alloc_disk_node(0, nid); if (!disk) @@ -588,15 +772,22 @@ static struct nd_device_driver nd_pmem_driver = { static int __init pmem_init(void) { + if (queue_mode == PMEM_Q_MQ) + dmaengine_get(); + return nd_driver_register(&nd_pmem_driver); } module_init(pmem_init); static void pmem_exit(void) { + if (queue_mode == PMEM_Q_MQ) + dmaengine_put(); + driver_unregister(&nd_pmem_driver.drv); } module_exit(pmem_exit); +MODULE_SOFTDEP("pre: dmaengine"); MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>"); MODULE_LICENSE("GPL v2");
Adding DMA support for pmem blk reads. This provides signficant CPU reduction with large memory reads with good performance. DMAs are triggered with test against bio_multiple_segment(), so the small I/Os (4k or less?) are still performed by the CPU in order to reduce latency. By default the pmem driver will be using blk-mq with DMA. Numbers below are measured against pmem simulated via DRAM using memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon platform. Keep in mind the performance for actual persistent memory will differ. Fio 2.21 was used. 64k: 1 task queuedepth=1 CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU 64k: 16 tasks queuedepth=16 CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU 2M: 1 task queuedepth=1 CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU 2M: 16 tasks queuedepth=16 CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU Signed-off-by: Dave Jiang <dave.jiang@intel.com> --- drivers/nvdimm/pmem.c | 225 +++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 208 insertions(+), 17 deletions(-)