mbox series

[0/4] brd: usr memcpy_[to|from]_page() in brd

Message ID 20230328195626.12075-1-kch@nvidia.com (mailing list archive)
Headers show
Series brd: usr memcpy_[to|from]_page() in brd | expand

Message

Chaitanya Kulkarni March 28, 2023, 7:56 p.m. UTC
Hi,

From :include/linux/highmem.h:
"kmap_atomic - Atomically map a page for temporary usage - Deprecated!"

Use memcpy_from_page() since does the same job of mapping, copying, and
unmaping except it uses non deprecated kmap_local_page() and
kunmap_local(). Following are the differences between kmal_local_page()
and kmap_atomic() :-

* creates local mapping per thread, local to CPU & not globally visible
* allows to be called from any context
* allows task preemption 

There is a slight performance difference observed with the use of new
API on the one arch I've tested with two different sets :-

Set 1 (Average of 3 runs) :-
-----------------------------
* Latency (lower is better)   :- ~14 higher with this patch seires
* IOPS/BW (higner is better)  :- ~47k higner with this patch series
* CPU Usage (lower is better) :- approximately the same 

Set 2 (Average of 3 runs) :-
-----------------------------
* Latency (lower is better)   :- ~9 higher with this patch seires
* IOPS/BW (higner is better)  :- ~23k higner with this patch series
* CPU Usage (lower is better) :- approximately the same 

Below is the test for the fio verification job and perf numbers on brd.

In case someone shows up with performance regression on the arch that
I've don't have access to we can decide then if we want to drop it this
series or keep using deprecated kernel API, but I think removing
deprecated API is useful in long term in anyway.

-ck

Chaitanya Kulkarni (4):
  brd: use memcpy_to_page() in copy_to_brd()
  brd: use memcpy_to_page() in copy_to_brd()
  brd: use memcpy_from_page() in copy_from_brd()
  brd: use memcpy_from_page() in copy_from_brd()

 drivers/block/brd.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

#######################################################################
Testing with fio verification and randread workload on brd:-

linux-block (brd-memcpy) # sh test-brd-memcpy-perf.sh 
Switched to branch 'for-next'
Your branch is ahead of 'origin/for-next' by 274 commits.
  (use "git push" to publish your local commits)
+ umount /mnt/brd
umount: /mnt/brd: not mounted.
+ dmesg -c
+ modprobe -r brd
+ lsmod
+ grep brd
++ nproc
+ make -j 48 M=drivers/block modules
  CC [M]  drivers/block/brd.o
  MODPOST drivers/block/Module.symvers
  CC [M]  drivers/block/floppy.mod.o
  CC [M]  drivers/block/brd.mod.o
  CC [M]  drivers/block/loop.mod.o
  CC [M]  drivers/block/nbd.mod.o
  CC [M]  drivers/block/virtio_blk.mod.o
  CC [M]  drivers/block/xen-blkfront.mod.o
  CC [M]  drivers/block/xen-blkback/xen-blkback.mod.o
  CC [M]  drivers/block/drbd/drbd.mod.o
  CC [M]  drivers/block/rbd.mod.o
  CC [M]  drivers/block/mtip32xx/mtip32xx.mod.o
  CC [M]  drivers/block/zram/zram.mod.o
  CC [M]  drivers/block/null_blk/null_blk.mod.o
  LD [M]  drivers/block/brd.ko
  LD [M]  drivers/block/virtio_blk.ko
  LD [M]  drivers/block/floppy.ko
  LD [M]  drivers/block/xen-blkfront.ko
  LD [M]  drivers/block/mtip32xx/mtip32xx.ko
  LD [M]  drivers/block/drbd/drbd.ko
  LD [M]  drivers/block/nbd.ko
  LD [M]  drivers/block/xen-blkback/xen-blkback.ko
  LD [M]  drivers/block/null_blk/null_blk.ko
  LD [M]  drivers/block/rbd.ko
  LD [M]  drivers/block/loop.ko
  LD [M]  drivers/block/zram/zram.ko
+ HOST=drivers/block/brd.ko
++ uname -r
+ HOST_DEST=/lib/modules/6.3.0-rc4lblk+/kernel/drivers/block/null_blk/
+ cp drivers/block/brd.ko /lib/modules/6.3.0-rc4lblk+/kernel/drivers/block/null_blk//
+ ls -lrth /lib/modules/6.3.0-rc4lblk+/kernel/drivers/block/null_blk//brd.ko
-rw-r--r--. 1 root root 377K Mar 27 16:00 /lib/modules/6.3.0-rc4lblk+/kernel/drivers/block/null_blk//brd.ko
+ dmesg -c
write-and-verify: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
fio-3.27
Starting 1 process
Jobs: 1 (f=0): [f(1)][100.0%][r=1222MiB/s][r=313k IOPS][eta 00m:00s]                
write-and-verify: (groupid=0, jobs=1): err= 0: pid=3701: Mon Mar 27 16:07:51 2023
  read: IOPS=401k, BW=1565MiB/s (1641MB/s)(6470MiB/4135msec)
    slat (nsec): min=1082, max=117624, avg=1430.90, stdev=419.78
    clat (nsec): min=1122, max=158170, avg=37721.35, stdev=2449.84
     lat (usec): min=2, max=159, avg=39.20, stdev= 2.51
    clat percentiles (nsec):
     |  1.00th=[36096],  5.00th=[36096], 10.00th=[36608], 20.00th=[36608],
     | 30.00th=[36608], 40.00th=[37120], 50.00th=[37120], 60.00th=[37120],
     | 70.00th=[37632], 80.00th=[37632], 90.00th=[38656], 95.00th=[42752],
     | 99.00th=[46848], 99.50th=[49920], 99.90th=[59648], 99.95th=[65280],
     | 99.99th=[90624]
  write: IOPS=209k, BW=817MiB/s (856MB/s)(10.0GiB/12540msec); 0 zone resets
    slat (usec): min=2, max=130, avg= 4.18, stdev= 1.04
    clat (nsec): min=1152, max=297666, avg=72041.65, stdev=6856.78
     lat (usec): min=5, max=300, avg=76.27, stdev= 7.21
    clat percentiles (usec):
     |  1.00th=[   55],  5.00th=[   62], 10.00th=[   65], 20.00th=[   69],
     | 30.00th=[   71], 40.00th=[   72], 50.00th=[   73], 60.00th=[   74],
     | 70.00th=[   75], 80.00th=[   76], 90.00th=[   79], 95.00th=[   83],
     | 99.00th=[   91], 99.50th=[   97], 99.90th=[  122], 99.95th=[  130],
     | 99.99th=[  155]
   bw (  KiB/s): min=48776, max=1028502, per=96.45%, avg=806517.46, stdev=164544.29, samples=26
   iops        : min=12194, max=257125, avg=201629.42, stdev=41136.06, samples=26
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=38.63%
  lat (usec)   : 100=61.12%, 250=0.25%, 500=0.01%
  cpu          : usr=54.26%, sys=45.67%, ctx=20, majf=0, minf=38837
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1656350,2621440,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=1565MiB/s (1641MB/s), 1565MiB/s-1565MiB/s (1641MB/s-1641MB/s), io=6470MiB (6784MB), run=4135-4135msec
  WRITE: bw=817MiB/s (856MB/s), 817MiB/s-817MiB/s (856MB/s-856MB/s), io=10.0GiB (10.7GB), run=12540-12540msec

Disk stats (read/write):
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


#######################################################################
Performance numbers :-

* Set 1:-
----------------
* Avg Latency delta (lower is better) :- ~14 higher with this patch seires

linux-block (brd-memcpy) # grep -w "lat (nsec):" *brd*fio 
default-brd.1.fio:     lat (nsec): min=1363, max=4413.5k, avg=2918.00, stdev=1731.03
default-brd.2.fio:     lat (nsec): min=1393, max=4754.7k, avg=2904.26, stdev=1692.10
default-brd.3.fio:     lat (nsec): min=1393, max=4646.2k, avg=2934.00, stdev=1652.24
(2918.00+2904.26+2934.00)/3 = 2918

with-memcpy-brd.1.fio: lat (nsec): min=1413, max=1176.6k, avg=2895.35, stdev=1552.79
with-memcpy-brd.2.fio: lat (nsec): min=1393, max=647331,  avg=2919.57, stdev=1564.59
with-memcpy-brd.3.fio: lat (nsec): min=1393, max=1685.6k, avg=2899.98, stdev=1558.76
(2895.35+2919.57+2899.98)/3 = 2904

* Ave IOPS/BW delta (higner is better ):- ~47k higner with this patch series

linux-block (brd-memcpy) # grep IOPS *brd*fio
default-brd.1.fio:     read: IOPS=7504k, BW=28.6GiB/s (30.7GB/s)(1717GiB/60001msec)
default-brd.2.fio:     read: IOPS=7525k, BW=28.7GiB/s (30.8GB/s)(1722GiB/60002msec)
default-brd.3.fio:     read: IOPS=7441k, BW=28.4GiB/s (30.5GB/s)(1703GiB/60001msec)
(7504+7525+7441)/3 = 7490

with-memcpy-brd.1.fio: read: IOPS=7558k, BW=28.8GiB/s (31.0GB/s)(1730GiB/60002msec)
with-memcpy-brd.2.fio: read: IOPS=7494k, BW=28.6GiB/s (30.7GB/s)(1715GiB/60001msec)
with-memcpy-brd.3.fio: read: IOPS=7561k, BW=28.8GiB/s (31.0GB/s)(1731GiB/60001msec)
(7558+7494+7561)/3 = 7537


* Avg CPU Usage delta (lower is better) :- approximately the same 

linux-block (brd-memcpy) # grep cpu  *brd*fio
default-brd.1.fio:      cpu: usr=15.98%, sys=83.92%, ctx=2858, majf=0, minf=347
default-brd.2.fio:      cpu: usr=16.37%, sys=83.53%, ctx=2181, majf=0, minf=351
default-brd.3.fio:      cpu: usr=15.97%, sys=83.94%, ctx=2363, majf=0, minf=353
(83.92+83.53+83.94)/3 = 83

with-memcpy-brd.1.fio:  cpu: usr=16.48%, sys=83.42%, ctx=8127, majf=0, minf=348
with-memcpy-brd.2.fio:  cpu: usr=16.41%, sys=83.48%, ctx=9116, majf=0, minf=371
with-memcpy-brd.3.fio:  cpu: usr=16.38%, sys=83.52%, ctx=2361, majf=0, minf=360
(83.42+83.48+83.52)/3 83


* Set 2:-
---------------
* Avg Latency delta (lower is better) :- ~9 higher with this patch seires

linux-block (brd-memcpy) # grep -w "lat (nsec):" *brd*fio 
default-brd.1.fio:     lat (nsec): min=1362, max=895642, avg=2879.71, stdev=1554.52
default-brd.2.fio:     lat (nsec): min=1363, max=856197, avg=2905.51, stdev=1539.65
default-brd.3.fio:     lat (nsec): min=1362, max=1114.1k, avg=2843.13, stdev=1581.05
(2879.71+2905.51+2843.13)/3 = 2876

with-memcpy-brd.1.fio: lat (nsec): min=1362, max=1079.7k, avg=2867.75, stdev=1565.19
with-memcpy-brd.2.fio: lat (nsec): min=1362, max=1160.5k, avg=2867.36, stdev=1539.65
with-memcpy-brd.3.fio: lat (nsec): min=1343, max=859683, avg=2866.50, stdev=1546.11
(2867.75+2867.36+2866.50)/3 = 2867

* Avg IOPS/BW delta (higner is better ):- ~23k higner with this patch series

linux-block (brd-memcpy) # grep IOPS  *brd*fio
default-brd.1.fio:     read: IOPS=7613k, BW=29.0GiB/s (31.2GB/s)(1743GiB/60002msec)
default-brd.2.fio:     read: IOPS=7503k, BW=28.6GiB/s (30.7GB/s)(1717GiB/60002msec)
default-brd.3.fio:     read: IOPS=7698k, BW=29.4GiB/s (31.5GB/s)(1762GiB/60001msec)
(7613+7503+7698)/3 = 7604

with-memcpy-brd.1.fio: read: IOPS=7623k, BW=29.1GiB/s (31.2GB/s)(1745GiB/60002msec)
with-memcpy-brd.2.fio: read: IOPS=7623k, BW=29.1GiB/s (31.2GB/s)(1745GiB/60001msec)
with-memcpy-brd.3.fio: read: IOPS=7637k, BW=29.1GiB/s (31.3GB/s)(1748GiB/60001msec)
(7623+7623+7637)/3 = 7627


* Avg CPU Usage delta (lower is better) :- approximately the same 

linux-block (brd-memcpy) # grep cpu  *brd*fio
default-brd.1.fio:      cpu: usr=15.32%, sys=84.58%, ctx=1485, majf=0, minf=360
default-brd.2.fio:      cpu: usr=16.70%, sys=83.20%, ctx=1691, majf=0, minf=357
default-brd.3.fio:      cpu: usr=15.59%, sys=84.31%, ctx=1835, majf=0, minf=345
(84.58+83.20+84.31)/3 = 84

with-memcpy-brd.1.fio:  cpu: usr=15.84%, sys=84.06%, ctx=1800, majf=0, minf=350
with-memcpy-brd.2.fio:  cpu: usr=16.22%, sys=83.68%, ctx=1831, majf=0, minf=342
with-memcpy-brd.3.fio:  cpu: usr=15.79%, sys=84.11%, ctx=1689, majf=0, minf=341
(84.06+83.68+84.11)/3 = 83