[RFC,v6,0/7] Implement Data Access Monitoring-based Memory Operation Schemes

Message ID	20200407100007.3894-1-sjpark@amazon.com (mailing list archive)
Headers	show Return-Path: <SRS0=tmHZ=5X=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4264820731 IronPort-SDR: dWRLrkafSEgY7MxhhaPQ3RBlKdaY1cGpRvHtHiWwpSfwEY4gDbaDqHRQURKmc+R6s6qNowo15/ ozEEy7cyn8SQ== From: SeongJae Park <sjpark@amazon.com> To: <akpm@linux-foundation.org> CC: SeongJae Park <sjpark@amazon.de>, <Jonathan.Cameron@Huawei.com>, <aarcange@redhat.com>, <acme@kernel.org>, <alexander.shishkin@linux.intel.com>, <amit@kernel.org>, <brendan.d.gregg@gmail.com>, <brendanhiggins@google.com>, <cai@lca.pw>, <colin.king@canonical.com>, <corbet@lwn.net>, <dwmw@amazon.com>, <jolsa@redhat.com>, <kirill@shutemov.name>, <mark.rutland@arm.com>, <mgorman@suse.de>, <minchan@kernel.org>, <mingo@redhat.com>, <namhyung@kernel.org>, <peterz@infradead.org>, <rdunlap@infradead.org>, <riel@surriel.com>, <rientjes@google.com>, <rostedt@goodmis.org>, <shakeelb@google.com>, <shuah@kernel.org>, <sj38.park@gmail.com>, <vbabka@suse.cz>, <vdavydov.dev@gmail.com>, <yang.shi@linux.alibaba.com>, <ying.huang@intel.com>, <linux-mm@kvack.org>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [RFC v6 0/7] Implement Data Access Monitoring-based Memory Operation Schemes Date: Tue, 7 Apr 2020 11:59:59 +0200 Message-ID: <20200407100007.3894-1-sjpark@amazon.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Implement Data Access Monitoring-based Memory Operation Schemes \| expand [RFC,v6,0/7] Implement Data Access Monitoring-based Memory Operation Schemes [RFC,v6,1/7] mm/madvise: Export do_madvise() to external GPL modules [RFC,v6,2/7] mm/damon: Account age of target regions [RFC,v6,3/7] mm/damon: Implement data access monitoring-based operation schemes [RFC,v6,4/7] mm/damon/schemes: Implement a debugfs interface [RFC,v6,5/7] mm/damon-test: Add kunit test case for regions age accounting [RFC,v6,6/7] mm/damon/selftests: Add 'schemes' debugfs tests [RFC,v6,7/7] damon/tools: Support more human friendly 'schemes' control

From: SeongJae Park <sjpark@amazon.de> DAMON[1] can be used as a primitive for data access awared memory management optimizations. That said, users who want such optimizations should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. Such efforts will be inevitable for some complicated optimizations. However, in many other cases, the users would simply want the system to apply a memory management action to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB keeping only rare accesses more than 2 minutes", or "Do not use THP for a memory region larger than 2 MiB rarely accessed for more than 1 seconds". This RFC patchset makes DAMON to handle such data access monitoring-based operation schemes. With this change, users can do the data access awared optimizations by simply specifying their schemes to DAMON. Evaluations =========== Setup ----- On my personal QEMU/KVM based virtual machine on an Intel i7 host machine running Ubuntu 18.04, I measure runtime and consumed system memory while running various realistic workloads with several configurations. I use 13 and 12 workloads in PARSEC3[3] and SPLASH-2X[4] benchmark suites, respectively. I personally use another wrapper scripts[5] for setup and run of the workloads. On top of this patchset, we also applied the DAMON-based operation schemes patchset[6] for this evaluation. Measurement ~~~~~~~~~~~ For the measurement of the amount of consumed memory in system global scope, I drop caches before starting each of the workloads and monitor 'MemFree' in the '/proc/meminfo' file. To make results more stable, I repeat the runs 5 times and average results. You can get stdev, min, and max of the numbers among the repeated runs in appendix below. Configurations ~~~~~~~~~~~~~~ The configurations I use are as below. orig: Linux v5.5 with 'madvise' THP policy rec: 'orig' plus DAMON running with record feature thp: same with 'orig', but use 'always' THP policy ethp: 'orig' plus a DAMON operation scheme[6], 'efficient THP' prcl: 'orig' plus a DAMON operation scheme, 'proactive reclaim[7]' I use 'rec' for measurement of DAMON overheads to target workloads and system memory. The remaining configs including 'thp', 'ethp', and 'prcl' are for measurement of DAMON monitoring accuracy. 'ethp' and 'prcl' is simple DAMON-based operation schemes developed for proof of concepts of DAMON. 'ethp' reduces memory space waste of THP by using DAMON for decision of promotions and demotion for huge pages, while 'prcl' is as similar as the original work. Those are implemented as below: # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action> # ethp: Use huge pages if a region >2MB shows >5% access rate, use regular # pages if a region >2MB shows <5% access rate for >1 second 2M null 5 null null null hugepage 2M null null 5 1s null nohugepage # prcl: If a region >4KB shows <5% access rate for >5 seconds, page out. 4K null null 5 500ms null pageout Note that both 'ethp' and 'prcl' are designed with my only straightforward intuition, because those are for only proof of concepts and monitoring accuracy of DAMON. In other words, those are not for production. For production use, those should be tuned more. [1] "Redis latency problems troubleshooting", https://redis.io/topics/latency [2] "Disable Transparent Huge Pages (THP)", https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ [3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm [4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x [5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu [6] "[RFC v4 0/7] Implement Data Access Monitoring-based Memory Operation Schemes", https://lore.kernel.org/linux-mm/20200303121406.20954-1-sjpark@amazon.com/ [7] "Proactively reclaiming idle memory", https://lwn.net/Articles/787611/ Results ------- Below two tables show the measurement results. The runtimes are in seconds while the memory usages are in KiB. Each configurations except 'orig' shows its overhead relative to 'orig' in percent within parenthesises. runtime orig rec (overhead) thp (overhead) ethp (overhead) prcl (overhead) parsec3/blackscholes 107.097 106.955 (-0.13) 106.352 (-0.70) 107.357 (0.24) 108.284 (1.11) parsec3/bodytrack 79.135 79.062 (-0.09) 78.996 (-0.18) 79.261 (0.16) 79.824 (0.87) parsec3/canneal 139.036 139.694 (0.47) 125.947 (-9.41) 131.071 (-5.73) 148.648 (6.91) parsec3/dedup 11.914 11.905 (-0.07) 11.729 (-1.55) 11.916 (0.02) 12.613 (5.87) parsec3/facesim 208.761 209.476 (0.34) 204.778 (-1.91) 206.157 (-1.25) 214.016 (2.52) parsec3/ferret 190.854 191.309 (0.24) 190.223 (-0.33) 190.821 (-0.02) 191.847 (0.52) parsec3/fluidanimate 211.317 213.798 (1.17) 208.883 (-1.15) 211.319 (0.00) 214.566 (1.54) parsec3/freqmine 288.672 290.547 (0.65) 288.310 (-0.13) 288.727 (0.02) 292.294 (1.25) parsec3/raytrace 118.692 119.443 (0.63) 118.625 (-0.06) 118.986 (0.25) 129.942 (9.48) parsec3/streamcluster 323.387 327.244 (1.19) 284.931 (-11.89) 290.604 (-10.14) 330.111 (2.08) parsec3/swaptions 154.304 154.891 (0.38) 154.373 (0.04) 155.226 (0.60) 155.338 (0.67) parsec3/vips 58.879 59.254 (0.64) 58.459 (-0.71) 59.029 (0.25) 59.761 (1.50) parsec3/x264 71.805 68.718 (-4.30) 67.262 (-6.33) 69.494 (-3.22) 71.291 (-0.72) splash2x/barnes 80.624 80.680 (0.07) 74.538 (-7.55) 78.363 (-2.80) 86.373 (7.13) splash2x/fft 33.462 33.285 (-0.53) 23.146 (-30.83) 33.306 (-0.47) 35.311 (5.53) splash2x/lu_cb 85.474 85.681 (0.24) 84.516 (-1.12) 85.525 (0.06) 87.267 (2.10) splash2x/lu_ncb 93.227 93.211 (-0.02) 90.939 (-2.45) 93.526 (0.32) 94.409 (1.27) splash2x/ocean_cp 44.348 44.668 (0.72) 42.920 (-3.22) 44.128 (-0.50) 45.785 (3.24) splash2x/ocean_ncp 81.234 81.275 (0.05) 51.441 (-36.67) 64.974 (-20.02) 94.207 (15.97) splash2x/radiosity 90.976 91.131 (0.17) 90.325 (-0.72) 91.395 (0.46) 97.867 (7.57) splash2x/radix 31.269 31.185 (-0.27) 25.103 (-19.72) 29.289 (-6.33) 37.713 (20.61) splash2x/raytrace 83.945 84.242 (0.35) 82.314 (-1.94) 83.334 (-0.73) 84.655 (0.85) splash2x/volrend 86.703 87.545 (0.97) 86.324 (-0.44) 86.717 (0.02) 87.925 (1.41) splash2x/water_nsquared 230.426 232.979 (1.11) 219.950 (-4.55) 224.474 (-2.58) 235.770 (2.32) splash2x/water_spatial 88.982 89.748 (0.86) 89.086 (0.12) 89.431 (0.50) 95.849 (7.72) total 2994.520 3007.910 (0.45) 2859.470 (-4.51) 2924.420 (-2.34) 3091.670 (3.24) memused.avg orig rec (overhead) thp (overhead) ethp (overhead) prcl (overhead) parsec3/blackscholes 1821479.200 1836018.600 (0.80) 1822020.600 (0.03) 1834214.200 (0.70) 1721607.800 (-5.48) parsec3/bodytrack 1418698.400 1434689.800 (1.13) 1419134.400 (0.03) 1430609.800 (0.84) 1433137.600 (1.02) parsec3/canneal 1045065.400 1052992.400 (0.76) 1042607.400 (-0.24) 1048730.400 (0.35) 1049446.000 (0.42) parsec3/dedup 2387073.200 2425093.600 (1.59) 2398469.600 (0.48) 2416738.400 (1.24) 2433976.800 (1.96) parsec3/facesim 540075.800 554130.000 (2.60) 544759.400 (0.87) 553325.800 (2.45) 489255.600 (-9.41) parsec3/ferret 316932.800 331383.600 (4.56) 320355.800 (1.08) 331042.000 (4.45) 328275.600 (3.58) parsec3/fluidanimate 576466.400 587466.600 (1.91) 582737.000 (1.09) 582560.600 (1.06) 499228.800 (-13.40) parsec3/freqmine 985864.000 996351.800 (1.06) 990195.000 (0.44) 997435.400 (1.17) 809333.800 (-17.91) parsec3/raytrace 1749485.600 1753601.400 (0.24) 1744385.000 (-0.29) 1755230.400 (0.33) 1597574.400 (-8.68) parsec3/streamcluster 120976.200 133270.000 (10.16) 118688.200 (-1.89) 132846.800 (9.81) 133412.400 (10.28) parsec3/swaptions 14953.600 28689.400 (91.86) 15826.000 (5.83) 26803.000 (79.24) 27754.400 (85.60) parsec3/vips 2940086.400 2965866.800 (0.88) 2943217.200 (0.11) 2960823.600 (0.71) 2968121.000 (0.95) parsec3/x264 3179843.200 3186839.600 (0.22) 3175893.600 (-0.12) 3182023.400 (0.07) 3202598.000 (0.72) splash2x/barnes 1210899.200 1211648.600 (0.06) 1219328.800 (0.70) 1217686.000 (0.56) 1126669.000 (-6.96) splash2x/fft 9322834.800 9142039.200 (-1.94) 9183937.800 (-1.49) 9159042.800 (-1.76) 9321729.200 (-0.01) splash2x/lu_cb 515411.200 523698.400 (1.61) 521019.800 (1.09) 523047.400 (1.48) 461828.400 (-10.40) splash2x/lu_ncb 514869.000 525223.000 (2.01) 521820.600 (1.35) 522588.800 (1.50) 480118.400 (-6.75) splash2x/ocean_cp 3345433.400 3298946.800 (-1.39) 3377377.000 (0.95) 3289771.600 (-1.66) 3273329.800 (-2.16) splash2x/ocean_ncp 3902999.600 3873302.600 (-0.76) 7069853.000 (81.14) 4962220.800 (27.14) 3772835.600 (-3.33) splash2x/radiosity 1471551.000 1470698.600 (-0.06) 1481433.200 (0.67) 1466283.400 (-0.36) 838138.400 (-43.04) splash2x/radix 1700185.000 1674226.400 (-1.53) 1386397.600 (-18.46) 1544387.800 (-9.16) 1957567.600 (15.14) splash2x/raytrace 45493.800 57050.800 (25.40) 50134.000 (10.20) 60166.400 (32.25) 57634.000 (26.69) splash2x/volrend 150549.200 165190.600 (9.73) 151509.600 (0.64) 162845.000 (8.17) 161346.000 (7.17) splash2x/water_nsquared 46275.200 58483.600 (26.38) 71529.200 (54.57) 56770.200 (22.68) 59995.800 (29.65) splash2x/water_spatial 666577.200 672511.800 (0.89) 667422.200 (0.13) 674555.000 (1.20) 608374.000 (-8.73) total 39990000.000 39959400.000 (-0.08) 42819900.000 (7.08) 40891655.000 (2.25) 38813174.000 (-2.94) DAMON Overheads ~~~~~~~~~~~~~~~ In total, DAMON recording feature incurs 0.41% runtime overhead (up to 1.19% in worst case with 'parsec3/streamcluster') and -0.08% memory space overhead. For convenience test run of 'rec', I use a Python wrapper. The wrapper constantly consumes about 10-15MB of memory. This becomes high memory overhead if the target workload has small memory footprint. In detail, 10%, 91%, 25%, 9%, and 26% overheads shown for parsec3/streamcluster (125 MiB), parsec3/swaptions (15 MiB), splash2x/raytrace (45 MiB), splash2x/volrend (151 MiB), and splash2x/water_nsquared (46 MiB)). Nonetheless, the overheads are not from DAMON, but from the wrapper, and thus should be ignored. This fake memory overhead continues in 'ethp' and 'prcl', as those configurations are also using the Python wrapper. Efficient THP ~~~~~~~~~~~~~ THP 'always' enabled policy achieves 4.51% speedup but incurs 7.08% memory overhead. It achieves 36.67% speedup in best case, but 81.14% memory overhead in worst case. Interestingly, both the best and worst case are with 'splash2x/ocean_ncp'). The 2-lines implementation of data access monitoring based THP version ('ethp') shows 2.34% speedup and 2.25% memory overhead. In other words, 'ethp' removes 68.22% of THP memory waste while preserving 51.88% of THP speedup in total. In case of the 'splash2x/ocean_ncp', 'ethp' removes 66.55% of THP memory waste while preserving 74% of THP speedup. Proactive Reclamation ~~~~~~~~~~~~~~~~~~~~ As same to the original work, I use 'zram' swap device for this configuration. In total, our 1 line implementation of Proactive Reclamation, 'prcl', incurred 3.24% runtime overhead in total while achieving 2.94% system memory usage reduction. Nonetheless, as the memory usage is calculated with 'MemFree' in '/proc/meminfo', it contains the SwapCached pages. As the swapcached pages can be easily evicted, I also measured the residential set size of the workloads: rss.avg orig rec (overhead) thp (overhead) ethp (overhead) prcl (overhead) parsec3/blackscholes 589877.400 591587.600 (0.29) 593797.000 (0.66) 591090.800 (0.21) 424841.800 (-27.98) parsec3/bodytrack 32326.600 32289.800 (-0.11) 32284.000 (-0.13) 32249.600 (-0.24) 28931.800 (-10.50) parsec3/canneal 839469.400 840116.600 (0.08) 838083.800 (-0.17) 837870.400 (-0.19) 833193.800 (-0.75) parsec3/dedup 1194881.800 1207486.800 (1.05) 1217461.000 (1.89) 1225107.000 (2.53) 995459.400 (-16.69) parsec3/facesim 311416.600 311812.800 (0.13) 314923.000 (1.13) 312525.200 (0.36) 195057.600 (-37.36) parsec3/ferret 99787.800 99655.400 (-0.13) 101332.800 (1.55) 99820.400 (0.03) 93295.000 (-6.51) parsec3/fluidanimate 531801.600 531784.800 (-0.00) 531775.400 (-0.00) 531928.600 (0.02) 432113.400 (-18.75) parsec3/freqmine 552404.600 553054.400 (0.12) 555716.400 (0.60) 554045.600 (0.30) 157776.200 (-71.44) parsec3/raytrace 894502.400 892753.600 (-0.20) 888306.200 (-0.69) 892790.600 (-0.19) 374962.600 (-58.08) parsec3/streamcluster 110877.200 110846.400 (-0.03) 111255.400 (0.34) 111467.600 (0.53) 110063.400 (-0.73) parsec3/swaptions 5637.600 5611.600 (-0.46) 5621.400 (-0.29) 5630.200 (-0.13) 4594.800 (-18.50) parsec3/vips 31897.600 31803.800 (-0.29) 32336.400 (1.38) 32168.000 (0.85) 30496.800 (-4.39) parsec3/x264 82068.400 81975.600 (-0.11) 83066.400 (1.22) 82656.400 (0.72) 80752.400 (-1.60) splash2x/barnes 1210976.600 1215669.400 (0.39) 1224071.200 (1.08) 1219203.200 (0.68) 1047794.600 (-13.48) splash2x/fft 9714139.000 9623503.600 (-0.93) 9523996.200 (-1.96) 9555242.400 (-1.64) 9050047.000 (-6.84) splash2x/lu_cb 510368.800 510468.800 (0.02) 514496.800 (0.81) 510299.200 (-0.01) 445912.000 (-12.63) splash2x/lu_ncb 510149.600 510325.600 (0.03) 513899.000 (0.73) 510331.200 (0.04) 465811.200 (-8.69) splash2x/ocean_cp 3407224.400 3405827.200 (-0.04) 3437758.400 (0.90) 3394473.000 (-0.37) 3334869.600 (-2.12) splash2x/ocean_ncp 3919511.200 3934023.000 (0.37) 7181317.200 (83.22) 5074390.600 (29.46) 3560788.200 (-9.15) splash2x/radiosity 1474982.000 1476292.400 (0.09) 1485884.000 (0.74) 1474162.800 (-0.06) 695592.400 (-52.84) splash2x/radix 1765313.200 1752605.000 (-0.72) 1440052.200 (-18.43) 1662186.600 (-5.84) 1888954.800 (7.00) splash2x/raytrace 23277.600 23289.600 (0.05) 29185.600 (25.38) 26960.600 (15.82) 21139.400 (-9.19) splash2x/volrend 44110.600 44069.200 (-0.09) 44321.600 (0.48) 44436.000 (0.74) 28610.400 (-35.14) splash2x/water_nsquared 29412.800 29443.200 (0.10) 29470.000 (0.19) 29894.600 (1.64) 27927.800 (-5.05) splash2x/water_spatial 655785.200 656694.400 (0.14) 655665.200 (-0.02) 656572.000 (0.12) 558691.000 (-14.81) total 28542100.000 28472900.000 (-0.24) 31386000.000 (9.96) 29467572.000 (3.24) 24887691.000 (-12.80) In total, 12.80% of residential sets were reduced. With parsec3/freqmine, 'prcl' reduced 17.91% of system memory usage and 71.44% of residential sets while incurring only 1.25% runtime overhead. Sequence Of Patches =================== The patches are based on the v5.6 plus v8 DAMON patchset[1] and Minchan's ``do_madvise()`` patch[2]. Minchan's patch was necessary for reuse of ``madvise()`` code in DAMON. You can also clone the complete git tree: $ git clone git://github.com/sjp38/linux -b damos/rfc/v6 The web is also available: https://github.com/sjp38/linux/releases/tag/damos/rfc/v6 [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/ [2] https://lore.kernel.org/linux-mm/20200302193630.68771-2-minchan@kernel.org/ The first patch allows DAMON to reuse ``madvise()`` code for the actions. The second patch accounts age of each region. The third patch implements the handling of the schemes in DAMON and exports a kernel space programming interface for it. The fourth patch implements a debugfs interface for the privileged people and programs. The fifth and sixth patches each adds kunit tests and selftests for these changes, and finally the seventhe patch adds human friendly schemes support to the user space tool for DAMON. Patch History ============= Changes from RFC v5 (https://lore.kernel.org/linux-mm/20200330115042.17431-1-sjpark@amazon.com/) - Rebase on DAMON v8 patchset - Update test results - Fix DAMON userspace tool crash on signal handling - Fix checkpatch warnings Changes from RFC v4 (https://lore.kernel.org/linux-mm/20200303121406.20954-1-sjpark@amazon.com/) - Handle CONFIG_ADVISE_SYSCALL - Clean up code (Jonathan Cameron) - Update test results - Rebase on v5.6 + DAMON v7 Changes from RFC v3 (https://lore.kernel.org/linux-mm/20200225102300.23895-1-sjpark@amazon.com/) - Add Reviewed-by from Brendan Higgins - Code cleanup: Modularize madvise() call - Fix a trivial bug in the wrapper python script - Add more stable and detailed evaluation results with updated ETHP scheme Changes from RFC v2 (https://lore.kernel.org/linux-mm/20200218085309.18346-1-sjpark@amazon.com/) - Fix aging mechanism for more better 'old region' selection - Add more kunittests and kselftests for this patchset - Support more human friedly description and application of 'schemes' Changes from RFC v1 (https://lore.kernel.org/linux-mm/20200210150921.32482-1-sjpark@amazon.com/) - Properly adjust age accounting related properties after splitting, merging, and action applying SeongJae Park (7): mm/madvise: Export do_madvise() to external GPL modules mm/damon: Account age of target regions mm/damon: Implement data access monitoring-based operation schemes mm/damon/schemes: Implement a debugfs interface mm/damon-test: Add kunit test case for regions age accounting mm/damon/selftests: Add 'schemes' debugfs tests damon/tools: Support more human friendly 'schemes' control include/linux/damon.h | 29 ++ mm/damon-test.h | 5 + mm/damon.c | 428 +++++++++++++++++- mm/madvise.c | 1 + tools/damon/_convert_damos.py | 126 ++++++ tools/damon/_damon.py | 143 ++++++ tools/damon/damo | 7 + tools/damon/record.py | 135 +----- tools/damon/schemes.py | 105 +++++ .../testing/selftests/damon/debugfs_attrs.sh | 29 ++ 10 files changed, 879 insertions(+), 129 deletions(-) create mode 100755 tools/damon/_convert_damos.py create mode 100644 tools/damon/_damon.py create mode 100644 tools/damon/schemes.py

[RFC,v6,0/7] Implement Data Access Monitoring-based Memory Operation Schemes

Message