diff mbox series

[v3,09/11] Documentation/admin-guide/mm: Add a document for DAMON

Message ID 20200204062312.19913-10-sj38.park@gmail.com (mailing list archive)
State New, archived
Headers show
Series Introduce Data Access MONitor (DAMON) | expand

Commit Message

SeongJae Park Feb. 4, 2020, 6:23 a.m. UTC
From: SeongJae Park <sjpark@amazon.de>

This commit adds a simple document for DAMON under
`Documentation/admin-guide/mm`.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
---
 .../admin-guide/mm/data_access_monitor.rst    | 414 ++++++++++++++++++
 Documentation/admin-guide/mm/index.rst        |   1 +
 2 files changed, 415 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/data_access_monitor.rst
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/data_access_monitor.rst b/Documentation/admin-guide/mm/data_access_monitor.rst
new file mode 100644
index 000000000000..4d836c3866e2
--- /dev/null
+++ b/Documentation/admin-guide/mm/data_access_monitor.rst
@@ -0,0 +1,414 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+DAMON: Data Access MONitor
+==========================
+
+Introduction
+============
+
+Memory management decisions can normally be more efficient if finer data access
+information is available.  However, because finer information usually comes
+with higher overhead, most systems including Linux made a tradeoff: Forgive
+some wise decisions and use coarse information and/or light-weight heuristics.
+
+A number of experimental data access pattern awared memory management
+optimizations say the sacrifices are
+huge (2.55x slowdown).  However, none of those has successfully adopted to
+Linux kernel mainly due to the absence of a scalable and efficient data access
+monitoring mechanism.
+
+DAMON is a data access monitoring solution for the problem.  It is 1) accurate
+enough for the DRAM level memory management, 2) light-weight enough to be
+applied online, and 3) keeps predefined upper-bound overhead regardless of the
+size of target workloads (thus scalable).
+
+DAMON is implemented as a standalone kernel module and provides several simple
+interfaces.  Owing to that, though it has mainly designed for the kernel's
+memory management mechanisms, it can be also used for a wide range of user
+space programs and people.
+
+
+Frequently Asked Questions
+==========================
+
+Q: Why not integrated with perf?
+A: From the perspective of perf like profilers, DAMON can be thought of as a
+data source in kernel, like tracepoints, pressure stall information (psi), or
+idle page tracking.  Thus, it can be easily integrated with those.  However,
+this patchset doesn't provide a fancy perf integration because current step of
+DAMON development is focused on its core logic only.  That said, DAMON already
+provides two interfaces for user space programs, which based on debugfs and
+tracepoint, respectively.  Using the tracepoint interface, you can use DAMON
+with perf.  This patchset also provides the debugfs interface based user space
+tool for DAMON.  It can be used to record, visualize, and analyze data access
+pattern of target processes in a convenient way.
+
+Q: Why a new module, instead of extending perf or other tools?
+A: First, DAMON aims to be used by other programs including the kernel.
+Therefore, having dependency to specific tools like perf is not desirable.
+Second, because it need to be lightweight as much as possible so that it can be
+used online, any unnecessary overhead such as kernel - user space context
+switching cost should be avoided.  These are the two most biggest reasons why
+DAMON is implemented in the kernel space.  The idle page tracking subsystem
+would be the kernel module that most seems similar to DAMON.  However, it's own
+interface is not compatible with DAMON.  Also, the internal implementation of
+it has no common part to be reused by DAMON.
+
+Q: Can 'perf mem' provide the data required for DAMON?
+A: On the systems supporting 'perf mem', yes.  DAMON is using the PTE Accessed
+bits in low level.  Other H/W or S/W features that can be used for the purpose
+could be used.  However, as explained with above question, DAMON need to be
+implemented in the kernel space.
+
+
+Expected Use-cases
+==================
+
+A straightforward usecase of DAMON would be the program behavior analysis.
+With the DAMON output, users can confirm whether the program is running as
+intended or not.  This will be useful for debuggings and tests of design
+points.
+
+The monitored results can also be useful for counting the dynamic working set
+size of workloads.  For the administration of memory overcommitted systems or
+selection of the environments (e.g., containers providing different amount of
+memory) for your workloads, this will be useful.
+
+If you are a programmer, you can optimize your program by managing the memory
+based on the actual data access pattern.  For example, you can identify the
+dynamic hotness of your data using DAMON and call ``mlock()`` to keep your hot
+data in DRAM, or call ``madvise()`` with ``MADV_PAGEOUT`` to proactively
+reclaim cold data.  Even though your program is guaranteed to not encounter
+memory pressure, you can still improve the performance by applying the DAMON
+outputs for call of ``MADV_HUGEPAGE`` and ``MADV_NOHUGEPAGE``.  More creative
+optimizations would be possible.  Our evaluations of DAMON includes a
+straightforward optimization using the ``mlock()``.  Please refer to the below
+Evaluation section for more detail.
+
+As DAMON incurs very low overhead, such optimizations can be applied not only
+offline, but also online.  Also, there is no reason to limit such optimizations
+to the user space.  Several parts of the kernel's memory management mechanisms
+could be also optimized using DAMON. The reclamation, the THP (de)promotion
+decisions, and the compaction would be such a candidates.
+
+
+Mechanisms of DAMON
+===================
+
+
+Basic Access Check
+------------------
+
+DAMON basically reports what pages are how frequently accessed.  The report is
+passed to users in binary format via a ``result file`` which users can set it's
+path.  Note that the frequency is not an absolute number of accesses, but a
+relative frequency among the pages of the target workloads.
+
+Users can also control the resolution of the reports by setting two time
+intervals, ``sampling interval`` and ``aggregation interval``.  In detail,
+DAMON checks access to each page per ``sampling interval``, aggregates the
+results (counts the number of the accesses to each page), and reports the
+aggregated results per ``aggregation interval``.  For the access check of each
+page, DAMON uses the Accessed bits of PTEs.
+
+This is thus similar to the previously mentioned periodic access checks based
+mechanisms, which overhead is increasing as the size of the target process
+grows.
+
+
+Region Based Sampling
+---------------------
+
+To avoid the unbounded increase of the overhead, DAMON groups a number of
+adjacent pages that assumed to have same access frequencies into a region.  As
+long as the assumption (pages in a region have same access frequencies) is
+kept, only one page in the region is required to be checked.  Thus, for each
+``sampling interval``, DAMON randomly picks one page in each region and clears
+its Accessed bit.  After one more ``sampling interval``, DAMON reads the
+Accessed bit of the page and increases the access frequency of the region if
+the bit has set meanwhile.  Therefore, the monitoring overhead is controllable
+by setting the number of regions.  DAMON allows users to set the minimal and
+maximum number of regions for the trade-off.
+
+Except the assumption, this is almost same with the above-mentioned
+miniature-like static region based sampling.  In other words, this scheme
+cannot preserve the quality of the output if the assumption is not guaranteed.
+
+
+Adaptive Regions Adjustment
+---------------------------
+
+At the beginning of the monitoring, DAMON constructs the initial regions by
+evenly splitting the memory mapped address space of the process into the
+user-specified minimal number of regions.  In this initial state, the
+assumption is normally not kept and thus the quality could be low.  To keep the
+assumption as much as possible, DAMON adaptively merges and splits each region.
+For each ``aggregation interval``, it compares the access frequencies of
+adjacent regions and merges those if the frequency difference is small.  Then,
+after it reports and clears the aggregated access frequency of each region, it
+splits each region into two regions if the total number of regions is smaller
+than the half of the user-specified maximum number of regions.
+
+In this way, DAMON provides its best-effort quality and minimal overhead while
+keeping the bounds users set for their trade-off.
+
+
+Applying Dynamic Memory Mappings
+--------------------------------
+
+Only a number of small parts in the super-huge virtual address space of the
+processes is mapped to physical memory and accessed.  Thus, tracking the
+unmapped address regions is just wasteful.  However, tracking every memory
+mapping change might incur an overhead.  For the reason, DAMON applies the
+dynamic memory mapping changes to the tracking regions only for each of an
+user-specified time interval (``regions update interval``).
+
+
+``debugfs`` Interface
+=====================
+
+DAMON exports four files, ``attrs``, ``pids``, ``record``, and ``monitor_on``
+under its debugfs directory, ``<debugfs>/damon/``.
+
+Attributes
+----------
+
+Users can read and write the ``sampling interval``, ``aggregation interval``,
+``regions update interval``, and min/max number of monitoring target regions by
+reading from and writing to the ``attrs`` file.  For example, below commands
+set those values to 5 ms, 100 ms, 1,000 ms, 10, 1000 and check it again::
+
+    # cd <debugfs>/damon
+    # echo 5000 100000 1000000 10 1000 > attrs
+    # cat attrs
+    5000 100000 1000000 10 1000
+
+Target PIDs
+-----------
+
+Users can read and write the pids of current monitoring target processes by
+reading from and writing to the ``pids`` file.  For example, below commands set
+processes having pids 42 and 4242 as the processes to be monitored and check it
+again::
+
+    # cd <debugfs>/damon
+    # echo 42 4242 > pids
+    # cat pids
+    42 4242
+
+Note that setting the pids doesn't starts the monitoring.
+
+Record
+------
+
+DAMON support direct monitoring result record feature.  The recorded results
+are first written to a buffer and flushed to a file in batch.  Users can set
+the size of the buffer and the path to the result file by reading from and
+writing to the ``record`` file.  For example, below commands set the buffer to
+be 4 KiB and the result to be saved in ``/damon.data``.
+
+    # cd <debugfs>/damon
+    # echo "4096 /damon.data" > pids
+    # cat record
+    4096 /damon.data
+
+Turning On/Off
+--------------
+
+You can check current status, start and stop the monitoring by reading from and
+writing to the ``monitor_on`` file.  Writing ``on`` to the file starts DAMON to
+monitor the target processes with the attributes.  Writing ``off`` to the file
+stops DAMON.  DAMON also stops if every target processes is be terminated.
+Below example commands turn on, off, and check status of DAMON::
+
+    # cd <debugfs>/damon
+    # echo on > monitor_on
+    # echo off > monitor_on
+    # cat monitor_on
+    off
+
+Please note that you cannot write to the ``attrs`` and ``pids`` files while the
+monitoring is turned on.  If you write to the files while DAMON is running,
+``-EINVAL`` will be returned.
+
+
+User Space Tool for DAMON
+=========================
+
+There is a user space tool for DAMON, ``/tools/damon/damo``.  It provides
+another user interface which more convenient than the debugfs interface.
+Nevertheless, note that it is only aimed to be used for minimal reference of
+the DAMON's debugfs interfaces and for tests of the DAMON itself.  Based on the
+debugfs interface, you can create another cool and more convenient user space
+tools.
+
+The interface of the tool is basically subcommand based.  You can almost always
+use ``-h`` option to get help of the use of each subcommand.  Currently, it
+supports two subcommands, ``record`` and ``report``.
+
+
+Recording Data Access Pattern
+-----------------------------
+
+The ``record`` subcommand records the data access pattern of target process in
+a file (``./damon.data`` by default) using DAMON.  You can specifies the target
+as either pid or a command for an execution of the process.  Below example
+shows a command target usage::
+
+    # cd <kernel>/tools/damon/
+    # ./damo record "sleep 5"
+
+The tool will execute ``sleep 5`` by itself and record the data access patterns
+of the process.  Below example shows a pid target usage::
+
+    # sleep 5 &
+    # ./damo record `pidof sleep`
+
+You can set more detailed attributes and path to the recorded data file using
+optional arguments to the subcommand.  Use the ``-h`` option for more help.
+
+
+Analyzing Data Access Pattern
+-----------------------------
+
+The ``report`` subcommand reads a data access pattern record file (if not
+explicitly specified, reads ``./damon.data`` file if exists) and generates
+reports of various types.  You can specify what type of report you want using
+sub-subcommand to ``report`` subcommand.  For supported types, pass the ``-h``
+option to ``report`` subcommand.
+
+
+raw
+~~~
+
+``raw`` sub-subcommand simply transforms the record, which is storing the data
+access patterns in binary format to human readable text.  For example::
+
+    $ ./damo report raw
+    start_time:  193485829398
+    rel time:                0
+    nr_tasks:  1
+    pid:  1348
+    nr_regions:  4
+    560189609000-56018abce000(  22827008):  0
+    7fbdff59a000-7fbdffaf1a00(   5601792):  0
+    7fbdffaf1a00-7fbdffbb5000(    800256):  1
+    7ffea0dc0000-7ffea0dfd000(    249856):  0
+
+    rel time:        100000731
+    nr_tasks:  1
+    pid:  1348
+    nr_regions:  6
+    560189609000-56018abce000(  22827008):  0
+    7fbdff59a000-7fbdff8ce933(   3361075):  0
+    7fbdff8ce933-7fbdffaf1a00(   2240717):  1
+    7fbdffaf1a00-7fbdffb66d99(    480153):  0
+    7fbdffb66d99-7fbdffbb5000(    320103):  1
+    7ffea0dc0000-7ffea0dfd000(    249856):  0
+
+The first line shows recording started timestamp (nanosecond).  Records of data
+access patterns are following this.  Each record is sperated by a blank line.
+Each record first specifies the recorded time (``rel time``), number of
+monitored tasks in this record (``nr_tasks``).  Multiple number of records of
+data access pattern for each task continue.  Each data access pattern for each
+task shows first it's pid (``pid``) and number of monitored virtual address
+regions in this access pattern (``nr_regions``).  After that, each line shows
+start/end address, size, and number of monitored accesses to the region for
+each of the regions.
+
+
+heats
+~~~~~
+
+The ``raw`` type shows detailed information but it is exhaustive to manually
+read and analyzed.  For the reason, ``heats`` plots the data in heatmap form,
+using time as x-axis, virtual address as y-axis, and access frequency as
+z-axis.  Also, users set the resolution and start/end point of each axis via
+optional arguments.  For example::
+
+    $ ./damo report heats --tres 3 --ares 3
+    0               0               0.0
+    0               7609002         0.0
+    0               15218004        0.0
+    66112620851     0               0.0
+    66112620851     7609002         0.0
+    66112620851     15218004        0.0
+    132225241702    0               0.0
+    132225241702    7609002         0.0
+    132225241702    15218004        0.0
+
+This command shows the recorded access pattern of the ``sleep`` command using 3
+data points for each of time axis and address axis.  Therefore, it shows 9 data
+points in total.
+
+Users can easily converts this text output into heatmap image or other 3D
+representation using various tools such as 'gnuplot'.  ``raw`` sub-subcommand
+also provides 'gnuplot' based heatmap image creation.  For this, you can use
+``--heatmap`` option.  Also, note that because it uses 'gnuplot' internally, it
+will fail if 'gnuplot' is not installed on your system.  For example::
+
+    $ ./damo report heats --heatmap heatmap.png
+
+Creates ``heatmap.png`` file containing the heatmap image.  It supports
+``pdf``, ``png``, ``jpeg``, and ``svg``.
+
+For proper zoom in / zoom out, you need to see the layout of the record.  For
+that, use '--guide' option.  If the option is given, it will provide useful
+information about the records in the record file.  For example::
+
+    $ ./damo report heats --guide
+    pid:1348
+    time: 193485829398-198337863555 (4852034157)
+    region   0: 00000094564599762944-00000094564622589952 (22827008)
+    region   1: 00000140454009610240-00000140454016012288 (6402048)
+    region   2: 00000140731597193216-00000140731597443072 (249856)
+
+The output shows monitored regions (start and end addresses in byte) and
+monitored time duration (start and end time in nanosecond) of each target task.
+Therefore, it would be wise to plot only each region rather than plotting
+entire address space in one heatmap because the gaps between the regions are so
+huge in this case.
+
+
+wss
+~~~
+
+The ``wss`` type shows the distribution or time-varying working set sizes of
+the recorded workload using the records.  For example::
+
+    $ ./damo report wss
+    # <percentile> <wss>
+    # pid   1348
+    # avr:  66228
+    0       0
+    25      0
+    50      0
+    75      0
+    100     1920615
+
+Without any option, it shows the distribution of the working set sizes as
+above.  Basically it shows 0th, 25th, 50th, 75th, and 100th percentile and
+average of the measured working set sizes in the access pattern records.  In
+this case, the working set size was zero for 75th percentile but 1,920,615
+bytes in max and 66,228 in average.
+
+By setting the sort key of the percentile using '--sortby', you can also see
+how the working set size is chronologically changed.  For example::
+
+    $ ./damo report wss --sortby time
+    # <percentile> <wss>
+    # pid   1348
+    # avr:  66228
+    0       0
+    25      0
+    50      0
+    75      0
+    100     0
+
+The average is still 66,228.  And, because we sorted the working set using
+recorded time and the access is very short, we cannot show when the access
+made.
+
+Users can specify the resolution of the distribution (``--range``).  It also
+supports 'gnuplot' based simple visualization (``--plot``) of the distribution.
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 11db46448354..d3d0ba373eb6 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -27,6 +27,7 @@  the Linux memory management.
 
    concepts
    cma_debugfs
+   data_access_monitor
    hugetlbpage
    idle_page_tracking
    ksm