mbox series

[net-next,v23,0/7] Replace page_frag with page_frag_cache (Part-1)

Message ID 20241028115343.3405838-1-linyunsheng@huawei.com (mailing list archive)
Headers show
Series Replace page_frag with page_frag_cache (Part-1) | expand

Message

Yunsheng Lin Oct. 28, 2024, 11:53 a.m. UTC
This is part 1 of "Replace page_frag with page_frag_cache",
which mainly contain refactoring and optimization for the
implementation of page_frag API before the replacing.

As the discussion in [1], it would be better to target net-next
tree to get more testing as all the callers page_frag API are
in networking, and the chance of conflicting with MM tree seems
low as implementation of page_frag API seems quite self-contained.

After [2], there are still two implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().

This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now as pcp
is also supported for high-order pages:
commit 44042b449872 ("mm/page_alloc: allow high-order pages to
be stored on the per-cpu lists")

As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.

After this patchset:
1. Unify the page frag implementation by taking the best out of
   two the existing implementations: we are able to save some space
   for the 'page_frag_cache' API user, and avoid 'get_page()' for
   the old 'page_frag' API user.
2. Future bugfix and performance can be done in one place, hence
   improving maintainability of page_frag's implementation.

Kernel Image changing:
    Linux Kernel   total |      text      data        bss
    ------------------------------------------------------
    after     45250307 |   27274279   17209996     766032
    before    45254134 |   27278118   17209984     766032
    delta        -3827 |      -3839        +12         +0

Performance validation:
1. Using micro-benchmark ko added in patch 1 to test aligned and
   non-aligned API performance impact for the existing users, there
   is no notiable performance degradation. Instead we seems to have
   some major performance boot for both aligned and non-aligned API
   after switching to ptr_ring for testing, respectively about 200%
   and 10% improvement in arm64 server as below.

2. Use the below netcat test case, we also have some minor
   performance boot for replacing 'page_frag' with 'page_frag_cache'
   after this patchset.
   server: taskset -c 32 nc -l -k 1234 > /dev/null
   client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

In order to avoid performance noise as much as possible, the testing
is done in system without any other load and have enough iterations to
prove the data is stable enough, complete log for testing is below:

perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000
perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1
taskset -c 32 nc -l -k 1234 > /dev/null
perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

*After* this patchset:

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         17.758393      task-clock (msec)         #    0.004 CPUs utilized            ( +-  0.51% )
                 5      context-switches          #    0.293 K/sec                    ( +-  0.65% )
                 0      cpu-migrations            #    0.008 K/sec                    ( +- 17.21% )
                74      page-faults               #    0.004 M/sec                    ( +-  0.12% )
          46128650      cycles                    #    2.598 GHz                      ( +-  0.51% )
          60810511      instructions              #    1.32  insn per cycle           ( +-  0.04% )
          14764914      branches                  #  831.433 M/sec                    ( +-  0.04% )
             19281      branch-misses             #    0.13% of all branches          ( +-  0.13% )

       4.240273854 seconds time elapsed                                          ( +-  0.13% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         17.348690      task-clock (msec)         #    0.019 CPUs utilized            ( +-  0.66% )
                 5      context-switches          #    0.310 K/sec                    ( +-  0.84% )
                 0      cpu-migrations            #    0.009 K/sec                    ( +- 16.55% )
                74      page-faults               #    0.004 M/sec                    ( +-  0.11% )
          45065287      cycles                    #    2.598 GHz                      ( +-  0.66% )
          60755389      instructions              #    1.35  insn per cycle           ( +-  0.05% )
          14747865      branches                  #  850.085 M/sec                    ( +-  0.05% )
             19272      branch-misses             #    0.13% of all branches          ( +-  0.13% )

       0.935251375 seconds time elapsed                                          ( +-  0.07% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      16626.042731      task-clock (msec)         #    0.607 CPUs utilized            ( +-  0.03% )
           3291020      context-switches          #    0.198 M/sec                    ( +-  0.05% )
                 1      cpu-migrations            #    0.000 K/sec                    ( +-  0.50% )
                85      page-faults               #    0.005 K/sec                    ( +-  0.16% )
       30581044838      cycles                    #    1.839 GHz                      ( +-  0.05% )
       34962744631      instructions              #    1.14  insn per cycle           ( +-  0.01% )
        6483883671      branches                  #  389.984 M/sec                    ( +-  0.02% )
          99624551      branch-misses             #    1.54% of all branches          ( +-  0.17% )

      27.370305077 seconds time elapsed                                          ( +-  0.01% )


*Before* this patchset:

Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         21.587934      task-clock (msec)         #    0.005 CPUs utilized            ( +-  0.72% )
                 6      context-switches          #    0.281 K/sec                    ( +-  0.28% )
                 1      cpu-migrations            #    0.047 K/sec                    ( +-  0.50% )
                73      page-faults               #    0.003 M/sec                    ( +-  0.12% )
          56080697      cycles                    #    2.598 GHz                      ( +-  0.72% )
          61605150      instructions              #    1.10  insn per cycle           ( +-  0.05% )
          14950196      branches                  #  692.526 M/sec                    ( +-  0.05% )
             19410      branch-misses             #    0.13% of all branches          ( +-  0.18% )

       4.603530546 seconds time elapsed                                          ( +-  0.11% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         20.988297      task-clock (msec)         #    0.006 CPUs utilized            ( +-  0.81% )
                 7      context-switches          #    0.316 K/sec                    ( +-  0.54% )
                 1      cpu-migrations            #    0.048 K/sec                    ( +-  0.70% )
                73      page-faults               #    0.003 M/sec                    ( +-  0.11% )
          54512166      cycles                    #    2.597 GHz                      ( +-  0.81% )
          61440941      instructions              #    1.13  insn per cycle           ( +-  0.08% )
          14906043      branches                  #  710.207 M/sec                    ( +-  0.08% )
             19927      branch-misses             #    0.13% of all branches          ( +-  0.17% )

       3.438041238 seconds time elapsed                                          ( +-  1.11% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      17364.040855      task-clock (msec)         #    0.624 CPUs utilized            ( +-  0.02% )
           3340375      context-switches          #    0.192 M/sec                    ( +-  0.06% )
                 1      cpu-migrations            #    0.000 K/sec
                85      page-faults               #    0.005 K/sec                    ( +-  0.15% )
       32077623335      cycles                    #    1.847 GHz                      ( +-  0.03% )
       35121047596      instructions              #    1.09  insn per cycle           ( +-  0.01% )
        6519872824      branches                  #  375.481 M/sec                    ( +-  0.02% )
         101877022      branch-misses             #    1.56% of all branches          ( +-  0.14% )

      27.842745343 seconds time elapsed                                          ( +-  0.02% )


Note, ipv4-udp, ipv6-tcp and ipv6-udp is also tested with the below script:
nc -u -l -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -N -u 127.0.0.1 1234

nc -l6 -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -N ::1 1234

nc -l6 -k -u 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u -N ::1 1234

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Shuah Khan <skhan@linuxfoundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>

1. https://lore.kernel.org/all/add10dd4-7f5d-4aa1-aa04-767590f944e0@redhat.com/
2. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/

Change log:
V23:
   1. CC Andrew and MM ML explicitly.
   2. Split into two parts according to the discussion in v22, and this is
      the part-1.

V22:
   1. Fix some typo as noted by Bagas.
   2. Remove page_frag_cache_page_offset() as it is not really related to
      this patchset.

V21:
   1. Do renaming as suggested by Alexander.
   2. Filter out the test results of dmesg in script as suggested by
      Shuah.

V20:
   1. Rename skb_copy_to_page_nocache() to skb_add_frag_nocache().
   2. Define the PFMEMALLOC_BIT as the ORDER_MASK + 1 as suggested by
      Alexander.

V19:
   1. Rebased on latest net-next.
   2. Use wait_for_completion_timeout() instead of wait_for_completion()
      in page_frag_test.c

V18:
   1. Fix a typo in test_page_frag.sh pointed out by Alexander.
   2. Move some inline helper into c file, use ternary operator and
      move the getting of the size as suggested by Alexander.

V17:
   1. Add TEST_FILES in Makefile for test_page_frag.sh.

V16:
   1. Add test_page_frag.sh to handle page_frag_test.ko and add testing
      for prepare API.
   2. Move inline helper unneeded outside of the page_frag_cache.c to
      page_frag_cache.c.
   3. Reset nc->offset when reusing an old page.

V15:
   1. Fix the compile error pointed out by Simon.
   2. Fix Other mistakes when using new API naming and refactoring.

V14:
   1. Drop '_va' Renaming patch and use new API naming.
   2. Use new refactoring to enable more codes to be reusable.
   3. And other minor suggestions from Alexander.

V13:
   1. Move page_frag_test from mm/ to tools/testing/selftest/mm
   2. Use ptr_ring to replace ptr_pool for page_frag_test.c
   3. Retest based on the new testing ko, which shows a big different
      result than using ptr_pool.

V12:
   1. Do not treat page_frag_test ko as DEBUG feature.
   2. Make some improvement for the refactoring in patch 8.
   3. Some other minor improvement as Alexander's comment.

RFC v11:
   1. Fold 'page_frag_cache' moving change into patch 2.
   2. Optimizate patch 3 according to discussion in v9.

V10:
   1. Change Subject to "Replace page_frag with page_frag_cache for sk_page_frag()".
   2. Move 'struct page_frag_cache' to sched.h as suggested by Alexander.
   3. Rename skb_copy_to_page_nocache().
   4. Adjust change between patches to make it more reviewable as Alexander's comment.
   5. Use 'aligned_remaining' variable to generate virtual address as Alexander's
      comment.
   6. Some included header and typo fix as Alexander's comment.
   7. Add back the get_order() opt patch for xtensa arch

V9:
   1. Add check for test_alloc_len and change perm of module_param()
      to 0 as Wang Wei' comment.
   2. Rebased on latest net-next.

V8: Remove patch 2 & 3 in V7, as free_unref_page() is changed to call
    pcp_allowed_order() and used in page_frag API recently in:
    commit 5b8d75913a0e ("mm: combine free_the_page() and free_unref_page()")

V7: Fix doc build warning and error.

V6:
   1. Fix some typo and compiler error for x86 pointed out by Jakub and
      Simon.
   2. Add two refactoring and optimization patches.

V5:
   1. Add page_frag_alloc_pg() API for tls_device.c case and refactor
      some implementation, update kernel bin size changing as bin size
      is increased after that.
   2. Add ack from Mat.

RFC v4:
   1. Update doc according to Randy and Mat's suggestion.
   2. Change probe API to "probe" for a specific amount of available space,
      rather than "nonzero" space according to Mat's suggestion.
   3. Retest and update the test result.

v3:
   1. Use new layout for 'struct page_frag_cache' as the discussion
      with Alexander and other sugeestions from Alexander.
   2. Add probe API to address Mat' comment about mptcp use case.
   3. Some doc updating according to Bagas' suggestion.

v2:
   1. reorder test module to patch 1.
   2. split doc and maintainer updating to two patches.
   3. refactor the page_frag before moving.
   4. fix a type and 'static' warning in test module.
   5. add a patch for xtensa arch to enable using get_order() in
      BUILD_BUG_ON().
   6. Add test case and performance data for the socket code.

Yunsheng Lin (7):
  mm: page_frag: add a test module for page_frag
  mm: move the page fragment allocator from page_alloc into its own file
  mm: page_frag: use initial zero offset for page_frag_alloc_align()
  mm: page_frag: avoid caller accessing 'page_frag_cache' directly
  xtensa: remove the get_order() implementation
  mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'
  mm: page_frag: use __alloc_pages() to replace alloc_pages_node()

 arch/xtensa/include/asm/page.h                |  18 --
 drivers/vhost/net.c                           |   2 +-
 include/linux/gfp.h                           |  22 --
 include/linux/mm_types.h                      |  18 --
 include/linux/mm_types_task.h                 |  21 ++
 include/linux/page_frag_cache.h               |  61 ++++++
 include/linux/skbuff.h                        |   1 +
 mm/Makefile                                   |   1 +
 mm/page_alloc.c                               | 136 ------------
 mm/page_frag_cache.c                          | 171 +++++++++++++++
 net/core/skbuff.c                             |   6 +-
 net/rxrpc/conn_object.c                       |   4 +-
 net/rxrpc/local_object.c                      |   4 +-
 net/sunrpc/svcsock.c                          |   6 +-
 tools/testing/selftests/mm/Makefile           |   3 +
 tools/testing/selftests/mm/page_frag/Makefile |  18 ++
 .../selftests/mm/page_frag/page_frag_test.c   | 198 ++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   8 +
 tools/testing/selftests/mm/test_page_frag.sh  | 175 ++++++++++++++++
 19 files changed, 665 insertions(+), 208 deletions(-)
 create mode 100644 include/linux/page_frag_cache.h
 create mode 100644 mm/page_frag_cache.c
 create mode 100644 tools/testing/selftests/mm/page_frag/Makefile
 create mode 100644 tools/testing/selftests/mm/page_frag/page_frag_test.c
 create mode 100755 tools/testing/selftests/mm/test_page_frag.sh

Comments

Alexander Duyck Oct. 28, 2024, 3:30 p.m. UTC | #1
On Mon, Oct 28, 2024 at 5:00 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> This is part 1 of "Replace page_frag with page_frag_cache",
> which mainly contain refactoring and optimization for the
> implementation of page_frag API before the replacing.
>
> As the discussion in [1], it would be better to target net-next
> tree to get more testing as all the callers page_frag API are
> in networking, and the chance of conflicting with MM tree seems
> low as implementation of page_frag API seems quite self-contained.
>
> After [2], there are still two implementations for page frag:
>
> 1. mm/page_alloc.c: net stack seems to be using it in the
>    rx part with 'struct page_frag_cache' and the main API
>    being page_frag_alloc_align().
> 2. net/core/sock.c: net stack seems to be using it in the
>    tx part with 'struct page_frag' and the main API being
>    skb_page_frag_refill().
>
> This patchset tries to unfiy the page frag implementation
> by replacing page_frag with page_frag_cache for sk_page_frag()
> first. net_high_order_alloc_disable_key for the implementation
> in net/core/sock.c doesn't seems matter that much now as pcp
> is also supported for high-order pages:
> commit 44042b449872 ("mm/page_alloc: allow high-order pages to
> be stored on the per-cpu lists")
>
> As the related change is mostly related to networking, so
> targeting the net-next. And will try to replace the rest
> of page_frag in the follow patchset.
>
> After this patchset:
> 1. Unify the page frag implementation by taking the best out of
>    two the existing implementations: we are able to save some space
>    for the 'page_frag_cache' API user, and avoid 'get_page()' for
>    the old 'page_frag' API user.
> 2. Future bugfix and performance can be done in one place, hence
>    improving maintainability of page_frag's implementation.
>
> Kernel Image changing:
>     Linux Kernel   total |      text      data        bss
>     ------------------------------------------------------
>     after     45250307 |   27274279   17209996     766032
>     before    45254134 |   27278118   17209984     766032
>     delta        -3827 |      -3839        +12         +0
>
> Performance validation:
> 1. Using micro-benchmark ko added in patch 1 to test aligned and
>    non-aligned API performance impact for the existing users, there
>    is no notiable performance degradation. Instead we seems to have
>    some major performance boot for both aligned and non-aligned API
>    after switching to ptr_ring for testing, respectively about 200%
>    and 10% improvement in arm64 server as below.
>
> 2. Use the below netcat test case, we also have some minor
>    performance boot for replacing 'page_frag' with 'page_frag_cache'
>    after this patchset.
>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>
> In order to avoid performance noise as much as possible, the testing
> is done in system without any other load and have enough iterations to
> prove the data is stable enough, complete log for testing is below:
>
> perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000
> perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1
> taskset -c 32 nc -l -k 1234 > /dev/null
> perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>
> *After* this patchset:
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):
>
>          17.758393      task-clock (msec)         #    0.004 CPUs utilized            ( +-  0.51% )
>                  5      context-switches          #    0.293 K/sec                    ( +-  0.65% )
>                  0      cpu-migrations            #    0.008 K/sec                    ( +- 17.21% )
>                 74      page-faults               #    0.004 M/sec                    ( +-  0.12% )
>           46128650      cycles                    #    2.598 GHz                      ( +-  0.51% )
>           60810511      instructions              #    1.32  insn per cycle           ( +-  0.04% )
>           14764914      branches                  #  831.433 M/sec                    ( +-  0.04% )
>              19281      branch-misses             #    0.13% of all branches          ( +-  0.13% )
>
>        4.240273854 seconds time elapsed                                          ( +-  0.13% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):
>
>          17.348690      task-clock (msec)         #    0.019 CPUs utilized            ( +-  0.66% )
>                  5      context-switches          #    0.310 K/sec                    ( +-  0.84% )
>                  0      cpu-migrations            #    0.009 K/sec                    ( +- 16.55% )
>                 74      page-faults               #    0.004 M/sec                    ( +-  0.11% )
>           45065287      cycles                    #    2.598 GHz                      ( +-  0.66% )
>           60755389      instructions              #    1.35  insn per cycle           ( +-  0.05% )
>           14747865      branches                  #  850.085 M/sec                    ( +-  0.05% )
>              19272      branch-misses             #    0.13% of all branches          ( +-  0.13% )
>
>        0.935251375 seconds time elapsed                                          ( +-  0.07% )
>
>  Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):
>
>       16626.042731      task-clock (msec)         #    0.607 CPUs utilized            ( +-  0.03% )
>            3291020      context-switches          #    0.198 M/sec                    ( +-  0.05% )
>                  1      cpu-migrations            #    0.000 K/sec                    ( +-  0.50% )
>                 85      page-faults               #    0.005 K/sec                    ( +-  0.16% )
>        30581044838      cycles                    #    1.839 GHz                      ( +-  0.05% )
>        34962744631      instructions              #    1.14  insn per cycle           ( +-  0.01% )
>         6483883671      branches                  #  389.984 M/sec                    ( +-  0.02% )
>           99624551      branch-misses             #    1.54% of all branches          ( +-  0.17% )
>
>       27.370305077 seconds time elapsed                                          ( +-  0.01% )
>
>
> *Before* this patchset:
>
> Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):
>
>          21.587934      task-clock (msec)         #    0.005 CPUs utilized            ( +-  0.72% )
>                  6      context-switches          #    0.281 K/sec                    ( +-  0.28% )
>                  1      cpu-migrations            #    0.047 K/sec                    ( +-  0.50% )
>                 73      page-faults               #    0.003 M/sec                    ( +-  0.12% )
>           56080697      cycles                    #    2.598 GHz                      ( +-  0.72% )
>           61605150      instructions              #    1.10  insn per cycle           ( +-  0.05% )
>           14950196      branches                  #  692.526 M/sec                    ( +-  0.05% )
>              19410      branch-misses             #    0.13% of all branches          ( +-  0.18% )
>
>        4.603530546 seconds time elapsed                                          ( +-  0.11% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):
>
>          20.988297      task-clock (msec)         #    0.006 CPUs utilized            ( +-  0.81% )
>                  7      context-switches          #    0.316 K/sec                    ( +-  0.54% )
>                  1      cpu-migrations            #    0.048 K/sec                    ( +-  0.70% )
>                 73      page-faults               #    0.003 M/sec                    ( +-  0.11% )
>           54512166      cycles                    #    2.597 GHz                      ( +-  0.81% )
>           61440941      instructions              #    1.13  insn per cycle           ( +-  0.08% )
>           14906043      branches                  #  710.207 M/sec                    ( +-  0.08% )
>              19927      branch-misses             #    0.13% of all branches          ( +-  0.17% )
>
>        3.438041238 seconds time elapsed                                          ( +-  1.11% )
>
>  Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):
>
>       17364.040855      task-clock (msec)         #    0.624 CPUs utilized            ( +-  0.02% )
>            3340375      context-switches          #    0.192 M/sec                    ( +-  0.06% )
>                  1      cpu-migrations            #    0.000 K/sec
>                 85      page-faults               #    0.005 K/sec                    ( +-  0.15% )
>        32077623335      cycles                    #    1.847 GHz                      ( +-  0.03% )
>        35121047596      instructions              #    1.09  insn per cycle           ( +-  0.01% )
>         6519872824      branches                  #  375.481 M/sec                    ( +-  0.02% )
>          101877022      branch-misses             #    1.56% of all branches          ( +-  0.14% )
>
>       27.842745343 seconds time elapsed                                          ( +-  0.02% )
>
>

Is this actually the numbers for this patch set? Seems like you have
been using the same numbers for the last several releases. I can
understand the "before" being mostly the same, but since we have
factored out the refactor portion of it the numbers for the "after"
should have deviated as I find it highly unlikely the numbers are
exactly the same down to the nanosecond. from the previous patch set.

Also it wouldn't hurt to have an explanation for the 3.4->0.9 second
performance change as it seems like the samples don't seem to match up
with the elapsed time data.
Yunsheng Lin Oct. 29, 2024, 9:36 a.m. UTC | #2
On 2024/10/28 23:30, Alexander Duyck wrote:

...

>>
>>
> 
> Is this actually the numbers for this patch set? Seems like you have
> been using the same numbers for the last several releases. I can

Yes, as recent refactoring doesn't seems big enough that the perf data is
reused for the last several releases.

> understand the "before" being mostly the same, but since we have

As there is rebasing for the latest net-next tree, even the 'before'
might not be the same as the testing seems sensitive to other changing,
like binary size changing and page allocator changing during different
version.

So it might need both the same kernel and config for 'before' and 'after'.

> factored out the refactor portion of it the numbers for the "after"
> should have deviated as I find it highly unlikely the numbers are
> exactly the same down to the nanosecond. from the previous patch set.
Below is the the performance data for Part-1 with the latest net-next:

Before this patchset:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         17.990790      task-clock (msec)         #    0.003 CPUs utilized            ( +-  0.19% )
                 8      context-switches          #    0.444 K/sec                    ( +-  0.09% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
                81      page-faults               #    0.004 M/sec                    ( +-  0.09% )
          46712295      cycles                    #    2.596 GHz                      ( +-  0.19% )
          34466157      instructions              #    0.74  insn per cycle           ( +-  0.01% )
           8011755      branches                  #  445.325 M/sec                    ( +-  0.01% )
             39913      branch-misses             #    0.50% of all branches          ( +-  0.07% )

       6.382252558 seconds time elapsed                                          ( +-  0.07% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         17.638466      task-clock (msec)         #    0.003 CPUs utilized            ( +-  0.01% )
                 8      context-switches          #    0.451 K/sec                    ( +-  0.20% )
                 0      cpu-migrations            #    0.001 K/sec                    ( +- 70.53% )
                81      page-faults               #    0.005 M/sec                    ( +-  0.08% )
          45794305      cycles                    #    2.596 GHz                      ( +-  0.01% )
          34435077      instructions              #    0.75  insn per cycle           ( +-  0.00% )
           8004416      branches                  #  453.805 M/sec                    ( +-  0.00% )
             39758      branch-misses             #    0.50% of all branches          ( +-  0.06% )

       5.328976590 seconds time elapsed                                          ( +-  0.60% )


After this patchset:
Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         18.647432      task-clock (msec)         #    0.003 CPUs utilized            ( +-  1.11% )
                 8      context-switches          #    0.422 K/sec                    ( +-  0.36% )
                 0      cpu-migrations            #    0.005 K/sec                    ( +- 22.54% )
                81      page-faults               #    0.004 M/sec                    ( +-  0.08% )
          48418108      cycles                    #    2.597 GHz                      ( +-  1.11% )
          35889299      instructions              #    0.74  insn per cycle           ( +-  0.11% )
           8318363      branches                  #  446.086 M/sec                    ( +-  0.11% )
             19263      branch-misses             #    0.23% of all branches          ( +-  0.13% )

       5.624666079 seconds time elapsed                                          ( +-  0.07% )


 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         18.466768      task-clock (msec)         #    0.007 CPUs utilized            ( +-  1.23% )
                 8      context-switches          #    0.428 K/sec                    ( +-  0.26% )
                 0      cpu-migrations            #    0.002 K/sec                    ( +- 34.73% )
                81      page-faults               #    0.004 M/sec                    ( +-  0.09% )
          47949220      cycles                    #    2.597 GHz                      ( +-  1.23% )
          35859039      instructions              #    0.75  insn per cycle           ( +-  0.12% )
           8309086      branches                  #  449.948 M/sec                    ( +-  0.11% )
             19246      branch-misses             #    0.23% of all branches          ( +-  0.08% )

       2.573546035 seconds time elapsed                                          ( +-  0.04% )

> 
> Also it wouldn't hurt to have an explanation for the 3.4->0.9 second
> performance change as it seems like the samples don't seem to match up
> with the elapsed time data.

As there is also a 4.6->3.4 second performance change for the 'before'
part, I am not really thinking much at that.

I am guessing some timing for implementation of ptr_ring or cpu cache
cause the above performance change?

I used the same cpu for both pop and push thread, the performance change
doesn't seems to exist anymore, and the performance improvement doesn't
seems to exist anymore either:

After this patchset:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000' (10 runs):

         13.293402      task-clock (msec)         #    0.002 CPUs utilized            ( +-  5.05% )
                 7      context-switches          #    0.534 K/sec                    ( +-  1.41% )
                 0      cpu-migrations            #    0.015 K/sec                    ( +-100.00% )
                80      page-faults               #    0.006 M/sec                    ( +-  0.38% )
          34494793      cycles                    #    2.595 GHz                      ( +-  5.05% )
           9663299      instructions              #    0.28  insn per cycle           ( +-  1.45% )
           1767284      branches                  #  132.944 M/sec                    ( +-  1.70% )
             19798      branch-misses             #    1.12% of all branches          ( +-  1.18% )

       8.119681413 seconds time elapsed                                          ( +-  0.01% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000 test_align=1' (10 runs):

         12.289096      task-clock (msec)         #    0.002 CPUs utilized            ( +-  0.07% )
                 7      context-switches          #    0.570 K/sec                    ( +-  2.13% )
                 0      cpu-migrations            #    0.033 K/sec                    ( +- 66.67% )
                81      page-faults               #    0.007 M/sec                    ( +-  0.43% )
          31886319      cycles                    #    2.595 GHz                      ( +-  0.07% )
           9468850      instructions              #    0.30  insn per cycle           ( +-  0.06% )
           1723487      branches                  #  140.245 M/sec                    ( +-  0.05% )
             19263      branch-misses             #    1.12% of all branches          ( +-  0.47% )

       8.119686950 seconds time elapsed                                          ( +-  0.01% )

Before this patchset:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000' (10 runs):

         13.320328      task-clock (msec)         #    0.002 CPUs utilized            ( +-  5.00% )
                 7      context-switches          #    0.541 K/sec                    ( +-  1.85% )
                 0      cpu-migrations            #    0.008 K/sec                    ( +-100.00% )
                80      page-faults               #    0.006 M/sec                    ( +-  0.36% )
          34572091      cycles                    #    2.595 GHz                      ( +-  5.01% )
           9664910      instructions              #    0.28  insn per cycle           ( +-  1.51% )
           1768276      branches                  #  132.750 M/sec                    ( +-  1.80% )
             19592      branch-misses             #    1.11% of all branches          ( +-  1.33% )

       8.119686381 seconds time elapsed                                          ( +-  0.01% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000 test_align=1' (10 runs):

         12.306471      task-clock (msec)         #    0.002 CPUs utilized            ( +-  0.08% )
                 7      context-switches          #    0.585 K/sec                    ( +-  1.85% )
                 0      cpu-migrations            #    0.000 K/sec
                80      page-faults               #    0.007 M/sec                    ( +-  0.28% )
          31937686      cycles                    #    2.595 GHz                      ( +-  0.08% )
           9462218      instructions              #    0.30  insn per cycle           ( +-  0.08% )
           1721989      branches                  #  139.925 M/sec                    ( +-  0.07% )
             19114      branch-misses             #    1.11% of all branches          ( +-  0.31% )

       8.118897296 seconds time elapsed                                          ( +-  0.00% )
Alexander Duyck Oct. 29, 2024, 3:45 p.m. UTC | #3
On Tue, Oct 29, 2024 at 2:36 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/10/28 23:30, Alexander Duyck wrote:
>
> ...
>
> >>
> >>
> >
> > Is this actually the numbers for this patch set? Seems like you have
> > been using the same numbers for the last several releases. I can
>
> Yes, as recent refactoring doesn't seems big enough that the perf data is
> reused for the last several releases.
>
> > understand the "before" being mostly the same, but since we have
>
> As there is rebasing for the latest net-next tree, even the 'before'
> might not be the same as the testing seems sensitive to other changing,
> like binary size changing and page allocator changing during different
> version.
>
> So it might need both the same kernel and config for 'before' and 'after'.
>
> > factored out the refactor portion of it the numbers for the "after"
> > should have deviated as I find it highly unlikely the numbers are
> > exactly the same down to the nanosecond. from the previous patch set.
> Below is the the performance data for Part-1 with the latest net-next:
>
> Before this patchset:
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):
>
>          17.990790      task-clock (msec)         #    0.003 CPUs utilized            ( +-  0.19% )
>                  8      context-switches          #    0.444 K/sec                    ( +-  0.09% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
>                 81      page-faults               #    0.004 M/sec                    ( +-  0.09% )
>           46712295      cycles                    #    2.596 GHz                      ( +-  0.19% )
>           34466157      instructions              #    0.74  insn per cycle           ( +-  0.01% )
>            8011755      branches                  #  445.325 M/sec                    ( +-  0.01% )
>              39913      branch-misses             #    0.50% of all branches          ( +-  0.07% )
>
>        6.382252558 seconds time elapsed                                          ( +-  0.07% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):
>
>          17.638466      task-clock (msec)         #    0.003 CPUs utilized            ( +-  0.01% )
>                  8      context-switches          #    0.451 K/sec                    ( +-  0.20% )
>                  0      cpu-migrations            #    0.001 K/sec                    ( +- 70.53% )
>                 81      page-faults               #    0.005 M/sec                    ( +-  0.08% )
>           45794305      cycles                    #    2.596 GHz                      ( +-  0.01% )
>           34435077      instructions              #    0.75  insn per cycle           ( +-  0.00% )
>            8004416      branches                  #  453.805 M/sec                    ( +-  0.00% )
>              39758      branch-misses             #    0.50% of all branches          ( +-  0.06% )
>
>        5.328976590 seconds time elapsed                                          ( +-  0.60% )
>
>
> After this patchset:
> Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):
>
>          18.647432      task-clock (msec)         #    0.003 CPUs utilized            ( +-  1.11% )
>                  8      context-switches          #    0.422 K/sec                    ( +-  0.36% )
>                  0      cpu-migrations            #    0.005 K/sec                    ( +- 22.54% )
>                 81      page-faults               #    0.004 M/sec                    ( +-  0.08% )
>           48418108      cycles                    #    2.597 GHz                      ( +-  1.11% )
>           35889299      instructions              #    0.74  insn per cycle           ( +-  0.11% )
>            8318363      branches                  #  446.086 M/sec                    ( +-  0.11% )
>              19263      branch-misses             #    0.23% of all branches          ( +-  0.13% )
>
>        5.624666079 seconds time elapsed                                          ( +-  0.07% )
>
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):
>
>          18.466768      task-clock (msec)         #    0.007 CPUs utilized            ( +-  1.23% )
>                  8      context-switches          #    0.428 K/sec                    ( +-  0.26% )
>                  0      cpu-migrations            #    0.002 K/sec                    ( +- 34.73% )
>                 81      page-faults               #    0.004 M/sec                    ( +-  0.09% )
>           47949220      cycles                    #    2.597 GHz                      ( +-  1.23% )
>           35859039      instructions              #    0.75  insn per cycle           ( +-  0.12% )
>            8309086      branches                  #  449.948 M/sec                    ( +-  0.11% )
>              19246      branch-misses             #    0.23% of all branches          ( +-  0.08% )
>
>        2.573546035 seconds time elapsed                                          ( +-  0.04% )
>

Interesting. It doesn't look like too much changed in terms of most of
the metrics other than the fact that we reduced the number of branch
misses by just over half.

> >
> > Also it wouldn't hurt to have an explanation for the 3.4->0.9 second
> > performance change as it seems like the samples don't seem to match up
> > with the elapsed time data.
>
> As there is also a 4.6->3.4 second performance change for the 'before'
> part, I am not really thinking much at that.
>
> I am guessing some timing for implementation of ptr_ring or cpu cache
> cause the above performance change?
>
> I used the same cpu for both pop and push thread, the performance change
> doesn't seems to exist anymore, and the performance improvement doesn't
> seems to exist anymore either:
>
> After this patchset:
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000' (10 runs):
>
>          13.293402      task-clock (msec)         #    0.002 CPUs utilized            ( +-  5.05% )
>                  7      context-switches          #    0.534 K/sec                    ( +-  1.41% )
>                  0      cpu-migrations            #    0.015 K/sec                    ( +-100.00% )
>                 80      page-faults               #    0.006 M/sec                    ( +-  0.38% )
>           34494793      cycles                    #    2.595 GHz                      ( +-  5.05% )
>            9663299      instructions              #    0.28  insn per cycle           ( +-  1.45% )
>            1767284      branches                  #  132.944 M/sec                    ( +-  1.70% )
>              19798      branch-misses             #    1.12% of all branches          ( +-  1.18% )
>
>        8.119681413 seconds time elapsed                                          ( +-  0.01% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000 test_align=1' (10 runs):
>
>          12.289096      task-clock (msec)         #    0.002 CPUs utilized            ( +-  0.07% )
>                  7      context-switches          #    0.570 K/sec                    ( +-  2.13% )
>                  0      cpu-migrations            #    0.033 K/sec                    ( +- 66.67% )
>                 81      page-faults               #    0.007 M/sec                    ( +-  0.43% )
>           31886319      cycles                    #    2.595 GHz                      ( +-  0.07% )
>            9468850      instructions              #    0.30  insn per cycle           ( +-  0.06% )
>            1723487      branches                  #  140.245 M/sec                    ( +-  0.05% )
>              19263      branch-misses             #    1.12% of all branches          ( +-  0.47% )
>
>        8.119686950 seconds time elapsed                                          ( +-  0.01% )
>
> Before this patchset:
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000' (10 runs):
>
>          13.320328      task-clock (msec)         #    0.002 CPUs utilized            ( +-  5.00% )
>                  7      context-switches          #    0.541 K/sec                    ( +-  1.85% )
>                  0      cpu-migrations            #    0.008 K/sec                    ( +-100.00% )
>                 80      page-faults               #    0.006 M/sec                    ( +-  0.36% )
>           34572091      cycles                    #    2.595 GHz                      ( +-  5.01% )
>            9664910      instructions              #    0.28  insn per cycle           ( +-  1.51% )
>            1768276      branches                  #  132.750 M/sec                    ( +-  1.80% )
>              19592      branch-misses             #    1.11% of all branches          ( +-  1.33% )
>
>        8.119686381 seconds time elapsed                                          ( +-  0.01% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=0 test_alloc_len=12 nr_test=512000 test_align=1' (10 runs):
>
>          12.306471      task-clock (msec)         #    0.002 CPUs utilized            ( +-  0.08% )
>                  7      context-switches          #    0.585 K/sec                    ( +-  1.85% )
>                  0      cpu-migrations            #    0.000 K/sec
>                 80      page-faults               #    0.007 M/sec                    ( +-  0.28% )
>           31937686      cycles                    #    2.595 GHz                      ( +-  0.08% )
>            9462218      instructions              #    0.30  insn per cycle           ( +-  0.08% )
>            1721989      branches                  #  139.925 M/sec                    ( +-  0.07% )
>              19114      branch-misses             #    1.11% of all branches          ( +-  0.31% )
>
>        8.118897296 seconds time elapsed                                          ( +-  0.00% )

That isn't too surprising. Most likely you are at the mercy of the
scheduler and you are just waiting for it to cycle back and forth from
producer to consumer and back in order to allow you to complete the
test.
Jakub Kicinski Nov. 5, 2024, 11:57 p.m. UTC | #4
On Mon, 28 Oct 2024 19:53:35 +0800 Yunsheng Lin wrote:
> This is part 1 of "Replace page_frag with page_frag_cache",
> which mainly contain refactoring and optimization for the
> implementation of page_frag API before the replacing.

Looks like Alex is happy with all of these patches. Since
page_frag_cache is primarily used in networking I think it's
okay for us to apply it but I wanted to ask if anyone:
 - thinks this shouldn't go in;
 - needs more time to review;
 - prefers to take it via their own tree.
Alexander Duyck Nov. 8, 2024, 12:02 a.m. UTC | #5
On Tue, Nov 5, 2024 at 3:57 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 28 Oct 2024 19:53:35 +0800 Yunsheng Lin wrote:
> > This is part 1 of "Replace page_frag with page_frag_cache",
> > which mainly contain refactoring and optimization for the
> > implementation of page_frag API before the replacing.
>
> Looks like Alex is happy with all of these patches. Since
> page_frag_cache is primarily used in networking I think it's
> okay for us to apply it but I wanted to ask if anyone:
>  - thinks this shouldn't go in;
>  - needs more time to review;
>  - prefers to take it via their own tree.

Yeah. I was happy with the set. Just curious about the numbers as they
hadn't been updated, but I am satisfied with the numbers provided
after I pointed that out.

- Alex
patchwork-bot+netdevbpf@kernel.org Nov. 11, 2024, 10:20 p.m. UTC | #6
Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 28 Oct 2024 19:53:35 +0800 you wrote:
> This is part 1 of "Replace page_frag with page_frag_cache",
> which mainly contain refactoring and optimization for the
> implementation of page_frag API before the replacing.
> 
> As the discussion in [1], it would be better to target net-next
> tree to get more testing as all the callers page_frag API are
> in networking, and the chance of conflicting with MM tree seems
> low as implementation of page_frag API seems quite self-contained.
> 
> [...]

Here is the summary with links:
  - [net-next,v23,1/7] mm: page_frag: add a test module for page_frag
    https://git.kernel.org/netdev/net-next/c/7fef0dec415c
  - [net-next,v23,2/7] mm: move the page fragment allocator from page_alloc into its own file
    https://git.kernel.org/netdev/net-next/c/65941f10caf2
  - [net-next,v23,3/7] mm: page_frag: use initial zero offset for page_frag_alloc_align()
    https://git.kernel.org/netdev/net-next/c/8218f62c9c9b
  - [net-next,v23,4/7] mm: page_frag: avoid caller accessing 'page_frag_cache' directly
    https://git.kernel.org/netdev/net-next/c/3d18dfe69ce4
  - [net-next,v23,5/7] xtensa: remove the get_order() implementation
    https://git.kernel.org/netdev/net-next/c/49e302be73f1
  - [net-next,v23,6/7] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'
    https://git.kernel.org/netdev/net-next/c/0c3ce2f50261
  - [net-next,v23,7/7] mm: page_frag: use __alloc_pages() to replace alloc_pages_node()
    https://git.kernel.org/netdev/net-next/c/ec397ea00cb3

You are awesome, thank you!