mbox series

[net-next,v2,00/10] Replace page_frag with page_frag_cache (Part-2)

Message ID 20241206122533.3589947-1-linyunsheng@huawei.com (mailing list archive)
Headers show
Series Replace page_frag with page_frag_cache (Part-2) | expand

Message

Yunsheng Lin Dec. 6, 2024, 12:25 p.m. UTC
This is part 2 of "Replace page_frag with page_frag_cache",
which introduces the new API and replaces page_frag with
page_frag_cache for sk_page_frag().

The part 1 of "Replace page_frag with page_frag_cache" is in
[1].

After [2], there are still two implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().

This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now as pcp
is also supported for high-order pages:
commit 44042b449872 ("mm/page_alloc: allow high-order pages to
be stored on the per-cpu lists")

As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.

After this patchset:
1. Unify the page frag implementation by taking the best out of
   two the existing implementations: we are able to save some space
   for the 'page_frag_cache' API user, and avoid 'get_page()' for
   the old 'page_frag' API user.
2. Future bugfix and performance can be done in one place, hence
   improving maintainability of page_frag's implementation.

Performance validation for part2:
1. Using micro-benchmark ko added in patch 1 to test aligned and
   non-aligned API performance impact for the existing users, there
   seems to be about 20% performance degradation for refactoring
   page_frag to support the new API, which seems to nullify most of
   the performance gain in [3] of part1.
2. Use the below netcat test case, there seems to be some minor
   performance gain for replacing 'page_frag' with 'page_frag_cache'
   using the new page_frag API after this patchset.
   server: taskset -c 32 nc -l -k 1234 > /dev/null
   client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

In order to avoid performance noise as much as possible, the testing
is done in system without any other load and have enough iterations to
prove the data is stable enough, complete log for testing is below:

perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000
perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1
taskset -c 32 nc -l -k 1234 > /dev/null
perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

*After* this patchset:

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         18.753187      task-clock (msec)         #    0.003 CPUs utilized            ( +-  0.44% )
                 8      context-switches          #    0.422 K/sec                    ( +-  0.30% )
                 0      cpu-migrations            #    0.003 K/sec                    ( +- 32.09% )
                84      page-faults               #    0.004 M/sec                    ( +-  0.08% )
          48700826      cycles                    #    2.597 GHz                      ( +-  0.44% )
          62086543      instructions              #    1.27  insn per cycle           ( +-  0.03% )
          14869358      branches                  #  792.898 M/sec                    ( +-  0.03% )
             19639      branch-misses             #    0.13% of all branches          ( +-  0.60% )

       7.035285915 seconds time elapsed                                          ( +-  0.06% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         18.442151      task-clock (msec)         #    0.006 CPUs utilized            ( +-  0.01% )
                 8      context-switches          #    0.422 K/sec                    ( +-  0.40% )
                 0      cpu-migrations            #    0.001 K/sec                    ( +- 57.44% )
                84      page-faults               #    0.005 M/sec                    ( +-  0.08% )
          47890149      cycles                    #    2.597 GHz                      ( +-  0.01% )
          60718325      instructions              #    1.27  insn per cycle           ( +-  0.00% )
          14570862      branches                  #  790.085 M/sec                    ( +-  0.00% )
             19613      branch-misses             #    0.13% of all branches          ( +-  0.12% )

       3.210892358 seconds time elapsed                                          ( +-  0.12% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      16824.017944      task-clock (msec)         #    0.621 CPUs utilized            ( +-  0.02% )
           2987954      context-switches          #    0.178 M/sec                    ( +-  0.04% )
                 1      cpu-migrations            #    0.000 K/sec
                93      page-faults               #    0.006 K/sec                    ( +-  0.09% )
       31982647267      cycles                    #    1.901 GHz                      ( +-  0.03% )
       38907812424      instructions              #    1.22  insn per cycle           ( +-  0.02% )
        7112328962      branches                  #  422.749 M/sec                    ( +-  0.03% )
          94789062      branch-misses             #    1.33% of all branches          ( +-  0.21% )

      27.104994660 seconds time elapsed                                          ( +-  0.03% )


*Before* this patchset:

Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         18.700051      task-clock (msec)         #    0.003 CPUs utilized            ( +-  1.04% )
                 8      context-switches          #    0.420 K/sec                    ( +-  0.31% )
                 0      cpu-migrations            #    0.019 K/sec                    ( +- 10.16% )
                81      page-faults               #    0.004 M/sec                    ( +-  0.09% )
          48548980      cycles                    #    2.596 GHz                      ( +-  1.04% )
          61857980      instructions              #    1.27  insn per cycle           ( +-  0.09% )
          14814201      branches                  #  792.201 M/sec                    ( +-  0.08% )
             42007      branch-misses             #    0.28% of all branches          ( +-  0.11% )

       5.565806266 seconds time elapsed                                          ( +-  0.08% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         18.468618      task-clock (msec)         #    0.007 CPUs utilized            ( +-  1.14% )
                 8      context-switches          #    0.422 K/sec                    ( +-  0.43% )
                 0      cpu-migrations            #    0.026 K/sec                    ( +-  7.89% )
                81      page-faults               #    0.004 M/sec                    ( +-  0.08% )
          47950150      cycles                    #    2.596 GHz                      ( +-  1.14% )
          61745530      instructions              #    1.29  insn per cycle           ( +-  0.09% )
          14787783      branches                  #  800.698 M/sec                    ( +-  0.08% )
             41734      branch-misses             #    0.28% of all branches          ( +-  0.09% )

       2.584180919 seconds time elapsed                                          ( +-  0.04% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      17105.617450      task-clock (msec)         #    0.599 CPUs utilized            ( +-  0.02% )
           2822654      context-switches          #    0.165 M/sec                    ( +-  0.03% )
                 1      cpu-migrations            #    0.000 K/sec                    ( +-  0.50% )
                93      page-faults               #    0.005 K/sec                    ( +-  0.09% )
       31819244033      cycles                    #    1.860 GHz                      ( +-  0.03% )
       37297412811      instructions              #    1.17  insn per cycle           ( +-  0.01% )
        6676699757      branches                  #  390.322 M/sec                    ( +-  0.01% )
         325102016      branch-misses             #    4.87% of all branches          ( +-  0.06% )

      28.568053622 seconds time elapsed                                          ( +-  0.02% )

Note, ipv4-udp, ipv6-tcp and ipv6-udp is also tested with the below script:
nc -u -l -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u 127.0.0.1 1234

nc -l6 -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc ::1 1234

nc -l6 -k -u 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u ::1 1234

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Shuah Khan <skhan@linuxfoundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>

1. https://lore.kernel.org/all/20241028115343.3405838-1-linyunsheng@huawei.com/
2. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/
3. https://lore.kernel.org/all/472a7a09-387f-480d-b66c-761e0b6192ef@huawei.com/

V2: Repost based on the latest net-next.

V1: Rebase on latest net-next tree and redo the performance test.

RFC:
    1. CC Andrew and MM ML explicitly.
    2. Split into two parts according to the discussion in v22, and this is
       the part-2.
    3. Split 'introduce new API' patch to more patches to make more reviewable
       and easier to discuss.

Yunsheng Lin (10):
  mm: page_frag: some minor refactoring before adding new API
  net: rename skb_copy_to_page_nocache() helper
  mm: page_frag: update documentation for page_frag
  mm: page_frag: introduce page_frag_alloc_abort() related API
  mm: page_frag: introduce refill prepare & commit API
  mm: page_frag: introduce alloc_refill prepare & commit API
  mm: page_frag: introduce probe related API
  mm: page_frag: add testing for the newly added API
  net: replace page_frag with page_frag_cache
  mm: page_frag: add an entry in MAINTAINERS for page_frag

 Documentation/mm/page_frags.rst               | 207 ++++++++++-
 MAINTAINERS                                   |  12 +
 .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
 .../chelsio/inline_crypto/chtls/chtls_io.c    | 101 ++----
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
 drivers/net/tun.c                             |  47 ++-
 include/linux/page_frag_cache.h               | 330 +++++++++++++++++-
 include/linux/sched.h                         |   2 +-
 include/net/sock.h                            |  30 +-
 kernel/exit.c                                 |   3 +-
 kernel/fork.c                                 |   3 +-
 mm/page_frag_cache.c                          | 108 +++++-
 net/core/skbuff.c                             |  58 +--
 net/core/skmsg.c                              |  12 +-
 net/core/sock.c                               |  32 +-
 net/ipv4/ip_output.c                          |  28 +-
 net/ipv4/tcp.c                                |  26 +-
 net/ipv4/tcp_output.c                         |  25 +-
 net/ipv6/ip6_output.c                         |  28 +-
 net/kcm/kcmsock.c                             |  21 +-
 net/mptcp/protocol.c                          |  47 ++-
 net/tls/tls_device.c                          | 100 ++++--
 .../selftests/mm/page_frag/page_frag_test.c   |  76 +++-
 tools/testing/selftests/mm/run_vmtests.sh     |   4 +
 tools/testing/selftests/mm/test_page_frag.sh  |  27 ++
 25 files changed, 1045 insertions(+), 288 deletions(-)

Comments

Alexander Duyck Dec. 8, 2024, 9:34 p.m. UTC | #1
On Fri, Dec 6, 2024 at 4:32 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> This is part 2 of "Replace page_frag with page_frag_cache",
> which introduces the new API and replaces page_frag with
> page_frag_cache for sk_page_frag().
>
> The part 1 of "Replace page_frag with page_frag_cache" is in
> [1].
>
> After [2], there are still two implementations for page frag:
>
> 1. mm/page_alloc.c: net stack seems to be using it in the
>    rx part with 'struct page_frag_cache' and the main API
>    being page_frag_alloc_align().
> 2. net/core/sock.c: net stack seems to be using it in the
>    tx part with 'struct page_frag' and the main API being
>    skb_page_frag_refill().
>
> This patchset tries to unfiy the page frag implementation
> by replacing page_frag with page_frag_cache for sk_page_frag()
> first. net_high_order_alloc_disable_key for the implementation
> in net/core/sock.c doesn't seems matter that much now as pcp
> is also supported for high-order pages:
> commit 44042b449872 ("mm/page_alloc: allow high-order pages to
> be stored on the per-cpu lists")
>
> As the related change is mostly related to networking, so
> targeting the net-next. And will try to replace the rest
> of page_frag in the follow patchset.
>
> After this patchset:
> 1. Unify the page frag implementation by taking the best out of
>    two the existing implementations: we are able to save some space
>    for the 'page_frag_cache' API user, and avoid 'get_page()' for
>    the old 'page_frag' API user.
> 2. Future bugfix and performance can be done in one place, hence
>    improving maintainability of page_frag's implementation.
>
> Performance validation for part2:
> 1. Using micro-benchmark ko added in patch 1 to test aligned and
>    non-aligned API performance impact for the existing users, there
>    seems to be about 20% performance degradation for refactoring
>    page_frag to support the new API, which seems to nullify most of
>    the performance gain in [3] of part1.

So if I am understanding correctly then this is showing a 20%
performance degradation with this patchset. I would argue that it is
significant enough that it would be a blocking factor for this patch
set. I would suggest bisecting the patch set to identify where the
performance degradation has been added and see what we can do to
resolve it, and if nothing else document it in that patch so we can
identify the root cause for the slowdown.

> 2. Use the below netcat test case, there seems to be some minor
>    performance gain for replacing 'page_frag' with 'page_frag_cache'
>    using the new page_frag API after this patchset.
>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

This test would barely touch the page pool. The fact is most of the
overhead for this would likely be things like TCP latency and data
copy much more than the page allocation. As such fluctuations here are
likely not related to your changes.
Yunsheng Lin Dec. 9, 2024, 11:42 a.m. UTC | #2
On 2024/12/9 5:34, Alexander Duyck wrote:

...

>>
>> Performance validation for part2:
>> 1. Using micro-benchmark ko added in patch 1 to test aligned and
>>    non-aligned API performance impact for the existing users, there
>>    seems to be about 20% performance degradation for refactoring
>>    page_frag to support the new API, which seems to nullify most of
>>    the performance gain in [3] of part1.
> 
> So if I am understanding correctly then this is showing a 20%
> performance degradation with this patchset. I would argue that it is
> significant enough that it would be a blocking factor for this patch
> set. I would suggest bisecting the patch set to identify where the
> performance degradation has been added and see what we can do to
> resolve it, and if nothing else document it in that patch so we can
> identify the root cause for the slowdown.

The only patch in this patchset affecting the performance of existing API
seems to be patch 1, only including patch 1 does show ~20% performance
degradation as including the whole patchset does:
mm: page_frag: some minor refactoring before adding new API

And the cause seems to be about the binary increasing as below, as the
performance degradation didn't seems to change much when I tried inlining
the __page_frag_cache_commit_noref() by moving it to the header file:

./scripts/bloat-o-meter vmlinux_orig vmlinux
add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420)
Function                                     old     new   delta
__page_frag_cache_prepare                      -     500    +500
__napi_alloc_frag_align                       68     180    +112
__netdev_alloc_skb                           488     596    +108
napi_alloc_skb                               556     624     +68
__netdev_alloc_frag_align                    196     252     +56
svc_tcp_sendmsg                              340     376     +36
__page_frag_cache_commit_noref                 -      32     +32
e843419@09a6_0000bd47_30                       -       8      +8
e843419@0369_000044ee_684                      8       -      -8
__page_frag_alloc_align                      492       -    -492
Total: Before=34719207, After=34719627, chg +0.00%

./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko
add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78)
Function                                     old     new   delta
page_frag_push_thread                        508     580     +72
__UNIQUE_ID_vermagic367                       67      73      +6
Total: Before=4582, After=4660, chg +1.70%

Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
new API can make use of them as much as possible.

Any better idea to reuse common codes as much as possible while avoiding
the performance degradation as much as possible?

> 
>> 2. Use the below netcat test case, there seems to be some minor
>>    performance gain for replacing 'page_frag' with 'page_frag_cache'
>>    using the new page_frag API after this patchset.
>>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
> 
> This test would barely touch the page pool. The fact is most of the

I am guessing you meant page_frag here?

> overhead for this would likely be things like TCP latency and data
> copy much more than the page allocation. As such fluctuations here are
> likely not related to your changes.

But it does tell us something that the replacing does not seems to
cause obvious regression, right?

I tried using a smaller MTU to amplify the impact of page allocation,
it seemed to have a similar result.
Alexander Duyck Dec. 9, 2024, 4:03 p.m. UTC | #3
On Mon, Dec 9, 2024 at 3:42 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/12/9 5:34, Alexander Duyck wrote:
>
> ...
>
> >>
> >> Performance validation for part2:
> >> 1. Using micro-benchmark ko added in patch 1 to test aligned and
> >>    non-aligned API performance impact for the existing users, there
> >>    seems to be about 20% performance degradation for refactoring
> >>    page_frag to support the new API, which seems to nullify most of
> >>    the performance gain in [3] of part1.
> >
> > So if I am understanding correctly then this is showing a 20%
> > performance degradation with this patchset. I would argue that it is
> > significant enough that it would be a blocking factor for this patch
> > set. I would suggest bisecting the patch set to identify where the
> > performance degradation has been added and see what we can do to
> > resolve it, and if nothing else document it in that patch so we can
> > identify the root cause for the slowdown.
>
> The only patch in this patchset affecting the performance of existing API
> seems to be patch 1, only including patch 1 does show ~20% performance
> degradation as including the whole patchset does:
> mm: page_frag: some minor refactoring before adding new API
>
> And the cause seems to be about the binary increasing as below, as the
> performance degradation didn't seems to change much when I tried inlining
> the __page_frag_cache_commit_noref() by moving it to the header file:
>
> ./scripts/bloat-o-meter vmlinux_orig vmlinux
> add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420)
> Function                                     old     new   delta
> __page_frag_cache_prepare                      -     500    +500
> __napi_alloc_frag_align                       68     180    +112
> __netdev_alloc_skb                           488     596    +108
> napi_alloc_skb                               556     624     +68
> __netdev_alloc_frag_align                    196     252     +56
> svc_tcp_sendmsg                              340     376     +36
> __page_frag_cache_commit_noref                 -      32     +32
> e843419@09a6_0000bd47_30                       -       8      +8
> e843419@0369_000044ee_684                      8       -      -8
> __page_frag_alloc_align                      492       -    -492
> Total: Before=34719207, After=34719627, chg +0.00%
>
> ./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko
> add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78)
> Function                                     old     new   delta
> page_frag_push_thread                        508     580     +72
> __UNIQUE_ID_vermagic367                       67      73      +6
> Total: Before=4582, After=4660, chg +1.70%

Other than code size have you tried using perf to profile the
benchmark before and after. I suspect that would be telling about
which code changes are the most likely to be causing the issues.
Overall I don't think the size has increased all that much. I suspect
most of this is the fact that you are inlining more of the
functionality.

> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
> new API can make use of them as much as possible.
>
> Any better idea to reuse common codes as much as possible while avoiding
> the performance degradation as much as possible?
>
> >
> >> 2. Use the below netcat test case, there seems to be some minor
> >>    performance gain for replacing 'page_frag' with 'page_frag_cache'
> >>    using the new page_frag API after this patchset.
> >>    server: taskset -c 32 nc -l -k 1234 > /dev/null
> >>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
> >
> > This test would barely touch the page pool. The fact is most of the
>
> I am guessing you meant page_frag here?
>
> > overhead for this would likely be things like TCP latency and data
> > copy much more than the page allocation. As such fluctuations here are
> > likely not related to your changes.
>
> But it does tell us something that the replacing does not seems to
> cause obvious regression, right?

Not really. The fragment allocator is such a small portion of this
test that we could probably double the cost for it and it would still
be negligible.

> I tried using a smaller MTU to amplify the impact of page allocation,
> it seemed to have a similar result.

Not surprising. However the network is likely only a small part of
this. I suspect if you ran a profile it would likely show the same.
Yunsheng Lin Dec. 10, 2024, 12:27 p.m. UTC | #4
On 2024/12/10 0:03, Alexander Duyck wrote:

...

> 
> Other than code size have you tried using perf to profile the
> benchmark before and after. I suspect that would be telling about
> which code changes are the most likely to be causing the issues.
> Overall I don't think the size has increased all that much. I suspect
> most of this is the fact that you are inlining more of the
> functionality.

It seems the testing result is very sensitive to code changing and
reorganizing, as using the patch at the end to avoid the problem of
'perf stat' not including data from the kernel thread seems to provide
more reasonable performance data.

It seems the most obvious difference is 'insn per cycle' and I am not
sure how to interpret the difference of below data for the performance
degradation yet.

With patch 1:
 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':

       5473.815250      task-clock (msec)         #    0.984 CPUs utilized
                18      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.022 K/sec
       14210894727      cycles                    #    2.596 GHz                      (92.78%)
       18903171767      instructions              #    1.33  insn per cycle           (92.82%)
        2997494420      branches                  #  547.606 M/sec                    (92.84%)
           7539978      branch-misses             #    0.25% of all branches          (92.84%)
        6291190031      L1-dcache-loads           # 1149.325 M/sec                    (92.78%)
          29874701      L1-dcache-load-misses     #    0.47% of all L1-dcache hits    (92.82%)
          57979668      LLC-loads                 #   10.592 M/sec                    (92.79%)
            347822      LLC-load-misses           #    0.01% of all LL-cache hits     (92.90%)
        5946042629      L1-icache-loads           # 1086.270 M/sec                    (92.91%)
            193877      L1-icache-load-misses                                         (92.91%)
        6820220221      dTLB-loads                # 1245.972 M/sec                    (92.91%)
            137999      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
        5947607438      iTLB-loads                # 1086.556 M/sec                    (92.91%)
               210      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       5.563068950 seconds time elapsed

Without patch 1:
root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000
insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':

       5306.644600      task-clock (msec)         #    0.984 CPUs utilized
                15      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.023 K/sec
       13776872322      cycles                    #    2.596 GHz                      (92.84%)
       13257649773      instructions              #    0.96  insn per cycle           (92.82%)
        2446901087      branches                  #  461.101 M/sec                    (92.91%)
           7172751      branch-misses             #    0.29% of all branches          (92.84%)
        5041456343      L1-dcache-loads           #  950.027 M/sec                    (92.84%)
          38418414      L1-dcache-load-misses     #    0.76% of all L1-dcache hits    (92.76%)
          65486400      LLC-loads                 #   12.340 M/sec                    (92.82%)
            191497      LLC-load-misses           #    0.01% of all LL-cache hits     (92.79%)
        4906456833      L1-icache-loads           #  924.587 M/sec                    (92.90%)
            175208      L1-icache-load-misses                                         (92.91%)
        5539879607      dTLB-loads                # 1043.952 M/sec                    (92.91%)
            140166      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
        4906685698      iTLB-loads                #  924.631 M/sec                    (92.91%)
               170      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       5.395104330 seconds time elapsed


Below is perf data for aligned API without patch 1, as above non-aligned
API also use test_alloc_len as 12, theoretically the performance data
should not be better than the non-aligned API as the aligned API will do
the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems
to show otherwise and I am not sure how to interpret that too:
perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1
insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1':

       2447.553100      task-clock (msec)         #    0.965 CPUs utilized
                 9      context-switches          #    0.004 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.050 K/sec
        6354149177      cycles                    #    2.596 GHz                      (92.81%)
        6467793726      instructions              #    1.02  insn per cycle           (92.76%)
        1120749183      branches                  #  457.906 M/sec                    (92.81%)
           7370402      branch-misses             #    0.66% of all branches          (92.81%)
        2847963759      L1-dcache-loads           # 1163.596 M/sec                    (92.76%)
          39439592      L1-dcache-load-misses     #    1.38% of all L1-dcache hits    (92.77%)
          42553468      LLC-loads                 #   17.386 M/sec                    (92.71%)
             95960      LLC-load-misses           #    0.01% of all LL-cache hits     (92.94%)
        2554887203      L1-icache-loads           # 1043.854 M/sec                    (92.97%)
            118902      L1-icache-load-misses                                         (92.97%)
        3365755289      dTLB-loads                # 1375.151 M/sec                    (92.97%)
             81401      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.97%)
        2554882937      iTLB-loads                # 1043.852 M/sec                    (92.97%)
               159      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.58%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       2.535085780 seconds time elapsed


> 
>> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
>> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
>> new API can make use of them as much as possible.
>>
>> Any better idea to reuse common codes as much as possible while avoiding
>> the performance degradation as much as possible?
>>
>>>
>>>> 2. Use the below netcat test case, there seems to be some minor
>>>>    performance gain for replacing 'page_frag' with 'page_frag_cache'
>>>>    using the new page_frag API after this patchset.
>>>>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>>>>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>>>
>>> This test would barely touch the page pool. The fact is most of the
>>
>> I am guessing you meant page_frag here?
>>
>>> overhead for this would likely be things like TCP latency and data
>>> copy much more than the page allocation. As such fluctuations here are
>>> likely not related to your changes.
>>
>> But it does tell us something that the replacing does not seems to
>> cause obvious regression, right?
> 
> Not really. The fragment allocator is such a small portion of this
> test that we could probably double the cost for it and it would still
> be negligible.

The most beneficial thing for replacing of the old API seems to be about
batching of page->_refcount updating and avoid some page_address(), but
may have overhead from unifying of page_frag API.

> 
>> I tried using a smaller MTU to amplify the impact of page allocation,
>> it seemed to have a similar result.
> 
> Not surprising. However the network is likely only a small part of
> this. I suspect if you ran a profile it would likely show the same.
> 

patch for doing the push operation in the insmod process instead of
in the kernel thread as 'perf stat' does not seem to include the data
of kernel thread:
diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c
index e806c1866e36..a818431c38b8 100644
--- a/tools/testing/selftests/mm/page_frag/page_frag_test.c
+++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c
@@ -131,30 +131,39 @@ static int __init page_frag_test_init(void)
        init_completion(&wait);

        if (test_alloc_len > PAGE_SIZE || test_alloc_len <= 0 ||
-           !cpu_active(test_push_cpu) || !cpu_active(test_pop_cpu))
+           !cpu_active(test_pop_cpu))
                return -EINVAL;

        ret = ptr_ring_init(&ptr_ring, nr_objs, GFP_KERNEL);
        if (ret)
                return ret;

-       tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring,
-                                        test_push_cpu, "page_frag_push");
-       if (IS_ERR(tsk_push))
-               return PTR_ERR(tsk_push);
-
        tsk_pop = kthread_create_on_cpu(page_frag_pop_thread, &ptr_ring,
                                        test_pop_cpu, "page_frag_pop");
-       if (IS_ERR(tsk_pop)) {
-               kthread_stop(tsk_push);
+       if (IS_ERR(tsk_pop))
                return PTR_ERR(tsk_pop);
+
+       pr_info("test_push_cpu = %d\n", test_push_cpu);
+
+       if (test_push_cpu < 0)
+               goto skip_push_thread;
+
+       tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring,
+                                        test_push_cpu, "page_frag_push");
+       if (IS_ERR(tsk_push)) {
+               kthread_stop(tsk_pop);
+               return PTR_ERR(tsk_push);
        }

+skip_push_thread:
        start = ktime_get();
-       wake_up_process(tsk_push);
+       pr_info("waiting for test to complete\n");
        wake_up_process(tsk_pop);

-       pr_info("waiting for test to complete\n");
+       if (test_push_cpu < 0)
+               page_frag_push_thread(&ptr_ring);
+       else
+               wake_up_process(tsk_push);

        while (!wait_for_completion_timeout(&wait, msecs_to_jiffies(10000))) {
                /* exit if there is no progress for push or pop size */
Alexander Duyck Dec. 10, 2024, 3:58 p.m. UTC | #5
On Tue, Dec 10, 2024 at 4:27 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/12/10 0:03, Alexander Duyck wrote:
>
> ...
>
> >
> > Other than code size have you tried using perf to profile the
> > benchmark before and after. I suspect that would be telling about
> > which code changes are the most likely to be causing the issues.
> > Overall I don't think the size has increased all that much. I suspect
> > most of this is the fact that you are inlining more of the
> > functionality.
>
> It seems the testing result is very sensitive to code changing and
> reorganizing, as using the patch at the end to avoid the problem of
> 'perf stat' not including data from the kernel thread seems to provide
> more reasonable performance data.
>
> It seems the most obvious difference is 'insn per cycle' and I am not
> sure how to interpret the difference of below data for the performance
> degradation yet.
>
> With patch 1:
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':
>
>        5473.815250      task-clock (msec)         #    0.984 CPUs utilized
>                 18      context-switches          #    0.003 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.022 K/sec
>        14210894727      cycles                    #    2.596 GHz                      (92.78%)
>        18903171767      instructions              #    1.33  insn per cycle           (92.82%)
>         2997494420      branches                  #  547.606 M/sec                    (92.84%)
>            7539978      branch-misses             #    0.25% of all branches          (92.84%)
>         6291190031      L1-dcache-loads           # 1149.325 M/sec                    (92.78%)
>           29874701      L1-dcache-load-misses     #    0.47% of all L1-dcache hits    (92.82%)
>           57979668      LLC-loads                 #   10.592 M/sec                    (92.79%)
>             347822      LLC-load-misses           #    0.01% of all LL-cache hits     (92.90%)
>         5946042629      L1-icache-loads           # 1086.270 M/sec                    (92.91%)
>             193877      L1-icache-load-misses                                         (92.91%)
>         6820220221      dTLB-loads                # 1245.972 M/sec                    (92.91%)
>             137999      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
>         5947607438      iTLB-loads                # 1086.556 M/sec                    (92.91%)
>                210      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        5.563068950 seconds time elapsed
>
> Without patch 1:
> root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000
> insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable
>
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':
>
>        5306.644600      task-clock (msec)         #    0.984 CPUs utilized
>                 15      context-switches          #    0.003 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.023 K/sec
>        13776872322      cycles                    #    2.596 GHz                      (92.84%)
>        13257649773      instructions              #    0.96  insn per cycle           (92.82%)
>         2446901087      branches                  #  461.101 M/sec                    (92.91%)
>            7172751      branch-misses             #    0.29% of all branches          (92.84%)
>         5041456343      L1-dcache-loads           #  950.027 M/sec                    (92.84%)
>           38418414      L1-dcache-load-misses     #    0.76% of all L1-dcache hits    (92.76%)
>           65486400      LLC-loads                 #   12.340 M/sec                    (92.82%)
>             191497      LLC-load-misses           #    0.01% of all LL-cache hits     (92.79%)
>         4906456833      L1-icache-loads           #  924.587 M/sec                    (92.90%)
>             175208      L1-icache-load-misses                                         (92.91%)
>         5539879607      dTLB-loads                # 1043.952 M/sec                    (92.91%)
>             140166      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
>         4906685698      iTLB-loads                #  924.631 M/sec                    (92.91%)
>                170      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        5.395104330 seconds time elapsed
>
>
> Below is perf data for aligned API without patch 1, as above non-aligned
> API also use test_alloc_len as 12, theoretically the performance data
> should not be better than the non-aligned API as the aligned API will do
> the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems
> to show otherwise and I am not sure how to interpret that too:
> perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1
> insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable
>
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1':
>
>        2447.553100      task-clock (msec)         #    0.965 CPUs utilized
>                  9      context-switches          #    0.004 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.050 K/sec
>         6354149177      cycles                    #    2.596 GHz                      (92.81%)
>         6467793726      instructions              #    1.02  insn per cycle           (92.76%)
>         1120749183      branches                  #  457.906 M/sec                    (92.81%)
>            7370402      branch-misses             #    0.66% of all branches          (92.81%)
>         2847963759      L1-dcache-loads           # 1163.596 M/sec                    (92.76%)
>           39439592      L1-dcache-load-misses     #    1.38% of all L1-dcache hits    (92.77%)
>           42553468      LLC-loads                 #   17.386 M/sec                    (92.71%)
>              95960      LLC-load-misses           #    0.01% of all LL-cache hits     (92.94%)
>         2554887203      L1-icache-loads           # 1043.854 M/sec                    (92.97%)
>             118902      L1-icache-load-misses                                         (92.97%)
>         3365755289      dTLB-loads                # 1375.151 M/sec                    (92.97%)
>              81401      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.97%)
>         2554882937      iTLB-loads                # 1043.852 M/sec                    (92.97%)
>                159      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.58%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        2.535085780 seconds time elapsed

I'm not sure perf stat will tell us much as it is really too high
level to give us much in the way of details. I would be more
interested in the output from perf record -g followed by a perf
report, or maybe even just a snapshot from perf top while the test is
running. That should show us where the CPU is spending most of its
time and what areas are hot in the before and after graphs.
Yunsheng Lin Dec. 11, 2024, 12:52 p.m. UTC | #6
On 2024/12/10 23:58, Alexander Duyck wrote:

> 
> I'm not sure perf stat will tell us much as it is really too high
> level to give us much in the way of details. I would be more
> interested in the output from perf record -g followed by a perf
> report, or maybe even just a snapshot from perf top while the test is
> running. That should show us where the CPU is spending most of its
> time and what areas are hot in the before and after graphs.

It seems the bottleneck is in the freeing side that page_frag_free()
function took up to about 50% cpu for non-aligned API and 16% cpu
for aligned API in the push CPU using 'perf top'.

Using the below patch cause the page_frag_free() to disappear in the
push CPU  of 'perf top', new performance data is below:
Without patch 1:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):

         21.084113      task-clock (msec)         #    0.008 CPUs utilized            ( +-  1.59% )
                 7      context-switches          #    0.334 K/sec                    ( +-  1.25% )
                 1      cpu-migrations            #    0.031 K/sec                    ( +- 20.20% )
                78      page-faults               #    0.004 M/sec                    ( +-  0.26% )
          54748233      cycles                    #    2.597 GHz                      ( +-  1.59% )
          61637051      instructions              #    1.13  insn per cycle           ( +-  0.13% )
          14727268      branches                  #  698.501 M/sec                    ( +-  0.11% )
             20178      branch-misses             #    0.14% of all branches          ( +-  0.94% )

       2.637345524 seconds time elapsed                                          ( +-  0.19% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):

         19.669259      task-clock (msec)         #    0.009 CPUs utilized            ( +-  2.91% )
                 7      context-switches          #    0.356 K/sec                    ( +-  1.04% )
                 0      cpu-migrations            #    0.005 K/sec                    ( +- 68.82% )
                77      page-faults               #    0.004 M/sec                    ( +-  0.27% )
          51077447      cycles                    #    2.597 GHz                      ( +-  2.91% )
          58875368      instructions              #    1.15  insn per cycle           ( +-  4.47% )
          14040015      branches                  #  713.805 M/sec                    ( +-  4.68% )
             20150      branch-misses             #    0.14% of all branches          ( +-  0.64% )

       2.226539190 seconds time elapsed                                          ( +-  0.12% )

With patch 1:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):

         20.782788      task-clock (msec)         #    0.008 CPUs utilized            ( +-  0.09% )
                 7      context-switches          #    0.342 K/sec                    ( +-  0.97% )
                 1      cpu-migrations            #    0.031 K/sec                    ( +- 16.83% )
                78      page-faults               #    0.004 M/sec                    ( +-  0.31% )
          53967333      cycles                    #    2.597 GHz                      ( +-  0.08% )
          61577257      instructions              #    1.14  insn per cycle           ( +-  0.02% )
          14712140      branches                  #  707.900 M/sec                    ( +-  0.02% )
             20234      branch-misses             #    0.14% of all branches          ( +-  0.55% )

       2.677974457 seconds time elapsed                                          ( +-  0.15% )

root@(none):/home# perf stat -r 20 insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1

insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):

         20.420537      task-clock (msec)         #    0.009 CPUs utilized            ( +-  0.05% )
                 7      context-switches          #    0.345 K/sec                    ( +-  0.71% )
                 0      cpu-migrations            #    0.005 K/sec                    ( +-100.00% )
                77      page-faults               #    0.004 M/sec                    ( +-  0.23% )
          53038942      cycles                    #    2.597 GHz                      ( +-  0.05% )
          59965712      instructions              #    1.13  insn per cycle           ( +-  0.03% )
          14372507      branches                  #  703.826 M/sec                    ( +-  0.03% )
             20580      branch-misses             #    0.14% of all branches          ( +-  0.56% )

       2.287783171 seconds time elapsed                                          ( +-  0.12% )

It seems that bottleneck is still the freeing side that the above
result might not be as meaningful as it should be.

As we can't use more than one cpu for the free side without some
lock using a single ptr_ring, it seems something more complicated
might need to be done in order to support more than one CPU for the
freeing side?

Before patch 1, __page_frag_alloc_align took up to 3.62% percent of
CPU using 'perf top'.
After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref()
took up to 4.67% + 1.01% = 5.68%.
Having a similar result, I am not sure if the CPU usages is able tell us
the performance degradation here as it seems to be quite large?

@@ -100,13 +100,20 @@ static int page_frag_push_thread(void *arg)
                if (!va)
                        continue;

-               ret = __ptr_ring_produce(ring, va);
-               if (ret) {
+               do {
+                       ret = __ptr_ring_produce(ring, va);
+                       if (!ret) {
+                               va = NULL;
+                               break;
+                       } else {
+                               cond_resched();
+                       }
+               } while (!force_exit);
+
+               if (va)
                        page_frag_free(va);
-                       cond_resched();
-               } else {
+               else
                        test_pushed++;
-               }
        }

        pr_info("page_frag push test thread exits on cpu %d\n",
Yunsheng Lin Dec. 13, 2024, 12:09 p.m. UTC | #7
On 2024/12/11 20:52, Yunsheng Lin wrote:
 > It seems that bottleneck is still the freeing side that the above
> result might not be as meaningful as it should be.

Through 'perf top' annotating, there seems to be about 70%+ cpu usage
for the atmoic operation of put_page_testzero() in page_frag_free(),
it was unexpected that the atmoic operation had that much overhead:(

> 
> As we can't use more than one cpu for the free side without some
> lock using a single ptr_ring, it seems something more complicated
> might need to be done in order to support more than one CPU for the
> freeing side?
> 
> Before patch 1, __page_frag_alloc_align took up to 3.62% percent of
> CPU using 'perf top'.
> After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref()
> took up to 4.67% + 1.01% = 5.68%.
> Having a similar result, I am not sure if the CPU usages is able tell us
> the performance degradation here as it seems to be quite large?
> 

And using 'struct page_frag' to pass the parameter seems to cause some
observable overhead as the testing is very low level, peformance seems to
be negligible using the below patch to avoid passing 'struct page_frag',
3.62% and 3.27% for the cpu usages for __page_frag_alloc_align() before
patch 1 and __page_frag_cache_prepare() after patch 1 respectively.

The new refatcoring avoid some overhead for the old API, but might cause
some overhead for the new API as it is not able to skip the virt_to_page()
for refilling and reusing case, though it seems to be an unlikely case.
Or any better idea how to do refatcoring for unifying the page_frag API?

diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 41a91df82631..b83e7655654e 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -39,8 +39,24 @@ static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)

 void page_frag_cache_drain(struct page_frag_cache *nc);
 void __page_frag_cache_drain(struct page *page, unsigned int count);
-void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz,
-			      gfp_t gfp_mask, unsigned int align_mask);
+void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
+				gfp_t gfp_mask, unsigned int align_mask);
+
+static inline void *__page_frag_alloc_align(struct page_frag_cache *nc,
+					    unsigned int fragsz, gfp_t gfp_mask,
+					    unsigned int align_mask)
+{
+	void *va;
+
+	va = __page_frag_cache_prepare(nc, fragsz, gfp_mask, align_mask);
+	if (likely(va)) {
+		va += nc->offset;
+		nc->offset += fragsz;
+		nc->pagecnt_bias--;
+	}
+
+	return va;
+}

 static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
 					  unsigned int fragsz, gfp_t gfp_mask,
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index 3f7a203d35c6..729309aee27a 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -90,9 +90,9 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);

-void *__page_frag_alloc_align(struct page_frag_cache *nc,
-			      unsigned int fragsz, gfp_t gfp_mask,
-			      unsigned int align_mask)
+void *__page_frag_cache_prepare(struct page_frag_cache *nc,
+				unsigned int fragsz, gfp_t gfp_mask,
+				unsigned int align_mask)
 {
 	unsigned long encoded_page = nc->encoded_page;
 	unsigned int size, offset;
@@ -151,12 +151,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 		offset = 0;
 	}

-	nc->pagecnt_bias--;
-	nc->offset = offset + fragsz;
-
-	return encoded_page_decode_virt(encoded_page) + offset;
+	nc->offset = offset;
+	return encoded_page_decode_virt(encoded_page);
 }
-EXPORT_SYMBOL(__page_frag_alloc_align);
+EXPORT_SYMBOL(__page_frag_cache_prepare);

 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.