mbox series

[00/12] introduce percpu block scan_hint

Message ID 20190228021839.55779-1-dennis@kernel.org (mailing list archive)
Headers show
Series introduce percpu block scan_hint | expand

Message

Dennis Zhou Feb. 28, 2019, 2:18 a.m. UTC
Hi everyone,

It was reported a while [1] that an increase in allocation alignment
requirement [2] caused the percpu memory allocator to do significantly
more work.

After spending quite a bit of time diving into it, it seems the crux was
the following:
  1) chunk management by free_bytes caused allocations to scan over
     chunks that could not fit due to fragmentation
  2) per block fragmentation required scanning from an early first_free
     bit causing allocations to repeat work

This series introduces a scan_hint for pcpu_block_md and merges the
paths used to manage the hints. The scan_hint represents the largest
known free area prior to the contig_hint. There are some caveats to
this. First, it may not necessarily be the largest area as we do partial
updates based on freeing of regions and failed scanning in
pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
scan_hint_start > contig_hint_start is possible. This is necessary
for scan_hint discovery when refreshing the hint of a block.

A necessary change is to enforce a block to be the size of a page. This
let's the management of nr_empty_pop_pages to be done by breaking and
making full contig_hints in the hint update paths. Prior, this was done
by piggy backing off of refreshing the chunk contig_hint as it performed
a full scan and counting empty full pages.

The following are the results found using the workload provided in [3].

        branch        | time
       ------------------------
        5.0-rc7       | 69s
        [2] reverted  | 44s
        scan_hint     | 39s

The times above represent the approximate average across multiple runs.
I tested based on a basic 1M 16-byte allocation pattern with no
alignment requirement and times did not differ between 5.0-rc7 and
scan_hint.

[1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/
[2] https://lore.kernel.org/netdev/20181116154329.247947-1-edumazet@google.com/
[3] https://lore.kernel.org/netdev/vbfzhrj9smb.fsf@mellanox.com/

This patchset contains the following 12 patches:
  0001-percpu-update-free-path-with-correct-new-free-region.patch
  0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
  0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
  0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
  0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
  0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
  0007-percpu-add-block-level-scan_hint.patch
  0008-percpu-remember-largest-area-skipped-during-allocati.patch
  0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
  0010-percpu-make-pcpu_block_md-generic.patch
  0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
  0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch

0001 fixes an issue where the chunk contig_hint was being updated
improperly with the new region's starting offset and possibly differing
contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
0003 introduces a helper to do region overlap comparison. 0004 switches
to chunk management by contig_hint rather than free_bytes. 0005 moves
chunks that fail to allocate to the empty block list to prevent excess
scanning with of chunks with small contig_hints and poor alignment.
0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
modifies nr_empty_pop_pages management to be a part of the hint updates.
0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
generic so chunk hints can be managed as a pcpu_block_md responsible
for more bits. 0011-0012 add chunk scan_hints.

This patchset is on top of percpu#master a3b22b9f11d9.

diffstats below:

Dennis Zhou (12):
  percpu: update free path with correct new free region
  percpu: do not search past bitmap when allocating an area
  percpu: introduce helper to determine if two regions overlap
  percpu: manage chunks based on contig_bits instead of free_bytes
  percpu: relegate chunks unusable when failing small allocations
  percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
  percpu: add block level scan_hint
  percpu: remember largest area skipped during allocation
  percpu: use block scan_hint to only scan forward
  percpu: make pcpu_block_md generic
  percpu: convert chunk hints to be based on pcpu_block_md
  percpu: use chunk scan_hint to skip some scanning

 include/linux/percpu.h |  12 +-
 mm/percpu-internal.h   |  15 +-
 mm/percpu-km.c         |   2 +-
 mm/percpu-stats.c      |   5 +-
 mm/percpu.c            | 547 +++++++++++++++++++++++++++++------------
 5 files changed, 404 insertions(+), 177 deletions(-)

Thanks,
Dennis

Comments

Vlad Buslov Feb. 28, 2019, 2:47 p.m. UTC | #1
On Thu 28 Feb 2019 at 02:18, Dennis Zhou <dennis@kernel.org> wrote:
> Hi everyone,
>
> It was reported a while [1] that an increase in allocation alignment
> requirement [2] caused the percpu memory allocator to do significantly
> more work.
>
> After spending quite a bit of time diving into it, it seems the crux was
> the following:
>   1) chunk management by free_bytes caused allocations to scan over
>      chunks that could not fit due to fragmentation
>   2) per block fragmentation required scanning from an early first_free
>      bit causing allocations to repeat work
>
> This series introduces a scan_hint for pcpu_block_md and merges the
> paths used to manage the hints. The scan_hint represents the largest
> known free area prior to the contig_hint. There are some caveats to
> this. First, it may not necessarily be the largest area as we do partial
> updates based on freeing of regions and failed scanning in
> pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
> scan_hint_start > contig_hint_start is possible. This is necessary
> for scan_hint discovery when refreshing the hint of a block.
>
> A necessary change is to enforce a block to be the size of a page. This
> let's the management of nr_empty_pop_pages to be done by breaking and
> making full contig_hints in the hint update paths. Prior, this was done
> by piggy backing off of refreshing the chunk contig_hint as it performed
> a full scan and counting empty full pages.
>
> The following are the results found using the workload provided in [3].
>
>         branch        | time
>        ------------------------
>         5.0-rc7       | 69s
>         [2] reverted  | 44s
>         scan_hint     | 39s
>
> The times above represent the approximate average across multiple runs.
> I tested based on a basic 1M 16-byte allocation pattern with no
> alignment requirement and times did not differ between 5.0-rc7 and
> scan_hint.
>
> [1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/
> [2] https://lore.kernel.org/netdev/20181116154329.247947-1-edumazet@google.com/
> [3] https://lore.kernel.org/netdev/vbfzhrj9smb.fsf@mellanox.com/
>
> This patchset contains the following 12 patches:
>   0001-percpu-update-free-path-with-correct-new-free-region.patch
>   0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
>   0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
>   0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
>   0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
>   0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
>   0007-percpu-add-block-level-scan_hint.patch
>   0008-percpu-remember-largest-area-skipped-during-allocati.patch
>   0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
>   0010-percpu-make-pcpu_block_md-generic.patch
>   0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
>   0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch
>
> 0001 fixes an issue where the chunk contig_hint was being updated
> improperly with the new region's starting offset and possibly differing
> contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
> 0003 introduces a helper to do region overlap comparison. 0004 switches
> to chunk management by contig_hint rather than free_bytes. 0005 moves
> chunks that fail to allocate to the empty block list to prevent excess
> scanning with of chunks with small contig_hints and poor alignment.
> 0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
> modifies nr_empty_pop_pages management to be a part of the hint updates.
> 0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
> generic so chunk hints can be managed as a pcpu_block_md responsible
> for more bits. 0011-0012 add chunk scan_hints.
>
> This patchset is on top of percpu#master a3b22b9f11d9.
>
> diffstats below:
>
> Dennis Zhou (12):
>   percpu: update free path with correct new free region
>   percpu: do not search past bitmap when allocating an area
>   percpu: introduce helper to determine if two regions overlap
>   percpu: manage chunks based on contig_bits instead of free_bytes
>   percpu: relegate chunks unusable when failing small allocations
>   percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
>   percpu: add block level scan_hint
>   percpu: remember largest area skipped during allocation
>   percpu: use block scan_hint to only scan forward
>   percpu: make pcpu_block_md generic
>   percpu: convert chunk hints to be based on pcpu_block_md
>   percpu: use chunk scan_hint to skip some scanning
>
>  include/linux/percpu.h |  12 +-
>  mm/percpu-internal.h   |  15 +-
>  mm/percpu-km.c         |   2 +-
>  mm/percpu-stats.c      |   5 +-
>  mm/percpu.c            | 547 +++++++++++++++++++++++++++++------------
>  5 files changed, 404 insertions(+), 177 deletions(-)
>
> Thanks,
> Dennis

Hi Dennis,

Thank you very much for doing this!

I applied the patches on top of current net-next and can confirm that tc
filter insertion rate is significantly improved and is better compared
to version with offending commit simply reverted.

Regards,
Vlad
Dennis Zhou March 13, 2019, 8:19 p.m. UTC | #2
On Wed, Feb 27, 2019 at 09:18:27PM -0500, Dennis Zhou wrote:
> Hi everyone,
> 
> It was reported a while [1] that an increase in allocation alignment
> requirement [2] caused the percpu memory allocator to do significantly
> more work.
> 
> After spending quite a bit of time diving into it, it seems the crux was
> the following:
>   1) chunk management by free_bytes caused allocations to scan over
>      chunks that could not fit due to fragmentation
>   2) per block fragmentation required scanning from an early first_free
>      bit causing allocations to repeat work
> 
> This series introduces a scan_hint for pcpu_block_md and merges the
> paths used to manage the hints. The scan_hint represents the largest
> known free area prior to the contig_hint. There are some caveats to
> this. First, it may not necessarily be the largest area as we do partial
> updates based on freeing of regions and failed scanning in
> pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
> scan_hint_start > contig_hint_start is possible. This is necessary
> for scan_hint discovery when refreshing the hint of a block.
> 
> A necessary change is to enforce a block to be the size of a page. This
> let's the management of nr_empty_pop_pages to be done by breaking and
> making full contig_hints in the hint update paths. Prior, this was done
> by piggy backing off of refreshing the chunk contig_hint as it performed
> a full scan and counting empty full pages.
> 
> The following are the results found using the workload provided in [3].
> 
>         branch        | time
>        ------------------------
>         5.0-rc7       | 69s
>         [2] reverted  | 44s
>         scan_hint     | 39s
> 
> The times above represent the approximate average across multiple runs.
> I tested based on a basic 1M 16-byte allocation pattern with no
> alignment requirement and times did not differ between 5.0-rc7 and
> scan_hint.
> 
> [1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/
> [2] https://lore.kernel.org/netdev/20181116154329.247947-1-edumazet@google.com/
> [3] https://lore.kernel.org/netdev/vbfzhrj9smb.fsf@mellanox.com/
> 
> This patchset contains the following 12 patches:
>   0001-percpu-update-free-path-with-correct-new-free-region.patch
>   0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
>   0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
>   0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
>   0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
>   0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
>   0007-percpu-add-block-level-scan_hint.patch
>   0008-percpu-remember-largest-area-skipped-during-allocati.patch
>   0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
>   0010-percpu-make-pcpu_block_md-generic.patch
>   0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
>   0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch
> 
> 0001 fixes an issue where the chunk contig_hint was being updated
> improperly with the new region's starting offset and possibly differing
> contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
> 0003 introduces a helper to do region overlap comparison. 0004 switches
> to chunk management by contig_hint rather than free_bytes. 0005 moves
> chunks that fail to allocate to the empty block list to prevent excess
> scanning with of chunks with small contig_hints and poor alignment.
> 0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
> modifies nr_empty_pop_pages management to be a part of the hint updates.
> 0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
> generic so chunk hints can be managed as a pcpu_block_md responsible
> for more bits. 0011-0012 add chunk scan_hints.
> 
> This patchset is on top of percpu#master a3b22b9f11d9.
> 
> diffstats below:
> 
> Dennis Zhou (12):
>   percpu: update free path with correct new free region
>   percpu: do not search past bitmap when allocating an area
>   percpu: introduce helper to determine if two regions overlap
>   percpu: manage chunks based on contig_bits instead of free_bytes
>   percpu: relegate chunks unusable when failing small allocations
>   percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
>   percpu: add block level scan_hint
>   percpu: remember largest area skipped during allocation
>   percpu: use block scan_hint to only scan forward
>   percpu: make pcpu_block_md generic
>   percpu: convert chunk hints to be based on pcpu_block_md
>   percpu: use chunk scan_hint to skip some scanning
> 
>  include/linux/percpu.h |  12 +-
>  mm/percpu-internal.h   |  15 +-
>  mm/percpu-km.c         |   2 +-
>  mm/percpu-stats.c      |   5 +-
>  mm/percpu.c            | 547 +++++++++++++++++++++++++++++------------
>  5 files changed, 404 insertions(+), 177 deletions(-)
> 
> Thanks,
> Dennis

Applied to percpu/for-5.2.

Thanks,
Dennis