Message ID | 20190228021839.55779-1-dennis@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | introduce percpu block scan_hint | expand |
On Thu 28 Feb 2019 at 02:18, Dennis Zhou <dennis@kernel.org> wrote: > Hi everyone, > > It was reported a while [1] that an increase in allocation alignment > requirement [2] caused the percpu memory allocator to do significantly > more work. > > After spending quite a bit of time diving into it, it seems the crux was > the following: > 1) chunk management by free_bytes caused allocations to scan over > chunks that could not fit due to fragmentation > 2) per block fragmentation required scanning from an early first_free > bit causing allocations to repeat work > > This series introduces a scan_hint for pcpu_block_md and merges the > paths used to manage the hints. The scan_hint represents the largest > known free area prior to the contig_hint. There are some caveats to > this. First, it may not necessarily be the largest area as we do partial > updates based on freeing of regions and failed scanning in > pcpu_alloc_area(). Second, if contig_hint == scan_hint, then > scan_hint_start > contig_hint_start is possible. This is necessary > for scan_hint discovery when refreshing the hint of a block. > > A necessary change is to enforce a block to be the size of a page. This > let's the management of nr_empty_pop_pages to be done by breaking and > making full contig_hints in the hint update paths. Prior, this was done > by piggy backing off of refreshing the chunk contig_hint as it performed > a full scan and counting empty full pages. > > The following are the results found using the workload provided in [3]. > > branch | time > ------------------------ > 5.0-rc7 | 69s > [2] reverted | 44s > scan_hint | 39s > > The times above represent the approximate average across multiple runs. > I tested based on a basic 1M 16-byte allocation pattern with no > alignment requirement and times did not differ between 5.0-rc7 and > scan_hint. > > [1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/ > [2] https://lore.kernel.org/netdev/20181116154329.247947-1-edumazet@google.com/ > [3] https://lore.kernel.org/netdev/vbfzhrj9smb.fsf@mellanox.com/ > > This patchset contains the following 12 patches: > 0001-percpu-update-free-path-with-correct-new-free-region.patch > 0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch > 0003-percpu-introduce-helper-to-determine-if-two-regions-.patch > 0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch > 0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch > 0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch > 0007-percpu-add-block-level-scan_hint.patch > 0008-percpu-remember-largest-area-skipped-during-allocati.patch > 0009-percpu-use-block-scan_hint-to-only-scan-forward.patch > 0010-percpu-make-pcpu_block_md-generic.patch > 0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch > 0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch > > 0001 fixes an issue where the chunk contig_hint was being updated > improperly with the new region's starting offset and possibly differing > contig_hint. 0002 fixes possibly scanning pass the end of the bitmap. > 0003 introduces a helper to do region overlap comparison. 0004 switches > to chunk management by contig_hint rather than free_bytes. 0005 moves > chunks that fail to allocate to the empty block list to prevent excess > scanning with of chunks with small contig_hints and poor alignment. > 0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and > modifies nr_empty_pop_pages management to be a part of the hint updates. > 0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md > generic so chunk hints can be managed as a pcpu_block_md responsible > for more bits. 0011-0012 add chunk scan_hints. > > This patchset is on top of percpu#master a3b22b9f11d9. > > diffstats below: > > Dennis Zhou (12): > percpu: update free path with correct new free region > percpu: do not search past bitmap when allocating an area > percpu: introduce helper to determine if two regions overlap > percpu: manage chunks based on contig_bits instead of free_bytes > percpu: relegate chunks unusable when failing small allocations > percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE > percpu: add block level scan_hint > percpu: remember largest area skipped during allocation > percpu: use block scan_hint to only scan forward > percpu: make pcpu_block_md generic > percpu: convert chunk hints to be based on pcpu_block_md > percpu: use chunk scan_hint to skip some scanning > > include/linux/percpu.h | 12 +- > mm/percpu-internal.h | 15 +- > mm/percpu-km.c | 2 +- > mm/percpu-stats.c | 5 +- > mm/percpu.c | 547 +++++++++++++++++++++++++++++------------ > 5 files changed, 404 insertions(+), 177 deletions(-) > > Thanks, > Dennis Hi Dennis, Thank you very much for doing this! I applied the patches on top of current net-next and can confirm that tc filter insertion rate is significantly improved and is better compared to version with offending commit simply reverted. Regards, Vlad
On Wed, Feb 27, 2019 at 09:18:27PM -0500, Dennis Zhou wrote: > Hi everyone, > > It was reported a while [1] that an increase in allocation alignment > requirement [2] caused the percpu memory allocator to do significantly > more work. > > After spending quite a bit of time diving into it, it seems the crux was > the following: > 1) chunk management by free_bytes caused allocations to scan over > chunks that could not fit due to fragmentation > 2) per block fragmentation required scanning from an early first_free > bit causing allocations to repeat work > > This series introduces a scan_hint for pcpu_block_md and merges the > paths used to manage the hints. The scan_hint represents the largest > known free area prior to the contig_hint. There are some caveats to > this. First, it may not necessarily be the largest area as we do partial > updates based on freeing of regions and failed scanning in > pcpu_alloc_area(). Second, if contig_hint == scan_hint, then > scan_hint_start > contig_hint_start is possible. This is necessary > for scan_hint discovery when refreshing the hint of a block. > > A necessary change is to enforce a block to be the size of a page. This > let's the management of nr_empty_pop_pages to be done by breaking and > making full contig_hints in the hint update paths. Prior, this was done > by piggy backing off of refreshing the chunk contig_hint as it performed > a full scan and counting empty full pages. > > The following are the results found using the workload provided in [3]. > > branch | time > ------------------------ > 5.0-rc7 | 69s > [2] reverted | 44s > scan_hint | 39s > > The times above represent the approximate average across multiple runs. > I tested based on a basic 1M 16-byte allocation pattern with no > alignment requirement and times did not differ between 5.0-rc7 and > scan_hint. > > [1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/ > [2] https://lore.kernel.org/netdev/20181116154329.247947-1-edumazet@google.com/ > [3] https://lore.kernel.org/netdev/vbfzhrj9smb.fsf@mellanox.com/ > > This patchset contains the following 12 patches: > 0001-percpu-update-free-path-with-correct-new-free-region.patch > 0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch > 0003-percpu-introduce-helper-to-determine-if-two-regions-.patch > 0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch > 0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch > 0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch > 0007-percpu-add-block-level-scan_hint.patch > 0008-percpu-remember-largest-area-skipped-during-allocati.patch > 0009-percpu-use-block-scan_hint-to-only-scan-forward.patch > 0010-percpu-make-pcpu_block_md-generic.patch > 0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch > 0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch > > 0001 fixes an issue where the chunk contig_hint was being updated > improperly with the new region's starting offset and possibly differing > contig_hint. 0002 fixes possibly scanning pass the end of the bitmap. > 0003 introduces a helper to do region overlap comparison. 0004 switches > to chunk management by contig_hint rather than free_bytes. 0005 moves > chunks that fail to allocate to the empty block list to prevent excess > scanning with of chunks with small contig_hints and poor alignment. > 0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and > modifies nr_empty_pop_pages management to be a part of the hint updates. > 0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md > generic so chunk hints can be managed as a pcpu_block_md responsible > for more bits. 0011-0012 add chunk scan_hints. > > This patchset is on top of percpu#master a3b22b9f11d9. > > diffstats below: > > Dennis Zhou (12): > percpu: update free path with correct new free region > percpu: do not search past bitmap when allocating an area > percpu: introduce helper to determine if two regions overlap > percpu: manage chunks based on contig_bits instead of free_bytes > percpu: relegate chunks unusable when failing small allocations > percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE > percpu: add block level scan_hint > percpu: remember largest area skipped during allocation > percpu: use block scan_hint to only scan forward > percpu: make pcpu_block_md generic > percpu: convert chunk hints to be based on pcpu_block_md > percpu: use chunk scan_hint to skip some scanning > > include/linux/percpu.h | 12 +- > mm/percpu-internal.h | 15 +- > mm/percpu-km.c | 2 +- > mm/percpu-stats.c | 5 +- > mm/percpu.c | 547 +++++++++++++++++++++++++++++------------ > 5 files changed, 404 insertions(+), 177 deletions(-) > > Thanks, > Dennis Applied to percpu/for-5.2. Thanks, Dennis