Message ID | 161419301128.2718959.4838557038019199822.stgit@firesoul (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,net-next,1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map | expand |
As a side-node, I didn't pick up the other patches as there is review feedback and I didn't have strong opinions either way. Patch 3 is curious though, it probably should be split out and sent separetly but still; On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote: > Avoid multiplication (imul) operations when accessing: > zone->free_area[order].nr_free > > This was really tricky to find. I was puzzled why perf reported that > rmqueue_bulk was using 44% of the time in an imul operation: > > ??? del_page_from_free_list(): > 44,54 ??? e2: imul $0x58,%rax,%rax > > This operation was generated (by compiler) because the struct free_area have > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use > and instead choose to use a more expensive imul, to find the offset into the > array free_area[]. > > The patch align struct free_area to a cache-line, which cause the > compiler avoid the imul operation. The imul operation is very fast on > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the > member 'nr_free' to be first element, which saves one 'add' operation. > > Looking up instruction latency this exchange a 3-cycle imul with a > 1-cycle shl, saving 2-cycles. It does trade some space to do this. > > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) > I'm having some trouble parsing this and matching it to the patch itself. First off, on my system (x86-64), the size of struct free area is 72, not 88 bytes. For either size, cache-aligning the structure is a big increase in the struct size. struct free_area { struct list_head free_list[4]; /* 0 64 */ /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int nr_free; /* 64 8 */ /* size: 72, cachelines: 2, members: 2 */ /* last cacheline: 8 bytes */ }; Are there other patches in the tree? What does pahole say? With gcc-9, I'm also not seeing the imul instruction outputted like you described in rmqueue_pcplist which inlines rmqueue_bulk. At the point where it calls get_page_from_free_area, it's using shl for the page list operation. This might be a compiler glitch but given that free_area is a different size, I'm less certain and wonder if something else is going on. Finally, moving nr_free to the end and cache aligning it will make the started of each free_list cache-aligned because of its location in the struct zone so what purpose does __pad_to_align_free_list serve?
On Thu, 25 Feb 2021 11:28:49 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > As a side-node, I didn't pick up the other patches as there is review > feedback and I didn't have strong opinions either way. Patch 3 is curious > though, it probably should be split out and sent separetly but still; > > On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote: > > Avoid multiplication (imul) operations when accessing: > > zone->free_area[order].nr_free > > > > This was really tricky to find. I was puzzled why perf reported that > > rmqueue_bulk was using 44% of the time in an imul operation: > > > > ??? del_page_from_free_list(): > > 44,54 ??? e2: imul $0x58,%rax,%rax > > > > This operation was generated (by compiler) because the struct free_area have > > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use > > and instead choose to use a more expensive imul, to find the offset into the > > array free_area[]. > > > > The patch align struct free_area to a cache-line, which cause the > > compiler avoid the imul operation. The imul operation is very fast on > > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the > > member 'nr_free' to be first element, which saves one 'add' operation. > > > > Looking up instruction latency this exchange a 3-cycle imul with a > > 1-cycle shl, saving 2-cycles. It does trade some space to do this. > > > > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) > > > > I'm having some trouble parsing this and matching it to the patch itself. > > First off, on my system (x86-64), the size of struct free area is 72, > not 88 bytes. For either size, cache-aligning the structure is a big > increase in the struct size. Yes, the increase in size is big. For the struct free_area 40 bytes for my case and 56 bytes for your case. The real problem is that this is multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs (is it 5?). Thus, 56*11*5 = 3080 bytes. Thus, I'm not sure it is worth it! As I'm only saving 2-cycles, for something that depends on the compiler generating specific code. And the compiler can easily change, and "fix" this on-its-own in a later release, and then we are just wasting memory. I did notice this imul happens 45 times in mm/page_alloc.o, with this offset 0x58, but still this is likely not on hot-path. > struct free_area { > struct list_head free_list[4]; /* 0 64 */ > /* --- cacheline 1 boundary (64 bytes) --- */ > long unsigned int nr_free; /* 64 8 */ > > /* size: 72, cachelines: 2, members: 2 */ > /* last cacheline: 8 bytes */ > }; > > Are there other patches in the tree? What does pahole say? The size of size of struct free_area varies based on some CONFIG setting, as free_list[] array size is determined by MIGRATE_TYPES, which on my system is 5, and not 4 as on your system. struct list_head free_list[MIGRATE_TYPES]; CONFIG_CMA and CONFIG_MEMORY_ISOLATION both increase MIGRATE_TYPES with one. Thus, the array size can vary from 4 to 6. > With gcc-9, I'm also not seeing the imul instruction outputted like you > described in rmqueue_pcplist which inlines rmqueue_bulk. At the point > where it calls get_page_from_free_area, it's using shl for the page list > operation. This might be a compiler glitch but given that free_area is a > different size, I'm less certain and wonder if something else is going on. I think it is the size variation. > Finally, moving nr_free to the end and cache aligning it will make the > started of each free_list cache-aligned because of its location in the > struct zone so what purpose does __pad_to_align_free_list serve? The purpose of purpose of __pad_to_align_free_list is because struct list_head is 16 bytes, thus I wanted to align free_list to 16, given we already have wasted the space. Notice I added some more detailed notes in[1]: [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-optimisations
On Thu, Feb 25, 2021 at 04:16:33PM +0100, Jesper Dangaard Brouer wrote: > > On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote: > > > Avoid multiplication (imul) operations when accessing: > > > zone->free_area[order].nr_free > > > > > > This was really tricky to find. I was puzzled why perf reported that > > > rmqueue_bulk was using 44% of the time in an imul operation: > > > > > > ??? del_page_from_free_list(): > > > 44,54 ??? e2: imul $0x58,%rax,%rax > > > > > > This operation was generated (by compiler) because the struct free_area have > > > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use > > > and instead choose to use a more expensive imul, to find the offset into the > > > array free_area[]. > > > > > > The patch align struct free_area to a cache-line, which cause the > > > compiler avoid the imul operation. The imul operation is very fast on > > > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the > > > member 'nr_free' to be first element, which saves one 'add' operation. > > > > > > Looking up instruction latency this exchange a 3-cycle imul with a > > > 1-cycle shl, saving 2-cycles. It does trade some space to do this. > > > > > > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) > > > > > > > I'm having some trouble parsing this and matching it to the patch itself. > > > > First off, on my system (x86-64), the size of struct free area is 72, > > not 88 bytes. For either size, cache-aligning the structure is a big > > increase in the struct size. > > Yes, the increase in size is big. For the struct free_area 40 bytes for > my case and 56 bytes for your case. The real problem is that this is > multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs > (is it 5?). Thus, 56*11*5 = 3080 bytes. > > Thus, I'm not sure it is worth it! As I'm only saving 2-cycles, for > something that depends on the compiler generating specific code. And > the compiler can easily change, and "fix" this on-its-own in a later > release, and then we are just wasting memory. > > I did notice this imul happens 45 times in mm/page_alloc.o, with this > offset 0x58, but still this is likely not on hot-path. > Yeah, I'm not convinced it's worth it. The benefit of 2 cycles is small and it's config-dependant. While some configurations will benefit, others do not but the increased consumption is universal. I think there are better ways to save 2 cycles in the page allocator and this seems like a costly micro-optimisation. > > <SNIP> > > > > With gcc-9, I'm also not seeing the imul instruction outputted like you > > described in rmqueue_pcplist which inlines rmqueue_bulk. At the point > > where it calls get_page_from_free_area, it's using shl for the page list > > operation. This might be a compiler glitch but given that free_area is a > > different size, I'm less certain and wonder if something else is going on. > > I think it is the size variation. > Yes. > > Finally, moving nr_free to the end and cache aligning it will make the > > started of each free_list cache-aligned because of its location in the > > struct zone so what purpose does __pad_to_align_free_list serve? > > The purpose of purpose of __pad_to_align_free_list is because struct > list_head is 16 bytes, thus I wanted to align free_list to 16, given we > already have wasted the space. > Ok, that's fair enough but it's also somewhat of a micro-optimisation as whether it helps or not depends on the architecture. I don't think I'll pick this up, certainly in the context of the bulk allocator but it's worth keeping in mind. It's an interesting corner case at least.
On Thu, 25 Feb 2021 15:38:15 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Thu, Feb 25, 2021 at 04:16:33PM +0100, Jesper Dangaard Brouer wrote: > > > On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote: > > > > Avoid multiplication (imul) operations when accessing: > > > > zone->free_area[order].nr_free > > > > > > > > This was really tricky to find. I was puzzled why perf reported that > > > > rmqueue_bulk was using 44% of the time in an imul operation: > > > > > > > > ??? del_page_from_free_list(): > > > > 44,54 ??? e2: imul $0x58,%rax,%rax > > > > > > > > This operation was generated (by compiler) because the struct free_area have > > > > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use > > > > and instead choose to use a more expensive imul, to find the offset into the > > > > array free_area[]. > > > > > > > > The patch align struct free_area to a cache-line, which cause the > > > > compiler avoid the imul operation. The imul operation is very fast on > > > > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the > > > > member 'nr_free' to be first element, which saves one 'add' operation. > > > > > > > > Looking up instruction latency this exchange a 3-cycle imul with a > > > > 1-cycle shl, saving 2-cycles. It does trade some space to do this. > > > > > > > > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) > > > > > > > > > > I'm having some trouble parsing this and matching it to the patch itself. > > > > > > First off, on my system (x86-64), the size of struct free area is 72, > > > not 88 bytes. For either size, cache-aligning the structure is a big > > > increase in the struct size. > > > > Yes, the increase in size is big. For the struct free_area 40 bytes for > > my case and 56 bytes for your case. The real problem is that this is > > multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs > > (is it 5?). Thus, 56*11*5 = 3080 bytes. > > > > Thus, I'm not sure it is worth it! As I'm only saving 2-cycles, for > > something that depends on the compiler generating specific code. And > > the compiler can easily change, and "fix" this on-its-own in a later > > release, and then we are just wasting memory. > > > > I did notice this imul happens 45 times in mm/page_alloc.o, with this > > offset 0x58, but still this is likely not on hot-path. > > > > Yeah, I'm not convinced it's worth it. The benefit of 2 cycles is small and > it's config-dependant. While some configurations will benefit, others do > not but the increased consumption is universal. I think there are better > ways to save 2 cycles in the page allocator and this seems like a costly > micro-optimisation. > > > > <SNIP> > > > > > > With gcc-9, I'm also not seeing the imul instruction outputted like you > > > described in rmqueue_pcplist which inlines rmqueue_bulk. At the point > > > where it calls get_page_from_free_area, it's using shl for the page list > > > operation. This might be a compiler glitch but given that free_area is a > > > different size, I'm less certain and wonder if something else is going on. > > > > I think it is the size variation. > > > > Yes. > > > > Finally, moving nr_free to the end and cache aligning it will make the > > > started of each free_list cache-aligned because of its location in the > > > struct zone so what purpose does __pad_to_align_free_list serve? > > > > The purpose of purpose of __pad_to_align_free_list is because struct > > list_head is 16 bytes, thus I wanted to align free_list to 16, given we > > already have wasted the space. > > > > Ok, that's fair enough but it's also somewhat of a micro-optimisation as > whether it helps or not depends on the architecture. > > I don't think I'll pick this up, certainly in the context of the bulk > allocator but it's worth keeping in mind. It's an interesting corner case > at least. I fully agree. Lets drop this patch.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b593316bff3d..4d83201717e1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -93,10 +93,12 @@ extern int page_group_by_mobility_disabled; #define get_pageblock_migratetype(page) \ get_pfnblock_flags_mask(page, page_to_pfn(page), MIGRATETYPE_MASK) +/* Aligned struct to make zone->free_area[order] access faster */ struct free_area { - struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; -}; + unsigned long __pad_to_align_free_list; + struct list_head free_list[MIGRATE_TYPES]; +} ____cacheline_aligned_in_smp; static inline struct page *get_page_from_free_area(struct free_area *area, int migratetype)
Avoid multiplication (imul) operations when accessing: zone->free_area[order].nr_free This was really tricky to find. I was puzzled why perf reported that rmqueue_bulk was using 44% of the time in an imul operation: │ del_page_from_free_list(): 44,54 │ e2: imul $0x58,%rax,%rax This operation was generated (by compiler) because the struct free_area have size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use and instead choose to use a more expensive imul, to find the offset into the array free_area[]. The patch align struct free_area to a cache-line, which cause the compiler avoid the imul operation. The imul operation is very fast on modern Intel CPUs. To help fast-path that decrement 'nr_free' move the member 'nr_free' to be first element, which saves one 'add' operation. Looking up instruction latency this exchange a 3-cycle imul with a 1-cycle shl, saving 2-cycles. It does trade some space to do this. Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- include/linux/mmzone.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)