Message ID | 1599116482-7410-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4,1/4] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags | expand |
On Thu, Sep 03, 2020 at 03:01:20PM +0800, Alex Shi wrote: > pageblock_flags is used as long, since every pageblock_flags is just 4 > bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags, > that flag setting has to sync in cmpxchg with 7 or 15 other pageblock > flags. It would cause long waiting for sync. > > If we could change the pageblock_flags variable as char, we could use > char size cmpxchg, which just sync up with 2 pageblock flags. it could > relief the false sharing in cmpxchg. > > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Page block types were not known to change at high frequency that would cause a measurable performance drop. If anything, the performance hit from pageblocks is the lookup paths which is a lot more frequent. What was the workload you were running that altered pageblocks at a high enough frequency for collisions to occur when updating adjacent pageblocks?
On 03.09.20 09:01, Alex Shi wrote: > pageblock_flags is used as long, since every pageblock_flags is just 4 > bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags, > that flag setting has to sync in cmpxchg with 7 or 15 other pageblock > flags. It would cause long waiting for sync. > > If we could change the pageblock_flags variable as char, we could use > char size cmpxchg, which just sync up with 2 pageblock flags. it could > relief the false sharing in cmpxchg. > > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org Could you please 1. Send a cover letter and indicate the changees between versions. I cannot find any in my mailbox or on -mm - is there any? (also, is there a patch 4 ?) 2. Report proper performance numbers as requested, especially, over multiple runs. This should go into patch 1/2. Are they buried somewhere? 3. Clarify what patch 3 does: Do we actually waste 8*sizeof(long) where we only need 4 bits? Also, breaking stuff in patch 1 and fixing it in patch 3 is not acceptable. This breaks git bisect. Skimming over the patches I think this is the case. I am not convinced yet that we need and want this. As Alex mentioned, we touch pageblock flags while holding the zone lock already in most cases - and as Mel mentiones, updates should be rare.
在 2020/9/3 下午3:24, Mel Gorman 写道: > On Thu, Sep 03, 2020 at 03:01:20PM +0800, Alex Shi wrote: >> pageblock_flags is used as long, since every pageblock_flags is just 4 >> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags, >> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock >> flags. It would cause long waiting for sync. >> >> If we could change the pageblock_flags variable as char, we could use >> char size cmpxchg, which just sync up with 2 pageblock flags. it could >> relief the false sharing in cmpxchg. >> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > > Page block types were not known to change at high frequency that would > cause a measurable performance drop. If anything, the performance hit > from pageblocks is the lookup paths which is a lot more frequent. Yes, it is not hot path. But it's still a meaningful points to reduce cmpxchg level false sharing which isn't right on logical. > > What was the workload you were running that altered pageblocks at a high > enough frequency for collisions to occur when updating adjacent > pageblocks? > I have run thpscale with 'always' defrag setting of THP. The Amean stddev is much larger than a very little average run time reducing. But the left patch 4 could show the cmpxchg retry reduce from thousands to hundreds or less. Subject: [PATCH v4 4/4] add cmpxchg tracing Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> --- include/trace/events/pageblock.h | 30 ++++++++++++++++++++++++++++++ mm/page_alloc.c | 4 ++++ 2 files changed, 34 insertions(+) create mode 100644 include/trace/events/pageblock.h diff --git a/include/trace/events/pageblock.h b/include/trace/events/pageblock.h new file mode 100644 index 000000000000..003c2d716f82 --- /dev/null +++ b/include/trace/events/pageblock.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM pageblock + +#if !defined(_TRACE_PAGEBLOCK_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_PAGEBLOCK_H + +#include <linux/tracepoint.h> + +TRACE_EVENT(hit_cmpxchg, + + TP_PROTO(char byte), + + TP_ARGS(byte), + + TP_STRUCT__entry( + __field(char, byte) + ), + + TP_fast_assign( + __entry->byte = byte; + ), + + TP_printk("%d", __entry->byte) +); + +#endif /* _TRACE_PAGE_ISOLATION_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8b65d83d8be6..a6d7159295bc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -509,6 +509,9 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned * @pfn: The target page frame number * @mask: mask of bits that the caller is interested in */ +#define CREATE_TRACE_POINTS +#include <trace/events/pageblock.h> + void set_pfnblock_flags_mask(struct page *page, unsigned long flags, unsigned long pfn, unsigned long mask) @@ -532,6 +535,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags, if (byte == old_byte) break; byte = old_byte; + trace_hit_cmpxchg(byte); } } -- 1.8.3.1
在 2020/9/3 下午4:32, Alex Shi 写道: >> > I have run thpscale with 'always' defrag setting of THP. The Amean stddev is much > larger than a very little average run time reducing. > > But the left patch 4 could show the cmpxchg retry reduce from thousands to hundreds > or less. > > Subject: [PATCH v4 4/4] add cmpxchg tracing It's a typical result with the patchset: Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale pageblock-c': 9,564 compaction:mm_compaction_isolate_migratepages 6,430 compaction:mm_compaction_isolate_freepages 5,287 compaction:mm_compaction_migratepages 45,299 compaction:mm_compaction_begin 45,299 compaction:mm_compaction_end 30,557 compaction:mm_compaction_try_to_compact_pages 95,540 compaction:mm_compaction_finished 149,379 compaction:mm_compaction_suitable 0 compaction:mm_compaction_deferred 0 compaction:mm_compaction_defer_compaction 3,949 compaction:mm_compaction_defer_reset 0 compaction:mm_compaction_kcompactd_sleep 0 compaction:mm_compaction_wakeup_kcompactd 0 compaction:mm_compaction_kcompactd_wake 68 pageblock:hit_cmpxchg 113.570974583 seconds time elapsed 14.664451000 seconds user 96.847116000 seconds sys It's 5.9-rc2 base kernel result: Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale rc2-e': 15,920 compaction:mm_compaction_isolate_migratepages 20,523 compaction:mm_compaction_isolate_freepages 9,752 compaction:mm_compaction_migratepages 27,773 compaction:mm_compaction_begin 27,773 compaction:mm_compaction_end 16,391 compaction:mm_compaction_try_to_compact_pages 62,809 compaction:mm_compaction_finished 69,821 compaction:mm_compaction_suitable 0 compaction:mm_compaction_deferred 0 compaction:mm_compaction_defer_compaction 7,875 compaction:mm_compaction_defer_reset 0 compaction:mm_compaction_kcompactd_sleep 0 compaction:mm_compaction_wakeup_kcompactd 0 compaction:mm_compaction_kcompactd_wake 1,208 pageblock:hit_cmpxchg 116.440414591 seconds time elapsed 15.326913000 seconds user 103.752758000 seconds sys
On 9/3/20 10:40 AM, Alex Shi wrote: > > > 在 2020/9/3 下午4:32, Alex Shi 写道: >>> >> I have run thpscale with 'always' defrag setting of THP. The Amean stddev is much >> larger than a very little average run time reducing. >> >> But the left patch 4 could show the cmpxchg retry reduce from thousands to hundreds >> or less. >> >> Subject: [PATCH v4 4/4] add cmpxchg tracing > > > It's a typical result with the patchset: > > Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale pageblock-c': > > 9,564 compaction:mm_compaction_isolate_migratepages > 6,430 compaction:mm_compaction_isolate_freepages > 5,287 compaction:mm_compaction_migratepages > 45,299 compaction:mm_compaction_begin > 45,299 compaction:mm_compaction_end > 30,557 compaction:mm_compaction_try_to_compact_pages > 95,540 compaction:mm_compaction_finished > 149,379 compaction:mm_compaction_suitable > 0 compaction:mm_compaction_deferred > 0 compaction:mm_compaction_defer_compaction > 3,949 compaction:mm_compaction_defer_reset > 0 compaction:mm_compaction_kcompactd_sleep > 0 compaction:mm_compaction_wakeup_kcompactd > 0 compaction:mm_compaction_kcompactd_wake > 68 pageblock:hit_cmpxchg > > 113.570974583 seconds time elapsed > > 14.664451000 seconds user > 96.847116000 seconds sys > > It's 5.9-rc2 base kernel result: > > Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale rc2-e': > > 15,920 compaction:mm_compaction_isolate_migratepages > 20,523 compaction:mm_compaction_isolate_freepages > 9,752 compaction:mm_compaction_migratepages > 27,773 compaction:mm_compaction_begin > 27,773 compaction:mm_compaction_end > 16,391 compaction:mm_compaction_try_to_compact_pages > 62,809 compaction:mm_compaction_finished > 69,821 compaction:mm_compaction_suitable > 0 compaction:mm_compaction_deferred > 0 compaction:mm_compaction_defer_compaction > 7,875 compaction:mm_compaction_defer_reset > 0 compaction:mm_compaction_kcompactd_sleep > 0 compaction:mm_compaction_wakeup_kcompactd > 0 compaction:mm_compaction_kcompactd_wake > 1,208 pageblock:hit_cmpxchg > > 116.440414591 seconds time elapsed > > 15.326913000 seconds user > 103.752758000 seconds sys The runs wildly differ in many of other stats, so I'm not sure they are really comparable. I guess you could show the fraction of hit_cmpxchg to all cmpxchg. But there's also danger of tracepoints widening the race window. In the end what matters is how these 1208 retries contribute to runtime. I doubt they could be really visible in a 100+ seconds run though.
在 2020/9/3 下午4:19, David Hildenbrand 写道: > On 03.09.20 09:01, Alex Shi wrote: >> pageblock_flags is used as long, since every pageblock_flags is just 4 >> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags, >> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock >> flags. It would cause long waiting for sync. >> >> If we could change the pageblock_flags variable as char, we could use >> char size cmpxchg, which just sync up with 2 pageblock flags. it could >> relief the false sharing in cmpxchg. >> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Mel Gorman <mgorman@techsingularity.net> >> Cc: linux-mm@kvack.org >> Cc: linux-kernel@vger.kernel.org > > Could you please > > 1. Send a cover letter and indicate the changees between versions. I > cannot find any in my mailbox or on -mm - is there any? (also, is there > a patch 4 ?) Hi David, Thanks for comments! I thought a short patchset don't need a cover. Here it's: cmpxchg is a lockless way to update data by check and compare the target data after updated and retry if target data is changed during that action. So we should just compare the exact size of target data. If the data is only byte, but cmpxchg compare a long word, that leads to a cmpxchg level flase sharing, cause more try which lock memory more times. That's a clearly logical error and should be fixed. v1, the initial version v2, fix the armv6 cmpxchgb missing issue. v3, fix more arch cmpxchg missing issue on armv6, sh2, sparc32, xtensa v4, redo cmpxchgb fix by NO_CMPXCHG_BYTE into arch/Kconfig > > 2. Report proper performance numbers as requested, especially, over > multiple runs. This should go into patch 1/2. Are they buried somewhere? There are some result sent on https://lkml.org/lkml/2020/8/30/95 > > 3. Clarify what patch 3 does: Do we actually waste 8*sizeof(long) where > we only need 4 bits? No, no waste, current kernel pack the 4 bits into long, that cause cmpxchg compare 8 pageblock flags which 7 of them are not needed. > > Also, breaking stuff in patch 1 and fixing it in patch 3 is not > acceptable. This breaks git bisect. Skimming over the patches I think > this is the case. Got it, thanks! > > I am not convinced yet that we need and want this. As Alex mentioned, we > touch pageblock flags while holding the zone lock already in most cases > - and as Mel mentiones, updates should be rare. > yes, not too much, but there are still a few. and cmpxchg retry will lock memory which I believe the less the better. Thanks Alex
On Thu, Sep 03, 2020 at 04:32:54PM +0800, Alex Shi wrote: > > > ??? 2020/9/3 ??????3:24, Mel Gorman ??????: > > On Thu, Sep 03, 2020 at 03:01:20PM +0800, Alex Shi wrote: > >> pageblock_flags is used as long, since every pageblock_flags is just 4 > >> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags, > >> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock > >> flags. It would cause long waiting for sync. > >> > >> If we could change the pageblock_flags variable as char, we could use > >> char size cmpxchg, which just sync up with 2 pageblock flags. it could > >> relief the false sharing in cmpxchg. > >> > >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > > > > Page block types were not known to change at high frequency that would > > cause a measurable performance drop. If anything, the performance hit > > from pageblocks is the lookup paths which is a lot more frequent. > > Yes, it is not hot path. But it's still a meaningful points to reduce cmpxchg > level false sharing which isn't right on logical. > Except there is no guarantee that false sharing was reduced. cmpxchg is still used except using the byte as the comparison for the old value and in some cases, that width will still be 32-bit for the exchange. It would be up to each architecture to see if that can be translated to a better instruction but it may not even matter. As the instruction will be prefixed with the lock instruction, the bus will be locked and bus locking is probably on the cache line boundary so there is a collision anyway while the atomic update takes place. End result -- reducing false sharing in this case is not guaranteed to help performance and may not be detectable when it's a low frequency operation but the code will behave differently depending on the architecture and CPU family. Your justification path measured the number of times a cmpxchg was retried but it did not track how many pageblock updates there were or how many potentially collided. As the workload is uncontrolled with respect to pageblock updates, you do not know if the difference in retries is due to a real reduction in collisions or a difference in the number of pageblock updates that potentially collide. Either way, because the frequency of the operation was so low relative too your target load, any difference in performance would be indistinguishable from noise. I don't think it's worth the churn in this case for an impact that will be very difficult to detect and variable across architectures and CPU families.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8379432f4f2f..be676e659fb7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -437,7 +437,7 @@ struct zone { * Flags for a pageblock_nr_pages block. See pageblock-flags.h. * In SPARSEMEM, this map is stored in struct mem_section */ - unsigned long *pageblock_flags; + unsigned char *pageblock_flags; #endif /* CONFIG_SPARSEMEM */ /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ @@ -1159,7 +1159,7 @@ struct mem_section_usage { DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION); #endif /* See declaration of similar field in struct zone */ - unsigned long pageblock_flags[0]; + unsigned char pageblock_flags[0]; }; void subsection_map_init(unsigned long pfn, unsigned long nr_pages); @@ -1212,7 +1212,7 @@ struct mem_section { extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif -static inline unsigned long *section_to_usemap(struct mem_section *ms) +static inline unsigned char *section_to_usemap(struct mem_section *ms) { return ms->usage->pageblock_flags; } diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index fff52ad370c1..d189441568eb 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -54,7 +54,7 @@ enum pageblock_bits { /* Forward declaration */ struct page; -unsigned long get_pfnblock_flags_mask(struct page *page, +unsigned char get_pfnblock_flags_mask(struct page *page, unsigned long pfn, unsigned long mask); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fab5e97dc9ca..81e96d4d9c42 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -445,7 +445,7 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn) #endif /* Return a pointer to the bitmap storing bits affecting a block of pages */ -static inline unsigned long *get_pageblock_bitmap(struct page *page, +static inline unsigned char *get_pageblock_bitmap(struct page *page, unsigned long pfn) { #ifdef CONFIG_SPARSEMEM @@ -474,24 +474,24 @@ static inline int pfn_to_bitidx(struct page *page, unsigned long pfn) * Return: pageblock_bits flags */ static __always_inline -unsigned long __get_pfnblock_flags_mask(struct page *page, +unsigned char __get_pfnblock_flags_mask(struct page *page, unsigned long pfn, unsigned long mask) { - unsigned long *bitmap; - unsigned long bitidx, word_bitidx; - unsigned long word; + unsigned char *bitmap; + unsigned long bitidx, byte_bitidx; + unsigned char byte; bitmap = get_pageblock_bitmap(page, pfn); bitidx = pfn_to_bitidx(page, pfn); - word_bitidx = bitidx / BITS_PER_LONG; - bitidx &= (BITS_PER_LONG-1); + byte_bitidx = bitidx / BITS_PER_BYTE; + bitidx &= (BITS_PER_BYTE-1); - word = bitmap[word_bitidx]; - return (word >> bitidx) & mask; + byte = bitmap[byte_bitidx]; + return (byte >> bitidx) & mask; } -unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn, +unsigned char get_pfnblock_flags_mask(struct page *page, unsigned long pfn, unsigned long mask) { return __get_pfnblock_flags_mask(page, pfn, mask); @@ -513,29 +513,29 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags, unsigned long pfn, unsigned long mask) { - unsigned long *bitmap; - unsigned long bitidx, word_bitidx; - unsigned long old_word, word; + unsigned char *bitmap; + unsigned long bitidx, byte_bitidx; + unsigned char old_byte, byte; BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits)); bitmap = get_pageblock_bitmap(page, pfn); bitidx = pfn_to_bitidx(page, pfn); - word_bitidx = bitidx / BITS_PER_LONG; - bitidx &= (BITS_PER_LONG-1); + byte_bitidx = bitidx / BITS_PER_BYTE; + bitidx &= (BITS_PER_BYTE-1); VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); mask <<= bitidx; flags <<= bitidx; - word = READ_ONCE(bitmap[word_bitidx]); + byte = READ_ONCE(bitmap[byte_bitidx]); for (;;) { - old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags); - if (word == old_word) + old_byte = cmpxchg(&bitmap[byte_bitidx], byte, (byte & ~mask) | flags); + if (byte == old_byte) break; - word = old_word; + byte = old_byte; } }
pageblock_flags is used as long, since every pageblock_flags is just 4 bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags, that flag setting has to sync in cmpxchg with 7 or 15 other pageblock flags. It would cause long waiting for sync. If we could change the pageblock_flags variable as char, we could use char size cmpxchg, which just sync up with 2 pageblock flags. it could relief the false sharing in cmpxchg. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- include/linux/mmzone.h | 6 +++--- include/linux/pageblock-flags.h | 2 +- mm/page_alloc.c | 38 +++++++++++++++++++------------------- 3 files changed, 23 insertions(+), 23 deletions(-)