Message ID | 1597549677-7480-2-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags | expand |
On Sun, Aug 16, 2020 at 11:47:57AM +0800, Alex Shi wrote: > Current pageblock_flags is only 4 bits, so it has to share a char size > in cmpxchg when get set, the false sharing cause perf drop. > > If we incrase the bits up to 8, false sharing would gone in cmpxchg. and > the only cost is half char per pageblock, which is half char per 128MB > on x86, 4 chars in 1 GB. I don't believe this patch has that effect, mostly because it still does cmpxchg() on words instead of bytes. But which functions would benefit? It seems to me this cmpxchg() is only called from the set_pageblock_migratetype() morass of functions, none of which are called in hot paths as far as I can make out. So are you just reasoning by analogy with the previous patch where you have measured a performance improvement, or did you send the wrong patch, or did I overlook a hot path that calls one of the pageblock migration functions?
在 2020/8/16 下午12:17, Matthew Wilcox 写道: > On Sun, Aug 16, 2020 at 11:47:57AM +0800, Alex Shi wrote: >> Current pageblock_flags is only 4 bits, so it has to share a char size >> in cmpxchg when get set, the false sharing cause perf drop. >> >> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and >> the only cost is half char per pageblock, which is half char per 128MB >> on x86, 4 chars in 1 GB. > > I don't believe this patch has that effect, mostly because it still does > cmpxchg() on words instead of bytes. Hi Matthew, Thank a lot for comments! Sorry, I must overlook sth, would you like point out why the cmpxchg is still on words after patch 1 applied? > > But which functions would benefit? It seems to me this cmpxchg() is > only called from the set_pageblock_migratetype() morass of functions, > none of which are called in hot paths as far as I can make out. > > So are you just reasoning by analogy with the previous patch where you > have measured a performance improvement, or did you send the wrong patch, > or did I overlook a hot path that calls one of the pageblock migration > functions? > Uh, I am reading compaction.c and found the following commit introduced test_and_set_skip under a lock. It looks like the pagelock_flags setting has false sharing in cmpxchg. but I have no valid data on this yet. Thanks Alex e380bebe4771548 mm, compaction: keep migration source private to a single compaction instance if (!locked) { locked = compact_trylock_irqsave(zone_lru_lock(zone), &flags, cc); - if (!locked) + + /* Allow future scanning if the lock is contended */ + if (!locked) { + clear_pageblock_skip(page); break; + } + + /* Try get exclusive access under lock */ + if (!skip_updated) { + skip_updated = true; + if (test_and_set_skip(cc, page, low_pfn)) + goto isolate_abort; + }
On Sun, Aug 16, 2020 at 7:11 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > > > 在 2020/8/16 下午12:17, Matthew Wilcox 写道: > > On Sun, Aug 16, 2020 at 11:47:57AM +0800, Alex Shi wrote: > >> Current pageblock_flags is only 4 bits, so it has to share a char size > >> in cmpxchg when get set, the false sharing cause perf drop. > >> > >> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and > >> the only cost is half char per pageblock, which is half char per 128MB > >> on x86, 4 chars in 1 GB. > > > > I don't believe this patch has that effect, mostly because it still does > > cmpxchg() on words instead of bytes. > > Hi Matthew, > > Thank a lot for comments! > > Sorry, I must overlook sth, would you like point out why the cmpxchg is still > on words after patch 1 applied? > I would take it one step further. You still have false sharing as the pageblocks bits still occupy the same cacheline so you are going to see them cache bouncing regardless. What it seems like you are attempting to address is the fact that multiple threads could all be attempting to update the same long value. As I pointed out for the migrate type it seems to be protected by the zone lock, but for compaction the skip bit doesn't have the same protection as there are some threads using the zone lock and others using the LRU lock. I'm still not sure it makes much of a difference though. > > > > But which functions would benefit? It seems to me this cmpxchg() is > > only called from the set_pageblock_migratetype() morass of functions, > > none of which are called in hot paths as far as I can make out. > > > > So are you just reasoning by analogy with the previous patch where you > > have measured a performance improvement, or did you send the wrong patch, > > or did I overlook a hot path that calls one of the pageblock migration > > functions? > > > > Uh, I am reading compaction.c and found the following commit introduced > test_and_set_skip under a lock. It looks like the pagelock_flags setting > has false sharing in cmpxchg. but I have no valid data on this yet. > > Thanks > Alex > > e380bebe4771548 mm, compaction: keep migration source private to a single compaction instance > > if (!locked) { > locked = compact_trylock_irqsave(zone_lru_lock(zone), > &flags, cc); > - if (!locked) > + > + /* Allow future scanning if the lock is contended */ > + if (!locked) { > + clear_pageblock_skip(page); > break; > + } > + > + /* Try get exclusive access under lock */ > + if (!skip_updated) { > + skip_updated = true; > + if (test_and_set_skip(cc, page, low_pfn)) > + goto isolate_abort; > + } > I'm not sure that is a good grounds for doubling the size of the pageblock flags. If you look further down in the code there are bits that are setting these bits without taking the lock. The assumption here is that by taking the lock the test_and_set_skip will be performed atomically since another thread cannot perform that while the zone lock is held. If you look in the function itself it only does anything if the skip bits are checked and if the page is the first page in the pageblock. I think you might be confusing some of my earlier comments. I still believe the 3% regression you reported with my patch is not directly related to the test_and_set_skip as the test you ran seems unlikely to trigger compaction. However with that said one of the advantages of using the locked section to perform these types of tests is that it reduces the number of times the test is run since it will only be on the first unlocked page in any batch of pages and the first page in the pageblock is always going to be handled without the lock held since it is the first page processed. Until we can get a test up such as thpscale that does a good job of stressing the compaction code I don't think we can rely on just observations to say if this is an improvement or not. Thanks. - Alex
在 2020/8/16 下午11:56, Alexander Duyck 写道: > On Sun, Aug 16, 2020 at 7:11 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: >> >> >> >> 在 2020/8/16 下午12:17, Matthew Wilcox 写道: >>> On Sun, Aug 16, 2020 at 11:47:57AM +0800, Alex Shi wrote: >>>> Current pageblock_flags is only 4 bits, so it has to share a char size >>>> in cmpxchg when get set, the false sharing cause perf drop. >>>> >>>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and >>>> the only cost is half char per pageblock, which is half char per 128MB >>>> on x86, 4 chars in 1 GB. >>> >>> I don't believe this patch has that effect, mostly because it still does >>> cmpxchg() on words instead of bytes. >> >> Hi Matthew, >> >> Thank a lot for comments! >> >> Sorry, I must overlook sth, would you like point out why the cmpxchg is still >> on words after patch 1 applied? >> > > I would take it one step further. You still have false sharing as the > pageblocks bits still occupy the same cacheline so you are going to > see them cache bouncing regardless. Right, there 2 level false sharing here, cacheline and cmpxchg comparsion range. this patch could fix the cmpxchg level with a very cheap price. the cacheline size is too huge to resovle here. > > What it seems like you are attempting to address is the fact that > multiple threads could all be attempting to update the same long > value. As I pointed out for the migrate type it seems to be protected > by the zone lock, but for compaction the skip bit doesn't have the > same protection as there are some threads using the zone lock and > others using the LRU lock. I'm still not sure it makes much of a > difference though. It looks with this patch, lock are not needed anymore on the flags. > >>> >>> But which functions would benefit? It seems to me this cmpxchg() is >>> only called from the set_pageblock_migratetype() morass of functions, >>> none of which are called in hot paths as far as I can make out. >>> >>> So are you just reasoning by analogy with the previous patch where you >>> have measured a performance improvement, or did you send the wrong patch, >>> or did I overlook a hot path that calls one of the pageblock migration >>> functions? >>> >> >> Uh, I am reading compaction.c and found the following commit introduced >> test_and_set_skip under a lock. It looks like the pagelock_flags setting >> has false sharing in cmpxchg. but I have no valid data on this yet. >> >> Thanks >> Alex >> >> e380bebe4771548 mm, compaction: keep migration source private to a single compaction instance >> >> if (!locked) { >> locked = compact_trylock_irqsave(zone_lru_lock(zone), >> &flags, cc); >> - if (!locked) >> + >> + /* Allow future scanning if the lock is contended */ >> + if (!locked) { >> + clear_pageblock_skip(page); >> break; >> + } >> + >> + /* Try get exclusive access under lock */ >> + if (!skip_updated) { >> + skip_updated = true; >> + if (test_and_set_skip(cc, page, low_pfn)) >> + goto isolate_abort; >> + } >> > > I'm not sure that is a good grounds for doubling the size of the > pageblock flags. If you look further down in the code there are bits > that are setting these bits without taking the lock. The assumption > here is that by taking the lock the test_and_set_skip will be > performed atomically since another thread cannot perform that while > the zone lock is held. If you look in the function itself it only does > anything if the skip bits are checked and if the page is the first > page in the pageblock. > > I think you might be confusing some of my earlier comments. I still > believe the 3% regression you reported with my patch is not directly > related to the test_and_set_skip as the test you ran seems unlikely to > trigger compaction. However with that said one of the advantages of > using the locked section to perform these types of tests is that it > reduces the number of times the test is run since it will only be on > the first unlocked page in any batch of pages and the first page in > the pageblock is always going to be handled without the lock held > since it is the first page processed. > > Until we can get a test up such as thpscale that does a good job of > stressing the compaction code I don't think we can rely on just > observations to say if this is an improvement or not. I still struggle on thpscale meaningful running. But if the patch is clearly right in theory. Do we have to hang on a benchmark result? Thanks Alex
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index d189441568eb..556fc2c0b392 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -25,7 +25,7 @@ enum pageblock_bits { * Assume the bits will always align on a word. If this assumption * changes then get/set pageblock needs updating. */ - NR_PAGEBLOCK_BITS + NR_PAGEBLOCK_BITS = 8 }; #ifdef CONFIG_HUGETLB_PAGE diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 142803d1f49b..01c3fb822732 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -521,7 +521,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags, unsigned long bitidx, word_bitidx; unsigned char old_word, word; - BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); + BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8); BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits)); bitmap = get_pageblock_bitmap(page, pfn);
Current pageblock_flags is only 4 bits, so it has to share a char size in cmpxchg when get set, the false sharing cause perf drop. If we incrase the bits up to 8, false sharing would gone in cmpxchg. and the only cost is half char per pageblock, which is half char per 128MB on x86, 4 chars in 1 GB. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/pageblock-flags.h | 2 +- mm/page_alloc.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)