Message ID | 20240529111904.2069608-4-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: page_type, zsmalloc and page_mapcount_reset() | expand |
On (24/05/29 13:19), David Hildenbrand wrote: > We won't be able to support 256 KiB base pages, which is acceptable. [..] > +config HAVE_ZSMALLOC > + def_bool y > + depends on MMU > + depends on PAGE_SIZE_LESS_THAN_256KB # we want <= 64 KiB Can't really say that I'm happy with this, but if mm-folks are fine then okay. FWIW Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
On Thu, May 30, 2024 at 02:01:23PM +0900, Sergey Senozhatsky wrote: > On (24/05/29 13:19), David Hildenbrand wrote: > > We won't be able to support 256 KiB base pages, which is acceptable. > [..] > > +config HAVE_ZSMALLOC > > + def_bool y > > + depends on MMU > > + depends on PAGE_SIZE_LESS_THAN_256KB # we want <= 64 KiB > > Can't really say that I'm happy with this, but if mm-folks are > fine then okay. I have an idea ... We use 6 of the bits in the top byte of the page_type to enumerate a type (ie value 0x80-0xbf) and then the remaining 24 bits are available. It's actually more efficient: $ ./scripts/bloat-o-meter prev.o .build-debian/mm/filemap.o add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-40 (-40) Function old new delta __filemap_add_folio 1102 1098 -4 filemap_unaccount_folio 455 446 -9 replace_page_cache_folio 474 447 -27 Total: Before=41258, After=41218, chg -0.10% (that's all from PG_hugetlb) before: 1406: 8b 46 30 mov 0x30(%rsi),%eax mapcount = atomic_read(&folio->_mapcount) + 1; 1409: 83 c0 01 add $0x1,%eax if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) 140c: 83 f8 81 cmp $0xffffff81,%eax 140f: 7d 6c jge 147d <filemap_unaccount_folio+0x8d> 1411: 8b 43 30 mov 0x30(%rbx),%eax 1414: 25 00 08 00 f0 and $0xf0000800,%eax 1419: 3d 00 00 00 f0 cmp $0xf0000000,%eax 141e: 74 4e je 146e <filemap_unaccount_folio+0x7e> after: 1406: 8b 46 30 mov 0x30(%rsi),%eax mapcount = atomic_read(&folio->_mapcount) + 1; 1409: 83 c0 01 add $0x1,%eax if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) 140c: 83 f8 81 cmp $0xffffff81,%eax 140f: 7d 63 jge 1474 <filemap_unaccount_folio+0x8 4> if (folio_test_hugetlb(folio)) 1411: 80 7b 33 84 cmpb $0x84,0x33(%rbx) 1415: 74 4e je 1465 <filemap_unaccount_folio+0x75> so we go from "mov, and, cmp, je" to just "cmpb, je", which must surely be faster to execute as well as being more compact in the I$ (6 bytes vs 15). Anyway, not tested but this is the patch I used to generate the above. More for comment than application. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5265b3434b9e..4129d04ac812 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -942,24 +942,24 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned) * mistaken for a page type value. */ -#define PAGE_TYPE_BASE 0xf0000000 -/* Reserve 0x0000007f to catch underflows of _mapcount */ -#define PAGE_MAPCOUNT_RESERVE -128 -#define PG_buddy 0x00000080 -#define PG_offline 0x00000100 -#define PG_table 0x00000200 -#define PG_guard 0x00000400 -#define PG_hugetlb 0x00000800 -#define PG_slab 0x00001000 - -#define PageType(page, flag) \ - ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) -#define folio_test_type(folio, flag) \ - ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) +/* Reserve 0x0000007f to catch underflows of _mapcount */ +#define PAGE_MAPCOUNT_RESERVE -128 + +#define PG_buddy 0x80 +#define PG_offline 0x81 +#define PG_table 0x82 +#define PG_guard 0x83 +#define PG_hugetlb 0x84 +#define PG_slab 0x85 + +#define PageType(page, type) \ + (((page)->page_type >> 24) == type) +#define folio_test_type(folio, type) \ + (((folio)->page.page_type >> 24) == type) static inline int page_type_has_type(unsigned int page_type) { - return (int)page_type < PAGE_MAPCOUNT_RESERVE; + return ((int)page_type < 0) && (page_type < 0xc0000000); } static inline int page_has_type(const struct page *page) @@ -975,12 +975,12 @@ static __always_inline bool folio_test_##fname(const struct folio *folio)\ static __always_inline void __folio_set_##fname(struct folio *folio) \ { \ VM_BUG_ON_FOLIO(!folio_test_type(folio, 0), folio); \ - folio->page.page_type &= ~PG_##lname; \ + folio->page.page_type = PG_##lname << 24; \ } \ static __always_inline void __folio_clear_##fname(struct folio *folio) \ { \ VM_BUG_ON_FOLIO(!folio_test_##fname(folio), folio); \ - folio->page.page_type |= PG_##lname; \ + folio->page.page_type = 0xffffffff; \ } #define PAGE_TYPE_OPS(uname, lname, fname) \ @@ -992,12 +992,12 @@ static __always_inline int Page##uname(const struct page *page) \ static __always_inline void __SetPage##uname(struct page *page) \ { \ VM_BUG_ON_PAGE(!PageType(page, 0), page); \ - page->page_type &= ~PG_##lname; \ + page->page_type = PG_##lname << 24; \ } \ static __always_inline void __ClearPage##uname(struct page *page) \ { \ VM_BUG_ON_PAGE(!Page##uname(page), page); \ - page->page_type |= PG_##lname; \ + page->page_type = 0xffffffff; \ } /*
On 31.05.24 16:27, Matthew Wilcox wrote: > On Thu, May 30, 2024 at 02:01:23PM +0900, Sergey Senozhatsky wrote: >> On (24/05/29 13:19), David Hildenbrand wrote: >>> We won't be able to support 256 KiB base pages, which is acceptable. >> [..] >>> +config HAVE_ZSMALLOC >>> + def_bool y >>> + depends on MMU >>> + depends on PAGE_SIZE_LESS_THAN_256KB # we want <= 64 KiB >> >> Can't really say that I'm happy with this, but if mm-folks are >> fine then okay. > > I have an idea ... > > We use 6 of the bits in the top byte of the page_type to enumerate > a type (ie value 0x80-0xbf) and then the remaining 24 bits are > available. It's actually more efficient: > > $ ./scripts/bloat-o-meter prev.o .build-debian/mm/filemap.o > add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-40 (-40) > Function old new delta > __filemap_add_folio 1102 1098 -4 > filemap_unaccount_folio 455 446 -9 > replace_page_cache_folio 474 447 -27 > Total: Before=41258, After=41218, chg -0.10% > > (that's all from PG_hugetlb) > > before: > 1406: 8b 46 30 mov 0x30(%rsi),%eax > mapcount = atomic_read(&folio->_mapcount) + 1; > 1409: 83 c0 01 add $0x1,%eax > if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) > 140c: 83 f8 81 cmp $0xffffff81,%eax > 140f: 7d 6c jge 147d <filemap_unaccount_folio+0x8d> > 1411: 8b 43 30 mov 0x30(%rbx),%eax > 1414: 25 00 08 00 f0 and $0xf0000800,%eax > 1419: 3d 00 00 00 f0 cmp $0xf0000000,%eax > 141e: 74 4e je 146e <filemap_unaccount_folio+0x7e> > > after: > 1406: 8b 46 30 mov 0x30(%rsi),%eax > mapcount = atomic_read(&folio->_mapcount) + 1; > 1409: 83 c0 01 add $0x1,%eax > if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) > 140c: 83 f8 81 cmp $0xffffff81,%eax > 140f: 7d 63 jge 1474 <filemap_unaccount_folio+0x8 > 4> > if (folio_test_hugetlb(folio)) > 1411: 80 7b 33 84 cmpb $0x84,0x33(%rbx) > 1415: 74 4e je 1465 <filemap_unaccount_folio+0x75> > > so we go from "mov, and, cmp, je" to just "cmpb, je", which must surely > be faster to execute as well as being more compact in the I$ (6 bytes vs 15). > > Anyway, not tested but this is the patch I used to generate the above. > More for comment than application. Right, it's likely very similar to my previous proposal to use 8 bit (uint8_t) for the type. https://lore.kernel.org/all/00ba1dff-7c05-46e8-b0d9-a78ac1cfc198@redhat.com/ I would prefer if we would do that separately; unless someone is able to raise why we care about zram + 256KiB that much right now. (claim: we don't) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 5265b3434b9e..4129d04ac812 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -942,24 +942,24 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned) > * mistaken for a page type value. > */ > > -#define PAGE_TYPE_BASE 0xf0000000 > -/* Reserve 0x0000007f to catch underflows of _mapcount */ > -#define PAGE_MAPCOUNT_RESERVE -128 > -#define PG_buddy 0x00000080 > -#define PG_offline 0x00000100 > -#define PG_table 0x00000200 > -#define PG_guard 0x00000400 > -#define PG_hugetlb 0x00000800 > -#define PG_slab 0x00001000 > - > -#define PageType(page, flag) \ > - ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) > -#define folio_test_type(folio, flag) \ > - ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) > +/* Reserve 0x0000007f to catch underflows of _mapcount */ > +#define PAGE_MAPCOUNT_RESERVE -128 > + > +#define PG_buddy 0x80 > +#define PG_offline 0x81 > +#define PG_table 0x82 > +#define PG_guard 0x83 > +#define PG_hugetlb 0x84 > +#define PG_slab 0x85 Hoping we can stop calling that PG_ ...
On Fri, 31 May 2024 16:32:04 +0200 David Hildenbrand <david@redhat.com> wrote: > On 31.05.24 16:27, Matthew Wilcox wrote: > > On Thu, May 30, 2024 at 02:01:23PM +0900, Sergey Senozhatsky wrote: > > 1409: 83 c0 01 add $0x1,%eax > > if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) > > 140c: 83 f8 81 cmp $0xffffff81,%eax > > 140f: 7d 63 jge 1474 <filemap_unaccount_folio+0x8 > > 4> > > if (folio_test_hugetlb(folio)) > > 1411: 80 7b 33 84 cmpb $0x84,0x33(%rbx) > > 1415: 74 4e je 1465 <filemap_unaccount_folio+0x75> > > > > so we go from "mov, and, cmp, je" to just "cmpb, je", which must surely > > be faster to execute as well as being more compact in the I$ (6 bytes vs 15). > > > > Anyway, not tested but this is the patch I used to generate the above. > > More for comment than application. > > Right, it's likely very similar to my previous proposal to use 8 bit > (uint8_t) for the type. > > https://lore.kernel.org/all/00ba1dff-7c05-46e8-b0d9-a78ac1cfc198@redhat.com/ > > I would prefer if we would do that separately; unless someone is able to > raise why we care about zram + 256KiB that much right now. (claim: we don't) > iow, "this is ok for now", yes? I'd like to push this into mm-"stable" later this week. Speak now or forever hold your pieces.
On (24/06/25 15:33), Andrew Morton wrote: > On Fri, 31 May 2024 16:32:04 +0200 David Hildenbrand <david@redhat.com> wrote: > > > On 31.05.24 16:27, Matthew Wilcox wrote: > > > On Thu, May 30, 2024 at 02:01:23PM +0900, Sergey Senozhatsky wrote: > > > 1409: 83 c0 01 add $0x1,%eax > > > if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) > > > 140c: 83 f8 81 cmp $0xffffff81,%eax > > > 140f: 7d 63 jge 1474 <filemap_unaccount_folio+0x8 > > > 4> > > > if (folio_test_hugetlb(folio)) > > > 1411: 80 7b 33 84 cmpb $0x84,0x33(%rbx) > > > 1415: 74 4e je 1465 <filemap_unaccount_folio+0x75> > > > > > > so we go from "mov, and, cmp, je" to just "cmpb, je", which must surely > > > be faster to execute as well as being more compact in the I$ (6 bytes vs 15). > > > > > > Anyway, not tested but this is the patch I used to generate the above. > > > More for comment than application. > > > > Right, it's likely very similar to my previous proposal to use 8 bit > > (uint8_t) for the type. > > > > https://lore.kernel.org/all/00ba1dff-7c05-46e8-b0d9-a78ac1cfc198@redhat.com/ > > > > I would prefer if we would do that separately; unless someone is able to > > raise why we care about zram + 256KiB that much right now. (claim: we don't) > > > > iow, "this is ok for now", yes? Perhaps. I'm not in position to claim that zram + 256KiB PAGE_SIZE is irrelevant, but I'm also not in position to claim the opposite. Matthew and David have ideas/proposals/patches to fix it should 256KiB PAGE_SIZE become an issue.
On 26.06.24 06:41, Sergey Senozhatsky wrote: > On (24/06/25 15:33), Andrew Morton wrote: >> On Fri, 31 May 2024 16:32:04 +0200 David Hildenbrand <david@redhat.com> wrote: >> >>> On 31.05.24 16:27, Matthew Wilcox wrote: >>>> On Thu, May 30, 2024 at 02:01:23PM +0900, Sergey Senozhatsky wrote: >>>> 1409: 83 c0 01 add $0x1,%eax >>>> if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) >>>> 140c: 83 f8 81 cmp $0xffffff81,%eax >>>> 140f: 7d 63 jge 1474 <filemap_unaccount_folio+0x8 >>>> 4> >>>> if (folio_test_hugetlb(folio)) >>>> 1411: 80 7b 33 84 cmpb $0x84,0x33(%rbx) >>>> 1415: 74 4e je 1465 <filemap_unaccount_folio+0x75> >>>> >>>> so we go from "mov, and, cmp, je" to just "cmpb, je", which must surely >>>> be faster to execute as well as being more compact in the I$ (6 bytes vs 15). >>>> >>>> Anyway, not tested but this is the patch I used to generate the above. >>>> More for comment than application. >>> >>> Right, it's likely very similar to my previous proposal to use 8 bit >>> (uint8_t) for the type. >>> >>> https://lore.kernel.org/all/00ba1dff-7c05-46e8-b0d9-a78ac1cfc198@redhat.com/ >>> >>> I would prefer if we would do that separately; unless someone is able to >>> raise why we care about zram + 256KiB that much right now. (claim: we don't) >>> >> >> iow, "this is ok for now", yes? > > Perhaps. I'm not in position to claim that zram + 256KiB PAGE_SIZE is > irrelevant, but I'm also not in position to claim the opposite. > > Matthew and David have ideas/proposals/patches to fix it should 256KiB > PAGE_SIZE become an issue. Yes, let's keep it simple for now. There are various ways to handle that if there is really the need to. 256KiB is not particularly common (quite the opposite I would claim), and a simple fix would be dedicating 18 instead of 16 bit. Long-term, we should handle it more cleanly though, and there are also various ways forward (store offset in page, separate allocation like memdesc for metadata, etc.). Mess with turning page types from flags into values should be a separate effort, because requires more care (e.g., PAGE_SLAB_MAPCOUNT_VALUE).
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 6aea609b795c2..40e035468de22 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -2,6 +2,7 @@ config ZRAM tristate "Compressed RAM block device support" depends on BLOCK && SYSFS && MMU + depends on HAVE_ZSMALLOC select ZSMALLOC help Creates virtual block devices called /dev/zramX (X = 0, 1, ...). diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index f060db808102c..3afcbfbb379ea 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -957,6 +957,7 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned) #define PG_guard 0x08000000 #define PG_hugetlb 0x04008000 #define PG_slab 0x02000000 +#define PG_zsmalloc 0x01000000 #define PAGE_MAPCOUNT_RESERVE (~0x0000ffff) #define PageType(page, flag) \ @@ -1072,6 +1073,8 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb) FOLIO_TEST_FLAG_FALSE(hugetlb) #endif +PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc) + /** * PageHuge - Determine if the page belongs to hugetlbfs * @page: The page to test. diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a541..67dc18c94448d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -128,7 +128,7 @@ config ZSWAP_COMPRESSOR_DEFAULT choice prompt "Default allocator" depends on ZSWAP - default ZSWAP_ZPOOL_DEFAULT_ZSMALLOC if MMU + default ZSWAP_ZPOOL_DEFAULT_ZSMALLOC if HAVE_ZSMALLOC default ZSWAP_ZPOOL_DEFAULT_ZBUD help Selects the default allocator for the compressed cache for @@ -154,6 +154,7 @@ config ZSWAP_ZPOOL_DEFAULT_Z3FOLD config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC bool "zsmalloc" + depends on HAVE_ZSMALLOC select ZSMALLOC help Use the zsmalloc allocator as the default allocator. @@ -186,10 +187,15 @@ config Z3FOLD page. It is a ZBUD derivative so the simplicity and determinism are still there. +config HAVE_ZSMALLOC + def_bool y + depends on MMU + depends on PAGE_SIZE_LESS_THAN_256KB # we want <= 64 KiB + config ZSMALLOC tristate prompt "N:1 compression allocator (zsmalloc)" if ZSWAP - depends on MMU + depends on HAVE_ZSMALLOC help zsmalloc is a slab-based memory allocator designed to store pages of various compression levels efficiently. It achieves diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index a2a5866473bb8..44e0171d60036 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -20,7 +20,8 @@ * page->index: links together all component pages of a zspage * For the huge page, this is always 0, so we use this field * to store handle. - * page->page_type: first object offset in a subpage of zspage + * page->page_type: PG_zsmalloc, lower 16 bit locate the first object + * offset in a subpage of a zspage * * Usage of struct page flags: * PG_private: identifies the first component page @@ -450,14 +451,28 @@ static inline struct page *get_first_page(struct zspage *zspage) return first_page; } +#define FIRST_OBJ_PAGE_TYPE_MASK 0xffff + +static inline void reset_first_obj_offset(struct page *page) +{ + VM_WARN_ON_ONCE(!PageZsmalloc(page)); + page->page_type |= FIRST_OBJ_PAGE_TYPE_MASK; +} + static inline unsigned int get_first_obj_offset(struct page *page) { - return page->page_type; + VM_WARN_ON_ONCE(!PageZsmalloc(page)); + return page->page_type & FIRST_OBJ_PAGE_TYPE_MASK; } static inline void set_first_obj_offset(struct page *page, unsigned int offset) { - page->page_type = offset; + /* With 16 bit available, we can support offsets into 64 KiB pages. */ + BUILD_BUG_ON(PAGE_SIZE > SZ_64K); + VM_WARN_ON_ONCE(!PageZsmalloc(page)); + VM_WARN_ON_ONCE(offset & ~FIRST_OBJ_PAGE_TYPE_MASK); + page->page_type &= ~FIRST_OBJ_PAGE_TYPE_MASK; + page->page_type |= offset & FIRST_OBJ_PAGE_TYPE_MASK; } static inline unsigned int get_freeobj(struct zspage *zspage) @@ -791,8 +806,9 @@ static void reset_page(struct page *page) __ClearPageMovable(page); ClearPagePrivate(page); set_page_private(page, 0); - page_mapcount_reset(page); page->index = 0; + reset_first_obj_offset(page); + __ClearPageZsmalloc(page); } static int trylock_zspage(struct zspage *zspage) @@ -965,11 +981,13 @@ static struct zspage *alloc_zspage(struct zs_pool *pool, if (!page) { while (--i >= 0) { dec_zone_page_state(pages[i], NR_ZSPAGES); + __ClearPageZsmalloc(pages[i]); __free_page(pages[i]); } cache_free_zspage(pool, zspage); return NULL; } + __SetPageZsmalloc(page); inc_zone_page_state(page, NR_ZSPAGES); pages[i] = page; @@ -1754,6 +1772,9 @@ static int zs_page_migrate(struct page *newpage, struct page *page, VM_BUG_ON_PAGE(!PageIsolated(page), page); + /* We're committed, tell the world that this is a Zsmalloc page. */ + __SetPageZsmalloc(newpage); + /* The page is locked, so this pointer must remain valid */ zspage = get_zspage(page); pool = zspage->pool;
Let's clean it up: use a proper page type and store our data (offset into a page) in the lower 16 bit as documented. We won't be able to support 256 KiB base pages, which is acceptable. Teach Kconfig to handle that cleanly using a new CONFIG_HAVE_ZSMALLOC. Based on this, we should do a proper "struct zsdesc" conversion, as proposed in [1]. This removes the last _mapcount/page_type offender. [1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/ Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- drivers/block/zram/Kconfig | 1 + include/linux/page-flags.h | 3 +++ mm/Kconfig | 10 ++++++++-- mm/zsmalloc.c | 29 +++++++++++++++++++++++++---- 4 files changed, 37 insertions(+), 6 deletions(-)