[GIT,PULL] Memory folios for v5.15

Message ID	YSPwmNNuuQhXNToQ@casper.infradead.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=q3Fs=NO=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 2F3AA613CF Date: Mon, 23 Aug 2021 20:01:44 +0100 From: Matthew Wilcox <willy@infradead.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org> Subject: [GIT PULL] Memory folios for v5.15 Message-ID: <YSPwmNNuuQhXNToQ@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[GIT,PULL] Memory folios for v5.15 \| expand [GIT,PULL] Memory folios for v5.15

Matthew Wilcox Aug. 23, 2021, 7:01 p.m. UTC

Hi Linus,

I'm sending this pull request a few days before the merge window
opens so you have time to think about it.  I don't intend to make any
further changes to the branch, so I've created the tag and signed it.
It's been in Stephen's next tree for a few weeks with only minor problems
(now addressed).

The point of all this churn is to allow filesystems and the page cache
to manage memory in larger chunks than PAGE_SIZE.  The original plan was
to use compound pages like THP does, but I ran into problems with some
functions that take a struct page expect only a head page while others
expect the precise page containing a particular byte.

This pull request converts just parts of the core MM and the page cache.
For 5.16, we intend to convert various filesystems (XFS and AFS are ready;
other filesystems may make it) and also convert more of the MM and page
cache to folios.  For 5.17, multi-page folios should be ready.

The multi-page folios offer some improvement to some workloads.  The 80%
win is real, but appears to be an artificial benchmark (postgres startup,
which isn't a serious workload).  Real workloads (eg building the kernel,
running postgres in a steady state, etc) seem to benefit between 0-10%.
I haven't heard of any performance losses as a result of this series.
Nobody has done any serious performance tuning; I imagine that tweaking
the readahead algorithm could provide some more interesting wins.
There are also other places where we could choose to create large folios
and currently do not, such as writes that are larger than PAGE_SIZE.

I'd like to thank all my reviewers who've offered review/ack tags:

Christoph Hellwig <hch@lst.de>
David Howells <dhowells@redhat.com>
Jan Kara <jack@suse.cz>
Jeff Layton <jlayton@kernel.org>
Johannes Weiner <hannes@cmpxchg.org>
Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Michal Hocko <mhocko@suse.com>
Mike Rapoport <rppt@linux.ibm.com>
Vlastimil Babka <vbabka@suse.cz>
William Kucharski <william.kucharski@oracle.com>
Yu Zhao <yuzhao@google.com>
Zi Yan <ziy@nvidia.com>

As well as those who gave feedback I incorporated but haven't offered up
review tags for this part of the series: Nick Piggin, Mel Gorman, Ming
Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably
a few others who I forget.

The following changes since commit f0eb870a84224c9bfde0dc547927e8df1be4267c:

  Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux (2021-07-18 11:27:25 -0700)

are available in the Git repository at:

  git://git.infradead.org/users/willy/pagecache.git tags/folio-5.15

for you to fetch changes up to 1a90e9dae32ce26de43c1c5eddb3ecce27f2a640:

  mm/writeback: Add folio_write_one (2021-08-15 23:04:07 -0400)

----------------------------------------------------------------
Memory folios

Add memory folios, a new type to represent either order-0 pages or
the head page of a compoud page.  This should be enough infrastructure
to support filesystems converting from pages to folios.

----------------------------------------------------------------
Matthew Wilcox (Oracle) (90):
      mm: Convert get_page_unless_zero() to return bool
      mm: Introduce struct folio
      mm: Add folio_pgdat(), folio_zone() and folio_zonenum()
      mm/vmstat: Add functions to account folio statistics
      mm/debug: Add VM_BUG_ON_FOLIO() and VM_WARN_ON_ONCE_FOLIO()
      mm: Add folio reference count functions
      mm: Add folio_put()
      mm: Add folio_get()
      mm: Add folio_try_get_rcu()
      mm: Add folio flag manipulation functions
      mm/lru: Add folio LRU functions
      mm: Handle per-folio private data
      mm/filemap: Add folio_index(), folio_file_page() and folio_contains()
      mm/filemap: Add folio_next_index()
      mm/filemap: Add folio_pos() and folio_file_pos()
      mm/util: Add folio_mapping() and folio_file_mapping()
      mm/filemap: Add folio_unlock()
      mm/filemap: Add folio_lock()
      mm/filemap: Add folio_lock_killable()
      mm/filemap: Add __folio_lock_async()
      mm/filemap: Add folio_wait_locked()
      mm/filemap: Add __folio_lock_or_retry()
      mm/swap: Add folio_rotate_reclaimable()
      mm/filemap: Add folio_end_writeback()
      mm/writeback: Add folio_wait_writeback()
      mm/writeback: Add folio_wait_stable()
      mm/filemap: Add folio_wait_bit()
      mm/filemap: Add folio_wake_bit()
      mm/filemap: Convert page wait queues to be folios
      mm/filemap: Add folio private_2 functions
      fs/netfs: Add folio fscache functions
      mm: Add folio_mapped()
      mm: Add folio_nid()
      mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics()
      mm/memcg: Use the node id in mem_cgroup_update_tree()
      mm/memcg: Remove soft_limit_tree_node()
      mm/memcg: Convert memcg_check_events to take a node ID
      mm/memcg: Add folio_memcg() and related functions
      mm/memcg: Convert commit_charge() to take a folio
      mm/memcg: Convert mem_cgroup_charge() to take a folio
      mm/memcg: Convert uncharge_page() to uncharge_folio()
      mm/memcg: Convert mem_cgroup_uncharge() to take a folio
      mm/memcg: Convert mem_cgroup_migrate() to take folios
      mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio
      mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock()
      mm/memcg: Convert mem_cgroup_move_account() to use a folio
      mm/memcg: Add folio_lruvec()
      mm/memcg: Add folio_lruvec_lock() and similar functions
      mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave()
      mm/workingset: Convert workingset_activation to take a folio
      mm: Add folio_pfn()
      mm: Add folio_raw_mapping()
      mm: Add flush_dcache_folio()
      mm: Add kmap_local_folio()
      mm: Add arch_make_folio_accessible()
      mm: Add folio_young and folio_idle
      mm/swap: Add folio_activate()
      mm/swap: Add folio_mark_accessed()
      mm/rmap: Add folio_mkclean()
      mm/migrate: Add folio_migrate_mapping()
      mm/migrate: Add folio_migrate_flags()
      mm/migrate: Add folio_migrate_copy()
      mm/writeback: Rename __add_wb_stat() to wb_stat_mod()
      flex_proportions: Allow N events instead of 1
      mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add()
      mm/writeback: Add __folio_end_writeback()
      mm/writeback: Add folio_start_writeback()
      mm/writeback: Add folio_mark_dirty()
      mm/writeback: Add __folio_mark_dirty()
      mm/writeback: Convert tracing writeback_page_template to folios
      mm/writeback: Add filemap_dirty_folio()
      mm/writeback: Add folio_account_cleaned()
      mm/writeback: Add folio_cancel_dirty()
      mm/writeback: Add folio_clear_dirty_for_io()
      mm/writeback: Add folio_account_redirty()
      mm/writeback: Add folio_redirty_for_writepage()
      mm/filemap: Add i_blocks_per_folio()
      mm/filemap: Add folio_mkwrite_check_truncate()
      mm/filemap: Add readahead_folio()
      mm/workingset: Convert workingset_refault() to take a folio
      mm: Add folio_evictable()
      mm/lru: Convert __pagevec_lru_add_fn to take a folio
      mm/lru: Add folio_add_lru()
      mm/page_alloc: Add folio allocation functions
      mm/filemap: Add filemap_alloc_folio
      mm/filemap: Add filemap_add_folio()
      mm/filemap: Convert mapping_get_entry to return a folio
      mm/filemap: Add filemap_get_folio
      mm/filemap: Add FGP_STABLE
      mm/writeback: Add folio_write_one

 Documentation/core-api/cachetlb.rst         |   6 +
 Documentation/core-api/mm-api.rst           |   5 +
 Documentation/filesystems/netfs_library.rst |   2 +
 arch/arc/include/asm/cacheflush.h           |   1 +
 arch/arm/include/asm/cacheflush.h           |   1 +
 arch/mips/include/asm/cacheflush.h          |   2 +
 arch/nds32/include/asm/cacheflush.h         |   1 +
 arch/nios2/include/asm/cacheflush.h         |   3 +-
 arch/parisc/include/asm/cacheflush.h        |   3 +-
 arch/sh/include/asm/cacheflush.h            |   3 +-
 arch/xtensa/include/asm/cacheflush.h        |   3 +-
 fs/afs/write.c                              |   9 +-
 fs/cachefiles/rdwr.c                        |  16 +-
 fs/io_uring.c                               |   2 +-
 fs/jfs/jfs_metapage.c                       |   1 +
 include/asm-generic/cacheflush.h            |   6 +
 include/linux/backing-dev.h                 |   6 +-
 include/linux/flex_proportions.h            |   9 +-
 include/linux/gfp.h                         |  22 +-
 include/linux/highmem-internal.h            |  11 +
 include/linux/highmem.h                     |  37 ++
 include/linux/huge_mm.h                     |  15 -
 include/linux/ksm.h                         |   4 +-
 include/linux/memcontrol.h                  | 231 ++++++-----
 include/linux/migrate.h                     |   4 +
 include/linux/mm.h                          | 239 +++++++++---
 include/linux/mm_inline.h                   | 103 +++--
 include/linux/mm_types.h                    |  77 ++++
 include/linux/mmdebug.h                     |  20 +
 include/linux/netfs.h                       |  77 ++--
 include/linux/page-flags.h                  | 267 +++++++++----
 include/linux/page_idle.h                   |  99 +++--
 include/linux/page_owner.h                  |   8 +-
 include/linux/page_ref.h                    | 158 +++++++-
 include/linux/pagemap.h                     | 585 ++++++++++++++++++----------
 include/linux/rmap.h                        |  10 +-
 include/linux/swap.h                        |  17 +-
 include/linux/vmstat.h                      | 113 +++++-
 include/linux/writeback.h                   |   9 +-
 include/trace/events/pagemap.h              |  46 ++-
 include/trace/events/writeback.h            |  28 +-
 kernel/bpf/verifier.c                       |   2 +-
 kernel/events/uprobes.c                     |   3 +-
 lib/flex_proportions.c                      |  28 +-
 mm/Makefile                                 |   2 +-
 mm/compaction.c                             |   4 +-
 mm/filemap.c                                | 575 +++++++++++++--------------
 mm/folio-compat.c                           | 142 +++++++
 mm/huge_memory.c                            |   7 +-
 mm/hugetlb.c                                |   2 +-
 mm/internal.h                               |  36 +-
 mm/khugepaged.c                             |   8 +-
 mm/ksm.c                                    |  34 +-
 mm/memcontrol.c                             | 358 +++++++++--------
 mm/memory-failure.c                         |   2 +-
 mm/memory.c                                 |  20 +-
 mm/mempolicy.c                              |  10 +
 mm/memremap.c                               |   2 +-
 mm/migrate.c                                | 189 +++++----
 mm/mlock.c                                  |   3 +-
 mm/page-writeback.c                         | 477 +++++++++++++----------
 mm/page_alloc.c                             |  14 +-
 mm/page_io.c                                |   4 +-
 mm/page_owner.c                             |  10 +-
 mm/rmap.c                                   |  14 +-
 mm/shmem.c                                  |   7 +-
 mm/swap.c                                   | 197 +++++-----
 mm/swap_state.c                             |   2 +-
 mm/swapfile.c                               |   8 +-
 mm/userfaultfd.c                            |   2 +-
 mm/util.c                                   | 111 +++---
 mm/vmscan.c                                 |   8 +-
 mm/workingset.c                             |  52 +--
 73 files changed, 2900 insertions(+), 1692 deletions(-)
 create mode 100644 mm/folio-compat.c

Johannes Weiner Aug. 23, 2021, 9:26 p.m. UTC | #1

On Mon, Aug 23, 2021 at 08:01:44PM +0100, Matthew Wilcox wrote:
> Hi Linus,
> 
> I'm sending this pull request a few days before the merge window
> opens so you have time to think about it.  I don't intend to make any
> further changes to the branch, so I've created the tag and signed it.
> It's been in Stephen's next tree for a few weeks with only minor problems
> (now addressed).
> 
> The point of all this churn is to allow filesystems and the page cache
> to manage memory in larger chunks than PAGE_SIZE.  The original plan was
> to use compound pages like THP does, but I ran into problems with some
> functions that take a struct page expect only a head page while others
> expect the precise page containing a particular byte.
> 
> This pull request converts just parts of the core MM and the page cache.
> For 5.16, we intend to convert various filesystems (XFS and AFS are ready;
> other filesystems may make it) and also convert more of the MM and page
> cache to folios.  For 5.17, multi-page folios should be ready.
> 
> The multi-page folios offer some improvement to some workloads.  The 80%
> win is real, but appears to be an artificial benchmark (postgres startup,
> which isn't a serious workload).  Real workloads (eg building the kernel,
> running postgres in a steady state, etc) seem to benefit between 0-10%.
> I haven't heard of any performance losses as a result of this series.
> Nobody has done any serious performance tuning; I imagine that tweaking
> the readahead algorithm could provide some more interesting wins.
> There are also other places where we could choose to create large folios
> and currently do not, such as writes that are larger than PAGE_SIZE.
> 
> I'd like to thank all my reviewers who've offered review/ack tags:
> 
> Christoph Hellwig <hch@lst.de>
> David Howells <dhowells@redhat.com>
> Jan Kara <jack@suse.cz>
> Jeff Layton <jlayton@kernel.org>
> Johannes Weiner <hannes@cmpxchg.org>

Just to clarify, I'm only on this list because I acked 3 smaller,
independent memcg cleanup patches in this series. I have repeatedly
expressed strong reservations over folios themselves.

The arguments for a better data interface between mm and filesystem in
light of variable page sizes are plentiful and convincing. But from an
MM point of view, it's all but clear where the delineation between the
page and folio is, and what the endgame is supposed to look like.

One one hand, the ambition appears to substitute folio for everything
that could be a base page or a compound page even inside core MM
code. Since there are very few places in the MM code that expressly
deal with tail pages in the first place, this amounts to a conversion
of most MM code - including the LRU management, reclaim, rmap,
migrate, swap, page fault code etc. - away from "the page".

However, this far exceeds the goal of a better mm-fs interface. And
the value proposition of a full MM-internal conversion, including
e.g. the less exposed anon page handling, is much more nebulous. It's
been proposed to leave anon pages out, but IMO to keep that direction
maintainable, the folio would have to be translated to a page quite
early when entering MM code, rather than propagating it inward, in
order to avoid huge, massively overlapping page and folio APIs.

It's also not clear to me that using the same abstraction for compound
pages and the file cache object is future proof. It's evident from
scalability issues in the allocator, reclaim, compaction, etc. that
with current memory sizes and IO devices, we're hitting the limits of
efficiently managing memory in 4k base pages per default. It's also
clear that we'll continue to have a need for 4k cache granularity for
quite a few workloads that work with large numbers of small files. I'm
not sure how this could be resolved other than divorcing the idea of a
(larger) base page from the idea of cache entries that can correspond,
if necessary, to memory chunks smaller than a default page.

A longer thread on that can be found here:
https://lore.kernel.org/linux-fsdevel/YFja%2FLRC1NI6quL6@cmpxchg.org/

As an MM stakeholder, I don't think folios are the answer for MM code.

Linus Torvalds Aug. 23, 2021, 10:06 p.m. UTC | #2

On Mon, Aug 23, 2021 at 2:25 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> One one hand, the ambition appears to substitute folio for everything
> that could be a base page or a compound page even inside core MM
> code. Since there are very few places in the MM code that expressly
> deal with tail pages in the first place, this amounts to a conversion
> of most MM code - including the LRU management, reclaim, rmap,
> migrate, swap, page fault code etc. - away from "the page".

Yeah, honestly, I would have preferred to see this done the exact
reverse way: make the rule be that "struct page" is always a head
page, and anything that isn't a head page would be called something
else.

Because, as you say, head pages are the norm. And "folio" may be a
clever term, but it's not very natural. Certainly not at all as
intuitive or common as "page" as a name in the industry.

That said, I see why Willy did it the way he did - it was easier to do
it incrementally the way he did. But I do think it ends up with an end
result that is kind of topsy turvy where the common "this is the core
allocation" being called that odd "folio" thing, and then the simpler
"page" name is for things that almost nobody should even care about.

I'd have personally preferred to call the head page just a "page", and
other pages "subpage" or something like that. I think that would be
much more intuitive than "folio/page".

                  Linus

Matthew Wilcox Aug. 23, 2021, 10:15 p.m. UTC | #3

On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> On Mon, Aug 23, 2021 at 08:01:44PM +0100, Matthew Wilcox wrote:
> Just to clarify, I'm only on this list because I acked 3 smaller,
> independent memcg cleanup patches in this series. I have repeatedly
> expressed strong reservations over folios themselves.

I thought I'd addressed all your concerns.  I'm sorry I misunderstood
and did not intend to misrepresent your position.

> The arguments for a better data interface between mm and filesystem in
> light of variable page sizes are plentiful and convincing. But from an
> MM point of view, it's all but clear where the delineation between the
> page and folio is, and what the endgame is supposed to look like.
> 
> One one hand, the ambition appears to substitute folio for everything
> that could be a base page or a compound page even inside core MM
> code. Since there are very few places in the MM code that expressly
> deal with tail pages in the first place, this amounts to a conversion
> of most MM code - including the LRU management, reclaim, rmap,
> migrate, swap, page fault code etc. - away from "the page".

I would agree with all of those except the page fault code; I believe
that needs to continue to work in terms of pages in order to support
misaligned mappings.

> However, this far exceeds the goal of a better mm-fs interface. And
> the value proposition of a full MM-internal conversion, including
> e.g. the less exposed anon page handling, is much more nebulous. It's
> been proposed to leave anon pages out, but IMO to keep that direction
> maintainable, the folio would have to be translated to a page quite
> early when entering MM code, rather than propagating it inward, in
> order to avoid huge, massively overlapping page and folio APIs.

I only intend to leave anonymous memory out /for now/.  My hope is
that somebody else decides to work on it (and indeed Google have
volunteered someone for the task).

> It's also not clear to me that using the same abstraction for compound
> pages and the file cache object is future proof. It's evident from
> scalability issues in the allocator, reclaim, compaction, etc. that
> with current memory sizes and IO devices, we're hitting the limits of
> efficiently managing memory in 4k base pages per default. It's also
> clear that we'll continue to have a need for 4k cache granularity for
> quite a few workloads that work with large numbers of small files. I'm
> not sure how this could be resolved other than divorcing the idea of a
> (larger) base page from the idea of cache entries that can correspond,
> if necessary, to memory chunks smaller than a default page.

That sounds to me exactly like folios, except for the naming.  From the
MM point of view, it's less churn to do it your way, but from the
point of view of the rest of the kernel, there's going to be unexpected
consequences.  For example, btrfs didn't support page size != block size
until just recently (and I'm not sure it's entirely fixed yet?)

And there's nobody working on your idea.  At least not that have surfaced
so far.  The folio patch is here now.

Folios are also variable sized.  For files which are small, we still only
allocate 4kB to cache them.  If the file is accessed entirely randomly,
we only allocate 4kB chunks at a time.  We only allocate larger folios
when we think there is an advantage to doing so.

This benefit is retained if someone does come along to change PAGE_SIZE
to 16KiB (or whatever).  Folios can still be composed of multiple pages,
no matter what the PAGE_SIZE is.

> A longer thread on that can be found here:
> https://lore.kernel.org/linux-fsdevel/YFja%2FLRC1NI6quL6@cmpxchg.org/
> 
> As an MM stakeholder, I don't think folios are the answer for MM code.

Matthew Wilcox Aug. 24, 2021, 2:20 a.m. UTC | #4

On Mon, Aug 23, 2021 at 03:06:08PM -0700, Linus Torvalds wrote:
> On Mon, Aug 23, 2021 at 2:25 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > One one hand, the ambition appears to substitute folio for everything
> > that could be a base page or a compound page even inside core MM
> > code. Since there are very few places in the MM code that expressly
> > deal with tail pages in the first place, this amounts to a conversion
> > of most MM code - including the LRU management, reclaim, rmap,
> > migrate, swap, page fault code etc. - away from "the page".
> 
> Yeah, honestly, I would have preferred to see this done the exact
> reverse way: make the rule be that "struct page" is always a head
> page, and anything that isn't a head page would be called something
> else.
> 
> Because, as you say, head pages are the norm. And "folio" may be a
> clever term, but it's not very natural. Certainly not at all as
> intuitive or common as "page" as a name in the industry.
> 
> That said, I see why Willy did it the way he did - it was easier to do
> it incrementally the way he did. But I do think it ends up with an end
> result that is kind of topsy turvy where the common "this is the core
> allocation" being called that odd "folio" thing, and then the simpler
> "page" name is for things that almost nobody should even care about.
> 
> I'd have personally preferred to call the head page just a "page", and
> other pages "subpage" or something like that. I think that would be
> much more intuitive than "folio/page".

I'm trying to figure out how we can get there.

To start, define

struct mmu_page {
	union {
		struct page;
		struct {
			unsigned long flags;
			unsigned long compound_head;
			unsigned char compound_dtor;
			unsigned char compound_order;
			atomic_t compound_mapcount;
			unsigned int compound_nr;
		};
	};
};

Now memmap becomes an array of struct mmu_pages instead of struct pages.

We also need to sort out the type returned from the page cache APIs.
Right now, it returns (effectively) the mmu_page.  I think it _should_
return the (arbitrary order) struct page, but auditing every caller of
every function is an inhuman job.

I can't see how to get there from here without a ridiculous number
of bugs.  Maybe you can.

Matthew Wilcox Aug. 24, 2021, 1:04 p.m. UTC | #5

On Mon, Aug 23, 2021 at 03:06:08PM -0700, Linus Torvalds wrote:
> Yeah, honestly, I would have preferred to see this done the exact
> reverse way: make the rule be that "struct page" is always a head
> page, and anything that isn't a head page would be called something
> else.
> 
> Because, as you say, head pages are the norm. And "folio" may be a
> clever term, but it's not very natural. Certainly not at all as
> intuitive or common as "page" as a name in the industry.

Actually, I think this is an advantage for folios.  Maybe not for the
core MM which has always been _fairly_ careful to deal with compound
pages properly.  But for filesystem people, device drivers, etc, when
people see a struct page, they think it's PAGE_SIZE bytes in size.
And they're usually right, which is what makes things like THP so prone
to "Oops, we missed a spot" bugs.  By contrast, if you see something
which takes a struct folio and then works on PAGE_SIZE bytes, that's a
sign there's something funny going on.  There are a few of those still;
for example kmap() can only map PAGE_SIZE bytes at a time.

David Howells Aug. 24, 2021, 3:54 p.m. UTC | #6

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Yeah, honestly, I would have preferred to see this done the exact
> reverse way: make the rule be that "struct page" is always a head
> page, and anything that isn't a head page would be called something
> else.
> ...
> That said, I see why Willy did it the way he did - it was easier to do
> it incrementally the way he did. But I do think it ends up with an end
> result that is kind of topsy turvy where the common "this is the core
> allocation" being called that odd "folio" thing, and then the simpler
> "page" name is for things that almost nobody should even care about.

From a filesystem pov, it may be better done Willy's way.  There's a lot of
assumption that "struct page" corresponds to a PAGE_SIZE patch of RAM and is
equivalent to a hardware page, so using something other than struct page seems
a better idea.  It's easier to avoid the assumption if it's called something
different.

We're dealing with variable-sized clusters of things that, in the future,
could be, say, a combination of typical 4K pages and higher order pages
(depending on what the arch supports), so I think using "page" is the wrong
name to use.

There are some pieces, kmap being a prime example, that might be tricky to
make handle a transparently variable-sized multipage object, so careful
auditing will likely be required if we do stick with "struct page".

Further, there's the problem that there are a *lot* of places where
filesystems access struct page members directly, rather than going through
helper functions - and all of these need to be fixed.  This is much easier to
manage if we can get the compiler to do the catching.  Hiding them all within
struct page would require a humongous single patch.

One question does spring to mind, though: do filesystems even need to know
about hardware pages at all?  They need to be able to access source data or a
destination buffer, but that can be stitched together from disparate chunks
that have nothing to do with pages (eg. iov_iter); they need access to the
pagecache, and may need somewhere to cache pieces of information, and they
need to be able to pass chunks of pagecache, data or bufferage to crypto
(scatterlists) and I/O routines (bio, skbuff) - but can we hide "paginess"
from filesystems?

The main point where this matters, at the moment, is, I think, mmap - but
could more of that be handled transparently by the VM?

> Because, as you say, head pages are the norm. And "folio" may be a
> clever term, but it's not very natural. Certainly not at all as
> intuitive or common as "page" as a name in the industry.

That's mostly because no one uses the term... yet, and that it's not commonly
used.  I've got used to it in building on top of Willy's patches and have no
problem with it - apart from the fact that I would expect something more like
a plural or a collective noun ("sheaf" or "ream" maybe?) - but at least the
name is similar in length to "page".

And it's handy for grepping ;-)

> I'd have personally preferred to call the head page just a "page", and
> other pages "subpage" or something like that. I think that would be
> much more intuitive than "folio/page".

As previously stated, I think we need to leave "struct page" as meaning
"hardware page" and build some other concept on top for aggregation/buffering.

David

Matthew Wilcox Aug. 24, 2021, 5:56 p.m. UTC | #7

On Tue, Aug 24, 2021 at 04:54:27PM +0100, David Howells wrote:
> One question does spring to mind, though: do filesystems even need to know
> about hardware pages at all?  They need to be able to access source data or a
> destination buffer, but that can be stitched together from disparate chunks
> that have nothing to do with pages (eg. iov_iter); they need access to the
> pagecache, and may need somewhere to cache pieces of information, and they
> need to be able to pass chunks of pagecache, data or bufferage to crypto
> (scatterlists) and I/O routines (bio, skbuff) - but can we hide "paginess"
> from filesystems?
> 
> The main point where this matters, at the moment, is, I think, mmap - but
> could more of that be handled transparently by the VM?

It really depends on the filesystem.  I just audited adfs, for example,
and there is literally nothing in there that cares about struct page.
It passes its arguments from ->readpage and ->writepage to
block_*_full_page(); it uses cont_write_begin() for its ->write_begin;
and it uses __set_page_dirty_buffers for its ->set_page_dirty.

Then there are filesystems like UFS which use struct page extensively in
its directory handling.  And NFS which uses struct page throughout.
Partly there's just better infrastructure for block-based filesystems
(which you're fixing) and partly NFS is trying to perform better than
a filesystem which exists for compatibility with a long-dead OS.

> > Because, as you say, head pages are the norm. And "folio" may be a
> > clever term, but it's not very natural. Certainly not at all as
> > intuitive or common as "page" as a name in the industry.
> 
> That's mostly because no one uses the term... yet, and that it's not commonly
> used.  I've got used to it in building on top of Willy's patches and have no
> problem with it - apart from the fact that I would expect something more like
> a plural or a collective noun ("sheaf" or "ream" maybe?) - but at least the
> name is similar in length to "page".
> 
> And it's handy for grepping ;-)

If the only thing standing between this patch and the merge is
s/folio/ream/g, I will do that.  All three options are equally greppable
(except for 'ream' as a substring of dream, stream, preamble, scream,
whereami, and typos for remain).

Linus Torvalds Aug. 24, 2021, 6:26 p.m. UTC | #8

On Tue, Aug 24, 2021 at 11:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> If the only thing standing between this patch and the merge is
> s/folio/ream/g,

I really don't think that helps. All the book-binding analogies are
only confusing.

If anything, I'd make things more explicit. Stupid and
straightforward. Maybe just "struct head_page" or something like that.
Name it by what it *is*, not by analogies.

None of this cute/clever stuff. I think making it obvious and
descriptive would be the much better approach, not some clever "book
binders call a collection of pages XYZ".

             Linus

Linus Torvalds Aug. 24, 2021, 6:29 p.m. UTC | #9

On Tue, Aug 24, 2021 at 11:26 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If anything, I'd make things more explicit. Stupid and
> straightforward. Maybe just "struct head_page" or something like that.
> Name it by what it *is*, not by analogies.

Btw, just to clarify: I don't love "struct head_page" either. It looks
clunky. But at least something like that would be a _straightforward_
clunky name.

Something like just "struct pages" would be less clunky, would still
get the message across, but gets a bit too visually similar.

Naming is hard.

             Linus

Johannes Weiner Aug. 24, 2021, 6:32 p.m. UTC | #10

On Mon, Aug 23, 2021 at 11:15:48PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> > However, this far exceeds the goal of a better mm-fs interface. And
> > the value proposition of a full MM-internal conversion, including
> > e.g. the less exposed anon page handling, is much more nebulous. It's
> > been proposed to leave anon pages out, but IMO to keep that direction
> > maintainable, the folio would have to be translated to a page quite
> > early when entering MM code, rather than propagating it inward, in
> > order to avoid huge, massively overlapping page and folio APIs.
> 
> I only intend to leave anonymous memory out /for now/.  My hope is
> that somebody else decides to work on it (and indeed Google have
> volunteered someone for the task).

Unlike the filesystem side, this seems like a lot of churn for very
little tangible value. And leaves us with an end result that nobody
appears to be terribly excited about.

But the folio abstraction is too low-level to use JUST for file cache
and NOT for anon. It's too close to the page layer itself and would
duplicate too much of it to be maintainable side by side.

That's why I asked why it couldn't be a more abstract memory unit for
managing file cache. With a clearer delineation between that and how
the backing memory is implemented - 1 page, N pages, maybe just a part
of a page later on. And not just be a different name for a head page.

It appears David is asking the same in the parallel subthread.

> > It's also not clear to me that using the same abstraction for compound
> > pages and the file cache object is future proof. It's evident from
> > scalability issues in the allocator, reclaim, compaction, etc. that
> > with current memory sizes and IO devices, we're hitting the limits of
> > efficiently managing memory in 4k base pages per default. It's also
> > clear that we'll continue to have a need for 4k cache granularity for
> > quite a few workloads that work with large numbers of small files. I'm
> > not sure how this could be resolved other than divorcing the idea of a
> > (larger) base page from the idea of cache entries that can correspond,
> > if necessary, to memory chunks smaller than a default page.
> 
> That sounds to me exactly like folios, except for the naming.

Then I think you misunderstood me.

> From the MM point of view, it's less churn to do it your way, but
> from the point of view of the rest of the kernel, there's going to
> be unexpected consequences.  For example, btrfs didn't support page
> size != block size until just recently (and I'm not sure it's
> entirely fixed yet?)
> 
> And there's nobody working on your idea.  At least not that have surfaced
> so far.  The folio patch is here now.
> 
> Folios are also variable sized.  For files which are small, we still only
> allocate 4kB to cache them.  If the file is accessed entirely randomly,
> we only allocate 4kB chunks at a time.  We only allocate larger folios
> when we think there is an advantage to doing so.
> 
> This benefit is retained if someone does come along to change PAGE_SIZE
> to 16KiB (or whatever).  Folios can still be composed of multiple pages,
> no matter what the PAGE_SIZE is.

The folio doc says "It is at least as large as %PAGE_SIZE";
folio_order() says "A folio is composed of 2^order pages";
page_folio(), folio_pfn(), folio_nr_pages all encode a N:1
relationship. And yes, the name implies it too.

This is in direct conflict with what I'm talking about, where base
page granularity could become coarser than file cache granularity.

Are we going to bump struct page to 2M soon? I don't know. Here is
what I do know about 4k pages, though:

- It's a lot of transactional overhead to manage tens of gigs of
  memory in 4k pages. We're reclaiming, paging and swapping more than
  ever before in our DCs, because flash provides in abundance the
  low-latency IOPS required for that, and parking cold/warm workload
  memory on cheap flash saves expensive RAM. But we're continously
  scanning thousands of pages per second to do this. There was also
  the RWF_UNCACHED thread around reclaim CPU overhead at the higher
  end of buffered IO rates. There is the fact that we have a pending
  proposal from Google to replace rmap because it's too CPU-intense
  when paging into compressed memory pools.

- It's a lot of internal fragmentation. Compaction is becoming the
  default method for allocating the majority of memory in our
  servers. This is a latency concern during page faults, and a
  predictability concern when we defer it to khugepaged collapsing.

- struct page is statically eating gigs of expensive memory on every
  single machine, when only some of our workloads would require this
  level of granularity for some of their memory. And that's *after*
  we're fighting over every bit in that structure.

Base page size becoming bigger than cache entries in the near future
doesn't strike me as an exotic idea. The writing seems to be on the
wall. But the folio appears full of assumptions that conflict with it.

Sure, the patch is here now. But how much time will all the churn buy
us before we may need a do-over? Would clean, incremental changes to
the cache entry abstraction even be possible after we have anon and
all kinds of other compound page internals hanging off of it as well?

Wouldn't it make more sense to decouple filesystems from "paginess",
as David puts it, now instead? Avoid the risk of doing it twice, avoid
the more questionable churn inside mm code, avoid the confusing
proximity to the page and its API in the long-term...

Linus Torvalds Aug. 24, 2021, 6:59 p.m. UTC | #11

On Tue, Aug 24, 2021 at 11:31 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Unlike the filesystem side, this seems like a lot of churn for very
> little tangible value. And leaves us with an end result that nobody
> appears to be terribly excited about.

Well, there is actually some fairly well documented tangible value:
our page accessor helper functions spend an absolutely insane amount
of effort and time on just checking "is this a head page", and
following the "compound_head" pointer etc.

Functions that *used* to be trivial - and are still used as if they
were - generate nasty complex code.

I'm thinking things like "get_page()" - it increments the reference
count to the page. It's just a single atomic increment, right.

Wrong..

It's still inlined, but it generates these incredible gyrations with
testing the low bit of a field, doing two very different things based
on whether it is set, and now we have that "is it close to overflow"
test too (ok, that one is dependent on VM_DEBUG), so it actually
generates two conditional branches, odd bit tests, lots of extra calls
etc etc,

So "get_page()" should probably not be an inline function any more.
And that's just the first thing I happened to look at. I think we have
those "head = compound_head(page)" all over the VM code,

And no, that "look up the compound page header" is not necessarily the
biggest part of it, but it's definitely one part of it. And if we had
a "we know this page is a head page" that all just goes away.

And in a lot of cases, we *do* know that. Which is exactly the kind of
static knowledge that the folio patches expose.

But it is a lot of churn. And it basically duplicates all our page
functions, just to have those simplified versions. And It's very core
code, and while I appreciate the cleverness of the "folio" name, I do
think it makes the end result perhaps subtler than it needs to be.

The one thing I do like about it is how it uses the type system to be
incremental.

So I don't hate the patches. I think they are clever, I think they are
likely worthwhile, but I also certainly don't love them.

               Linus

Matthew Wilcox Aug. 24, 2021, 7:01 p.m. UTC | #12

On Tue, Aug 24, 2021 at 11:26:30AM -0700, Linus Torvalds wrote:
> On Tue, Aug 24, 2021 at 11:17 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > If the only thing standing between this patch and the merge is
> > s/folio/ream/g,
> 
> I really don't think that helps. All the book-binding analogies are
> only confusing.
> 
> If anything, I'd make things more explicit. Stupid and
> straightforward. Maybe just "struct head_page" or something like that.
> Name it by what it *is*, not by analogies.

I don't mind calling it something entirely different.  I mean, the word
"slab" has nothing to do with memory or pages or anything.  I just want
something short and greppable.  Choosing short words at random from
/usr/share/dict/words:

belt gala claw ogre peck raft 
bowl moat cask deck rink toga

David Howells Aug. 24, 2021, 7:11 p.m. UTC | #13

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Something like just "struct pages" would be less clunky, would still
> get the message across, but gets a bit too visually similar.

"page_group"?  I would suggest "pgroup", but that's already taken.  Maybe
"page_set" with "pset" as a shorthand pointer name.  Or "struct pset/pgset"?

I would prefer a short single word name as there's a good chance it's going to
be prefixing a bunch of API functions.

If you don't mind straying a bit from something with then name "page" in it,
then "chapter", "sheet" or "book"?

David

Linus Torvalds Aug. 24, 2021, 7:11 p.m. UTC | #14

On Tue, Aug 24, 2021 at 12:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Choosing short words at random from /usr/share/dict/words:

I don't think you're getting my point.

In fact, you're just making it WORSE.

"short" and "greppable" is not the main issue here.

"understandable" and "follows other conventions" is.

And those "other conventions" are not "book binders in the 17th
century". They are about operating system design.

So when you mention "slab" as a name example, that's not the argument
you think it is. That's a real honest-to-goodness operating system
convention name that doesn't exactly predate Linux, but is most
certainly not new.

In fact, "slab" is a bad example for another reason: we don't actually
really use it outside of the internal implementation of the slab
cache. The name we actually *use* tends to be "kmalloc()" or similar,
which most definitely has a CS history that goes back even further and
is not at all confusing to anybody.

So no. This email just convinces me that you have ENTIRELY the wrong
approach to naming and is just making me more convinced that "folio"
came from the wrong kind of thinking.

Because "random short words" is absolutely the last thing you should look at.

             Linus

Matthew Wilcox Aug. 24, 2021, 7:23 p.m. UTC | #15

On Tue, Aug 24, 2021 at 12:11:49PM -0700, Linus Torvalds wrote:
> On Tue, Aug 24, 2021 at 12:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Choosing short words at random from /usr/share/dict/words:
> 
> I don't think you're getting my point.
> 
> In fact, you're just making it WORSE.
> 
> "short" and "greppable" is not the main issue here.
> 
> "understandable" and "follows other conventions" is.
> 
> And those "other conventions" are not "book binders in the 17th
> century". They are about operating system design.
> 
> So when you mention "slab" as a name example, that's not the argument
> you think it is. That's a real honest-to-goodness operating system
> convention name that doesn't exactly predate Linux, but is most
> certainly not new.

Sure, but at the time Jeff Bonwick chose it, it had no meaning in
computer science or operating system design.  Whatever name is chosen,
we'll get used to it.  I don't even care what name it is.

I want "short" because it ends up used everywhere.  I don't want to
be typing
	lock_hippopotamus(hippopotamus);

and I want greppable so it's not confused with something somebody else
has already used as an identifier.

Linus Torvalds Aug. 24, 2021, 7:25 p.m. UTC | #16

On Tue, Aug 24, 2021 at 12:11 PM David Howells <dhowells@redhat.com> wrote:
>
> "page_group"?  I would suggest "pgroup", but that's already taken.  Maybe
> "page_set" with "pset" as a shorthand pointer name.  Or "struct pset/pgset"?

Please don't do the "shorthand" thing. Names like "pset" and "pgroup"
are pure and utter garbage, and make no sense and describe nothing at
all.

If you want a pointer name and don't need a descriptive name because
there is no ambiguity, you might as well just use 'p'. And if you want
to make it clear that it's a collection of pages, you might as well
use "pages".

Variable naming is one thing, and tere's nothing wrong with variable
names like 'i', 'p' and 'pages'. The variable name should come from
the context, and 'a' and 'b' can make perfect sense (and 'new' and
'old' can be very good names that clarify what the usage is - C++
people can go pound sand, they mis-designed the language keywords).

But the *type* name should describe the type, and it sure shouldn't be
anything like pset/pgroup.

Something like "page_group" or "pageset" sound reasonable to me as type names.

                      Linus

Theodore Ts'o Aug. 24, 2021, 7:26 p.m. UTC | #17

On Tue, Aug 24, 2021 at 11:29:53AM -0700, Linus Torvalds wrote:
> 
> Something like just "struct pages" would be less clunky, would still
> get the message across, but gets a bit too visually similar.

How about "struct mempages"?

> Naming is hard.

Indeed...

					- Ted

David Howells Aug. 24, 2021, 7:34 p.m. UTC | #18

Theodore Ts'o <tytso@mit.edu> wrote:

> How about "struct mempages"?

Kind of redundant in this case?

David

David Howells Aug. 24, 2021, 7:35 p.m. UTC | #19

Matthew Wilcox <willy@infradead.org> wrote:

> Sure, but at the time Jeff Bonwick chose it, it had no meaning in
> computer science or operating system design.  Whatever name is chosen,
> we'll get used to it.  I don't even care what name it is.
> 
> I want "short" because it ends up used everywhere.  I don't want to
> be typing
> 	lock_hippopotamus(hippopotamus);
> 
> and I want greppable so it's not confused with something somebody else
> has already used as an identifier.

Can you live with pageset?

David

Linus Torvalds Aug. 24, 2021, 7:38 p.m. UTC | #20

On Tue, Aug 24, 2021 at 12:25 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Something like "page_group" or "pageset" sound reasonable to me as type names.

"pageset" is such a great name that we already use it, so I guess that
doesn't work.

            Linus

Matthew Wilcox Aug. 24, 2021, 7:44 p.m. UTC | #21

On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote:
> The folio doc says "It is at least as large as %PAGE_SIZE";
> folio_order() says "A folio is composed of 2^order pages";
> page_folio(), folio_pfn(), folio_nr_pages all encode a N:1
> relationship. And yes, the name implies it too.
> 
> This is in direct conflict with what I'm talking about, where base
> page granularity could become coarser than file cache granularity.

That doesn't make any sense.  A page is the fundamental unit of the
mm.  Why would we want to increase the granularity of page allocation
and not increase the granularity of the file cache?

> Are we going to bump struct page to 2M soon? I don't know. Here is
> what I do know about 4k pages, though:
> 
> - It's a lot of transactional overhead to manage tens of gigs of
>   memory in 4k pages. We're reclaiming, paging and swapping more than
>   ever before in our DCs, because flash provides in abundance the
>   low-latency IOPS required for that, and parking cold/warm workload
>   memory on cheap flash saves expensive RAM. But we're continously
>   scanning thousands of pages per second to do this. There was also
>   the RWF_UNCACHED thread around reclaim CPU overhead at the higher
>   end of buffered IO rates. There is the fact that we have a pending
>   proposal from Google to replace rmap because it's too CPU-intense
>   when paging into compressed memory pools.

This seems like an argument for folios, not against them.  If user
memory (both anon and file) is being allocated in larger chunks, there
are fewer pages to scan, less book-keeping to do, and all you're paying
for that is I/O bandwidth.

> - It's a lot of internal fragmentation. Compaction is becoming the
>   default method for allocating the majority of memory in our
>   servers. This is a latency concern during page faults, and a
>   predictability concern when we defer it to khugepaged collapsing.

Again, the more memory that we allocate in higher-order chunks, the
better this situation becomes.

> - struct page is statically eating gigs of expensive memory on every
>   single machine, when only some of our workloads would require this
>   level of granularity for some of their memory. And that's *after*
>   we're fighting over every bit in that structure.

That, folios does not help with.  I have post-folio ideas about how
to address that, but I can't realistically start working on them
until folios are upstream.

Theodore Ts'o Aug. 24, 2021, 7:44 p.m. UTC | #22

On Tue, Aug 24, 2021 at 08:23:15PM +0100, Matthew Wilcox wrote:
> > So when you mention "slab" as a name example, that's not the argument
> > you think it is. That's a real honest-to-goodness operating system
> > convention name that doesn't exactly predate Linux, but is most
> > certainly not new.
> 
> Sure, but at the time Jeff Bonwick chose it, it had no meaning in
> computer science or operating system design.

I think the big difference is that "slab" is mostly used as an
internal name.  In Linux it doesn't even leak out to the users, since
we use kmem_cache_{create,alloc,free,destroy}().  So the "slab"
doesn't even show up in the API.

The problem is whether we use struct head_page, or folio, or mempages,
we're going to be subsystem users' faces.  And people who are using it
every day will eventually get used to anything, whether it's "folio"
or "xmoqax", we sould give a thought to newcomers to Linux file system
code.  If they see things like "read_folio()", they are going to be
far more confused than "read_pages()" or "read_mempages()".

Sure, one impenetrable code word isn't that bad.  But this is a case
of a death by a thousand cuts.  At $WORK, one time we had welcomed an
intern to our group, I had to stop everyone each time that they used
an acronym, or a codeword, and asked them to define the term.

It was really illuminating what an insider takes for granted, but when
it's one cutsy codeword after another, with three or more such
codewords in a sentence, it's *really* a less-than-great initial
experience for a newcomer.

So if someone sees "kmem_cache_alloc()", they can probably make a
guess what it means, and it's memorable once they learn it.
Similarly, something like "head_page", or "mempages" is going to a bit
more obvious to a kernel newbie.  So if we can make a tiny gesture
towards comprehensibility, it would be good to do so while it's still
easier to change the name.

Cheers,

					- Ted

Linus Torvalds Aug. 24, 2021, 7:48 p.m. UTC | #23

On Tue, Aug 24, 2021 at 12:38 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> "pageset" is such a great name that we already use it, so I guess that
> doesn't work.

Actually, maybe I can backtrack on that a bit.

Maybe 'pageset' would work as a name. It's not used as a type right
now, but the usage where we do have those comments around 'struct
per_cpu_pages' are actually not that different from the folio kind of
thing. It has a list of "pages" that have a fixed order.

So that existing 'pageset' user might actually fit in conceptually.
The 'pageset' is only really used in comments and as part of a field
name, and the use does seem to be kind of consistent with the Willy's
use of a "aligned allocation-group of pages".

                 Linus

David Howells Aug. 24, 2021, 7:59 p.m. UTC | #24

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Something like "page_group" or "pageset" sound reasonable to me as type
> > names.
> 
> "pageset" is such a great name that we already use it, so I guess that
> doesn't work.

Heh.  I tried grepping for "struct page_set" and that showed nothing.  Maybe
"pagegroup"?  Here's a bunch of possible alternatives to set/group:

	https://en.wiktionary.org/wiki/Thesaurus:group

Maybe consider it a sequence of pages, "struct pageseq"?  page_aggregate
sounds like a possibility, but it's quite long.

Though from an fs point of view, I'd be okay hiding the fact that pages are
involved.  It's a buffer; a chunk of memory or chunk of pagecache with
metadata - maybe something on that theme?

David

Matthew Wilcox Aug. 24, 2021, 8 p.m. UTC | #25

On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 24, 2021 at 08:23:15PM +0100, Matthew Wilcox wrote:
> > > So when you mention "slab" as a name example, that's not the argument
> > > you think it is. That's a real honest-to-goodness operating system
> > > convention name that doesn't exactly predate Linux, but is most
> > > certainly not new.
> > 
> > Sure, but at the time Jeff Bonwick chose it, it had no meaning in
> > computer science or operating system design.
> 
> I think the big difference is that "slab" is mostly used as an
> internal name.  In Linux it doesn't even leak out to the users, since
> we use kmem_cache_{create,alloc,free,destroy}().  So the "slab"
> doesn't even show up in the API.

/proc/slabinfo
/proc/sys/vm/min_slab_ratio
/sys/kernel/slab
include/linux/slab.h
cpuset.memory_spread_slab
failslab=
slab_merge
slab_max_order=

$ git grep slab fs/ext4 |wc -l
30
(13 of which are slab.h)

> The problem is whether we use struct head_page, or folio, or mempages,
> we're going to be subsystem users' faces.  And people who are using it
> every day will eventually get used to anything, whether it's "folio"
> or "xmoqax", we sould give a thought to newcomers to Linux file system
> code.  If they see things like "read_folio()", they are going to be
> far more confused than "read_pages()" or "read_mempages()".
> 
> Sure, one impenetrable code word isn't that bad.  But this is a case
> of a death by a thousand cuts.  At $WORK, one time we had welcomed an
> intern to our group, I had to stop everyone each time that they used
> an acronym, or a codeword, and asked them to define the term.
> 
> It was really illuminating what an insider takes for granted, but when
> it's one cutsy codeword after another, with three or more such
> codewords in a sentence, it's *really* a less-than-great initial
> experience for a newcomer.
> 
> So if someone sees "kmem_cache_alloc()", they can probably make a
> guess what it means, and it's memorable once they learn it.
> Similarly, something like "head_page", or "mempages" is going to a bit
> more obvious to a kernel newbie.  So if we can make a tiny gesture
> towards comprehensibility, it would be good to do so while it's still
> easier to change the name.

I completely agree that it's good to use something which is not jargon,
or is at least widely-understood jargon.  And I loathe acronyms (you'll
notice I haven't suggested a single one).  Folio/ream/quire/sheaf were
all attempts to get across "collection of pages".  Another direction
would be something that is associated with memory (but I don't have
a good example).  Or a non-English word (roman?  seite?  sidor?)

We're going to end up with hpage, aren't we?

Theodore Ts'o Aug. 24, 2021, 8:02 p.m. UTC | #26

On Tue, Aug 24, 2021 at 08:34:47PM +0100, David Howells wrote:
> Theodore Ts'o <tytso@mit.edu> wrote:
> 
> > How about "struct mempages"?
> 
> Kind of redundant in this case?

I was looking for something which was visually different from "struct
page", but was still reasonably short.  Otherwise "struct pages" as
Linus suggested would work for me.

What do you think of "struct pageset"?  Not quite as short as folios,
but it's clearer.

				- Ted

Vlastimil Babka Aug. 24, 2021, 8:35 p.m. UTC | #27

On 8/24/21 9:35 PM, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
>> Sure, but at the time Jeff Bonwick chose it, it had no meaning in
>> computer science or operating system design.  Whatever name is chosen,
>> we'll get used to it.  I don't even care what name it is.
>>
>> I want "short" because it ends up used everywhere.  I don't want to
>> be typing
>> 	lock_hippopotamus(hippopotamus);
>>
>> and I want greppable so it's not confused with something somebody else
>> has already used as an identifier.
> 
> Can you live with pageset?

Pagesets already exist in the page allocator internals. Yeah, could be
renamed as it's not visible outside.

> David
> 
>

Vlastimil Babka Aug. 24, 2021, 8:40 p.m. UTC | #28

On 8/24/21 10:35 PM, Vlastimil Babka wrote:
> On 8/24/21 9:35 PM, David Howells wrote:
>> Matthew Wilcox <willy@infradead.org> wrote:
>>
>>> Sure, but at the time Jeff Bonwick chose it, it had no meaning in
>>> computer science or operating system design.  Whatever name is chosen,
>>> we'll get used to it.  I don't even care what name it is.
>>>
>>> I want "short" because it ends up used everywhere.  I don't want to
>>> be typing
>>> 	lock_hippopotamus(hippopotamus);
>>>
>>> and I want greppable so it's not confused with something somebody else
>>> has already used as an identifier.
>>
>> Can you live with pageset?
> 
> Pagesets already exist in the page allocator internals. Yeah, could be
> renamed as it's not visible outside.

Should have read the rest of thread before replying.

Maybe in the spirit of the discussion we could call it pageshed?

/me hides

>> David
>>
>>
> 
>

David Howells Aug. 24, 2021, 9:32 p.m. UTC | #29

Theodore Ts'o <tytso@mit.edu> wrote:

> What do you think of "struct pageset"?  Not quite as short as folios,
> but it's clearer.

Fine by me (I suggested page_set), and as Vlastimil points out, the current
usage of the name could be renamed.

David

Christoph Hellwig Aug. 25, 2021, 6:32 a.m. UTC | #30

On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
> The problem is whether we use struct head_page, or folio, or mempages,
> we're going to be subsystem users' faces.  And people who are using it
> every day will eventually get used to anything, whether it's "folio"
> or "xmoqax", we sould give a thought to newcomers to Linux file system
> code.  If they see things like "read_folio()", they are going to be
> far more confused than "read_pages()" or "read_mempages()".

Are they?  It's not like page isn't some randomly made up term
as well, just one that had a lot more time to spread.

> So if someone sees "kmem_cache_alloc()", they can probably make a
> guess what it means, and it's memorable once they learn it.
> Similarly, something like "head_page", or "mempages" is going to a bit
> more obvious to a kernel newbie.  So if we can make a tiny gesture
> towards comprehensibility, it would be good to do so while it's still
> easier to change the name.

All this sounds really weird to me.  I doubt there is any name that
nicely explains "structure used to manage arbitrary power of two
units of memory in the kernel" very well.  So I agree with willy here,
let's pick something short and not clumsy.  I initially found the folio
name a little strange, but working with it I got used to it quickly.
And all the other uggestions I've seen s far are significantly worse,
especially all the odd compounds with page in it.

Christoph Hellwig Aug. 25, 2021, 6:39 a.m. UTC | #31

On Tue, Aug 24, 2021 at 11:59:52AM -0700, Linus Torvalds wrote:
> But it is a lot of churn. And it basically duplicates all our page
> functions, just to have those simplified versions. And It's very core
> code, and while I appreciate the cleverness of the "folio" name, I do
> think it makes the end result perhaps subtler than it needs to be.

Maybe I'm biassed by looking at the file system and pagecache side
mostly, but if you look at the progress willy has been making a lot
of the relevant functionality will exist in either folio or page
versions, not both.  A lot of the duplication is to support the
following:

> The one thing I do like about it is how it uses the type system to be
> incremental.

Rasmus Villemoes Aug. 25, 2021, 9:01 a.m. UTC | #32

On 25/08/2021 08.32, Christoph Hellwig wrote:
> On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
>> The problem is whether we use struct head_page, or folio, or mempages,
>> we're going to be subsystem users' faces.  And people who are using it
>> every day will eventually get used to anything, whether it's "folio"
>> or "xmoqax", we sould give a thought to newcomers to Linux file system
>> code.  If they see things like "read_folio()", they are going to be
>> far more confused than "read_pages()" or "read_mempages()".
> 
> Are they?  It's not like page isn't some randomly made up term
> as well, just one that had a lot more time to spread.
> 
>> So if someone sees "kmem_cache_alloc()", they can probably make a
>> guess what it means, and it's memorable once they learn it.
>> Similarly, something like "head_page", or "mempages" is going to a bit
>> more obvious to a kernel newbie.  So if we can make a tiny gesture
>> towards comprehensibility, it would be good to do so while it's still
>> easier to change the name.
> 
> All this sounds really weird to me.  I doubt there is any name that
> nicely explains "structure used to manage arbitrary power of two
> units of memory in the kernel" very well.  So I agree with willy here,
> let's pick something short and not clumsy.  I initially found the folio
> name a little strange, but working with it I got used to it quickly.
> And all the other uggestions I've seen s far are significantly worse,
> especially all the odd compounds with page in it.
> 

A comment from the peanut gallery: I find the name folio completely
appropriate and easy to understand. Our vocabulary is already strongly
inspired by words used in the world of printed text: the smallest unit
of information is a char(acter) [ok, we usually call them bytes], a few
characters make up a word, there's a number of words to each (cache)
line, and a number of those is what makes up a page. So obviously a
folio is something consisting of a few pages.

Are the analogies perfect? Of course not. But they are actually quite
apt; words, lines and pages don't universally have one size, but they do
form a natural hierarchy describing how we organize information.

Splitting a word across lines can slow down the reader so should be
avoided... [sorry, couldn't resist].

Rasmus

Jeff Layton Aug. 25, 2021, 12:03 p.m. UTC | #33

On Wed, 2021-08-25 at 07:32 +0100, Christoph Hellwig wrote:
> On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
> > The problem is whether we use struct head_page, or folio, or mempages,
> > we're going to be subsystem users' faces.  And people who are using it
> > every day will eventually get used to anything, whether it's "folio"
> > or "xmoqax", we sould give a thought to newcomers to Linux file system
> > code.  If they see things like "read_folio()", they are going to be
> > far more confused than "read_pages()" or "read_mempages()".
> 
> Are they?  It's not like page isn't some randomly made up term
> as well, just one that had a lot more time to spread.
> 

Absolutely.  "folio" is no worse than "page", we've just had more time
to get used to "page".

> > So if someone sees "kmem_cache_alloc()", they can probably make a
> > guess what it means, and it's memorable once they learn it.
> > Similarly, something like "head_page", or "mempages" is going to a bit
> > more obvious to a kernel newbie.  So if we can make a tiny gesture
> > towards comprehensibility, it would be good to do so while it's still
> > easier to change the name.
> 
> All this sounds really weird to me.  I doubt there is any name that
> nicely explains "structure used to manage arbitrary power of two
> units of memory in the kernel" very well.  So I agree with willy here,
> let's pick something short and not clumsy.  I initially found the folio
> name a little strange, but working with it I got used to it quickly.
> And all the other uggestions I've seen s far are significantly worse,
> especially all the odd compounds with page in it.

Same here. Compound words are especially bad, as newbies will
continually have to look at whether it's "page_set" or "pageset".

Jeff Layton Aug. 25, 2021, 12:08 p.m. UTC | #34

On Tue, 2021-08-24 at 22:32 +0100, David Howells wrote:
> Theodore Ts'o <tytso@mit.edu> wrote:
> 
> > What do you think of "struct pageset"?  Not quite as short as folios,
> > but it's clearer.
> 
> Fine by me (I suggested page_set), and as Vlastimil points out, the current
> usage of the name could be renamed.
> 

I honestly fail to see how any of this is better than "folio".

It's just a name, and "folio" has the advantage of being fairly unique.
The greppability that Willy mentioned is a perk, but folio also doesn't
sound similar to other words when discussing them verbally. That's
another advantage.

If I say "pageset" in a conversation, do I mean "struct pageset" or "a
random set of pages"? If I say  "folio", it's much more clear to what
I'm referring.

We've had a lot of time to get used to "page" as a term of art. We'd get
used to folio too.

Johannes Weiner Aug. 25, 2021, 3:13 p.m. UTC | #35

On Tue, Aug 24, 2021 at 08:44:01PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote:
> > The folio doc says "It is at least as large as %PAGE_SIZE";
> > folio_order() says "A folio is composed of 2^order pages";
> > page_folio(), folio_pfn(), folio_nr_pages all encode a N:1
> > relationship. And yes, the name implies it too.
> > 
> > This is in direct conflict with what I'm talking about, where base
> > page granularity could become coarser than file cache granularity.
> 
> That doesn't make any sense.  A page is the fundamental unit of the
> mm.  Why would we want to increase the granularity of page allocation
> and not increase the granularity of the file cache?

I'm not sure why one should be tied to the other. The folio itself is
based on the premise that a cache entry doesn't have to correspond to
exactly one struct page. And I agree with that. I'm just wondering why
it continues to imply a cache entry is at least one full page, rather
than saying a cache entry is a set of bytes that can be backed however
the MM sees fit. So that in case we do bump struct page size in the
future we don't have to redo the filesystem interface again.

I've listed reasons why 4k pages are increasingly the wrong choice for
many allocations, reclaim and paging. We also know there is a need to
maintain support for 4k cache entries.

> > Are we going to bump struct page to 2M soon? I don't know. Here is
> > what I do know about 4k pages, though:
> > 
> > - It's a lot of transactional overhead to manage tens of gigs of
> >   memory in 4k pages. We're reclaiming, paging and swapping more than
> >   ever before in our DCs, because flash provides in abundance the
> >   low-latency IOPS required for that, and parking cold/warm workload
> >   memory on cheap flash saves expensive RAM. But we're continously
> >   scanning thousands of pages per second to do this. There was also
> >   the RWF_UNCACHED thread around reclaim CPU overhead at the higher
> >   end of buffered IO rates. There is the fact that we have a pending
> >   proposal from Google to replace rmap because it's too CPU-intense
> >   when paging into compressed memory pools.
> 
> This seems like an argument for folios, not against them.  If user
> memory (both anon and file) is being allocated in larger chunks, there
> are fewer pages to scan, less book-keeping to do, and all you're paying
> for that is I/O bandwidth.

Well, it's an argument for huge pages, and we already have those in
the form of THP.

The problem with THP today is that the page allocator fragments the
physical address space at the 4k granularity per default, and groups
random allocations with no type information and rudimentary
lifetime/reclaimability hints together.

I'm having a hard time seeing 2M allocations scale as long as we do
this. As opposed to making 2M the default block and using slab-style
physical grouping by type and instantiation time for smaller cache
entries - to improve the chances of physically contiguous reclaim.

But because folios are compound/head pages first and foremost, they
are inherently tied to being multiples of PAGE_SIZE.

> > - It's a lot of internal fragmentation. Compaction is becoming the
> >   default method for allocating the majority of memory in our
> >   servers. This is a latency concern during page faults, and a
> >   predictability concern when we defer it to khugepaged collapsing.
> 
> Again, the more memory that we allocate in higher-order chunks, the
> better this situation becomes.

It only needs 1 unfortunately placed 4k page out of 512 to mess up a
2M block indefinitely. And the page allocator has little awareness
whether the 4k page it's handing out to somebody pairs well with the
4k page adjacent to it in terms of type and lifetime.

> > - struct page is statically eating gigs of expensive memory on every
> >   single machine, when only some of our workloads would require this
> >   level of granularity for some of their memory. And that's *after*
> >   we're fighting over every bit in that structure.
> 
> That, folios does not help with.  I have post-folio ideas about how
> to address that, but I can't realistically start working on them
> until folios are upstream.

How would you reduce the memory overhead of struct page without losing
necessary 4k granularity at the cache level? As long as folio implies
that cache entries can't be smaller than a struct page?

I appreciate folio is a big patchset and I don't mean to get too much
into speculation about the future.

But we're here in part because the filesystems have been too exposed
to the backing memory implementation details. So all I'm saying is, if
you're touching all the file cache interface now anyway, why not use
the opportunity to properly disconnect it from the reality of pages,
instead of making the compound page the new interface for filesystems.

What's wrong with the idea of a struct cache_entry which can be
embedded wherever we want: in a page, a folio or a pageset. Or in the
future allocated on demand for <PAGE_SIZE entries, if need be. But
actually have it be just a cache entry for the fs to read and write,
not also a compound page and an anon page etc. all at the same time.

Even today that would IMO delineate more clearly between the file
cache data plane and the backing memory plane. It doesn't get in the
way of also fixing the base-or-compound mess inside MM code with
folio/pageset, either.

And if down the line we change how the backing memory is implemented,
the changes would be a more manageable scope inside MM proper.

Anyway, I think I've asked all this before and don't mean to harp on
it if people generally disagree that this is a concern.

Darrick J. Wong Aug. 26, 2021, 12:45 a.m. UTC | #36

On Wed, Aug 25, 2021 at 11:13:45AM -0400, Johannes Weiner wrote:
> On Tue, Aug 24, 2021 at 08:44:01PM +0100, Matthew Wilcox wrote:
> > On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote:
> > > The folio doc says "It is at least as large as %PAGE_SIZE";
> > > folio_order() says "A folio is composed of 2^order pages";
> > > page_folio(), folio_pfn(), folio_nr_pages all encode a N:1
> > > relationship. And yes, the name implies it too.
> > > 
> > > This is in direct conflict with what I'm talking about, where base
> > > page granularity could become coarser than file cache granularity.
> > 
> > That doesn't make any sense.  A page is the fundamental unit of the
> > mm.  Why would we want to increase the granularity of page allocation
> > and not increase the granularity of the file cache?
> 
> I'm not sure why one should be tied to the other. The folio itself is
> based on the premise that a cache entry doesn't have to correspond to
> exactly one struct page. And I agree with that. I'm just wondering why
> it continues to imply a cache entry is at least one full page, rather
> than saying a cache entry is a set of bytes that can be backed however
> the MM sees fit. So that in case we do bump struct page size in the
> future we don't have to redo the filesystem interface again.
> 
> I've listed reasons why 4k pages are increasingly the wrong choice for
> many allocations, reclaim and paging. We also know there is a need to
> maintain support for 4k cache entries.
> 
> > > Are we going to bump struct page to 2M soon? I don't know. Here is
> > > what I do know about 4k pages, though:
> > > 
> > > - It's a lot of transactional overhead to manage tens of gigs of
> > >   memory in 4k pages. We're reclaiming, paging and swapping more than
> > >   ever before in our DCs, because flash provides in abundance the
> > >   low-latency IOPS required for that, and parking cold/warm workload
> > >   memory on cheap flash saves expensive RAM. But we're continously
> > >   scanning thousands of pages per second to do this. There was also
> > >   the RWF_UNCACHED thread around reclaim CPU overhead at the higher
> > >   end of buffered IO rates. There is the fact that we have a pending
> > >   proposal from Google to replace rmap because it's too CPU-intense
> > >   when paging into compressed memory pools.
> > 
> > This seems like an argument for folios, not against them.  If user
> > memory (both anon and file) is being allocated in larger chunks, there
> > are fewer pages to scan, less book-keeping to do, and all you're paying
> > for that is I/O bandwidth.
> 
> Well, it's an argument for huge pages, and we already have those in
> the form of THP.
> 
> The problem with THP today is that the page allocator fragments the
> physical address space at the 4k granularity per default, and groups
> random allocations with no type information and rudimentary
> lifetime/reclaimability hints together.
> 
> I'm having a hard time seeing 2M allocations scale as long as we do
> this. As opposed to making 2M the default block and using slab-style
> physical grouping by type and instantiation time for smaller cache
> entries - to improve the chances of physically contiguous reclaim.
> 
> But because folios are compound/head pages first and foremost, they
> are inherently tied to being multiples of PAGE_SIZE.
> 
> > > - It's a lot of internal fragmentation. Compaction is becoming the
> > >   default method for allocating the majority of memory in our
> > >   servers. This is a latency concern during page faults, and a
> > >   predictability concern when we defer it to khugepaged collapsing.
> > 
> > Again, the more memory that we allocate in higher-order chunks, the
> > better this situation becomes.
> 
> It only needs 1 unfortunately placed 4k page out of 512 to mess up a
> 2M block indefinitely. And the page allocator has little awareness
> whether the 4k page it's handing out to somebody pairs well with the
> 4k page adjacent to it in terms of type and lifetime.
> 
> > > - struct page is statically eating gigs of expensive memory on every
> > >   single machine, when only some of our workloads would require this
> > >   level of granularity for some of their memory. And that's *after*
> > >   we're fighting over every bit in that structure.
> > 
> > That, folios does not help with.  I have post-folio ideas about how
> > to address that, but I can't realistically start working on them
> > until folios are upstream.
> 
> How would you reduce the memory overhead of struct page without losing
> necessary 4k granularity at the cache level? As long as folio implies
> that cache entries can't be smaller than a struct page?
> 
> I appreciate folio is a big patchset and I don't mean to get too much
> into speculation about the future.
> 
> But we're here in part because the filesystems have been too exposed
> to the backing memory implementation details. So all I'm saying is, if
> you're touching all the file cache interface now anyway, why not use
> the opportunity to properly disconnect it from the reality of pages,
> instead of making the compound page the new interface for filesystems.
> 
> What's wrong with the idea of a struct cache_entry which can be
> embedded wherever we want: in a page, a folio or a pageset. Or in the
> future allocated on demand for <PAGE_SIZE entries, if need be. But
> actually have it be just a cache entry for the fs to read and write,
> not also a compound page and an anon page etc. all at the same time.

Pardon my ignorance, but ... how would adding yet another layer help a
filesystem?  No matter how the software is structured, we have to set up
and manage the (hardware) page state for programs, and we must keep that
coherent with the file space mappings that we maintain.  I already know
how to deal with pages and dealing with "folios" seems about the same.
Adding another layer of caching structures just adds another layer of
cra^Wcoherency management for a filesystem to screw up.

The folios change management of memory pages enough to disentangle the
page/compound page confusion that exists now, and it seems like a
reasonable means to supporting unreasonable things like copy on write
storage for filesystems with a 56k block size.

(And I'm sure I'll get tons of blowback for this, but XFS can manage
space in weird units like that (configure the rt volume, set a 56k rt
extent size, and all the allocations are multiples of 56k); if we ever
wanted to support reflink on /that/ hot mess, it would be awesome to be
able to say that we're only going to do 56k folios in the page cache for
those files instead of the crazy writeback games that the prototype
patchset does now.)

--D

> Even today that would IMO delineate more clearly between the file
> cache data plane and the backing memory plane. It doesn't get in the
> way of also fixing the base-or-compound mess inside MM code with
> folio/pageset, either.
> 
> And if down the line we change how the backing memory is implemented,
> the changes would be a more manageable scope inside MM proper.
> 
> Anyway, I think I've asked all this before and don't mean to harp on
> it if people generally disagree that this is a concern.

Darrick J. Wong Aug. 26, 2021, 12:59 a.m. UTC | #37

On Wed, Aug 25, 2021 at 08:03:18AM -0400, Jeff Layton wrote:
> On Wed, 2021-08-25 at 07:32 +0100, Christoph Hellwig wrote:
> > On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
> > > The problem is whether we use struct head_page, or folio, or mempages,
> > > we're going to be subsystem users' faces.  And people who are using it
> > > every day will eventually get used to anything, whether it's "folio"
> > > or "xmoqax", we sould give a thought to newcomers to Linux file system
> > > code.  If they see things like "read_folio()", they are going to be
> > > far more confused than "read_pages()" or "read_mempages()".
> > 
> > Are they?  It's not like page isn't some randomly made up term
> > as well, just one that had a lot more time to spread.
> > 
> 
> Absolutely.  "folio" is no worse than "page", we've just had more time
> to get used to "page".

I /like/ the name 'folio'.  My privileged education :P informed me (when
Matthew talked to me the very first time about this patchset) that it's
a wonderfully flexible word that describes both a collection of various
pages and a single large sheet of paper folded in half.  Or in the case
of x86, folded in half nine times.

That's *exactly* the usage that Matthew is proposing.

English already had a word ready for us to use, so let's use it.

--D

(Well, ok, the one thing I dislike is that my brain keeps typing out
'fileio' instead of 'folio', but it's still better than struct xmoqax or
remembering if we do camel_case or PotholeCase.)

> > > So if someone sees "kmem_cache_alloc()", they can probably make a
> > > guess what it means, and it's memorable once they learn it.
> > > Similarly, something like "head_page", or "mempages" is going to a bit
> > > more obvious to a kernel newbie.  So if we can make a tiny gesture
> > > towards comprehensibility, it would be good to do so while it's still
> > > easier to change the name.
> > 
> > All this sounds really weird to me.  I doubt there is any name that
> > nicely explains "structure used to manage arbitrary power of two
> > units of memory in the kernel" very well.  So I agree with willy here,
> > let's pick something short and not clumsy.  I initially found the folio
> > name a little strange, but working with it I got used to it quickly.
> > And all the other uggestions I've seen s far are significantly worse,
> > especially all the odd compounds with page in it.
> 
> Same here. Compound words are especially bad, as newbies will
> continually have to look at whether it's "page_set" or "pageset".
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
>

Nicholas Piggin Aug. 26, 2021, 4:02 a.m. UTC | #38

Excerpts from Christoph Hellwig's message of August 25, 2021 4:32 pm:
> On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
>> The problem is whether we use struct head_page, or folio, or mempages,
>> we're going to be subsystem users' faces.  And people who are using it
>> every day will eventually get used to anything, whether it's "folio"
>> or "xmoqax", we sould give a thought to newcomers to Linux file system
>> code.  If they see things like "read_folio()", they are going to be
>> far more confused than "read_pages()" or "read_mempages()".
> 
> Are they?  It's not like page isn't some randomly made up term
> as well, just one that had a lot more time to spread.
> 
>> So if someone sees "kmem_cache_alloc()", they can probably make a
>> guess what it means, and it's memorable once they learn it.
>> Similarly, something like "head_page", or "mempages" is going to a bit
>> more obvious to a kernel newbie.  So if we can make a tiny gesture
>> towards comprehensibility, it would be good to do so while it's still
>> easier to change the name.
> 
> All this sounds really weird to me.  I doubt there is any name that
> nicely explains "structure used to manage arbitrary power of two
> units of memory in the kernel" very well.

Cluster is easily understandable to a filesystem developer as contiguous 
set of one or more, probably aligned and sized to power of 2.  Swap 
subsystem in mm uses it (maybe because it's disk adjacent, but it does 
have page clusters) so mm developers would be fine with it too. Sadly
you might have to call it page_cluster to avoid confusion with block 
clusters in fs then it gets a bit long.

Superpage could be different enough from huge page that implies one page 
of a particular large size (even though some other OS might use it for 
that), but a super set of pages, which could be 1 or more.

Thanks,
Nick

Amir Goldstein Aug. 26, 2021, 6:32 a.m. UTC | #39

On Wed, Aug 25, 2021 at 12:02 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On 25/08/2021 08.32, Christoph Hellwig wrote:
> > On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
> >> The problem is whether we use struct head_page, or folio, or mempages,
> >> we're going to be subsystem users' faces.  And people who are using it
> >> every day will eventually get used to anything, whether it's "folio"
> >> or "xmoqax", we sould give a thought to newcomers to Linux file system
> >> code.  If they see things like "read_folio()", they are going to be
> >> far more confused than "read_pages()" or "read_mempages()".
> >
> > Are they?  It's not like page isn't some randomly made up term
> > as well, just one that had a lot more time to spread.
> >
> >> So if someone sees "kmem_cache_alloc()", they can probably make a
> >> guess what it means, and it's memorable once they learn it.
> >> Similarly, something like "head_page", or "mempages" is going to a bit
> >> more obvious to a kernel newbie.  So if we can make a tiny gesture
> >> towards comprehensibility, it would be good to do so while it's still
> >> easier to change the name.
> >
> > All this sounds really weird to me.  I doubt there is any name that
> > nicely explains "structure used to manage arbitrary power of two
> > units of memory in the kernel" very well.  So I agree with willy here,
> > let's pick something short and not clumsy.  I initially found the folio
> > name a little strange, but working with it I got used to it quickly.
> > And all the other uggestions I've seen s far are significantly worse,
> > especially all the odd compounds with page in it.
> >
>
> A comment from the peanut gallery: I find the name folio completely
> appropriate and easy to understand. Our vocabulary is already strongly
> inspired by words used in the world of printed text: the smallest unit
> of information is a char(acter) [ok, we usually call them bytes], a few
> characters make up a word, there's a number of words to each (cache)
> line, and a number of those is what makes up a page. So obviously a
> folio is something consisting of a few pages.
>
> Are the analogies perfect? Of course not. But they are actually quite
> apt; words, lines and pages don't universally have one size, but they do
> form a natural hierarchy describing how we organize information.
>
> Splitting a word across lines can slow down the reader so should be
> avoided... [sorry, couldn't resist].
>

And if we ever want to manage page cache using an arbitrary number
of contiguous filios, we can always saw them into a scroll ;-)

Thanks,
Amir.

David Howells Aug. 26, 2021, 8:58 a.m. UTC | #40

Johannes Weiner <hannes@cmpxchg.org> wrote:

> But we're here in part because the filesystems have been too exposed
> to the backing memory implementation details. So all I'm saying is, if
> you're touching all the file cache interface now anyway, why not use
> the opportunity to properly disconnect it from the reality of pages,
> instead of making the compound page the new interface for filesystems.
> 
> What's wrong with the idea of a struct cache_entry

Well, the name's already taken, though only in cifs.  And we have a *lot* of
caches so just calling it "cache_entry" is kind of unspecific.

> which can be
> embedded wherever we want: in a page, a folio or a pageset. Or in the
> future allocated on demand for <PAGE_SIZE entries, if need be. But
> actually have it be just a cache entry for the fs to read and write,
> not also a compound page and an anon page etc. all at the same time.
> 
> Even today that would IMO delineate more clearly between the file
> cache data plane and the backing memory plane. It doesn't get in the
> way of also fixing the base-or-compound mess inside MM code with
> folio/pageset, either.

One thing I like about Willy's folio concept is that, as long as everyone uses
the proper accessor functions and macros, we can mostly ignore the fact that
they're 2^N sized/aligned and they're composed of exact multiples of pages.
What really matters are the correspondences between folio size/alignment and
medium/IO size/alignment, so you could look on the folio as being a tool to
disconnect the filesystem from the concept of pages.

We could, in the future, in theory, allow the internal implementation of a
folio to shift from being a page array to being a kmalloc'd page list or
allow higher order units to be mixed in.  The main thing we have to stop
people from doing is directly accessing the members of the struct.

There are some tricky bits: kmap and mmapped page handling, for example.  Some
of this can be mitigated by making iov_iters handle folios (the ITER_XARRAY
type does, for example) and providing utilities to populate scatterlists.

David

Matthew Wilcox Aug. 26, 2021, 5:18 p.m. UTC | #41

On Tue, Aug 24, 2021 at 12:48:13PM -0700, Linus Torvalds wrote:
> On Tue, Aug 24, 2021 at 12:38 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > "pageset" is such a great name that we already use it, so I guess that
> > doesn't work.
> 
> Actually, maybe I can backtrack on that a bit.
> 
> Maybe 'pageset' would work as a name. It's not used as a type right
> now, but the usage where we do have those comments around 'struct
> per_cpu_pages' are actually not that different from the folio kind of
> thing. It has a list of "pages" that have a fixed order.
> 
> So that existing 'pageset' user might actually fit in conceptually.
> The 'pageset' is only really used in comments and as part of a field
> name, and the use does seem to be kind of consistent with the Willy's
> use of a "aligned allocation-group of pages".

The 'pageset' in use in mm/page_alloc.c really seems to be more of a
pagelist than a pageset.  The one concern I have about renaming it is
that we actually print the word 'pagesets' in /proc/zoneinfo.

There's also some infiniband driver that uses the word "pageset"
which really seems to mean "DMA range".

So if I rename the existing mm pageset to pagelist, and then modify all
these patches to call a folio a pageset, you'd take this patchset?

Johannes Weiner Aug. 27, 2021, 10:03 a.m. UTC | #42

On Thu, Aug 26, 2021 at 09:58:06AM +0100, David Howells wrote:
> One thing I like about Willy's folio concept is that, as long as everyone uses
> the proper accessor functions and macros, we can mostly ignore the fact that
> they're 2^N sized/aligned and they're composed of exact multiples of pages.
> What really matters are the correspondences between folio size/alignment and
> medium/IO size/alignment, so you could look on the folio as being a tool to
> disconnect the filesystem from the concept of pages.
>
> We could, in the future, in theory, allow the internal implementation of a
> folio to shift from being a page array to being a kmalloc'd page list or
> allow higher order units to be mixed in.  The main thing we have to stop
> people from doing is directly accessing the members of the struct.

In the current state of the folio patches, I agree with you. But
conceptually, folios are not disconnecting from the page beyond
PAGE_SIZE -> PAGE_SIZE * (1 << folio_order()). This is why I asked
what the intended endgame is. And I wonder if there is a bit of an
alignment issue between FS and MM people about the exact nature and
identity of this data structure.

At the current stage of conversion, folio is a more clearly delineated
API of what can be safely used from the FS for the interaction with
the page cache and memory management. And it looks still flexible to
make all sorts of changes, including how it's backed by
memory. Compared with the page, where parts of the API are for the FS,
but there are tons of members, functions, constants, and restrictions
due to the page's role inside MM core code. Things you shouldn't be
using, things you shouldn't be assuming from the fs side, but it's
hard to tell which is which, because struct page is a lot of things.

However, the MM narrative for folios is that they're an abstraction
for regular vs compound pages. This is rather generic. Conceptually,
it applies very broadly and deeply to MM core code: anonymous memory
handling, reclaim, swapping, even the slab allocator uses them. If we
follow through on this concept from the MM side - and that seems to be
the plan - it's inevitable that the folio API will grow more
MM-internal members, methods, as well as restrictions again in the
process. Except for the tail page bits, I don't see too much in struct
page that would not conceptually fit into this version of the folio.

The cache_entry idea is really just to codify and retain that
domain-specific minimalism and clarity from the filesystem side. As
well as the flexibility around how backing memory is implemented,
which I think could come in handy soon, but isn't the sole reason.

David Howells Aug. 27, 2021, 10:49 a.m. UTC | #43

Johannes Weiner <hannes@cmpxchg.org> wrote:

> 
> On Thu, Aug 26, 2021 at 09:58:06AM +0100, David Howells wrote:
> > One thing I like about Willy's folio concept is that, as long as everyone uses
> > the proper accessor functions and macros, we can mostly ignore the fact that
> > they're 2^N sized/aligned and they're composed of exact multiples of pages.
> > What really matters are the correspondences between folio size/alignment and
> > medium/IO size/alignment, so you could look on the folio as being a tool to
> > disconnect the filesystem from the concept of pages.
> >
> > We could, in the future, in theory, allow the internal implementation of a
> > folio to shift from being a page array to being a kmalloc'd page list or
> > allow higher order units to be mixed in.  The main thing we have to stop
> > people from doing is directly accessing the members of the struct.
> 
> In the current state of the folio patches, I agree with you. But
> conceptually, folios are not disconnecting from the page beyond
> PAGE_SIZE -> PAGE_SIZE * (1 << folio_order()). This is why I asked
> what the intended endgame is. And I wonder if there is a bit of an
> alignment issue between FS and MM people about the exact nature and
> identity of this data structure.

Possibly.  I would guess there are a couple of reasons that on the MM side
particularly it's dealt with as a strict array of pages: efficiency and
mmap-related faults.

It's most efficient to treat it as an array of contiguous pages as that
removes the need for indirection.  From the pov of mmap, faults happen
along the lines of h/w page divisions.

From an FS point of view, at minimum, I just need to know the state of the
folio.  If a page fault dirties several folios, that's fine.  If I can find
out that a folio was partially dirtied, that's useful, but not critical.  I am
a bit concerned about higher-order folios causing huge writes - but I do
realise that we might want to improve TLB/PT efficiency by using larger
entries and that that comes with consequences for mmapped writes.

> At the current stage of conversion, folio is a more clearly delineated
> API of what can be safely used from the FS for the interaction with
> the page cache and memory management. And it looks still flexible to
> make all sorts of changes, including how it's backed by
> memory. Compared with the page, where parts of the API are for the FS,
> but there are tons of members, functions, constants, and restrictions
> due to the page's role inside MM core code. Things you shouldn't be
> using, things you shouldn't be assuming from the fs side, but it's
> hard to tell which is which, because struct page is a lot of things.

I definitely like the API cleanup that folios offer.  However, I do think
Willy needs to better document the differences between some of the functions,
or at least when/where they should be used - folio_mapping() and
folio_file_mapping() being examples of this.

> However, the MM narrative for folios is that they're an abstraction
> for regular vs compound pages. This is rather generic. Conceptually,
> it applies very broadly and deeply to MM core code: anonymous memory
> handling, reclaim, swapping, even the slab allocator uses them. If we
> follow through on this concept from the MM side - and that seems to be
> the plan - it's inevitable that the folio API will grow more
> MM-internal members, methods, as well as restrictions again in the
> process. Except for the tail page bits, I don't see too much in struct
> page that would not conceptually fit into this version of the folio.
> 
> The cache_entry idea is really just to codify and retain that
> domain-specific minimalism and clarity from the filesystem side. As
> well as the flexibility around how backing memory is implemented,
> which I think could come in handy soon, but isn't the sole reason.

I can see while you might want the clarification.  However, at this point, can
you live with this set of folio patches?  Can you live with the name?  Could
you live with it if "folio" was changed to something else?

I would really like to see this patchset get in.  It's hanging over changes I
and others want to make that will conflict with Willy's changes.  If we can
get the basic API of folios in now, that's means I can make my changes on top
of them.

Thanks,
David

Matthew Wilcox Aug. 27, 2021, 12:05 p.m. UTC | #44

On Fri, Aug 27, 2021 at 06:03:25AM -0400, Johannes Weiner wrote:
> At the current stage of conversion, folio is a more clearly delineated
> API of what can be safely used from the FS for the interaction with
> the page cache and memory management. And it looks still flexible to
> make all sorts of changes, including how it's backed by
> memory. Compared with the page, where parts of the API are for the FS,
> but there are tons of members, functions, constants, and restrictions
> due to the page's role inside MM core code. Things you shouldn't be
> using, things you shouldn't be assuming from the fs side, but it's
> hard to tell which is which, because struct page is a lot of things.
> 
> However, the MM narrative for folios is that they're an abstraction
> for regular vs compound pages. This is rather generic. Conceptually,
> it applies very broadly and deeply to MM core code: anonymous memory
> handling, reclaim, swapping, even the slab allocator uses them. If we
> follow through on this concept from the MM side - and that seems to be
> the plan - it's inevitable that the folio API will grow more
> MM-internal members, methods, as well as restrictions again in the
> process. Except for the tail page bits, I don't see too much in struct
> page that would not conceptually fit into this version of the folio.

So the superhypermegaultra ambitious version of this does something
like:

struct slab_page {
	unsigned long flags;
	union {
		struct list_head slab_list;
		struct {
		...
		};
	};
	struct kmem_cache *slab_cache;
	void *freelist;
	void *s_mem;
	unsigned int active;
	atomic_t _refcount;
	unsigned long memcg_data;
};

struct folio {
	... more or less as now ...
};

struct net_page {
	unsigned long flags;
	unsigned long pp_magic;
	struct page_pool *pp;
	unsigned long _pp_mapping_pad;
	unsigned long dma_addr[2];
	atomic_t _mapcount;
	atomic_t _refcount;
	unsigned long memcg_data;
};

struct page {
	union {
		struct folio folio;
		struct slab_page slab;
		struct net_page pool;
		...
	};
};

and then functions which only take one specific type of page use that
type.  And the compiler will tell you that you can't pass a net_page
to a slab function, or vice versa.

This is a lot more churn, and I'm far from convinced that it's worth
doing.  There's also the tricky "This page is mappable to userspace"
kind of functions, which (for example) includes vmalloc and net_page
as well as folios and random driver allocations, but shouldn't include
slab or page table pages.  They're especially tricky because mapping to
userspace comes with rules around the use of the ->mapping field as well
as ->_mapcount.

Johannes Weiner Aug. 27, 2021, 2:07 p.m. UTC | #45

On Wed, Aug 25, 2021 at 05:45:55PM -0700, Darrick J. Wong wrote:
> Pardon my ignorance, but ... how would adding yet another layer help a
> filesystem?  No matter how the software is structured, we have to set up
> and manage the (hardware) page state for programs, and we must keep that
> coherent with the file space mappings that we maintain.  I already know
> how to deal with pages and dealing with "folios" seems about the same.
> Adding another layer of caching structures just adds another layer of
> cra^Wcoherency management for a filesystem to screw up.
> 
> The folios change management of memory pages enough to disentangle the
> page/compound page confusion that exists now, and it seems like a
> reasonable means to supporting unreasonable things like copy on write
> storage for filesystems with a 56k block size.
> 
> (And I'm sure I'll get tons of blowback for this, but XFS can manage
> space in weird units like that (configure the rt volume, set a 56k rt
> extent size, and all the allocations are multiples of 56k); if we ever
> wanted to support reflink on /that/ hot mess, it would be awesome to be
> able to say that we're only going to do 56k folios in the page cache for
> those files instead of the crazy writeback games that the prototype
> patchset does now.)

I'm guessing the reason you want 56k blocks is because with a larger
filesystems and faster drives it would be a more reasonable unit for
managing this amount of data than 4k would be.

We have the same thoughts in MM and growing memory sizes. The DAX
stuff said from the start it won't be built on linear struct page
mappings anymore because we expect the memory modules to be too big to
manage them with such fine-grained granularity. But in practice, this
is more and more becoming true for DRAM as well. We don't want to
allocate gigabytes of struct page when on our servers only a very
small share of overall memory needs to be managed at this granularity.

Folio perpetuates the problem of the base page being the floor for
cache granularity, and so from an MM POV it doesn't allow us to scale
up to current memory sizes without horribly regressing certain
filesystem workloads that still need us to be able to scale down.

But there is something more important that I wish more MM people would
engage on: When you ask for 56k/2M/whatever buffers, the MM has to be
able to *allocate* them.

I'm assuming that while you certainly have preferences, you don't rely
too much on whether that memory is composed of a contiguous chunk of
4k pages, a single 56k page, a part of a 2M page, or maybe even
discontig 4k chunks with an SG API. You want to manage your disk space
one way, but you could afford the MM some flexibility to do the right
thing under different levels of memory load, and allow it to scale in
the direction it needs for its own purposes.

But if folios are also the low-level compound pages used throughout
the MM code, we're tying these fs allocations to the requirement of
being physically contiguous. This is a much more difficult allocation
problem. And from the MM side, we have a pretty poor track record of
serving contiguous memory larger than the base page size.

Since forever have non-MM people assumed that because the page
allocator takes an order argument you could make arbitrary 2^n
requests. When they inevitably complain that it doesn't work, even
under light loads, we tell them "lol order-0 or good luck".

Compaction has improved our ability to serve these requests, but only
*if you bring the time for defragmentation*. Many allocations
don't. THP has been around for years, but honestly it doesn't really
work in general purpose environments. Yeah if you have some HPC number
cruncher that allocates all the anon at startup and then runs for
hours, it's fine. But in a more dynamic environment after some uptime,
the MM code just isn't able to produce these larger pages reliably and
within a reasonable deadline. I'm assuming filesystem workloads won't
bring the necessary patience for this either.

We've effectively declared bankruptcy on this already. Many requests
have been replaced with kvmalloc(), and THP has been mostly relegated
to the optimistic background tinkering of khugepaged. You can't rely
on it, so you need to structure your expectations around it, and
perform well when it isn't. This will apply to filesystems as well.

I really don't think it makes sense to discuss folios as the means for
enabling huge pages in the page cache, without also taking a long hard
look at the allocation model that is supposed to back them. Because
you can't make it happen without that. And this part isn't looking so
hot to me, tbh.

Willy says he has future ideas to make compound pages scale. But we
have years of history saying this is incredibly hard to achieve - and
it certainly wasn't for a lack of constant trying.

Decoupling the filesystems from struct page is a necessary step. I can
also see an argument for abstracting away compound pages to clean up
the compound_head() mess in all the helpers (although I'm still not
convinced the wholesale replacement of the page concept is the best
way to achieve this). But combining the two objectives, and making
compound pages the basis for huge page cache - after everything we
know about higher-order allocs - seems like a stretch to me.

Matthew Wilcox Aug. 27, 2021, 6:44 p.m. UTC | #46

On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote:
> We have the same thoughts in MM and growing memory sizes. The DAX
> stuff said from the start it won't be built on linear struct page
> mappings anymore because we expect the memory modules to be too big to
> manage them with such fine-grained granularity.

Well, I did.  Then I left Intel, and Dan took over.  Now we have a struct
page for each 4kB of PMEM.  I'm not particularly happy about this change
of direction.

> But in practice, this
> is more and more becoming true for DRAM as well. We don't want to
> allocate gigabytes of struct page when on our servers only a very
> small share of overall memory needs to be managed at this granularity.

This is a much less compelling argument than you think.  I had some
ideas along these lines and I took them to a performance analysis group.
They told me that for their workloads, doubling the amount of DRAM in a
system increased performance by ~10%.  So increasing the amount of DRAM
by 1/63 is going to increase performance by 1/630 or 0.15%.  There are
more important performance wins to go after.

Even in the cloud space where increasing memory by 1/63 might increase the
number of VMs you can host by 1/63, how many PMs host as many as 63 VMs?
ie does it really buy you anything?  It sounds like a nice big number
("My 1TB machine has 16GB occupied by memmap!"), but the real benefit
doesn't really seem to be there.  And of course, that assumes that you
have enough other resources to scale to 64/63 of your current workload;
you might hit CPU, IO or some other limit first.

> Folio perpetuates the problem of the base page being the floor for
> cache granularity, and so from an MM POV it doesn't allow us to scale
> up to current memory sizes without horribly regressing certain
> filesystem workloads that still need us to be able to scale down.

The mistake you're making is coupling "minimum mapping granularity" with
"minimum allocation granularity".  We can happily build a system which
only allocates memory on 2MB boundaries and yet lets you map that memory
to userspace in 4kB granules.

> I really don't think it makes sense to discuss folios as the means for
> enabling huge pages in the page cache, without also taking a long hard
> look at the allocation model that is supposed to back them. Because
> you can't make it happen without that. And this part isn't looking so
> hot to me, tbh.

Please, don't creep the scope of this project to "first, redesign
the memory allocator".  This project is _if we can_, use larg(er)
pages to cache files.  What Darrick is talking about is an entirely
different project that I haven't signed up for and won't.

> Willy says he has future ideas to make compound pages scale. But we
> have years of history saying this is incredibly hard to achieve - and
> it certainly wasn't for a lack of constant trying.

I genuinely don't understand.  We have five primary users of memory
in Linux (once we're in a steady state after boot):

 - Anonymous memory
 - File-backed memory
 - Slab
 - Network buffers
 - Page tables

The relative importance of each one very much depends on your workload.
Slab already uses medium order pages and can be made to use larger.
Folios should give us large allocations of file-backed memory and
eventually anonymous memory.  Network buffers seem to be headed towards
larger allocations too.  Page tables will need some more thought, but
once we're no longer interleaving file cache pages, anon pages and
page tables, they become less of a problem to deal with.

Once everybody's allocating order-4 pages, order-4 pages become easy
to allocate.  When everybody's allocating order-0 pages, order-4 pages
require the right 16 pages to come available, and that's really freaking
hard.

Dan Williams Aug. 27, 2021, 9:41 p.m. UTC | #47

On Fri, Aug 27, 2021 at 11:47 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote:
> > We have the same thoughts in MM and growing memory sizes. The DAX
> > stuff said from the start it won't be built on linear struct page
> > mappings anymore because we expect the memory modules to be too big to
> > manage them with such fine-grained granularity.
>
> Well, I did.  Then I left Intel, and Dan took over.  Now we have a struct
> page for each 4kB of PMEM.  I'm not particularly happy about this change
> of direction.

Page-less DAX left more problems than it solved. Meanwhile,
ZONE_DEVICE has spawned other useful things like peer-to-peer DMA.

I am more encouraged by efforts to make the 'struct page' overhead
disappear, first from Muchun Song for hugetlbfs and recently Joao
Martins for device-dax.  If anything, I think 'struct page' for PMEM /
DAX *strengthens* the case for folios / better mechanisms to reduce
the overhead of tracking 4K pages.

Matthew Wilcox Aug. 27, 2021, 9:49 p.m. UTC | #48

On Fri, Aug 27, 2021 at 02:41:11PM -0700, Dan Williams wrote:
> On Fri, Aug 27, 2021 at 11:47 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote:
> > > We have the same thoughts in MM and growing memory sizes. The DAX
> > > stuff said from the start it won't be built on linear struct page
> > > mappings anymore because we expect the memory modules to be too big to
> > > manage them with such fine-grained granularity.
> >
> > Well, I did.  Then I left Intel, and Dan took over.  Now we have a struct
> > page for each 4kB of PMEM.  I'm not particularly happy about this change
> > of direction.
> 
> Page-less DAX left more problems than it solved. Meanwhile,
> ZONE_DEVICE has spawned other useful things like peer-to-peer DMA.

ZONE_DEVICE has created more problems than it solved.  Pageless memory
is a concept which still needs to be supported, and we could have made
a start on that five years ago.  Instead you opted for the expeditious
solution.

Matthew Wilcox Aug. 28, 2021, 3:29 a.m. UTC | #49

On Mon, Aug 23, 2021 at 08:01:44PM +0100, Matthew Wilcox wrote:
> The following changes since commit f0eb870a84224c9bfde0dc547927e8df1be4267c:
> 
>   Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux (2021-07-18 11:27:25 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.infradead.org/users/willy/pagecache.git tags/folio-5.15
> 
> for you to fetch changes up to 1a90e9dae32ce26de43c1c5eddb3ecce27f2a640:
> 
>   mm/writeback: Add folio_write_one (2021-08-15 23:04:07 -0400)

Running 'sed -i' across the patches and reapplying them got me this:

The following changes since commit f0eb870a84224c9bfde0dc547927e8df1be4267c:

  Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux (2021-07-18 11:27:25 -0700)

are available in the Git repository at:

  git://git.infradead.org/users/willy/pagecache.git tags/pageset-5.15

for you to fetch changes up to dc185ab836d41729f15b2925a59c7dc29ae72377:

  mm/writeback: Add pageset_write_one (2021-08-27 22:52:26 -0400)

----------------------------------------------------------------
Pagesets

Add pagesets, a new type to represent either an order-0 page or the
head page of a compound page.  This should be enough infrastructure to
support filesystems converting from pages to pagesets.

----------------------------------------------------------------
Matthew Wilcox (Oracle) (90):
      mm: Convert get_page_unless_zero() to return bool
      mm: Introduce struct pageset
      mm: Add pageset_pgdat(), pageset_zone() and pageset_zonenum()
      mm/vmstat: Add functions to account pageset statistics
      mm/debug: Add VM_BUG_ON_PAGESET() and VM_WARN_ON_ONCE_PAGESET()
      mm: Add pageset reference count functions
      mm: Add pageset_put()
      mm: Add pageset_get()
      mm: Add pageset_try_get_rcu()
      mm: Add pageset flag manipulation functions
      mm/lru: Add pageset LRU functions
      mm: Handle per-pageset private data
      mm/filemap: Add pageset_index(), pageset_file_page() and pageset_contains()
      mm/filemap: Add pageset_next_index()
      mm/filemap: Add pageset_pos() and pageset_file_pos()
      mm/util: Add pageset_mapping() and pageset_file_mapping()
      mm/filemap: Add pageset_unlock()
      mm/filemap: Add pageset_lock()
      mm/filemap: Add pageset_lock_killable()
      mm/filemap: Add __pageset_lock_async()
      mm/filemap: Add pageset_wait_locked()
      mm/filemap: Add __pageset_lock_or_retry()
      mm/swap: Add pageset_rotate_reclaimable()
      mm/filemap: Add pageset_end_writeback()
      mm/writeback: Add pageset_wait_writeback()
      mm/writeback: Add pageset_wait_stable()
      mm/filemap: Add pageset_wait_bit()
      mm/filemap: Add pageset_wake_bit()
      mm/filemap: Convert page wait queues to be pagesets
      mm/filemap: Add pageset private_2 functions
      fs/netfs: Add pageset fscache functions
      mm: Add pageset_mapped()
      mm: Add pageset_nid()
      mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics()
      mm/memcg: Use the node id in mem_cgroup_update_tree()
      mm/memcg: Remove soft_limit_tree_node()
      mm/memcg: Convert memcg_check_events to take a node ID
      mm/memcg: Add pageset_memcg() and related functions
      mm/memcg: Convert commit_charge() to take a pageset
      mm/memcg: Convert mem_cgroup_charge() to take a pageset
      mm/memcg: Convert uncharge_page() to uncharge_pageset()
      mm/memcg: Convert mem_cgroup_uncharge() to take a pageset
      mm/memcg: Convert mem_cgroup_migrate() to take pagesets
      mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to pageset
      mm/memcg: Add pageset_memcg_lock() and pageset_memcg_unlock()
      mm/memcg: Convert mem_cgroup_move_account() to use a pageset
      mm/memcg: Add pageset_lruvec()
      mm/memcg: Add pageset_lruvec_lock() and similar functions
      mm/memcg: Add pageset_lruvec_relock_irq() and pageset_lruvec_relock_irqsave()
      mm/workingset: Convert workingset_activation to take a pageset
      mm: Add pageset_pfn()
      mm: Add pageset_raw_mapping()
      mm: Add flush_dcache_pageset()
      mm: Add kmap_local_pageset()
      mm: Add arch_make_pageset_accessible()
      mm: Add pageset_young and pageset_idle
      mm/swap: Add pageset_activate()
      mm/swap: Add pageset_mark_accessed()
      mm/rmap: Add pageset_mkclean()
      mm/migrate: Add pageset_migrate_mapping()
      mm/migrate: Add pageset_migrate_flags()
      mm/migrate: Add pageset_migrate_copy()
      mm/writeback: Rename __add_wb_stat() to wb_stat_mod()
      flex_proportions: Allow N events instead of 1
      mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add()
      mm/writeback: Add __pageset_end_writeback()
      mm/writeback: Add pageset_start_writeback()
      mm/writeback: Add pageset_mark_dirty()
      mm/writeback: Add __pageset_mark_dirty()
      mm/writeback: Convert tracing writeback_page_template to pagesets
      mm/writeback: Add filemap_dirty_pageset()
      mm/writeback: Add pageset_account_cleaned()
      mm/writeback: Add pageset_cancel_dirty()
      mm/writeback: Add pageset_clear_dirty_for_io()
      mm/writeback: Add pageset_account_redirty()
      mm/writeback: Add pageset_redirty_for_writepage()
      mm/filemap: Add i_blocks_per_pageset()
      mm/filemap: Add pageset_mkwrite_check_truncate()
      mm/filemap: Add readahead_pageset()
      mm/workingset: Convert workingset_refault() to take a pageset
      mm: Add pageset_evictable()
      mm/lru: Convert __pagevec_lru_add_fn to take a pageset
      mm/lru: Add pageset_add_lru()
      mm/page_alloc: Add pageset allocation functions
      mm/filemap: Add filemap_alloc_pageset
      mm/filemap: Add filemap_add_pageset()
      mm/filemap: Convert mapping_get_entry to return a pageset
      mm/filemap: Add filemap_get_pageset
      mm/filemap: Add FGP_STABLE
      mm/writeback: Add pageset_write_one

 Documentation/core-api/cachetlb.rst         |   6 +
 Documentation/core-api/mm-api.rst           |   5 +
 Documentation/filesystems/netfs_library.rst |   2 +
 arch/arc/include/asm/cacheflush.h           |   1 +
 arch/arm/include/asm/cacheflush.h           |   1 +
 arch/mips/include/asm/cacheflush.h          |   2 +
 arch/nds32/include/asm/cacheflush.h         |   1 +
 arch/nios2/include/asm/cacheflush.h         |   3 +-
 arch/parisc/include/asm/cacheflush.h        |   3 +-
 arch/sh/include/asm/cacheflush.h            |   3 +-
 arch/xtensa/include/asm/cacheflush.h        |   3 +-
 fs/afs/write.c                              |   9 +-
 fs/cachefiles/rdwr.c                        |  16 +-
 fs/io_uring.c                               |   2 +-
 fs/jfs/jfs_metapage.c                       |   1 +
 include/asm-generic/cacheflush.h            |   6 +
 include/linux/backing-dev.h                 |   6 +-
 include/linux/flex_proportions.h            |   9 +-
 include/linux/gfp.h                         |  22 +-
 include/linux/highmem-internal.h            |  11 +
 include/linux/highmem.h                     |  37 ++
 include/linux/huge_mm.h                     |  15 -
 include/linux/ksm.h                         |   4 +-
 include/linux/memcontrol.h                  | 231 ++++++-----
 include/linux/migrate.h                     |   4 +
 include/linux/mm.h                          | 239 +++++++++---
 include/linux/mm_inline.h                   | 103 +++--
 include/linux/mm_types.h                    |  77 ++++
 include/linux/mmdebug.h                     |  20 +
 include/linux/netfs.h                       |  77 ++--
 include/linux/page-flags.h                  | 267 +++++++++----
 include/linux/page_idle.h                   |  99 +++--
 include/linux/page_owner.h                  |   8 +-
 include/linux/page_ref.h                    | 158 +++++++-
 include/linux/pagemap.h                     | 585 ++++++++++++++++++----------
 include/linux/rmap.h                        |  10 +-
 include/linux/swap.h                        |  17 +-
 include/linux/vmstat.h                      | 113 +++++-
 include/linux/writeback.h                   |   9 +-
 include/trace/events/pagemap.h              |  46 ++-
 include/trace/events/writeback.h            |  28 +-
 kernel/bpf/verifier.c                       |   2 +-
 kernel/events/uprobes.c                     |   3 +-
 lib/flex_proportions.c                      |  28 +-
 mm/Makefile                                 |   2 +-
 mm/compaction.c                             |   4 +-
 mm/filemap.c                                | 575 +++++++++++++--------------
 mm/huge_memory.c                            |   7 +-
 mm/hugetlb.c                                |   2 +-
 mm/internal.h                               |  36 +-
 mm/khugepaged.c                             |   8 +-
 mm/ksm.c                                    |  34 +-
 mm/memcontrol.c                             | 358 +++++++++--------
 mm/memory-failure.c                         |   2 +-
 mm/memory.c                                 |  20 +-
 mm/mempolicy.c                              |  10 +
 mm/memremap.c                               |   2 +-
 mm/migrate.c                                | 189 +++++----
 mm/mlock.c                                  |   3 +-
 mm/page-writeback.c                         | 477 +++++++++++++----------
 mm/page_alloc.c                             |  14 +-
 mm/page_io.c                                |   4 +-
 mm/page_owner.c                             |  10 +-
 mm/pageset-compat.c                         | 142 +++++++
 mm/rmap.c                                   |  14 +-
 mm/shmem.c                                  |   7 +-
 mm/swap.c                                   | 197 +++++-----
 mm/swap_state.c                             |   2 +-
 mm/swapfile.c                               |   8 +-
 mm/userfaultfd.c                            |   2 +-
 mm/util.c                                   | 111 +++---
 mm/vmscan.c                                 |   8 +-
 mm/workingset.c                             |  52 +--
 73 files changed, 2900 insertions(+), 1692 deletions(-)
 create mode 100644 mm/pageset-compat.c

Johannes Weiner Aug. 30, 2021, 5:32 p.m. UTC | #50

On Fri, Aug 27, 2021 at 07:44:29PM +0100, Matthew Wilcox wrote:
> On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote:
> > We have the same thoughts in MM and growing memory sizes. The DAX
> > stuff said from the start it won't be built on linear struct page
> > mappings anymore because we expect the memory modules to be too big to
> > manage them with such fine-grained granularity.
> 
> Well, I did.  Then I left Intel, and Dan took over.  Now we have a struct
> page for each 4kB of PMEM.  I'm not particularly happy about this change
> of direction.
> 
> > But in practice, this
> > is more and more becoming true for DRAM as well. We don't want to
> > allocate gigabytes of struct page when on our servers only a very
> > small share of overall memory needs to be managed at this granularity.
> 
> This is a much less compelling argument than you think.  I had some
> ideas along these lines and I took them to a performance analysis group.
> They told me that for their workloads, doubling the amount of DRAM in a
> system increased performance by ~10%.  So increasing the amount of DRAM
> by 1/63 is going to increase performance by 1/630 or 0.15%.  There are
> more important performance wins to go after.

Well, that's kind of obvious.

Once a configuration is balanced for CPU, memory, IO, network etc,
adding sticks of RAM doesn't help; neither will freeing some memory
here and there. The short term isn't where this matters.

It matters rather a lot, though, when we design and purchase the
hardware. RAM is becoming a larger share of overall machine cost, so
at-scale deployments like ours are under more pressure than ever to
provision it tightly. When we configure our systems we look at the
workloads' resource consumption ratios, as well as the kernel
overhead, and then we need to buy capacity accordingly.

> Even in the cloud space where increasing memory by 1/63 might increase the
> number of VMs you can host by 1/63, how many PMs host as many as 63 VMs?
> ie does it really buy you anything?  It sounds like a nice big number
> ("My 1TB machine has 16GB occupied by memmap!"), but the real benefit
> doesn't really seem to be there.  And of course, that assumes that you
> have enough other resources to scale to 64/63 of your current workload;
> you might hit CPU, IO or some other limit first.

A lot of DC hosts nowadays are in a direct pipeline for handling user
requests, which are highly parallelizable.

They are much smaller, and there are a lot more of them than there are
VMs in the world. The per-request and per-host margins are thinner,
and the compute-to-memory ratio is more finely calibrated than when
you're renting out large VMs that don't neatly divide up the machine.

Right now, we're averaging ~1G of RAM per CPU thread for most of our
hosts. You don't need a very large system - certainly not in the TB
ballpark - where struct page takes up the memory budget of entire CPU
threads. So now we have to spec memory for it, and spend additional
capex and watts, or we'll end up leaving those CPU threads stranded.

You're certainly right that there are configurations that likely won't
care much - especially more legacy, big-iron style stuff that isn't
quite as parallelized and as thinly provisioned.

But you can't make the argument that nobody will miss 16G in a 1TB
host that has the CPU concurrency and the parallel work to match it.

> > Folio perpetuates the problem of the base page being the floor for
> > cache granularity, and so from an MM POV it doesn't allow us to scale
> > up to current memory sizes without horribly regressing certain
> > filesystem workloads that still need us to be able to scale down.
> 
> The mistake you're making is coupling "minimum mapping granularity" with
> "minimum allocation granularity".  We can happily build a system which
> only allocates memory on 2MB boundaries and yet lets you map that memory
> to userspace in 4kB granules.

Yeah, but I want to do it without allocating 4k granule descriptors
statically at boot time for the entirety of available memory.

> > I really don't think it makes sense to discuss folios as the means for
> > enabling huge pages in the page cache, without also taking a long hard
> > look at the allocation model that is supposed to back them. Because
> > you can't make it happen without that. And this part isn't looking so
> > hot to me, tbh.
> 
> Please, don't creep the scope of this project to "first, redesign
> the memory allocator".  This project is _if we can_, use larg(er)
> pages to cache files.  What Darrick is talking about is an entirely
> different project that I haven't signed up for and won't.

I never said the allocator needs to be fixed first. I've only been
advocating to remove (or keep out) unnecessary allocation assumptions
from folio to give us the flexibility to fix the allocator later on.

> > Willy says he has future ideas to make compound pages scale. But we
> > have years of history saying this is incredibly hard to achieve - and
> > it certainly wasn't for a lack of constant trying.
> 
> I genuinely don't understand.  We have five primary users of memory
> in Linux (once we're in a steady state after boot):
> 
>  - Anonymous memory
>  - File-backed memory
>  - Slab
>  - Network buffers
>  - Page tables
> 
> The relative importance of each one very much depends on your workload.
> Slab already uses medium order pages and can be made to use larger.
> Folios should give us large allocations of file-backed memory and
> eventually anonymous memory.  Network buffers seem to be headed towards
> larger allocations too.  Page tables will need some more thought, but
> once we're no longer interleaving file cache pages, anon pages and
> page tables, they become less of a problem to deal with.
> 
> Once everybody's allocating order-4 pages, order-4 pages become easy
> to allocate.  When everybody's allocating order-0 pages, order-4 pages
> require the right 16 pages to come available, and that's really freaking
> hard.

Well yes, once (and iff) everybody is doing that. But for the
foreseeable future we're expecting to stay in a world where the
*majority* of memory is in larger chunks, while we continue to see 4k
cache entries, anon pages, and corresponding ptes, yes?

Memory is dominated by larger allocations from the main workloads, but
we'll continue to have a base system that does logging, package
upgrades, IPC stuff, has small config files, small libraries, small
executables. It'll be a while until we can raise the floor on those
much smaller allocations - if ever.

So we need a system to manage them living side by side.

The slab allocator has proven to be an excellent solution to this
problem, because the mailing lists are not flooded with OOM reports
where smaller allocations fragmented the 4k page space. And even large
temporary slab explosions (inodes, dentries etc.) are usually pushed
back with fairly reasonable CPU overhead.

The same really cannot be said for the untyped page allocator and the
various solutions we've had to address fragmentation after the fact.

Again, I'm not saying any of this needs to be actually *fixed* MM-side
to enable the huge page cache in the filesystems. I'd be more than
happy to go ahead with the "cache descriptor" aspect of the folio.

All I'm saying we shouldn't double down on compound pages and tie the
filesystems to that anchor, just for that false synergy between the
new cache descriptor and fixing the compound_head() mess.

Matthew Wilcox Aug. 30, 2021, 6:22 p.m. UTC | #51

On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote:
> A lot of DC hosts nowadays are in a direct pipeline for handling user
> requests, which are highly parallelizable.
> 
> They are much smaller, and there are a lot more of them than there are
> VMs in the world. The per-request and per-host margins are thinner,
> and the compute-to-memory ratio is more finely calibrated than when
> you're renting out large VMs that don't neatly divide up the machine.
> 
> Right now, we're averaging ~1G of RAM per CPU thread for most of our
> hosts. You don't need a very large system - certainly not in the TB
> ballpark - where struct page takes up the memory budget of entire CPU
> threads. So now we have to spec memory for it, and spend additional
> capex and watts, or we'll end up leaving those CPU threads stranded.

So you're noticing at the level of a 64 thread machine (something like
a dual-socket Xeon Gold 5318H, which would have 2x18x2 = 72 threads).
Things certainly have changed, then.

> > The mistake you're making is coupling "minimum mapping granularity" with
> > "minimum allocation granularity".  We can happily build a system which
> > only allocates memory on 2MB boundaries and yet lets you map that memory
> > to userspace in 4kB granules.
> 
> Yeah, but I want to do it without allocating 4k granule descriptors
> statically at boot time for the entirety of available memory.

Even that is possible when bumping the PAGE_SIZE to 16kB.  It needs a
bit of fiddling:

static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
                        unsigned long addr, struct page *page, pgprot_t prot)
{
        if (!pte_none(*pte))
                return -EBUSY;
        /* Ok, finally just insert the thing.. */
        get_page(page);
        inc_mm_counter_fast(mm, mm_counter_file(page));
        page_add_file_rmap(page, false);
        set_pte_at(mm, addr, pte, mk_pte(page, prot));
        return 0;
}

mk_pte() assumes that a struct page refers to a single pte.  If we
revamped it to take (page, offset, prot), it could construct the
appropriate pte for the offset within that page.

---

Independent of _that_, the biggest problem we face (I think) in getting
rid of memmap is that it offers the pfn_to_page() lookup.  If we move to a
dynamically allocated descriptor for our arbitrarily-sized memory objects,
we need a tree to store them in.  Given the trees we currently have,
our best bet is probably the radix tree, but I dislike its glass jaws.
I'm hoping that (again) the maple tree becomes stable soon enough for
us to dynamically allocate memory descriptors and store them in it.
And that we don't discover a bootstrapping problem between kmalloc()
(for tree nodes) and memmap (to look up the page associated with a node).

But that's all a future problem and if we can't even take a first step
to decouple filesystems from struct page then working towards that would
be wasted effort.

> > > Willy says he has future ideas to make compound pages scale. But we
> > > have years of history saying this is incredibly hard to achieve - and
> > > it certainly wasn't for a lack of constant trying.
> > 
> > I genuinely don't understand.  We have five primary users of memory
> > in Linux (once we're in a steady state after boot):
> > 
> >  - Anonymous memory
> >  - File-backed memory
> >  - Slab
> >  - Network buffers
> >  - Page tables
> > 
> > The relative importance of each one very much depends on your workload.
> > Slab already uses medium order pages and can be made to use larger.
> > Folios should give us large allocations of file-backed memory and
> > eventually anonymous memory.  Network buffers seem to be headed towards
> > larger allocations too.  Page tables will need some more thought, but
> > once we're no longer interleaving file cache pages, anon pages and
> > page tables, they become less of a problem to deal with.
> > 
> > Once everybody's allocating order-4 pages, order-4 pages become easy
> > to allocate.  When everybody's allocating order-0 pages, order-4 pages
> > require the right 16 pages to come available, and that's really freaking
> > hard.
> 
> Well yes, once (and iff) everybody is doing that. But for the
> foreseeable future we're expecting to stay in a world where the
> *majority* of memory is in larger chunks, while we continue to see 4k
> cache entries, anon pages, and corresponding ptes, yes?

No.  4k page table entries are demanded by the architecture, and there's
little we can do about that.  We can allocate them in larger chunks, but
let's not solve that problem in this email.  I can see a world where
anon memory is managed (by default, opportunistically) in larger
chunks within a year.  Maybe six months if somebody really works hard
on it.

> Memory is dominated by larger allocations from the main workloads, but
> we'll continue to have a base system that does logging, package
> upgrades, IPC stuff, has small config files, small libraries, small
> executables. It'll be a while until we can raise the floor on those
> much smaller allocations - if ever.
> 
> So we need a system to manage them living side by side.
> 
> The slab allocator has proven to be an excellent solution to this
> problem, because the mailing lists are not flooded with OOM reports
> where smaller allocations fragmented the 4k page space. And even large
> temporary slab explosions (inodes, dentries etc.) are usually pushed
> back with fairly reasonable CPU overhead.

You may not see the bug reports, but they exist.  Right now, we have
a service that is echoing 2 to drop_caches every hour on systems which
are lightly loaded, otherwise the dcache swamps the entire machine and
takes hours or days to come back under control.

Johannes Weiner Aug. 30, 2021, 8:27 p.m. UTC | #52

On Mon, Aug 30, 2021 at 07:22:25PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote:
> > > The mistake you're making is coupling "minimum mapping granularity" with
> > > "minimum allocation granularity".  We can happily build a system which
> > > only allocates memory on 2MB boundaries and yet lets you map that memory
> > > to userspace in 4kB granules.
> > 
> > Yeah, but I want to do it without allocating 4k granule descriptors
> > statically at boot time for the entirety of available memory.
> 
> Even that is possible when bumping the PAGE_SIZE to 16kB.  It needs a
> bit of fiddling:
> 
> static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
>                         unsigned long addr, struct page *page, pgprot_t prot)
> {
>         if (!pte_none(*pte))
>                 return -EBUSY;
>         /* Ok, finally just insert the thing.. */
>         get_page(page);
>         inc_mm_counter_fast(mm, mm_counter_file(page));
>         page_add_file_rmap(page, false);
>         set_pte_at(mm, addr, pte, mk_pte(page, prot));
>         return 0;
> }
> 
> mk_pte() assumes that a struct page refers to a single pte.  If we
> revamped it to take (page, offset, prot), it could construct the
> appropriate pte for the offset within that page.

Right, page tables only need a pfn. The struct page is for us to
maintain additional state about the object.

For the objects that are subpage sized, we should be able to hold that
state (shrinker lru linkage, referenced bit, dirtiness, ...) inside
ad-hoc allocated descriptors.

Descriptors which could well be what struct folio {} is today, IMO. As
long as it doesn't innately assume, or will assume, in the API the
1:1+ mapping to struct page that is inherent to the compound page.

> Independent of _that_, the biggest problem we face (I think) in getting
> rid of memmap is that it offers the pfn_to_page() lookup.  If we move to a
> dynamically allocated descriptor for our arbitrarily-sized memory objects,
> we need a tree to store them in.  Given the trees we currently have,
> our best bet is probably the radix tree, but I dislike its glass jaws.
> I'm hoping that (again) the maple tree becomes stable soon enough for
> us to dynamically allocate memory descriptors and store them in it.
> And that we don't discover a bootstrapping problem between kmalloc()
> (for tree nodes) and memmap (to look up the page associated with a node).
> 
> But that's all a future problem and if we can't even take a first step
> to decouple filesystems from struct page then working towards that would
> be wasted effort.

Agreed. Again, I'm just advocating to keep the doors open on that, and
avoid the situation where the filesystem folks run off and convert to
a flexible folio data structure, and the MM people run off and convert
all compound pages to folio and in the process hardcode assumptions
and turn it basically into struct page again that can't easily change.

> > > > Willy says he has future ideas to make compound pages scale. But we
> > > > have years of history saying this is incredibly hard to achieve - and
> > > > it certainly wasn't for a lack of constant trying.
> > > 
> > > I genuinely don't understand.  We have five primary users of memory
> > > in Linux (once we're in a steady state after boot):
> > > 
> > >  - Anonymous memory
> > >  - File-backed memory
> > >  - Slab
> > >  - Network buffers
> > >  - Page tables
> > > 
> > > The relative importance of each one very much depends on your workload.
> > > Slab already uses medium order pages and can be made to use larger.
> > > Folios should give us large allocations of file-backed memory and
> > > eventually anonymous memory.  Network buffers seem to be headed towards
> > > larger allocations too.  Page tables will need some more thought, but
> > > once we're no longer interleaving file cache pages, anon pages and
> > > page tables, they become less of a problem to deal with.
> > > 
> > > Once everybody's allocating order-4 pages, order-4 pages become easy
> > > to allocate.  When everybody's allocating order-0 pages, order-4 pages
> > > require the right 16 pages to come available, and that's really freaking
> > > hard.
> > 
> > Well yes, once (and iff) everybody is doing that. But for the
> > foreseeable future we're expecting to stay in a world where the
> > *majority* of memory is in larger chunks, while we continue to see 4k
> > cache entries, anon pages, and corresponding ptes, yes?
> 
> No.  4k page table entries are demanded by the architecture, and there's
> little we can do about that.

I wasn't claiming otherwise..?

> > Memory is dominated by larger allocations from the main workloads, but
> > we'll continue to have a base system that does logging, package
> > upgrades, IPC stuff, has small config files, small libraries, small
> > executables. It'll be a while until we can raise the floor on those
> > much smaller allocations - if ever.
> > 
> > So we need a system to manage them living side by side.
> > 
> > The slab allocator has proven to be an excellent solution to this
> > problem, because the mailing lists are not flooded with OOM reports
> > where smaller allocations fragmented the 4k page space. And even large
> > temporary slab explosions (inodes, dentries etc.) are usually pushed
> > back with fairly reasonable CPU overhead.
> 
> You may not see the bug reports, but they exist.  Right now, we have
> a service that is echoing 2 to drop_caches every hour on systems which
> are lightly loaded, otherwise the dcache swamps the entire machine and
> takes hours or days to come back under control.

Sure, but compare that to the number of complaints about higher-order
allocations failing or taking too long (THP in the fault path e.g.)...

Typegrouping isn't infallible for fighting fragmentation, but it seems
to be good enough for most cases. Unlike the buddy allocator.

Matthew Wilcox Aug. 30, 2021, 9:38 p.m. UTC | #53

On Mon, Aug 30, 2021 at 04:27:04PM -0400, Johannes Weiner wrote:
> Right, page tables only need a pfn. The struct page is for us to
> maintain additional state about the object.
> 
> For the objects that are subpage sized, we should be able to hold that
> state (shrinker lru linkage, referenced bit, dirtiness, ...) inside
> ad-hoc allocated descriptors.
> 
> Descriptors which could well be what struct folio {} is today, IMO. As
> long as it doesn't innately assume, or will assume, in the API the
> 1:1+ mapping to struct page that is inherent to the compound page.

Maybe this is where we fundamentally disagree.  I don't think there's
any point in *managing* memory in a different size from that in which it
is *allocated*.  There's no point in tracking dirtiness, LRU position,
locked, etc, etc in different units from allocation size.  The point of
tracking all these things is so we can allocate and free memory.  If
a 'cache descriptor' reaches the end of the LRU and should be reclaimed,
that's wasted effort in tracking if the rest of the 'cache descriptor'
is dirty and heavily in use.  So a 'cache descriptor' should always be
at least a 'struct page' in size (assuming you're using 'struct page'
to mean "the size of the smallest allocation unit from the page
allocator")

> > > > I genuinely don't understand.  We have five primary users of memory
> > > > in Linux (once we're in a steady state after boot):
> > > > 
> > > >  - Anonymous memory
> > > >  - File-backed memory
> > > >  - Slab
> > > >  - Network buffers
> > > >  - Page tables
> > > > 
> > > > The relative importance of each one very much depends on your workload.
> > > > Slab already uses medium order pages and can be made to use larger.
> > > > Folios should give us large allocations of file-backed memory and
> > > > eventually anonymous memory.  Network buffers seem to be headed towards
> > > > larger allocations too.  Page tables will need some more thought, but
> > > > once we're no longer interleaving file cache pages, anon pages and
> > > > page tables, they become less of a problem to deal with.
> > > > 
> > > > Once everybody's allocating order-4 pages, order-4 pages become easy
> > > > to allocate.  When everybody's allocating order-0 pages, order-4 pages
> > > > require the right 16 pages to come available, and that's really freaking
> > > > hard.
> > > 
> > > Well yes, once (and iff) everybody is doing that. But for the
> > > foreseeable future we're expecting to stay in a world where the
> > > *majority* of memory is in larger chunks, while we continue to see 4k
> > > cache entries, anon pages, and corresponding ptes, yes?
> > 
> > No.  4k page table entries are demanded by the architecture, and there's
> > little we can do about that.
> 
> I wasn't claiming otherwise..?

You snipped the part of my paragraph that made the 'No' make sense.
I'm agreeing that page tables will continue to be a problem, but
everything else (page cache, anon, networking, slab) I expect to be
using higher order allocations within the next year.

> > > The slab allocator has proven to be an excellent solution to this
> > > problem, because the mailing lists are not flooded with OOM reports
> > > where smaller allocations fragmented the 4k page space. And even large
> > > temporary slab explosions (inodes, dentries etc.) are usually pushed
> > > back with fairly reasonable CPU overhead.
> > 
> > You may not see the bug reports, but they exist.  Right now, we have
> > a service that is echoing 2 to drop_caches every hour on systems which
> > are lightly loaded, otherwise the dcache swamps the entire machine and
> > takes hours or days to come back under control.
> 
> Sure, but compare that to the number of complaints about higher-order
> allocations failing or taking too long (THP in the fault path e.g.)...

Oh, we have those bug reports too ...

> Typegrouping isn't infallible for fighting fragmentation, but it seems
> to be good enough for most cases. Unlike the buddy allocator.

You keep saying that the buddy allocator isn't given enough information to
do any better, but I think it is.  Page cache and anon memory are marked
with GFP_MOVABLE.  Slab, network and page tables aren't.  Is there a
reason that isn't enough?

I think something that might actually help is if we added a pair of new
GFP flags, __GFP_FAST and __GFP_DENSE.  Dense allocations are those which
are expected to live for a long time, and so the page allocator should
try to group them with other dense allocations.  Slab and page tables
should use DENSE, along with things like superblocks, or fs bitmaps where
the speed of allocation is almost unimportant, but attempting to keep
them out of the way of other allocations is useful.  Fast allocations
are for allocations which should not live for very long.  The speed of
allocation dominates, and it's OK if the allocation gets in the way of
defragmentation for a while.

An example of another allocator that could care about DENSE vs FAST
would be vmalloc.  Today, it does:

        if (array_size > PAGE_SIZE) {
                area->pages = __vmalloc_node(array_size, 1, nested_gfp, node,
                                        area->caller);
        } else {
                area->pages = kmalloc_node(array_size, nested_gfp, node);
        }

That's actually pretty bad; if you have, say, a 768kB vmalloc space,
you need a 12kB array.  We currently allocate 16kB for the array, when we
could use alloc_pages_exact() to free the 4kB we're never going to use.
If this is GFP_DENSE, we know it's a long-lived allocation and we can
let somebody else use the extra 4kB.  If it's not, it's probably not
worth bothering with.

Vlastimil Babka Aug. 31, 2021, 5:40 p.m. UTC | #54

On 8/30/21 23:38, Matthew Wilcox wrote:
> I think something that might actually help is if we added a pair of new
> GFP flags, __GFP_FAST and __GFP_DENSE.  Dense allocations are those which
> are expected to live for a long time, and so the page allocator should
> try to group them with other dense allocations.  Slab and page tables
> should use DENSE, along with things like superblocks, or fs bitmaps where
> the speed of allocation is almost unimportant, but attempting to keep
> them out of the way of other allocations is useful.  Fast allocations
> are for allocations which should not live for very long.  The speed of
> allocation dominates, and it's OK if the allocation gets in the way of
> defragmentation for a while.

Note we used to have GFP_TEMPORARY, but it didn't really work out:
https://lwn.net/Articles/732107/

> An example of another allocator that could care about DENSE vs FAST
> would be vmalloc.  Today, it does:
> 
>         if (array_size > PAGE_SIZE) {
>                 area->pages = __vmalloc_node(array_size, 1, nested_gfp, node,
>                                         area->caller);
>         } else {
>                 area->pages = kmalloc_node(array_size, nested_gfp, node);
>         }
> 
> That's actually pretty bad; if you have, say, a 768kB vmalloc space,
> you need a 12kB array.  We currently allocate 16kB for the array, when we
> could use alloc_pages_exact() to free the 4kB we're never going to use.
> If this is GFP_DENSE, we know it's a long-lived allocation and we can
> let somebody else use the extra 4kB.  If it's not, it's probably not
> worth bothering with.
>

Eric W. Biederman Aug. 31, 2021, 6:50 p.m. UTC | #55

Johannes Weiner <hannes@cmpxchg.org> writes:

> On Mon, Aug 30, 2021 at 07:22:25PM +0100, Matthew Wilcox wrote:
>> On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote:
>> > > The mistake you're making is coupling "minimum mapping granularity" with
>> > > "minimum allocation granularity".  We can happily build a system which
>> > > only allocates memory on 2MB boundaries and yet lets you map that memory
>> > > to userspace in 4kB granules.
>> > 
>> > Yeah, but I want to do it without allocating 4k granule descriptors
>> > statically at boot time for the entirety of available memory.
>> 
>> Even that is possible when bumping the PAGE_SIZE to 16kB.  It needs a
>> bit of fiddling:
>> 
>> static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
>>                         unsigned long addr, struct page *page, pgprot_t prot)
>> {
>>         if (!pte_none(*pte))
>>                 return -EBUSY;
>>         /* Ok, finally just insert the thing.. */
>>         get_page(page);
>>         inc_mm_counter_fast(mm, mm_counter_file(page));
>>         page_add_file_rmap(page, false);
>>         set_pte_at(mm, addr, pte, mk_pte(page, prot));
>>         return 0;
>> }
>> 
>> mk_pte() assumes that a struct page refers to a single pte.  If we
>> revamped it to take (page, offset, prot), it could construct the
>> appropriate pte for the offset within that page.
>
> Right, page tables only need a pfn. The struct page is for us to
> maintain additional state about the object.
>
> For the objects that are subpage sized, we should be able to hold that
> state (shrinker lru linkage, referenced bit, dirtiness, ...) inside
> ad-hoc allocated descriptors.
>
> Descriptors which could well be what struct folio {} is today, IMO. As
> long as it doesn't innately assume, or will assume, in the API the
> 1:1+ mapping to struct page that is inherent to the compound page.

struct buffer_head any one?

I am being silly but when you say you want something that isn't a page
for caching that could be less than a page in size, it really sounds
like you want struct buffer_head.

The only actual problem I am aware of with struct buffer_head is that
it is a block device abstraction and does not map well to other
situations.  Which makes network filesystems unable to use struct
buffer_head.

Eric

Mike Rapoport Sept. 1, 2021, 12:58 p.m. UTC | #56

On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 24, 2021 at 08:23:15PM +0100, Matthew Wilcox wrote:
>
> So if someone sees "kmem_cache_alloc()", they can probably make a
> guess what it means, and it's memorable once they learn it.
> Similarly, something like "head_page", or "mempages" is going to a bit
> more obvious to a kernel newbie.  So if we can make a tiny gesture
> towards comprehensibility, it would be good to do so while it's still
> easier to change the name.

Talking about being newbie friendly, how about we'll just add a piece of
documentation along with the new type for a change?

Something along those lines (I'm sure willy can add several more sentences
for Folio description)

diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst
index 30e8fbed6914..b5b39ebe67cf 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -30,6 +30,29 @@ Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
 helpers that allow the conversion from PFN to `struct page` and vice
 versa.
 
+Pages
+-----
+
+Each physical page frame in the system is represented by a `struct page`.
+This structure aggregatates several types, each corresponding to a
+particular usage of a page frame, such as anonymous memory, SLAB caches,
+file-backed memory etc. These types are define within unions in the struct
+page to reduce memory footprint of the memory map.
+
+The actual type of the particular insance of struct page is determined by
+values of the fields shared between the different types and can be quired
+using page flag operatoins defined in ``include/linux/page-flags.h``
+
+Folios
+------
+
+For many use cases, single page frame granularity is too small. In such
+cases a contiguous range of memory can be referred by `struct folio`.
+
+A folio is a physically, virtually and logically contiguous range of
+bytes. It is a power-of-two in size, and it is aligned to that same
+power-of-two. It is at least as large as PAGE_SIZE.
+
 FLATMEM
 =======

Johannes Weiner Sept. 1, 2021, 5:43 p.m. UTC | #57

On Mon, Aug 30, 2021 at 10:38:20PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 30, 2021 at 04:27:04PM -0400, Johannes Weiner wrote:
> > Right, page tables only need a pfn. The struct page is for us to
> > maintain additional state about the object.
> > 
> > For the objects that are subpage sized, we should be able to hold that
> > state (shrinker lru linkage, referenced bit, dirtiness, ...) inside
> > ad-hoc allocated descriptors.
> > 
> > Descriptors which could well be what struct folio {} is today, IMO. As
> > long as it doesn't innately assume, or will assume, in the API the
> > 1:1+ mapping to struct page that is inherent to the compound page.
> 
> Maybe this is where we fundamentally disagree.  I don't think there's
> any point in *managing* memory in a different size from that in which it
> is *allocated*.  There's no point in tracking dirtiness, LRU position,
> locked, etc, etc in different units from allocation size.  The point of
> tracking all these things is so we can allocate and free memory.  If
> a 'cache descriptor' reaches the end of the LRU and should be reclaimed,
> that's wasted effort in tracking if the rest of the 'cache descriptor'
> is dirty and heavily in use.  So a 'cache descriptor' should always be
> at least a 'struct page' in size (assuming you're using 'struct page'
> to mean "the size of the smallest allocation unit from the page
> allocator")

First off, we've been doing this with the slab shrinker for decades.

Second, you'll still be doing this when you track 4k struct pages in a
system that is trying to serve primarily higher-order pages. Whether
you free N cache descriptors to free a page, or free N pages to free a
compound page, it's the same thing. You won't avoid this problem.

> > > > Well yes, once (and iff) everybody is doing that. But for the
> > > > foreseeable future we're expecting to stay in a world where the
> > > > *majority* of memory is in larger chunks, while we continue to see 4k
> > > > cache entries, anon pages, and corresponding ptes, yes?
> > > 
> > > No.  4k page table entries are demanded by the architecture, and there's
> > > little we can do about that.
> > 
> > I wasn't claiming otherwise..?
> 
> You snipped the part of my paragraph that made the 'No' make sense.
> I'm agreeing that page tables will continue to be a problem, but
> everything else (page cache, anon, networking, slab) I expect to be
> using higher order allocations within the next year.

Some, maybe, but certainly not all of them. I'd like to remind you of
this analysis that Al did on the linux source tree with various page
sizes:

https://lore.kernel.org/linux-mm/YGVUobKUMUtEy1PS@zeniv-ca.linux.org.uk/

Page size	Footprint
4Kb		1128Mb
8Kb		1324Mb
16Kb		1764Mb
32Kb		2739Mb
64Kb		4832Mb
128Kb		9191Mb
256Kb		18062Mb
512Kb		35883Mb
1Mb		71570Mb
2Mb		142958Mb

Even just going to 32k more than doubles the cache footprint of this
one repo. This is a no-go from a small-file scalability POV.

I think my point stands: for the foreseeable future, we're going to
continue to see demand for 4k cache entries as well as an increasing
demand for 2M blocks in the page cache and for anonymous mappings.

We're going to need an allocation model that can handle this. Luckily,
we already do...

> > > > The slab allocator has proven to be an excellent solution to this
> > > > problem, because the mailing lists are not flooded with OOM reports
> > > > where smaller allocations fragmented the 4k page space. And even large
> > > > temporary slab explosions (inodes, dentries etc.) are usually pushed
> > > > back with fairly reasonable CPU overhead.
> > > 
> > > You may not see the bug reports, but they exist.  Right now, we have
> > > a service that is echoing 2 to drop_caches every hour on systems which
> > > are lightly loaded, otherwise the dcache swamps the entire machine and
> > > takes hours or days to come back under control.
> > 
> > Sure, but compare that to the number of complaints about higher-order
> > allocations failing or taking too long (THP in the fault path e.g.)...
> 
> Oh, we have those bug reports too ...
> 
> > Typegrouping isn't infallible for fighting fragmentation, but it seems
> > to be good enough for most cases. Unlike the buddy allocator.
> 
> You keep saying that the buddy allocator isn't given enough information to
> do any better, but I think it is.  Page cache and anon memory are marked
> with GFP_MOVABLE.  Slab, network and page tables aren't.  Is there a
> reason that isn't enough?

Anon and cache don't have the same lifetime, and anon isn't
reclaimable without swap. Yes, movable means we don't have to reclaim
them, but background reclaim happens anyway due to the watermarks, and
if that doesn't produce contiguous blocks by itself already then
compaction has to run on top of that. This is where we tend to see the
allocation latencies that prohibit THP allocations during page faults.

I would say the same is true for page tables allocated alongside
network buffers and unreclaimable slab pages. I.e. a burst in
short-lived network buffer allocations being interleaved with
long-lived page table allocations. Ongoing concurrency scaling is
going to increase the likelihood of those happening.

> I think something that might actually help is if we added a pair of new
> GFP flags, __GFP_FAST and __GFP_DENSE.  Dense allocations are those which
> are expected to live for a long time, and so the page allocator should
> try to group them with other dense allocations.  Slab and page tables
> should use DENSE,

You're really just recreating a crappier, less maintainable version of
the object packing that *slab already does*.

It's *slab* that is supposed to deal with internal fragmentation, not
the page allocator.

The page allocator is good at cranking out uniform, slightly big
memory blocks. The slab allocator is good at subdividing those into
smaller objects, neatly packed and grouped to facilitate contiguous
reclaim, while providing detailed breakdowns of per-type memory usage
and internal fragmentation to the user and to kernel developers.

[ And introspection and easy reporting from production are *really
  important*, because fragmentation issues develop over timelines that
  extend the usual testing horizon of kernel developers. ]

By trying to make compound pages the norm, you're making internal
fragmentation a first-class problem of the page allocator. This
conflates the problem space between slab and the page allocator and it
forces you to duplicate large parts of the solution.

This is not about whether it's technically achievable. It's about
making an incomprehensible mess of the allocator layering and having
to solve a difficult MM problem in two places. Because you're trying
to make compound pages into something they were never meant to be.

They're fine for the odd optimistic allocation that can either wait
forever to defragment or fall back gracefully. But there is just no
way these things are going to be the maintainable route for
transitioning to a larger page size.

As long as this is your ambition with the folio, I'm sorry but it's a
NAK from me.

Zi Yan Sept. 2, 2021, 3:13 p.m. UTC | #58

On 1 Sep 2021, at 13:43, Johannes Weiner wrote:

> On Mon, Aug 30, 2021 at 10:38:20PM +0100, Matthew Wilcox wrote:
>> On Mon, Aug 30, 2021 at 04:27:04PM -0400, Johannes Weiner wrote:
>>> Right, page tables only need a pfn. The struct page is for us to
>>> maintain additional state about the object.
>>>
>>> For the objects that are subpage sized, we should be able to hold that
>>> state (shrinker lru linkage, referenced bit, dirtiness, ...) inside
>>> ad-hoc allocated descriptors.
>>>
>>> Descriptors which could well be what struct folio {} is today, IMO. As
>>> long as it doesn't innately assume, or will assume, in the API the
>>> 1:1+ mapping to struct page that is inherent to the compound page.
>>
>> Maybe this is where we fundamentally disagree.  I don't think there's
>> any point in *managing* memory in a different size from that in which it
>> is *allocated*.  There's no point in tracking dirtiness, LRU position,
>> locked, etc, etc in different units from allocation size.  The point of
>> tracking all these things is so we can allocate and free memory.  If
>> a 'cache descriptor' reaches the end of the LRU and should be reclaimed,
>> that's wasted effort in tracking if the rest of the 'cache descriptor'
>> is dirty and heavily in use.  So a 'cache descriptor' should always be
>> at least a 'struct page' in size (assuming you're using 'struct page'
>> to mean "the size of the smallest allocation unit from the page
>> allocator")
>
> First off, we've been doing this with the slab shrinker for decades.
>
> Second, you'll still be doing this when you track 4k struct pages in a
> system that is trying to serve primarily higher-order pages. Whether
> you free N cache descriptors to free a page, or free N pages to free a
> compound page, it's the same thing. You won't avoid this problem.
>
>>>>> Well yes, once (and iff) everybody is doing that. But for the
>>>>> foreseeable future we're expecting to stay in a world where the
>>>>> *majority* of memory is in larger chunks, while we continue to see 4k
>>>>> cache entries, anon pages, and corresponding ptes, yes?
>>>>
>>>> No.  4k page table entries are demanded by the architecture, and there's
>>>> little we can do about that.
>>>
>>> I wasn't claiming otherwise..?
>>
>> You snipped the part of my paragraph that made the 'No' make sense.
>> I'm agreeing that page tables will continue to be a problem, but
>> everything else (page cache, anon, networking, slab) I expect to be
>> using higher order allocations within the next year.
>
> Some, maybe, but certainly not all of them. I'd like to remind you of
> this analysis that Al did on the linux source tree with various page
> sizes:
>
> https://lore.kernel.org/linux-mm/YGVUobKUMUtEy1PS@zeniv-ca.linux.org.uk/
>
> Page size	Footprint
> 4Kb		1128Mb
> 8Kb		1324Mb
> 16Kb		1764Mb
> 32Kb		2739Mb
> 64Kb		4832Mb
> 128Kb		9191Mb
> 256Kb		18062Mb
> 512Kb		35883Mb
> 1Mb		71570Mb
> 2Mb		142958Mb
>
> Even just going to 32k more than doubles the cache footprint of this
> one repo. This is a no-go from a small-file scalability POV.
>
> I think my point stands: for the foreseeable future, we're going to
> continue to see demand for 4k cache entries as well as an increasing
> demand for 2M blocks in the page cache and for anonymous mappings.
>
> We're going to need an allocation model that can handle this. Luckily,
> we already do...
>
>>>>> The slab allocator has proven to be an excellent solution to this
>>>>> problem, because the mailing lists are not flooded with OOM reports
>>>>> where smaller allocations fragmented the 4k page space. And even large
>>>>> temporary slab explosions (inodes, dentries etc.) are usually pushed
>>>>> back with fairly reasonable CPU overhead.
>>>>
>>>> You may not see the bug reports, but they exist.  Right now, we have
>>>> a service that is echoing 2 to drop_caches every hour on systems which
>>>> are lightly loaded, otherwise the dcache swamps the entire machine and
>>>> takes hours or days to come back under control.
>>>
>>> Sure, but compare that to the number of complaints about higher-order
>>> allocations failing or taking too long (THP in the fault path e.g.)...
>>
>> Oh, we have those bug reports too ...
>>
>>> Typegrouping isn't infallible for fighting fragmentation, but it seems
>>> to be good enough for most cases. Unlike the buddy allocator.
>>
>> You keep saying that the buddy allocator isn't given enough information to
>> do any better, but I think it is.  Page cache and anon memory are marked
>> with GFP_MOVABLE.  Slab, network and page tables aren't.  Is there a
>> reason that isn't enough?
>
> Anon and cache don't have the same lifetime, and anon isn't
> reclaimable without swap. Yes, movable means we don't have to reclaim
> them, but background reclaim happens anyway due to the watermarks, and
> if that doesn't produce contiguous blocks by itself already then
> compaction has to run on top of that. This is where we tend to see the
> allocation latencies that prohibit THP allocations during page faults.
>
> I would say the same is true for page tables allocated alongside
> network buffers and unreclaimable slab pages. I.e. a burst in
> short-lived network buffer allocations being interleaved with
> long-lived page table allocations. Ongoing concurrency scaling is
> going to increase the likelihood of those happening.
>
>> I think something that might actually help is if we added a pair of new
>> GFP flags, __GFP_FAST and __GFP_DENSE.  Dense allocations are those which
>> are expected to live for a long time, and so the page allocator should
>> try to group them with other dense allocations.  Slab and page tables
>> should use DENSE,
>
> You're really just recreating a crappier, less maintainable version of
> the object packing that *slab already does*.
>
> It's *slab* that is supposed to deal with internal fragmentation, not
> the page allocator.
>
> The page allocator is good at cranking out uniform, slightly big
> memory blocks. The slab allocator is good at subdividing those into
> smaller objects, neatly packed and grouped to facilitate contiguous
> reclaim, while providing detailed breakdowns of per-type memory usage
> and internal fragmentation to the user and to kernel developers.
>
> [ And introspection and easy reporting from production are *really
>   important*, because fragmentation issues develop over timelines that
>   extend the usual testing horizon of kernel developers. ]

Initially, I thought it was a great idea to bump PAGE_SIZE to 2MB and
use slab allocator like method for <2MB pages. But as I think about it
more, I fail to see how it solves the existing fragmentation issues
compared to our existing method, pageblock, since IMHO the fundamental
issue of fragmentation in page allocation comes from mixing moveable
and unmoveable pages in one pageblock, which does not exist in current
slab allocation. There is no mix of reclaimable and unreclaimable objects
in slab allocation, right? In my mind, reclaimable object is an analog
of moveable page and unreclaimable object is an analog of unmoveable page.
In addition, pageblock with different migrate types resembles how
slab groups objects, so what is new in using slab instead of pageblock?

My key question is do we allow mixing moveable sub-2MB data chunks with
unmoveable sub-2MB data chunks in your new slab-like allocation method?

If yes, how would kernel reclaim an order-0 (2MB) page that has an
unmoveable sub-2MB data chunk? Isn’t it the same fragmentation situation
we are facing nowadays when kernel tries to allocate a 2MB page but finds
every 2MB pageblock has an unmoveable page?

If no, why wouldn’t kernel do the same for pageblock? If kernel disallows
page allocation fallbacks, so that unmoveable pages and moveable pages
will not sit in a single pageblock, compaction and reclaim should be able
to get a 2MB free page most of the time. And this would be a much smaller
change, right?

Let me know if I miss anything.


--
Best Regards,
Yan, Zi

Vlastimil Babka Sept. 6, 2021, 2 p.m. UTC | #59

On 9/2/21 17:13, Zi Yan wrote:
>> You're really just recreating a crappier, less maintainable version of
>> the object packing that *slab already does*.
>>
>> It's *slab* that is supposed to deal with internal fragmentation, not
>> the page allocator.
>>
>> The page allocator is good at cranking out uniform, slightly big
>> memory blocks. The slab allocator is good at subdividing those into
>> smaller objects, neatly packed and grouped to facilitate contiguous
>> reclaim, while providing detailed breakdowns of per-type memory usage
>> and internal fragmentation to the user and to kernel developers.
>>
>> [ And introspection and easy reporting from production are *really
>>   important*, because fragmentation issues develop over timelines that
>>   extend the usual testing horizon of kernel developers. ]
> 
> Initially, I thought it was a great idea to bump PAGE_SIZE to 2MB and
> use slab allocator like method for <2MB pages. But as I think about it
> more, I fail to see how it solves the existing fragmentation issues
> compared to our existing method, pageblock, since IMHO the fundamental
> issue of fragmentation in page allocation comes from mixing moveable
> and unmoveable pages in one pageblock, which does not exist in current
> slab allocation. There is no mix of reclaimable and unreclaimable objects
> in slab allocation, right?

AFAICS that's correct. Slab caches can in general merge, as that
decreases memory usage (with the tradeoff of potentially mixing objects
with different lifetimes more). But SLAB_RECLAIM_ACCOUNT (a flag for
reclaimable caches) is part of SLAB_MERGE_SAME, so caches can only merge
it they are both reclaimable or not.

> In my mind, reclaimable object is an analog
> of moveable page and unreclaimable object is an analog of unmoveable page.

More precisely it resembles reclaimable and unreclaimable pages. Movable
pages can be also migrated, but slab objects not.

> In addition, pageblock with different migrate types resembles how
> slab groups objects, so what is new in using slab instead of pageblock?

Slab would be more strict in not allowing the merge. At page allocator
level, if memory is exhausted, eventually page of any type can be
allocated from pageblock of any other type as part of the fallback. The
only really strict mechanism is movable zone.

> My key question is do we allow mixing moveable sub-2MB data chunks with
> unmoveable sub-2MB data chunks in your new slab-like allocation method?
> 
> If yes, how would kernel reclaim an order-0 (2MB) page that has an
> unmoveable sub-2MB data chunk? Isn’t it the same fragmentation situation
> we are facing nowadays when kernel tries to allocate a 2MB page but finds
> every 2MB pageblock has an unmoveable page?

Yes, any scheme where all pages are not movable can theoretically
degrade to a situation where at one moment all memory is allocated by
the unmovable pages, and later almost all pages were freed, but leaving
one unmovable page in each pageblock.

> If no, why wouldn’t kernel do the same for pageblock? If kernel disallows
> page allocation fallbacks, so that unmoveable pages and moveable pages
> will not sit in a single pageblock, compaction and reclaim should be able
> to get a 2MB free page most of the time. And this would be a much smaller
> change, right?

If we did that restriction of fallbacks, it would indeed be as strict
the way as slab is, but things could still degrade to unmovable pages
scattered all over the pageblocks as mentioned above.

But since it's so similar to slabs, the same thing could happen with
slabs today, and I don't recall reports of that happening massively? But
of course slabs are not all 2MB large, serving 4k pages.

> Let me know if I miss anything.
> 
> 
> --
> Best Regards,
> Yan, Zi
>

Christoph Hellwig Sept. 9, 2021, 12:43 p.m. UTC | #60

So what is the result here?  Not having folios (with that or another
name) is really going to set back making progress on sane support for
huge pages.  Both in the pagecache but also for other places like direct
I/O.

Vlastimil Babka Sept. 9, 2021, 1:56 p.m. UTC | #61

On 9/9/21 14:43, Christoph Hellwig wrote:
> So what is the result here?  Not having folios (with that or another
> name) is really going to set back making progress on sane support for
> huge pages.  Both in the pagecache but also for other places like direct
> I/O.

Yeah, the silence doesn't seem actionable. If naming is the issue, I believe
Matthew had also a branch where it was renamed to pageset. If it's the
unclear future evolution wrt supporting subpages of large pages, should we
just do nothing until somebody turns that hypothetical future into code and
we see whether it works or not?

Johannes Weiner Sept. 9, 2021, 6:16 p.m. UTC | #62

On Thu, Sep 09, 2021 at 03:56:54PM +0200, Vlastimil Babka wrote:
> On 9/9/21 14:43, Christoph Hellwig wrote:
> > So what is the result here?  Not having folios (with that or another
> > name) is really going to set back making progress on sane support for
> > huge pages.  Both in the pagecache but also for other places like direct
> > I/O.

From my end, I have no objections to using the current shape of
Willy's data structure as a cache descriptor for the filesystem API:

struct foo {
        /* private: don't document the anon union */
        union {
                struct {
        /* public: */
                        unsigned long flags;
                        struct list_head lru;
                        struct address_space *mapping;
                        pgoff_t index;
                        void *private;
                        atomic_t _mapcount;
                        atomic_t _refcount;
#ifdef CONFIG_MEMCG
                        unsigned long memcg_data;
#endif
        /* private: the union with struct page is transitional */
                };
                struct page page;
        };
};

I also have no general objection to a *separate* folio or pageset or
whatever data structure to address the compound page mess inside VM
code. With its own cost/benefit analysis. For whatever is left after
the filesystems have been sorted out.

My objection is simply to one shared abstraction for both. There is
ample evidence from years of hands-on production experience that
compound pages aren't the way toward scalable and maintainable larger
page sizes from the MM side. And it's anything but obvious or
self-evident that just because struct page worked for both roles that
the same is true for compound pages.

Willy says it'll work out, I say it won't. We don't have code to prove
this either way right now.

Why expose the filesystems to this gamble?

Nothing prevents us from putting a 'struct pageset pageset' or 'struct
folio folio' into a cache descriptor like above later on, right?

[ And IMO, the fact that filesystem people are currently exposed to,
  and blocked on, mindnumbing internal MM discussions just further
  strengthens the argument to disconnect the page cache frontend from
  the memory allocation backend. The fs folks don't care - and really
  shouldn't care - about any of this. I understand the frustration. ]

Can we go ahead with the cache descriptor for now, and keep the door
open on how they are backed from the MM side? We should be able to
answer this without going too deep into MM internals.

In the short term, this would unblock the fs people.

In the longer term this would allow the fs people to focus on fs
problems, and MM people to solve MM problems.

> Yeah, the silence doesn't seem actionable. If naming is the issue, I believe
> Matthew had also a branch where it was renamed to pageset. If it's the
> unclear future evolution wrt supporting subpages of large pages, should we
> just do nothing until somebody turns that hypothetical future into code and
> we see whether it works or not?

Folio or pageset works for compound pages, but implies unnecessary
implementation details for a variable-sized cache descriptor, IMO.

I don't love the name folio for compound pages, but I think it's
actually hazardous for the filesystem API.

To move forward with the filesystem bits, can we:

1. call it something - anything - that isn't tied to the page, or the
   nature of multiple pages? fsmem, fsblock, cachemem, cachent, I
   don't care too deeply and would rather have a less snappy name than
   a clever misleading one,

2. make things like folio_order(), folio_nr_pages(), folio_page()
   page_folio() private API in mm/internal.h, to acknowledge that
   these are current implementation details, not promises on how the
   cache entry will forever be backed in the future?

3. remove references to physical contiguity, PAGE_SIZE, anonymous
   pages - and really anything else that nobody has explicitly asked
   for yet - from the kerneldoc; generally keep things specced to what
   we need now, and not create dependencies against speculative future
   ambitions that may or may not pan out,

4. separate and/or table the bits that are purely about compound pages
   inside MM code and not relevant for the fs interface - things like
   the workingset.c and swap.c conversions (page_folio() usage seems
   like a good indicator for where it permeated too deeply into MM
   core code which then needs to translate back up again)?

Matthew Wilcox Sept. 9, 2021, 6:44 p.m. UTC | #63

On Thu, Sep 09, 2021 at 02:16:39PM -0400, Johannes Weiner wrote:
> My objection is simply to one shared abstraction for both. There is
> ample evidence from years of hands-on production experience that
> compound pages aren't the way toward scalable and maintainable larger
> page sizes from the MM side. And it's anything but obvious or
> self-evident that just because struct page worked for both roles that
> the same is true for compound pages.

I object to this requirement.  The folio work has been going on for almost
a year now, and you come in AT THE END OF THE MERGE WINDOW to ask for it
to do something entirely different from what it's supposed to be doing.
If you'd asked for this six months ago -- maybe.  But now is completely
unreasonable.

I don't think it's a good thing to try to do.  I think that your "let's
use slab for this" idea is bonkers and doesn't work.  And I really object
to you getting in the way of my patchset which has actual real-world
performance advantages in order to whine about how bad the type system
is in Linux without doing anything to help with it.

Do something.  Or stop standing in the way.  Either works for me.

John Hubbard Sept. 9, 2021, 7:17 p.m. UTC | #64

On 9/9/21 06:56, Vlastimil Babka wrote:
> On 9/9/21 14:43, Christoph Hellwig wrote:
>> So what is the result here?  Not having folios (with that or another
>> name) is really going to set back making progress on sane support for
>> huge pages.  Both in the pagecache but also for other places like direct
>> I/O.
> 
> Yeah, the silence doesn't seem actionable. If naming is the issue, I believe
> Matthew had also a branch where it was renamed to pageset. If it's the
> unclear future evolution wrt supporting subpages of large pages, should we
> just do nothing until somebody turns that hypothetical future into code and
> we see whether it works or not?
> 

When I saw Matthew's proposal to rename folio --> pageset, my reaction was,
"OK, this is a huge win!". Because:

* The new name addressed Linus' concerns about naming, which unblocks it
   there, and

* The new name seems to meet all of the criteria of the "folio" name,
   including even grep-ability, after a couple of tiny page_set and pageset
   cases are renamed--AND it also meets Linus' criteria for self-describing
   names.

So I didn't want to add noise to that thread, but now that there is still
some doubt about this, I'll pop up and suggest: do the huge
's/folio/pageset/g', and of course the associated renaming of the conflicting
existing pageset and page_set cases, and then maybe it goes in.

thanks,

Matthew Wilcox Sept. 9, 2021, 7:23 p.m. UTC | #65

On Thu, Sep 09, 2021 at 12:17:00PM -0700, John Hubbard wrote:
> On 9/9/21 06:56, Vlastimil Babka wrote:
> > On 9/9/21 14:43, Christoph Hellwig wrote:
> > > So what is the result here?  Not having folios (with that or another
> > > name) is really going to set back making progress on sane support for
> > > huge pages.  Both in the pagecache but also for other places like direct
> > > I/O.
> > 
> > Yeah, the silence doesn't seem actionable. If naming is the issue, I believe
> > Matthew had also a branch where it was renamed to pageset. If it's the
> > unclear future evolution wrt supporting subpages of large pages, should we
> > just do nothing until somebody turns that hypothetical future into code and
> > we see whether it works or not?
> > 
> 
> When I saw Matthew's proposal to rename folio --> pageset, my reaction was,
> "OK, this is a huge win!". Because:
> 
> * The new name addressed Linus' concerns about naming, which unblocks it
>   there, and
> 
> * The new name seems to meet all of the criteria of the "folio" name,
>   including even grep-ability, after a couple of tiny page_set and pageset
>   cases are renamed--AND it also meets Linus' criteria for self-describing
>   names.
> 
> So I didn't want to add noise to that thread, but now that there is still
> some doubt about this, I'll pop up and suggest: do the huge
> 's/folio/pageset/g', and of course the associated renaming of the conflicting
> existing pageset and page_set cases, and then maybe it goes in.

So I've done that.

https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/tags/pageset-5.15

I sent it to Linus almost two weeks ago:
https://lore.kernel.org/linux-mm/YSmtjVTqR9%2F4W1aq@casper.infradead.org/

Still nothing, so I presume he's still thinking about it.

Johannes Weiner Sept. 9, 2021, 10:03 p.m. UTC | #66

On Thu, Sep 09, 2021 at 07:44:22PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 09, 2021 at 02:16:39PM -0400, Johannes Weiner wrote:
> > My objection is simply to one shared abstraction for both. There is
> > ample evidence from years of hands-on production experience that
> > compound pages aren't the way toward scalable and maintainable larger
> > page sizes from the MM side. And it's anything but obvious or
> > self-evident that just because struct page worked for both roles that
> > the same is true for compound pages.
> 
> I object to this requirement.  The folio work has been going on for almost
> a year now, and you come in AT THE END OF THE MERGE WINDOW to ask for it
> to do something entirely different from what it's supposed to be doing.
> If you'd asked for this six months ago -- maybe.  But now is completely
> unreasonable.

I asked for exactly this exactly six months ago.

On March 22nd, I wrote this re: the filesystem interfacing:

: So I think transitioning away from ye olde page is a great idea. I
: wonder this: have we mapped out the near future of the VM enough to
: say that the folio is the right abstraction?
:
: What does 'folio' mean when it corresponds to either a single page or
: some slab-type object with no dedicated page?
:
: If we go through with all the churn now anyway, IMO it makes at least
: sense to ditch all association and conceptual proximity to the
: hardware page or collections thereof. Simply say it's some length of
: memory, and keep thing-to-page translations out of the public API from
: the start. I mean, is there a good reason to keep this baggage?

It's not my fault you consistently dismissed and pushed past this
question and then send a pull request anyway.

> I don't think it's a good thing to try to do.  I think that your "let's
> use slab for this" idea is bonkers and doesn't work.

Based on what exactly?

You can't think it's that bonkers when you push for replicating
slab-like grouping in the page allocator.

Anyway, it was never about how larger pages will pan out in MM. It was
about keeping some flexibility around the backing memory for cache
entries, given that this is still an unsolved problem. This is not a
crazy or unreasonable request, it's the prudent thing to do given the
amount of open-ended churn and disruptiveness of your patches.

It seems you're not interested in engaging in this argument. You
prefer to go off on tangents and speculations about how the page
allocator will work in the future, with seemingly little production
experience about what does and doesn't work in real life; and at the
same time dismiss the experience of people that deal with MM problems
hands-on on millions of machines & thousands of workloads every day.

> And I really object to you getting in the way of my patchset which
> has actual real-world performance advantages

So? You've gotten in the way of patches that removed unnecessary
compound_head() call and would have immediately provided some of these
same advantages without hurting anybody - because the folio will
eventually solve them all anyway.

We all balance immediate payoff against what we think will be the
right thing longer term.

Anyway, if you think I'm bonkers, just ignore me. If not, maybe lay
off the rhetorics, engage in a good-faith discussion and actually
address my feedback?

Matthew Wilcox Sept. 9, 2021, 10:48 p.m. UTC | #67

Ugh.  I'm not dealing with this shit.  I'm supposed to be on holiday.
I've been checking in to see what needs to happen for folios to be
merged.  But now I'm just fucking done.  I shan't be checking my email
until September 19th.

Merge the folio branch, merge the pageset branch, or don't merge
anything.  I don't fucking care any more.

On Thu, Sep 09, 2021 at 06:03:17PM -0400, Johannes Weiner wrote:
> On Thu, Sep 09, 2021 at 07:44:22PM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 09, 2021 at 02:16:39PM -0400, Johannes Weiner wrote:
> > > My objection is simply to one shared abstraction for both. There is
> > > ample evidence from years of hands-on production experience that
> > > compound pages aren't the way toward scalable and maintainable larger
> > > page sizes from the MM side. And it's anything but obvious or
> > > self-evident that just because struct page worked for both roles that
> > > the same is true for compound pages.
> > 
> > I object to this requirement.  The folio work has been going on for almost
> > a year now, and you come in AT THE END OF THE MERGE WINDOW to ask for it
> > to do something entirely different from what it's supposed to be doing.
> > If you'd asked for this six months ago -- maybe.  But now is completely
> > unreasonable.
> 
> I asked for exactly this exactly six months ago.
> 
> On March 22nd, I wrote this re: the filesystem interfacing:
> 
> : So I think transitioning away from ye olde page is a great idea. I
> : wonder this: have we mapped out the near future of the VM enough to
> : say that the folio is the right abstraction?
> :
> : What does 'folio' mean when it corresponds to either a single page or
> : some slab-type object with no dedicated page?
> :
> : If we go through with all the churn now anyway, IMO it makes at least
> : sense to ditch all association and conceptual proximity to the
> : hardware page or collections thereof. Simply say it's some length of
> : memory, and keep thing-to-page translations out of the public API from
> : the start. I mean, is there a good reason to keep this baggage?
> 
> It's not my fault you consistently dismissed and pushed past this
> question and then send a pull request anyway.
> 
> > I don't think it's a good thing to try to do.  I think that your "let's
> > use slab for this" idea is bonkers and doesn't work.
> 
> Based on what exactly?
> 
> You can't think it's that bonkers when you push for replicating
> slab-like grouping in the page allocator.
> 
> Anyway, it was never about how larger pages will pan out in MM. It was
> about keeping some flexibility around the backing memory for cache
> entries, given that this is still an unsolved problem. This is not a
> crazy or unreasonable request, it's the prudent thing to do given the
> amount of open-ended churn and disruptiveness of your patches.
> 
> It seems you're not interested in engaging in this argument. You
> prefer to go off on tangents and speculations about how the page
> allocator will work in the future, with seemingly little production
> experience about what does and doesn't work in real life; and at the
> same time dismiss the experience of people that deal with MM problems
> hands-on on millions of machines & thousands of workloads every day.
> 
> > And I really object to you getting in the way of my patchset which
> > has actual real-world performance advantages
> 
> So? You've gotten in the way of patches that removed unnecessary
> compound_head() call and would have immediately provided some of these
> same advantages without hurting anybody - because the folio will
> eventually solve them all anyway.
> 
> We all balance immediate payoff against what we think will be the
> right thing longer term.
> 
> Anyway, if you think I'm bonkers, just ignore me. If not, maybe lay
> off the rhetorics, engage in a good-faith discussion and actually
> address my feedback?

Kent Overstreet Sept. 10, 2021, 8:16 p.m. UTC | #68

So I've been following the folio discussion, and it seems like the discussion
has gone off the rails a bit partly just because struct page is such a mess and
has been so overused, and we all want to see that cleaned up but we're not being
clear about what that means. I was just talking with Johannes off list, and I
thought I'd recap that discussion as well as other talks with Mathew and see if
I can lay something out that everyone agrees with.

Some background:

For some years now, the overhead of dealing with 4k pages in the page cache has
gotten really, really painful. Any time we're doing buffered IO, we end up
walking a radix tree to get to the cached page, then doing a memcpy to or from
that page - which quite conveniently blows away the CPU cache - then walking
the radix tree to look up the next page, often touching locks along the way that
are no longer in cache - it's really bad.

We've been hacking around this - the btrfs people have a vectorized buffered
write path, and also this is what my generic_file_buffered_read() patches we're
about, batching up the page cache lookups - but really these are hacks that make
our core IO paths even more complicated, when the right answer that's been
staring all of us filesystem people in the face for years has been that it's
2021 and dealing with cached data in 4k chunks (when block based filesystems are
a thing of the past!) is abject stupidity.

So we need to be moving to larger, variable sized allocations for cached data.
We NEED this, this HAS TO HAPPEN - spend some time really digging into profiles,
and looking actual application usage, this is the #1 thing that's killing our
performance in the IO paths. Remember, us developers tend to be benchmarking
things like direct IO and small random IOs because we're looking at the whole IO
path, but most reads and writes are buffered, and they're already in cache, and
they're mostly big and sequential.

I emphasize this because a lot of us have really been waiting rather urgently
for Willy's work to go in, and there will no doubt be a lot more downstream
filesystem work to be done to fully take advantage of it and we're waiting on
this stuff to get merged so we can actually start testing and profiling the
brave new world and seeing what to work on next.

As an aside, before this there have been quite a few attempts at using
hugepages to deal with these issues, and they're all _fucking gross_, because
they all do if (normal page) else if (hugepage), and they all cut and paste
filemap.c code because no one (rightly) wanted to add their abortions to the
main IO paths. But look around the kernel and see how many times you can find
core filemap.c code duplicated elsewhere... Anyways, Willy's work is going to
let us delete all that crap.

So: this all means that filesystem code needs to start working in larger,
variable sized units, which today means - compound pages. Hence, the folio work
started out as a wrapper around compound pages.

So, one objection to folios has been that they leak too much MM details out into
the filesystem code. To that we must point out: all the code that's going to be
using folios is right now using struct page - this isn't leaking out new details
and making things worse, this is actually (potentially!) a step in the right
direction, by moving some users of struct page to a new type that is actually
created for a specific purpose.

I think a lot of the acrimony in this discussion came precisely from this mess;
Johannes and the other MM people would like to see this situation improved so
that they have more freedom to reengineer and improve things on their side. One
particularly noteworthy idea was having struct page refer to multiple hardware
pages, and using slab/slub for larger alloctions. In my view, the primary reason
for making this change isn't the memory overhead to struct page (though reducing
that would be nice); it's that the slab allocator is _significantly_ faster than
the buddy allocator (the buddy allocator isn't percpu!) and as average
allocation sizes increase, this is hurting us more and more over time.

So we should listen to the MM people.

Fortunately, Matthew made a big step in the right direction by making folios a
new type. Right now, struct folio is not separately allocated - it's just
unionized/overlayed with struct page - but perhaps in the future they could be
separately allocated. I don't think that is a remotely realistic goal for _this_
patch series given the amount of code that touches struct page (thing: writeback
code, LRU list code, page fault handlers!) - but I think that's a goal we could
keep in mind going forward.

We should also be clear on what _exactly_ folios are for, so they don't become
the new dumping ground for everyone to stash their crap. They're to be a new
core abstraction, and we should endeaver to keep our core data structures
_small_, and _simple_. So: no scatter gather. A folio should just represent a
single buffer of physically contiguous memory - vmap is slow, kmap_atomic() only
works on single pages, we do _not_ want to make filesystem code jump through
hoops to deal with anything else. The buffers should probably be power of two
sized, as that's what the buddy allocator likes to give us - that doesn't
necessarily have to be baked into the design, but I can't see us ever actually
wanting non power of two sized allocations.

Q: But what about fragmentation? Won't these allocations fail sometimes?

Yes, and that's OK. The relevant filesystem code is all changing to handle
variable sized allocations, so it's completely fine if we fail a 256k allocation
and we have to fall back to whatever is available.

But also keep in mind that switching the biggest consumer of kernel side memory
to larger allocations is going to do more than anything else to help prevent
memory from getting fragmented in the first place. We _want_ this.

Q: Oh yeah, but what again are folios for, exactly?

Folios are for cached filesystem data which (importantly) may be mapped to
userspace.

So when MM people see a new data structure come up with new references to page
size - there's a very good reason with that, which is that we need to be
allocating in multiples of the hardware page size if we're going to be able to
map it to userspace and have PTEs point to it.

So going forward, if the MM people want struct page to refer to muliple hardware
pages - this shouldn't prevent that, and folios will refer to multiples of the
_hardware_ page size, not struct page pagesize.

Also - all the filesystem code that's being converted tends to talk and thing in
units of pages. So going forward, it would be a nice cleanup to get rid of as
many of those references as possible and just talk in terms of bytes (e.g. I
have generally been trying to get rid of references to PAGE_SIZE in bcachefs
wherever reasonable, for other reasons) - those cleanups are probably for
another patch series, and in the interests of getting this patch series merged
with the fewest introduced bugs possible we probably want the current helpers.

-------------

That's my recap, I hope I haven't missed anything. The TL;DR is:

 * struct page is a mess; yes, we know. We're all living with that pain.

 * This isn't our ultimate end goal (nothing ever is!) - but it's probably along
   the right path.

 * Going forward: maybe struct folio should be separately allocated. That will
   entail a lot more work so it's not appropriate for this patch series, but I
   think it's a goal that would make everyone 

 * We should probably think and talk more concretely about what our end goals
   are.

Getting away from struct page is something that comes up again and again - DAX
is another notable (and acrimonious) area this has come up. Also, page->mapping
and page->index make sharing cached data in different files (thing: reflink,
snapshots) pretty much non starters.

I'm going to publicly float one of my own ideas here: maybe entries in the page
cache radix tree don't have to be just a single pointer/ulong. If those entries
were bigger, perhaps some things would fit better there than in either struct
page/folio.


Excessive PAGE_SIZE usage:
--------------------------

Another thing that keeps coming up is - indiscriminate use of PAGE_SIZE makes it
hard, especially when we're reviewing new code, to tell what's a legitimate use
or not. When it's tied to the hardware page size (as folios are), it's probably
legitimate, but PAGE_SIZE is _way_ overused.

Partly this was because historically slab had to be used for small allocations
and the buddy allocator, __get_free_pages(), had to be used for larger
allocations. This is still somewhat the case - slab can go up to something like
128k, but there's still a hard cap on allocation size with kmalloc().

Perhaps the MM people could look into lifting this restriction, so that
kmalloc() could be used for any sized physically contiguous allocation that the
system could satisfy?  If we had this, then it would make it more practical to
go through and refactor existing code that uses __get_free_pages() and convert
it to kmalloc(), without having to stare at code and figure out if it's safe.

And that's my $.02

Kirill A. Shutemov Sept. 11, 2021, 1:23 a.m. UTC | #69

On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> So we should listen to the MM people.

Count me here.

I think the problem with folio is that everybody wants to read in her/his
hopes and dreams into it and gets disappointed when see their somewhat
related problem doesn't get magically fixed with folio.

Folio started as a way to relief pain from dealing with compound pages.
It provides an unified view on base pages and compound pages. That's it.

It is required ground work for wider adoption of compound pages in page
cache. But it also will be useful for anon THP and hugetlb.

Based on adoption rate and resulting code, the new abstraction has nice
downstream effects. It may be suitable for more than it was intended for
initially. That's great.

But if it doesn't solve your problem... well, sorry...

The patchset makes a nice step forward and cuts back on mess I created on
the way to huge-tmpfs.

I would be glad to see the patchset upstream.

Michal Hocko Sept. 13, 2021, 11:32 a.m. UTC | #70

On Sat 11-09-21 04:23:24, Kirill A. Shutemov wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > So we should listen to the MM people.
> 
> Count me here.
> 
> I think the problem with folio is that everybody wants to read in her/his
> hopes and dreams into it and gets disappointed when see their somewhat
> related problem doesn't get magically fixed with folio.
> 
> Folio started as a way to relief pain from dealing with compound pages.
> It provides an unified view on base pages and compound pages. That's it.
> 
> It is required ground work for wider adoption of compound pages in page
> cache. But it also will be useful for anon THP and hugetlb.
> 
> Based on adoption rate and resulting code, the new abstraction has nice
> downstream effects. It may be suitable for more than it was intended for
> initially. That's great.
> 
> But if it doesn't solve your problem... well, sorry...
> 
> The patchset makes a nice step forward and cuts back on mess I created on
> the way to huge-tmpfs.
> 
> I would be glad to see the patchset upstream.

I do agree here. While points that Johannes brought up are relevant
and worth thinking about I do also see a clear advantage of folio (or
whatever $name) is bringing. The compound page handling is just a mess
and source of practical problems and bugs.

This really requires some systematic approach to deal with it. The
proposed type system is definitely a good way to approach it. Johannes
is not happy about having the type still refer to page units but I
haven't seen an example where that leads to a worse or harder to
maintain code so far. The evolution is likely not going to stop at the
current type system but I haven't seen any specifics to prove it would
stand in the way. The existing code (fs or other subsystem interacting
with MM) is going to require quite a lot of changes to move away from
struct page notion but I do not see folios to add fundamental blocker
there.

All that being said, not only I see folios to be a step into the
right direction to address compound pages mess it is also a code that
already exists and gives some real advantages. I haven't heard anybody
subscribing to a different approach and providing an implementation in a
foreseeable future so I would rather go with this approach then dealing
with the existing code long term.

Johannes Weiner Sept. 13, 2021, 6:12 p.m. UTC | #71

On Mon, Sep 13, 2021 at 01:32:30PM +0200, Michal Hocko wrote:
> The existing code (fs or other subsystem interacting with MM) is
> going to require quite a lot of changes to move away from struct
> page notion but I do not see folios to add fundamental blocker
> there.

The current folio seems to do quite a bit of that work, actually. But
it'll be undone when the MM conversion matures the data structure into
the full-blown new page.

It's not about hopes and dreams, it's the simple fact that the patches
do something now that seems very valuable, but which we'll lose again
over time. And avoiding that is a relatively minor adjustment at this
time compared to a much larger one later on.

So yeah, it's not really a blocker. It's just a missed opportunity to
lastingly disentangle struct page's multiple roles when touching all
the relevant places anyway. It's also (needlessly) betting that
compound pages can be made into a scalable, reliable, and predictable
allocation model, and proliferating them into fs/ based on that.

These patches, and all the ones that will need to follow to finish the
conversion, are exceptionally expensive. It would have been nice to
get more out of this disruption than to identify the relatively few
places that genuinely need compound_head(), and having a datatype for
N contiguous pages. Is there merit in solving those problems? Sure. Is
it a robust, forward-looking direction for the MM space that justifies
the cost of these and later patches? You seem to think so, I don't.

It doesn't look like we'll agree on this. But I think I've made my
points several times now, so I'll defer to Linus and Andrew.

Johannes Weiner Sept. 15, 2021, 3:40 p.m. UTC | #72

On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> One particularly noteworthy idea was having struct page refer to
> multiple hardware pages, and using slab/slub for larger
> alloctions. In my view, the primary reason for making this change
> isn't the memory overhead to struct page (though reducing that would
> be nice);

Don't underestimate this, however.

Picture the near future Willy describes, where we don't bump struct
page size yet but serve most cache with compound huge pages.

On x86, it would mean that the average page cache entry has 512
mapping pointers, 512 index members, 512 private pointers, 1024 LRU
list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
flags, 512 memcg pointers etc. - you get the idea.

This is a ton of memory. I think this doesn't get more traction
because it's memory we've always allocated, and we're simply more
sensitive to regressions than long-standing pain. But nevertheless
this is a pretty low-hanging fruit.

The folio makes a great first step moving those into a separate data
structure, opening the door to one day realizing these savings. Even
when some MM folks say this was never the intent behind the patches, I
think this is going to matter significantly, if not more so, later on.

> Fortunately, Matthew made a big step in the right direction by making folios a
> new type. Right now, struct folio is not separately allocated - it's just
> unionized/overlayed with struct page - but perhaps in the future they could be
> separately allocated. I don't think that is a remotely realistic goal for _this_
> patch series given the amount of code that touches struct page (thing: writeback
> code, LRU list code, page fault handlers!) - but I think that's a goal we could
> keep in mind going forward.

Yeah, agreed. Not doable out of the gate, but retaining the ability to
allocate the "cache entry descriptor" bits - mapping, index etc. -
on-demand would be a huge benefit down the road for the above reason.

For that they would have to be in - and stay in - their own type.

> We should also be clear on what _exactly_ folios are for, so they don't become
> the new dumping ground for everyone to stash their crap. They're to be a new
> core abstraction, and we should endeaver to keep our core data structures
> _small_, and _simple_.

Right. struct page is a lot of things and anything but simple and
obvious today. struct folio in its current state does a good job
separating some of that stuff out.

However, when we think about *which* of the struct page mess the folio
wants to address, I think that bias toward recent pain over much
bigger long-standing pain strikes again.

The compound page proliferation is new, and we're sensitive to the
ambiguity it created between head and tail pages. It's added some
compound_head() in lower-level accessor functions that are not
necessary for many contexts. The folio type safety will help clean
that up, and this is great.

However, there is a much bigger, systematic type ambiguity in the MM
world that we've just gotten used to over the years: anon vs file vs
shmem vs slab vs ...

- Many places rely on context to say "if we get here, it must be
  anon/file", and then unsafely access overloaded member elements:
  page->mapping, PG_readahead, PG_swapcache, PG_private

- On the other hand, we also have low-level accessor functions that
  disambiguate the type and impose checks on contexts that may or may
  not actually need them - not unlike compound_head() in PageActive():

  struct address_space *folio_mapping(struct folio *folio)
  {
	struct address_space *mapping;

	/* This happens if someone calls flush_dcache_page on slab page */
	if (unlikely(folio_test_slab(folio)))
		return NULL;

	if (unlikely(folio_test_swapcache(folio)))
		return swap_address_space(folio_swap_entry(folio));

	mapping = folio->mapping;
	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
		return NULL;

	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
  }

  Then we go identify places that say "we know it's at least not a
  slab page!" and convert them to page_mapping_file() which IS safe to
  use with anon. Or we say "we know this MUST be a file page" and just
  access the (unsafe) mapping pointer directly.

- We have a singular page lock, but what it guards depends on what
  type of page we're dealing with. For a cache page it protects
  uptodate and the mapping. For an anon page it protects swap state.

  A lot of us can remember the rules if we try, but the code doesn't
  help and it gets really tricky when dealing with multiple types of
  pages simultaneously. Even mature code like reclaim just serializes
  the operation instead of protecting data - the writeback checks and
  the page table reference tests don't seem to need page lock.

  When the cgroup folks wrote the initial memory controller, they just
  added their own page-scope lock to protect page->memcg even though
  the page lock would have covered what it needed.

- shrink_page_list() uses page_mapping() in the first half of the
  function to tell whether the page is anon or file, but halfway
  through we do this:

	  /* Adding to swap updated mapping */
          mapping = page_mapping(page);

  and then use PageAnon() to disambiguate the page type.

- At activate_locked:, we check PG_swapcache directly on the page and
  rely on it doing the right thing for anon, file, and shmem pages.
  But this flag is PG_owner_priv_1 and actually used by the filesystem
  for something else. I guess PG_checked pages currently don't make it
  this far in reclaim, or we'd crash somewhere in try_to_free_swap().

  I suppose we're also never calling page_mapping() on PageChecked
  filesystem pages right now, because it would return a swap mapping
  before testing whether this is a file page. You know, because shmem.

These are just a few examples from an MM perspective. I'm sure the FS
folks have their own stories and examples about pitfalls in dealing
with struct page members.

We're so used to this that we don't realize how much bigger and
pervasive this lack of typing is than the compound page thing.

I'm not saying the compound page mess isn't worth fixing. It is.

I'm saying if we started with a file page or cache entry abstraction
we'd solve not only the huge page cache, but also set us up for a MUCH
more comprehensive cleanup in MM code and MM/FS interaction that makes
the tailpage cleanup pale in comparison. For the same amount of churn,
since folio would also touch all of these places.

Damian Tometzki Sept. 15, 2021, 5:55 p.m. UTC | #73

Hello together,

I am an outsider and  following the discussion here on the subject. 
Can we not go upsream with the state of development ? 
Optimizations will always be there and new kernel releases too.

I can not assess the risk but I think a decision must be made. 

Damian
 

On Wed, 15. Sep 11:40, Johannes Weiner wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > One particularly noteworthy idea was having struct page refer to
> > multiple hardware pages, and using slab/slub for larger
> > alloctions. In my view, the primary reason for making this change
> > isn't the memory overhead to struct page (though reducing that would
> > be nice);
> 
> Don't underestimate this, however.
> 
> Picture the near future Willy describes, where we don't bump struct
> page size yet but serve most cache with compound huge pages.
> 
> On x86, it would mean that the average page cache entry has 512
> mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> flags, 512 memcg pointers etc. - you get the idea.
> 
> This is a ton of memory. I think this doesn't get more traction
> because it's memory we've always allocated, and we're simply more
> sensitive to regressions than long-standing pain. But nevertheless
> this is a pretty low-hanging fruit.
> 
> The folio makes a great first step moving those into a separate data
> structure, opening the door to one day realizing these savings. Even
> when some MM folks say this was never the intent behind the patches, I
> think this is going to matter significantly, if not more so, later on.
> 
> > Fortunately, Matthew made a big step in the right direction by making folios a
> > new type. Right now, struct folio is not separately allocated - it's just
> > unionized/overlayed with struct page - but perhaps in the future they could be
> > separately allocated. I don't think that is a remotely realistic goal for _this_
> > patch series given the amount of code that touches struct page (thing: writeback
> > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > keep in mind going forward.
> 
> Yeah, agreed. Not doable out of the gate, but retaining the ability to
> allocate the "cache entry descriptor" bits - mapping, index etc. -
> on-demand would be a huge benefit down the road for the above reason.
> 
> For that they would have to be in - and stay in - their own type.
> 
> > We should also be clear on what _exactly_ folios are for, so they don't become
> > the new dumping ground for everyone to stash their crap. They're to be a new
> > core abstraction, and we should endeaver to keep our core data structures
> > _small_, and _simple_.
> 
> Right. struct page is a lot of things and anything but simple and
> obvious today. struct folio in its current state does a good job
> separating some of that stuff out.
> 
> However, when we think about *which* of the struct page mess the folio
> wants to address, I think that bias toward recent pain over much
> bigger long-standing pain strikes again.
> 
> The compound page proliferation is new, and we're sensitive to the
> ambiguity it created between head and tail pages. It's added some
> compound_head() in lower-level accessor functions that are not
> necessary for many contexts. The folio type safety will help clean
> that up, and this is great.
> 
> However, there is a much bigger, systematic type ambiguity in the MM
> world that we've just gotten used to over the years: anon vs file vs
> shmem vs slab vs ...
> 
> - Many places rely on context to say "if we get here, it must be
>   anon/file", and then unsafely access overloaded member elements:
>   page->mapping, PG_readahead, PG_swapcache, PG_private
> 
> - On the other hand, we also have low-level accessor functions that
>   disambiguate the type and impose checks on contexts that may or may
>   not actually need them - not unlike compound_head() in PageActive():
> 
>   struct address_space *folio_mapping(struct folio *folio)
>   {
> 	struct address_space *mapping;
> 
> 	/* This happens if someone calls flush_dcache_page on slab page */
> 	if (unlikely(folio_test_slab(folio)))
> 		return NULL;
> 
> 	if (unlikely(folio_test_swapcache(folio)))
> 		return swap_address_space(folio_swap_entry(folio));
> 
> 	mapping = folio->mapping;
> 	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> 		return NULL;
> 
> 	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
>   }
> 
>   Then we go identify places that say "we know it's at least not a
>   slab page!" and convert them to page_mapping_file() which IS safe to
>   use with anon. Or we say "we know this MUST be a file page" and just
>   access the (unsafe) mapping pointer directly.
> 
> - We have a singular page lock, but what it guards depends on what
>   type of page we're dealing with. For a cache page it protects
>   uptodate and the mapping. For an anon page it protects swap state.
> 
>   A lot of us can remember the rules if we try, but the code doesn't
>   help and it gets really tricky when dealing with multiple types of
>   pages simultaneously. Even mature code like reclaim just serializes
>   the operation instead of protecting data - the writeback checks and
>   the page table reference tests don't seem to need page lock.
> 
>   When the cgroup folks wrote the initial memory controller, they just
>   added their own page-scope lock to protect page->memcg even though
>   the page lock would have covered what it needed.
> 
> - shrink_page_list() uses page_mapping() in the first half of the
>   function to tell whether the page is anon or file, but halfway
>   through we do this:
> 
> 	  /* Adding to swap updated mapping */
>           mapping = page_mapping(page);
> 
>   and then use PageAnon() to disambiguate the page type.
> 
> - At activate_locked:, we check PG_swapcache directly on the page and
>   rely on it doing the right thing for anon, file, and shmem pages.
>   But this flag is PG_owner_priv_1 and actually used by the filesystem
>   for something else. I guess PG_checked pages currently don't make it
>   this far in reclaim, or we'd crash somewhere in try_to_free_swap().
> 
>   I suppose we're also never calling page_mapping() on PageChecked
>   filesystem pages right now, because it would return a swap mapping
>   before testing whether this is a file page. You know, because shmem.
> 
> These are just a few examples from an MM perspective. I'm sure the FS
> folks have their own stories and examples about pitfalls in dealing
> with struct page members.
> 
> We're so used to this that we don't realize how much bigger and
> pervasive this lack of typing is than the compound page thing.
> 
> I'm not saying the compound page mess isn't worth fixing. It is.
> 
> I'm saying if we started with a file page or cache entry abstraction
> we'd solve not only the huge page cache, but also set us up for a MUCH
> more comprehensive cleanup in MM code and MM/FS interaction that makes
> the tailpage cleanup pale in comparison. For the same amount of churn,
> since folio would also touch all of these places.
>

Darrick J. Wong Sept. 16, 2021, 2:58 a.m. UTC | #74

On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > One particularly noteworthy idea was having struct page refer to
> > multiple hardware pages, and using slab/slub for larger
> > alloctions. In my view, the primary reason for making this change
> > isn't the memory overhead to struct page (though reducing that would
> > be nice);
> 
> Don't underestimate this, however.
> 
> Picture the near future Willy describes, where we don't bump struct
> page size yet but serve most cache with compound huge pages.
> 
> On x86, it would mean that the average page cache entry has 512
> mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> flags, 512 memcg pointers etc. - you get the idea.
> 
> This is a ton of memory. I think this doesn't get more traction
> because it's memory we've always allocated, and we're simply more
> sensitive to regressions than long-standing pain. But nevertheless
> this is a pretty low-hanging fruit.
> 
> The folio makes a great first step moving those into a separate data
> structure, opening the door to one day realizing these savings. Even
> when some MM folks say this was never the intent behind the patches, I
> think this is going to matter significantly, if not more so, later on.

So ... I chatted with Kent the other day, who suggested to me that maybe
the point you're really after is that you want to increase the hw page
size to reduce overhead while retaining the ability to hand out parts of
those larger pages to the page cache, and folios don't get us there?

> > Fortunately, Matthew made a big step in the right direction by making folios a
> > new type. Right now, struct folio is not separately allocated - it's just
> > unionized/overlayed with struct page - but perhaps in the future they could be
> > separately allocated. I don't think that is a remotely realistic goal for _this_
> > patch series given the amount of code that touches struct page (thing: writeback
> > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > keep in mind going forward.
> 
> Yeah, agreed. Not doable out of the gate, but retaining the ability to
> allocate the "cache entry descriptor" bits - mapping, index etc. -
> on-demand would be a huge benefit down the road for the above reason.
> 
> For that they would have to be in - and stay in - their own type.
> 
> > We should also be clear on what _exactly_ folios are for, so they don't become
> > the new dumping ground for everyone to stash their crap. They're to be a new
> > core abstraction, and we should endeaver to keep our core data structures
> > _small_, and _simple_.
> 
> Right. struct page is a lot of things and anything but simple and
> obvious today. struct folio in its current state does a good job
> separating some of that stuff out.
> 
> However, when we think about *which* of the struct page mess the folio
> wants to address, I think that bias toward recent pain over much
> bigger long-standing pain strikes again.
> 
> The compound page proliferation is new, and we're sensitive to the
> ambiguity it created between head and tail pages. It's added some
> compound_head() in lower-level accessor functions that are not
> necessary for many contexts. The folio type safety will help clean
> that up, and this is great.
> 
> However, there is a much bigger, systematic type ambiguity in the MM
> world that we've just gotten used to over the years: anon vs file vs
> shmem vs slab vs ...
> 
> - Many places rely on context to say "if we get here, it must be
>   anon/file", and then unsafely access overloaded member elements:
>   page->mapping, PG_readahead, PG_swapcache, PG_private
> 
> - On the other hand, we also have low-level accessor functions that
>   disambiguate the type and impose checks on contexts that may or may
>   not actually need them - not unlike compound_head() in PageActive():
> 
>   struct address_space *folio_mapping(struct folio *folio)
>   {
> 	struct address_space *mapping;
> 
> 	/* This happens if someone calls flush_dcache_page on slab page */
> 	if (unlikely(folio_test_slab(folio)))
> 		return NULL;
> 
> 	if (unlikely(folio_test_swapcache(folio)))
> 		return swap_address_space(folio_swap_entry(folio));
> 
> 	mapping = folio->mapping;
> 	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> 		return NULL;
> 
> 	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
>   }
> 
>   Then we go identify places that say "we know it's at least not a
>   slab page!" and convert them to page_mapping_file() which IS safe to
>   use with anon. Or we say "we know this MUST be a file page" and just
>   access the (unsafe) mapping pointer directly.
> 
> - We have a singular page lock, but what it guards depends on what
>   type of page we're dealing with. For a cache page it protects
>   uptodate and the mapping. For an anon page it protects swap state.
> 
>   A lot of us can remember the rules if we try, but the code doesn't
>   help and it gets really tricky when dealing with multiple types of
>   pages simultaneously. Even mature code like reclaim just serializes
>   the operation instead of protecting data - the writeback checks and
>   the page table reference tests don't seem to need page lock.
> 
>   When the cgroup folks wrote the initial memory controller, they just
>   added their own page-scope lock to protect page->memcg even though
>   the page lock would have covered what it needed.
> 
> - shrink_page_list() uses page_mapping() in the first half of the
>   function to tell whether the page is anon or file, but halfway
>   through we do this:
> 
> 	  /* Adding to swap updated mapping */
>           mapping = page_mapping(page);
> 
>   and then use PageAnon() to disambiguate the page type.
> 
> - At activate_locked:, we check PG_swapcache directly on the page and
>   rely on it doing the right thing for anon, file, and shmem pages.
>   But this flag is PG_owner_priv_1 and actually used by the filesystem
>   for something else. I guess PG_checked pages currently don't make it
>   this far in reclaim, or we'd crash somewhere in try_to_free_swap().
> 
>   I suppose we're also never calling page_mapping() on PageChecked
>   filesystem pages right now, because it would return a swap mapping
>   before testing whether this is a file page. You know, because shmem.

(Yes, it would be helpful to fix these ambiguities, because I feel like
discussions about all these other non-pagecache uses of memory keep
coming up on fsdevel and the code /really/ doesn't help me figure out
what everyone's talking about before the discussion moves on...)

> These are just a few examples from an MM perspective. I'm sure the FS
> folks have their own stories and examples about pitfalls in dealing
> with struct page members.

We do, and I thought we were making good progress pushing a lot of that
into the fs/iomap/ library.  With fs iomap, disk filesystems pass space
mapping data to the iomap functions and let them deal with pages (or
folios).  IOWs, filesystems don't deal with pages directly anymore, and
folios sounded like an easy transition (for a filesystem) to whatever
comes next.  At some point it would be nice to get fscrypt and fsverify
hooked up so that we could move ext4 further off of buffer heads.

I don't know how we proceed from here -- there's quite a bit of
filesystems work that depended on the folios series actually landing.
Given that Linus has neither pulled it, rejected it, or told willy what
to do, and the folio series now has a NAK on it, I can't even start on
how to proceed from here.

--D

> We're so used to this that we don't realize how much bigger and
> pervasive this lack of typing is than the compound page thing.
> 
> I'm not saying the compound page mess isn't worth fixing. It is.
> 
> I'm saying if we started with a file page or cache entry abstraction
> we'd solve not only the huge page cache, but also set us up for a MUCH
> more comprehensive cleanup in MM code and MM/FS interaction that makes
> the tailpage cleanup pale in comparison. For the same amount of churn,
> since folio would also touch all of these places.

Johannes Weiner Sept. 16, 2021, 4:54 p.m. UTC | #75

On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > > One particularly noteworthy idea was having struct page refer to
> > > multiple hardware pages, and using slab/slub for larger
> > > alloctions. In my view, the primary reason for making this change
> > > isn't the memory overhead to struct page (though reducing that would
> > > be nice);
> > 
> > Don't underestimate this, however.
> > 
> > Picture the near future Willy describes, where we don't bump struct
> > page size yet but serve most cache with compound huge pages.
> > 
> > On x86, it would mean that the average page cache entry has 512
> > mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> > list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> > flags, 512 memcg pointers etc. - you get the idea.
> > 
> > This is a ton of memory. I think this doesn't get more traction
> > because it's memory we've always allocated, and we're simply more
> > sensitive to regressions than long-standing pain. But nevertheless
> > this is a pretty low-hanging fruit.
> > 
> > The folio makes a great first step moving those into a separate data
> > structure, opening the door to one day realizing these savings. Even
> > when some MM folks say this was never the intent behind the patches, I
> > think this is going to matter significantly, if not more so, later on.
> 
> So ... I chatted with Kent the other day, who suggested to me that maybe
> the point you're really after is that you want to increase the hw page
> size to reduce overhead while retaining the ability to hand out parts of
> those larger pages to the page cache, and folios don't get us there?

Yes, that's one of the points.

It's exporting the huge page model we've been using for anonymous
memory to the filesystems, even though that model has shown
significant limitations in practice: it doesn't work well out of the
box, the necessary configuration is painful and complicated, and even
when done correctly it still has high allocation latencies. It's much
more "handtuned HPC workload" than "general purpose feature".

Fixing this is an open problem. I don't know for sure if we need to
increase the page size for that, but neither does anybody else. This
is simply work and experiments that haven't been done on the MM side.

Exposing the filesystems to that implementation now exposes them to
the risk of a near-term do-over, and puts a significantly higher
barrier on fixing the allocation model down the line.

There isn't a technical reason for this coupling the filesystems that
tightly to the allocation model. It's just that the filesystem people
would like a size-agnostic cache object, and some MM folks would like
to clean up the compound page mess, and folio tries to do both of
these things at once.

> > > Fortunately, Matthew made a big step in the right direction by making folios a
> > > new type. Right now, struct folio is not separately allocated - it's just
> > > unionized/overlayed with struct page - but perhaps in the future they could be
> > > separately allocated. I don't think that is a remotely realistic goal for _this_
> > > patch series given the amount of code that touches struct page (thing: writeback
> > > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > > keep in mind going forward.
> > 
> > Yeah, agreed. Not doable out of the gate, but retaining the ability to
> > allocate the "cache entry descriptor" bits - mapping, index etc. -
> > on-demand would be a huge benefit down the road for the above reason.
> > 
> > For that they would have to be in - and stay in - their own type.
> > 
> > > We should also be clear on what _exactly_ folios are for, so they don't become
> > > the new dumping ground for everyone to stash their crap. They're to be a new
> > > core abstraction, and we should endeaver to keep our core data structures
> > > _small_, and _simple_.
> > 
> > Right. struct page is a lot of things and anything but simple and
> > obvious today. struct folio in its current state does a good job
> > separating some of that stuff out.
> > 
> > However, when we think about *which* of the struct page mess the folio
> > wants to address, I think that bias toward recent pain over much
> > bigger long-standing pain strikes again.
> > 
> > The compound page proliferation is new, and we're sensitive to the
> > ambiguity it created between head and tail pages. It's added some
> > compound_head() in lower-level accessor functions that are not
> > necessary for many contexts. The folio type safety will help clean
> > that up, and this is great.
> > 
> > However, there is a much bigger, systematic type ambiguity in the MM
> > world that we've just gotten used to over the years: anon vs file vs
> > shmem vs slab vs ...
> > 
> > - Many places rely on context to say "if we get here, it must be
> >   anon/file", and then unsafely access overloaded member elements:
> >   page->mapping, PG_readahead, PG_swapcache, PG_private
> > 
> > - On the other hand, we also have low-level accessor functions that
> >   disambiguate the type and impose checks on contexts that may or may
> >   not actually need them - not unlike compound_head() in PageActive():
> > 
> >   struct address_space *folio_mapping(struct folio *folio)
> >   {
> > 	struct address_space *mapping;
> > 
> > 	/* This happens if someone calls flush_dcache_page on slab page */
> > 	if (unlikely(folio_test_slab(folio)))
> > 		return NULL;
> > 
> > 	if (unlikely(folio_test_swapcache(folio)))
> > 		return swap_address_space(folio_swap_entry(folio));
> > 
> > 	mapping = folio->mapping;
> > 	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> > 		return NULL;
> > 
> > 	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> >   }
> > 
> >   Then we go identify places that say "we know it's at least not a
> >   slab page!" and convert them to page_mapping_file() which IS safe to
> >   use with anon. Or we say "we know this MUST be a file page" and just
> >   access the (unsafe) mapping pointer directly.
> > 
> > - We have a singular page lock, but what it guards depends on what
> >   type of page we're dealing with. For a cache page it protects
> >   uptodate and the mapping. For an anon page it protects swap state.
> > 
> >   A lot of us can remember the rules if we try, but the code doesn't
> >   help and it gets really tricky when dealing with multiple types of
> >   pages simultaneously. Even mature code like reclaim just serializes
> >   the operation instead of protecting data - the writeback checks and
> >   the page table reference tests don't seem to need page lock.
> > 
> >   When the cgroup folks wrote the initial memory controller, they just
> >   added their own page-scope lock to protect page->memcg even though
> >   the page lock would have covered what it needed.
> > 
> > - shrink_page_list() uses page_mapping() in the first half of the
> >   function to tell whether the page is anon or file, but halfway
> >   through we do this:
> > 
> > 	  /* Adding to swap updated mapping */
> >           mapping = page_mapping(page);
> > 
> >   and then use PageAnon() to disambiguate the page type.
> > 
> > - At activate_locked:, we check PG_swapcache directly on the page and
> >   rely on it doing the right thing for anon, file, and shmem pages.
> >   But this flag is PG_owner_priv_1 and actually used by the filesystem
> >   for something else. I guess PG_checked pages currently don't make it
> >   this far in reclaim, or we'd crash somewhere in try_to_free_swap().
> > 
> >   I suppose we're also never calling page_mapping() on PageChecked
> >   filesystem pages right now, because it would return a swap mapping
> >   before testing whether this is a file page. You know, because shmem.
> 
> (Yes, it would be helpful to fix these ambiguities, because I feel like
> discussions about all these other non-pagecache uses of memory keep
> coming up on fsdevel and the code /really/ doesn't help me figure out
> what everyone's talking about before the discussion moves on...)

Excellent.

However, after listening to Kent and other filesystem folks, I think
it's important to point out that the folio is not a dedicated page
cache page descriptor that will address any of the above examples.

The MM POV (and the justification for both the acks and the naks of
the patchset) is that it's a generic, untyped compound page
abstraction, which applies to file, anon, slab, networking
pages. Certainly, the folio patches as of right now also convert anon
page handling to the folio. If followed to its conclusion, the folio
will have plenty of members and API functions for non-pagecache users
and look pretty much like struct page today, just with a dynamic size.

I know Kent was surprised by this. I know Dave Chinner suggested to
call it "cache page" or "cage" early on, which also suggests an
understanding of a *dedicated* cache page descriptor.

I don't think the ambiguous folio name and the ambiguous union with
the page helped in any way in aligning fs and mm folks on what this
thing is actually supposed to be!

I agree with what I think the filesystems want: instead of an untyped,
variable-sized block of memory, I think we should have a typed page
cache desciptor.

That would work better for the filesystems, and I think would also
work better for the MM code down the line and fix the above examples.

The headpage/tailpage cleanup would come free with that.

> > These are just a few examples from an MM perspective. I'm sure the FS
> > folks have their own stories and examples about pitfalls in dealing
> > with struct page members.
> 
> We do, and I thought we were making good progress pushing a lot of that
> into the fs/iomap/ library.  With fs iomap, disk filesystems pass space
> mapping data to the iomap functions and let them deal with pages (or
> folios).  IOWs, filesystems don't deal with pages directly anymore, and
> folios sounded like an easy transition (for a filesystem) to whatever
> comes next.  At some point it would be nice to get fscrypt and fsverify
> hooked up so that we could move ext4 further off of buffer heads.
> 
> I don't know how we proceed from here -- there's quite a bit of
> filesystems work that depended on the folios series actually landing.
> Given that Linus has neither pulled it, rejected it, or told willy what
> to do, and the folio series now has a NAK on it, I can't even start on
> how to proceed from here.

I think divide and conquer is the way forward.

The crux of the matter is that folio is trying to 1) replace struct
page as the filesystem interface to the MM and 2) replace struct page
as the internal management object for file and anon, and conceptually
also slab & networking pages all at the same time.

As you can guess, goals 1) and 2) have vastly different scopes.

Replacing struct page in the filesystem isn't very controversial, and
filesystem folks seem uniformly ready to go. I agree.

Replacing struct page in MM code is much less clear cut. We have some
people who say it'll be great, some people who say we can probably
figure out open questions down the line, and we have some people who
have expressed doubts that all this churn will ever be worth it. I
think it's worth replacing, but not with an untyped compound thing.

It's sh*tty that the filesystem people are acutely blocked on
large-scope, long-term MM discussions they don't care about.

It's also sh*tty that these MM discussions are rushed by folks who
aren't familiar with or care too much about the MM internals.

This friction isn't necessary. The folio conversion is an incremental
process. It's not like everything in MM code has been fully converted
already - some stuff deals with the folio, most stuff with the page.

An easy way forward that I see is to split this large, open-ended
project into more digestible pieces. E.g. separate 1) and 2): merge a
"size-agnostic cache page" type now; give MM folks the time they need
to figure out how and if they want to replace struct page internally.

That's why I suggested to drop the anon page conversion bits in
swap.c, workingset.c, memcontrol.c etc, and just focus on the
uncontroversial page cache bits for now.

David Howells Sept. 16, 2021, 9:58 p.m. UTC | #76

Johannes Weiner <hannes@cmpxchg.org> wrote:

> I know Kent was surprised by this. I know Dave Chinner suggested to
> call it "cache page" or "cage" early on, which also suggests an
> understanding of a *dedicated* cache page descriptor.

If we are aiming to get pages out of the view of the filesystem, then we
should probably not include "page" in the name.  "Data cache" would seem
obvious, but we already have that concept for the CPU.  How about something
like "struct content" and rename i_pages to i_content?

David

Dave Chinner Sept. 17, 2021, 5:24 a.m. UTC | #77

On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote:
> On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> > > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> The MM POV (and the justification for both the acks and the naks of
> the patchset) is that it's a generic, untyped compound page
> abstraction, which applies to file, anon, slab, networking
> pages. Certainly, the folio patches as of right now also convert anon
> page handling to the folio. If followed to its conclusion, the folio
> will have plenty of members and API functions for non-pagecache users
> and look pretty much like struct page today, just with a dynamic size.
> 
> I know Kent was surprised by this. I know Dave Chinner suggested to
> call it "cache page" or "cage" early on, which also suggests an
> understanding of a *dedicated* cache page descriptor.

Don't take a flippant I comment made in a bikeshed as any sort of
representation of what I think about this current situation. I've
largely been silent because of your history of yelling incoherently
in response to anything I say that you don't agree with.

But now you've explicitly drawn me into this discussion, I'll point
out that I'm one of very few people in the wider Linux mm/fs
community who has any *direct experience* with the cache handle
based architecture being advocated for here.

I don't agree with your assertion that cache handle based objects
are the way forward, so please read and try to understand what
I've just put a couple of hours into writing before you start
shouting. Please?

---

Ok, so this cache page descriptor/handle/object architecture has
been implemented in other operating systems.  It's the solution that
Irix implemented back in the early _1990s_ via it's chunk cache.

I've talked about this a few times in the past 15 years, so I guess
I'll talk about it again. eg at LSFMM 2014 where I said "we
don't really want to go down that path" in reference to supporting
sector sizes > PAGE_SIZE:

https://lwn.net/Articles/592101/

So, in more gory detail why I don't think we really want to go down
that path.....

The Irix chunk cache sat between the low layer global, disk address
indexed buffer cache[1] and the high layer per-mm-context page cache
used for mmap().

A "chunk" was a variable sized object indexed by file offset on a
per-inode AVL tree - basically the same caching architecture as our
current per-inode mapping tree uses to index pages. But unlike the
Linux page cache, these chunks were an extension of the low level
buffer cache. Hence they were also indexed by physical disk
address and the life-cycle was managed by the buffer cache shrinker
rather than the mm-based page cache reclaim algorithms.

Chunks were built from page cache pages, and pages pointed back to
the chunk that they belonged to. Chunks needed their own locking. IO
was done based on chunks, not pages. Filesystems decided the size of
chunks, not the page cache. Pages attached to chunks could be of any
hardware supported size - the only limitation was that all pages
attached to a chunk had to be the same size. A large hardware page
in the page cache could be mapped by multiple smaller chunks. A
chunk made up of multiple hardware pages could vmap it's contents if
the user needed contiguous access.[2]

Chunks were largely unaware of ongoing mmap operations. page faults
on pages that had no associated chunk (e.g. originally populated
into the page cache by a read fault into hole or a cached page that
the buffer cache had torn down) then a new chunk had to be built. 
The code needed to handle to partially populated chunks in this sort
of situation was really, really nasty as it required interacting
with the filesystem and having the filesystem take locks and call
back up into the page cache to build the new chunk in the IO path.

Similarly, dirty page state from page faults needed to be propagated
down to the chunks, because dirty tracking for writeback was done at
the chunk level, not the page cache level. This was *really* nasty,
because if the page didn't have a chunk already built, it couldn't
be built in a write fault context.  Hence sweeping dirty page state
to the IO subsystem was handled periodically by a pdflush daemon ,
which could work with the filesytsem to build new (dirty) chunks and
insert them into the chunk cache for writeback.

Similar problems will have to be considered during design for Linux
because the dirty tracking in Linux for writeback is done at the
per-inode mapping tree level. Hence things like ->page_mkwrite are
going to have to dig through the page to the cached chunk and mark
the chunk dirty rather than the page. Whether deadlocks are going to
have to be worked around is an open question; I don't have answers
to these concerns because nobody is proposing an architecture
detailed enough to explore these situations.

This also leads to really interesting questions about how page and
chunk state w.r.t. IO is kept coherent.  e.g. if we are not tracking
IO state on individual page cache pages, how do we ensure all the
pages stay stable when IO is being done to a block device that
requires stable pages? Along similar lines: what's the interlock
mechanism that we'll use to ensure that IO or truncate can lock out
per-page accesses if the filesystem IO paths no longer directly
interact with page state any more? I also wonder how will we manage
cached chunks if the filesystem currently relies on page level
locking for atomicity, concurrency and existence guarantees (e.g.
ext4 buffered IO)?

IOWs, it is extremely likely that there will still be situations
where we have to blast directly through the cache handle abstraction
to manipulate the objects behind the abstraction so that we can make
specific functionality work correctly, without regressions and/or
efficiently.

Hence the biggest issue that a chunk-like cache handle introduces
is the complex multi-dimensional state update interactions. These will
require more complex locking and that locking will be required to
work in arbitrary orders for operations to be performed safely and
atomically. e.g IO needs inode->chunk->page order, whilst page
migration/comapction needs page->chunk->inode order. Page migration
and compaction on Irix had some unfixable deadlocks in rare corner
cases because of locking inversion problems between filesystems,
chunks, pages and mm contexts. I don't see any fundamental
difference in Linux architecture that makes me think that it will
be any different.[3]

I've got war chest full of chunk cache related data corruption bugs
on Irix that were crazy hard to reproduce and even more difficult to
fix. At least half the bugs I had to fix in the chunk cache over
3-4 years as maintainer were data corruption bugs resulting from
inconsistencies in multi-object state updates.

I've got a whole 'nother barrel full of problem cases that revolve
around memory reclaim, too. The cache handles really need to pin the
pages that back them, and so we can't really do access optimised
per-page based reclaim of file-backed pages anymore.  The Irix chunk
cache had it's own LRUs and shrinker[4] to manage life-cycles of
chunks under memory pressure, and the mm code had it's own
independent page cache shrinker. Hence pages didn't get freed until
both the chunk cache and the page cache released the pages they had
references to.

IOWs, we're going to end up needing to reclaim cache handles before
we can do page reclaim. This needs careful thought and will likely
need a complete redesign of the vmscan.c algorithms to work
properly. I really, really don't want to see awful layer violations
like bufferhead reclaim getting hacked into the low layer page
reclaim algorithms happen ever again. We're still paying the price
for that.

And given the way Linux uses the mapping tree for keeping stuff like
per-page working set refault information after the pages have been
removed from the page cache, I really struggle to see how
functionality like this can be supported with a chunk based cache
index that doesn't actually have direct tracking of individual page
access and reclaim behaviour.

We're also going to need a range-based indexing mechanism for the
mapping tree if we want to avoid the inefficiencies that mapping
large objects into the Xarray require. We'll need an rcu-aware tree
of some kind, be it a btree, maple tree or something else so that we
can maintain lockless lookups of cache objects. That infrastructure
doesn't exist yet, either.

And on that note, it is worth keeping in mind that one of the
reasons that the current linux page cache architecture scales better
for single files than the Irix architecture ever did is because the
Irix chunk cache could not be made lockless. The requirements for
atomic multi-dimensional indexing updates and coherent, atomic
multi-object state changes could never be solved in a lockless
manner. It was not for lack of trying or talent; people way
smarter than me couldn't solve that problem. SO there's an open
question as to whether we can maintain existing lockless algorithms
when a chunk cache is layered over the top of the page cache.

IOWs, I see significant, fundamental problems that chunk cache
architectures suffer from. I know there are inherent problems with
state coherency, locking, complexity in the IO path, etc. Some of
these problems will inot be discovered until the implementation is
well under way. Some of these problem may well be unsolveable, too.
And until there's an actual model proposed of how everything will
interact and work, we can't actually do any of this architectural
analysis to determine if it might work or not. The chunk cache
proposal is really just a grand thought experiment at this point in
time.

OTOH, folios have none of these problems and are here right now.
Sure, they have their own issues, but we can see them for what they
are given the code is already out there, and pretty much everyone
sees them as a big step forwards.

Folios don't prevent a chunk cache from being implemented. In fact,
to make folios highly efficient, we have to do things a chunk cache
would also require to be implemented. e.g. range-based cache
indexing. Unlike a chunk cache, folios don't depend on this being
done first - they stand alone without those changes, and will only
improve from making them. IOWs, you can't use the "folios being
mapped 512 times into the mapping tree" as a reason the chunk cache
is better - the chunk cache also requires this same problem to be
solved, but the chunk cache needs efficient range lookups done
*before* it is implemented, not provided afterwards as an
optimisation.

IOWs, if we want to move towards a chunk cache, the first step is to
move to folios to allow large objects in the page cache. Then we can
implement a lock-less range based index mechanism for the mapping
tree. Then we can look to replace folios with a typed cache handle
without having to worry about all the whacky multi-object coherency
problems because they only need to point to a single folio. Then we
can work out all the memory reclaim issues, locking issues, sort out
the API that filesystems use instead of folios, etc that ineed to be
done when cache handles are introduced.  And once we've worked
through all that, then we can add support for multiple folios within
a single cache object and discover all the really hard problems that
this exposes. At this point, the cache objects are no longer
dependent on folios to provide objects > PAGE_SIZE to the
filesystems, and we can start to remove folios from the mm code and
replace them with something else that the cache handle uses to
provide the backing store to the filesysetms...

Seriously, I have given a lot of thought over the years to a chunk
cache for Linux. Right now, a chunk cache is a solution looking for
a problem to solve. Unless there's an overall architectural mm
plan that is being worked towards that requires a chunk cache, then
I just don't see the justification for doing all this work because
the first two steps above get filesystems everything they are
currently asking for. Everything else past that is really just an
experiment...

> I agree with what I think the filesystems want: instead of an untyped,
> variable-sized block of memory, I think we should have a typed page
> cache desciptor.

I don't think that's what fs devs want at all. It's what you think
fs devs want. If you'd been listening to us the same way that Willy
has been for the past year, maybe you'd have a different opinion.

Indeed, we don't actually need a new page cache abstraction.
fs/iomap already provides filesystems with a complete, efficient
page cache abstraction that only requires filesytems to provide
block mapping services. Filesystems using iomap do not interact with
the page cache at all. And David Howells is working with Willy and
all the network fs devs to build an equivalent generic netfs page
cache abstraction based on folios that is supported by the major
netfs client implementations in the kernel.

IOWs, fs devs don't need a new page cache abstraction - we've got
our own abstractions tailored directly to our needs. What we need
are API cleanups, consistency in object access mechanisms and
dynamic object size support to simplify and fill out the feature set
of the abstractions we've already built.

The fact that so many fs developers are pushing *hard* for folios is
that it provides what we've been asking for individually over last
few years. Willy has done a great job of working with the fs
developers and getting feedback at every step of the process, and
you see that in the amount of work that in progress that is already
based on folios. ANd it provides those cleanups and new
functionality without changing or invalidating any of the knowledge
we collectively hold about how the page cache works. That's _pure
gold_ right there.

In summary:

If you don't know anything about the architecture and limitations of
the XFS buffer cache (also read the footnotes), you'd do very well
to pay heed to what I've said in this email considering the direct
relevancy it's history has to the alternative cache handle proposal
being made here.  We also need to consider the evidence that
filesystems do not actually need a new page cache abstraction - they
just need the existing page cache to be able to index objects larger
than PAGE_SIZE.

So with all that in mind, I consider folios (or whatever we call
them) to be the best stepping stone towards a PAGE_SIZE indepedent
future that we currently have. folios don't prevent us from
introducing a cache handle based architecture if we have a
compelling reason to do so in the future, nor do they stop anyone
working on such infrastructure in parallel if it really is
necessary. But the reality is that we don't need such a fundamental
architectural change to provide the functionality that folios
provide us with _right now_.

Folios are not perfect, but they are here and they solve many issues
we need solved. We're never going to have a perfect solution that
everyone agrees with, so the real question is "are folios good
enough?". To me the answer is a resounding yes.

Cheers,

Dave.

[1] fs/xfs/xfs_buf.c is an example of a high performance handle
based, variable object size cache that abstracts away the details of
the data store being allocated from slab, discontiguous pages,
contiguous pages or [2] vmapped memory. It is basically two decade
old re-implementation of the Irix low layer global disk-addressed
buffer cache, modernised and tailored directly to the needs of XFS
metadata caching.

[3] Keep in mind that the xfs_buf cache used to be page cache
backed. The page cache provided the caching and memory reclaim
infrastructure to the xfs_buf handles - and so we do actually have
recent direct experience on Linux with the architecture you are
proposing here. This architecture proved to be a major limitation
from a performance, multi-object state coherency and cache residency
prioritisation aspects. It really sucked with systems that had 64KB
page sizes and 4kB metadata block sizes, and ....

[4] So we went back to the old Irix way of managing the cache - our
own buffer based LRUs and aging mechanisms, with memory reclaim run
by a shrinkers based on buffer-type base priorities. We use bulk
page allocation for buffers that >= PAGE_SIZE, and slab allocation <
PAGE_SIZE. That's exactly what you are suggesting we do with 2MB
sized base pages, but without having to care about mmap() at all.

Christoph Hellwig Sept. 17, 2021, 7:18 a.m. UTC | #78

On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote:
> Folios are not perfect, but they are here and they solve many issues
> we need solved. We're never going to have a perfect solution that
> everyone agrees with, so the real question is "are folios good
> enough?". To me the answer is a resounding yes.

Besides agreeing to all what you said, the other important part is:
even if we were to eventually go with Johannes grand plans (which I
disagree with in many apects), what is the harm in doing folios now?

Despite all the fuzz, the pending folio PR does nothing but add type
safety to compound pages.  Which is something we badly need, no matter
what kind of other caching grand plans people have.

Johannes Weiner Sept. 17, 2021, 4:31 p.m. UTC | #79

On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote:
> > I agree with what I think the filesystems want: instead of an untyped,
> > variable-sized block of memory, I think we should have a typed page
> > cache desciptor.
> 
> I don't think that's what fs devs want at all. It's what you think
> fs devs want. If you'd been listening to us the same way that Willy
> has been for the past year, maybe you'd have a different opinion.

I was going off of Darrick's remarks about non-pagecache uses, Kent's
remarks Kent about simple and obvious core data structures, and yes
your suggestion of "cache page".

But I think you may have overinterpreted what I meant by cache
descriptor:

> Indeed, we don't actually need a new page cache abstraction.

I didn't suggest to change what the folio currently already is for the
page cache. I asked to keep anon pages out of it (and in the future
potentially other random stuff that is using compound pages).

It doesn't have any bearing on how it presents to you on the
filesystem side, other than that it isn't as overloaded as struct page
is with non-pagecache stuff.

A full-on disconnect between the cache entry descriptor and the page
is something that came up during speculation on how the MM will be
able to effectively raise the page size and meet scalability
requirements on modern hardware - and in that context I do appreciate
you providing background information on the chunk cache, which will be
valuable to inform *that* discussion.

But it isn't what I suggested as the immediate action to unblock the
folio merge.

> The fact that so many fs developers are pushing *hard* for folios is
> that it provides what we've been asking for individually over last
> few years.

I'm not sure filesystem people are pushing hard for non-pagecache
stuff to be in the folio.

> Willy has done a great job of working with the fs developers and
> getting feedback at every step of the process, and you see that in
> the amount of work that in progress that is already based on
> folios.

And that's great, but the folio is blocked on MM questions:

1. Is the folio a good descriptor for all uses of anon and file pages
   inside MM code way beyond the page cache layer YOU care about?

2. Are compound pages a scalable, future-proof allocation strategy?

For some people the answers are yes, for others they are a no.

For 1), the value proposition is to clean up the relatively recent
head/tail page confusion. And though everybody agrees that there is
value in that, it's a LOT of churn for what it does. Several people
have pointed this out, and AFAICS this is the most common reason for
people that have expressed doubt or hesitation over the patches.

In an attempt to address this, I pointed out the cleanup opportunities
that would open up by using separate anon and file folio types instead
of one type for both. Nothing more. No intermediate thing, no chunk
cache. Doesn't affect you. Just taking Willy's concept of type safety
and applying it to file and anon instead of page vs compound page.

- It wouldn't change anything for fs people from the current folio
  patchset (except maybe the name)

- It would accomplish the head/tail page cleanup the same way, since
  just like a folio, a "file folio" could also never be a tail page

- It would take the same solution folio prescribes to the compound
  page issue (explicit typing to get rid of useless checks, lookups
  and subtle bugs) and solve way more instances of this all over MM
  code, thereby hopefully boosting the value proposition and making
  *that part* of the patches a clearer win for the MM subsystem

This is a question directed at MM people, not filesystem people. It
doesn't pertain to you at all.

And if MM people agree or want to keep discussing it, the relatively
minor action item for the folio patch is the same: drop the partial
anon-to-folio conversion bits inside MM code for now and move on.

For 2), nobody knows the answer to this. Nobody. Anybody who claims to
do so is full of sh*t. Maybe compound pages work out, maybe they
don't. We can talk a million years about larger page sizes, how to
handle internal fragmentation, the difficulties of implementing a
chunk cache, but it's completely irrelevant because it's speculative.

We know there are multiple page sizes supported by the hardware and
the smallest supported one is no longer the most dominant one. We do
not know for sure yet how the MM is internally going to lay out its
type system so that the allocator, mmap, page reclaim etc. can be CPU
efficient and the descriptors be memory efficient.

Nobody's "grand plan" here is any more viable, tested or proven than
anybody else's.

My question for fs folks is simply this: as long as you can pass a
folio to kmap and mmap and it knows what to do with it, is there any
filesystem relevant requirement that the folio map to 1 or more
literal "struct page", and that folio_page(), folio_nr_pages() etc be
part of the public API? Or can we keep this translation layer private
to MM code? And will page_folio() be required for anything beyond the
transitional period away from pages?

Can we move things not used outside of MM into mm/internal.h, mark the
transitional bits of the public API as such, and move on?

The unproductive vitriol, personal attacks and dismissiveness over
relatively minor asks and RFCs from the subsystem that is the most
impacted by this patchset is just nuts.

Kirill A. Shutemov Sept. 17, 2021, 8:57 p.m. UTC | #80

On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> I didn't suggest to change what the folio currently already is for the
> page cache. I asked to keep anon pages out of it (and in the future
> potentially other random stuff that is using compound pages).

It would mean that anon-THP cannot benefit from the work Willy did with
folios. Anon-THP is the most active user of compound pages at the moment
and it also suffers from the compound_head() plague. You ask to exclude
anon-THP siting *possible* future benefits for pagecache.

Sorry, but this doesn't sound fair to me.

We already had similar experiment with PAGE_CACHE_SIZE. It was introduced
with hope to have PAGE_CACHE_SIZE != PAGE_SIZE one day. It never happened
and only caused confusion on the border between pagecache-specific code
and generic code that handled both file and anon pages.

If you want to limit usage of the new type to pagecache, the burden on you
to prove that it is useful and not just a dead weight.

Kent Overstreet Sept. 17, 2021, 9:13 p.m. UTC | #81

Snipped, reordered:

On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> 2. Are compound pages a scalable, future-proof allocation strategy?
> 
> For 2), nobody knows the answer to this. Nobody. Anybody who claims to
> do so is full of sh*t. Maybe compound pages work out, maybe they
> don't. We can talk a million years about larger page sizes, how to
> handle internal fragmentation, the difficulties of implementing a
> chunk cache, but it's completely irrelevant because it's speculative.

Calling it compound pages here is a misnomer, and it confuses the discussion.
The question is really about whether we should start using higher order
allocations for data in the page cache, and perhaps a better way of framing that
question is: should we continue to fragment all our page cache allocations up
front into individual pages?

But I don't think this really the blocker.

> 1. Is the folio a good descriptor for all uses of anon and file pages
>    inside MM code way beyond the page cache layer YOU care about?
> 
> For some people the answers are yes, for others they are a no.

The anon page conversion does seem to be where all the disagreement is coming
from.

So my ask, to everyone involved is - if anonymous pages are dropped from the
folio patches, do we have any other real objections to the patch series?

It's an open question as to how much anonymous pages are like file pages, and if
we continue down the route of of splitting up struct page into separate types
whether anonymous pages should be the same time as file pages.

Also, it appears even file pages aren't fully converted to folios in Willy's
patch set - grepping around reveals plenty of references to struct page left in
fs/. I think that even if anonymous pages are going to become folios it's a
pretty reasonable ask for that to wait a cycle or two and see how the conversion
of file pages fully plays out.

Also: it's become pretty clear to me that we have crappy communications between
MM developers and filesystem developers. Internally both teams have solid
communications - I know in filesystem land we all talk to each other and are
pretty good at working colaboratively, and it sounds like the MM team also has
good internal communications. But we seem to have some problems with tackling
issues that cross over between FS and MM land, or awkwardly sit between them.

Perhaps this is something we could try to address when picking conference topics
in the future. Johannes also mentioned a monthly group call the MM devs schedule
- I wonder if it would be useful to get something similar going between MM and
interested parties in filesystem land.

Kent Overstreet Sept. 17, 2021, 9:17 p.m. UTC | #82

On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > I didn't suggest to change what the folio currently already is for the
> > page cache. I asked to keep anon pages out of it (and in the future
> > potentially other random stuff that is using compound pages).
> 
> It would mean that anon-THP cannot benefit from the work Willy did with
> folios. Anon-THP is the most active user of compound pages at the moment
> and it also suffers from the compound_head() plague. You ask to exclude
> anon-THP siting *possible* future benefits for pagecache.
> 
> Sorry, but this doesn't sound fair to me.

I'm less concerned with what's fair than figuring out what the consensus is so
we can move forward. I agree that anonymous THPs could benefit greatly from
conversion to folios - but looking at the code it doesn't look like much of that
has been done yet.

I understand you've had some input into the folio patches, so maybe you'd be
best able to answer while Matthew is away - would it be fair to say that, in the
interests of moving forward, anonymous pages could be split out for now? That
way the MM people gain time to come to their own consensus and we can still
unblock the FS work that's already been done on top of folios.

Kirill A. Shutemov Sept. 17, 2021, 10:02 p.m. UTC | #83

On Fri, Sep 17, 2021 at 05:17:09PM -0400, Kent Overstreet wrote:
> On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > > I didn't suggest to change what the folio currently already is for the
> > > page cache. I asked to keep anon pages out of it (and in the future
> > > potentially other random stuff that is using compound pages).
> > 
> > It would mean that anon-THP cannot benefit from the work Willy did with
> > folios. Anon-THP is the most active user of compound pages at the moment
> > and it also suffers from the compound_head() plague. You ask to exclude
> > anon-THP siting *possible* future benefits for pagecache.
> > 
> > Sorry, but this doesn't sound fair to me.
> 
> I'm less concerned with what's fair than figuring out what the consensus is so
> we can move forward. I agree that anonymous THPs could benefit greatly from
> conversion to folios - but looking at the code it doesn't look like much of that
> has been done yet.
> 
> I understand you've had some input into the folio patches, so maybe you'd be
> best able to answer while Matthew is away - would it be fair to say that, in the
> interests of moving forward, anonymous pages could be split out for now? That
> way the MM people gain time to come to their own consensus and we can still
> unblock the FS work that's already been done on top of folios.

I can't answer for Matthew.

Anon conversion patchset doesn't exists yet (but it is in plans) so
there's nothing to split out. Once someone will come up with such patchset
he has to sell it upstream on its own merit.

Possible future efforts should not block code at hands. "Talk is cheap.
Show me the code."

Kent Overstreet Sept. 17, 2021, 10:21 p.m. UTC | #84

On Sat, Sep 18, 2021 at 01:02:09AM +0300, Kirill A. Shutemov wrote:
> I can't answer for Matthew.
> 
> Anon conversion patchset doesn't exists yet (but it is in plans) so
> there's nothing to split out. Once someone will come up with such patchset
> he has to sell it upstream on its own merit.

Perhaps we've been operating under some incorrect assumptions then. If the
current patch series doesn't actually touch anonymous pages - the patch series
does touch code in e.g. mm/swap.c, but looking closer it might just be due to
the (mis)organization of the current code - maybe there aren't any real
objections left?

Theodore Ts'o Sept. 17, 2021, 10:25 p.m. UTC | #85

On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote:
> Also: it's become pretty clear to me that we have crappy
> communications between MM developers and filesystem
> developers.

I think one of the challenges has been the lack of an LSF/MM since
2019.  And it may be that having *some* kind of ad hoc technical
discussion given that LSF/MM in 2021 is not happening might be a good
thing.  I'm sure if we asked nicely, we could use the LPC
infrasutrcture to set up something, assuming we can find a mutually
agreeable day or dates.

> Internally both teams have solid communications - I know
> in filesystem land we all talk to each other and are pretty good at
> working colaboratively, and it sounds like the MM team also has good
> internal communications. But we seem to have some problems with
> tackling issues that cross over between FS and MM land, or awkwardly
> sit between them.

That's a bit of a over-generalization; it seems like we've uncovered
that some of the disagreemnts are between different parts of the MM
community over the suitability of folios for anonymous pages.

And it's interesting, because I don't really consider Willy to be one
of "the FS folks" --- and he has been quite diligent to reaching out
to a number of folks in the FS community about our needs, and it's
clear that this has been really, really helpful.  There's no question
that we've had for many years some difficulties in the code paths that
sit between FS and MM, and I'd claim that it's not just because of
communications, but the relative lack of effort that was focused in
that area.  The fact that Willy has spent the last 9 months working on
FS / MM interactions has been really great, and I hope it continues.

That being said, it sounds like there are issues internal to the MM
devs that still need to be ironed out, and at the risk of throwing the
anon-THP folks under the bus, if we can land at least some portion of
the folio commits, it seems like that would be a step in the right
direction.

Cheers,

						- Ted

Johannes Weiner Sept. 17, 2021, 11:15 p.m. UTC | #86

On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > I didn't suggest to change what the folio currently already is for the
> > page cache. I asked to keep anon pages out of it (and in the future
> > potentially other random stuff that is using compound pages).
> 
> It would mean that anon-THP cannot benefit from the work Willy did with
> folios. Anon-THP is the most active user of compound pages at the moment
> and it also suffers from the compound_head() plague. You ask to exclude
> anon-THP siting *possible* future benefits for pagecache.
> 
> Sorry, but this doesn't sound fair to me.

Hold on Kirill. I'm not saying we shouldn't fix anonthp. But let's
clarify the actual code in question in this specific patchset. You say
anonthp cannot benefit from folio, but in the other email you say this
patchset isn't doing the conversion yet.

The code I'm specifically referring to here is the conversion of some
code that encounters both anon and file pages - swap.c, memcontrol.c,
workingset.c, and a few other places. It's a small part of the folio
patches, but it's a big deal for the MM code conceptually.

I'm requesting to drop those and just keep the page cache bits. Not
because I think anonthp shouldn't be fixed, but because I think we're
not in agreement yet on how they should be fixed. And it's somewhat
independent of fixing the page cache interface now that people are
waiting on much more desparately and acutely than we inside MM wait
for a struct page cleanup. It's not good to hold them while we argue.

Dropping the anon bits isn't final. Depending on how our discussion
turns out, we can still put them in later or we can put in something
new. The important thing is that the uncontroversial page cache bits
aren't held up any longer while we figure it out.

> If you want to limit usage of the new type to pagecache, the burden on you
> to prove that it is useful and not just a dead weight.

I'm not asking to add anything to the folio patches, just to remove
some bits around the edges. And for the page cache bits: I think we
have a rather large number of folks really wanting those. Now.

Again, I think we should fix anonthp. But I also think we should
really look at struct page more broadly. And I think we should have
that discussion inside a forum of MM people that truly care.

I'm just trying to unblock the fs folks at this point and merge what
we can now.

Josef Bacik Sept. 17, 2021, 11:35 p.m. UTC | #87

On 9/17/21 6:25 PM, Theodore Ts'o wrote:
> On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote:
>> Also: it's become pretty clear to me that we have crappy
>> communications between MM developers and filesystem
>> developers.
> 
> I think one of the challenges has been the lack of an LSF/MM since
> 2019.  And it may be that having *some* kind of ad hoc technical
> discussion given that LSF/MM in 2021 is not happening might be a good
> thing.  I'm sure if we asked nicely, we could use the LPC
> infrasutrcture to set up something, assuming we can find a mutually
> agreeable day or dates.
> 

We have a slot for this in the FS MC, first slot actually, so hopefully 
we can get things hashed out there.  Thanks,

Josef

Dave Chinner Sept. 18, 2021, 1:04 a.m. UTC | #88

On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> My question for fs folks is simply this: as long as you can pass a
> folio to kmap and mmap and it knows what to do with it, is there any
> filesystem relevant requirement that the folio map to 1 or more
> literal "struct page", and that folio_page(), folio_nr_pages() etc be
> part of the public API?

In the short term, yes, we need those things in the public API.
In the long term, not so much.

We need something in the public API that tells us the offset and
size of the folio. Lots of page cache code currently does stuff like
calculate the size or iteration counts based on the difference of
page->index values (i.e. number of pages) and iterate page by page.
A direct conversion of such algorithms increments by
folio_nr_pages() instead of 1. So stuff like this is definitely
necessary as public APIs in the initial conversion.

Let's face it, folio_nr_pages() is a huge improvement on directly
exposing THP/compound page interfaces to filesystems and leaving
them to work it out for themselves. So even in the short term, these
API members represent a major step forward in mm API cleanliness.

As for long term, everything in the page cache API needs to
transition to byte offsets and byte counts instead of units of
PAGE_SIZE and page->index. That's a more complex transition, but
AFAIA that's part of the future work Willy is intended to do with
folios and the folio API. Once we get away from accounting and
tracking everything as units of struct page, all the public facing
APIs that use those units can go away.

It's fairly slow to do this, because we have so much code that is
doing stuff like converting file offsets between byte counts and
page counts and vice versa. And it's not necessary to do an initial
conversion to folios, either. But once everything in the page cache
indexing API moves to byte ranges, the need to count pages, use page
counts are ranges, iterate by page index, etc all goes away and
hence those APIs can also go away.

As for converting between folios and pages, we'll need those sorts
of APIs for the foreseeable future because low level storage layers
and hardware use pages for their scatter gather arrays and at some
point we've got to expose those pages from behind the folio API.
Even if we replace struct page with some other hardware page
descriptor, we're still going to need such translation APIs are some
point in the stack....

> Or can we keep this translation layer private
> to MM code? And will page_folio() be required for anything beyond the
> transitional period away from pages?

No idea, but as per above I think it's a largely irrelevant concern
for the forseeable future because pages will be here for a long time
yet.

> Can we move things not used outside of MM into mm/internal.h, mark the
> transitional bits of the public API as such, and move on?

Sure, but that's up to you to do as a patch set on top of Willy's
folio trees if you think it improves the status quo.  Write the
patches and present them for review just like everyone else does,
and they can be discussed on their merits in that context rather
than being presented as a reason for blocking current progress on
folios.

Cheers,

Dave.

Kent Overstreet Sept. 18, 2021, 4:51 a.m. UTC | #89

On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote:
> As for long term, everything in the page cache API needs to
> transition to byte offsets and byte counts instead of units of
> PAGE_SIZE and page->index. That's a more complex transition, but
> AFAIA that's part of the future work Willy is intended to do with
> folios and the folio API. Once we get away from accounting and
> tracking everything as units of struct page, all the public facing
> APIs that use those units can go away.

Probably 95% of the places we use page->index and page->mapping aren't necessary
because we've already got that information from the context we're in and
removing them would be a useful cleanup - if we've already got that from context
(e.g. we're looking up the page in the page cache, via i_pageS) eliminating the
page->index or page->mapping use means we're getting rid of a data dependency so
it's good for performance - but more importantly, those (much fewer) places in
the code where we actually _do_ need page->index and page->mapping are really
important places to be able to find because they're interesting boundaries
between different components in the VM.

Dave Chinner Sept. 20, 2021, 1:04 a.m. UTC | #90

On Sat, Sep 18, 2021 at 12:51:50AM -0400, Kent Overstreet wrote:
> On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote:
> > As for long term, everything in the page cache API needs to
> > transition to byte offsets and byte counts instead of units of
> > PAGE_SIZE and page->index. That's a more complex transition, but
> > AFAIA that's part of the future work Willy is intended to do with
> > folios and the folio API. Once we get away from accounting and
> > tracking everything as units of struct page, all the public facing
> > APIs that use those units can go away.
> 
> Probably 95% of the places we use page->index and page->mapping aren't necessary
> because we've already got that information from the context we're in and
> removing them would be a useful cleanup

*nod*

> - if we've already got that from context
> (e.g. we're looking up the page in the page cache, via i_pageS) eliminating the
> page->index or page->mapping use means we're getting rid of a data dependency so
> it's good for performance - but more importantly, those (much fewer) places in
> the code where we actually _do_ need page->index and page->mapping are really
> important places to be able to find because they're interesting boundaries
> between different components in the VM.

*nod*

This is where infrastructure like like write_cache_pages() is
problematic.  It's not actually a component of the VM - it's core
page cache/filesystem API functionality - but the implementation is
determined by the fact there is no clear abstraction between the
page cache and the VM and so while the filesysetm side of the API is
byte-ranged based, the VM side is struct page based and so the
impedence mismatch has to be handled in the page cache
implementation.

Folios are definitely pointing out issues like this whilst, IMO,
demonstrating that an abstraction like folios are also a necessary
first step to address the problems they make obvious...

Cheers,

Dave.

Matthew Wilcox Sept. 20, 2021, 2:17 a.m. UTC | #91

On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> Q: Oh yeah, but what again are folios for, exactly?
> 
> Folios are for cached filesystem data which (importantly) may be mapped to
> userspace.
> 
> So when MM people see a new data structure come up with new references to page
> size - there's a very good reason with that, which is that we need to be
> allocating in multiples of the hardware page size if we're going to be able to
> map it to userspace and have PTEs point to it.
> 
> So going forward, if the MM people want struct page to refer to muliple hardware
> pages - this shouldn't prevent that, and folios will refer to multiples of the
> _hardware_ page size, not struct page pagesize.
> 
> Also - all the filesystem code that's being converted tends to talk and thing in
> units of pages. So going forward, it would be a nice cleanup to get rid of as
> many of those references as possible and just talk in terms of bytes (e.g. I
> have generally been trying to get rid of references to PAGE_SIZE in bcachefs
> wherever reasonable, for other reasons) - those cleanups are probably for
> another patch series, and in the interests of getting this patch series merged
> with the fewest introduced bugs possible we probably want the current helpers.

I'd like to thank those who reached out off-list.  Some of you know I've
had trouble with depression in the past, and I'd like to reassure you
that that's not a problem at the moment.  I had a good holiday, and I
was able to keep from thinking about folios most of the time.

I'd also like to thank those who engaged in the discussion while I was
gone.  A lot of good points have been made.  I don't think the normal
style of replying to each email individually makes a lot of sense at
this point, so I'll make some general comments instead.  I'll respond
to the process issues on the other thread.

I agree with the feeling a lot of people have expressed, that struct page
is massively overloaded and we would do much better with stronger typing.
I like it when the compiler catches bugs for me.  Disentangling struct
page is something I've been working on for a while, and folios are a
step in that direction (in that they remove the two types of tail page
from the universe of possibilities).

I don't believe it is realistic to disentangle file pages and anon
pages from each other.  Thanks to swap and shmem, both file pages and
anon pages need to be able to be moved in and out of the swap cache.
The swap cache shares a lot of code with the page cache, so changing
how the swap cache works is also tricky.

What I do believe is possible is something Kent hinted at; treating anon
pages more like file pages.  I also believe that shmem should be able to
write pages to swap without moving the pages into the swap cache first.
But these two things are just beliefs.  I haven't tried to verify them
and they may come to nothing.

I also want to split out slab_page and page_table_page from struct page.
I don't intend to convert either of those to folios.

I do want to make struct page dynamically allocated (and have for
a while).  There are some complicating factors ...

There are two primary places where we need to map from a physical
address to a "memory descriptor".  The one that most people care about
is get_user_pages().  We have a page table entry and need to increment
the refcount on the head page, possibly mark the head page dirty, but
also return the subpage of any compound page we find.  The one that far
fewer people care about is memory-failure.c; we also need to find the
head page to determine what kind of memory has been affected, but we
need to mark the subpage as HWPoison.

Both of these need to be careful to not confuse tail and non-tail pages.
So yes, we need to use folios for anything that's mappable to userspace.
That's not just anon & file pages but also network pools, graphics card
memory and vmalloc memory.  Eventually, I think struct page actually goes
down to a union of a few words of padding, along with ->compound_head.
Because that's all we're guaranteed is actually there; everything else
is only there in head pages.

There are a lot of places that should use folios which the current
patchset doesn't convert.  I prioritised filesystems because we've got
~60 filesystems to convert, and working on the filesystems can proceed
in parallel with working on the rest of the MM.  Also, if I converted
the entire MM at once, there would be complaints that a 600 patch series
was unreviewable.  So here we are, there's a bunch of compatibility code
that indicates areas which still need to be converted.

I'm sure I've missed things, but I've been working on this email all
day and wanted to send it out before going to sleep.

Kirill A. Shutemov Sept. 20, 2021, 10:03 a.m. UTC | #92

On Fri, Sep 17, 2021 at 07:15:40PM -0400, Johannes Weiner wrote:
> The code I'm specifically referring to here is the conversion of some
> code that encounters both anon and file pages - swap.c, memcontrol.c,
> workingset.c, and a few other places. It's a small part of the folio
> patches, but it's a big deal for the MM code conceptually.

Hard to say without actually trying, but my worry here that this may lead
to code duplication to separate file and anon code path. I donno.

Johannes Weiner Sept. 21, 2021, 7:47 p.m. UTC | #93

Just a note upfront:

This discussion is now about whether folio are suitable for anon pages
as well. I'd like to reiterate that regardless of the outcome of this
discussion I think we should probably move ahead with the page cache
bits, since people are specifically blocked on those and there is no
dependency on the anon stuff, as the conversion is incremental.

On Mon, Sep 20, 2021 at 03:17:15AM +0100, Matthew Wilcox wrote:
> I don't believe it is realistic to disentangle file pages and anon
> pages from each other.  Thanks to swap and shmem, both file pages and
> anon pages need to be able to be moved in and out of the swap cache.

Yes, the swapcache is actually shared code and needs a shared type.

However, once swap and shmem are fully operating on *typed* anon and
file pages, there are no possible routes of admission for tail pages
into the swapcache:

	vmscan:
	add_to_swap_cache(anon_page->page);

	shmem:
	delete_from_swap_cache(file_page->page);

and so the justification for replacing page with folio *below* those
entry points to address tailpage confusion becomes nil: there is no
confusion. Move the anon bits to anon_page and leave the shared bits
in page. That's 912 lines of swap_state.c we could mostly leave alone.

The same is true for the LRU code in swap.c. Conceptually, already no
tailpages *should* make it onto the LRU. Once the high-level page
instantiation functions - add_to_page_cache_lru, do_anonymous_page -
have type safety, you really do not need to worry about tail pages
deep in the LRU code. 1155 more lines of swap.c.

And when you've ensured that tail pages can't make it onto the LRU,
that takes care of the entire page reclaim code as well; converting it
wholesale to folio again would provide little additional value. 4707
lines of vmscan.c.

And with the page instantiation functions typed, nobody can pass tail
pages into memcg charging code, either. 7509 lines of memcontrol.c.

But back to your generic swapcache example: beyond the swapcache and
the page LRU management, there really isn't a great deal of code that
is currently truly type-agnostic and generic like that. And the rest
could actually benefit from being typed more tightly to bring out what
is actually going on.

The anon_page->page relationship may look familiar too. It's a natural
type hierarchy between superclass and subclasses that is common in
object oriented languages: page has attributes and methods that are
generic and shared; anon_page and file_page encode where their
implementation differs.

A type system like that would set us up for a lot of clarification and
generalization of the MM code. For example it would immediately
highlight when "generic" code is trying to access type-specific stuff
that maybe it shouldn't, and thus help/force us refactor - something
that a shared, flat folio type would not.

And again, higher-level types would take care of the tail page
confusion in many (most?) places automatically.

> The swap cache shares a lot of code with the page cache, so changing
> how the swap cache works is also tricky.

The overlap is actually fairly small right now. Add and delete
routines are using the raw xarray functions. Lookups use the most
minimal version of find_get_page(), which wouldn't be a big deal to
open-code until swapcache and pagecache would *actually* be unified.

> What I do believe is possible is something Kent hinted at; treating
> anon pages more like file pages.  I also believe that shmem should
> be able to write pages to swap without moving the pages into the
> swap cache first.  But these two things are just beliefs.  I haven't
> tried to verify them and they may come to nothing.

Treating anon and file pages the same where possible makes sense. It's
simple: the more code that can be made truly generic and be shared
between subclasses, the better.

However, for that we first have to identify what parts actually are
generic, and what parts are falsely shared and shoehorned into
equivalency due to being crammed into the same overloaded structure.

For example, page->mapping for file is an address_space and the page's
membership in that tree structure is protected by the page
lock. page->mapping for anon is... not that. The pointer itself is
ad-hoc typed to point to an anon_vma instead. And anon_vmas behave
completely differently from a page's pagecache state.

The *swapcache* state of an anon page is actually much closer to what
the pagecache state of a file page is. And since it would be nice to
share more of the swapcache and pagecache *implementation*, it makes
sense that the relevant page attributes would correspond as well.

(Yeah, page->mapping and page->index are used "the same way" for rmap,
but that's a much smaller, read-only case. And when you look at how
"generic" the rmap code is - with its foo_file and foo_anon functions,
and PageAnon() checks, and conditional page locking in the shared
bits-- the attribute sharing at the page level really did nothing to
help the implementation be more generic.)

It really should be something like:

	struct page {
		/* pagecache/swapcache state */
		struct address_space *address_space;
		pgoff_t index;
		lock_t lock;
	}

	struct file_page {
		struct page;
	}

	struct anon_page {
		struct page;
		struct anon_vma *anon_vma;
		pgoff_t offset;
	};

to recognize the difference in anon vs file rmapping and locking,
while recognizing the similarity between swapcache and pagecache.

A shared folio would perpetuate false equivalencies between anon and
file which make it difficult to actually split out and refactor what
*should* be generic vs what should be type-specific. And instead lead
to more "generic" code littered with FolioAnon() conditionals.

And in the name of tail page cleanup it would churn through thousands
of lines of code where there is no conceptual confusion about tail
pages to begin with.

Proper type inheritance would allow us to encode how things actually
are implemented right now and would be a great first step in
identifying what needs to be done in order to share more code.

And it would take care of so many places re: tail pages that it's a
legitimate question to ask: how many places would actually be *left*
that *need* to deal with tail pages? Couldn't we bubble
compound_head() and friends into these few select places and be done?

> I also want to split out slab_page and page_table_page from struct page.
> I don't intend to convert either of those to folios.
> 
> I do want to make struct page dynamically allocated (and have for
> a while).  There are some complicating factors ...
> 
> There are two primary places where we need to map from a physical
> address to a "memory descriptor".  The one that most people care about
> is get_user_pages().  We have a page table entry and need to increment
> the refcount on the head page, possibly mark the head page dirty, but
> also return the subpage of any compound page we find.  The one that far
> fewer people care about is memory-failure.c; we also need to find the
> head page to determine what kind of memory has been affected, but we
> need to mark the subpage as HWPoison.
> 
> Both of these need to be careful to not confuse tail and non-tail pages.

That makes sense.

But gup() as an interface to the rest of the kernel is rather strange:
It's not a generic page table walker that can take a callback argument
to deal with whatever the page table points to. It also doesn't return
properly typed objects: it returns struct page which is currently a
wildcard for whatever people cram into it.

> So yes, we need to use folios for anything that's mappable to userspace.
> That's not just anon & file pages but also network pools, graphics card
> memory and vmalloc memory.  Eventually, I think struct page actually goes
> down to a union of a few words of padding, along with ->compound_head.
> Because that's all we're guaranteed is actually there; everything else
> is only there in head pages.

(Side question: if GUP can return tail pages, how does that map to
folios?)

Anyway, I don't see that folio for everything mappable is the obvious
conclusion, because it doesn't address what is really weird about gup.

While a folio interface would clean up the head and tail page issue,
it maintains the incentive of cramming everything that people want to
mmap and gup into the same wildcard type struct. And still leave the
bigger problem of ad-hoc typing that wildcard ("What is the thing that
was returned? Anon? File? GPU memory?") to the user.

I think rather than cramming everything that can be mmapped into folio
for the purpose of GUP and tail pages - even when these objects have
otherwise little in common - it would make more sense to reconsider
how GUP as an interface deals with typing. Some options:

  a) Make it a higher-order function that leaves typing fully to the
     provided callback. This makes it clear (and greppable) which
     functions need to be wary about tail pages, and type inference in
     general.

  b) Create an intermediate mmap type that can map to one of the
     higher-order types like anon or file, but never to a tail
     page. This sounds like what you want struct page to be
     long-term. But this sort of depends on the side question above -
     what if pte maps tail page?

  c) Provide a stricter interface for known higher-order types
     (get_anon_pages...). Supporting new types means adding more entry
     function, which IMO is preferable to cramming more stuff into a
     wildcard struct folio.

  d) A hybrid of a) and c) to safely cover common cases, while
     allowing "i know what i'm doing" uses.

In summary, I think a page type hierarchy would do wonders to clean up
anon and file page implementations, and encourage and enable more code
sharing down the line, while taking care of tail pages as well.

This leaves the question how many places are actually *left* to deal
with tail pages in MM. Folios are based on the premise that the
confusion is simply everywhere, and that everything needs to be
converted first to be safe. This is convenient because it means we
never have to identify which parts truly *do* need tailpage handling,
truly *need* the compound_head() lookups. Yes, compound_head() has to
go from generic page flags testers. But as per the examples at the
top, I really don't think we need to convert every crevice of the MM
code to folio before we can be reasonably sure that removing it is
safe. I really want to to see a better ballpark analysis of what parts
need to deal with tail pages to justify all this churn for them.

Matthew Wilcox Sept. 21, 2021, 8:38 p.m. UTC | #94

On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> This discussion is now about whether folio are suitable for anon pages
> as well. I'd like to reiterate that regardless of the outcome of this
> discussion I think we should probably move ahead with the page cache
> bits, since people are specifically blocked on those and there is no
> dependency on the anon stuff, as the conversion is incremental.

So you withdraw your NAK for the 5.15 pull request which is now four
weeks old and has utterly missed the merge window?

> and so the justification for replacing page with folio *below* those
> entry points to address tailpage confusion becomes nil: there is no
> confusion. Move the anon bits to anon_page and leave the shared bits
> in page. That's 912 lines of swap_state.c we could mostly leave alone.

Your argument seems to be based on "minimising churn".  Which is certainly
a goal that one could have, but I think in this case is actually harmful.
There are hundreds, maybe thousands, of functions throughout the kernel
(certainly throughout filesystems) which assume that a struct page is
PAGE_SIZE bytes.  Yes, every single one of them is buggy to assume that,
but tracking them all down is a never-ending task as new ones will be
added as fast as they can be removed.

> The same is true for the LRU code in swap.c. Conceptually, already no
> tailpages *should* make it onto the LRU. Once the high-level page
> instantiation functions - add_to_page_cache_lru, do_anonymous_page -
> have type safety, you really do not need to worry about tail pages
> deep in the LRU code. 1155 more lines of swap.c.

It's actually impossible in practice as well as conceptually.  The list
LRU is in the union with compound_head, so you cannot put a tail page
onto the LRU.  But yet we call compound_head() on every one of them
multiple times because our current type system does not allow us to
express "this is not a tail page".

> The anon_page->page relationship may look familiar too. It's a natural
> type hierarchy between superclass and subclasses that is common in
> object oriented languages: page has attributes and methods that are
> generic and shared; anon_page and file_page encode where their
> implementation differs.
> 
> A type system like that would set us up for a lot of clarification and
> generalization of the MM code. For example it would immediately
> highlight when "generic" code is trying to access type-specific stuff
> that maybe it shouldn't, and thus help/force us refactor - something
> that a shared, flat folio type would not.

If you want to try your hand at splitting out anon_folio from folio
later, be my guest.  I've just finished splitting out 'slab' from page,
and I'll post it later.  I don't think that splitting anon_folio from
folio is worth doing, but will not stand in your way.  I do think that
splitting tail pages from non-tail pages is worthwhile, and that's what
this patchset does.

Kent Overstreet Sept. 21, 2021, 9:11 p.m. UTC | #95

On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> > and so the justification for replacing page with folio *below* those
> > entry points to address tailpage confusion becomes nil: there is no
> > confusion. Move the anon bits to anon_page and leave the shared bits
> > in page. That's 912 lines of swap_state.c we could mostly leave alone.
> 
> Your argument seems to be based on "minimising churn".  Which is certainly
> a goal that one could have, but I think in this case is actually harmful.
> There are hundreds, maybe thousands, of functions throughout the kernel
> (certainly throughout filesystems) which assume that a struct page is
> PAGE_SIZE bytes.  Yes, every single one of them is buggy to assume that,
> but tracking them all down is a never-ending task as new ones will be
> added as fast as they can be removed.

Yet it's only file backed pages that are actually changing in behaviour right
now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for
network pools, for slab.

> > The anon_page->page relationship may look familiar too. It's a natural
> > type hierarchy between superclass and subclasses that is common in
> > object oriented languages: page has attributes and methods that are
> > generic and shared; anon_page and file_page encode where their
> > implementation differs.
> > 
> > A type system like that would set us up for a lot of clarification and
> > generalization of the MM code. For example it would immediately
> > highlight when "generic" code is trying to access type-specific stuff
> > that maybe it shouldn't, and thus help/force us refactor - something
> > that a shared, flat folio type would not.
> 
> If you want to try your hand at splitting out anon_folio from folio
> later, be my guest.  I've just finished splitting out 'slab' from page,
> and I'll post it later.  I don't think that splitting anon_folio from
> folio is worth doing, but will not stand in your way.  I do think that
> splitting tail pages from non-tail pages is worthwhile, and that's what
> this patchset does.

Eesh, we can and should hold ourselves to a higher standard in our technical
discussions.

Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate
and shouldn't have happened) colour our perceptions and keep us from having
productive working relationships going forward. The points Johannes is bringing
up are valid and pertinent and deserve to be discussed.

If you're still trying to sell folios as the be all, end all solution for
anything using compound pages, I think you should be willing to make the
argument that that really is the _right_ solution - not just that it was the one
easiest for you to implement.

Actual code might make this discussion more concrete and clearer. Could you post
your slab conversion?

Kent Overstreet Sept. 21, 2021, 9:22 p.m. UTC | #96

On Tue, Sep 21, 2021 at 05:11:09PM -0400, Kent Overstreet wrote:
> On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote:
> > On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> > > and so the justification for replacing page with folio *below* those
> > > entry points to address tailpage confusion becomes nil: there is no
> > > confusion. Move the anon bits to anon_page and leave the shared bits
> > > in page. That's 912 lines of swap_state.c we could mostly leave alone.
> > 
> > Your argument seems to be based on "minimising churn".  Which is certainly
> > a goal that one could have, but I think in this case is actually harmful.
> > There are hundreds, maybe thousands, of functions throughout the kernel
> > (certainly throughout filesystems) which assume that a struct page is
> > PAGE_SIZE bytes.  Yes, every single one of them is buggy to assume that,
> > but tracking them all down is a never-ending task as new ones will be
> > added as fast as they can be removed.
> 
> Yet it's only file backed pages that are actually changing in behaviour right
> now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for
> network pools, for slab.
> 
> > > The anon_page->page relationship may look familiar too. It's a natural
> > > type hierarchy between superclass and subclasses that is common in
> > > object oriented languages: page has attributes and methods that are
> > > generic and shared; anon_page and file_page encode where their
> > > implementation differs.
> > > 
> > > A type system like that would set us up for a lot of clarification and
> > > generalization of the MM code. For example it would immediately
> > > highlight when "generic" code is trying to access type-specific stuff
> > > that maybe it shouldn't, and thus help/force us refactor - something
> > > that a shared, flat folio type would not.
> > 
> > If you want to try your hand at splitting out anon_folio from folio
> > later, be my guest.  I've just finished splitting out 'slab' from page,
> > and I'll post it later.  I don't think that splitting anon_folio from
> > folio is worth doing, but will not stand in your way.  I do think that
> > splitting tail pages from non-tail pages is worthwhile, and that's what
> > this patchset does.
> 
> Eesh, we can and should hold ourselves to a higher standard in our technical
> discussions.
> 
> Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate
> and shouldn't have happened) colour our perceptions and keep us from having
> productive working relationships going forward. The points Johannes is bringing
> up are valid and pertinent and deserve to be discussed.
> 
> If you're still trying to sell folios as the be all, end all solution for
> anything using compound pages, I think you should be willing to make the
> argument that that really is the _right_ solution - not just that it was the one
> easiest for you to implement.
> 
> Actual code might make this discussion more concrete and clearer. Could you post
> your slab conversion?

Linus, I'd also like to humbly and publicly request that, despite it being past
the merge window and a breach of our normal process, folios still be merged for
5.15. Or failing that, that they're the first thing in for 5.16.

The reason for my request is that:

 - folios, at least in filesystem land, solve pressing problems and much work
   has been done on top of them assuming they go in, and the filesystem people
   seem to be pretty unanimous that we both want and need this

 - the public process and discussion has been a trainwreck. We're effectively
   arguing about the future of struct page, which is a "boiling the oceans" type
   issue, and the amount of mess that needs to be cleaned up makes it hard for
   parties working in different areas of the code with different interests and
   concerns to see the areas where we really do have common interests and goals

 - it's become apparent that there haven't been any real objections to the code
   that was queued up for 5.15. There _are_ very real discussions and points of
   contention still to be decided and resolved for the work beyond file backed
   pages, but those discussions were what derailed the more modest, and more
   badly needed, work that affects everyone in filesystem land

 - And, last but not least: it would really help with the frustration levels
   that have been making these discussions extroardinarily difficult. I think
   this whole thing has been showing that our process has some weak points where
   hopefully we'll do better in the future, but in the meantime - Matthew has
   been doing good and badly needed work, and he has my vote of confidence. I
   don't necessarily fully agree with _everything_ he wants to do with folios -
   I'm not writing a blank check here - but he's someone I can work with and
   want to continue to work with.

   Johannes too, for that matter.

Thanks and regards,
Kent

Johannes Weiner Sept. 21, 2021, 9:59 p.m. UTC | #97

On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> > This discussion is now about whether folio are suitable for anon pages
> > as well. I'd like to reiterate that regardless of the outcome of this
> > discussion I think we should probably move ahead with the page cache
> > bits, since people are specifically blocked on those and there is no
> > dependency on the anon stuff, as the conversion is incremental.
> 
> So you withdraw your NAK for the 5.15 pull request which is now four
> weeks old and has utterly missed the merge window?

Once you drop the bits that convert shared anon and file
infrastructure, yes. Because we haven't discussed yet, nor agree on,
that folio are the way forward for anon pages.

> > and so the justification for replacing page with folio *below* those
> > entry points to address tailpage confusion becomes nil: there is no
> > confusion. Move the anon bits to anon_page and leave the shared bits
> > in page. That's 912 lines of swap_state.c we could mostly leave alone.
> 
> Your argument seems to be based on "minimising churn". Which is certainly
> a goal that one could have, but I think in this case is actually harmful.
> There are hundreds, maybe thousands, of functions throughout the kernel
> (certainly throughout filesystems) which assume that a struct page is
> PAGE_SIZE bytes.  Yes, every single one of them is buggy to assume that,
> but tracking them all down is a never-ending task as new ones will be
> added as fast as they can be removed.

What does that have to do with anon pages?

> > The same is true for the LRU code in swap.c. Conceptually, already no
> > tailpages *should* make it onto the LRU. Once the high-level page
> > instantiation functions - add_to_page_cache_lru, do_anonymous_page -
> > have type safety, you really do not need to worry about tail pages
> > deep in the LRU code. 1155 more lines of swap.c.
> 
> It's actually impossible in practice as well as conceptually.  The list
> LRU is in the union with compound_head, so you cannot put a tail page
> onto the LRU.  But yet we call compound_head() on every one of them
> multiple times because our current type system does not allow us to
> express "this is not a tail page".

No, because we haven't identified *who actually needs* these calls
and move them up and out of the low-level helpers.

It was a mistake to add them there, yes. But they were added recently
for rather few callers. And we've had people send patches already to
move them where they are actually needed.

Of course converting *absolutely everybody else* to not-tailpage
instead will also fix the problem... I just don't agree that this is
an appropriate response to the issue.

Asking again: who conceptually deals with tail pages in MM? LRU and
reclaim don't. The page cache doesn't. Compaction doesn't. Migration
doesn't. All these data structures and operations are structured
around headpages, because that's the logical unit they operate on. The
notable exception, of course, are the page tables because they map the
pfns of tail pages. But is that it?  Does it come down to page table
walkers encountering pte-mapped tailpages? And needing compound_head()
before calling mark_page_accessed() or set_page_dirty()?

We couldn't fix vm_normal_page() to handle this? And switch khugepaged
to a new vm_raw_page() or whatever?

It should be possible to answer this question as part of the case for
converting tens of thousands of lines of code to folio.

Matthew Wilcox Sept. 21, 2021, 10:18 p.m. UTC | #98

On Tue, Sep 21, 2021 at 05:11:09PM -0400, Kent Overstreet wrote:
> On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote:
> > On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> > > and so the justification for replacing page with folio *below* those
> > > entry points to address tailpage confusion becomes nil: there is no
> > > confusion. Move the anon bits to anon_page and leave the shared bits
> > > in page. That's 912 lines of swap_state.c we could mostly leave alone.
> > 
> > Your argument seems to be based on "minimising churn".  Which is certainly
> > a goal that one could have, but I think in this case is actually harmful.
> > There are hundreds, maybe thousands, of functions throughout the kernel
> > (certainly throughout filesystems) which assume that a struct page is
> > PAGE_SIZE bytes.  Yes, every single one of them is buggy to assume that,
> > but tracking them all down is a never-ending task as new ones will be
> > added as fast as they can be removed.
> 
> Yet it's only file backed pages that are actually changing in behaviour right
> now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for
> network pools, for slab.

The point (I think) Johannes is making is that some of the patches in
this series touch code paths which are used by both anon and file pages.
And it's those he's objecting to.

> > If you want to try your hand at splitting out anon_folio from folio
> > later, be my guest.  I've just finished splitting out 'slab' from page,
> > and I'll post it later.  I don't think that splitting anon_folio from
> > folio is worth doing, but will not stand in your way.  I do think that
> > splitting tail pages from non-tail pages is worthwhile, and that's what
> > this patchset does.
> 
> Eesh, we can and should hold ourselves to a higher standard in our technical
> discussions.
> 
> Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate
> and shouldn't have happened) colour our perceptions and keep us from having
> productive working relationships going forward. The points Johannes is bringing
> up are valid and pertinent and deserve to be discussed.
> 
> If you're still trying to sell folios as the be all, end all solution for
> anything using compound pages, I think you should be willing to make the
> argument that that really is the _right_ solution - not just that it was the one
> easiest for you to implement.

Starting from the principle that the type of a pointer should never be
wrong, GUP can convert from a PTE to a struct page.  We need a name for
the head page that GUP converts to, and my choice for that name is folio.
A folio needs a refcount, a lock bit and a dirty bit.

By the way, I think I see a path to:

struct page {
	unsigned long compound_head;
};

which will reduce the overhead of struct page from 64 bytes to 8.
That should solve one of Johannes' problems.

> Actual code might make this discussion more concrete and clearer. Could you post
> your slab conversion?

It's a bit big and deserves to be split into multiple patches.
It's on top of folio-5.15.  It also only really works for SLUB right
now; CONFIG_SLAB doesn't compile yet.  It does pass xfstests with
CONFIG_SLUB ;-)

I'm not entirely convinced I've done the right thing with
page_memcg_check().  There's probably other things wrong with it,
I was banging it out during gaps between sessions at Plumbers.

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ddeaba947eb3..5f3d2efeb88b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -981,7 +981,7 @@ static void __meminit free_pagetable(struct page *page, int order)
 	if (PageReserved(page)) {
 		__ClearPageReserved(page);
 
-		magic = (unsigned long)page->freelist;
+		magic = page->index;
 		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
 			while (nr_pages--)
 				put_page_bootmem(page++);
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 2bc8b1f69c93..cc35d010fa94 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -30,7 +30,7 @@ void put_page_bootmem(struct page *page);
  */
 static inline void free_bootmem_page(struct page *page)
 {
-	unsigned long magic = (unsigned long)page->freelist;
+	unsigned long magic = page->index;
 
 	/*
 	 * The reserve_bootmem_region sets the reserved flag on bootmem
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index dd874a1ee862..59c860295618 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -188,11 +188,11 @@ static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache)
 	return 0;
 }
 
-void __kasan_poison_slab(struct page *page);
-static __always_inline void kasan_poison_slab(struct page *page)
+void __kasan_poison_slab(struct slab *slab);
+static __always_inline void kasan_poison_slab(struct slab *slab)
 {
 	if (kasan_enabled())
-		__kasan_poison_slab(page);
+		__kasan_poison_slab(slab);
 }
 
 void __kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
@@ -317,7 +317,7 @@ static inline void kasan_cache_create(struct kmem_cache *cache,
 				      slab_flags_t *flags) {}
 static inline void kasan_cache_create_kmalloc(struct kmem_cache *cache) {}
 static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; }
-static inline void kasan_poison_slab(struct page *page) {}
+static inline void kasan_poison_slab(struct slab *slab) {}
 static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
 					void *object) {}
 static inline void kasan_poison_object_data(struct kmem_cache *cache,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 562b27167c9e..1c0b3b95bdd7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -546,41 +546,39 @@ static inline bool folio_memcg_kmem(struct folio *folio)
 }
 
 /*
- * page_objcgs - get the object cgroups vector associated with a page
- * @page: a pointer to the page struct
+ * slab_objcgs - get the object cgroups vector associated with a slab
+ * @slab: a pointer to the slab struct
  *
- * Returns a pointer to the object cgroups vector associated with the page,
- * or NULL. This function assumes that the page is known to have an
- * associated object cgroups vector. It's not safe to call this function
- * against pages, which might have an associated memory cgroup: e.g.
- * kernel stack pages.
+ * Returns a pointer to the object cgroups vector associated with the slab,
+ * or NULL. This function assumes that the slab is known to have an
+ * associated object cgroups vector.
  */
-static inline struct obj_cgroup **page_objcgs(struct page *page)
+static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
 {
-	unsigned long memcg_data = READ_ONCE(page->memcg_data);
+	unsigned long memcg_data = READ_ONCE(slab->memcg_data);
 
-	VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS), page);
-	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
+	VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS), &slab->page);
+	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, &slab->page);
 
 	return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
 }
 
 /*
- * page_objcgs_check - get the object cgroups vector associated with a page
- * @page: a pointer to the page struct
+ * slab_objcgs_check - get the object cgroups vector associated with a slab
+ * @slab: a pointer to the slab struct
  *
- * Returns a pointer to the object cgroups vector associated with the page,
- * or NULL. This function is safe to use if the page can be directly associated
+ * Returns a pointer to the object cgroups vector associated with the slab,
+ * or NULL. This function is safe to use if the slab can be directly associated
  * with a memory cgroup.
  */
-static inline struct obj_cgroup **page_objcgs_check(struct page *page)
+static inline struct obj_cgroup **slab_objcgs_check(struct slab *slab)
 {
-	unsigned long memcg_data = READ_ONCE(page->memcg_data);
+	unsigned long memcg_data = READ_ONCE(slab->memcg_data);
 
 	if (!memcg_data || !(memcg_data & MEMCG_DATA_OBJCGS))
 		return NULL;
 
-	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
+	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, &slab->page);
 
 	return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
 }
@@ -591,12 +589,12 @@ static inline bool folio_memcg_kmem(struct folio *folio)
 	return false;
 }
 
-static inline struct obj_cgroup **page_objcgs(struct page *page)
+static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
 {
 	return NULL;
 }
 
-static inline struct obj_cgroup **page_objcgs_check(struct page *page)
+static inline struct obj_cgroup **slab_objcgs_check(struct slab *slab)
 {
 	return NULL;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1066afc9a06d..6db4d64ebe6d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -109,33 +109,6 @@ struct page {
 			 */
 			unsigned long dma_addr[2];
 		};
-		struct {	/* slab, slob and slub */
-			union {
-				struct list_head slab_list;
-				struct {	/* Partial pages */
-					struct page *next;
-#ifdef CONFIG_64BIT
-					int pages;	/* Nr of pages left */
-					int pobjects;	/* Approximate count */
-#else
-					short int pages;
-					short int pobjects;
-#endif
-				};
-			};
-			struct kmem_cache *slab_cache; /* not slob */
-			/* Double-word boundary */
-			void *freelist;		/* first free object */
-			union {
-				void *s_mem;	/* slab: first object */
-				unsigned long counters;		/* SLUB */
-				struct {			/* SLUB */
-					unsigned inuse:16;
-					unsigned objects:15;
-					unsigned frozen:1;
-				};
-			};
-		};
 		struct {	/* Tail pages of compound page */
 			unsigned long compound_head;	/* Bit zero is set */
 
@@ -199,9 +172,6 @@ struct page {
 		 * which are currently stored here.
 		 */
 		unsigned int page_type;
-
-		unsigned int active;		/* SLAB */
-		int units;			/* SLOB */
 	};
 
 	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
@@ -231,6 +201,59 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+struct slab {
+	union {
+		struct {
+			unsigned long flags;
+			union {
+				struct list_head slab_list;
+				struct {	/* Partial pages */
+					struct slab *next;
+#ifdef CONFIG_64BIT
+					int slabs;	/* Nr of slabs left */
+					int pobjects;	/* Approximate count */
+#else
+					short int slabs;
+					short int pobjects;
+#endif
+				};
+			};
+			struct kmem_cache *slab_cache; /* not slob */
+			/* Double-word boundary */
+			void *freelist;		/* first free object */
+			union {
+				void *s_mem;	/* slab: first object */
+				unsigned long counters;		/* SLUB */
+				struct {			/* SLUB */
+					unsigned inuse:16;
+					unsigned objects:15;
+					unsigned frozen:1;
+				};
+			};
+
+			union {
+				unsigned int active;		/* SLAB */
+				int units;			/* SLOB */
+			};
+			atomic_t _refcount;
+#ifdef CONFIG_MEMCG
+			unsigned long memcg_data;
+#endif
+		};
+		struct page page;
+	};
+};
+
+#define SLAB_MATCH(pg, sl)						\
+	static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
+SLAB_MATCH(flags, flags);
+SLAB_MATCH(compound_head, slab_list);
+SLAB_MATCH(_refcount, _refcount);
+#ifdef CONFIG_MEMCG
+SLAB_MATCH(memcg_data, memcg_data);
+#endif
+#undef SLAB_MATCH
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b48bc214fe89..a21d14fec973 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -167,6 +167,8 @@ enum pageflags {
 	/* Remapped by swiotlb-xen. */
 	PG_xen_remapped = PG_owner_priv_1,
 
+	/* SLAB / SLUB / SLOB */
+	PG_pfmemalloc = PG_active,
 	/* SLOB */
 	PG_slob_free = PG_private,
 
@@ -193,6 +195,25 @@ static inline unsigned long _compound_head(const struct page *page)
 
 #define compound_head(page)	((typeof(page))_compound_head(page))
 
+/**
+ * page_slab - Converts from page to slab.
+ * @p: The page.
+ *
+ * This function cannot be called on a NULL pointer.  It can be called
+ * on a non-slab page; the caller should check is_slab() to be sure
+ * that the slab really is a slab.
+ *
+ * Return: The slab which contains this page.
+ */
+#define page_slab(p)		(_Generic((p),				\
+	const struct page *:	(const struct slab *)_compound_head(p), \
+	struct page *:		(struct slab *)_compound_head(p)))
+
+static inline bool is_slab(struct slab *slab)
+{
+	return test_bit(PG_slab, &slab->flags);
+}
+
 /**
  * page_folio - Converts from page to folio.
  * @p: The page.
@@ -921,34 +942,6 @@ extern bool is_free_buddy_page(struct page *page);
 
 __PAGEFLAG(Isolated, isolated, PF_ANY);
 
-/*
- * If network-based swap is enabled, sl*b must keep track of whether pages
- * were allocated from pfmemalloc reserves.
- */
-static inline int PageSlabPfmemalloc(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageSlab(page), page);
-	return PageActive(page);
-}
-
-static inline void SetPageSlabPfmemalloc(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageSlab(page), page);
-	SetPageActive(page);
-}
-
-static inline void __ClearPageSlabPfmemalloc(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageSlab(page), page);
-	__ClearPageActive(page);
-}
-
-static inline void ClearPageSlabPfmemalloc(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageSlab(page), page);
-	ClearPageActive(page);
-}
-
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1UL << PG_mlocked)
 #else
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 3aa5e1e73ab6..f1bfcb10f5e0 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -87,11 +87,11 @@ struct kmem_cache {
 	struct kmem_cache_node *node[MAX_NUMNODES];
 };
 
-static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
+static inline void *nearest_obj(struct kmem_cache *cache, struct slab *slab,
 				void *x)
 {
-	void *object = x - (x - page->s_mem) % cache->size;
-	void *last_object = page->s_mem + (cache->num - 1) * cache->size;
+	void *object = x - (x - slab->s_mem) % cache->size;
+	void *last_object = slab->s_mem + (cache->num - 1) * cache->size;
 
 	if (unlikely(object > last_object))
 		return last_object;
@@ -106,16 +106,16 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
  *   reciprocal_divide(offset, cache->reciprocal_buffer_size)
  */
 static inline unsigned int obj_to_index(const struct kmem_cache *cache,
-					const struct page *page, void *obj)
+					const struct slab *slab, void *obj)
 {
-	u32 offset = (obj - page->s_mem);
+	u32 offset = (obj - slab->s_mem);
 	return reciprocal_divide(offset, cache->reciprocal_buffer_size);
 }
 
-static inline int objs_per_slab_page(const struct kmem_cache *cache,
-				     const struct page *page)
+static inline int objs_per_slab(const struct kmem_cache *cache,
+				     const struct slab *slab)
 {
-	if (is_kfence_address(page_address(page)))
+	if (is_kfence_address(slab_address(slab)))
 		return 1;
 	return cache->num;
 }
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index dcde82a4434c..7394c959dc5f 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -43,9 +43,9 @@ enum stat_item {
 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to next available object */
 	unsigned long tid;	/* Globally unique transaction id */
-	struct page *page;	/* The slab from which we are allocating */
+	struct slab *slab;	/* The slab from which we are allocating */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
-	struct page *partial;	/* Partially allocated frozen slabs */
+	struct slab *partial;	/* Partially allocated frozen slabs */
 #endif
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
@@ -159,16 +159,16 @@ static inline void sysfs_slab_release(struct kmem_cache *s)
 }
 #endif
 
-void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct slab *slab,
 		u8 *object, char *reason);
 
 void *fixup_red_left(struct kmem_cache *s, void *p);
 
-static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
+static inline void *nearest_obj(struct kmem_cache *cache, struct slab *slab,
 				void *x) {
-	void *object = x - (x - page_address(page)) % cache->size;
-	void *last_object = page_address(page) +
-		(page->objects - 1) * cache->size;
+	void *object = x - (x - slab_address(slab)) % cache->size;
+	void *last_object = slab_address(slab) +
+		(slab->objects - 1) * cache->size;
 	void *result = (unlikely(object > last_object)) ? last_object : object;
 
 	result = fixup_red_left(cache, result);
@@ -184,16 +184,16 @@ static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
 }
 
 static inline unsigned int obj_to_index(const struct kmem_cache *cache,
-					const struct page *page, void *obj)
+					const struct slab *slab, void *obj)
 {
 	if (is_kfence_address(obj))
 		return 0;
-	return __obj_to_index(cache, page_address(page), obj);
+	return __obj_to_index(cache, slab_address(slab), obj);
 }
 
-static inline int objs_per_slab_page(const struct kmem_cache *cache,
-				     const struct page *page)
+static inline int objs_per_slab(const struct kmem_cache *cache,
+				     const struct slab *slab)
 {
-	return page->objects;
+	return slab->objects;
 }
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index 5b152dba7344..cf8f62c59b0a 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -15,7 +15,7 @@
 
 void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
 {
-	page->freelist = (void *)type;
+	page->index = type;
 	SetPagePrivate(page);
 	set_page_private(page, info);
 	page_ref_inc(page);
@@ -23,14 +23,13 @@ void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
 
 void put_page_bootmem(struct page *page)
 {
-	unsigned long type;
+	unsigned long type = page->index;
 
-	type = (unsigned long) page->freelist;
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
 	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
 
 	if (page_ref_dec_return(page) == 1) {
-		page->freelist = NULL;
+		page->index = 0;
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 2baf121fb8c5..a8b9a7822b9f 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -247,8 +247,9 @@ struct kasan_free_meta *kasan_get_free_meta(struct kmem_cache *cache,
 }
 #endif
 
-void __kasan_poison_slab(struct page *page)
+void __kasan_poison_slab(struct slab *slab)
 {
+	struct page *page = &slab->page;
 	unsigned long i;
 
 	for (i = 0; i < compound_nr(page); i++)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c954fda9d7f4..c21b9a63fb4a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2842,16 +2842,16 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
  */
 #define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
 
-int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
+int memcg_alloc_slab_obj_cgroups(struct slab *slab, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page)
 {
-	unsigned int objects = objs_per_slab_page(s, page);
+	unsigned int objects = objs_per_slab(s, slab);
 	unsigned long memcg_data;
 	void *vec;
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
-			   page_to_nid(page));
+			   slab_nid(slab));
 	if (!vec)
 		return -ENOMEM;
 
@@ -2862,8 +2862,8 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 		 * it's memcg_data, no synchronization is required and
 		 * memcg_data can be simply assigned.
 		 */
-		page->memcg_data = memcg_data;
-	} else if (cmpxchg(&page->memcg_data, 0, memcg_data)) {
+		slab->memcg_data = memcg_data;
+	} else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
 		/*
 		 * If the slab page is already in use, somebody can allocate
 		 * and assign obj_cgroups in parallel. In this case the existing
@@ -2891,38 +2891,39 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
  */
 struct mem_cgroup *mem_cgroup_from_obj(void *p)
 {
-	struct page *page;
+	struct slab *slab;
 
 	if (mem_cgroup_disabled())
 		return NULL;
 
-	page = virt_to_head_page(p);
+	slab = virt_to_slab(p);
 
 	/*
 	 * Slab objects are accounted individually, not per-page.
 	 * Memcg membership data for each individual object is saved in
-	 * the page->obj_cgroups.
+	 * the slab->obj_cgroups.
 	 */
-	if (page_objcgs_check(page)) {
+	if (slab_objcgs_check(slab)) {
 		struct obj_cgroup *objcg;
 		unsigned int off;
 
-		off = obj_to_index(page->slab_cache, page, p);
-		objcg = page_objcgs(page)[off];
+		off = obj_to_index(slab->slab_cache, slab, p);
+		objcg = slab_objcgs(slab)[off];
 		if (objcg)
 			return obj_cgroup_memcg(objcg);
 
 		return NULL;
 	}
 
+	/* I am pretty sure this is wrong */
 	/*
-	 * page_memcg_check() is used here, because page_has_obj_cgroups()
+	 * page_memcg_check() is used here, because slab_has_obj_cgroups()
 	 * check above could fail because the object cgroups vector wasn't set
 	 * at that moment, but it can be set concurrently.
-	 * page_memcg_check(page) will guarantee that a proper memory
+	 * page_memcg_check() will guarantee that a proper memory
 	 * cgroup pointer or NULL will be returned.
 	 */
-	return page_memcg_check(page);
+	return page_memcg_check(&slab->page);
 }
 
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
diff --git a/mm/slab.h b/mm/slab.h
index f997fd5e42c8..1c6311fd7060 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -5,6 +5,69 @@
  * Internal slab definitions
  */
 
+static inline void *slab_address(const struct slab *slab)
+{
+	return page_address(&slab->page);
+}
+
+static inline struct pglist_data *slab_pgdat(const struct slab *slab)
+{
+	return page_pgdat(&slab->page);
+}
+
+static inline int slab_nid(const struct slab *slab)
+{
+	return page_to_nid(&slab->page);
+}
+
+static inline struct slab *virt_to_slab(const void *addr)
+{
+	struct page *page = virt_to_page(addr);
+
+	return page_slab(page);
+}
+
+static inline bool SlabMulti(const struct slab *slab)
+{
+	return test_bit(PG_head, &slab->flags);
+}
+
+static inline int slab_order(const struct slab *slab)
+{
+	if (!SlabMulti(slab))
+		return 0;
+	return (&slab->page)[1].compound_order;
+}
+
+static inline size_t slab_size(const struct slab *slab)
+{
+	return PAGE_SIZE << slab_order(slab);
+}
+
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline bool SlabPfmemalloc(const struct slab *slab)
+{
+	return test_bit(PG_pfmemalloc, &slab->flags);
+}
+
+static inline void SetSlabPfmemalloc(struct slab *slab)
+{
+	set_bit(PG_pfmemalloc, &slab->flags);
+}
+
+static inline void __ClearSlabPfmemalloc(struct slab *slab)
+{
+	__clear_bit(PG_pfmemalloc, &slab->flags);
+}
+
+static inline void ClearSlabPfmemalloc(struct slab *slab)
+{
+	clear_bit(PG_pfmemalloc, &slab->flags);
+}
+
 #ifdef CONFIG_SLOB
 /*
  * Common fields provided in kmem_cache by all slab allocators
@@ -245,15 +308,15 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
+int memcg_alloc_slab_obj_cgroups(struct slab *slab, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page);
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr);
 
-static inline void memcg_free_page_obj_cgroups(struct page *page)
+static inline void memcg_free_slab_obj_cgroups(struct slab *slab)
 {
-	kfree(page_objcgs(page));
-	page->memcg_data = 0;
+	kfree(slab_objcgs(slab));
+	slab->memcg_data = 0;
 }
 
 static inline size_t obj_full_size(struct kmem_cache *s)
@@ -298,7 +361,7 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      gfp_t flags, size_t size,
 					      void **p)
 {
-	struct page *page;
+	struct slab *slab;
 	unsigned long off;
 	size_t i;
 
@@ -307,19 +370,19 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
-			page = virt_to_head_page(p[i]);
+			slab = virt_to_slab(p[i]);
 
-			if (!page_objcgs(page) &&
-			    memcg_alloc_page_obj_cgroups(page, s, flags,
+			if (!slab_objcgs(slab) &&
+			    memcg_alloc_slab_obj_cgroups(slab, s, flags,
 							 false)) {
 				obj_cgroup_uncharge(objcg, obj_full_size(s));
 				continue;
 			}
 
-			off = obj_to_index(s, page, p[i]);
+			off = obj_to_index(s, slab, p[i]);
 			obj_cgroup_get(objcg);
-			page_objcgs(page)[off] = objcg;
-			mod_objcg_state(objcg, page_pgdat(page),
+			slab_objcgs(slab)[off] = objcg;
+			mod_objcg_state(objcg, slab_pgdat(slab),
 					cache_vmstat_idx(s), obj_full_size(s));
 		} else {
 			obj_cgroup_uncharge(objcg, obj_full_size(s));
@@ -334,7 +397,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
 	struct kmem_cache *s;
 	struct obj_cgroup **objcgs;
 	struct obj_cgroup *objcg;
-	struct page *page;
+	struct slab *slab;
 	unsigned int off;
 	int i;
 
@@ -345,24 +408,24 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
 		if (unlikely(!p[i]))
 			continue;
 
-		page = virt_to_head_page(p[i]);
-		objcgs = page_objcgs(page);
+		slab = virt_to_slab(p[i]);
+		objcgs = slab_objcgs(slab);
 		if (!objcgs)
 			continue;
 
 		if (!s_orig)
-			s = page->slab_cache;
+			s = slab->slab_cache;
 		else
 			s = s_orig;
 
-		off = obj_to_index(s, page, p[i]);
+		off = obj_to_index(s, slab, p[i]);
 		objcg = objcgs[off];
 		if (!objcg)
 			continue;
 
 		objcgs[off] = NULL;
 		obj_cgroup_uncharge(objcg, obj_full_size(s));
-		mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
+		mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
 				-obj_full_size(s));
 		obj_cgroup_put(objcg);
 	}
@@ -374,14 +437,14 @@ static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 	return NULL;
 }
 
-static inline int memcg_alloc_page_obj_cgroups(struct page *page,
+static inline int memcg_alloc_slab_obj_cgroups(struct slab *slab,
 					       struct kmem_cache *s, gfp_t gfp,
 					       bool new_page)
 {
 	return 0;
 }
 
-static inline void memcg_free_page_obj_cgroups(struct page *page)
+static inline void memcg_free_slab_obj_cgroups(struct slab *slab)
 {
 }
 
@@ -407,33 +470,33 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
 {
-	struct page *page;
+	struct slab *slab;
 
-	page = virt_to_head_page(obj);
-	if (WARN_ONCE(!PageSlab(page), "%s: Object is not a Slab page!\n",
+	slab = virt_to_slab(obj);
+	if (WARN_ONCE(!is_slab(slab), "%s: Object is not a Slab page!\n",
 					__func__))
 		return NULL;
-	return page->slab_cache;
+	return slab->slab_cache;
 }
 
-static __always_inline void account_slab_page(struct page *page, int order,
+static __always_inline void account_slab(struct slab *slab, int order,
 					      struct kmem_cache *s,
 					      gfp_t gfp)
 {
 	if (memcg_kmem_enabled() && (s->flags & SLAB_ACCOUNT))
-		memcg_alloc_page_obj_cgroups(page, s, gfp, true);
+		memcg_alloc_slab_obj_cgroups(slab, s, gfp, true);
 
-	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+	mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
 }
 
-static __always_inline void unaccount_slab_page(struct page *page, int order,
+static __always_inline void unaccount_slab(struct slab *slab, int order,
 						struct kmem_cache *s)
 {
 	if (memcg_kmem_enabled())
-		memcg_free_page_obj_cgroups(page);
+		memcg_free_slab_obj_cgroups(slab);
 
-	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+	mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }
 
@@ -635,7 +698,7 @@ static inline void debugfs_slab_release(struct kmem_cache *s) { }
 #define KS_ADDRS_COUNT 16
 struct kmem_obj_info {
 	void *kp_ptr;
-	struct page *kp_page;
+	struct slab *kp_slab;
 	void *kp_objp;
 	unsigned long kp_data_offset;
 	struct kmem_cache *kp_slab_cache;
@@ -643,7 +706,7 @@ struct kmem_obj_info {
 	void *kp_stack[KS_ADDRS_COUNT];
 	void *kp_free_stack[KS_ADDRS_COUNT];
 };
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page);
+void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
 #endif
 
 #endif /* MM_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1c673c323baf..d0d843cb7cf1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -585,18 +585,18 @@ void kmem_dump_obj(void *object)
 {
 	char *cp = IS_ENABLED(CONFIG_MMU) ? "" : "/vmalloc";
 	int i;
-	struct page *page;
+	struct slab *slab;
 	unsigned long ptroffset;
 	struct kmem_obj_info kp = { };
 
 	if (WARN_ON_ONCE(!virt_addr_valid(object)))
 		return;
-	page = virt_to_head_page(object);
-	if (WARN_ON_ONCE(!PageSlab(page))) {
+	slab = virt_to_slab(object);
+	if (WARN_ON_ONCE(!is_slab(slab))) {
 		pr_cont(" non-slab memory.\n");
 		return;
 	}
-	kmem_obj_info(&kp, object, page);
+	kmem_obj_info(&kp, object, slab);
 	if (kp.kp_slab_cache)
 		pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name);
 	else
diff --git a/mm/slub.c b/mm/slub.c
index 090fa14628f9..c3b84bd61400 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -47,7 +47,7 @@
  * Lock order:
  *   1. slab_mutex (Global Mutex)
  *   2. node->list_lock
- *   3. slab_lock(page) (Only on some arches and for debugging)
+ *   3. slab_lock(slab) (Only on some arches and for debugging)
  *
  *   slab_mutex
  *
@@ -56,17 +56,17 @@
  *
  *   The slab_lock is only used for debugging and on arches that do not
  *   have the ability to do a cmpxchg_double. It only protects:
- *	A. page->freelist	-> List of object free in a page
- *	B. page->inuse		-> Number of objects in use
- *	C. page->objects	-> Number of objects in page
- *	D. page->frozen		-> frozen state
+ *	A. slab->freelist	-> List of object free in a slab
+ *	B. slab->inuse		-> Number of objects in use
+ *	C. slab->objects	-> Number of objects in slab
+ *	D. slab->frozen		-> frozen state
  *
  *   If a slab is frozen then it is exempt from list management. It is not
  *   on any list except per cpu partial list. The processor that froze the
- *   slab is the one who can perform list operations on the page. Other
+ *   slab is the one who can perform list operations on the slab. Other
  *   processors may put objects onto the freelist but the processor that
  *   froze the slab is the only one that can retrieve the objects from the
- *   page's freelist.
+ *   slab's freelist.
  *
  *   The list_lock protects the partial and full list on each node and
  *   the partial slab counter. If taken then no new slabs may be added or
@@ -94,10 +94,10 @@
  * cannot scan all objects.
  *
  * Slabs are freed when they become empty. Teardown and setup is
- * minimal so we rely on the page allocators per cpu caches for
+ * minimal so we rely on the slab allocators per cpu caches for
  * fast frees and allocs.
  *
- * page->frozen		The slab is frozen and exempt from list processing.
+ * slab->frozen		The slab is frozen and exempt from list processing.
  * 			This means that the slab is dedicated to a purpose
  * 			such as satisfying allocations for a specific
  * 			processor. Objects may be freed in the slab while
@@ -192,7 +192,7 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
 
 #define OO_SHIFT	16
 #define OO_MASK		((1 << OO_SHIFT) - 1)
-#define MAX_OBJS_PER_PAGE	32767 /* since page.objects is u15 */
+#define MAX_OBJS_PER_PAGE	32767 /* since slab.objects is u15 */
 
 /* Internal SLUB flags */
 /* Poison object */
@@ -357,22 +357,20 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
 }
 
 /*
- * Per slab locking using the pagelock
+ * Per slab locking using the slablock
  */
-static __always_inline void slab_lock(struct page *page)
+static __always_inline void slab_lock(struct slab *slab)
 {
-	VM_BUG_ON_PAGE(PageTail(page), page);
-	bit_spin_lock(PG_locked, &page->flags);
+	bit_spin_lock(PG_locked, &slab->flags);
 }
 
-static __always_inline void slab_unlock(struct page *page)
+static __always_inline void slab_unlock(struct slab *slab)
 {
-	VM_BUG_ON_PAGE(PageTail(page), page);
-	__bit_spin_unlock(PG_locked, &page->flags);
+	__bit_spin_unlock(PG_locked, &slab->flags);
 }
 
 /* Interrupts must be disabled (for the fallback code to work right) */
-static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
 		void *freelist_old, unsigned long counters_old,
 		void *freelist_new, unsigned long counters_new,
 		const char *n)
@@ -381,22 +379,22 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&page->freelist, &page->counters,
+		if (cmpxchg_double(&slab->freelist, &slab->counters,
 				   freelist_old, counters_old,
 				   freelist_new, counters_new))
 			return true;
 	} else
 #endif
 	{
-		slab_lock(page);
-		if (page->freelist == freelist_old &&
-					page->counters == counters_old) {
-			page->freelist = freelist_new;
-			page->counters = counters_new;
-			slab_unlock(page);
+		slab_lock(slab);
+		if (slab->freelist == freelist_old &&
+					slab->counters == counters_old) {
+			slab->freelist = freelist_new;
+			slab->counters = counters_new;
+			slab_unlock(slab);
 			return true;
 		}
-		slab_unlock(page);
+		slab_unlock(slab);
 	}
 
 	cpu_relax();
@@ -409,7 +407,7 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
 	return false;
 }
 
-static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
 		void *freelist_old, unsigned long counters_old,
 		void *freelist_new, unsigned long counters_new,
 		const char *n)
@@ -417,7 +415,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
-		if (cmpxchg_double(&page->freelist, &page->counters,
+		if (cmpxchg_double(&slab->freelist, &slab->counters,
 				   freelist_old, counters_old,
 				   freelist_new, counters_new))
 			return true;
@@ -427,16 +425,16 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
 		unsigned long flags;
 
 		local_irq_save(flags);
-		slab_lock(page);
-		if (page->freelist == freelist_old &&
-					page->counters == counters_old) {
-			page->freelist = freelist_new;
-			page->counters = counters_new;
-			slab_unlock(page);
+		slab_lock(slab);
+		if (slab->freelist == freelist_old &&
+					slab->counters == counters_old) {
+			slab->freelist = freelist_new;
+			slab->counters = counters_new;
+			slab_unlock(slab);
 			local_irq_restore(flags);
 			return true;
 		}
-		slab_unlock(page);
+		slab_unlock(slab);
 		local_irq_restore(flags);
 	}
 
@@ -475,24 +473,24 @@ static inline bool slab_add_kunit_errors(void) { return false; }
 #endif
 
 /*
- * Determine a map of object in use on a page.
+ * Determine a map of object in use on a slab.
  *
- * Node listlock must be held to guarantee that the page does
+ * Node listlock must be held to guarantee that the slab does
  * not vanish from under us.
  */
-static unsigned long *get_map(struct kmem_cache *s, struct page *page)
+static unsigned long *get_map(struct kmem_cache *s, struct slab *slab)
 	__acquires(&object_map_lock)
 {
 	void *p;
-	void *addr = page_address(page);
+	void *addr = slab_address(slab);
 
 	VM_BUG_ON(!irqs_disabled());
 
 	spin_lock(&object_map_lock);
 
-	bitmap_zero(object_map, page->objects);
+	bitmap_zero(object_map, slab->objects);
 
-	for (p = page->freelist; p; p = get_freepointer(s, p))
+	for (p = slab->freelist; p; p = get_freepointer(s, p))
 		set_bit(__obj_to_index(s, addr, p), object_map);
 
 	return object_map;
@@ -552,19 +550,19 @@ static inline void metadata_access_disable(void)
  * Object debugging
  */
 
-/* Verify that a pointer has an address that is valid within a slab page */
+/* Verify that a pointer has an address that is valid within a slab */
 static inline int check_valid_pointer(struct kmem_cache *s,
-				struct page *page, void *object)
+				struct slab *slab, void *object)
 {
 	void *base;
 
 	if (!object)
 		return 1;
 
-	base = page_address(page);
+	base = slab_address(slab);
 	object = kasan_reset_tag(object);
 	object = restore_red_left(s, object);
-	if (object < base || object >= base + page->objects * s->size ||
+	if (object < base || object >= base + slab->objects * s->size ||
 		(object - base) % s->size) {
 		return 0;
 	}
@@ -675,11 +673,11 @@ void print_tracking(struct kmem_cache *s, void *object)
 	print_track("Freed", get_track(s, object, TRACK_FREE), pr_time);
 }
 
-static void print_page_info(struct page *page)
+static void print_slab_info(struct slab *slab)
 {
 	pr_err("Slab 0x%p objects=%u used=%u fp=0x%p flags=%#lx(%pGp)\n",
-	       page, page->objects, page->inuse, page->freelist,
-	       page->flags, &page->flags);
+	       slab, slab->objects, slab->inuse, slab->freelist,
+	       slab->flags, &slab->flags);
 
 }
 
@@ -713,12 +711,12 @@ static void slab_fix(struct kmem_cache *s, char *fmt, ...)
 	va_end(args);
 }
 
-static bool freelist_corrupted(struct kmem_cache *s, struct page *page,
+static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
 			       void **freelist, void *nextfree)
 {
 	if ((s->flags & SLAB_CONSISTENCY_CHECKS) &&
-	    !check_valid_pointer(s, page, nextfree) && freelist) {
-		object_err(s, page, *freelist, "Freechain corrupt");
+	    !check_valid_pointer(s, slab, nextfree) && freelist) {
+		object_err(s, slab, *freelist, "Freechain corrupt");
 		*freelist = NULL;
 		slab_fix(s, "Isolate corrupted freechain");
 		return true;
@@ -727,14 +725,14 @@ static bool freelist_corrupted(struct kmem_cache *s, struct page *page,
 	return false;
 }
 
-static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
+static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
 {
 	unsigned int off;	/* Offset of last byte */
-	u8 *addr = page_address(page);
+	u8 *addr = slab_address(slab);
 
 	print_tracking(s, p);
 
-	print_page_info(page);
+	print_slab_info(slab);
 
 	pr_err("Object 0x%p @offset=%tu fp=0x%p\n\n",
 	       p, p - addr, get_freepointer(s, p));
@@ -766,18 +764,18 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 	dump_stack();
 }
 
-void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct slab *slab,
 			u8 *object, char *reason)
 {
 	if (slab_add_kunit_errors())
 		return;
 
 	slab_bug(s, "%s", reason);
-	print_trailer(s, page, object);
+	print_trailer(s, slab, object);
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
-static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page,
+static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
 			const char *fmt, ...)
 {
 	va_list args;
@@ -790,7 +788,7 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page,
 	vsnprintf(buf, sizeof(buf), fmt, args);
 	va_end(args);
 	slab_bug(s, "%s", buf);
-	print_page_info(page);
+	print_slab_info(slab);
 	dump_stack();
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
@@ -818,13 +816,13 @@ static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
 	memset(from, data, to - from);
 }
 
-static int check_bytes_and_report(struct kmem_cache *s, struct page *page,
+static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
 			u8 *object, char *what,
 			u8 *start, unsigned int value, unsigned int bytes)
 {
 	u8 *fault;
 	u8 *end;
-	u8 *addr = page_address(page);
+	u8 *addr = slab_address(slab);
 
 	metadata_access_enable();
 	fault = memchr_inv(kasan_reset_tag(start), value, bytes);
@@ -843,7 +841,7 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page,
 	pr_err("0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x\n",
 					fault, end - 1, fault - addr,
 					fault[0], value);
-	print_trailer(s, page, object);
+	print_trailer(s, slab, object);
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 
 skip_bug_print:
@@ -889,7 +887,7 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page,
  * may be used with merged slabcaches.
  */
 
-static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p)
+static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
 {
 	unsigned long off = get_info_end(s);	/* The end of info */
 
@@ -902,12 +900,12 @@ static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p)
 	if (size_from_object(s) == off)
 		return 1;
 
-	return check_bytes_and_report(s, page, p, "Object padding",
+	return check_bytes_and_report(s, slab, p, "Object padding",
 			p + off, POISON_INUSE, size_from_object(s) - off);
 }
 
-/* Check the pad bytes at the end of a slab page */
-static int slab_pad_check(struct kmem_cache *s, struct page *page)
+/* Check the pad bytes at the end of a slab */
+static int slab_pad_check(struct kmem_cache *s, struct slab *slab)
 {
 	u8 *start;
 	u8 *fault;
@@ -919,8 +917,8 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page)
 	if (!(s->flags & SLAB_POISON))
 		return 1;
 
-	start = page_address(page);
-	length = page_size(page);
+	start = slab_address(slab);
+	length = slab_size(slab);
 	end = start + length;
 	remainder = length % s->size;
 	if (!remainder)
@@ -935,7 +933,7 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page)
 	while (end > fault && end[-1] == POISON_INUSE)
 		end--;
 
-	slab_err(s, page, "Padding overwritten. 0x%p-0x%p @offset=%tu",
+	slab_err(s, slab, "Padding overwritten. 0x%p-0x%p @offset=%tu",
 			fault, end - 1, fault - start);
 	print_section(KERN_ERR, "Padding ", pad, remainder);
 
@@ -943,23 +941,23 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page)
 	return 0;
 }
 
-static int check_object(struct kmem_cache *s, struct page *page,
+static int check_object(struct kmem_cache *s, struct slab *slab,
 					void *object, u8 val)
 {
 	u8 *p = object;
 	u8 *endobject = object + s->object_size;
 
 	if (s->flags & SLAB_RED_ZONE) {
-		if (!check_bytes_and_report(s, page, object, "Left Redzone",
+		if (!check_bytes_and_report(s, slab, object, "Left Redzone",
 			object - s->red_left_pad, val, s->red_left_pad))
 			return 0;
 
-		if (!check_bytes_and_report(s, page, object, "Right Redzone",
+		if (!check_bytes_and_report(s, slab, object, "Right Redzone",
 			endobject, val, s->inuse - s->object_size))
 			return 0;
 	} else {
 		if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
-			check_bytes_and_report(s, page, p, "Alignment padding",
+			check_bytes_and_report(s, slab, p, "Alignment padding",
 				endobject, POISON_INUSE,
 				s->inuse - s->object_size);
 		}
@@ -967,15 +965,15 @@ static int check_object(struct kmem_cache *s, struct page *page,
 
 	if (s->flags & SLAB_POISON) {
 		if (val != SLUB_RED_ACTIVE && (s->flags & __OBJECT_POISON) &&
-			(!check_bytes_and_report(s, page, p, "Poison", p,
+			(!check_bytes_and_report(s, slab, p, "Poison", p,
 					POISON_FREE, s->object_size - 1) ||
-			 !check_bytes_and_report(s, page, p, "End Poison",
+			 !check_bytes_and_report(s, slab, p, "End Poison",
 				p + s->object_size - 1, POISON_END, 1)))
 			return 0;
 		/*
 		 * check_pad_bytes cleans up on its own.
 		 */
-		check_pad_bytes(s, page, p);
+		check_pad_bytes(s, slab, p);
 	}
 
 	if (!freeptr_outside_object(s) && val == SLUB_RED_ACTIVE)
@@ -986,8 +984,8 @@ static int check_object(struct kmem_cache *s, struct page *page,
 		return 1;
 
 	/* Check free pointer validity */
-	if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
-		object_err(s, page, p, "Freepointer corrupt");
+	if (!check_valid_pointer(s, slab, get_freepointer(s, p))) {
+		object_err(s, slab, p, "Freepointer corrupt");
 		/*
 		 * No choice but to zap it and thus lose the remainder
 		 * of the free objects in this slab. May cause
@@ -999,57 +997,57 @@ static int check_object(struct kmem_cache *s, struct page *page,
 	return 1;
 }
 
-static int check_slab(struct kmem_cache *s, struct page *page)
+static int check_slab(struct kmem_cache *s, struct slab *slab)
 {
 	int maxobj;
 
 	VM_BUG_ON(!irqs_disabled());
 
-	if (!PageSlab(page)) {
-		slab_err(s, page, "Not a valid slab page");
+	if (!is_slab(slab)) {
+		slab_err(s, slab, "Not a valid slab slab");
 		return 0;
 	}
 
-	maxobj = order_objects(compound_order(page), s->size);
-	if (page->objects > maxobj) {
-		slab_err(s, page, "objects %u > max %u",
-			page->objects, maxobj);
+	maxobj = order_objects(slab_order(slab), s->size);
+	if (slab->objects > maxobj) {
+		slab_err(s, slab, "objects %u > max %u",
+			slab->objects, maxobj);
 		return 0;
 	}
-	if (page->inuse > page->objects) {
-		slab_err(s, page, "inuse %u > max %u",
-			page->inuse, page->objects);
+	if (slab->inuse > slab->objects) {
+		slab_err(s, slab, "inuse %u > max %u",
+			slab->inuse, slab->objects);
 		return 0;
 	}
 	/* Slab_pad_check fixes things up after itself */
-	slab_pad_check(s, page);
+	slab_pad_check(s, slab);
 	return 1;
 }
 
 /*
- * Determine if a certain object on a page is on the freelist. Must hold the
+ * Determine if a certain object on a slab is on the freelist. Must hold the
  * slab lock to guarantee that the chains are in a consistent state.
  */
-static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
+static int on_freelist(struct kmem_cache *s, struct slab *slab, void *search)
 {
 	int nr = 0;
 	void *fp;
 	void *object = NULL;
 	int max_objects;
 
-	fp = page->freelist;
-	while (fp && nr <= page->objects) {
+	fp = slab->freelist;
+	while (fp && nr <= slab->objects) {
 		if (fp == search)
 			return 1;
-		if (!check_valid_pointer(s, page, fp)) {
+		if (!check_valid_pointer(s, slab, fp)) {
 			if (object) {
-				object_err(s, page, object,
+				object_err(s, slab, object,
 					"Freechain corrupt");
 				set_freepointer(s, object, NULL);
 			} else {
-				slab_err(s, page, "Freepointer corrupt");
-				page->freelist = NULL;
-				page->inuse = page->objects;
+				slab_err(s, slab, "Freepointer corrupt");
+				slab->freelist = NULL;
+				slab->inuse = slab->objects;
 				slab_fix(s, "Freelist cleared");
 				return 0;
 			}
@@ -1060,34 +1058,34 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
 		nr++;
 	}
 
-	max_objects = order_objects(compound_order(page), s->size);
+	max_objects = order_objects(slab_order(slab), s->size);
 	if (max_objects > MAX_OBJS_PER_PAGE)
 		max_objects = MAX_OBJS_PER_PAGE;
 
-	if (page->objects != max_objects) {
-		slab_err(s, page, "Wrong number of objects. Found %d but should be %d",
-			 page->objects, max_objects);
-		page->objects = max_objects;
+	if (slab->objects != max_objects) {
+		slab_err(s, slab, "Wrong number of objects. Found %d but should be %d",
+			 slab->objects, max_objects);
+		slab->objects = max_objects;
 		slab_fix(s, "Number of objects adjusted");
 	}
-	if (page->inuse != page->objects - nr) {
-		slab_err(s, page, "Wrong object count. Counter is %d but counted were %d",
-			 page->inuse, page->objects - nr);
-		page->inuse = page->objects - nr;
+	if (slab->inuse != slab->objects - nr) {
+		slab_err(s, slab, "Wrong object count. Counter is %d but counted were %d",
+			 slab->inuse, slab->objects - nr);
+		slab->inuse = slab->objects - nr;
 		slab_fix(s, "Object count adjusted");
 	}
 	return search == NULL;
 }
 
-static void trace(struct kmem_cache *s, struct page *page, void *object,
+static void trace(struct kmem_cache *s, struct slab *slab, void *object,
 								int alloc)
 {
 	if (s->flags & SLAB_TRACE) {
 		pr_info("TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
 			s->name,
 			alloc ? "alloc" : "free",
-			object, page->inuse,
-			page->freelist);
+			object, slab->inuse,
+			slab->freelist);
 
 		if (!alloc)
 			print_section(KERN_INFO, "Object ", (void *)object,
@@ -1101,22 +1099,22 @@ static void trace(struct kmem_cache *s, struct page *page, void *object,
  * Tracking of fully allocated slabs for debugging purposes.
  */
 static void add_full(struct kmem_cache *s,
-	struct kmem_cache_node *n, struct page *page)
+	struct kmem_cache_node *n, struct slab *slab)
 {
 	if (!(s->flags & SLAB_STORE_USER))
 		return;
 
 	lockdep_assert_held(&n->list_lock);
-	list_add(&page->slab_list, &n->full);
+	list_add(&slab->slab_list, &n->full);
 }
 
-static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page)
+static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct slab *slab)
 {
 	if (!(s->flags & SLAB_STORE_USER))
 		return;
 
 	lockdep_assert_held(&n->list_lock);
-	list_del(&page->slab_list);
+	list_del(&slab->slab_list);
 }
 
 /* Tracking of the number of slabs for debugging purposes */
@@ -1156,7 +1154,7 @@ static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects)
 }
 
 /* Object debug checks for alloc/free paths */
-static void setup_object_debug(struct kmem_cache *s, struct page *page,
+static void setup_object_debug(struct kmem_cache *s, struct slab *slab,
 								void *object)
 {
 	if (!kmem_cache_debug_flags(s, SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON))
@@ -1167,90 +1165,90 @@ static void setup_object_debug(struct kmem_cache *s, struct page *page,
 }
 
 static
-void setup_page_debug(struct kmem_cache *s, struct page *page, void *addr)
+void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr)
 {
 	if (!kmem_cache_debug_flags(s, SLAB_POISON))
 		return;
 
 	metadata_access_enable();
-	memset(kasan_reset_tag(addr), POISON_INUSE, page_size(page));
+	memset(kasan_reset_tag(addr), POISON_INUSE, slab_size(slab));
 	metadata_access_disable();
 }
 
 static inline int alloc_consistency_checks(struct kmem_cache *s,
-					struct page *page, void *object)
+					struct slab *slab, void *object)
 {
-	if (!check_slab(s, page))
+	if (!check_slab(s, slab))
 		return 0;
 
-	if (!check_valid_pointer(s, page, object)) {
-		object_err(s, page, object, "Freelist Pointer check fails");
+	if (!check_valid_pointer(s, slab, object)) {
+		object_err(s, slab, object, "Freelist Pointer check fails");
 		return 0;
 	}
 
-	if (!check_object(s, page, object, SLUB_RED_INACTIVE))
+	if (!check_object(s, slab, object, SLUB_RED_INACTIVE))
 		return 0;
 
 	return 1;
 }
 
 static noinline int alloc_debug_processing(struct kmem_cache *s,
-					struct page *page,
+					struct slab *slab,
 					void *object, unsigned long addr)
 {
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
-		if (!alloc_consistency_checks(s, page, object))
+		if (!alloc_consistency_checks(s, slab, object))
 			goto bad;
 	}
 
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
-	trace(s, page, object, 1);
+	trace(s, slab, object, 1);
 	init_object(s, object, SLUB_RED_ACTIVE);
 	return 1;
 
 bad:
-	if (PageSlab(page)) {
+	if (is_slab(slab)) {
 		/*
-		 * If this is a slab page then lets do the best we can
+		 * If this is a slab then lets do the best we can
 		 * to avoid issues in the future. Marking all objects
 		 * as used avoids touching the remaining objects.
 		 */
 		slab_fix(s, "Marking all objects used");
-		page->inuse = page->objects;
-		page->freelist = NULL;
+		slab->inuse = slab->objects;
+		slab->freelist = NULL;
 	}
 	return 0;
 }
 
 static inline int free_consistency_checks(struct kmem_cache *s,
-		struct page *page, void *object, unsigned long addr)
+		struct slab *slab, void *object, unsigned long addr)
 {
-	if (!check_valid_pointer(s, page, object)) {
-		slab_err(s, page, "Invalid object pointer 0x%p", object);
+	if (!check_valid_pointer(s, slab, object)) {
+		slab_err(s, slab, "Invalid object pointer 0x%p", object);
 		return 0;
 	}
 
-	if (on_freelist(s, page, object)) {
-		object_err(s, page, object, "Object already free");
+	if (on_freelist(s, slab, object)) {
+		object_err(s, slab, object, "Object already free");
 		return 0;
 	}
 
-	if (!check_object(s, page, object, SLUB_RED_ACTIVE))
+	if (!check_object(s, slab, object, SLUB_RED_ACTIVE))
 		return 0;
 
-	if (unlikely(s != page->slab_cache)) {
-		if (!PageSlab(page)) {
-			slab_err(s, page, "Attempt to free object(0x%p) outside of slab",
+	if (unlikely(s != slab->slab_cache)) {
+		if (!is_slab(slab)) {
+			slab_err(s, slab, "Attempt to free object(0x%p) outside of slab",
 				 object);
-		} else if (!page->slab_cache) {
+		} else if (!slab->slab_cache) {
 			pr_err("SLUB <none>: no slab for object 0x%p.\n",
 			       object);
 			dump_stack();
 		} else
-			object_err(s, page, object,
-					"page slab pointer corrupt.");
+			object_err(s, slab, object,
+					"slab slab pointer corrupt.");
 		return 0;
 	}
 	return 1;
@@ -1258,21 +1256,21 @@ static inline int free_consistency_checks(struct kmem_cache *s,
 
 /* Supports checking bulk free of a constructed freelist */
 static noinline int free_debug_processing(
-	struct kmem_cache *s, struct page *page,
+	struct kmem_cache *s, struct slab *slab,
 	void *head, void *tail, int bulk_cnt,
 	unsigned long addr)
 {
-	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
 	void *object = head;
 	int cnt = 0;
 	unsigned long flags;
 	int ret = 0;
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	slab_lock(page);
+	slab_lock(slab);
 
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
-		if (!check_slab(s, page))
+		if (!check_slab(s, slab))
 			goto out;
 	}
 
@@ -1280,13 +1278,13 @@ static noinline int free_debug_processing(
 	cnt++;
 
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
-		if (!free_consistency_checks(s, page, object, addr))
+		if (!free_consistency_checks(s, slab, object, addr))
 			goto out;
 	}
 
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
-	trace(s, page, object, 0);
+	trace(s, slab, object, 0);
 	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
 	init_object(s, object, SLUB_RED_INACTIVE);
 
@@ -1299,10 +1297,10 @@ static noinline int free_debug_processing(
 
 out:
 	if (cnt != bulk_cnt)
-		slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
+		slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n",
 			 bulk_cnt, cnt);
 
-	slab_unlock(page);
+	slab_unlock(slab);
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	if (!ret)
 		slab_fix(s, "Object at 0x%p not freed", object);
@@ -1514,26 +1512,26 @@ slab_flags_t kmem_cache_flags(unsigned int object_size,
 }
 #else /* !CONFIG_SLUB_DEBUG */
 static inline void setup_object_debug(struct kmem_cache *s,
-			struct page *page, void *object) {}
+			struct slab *slab, void *object) {}
 static inline
-void setup_page_debug(struct kmem_cache *s, struct page *page, void *addr) {}
+void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	struct page *page, void *object, unsigned long addr) { return 0; }
+	struct slab *slab, void *object, unsigned long addr) { return 0; }
 
 static inline int free_debug_processing(
-	struct kmem_cache *s, struct page *page,
+	struct kmem_cache *s, struct slab *slab,
 	void *head, void *tail, int bulk_cnt,
 	unsigned long addr) { return 0; }
 
-static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
+static inline int slab_pad_check(struct kmem_cache *s, struct slab *slab)
 			{ return 1; }
-static inline int check_object(struct kmem_cache *s, struct page *page,
+static inline int check_object(struct kmem_cache *s, struct slab *slab,
 			void *object, u8 val) { return 1; }
 static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
-					struct page *page) {}
+					struct slab *slab) {}
 static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n,
-					struct page *page) {}
+					struct slab *slab) {}
 slab_flags_t kmem_cache_flags(unsigned int object_size,
 	slab_flags_t flags, const char *name)
 {
@@ -1552,7 +1550,7 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node,
 static inline void dec_slabs_node(struct kmem_cache *s, int node,
 							int objects) {}
 
-static bool freelist_corrupted(struct kmem_cache *s, struct page *page,
+static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
 			       void **freelist, void *nextfree)
 {
 	return false;
@@ -1662,10 +1660,10 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
 	return *head != NULL;
 }
 
-static void *setup_object(struct kmem_cache *s, struct page *page,
+static void *setup_object(struct kmem_cache *s, struct slab *slab,
 				void *object)
 {
-	setup_object_debug(s, page, object);
+	setup_object_debug(s, slab, object);
 	object = kasan_init_slab_obj(s, object);
 	if (unlikely(s->ctor)) {
 		kasan_unpoison_object_data(s, object);
@@ -1678,18 +1676,25 @@ static void *setup_object(struct kmem_cache *s, struct page *page,
 /*
  * Slab allocation and freeing
  */
-static inline struct page *alloc_slab_page(struct kmem_cache *s,
+static inline struct slab *alloc_slab(struct kmem_cache *s,
 		gfp_t flags, int node, struct kmem_cache_order_objects oo)
 {
 	struct page *page;
+	struct slab *slab;
 	unsigned int order = oo_order(oo);
 
 	if (node == NUMA_NO_NODE)
 		page = alloc_pages(flags, order);
 	else
 		page = __alloc_pages_node(node, flags, order);
+	if (!page)
+		return NULL;
 
-	return page;
+	__SetPageSlab(page);
+	slab = (struct slab *)page;
+	if (page_is_pfmemalloc(page))
+		SetSlabPfmemalloc(slab);
+	return slab;
 }
 
 #ifdef CONFIG_SLAB_FREELIST_RANDOM
@@ -1710,7 +1715,7 @@ static int init_cache_random_seq(struct kmem_cache *s)
 		return err;
 	}
 
-	/* Transform to an offset on the set of pages */
+	/* Transform to an offset on the set of slabs */
 	if (s->random_seq) {
 		unsigned int i;
 
@@ -1734,54 +1739,54 @@ static void __init init_freelist_randomization(void)
 }
 
 /* Get the next entry on the pre-computed freelist randomized */
-static void *next_freelist_entry(struct kmem_cache *s, struct page *page,
+static void *next_freelist_entry(struct kmem_cache *s, struct slab *slab,
 				unsigned long *pos, void *start,
-				unsigned long page_limit,
+				unsigned long slab_limit,
 				unsigned long freelist_count)
 {
 	unsigned int idx;
 
 	/*
-	 * If the target page allocation failed, the number of objects on the
-	 * page might be smaller than the usual size defined by the cache.
+	 * If the target slab allocation failed, the number of objects on the
+	 * slab might be smaller than the usual size defined by the cache.
 	 */
 	do {
 		idx = s->random_seq[*pos];
 		*pos += 1;
 		if (*pos >= freelist_count)
 			*pos = 0;
-	} while (unlikely(idx >= page_limit));
+	} while (unlikely(idx >= slab_limit));
 
 	return (char *)start + idx;
 }
 
 /* Shuffle the single linked freelist based on a random pre-computed sequence */
-static bool shuffle_freelist(struct kmem_cache *s, struct page *page)
+static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 {
 	void *start;
 	void *cur;
 	void *next;
-	unsigned long idx, pos, page_limit, freelist_count;
+	unsigned long idx, pos, slab_limit, freelist_count;
 
-	if (page->objects < 2 || !s->random_seq)
+	if (slab->objects < 2 || !s->random_seq)
 		return false;
 
 	freelist_count = oo_objects(s->oo);
 	pos = get_random_int() % freelist_count;
 
-	page_limit = page->objects * s->size;
-	start = fixup_red_left(s, page_address(page));
+	slab_limit = slab->objects * s->size;
+	start = fixup_red_left(s, slab_address(slab));
 
 	/* First entry is used as the base of the freelist */
-	cur = next_freelist_entry(s, page, &pos, start, page_limit,
+	cur = next_freelist_entry(s, slab, &pos, start, slab_limit,
 				freelist_count);
-	cur = setup_object(s, page, cur);
-	page->freelist = cur;
+	cur = setup_object(s, slab, cur);
+	slab->freelist = cur;
 
-	for (idx = 1; idx < page->objects; idx++) {
-		next = next_freelist_entry(s, page, &pos, start, page_limit,
+	for (idx = 1; idx < slab->objects; idx++) {
+		next = next_freelist_entry(s, slab, &pos, start, slab_limit,
 			freelist_count);
-		next = setup_object(s, page, next);
+		next = setup_object(s, slab, next);
 		set_freepointer(s, cur, next);
 		cur = next;
 	}
@@ -1795,15 +1800,15 @@ static inline int init_cache_random_seq(struct kmem_cache *s)
 	return 0;
 }
 static inline void init_freelist_randomization(void) { }
-static inline bool shuffle_freelist(struct kmem_cache *s, struct page *page)
+static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 {
 	return false;
 }
 #endif /* CONFIG_SLAB_FREELIST_RANDOM */
 
-static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
 	void *start, *p, *next;
@@ -1825,65 +1830,62 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
 		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~(__GFP_RECLAIM|__GFP_NOFAIL);
 
-	page = alloc_slab_page(s, alloc_gfp, node, oo);
-	if (unlikely(!page)) {
+	slab = alloc_slab(s, alloc_gfp, node, oo);
+	if (unlikely(!slab)) {
 		oo = s->min;
 		alloc_gfp = flags;
 		/*
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
-		page = alloc_slab_page(s, alloc_gfp, node, oo);
-		if (unlikely(!page))
+		slab = alloc_slab(s, alloc_gfp, node, oo);
+		if (unlikely(!slab))
 			goto out;
 		stat(s, ORDER_FALLBACK);
 	}
 
-	page->objects = oo_objects(oo);
+	slab->objects = oo_objects(oo);
 
-	account_slab_page(page, oo_order(oo), s, flags);
+	account_slab(slab, oo_order(oo), s, flags);
 
-	page->slab_cache = s;
-	__SetPageSlab(page);
-	if (page_is_pfmemalloc(page))
-		SetPageSlabPfmemalloc(page);
+	slab->slab_cache = s;
 
-	kasan_poison_slab(page);
+	kasan_poison_slab(slab);
 
-	start = page_address(page);
+	start = slab_address(slab);
 
-	setup_page_debug(s, page, start);
+	setup_slab_debug(s, slab, start);
 
-	shuffle = shuffle_freelist(s, page);
+	shuffle = shuffle_freelist(s, slab);
 
 	if (!shuffle) {
 		start = fixup_red_left(s, start);
-		start = setup_object(s, page, start);
-		page->freelist = start;
-		for (idx = 0, p = start; idx < page->objects - 1; idx++) {
+		start = setup_object(s, slab, start);
+		slab->freelist = start;
+		for (idx = 0, p = start; idx < slab->objects - 1; idx++) {
 			next = p + s->size;
-			next = setup_object(s, page, next);
+			next = setup_object(s, slab, next);
 			set_freepointer(s, p, next);
 			p = next;
 		}
 		set_freepointer(s, p, NULL);
 	}
 
-	page->inuse = page->objects;
-	page->frozen = 1;
+	slab->inuse = slab->objects;
+	slab->frozen = 1;
 
 out:
 	if (gfpflags_allow_blocking(flags))
 		local_irq_disable();
-	if (!page)
+	if (!slab)
 		return NULL;
 
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	inc_slabs_node(s, slab_nid(slab), slab->objects);
 
-	return page;
+	return slab;
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
 	if (unlikely(flags & GFP_SLAB_BUG_MASK))
 		flags = kmalloc_fix_flags(flags);
@@ -1892,76 +1894,77 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
 }
 
-static void __free_slab(struct kmem_cache *s, struct page *page)
+static void __free_slab(struct kmem_cache *s, struct slab *slab)
 {
-	int order = compound_order(page);
-	int pages = 1 << order;
+	struct page *page = &slab->page;
+	int order = slab_order(slab);
+	int slabs = 1 << order;
 
 	if (kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) {
 		void *p;
 
-		slab_pad_check(s, page);
-		for_each_object(p, s, page_address(page),
-						page->objects)
-			check_object(s, page, p, SLUB_RED_INACTIVE);
+		slab_pad_check(s, slab);
+		for_each_object(p, s, slab_address(slab),
+						slab->objects)
+			check_object(s, slab, p, SLUB_RED_INACTIVE);
 	}
 
-	__ClearPageSlabPfmemalloc(page);
+	__ClearSlabPfmemalloc(slab);
 	__ClearPageSlab(page);
-	/* In union with page->mapping where page allocator expects NULL */
-	page->slab_cache = NULL;
+	page->mapping = NULL;
 	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += pages;
-	unaccount_slab_page(page, order, s);
-	__free_pages(page, order);
+		current->reclaim_state->reclaimed_slab += slabs;
+	unaccount_slab(slab, order, s);
+	put_page(page);
 }
 
 static void rcu_free_slab(struct rcu_head *h)
 {
 	struct page *page = container_of(h, struct page, rcu_head);
+	struct slab *slab = (struct slab *)page;
 
-	__free_slab(page->slab_cache, page);
+	__free_slab(slab->slab_cache, slab);
 }
 
-static void free_slab(struct kmem_cache *s, struct page *page)
+static void free_slab(struct kmem_cache *s, struct slab *slab)
 {
 	if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) {
-		call_rcu(&page->rcu_head, rcu_free_slab);
+		call_rcu(&slab->page.rcu_head, rcu_free_slab);
 	} else
-		__free_slab(s, page);
+		__free_slab(s, slab);
 }
 
-static void discard_slab(struct kmem_cache *s, struct page *page)
+static void discard_slab(struct kmem_cache *s, struct slab *slab)
 {
-	dec_slabs_node(s, page_to_nid(page), page->objects);
-	free_slab(s, page);
+	dec_slabs_node(s, slab_nid(slab), slab->objects);
+	free_slab(s, slab);
 }
 
 /*
  * Management of partially allocated slabs.
  */
 static inline void
-__add_partial(struct kmem_cache_node *n, struct page *page, int tail)
+__add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
 {
 	n->nr_partial++;
 	if (tail == DEACTIVATE_TO_TAIL)
-		list_add_tail(&page->slab_list, &n->partial);
+		list_add_tail(&slab->slab_list, &n->partial);
 	else
-		list_add(&page->slab_list, &n->partial);
+		list_add(&slab->slab_list, &n->partial);
 }
 
 static inline void add_partial(struct kmem_cache_node *n,
-				struct page *page, int tail)
+				struct slab *slab, int tail)
 {
 	lockdep_assert_held(&n->list_lock);
-	__add_partial(n, page, tail);
+	__add_partial(n, slab, tail);
 }
 
 static inline void remove_partial(struct kmem_cache_node *n,
-					struct page *page)
+					struct slab *slab)
 {
 	lockdep_assert_held(&n->list_lock);
-	list_del(&page->slab_list);
+	list_del(&slab->slab_list);
 	n->nr_partial--;
 }
 
@@ -1972,12 +1975,12 @@ static inline void remove_partial(struct kmem_cache_node *n,
  * Returns a list of objects or NULL if it fails.
  */
 static inline void *acquire_slab(struct kmem_cache *s,
-		struct kmem_cache_node *n, struct page *page,
+		struct kmem_cache_node *n, struct slab *slab,
 		int mode, int *objects)
 {
 	void *freelist;
 	unsigned long counters;
-	struct page new;
+	struct slab new;
 
 	lockdep_assert_held(&n->list_lock);
 
@@ -1986,12 +1989,12 @@ static inline void *acquire_slab(struct kmem_cache *s,
 	 * The old freelist is the list of objects for the
 	 * per cpu allocation list.
 	 */
-	freelist = page->freelist;
-	counters = page->counters;
+	freelist = slab->freelist;
+	counters = slab->counters;
 	new.counters = counters;
 	*objects = new.objects - new.inuse;
 	if (mode) {
-		new.inuse = page->objects;
+		new.inuse = slab->objects;
 		new.freelist = NULL;
 	} else {
 		new.freelist = freelist;
@@ -2000,19 +2003,19 @@ static inline void *acquire_slab(struct kmem_cache *s,
 	VM_BUG_ON(new.frozen);
 	new.frozen = 1;
 
-	if (!__cmpxchg_double_slab(s, page,
+	if (!__cmpxchg_double_slab(s, slab,
 			freelist, counters,
 			new.freelist, new.counters,
 			"acquire_slab"))
 		return NULL;
 
-	remove_partial(n, page);
+	remove_partial(n, slab);
 	WARN_ON(!freelist);
 	return freelist;
 }
 
-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
-static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
+static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
+static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
 
 /*
  * Try to allocate a partial slab from a specific node.
@@ -2020,7 +2023,7 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 				struct kmem_cache_cpu *c, gfp_t flags)
 {
-	struct page *page, *page2;
+	struct slab *slab, *slab2;
 	void *object = NULL;
 	unsigned int available = 0;
 	int objects;
@@ -2035,23 +2038,23 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 		return NULL;
 
 	spin_lock(&n->list_lock);
-	list_for_each_entry_safe(page, page2, &n->partial, slab_list) {
+	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
 		void *t;
 
-		if (!pfmemalloc_match(page, flags))
+		if (!pfmemalloc_match(slab, flags))
 			continue;
 
-		t = acquire_slab(s, n, page, object == NULL, &objects);
+		t = acquire_slab(s, n, slab, object == NULL, &objects);
 		if (!t)
 			break;
 
 		available += objects;
 		if (!object) {
-			c->page = page;
+			c->slab = slab;
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
-			put_cpu_partial(s, page, 0);
+			put_cpu_partial(s, slab, 0);
 			stat(s, CPU_PARTIAL_NODE);
 		}
 		if (!kmem_cache_has_cpu_partial(s)
@@ -2064,7 +2067,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 }
 
 /*
- * Get a page from somewhere. Search in increasing NUMA distances.
+ * Get a slab from somewhere. Search in increasing NUMA distances.
  */
 static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		struct kmem_cache_cpu *c)
@@ -2128,7 +2131,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 }
 
 /*
- * Get a partial page, lock it and return it.
+ * Get a partial slab, lock it and return it.
  */
 static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 		struct kmem_cache_cpu *c)
@@ -2218,19 +2221,19 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 /*
  * Remove the cpu slab
  */
-static void deactivate_slab(struct kmem_cache *s, struct page *page,
+static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 				void *freelist, struct kmem_cache_cpu *c)
 {
 	enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
-	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
 	int lock = 0, free_delta = 0;
 	enum slab_modes l = M_NONE, m = M_NONE;
 	void *nextfree, *freelist_iter, *freelist_tail;
 	int tail = DEACTIVATE_TO_HEAD;
-	struct page new;
-	struct page old;
+	struct slab new;
+	struct slab old;
 
-	if (page->freelist) {
+	if (slab->freelist) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
 		tail = DEACTIVATE_TO_TAIL;
 	}
@@ -2249,7 +2252,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
 		 * 'freelist_iter' is already corrupted.  So isolate all objects
 		 * starting at 'freelist_iter' by skipping them.
 		 */
-		if (freelist_corrupted(s, page, &freelist_iter, nextfree))
+		if (freelist_corrupted(s, slab, &freelist_iter, nextfree))
 			break;
 
 		freelist_tail = freelist_iter;
@@ -2259,25 +2262,25 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
 	}
 
 	/*
-	 * Stage two: Unfreeze the page while splicing the per-cpu
-	 * freelist to the head of page's freelist.
+	 * Stage two: Unfreeze the slab while splicing the per-cpu
+	 * freelist to the head of slab's freelist.
 	 *
-	 * Ensure that the page is unfrozen while the list presence
+	 * Ensure that the slab is unfrozen while the list presence
 	 * reflects the actual number of objects during unfreeze.
 	 *
 	 * We setup the list membership and then perform a cmpxchg
-	 * with the count. If there is a mismatch then the page
-	 * is not unfrozen but the page is on the wrong list.
+	 * with the count. If there is a mismatch then the slab
+	 * is not unfrozen but the slab is on the wrong list.
 	 *
 	 * Then we restart the process which may have to remove
-	 * the page from the list that we just put it on again
+	 * the slab from the list that we just put it on again
 	 * because the number of objects in the slab may have
 	 * changed.
 	 */
 redo:
 
-	old.freelist = READ_ONCE(page->freelist);
-	old.counters = READ_ONCE(page->counters);
+	old.freelist = READ_ONCE(slab->freelist);
+	old.counters = READ_ONCE(slab->counters);
 	VM_BUG_ON(!old.frozen);
 
 	/* Determine target state of the slab */
@@ -2299,7 +2302,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
 			lock = 1;
 			/*
 			 * Taking the spinlock removes the possibility
-			 * that acquire_slab() will see a slab page that
+			 * that acquire_slab() will see a slab slab that
 			 * is frozen
 			 */
 			spin_lock(&n->list_lock);
@@ -2319,18 +2322,18 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
 
 	if (l != m) {
 		if (l == M_PARTIAL)
-			remove_partial(n, page);
+			remove_partial(n, slab);
 		else if (l == M_FULL)
-			remove_full(s, n, page);
+			remove_full(s, n, slab);
 
 		if (m == M_PARTIAL)
-			add_partial(n, page, tail);
+			add_partial(n, slab, tail);
 		else if (m == M_FULL)
-			add_full(s, n, page);
+			add_full(s, n, slab);
 	}
 
 	l = m;
-	if (!__cmpxchg_double_slab(s, page,
+	if (!__cmpxchg_double_slab(s, slab,
 				old.freelist, old.counters,
 				new.freelist, new.counters,
 				"unfreezing slab"))
@@ -2345,11 +2348,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
 		stat(s, DEACTIVATE_FULL);
 	else if (m == M_FREE) {
 		stat(s, DEACTIVATE_EMPTY);
-		discard_slab(s, page);
+		discard_slab(s, slab);
 		stat(s, FREE_SLAB);
 	}
 
-	c->page = NULL;
+	c->slab = NULL;
 	c->freelist = NULL;
 }
 
@@ -2365,15 +2368,15 @@ static void unfreeze_partials(struct kmem_cache *s,
 {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
-	struct page *page, *discard_page = NULL;
+	struct slab *slab, *next_slab = NULL;
 
-	while ((page = slub_percpu_partial(c))) {
-		struct page new;
-		struct page old;
+	while ((slab = slub_percpu_partial(c))) {
+		struct slab new;
+		struct slab old;
 
-		slub_set_percpu_partial(c, page);
+		slub_set_percpu_partial(c, slab);
 
-		n2 = get_node(s, page_to_nid(page));
+		n2 = get_node(s, slab_nid(slab));
 		if (n != n2) {
 			if (n)
 				spin_unlock(&n->list_lock);
@@ -2384,8 +2387,8 @@ static void unfreeze_partials(struct kmem_cache *s,
 
 		do {
 
-			old.freelist = page->freelist;
-			old.counters = page->counters;
+			old.freelist = slab->freelist;
+			old.counters = slab->counters;
 			VM_BUG_ON(!old.frozen);
 
 			new.counters = old.counters;
@@ -2393,16 +2396,16 @@ static void unfreeze_partials(struct kmem_cache *s,
 
 			new.frozen = 0;
 
-		} while (!__cmpxchg_double_slab(s, page,
+		} while (!__cmpxchg_double_slab(s, slab,
 				old.freelist, old.counters,
 				new.freelist, new.counters,
 				"unfreezing slab"));
 
 		if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
-			page->next = discard_page;
-			discard_page = page;
+			slab->next = next_slab;
+			next_slab = slab;
 		} else {
-			add_partial(n, page, DEACTIVATE_TO_TAIL);
+			add_partial(n, slab, DEACTIVATE_TO_TAIL);
 			stat(s, FREE_ADD_PARTIAL);
 		}
 	}
@@ -2410,40 +2413,40 @@ static void unfreeze_partials(struct kmem_cache *s,
 	if (n)
 		spin_unlock(&n->list_lock);
 
-	while (discard_page) {
-		page = discard_page;
-		discard_page = discard_page->next;
+	while (next_slab) {
+		slab = next_slab;
+		next_slab = next_slab->next;
 
 		stat(s, DEACTIVATE_EMPTY);
-		discard_slab(s, page);
+		discard_slab(s, slab);
 		stat(s, FREE_SLAB);
 	}
 #endif	/* CONFIG_SLUB_CPU_PARTIAL */
 }
 
 /*
- * Put a page that was just frozen (in __slab_free|get_partial_node) into a
- * partial page slot if available.
+ * Put a slab that was just frozen (in __slab_free|get_partial_node) into a
+ * partial slab slot if available.
  *
  * If we did not find a slot then simply move all the partials to the
  * per node partial list.
  */
-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
+static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
-	struct page *oldpage;
-	int pages;
+	struct slab *oldslab;
+	int slabs;
 	int pobjects;
 
 	preempt_disable();
 	do {
-		pages = 0;
+		slabs = 0;
 		pobjects = 0;
-		oldpage = this_cpu_read(s->cpu_slab->partial);
+		oldslab = this_cpu_read(s->cpu_slab->partial);
 
-		if (oldpage) {
-			pobjects = oldpage->pobjects;
-			pages = oldpage->pages;
+		if (oldslab) {
+			pobjects = oldslab->pobjects;
+			slabs = oldslab->slabs;
 			if (drain && pobjects > slub_cpu_partial(s)) {
 				unsigned long flags;
 				/*
@@ -2453,22 +2456,22 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 				local_irq_save(flags);
 				unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
 				local_irq_restore(flags);
-				oldpage = NULL;
+				oldslab = NULL;
 				pobjects = 0;
-				pages = 0;
+				slabs = 0;
 				stat(s, CPU_PARTIAL_DRAIN);
 			}
 		}
 
-		pages++;
-		pobjects += page->objects - page->inuse;
+		slabs++;
+		pobjects += slab->objects - slab->inuse;
 
-		page->pages = pages;
-		page->pobjects = pobjects;
-		page->next = oldpage;
+		slab->slabs = slabs;
+		slab->pobjects = pobjects;
+		slab->next = oldslab;
 
-	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
-								!= oldpage);
+	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldslab, slab)
+								!= oldslab);
 	if (unlikely(!slub_cpu_partial(s))) {
 		unsigned long flags;
 
@@ -2483,7 +2486,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	stat(s, CPUSLAB_FLUSH);
-	deactivate_slab(s, c->page, c->freelist, c);
+	deactivate_slab(s, c->slab, c->freelist, c);
 
 	c->tid = next_tid(c->tid);
 }
@@ -2497,7 +2500,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
-	if (c->page)
+	if (c->slab)
 		flush_slab(s, c);
 
 	unfreeze_partials(s, c);
@@ -2515,7 +2518,7 @@ static bool has_cpu_slab(int cpu, void *info)
 	struct kmem_cache *s = info;
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
-	return c->page || slub_percpu_partial(c);
+	return c->slab || slub_percpu_partial(c);
 }
 
 static void flush_all(struct kmem_cache *s)
@@ -2546,19 +2549,19 @@ static int slub_cpu_dead(unsigned int cpu)
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
  */
-static inline int node_match(struct page *page, int node)
+static inline int node_match(struct slab *slab, int node)
 {
 #ifdef CONFIG_NUMA
-	if (node != NUMA_NO_NODE && page_to_nid(page) != node)
+	if (node != NUMA_NO_NODE && slab_nid(slab) != node)
 		return 0;
 #endif
 	return 1;
 }
 
 #ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
+static int count_free(struct slab *slab)
 {
-	return page->objects - page->inuse;
+	return slab->objects - slab->inuse;
 }
 
 static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
@@ -2569,15 +2572,15 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
 static unsigned long count_partial(struct kmem_cache_node *n,
-					int (*get_count)(struct page *))
+					int (*get_count)(struct slab *))
 {
 	unsigned long flags;
 	unsigned long x = 0;
-	struct page *page;
+	struct slab *slab;
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, slab_list)
-		x += get_count(page);
+	list_for_each_entry(slab, &n->partial, slab_list)
+		x += get_count(slab);
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	return x;
 }
@@ -2625,7 +2628,7 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 {
 	void *freelist;
 	struct kmem_cache_cpu *c = *pc;
-	struct page *page;
+	struct slab *slab;
 
 	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
 
@@ -2634,62 +2637,62 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 	if (freelist)
 		return freelist;
 
-	page = new_slab(s, flags, node);
-	if (page) {
+	slab = new_slab(s, flags, node);
+	if (slab) {
 		c = raw_cpu_ptr(s->cpu_slab);
-		if (c->page)
+		if (c->slab)
 			flush_slab(s, c);
 
 		/*
-		 * No other reference to the page yet so we can
+		 * No other reference to the slab yet so we can
 		 * muck around with it freely without cmpxchg
 		 */
-		freelist = page->freelist;
-		page->freelist = NULL;
+		freelist = slab->freelist;
+		slab->freelist = NULL;
 
 		stat(s, ALLOC_SLAB);
-		c->page = page;
+		c->slab = slab;
 		*pc = c;
 	}
 
 	return freelist;
 }
 
-static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
+static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags)
 {
-	if (unlikely(PageSlabPfmemalloc(page)))
+	if (unlikely(SlabPfmemalloc(slab)))
 		return gfp_pfmemalloc_allowed(gfpflags);
 
 	return true;
 }
 
 /*
- * Check the page->freelist of a page and either transfer the freelist to the
- * per cpu freelist or deactivate the page.
+ * Check the slab->freelist of a slab and either transfer the freelist to the
+ * per cpu freelist or deactivate the slab.
  *
- * The page is still frozen if the return value is not NULL.
+ * The slab is still frozen if the return value is not NULL.
  *
- * If this function returns NULL then the page has been unfrozen.
+ * If this function returns NULL then the slab has been unfrozen.
  *
  * This function must be called with interrupt disabled.
  */
-static inline void *get_freelist(struct kmem_cache *s, struct page *page)
+static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
 {
-	struct page new;
+	struct slab new;
 	unsigned long counters;
 	void *freelist;
 
 	do {
-		freelist = page->freelist;
-		counters = page->counters;
+		freelist = slab->freelist;
+		counters = slab->counters;
 
 		new.counters = counters;
 		VM_BUG_ON(!new.frozen);
 
-		new.inuse = page->objects;
+		new.inuse = slab->objects;
 		new.frozen = freelist != NULL;
 
-	} while (!__cmpxchg_double_slab(s, page,
+	} while (!__cmpxchg_double_slab(s, slab,
 		freelist, counters,
 		NULL, new.counters,
 		"get_freelist"));
@@ -2711,7 +2714,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
  *
  * And if we were unable to get a new slab from the partial slab lists then
  * we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
+ * a call to the slab allocator and the setup of a new slab.
  *
  * Version of __slab_alloc to use when we know that interrupts are
  * already disabled (which is the case for bulk allocation).
@@ -2720,12 +2723,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			  unsigned long addr, struct kmem_cache_cpu *c)
 {
 	void *freelist;
-	struct page *page;
+	struct slab *slab;
 
 	stat(s, ALLOC_SLOWPATH);
 
-	page = c->page;
-	if (!page) {
+	slab = c->slab;
+	if (!slab) {
 		/*
 		 * if the node is not online or has no normal memory, just
 		 * ignore the node constraint
@@ -2737,7 +2740,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	}
 redo:
 
-	if (unlikely(!node_match(page, node))) {
+	if (unlikely(!node_match(slab, node))) {
 		/*
 		 * same as above but node_match() being false already
 		 * implies node != NUMA_NO_NODE
@@ -2747,18 +2750,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			goto redo;
 		} else {
 			stat(s, ALLOC_NODE_MISMATCH);
-			deactivate_slab(s, page, c->freelist, c);
+			deactivate_slab(s, slab, c->freelist, c);
 			goto new_slab;
 		}
 	}
 
 	/*
-	 * By rights, we should be searching for a slab page that was
+	 * By rights, we should be searching for a slab slab that was
 	 * PFMEMALLOC but right now, we are losing the pfmemalloc
-	 * information when the page leaves the per-cpu allocator
+	 * information when the slab leaves the per-cpu allocator
 	 */
-	if (unlikely(!pfmemalloc_match(page, gfpflags))) {
-		deactivate_slab(s, page, c->freelist, c);
+	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
+		deactivate_slab(s, slab, c->freelist, c);
 		goto new_slab;
 	}
 
@@ -2767,10 +2770,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (freelist)
 		goto load_freelist;
 
-	freelist = get_freelist(s, page);
+	freelist = get_freelist(s, slab);
 
 	if (!freelist) {
-		c->page = NULL;
+		c->slab = NULL;
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
 	}
@@ -2780,10 +2783,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 load_freelist:
 	/*
 	 * freelist is pointing to the list of objects to be used.
-	 * page is pointing to the page from which the objects are obtained.
-	 * That page must be frozen for per cpu allocations to work.
+	 * slab is pointing to the slab from which the objects are obtained.
+	 * That slab must be frozen for per cpu allocations to work.
 	 */
-	VM_BUG_ON(!c->page->frozen);
+	VM_BUG_ON(!c->slab->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
 	return freelist;
@@ -2791,8 +2794,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 new_slab:
 
 	if (slub_percpu_partial(c)) {
-		page = c->page = slub_percpu_partial(c);
-		slub_set_percpu_partial(c, page);
+		slab = c->slab = slub_percpu_partial(c);
+		slub_set_percpu_partial(c, slab);
 		stat(s, CPU_PARTIAL_ALLOC);
 		goto redo;
 	}
@@ -2804,16 +2807,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		return NULL;
 	}
 
-	page = c->page;
-	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
+	slab = c->slab;
+	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(slab, gfpflags)))
 		goto load_freelist;
 
 	/* Only entered in the debug case */
 	if (kmem_cache_debug(s) &&
-			!alloc_debug_processing(s, page, freelist, addr))
+			!alloc_debug_processing(s, slab, freelist, addr))
 		goto new_slab;	/* Slab failed checks. Next slab needed */
 
-	deactivate_slab(s, page, get_freepointer(s, freelist), c);
+	deactivate_slab(s, slab, get_freepointer(s, freelist), c);
 	return freelist;
 }
 
@@ -2869,7 +2872,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 {
 	void *object;
 	struct kmem_cache_cpu *c;
-	struct page *page;
+	struct slab *slab;
 	unsigned long tid;
 	struct obj_cgroup *objcg = NULL;
 	bool init = false;
@@ -2902,9 +2905,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	/*
 	 * Irqless object alloc/free algorithm used here depends on sequence
 	 * of fetching cpu_slab's data. tid should be fetched before anything
-	 * on c to guarantee that object and page associated with previous tid
+	 * on c to guarantee that object and slab associated with previous tid
 	 * won't be used with current tid. If we fetch tid first, object and
-	 * page could be one associated with next tid and our alloc/free
+	 * slab could be one associated with next tid and our alloc/free
 	 * request will be failed. In this case, we will retry. So, no problem.
 	 */
 	barrier();
@@ -2917,8 +2920,8 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	 */
 
 	object = c->freelist;
-	page = c->page;
-	if (unlikely(!object || !page || !node_match(page, node))) {
+	slab = c->slab;
+	if (unlikely(!object || !slab || !node_match(slab, node))) {
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
@@ -3020,17 +3023,17 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
  * have a longer lifetime than the cpu slabs in most processing loads.
  *
  * So we still attempt to reduce cache line usage. Just take the slab
- * lock and free the item. If there is no additional partial page
+ * lock and free the item. If there is no additional partial slab
  * handling required then we can return immediately.
  */
-static void __slab_free(struct kmem_cache *s, struct page *page,
+static void __slab_free(struct kmem_cache *s, struct slab *slab,
 			void *head, void *tail, int cnt,
 			unsigned long addr)
 
 {
 	void *prior;
 	int was_frozen;
-	struct page new;
+	struct slab new;
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long flags;
@@ -3041,7 +3044,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		return;
 
 	if (kmem_cache_debug(s) &&
-	    !free_debug_processing(s, page, head, tail, cnt, addr))
+	    !free_debug_processing(s, slab, head, tail, cnt, addr))
 		return;
 
 	do {
@@ -3049,8 +3052,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 			spin_unlock_irqrestore(&n->list_lock, flags);
 			n = NULL;
 		}
-		prior = page->freelist;
-		counters = page->counters;
+		prior = slab->freelist;
+		counters = slab->counters;
 		set_freepointer(s, tail, prior);
 		new.counters = counters;
 		was_frozen = new.frozen;
@@ -3069,7 +3072,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 			} else { /* Needs to be taken off a list */
 
-				n = get_node(s, page_to_nid(page));
+				n = get_node(s, slab_nid(slab));
 				/*
 				 * Speculatively acquire the list_lock.
 				 * If the cmpxchg does not succeed then we may
@@ -3083,7 +3086,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 			}
 		}
 
-	} while (!cmpxchg_double_slab(s, page,
+	} while (!cmpxchg_double_slab(s, slab,
 		prior, counters,
 		head, new.counters,
 		"__slab_free"));
@@ -3098,10 +3101,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 			stat(s, FREE_FROZEN);
 		} else if (new.frozen) {
 			/*
-			 * If we just froze the page then put it onto the
+			 * If we just froze the slab then put it onto the
 			 * per cpu partial list.
 			 */
-			put_cpu_partial(s, page, 1);
+			put_cpu_partial(s, slab, 1);
 			stat(s, CPU_PARTIAL_FREE);
 		}
 
@@ -3116,8 +3119,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	 * then add it.
 	 */
 	if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
-		remove_full(s, n, page);
-		add_partial(n, page, DEACTIVATE_TO_TAIL);
+		remove_full(s, n, slab);
+		add_partial(n, slab, DEACTIVATE_TO_TAIL);
 		stat(s, FREE_ADD_PARTIAL);
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
@@ -3128,16 +3131,16 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		/*
 		 * Slab on the partial list.
 		 */
-		remove_partial(n, page);
+		remove_partial(n, slab);
 		stat(s, FREE_REMOVE_PARTIAL);
 	} else {
 		/* Slab must be on the full list */
-		remove_full(s, n, page);
+		remove_full(s, n, slab);
 	}
 
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	stat(s, FREE_SLAB);
-	discard_slab(s, page);
+	discard_slab(s, slab);
 }
 
 /*
@@ -3152,11 +3155,11 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
  * with all sorts of special processing.
  *
  * Bulk free of a freelist with several objects (all pointing to the
- * same page) possible by specifying head and tail ptr, plus objects
+ * same slab) possible by specifying head and tail ptr, plus objects
  * count (cnt). Bulk free indicated by tail pointer being set.
  */
 static __always_inline void do_slab_free(struct kmem_cache *s,
-				struct page *page, void *head, void *tail,
+				struct slab *slab, void *head, void *tail,
 				int cnt, unsigned long addr)
 {
 	void *tail_obj = tail ? : head;
@@ -3180,7 +3183,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	/* Same with comment on barrier() in slab_alloc_node() */
 	barrier();
 
-	if (likely(page == c->page)) {
+	if (likely(slab == c->slab)) {
 		void **freelist = READ_ONCE(c->freelist);
 
 		set_freepointer(s, tail_obj, freelist);
@@ -3195,11 +3198,11 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 		}
 		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, head, tail_obj, cnt, addr);
+		__slab_free(s, slab, head, tail_obj, cnt, addr);
 
 }
 
-static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
+static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
 				      void *head, void *tail, int cnt,
 				      unsigned long addr)
 {
@@ -3208,13 +3211,13 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
 	 * to remove objects, whose reuse must be delayed.
 	 */
 	if (slab_free_freelist_hook(s, &head, &tail))
-		do_slab_free(s, page, head, tail, cnt, addr);
+		do_slab_free(s, slab, head, tail, cnt, addr);
 }
 
 #ifdef CONFIG_KASAN_GENERIC
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr)
 {
-	do_slab_free(cache, virt_to_head_page(x), x, NULL, 1, addr);
+	do_slab_free(cache, virt_to_slab(x), x, NULL, 1, addr);
 }
 #endif
 
@@ -3223,13 +3226,13 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 	s = cache_from_obj(s, x);
 	if (!s)
 		return;
-	slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_);
+	slab_free(s, virt_to_slab(x), x, NULL, 1, _RET_IP_);
 	trace_kmem_cache_free(_RET_IP_, x, s->name);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
 struct detached_freelist {
-	struct page *page;
+	struct slab *slab;
 	void *tail;
 	void *freelist;
 	int cnt;
@@ -3239,8 +3242,8 @@ struct detached_freelist {
 /*
  * This function progressively scans the array with free objects (with
  * a limited look ahead) and extract objects belonging to the same
- * page.  It builds a detached freelist directly within the given
- * page/objects.  This can happen without any need for
+ * slab.  It builds a detached freelist directly within the given
+ * slab/objects.  This can happen without any need for
  * synchronization, because the objects are owned by running process.
  * The freelist is build up as a single linked list in the objects.
  * The idea is, that this detached freelist can then be bulk
@@ -3255,10 +3258,10 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
 	size_t first_skipped_index = 0;
 	int lookahead = 3;
 	void *object;
-	struct page *page;
+	struct slab *slab;
 
 	/* Always re-init detached_freelist */
-	df->page = NULL;
+	df->slab = NULL;
 
 	do {
 		object = p[--size];
@@ -3268,18 +3271,18 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
 	if (!object)
 		return 0;
 
-	page = virt_to_head_page(object);
+	slab = virt_to_slab(object);
 	if (!s) {
 		/* Handle kalloc'ed objects */
-		if (unlikely(!PageSlab(page))) {
-			BUG_ON(!PageCompound(page));
+		if (unlikely(!is_slab(slab))) {
+			BUG_ON(!SlabMulti(slab));
 			kfree_hook(object);
-			__free_pages(page, compound_order(page));
+			put_page(&slab->page);
 			p[size] = NULL; /* mark object processed */
 			return size;
 		}
 		/* Derive kmem_cache from object */
-		df->s = page->slab_cache;
+		df->s = slab->slab_cache;
 	} else {
 		df->s = cache_from_obj(s, object); /* Support for memcg */
 	}
@@ -3292,7 +3295,7 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
 	}
 
 	/* Start new detached freelist */
-	df->page = page;
+	df->slab = slab;
 	set_freepointer(df->s, object, NULL);
 	df->tail = object;
 	df->freelist = object;
@@ -3304,8 +3307,8 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
 		if (!object)
 			continue; /* Skip processed objects */
 
-		/* df->page is always set at this point */
-		if (df->page == virt_to_head_page(object)) {
+		/* df->slab is always set at this point */
+		if (df->slab == virt_to_slab(object)) {
 			/* Opportunity build freelist */
 			set_freepointer(df->s, object, df->freelist);
 			df->freelist = object;
@@ -3337,10 +3340,10 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 		struct detached_freelist df;
 
 		size = build_detached_freelist(s, size, p, &df);
-		if (!df.page)
+		if (!df.slab)
 			continue;
 
-		slab_free(df.s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
+		slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt, _RET_IP_);
 	} while (likely(size));
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
@@ -3435,7 +3438,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
  */
 
 /*
- * Minimum / Maximum order of slab pages. This influences locking overhead
+ * Minimum / Maximum order of slab slabs. This influences locking overhead
  * and slab fragmentation. A higher order reduces the number of partial slabs
  * and increases the number of allocations possible without having to
  * take the list_lock.
@@ -3449,7 +3452,7 @@ static unsigned int slub_min_objects;
  *
  * The order of allocation has significant impact on performance and other
  * system components. Generally order 0 allocations should be preferred since
- * order 0 does not cause fragmentation in the page allocator. Larger objects
+ * order 0 does not cause fragmentation in the slab allocator. Larger objects
  * be problematic to put into order 0 slabs because there may be too much
  * unused space left. We go to a higher order if more than 1/16th of the slab
  * would be wasted.
@@ -3461,15 +3464,15 @@ static unsigned int slub_min_objects;
  *
  * slub_max_order specifies the order where we begin to stop considering the
  * number of objects in a slab as critical. If we reach slub_max_order then
- * we try to keep the page order as low as possible. So we accept more waste
- * of space in favor of a small page order.
+ * we try to keep the slab order as low as possible. So we accept more waste
+ * of space in favor of a small slab order.
  *
  * Higher order allocations also allow the placement of more objects in a
  * slab and thereby reduce object handling overhead. If the user has
  * requested a higher minimum order then we start with that one instead of
  * the smallest order which will fit the object.
  */
-static inline unsigned int slab_order(unsigned int size,
+static inline unsigned int calc_slab_order(unsigned int size,
 		unsigned int min_objects, unsigned int max_order,
 		unsigned int fract_leftover)
 {
@@ -3533,7 +3536,7 @@ static inline int calculate_order(unsigned int size)
 
 		fraction = 16;
 		while (fraction >= 4) {
-			order = slab_order(size, min_objects,
+			order = calc_slab_order(size, min_objects,
 					slub_max_order, fraction);
 			if (order <= slub_max_order)
 				return order;
@@ -3546,14 +3549,14 @@ static inline int calculate_order(unsigned int size)
 	 * We were unable to place multiple objects in a slab. Now
 	 * lets see if we can place a single object there.
 	 */
-	order = slab_order(size, 1, slub_max_order, 1);
+	order = calc_slab_order(size, 1, slub_max_order, 1);
 	if (order <= slub_max_order)
 		return order;
 
 	/*
 	 * Doh this slab cannot be placed using slub_max_order.
 	 */
-	order = slab_order(size, 1, MAX_ORDER, 1);
+	order = calc_slab_order(size, 1, MAX_ORDER, 1);
 	if (order < MAX_ORDER)
 		return order;
 	return -ENOSYS;
@@ -3605,38 +3608,38 @@ static struct kmem_cache *kmem_cache_node;
  */
 static void early_kmem_cache_node_alloc(int node)
 {
-	struct page *page;
+	struct slab *slab;
 	struct kmem_cache_node *n;
 
 	BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
+	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
 
-	BUG_ON(!page);
-	if (page_to_nid(page) != node) {
+	BUG_ON(!slab);
+	if (slab_nid(slab) != node) {
 		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
 		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
 	}
 
-	n = page->freelist;
+	n = slab->freelist;
 	BUG_ON(!n);
 #ifdef CONFIG_SLUB_DEBUG
 	init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
 	init_tracking(kmem_cache_node, n);
 #endif
 	n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
-	page->freelist = get_freepointer(kmem_cache_node, n);
-	page->inuse = 1;
-	page->frozen = 0;
+	slab->freelist = get_freepointer(kmem_cache_node, n);
+	slab->inuse = 1;
+	slab->frozen = 0;
 	kmem_cache_node->node[node] = n;
 	init_kmem_cache_node(n);
-	inc_slabs_node(kmem_cache_node, node, page->objects);
+	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
 	 * No locks need to be taken here as it has just been
 	 * initialized and there is no concurrent access.
 	 */
-	__add_partial(n, page, DEACTIVATE_TO_HEAD);
+	__add_partial(n, slab, DEACTIVATE_TO_HEAD);
 }
 
 static void free_kmem_cache_nodes(struct kmem_cache *s)
@@ -3894,8 +3897,8 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 #endif
 
 	/*
-	 * The larger the object size is, the more pages we want on the partial
-	 * list to avoid pounding the page allocator excessively.
+	 * The larger the object size is, the more slabs we want on the partial
+	 * list to avoid pounding the slab allocator excessively.
 	 */
 	set_min_partial(s, ilog2(s->size) / 2);
 
@@ -3922,19 +3925,19 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 	return -EINVAL;
 }
 
-static void list_slab_objects(struct kmem_cache *s, struct page *page,
+static void list_slab_objects(struct kmem_cache *s, struct slab *slab,
 			      const char *text)
 {
 #ifdef CONFIG_SLUB_DEBUG
-	void *addr = page_address(page);
+	void *addr = slab_address(slab);
 	unsigned long *map;
 	void *p;
 
-	slab_err(s, page, text, s->name);
-	slab_lock(page);
+	slab_err(s, slab, text, s->name);
+	slab_lock(slab);
 
-	map = get_map(s, page);
-	for_each_object(p, s, addr, page->objects) {
+	map = get_map(s, slab);
+	for_each_object(p, s, addr, slab->objects) {
 
 		if (!test_bit(__obj_to_index(s, addr, p), map)) {
 			pr_err("Object 0x%p @offset=%tu\n", p, p - addr);
@@ -3942,7 +3945,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
 		}
 	}
 	put_map(map);
-	slab_unlock(page);
+	slab_unlock(slab);
 #endif
 }
 
@@ -3954,23 +3957,23 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
 static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 {
 	LIST_HEAD(discard);
-	struct page *page, *h;
+	struct slab *slab, *h;
 
 	BUG_ON(irqs_disabled());
 	spin_lock_irq(&n->list_lock);
-	list_for_each_entry_safe(page, h, &n->partial, slab_list) {
-		if (!page->inuse) {
-			remove_partial(n, page);
-			list_add(&page->slab_list, &discard);
+	list_for_each_entry_safe(slab, h, &n->partial, slab_list) {
+		if (!slab->inuse) {
+			remove_partial(n, slab);
+			list_add(&slab->slab_list, &discard);
 		} else {
-			list_slab_objects(s, page,
+			list_slab_objects(s, slab,
 			  "Objects remaining in %s on __kmem_cache_shutdown()");
 		}
 	}
 	spin_unlock_irq(&n->list_lock);
 
-	list_for_each_entry_safe(page, h, &discard, slab_list)
-		discard_slab(s, page);
+	list_for_each_entry_safe(slab, h, &discard, slab_list)
+		discard_slab(s, slab);
 }
 
 bool __kmem_cache_empty(struct kmem_cache *s)
@@ -4003,31 +4006,31 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page)
+void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	void *base;
 	int __maybe_unused i;
 	unsigned int objnr;
 	void *objp;
 	void *objp0;
-	struct kmem_cache *s = page->slab_cache;
+	struct kmem_cache *s = slab->slab_cache;
 	struct track __maybe_unused *trackp;
 
 	kpp->kp_ptr = object;
-	kpp->kp_page = page;
+	kpp->kp_slab = slab;
 	kpp->kp_slab_cache = s;
-	base = page_address(page);
+	base = slab_address(slab);
 	objp0 = kasan_reset_tag(object);
 #ifdef CONFIG_SLUB_DEBUG
 	objp = restore_red_left(s, objp0);
 #else
 	objp = objp0;
 #endif
-	objnr = obj_to_index(s, page, objp);
+	objnr = obj_to_index(s, slab, objp);
 	kpp->kp_data_offset = (unsigned long)((char *)objp0 - (char *)objp);
 	objp = base + s->size * objnr;
 	kpp->kp_objp = objp;
-	if (WARN_ON_ONCE(objp < base || objp >= base + page->objects * s->size || (objp - base) % s->size) ||
+	if (WARN_ON_ONCE(objp < base || objp >= base + slab->objects * s->size || (objp - base) % s->size) ||
 	    !(s->flags & SLAB_STORE_USER))
 		return;
 #ifdef CONFIG_SLUB_DEBUG
@@ -4115,8 +4118,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	unsigned int order = get_order(size);
 
 	flags |= __GFP_COMP;
-	page = alloc_pages_node(node, flags, order);
-	if (page) {
+	slab = alloc_pages_node(node, flags, order);
+	if (slab) {
 		ptr = page_address(page);
 		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
 				      PAGE_SIZE << order);
@@ -4165,7 +4168,7 @@ EXPORT_SYMBOL(__kmalloc_node);
  * Returns NULL if check passes, otherwise const char * to name of cache
  * to indicate an error.
  */
-void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
+void __check_heap_object(const void *ptr, unsigned long n, struct slab *slab,
 			 bool to_user)
 {
 	struct kmem_cache *s;
@@ -4176,18 +4179,18 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
 	ptr = kasan_reset_tag(ptr);
 
 	/* Find object and usable object size. */
-	s = page->slab_cache;
+	s = slab->slab_cache;
 
 	/* Reject impossible pointers. */
-	if (ptr < page_address(page))
-		usercopy_abort("SLUB object not in SLUB page?!", NULL,
+	if (ptr < slab_address(slab))
+		usercopy_abort("SLUB object not in SLUB slab?!", NULL,
 			       to_user, 0, n);
 
 	/* Find offset within object. */
 	if (is_kfence)
 		offset = ptr - kfence_object_start(ptr);
 	else
-		offset = (ptr - page_address(page)) % s->size;
+		offset = (ptr - slab_address(slab)) % s->size;
 
 	/* Adjust for redzone and reject if within the redzone. */
 	if (!is_kfence && kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
@@ -4222,25 +4225,25 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
 
 size_t __ksize(const void *object)
 {
-	struct page *page;
+	struct slab *slab;
 
 	if (unlikely(object == ZERO_SIZE_PTR))
 		return 0;
 
-	page = virt_to_head_page(object);
+	slab = virt_to_slab(object);
 
-	if (unlikely(!PageSlab(page))) {
-		WARN_ON(!PageCompound(page));
-		return page_size(page);
+	if (unlikely(!is_slab(slab))) {
+		WARN_ON(!SlabMulti(slab));
+		return slab_size(slab);
 	}
 
-	return slab_ksize(page->slab_cache);
+	return slab_ksize(slab->slab_cache);
 }
 EXPORT_SYMBOL(__ksize);
 
 void kfree(const void *x)
 {
-	struct page *page;
+	struct slab *slab;
 	void *object = (void *)x;
 
 	trace_kfree(_RET_IP_, x);
@@ -4248,18 +4251,19 @@ void kfree(const void *x)
 	if (unlikely(ZERO_OR_NULL_PTR(x)))
 		return;
 
-	page = virt_to_head_page(x);
-	if (unlikely(!PageSlab(page))) {
-		unsigned int order = compound_order(page);
+	slab = virt_to_slab(x);
+	if (unlikely(!is_slab(slab))) {
+		unsigned int order = slab_order(slab);
+		struct page *page = &slab->page;
 
-		BUG_ON(!PageCompound(page));
+		BUG_ON(!SlabMulti(slab));
 		kfree_hook(object);
 		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
 				      -(PAGE_SIZE << order));
-		__free_pages(page, order);
+		put_page(page);
 		return;
 	}
-	slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
+	slab_free(slab->slab_cache, slab, object, NULL, 1, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
 
@@ -4279,8 +4283,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	int node;
 	int i;
 	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
+	struct slab *slab;
+	struct slab *t;
 	struct list_head discard;
 	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
@@ -4298,22 +4302,22 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * Build lists of slabs to discard or promote.
 		 *
 		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
+		 * list_lock. slab->inuse here is the upper limit.
 		 */
-		list_for_each_entry_safe(page, t, &n->partial, slab_list) {
-			int free = page->objects - page->inuse;
+		list_for_each_entry_safe(slab, t, &n->partial, slab_list) {
+			int free = slab->objects - slab->inuse;
 
-			/* Do not reread page->inuse */
+			/* Do not reread slab->inuse */
 			barrier();
 
 			/* We do not keep full slabs on the list */
 			BUG_ON(free <= 0);
 
-			if (free == page->objects) {
-				list_move(&page->slab_list, &discard);
+			if (free == slab->objects) {
+				list_move(&slab->slab_list, &discard);
 				n->nr_partial--;
 			} else if (free <= SHRINK_PROMOTE_MAX)
-				list_move(&page->slab_list, promote + free - 1);
+				list_move(&slab->slab_list, promote + free - 1);
 		}
 
 		/*
@@ -4326,8 +4330,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
-		list_for_each_entry_safe(page, t, &discard, slab_list)
-			discard_slab(s, page);
+		list_for_each_entry_safe(slab, t, &discard, slab_list)
+			discard_slab(s, slab);
 
 		if (slabs_node(s, node))
 			ret = 1;
@@ -4461,7 +4465,7 @@ static struct notifier_block slab_memory_callback_nb = {
 
 /*
  * Used for early kmem_cache structures that were allocated using
- * the page allocator. Allocate them properly then fix up the pointers
+ * the slab allocator. Allocate them properly then fix up the pointers
  * that may be pointing to the wrong kmem_cache structure.
  */
 
@@ -4480,7 +4484,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	 */
 	__flush_cpu_slab(s, smp_processor_id());
 	for_each_kmem_cache_node(s, node, n) {
-		struct page *p;
+		struct slab *p;
 
 		list_for_each_entry(p, &n->partial, slab_list)
 			p->slab_cache = s;
@@ -4656,54 +4660,54 @@ EXPORT_SYMBOL(__kmalloc_node_track_caller);
 #endif
 
 #ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
+static int count_inuse(struct slab *slab)
 {
-	return page->inuse;
+	return slab->inuse;
 }
 
-static int count_total(struct page *page)
+static int count_total(struct slab *slab)
 {
-	return page->objects;
+	return slab->objects;
 }
 #endif
 
 #ifdef CONFIG_SLUB_DEBUG
-static void validate_slab(struct kmem_cache *s, struct page *page)
+static void validate_slab(struct kmem_cache *s, struct slab *slab)
 {
 	void *p;
-	void *addr = page_address(page);
+	void *addr = slab_address(slab);
 	unsigned long *map;
 
-	slab_lock(page);
+	slab_lock(slab);
 
-	if (!check_slab(s, page) || !on_freelist(s, page, NULL))
+	if (!check_slab(s, slab) || !on_freelist(s, slab, NULL))
 		goto unlock;
 
 	/* Now we know that a valid freelist exists */
-	map = get_map(s, page);
-	for_each_object(p, s, addr, page->objects) {
+	map = get_map(s, slab);
+	for_each_object(p, s, addr, slab->objects) {
 		u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
 			 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
 
-		if (!check_object(s, page, p, val))
+		if (!check_object(s, slab, p, val))
 			break;
 	}
 	put_map(map);
 unlock:
-	slab_unlock(page);
+	slab_unlock(slab);
 }
 
 static int validate_slab_node(struct kmem_cache *s,
 		struct kmem_cache_node *n)
 {
 	unsigned long count = 0;
-	struct page *page;
+	struct slab *slab;
 	unsigned long flags;
 
 	spin_lock_irqsave(&n->list_lock, flags);
 
-	list_for_each_entry(page, &n->partial, slab_list) {
-		validate_slab(s, page);
+	list_for_each_entry(slab, &n->partial, slab_list) {
+		validate_slab(s, slab);
 		count++;
 	}
 	if (count != n->nr_partial) {
@@ -4715,8 +4719,8 @@ static int validate_slab_node(struct kmem_cache *s,
 	if (!(s->flags & SLAB_STORE_USER))
 		goto out;
 
-	list_for_each_entry(page, &n->full, slab_list) {
-		validate_slab(s, page);
+	list_for_each_entry(slab, &n->full, slab_list) {
+		validate_slab(s, slab);
 		count++;
 	}
 	if (count != atomic_long_read(&n->nr_slabs)) {
@@ -4838,7 +4842,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 				cpumask_set_cpu(track->cpu,
 						to_cpumask(l->cpus));
 			}
-			node_set(page_to_nid(virt_to_page(track)), l->nodes);
+			node_set(slab_nid(virt_to_slab(track)), l->nodes);
 			return 1;
 		}
 
@@ -4869,19 +4873,19 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
 	cpumask_clear(to_cpumask(l->cpus));
 	cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
 	nodes_clear(l->nodes);
-	node_set(page_to_nid(virt_to_page(track)), l->nodes);
+	node_set(slab_nid(virt_to_slab(track)), l->nodes);
 	return 1;
 }
 
 static void process_slab(struct loc_track *t, struct kmem_cache *s,
-		struct page *page, enum track_item alloc)
+		struct slab *slab, enum track_item alloc)
 {
-	void *addr = page_address(page);
+	void *addr = slab_address(slab);
 	void *p;
 	unsigned long *map;
 
-	map = get_map(s, page);
-	for_each_object(p, s, addr, page->objects)
+	map = get_map(s, slab);
+	for_each_object(p, s, addr, slab->objects)
 		if (!test_bit(__obj_to_index(s, addr, p), map))
 			add_location(t, s, get_track(s, p, alloc));
 	put_map(map);
@@ -4924,32 +4928,32 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab,
 							       cpu);
 			int node;
-			struct page *page;
+			struct slab *slab;
 
-			page = READ_ONCE(c->page);
-			if (!page)
+			slab = READ_ONCE(c->slab);
+			if (!slab)
 				continue;
 
-			node = page_to_nid(page);
+			node = slab_nid(slab);
 			if (flags & SO_TOTAL)
-				x = page->objects;
+				x = slab->objects;
 			else if (flags & SO_OBJECTS)
-				x = page->inuse;
+				x = slab->inuse;
 			else
 				x = 1;
 
 			total += x;
 			nodes[node] += x;
 
-			page = slub_percpu_partial_read_once(c);
-			if (page) {
-				node = page_to_nid(page);
+			slab = slub_percpu_partial_read_once(c);
+			if (slab) {
+				node = slab_nid(slab);
 				if (flags & SO_TOTAL)
 					WARN_ON_ONCE(1);
 				else if (flags & SO_OBJECTS)
 					WARN_ON_ONCE(1);
 				else
-					x = page->pages;
+					x = slab->slabs;
 				total += x;
 				nodes[node] += x;
 			}
@@ -5146,31 +5150,31 @@ SLAB_ATTR_RO(objects_partial);
 static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 {
 	int objects = 0;
-	int pages = 0;
+	int slabs = 0;
 	int cpu;
 	int len = 0;
 
 	for_each_online_cpu(cpu) {
-		struct page *page;
+		struct slab *slab;
 
-		page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
+		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
 
-		if (page) {
-			pages += page->pages;
-			objects += page->pobjects;
+		if (slab) {
+			slabs += slab->slabs;
+			objects += slab->pobjects;
 		}
 	}
 
-	len += sysfs_emit_at(buf, len, "%d(%d)", objects, pages);
+	len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);
 
 #ifdef CONFIG_SMP
 	for_each_online_cpu(cpu) {
-		struct page *page;
+		struct slab *slab;
 
-		page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
-		if (page)
+		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
+		if (slab)
 			len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
-					     cpu, page->pobjects, page->pages);
+					     cpu, slab->pobjects, slab->slabs);
 	}
 #endif
 	len += sysfs_emit_at(buf, len, "\n");
@@ -5825,16 +5829,16 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep)
 
 	for_each_kmem_cache_node(s, node, n) {
 		unsigned long flags;
-		struct page *page;
+		struct slab *slab;
 
 		if (!atomic_long_read(&n->nr_slabs))
 			continue;
 
 		spin_lock_irqsave(&n->list_lock, flags);
-		list_for_each_entry(page, &n->partial, slab_list)
-			process_slab(t, s, page, alloc);
-		list_for_each_entry(page, &n->full, slab_list)
-			process_slab(t, s, page, alloc);
+		list_for_each_entry(slab, &n->partial, slab_list)
+			process_slab(t, s, slab, alloc);
+		list_for_each_entry(slab, &n->full, slab_list)
+			process_slab(t, s, slab, alloc);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 	}
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 6326cdf36c4f..2b1099c986c6 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -750,7 +750,7 @@ static void free_map_bootmem(struct page *memmap)
 		>> PAGE_SHIFT;
 
 	for (i = 0; i < nr_pages; i++, page++) {
-		magic = (unsigned long) page->freelist;
+		magic = page->index;
 
 		BUG_ON(magic == NODE_INFO);
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 68e8831068f4..0661dc09e11b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -17,7 +17,7 @@
  *
  * Usage of struct page fields:
  *	page->private: points to zspage
- *	page->freelist(index): links together all component pages of a zspage
+ *	page->index: links together all component pages of a zspage
  *		For the huge page, this is always 0, so we use this field
  *		to store handle.
  *	page->units: first object offset in a subpage of zspage
@@ -827,7 +827,7 @@ static struct page *get_next_page(struct page *page)
 	if (unlikely(PageHugeObject(page)))
 		return NULL;
 
-	return page->freelist;
+	return (struct page *)page->index;
 }
 
 /**
@@ -901,7 +901,7 @@ static void reset_page(struct page *page)
 	set_page_private(page, 0);
 	page_mapcount_reset(page);
 	ClearPageHugeObject(page);
-	page->freelist = NULL;
+	page->index = 0;
 }
 
 static int trylock_zspage(struct zspage *zspage)
@@ -1027,7 +1027,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
 
 	/*
 	 * Allocate individual pages and link them together as:
-	 * 1. all pages are linked together using page->freelist
+	 * 1. all pages are linked together using page->index
 	 * 2. each sub-page point to zspage using page->private
 	 *
 	 * we set PG_private to identify the first page (i.e. no other sub-page
@@ -1036,7 +1036,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
 	for (i = 0; i < nr_pages; i++) {
 		page = pages[i];
 		set_page_private(page, (unsigned long)zspage);
-		page->freelist = NULL;
+		page->index = 0;
 		if (i == 0) {
 			zspage->first_page = page;
 			SetPagePrivate(page);
@@ -1044,7 +1044,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
 					class->pages_per_zspage == 1))
 				SetPageHugeObject(page);
 		} else {
-			prev_page->freelist = page;
+			prev_page->index = (unsigned long)page;
 		}
 		prev_page = page;
 	}

Johannes Weiner Sept. 22, 2021, 3:08 p.m. UTC | #99

On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote:
>  - it's become apparent that there haven't been any real objections to the code
>    that was queued up for 5.15. There _are_ very real discussions and points of
>    contention still to be decided and resolved for the work beyond file backed
>    pages, but those discussions were what derailed the more modest, and more
>    badly needed, work that affects everyone in filesystem land

Unfortunately, I think this is a result of me wanting to discuss a way
forward rather than a way back.

To clarify: I do very much object to the code as currently queued up,
and not just to a vague future direction.

The patches add and convert a lot of complicated code to provision for
a future we do not agree on. The indirections it adds, and the hybrid
state it leaves the tree in, make it directly more difficult to work
with and understand the MM code base. Stuff that isn't needed for
exposing folios to the filesystems.

As Willy has repeatedly expressed a take-it-or-leave-it attitude in
response to my feedback, I'm not excited about merging this now and
potentially leaving quite a bit of cleanup work to others if the
downstream discussion don't go to his liking.

Here is the roughly annotated pull request:

      mm: Convert get_page_unless_zero() to return bool
      mm: Introduce struct folio
      mm: Add folio_pgdat(), folio_zone() and folio_zonenum()

      mm/vmstat: Add functions to account folio statistics

		Used internally and not *really* needed for filesystem
		folios... There are a couple of callsites in
		mm/page-writeback.c so I suppose it's ok.

      mm/debug: Add VM_BUG_ON_FOLIO() and VM_WARN_ON_ONCE_FOLIO()
      mm: Add folio reference count functions
      mm: Add folio_put()
      mm: Add folio_get()
      mm: Add folio_try_get_rcu()
      mm: Add folio flag manipulation functions

      mm/lru: Add folio LRU functions

		The LRU code is used by anon and file and not needed
		for the filesystem API.

		And as discussed, there is generally no ambiguity of
		tail pages on the LRU list.

      mm: Handle per-folio private data
      mm/filemap: Add folio_index(), folio_file_page() and folio_contains()
      mm/filemap: Add folio_next_index()
      mm/filemap: Add folio_pos() and folio_file_pos()
      mm/util: Add folio_mapping() and folio_file_mapping()
      mm/filemap: Add folio_unlock()
      mm/filemap: Add folio_lock()
      mm/filemap: Add folio_lock_killable()
      mm/filemap: Add __folio_lock_async()
      mm/filemap: Add folio_wait_locked()
      mm/filemap: Add __folio_lock_or_retry()

      mm/swap: Add folio_rotate_reclaimable()

		More LRU code, although this one is only used by
		page-writeback... I suppose.

      mm/filemap: Add folio_end_writeback()
      mm/writeback: Add folio_wait_writeback()
      mm/writeback: Add folio_wait_stable()
      mm/filemap: Add folio_wait_bit()
      mm/filemap: Add folio_wake_bit()
      mm/filemap: Convert page wait queues to be folios
      mm/filemap: Add folio private_2 functions
      fs/netfs: Add folio fscache functions
      mm: Add folio_mapped()
      mm: Add folio_nid()

      mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics()
      mm/memcg: Use the node id in mem_cgroup_update_tree()
      mm/memcg: Remove soft_limit_tree_node()
      mm/memcg: Convert memcg_check_events to take a node ID

		These are nice cleanups, unrelated to folios. Ack.

      mm/memcg: Add folio_memcg() and related functions
      mm/memcg: Convert commit_charge() to take a folio
      mm/memcg: Convert mem_cgroup_charge() to take a folio
      mm/memcg: Convert uncharge_page() to uncharge_folio()
      mm/memcg: Convert mem_cgroup_uncharge() to take a folio
      mm/memcg: Convert mem_cgroup_migrate() to take folios
      mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio
      mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock()
      mm/memcg: Convert mem_cgroup_move_account() to use a folio
      mm/memcg: Add folio_lruvec()
      mm/memcg: Add folio_lruvec_lock() and similar functions
      mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave()
      mm/workingset: Convert workingset_activation to take a folio	

		This is all anon+file stuff, not needed for filesystem
		folios.

		As per the other email, no conceptual entry point for
		tail pages into either subsystem, so no ambiguity
		around the necessity of any compound_head() calls,
		directly or indirectly. It's easy to rule out
		wholesale, so there is no justification for
		incrementally annotating every single use of the page.

		NAK.

      mm: Add folio_pfn()
      mm: Add folio_raw_mapping()
      mm: Add flush_dcache_folio()
      mm: Add kmap_local_folio()
      mm: Add arch_make_folio_accessible()

      mm: Add folio_young and folio_idle
      mm/swap: Add folio_activate()
      mm/swap: Add folio_mark_accessed()

		This is anon+file aging stuff, not needed.

      mm/rmap: Add folio_mkclean()

      mm/migrate: Add folio_migrate_mapping()
      mm/migrate: Add folio_migrate_flags()
      mm/migrate: Add folio_migrate_copy()

		More anon+file conversion, not needed.

      mm/writeback: Rename __add_wb_stat() to wb_stat_mod()
      flex_proportions: Allow N events instead of 1
      mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add()
      mm/writeback: Add __folio_end_writeback()
      mm/writeback: Add folio_start_writeback()
      mm/writeback: Add folio_mark_dirty()
      mm/writeback: Add __folio_mark_dirty()
      mm/writeback: Convert tracing writeback_page_template to folios
      mm/writeback: Add filemap_dirty_folio()
      mm/writeback: Add folio_account_cleaned()
      mm/writeback: Add folio_cancel_dirty()
      mm/writeback: Add folio_clear_dirty_for_io()
      mm/writeback: Add folio_account_redirty()
      mm/writeback: Add folio_redirty_for_writepage()
      mm/filemap: Add i_blocks_per_folio()
      mm/filemap: Add folio_mkwrite_check_truncate()
      mm/filemap: Add readahead_folio()

      mm/workingset: Convert workingset_refault() to take a folio

		Anon+file, not needed. NAK.

      mm: Add folio_evictable()
      mm/lru: Convert __pagevec_lru_add_fn to take a folio
      mm/lru: Add folio_add_lru()

		LRU code, not needed.

      mm/page_alloc: Add folio allocation functions
      mm/filemap: Add filemap_alloc_folio
      mm/filemap: Add filemap_add_folio()
      mm/filemap: Convert mapping_get_entry to return a folio
      mm/filemap: Add filemap_get_folio
      mm/filemap: Add FGP_STABLE
      mm/writeback: Add folio_write_one

I'm counting about a thousand of lines of contentious LOC that clearly
aren't necessary for exposing folios to the filesystems.

The rest of these are pagecache and writeback. It's still a ton of
(internal) code converted to folios that has conceptually little to no
ambiguity about head and tail pages.

As per the other email I still think it would have been good to have a
high-level discussion about the *legitimate* entry points and data
structures that will continue to deal with tail pages down the
line. To scope the actual problem that is being addressed by this
inverted/whitelist approach - so we don't annotate the entire world
just to box in a handful of page table walkers...

But oh well. Not a hill I care to die on at this point...

Kent Overstreet Sept. 22, 2021, 3:46 p.m. UTC | #100

On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote:
> >  - it's become apparent that there haven't been any real objections to the code
> >    that was queued up for 5.15. There _are_ very real discussions and points of
> >    contention still to be decided and resolved for the work beyond file backed
> >    pages, but those discussions were what derailed the more modest, and more
> >    badly needed, work that affects everyone in filesystem land
> 
> Unfortunately, I think this is a result of me wanting to discuss a way
> forward rather than a way back.
> 
> To clarify: I do very much object to the code as currently queued up,
> and not just to a vague future direction.
> 
> The patches add and convert a lot of complicated code to provision for
> a future we do not agree on. The indirections it adds, and the hybrid
> state it leaves the tree in, make it directly more difficult to work
> with and understand the MM code base. Stuff that isn't needed for
> exposing folios to the filesystems.
> 
> As Willy has repeatedly expressed a take-it-or-leave-it attitude in
> response to my feedback, I'm not excited about merging this now and
> potentially leaving quite a bit of cleanup work to others if the
> downstream discussion don't go to his liking.
> 
> Here is the roughly annotated pull request:

Thanks for breaking this out, Johannes.

So: mm/filemap.c and mm/page-writeback.c - I disagree about folios not really
being needed there. Those files really belong more in fs/ than mm/, and the code
in those files needs folios the most - especially filemap.c, a lot of those
algorithms have to change from block based to extent based, making the analogy
with filesystems.

I think it makes sense to drop the mm/lru stuff, as well as the mm/memcg,
mm/migrate and mm/workingset and mm/swap stuff that you object to - that is, the
code paths that are for both file + anonymous pages, unless Matthew has
technical reasons why that would break the rest of the patch set.

And then, we really should have a pow wow and figure out what our options are
going forward. I think we have some agreement now that not everything is going
to be a folio going forwards (Matthew already split out his slab conversion to a
new type) - so if anonymous pages aren't becoming folios, we should prototype
some stuff and see where that helps and hurts us.

> As per the other email I still think it would have been good to have a
> high-level discussion about the *legitimate* entry points and data
> structures that will continue to deal with tail pages down the
> line. To scope the actual problem that is being addressed by this
> inverted/whitelist approach - so we don't annotate the entire world
> just to box in a handful of page table walkers...

That discussion can still happen... and there's still the potential to get a lot
more done if we're breaking open struct page and coming up with new types. I got
Matthew on board with what you wanted, re: using the slab allocator for larger
allocations

Matthew Wilcox Sept. 22, 2021, 4:26 p.m. UTC | #101

On Wed, Sep 22, 2021 at 11:46:04AM -0400, Kent Overstreet wrote:
> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> > On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote:
> > >  - it's become apparent that there haven't been any real objections to the code
> > >    that was queued up for 5.15. There _are_ very real discussions and points of
> > >    contention still to be decided and resolved for the work beyond file backed
> > >    pages, but those discussions were what derailed the more modest, and more
> > >    badly needed, work that affects everyone in filesystem land
> > 
> > Unfortunately, I think this is a result of me wanting to discuss a way
> > forward rather than a way back.
> > 
> > To clarify: I do very much object to the code as currently queued up,
> > and not just to a vague future direction.
> > 
> > The patches add and convert a lot of complicated code to provision for
> > a future we do not agree on. The indirections it adds, and the hybrid
> > state it leaves the tree in, make it directly more difficult to work
> > with and understand the MM code base. Stuff that isn't needed for
> > exposing folios to the filesystems.
> > 
> > As Willy has repeatedly expressed a take-it-or-leave-it attitude in
> > response to my feedback, I'm not excited about merging this now and
> > potentially leaving quite a bit of cleanup work to others if the
> > downstream discussion don't go to his liking.

We're at a take-it-or-leave-it point for this pull request.  The time
for discussion was *MONTHS* ago.

> > Here is the roughly annotated pull request:
> 
> Thanks for breaking this out, Johannes.
> 
> So: mm/filemap.c and mm/page-writeback.c - I disagree about folios not really
> being needed there. Those files really belong more in fs/ than mm/, and the code
> in those files needs folios the most - especially filemap.c, a lot of those
> algorithms have to change from block based to extent based, making the analogy
> with filesystems.
> 
> I think it makes sense to drop the mm/lru stuff, as well as the mm/memcg,
> mm/migrate and mm/workingset and mm/swap stuff that you object to - that is, the
> code paths that are for both file + anonymous pages, unless Matthew has
> technical reasons why that would break the rest of the patch set.

Conceptually, it breaks the patch set.  Anywhere that we convert back
from a folio to a page, the guarantee of folios is weakened (and
possibly violated).  I don't think it makes sense from a practical point
of view either; it's re-adding compound_head() calls that just don't
need to be there.

> That discussion can still happen... and there's still the potential to get a lot
> more done if we're breaking open struct page and coming up with new types. I got
> Matthew on board with what you wanted, re: using the slab allocator for larger
> allocations

Wait, no, you didn't.  I think it's a terrible idea.  It's just completely
orthogonal to this patch set, so I don't want to talk about it.

Chris Mason Sept. 22, 2021, 4:56 p.m. UTC | #102

> On Sep 22, 2021, at 12:26 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Sep 22, 2021 at 11:46:04AM -0400, Kent Overstreet wrote:
>> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
>>> On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote:
>>>> - it's become apparent that there haven't been any real objections to the code
>>>>   that was queued up for 5.15. There _are_ very real discussions and points of
>>>>   contention still to be decided and resolved for the work beyond file backed
>>>>   pages, but those discussions were what derailed the more modest, and more
>>>>   badly needed, work that affects everyone in filesystem land
>>> 
>>> Unfortunately, I think this is a result of me wanting to discuss a way
>>> forward rather than a way back.
>>> 
>>> To clarify: I do very much object to the code as currently queued up,
>>> and not just to a vague future direction.
>>> 
>>> The patches add and convert a lot of complicated code to provision for
>>> a future we do not agree on. The indirections it adds, and the hybrid
>>> state it leaves the tree in, make it directly more difficult to work
>>> with and understand the MM code base. Stuff that isn't needed for
>>> exposing folios to the filesystems.
>>> 
>>> As Willy has repeatedly expressed a take-it-or-leave-it attitude in
>>> response to my feedback, I'm not excited about merging this now and
>>> potentially leaving quite a bit of cleanup work to others if the
>>> downstream discussion don't go to his liking.
> 
> We're at a take-it-or-leave-it point for this pull request.  The time
> for discussion was *MONTHS* ago.
> 

I’ll admit I’m not impartial, but my fundamental goal is moving the patches forward.  Given folios will need long term maintenance, engagement, and iteration throughout mm/, take-it-or-leave-it pulls seem like a recipe for future conflict, and more importantly, bugs.

I’d much rather work it out now.

-chris

Matthew Wilcox Sept. 22, 2021, 7:54 p.m. UTC | #103

On Wed, Sep 22, 2021 at 04:56:16PM +0000, Chris Mason wrote:
> 
> > On Sep 22, 2021, at 12:26 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Wed, Sep 22, 2021 at 11:46:04AM -0400, Kent Overstreet wrote:
> >> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> >>> On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote:
> >>>> - it's become apparent that there haven't been any real objections to the code
> >>>>   that was queued up for 5.15. There _are_ very real discussions and points of
> >>>>   contention still to be decided and resolved for the work beyond file backed
> >>>>   pages, but those discussions were what derailed the more modest, and more
> >>>>   badly needed, work that affects everyone in filesystem land
> >>> 
> >>> Unfortunately, I think this is a result of me wanting to discuss a way
> >>> forward rather than a way back.
> >>> 
> >>> To clarify: I do very much object to the code as currently queued up,
> >>> and not just to a vague future direction.
> >>> 
> >>> The patches add and convert a lot of complicated code to provision for
> >>> a future we do not agree on. The indirections it adds, and the hybrid
> >>> state it leaves the tree in, make it directly more difficult to work
> >>> with and understand the MM code base. Stuff that isn't needed for
> >>> exposing folios to the filesystems.
> >>> 
> >>> As Willy has repeatedly expressed a take-it-or-leave-it attitude in
> >>> response to my feedback, I'm not excited about merging this now and
> >>> potentially leaving quite a bit of cleanup work to others if the
> >>> downstream discussion don't go to his liking.
> > 
> > We're at a take-it-or-leave-it point for this pull request.  The time
> > for discussion was *MONTHS* ago.
> > 
> 
> I’ll admit I’m not impartial, but my fundamental goal is moving the patches forward.  Given folios will need long term maintenance, engagement, and iteration throughout mm/, take-it-or-leave-it pulls seem like a recipe for future conflict, and more importantly, bugs.
> 
> I’d much rather work it out now.

That's the nature of a pull request.  It's binary -- either it's pulled or
it's rejected.  Well, except that Linus has opted for silence, leaving
me in limbo.  I have no idea what he's thinking.  I don't know if he
agrees with Johannes.  I don't know what needs to change for Linus to
like this series enough to pull it (either now or in the 5.16 merge
window).  And that makes me frustrated.  This is over a year of work
from me and others, and it's being held up over concerns which seem to
me to be entirely insubstantial (the name "folio"?  really?  and even
my change to use "pageset" was met with silence from Linus.)

I agree with Kent & Johannes that struct page is a mess.  I agree that
cleaning it up will bring many benefits.  I've even started a design
document here:

https://kernelnewbies.org/MemoryTypes

I do see some advantages to splitting out anon memory descriptors from
file memory descriptors, but there is also plenty of code which handles
both types in the same way.  I see the requests to continue to use
struct page to mean a "memory descriptor which is either anon or file",
but I really think that's the wrong approach.  A struct page should
represent /a page/ of memory.  Otherwise we're just confusing people.
I know it's a confusion we've had since compound pages were introduced,
what, 25+ years ago, but that expediency has overstayed its welcome.

The continued silence from Linus is really driving me to despair.
I'm sorry I've been so curt with some of the requests.  I really am
willing to change things; I wasn't planning on doing anything with slab
until Kent prodded me to do it.  But equally, I strongly believe that
everything I've done here is a step towards the things that everybody
wants, and I'm frustrated that it's being perceived as a step away,
or even to the side of what people want.

So ... if any of you have Linus' ear.  Maybe you're at a conference with
him later this week.  Please, just get him to tell me what I need to do
to make him happy with this patchset.

Kent Overstreet Sept. 22, 2021, 8:15 p.m. UTC | #104

On Wed, Sep 22, 2021 at 08:54:11PM +0100, Matthew Wilcox wrote:
> That's the nature of a pull request.  It's binary -- either it's pulled or
> it's rejected.  Well, except that Linus has opted for silence, leaving
> me in limbo.  I have no idea what he's thinking.  I don't know if he
> agrees with Johannes.  I don't know what needs to change for Linus to
> like this series enough to pull it (either now or in the 5.16 merge
> window).  And that makes me frustrated.  This is over a year of work
> from me and others, and it's being held up over concerns which seem to
> me to be entirely insubstantial (the name "folio"?  really?  and even
> my change to use "pageset" was met with silence from Linus.)

People bikeshed the naming when they're uncomfortable with what's being proposed
and have nothing substantive to say, and people are uncomfortable with what's
being proposed when there's clear disagreement between major stakeholders who
aren't working with each other.

And the utterly ridiculous part of this whole fiasco is that you and Johannes
have a LOT of common ground regarding the larger picture of what we do with the
struct page mess, but you two keep digging in your heels purely because you're
convinced that you can't work with each other so you need to either route around
each other or be as forceful as possible to get what you want. You're convinced
you're not listenig to each other, but even that isn't true because when I pass
ideas back and forth between you and they come from "not Matthew" or "not
Johannes" you both listen and incorporate them just fine.

We can't have a process where major stakeholders are trying to actively sabotage
each other's efforts, which is pretty close to where we're at now. You two just
need to learn to work with each other.

Linus Torvalds Sept. 22, 2021, 8:21 p.m. UTC | #105

On Wed, Sep 22, 2021 at 12:56 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> The continued silence from Linus is really driving me to despair.

No need to despair. The silence isn't some "deep" thing.

What  happened is literally that I wasn't 100% happy with the naming,
but didn't hate the patches, and still don't.

But when there is still active discussion about them during the merge
window, I'm just not going to merge them.

The silence literally is just due to that - not participating in the
discussion for the simple reason that I had no hugely strong opinions
on my side - but also simply because there is no way I'd merge this
for 5.15 simply exactly _because_ of this discussion.

Normally I get to clean up my inbox the week after the merge window,
but the -Werror things kept my attention for one extra week, and so my
mailbox has been a disaster area as a result. So only today does my
inbox start to look reasonable again after the merge window (not
because of the extra email during the merge window, but simply because
the merge window causes me to ignore non-pull emails, and then I need
to go back and check the other stuff afterwards).

So I'm not particularly unhappy with the patchset. I understand where
it is coming from, I have no huge technical disagreement with it
personally.

That said, I'm not hugely _enthused_ about the mm side of it either,
which is why I also wouldn't just override the discussion and say
"that's it, I'm merging it". I basically wanted to see if it led
somewhere.

I'm not convinced it led anywhere, but that didn't really change
things for me, except for the "yeah, I'm not merging something core
like this while it's under active discussion" part.

           Linus

Ira Weiny Sept. 23, 2021, 12:45 a.m. UTC | #106

On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:

...

> +/**
> + * page_slab - Converts from page to slab.
> + * @p: The page.
> + *
> + * This function cannot be called on a NULL pointer.  It can be called
> + * on a non-slab page; the caller should check is_slab() to be sure
> + * that the slab really is a slab.
> + *
> + * Return: The slab which contains this page.
> + */
> +#define page_slab(p)		(_Generic((p),				\
> +	const struct page *:	(const struct slab *)_compound_head(p), \
> +	struct page *:		(struct slab *)_compound_head(p)))
> +
> +static inline bool is_slab(struct slab *slab)
> +{
> +	return test_bit(PG_slab, &slab->flags);
> +}
> +

I'm sorry, I don't have a dog in this fight and conceptually I think folios are
a good idea...

But for this work, having a call which returns if a 'struct slab' really is a
'struct slab' seems odd and well, IMHO, wrong.  Why can't page_slab() return
NULL if there is no slab containing that page?

Ira

Matthew Wilcox Sept. 23, 2021, 3:41 a.m. UTC | #107

On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote:
> On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:
> > +/**
> > + * page_slab - Converts from page to slab.
> > + * @p: The page.
> > + *
> > + * This function cannot be called on a NULL pointer.  It can be called
> > + * on a non-slab page; the caller should check is_slab() to be sure
> > + * that the slab really is a slab.
> > + *
> > + * Return: The slab which contains this page.
> > + */
> > +#define page_slab(p)		(_Generic((p),				\
> > +	const struct page *:	(const struct slab *)_compound_head(p), \
> > +	struct page *:		(struct slab *)_compound_head(p)))
> > +
> > +static inline bool is_slab(struct slab *slab)
> > +{
> > +	return test_bit(PG_slab, &slab->flags);
> > +}
> > +
> 
> I'm sorry, I don't have a dog in this fight and conceptually I think folios are
> a good idea...
> 
> But for this work, having a call which returns if a 'struct slab' really is a
> 'struct slab' seems odd and well, IMHO, wrong.  Why can't page_slab() return
> NULL if there is no slab containing that page?

No, this is a good question.

The way slub works right now is that if you ask for a "large" allocation,
it does:

        flags |= __GFP_COMP;
        page = alloc_pages_node(node, flags, order);

and returns page_address(page) (eventually; the code is more complex)
So when you call kfree(), it uses the PageSlab flag to determine if the
allocation was "large" or not:

        page = virt_to_head_page(x);
        if (unlikely(!PageSlab(page))) {
                free_nonslab_page(page, object);
                return;
        }
        slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);

Now, you could say that this is a bad way to handle things, and every
allocation from slab should have PageSlab set, and it should use one of
the many other bits in page->flags to indicate whether it's a large
allocation or not.  I may have feelings in that direction myself.
But I don't think I should be changing that in this patch.

Maybe calling this function is_slab() is the confusing thing.
Perhaps it should be called SlabIsLargeAllocation().  Not sure.

Kent Overstreet Sept. 23, 2021, 5:42 a.m. UTC | #108

On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote:
> >  - it's become apparent that there haven't been any real objections to the code
> >    that was queued up for 5.15. There _are_ very real discussions and points of
> >    contention still to be decided and resolved for the work beyond file backed
> >    pages, but those discussions were what derailed the more modest, and more
> >    badly needed, work that affects everyone in filesystem land
> 
> Unfortunately, I think this is a result of me wanting to discuss a way
> forward rather than a way back.
> 
> To clarify: I do very much object to the code as currently queued up,
> and not just to a vague future direction.
> 
> The patches add and convert a lot of complicated code to provision for
> a future we do not agree on. The indirections it adds, and the hybrid
> state it leaves the tree in, make it directly more difficult to work
> with and understand the MM code base. Stuff that isn't needed for
> exposing folios to the filesystems.

I think something we need is an alternate view - anon_folio, perhaps - and an
idea of what that would look like. Because you've been saying you don't think
file pages and anymous pages are similar enough to be the same time - so if
they're not, how's the code that works on both types of pages going to change to
accomadate that?

Do we have if (file_folio) else if (anon_folio) both doing the same thing, but
operating on different types? Some sort of subclassing going on?

I was agreeing with you that slab/network pools etc. shouldn't be folios - that
folios shouldn't be a replacement for compound pages. But I think we're going to
need a serious alternative proposal for anonymous pages if you're still against
them becoming folios, especially because according to Kirill they're already
working on that (and you have to admit transhuge pages did introduce a mess that
they will help with...)

Johannes Weiner Sept. 23, 2021, 6 p.m. UTC | #109

On Thu, Sep 23, 2021 at 01:42:17AM -0400, Kent Overstreet wrote:
> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> > On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote:
> > >  - it's become apparent that there haven't been any real objections to the code
> > >    that was queued up for 5.15. There _are_ very real discussions and points of
> > >    contention still to be decided and resolved for the work beyond file backed
> > >    pages, but those discussions were what derailed the more modest, and more
> > >    badly needed, work that affects everyone in filesystem land
> > 
> > Unfortunately, I think this is a result of me wanting to discuss a way
> > forward rather than a way back.
> > 
> > To clarify: I do very much object to the code as currently queued up,
> > and not just to a vague future direction.
> > 
> > The patches add and convert a lot of complicated code to provision for
> > a future we do not agree on. The indirections it adds, and the hybrid
> > state it leaves the tree in, make it directly more difficult to work
> > with and understand the MM code base. Stuff that isn't needed for
> > exposing folios to the filesystems.
> 
> I think something we need is an alternate view - anon_folio, perhaps - and an
> idea of what that would look like. Because you've been saying you don't think
> file pages and anymous pages are similar enough to be the same time - so if
> they're not, how's the code that works on both types of pages going to change to
> accomadate that?
> 
> Do we have if (file_folio) else if (anon_folio) both doing the same thing, but
> operating on different types? Some sort of subclassing going on?

Yeah, with subclassing and a generic type for shared code. I outlined
that earlier in the thread:

https://lore.kernel.org/all/YUo20TzAlqz8Tceg@cmpxchg.org/

So you have anon_page and file_page being subclasses of page - similar
to how filesystems have subclasses that inherit from struct inode - to
help refactor what is generic, what isn't, and highlight what should be.

Whether we do anon_page and file_page inheriting from struct page, or
anon_folio and file_folio inheriting from struct folio - either would
work of course. Again I think it comes down to the value proposition
of folio as a means to clean up compound pages inside the MM code.
It's pretty uncontroversial that we want PAGE_SIZE assumptions gone
from the filesystems, networking, drivers and other random code. The
argument for MM code is a different one. We seem to be discussing the
folio abstraction as a binary thing for the Linux kernel, rather than
a selectively applied tool, and I think it prevents us from doing
proper one-by-one cost/benefit analyses on the areas of application.

I suggested the anon/file split as an RFC to sidestep the cost/benefit
question of doing the massive folio change in MM just to cleanup the
compound pages; takeing the idea of redoing the page typing, just in a
way that would maybe benefit MM code more broadly and obviously.

> I was agreeing with you that slab/network pools etc. shouldn't be folios - that
> folios shouldn't be a replacement for compound pages. But I think we're going to
> need a serious alternative proposal for anonymous pages if you're still against
> them becoming folios, especially because according to Kirill they're already
> working on that (and you have to admit transhuge pages did introduce a mess that
> they will help with...)

I think we need a better analysis of that mess and a concept where
tailpages are and should be, if that is the justification for the MM
conversion.

The motivation is that we have a ton of compound_head() calls in
places we don't need them. No argument there, I think.

But the explanation for going with whitelisting - the most invasive
approach possible (and which leaves more than one person "unenthused"
about that part of the patches) - is that it's difficult and error
prone to identify which ones are necessary and which ones are not. And
maybe that we'll continue to have a widespread hybrid existence of
head and tail pages that will continue to require clarification.

But that seems to be an article of faith. It's implied by the
approach, but this may or may not be the case.

I certainly think it used to be messier in the past. But strides have
been made already to narrow the channels through which tail pages can
actually enter the code. Certainly we can rule out entire MM
subsystems and simply declare their compound_head() usage unnecessary
with little risk or ambiguity.

Then the question becomes which ones are legit. Whether anybody
outside the page allocator ever needs to *see* a tailpage struct page
to begin with. (Arguably that bit in __split_huge_page_tail() could be
a page allocator function; the pte handling is pfn-based except for
the mapcount management which could be encapsulated; the collapse code
uses vm_normal_page() but follows it quickly by compound_head() - and
arguably a tailpage generally isn't a "normal" vm page, so a new
pfn_to_normal_page() could encapsulate the compound_head()). Because
if not, seeing struct page in MM code isn't nearly as ambiguous as is
being implied. You would never have to worry about it - unless you are
in fact the page allocator.

So if this problem could be solved by making tail pages an
encapsulated page_alloc thing, and chasing down the rest of
find_subpage() callers (which needs to happen anyway), I don't think a
wholesale folio conversion of this subsystem would be justified.

A more in-depth analyses of where and how we need to deal with
tailpages - laying out the data structures that hold them and code
entry points for them - would go a long way for making the case for
folios. And might convince reluctant people to get behind the effort.

Or show that we don't need it. Either way, it seems like a win-win.

But I do think the onus for explaining why the particular approach was
chosen against much less invasive options is on the person pushing the
changes. And it should be more detailed than "we all know it sucks".

Matthew Wilcox Sept. 23, 2021, 7:31 p.m. UTC | #110

On Thu, Sep 23, 2021 at 02:00:46PM -0400, Johannes Weiner wrote:
> On Thu, Sep 23, 2021 at 01:42:17AM -0400, Kent Overstreet wrote:
> > I think something we need is an alternate view - anon_folio, perhaps - and an
> > idea of what that would look like. Because you've been saying you don't think
> > file pages and anymous pages are similar enough to be the same time - so if
> > they're not, how's the code that works on both types of pages going to change to
> > accomadate that?
> > 
> > Do we have if (file_folio) else if (anon_folio) both doing the same thing, but
> > operating on different types? Some sort of subclassing going on?
> 
> Yeah, with subclassing and a generic type for shared code. I outlined
> that earlier in the thread:
> 
> https://lore.kernel.org/all/YUo20TzAlqz8Tceg@cmpxchg.org/
> 
> So you have anon_page and file_page being subclasses of page - similar
> to how filesystems have subclasses that inherit from struct inode - to
> help refactor what is generic, what isn't, and highlight what should be.

I'm with you there.  I don't understand anon pages well enough to know
whether splitting them out from file pages is good or bad.  I had assumed
that if it were worth doing, they would have gained their own named
members in the page union, but perhaps that didn't happen in order to
keep the complexity of the union down?

> Whether we do anon_page and file_page inheriting from struct page, or
> anon_folio and file_folio inheriting from struct folio - either would
> work of course. Again I think it comes down to the value proposition
> of folio as a means to clean up compound pages inside the MM code.
> It's pretty uncontroversial that we want PAGE_SIZE assumptions gone
> from the filesystems, networking, drivers and other random code. The
> argument for MM code is a different one. We seem to be discussing the
> folio abstraction as a binary thing for the Linux kernel, rather than
> a selectively applied tool, and I think it prevents us from doing
> proper one-by-one cost/benefit analyses on the areas of application.

I wasn't originally planning on doing nearly as much as Kent has
opened me up to.  Slab seems like a clear win to split out.  Page
tables seem like they will be too.  I'd like to get to these structs:

struct page {
    unsigned long flags;
    unsigned long compound_head;
    union {
        struct { /* First tail page only */
            unsigned char compound_dtor;
            unsigned char compound_order;
            atomic_t compound_mapcount;
            unsigned int compound_nr;
        };
        struct { /* Second tail page only */
            atomic_t hpage_pinned_refcount;
            struct list_head deferred_list;
        };
        unsigned long padding1[5];
    };
    unsigned int padding2[2];
#ifdef CONFIG_MEMCG
    unsigned long padding3;
#endif
#ifdef WANT_PAGE_VIRTUAL
    void *virtual;
#endif
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif
};

struct slab {
... slab specific stuff here ...
};

struct page_table {
... pgtable stuff here ...
};

struct folio {
    unsigned long flags;
    union {
        struct {
            struct list_head lru;
            struct address_space *mapping;
            pgoff_t index;
            void *private;
        };
        struct {
... net pool here ...
        };
        struct {
... zone device here ...
        };
    };
    atomic_t _mapcount;
    atomic_t _refcount;
#ifdef CONFIG_MEMCG
    unsigned long memcg_data;
#endif
};

ie a 'struct page' contains no information on its own.  You have to
go to the compound_head page (cast to the appropriate type) to find
the information.  What Kent is proposing is exciting, but I think
further off.

> > I was agreeing with you that slab/network pools etc. shouldn't be folios - that
> > folios shouldn't be a replacement for compound pages. But I think we're going to
> > need a serious alternative proposal for anonymous pages if you're still against
> > them becoming folios, especially because according to Kirill they're already
> > working on that (and you have to admit transhuge pages did introduce a mess that
> > they will help with...)
> 
> I think we need a better analysis of that mess and a concept where
> tailpages are and should be, if that is the justification for the MM
> conversion.
> 
> The motivation is that we have a ton of compound_head() calls in
> places we don't need them. No argument there, I think.
> 
> But the explanation for going with whitelisting - the most invasive
> approach possible (and which leaves more than one person "unenthused"
> about that part of the patches) - is that it's difficult and error
> prone to identify which ones are necessary and which ones are not. And
> maybe that we'll continue to have a widespread hybrid existence of
> head and tail pages that will continue to require clarification.
> 
> But that seems to be an article of faith. It's implied by the
> approach, but this may or may not be the case.
> 
> I certainly think it used to be messier in the past. But strides have
> been made already to narrow the channels through which tail pages can
> actually enter the code. Certainly we can rule out entire MM
> subsystems and simply declare their compound_head() usage unnecessary
> with little risk or ambiguity.
> 
> Then the question becomes which ones are legit. Whether anybody
> outside the page allocator ever needs to *see* a tailpage struct page
> to begin with. (Arguably that bit in __split_huge_page_tail() could be
> a page allocator function; the pte handling is pfn-based except for
> the mapcount management which could be encapsulated; the collapse code
> uses vm_normal_page() but follows it quickly by compound_head() - and
> arguably a tailpage generally isn't a "normal" vm page, so a new
> pfn_to_normal_page() could encapsulate the compound_head()). Because
> if not, seeing struct page in MM code isn't nearly as ambiguous as is
> being implied. You would never have to worry about it - unless you are
> in fact the page allocator.
> 
> So if this problem could be solved by making tail pages an
> encapsulated page_alloc thing, and chasing down the rest of
> find_subpage() callers (which needs to happen anyway), I don't think a
> wholesale folio conversion of this subsystem would be justified.
> 
> A more in-depth analyses of where and how we need to deal with
> tailpages - laying out the data structures that hold them and code
> entry points for them - would go a long way for making the case for
> folios. And might convince reluctant people to get behind the effort.

OK.  So filesystems still need to deal with pages in some places.  One
place is at the bottom of the filesystem where memory gets packaged into
BIOs or SKBs to eventually participate in DMA:

struct bio_vec {
        struct page     *bv_page;
        unsigned int    bv_len;
        unsigned int    bv_offset;
};

That could become a folio (or Christoph's preferred option, a
phys_addr_t), but this is really an entirely different role for struct
page; it's just carrying the address of some memory for I/O to happen to.
Nobody looks at the contents of the struct page until it goes back to
the filesystem, at which point it clears the writeback bit or marks
it uptodate.

The other place that definitely still needs to be a struct page is

struct vm_fault {
...
        struct page *page;              /* ->fault handlers should return a
                                         * page here, unless VM_FAULT_NOPAGE
                                         * is set (which is also implied by
                                         * VM_FAULT_ERROR).
                                         */
...
};

Most filesystems use filemap_fault(), which handles this, but this affects
device drivers too.  We can't return a folio here because we need to
know which page corresponds to the address that took the fault.  We can
deduce it for filesystems, because we know how folios are allocated for
the page cache, but device drivers can map memory absolutely arbitrarily,
so there's no way to reconstruct that information.

Again, this could be a physical address (or a pfn), but we have it as a
page because it's locked and we're going to unlock it after mapping it.
So this is actually a place where we'll need to get a page from the
filesystem, convert to a folio and call folio operations on it.  This
is one of the reasons that lock_page() / unlock_page() contain the
embedded compound_head() today.

Kent Overstreet Sept. 23, 2021, 8:20 p.m. UTC | #111

On Thu, Sep 23, 2021 at 02:00:46PM -0400, Johannes Weiner wrote:
> Yeah, with subclassing and a generic type for shared code. I outlined
> that earlier in the thread:
> 
> https://lore.kernel.org/all/YUo20TzAlqz8Tceg@cmpxchg.org/
> 
> So you have anon_page and file_page being subclasses of page - similar
> to how filesystems have subclasses that inherit from struct inode - to
> help refactor what is generic, what isn't, and highlight what should be.
> 
> Whether we do anon_page and file_page inheriting from struct page, or
> anon_folio and file_folio inheriting from struct folio - either would
> work of course.

If we go that route, my preference would be for completely separate anon_folio
and file_folio types - separately allocated when we get their, both their
completely own thing. I think even in languages that have it data inheritence is
kind of evil and I prefer to avoid it - even if that means having code that does
if (anon_folio) else if (file_folio) where both branches do the exact same
thing.

For the LRU lists we might be able to create a new type wrapping a list head,
and embed that in both file_folio and anon_folio, and pass that type to the LRU
code. I'm just spitballing ideas though, you know that code better than I do.

> Again I think it comes down to the value proposition
> of folio as a means to clean up compound pages inside the MM code.
> It's pretty uncontroversial that we want PAGE_SIZE assumptions gone
> from the filesystems, networking, drivers and other random code. The
> argument for MM code is a different one. We seem to be discussing the
> folio abstraction as a binary thing for the Linux kernel, rather than
> a selectively applied tool, and I think it prevents us from doing
> proper one-by-one cost/benefit analyses on the areas of application.
> 
> I suggested the anon/file split as an RFC to sidestep the cost/benefit
> question of doing the massive folio change in MM just to cleanup the
> compound pages; takeing the idea of redoing the page typing, just in a
> way that would maybe benefit MM code more broadly and obviously.

It's not just compound pages though - THPs introduced a lot of if (normal page)
else if (hugepage) stuff that needs to be cleaned up.

Also, by enabling arbitrary size compound pages for anonymous memory, this is
going to help with memory fragmentation - right now, the situation for anonymous
pages is all or nothing, normal page or hugepage, and since most of the time it
ends up being normal pages we end up fragmenting memory unnecessarily. I don't
think it'll have anywhere near the performance impact for anonymous pages as it
will for file pages, but we should still see some performance gains too.

That's all true though whether or not anonymous pages end up using the same type
as folios though, so it's not an argument either way.

> I think we need a better analysis of that mess and a concept where
> tailpages are and should be, if that is the justification for the MM
> conversion.
> 
> The motivation is that we have a ton of compound_head() calls in
> places we don't need them. No argument there, I think.

I don't think that's the main motivation at this point, though. See the struct
page proposal document I wrote last night - several of the ideas in there are
yours. The compound vs. tail page confusion is just one of many birds we can
kill with this stone. 

I'd really love to hear your thoughts on that document btw - I want to know if
we're on the same page and if I accurately captured your ideas and if you've got
more to add.

> But the explanation for going with whitelisting - the most invasive
> approach possible (and which leaves more than one person "unenthused"
> about that part of the patches) - is that it's difficult and error
> prone to identify which ones are necessary and which ones are not. And
> maybe that we'll continue to have a widespread hybrid existence of
> head and tail pages that will continue to require clarification.
> 
> But that seems to be an article of faith. It's implied by the
> approach, but this may or may not be the case.
> 
> I certainly think it used to be messier in the past. But strides have
> been made already to narrow the channels through which tail pages can
> actually enter the code. Certainly we can rule out entire MM
> subsystems and simply declare their compound_head() usage unnecessary
> with little risk or ambiguity.

This sounds like we're not using assertions nearly enough. The primary use of
assertions isn't to catch where we've fucked and don't have a way to recover -
the right way to think of assertions is that they're for documenting invariants
in a way that can't go out of date, like comments can. They're almost as good as
doing it with the type system.

> Then the question becomes which ones are legit. Whether anybody
> outside the page allocator ever needs to *see* a tailpage struct page
> to begin with. (Arguably that bit in __split_huge_page_tail() could be
> a page allocator function; the pte handling is pfn-based except for
> the mapcount management which could be encapsulated; the collapse code
> uses vm_normal_page() but follows it quickly by compound_head() - and
> arguably a tailpage generally isn't a "normal" vm page, so a new
> pfn_to_normal_page() could encapsulate the compound_head()). Because
> if not, seeing struct page in MM code isn't nearly as ambiguous as is
> being implied. You would never have to worry about it - unless you are
> in fact the page allocator.
> 
> So if this problem could be solved by making tail pages an
> encapsulated page_alloc thing, and chasing down the rest of
> find_subpage() callers (which needs to happen anyway), I don't think a
> wholesale folio conversion of this subsystem would be justified.
> 
> A more in-depth analyses of where and how we need to deal with
> tailpages - laying out the data structures that hold them and code
> entry points for them - would go a long way for making the case for
> folios. And might convince reluctant people to get behind the effort.

Alternately - imagine we get to the struct page proposal I laid out. What code
is still going to deal with struct page, and which code is going to change to
working with some subtype of page?

Ira Weiny Sept. 23, 2021, 10:12 p.m. UTC | #112

On Thu, Sep 23, 2021 at 04:41:04AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote:
> > On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:
> > > +/**
> > > + * page_slab - Converts from page to slab.
> > > + * @p: The page.
> > > + *
> > > + * This function cannot be called on a NULL pointer.  It can be called
> > > + * on a non-slab page; the caller should check is_slab() to be sure
> > > + * that the slab really is a slab.
> > > + *
> > > + * Return: The slab which contains this page.
> > > + */
> > > +#define page_slab(p)		(_Generic((p),				\
> > > +	const struct page *:	(const struct slab *)_compound_head(p), \
> > > +	struct page *:		(struct slab *)_compound_head(p)))
> > > +
> > > +static inline bool is_slab(struct slab *slab)
> > > +{
> > > +	return test_bit(PG_slab, &slab->flags);
> > > +}
> > > +
> > 
> > I'm sorry, I don't have a dog in this fight and conceptually I think folios are
> > a good idea...
> > 
> > But for this work, having a call which returns if a 'struct slab' really is a
> > 'struct slab' seems odd and well, IMHO, wrong.  Why can't page_slab() return
> > NULL if there is no slab containing that page?
> 
> No, this is a good question.
> 
> The way slub works right now is that if you ask for a "large" allocation,
> it does:
> 
>         flags |= __GFP_COMP;
>         page = alloc_pages_node(node, flags, order);
> 
> and returns page_address(page) (eventually; the code is more complex)
> So when you call kfree(), it uses the PageSlab flag to determine if the
> allocation was "large" or not:
> 
>         page = virt_to_head_page(x);
>         if (unlikely(!PageSlab(page))) {
>                 free_nonslab_page(page, object);
>                 return;
>         }
>         slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
> 
> Now, you could say that this is a bad way to handle things, and every
> allocation from slab should have PageSlab set,

Yea basically.

So what makes 'struct slab' different from 'struct page' in an order 0
allocation?  Am I correct in deducing that PG_slab is not set in that case?

> and it should use one of
> the many other bits in page->flags to indicate whether it's a large
> allocation or not.

Isn't the fact that it is a compound page enough to know that?

> I may have feelings in that direction myself.
> But I don't think I should be changing that in this patch.
> 
> Maybe calling this function is_slab() is the confusing thing.
> Perhaps it should be called SlabIsLargeAllocation().  Not sure.

Well that makes a lot more sense to me from an API standpoint but checking
PG_slab is still likely to raise some eyebrows.

Regardless I like the fact that the community is at least attempting to fix
stuff like this.  Because adding types like this make it easier for people like
me to understand what is going on.

Ira

Matthew Wilcox Sept. 29, 2021, 3:24 p.m. UTC | #113

On Thu, Sep 23, 2021 at 03:12:41PM -0700, Ira Weiny wrote:
> On Thu, Sep 23, 2021 at 04:41:04AM +0100, Matthew Wilcox wrote:
> > On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:
> > > > +/**
> > > > + * page_slab - Converts from page to slab.
> > > > + * @p: The page.
> > > > + *
> > > > + * This function cannot be called on a NULL pointer.  It can be called
> > > > + * on a non-slab page; the caller should check is_slab() to be sure
> > > > + * that the slab really is a slab.
> > > > + *
> > > > + * Return: The slab which contains this page.
> > > > + */
> > > > +#define page_slab(p)		(_Generic((p),				\
> > > > +	const struct page *:	(const struct slab *)_compound_head(p), \
> > > > +	struct page *:		(struct slab *)_compound_head(p)))
> > > > +
> > > > +static inline bool is_slab(struct slab *slab)
> > > > +{
> > > > +	return test_bit(PG_slab, &slab->flags);
> > > > +}
> > > > +
> > > 
> > > I'm sorry, I don't have a dog in this fight and conceptually I think folios are
> > > a good idea...
> > > 
> > > But for this work, having a call which returns if a 'struct slab' really is a
> > > 'struct slab' seems odd and well, IMHO, wrong.  Why can't page_slab() return
> > > NULL if there is no slab containing that page?
> > 
> > No, this is a good question.
> > 
> > The way slub works right now is that if you ask for a "large" allocation,
> > it does:
> > 
> >         flags |= __GFP_COMP;
> >         page = alloc_pages_node(node, flags, order);
> > 
> > and returns page_address(page) (eventually; the code is more complex)
> > So when you call kfree(), it uses the PageSlab flag to determine if the
> > allocation was "large" or not:
> > 
> >         page = virt_to_head_page(x);
> >         if (unlikely(!PageSlab(page))) {
> >                 free_nonslab_page(page, object);
> >                 return;
> >         }
> >         slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
> > 
> > Now, you could say that this is a bad way to handle things, and every
> > allocation from slab should have PageSlab set,
> 
> Yea basically.
> 
> So what makes 'struct slab' different from 'struct page' in an order 0
> allocation?  Am I correct in deducing that PG_slab is not set in that case?

You might mean a couple of different things by that question, so let
me say some things which are true (on x86) but might not answer your
question:

If you kmalloc(4095) bytes, it comes from a slab.  That slab would usually
be an order-3 allocation.  If that order-3 allocation fails, slab might
go as low as an order-0 allocation, but PageSlab will always be set on
that head/base page because the allocation is smaller than two pages.

If you kmalloc(8193) bytes, slub throws up its hands and does an
allocation from the page allocator.  So it allocates an order-2 page,
does not set PG_slab on it, but PG_head is set on the head page and
PG_tail is set on all three tail pages.

> > and it should use one of
> > the many other bits in page->flags to indicate whether it's a large
> > allocation or not.
> 
> Isn't the fact that it is a compound page enough to know that?

No -- regular slab allocations have PG_head set.  But it could use,
eg, slab->slab_cache == NULL to distinguish page allocations
from slab allocations.

> > I may have feelings in that direction myself.
> > But I don't think I should be changing that in this patch.
> > 
> > Maybe calling this function is_slab() is the confusing thing.
> > Perhaps it should be called SlabIsLargeAllocation().  Not sure.
> 
> Well that makes a lot more sense to me from an API standpoint but checking
> PG_slab is still likely to raise some eyebrows.

Yeah.  Here's what I have right now:

+static inline bool SlabMultiPage(const struct slab *slab)
+{
+       return test_bit(PG_head, &slab->flags);
+}
+
+/* Did this allocation come from the page allocator instead of slab? */
+static inline bool SlabPageAllocation(const struct slab *slab)
+{
+       return !test_bit(PG_slab, &slab->flags);
+}

> Regardless I like the fact that the community is at least attempting to fix
> stuff like this.  Because adding types like this make it easier for people like
> me to understand what is going on.

Yes, I dislike that 'struct page' is so hard to understand, and so easy
to misuse.  It's a very weak type.

Matthew Wilcox Oct. 5, 2021, 1:52 p.m. UTC | #114

On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> One one hand, the ambition appears to substitute folio for everything
> that could be a base page or a compound page even inside core MM
> code. Since there are very few places in the MM code that expressly
> deal with tail pages in the first place, this amounts to a conversion
> of most MM code - including the LRU management, reclaim, rmap,
> migrate, swap, page fault code etc. - away from "the page".
> 
> However, this far exceeds the goal of a better mm-fs interface. And
> the value proposition of a full MM-internal conversion, including
> e.g. the less exposed anon page handling, is much more nebulous. It's
> been proposed to leave anon pages out, but IMO to keep that direction
> maintainable, the folio would have to be translated to a page quite
> early when entering MM code, rather than propagating it inward, in
> order to avoid huge, massively overlapping page and folio APIs.

Here's an example where our current confusion between "any page"
and "head page" at least produces confusing behaviour, if not an
outright bug, isolate_migratepages_block():

                page = pfn_to_page(low_pfn);
...
                if (PageCompound(page) && !cc->alloc_contig) {
                        const unsigned int order = compound_order(page);

                        if (likely(order < MAX_ORDER))
                                low_pfn += (1UL << order) - 1;
                        goto isolate_fail;
                }

compound_order() does not expect a tail page; it returns 0 unless it's
a head page.  I think what we actually want to do here is:

		if (!cc->alloc_contig) {
			struct page *head = compound_head(page);
			if (PageHead(head)) {
				const unsigned int order = compound_order(head);

				low_pfn |= (1UL << order) - 1;
				goto isolate_fail;
			}
		}

Not earth-shattering; not even necessarily a bug.  But it's an example
of the way the code reads is different from how the code is executed,
and that's potentially dangerous.  Having a different type for tail
and not-tail pages prevents the muddy thinking that can lead to
tail pages being passed to compound_order().

Johannes Weiner Oct. 5, 2021, 5:29 p.m. UTC | #115

On Tue, Oct 05, 2021 at 02:52:01PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> > One one hand, the ambition appears to substitute folio for everything
> > that could be a base page or a compound page even inside core MM
> > code. Since there are very few places in the MM code that expressly
> > deal with tail pages in the first place, this amounts to a conversion
> > of most MM code - including the LRU management, reclaim, rmap,
> > migrate, swap, page fault code etc. - away from "the page".
> > 
> > However, this far exceeds the goal of a better mm-fs interface. And
> > the value proposition of a full MM-internal conversion, including
> > e.g. the less exposed anon page handling, is much more nebulous. It's
> > been proposed to leave anon pages out, but IMO to keep that direction
> > maintainable, the folio would have to be translated to a page quite
> > early when entering MM code, rather than propagating it inward, in
> > order to avoid huge, massively overlapping page and folio APIs.
> 
> Here's an example where our current confusion between "any page"
> and "head page" at least produces confusing behaviour, if not an
> outright bug, isolate_migratepages_block():
> 
>                 page = pfn_to_page(low_pfn);
> ...
>                 if (PageCompound(page) && !cc->alloc_contig) {
>                         const unsigned int order = compound_order(page);
> 
>                         if (likely(order < MAX_ORDER))
>                                 low_pfn += (1UL << order) - 1;
>                         goto isolate_fail;
>                 }
> 
> compound_order() does not expect a tail page; it returns 0 unless it's
> a head page.  I think what we actually want to do here is:
> 
> 		if (!cc->alloc_contig) {
> 			struct page *head = compound_head(page);
> 			if (PageHead(head)) {
> 				const unsigned int order = compound_order(head);
> 
> 				low_pfn |= (1UL << order) - 1;
> 				goto isolate_fail;
> 			}
> 		}
> 
> Not earth-shattering; not even necessarily a bug.  But it's an example
> of the way the code reads is different from how the code is executed,
> and that's potentially dangerous.  Having a different type for tail
> and not-tail pages prevents the muddy thinking that can lead to
> tail pages being passed to compound_order().

Thanks for digging this up. I agree the second version is much better.

My question is still whether the extensive folio whitelisting of
everybody else is the best way to bring those codepaths to light.

The above isn't totally random. That code is a pfn walker which
translates from the basepage address space to an ambiguous struct page
object. There are more of those, but we can easily identify them: all
uses of pfn_to_page() and virt_to_page() indicate that the code needs
an audit for how exactly they're using the returned page.

The above instance of such a walker wants to deal with a higher-level
VM object: a thing that can be on the LRU, can be locked, etc. For
those instances the pattern is clear that the pfn_to_page() always
needs to be paired with a compound_head() before handling the page. I
had mentioned in the other subthread a pfn_to_normal_page() to
streamline this pattern, clarify intent, and mark the finished audit.

Another class are page table walkers resolving to an ambiguous struct
page right now. Those are also easy to identify, and AFAICS they all
want headpages, which is why I had mentioned a central compound_head()
in vm_normal_page().

Are there other such classes that I'm missing? Because it seems to me
there are two and they both have rather clear markers for where the
disambiguation needs to happen - and central helpers to put them in!

And it makes sense: almost nobody *actually* needs to access the tail
members of struct page. This suggests a pushdown and early filtering
in a few central translation/lookup helpers would work to completely
disambiguate remaining struct page usage inside MM code.

There *are* a few weird struct page usages left, like bio and sparse,
and you mentioned vm_fault as well in the other subthread. But it
really seems these want converting away from arbitrary struct page to
either something like phys_addr_t or a proper headpage anyway. Maybe a
tuple of headpage and subpage index in the fault case. Because even
after a full folio conversion of everybody else, those would be quite
weird in their use of an ambiguous struct page! Which struct members
are safe to access? What does it mean to lock a tailpage? Etc.

But it's possible I'm missing something. Are there entry points that
are difficult to identify both conceptually and code-wise? And which
couldn't be pushed down to resolve to headpages quite early? Those I
think would make the argument for the folio in the MM implementation.

David Hildenbrand Oct. 5, 2021, 5:32 p.m. UTC | #116

On 05.10.21 19:29, Johannes Weiner wrote:
> On Tue, Oct 05, 2021 at 02:52:01PM +0100, Matthew Wilcox wrote:
>> On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
>>> One one hand, the ambition appears to substitute folio for everything
>>> that could be a base page or a compound page even inside core MM
>>> code. Since there are very few places in the MM code that expressly
>>> deal with tail pages in the first place, this amounts to a conversion
>>> of most MM code - including the LRU management, reclaim, rmap,
>>> migrate, swap, page fault code etc. - away from "the page".
>>>
>>> However, this far exceeds the goal of a better mm-fs interface. And
>>> the value proposition of a full MM-internal conversion, including
>>> e.g. the less exposed anon page handling, is much more nebulous. It's
>>> been proposed to leave anon pages out, but IMO to keep that direction
>>> maintainable, the folio would have to be translated to a page quite
>>> early when entering MM code, rather than propagating it inward, in
>>> order to avoid huge, massively overlapping page and folio APIs.
>>
>> Here's an example where our current confusion between "any page"
>> and "head page" at least produces confusing behaviour, if not an
>> outright bug, isolate_migratepages_block():
>>
>>                  page = pfn_to_page(low_pfn);
>> ...
>>                  if (PageCompound(page) && !cc->alloc_contig) {
>>                          const unsigned int order = compound_order(page);
>>
>>                          if (likely(order < MAX_ORDER))
>>                                  low_pfn += (1UL << order) - 1;
>>                          goto isolate_fail;
>>                  }
>>
>> compound_order() does not expect a tail page; it returns 0 unless it's
>> a head page.  I think what we actually want to do here is:
>>
>> 		if (!cc->alloc_contig) {
>> 			struct page *head = compound_head(page);
>> 			if (PageHead(head)) {
>> 				const unsigned int order = compound_order(head);
>>
>> 				low_pfn |= (1UL << order) - 1;
>> 				goto isolate_fail;
>> 			}
>> 		}
>>
>> Not earth-shattering; not even necessarily a bug.  But it's an example
>> of the way the code reads is different from how the code is executed,
>> and that's potentially dangerous.  Having a different type for tail
>> and not-tail pages prevents the muddy thinking that can lead to
>> tail pages being passed to compound_order().
> 
> Thanks for digging this up. I agree the second version is much better.
> 
> My question is still whether the extensive folio whitelisting of
> everybody else is the best way to bring those codepaths to light.
> 
> The above isn't totally random. That code is a pfn walker which
> translates from the basepage address space to an ambiguous struct page
> object. There are more of those, but we can easily identify them: all
> uses of pfn_to_page() and virt_to_page() indicate that the code needs
> an audit for how exactly they're using the returned page.

+pfn_to_online_page()

Matthew Wilcox Oct. 5, 2021, 6:30 p.m. UTC | #117

On Tue, Oct 05, 2021 at 01:29:43PM -0400, Johannes Weiner wrote:
> On Tue, Oct 05, 2021 at 02:52:01PM +0100, Matthew Wilcox wrote:
> > On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> > > One one hand, the ambition appears to substitute folio for everything
> > > that could be a base page or a compound page even inside core MM
> > > code. Since there are very few places in the MM code that expressly
> > > deal with tail pages in the first place, this amounts to a conversion
> > > of most MM code - including the LRU management, reclaim, rmap,
> > > migrate, swap, page fault code etc. - away from "the page".
> > > 
> > > However, this far exceeds the goal of a better mm-fs interface. And
> > > the value proposition of a full MM-internal conversion, including
> > > e.g. the less exposed anon page handling, is much more nebulous. It's
> > > been proposed to leave anon pages out, but IMO to keep that direction
> > > maintainable, the folio would have to be translated to a page quite
> > > early when entering MM code, rather than propagating it inward, in
> > > order to avoid huge, massively overlapping page and folio APIs.
> > 
> > Here's an example where our current confusion between "any page"
> > and "head page" at least produces confusing behaviour, if not an
> > outright bug, isolate_migratepages_block():
> > 
> >                 page = pfn_to_page(low_pfn);
> > ...
> >                 if (PageCompound(page) && !cc->alloc_contig) {
> >                         const unsigned int order = compound_order(page);
> > 
> >                         if (likely(order < MAX_ORDER))
> >                                 low_pfn += (1UL << order) - 1;
> >                         goto isolate_fail;
> >                 }
> > 
> > compound_order() does not expect a tail page; it returns 0 unless it's
> > a head page.  I think what we actually want to do here is:
> > 
> > 		if (!cc->alloc_contig) {
> > 			struct page *head = compound_head(page);
> > 			if (PageHead(head)) {
> > 				const unsigned int order = compound_order(head);
> > 
> > 				low_pfn |= (1UL << order) - 1;
> > 				goto isolate_fail;
> > 			}
> > 		}
> > 
> > Not earth-shattering; not even necessarily a bug.  But it's an example
> > of the way the code reads is different from how the code is executed,
> > and that's potentially dangerous.  Having a different type for tail
> > and not-tail pages prevents the muddy thinking that can lead to
> > tail pages being passed to compound_order().
> 
> Thanks for digging this up. I agree the second version is much better.
> 
> My question is still whether the extensive folio whitelisting of
> everybody else is the best way to bring those codepaths to light.

Outside of core MM developers, I'm not sure that a lot of people
know that a struct page might represent 2^n pages of memory.  Even
architecture maintainers seem to be pretty fuzzy on what
flush_dcache_page() does for compound pages:
https://lore.kernel.org/linux-arch/20200818150736.GQ17456@casper.infradead.org/

I know this change is a massive pain, but I do think we're better off
in a world where 'struct page' really refers to one page of memory,
and we have some other name for a memory descriptor that refers to 2^n
pages of memory.

> The above isn't totally random. That code is a pfn walker which
> translates from the basepage address space to an ambiguous struct page
> object. There are more of those, but we can easily identify them: all
> uses of pfn_to_page() and virt_to_page() indicate that the code needs
> an audit for how exactly they're using the returned page.

Right; it's not random at all.  I ran across it while trying to work
out how zsmalloc interacts with memory compaction.  It just seemed like
a particularly compelling example because it's not part of some random
driver, it's a relatively important part of the MM.  And if such a place
has this kind of ambiguity, everything else must surely be worse.

> The above instance of such a walker wants to deal with a higher-level
> VM object: a thing that can be on the LRU, can be locked, etc. For
> those instances the pattern is clear that the pfn_to_page() always
> needs to be paired with a compound_head() before handling the page. I
> had mentioned in the other subthread a pfn_to_normal_page() to
> streamline this pattern, clarify intent, and mark the finished audit.
> 
> Another class are page table walkers resolving to an ambiguous struct
> page right now. Those are also easy to identify, and AFAICS they all
> want headpages, which is why I had mentioned a central compound_head()
> in vm_normal_page().
> 
> Are there other such classes that I'm missing? Because it seems to me
> there are two and they both have rather clear markers for where the
> disambiguation needs to happen - and central helpers to put them in!
> 
> And it makes sense: almost nobody *actually* needs to access the tail
> members of struct page. This suggests a pushdown and early filtering
> in a few central translation/lookup helpers would work to completely
> disambiguate remaining struct page usage inside MM code.

The end goal (before you started talking about shrinking memmap) was
to rename page->mapping, page->index, page->lru and page->private, so
you can't look at members of struct page any more.  struct page would
still have ->compound_head, but anythng else would require conversion
to folio first.

Now that you've put the "dynamically allocate the memory descriptor"
idea in my head, that rename becomes a deletion, and struct page
itself shrinks down to a single pointer.

> There *are* a few weird struct page usages left, like bio and sparse,
> and you mentioned vm_fault as well in the other subthread. But it
> really seems these want converting away from arbitrary struct page to
> either something like phys_addr_t or a proper headpage anyway. Maybe a
> tuple of headpage and subpage index in the fault case. Because even
> after a full folio conversion of everybody else, those would be quite
> weird in their use of an ambiguous struct page! Which struct members
> are safe to access? What does it mean to lock a tailpage? Etc.

If you think converting the MM from struct page to struct folio is bad,
a lot of churn, etc, you're going to be amazed at how much churn it'll be
to convert all of block and networking from struct page to phys_addr_t!
I'm not saying it's not worth doing, or it'll never be done, but that's
a five year project.  And I have no idea how to migrate to it gracefully.

> But it's possible I'm missing something. Are there entry points that
> are difficult to identify both conceptually and code-wise? And which
> couldn't be pushed down to resolve to headpages quite early? Those I
> think would make the argument for the folio in the MM implementation.

The approach I took with folio was to justify their appearance by
showing how they could remove all these hidden calls to compound_head().
So I went bottom-up.  Doing the slub conversion, I went in the opposite
direction; start out by converting the top layers from virt_to_head_page()
to use virt_to_slab().  Then simply call slab_page() when calling any
function which hasn't yet been converted.  At each step, we get better
and better type safety because every place that gets converted knows
it's being passed a head page and doesn't have to worry about whether
it might be passed a tail page.

Doing it in this direction doesn't let us remove the hidden calls to
compound_head() until the very end of the conversion, but people don't
seem to be particularly moved by all this wasted i-cache anyway.  I can
look at doing this for the page cache, but we kind of need agreement
that separating the types is where we're going, and what we're going to
end up calling both types.

Slab was easy; Bonwick decided what we were going to call the memory
descriptor ;-)

Jason Gunthorpe Oct. 5, 2021, 7:56 p.m. UTC | #118

On Tue, Oct 05, 2021 at 07:30:06PM +0100, Matthew Wilcox wrote:

> Outside of core MM developers, I'm not sure that a lot of people
> know that a struct page might represent 2^n pages of memory.  Even
> architecture maintainers seem to be pretty fuzzy on what
> flush_dcache_page() does for compound pages:
> https://lore.kernel.org/linux-arch/20200818150736.GQ17456@casper.infradead.org/

I definitely second that opinion

A final outcome where we still have struct page refering to all kinds
of different things feels like a missed opportunity to me.

Jason

Matthew Wilcox Oct. 16, 2021, 3:28 a.m. UTC | #119

On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
>       mm/memcg: Add folio_memcg() and related functions
>       mm/memcg: Convert commit_charge() to take a folio
>       mm/memcg: Convert mem_cgroup_charge() to take a folio
>       mm/memcg: Convert uncharge_page() to uncharge_folio()
>       mm/memcg: Convert mem_cgroup_uncharge() to take a folio
>       mm/memcg: Convert mem_cgroup_migrate() to take folios
>       mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio
>       mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock()
>       mm/memcg: Convert mem_cgroup_move_account() to use a folio
>       mm/memcg: Add folio_lruvec()
>       mm/memcg: Add folio_lruvec_lock() and similar functions
>       mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave()
>       mm/workingset: Convert workingset_activation to take a folio	
> 
> 		This is all anon+file stuff, not needed for filesystem
> 		folios.

No, that's not true.  A number of these functions are called from
filesystem code.  mem_cgroup_track_foreign_dirty() is only
called from filesystem code.  We at the very least need wrappers
like folio_cgroup_charge(), and folio_memcg_lock().

> 		As per the other email, no conceptual entry point for
> 		tail pages into either subsystem, so no ambiguity
> 		around the necessity of any compound_head() calls,
> 		directly or indirectly. It's easy to rule out
> 		wholesale, so there is no justification for
> 		incrementally annotating every single use of the page.

The justification is that we can remove all those hidden calls to
compound_head().  Hundreds of bytes of text spread throughout this file.

>       mm: Add folio_young and folio_idle
>       mm/swap: Add folio_activate()
>       mm/swap: Add folio_mark_accessed()
> 
> 		This is anon+file aging stuff, not needed.

Again, very much needed.  Take a look at pagecache_get_page().  In Linus'
tree today, it calls if (page_is_idle(page)) clear_page_idle(page);
So either we need wrappers (which are needlessly complicated thanks to
how page_is_idle() is defined) or we just convert it.

>       mm/rmap: Add folio_mkclean()
> 
>       mm/migrate: Add folio_migrate_mapping()
>       mm/migrate: Add folio_migrate_flags()
>       mm/migrate: Add folio_migrate_copy()
> 
> 		More anon+file conversion, not needed.

As far as I can tell, anon never calls any of these three functions.
anon calls migrate_page(), which calls migrate_page_move_mapping(),
but several filesystems do call these individual functions.

>       mm/lru: Add folio_add_lru()
> 
> 		LRU code, not needed.

Again, we need folio_add_lru() for filemap.  This one's more
tractable as a wrapper function.

Matthew Wilcox Oct. 16, 2021, 7:07 p.m. UTC | #120

On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
>       mm/lru: Add folio LRU functions
> 
> 		The LRU code is used by anon and file and not needed
> 		for the filesystem API.
> 
> 		And as discussed, there is generally no ambiguity of
> 		tail pages on the LRU list.

One of the assumptions you're making is that the current code is suitable
for folios.  One of the things that happens in this patch is:

-       update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
+       update_lru_size(lruvec, lru, folio_zonenum(folio),
+                       folio_nr_pages(folio));

static inline long folio_nr_pages(struct folio *folio)
{
        return compound_nr(&folio->page);
}

vs

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline int thp_nr_pages(struct page *page)
{
        VM_BUG_ON_PGFLAGS(PageTail(page), page);
        if (PageHead(page))
                return HPAGE_PMD_NR;
        return 1;
}
#else
static inline int thp_nr_pages(struct page *page)
{
        VM_BUG_ON_PGFLAGS(PageTail(page), page);
        return 1;
}
#endif

So if you want to leave all the LRU code using pages, all the uses of
thp_nr_pages() need to be converted to compound_nr().  Or maybe not all
of them; I don't know which ones might be safe to leave as thp_nr_pages().
That's one of the reasons I went with a whitelist approach.

Johannes Weiner Oct. 18, 2021, 4:47 p.m. UTC | #121

On Sat, Oct 16, 2021 at 04:28:23AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> >       mm/memcg: Add folio_memcg() and related functions
> >       mm/memcg: Convert commit_charge() to take a folio
> >       mm/memcg: Convert mem_cgroup_charge() to take a folio
> >       mm/memcg: Convert uncharge_page() to uncharge_folio()
> >       mm/memcg: Convert mem_cgroup_uncharge() to take a folio
> >       mm/memcg: Convert mem_cgroup_migrate() to take folios
> >       mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio
> >       mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock()
> >       mm/memcg: Convert mem_cgroup_move_account() to use a folio
> >       mm/memcg: Add folio_lruvec()
> >       mm/memcg: Add folio_lruvec_lock() and similar functions
> >       mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave()
> >       mm/workingset: Convert workingset_activation to take a folio	
> > 
> > 		This is all anon+file stuff, not needed for filesystem
> > 		folios.
> 
> No, that's not true.  A number of these functions are called from
> filesystem code. mem_cgroup_track_foreign_dirty() is only called
> from filesystem code. We at the very least need wrappers like
> folio_cgroup_charge(), and folio_memcg_lock().

Well, a handful of exceptions don't refute the broader point.

No objection from me to convert mem_cgroup_track_foreign_dirty().

No objection to add a mem_cgroup_charge_folio(). But I insist on the
subsystem prefix, because that's in line with how we're charging a
whole bunch of other different things (swap, skmem, etc.). It'll also
match a mem_cgroup_charge_anon() if we agree to an anon type.

folio_memcg_lock() sounds good to me.

> > 		As per the other email, no conceptual entry point for
> > 		tail pages into either subsystem, so no ambiguity
> > 		around the necessity of any compound_head() calls,
> > 		directly or indirectly. It's easy to rule out
> > 		wholesale, so there is no justification for
> > 		incrementally annotating every single use of the page.
> 
> The justification is that we can remove all those hidden calls to
> compound_head().  Hundreds of bytes of text spread throughout this file.

I find this line of argument highly disingenuous.

No new type is necessary to remove these calls inside MM code. Migrate
them into the callsites and remove the 99.9% very obviously bogus
ones. The process is the same whether you switch to a new type or not.

(I'll send more patches like the PageSlab() ones to that effect. It's
easy. The only reason nobody has bothered removing those until now is
that nobody reported regressions when they were added.)

But typesafety is an entirely different argument. And to reiterate the
main point of contention on these patches: there is no concensus among
MM people how (or whether) we want MM-internal typesafety for pages.

Personally, I think we do, but I don't think head vs tail is the most
important or the most error-prone aspect of the many identities struct
page can have. In most cases it's not even in the top 5 of questions I
have about the page when I see it in a random MM context (outside of
the very few places that do virt_to_page or pfn_to_page). Therefor
"folio" is not a very poignant way to name the object that is passed
around in most MM code. struct anon_page and struct file_page would be
way more descriptive and would imply the head/tail aspect.

Anyway, the email you are responding to was an offer to split the
uncontroversial "large pages backing filesystems" part from the
controversial "MM-internal typesafety" discussion. Several people in
both the fs space and the mm space have now asked to do this to move
ahead. Since you have stated in another subthread that you "want to
get back to working on large pages in the page cache," and you never
wanted to get involved that deeply in the struct page subtyping
efforts, it's not clear to me why you are not taking this offer.

> >       mm: Add folio_young and folio_idle
> >       mm/swap: Add folio_activate()
> >       mm/swap: Add folio_mark_accessed()
> > 
> > 		This is anon+file aging stuff, not needed.
> 
> Again, very much needed.  Take a look at pagecache_get_page().  In Linus'
> tree today, it calls if (page_is_idle(page)) clear_page_idle(page);
> So either we need wrappers (which are needlessly complicated thanks to
> how page_is_idle() is defined) or we just convert it.

I'm not sure I understand the complication. That you'd have to do

	if (page_is_idle(folio->page))
		clear_page_idle(folio->page)

inside code in mm/?

It's either that, or

a) generic code shared with anon pages has to do:

	if (folio_is_idle(page->folio))
		clear_folio_idle(page->folio)

which is a weird, or

b) both types work with their own wrappers:

	if (page_is_idle(page))
		clear_page_idle(page)

	if (folio_is_idle(folio))
		clear_folio_idle(folio)

and it's not obvious at all that they are in fact tracking the same
state.

State which is exported to userspace through the "page_idle" feature.

Doing the folio->page translation in mm/-private code, and keeping
this a page interface, is by far the most preferable solution.

> >       mm/rmap: Add folio_mkclean()
> > 
> >       mm/migrate: Add folio_migrate_mapping()
> >       mm/migrate: Add folio_migrate_flags()
> >       mm/migrate: Add folio_migrate_copy()
> > 
> > 		More anon+file conversion, not needed.
> 
> As far as I can tell, anon never calls any of these three functions.
> anon calls migrate_page(), which calls migrate_page_move_mapping(),
> but several filesystems do call these individual functions.

In the current series, migrate_page_move_mapping() has been replaced,
and anon pages go through them:

int folio_migrate_mapping(struct address_space *mapping,
                struct folio *newfolio, struct folio *folio, int extra_count)
{
	[...]
        if (!mapping) {
                /* Anonymous page without mapping */
                if (folio_ref_count(folio) != expected_count)
                        return -EAGAIN;

                /* No turning back from here */
                newfolio->index = folio->index;
                newfolio->mapping = folio->mapping;
                if (folio_test_swapbacked(folio))
                        __folio_set_swapbacked(newfolio);

That's what I'm objecting to.

I'm not objecting to adding these to the filesystem interface as thin
folio->page wrappers that call the page implementation.

> >       mm/lru: Add folio_add_lru()
> > 
> > 		LRU code, not needed.
> 
> Again, we need folio_add_lru() for filemap.  This one's more
> tractable as a wrapper function.

Please don't quote selectively to the point of it being misleading.

The original block my statement applied to was this:

      mm: Add folio_evictable()
      mm/lru: Convert __pagevec_lru_add_fn to take a folio
      mm/lru: Add folio_add_lru()

which goes way behond just being filesystem-interfacing.

I have no objection to a cache interface function for adding a folio
to the LRU (a wrapper to encapsulate the folio->page transition).

However, like with the memcg code above, the API is called lru_cache:
we have had lru_cache_add_file() and lru_cache_add_anon() in the past,
so lru_cache_add_folio() seems more appropriate - especially as long
as we still have one for pages (and maybe later one for anon pages).

---

All that to say, adding folio as a new type for file headpages with
API functions like this:

	mem_cgroup_charge_folio()
	lru_cache_add_folio()

now THAT would be an incremental change to the kernel code.

And if that new type proves like a great idea, we can do the same for
anon - whether with a shared type or with separate types.

And if it does end up the same type, in the interfaces and in the
implementation, we can merge

	mem_cgroup_charge_page()	# generic bits
	mem_cgroup_charge_folio()	# file bits
	mem_cgroup_charge_anon()	# anon bits

back into a single function, just like we've done it already for the
anon and file variants of those functions that we have had before.

And if we then want to rename that function to something we agree is
more appropriate, we can do that as yet another step.

That would actually be incremental refactoring.

Johannes Weiner Oct. 18, 2021, 5:25 p.m. UTC | #122

On Sat, Oct 16, 2021 at 08:07:40PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> >       mm/lru: Add folio LRU functions
> > 
> > 		The LRU code is used by anon and file and not needed
> > 		for the filesystem API.
> > 
> > 		And as discussed, there is generally no ambiguity of
> > 		tail pages on the LRU list.
> 
> One of the assumptions you're making is that the current code is suitable
> for folios.  One of the things that happens in this patch is:
> 
> -       update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
> +       update_lru_size(lruvec, lru, folio_zonenum(folio),
> +                       folio_nr_pages(folio));
> 
> static inline long folio_nr_pages(struct folio *folio)
> {
>         return compound_nr(&folio->page);
> }
> 
> vs
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> static inline int thp_nr_pages(struct page *page)
> {
>         VM_BUG_ON_PGFLAGS(PageTail(page), page);
>         if (PageHead(page))
>                 return HPAGE_PMD_NR;
>         return 1;
> }
> #else
> static inline int thp_nr_pages(struct page *page)
> {
>         VM_BUG_ON_PGFLAGS(PageTail(page), page);
>         return 1;
> }
> #endif
> 
> So if you want to leave all the LRU code using pages, all the uses of
> thp_nr_pages() need to be converted to compound_nr().  Or maybe not all
> of them; I don't know which ones might be safe to leave as thp_nr_pages().
> That's one of the reasons I went with a whitelist approach.

All of them. The only compound pages that can exist on the LRUs are
THPs, and the only THP pages that can exist on the LRUs are
compound. There is no plausible scenario where those two functions
would disagree in the LRU code.

Or elsewhere in the kernel, for that matter. Where would
thp_nr_pages() returning compound_nr() ever be wrong? How else are we
implementing THPs? I'm not sure that would make sense.

Kent Overstreet Oct. 18, 2021, 6:12 p.m. UTC | #123

On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote:
> I find this line of argument highly disingenuous.
> 
> No new type is necessary to remove these calls inside MM code. Migrate
> them into the callsites and remove the 99.9% very obviously bogus
> ones. The process is the same whether you switch to a new type or not.

Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous
pages" to be a serious counterargument. I got that you really don't want
anonymous pages to be folios from the call Friday, but I haven't been getting
anything that looks like a serious counterproposal from you.

Think about what our goal is: we want to get to a world where our types describe
unambigiuously how our data is used. That means working towards
 - getting rid of type punning
 - struct fields that are only used for a single purpose

Leaving all the LRU code as struct page means leaving a shit ton of type punning
in place, and you aren't outlining any alternate ways of dealing with that. As
long as all the LRU code is using struct page, that halts efforts towards
separately allocating these types and making struct page smaller (which was one
of your stated goals as well!), and it would leave a big mess in place for god
knows how long. It's been a massive effort for Willy to get this far, who knows
when someone else with the requisite skillset would be summoning up the energy
to deal with that - I don't see you or I doing it.

Meanwhile: we've got people working on using folios for anonymous pages to solve
some major problems

 - it cleans up all of the if (normalpage) else if (hugepage) mess

 - it'll _majorly_ help with our memory fragmentation problems, as I recently
   outlined. As long as we've got a very bimodal distribution in our allocation
   sizes where the peaks are at order 0 and HUGEPAGE_ORDER, we're going to have
   problems allocating hugepages. If anonymous + file memory can be arbitrary
   sized compound pages, we'll end up with more of a poisson distribution in our
   allocation sizes, and a _great deal_ of our difficulties with memory
   fragmentation are going to be alleviated.

 - and on architectures that support merging of TLB entries, folios for
   anonymous memory are going to get us some major performance improvements due
   to reduced TLB pressure, same as hugepages but without nearly as much memory
   fragmetation pain

And on top of all that, file and anonymous pages are just more alike than they
are different. As I keep saying, the sane incremental approach to splitting up
struct page into different dedicated types is to follow the union of structs. I
get that you REALLY REALLY don't want file and anonymous pages to be the same
type, but what you're asking just isn't incremental, it's asking for one big
refactoring to be done at the same time as another.

> (I'll send more patches like the PageSlab() ones to that effect. It's
> easy. The only reason nobody has bothered removing those until now is
> that nobody reported regressions when they were added.)

I was also pretty frustrated by your response to Willy's struct slab patches.

You claim to be all in favour of introducing more type safety and splitting
struct page up into multiple types, but on the basis of one objection - that his
patches start marking tail slab pages as PageSlab (and I agree with your
objection, FWIW) - instead of just asking for that to be changed, or posting a
patch that made that change to his series, you said in effect that we shouldn't
be doing any of the struct slab stuff by posting your own much more limited
refactoring, that was only targeted at the compound_head() issue, which we all
agree is a distraction and not the real issue. Why are you letting yourself get
distracted by that?

I'm not really sure what you want Johannes, besides the fact that you really
don't want file and anon pages to be the same type - but I don't see how that
gives us a route forwards on the fronts I just outlined.

Matthew Wilcox Oct. 18, 2021, 6:28 p.m. UTC | #124

On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote:
> On Sat, Oct 16, 2021 at 04:28:23AM +0100, Matthew Wilcox wrote:
> > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> > >       mm/memcg: Add folio_memcg() and related functions
> > >       mm/memcg: Convert commit_charge() to take a folio
> > >       mm/memcg: Convert mem_cgroup_charge() to take a folio
> > >       mm/memcg: Convert uncharge_page() to uncharge_folio()
> > >       mm/memcg: Convert mem_cgroup_uncharge() to take a folio
> > >       mm/memcg: Convert mem_cgroup_migrate() to take folios
> > >       mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio
> > >       mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock()
> > >       mm/memcg: Convert mem_cgroup_move_account() to use a folio
> > >       mm/memcg: Add folio_lruvec()
> > >       mm/memcg: Add folio_lruvec_lock() and similar functions
> > >       mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave()
> > >       mm/workingset: Convert workingset_activation to take a folio	
> > > 
> > > 		This is all anon+file stuff, not needed for filesystem
> > > 		folios.
> > 
> > No, that's not true.  A number of these functions are called from
> > filesystem code. mem_cgroup_track_foreign_dirty() is only called
> > from filesystem code. We at the very least need wrappers like
> > folio_cgroup_charge(), and folio_memcg_lock().
> 
> Well, a handful of exceptions don't refute the broader point.
> 
> No objection from me to convert mem_cgroup_track_foreign_dirty().
> 
> No objection to add a mem_cgroup_charge_folio(). But I insist on the
> subsystem prefix, because that's in line with how we're charging a
> whole bunch of other different things (swap, skmem, etc.). It'll also
> match a mem_cgroup_charge_anon() if we agree to an anon type.

I don't care about the name; I'll change that.  I still don't get when
you want mem_cgroup_foo() and when you want memcg_foo()

> > > 		As per the other email, no conceptual entry point for
> > > 		tail pages into either subsystem, so no ambiguity
> > > 		around the necessity of any compound_head() calls,
> > > 		directly or indirectly. It's easy to rule out
> > > 		wholesale, so there is no justification for
> > > 		incrementally annotating every single use of the page.
> > 
> > The justification is that we can remove all those hidden calls to
> > compound_head().  Hundreds of bytes of text spread throughout this file.
> 
> I find this line of argument highly disingenuous.
> 
> No new type is necessary to remove these calls inside MM code. Migrate
> them into the callsites and remove the 99.9% very obviously bogus
> ones. The process is the same whether you switch to a new type or not.
> 
> (I'll send more patches like the PageSlab() ones to that effect. It's
> easy. The only reason nobody has bothered removing those until now is
> that nobody reported regressions when they were added.)

That kind of change is actively dangerous.  Today, you can call
PageSlab() on a tail page, and it returns true.  After your patch,
it returns false.  Sure, there's a debug check in there that's enabled
on about 0.1% of all kernel builds, but I bet most people won't notice.

We're not able to catch these kinds of mistakes at review time:
https://lore.kernel.org/linux-mm/20211001024105.3217339-1-willy@infradead.org/

which means it escaped the eagle eyes of (at least):
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
    Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

I don't say that to shame these people.  We need the compiler's help
here.  If we're removing the ability to ask for whether a tail page
belongs to the slab allocator, we have to have the compiler warn us.

I have a feeling your patch also breaks tools/vm/page-types.c

> But typesafety is an entirely different argument. And to reiterate the
> main point of contention on these patches: there is no concensus among
> MM people how (or whether) we want MM-internal typesafety for pages.

I don't think there will ever be consensus as long as you don't take
the concerns of other MM developers seriously.  On Friday's call, several
people working on using large pages for anon memory told you that using
folios for anon memory would make their lives easier, and you didn't care.

> Personally, I think we do, but I don't think head vs tail is the most
> important or the most error-prone aspect of the many identities struct
> page can have. In most cases it's not even in the top 5 of questions I
> have about the page when I see it in a random MM context (outside of
> the very few places that do virt_to_page or pfn_to_page). Therefor
> "folio" is not a very poignant way to name the object that is passed
> around in most MM code. struct anon_page and struct file_page would be
> way more descriptive and would imply the head/tail aspect.

I get it that you want to split out anon pages from other types of
pages.  I'm not against there being a

struct anon_folio
{
	struct folio f;
};

which marks functions or regions of functions that only deal with anon
memory.  But we need _a_ type which represents "the head page of a
compound page or an order-0 page".  And that's what folio is.

Maybe we also want struct file_folio.  I don't see the need for it
myself, but maybe I'm wrong.

> Anyway, the email you are responding to was an offer to split the
> uncontroversial "large pages backing filesystems" part from the
> controversial "MM-internal typesafety" discussion. Several people in
> both the fs space and the mm space have now asked to do this to move
> ahead. Since you have stated in another subthread that you "want to
> get back to working on large pages in the page cache," and you never
> wanted to get involved that deeply in the struct page subtyping
> efforts, it's not clear to me why you are not taking this offer.

I am.  This email was written after trying to do just this.  I dropped
the patches you were opposed to and looked at the result.  It's not good.


You seem wedded to this idea that "folios are just for file backed
memory", and that's not my proposal at all.  folios are for everything.
Maybe we specialise out other types of memory later, or during, or
instead of converting something to use folios, but folios are an utterly
generic concept.

Johannes Weiner Oct. 18, 2021, 8:45 p.m. UTC | #125

On Mon, Oct 18, 2021 at 02:12:32PM -0400, Kent Overstreet wrote:
> On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote:
> > I find this line of argument highly disingenuous.
> > 
> > No new type is necessary to remove these calls inside MM code. Migrate
> > them into the callsites and remove the 99.9% very obviously bogus
> > ones. The process is the same whether you switch to a new type or not.
> 
> Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous
> pages" to be a serious counterargument. I got that you really don't want
> anonymous pages to be folios from the call Friday, but I haven't been getting
> anything that looks like a serious counterproposal from you.
> 
> Think about what our goal is: we want to get to a world where our types describe
> unambigiuously how our data is used. That means working towards
>  - getting rid of type punning
>  - struct fields that are only used for a single purpose

How is a common type inheritance model with a generic page type and
subclasses not a counter proposal?

And one which actually accomplishes those two things you're saying, as
opposed to a shared folio where even 'struct address_space *mapping'
is a total lie type-wise?

Plus, really, what's the *alternative* to doing that anyway? How are
we going to implement code that operates on folios and other subtypes
of the page alike? And deal with attributes and properties that are
shared among them all? Willy's original answer to that was that folio
is just *going* to be all these things - file, anon, slab, network,
rando driver stuff. But since that wasn't very popular, would not get
rid of type punning and overloaded members, would get rid of
efficiently allocating descriptor memory etc.- what *is* the
alternative now to common properties between split out subtypes?

I'm not *against* what you and Willy are saying. I have *genuinely
zero idea what* you are saying.

> Leaving all the LRU code as struct page means leaving a shit ton of type punning
> in place, and you aren't outlining any alternate ways of dealing with that. As
> long as all the LRU code is using struct page, that halts efforts towards
> separately allocating these types and making struct page smaller (which was one
> of your stated goals as well!), and it would leave a big mess in place for god
> knows how long.

I don't follow either of these claims.

Converting to a shared anon/file folio makes almost no dent into the
existing type punning we have, because head/tail page disambiguation
is a tiny part of the type inferment we do on struct page.

And leaving the LRU linkage in the struct page doesn't get in the way
of allocating separate subtype descriptors. All these types need a
list_head anyway, from anon to file to slab to the buddy allocator.

Maybe anon, file, slab don't need it at the 4k granularity all the
time, but the buddy allocator does anyway as long as it's 4k based and
I'm sure you don't want to be allocating a new buddy descriptor every
time we're splitting a larger page block into a smaller one?

I really have no idea how that would even work.

> It's been a massive effort for Willy to get this far, who knows when
> someone else with the requisite skillset would be summoning up the
> energy to deal with that - I don't see you or I doing it.
> 
> Meanwhile: we've got people working on using folios for anonymous pages to solve
> some major problems
> 
>  - it cleans up all of the if (normalpage) else if (hugepage) mess

No it doesn't.

>  - it'll _majorly_ help with our memory fragmentation problems, as I recently
>    outlined. As long as we've got a very bimodal distribution in our allocation
>    sizes where the peaks are at order 0 and HUGEPAGE_ORDER, we're going to have
>    problems allocating hugepages. If anonymous + file memory can be arbitrary
>    sized compound pages, we'll end up with more of a poisson distribution in our
>    allocation sizes, and a _great deal_ of our difficulties with memory
>    fragmentation are going to be alleviated.
>
>  - and on architectures that support merging of TLB entries, folios for
>    anonymous memory are going to get us some major performance improvements due
>    to reduced TLB pressure, same as hugepages but without nearly as much memory
>    fragmetation pain

It doesn't do those, either.

It's a new name for headpages, that's it.

Converting to arbitrary-order huge pages needs to rework assumptions
around what THP pages mean in various places of the code. Mainly the
page table code. Presumably. We don't have anything even resembling a
proposal on how this is all going to look like implementation-wise.

How does changing the name help with this?

How does not having the new name get in the way of it?

> And on top of all that, file and anonymous pages are just more alike than they
> are different.

I don't know what you're basing this on, and you can't just keep
making this claim without showing code to actually unify them.

They have some stuff in common, and some stuff is deeply different.
All about this screams class & subclass. Meanwhile you and Willy just
keep coming up with hacks on how we can somehow work around this fact
and contort the types to work out anyway.

You yourself said that folio including slab and other random stuff is
a bonkers idea. But that means we need to deal with properties that
are going to be shared between subtypes, and I'm the only one that has
come up with a remotely coherent proposal on how to do that.

> > (I'll send more patches like the PageSlab() ones to that effect. It's
> > easy. The only reason nobody has bothered removing those until now is
> > that nobody reported regressions when they were added.)
> 
> I was also pretty frustrated by your response to Willy's struct slab patches.
> 
> You claim to be all in favour of introducing more type safety and splitting
> struct page up into multiple types, but on the basis of one objection - that his
> patches start marking tail slab pages as PageSlab (and I agree with your
> objection, FWIW) - instead of just asking for that to be changed, or posting a
> patch that made that change to his series, you said in effect that we shouldn't
> be doing any of the struct slab stuff by posting your own much more limited
> refactoring, that was only targeted at the compound_head() issue, which we all
> agree is a distraction and not the real issue. Why are you letting yourself get
> distracted by that?

Kent, you can't be serious. I actually did exactly what you suggested
I should have done.

The struct slab patches are the right thing to do.

I had one minor concern (which you seem to share) and suggested a
small cleanup. Willy worried about this cleanup adding a needless
compound_head() call, so *I sent patches to eliminate this call and
allow this cleanup and the struct slab patches to go ahead.*

My patches are to unblock Willy's. He then moved the goal posts and
started talking about prefetching, but that isn't my fault. I was
collaborating and putting my own time and effort where my mouth is.

Can you please debug your own approach to reading these conversations?

Johannes Weiner Oct. 18, 2021, 9:56 p.m. UTC | #126

On Mon, Oct 18, 2021 at 07:28:13PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote:
> > On Sat, Oct 16, 2021 at 04:28:23AM +0100, Matthew Wilcox wrote:
> > > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote:
> > > > 		As per the other email, no conceptual entry point for
> > > > 		tail pages into either subsystem, so no ambiguity
> > > > 		around the necessity of any compound_head() calls,
> > > > 		directly or indirectly. It's easy to rule out
> > > > 		wholesale, so there is no justification for
> > > > 		incrementally annotating every single use of the page.
> > > 
> > > The justification is that we can remove all those hidden calls to
> > > compound_head().  Hundreds of bytes of text spread throughout this file.
> > 
> > I find this line of argument highly disingenuous.
> > 
> > No new type is necessary to remove these calls inside MM code. Migrate
> > them into the callsites and remove the 99.9% very obviously bogus
> > ones. The process is the same whether you switch to a new type or not.
> > 
> > (I'll send more patches like the PageSlab() ones to that effect. It's
> > easy. The only reason nobody has bothered removing those until now is
> > that nobody reported regressions when they were added.)
> 
> That kind of change is actively dangerous.  Today, you can call
> PageSlab() on a tail page, and it returns true.  After your patch,
> it returns false.  Sure, there's a debug check in there that's enabled
> on about 0.1% of all kernel builds, but I bet most people won't notice.
> 
> We're not able to catch these kinds of mistakes at review time:
> https://lore.kernel.org/linux-mm/20211001024105.3217339-1-willy@infradead.org/
> 
> which means it escaped the eagle eyes of (at least):
>     Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
>     Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>     Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
>     Cc: Christoph Lameter <cl@linux.com>
>     Cc: Mark Rutland <mark.rutland@arm.com>
>     Cc: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> I don't say that to shame these people.  We need the compiler's help
> here.  If we're removing the ability to ask for whether a tail page
> belongs to the slab allocator, we have to have the compiler warn us.
> 
> I have a feeling your patch also breaks tools/vm/page-types.c

As Hugh said in the meeting in response to this, "you'll look at
kernel code for any amount of time, you'll find bugs".

I already pointed out dangerous code from anon/file confusion
somewhere in this thread.

None of that is a reason not to fix it. But it should inform the
approach on how we fix it. I'm not against type safety, I'm for
incremental changes. And replacing an enormous subset of struct page
users with an unproven new type and loosely defined interaction with
other page subtypes is just not that.

> > But typesafety is an entirely different argument. And to reiterate the
> > main point of contention on these patches: there is no concensus among
> > MM people how (or whether) we want MM-internal typesafety for pages.
> 
> I don't think there will ever be consensus as long as you don't take
> the concerns of other MM developers seriously.  On Friday's call, several
> people working on using large pages for anon memory told you that using
> folios for anon memory would make their lives easier, and you didn't care.

Nope, one person claimed that it would help, and I asked how. Not
because I'm against typesafety, but because I wanted to know if there
is an aspect in there that would specifically benefit from a shared
folio type. I don't remember there being one, and I'm not against type
safety for anon pages.

What several people *did* say at this meeting was whether you could
drop the anon stuff for now until we have consensus.

> > Anyway, the email you are responding to was an offer to split the
> > uncontroversial "large pages backing filesystems" part from the
> > controversial "MM-internal typesafety" discussion. Several people in
> > both the fs space and the mm space have now asked to do this to move
> > ahead. Since you have stated in another subthread that you "want to
> > get back to working on large pages in the page cache," and you never
> > wanted to get involved that deeply in the struct page subtyping
> > efforts, it's not clear to me why you are not taking this offer.
> 
> I am.  This email was written after trying to do just this.  I dropped
> the patches you were opposed to and looked at the result.  It's not good.
>
> You seem wedded to this idea that "folios are just for file backed
> memory", and that's not my proposal at all.  folios are for everything.
> Maybe we specialise out other types of memory later, or during, or
> instead of converting something to use folios, but folios are an utterly
> generic concept.

That train left the station when several people said slab should not
be in the folio. Once that happened, you could no longer say it'll
work itself out around the edges. Now it needs a real approach to
coordinating with other subtypes, including shared properties and
implementation between them.

The "simple" folio approach only works when it really is a wholesale
replacement for *everything* that page is right now - modulo PAGE_SIZE
and modulo compound tail. But it isn't that anymore, is it?

Folio can't be everything and only some subtypes simultaneously. So
when you say folio is for everything, is struct slab dead? If not, how
is the relationship between them? How do you query shared property?

There really is no coherent proposal right now. These patches start an
open-ended conversion into a nebulous direction.

All I'm saying is: start with a reasonable, delineated scope (page
cache), and if that test balloon works out we can do the next one with
lessons learned from the first. Maybe that will converge to the
"simple" folio for all compound subtypes, maybe we'll move more toward
explicit subtyping that imply the head/tail thing anyway.

What is even the counter argument to that?

Kirill A. Shutemov Oct. 18, 2021, 11:16 p.m. UTC | #127

On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote:
> > I don't think there will ever be consensus as long as you don't take
> > the concerns of other MM developers seriously.  On Friday's call, several
> > people working on using large pages for anon memory told you that using
> > folios for anon memory would make their lives easier, and you didn't care.
> 
> Nope, one person claimed that it would help, and I asked how. Not
> because I'm against typesafety, but because I wanted to know if there
> is an aspect in there that would specifically benefit from a shared
> folio type. I don't remember there being one, and I'm not against type
> safety for anon pages.
> 
> What several people *did* say at this meeting was whether you could
> drop the anon stuff for now until we have consensus.

My read on the meeting was that most of people had nothing against anon
stuff, but asked if Willy could drop anon parts to get past your
objections to move forward.

You was the only person who was vocal against including anon pars. (Hugh
nodded to some of your points, but I don't really know his position on
folios in general and anon stuff in particular).

For record: I think folios has to be applied, including anon bits. They
are useful and address long standing issues with compound pages. Any
future type-safety work can be done on top of it.

I know it's not democracy and we don't count votes here, but we are
dragging it for months and don't get closer to consensus. At some point
"disagree and commit" has to be considered.

Johannes Weiner Oct. 19, 2021, 3:16 p.m. UTC | #128

On Tue, Oct 19, 2021 at 02:16:27AM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote:
> > > I don't think there will ever be consensus as long as you don't take
> > > the concerns of other MM developers seriously.  On Friday's call, several
> > > people working on using large pages for anon memory told you that using
> > > folios for anon memory would make their lives easier, and you didn't care.
> > 
> > Nope, one person claimed that it would help, and I asked how. Not
> > because I'm against typesafety, but because I wanted to know if there
> > is an aspect in there that would specifically benefit from a shared
> > folio type. I don't remember there being one, and I'm not against type
> > safety for anon pages.
> > 
> > What several people *did* say at this meeting was whether you could
> > drop the anon stuff for now until we have consensus.
> 
> My read on the meeting was that most of people had nothing against anon
> stuff, but asked if Willy could drop anon parts to get past your
> objections to move forward.
> 
> You was the only person who was vocal against including anon pars. (Hugh
> nodded to some of your points, but I don't really know his position on
> folios in general and anon stuff in particular).

Nobody likes to be the crazy person on the soapbox, so I asked Hugh in
private a few weeks back. Quoting him, with permission:

: To the first and second order of approximation, you have been
: speaking for me: but in a much more informed and constructive and
: coherent and rational way than I would have managed myself.

It's a broad and open-ended proposal with far reaching consequences,
and not everybody has the time (or foolhardiness) to engage on that. I
wouldn't count silence as approval - just like I don't see approval as
a sign that a person took a hard look at all the implications.

My only effort from the start has been working out unanswered
questions in this proposal: Are compound pages the reliable, scalable,
and memory-efficient way to do bigger page sizes? What's the scope of
remaining tailpages where typesafety will continue to lack? How do we
implement code and properties shared by folios and non-folio types
(like mmap/fault code for folio and network and driver pages)?

There are no satisfying answers to any of these questions, but that
also isn't very surprising: it's a huge scope. Lack of answers isn't
failure, it's just a sign that the step size is too large and too
dependent on a speculative future. It would have been great to whittle
things down to a more incremental and concrete first step, which would
have allowed us to keep testing the project against reality as we go
through all the myriad of uses and cornercases of struct page that no
single person can keep straight in their head.

I'm grateful for the struct slab spinoff, I think it's exactly all of
the above. I'm in full support of it and have dedicated time, effort
and patches to help work out kinks that immediately and inevitably
surfaced around the slab<->page boundary.

I only hoped we could do the same for file pages first, learn from
that, and then do anon pages; if they come out looking the same in the
process, a unified folio would be a great trailing refactoring step.

But alas here we are months later at the same impasse with the same
open questions, and still talking in circles about speculative code.
I don't have more time to invest into this, and I'm tired of the
vitriol and ad-hominems both in public and in private channels.

I'm not really sure how to exit this. The reasons for my NAK are still
there. But I will no longer argue or stand in the way of the patches.

Kent Overstreet Oct. 19, 2021, 4:11 p.m. UTC | #129

On Mon, Oct 18, 2021 at 04:45:59PM -0400, Johannes Weiner wrote:
> On Mon, Oct 18, 2021 at 02:12:32PM -0400, Kent Overstreet wrote:
> > On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote:
> > > I find this line of argument highly disingenuous.
> > > 
> > > No new type is necessary to remove these calls inside MM code. Migrate
> > > them into the callsites and remove the 99.9% very obviously bogus
> > > ones. The process is the same whether you switch to a new type or not.
> > 
> > Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous
> > pages" to be a serious counterargument. I got that you really don't want
> > anonymous pages to be folios from the call Friday, but I haven't been getting
> > anything that looks like a serious counterproposal from you.
> > 
> > Think about what our goal is: we want to get to a world where our types describe
> > unambigiuously how our data is used. That means working towards
> >  - getting rid of type punning
> >  - struct fields that are only used for a single purpose
> 
> How is a common type inheritance model with a generic page type and
> subclasses not a counter proposal?
> 
> And one which actually accomplishes those two things you're saying, as
> opposed to a shared folio where even 'struct address_space *mapping'
> is a total lie type-wise?
> 
> Plus, really, what's the *alternative* to doing that anyway? How are
> we going to implement code that operates on folios and other subtypes
> of the page alike? And deal with attributes and properties that are
> shared among them all? Willy's original answer to that was that folio
> is just *going* to be all these things - file, anon, slab, network,
> rando driver stuff. But since that wasn't very popular, would not get
> rid of type punning and overloaded members, would get rid of
> efficiently allocating descriptor memory etc.- what *is* the
> alternative now to common properties between split out subtypes?
> 
> I'm not *against* what you and Willy are saying. I have *genuinely
> zero idea what* you are saying.

So we were starting to talk more concretely last night about the splitting of
struct page into multiple types, and what that means for page->lru.

The basic process I've had in mind for splitting struct page up into multiple
types is: create a new type for each struct in the union-of-structs, change code
to refer to that type instead of struct page, then - importantly - delete those
members from the union-of-structs in struct page.

E.g. for struct slab, after Willy's struct slab patches, we want to delete that
stuff from struct page - otherwise we've introduced new type punning where code
can refer to the same members via struct page and struct slab, and it's also
completely necessary in order to separately allocate these new structs and slim
down struct page.

Roughly what I've been envisioning for folios is that the struct in the
union-of-structs with lru, mapping & index - that's what turns into folios.

Note that we have a bunch of code using page->lru, page->mapping, and
page->index that really shouldn't be. The buddy allocator uses page->lru for
freelists, and it shouldn't be, but there's a straightforward solution for that:
we need to create a new struct in the union-of-structs for free pages, and
confine the buddy allocator to that (it'll be a nice cleanup, right now it's
overloading both page->lru and page->private which makes no sense, and it'll
give us a nice place to stick some other things).

Other things that need to be fixed:

 - page->lru is used by the old .readpages interface for the list of pages we're
   doing reads to; Matthew converted most filesystems to his new and improved
   .readahead which thankfully no longer uses page->lru, but there's still a few
   filesystems that need to be converted - it looks like cifs and erofs, not
   sure what's going on with fs/cachefiles/. We need help from the maintainers
   of those filesystems to get that conversion done, this is holding up future
   cleanups.

 - page->mapping and page->index are used for entirely random purposes by some
   driver code - drivers/net/ethernet/sun/niu.c looks to be using page->mapping
   for a singly linked list (!).

 - unrelated, but worth noting: there's a fair amount of filesystem code that
   uses page->mapping and page->index and doesn't need to because it has it from
   context - it's both a performance improvement and a cleanup to change that
   code to not get it from the page.

Basically, we need to get to a point where each field in struct page is used for
one and just one thing, but that's going to take some time.

You've been noting that page->mapping is used for different things depending on
whether it's a file page or an anonymous page, and I agree that that's not ideal -
but it's one that I'm much less concerned about because a field being used for
two different things that are both core and related concepts in the kernel is
less bad than fields that are used as dumping grounds for whatever is
convenient - file & anon overloading page->mapping is just not the most pressing
issue to me.

Also, let's look at what file & anonymous pages share:
 - they're both mapped to userspace - they both need page->mapcount
 - they both share the lru code - they both need page->lru

page->lru is the real decider for me, because getting rid of non-lru uses of
that field looks very achievable to me, and once it's done it's one of the
fields we want to delete from struct page and move to struct folio.

If we leave the lru code using struct page, it creates a real problem for this
approach - it means we won't be able to delete the folio struct from the
union-of-structs in struct page. I'm not sure what our path forward would be.

That's my resistance to trying to separate file & anon at this point. I'm
definitely not saying we shouldn't separate file & anon in the future - I don't
have an opinion on whether or not it should be done, and if we do want to do
that I'd want to think about doing it by embedding a "struct lru_object" into
both file_folio and anon_folio and having the lru code refer that instead of
struct page - embedding an object is generally preferable to inheritence.

I want to say - and I don't think I've been clear enough about this - my
objection to trying to split up file & anonymous pages into separate types isn't
so much based on any deep philosophical reasons (I have some ideas for making
anonymous pages more like file pages that I would like to attempt, but I also
heard you when you said you'd tried to do that in the past and it hadn't worked
out) - my objection is because I think it would very much get in the way of
shorter term cleanups that are much more pressing. This is what I've been
referring to when I've been talking about following the union-of-structs in
splitting up struct page - I'm just trying to be practical.

Another relevant thing we've been talking about is consolidating the types of
pages that can be mapped into userspace. Right now we've got driver code mapping
all sorts of rando pages into userspace, and this isn't good - pages in theory
have this abstract interface that they implement, and pages mapped into
userspace have a bigger and more complicated interface - i.e.
a_ops.set_page_dirty; any page mapped into userspace can have this called on it
via the O_DIRECT read path, and possibly other things. Right now we have drivers
allocating vmalloc() memory and then mapping it into userspace, which is just
bizarre - what chunk of code really owns that page, and is implementing that
interface? vmalloc, or the driver?

What I'd like to see happen is for those to get switched to some sort of
internal device or inode, something that the driver owns and has an a_ops struct
- at this point they'd just be normal file pages. The reason drivers are mapping
vmalloc() memory into userspace is so they can get it into a contiguous kernel
side memory mapping, but they could also be doing that by calling vmap() on
existing pages - I think that would be much cleaner.

I have no idea if this approach works for network pool pages or how those would
be used, I haven't gotten that far - if someone can chime in about those that
would be great. But, the end goal I'm envisioning is a world where _only_ bog
standard file & anonymous pages are mapped to userspace - then _mapcount can be
deleted from struct page and only needs to live in struct folio.

Anyways, that's another thing to consider when thinking about whether file &
anonymous pages should be the same type.

Gao Xiang Oct. 19, 2021, 5:06 p.m. UTC | #130

On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote:
> On Mon, Oct 18, 2021 at 04:45:59PM -0400, Johannes Weiner wrote:
> > On Mon, Oct 18, 2021 at 02:12:32PM -0400, Kent Overstreet wrote:
> > > On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote:
> > > > I find this line of argument highly disingenuous.
> > > > 
> > > > No new type is necessary to remove these calls inside MM code. Migrate
> > > > them into the callsites and remove the 99.9% very obviously bogus
> > > > ones. The process is the same whether you switch to a new type or not.
> > > 
> > > Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous
> > > pages" to be a serious counterargument. I got that you really don't want
> > > anonymous pages to be folios from the call Friday, but I haven't been getting
> > > anything that looks like a serious counterproposal from you.
> > > 
> > > Think about what our goal is: we want to get to a world where our types describe
> > > unambigiuously how our data is used. That means working towards
> > >  - getting rid of type punning
> > >  - struct fields that are only used for a single purpose
> > 
> > How is a common type inheritance model with a generic page type and
> > subclasses not a counter proposal?
> > 
> > And one which actually accomplishes those two things you're saying, as
> > opposed to a shared folio where even 'struct address_space *mapping'
> > is a total lie type-wise?
> > 
> > Plus, really, what's the *alternative* to doing that anyway? How are
> > we going to implement code that operates on folios and other subtypes
> > of the page alike? And deal with attributes and properties that are
> > shared among them all? Willy's original answer to that was that folio
> > is just *going* to be all these things - file, anon, slab, network,
> > rando driver stuff. But since that wasn't very popular, would not get
> > rid of type punning and overloaded members, would get rid of
> > efficiently allocating descriptor memory etc.- what *is* the
> > alternative now to common properties between split out subtypes?
> > 
> > I'm not *against* what you and Willy are saying. I have *genuinely
> > zero idea what* you are saying.
> 
> So we were starting to talk more concretely last night about the splitting of
> struct page into multiple types, and what that means for page->lru.
> 
> The basic process I've had in mind for splitting struct page up into multiple
> types is: create a new type for each struct in the union-of-structs, change code
> to refer to that type instead of struct page, then - importantly - delete those
> members from the union-of-structs in struct page.
> 
> E.g. for struct slab, after Willy's struct slab patches, we want to delete that
> stuff from struct page - otherwise we've introduced new type punning where code
> can refer to the same members via struct page and struct slab, and it's also
> completely necessary in order to separately allocate these new structs and slim
> down struct page.
> 
> Roughly what I've been envisioning for folios is that the struct in the
> union-of-structs with lru, mapping & index - that's what turns into folios.
> 
> Note that we have a bunch of code using page->lru, page->mapping, and
> page->index that really shouldn't be. The buddy allocator uses page->lru for
> freelists, and it shouldn't be, but there's a straightforward solution for that:
> we need to create a new struct in the union-of-structs for free pages, and
> confine the buddy allocator to that (it'll be a nice cleanup, right now it's
> overloading both page->lru and page->private which makes no sense, and it'll
> give us a nice place to stick some other things).
> 
> Other things that need to be fixed:
> 
>  - page->lru is used by the old .readpages interface for the list of pages we're
>    doing reads to; Matthew converted most filesystems to his new and improved
>    .readahead which thankfully no longer uses page->lru, but there's still a few
>    filesystems that need to be converted - it looks like cifs and erofs, not
>    sure what's going on with fs/cachefiles/. We need help from the maintainers
>    of those filesystems to get that conversion done, this is holding up future
>    cleanups.

The reason why using page->lru for non-LRU pages was just because the
page struct is already there and it's an effective way to organize
variable temporary pages without any extra memory overhead other than
page structure itself. Another benefits is that such non-LRU pages can
be immediately picked from the list and added into page cache without
any pain (thus ->lru can be reused for real lru usage).

In order to maximize the performance (so that pages can be shared in
the same read request flexibly without extra overhead rather than
memory allocation/free from/to the buddy allocator) and minimise extra
footprint, this way was used. I'm pretty fine to transfer into some
other way instead if some similar field can be used in this way.

Yet if no such field anymore, I'm also very glad to write a patch to
get rid of such usage, but I wish it could be merged _only_ with the
real final transformation together otherwise it still takes the extra
memory of the old page structure and sacrifices the overall performance
to end users (..thus has no benefits at all.)

Thanks,
Gao Xiang

Matthew Wilcox Oct. 19, 2021, 5:34 p.m. UTC | #131

On Wed, Oct 20, 2021 at 01:06:04AM +0800, Gao Xiang wrote:
> On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote:
> > Other things that need to be fixed:
> > 
> >  - page->lru is used by the old .readpages interface for the list of pages we're
> >    doing reads to; Matthew converted most filesystems to his new and improved
> >    .readahead which thankfully no longer uses page->lru, but there's still a few
> >    filesystems that need to be converted - it looks like cifs and erofs, not
> >    sure what's going on with fs/cachefiles/. We need help from the maintainers
> >    of those filesystems to get that conversion done, this is holding up future
> >    cleanups.
> 
> The reason why using page->lru for non-LRU pages was just because the
> page struct is already there and it's an effective way to organize
> variable temporary pages without any extra memory overhead other than
> page structure itself. Another benefits is that such non-LRU pages can
> be immediately picked from the list and added into page cache without
> any pain (thus ->lru can be reused for real lru usage).
> 
> In order to maximize the performance (so that pages can be shared in
> the same read request flexibly without extra overhead rather than
> memory allocation/free from/to the buddy allocator) and minimise extra
> footprint, this way was used. I'm pretty fine to transfer into some
> other way instead if some similar field can be used in this way.
> 
> Yet if no such field anymore, I'm also very glad to write a patch to
> get rid of such usage, but I wish it could be merged _only_ with the
> real final transformation together otherwise it still takes the extra
> memory of the old page structure and sacrifices the overall performance
> to end users (..thus has no benefits at all.)

I haven't dived in to clean up erofs because I don't have a way to test
it, and I haven't taken the time to understand exactly what it's doing.

The old ->readpages interface gave you pages linked together on ->lru
and this code seems to have been written in that era, when you would
add pages to the page cache yourself.

In the new scheme, the pages get added to the page cache for you, and
then you take care of filling them (and marking them uptodate if the
read succeeds).  There's now readahead_expand() which you can call to add
extra pages to the cache if the readahead request isn't compressed-block
aligned.  Of course, it may not succeed if we're out of memory or there
were already pages in the cache.

It looks like this will be quite a large change to how erofs handles
compressed blocks, but if you're open to taking this on, I'd be very happy.

Jason Gunthorpe Oct. 19, 2021, 5:37 p.m. UTC | #132

On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote:

> I have no idea if this approach works for network pool pages or how those would
> be used, I haven't gotten that far - if someone can chime in about those that

Generally the driver goal is to create a shared memory buffer between
kernel and user space.

The broadly two common patterns are to have userspace call mmap() and
the kernel side returns the kernel pages from there - getting them
from some kernel allocator.

Or, userspace allocates the buffer and the kernel driver does
pin_user_pages() to import them to its address space.

I think it is quite feasible to provide some simple library API to
manage the shared buffer through mmap approach, and if that library
wants to allocate inodes, folios and what not it should be possible.

It would help this idea to see Christoph's cleanup series go forward:

https://lore.kernel.org/all/20200508153634.249933-1-hch@lst.de/

As it makes it alot easier for drivers to get inodes in the first
place.

> would be great. But, the end goal I'm envisioning is a world where _only_ bog
> standard file & anonymous pages are mapped to userspace - then _mapcount can be
> deleted from struct page and only needs to live in struct folio.

There is a lot of work in the past years on ZONE_DEVICE pages into
userspace. Today FSDAX is kind of a mashup of a file and device page,
but other stuff is less obvious, especially DEVICE_COHERENT.

Jason

Gao Xiang Oct. 19, 2021, 5:54 p.m. UTC | #133

Hi Matthew,

On Tue, Oct 19, 2021 at 06:34:19PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 20, 2021 at 01:06:04AM +0800, Gao Xiang wrote:
> > On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote:
> > > Other things that need to be fixed:
> > > 
> > >  - page->lru is used by the old .readpages interface for the list of pages we're
> > >    doing reads to; Matthew converted most filesystems to his new and improved
> > >    .readahead which thankfully no longer uses page->lru, but there's still a few
> > >    filesystems that need to be converted - it looks like cifs and erofs, not
> > >    sure what's going on with fs/cachefiles/. We need help from the maintainers
> > >    of those filesystems to get that conversion done, this is holding up future
> > >    cleanups.
> > 
> > The reason why using page->lru for non-LRU pages was just because the
> > page struct is already there and it's an effective way to organize
> > variable temporary pages without any extra memory overhead other than
> > page structure itself. Another benefits is that such non-LRU pages can
> > be immediately picked from the list and added into page cache without
> > any pain (thus ->lru can be reused for real lru usage).
> > 
> > In order to maximize the performance (so that pages can be shared in
> > the same read request flexibly without extra overhead rather than
> > memory allocation/free from/to the buddy allocator) and minimise extra
> > footprint, this way was used. I'm pretty fine to transfer into some
> > other way instead if some similar field can be used in this way.
> > 
> > Yet if no such field anymore, I'm also very glad to write a patch to
> > get rid of such usage, but I wish it could be merged _only_ with the
> > real final transformation together otherwise it still takes the extra
> > memory of the old page structure and sacrifices the overall performance
> > to end users (..thus has no benefits at all.)
> 
> I haven't dived in to clean up erofs because I don't have a way to test
> it, and I haven't taken the time to understand exactly what it's doing.

Actually I don't think it's an actual clean up due to the current page
structure design.

> 
> The old ->readpages interface gave you pages linked together on ->lru
> and this code seems to have been written in that era, when you would
> add pages to the page cache yourself.
> 
> In the new scheme, the pages get added to the page cache for you, and
> then you take care of filling them (and marking them uptodate if the
> read succeeds).  There's now readahead_expand() which you can call to add
> extra pages to the cache if the readahead request isn't compressed-block
> aligned.  Of course, it may not succeed if we're out of memory or there
> were already pages in the cache.

Hmmm, these temporary pages in the list may be (re)used later for page
cache,

or just used for temporary compressed pages for some I/O or lz4
decompression buffer (technically called lz77 sliding window) to
temporarily contain some decompressed data in the same read request
(due to some pages are already mapped and we cannot expose the
decompression process to userspace or some other reasons). All are
in the recycle way.

These temporary pages may finally go into some file page cache or
recycle for several temporary uses for many time and finally free to
the buddy allocator.

> 
> It looks like this will be quite a large change to how erofs handles
> compressed blocks, but if you're open to taking this on, I'd be very happy.

For ->lru, it's quite small, but it sacrifices the performance. Yet I'm
very glad to do if some decision of this ->lru field is determined.

Thanks,
Gao Xiang

David Howells Oct. 19, 2021, 9:14 p.m. UTC | #134

Kent Overstreet <kent.overstreet@gmail.com> wrote:

> 
>  - page->lru is used by the old .readpages interface for the list of pages we're
>    doing reads to; Matthew converted most filesystems to his new and improved
>    .readahead which thankfully no longer uses page->lru, but there's still a few
>    filesystems that need to be converted - it looks like cifs and erofs, not
>    sure what's going on with fs/cachefiles/. We need help from the maintainers
>    of those filesystems to get that conversion done, this is holding up future
>    cleanups.

fscache and cachefiles should be taken care of by my patchset here:

	https://lore.kernel.org/r/163363935000.1980952.15279841414072653108.stgit@warthog.procyon.org.uk
	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-remove-old-io

With that 9p, afs and ceph use netfs lib to handle readpage, readahead and
part of write_begin.

nfs and cifs do their own wrangling of readpages/readahead, but will call out
to the cache directly to handle each page individually.  At some point, cifs
will hopefully be converted to use netfs lib.

David

Matthew Wilcox Oct. 20, 2021, 3:19 a.m. UTC | #135

On Tue, Oct 19, 2021 at 11:16:18AM -0400, Johannes Weiner wrote:
> My only effort from the start has been working out unanswered
> questions in this proposal: Are compound pages the reliable, scalable,
> and memory-efficient way to do bigger page sizes? What's the scope of
> remaining tailpages where typesafety will continue to lack? How do we
> implement code and properties shared by folios and non-folio types
> (like mmap/fault code for folio and network and driver pages)?

I don't think those questions need to be answered before proceeding
with this patchset.  They're interesting questions, to be sure, but
to a large extent they're orthogonal to the changes here.  I look
forward to continuing to work on those problems while filesystems
and the VFS continue to be converted to use folios.

> I'm not really sure how to exit this. The reasons for my NAK are still
> there. But I will no longer argue or stand in the way of the patches.

Thank you.  I appreciate that.

David Hildenbrand Oct. 20, 2021, 7:50 a.m. UTC | #136

On 19.10.21 17:16, Johannes Weiner wrote:
> On Tue, Oct 19, 2021 at 02:16:27AM +0300, Kirill A. Shutemov wrote:
>> On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote:
>>>> I don't think there will ever be consensus as long as you don't take
>>>> the concerns of other MM developers seriously.  On Friday's call, several
>>>> people working on using large pages for anon memory told you that using
>>>> folios for anon memory would make their lives easier, and you didn't care.
>>>
>>> Nope, one person claimed that it would help, and I asked how. Not
>>> because I'm against typesafety, but because I wanted to know if there
>>> is an aspect in there that would specifically benefit from a shared
>>> folio type. I don't remember there being one, and I'm not against type
>>> safety for anon pages.
>>>
>>> What several people *did* say at this meeting was whether you could
>>> drop the anon stuff for now until we have consensus.
>>
>> My read on the meeting was that most of people had nothing against anon
>> stuff, but asked if Willy could drop anon parts to get past your
>> objections to move forward.
>>
>> You was the only person who was vocal against including anon pars. (Hugh
>> nodded to some of your points, but I don't really know his position on
>> folios in general and anon stuff in particular).
> 
> Nobody likes to be the crazy person on the soapbox, so I asked Hugh in
> private a few weeks back. Quoting him, with permission:
> 
> : To the first and second order of approximation, you have been
> : speaking for me: but in a much more informed and constructive and
> : coherent and rational way than I would have managed myself.
> 
> It's a broad and open-ended proposal with far reaching consequences,
> and not everybody has the time (or foolhardiness) to engage on that. I
> wouldn't count silence as approval - just like I don't see approval as
> a sign that a person took a hard look at all the implications.
> 
> My only effort from the start has been working out unanswered
> questions in this proposal: Are compound pages the reliable, scalable,
> and memory-efficient way to do bigger page sizes? What's the scope of
> remaining tailpages where typesafety will continue to lack? How do we
> implement code and properties shared by folios and non-folio types
> (like mmap/fault code for folio and network and driver pages)?
> 
> There are no satisfying answers to any of these questions, but that
> also isn't very surprising: it's a huge scope. Lack of answers isn't
> failure, it's just a sign that the step size is too large and too
> dependent on a speculative future. It would have been great to whittle
> things down to a more incremental and concrete first step, which would
> have allowed us to keep testing the project against reality as we go
> through all the myriad of uses and cornercases of struct page that no
> single person can keep straight in their head.
> 
> I'm grateful for the struct slab spinoff, I think it's exactly all of
> the above. I'm in full support of it and have dedicated time, effort
> and patches to help work out kinks that immediately and inevitably
> surfaced around the slab<->page boundary.
> 
> I only hoped we could do the same for file pages first, learn from
> that, and then do anon pages; if they come out looking the same in the
> process, a unified folio would be a great trailing refactoring step.
> 
> But alas here we are months later at the same impasse with the same
> open questions, and still talking in circles about speculative code.
> I don't have more time to invest into this, and I'm tired of the
> vitriol and ad-hominems both in public and in private channels.

Thanks Johannes for defending your position and I can understand that
you are running out of motivation+energy to defend further.

For the records: I was happy to see the slab refactoring, although I
raised some points regarding how to access properties that belong into
the "struct page". As raised elsewhere, I'd also be more comfortable
seeing small incremental changes/cleanups that are consistent even
without having decided on an ultimate end-goal -- this includes folios.
I'd be happy to see file-backed THP gaining their own, dedicated type
first ("struct $whatever"), before generalizing it to folios.

I'm writing this message solely to back your "not everybody has the time
(or foolhardiness) to engage on that. I wouldn't count silence as
approval.". While I do have the capacity to review smaller, incremental
steps (see struct slab), I don't have the time+energy to gasp the full
folio picture. So I also second "it's a huge scope. [...] it's just a
sign that the step size is too large and too dependent on a speculative
future."

My 2 cents on this topic.

Matthew Wilcox Oct. 20, 2021, 5:26 p.m. UTC | #137

On Wed, Oct 20, 2021 at 09:50:58AM +0200, David Hildenbrand wrote:
> For the records: I was happy to see the slab refactoring, although I
> raised some points regarding how to access properties that belong into
> the "struct page".

I thought the slab discussion was quite productive.  Unfortunately,
none of our six (!) slab maintainers had anything to say about it.  So I
think it's pointless to proceed unless one of them weighs in and says
"I'd be interested in merging something along these lines once these
problems are addressed".

> As raised elsewhere, I'd also be more comfortable
> seeing small incremental changes/cleanups that are consistent even
> without having decided on an ultimate end-goal -- this includes folios.
> I'd be happy to see file-backed THP gaining their own, dedicated type
> first ("struct $whatever"), before generalizing it to folios.

I am genuinely confused by this.

Folios are non-tail pages.  That's all.  There's no "ultimate end-goal".
It's just a new type that lets the compiler (and humans!) know that this
isn't a tail page.

Some people want to take this further, and split off special types from
struct page.  I think that's a great idea.  I'm even willing to help.
But there are all kinds of places in the kernel where we handle generic
pages of almost any type, and so regardless of how much we end up
splitting off from struct page, we're still going to want the concept
of folio.

I get that in some parts of the MM, we can just assume that any struct
page is a non-tail page.  But that's not the case in the filemap APIs;
they're pretty much all defined to return the precise page which contains
the specific byte.  I think that's a mistake, and I'm working to fix it.
But until it is all fixed [1], having a type which says "this is not a
tail page" is, frankly, essential.

[1] which is a gargantuan job because I'm not just dealing with
mm/filemap.c, but also with ~90 filesystems and things sufficiently like
filesystems to have an address_space_operations of their own, including
graphics drivers.

Kent Overstreet Oct. 20, 2021, 5:39 p.m. UTC | #138

On Tue, Oct 19, 2021 at 11:16:18AM -0400, Johannes Weiner wrote:
> On Tue, Oct 19, 2021 at 02:16:27AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote:
> > > > I don't think there will ever be consensus as long as you don't take
> > > > the concerns of other MM developers seriously.  On Friday's call, several
> > > > people working on using large pages for anon memory told you that using
> > > > folios for anon memory would make their lives easier, and you didn't care.
> > > 
> > > Nope, one person claimed that it would help, and I asked how. Not
> > > because I'm against typesafety, but because I wanted to know if there
> > > is an aspect in there that would specifically benefit from a shared
> > > folio type. I don't remember there being one, and I'm not against type
> > > safety for anon pages.
> > > 
> > > What several people *did* say at this meeting was whether you could
> > > drop the anon stuff for now until we have consensus.
> > 
> > My read on the meeting was that most of people had nothing against anon
> > stuff, but asked if Willy could drop anon parts to get past your
> > objections to move forward.
> > 
> > You was the only person who was vocal against including anon pars. (Hugh
> > nodded to some of your points, but I don't really know his position on
> > folios in general and anon stuff in particular).
> 
> Nobody likes to be the crazy person on the soapbox, so I asked Hugh in
> private a few weeks back. Quoting him, with permission:
> 
> : To the first and second order of approximation, you have been
> : speaking for me: but in a much more informed and constructive and
> : coherent and rational way than I would have managed myself.
> 
> It's a broad and open-ended proposal with far reaching consequences,
> and not everybody has the time (or foolhardiness) to engage on that. I
> wouldn't count silence as approval - just like I don't see approval as
> a sign that a person took a hard look at all the implications.
> 
> My only effort from the start has been working out unanswered
> questions in this proposal: Are compound pages the reliable, scalable,
> and memory-efficient way to do bigger page sizes? What's the scope of
> remaining tailpages where typesafety will continue to lack? How do we
> implement code and properties shared by folios and non-folio types
> (like mmap/fault code for folio and network and driver pages)?
> 
> There are no satisfying answers to any of these questions, but that
> also isn't very surprising: it's a huge scope. Lack of answers isn't
> failure, it's just a sign that the step size is too large and too
> dependent on a speculative future. It would have been great to whittle
> things down to a more incremental and concrete first step, which would
> have allowed us to keep testing the project against reality as we go
> through all the myriad of uses and cornercases of struct page that no
> single person can keep straight in their head.
> 
> I'm grateful for the struct slab spinoff, I think it's exactly all of
> the above. I'm in full support of it and have dedicated time, effort
> and patches to help work out kinks that immediately and inevitably
> surfaced around the slab<->page boundary.

Thank you for at least (belatedly) voicing your appreciation of the struct slab
patches, that much wasn't at all clear to me or Matthew during the initial
discussion.

> I only hoped we could do the same for file pages first, learn from
> that, and then do anon pages; if they come out looking the same in the
> process, a unified folio would be a great trailing refactoring step.
> 
> But alas here we are months later at the same impasse with the same
> open questions, and still talking in circles about speculative code.
> I don't have more time to invest into this, and I'm tired of the
> vitriol and ad-hominems both in public and in private channels.
> 
> I'm not really sure how to exit this. The reasons for my NAK are still
> there. But I will no longer argue or stand in the way of the patches.

Johannes, what I gathered from the meeting on Friday is that all you seem to
care about at this point is whether or not file and anonymous pages are the same
type. You got most of what you wanted regarding the direction of folios -
they're no longer targeted at all compound pages! We're working on breaking
struct page up into multiple types!

But I'm frustrated by you disengaging like this, after I went to a lot of effort
to bring you and your ideas into the discussion, but... if you're going to
stubbornly cling to this point and refuse to hear other ideas the way you have
been, I honestly don't know what to tell you.

And after all this it's hard to see the wider issues with struct page actually
getting tackled.

Shame.

Kent Overstreet Oct. 20, 2021, 5:46 p.m. UTC | #139

On Wed, Oct 20, 2021 at 01:54:20AM +0800, Gao Xiang wrote:
> On Tue, Oct 19, 2021 at 06:34:19PM +0100, Matthew Wilcox wrote:
> > It looks like this will be quite a large change to how erofs handles
> > compressed blocks, but if you're open to taking this on, I'd be very happy.
> 
> For ->lru, it's quite small, but it sacrifices the performance. Yet I'm
> very glad to do if some decision of this ->lru field is determined.

I would be very appreciative if you were willing to do the work, and I know
others would be too. These kinds of cleanups may seem small individually, but
they make a _very_ real difference when we're looking kernel-wide at how
possible these struct page changes may be - and even if they don't happen, it
really helps understandability of the code if we can move towards a single
struct field always being used for a single purpose in our core data types.

David Hildenbrand Oct. 20, 2021, 6:04 p.m. UTC | #140

On 20.10.21 19:26, Matthew Wilcox wrote:
> On Wed, Oct 20, 2021 at 09:50:58AM +0200, David Hildenbrand wrote:
>> For the records: I was happy to see the slab refactoring, although I
>> raised some points regarding how to access properties that belong into
>> the "struct page".
> 
> I thought the slab discussion was quite productive.  Unfortunately,
> none of our six (!) slab maintainers had anything to say about it.  So I
> think it's pointless to proceed unless one of them weighs in and says
> "I'd be interested in merging something along these lines once these
> problems are addressed".

Yes, that's really unfortunate ... :(

> 
>> As raised elsewhere, I'd also be more comfortable
>> seeing small incremental changes/cleanups that are consistent even
>> without having decided on an ultimate end-goal -- this includes folios.
>> I'd be happy to see file-backed THP gaining their own, dedicated type
>> first ("struct $whatever"), before generalizing it to folios.
> 
> I am genuinely confused by this.
> 
> Folios are non-tail pages.  That's all.  There's no "ultimate end-goal".
> It's just a new type that lets the compiler (and humans!) know that this
> isn't a tail page.
> 
> Some people want to take this further, and split off special types from
> struct page.  I think that's a great idea.  I'm even willing to help.
> But there are all kinds of places in the kernel where we handle generic
> pages of almost any type, and so regardless of how much we end up
> splitting off from struct page, we're still going to want the concept
> of folio.

And I guess that generic mechanism is where the controversy starts and
where people start having different expectation. IMHO you can tell that
from the whole "naming" discussion/controversy. I always thought, why
not call it "struct compound_page" until I think someone commented that
it might not be a compound page but only a single base page somewhere.
But I got tired (most probably just like you) when reading all the wild
ideas and all the side discussions. Nobody can follow all that.

If we'd be limiting this to "this is an anon THP" and call it "struct
anon_thp", I assume the end result would be significantly easier. Anon
THP only make sense with pages >1, otherwise it's simply a base page and
has to be treated completely different by most MM code (esp. THP splitting).

Similarly, call it "struct filemap" (bad name) and define it as either
being a single page or a compound page, however, the head of the page
(what you call folio).

Let's think about this (and this is something that might happen for
real): assume we have to add a field for handling something about anon
THP in the struct page (let's assume in the head page for simplicity).
Where would we add it? To "struct folio" and expose it to all other
folios that don't really need it because it's so special? To "struct
page" where it actually doesn't belong after all the discussions? And if
we would have to move that field it into a tail page, it would get even
more "tricky".

Of course, we could let all special types inherit from "struct folio",
which inherit from "struct page" ... but I am not convinced that we
actually want that. After all, we're C programmers ;)

But enough with another side-discussion :)

Yes, the types is what I think is something very reasonable to have now
that we discussed it; and I think it's a valuable result of the whole
discussion. I consider it as the cleaner, smaller step.

> 
> I get that in some parts of the MM, we can just assume that any struct
> page is a non-tail page.  But that's not the case in the filemap APIs;
> they're pretty much all defined to return the precise page which contains
> the specific byte.  I think that's a mistake, and I'm working to fix it.
> But until it is all fixed [1], having a type which says "this is not a
> tail page" is, frankly, essential.

I can completely understand that the filemap API wants and needs such a
concept. I think having some way to do that for the filemap API is very
much desired.

Christoph Hellwig Oct. 21, 2021, 6:51 a.m. UTC | #141

On Wed, Oct 20, 2021 at 08:04:56PM +0200, David Hildenbrand wrote:
> real): assume we have to add a field for handling something about anon
> THP in the struct page (let's assume in the head page for simplicity).
> Where would we add it? To "struct folio" and expose it to all other
> folios that don't really need it because it's so special? To "struct
> page" where it actually doesn't belong after all the discussions? And if
> we would have to move that field it into a tail page, it would get even
> more "tricky".
> 
> Of course, we could let all special types inherit from "struct folio",
> which inherit from "struct page" ... but I am not convinced that we
> actually want that. After all, we're C programmers ;)
> 
> But enough with another side-discussion :)

FYI, with my block and direct I/O developer hat on I really, really
want to have the folio for both file and anon pages.  Because to make
the get_user_pages path a _lot_ more efficient it should store folios.
And to make that work I need them to work for file and anon pages
because for get_user_pages and related code they are treated exactly
the same.

David Hildenbrand Oct. 21, 2021, 7:21 a.m. UTC | #142

On 21.10.21 08:51, Christoph Hellwig wrote:
> On Wed, Oct 20, 2021 at 08:04:56PM +0200, David Hildenbrand wrote:
>> real): assume we have to add a field for handling something about anon
>> THP in the struct page (let's assume in the head page for simplicity).
>> Where would we add it? To "struct folio" and expose it to all other
>> folios that don't really need it because it's so special? To "struct
>> page" where it actually doesn't belong after all the discussions? And if
>> we would have to move that field it into a tail page, it would get even
>> more "tricky".
>>
>> Of course, we could let all special types inherit from "struct folio",
>> which inherit from "struct page" ... but I am not convinced that we
>> actually want that. After all, we're C programmers ;)
>>
>> But enough with another side-discussion :)
> 
> FYI, with my block and direct I/O developer hat on I really, really
> want to have the folio for both file and anon pages.  Because to make
> the get_user_pages path a _lot_ more efficient it should store folios.
> And to make that work I need them to work for file and anon pages
> because for get_user_pages and related code they are treated exactly
> the same.

Thanks, I can understand that. And IMHO that would be even possible with
split types; the function prototype will simply have to look a little
more fancy instead of replacing "struct page" by "struct folio". :)

Kent Overstreet Oct. 21, 2021, 12:03 p.m. UTC | #143

On Thu, Oct 21, 2021 at 09:21:17AM +0200, David Hildenbrand wrote:
> On 21.10.21 08:51, Christoph Hellwig wrote:
> > FYI, with my block and direct I/O developer hat on I really, really
> > want to have the folio for both file and anon pages.  Because to make
> > the get_user_pages path a _lot_ more efficient it should store folios.
> > And to make that work I need them to work for file and anon pages
> > because for get_user_pages and related code they are treated exactly
> > the same.

++

> Thanks, I can understand that. And IMHO that would be even possible with
> split types; the function prototype will simply have to look a little
> more fancy instead of replacing "struct page" by "struct folio". :)

Possible yes, but might it be a little premature to split them?

David Hildenbrand Oct. 21, 2021, 12:35 p.m. UTC | #144

On 21.10.21 14:03, Kent Overstreet wrote:
> On Thu, Oct 21, 2021 at 09:21:17AM +0200, David Hildenbrand wrote:
>> On 21.10.21 08:51, Christoph Hellwig wrote:
>>> FYI, with my block and direct I/O developer hat on I really, really
>>> want to have the folio for both file and anon pages.  Because to make
>>> the get_user_pages path a _lot_ more efficient it should store folios.
>>> And to make that work I need them to work for file and anon pages
>>> because for get_user_pages and related code they are treated exactly
>>> the same.
> 
> ++
> 
>> Thanks, I can understand that. And IMHO that would be even possible with
>> split types; the function prototype will simply have to look a little
>> more fancy instead of replacing "struct page" by "struct folio". :)
> 
> Possible yes, but might it be a little premature to split them?

Personally, I think it's the right thing to do to introduce something
limited like "struct filemap" (again, bad name, i.e., folio restricted
to the filemap API) first and avoid introducing a generic folio thingy.

So I'd even consider going with folios all the way premature. But I
assume what to consider premature and what not depends on the point of
view already. And maybe that's the biggest point where we all disagree.

Anyhow, what I don't quite understand is the following: as the first
important goal, we want to improve the filemap API; that's a noble goal
and I highly appreciate Willy's work. To improve the API, there is
absolutely no need introduce generic folio. Yet we argue about whether
generic folio vs. filemap specific folio seems to be the right thing to
do as a first step.

My opinion after all the discussions: use a dedicate type with a clear
name to solve the immediate filemap API issue. Leave the remainder alone
for now. Less code to touch, less subsystems to involve (well, still a
lot), less people to upset, less discussions to have, faster review,
faster upstream, faster progress. A small but reasonable step.

But maybe I'm just living in a dream world :)

Christoph Hellwig Oct. 21, 2021, 12:38 p.m. UTC | #145

On Thu, Oct 21, 2021 at 02:35:32PM +0200, David Hildenbrand wrote:
> My opinion after all the discussions: use a dedicate type with a clear
> name to solve the immediate filemap API issue. Leave the remainder alone
> for now. Less code to touch, less subsystems to involve (well, still a
> lot), less people to upset, less discussions to have, faster review,
> faster upstream, faster progress. A small but reasonable step.

I don't get it.  I mean I'm not the MM expert, I've only been touching
most areas of it occasionally for the last 20 years, but anon and file
pages have way more in common both in terms of use cases and
implementation than what is different (unlike some of the other (ab)uses
of struct page).  What is the point of splitting it now when there are
tons of use cases where they are used absolutely interchangable both
in consumers of the API and the implementation?

Matthew Wilcox Oct. 21, 2021, 12:41 p.m. UTC | #146

On Thu, Oct 21, 2021 at 02:35:32PM +0200, David Hildenbrand wrote:
> My opinion after all the discussions: use a dedicate type with a clear
> name to solve the immediate filemap API issue. Leave the remainder alone
> for now. Less code to touch, less subsystems to involve (well, still a
> lot), less people to upset, less discussions to have, faster review,
> faster upstream, faster progress. A small but reasonable step.

I didn't change anything I didn't need to.  File pages go onto the
LRU list, so I need to change the LRU code to handle arbitrary-sized
folios instead of pages which are either order-0 or order-9.  Every
function that I convert in this patchset is either used by another
function in this patchset, or by the fs/iomap conversion that I have
staged for the next merge window after folios goes in.

David Hildenbrand Oct. 21, 2021, 1 p.m. UTC | #147

On 21.10.21 14:38, Christoph Hellwig wrote:
> On Thu, Oct 21, 2021 at 02:35:32PM +0200, David Hildenbrand wrote:
>> My opinion after all the discussions: use a dedicate type with a clear
>> name to solve the immediate filemap API issue. Leave the remainder alone
>> for now. Less code to touch, less subsystems to involve (well, still a
>> lot), less people to upset, less discussions to have, faster review,
>> faster upstream, faster progress. A small but reasonable step.
> 
> I don't get it.  I mean I'm not the MM expert, I've only been touching
> most areas of it occasionally for the last 20 years, but anon and file
> pages have way more in common both in terms of use cases and

You most certainly have way more MM expertise than me ;) I'm just a
random MM developer, so everybody can feel free to just ignore what I'm
saying here. I didn't NACK anything, I just consider a lot of things
that Johannes raised reasonable.

> implementation than what is different (unlike some of the other (ab)uses
> of struct page).  What is the point of splitting it now when there are
> tons of use cases where they are used absolutely interchangable both
> in consumers of the API and the implementation?
I guess in an ideal world, we'd have multiple abstractions. We could
clearly express for a function what type it expects. We'd have a type
for something passed on the filemap API. We'd have a type for anon THP
(or even just an anon page). We'd have a type that abstracts both.

With that in mind, and not planning with what we'll actually end up
with, to me it makes perfect sense to teach the filemap API to consume
the expected type first. And I am not convinced that the folio as is
("not a tail page") is the right abstraction we actually want to pass
around in places where we expect either anon or file pages -- or only
anon pages or only file pages.

Again, my 2 cents.

Johannes Weiner Oct. 21, 2021, 9:37 p.m. UTC | #148

On Wed, Oct 20, 2021 at 01:39:10PM -0400, Kent Overstreet wrote:
> Thank you for at least (belatedly) voicing your appreciation of the struct slab
> patches, that much wasn't at all clear to me or Matthew during the initial
> discussion.

The first sentence I wrote in response to that series is:

	"I like this whole patch series, but I think for memcg this is
	 a particularly nice cleanup."

	- https://lore.kernel.org/all/YWRwrka5h4Q5buca@cmpxchg.org/

The second email I wrote started with:

	"This looks great to me. It's a huge step in disentangling
	 struct page, and it's already showing very cool downstream
	 effects in somewhat unexpected places like the memory cgroup
	 controller."

	- https://lore.kernel.org/all/YWSZctm%2F2yxu19BV@cmpxchg.org/

Then I sent a pageflag cleanup series specifically to help improve the
clarity of the struct slab split a bit.

Truly ambiguous stuff..?

> > I only hoped we could do the same for file pages first, learn from
> > that, and then do anon pages; if they come out looking the same in the
> > process, a unified folio would be a great trailing refactoring step.
> > 
> > But alas here we are months later at the same impasse with the same
> > open questions, and still talking in circles about speculative code.
> > I don't have more time to invest into this, and I'm tired of the
> > vitriol and ad-hominems both in public and in private channels.
> > 
> > I'm not really sure how to exit this. The reasons for my NAK are still
> > there. But I will no longer argue or stand in the way of the patches.
> 
> Johannes, what I gathered from the meeting on Friday is that all you seem to
> care about at this point is whether or not file and anonymous pages are the same
> type.

No.

I'm going to bow out because - as the above confirms again - the
communication around these patches is utterly broken. But I'm not
leaving on a misrepresentation of my stance after having spent months
thinking about these patches and their implications.

Here is my summary of the discussion, and my conclusion:

The premise of the folio was initially to simply be a type that says:
I'm the headpage for one or more pages. Never a tailpage. Cool.

However, after we talked about what that actually means, we seem to
have some consensus on the following:

	1) If folio is to be a generic headpage, it'll be the new
	   dumping ground for slab, network, drivers etc. Nobody is
	   psyched about this, hence the idea to split the page into
	   subtypes which already resulted in the struct slab patches.

	2) If higher-order allocations are going to be the norm, it's
	   wasteful to statically allocate full descriptors at a 4k
	   granularity. Hence the push to eliminate overloading and do
	   on-demand allocation of necessary descriptor space.

I think that's accurate, but for the record: is there anybody who
disagrees with this and insists that struct folio should continue to
be the dumping ground for all kinds of memory types?

Let's assume the answer is "no" for now and move on.

If folios are NOT the common headpage type, it begs two questions:

	1) What subtype(s) of page SHOULD it represent?

	   This is somewhat unclear at this time. Some say file+anon.
	   It's also been suggested everything userspace-mappable, but
	   that would again bring back major type punning. Who knows?

	   Vocal proponents of the folio type have made conflicting
	   statements on this, which certainly gives me pause.

	2) What IS the common type used for attributes and code shared
	   between subtypes?

	   For example: if a folio is anon+file, then the code that
           maps memory to userspace needs a generic type in order to
           map both folios and network pages. Same as the page table
           walkers, and things like GUP.

	   Will this common type be struct page? Something new? Are we
	   going to duplicate the implementation for each subtype?

	   Another example: GUP can return tailpages. I don't see how
	   it could return folio with even its most generic definition
	   of "headpage".

(But bottomline, it's not clear how folio can be the universal
headpage type and simultaneously avoid being the type dumping ground
that the page was. Maybe I'm not creative enough?)

Anyway. I can even be convinved that we can figure out the exact fault
lines along which we split the page down the road.

My worry is more about 2). A shared type and generic code is likely to
emerge regardless of how we split it. Think about it, the only world
in which that isn't true would be one in which either

	a) page subtypes are all the same, or
	b) the subtypes have nothing in common

and both are clearly bogus.

I think we're being overly dismissive of this question. It seems to me
that *the core challenge* in splitting out the various subtypes of
struct page is to properly identify the generic domain and private
domains of the subtypes, and then clearly and consistently implement
boundaries! If this isn't a deliberate effort, things will get messy
and confusing quickly. These boundary quirks were the first thing that
showed up in the struct slab patches, and finding a clean and
intuitive fix didn't seem trivial to agree on (to my own surprise.)

So. All of the above leads me to these conclusions:

Once you acknowledge the need for a shared abstraction layer, forcing
a binary choice between anon and file doesn't make sense: they have
some stuff in common, and some stuff is different. Some code can be
shared naturally, some cannot. This isn't unlike the VFS inode and the
various fs-specific inode types. It's a chance for the code to finally
reflect the sizable but incomplete overlap of the two.

And once you need a model for generic and private attributes and code
anyway, doing just file at first - even if it isn't along a substruct
boundary - becomes a more reasonable, smaller step for splitting
things out of the page. Just the fs interface and page cache bits, as
opposed to also reclaim, lru, migration, memcg, all at once.

Obviously, because it's a smaller step, it won't go as far toward
shrinking struct page and separately allocatable descriptors. But it
also doesn't work against that effort. And there are still a ton of
bootstrapping questions around separately allocating descriptors
anyway. So it strikes me as an acceptable tradeoff for now.

There is something else that the smaller step would be great for:
doing file first would force us to properly deal with the generic vs
private domain delineation, and come up with a sound strategy for it.
With private file code and shared anon/file code. And it would do so
inside a much smaller and deliberate changeset, where we could give it
the proper attention. As opposed to letting it emerge ad-hoc and
drowning out the case-by-case decisions in huge, churny series.

So that's my ACTUAL stance.

(For completeness, here are the other considerations I mentioned in
the past: I don't think compound page allocations are a good path to
larger page sizes, based on the THP experience at FB, Google's THP
experience, and testimony from other people who have worked on
fragmentation and compaction; but I'm willing to punt on that pending
more data. I also don't think the head/tailpage question is
interesting enough to make it the central identity of the object we're
passing around MM code. Or that we need a new type to get rid of bogus
compound_head() calls. But whatever at this point.)

Counterarguments I've heard to the above:

Wouldn't a generic struct page layer eat into the goal of shrinking
struct page down to two words? Well sure, but if all that's left in it
at the end is a pointer, a list_head and some flags used by every
subtype, we've done pretty well on that front. It's all tradeoffs.
Also, way too many cornercases to be thinking in absolutes already.

Would it give up type safety in the LRU code? Not really, if all
additions are through typed headpages. We don't need to worry about
tailpages in that code, the same way we don't need to check
PageReserved() in there: there is no plausible route for such pages.

Don't you want tailpage safety in anon code? I'm not against that, but
it's not like the current folio patches provide it. They just set up a
direction (without MM consensus). Either way, it'd happen later on.

Why are my eyes glazing over when I read all this? Well, mine glazed
over writing all this. struct page is a lot of stuff, and IMO these
patches touch too much of it at once.

Anyway, that's my exhaustive take on things.

Matthew Wilcox Oct. 22, 2021, 1:52 a.m. UTC | #149

On Thu, Oct 21, 2021 at 05:37:41PM -0400, Johannes Weiner wrote:
> Here is my summary of the discussion, and my conclusion:

Thank you for this.  It's the clearest, most useful post on this thread,
including my own.  It really highlights the substantial points that
should be discussed.

> The premise of the folio was initially to simply be a type that says:
> I'm the headpage for one or more pages. Never a tailpage. Cool.
> 
> However, after we talked about what that actually means, we seem to
> have some consensus on the following:
> 
> 	1) If folio is to be a generic headpage, it'll be the new
> 	   dumping ground for slab, network, drivers etc. Nobody is
> 	   psyched about this, hence the idea to split the page into
> 	   subtypes which already resulted in the struct slab patches.
> 
> 	2) If higher-order allocations are going to be the norm, it's
> 	   wasteful to statically allocate full descriptors at a 4k
> 	   granularity. Hence the push to eliminate overloading and do
> 	   on-demand allocation of necessary descriptor space.
> 
> I think that's accurate, but for the record: is there anybody who
> disagrees with this and insists that struct folio should continue to
> be the dumping ground for all kinds of memory types?

I think there's a useful distinction to be drawn between "where we're
going with this patchset", "where we're going in the next six-twelve
months" and "where we're going eventually".  I think we have minor
differences of opinion on the answers to those questions, and they can
be resolved as we go, instead of up-front.

My answer to that question is that, while this full conversion is not
part of this patch, struct folio is logically:

struct folio {
	... almost everything that's currently in struct page ...
};

struct page {
    unsigned long flags;
    unsigned long compound_head;
    union {
        struct { /* First tail page only */
            unsigned char compound_dtor;
            unsigned char compound_order;
            atomic_t compound_mapcount;
            unsigned int compound_nr;
        };
        struct { /* Second tail page only */
            atomic_t hpage_pinned_refcount;
            struct list_head deferred_list;
        };
        unsigned long padding1[4];
    };
    unsigned int padding2[2];
#ifdef CONFIG_MEMCG
    unsigned long padding3;
#endif
#ifdef WANT_PAGE_VIRTUAL
    void *virtual;
#endif
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif
};

(I'm open to being told I have some of that wrong, eg maybe _last_cpupid
is actually part of struct folio and isn't a per-page property at all)

I'd like to get there in the next year.  I think dynamically allocating
memory descriptors is more than a year out.

Now, as far as struct folio being a dumping group, I would like to
split other things out from struct folio.  Let me address that below.

> Let's assume the answer is "no" for now and move on.
> 
> If folios are NOT the common headpage type, it begs two questions:
> 
> 	1) What subtype(s) of page SHOULD it represent?
> 
> 	   This is somewhat unclear at this time. Some say file+anon.
> 	   It's also been suggested everything userspace-mappable, but
> 	   that would again bring back major type punning. Who knows?
> 
> 	   Vocal proponents of the folio type have made conflicting
> 	   statements on this, which certainly gives me pause.
> 
> 	2) What IS the common type used for attributes and code shared
> 	   between subtypes?
> 
> 	   For example: if a folio is anon+file, then the code that
>            maps memory to userspace needs a generic type in order to
>            map both folios and network pages. Same as the page table
>            walkers, and things like GUP.
> 
> 	   Will this common type be struct page? Something new? Are we
> 	   going to duplicate the implementation for each subtype?
> 
> 	   Another example: GUP can return tailpages. I don't see how
> 	   it could return folio with even its most generic definition
> 	   of "headpage".
> 
> (But bottomline, it's not clear how folio can be the universal
> headpage type and simultaneously avoid being the type dumping ground
> that the page was. Maybe I'm not creative enough?)

This whole section is predicated on "If it is NOT the headpage type",
but I think this is a great list of why it _should_ be the generic
headpage type.

To answer a questions in here; GUP should continue to return precise
pages because that's what its callers expect.  But we should have a
better interface than GUP which returns a rather more compressed list
(something like today's biovec).

> Anyway. I can even be convinved that we can figure out the exact fault
> lines along which we split the page down the road.
> 
> My worry is more about 2). A shared type and generic code is likely to
> emerge regardless of how we split it. Think about it, the only world
> in which that isn't true would be one in which either
> 
> 	a) page subtypes are all the same, or
> 	b) the subtypes have nothing in common
> 
> and both are clearly bogus.

Amen!

I'm convinced that pgtable, slab and zsmalloc uses of struct page can all
be split out into their own types instead of being folios.  They have
little-to-nothing in common with anon+file; they can't be mapped into
userspace and they can't be on the LRU.  The only situation you can find
them in is something like compaction which walks PFNs.

I don't think we can split out ZONE_DEVICE and netpool into their own
types.  While they can't be on the LRU, they can be mapped to userspace,
like random device drivers.  So they can be found by GUP, and we want
(need) to be able to go to folio from there in order to get, lock and
set a folio as dirty.  Also, they have a mapcount as well as a refcount.

The real question, I think, is whether it's worth splitting anon & file
pages out from generic pages.  I can see arguments for it, but I can also
see arguments against it (whether it's two types: lru_mem and folio,
three types: anon_mem, file_mem and folio or even four types: ksm_mem,
anon_mem and file_mem).  I don't think a compelling argument has been
made either way.

Perhaps you could comment on how you'd see separate anon_mem and
file_mem types working for the memcg code?  Would you want to have
separate lock_anon_memcg() and lock_file_memcg(), or would you want
them to be cast to a common type like lock_folio_memcg()?

P.S. One variant we haven't explored is separating type specialisation
from finding the head page.  eg, instead of having

struct slab *slab = page_slab(page);

we could have:

struct slab *slab = folio_slab(page_folio(page));

I don't think it's particularly worth doing, but Kent mused about it
at one point.

David Hildenbrand Oct. 22, 2021, 7:59 a.m. UTC | #150

On 22.10.21 03:52, Matthew Wilcox wrote:
> On Thu, Oct 21, 2021 at 05:37:41PM -0400, Johannes Weiner wrote:
>> Here is my summary of the discussion, and my conclusion:
> 
> Thank you for this.  It's the clearest, most useful post on this thread,
> including my own.  It really highlights the substantial points that
> should be discussed.
> 
>> The premise of the folio was initially to simply be a type that says:
>> I'm the headpage for one or more pages. Never a tailpage. Cool.
>>
>> However, after we talked about what that actually means, we seem to
>> have some consensus on the following:
>>
>> 	1) If folio is to be a generic headpage, it'll be the new
>> 	   dumping ground for slab, network, drivers etc. Nobody is
>> 	   psyched about this, hence the idea to split the page into
>> 	   subtypes which already resulted in the struct slab patches.
>>
>> 	2) If higher-order allocations are going to be the norm, it's
>> 	   wasteful to statically allocate full descriptors at a 4k
>> 	   granularity. Hence the push to eliminate overloading and do
>> 	   on-demand allocation of necessary descriptor space.
>>
>> I think that's accurate, but for the record: is there anybody who
>> disagrees with this and insists that struct folio should continue to
>> be the dumping ground for all kinds of memory types?
> 
> I think there's a useful distinction to be drawn between "where we're
> going with this patchset", "where we're going in the next six-twelve
> months" and "where we're going eventually".  I think we have minor
> differences of opinion on the answers to those questions, and they can
> be resolved as we go, instead of up-front.
> 
> My answer to that question is that, while this full conversion is not
> part of this patch, struct folio is logically:
> 
> struct folio {
> 	... almost everything that's currently in struct page ...
> };
> 
> struct page {
>     unsigned long flags;
>     unsigned long compound_head;
>     union {
>         struct { /* First tail page only */
>             unsigned char compound_dtor;
>             unsigned char compound_order;
>             atomic_t compound_mapcount;
>             unsigned int compound_nr;
>         };
>         struct { /* Second tail page only */
>             atomic_t hpage_pinned_refcount;
>             struct list_head deferred_list;
>         };
>         unsigned long padding1[4];
>     };
>     unsigned int padding2[2];
> #ifdef CONFIG_MEMCG
>     unsigned long padding3;
> #endif
> #ifdef WANT_PAGE_VIRTUAL
>     void *virtual;
> #endif
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>     int _last_cpupid;
> #endif
> };
> 
> (I'm open to being told I have some of that wrong, eg maybe _last_cpupid
> is actually part of struct folio and isn't a per-page property at all)
> 
> I'd like to get there in the next year.  I think dynamically allocating
> memory descriptors is more than a year out.
> 
> Now, as far as struct folio being a dumping group, I would like to
> split other things out from struct folio.  Let me address that below.
> 
>> Let's assume the answer is "no" for now and move on.
>>
>> If folios are NOT the common headpage type, it begs two questions:
>>
>> 	1) What subtype(s) of page SHOULD it represent?
>>
>> 	   This is somewhat unclear at this time. Some say file+anon.
>> 	   It's also been suggested everything userspace-mappable, but
>> 	   that would again bring back major type punning. Who knows?
>>
>> 	   Vocal proponents of the folio type have made conflicting
>> 	   statements on this, which certainly gives me pause.
>>
>> 	2) What IS the common type used for attributes and code shared
>> 	   between subtypes?
>>
>> 	   For example: if a folio is anon+file, then the code that
>>            maps memory to userspace needs a generic type in order to
>>            map both folios and network pages. Same as the page table
>>            walkers, and things like GUP.
>>
>> 	   Will this common type be struct page? Something new? Are we
>> 	   going to duplicate the implementation for each subtype?
>>
>> 	   Another example: GUP can return tailpages. I don't see how
>> 	   it could return folio with even its most generic definition
>> 	   of "headpage".
>>
>> (But bottomline, it's not clear how folio can be the universal
>> headpage type and simultaneously avoid being the type dumping ground
>> that the page was. Maybe I'm not creative enough?)
> 
> This whole section is predicated on "If it is NOT the headpage type",
> but I think this is a great list of why it _should_ be the generic
> headpage type.
> 
> To answer a questions in here; GUP should continue to return precise
> pages because that's what its callers expect.  But we should have a
> better interface than GUP which returns a rather more compressed list
> (something like today's biovec).
> 
>> Anyway. I can even be convinved that we can figure out the exact fault
>> lines along which we split the page down the road.
>>
>> My worry is more about 2). A shared type and generic code is likely to
>> emerge regardless of how we split it. Think about it, the only world
>> in which that isn't true would be one in which either
>>
>> 	a) page subtypes are all the same, or
>> 	b) the subtypes have nothing in common
>>
>> and both are clearly bogus.
> 
> Amen!
> 
> I'm convinced that pgtable, slab and zsmalloc uses of struct page can all
> be split out into their own types instead of being folios.  They have
> little-to-nothing in common with anon+file; they can't be mapped into
> userspace and they can't be on the LRU.  The only situation you can find
> them in is something like compaction which walks PFNs.
> 
> I don't think we can split out ZONE_DEVICE and netpool into their own
> types.  While they can't be on the LRU, they can be mapped to userspace,
> like random device drivers.  So they can be found by GUP, and we want
> (need) to be able to go to folio from there in order to get, lock and
> set a folio as dirty.  Also, they have a mapcount as well as a refcount.
> 
> The real question, I think, is whether it's worth splitting anon & file
> pages out from generic pages.  I can see arguments for it, but I can also
> see arguments against it (whether it's two types: lru_mem and folio,
> three types: anon_mem, file_mem and folio or even four types: ksm_mem,
> anon_mem and file_mem).  I don't think a compelling argument has been
> made either way.
> 
> Perhaps you could comment on how you'd see separate anon_mem and
> file_mem types working for the memcg code?  Would you want to have
> separate lock_anon_memcg() and lock_file_memcg(), or would you want
> them to be cast to a common type like lock_folio_memcg()?

FWIW,

something like this would roughly express what I've been mumbling about:

anon_mem    file_mem
   |            |
   ------|------
      lru_mem       slab
         |           |
         -------------
               |
	      page

I wouldn't include folios in this picture, because IMHO folios as of now
are actually what we want to be "lru_mem", just which a much clearer
name+description (again, IMHO).

Going from file_mem -> page is easy, just casting pointers.
Going from page -> file_mem requires going to the head page if it's a
compound page.

But we expect most interfaces to pass around a proper type (e.g.,
lru_mem) instead of a page, which avoids having to lookup the compund
head page. And each function can express which type it actually wants to
consume. The filmap API wants to consume file_mem, so it should use that.

And IMHO, with something above in mind and not having a clue which
additional layers we'll really need, or which additional leaves we want
to have, we would start with the leaves (e.g., file_mem, anon_mem, slab)
and work our way towards the root. Just like we already started with slab.

Maybe that makes sense.

Matthew Wilcox Oct. 22, 2021, 1:01 p.m. UTC | #151

On Fri, Oct 22, 2021 at 09:59:05AM +0200, David Hildenbrand wrote:
> something like this would roughly express what I've been mumbling about:
> 
> anon_mem    file_mem
>    |            |
>    ------|------
>       lru_mem       slab
>          |           |
>          -------------
>                |
> 	      page
> 
> I wouldn't include folios in this picture, because IMHO folios as of now
> are actually what we want to be "lru_mem", just which a much clearer
> name+description (again, IMHO).

I think folios are a superset of lru_mem.  To enhance your drawing:

page
   folio
      lru_mem
         anon_mem
	 ksm
         file_mem
      netpool
      devmem
      zonedev
   slab
   pgtable
   buddy
   zsmalloc
   vmalloc

I have a little list of memory types here:
https://kernelnewbies.org/MemoryTypes

Let me know if anything is missing.

> Going from file_mem -> page is easy, just casting pointers.
> Going from page -> file_mem requires going to the head page if it's a
> compound page.
> 
> But we expect most interfaces to pass around a proper type (e.g.,
> lru_mem) instead of a page, which avoids having to lookup the compund
> head page. And each function can express which type it actually wants to
> consume. The filmap API wants to consume file_mem, so it should use that.
> 
> And IMHO, with something above in mind and not having a clue which
> additional layers we'll really need, or which additional leaves we want
> to have, we would start with the leaves (e.g., file_mem, anon_mem, slab)
> and work our way towards the root. Just like we already started with slab.

That assumes that the "root" layers already handle compound pages
properly.  For example, nothing in mm/page-writeback.c does; it assumes
everything is an order-0 page.  So working in the opposite direction
makes sense because it tells us what has already been converted and is
thus safe to call.

And starting with file_mem makes the supposition that it's worth splitting
file_mem from anon_mem.  I believe that's one or two steps further than
it's worth, but I can be convinced otherwise.  For example, do we have
examples of file pages being passed to routines that expect anon pages?
Most routines that I've looked at expect to see both file & anon pages,
and treat them either identically or do slightly different things.
But those are just the functions I've looked at; your experience may be
quite different.

David Hildenbrand Oct. 22, 2021, 2:40 p.m. UTC | #152

On 22.10.21 15:01, Matthew Wilcox wrote:
> On Fri, Oct 22, 2021 at 09:59:05AM +0200, David Hildenbrand wrote:
>> something like this would roughly express what I've been mumbling about:
>>
>> anon_mem    file_mem
>>    |            |
>>    ------|------
>>       lru_mem       slab
>>          |           |
>>          -------------
>>                |
>> 	      page
>>
>> I wouldn't include folios in this picture, because IMHO folios as of now
>> are actually what we want to be "lru_mem", just which a much clearer
>> name+description (again, IMHO).
> 
> I think folios are a superset of lru_mem.  To enhance your drawing:
> 

In the picture below we want "folio" to be the abstraction of "mappable
into user space", after reading your link below and reading your graph,
correct? Like calling it "user_mem" instead.

Because any of these types would imply that we're looking at the head
page (if it's a compound page). And we could (or even already have?)
have other types that cannot be mapped to user space that are actually a
compound page.

> page
>    folio
>       lru_mem
>          anon_mem
> 	 ksm
>          file_mem
>       netpool
>       devmem
>       zonedev
>    slab
>    pgtable
>    buddy
>    zsmalloc
>    vmalloc
> 
> I have a little list of memory types here:
> https://kernelnewbies.org/MemoryTypes
> 
> Let me know if anything is missing.

hugetlbfs pages might deserve a dedicated type, right?

> 
>> Going from file_mem -> page is easy, just casting pointers.
>> Going from page -> file_mem requires going to the head page if it's a
>> compound page.
>>
>> But we expect most interfaces to pass around a proper type (e.g.,
>> lru_mem) instead of a page, which avoids having to lookup the compund
>> head page. And each function can express which type it actually wants to
>> consume. The filmap API wants to consume file_mem, so it should use that.
>>
>> And IMHO, with something above in mind and not having a clue which
>> additional layers we'll really need, or which additional leaves we want
>> to have, we would start with the leaves (e.g., file_mem, anon_mem, slab)
>> and work our way towards the root. Just like we already started with slab.
> 
> That assumes that the "root" layers already handle compound pages
> properly.  For example, nothing in mm/page-writeback.c does; it assumes
> everything is an order-0 page.  So working in the opposite direction
> makes sense because it tells us what has already been converted and is
> thus safe to call.

Right, as long as the lower layers receive a "struct page", they have to
assume it's "anything" -- IOW a random base page.

We need some temporary logic when transitioning from "typed" code into
"struct page" code that doesn't talk compound pages yet, I agree. And I
think the different types used actually would tell us what has been
converted and what not. Whenever you have to go from type -> "struct
page" we have to be very careful.

> 
> And starting with file_mem makes the supposition that it's worth splitting
> file_mem from anon_mem.  I believe that's one or two steps further than
> it's worth, but I can be convinced otherwise.  For example, do we have
> examples of file pages being passed to routines that expect anon pages?

That would be a BUG, so I hope we don't have it ;)

> Most routines that I've looked at expect to see both file & anon pages,

Right, many of them do. Which tells me that they share a common type in
many places.

Let's consider LRU code

static inline int folio_is_file_lru(struct folio *folio)
{
	return !folio_swapbacked(folio);
}

I would say we don't really want to pass folios here. We actually want
to pass something reasonable, like "lru_mem". But yes, it's just doing
what "struct page" used to do via page_is_file_lru().

Let's consider folio_wait_writeback(struct folio *folio)

Do we actually want to pass in a folio here? Would we actually want to
pass in lru_mem here or even something else?

> and treat them either identically or do slightly different things.
> But those are just the functions I've looked at; your experience may be
> quite different.

I assume when it comes to LRU, writeback, ... the behavior is very
similar or at least the current functions just decide internally what to
do based on e.g., ..._is_file_lru().

I don't know if it's best to keep hiding that functionality within an
abstracted type or just provide two separate functions for anon and
file. folios mostly mimic what the old struct page used to do,
introducing similar functions. Maybe the reason we branch off within
these functions is because it just made sense when passing around
"struct page" and not having something clearer at hand that let the
caller do the branch. For the cases of LRU I looked at it somewhat makes
sense to just do it internally.

Looking at some core MM code, like mm/huge_memory.c, and seeing all the
PageAnon() specializations, having a dedicated anon_mem type might be
valuable. But at this point it's hard to tell if splitting up these
functions would actually be desirable.

We're knee-deep in the type discussion now and I appreciate it. I can
understand that folio are currently really just a "not a tail page"
concept and mimic a lot of what we already inherited from the old
"struct page" world.

Matthew Wilcox Oct. 23, 2021, 2:22 a.m. UTC | #153

On Fri, Oct 22, 2021 at 04:40:24PM +0200, David Hildenbrand wrote:
> On 22.10.21 15:01, Matthew Wilcox wrote:
> > On Fri, Oct 22, 2021 at 09:59:05AM +0200, David Hildenbrand wrote:
> >> something like this would roughly express what I've been mumbling about:
> >>
> >> anon_mem    file_mem
> >>    |            |
> >>    ------|------
> >>       lru_mem       slab
> >>          |           |
> >>          -------------
> >>                |
> >> 	      page
> >>
> >> I wouldn't include folios in this picture, because IMHO folios as of now
> >> are actually what we want to be "lru_mem", just which a much clearer
> >> name+description (again, IMHO).
> > 
> > I think folios are a superset of lru_mem.  To enhance your drawing:
> > 
> 
> In the picture below we want "folio" to be the abstraction of "mappable
> into user space", after reading your link below and reading your graph,
> correct? Like calling it "user_mem" instead.

Hmm.  Actually, we want a new layer in the ontology:

page
   folio
      mappable
         lru_mem
            anon_mem
            ksm
            file_mem
         netpool
         devmem
         zonedev
         vmalloc
      zsmalloc
      dmapool
      devmem (*)
   slab
   pgtable
   buddy

(*) yes, devmem appears twice; some is mappable and some is not

The ontology is kind of confusing because *every* page is part of a
folio.  Sometimes it's a folio of one page (eg vmalloc).  Which means
that it's legitimate to call page_folio() on a slab page and then call
folio_test_slab().  It's not the direction we want to go though.

We're also inconsistent about whether we consider an entire compound
page / folio the thing which is mapped, or whether each individual page
in the compound page / folio can be mapped.  See how differently file-THP
and anon-THP are handled in rmap, for example.  I think that was probably
a mistake.

> Because any of these types would imply that we're looking at the head
> page (if it's a compound page). And we could (or even already have?)
> have other types that cannot be mapped to user space that are actually a
> compound page.

Sure, slabs are compound pages which cannot be mapped to userspace.

> > I have a little list of memory types here:
> > https://kernelnewbies.org/MemoryTypes
> > 
> > Let me know if anything is missing.
> 
> hugetlbfs pages might deserve a dedicated type, right?

Not sure.  Aren't they just file pages (albeit sometimes treated
specially, which is one of the mistakes we need to fix)?

> > And starting with file_mem makes the supposition that it's worth splitting
> > file_mem from anon_mem.  I believe that's one or two steps further than
> > it's worth, but I can be convinced otherwise.  For example, do we have
> > examples of file pages being passed to routines that expect anon pages?
> 
> That would be a BUG, so I hope we don't have it ;)

Right.  I'm asking, did we fix any bugs in the last year or two that
were caused by this kind of mismatch and would be prevented by using
a different type?  There's about half a dozen bugs we've had in the
last year that were caused by passing tail pages to functions that
were expecting head pages.

I can think of one problem we have, which is that (for a few filesystems
which have opted into this), we can pass an anon page into ->readpage()
and we've had problems with those filesystems then mishandling the
anon page.  The solution to this problem is not to pass an lru_mem to
readpage, but to use a different fs operation to read swap pages.

> Let's consider folio_wait_writeback(struct folio *folio)
> 
> Do we actually want to pass in a folio here? Would we actually want to
> pass in lru_mem here or even something else?

Well, let's look at the callers (for simplicity, look at Linus'
current tree).  Other than the ones in filesystems which we can assume
have file pages, mm/migrate.c has __unmap_and_move().  What type should
migrate_pages() have and pass around?

> Looking at some core MM code, like mm/huge_memory.c, and seeing all the
> PageAnon() specializations, having a dedicated anon_mem type might be
> valuable. But at this point it's hard to tell if splitting up these
> functions would actually be desirable.

Yes.  That's my point; it *might* be desirable.  I have no objections to
it, but the people doing the work need to show the benefits.  I'm showing
the benefits to folios -- fewer bugs, smaller code, larger pages in the
page cache leading to faster systems.  I acknowledge the costs in terms
of churn.

You can see folios as a first step to disentangling some of the users
of struct page.  It certainly won't be the last step.  But I'd really
like to stop having theoretical discussions of memory types and get on
with writing code.  If that means we modify the fs APIs again in twelve
months to replace folios with file_mem, well, I'm OK with that.

Christoph Hellwig Oct. 23, 2021, 5:02 a.m. UTC | #154

On Sat, Oct 23, 2021 at 03:22:35AM +0100, Matthew Wilcox wrote:
> You can see folios as a first step to disentangling some of the users
> of struct page.  It certainly won't be the last step.  But I'd really
> like to stop having theoretical discussions of memory types and get on
> with writing code.

Agreed.  I think folios are really important to sort out the mess
around compound pages ASAP.

I'm a lot more lukewarm on the other splits.  Yes, struct page is a
mess, but I'm not sure creating gazillions of new types solve that
mess.  Getting rid of a bunch of the crazy optimizations that abuse
struct page fields might a better first step - or rather after the
first step of folios which fix real bugs in compount handling and do
enable sane handling of compound pages in the page cache.

> If that means we modify the fs APIs again in twelve
> months to replace folios with file_mem, well, I'm OK with that.

I suspect we won't even need that so quickly if at all, but I'd rather
have a little more churn rather than blocking this important work
forever.

David Hildenbrand Oct. 23, 2021, 9:58 a.m. UTC | #155

>> In the picture below we want "folio" to be the abstraction of "mappable
>> into user space", after reading your link below and reading your graph,
>> correct? Like calling it "user_mem" instead.
> 
> Hmm.  Actually, we want a new layer in the ontology:
> 
> page
>    folio
>       mappable
>          lru_mem
>             anon_mem
>             ksm
>             file_mem
>          netpool
>          devmem
>          zonedev
>          vmalloc
>       zsmalloc
>       dmapool
>       devmem (*)
>    slab
>    pgtable
>    buddy
> 
> (*) yes, devmem appears twice; some is mappable and some is not
> 

I mostly agree to 99% to the above and I think that's a valuable outcome
of the discussion.

What I yet don't understand why we would require the type "folio" at
all. This will be my last question: you're the folio expert, which
interfaces would you think would actually consume in the above done
right a folio and we would consequently need it?

I would assume that there would be no real need for them. Say we have
"struct lru_mem" and we want to test if it's an anon_mem for example to
upcast. Say the function to perform the check is something called
"lru_mem_test_anon()" for example.

Instead of

folio_test_anon(lru_mem_to_folio())

We'd do

_PageAnon(lru_mem_to_page())

Whereby _PageAnon() is just a variant that does no implicit compound
head lookup -- however you would want to call that. Because we know that
lru_mem doesn't point to a tail page.

I imagine the same would hold for any other type of accesses that go via
a page type, except that we might not always go directly via the "struct
page" but instead via an casted type (e.g., cast file_mem -> lru_mem and
call the corresponding helper that implements the magic).

> The ontology is kind of confusing because *every* page is part of a
> folio.  Sometimes it's a folio of one page (eg vmalloc).  Which means
> that it's legitimate to call page_folio() on a slab page and then call
> folio_test_slab().  It's not the direction we want to go though.

That tackles part of the problem I'm having with having a dedicated
"folio" type in the picture above. A folio is literally *any page* as
long as it's not a tail page :)

> 
> We're also inconsistent about whether we consider an entire compound
> page / folio the thing which is mapped, or whether each individual page
> in the compound page / folio can be mapped.  See how differently file-THP
> and anon-THP are handled in rmap, for example.  I think that was probably
> a mistake.

Yes. And whenever I think about "why do we want to split both types" the
thought that keeps dominating is "splitting and migrating anon THP is
just very different from any other THP".

> 
>> Because any of these types would imply that we're looking at the head
>> page (if it's a compound page). And we could (or even already have?)
>> have other types that cannot be mapped to user space that are actually a
>> compound page.
> 
> Sure, slabs are compound pages which cannot be mapped to userspace.
> 
>>> I have a little list of memory types here:
>>> https://kernelnewbies.org/MemoryTypes
>>>
>>> Let me know if anything is missing.
>>
>> hugetlbfs pages might deserve a dedicated type, right?
> 
> Not sure.  Aren't they just file pages (albeit sometimes treated
> specially, which is one of the mistakes we need to fix)?

From all the special-casing in core-mm and remembering that they make
excessive use of compound-tail members, my impression was that they
might look like file pages but are in many cases very different.

<offtopic>
Just for the records, I could imagine a type spanning multiple struct
pages, to handle the cases right now that actually store data in tail
page metadata. Like having "struct hugetlb" that is actually
X*sizeof(struct page) and instead of all these crazy compound tail page
lookups, we'd just be able to reference the relevant members via "struct
hugetlb" directly. We can do that for types we know are actually
compound pages of a certain size -- like hugetlbfs.
</offtopic>

> 
>>> And starting with file_mem makes the supposition that it's worth splitting
>>> file_mem from anon_mem.  I believe that's one or two steps further than
>>> it's worth, but I can be convinced otherwise.  For example, do we have
>>> examples of file pages being passed to routines that expect anon pages?
>>
>> That would be a BUG, so I hope we don't have it ;)
> 
> Right.  I'm asking, did we fix any bugs in the last year or two that
> were caused by this kind of mismatch and would be prevented by using
> a different type?  There's about half a dozen bugs we've had in the
> last year that were caused by passing tail pages to functions that
> were expecting head pages.

For my part, I don't recall either writing (well, it's not my area of
expertise) or reviewing such patches. I do assume that many type checks
catch that early during testing.

I do recall reviewing some patches that remove setting page flags on
(IIRC) anon pages that just don't make any sense, but were not harmful.

<example>
I keep stumbling over type checks that I think might just be due to old
cruft we're dragging along, due to the way we for example extended THP.

Like __split_huge_page(). I can spot two PageAnon(head) calls which end
up looking up the head page again. Then, we call remap_page(), which
doesn't make any sense for !PageAnon(), thus we end up doing a third
call to PageAnon(head). In __split_huge_page_tail() we check
PageAnon(head) again for every invocation.

I'm not saying that we should rewrite __split_huge_page() completely, or
that this cannot be cleaned up differently. I'm rather imagining that
splitting out an "struct anon_mem" might turn things cleaner and avoid
many of the type checks and consequently also more head page lookups.

Again, this is most probably a bad example, I just wanted to share
something that I noticed.
<\example>

Passing "struct page *" to random functions just has to let these functions
* Eventually lookup or at least verify that it's not a tail page
* Eventually lookup or at least verify that it's the right type.
And some functions to the same lookup over and over again.

> 
> I can think of one problem we have, which is that (for a few filesystems
> which have opted into this), we can pass an anon page into ->readpage()
> and we've had problems with those filesystems then mishandling the
> anon page.  The solution to this problem is not to pass an lru_mem to
> readpage, but to use a different fs operation to read swap pages.

Interesting example!

> 
>> Let's consider folio_wait_writeback(struct folio *folio)
>>
>> Do we actually want to pass in a folio here? Would we actually want to
>> pass in lru_mem here or even something else?
> 
> Well, let's look at the callers (for simplicity, look at Linus'
> current tree).  Other than the ones in filesystems which we can assume
> have file pages, mm/migrate.c has __unmap_and_move().  What type should
> migrate_pages() have and pass around?

That's an interesting point. Ideally it should deal with an abstract
type "struct migratable", which would include lru and !lru migratable
pages (e.g., balloon compaction).

The current function name indicates that we're working on pages
("migrate_pages") :) so the upcast would have to happen internally
unless we'd change the interface or even split it up ("migrate_lru_mems()").

But yes, that's an interesting case.

> 
>> Looking at some core MM code, like mm/huge_memory.c, and seeing all the
>> PageAnon() specializations, having a dedicated anon_mem type might be
>> valuable. But at this point it's hard to tell if splitting up these
>> functions would actually be desirable.
> 
> Yes.  That's my point; it *might* be desirable.  I have no objections to
> it, but the people doing the work need to show the benefits.  I'm showing
> the benefits to folios -- fewer bugs, smaller code, larger pages in the
> page cache leading to faster systems.  I acknowledge the costs in terms
> of churn.

See my bad example above. From the "bitwise" discussion I get the
feeling that some people care about type safety (including me) :)

> 
> You can see folios as a first step to disentangling some of the users
> of struct page.  It certainly won't be the last step.  But I'd really
> like to stop having theoretical discussions of memory types and get on
> with writing code.  If that means we modify the fs APIs again in twelve
> months to replace folios with file_mem, well, I'm OK with that.

I know, the crowd is screaming "we want folios, we need folios, get out
of the way". I know that the *compound page* handling is a mess and that
we want something to change that. The point I am making is that folios
are not necessarily what we *need*.

Types as discussed above are really just using the basic idea of a folio
lifted to the next level that not only avoid any kind of PageTail checks
but also any kind of type checks we have splattered all over the place.
IMHO that's a huge win when it comes to code readability and
maintainability. This also tackles the point Johannes made: folios being
the dumping ground for everything. And he has a point, because folios
are really just "not tail pages", so consequently they will 99% just
mimic what "struct page" does, and we all know what that means.

Your patches introduce the concept of folio across many layers and your
point is to eventually clean up later and eventually remove it from all
layers again. I can understand that approach, yet I am at least asking
the question if this is the right order to do this.

And again, I am not blocking this, I think cleaning up compound pages is
very nice. I'm asking questions to see how the concept of folios would
fit in long-term and if it would be required at all if types are done right.

And I think a valuable result of this discussion at least to me is that:
* I can understand why we want (many parts of) the filemap API to
  consume an abstracted type instead of file_mem and anon_mem.
* I understand that compound pages are a fact and properly teaching
  different layers subsystems of how to handle them cleanly is not
  something radical. It's just the natural and clean thing to do.
* I believe types as discussed above are realistic and comparatively
  easy to add. I believe they are much more realistic than
  a bunch of other ideas I heard throughout the last couple of months.

I acknowledge that defragmentation is a real problem, though. But it has
been and most probably will remain a different problem than just getting
compound page handling right.


Again, I appreciate this discussion. I know you're sick and tired of
folio discussions, so I'll stop asking questions.

Kent Overstreet Oct. 23, 2021, 4 p.m. UTC | #156

On Sat, Oct 23, 2021 at 11:58:42AM +0200, David Hildenbrand wrote:
> I know, the crowd is screaming "we want folios, we need folios, get out
> of the way". I know that the *compound page* handling is a mess and that
> we want something to change that. The point I am making is that folios
> are not necessarily what we *need*.
> 
> Types as discussed above are really just using the basic idea of a folio
> lifted to the next level that not only avoid any kind of PageTail checks
> but also any kind of type checks we have splattered all over the place.
> IMHO that's a huge win when it comes to code readability and
> maintainability. This also tackles the point Johannes made: folios being
> the dumping ground for everything. And he has a point, because folios
> are really just "not tail pages", so consequently they will 99% just
> mimic what "struct page" does, and we all know what that means.

Look, even if folios go this direction of being the compound page replacement,
the "new dumping ground" argument is just completely bogus.

In introducing new types and type safety for struct page, it's not reasonable to
try to solve everything at once - we don't know what an ideal end solution is
going to look like, we can't see that far ahead. What is a reasonable approach
is looking for where the fault lines in the way struct page is used now, and
cutting along those lines, look at the result, then cut it up some more. If the
first new type still inherits most of the mess in struct page but it solves real
problems, that's not a failure, that's normal incremental progress!

--------

More than that, I think you and Johannes heard what I was saying about imagining
what the ideal end solution would look like with infinite refactoring and you
two have been running way too far with that idea - the stuff you guys are
talking about sounds overengineered to me - inheritence heirarchies before we've
introduced the first new type!

The point of such thought experiments is to imagine how simple things could be -
and also to not take such thought experiments too seriously, because when we
start refactoring real world code, that's when we discover what's actually
_possible_.

I ran into a major roadblock when I tried converting buddy allocator freelists
to radix trees: freeing a page may require allocating a new page for the radix
tree freelist, which is fine normally - we're freeing a page after all - but not
if it's highmem. So right now I'm not sure if getting struct page down to two
words is even possible. Oh well.

> Your patches introduce the concept of folio across many layers and your
> point is to eventually clean up later and eventually remove it from all
> layers again. I can understand that approach, yet I am at least asking
> the question if this is the right order to do this.
> 
> And again, I am not blocking this, I think cleaning up compound pages is
> very nice. I'm asking questions to see how the concept of folios would
> fit in long-term and if it would be required at all if types are done right.

I'm also not really seeing the need to introduce folios as a replacement for all
of compound pages, though - I think limiting it to file & anon and using the
union-of-structs in struct page as the fault lines for introducing new types
would be the reasonable thing to do. The struct slab patches were great, it's a
real shame that the slab maintainers have been completely absent.

Also introducing new types to be describing our current using of struct page
isn't the only thing we should be doing - as we do that, that will (is!) uncover
a lot of places where our ontology of struct page uses is just nonsensical (all
the types of pages mapped into userspace!) - and part of our mission should be
to clean those up.

That does turn things into a much bigger project than what Matthew signed up
for, but we shouldn't all be sitting on the sidelines here...

Matthew Wilcox Oct. 23, 2021, 9:41 p.m. UTC | #157

On Sat, Oct 23, 2021 at 12:00:38PM -0400, Kent Overstreet wrote:
> I ran into a major roadblock when I tried converting buddy allocator freelists
> to radix trees: freeing a page may require allocating a new page for the radix
> tree freelist, which is fine normally - we're freeing a page after all - but not
> if it's highmem. So right now I'm not sure if getting struct page down to two
> words is even possible. Oh well.

I have a design in mind that I think avoids the problem.  It's somewhat
based on Bonwick's vmem paper, but not exactly.  I need to write it up.

> > Your patches introduce the concept of folio across many layers and your
> > point is to eventually clean up later and eventually remove it from all
> > layers again. I can understand that approach, yet I am at least asking
> > the question if this is the right order to do this.
> > 
> > And again, I am not blocking this, I think cleaning up compound pages is
> > very nice. I'm asking questions to see how the concept of folios would
> > fit in long-term and if it would be required at all if types are done right.
> 
> I'm also not really seeing the need to introduce folios as a replacement for all
> of compound pages, though - I think limiting it to file & anon and using the
> union-of-structs in struct page as the fault lines for introducing new types
> would be the reasonable thing to do. The struct slab patches were great, it's a
> real shame that the slab maintainers have been completely absent.

Right.  Folios are for unspecialised head pages.  If we decide
to specialise further in the future, that's great!  I think David
misunderstood me slightly; I don't know that specialising file + anon
pages (the aforementioned lru_mem) is the right approach.  It might be!
But it needs someone to try it, and find the advantages & disadvantages.

> Also introducing new types to be describing our current using of struct page
> isn't the only thing we should be doing - as we do that, that will (is!) uncover
> a lot of places where our ontology of struct page uses is just nonsensical (all
> the types of pages mapped into userspace!) - and part of our mission should be
> to clean those up.
> 
> That does turn things into a much bigger project than what Matthew signed up
> for, but we shouldn't all be sitting on the sidelines here...

I'm happy to help.  Indeed I may take on some of these sub-projects
myself.  I just don't want the perfect to be the enemy of the good.

Kent Overstreet Oct. 23, 2021, 10:23 p.m. UTC | #158

On Sat, Oct 23, 2021 at 10:41:41PM +0100, Matthew Wilcox wrote:
> On Sat, Oct 23, 2021 at 12:00:38PM -0400, Kent Overstreet wrote:
> > I ran into a major roadblock when I tried converting buddy allocator freelists
> > to radix trees: freeing a page may require allocating a new page for the radix
> > tree freelist, which is fine normally - we're freeing a page after all - but not
> > if it's highmem. So right now I'm not sure if getting struct page down to two
> > words is even possible. Oh well.
> 
> I have a design in mind that I think avoids the problem.  It's somewhat
> based on Bonwick's vmem paper, but not exactly.  I need to write it up.

I am intruiged... Care to drop some hints?

> Right.  Folios are for unspecialised head pages.  If we decide
> to specialise further in the future, that's great!  I think David
> misunderstood me slightly; I don't know that specialising file + anon
> pages (the aforementioned lru_mem) is the right approach.  It might be!
> But it needs someone to try it, and find the advantages & disadvantages.

Well, that's where your current patches are basically headed, aren't they? As I
understand it it's just file and some of the anon code that's converted so far.

Are you thinking more along the lines of converting everything that can be
mapped to userspace to folios? I think that would make a lot of sense given that
converting the weird things to file pages isn't likely to happen any time soon,
and it would us convert gup() to return folios, as Christoph noted.

> 
> > Also introducing new types to be describing our current using of struct page
> > isn't the only thing we should be doing - as we do that, that will (is!) uncover
> > a lot of places where our ontology of struct page uses is just nonsensical (all
> > the types of pages mapped into userspace!) - and part of our mission should be
> > to clean those up.
> > 
> > That does turn things into a much bigger project than what Matthew signed up
> > for, but we shouldn't all be sitting on the sidelines here...
> 
> I'm happy to help.  Indeed I may take on some of these sub-projects
> myself.  I just don't want the perfect to be the enemy of the good.

Agreed!

Johannes Weiner Oct. 25, 2021, 3:35 p.m. UTC | #159

On Fri, Oct 22, 2021 at 02:52:31AM +0100, Matthew Wilcox wrote:
> > Anyway. I can even be convinved that we can figure out the exact fault
> > lines along which we split the page down the road.
> > 
> > My worry is more about 2). A shared type and generic code is likely to
> > emerge regardless of how we split it. Think about it, the only world
> > in which that isn't true would be one in which either
> > 
> > 	a) page subtypes are all the same, or
> > 	b) the subtypes have nothing in common
> > 
> > and both are clearly bogus.
> 
> Amen!
> 
> I'm convinced that pgtable, slab and zsmalloc uses of struct page can all
> be split out into their own types instead of being folios.  They have
> little-to-nothing in common with anon+file; they can't be mapped into
> userspace and they can't be on the LRU.  The only situation you can find
> them in is something like compaction which walks PFNs.

They can all be accounted to a cgroup. pgtables are tracked the same
as other __GFP_ACCOUNT pages (pipe buffers and kernel stacks right now
from a quick grep, but as you can guess that's open-ended).

So if those all aren't folios, the generic type and the interfacing
object for memcg and accounting would continue to be the page.

> Perhaps you could comment on how you'd see separate anon_mem and
> file_mem types working for the memcg code?  Would you want to have
> separate lock_anon_memcg() and lock_file_memcg(), or would you want
> them to be cast to a common type like lock_folio_memcg()?

That should be lock_<generic>_memcg() since it actually serializes and
protects the same thing for all subtypes (unlike lock_page()!).

The memcg interface is fully type agnostic nowadays, but it also needs
to be able to handle any subtype. It should continue to interface with
the broadest, most generic definition of "chunk of memory".

Notably it does not do tailpages (and I don't see how it ever would),
so it could in theory use the folio - but only if the folio is really
the systematic replacement of absolutely *everything* that isn't a
tailpage - including pgtables, kernel stack, pipe buffers, and all
other random alloc_page() calls spread throughout the code base. Not
just conceptually, but an actual wholesale replacement of struct page
throughout allocation sites.

I'm not sure that's realistic. So I'm thinking struct page will likely
be the interfacing object for memcg for the foreseeable future.

Matthew Wilcox Oct. 25, 2021, 3:52 p.m. UTC | #160

On Mon, Oct 25, 2021 at 11:35:25AM -0400, Johannes Weiner wrote:
> On Fri, Oct 22, 2021 at 02:52:31AM +0100, Matthew Wilcox wrote:
> > > Anyway. I can even be convinved that we can figure out the exact fault
> > > lines along which we split the page down the road.
> > > 
> > > My worry is more about 2). A shared type and generic code is likely to
> > > emerge regardless of how we split it. Think about it, the only world
> > > in which that isn't true would be one in which either
> > > 
> > > 	a) page subtypes are all the same, or
> > > 	b) the subtypes have nothing in common
> > > 
> > > and both are clearly bogus.
> > 
> > Amen!
> > 
> > I'm convinced that pgtable, slab and zsmalloc uses of struct page can all
> > be split out into their own types instead of being folios.  They have
> > little-to-nothing in common with anon+file; they can't be mapped into
> > userspace and they can't be on the LRU.  The only situation you can find
> > them in is something like compaction which walks PFNs.
> 
> They can all be accounted to a cgroup. pgtables are tracked the same
> as other __GFP_ACCOUNT pages (pipe buffers and kernel stacks right now
> from a quick grep, but as you can guess that's open-ended).

Oh, this is good information!

> So if those all aren't folios, the generic type and the interfacing
> object for memcg and accounting would continue to be the page.
> 
> > Perhaps you could comment on how you'd see separate anon_mem and
> > file_mem types working for the memcg code?  Would you want to have
> > separate lock_anon_memcg() and lock_file_memcg(), or would you want
> > them to be cast to a common type like lock_folio_memcg()?
> 
> That should be lock_<generic>_memcg() since it actually serializes and
> protects the same thing for all subtypes (unlike lock_page()!).
> 
> The memcg interface is fully type agnostic nowadays, but it also needs
> to be able to handle any subtype. It should continue to interface with
> the broadest, most generic definition of "chunk of memory".

Some of the memory descriptors might prefer to keep their memcg_data at a
different offset from the start of the struct.  Can we accommodate that,
or do we ever get handed a specialised memory descriptor, then have to
cast back to an unspecialised descriptor?

(the LRU list would be an example of this; the list_head must be at the
same offset in all memory descriptors which use the LRU list)

Kent Overstreet Oct. 25, 2021, 4:05 p.m. UTC | #161

On Mon, Oct 25, 2021 at 11:35:25AM -0400, Johannes Weiner wrote:
> On Fri, Oct 22, 2021 at 02:52:31AM +0100, Matthew Wilcox wrote:
> > > Anyway. I can even be convinved that we can figure out the exact fault
> > > lines along which we split the page down the road.
> > > 
> > > My worry is more about 2). A shared type and generic code is likely to
> > > emerge regardless of how we split it. Think about it, the only world
> > > in which that isn't true would be one in which either
> > > 
> > > 	a) page subtypes are all the same, or
> > > 	b) the subtypes have nothing in common
> > > 
> > > and both are clearly bogus.
> > 
> > Amen!
> > 
> > I'm convinced that pgtable, slab and zsmalloc uses of struct page can all
> > be split out into their own types instead of being folios.  They have
> > little-to-nothing in common with anon+file; they can't be mapped into
> > userspace and they can't be on the LRU.  The only situation you can find
> > them in is something like compaction which walks PFNs.
> 
> They can all be accounted to a cgroup. pgtables are tracked the same
> as other __GFP_ACCOUNT pages (pipe buffers and kernel stacks right now
> from a quick grep, but as you can guess that's open-ended).
> 
> So if those all aren't folios, the generic type and the interfacing
> object for memcg and accounting would continue to be the page.
> 
> > Perhaps you could comment on how you'd see separate anon_mem and
> > file_mem types working for the memcg code?  Would you want to have
> > separate lock_anon_memcg() and lock_file_memcg(), or would you want
> > them to be cast to a common type like lock_folio_memcg()?
> 
> That should be lock_<generic>_memcg() since it actually serializes and
> protects the same thing for all subtypes (unlike lock_page()!).
> 
> The memcg interface is fully type agnostic nowadays, but it also needs
> to be able to handle any subtype. It should continue to interface with
> the broadest, most generic definition of "chunk of memory".
> 
> Notably it does not do tailpages (and I don't see how it ever would),
> so it could in theory use the folio - but only if the folio is really
> the systematic replacement of absolutely *everything* that isn't a
> tailpage - including pgtables, kernel stack, pipe buffers, and all
> other random alloc_page() calls spread throughout the code base. Not
> just conceptually, but an actual wholesale replacement of struct page
> throughout allocation sites.
> 
> I'm not sure that's realistic. So I'm thinking struct page will likely
> be the interfacing object for memcg for the foreseeable future.

Interesting.

We were also just discussing how in the block layer, bvecs can currently point
to multiple pages - this is the multipage bvec work that Ming did, it made bio
segment merging a lot cheaper by moving it from the layer that maps bvecs to
sglists and up to bio_add_page() and got rid of the need for segment counting.

But with the upper layers transitioning to compound pages, i.e. keeping
contiguous stuff together as a unit - we're going to want to switch bvecs to
pointing to compound pages, and ditch all the code that breaks up a bvec into
individual 4k pages when we iterate over them; we also won't need or want any
kind of page/segment merging anymore, which is really cool.

But since bios can do IO to/from basically any type of memory, this is another
argument in favor of folios becoming the replacement for all or essentially all
compound pages. The alternative would be changing bvecs to only point to head
pages, which I do think would be completely workable with appropriate
assertions.

We don't want to prevent doing block IO to/from slab memory - there's a lot of
places where we do block IO to memory that isn't exposed to userspace
(e.g. filesystem metadata, other weirder paths), so if bvecs point to folios,
then at least slab needs to be a subtype of folios and folios need to be all or
most compound pages.

I've been anti folios being the replacement for all compound pages because this
is C, trying to do a lot with types is a pain in the ass and I think in general
nested inheritence heirarchies tend to not be the way to go. But I'm definitely
keeping an open mind...

[GIT,PULL] Memory folios for v5.15

Pull-request

Message

Comments