Message ID | YSPwmNNuuQhXNToQ@casper.infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] Memory folios for v5.15 | expand |
On Mon, Aug 23, 2021 at 08:01:44PM +0100, Matthew Wilcox wrote: > Hi Linus, > > I'm sending this pull request a few days before the merge window > opens so you have time to think about it. I don't intend to make any > further changes to the branch, so I've created the tag and signed it. > It's been in Stephen's next tree for a few weeks with only minor problems > (now addressed). > > The point of all this churn is to allow filesystems and the page cache > to manage memory in larger chunks than PAGE_SIZE. The original plan was > to use compound pages like THP does, but I ran into problems with some > functions that take a struct page expect only a head page while others > expect the precise page containing a particular byte. > > This pull request converts just parts of the core MM and the page cache. > For 5.16, we intend to convert various filesystems (XFS and AFS are ready; > other filesystems may make it) and also convert more of the MM and page > cache to folios. For 5.17, multi-page folios should be ready. > > The multi-page folios offer some improvement to some workloads. The 80% > win is real, but appears to be an artificial benchmark (postgres startup, > which isn't a serious workload). Real workloads (eg building the kernel, > running postgres in a steady state, etc) seem to benefit between 0-10%. > I haven't heard of any performance losses as a result of this series. > Nobody has done any serious performance tuning; I imagine that tweaking > the readahead algorithm could provide some more interesting wins. > There are also other places where we could choose to create large folios > and currently do not, such as writes that are larger than PAGE_SIZE. > > I'd like to thank all my reviewers who've offered review/ack tags: > > Christoph Hellwig <hch@lst.de> > David Howells <dhowells@redhat.com> > Jan Kara <jack@suse.cz> > Jeff Layton <jlayton@kernel.org> > Johannes Weiner <hannes@cmpxchg.org> Just to clarify, I'm only on this list because I acked 3 smaller, independent memcg cleanup patches in this series. I have repeatedly expressed strong reservations over folios themselves. The arguments for a better data interface between mm and filesystem in light of variable page sizes are plentiful and convincing. But from an MM point of view, it's all but clear where the delineation between the page and folio is, and what the endgame is supposed to look like. One one hand, the ambition appears to substitute folio for everything that could be a base page or a compound page even inside core MM code. Since there are very few places in the MM code that expressly deal with tail pages in the first place, this amounts to a conversion of most MM code - including the LRU management, reclaim, rmap, migrate, swap, page fault code etc. - away from "the page". However, this far exceeds the goal of a better mm-fs interface. And the value proposition of a full MM-internal conversion, including e.g. the less exposed anon page handling, is much more nebulous. It's been proposed to leave anon pages out, but IMO to keep that direction maintainable, the folio would have to be translated to a page quite early when entering MM code, rather than propagating it inward, in order to avoid huge, massively overlapping page and folio APIs. It's also not clear to me that using the same abstraction for compound pages and the file cache object is future proof. It's evident from scalability issues in the allocator, reclaim, compaction, etc. that with current memory sizes and IO devices, we're hitting the limits of efficiently managing memory in 4k base pages per default. It's also clear that we'll continue to have a need for 4k cache granularity for quite a few workloads that work with large numbers of small files. I'm not sure how this could be resolved other than divorcing the idea of a (larger) base page from the idea of cache entries that can correspond, if necessary, to memory chunks smaller than a default page. A longer thread on that can be found here: https://lore.kernel.org/linux-fsdevel/YFja%2FLRC1NI6quL6@cmpxchg.org/ As an MM stakeholder, I don't think folios are the answer for MM code.
On Mon, Aug 23, 2021 at 2:25 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > One one hand, the ambition appears to substitute folio for everything > that could be a base page or a compound page even inside core MM > code. Since there are very few places in the MM code that expressly > deal with tail pages in the first place, this amounts to a conversion > of most MM code - including the LRU management, reclaim, rmap, > migrate, swap, page fault code etc. - away from "the page". Yeah, honestly, I would have preferred to see this done the exact reverse way: make the rule be that "struct page" is always a head page, and anything that isn't a head page would be called something else. Because, as you say, head pages are the norm. And "folio" may be a clever term, but it's not very natural. Certainly not at all as intuitive or common as "page" as a name in the industry. That said, I see why Willy did it the way he did - it was easier to do it incrementally the way he did. But I do think it ends up with an end result that is kind of topsy turvy where the common "this is the core allocation" being called that odd "folio" thing, and then the simpler "page" name is for things that almost nobody should even care about. I'd have personally preferred to call the head page just a "page", and other pages "subpage" or something like that. I think that would be much more intuitive than "folio/page". Linus
On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote: > On Mon, Aug 23, 2021 at 08:01:44PM +0100, Matthew Wilcox wrote: > Just to clarify, I'm only on this list because I acked 3 smaller, > independent memcg cleanup patches in this series. I have repeatedly > expressed strong reservations over folios themselves. I thought I'd addressed all your concerns. I'm sorry I misunderstood and did not intend to misrepresent your position. > The arguments for a better data interface between mm and filesystem in > light of variable page sizes are plentiful and convincing. But from an > MM point of view, it's all but clear where the delineation between the > page and folio is, and what the endgame is supposed to look like. > > One one hand, the ambition appears to substitute folio for everything > that could be a base page or a compound page even inside core MM > code. Since there are very few places in the MM code that expressly > deal with tail pages in the first place, this amounts to a conversion > of most MM code - including the LRU management, reclaim, rmap, > migrate, swap, page fault code etc. - away from "the page". I would agree with all of those except the page fault code; I believe that needs to continue to work in terms of pages in order to support misaligned mappings. > However, this far exceeds the goal of a better mm-fs interface. And > the value proposition of a full MM-internal conversion, including > e.g. the less exposed anon page handling, is much more nebulous. It's > been proposed to leave anon pages out, but IMO to keep that direction > maintainable, the folio would have to be translated to a page quite > early when entering MM code, rather than propagating it inward, in > order to avoid huge, massively overlapping page and folio APIs. I only intend to leave anonymous memory out /for now/. My hope is that somebody else decides to work on it (and indeed Google have volunteered someone for the task). > It's also not clear to me that using the same abstraction for compound > pages and the file cache object is future proof. It's evident from > scalability issues in the allocator, reclaim, compaction, etc. that > with current memory sizes and IO devices, we're hitting the limits of > efficiently managing memory in 4k base pages per default. It's also > clear that we'll continue to have a need for 4k cache granularity for > quite a few workloads that work with large numbers of small files. I'm > not sure how this could be resolved other than divorcing the idea of a > (larger) base page from the idea of cache entries that can correspond, > if necessary, to memory chunks smaller than a default page. That sounds to me exactly like folios, except for the naming. From the MM point of view, it's less churn to do it your way, but from the point of view of the rest of the kernel, there's going to be unexpected consequences. For example, btrfs didn't support page size != block size until just recently (and I'm not sure it's entirely fixed yet?) And there's nobody working on your idea. At least not that have surfaced so far. The folio patch is here now. Folios are also variable sized. For files which are small, we still only allocate 4kB to cache them. If the file is accessed entirely randomly, we only allocate 4kB chunks at a time. We only allocate larger folios when we think there is an advantage to doing so. This benefit is retained if someone does come along to change PAGE_SIZE to 16KiB (or whatever). Folios can still be composed of multiple pages, no matter what the PAGE_SIZE is. > A longer thread on that can be found here: > https://lore.kernel.org/linux-fsdevel/YFja%2FLRC1NI6quL6@cmpxchg.org/ > > As an MM stakeholder, I don't think folios are the answer for MM code.
On Mon, Aug 23, 2021 at 03:06:08PM -0700, Linus Torvalds wrote: > On Mon, Aug 23, 2021 at 2:25 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > One one hand, the ambition appears to substitute folio for everything > > that could be a base page or a compound page even inside core MM > > code. Since there are very few places in the MM code that expressly > > deal with tail pages in the first place, this amounts to a conversion > > of most MM code - including the LRU management, reclaim, rmap, > > migrate, swap, page fault code etc. - away from "the page". > > Yeah, honestly, I would have preferred to see this done the exact > reverse way: make the rule be that "struct page" is always a head > page, and anything that isn't a head page would be called something > else. > > Because, as you say, head pages are the norm. And "folio" may be a > clever term, but it's not very natural. Certainly not at all as > intuitive or common as "page" as a name in the industry. > > That said, I see why Willy did it the way he did - it was easier to do > it incrementally the way he did. But I do think it ends up with an end > result that is kind of topsy turvy where the common "this is the core > allocation" being called that odd "folio" thing, and then the simpler > "page" name is for things that almost nobody should even care about. > > I'd have personally preferred to call the head page just a "page", and > other pages "subpage" or something like that. I think that would be > much more intuitive than "folio/page". I'm trying to figure out how we can get there. To start, define struct mmu_page { union { struct page; struct { unsigned long flags; unsigned long compound_head; unsigned char compound_dtor; unsigned char compound_order; atomic_t compound_mapcount; unsigned int compound_nr; }; }; }; Now memmap becomes an array of struct mmu_pages instead of struct pages. We also need to sort out the type returned from the page cache APIs. Right now, it returns (effectively) the mmu_page. I think it _should_ return the (arbitrary order) struct page, but auditing every caller of every function is an inhuman job. I can't see how to get there from here without a ridiculous number of bugs. Maybe you can.
On Mon, Aug 23, 2021 at 03:06:08PM -0700, Linus Torvalds wrote: > Yeah, honestly, I would have preferred to see this done the exact > reverse way: make the rule be that "struct page" is always a head > page, and anything that isn't a head page would be called something > else. > > Because, as you say, head pages are the norm. And "folio" may be a > clever term, but it's not very natural. Certainly not at all as > intuitive or common as "page" as a name in the industry. Actually, I think this is an advantage for folios. Maybe not for the core MM which has always been _fairly_ careful to deal with compound pages properly. But for filesystem people, device drivers, etc, when people see a struct page, they think it's PAGE_SIZE bytes in size. And they're usually right, which is what makes things like THP so prone to "Oops, we missed a spot" bugs. By contrast, if you see something which takes a struct folio and then works on PAGE_SIZE bytes, that's a sign there's something funny going on. There are a few of those still; for example kmap() can only map PAGE_SIZE bytes at a time.
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Yeah, honestly, I would have preferred to see this done the exact > reverse way: make the rule be that "struct page" is always a head > page, and anything that isn't a head page would be called something > else. > ... > That said, I see why Willy did it the way he did - it was easier to do > it incrementally the way he did. But I do think it ends up with an end > result that is kind of topsy turvy where the common "this is the core > allocation" being called that odd "folio" thing, and then the simpler > "page" name is for things that almost nobody should even care about. From a filesystem pov, it may be better done Willy's way. There's a lot of assumption that "struct page" corresponds to a PAGE_SIZE patch of RAM and is equivalent to a hardware page, so using something other than struct page seems a better idea. It's easier to avoid the assumption if it's called something different. We're dealing with variable-sized clusters of things that, in the future, could be, say, a combination of typical 4K pages and higher order pages (depending on what the arch supports), so I think using "page" is the wrong name to use. There are some pieces, kmap being a prime example, that might be tricky to make handle a transparently variable-sized multipage object, so careful auditing will likely be required if we do stick with "struct page". Further, there's the problem that there are a *lot* of places where filesystems access struct page members directly, rather than going through helper functions - and all of these need to be fixed. This is much easier to manage if we can get the compiler to do the catching. Hiding them all within struct page would require a humongous single patch. One question does spring to mind, though: do filesystems even need to know about hardware pages at all? They need to be able to access source data or a destination buffer, but that can be stitched together from disparate chunks that have nothing to do with pages (eg. iov_iter); they need access to the pagecache, and may need somewhere to cache pieces of information, and they need to be able to pass chunks of pagecache, data or bufferage to crypto (scatterlists) and I/O routines (bio, skbuff) - but can we hide "paginess" from filesystems? The main point where this matters, at the moment, is, I think, mmap - but could more of that be handled transparently by the VM? > Because, as you say, head pages are the norm. And "folio" may be a > clever term, but it's not very natural. Certainly not at all as > intuitive or common as "page" as a name in the industry. That's mostly because no one uses the term... yet, and that it's not commonly used. I've got used to it in building on top of Willy's patches and have no problem with it - apart from the fact that I would expect something more like a plural or a collective noun ("sheaf" or "ream" maybe?) - but at least the name is similar in length to "page". And it's handy for grepping ;-) > I'd have personally preferred to call the head page just a "page", and > other pages "subpage" or something like that. I think that would be > much more intuitive than "folio/page". As previously stated, I think we need to leave "struct page" as meaning "hardware page" and build some other concept on top for aggregation/buffering. David
On Tue, Aug 24, 2021 at 04:54:27PM +0100, David Howells wrote: > One question does spring to mind, though: do filesystems even need to know > about hardware pages at all? They need to be able to access source data or a > destination buffer, but that can be stitched together from disparate chunks > that have nothing to do with pages (eg. iov_iter); they need access to the > pagecache, and may need somewhere to cache pieces of information, and they > need to be able to pass chunks of pagecache, data or bufferage to crypto > (scatterlists) and I/O routines (bio, skbuff) - but can we hide "paginess" > from filesystems? > > The main point where this matters, at the moment, is, I think, mmap - but > could more of that be handled transparently by the VM? It really depends on the filesystem. I just audited adfs, for example, and there is literally nothing in there that cares about struct page. It passes its arguments from ->readpage and ->writepage to block_*_full_page(); it uses cont_write_begin() for its ->write_begin; and it uses __set_page_dirty_buffers for its ->set_page_dirty. Then there are filesystems like UFS which use struct page extensively in its directory handling. And NFS which uses struct page throughout. Partly there's just better infrastructure for block-based filesystems (which you're fixing) and partly NFS is trying to perform better than a filesystem which exists for compatibility with a long-dead OS. > > Because, as you say, head pages are the norm. And "folio" may be a > > clever term, but it's not very natural. Certainly not at all as > > intuitive or common as "page" as a name in the industry. > > That's mostly because no one uses the term... yet, and that it's not commonly > used. I've got used to it in building on top of Willy's patches and have no > problem with it - apart from the fact that I would expect something more like > a plural or a collective noun ("sheaf" or "ream" maybe?) - but at least the > name is similar in length to "page". > > And it's handy for grepping ;-) If the only thing standing between this patch and the merge is s/folio/ream/g, I will do that. All three options are equally greppable (except for 'ream' as a substring of dream, stream, preamble, scream, whereami, and typos for remain).
On Tue, Aug 24, 2021 at 11:17 AM Matthew Wilcox <willy@infradead.org> wrote: > > If the only thing standing between this patch and the merge is > s/folio/ream/g, I really don't think that helps. All the book-binding analogies are only confusing. If anything, I'd make things more explicit. Stupid and straightforward. Maybe just "struct head_page" or something like that. Name it by what it *is*, not by analogies. None of this cute/clever stuff. I think making it obvious and descriptive would be the much better approach, not some clever "book binders call a collection of pages XYZ". Linus
On Tue, Aug 24, 2021 at 11:26 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If anything, I'd make things more explicit. Stupid and > straightforward. Maybe just "struct head_page" or something like that. > Name it by what it *is*, not by analogies. Btw, just to clarify: I don't love "struct head_page" either. It looks clunky. But at least something like that would be a _straightforward_ clunky name. Something like just "struct pages" would be less clunky, would still get the message across, but gets a bit too visually similar. Naming is hard. Linus
On Mon, Aug 23, 2021 at 11:15:48PM +0100, Matthew Wilcox wrote: > On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote: > > However, this far exceeds the goal of a better mm-fs interface. And > > the value proposition of a full MM-internal conversion, including > > e.g. the less exposed anon page handling, is much more nebulous. It's > > been proposed to leave anon pages out, but IMO to keep that direction > > maintainable, the folio would have to be translated to a page quite > > early when entering MM code, rather than propagating it inward, in > > order to avoid huge, massively overlapping page and folio APIs. > > I only intend to leave anonymous memory out /for now/. My hope is > that somebody else decides to work on it (and indeed Google have > volunteered someone for the task). Unlike the filesystem side, this seems like a lot of churn for very little tangible value. And leaves us with an end result that nobody appears to be terribly excited about. But the folio abstraction is too low-level to use JUST for file cache and NOT for anon. It's too close to the page layer itself and would duplicate too much of it to be maintainable side by side. That's why I asked why it couldn't be a more abstract memory unit for managing file cache. With a clearer delineation between that and how the backing memory is implemented - 1 page, N pages, maybe just a part of a page later on. And not just be a different name for a head page. It appears David is asking the same in the parallel subthread. > > It's also not clear to me that using the same abstraction for compound > > pages and the file cache object is future proof. It's evident from > > scalability issues in the allocator, reclaim, compaction, etc. that > > with current memory sizes and IO devices, we're hitting the limits of > > efficiently managing memory in 4k base pages per default. It's also > > clear that we'll continue to have a need for 4k cache granularity for > > quite a few workloads that work with large numbers of small files. I'm > > not sure how this could be resolved other than divorcing the idea of a > > (larger) base page from the idea of cache entries that can correspond, > > if necessary, to memory chunks smaller than a default page. > > That sounds to me exactly like folios, except for the naming. Then I think you misunderstood me. > From the MM point of view, it's less churn to do it your way, but > from the point of view of the rest of the kernel, there's going to > be unexpected consequences. For example, btrfs didn't support page > size != block size until just recently (and I'm not sure it's > entirely fixed yet?) > > And there's nobody working on your idea. At least not that have surfaced > so far. The folio patch is here now. > > Folios are also variable sized. For files which are small, we still only > allocate 4kB to cache them. If the file is accessed entirely randomly, > we only allocate 4kB chunks at a time. We only allocate larger folios > when we think there is an advantage to doing so. > > This benefit is retained if someone does come along to change PAGE_SIZE > to 16KiB (or whatever). Folios can still be composed of multiple pages, > no matter what the PAGE_SIZE is. The folio doc says "It is at least as large as %PAGE_SIZE"; folio_order() says "A folio is composed of 2^order pages"; page_folio(), folio_pfn(), folio_nr_pages all encode a N:1 relationship. And yes, the name implies it too. This is in direct conflict with what I'm talking about, where base page granularity could become coarser than file cache granularity. Are we going to bump struct page to 2M soon? I don't know. Here is what I do know about 4k pages, though: - It's a lot of transactional overhead to manage tens of gigs of memory in 4k pages. We're reclaiming, paging and swapping more than ever before in our DCs, because flash provides in abundance the low-latency IOPS required for that, and parking cold/warm workload memory on cheap flash saves expensive RAM. But we're continously scanning thousands of pages per second to do this. There was also the RWF_UNCACHED thread around reclaim CPU overhead at the higher end of buffered IO rates. There is the fact that we have a pending proposal from Google to replace rmap because it's too CPU-intense when paging into compressed memory pools. - It's a lot of internal fragmentation. Compaction is becoming the default method for allocating the majority of memory in our servers. This is a latency concern during page faults, and a predictability concern when we defer it to khugepaged collapsing. - struct page is statically eating gigs of expensive memory on every single machine, when only some of our workloads would require this level of granularity for some of their memory. And that's *after* we're fighting over every bit in that structure. Base page size becoming bigger than cache entries in the near future doesn't strike me as an exotic idea. The writing seems to be on the wall. But the folio appears full of assumptions that conflict with it. Sure, the patch is here now. But how much time will all the churn buy us before we may need a do-over? Would clean, incremental changes to the cache entry abstraction even be possible after we have anon and all kinds of other compound page internals hanging off of it as well? Wouldn't it make more sense to decouple filesystems from "paginess", as David puts it, now instead? Avoid the risk of doing it twice, avoid the more questionable churn inside mm code, avoid the confusing proximity to the page and its API in the long-term...
On Tue, Aug 24, 2021 at 11:31 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Unlike the filesystem side, this seems like a lot of churn for very > little tangible value. And leaves us with an end result that nobody > appears to be terribly excited about. Well, there is actually some fairly well documented tangible value: our page accessor helper functions spend an absolutely insane amount of effort and time on just checking "is this a head page", and following the "compound_head" pointer etc. Functions that *used* to be trivial - and are still used as if they were - generate nasty complex code. I'm thinking things like "get_page()" - it increments the reference count to the page. It's just a single atomic increment, right. Wrong.. It's still inlined, but it generates these incredible gyrations with testing the low bit of a field, doing two very different things based on whether it is set, and now we have that "is it close to overflow" test too (ok, that one is dependent on VM_DEBUG), so it actually generates two conditional branches, odd bit tests, lots of extra calls etc etc, So "get_page()" should probably not be an inline function any more. And that's just the first thing I happened to look at. I think we have those "head = compound_head(page)" all over the VM code, And no, that "look up the compound page header" is not necessarily the biggest part of it, but it's definitely one part of it. And if we had a "we know this page is a head page" that all just goes away. And in a lot of cases, we *do* know that. Which is exactly the kind of static knowledge that the folio patches expose. But it is a lot of churn. And it basically duplicates all our page functions, just to have those simplified versions. And It's very core code, and while I appreciate the cleverness of the "folio" name, I do think it makes the end result perhaps subtler than it needs to be. The one thing I do like about it is how it uses the type system to be incremental. So I don't hate the patches. I think they are clever, I think they are likely worthwhile, but I also certainly don't love them. Linus
On Tue, Aug 24, 2021 at 11:26:30AM -0700, Linus Torvalds wrote: > On Tue, Aug 24, 2021 at 11:17 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > If the only thing standing between this patch and the merge is > > s/folio/ream/g, > > I really don't think that helps. All the book-binding analogies are > only confusing. > > If anything, I'd make things more explicit. Stupid and > straightforward. Maybe just "struct head_page" or something like that. > Name it by what it *is*, not by analogies. I don't mind calling it something entirely different. I mean, the word "slab" has nothing to do with memory or pages or anything. I just want something short and greppable. Choosing short words at random from /usr/share/dict/words: belt gala claw ogre peck raft bowl moat cask deck rink toga
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Something like just "struct pages" would be less clunky, would still > get the message across, but gets a bit too visually similar. "page_group"? I would suggest "pgroup", but that's already taken. Maybe "page_set" with "pset" as a shorthand pointer name. Or "struct pset/pgset"? I would prefer a short single word name as there's a good chance it's going to be prefixing a bunch of API functions. If you don't mind straying a bit from something with then name "page" in it, then "chapter", "sheet" or "book"? David
On Tue, Aug 24, 2021 at 12:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > Choosing short words at random from /usr/share/dict/words: I don't think you're getting my point. In fact, you're just making it WORSE. "short" and "greppable" is not the main issue here. "understandable" and "follows other conventions" is. And those "other conventions" are not "book binders in the 17th century". They are about operating system design. So when you mention "slab" as a name example, that's not the argument you think it is. That's a real honest-to-goodness operating system convention name that doesn't exactly predate Linux, but is most certainly not new. In fact, "slab" is a bad example for another reason: we don't actually really use it outside of the internal implementation of the slab cache. The name we actually *use* tends to be "kmalloc()" or similar, which most definitely has a CS history that goes back even further and is not at all confusing to anybody. So no. This email just convinces me that you have ENTIRELY the wrong approach to naming and is just making me more convinced that "folio" came from the wrong kind of thinking. Because "random short words" is absolutely the last thing you should look at. Linus
On Tue, Aug 24, 2021 at 12:11:49PM -0700, Linus Torvalds wrote: > On Tue, Aug 24, 2021 at 12:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > Choosing short words at random from /usr/share/dict/words: > > I don't think you're getting my point. > > In fact, you're just making it WORSE. > > "short" and "greppable" is not the main issue here. > > "understandable" and "follows other conventions" is. > > And those "other conventions" are not "book binders in the 17th > century". They are about operating system design. > > So when you mention "slab" as a name example, that's not the argument > you think it is. That's a real honest-to-goodness operating system > convention name that doesn't exactly predate Linux, but is most > certainly not new. Sure, but at the time Jeff Bonwick chose it, it had no meaning in computer science or operating system design. Whatever name is chosen, we'll get used to it. I don't even care what name it is. I want "short" because it ends up used everywhere. I don't want to be typing lock_hippopotamus(hippopotamus); and I want greppable so it's not confused with something somebody else has already used as an identifier.
On Tue, Aug 24, 2021 at 12:11 PM David Howells <dhowells@redhat.com> wrote: > > "page_group"? I would suggest "pgroup", but that's already taken. Maybe > "page_set" with "pset" as a shorthand pointer name. Or "struct pset/pgset"? Please don't do the "shorthand" thing. Names like "pset" and "pgroup" are pure and utter garbage, and make no sense and describe nothing at all. If you want a pointer name and don't need a descriptive name because there is no ambiguity, you might as well just use 'p'. And if you want to make it clear that it's a collection of pages, you might as well use "pages". Variable naming is one thing, and tere's nothing wrong with variable names like 'i', 'p' and 'pages'. The variable name should come from the context, and 'a' and 'b' can make perfect sense (and 'new' and 'old' can be very good names that clarify what the usage is - C++ people can go pound sand, they mis-designed the language keywords). But the *type* name should describe the type, and it sure shouldn't be anything like pset/pgroup. Something like "page_group" or "pageset" sound reasonable to me as type names. Linus
On Tue, Aug 24, 2021 at 11:29:53AM -0700, Linus Torvalds wrote: > > Something like just "struct pages" would be less clunky, would still > get the message across, but gets a bit too visually similar. How about "struct mempages"? > Naming is hard. Indeed... - Ted
Theodore Ts'o <tytso@mit.edu> wrote:
> How about "struct mempages"?
Kind of redundant in this case?
David
Matthew Wilcox <willy@infradead.org> wrote: > Sure, but at the time Jeff Bonwick chose it, it had no meaning in > computer science or operating system design. Whatever name is chosen, > we'll get used to it. I don't even care what name it is. > > I want "short" because it ends up used everywhere. I don't want to > be typing > lock_hippopotamus(hippopotamus); > > and I want greppable so it's not confused with something somebody else > has already used as an identifier. Can you live with pageset? David
On Tue, Aug 24, 2021 at 12:25 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Something like "page_group" or "pageset" sound reasonable to me as type names. "pageset" is such a great name that we already use it, so I guess that doesn't work. Linus
On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote: > The folio doc says "It is at least as large as %PAGE_SIZE"; > folio_order() says "A folio is composed of 2^order pages"; > page_folio(), folio_pfn(), folio_nr_pages all encode a N:1 > relationship. And yes, the name implies it too. > > This is in direct conflict with what I'm talking about, where base > page granularity could become coarser than file cache granularity. That doesn't make any sense. A page is the fundamental unit of the mm. Why would we want to increase the granularity of page allocation and not increase the granularity of the file cache? > Are we going to bump struct page to 2M soon? I don't know. Here is > what I do know about 4k pages, though: > > - It's a lot of transactional overhead to manage tens of gigs of > memory in 4k pages. We're reclaiming, paging and swapping more than > ever before in our DCs, because flash provides in abundance the > low-latency IOPS required for that, and parking cold/warm workload > memory on cheap flash saves expensive RAM. But we're continously > scanning thousands of pages per second to do this. There was also > the RWF_UNCACHED thread around reclaim CPU overhead at the higher > end of buffered IO rates. There is the fact that we have a pending > proposal from Google to replace rmap because it's too CPU-intense > when paging into compressed memory pools. This seems like an argument for folios, not against them. If user memory (both anon and file) is being allocated in larger chunks, there are fewer pages to scan, less book-keeping to do, and all you're paying for that is I/O bandwidth. > - It's a lot of internal fragmentation. Compaction is becoming the > default method for allocating the majority of memory in our > servers. This is a latency concern during page faults, and a > predictability concern when we defer it to khugepaged collapsing. Again, the more memory that we allocate in higher-order chunks, the better this situation becomes. > - struct page is statically eating gigs of expensive memory on every > single machine, when only some of our workloads would require this > level of granularity for some of their memory. And that's *after* > we're fighting over every bit in that structure. That, folios does not help with. I have post-folio ideas about how to address that, but I can't realistically start working on them until folios are upstream.
On Tue, Aug 24, 2021 at 08:23:15PM +0100, Matthew Wilcox wrote: > > So when you mention "slab" as a name example, that's not the argument > > you think it is. That's a real honest-to-goodness operating system > > convention name that doesn't exactly predate Linux, but is most > > certainly not new. > > Sure, but at the time Jeff Bonwick chose it, it had no meaning in > computer science or operating system design. I think the big difference is that "slab" is mostly used as an internal name. In Linux it doesn't even leak out to the users, since we use kmem_cache_{create,alloc,free,destroy}(). So the "slab" doesn't even show up in the API. The problem is whether we use struct head_page, or folio, or mempages, we're going to be subsystem users' faces. And people who are using it every day will eventually get used to anything, whether it's "folio" or "xmoqax", we sould give a thought to newcomers to Linux file system code. If they see things like "read_folio()", they are going to be far more confused than "read_pages()" or "read_mempages()". Sure, one impenetrable code word isn't that bad. But this is a case of a death by a thousand cuts. At $WORK, one time we had welcomed an intern to our group, I had to stop everyone each time that they used an acronym, or a codeword, and asked them to define the term. It was really illuminating what an insider takes for granted, but when it's one cutsy codeword after another, with three or more such codewords in a sentence, it's *really* a less-than-great initial experience for a newcomer. So if someone sees "kmem_cache_alloc()", they can probably make a guess what it means, and it's memorable once they learn it. Similarly, something like "head_page", or "mempages" is going to a bit more obvious to a kernel newbie. So if we can make a tiny gesture towards comprehensibility, it would be good to do so while it's still easier to change the name. Cheers, - Ted
On Tue, Aug 24, 2021 at 12:38 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > "pageset" is such a great name that we already use it, so I guess that > doesn't work. Actually, maybe I can backtrack on that a bit. Maybe 'pageset' would work as a name. It's not used as a type right now, but the usage where we do have those comments around 'struct per_cpu_pages' are actually not that different from the folio kind of thing. It has a list of "pages" that have a fixed order. So that existing 'pageset' user might actually fit in conceptually. The 'pageset' is only really used in comments and as part of a field name, and the use does seem to be kind of consistent with the Willy's use of a "aligned allocation-group of pages". Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Something like "page_group" or "pageset" sound reasonable to me as type > > names. > > "pageset" is such a great name that we already use it, so I guess that > doesn't work. Heh. I tried grepping for "struct page_set" and that showed nothing. Maybe "pagegroup"? Here's a bunch of possible alternatives to set/group: https://en.wiktionary.org/wiki/Thesaurus:group Maybe consider it a sequence of pages, "struct pageseq"? page_aggregate sounds like a possibility, but it's quite long. Though from an fs point of view, I'd be okay hiding the fact that pages are involved. It's a buffer; a chunk of memory or chunk of pagecache with metadata - maybe something on that theme? David
On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: > On Tue, Aug 24, 2021 at 08:23:15PM +0100, Matthew Wilcox wrote: > > > So when you mention "slab" as a name example, that's not the argument > > > you think it is. That's a real honest-to-goodness operating system > > > convention name that doesn't exactly predate Linux, but is most > > > certainly not new. > > > > Sure, but at the time Jeff Bonwick chose it, it had no meaning in > > computer science or operating system design. > > I think the big difference is that "slab" is mostly used as an > internal name. In Linux it doesn't even leak out to the users, since > we use kmem_cache_{create,alloc,free,destroy}(). So the "slab" > doesn't even show up in the API. /proc/slabinfo /proc/sys/vm/min_slab_ratio /sys/kernel/slab include/linux/slab.h cpuset.memory_spread_slab failslab= slab_merge slab_max_order= $ git grep slab fs/ext4 |wc -l 30 (13 of which are slab.h) > The problem is whether we use struct head_page, or folio, or mempages, > we're going to be subsystem users' faces. And people who are using it > every day will eventually get used to anything, whether it's "folio" > or "xmoqax", we sould give a thought to newcomers to Linux file system > code. If they see things like "read_folio()", they are going to be > far more confused than "read_pages()" or "read_mempages()". > > Sure, one impenetrable code word isn't that bad. But this is a case > of a death by a thousand cuts. At $WORK, one time we had welcomed an > intern to our group, I had to stop everyone each time that they used > an acronym, or a codeword, and asked them to define the term. > > It was really illuminating what an insider takes for granted, but when > it's one cutsy codeword after another, with three or more such > codewords in a sentence, it's *really* a less-than-great initial > experience for a newcomer. > > So if someone sees "kmem_cache_alloc()", they can probably make a > guess what it means, and it's memorable once they learn it. > Similarly, something like "head_page", or "mempages" is going to a bit > more obvious to a kernel newbie. So if we can make a tiny gesture > towards comprehensibility, it would be good to do so while it's still > easier to change the name. I completely agree that it's good to use something which is not jargon, or is at least widely-understood jargon. And I loathe acronyms (you'll notice I haven't suggested a single one). Folio/ream/quire/sheaf were all attempts to get across "collection of pages". Another direction would be something that is associated with memory (but I don't have a good example). Or a non-English word (roman? seite? sidor?) We're going to end up with hpage, aren't we?
On Tue, Aug 24, 2021 at 08:34:47PM +0100, David Howells wrote: > Theodore Ts'o <tytso@mit.edu> wrote: > > > How about "struct mempages"? > > Kind of redundant in this case? I was looking for something which was visually different from "struct page", but was still reasonably short. Otherwise "struct pages" as Linus suggested would work for me. What do you think of "struct pageset"? Not quite as short as folios, but it's clearer. - Ted
On 8/24/21 9:35 PM, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > >> Sure, but at the time Jeff Bonwick chose it, it had no meaning in >> computer science or operating system design. Whatever name is chosen, >> we'll get used to it. I don't even care what name it is. >> >> I want "short" because it ends up used everywhere. I don't want to >> be typing >> lock_hippopotamus(hippopotamus); >> >> and I want greppable so it's not confused with something somebody else >> has already used as an identifier. > > Can you live with pageset? Pagesets already exist in the page allocator internals. Yeah, could be renamed as it's not visible outside. > David > >
On 8/24/21 10:35 PM, Vlastimil Babka wrote: > On 8/24/21 9:35 PM, David Howells wrote: >> Matthew Wilcox <willy@infradead.org> wrote: >> >>> Sure, but at the time Jeff Bonwick chose it, it had no meaning in >>> computer science or operating system design. Whatever name is chosen, >>> we'll get used to it. I don't even care what name it is. >>> >>> I want "short" because it ends up used everywhere. I don't want to >>> be typing >>> lock_hippopotamus(hippopotamus); >>> >>> and I want greppable so it's not confused with something somebody else >>> has already used as an identifier. >> >> Can you live with pageset? > > Pagesets already exist in the page allocator internals. Yeah, could be > renamed as it's not visible outside. Should have read the rest of thread before replying. Maybe in the spirit of the discussion we could call it pageshed? /me hides >> David >> >> > >
Theodore Ts'o <tytso@mit.edu> wrote: > What do you think of "struct pageset"? Not quite as short as folios, > but it's clearer. Fine by me (I suggested page_set), and as Vlastimil points out, the current usage of the name could be renamed. David
On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: > The problem is whether we use struct head_page, or folio, or mempages, > we're going to be subsystem users' faces. And people who are using it > every day will eventually get used to anything, whether it's "folio" > or "xmoqax", we sould give a thought to newcomers to Linux file system > code. If they see things like "read_folio()", they are going to be > far more confused than "read_pages()" or "read_mempages()". Are they? It's not like page isn't some randomly made up term as well, just one that had a lot more time to spread. > So if someone sees "kmem_cache_alloc()", they can probably make a > guess what it means, and it's memorable once they learn it. > Similarly, something like "head_page", or "mempages" is going to a bit > more obvious to a kernel newbie. So if we can make a tiny gesture > towards comprehensibility, it would be good to do so while it's still > easier to change the name. All this sounds really weird to me. I doubt there is any name that nicely explains "structure used to manage arbitrary power of two units of memory in the kernel" very well. So I agree with willy here, let's pick something short and not clumsy. I initially found the folio name a little strange, but working with it I got used to it quickly. And all the other uggestions I've seen s far are significantly worse, especially all the odd compounds with page in it.
On Tue, Aug 24, 2021 at 11:59:52AM -0700, Linus Torvalds wrote: > But it is a lot of churn. And it basically duplicates all our page > functions, just to have those simplified versions. And It's very core > code, and while I appreciate the cleverness of the "folio" name, I do > think it makes the end result perhaps subtler than it needs to be. Maybe I'm biassed by looking at the file system and pagecache side mostly, but if you look at the progress willy has been making a lot of the relevant functionality will exist in either folio or page versions, not both. A lot of the duplication is to support the following: > The one thing I do like about it is how it uses the type system to be > incremental.
On 25/08/2021 08.32, Christoph Hellwig wrote: > On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: >> The problem is whether we use struct head_page, or folio, or mempages, >> we're going to be subsystem users' faces. And people who are using it >> every day will eventually get used to anything, whether it's "folio" >> or "xmoqax", we sould give a thought to newcomers to Linux file system >> code. If they see things like "read_folio()", they are going to be >> far more confused than "read_pages()" or "read_mempages()". > > Are they? It's not like page isn't some randomly made up term > as well, just one that had a lot more time to spread. > >> So if someone sees "kmem_cache_alloc()", they can probably make a >> guess what it means, and it's memorable once they learn it. >> Similarly, something like "head_page", or "mempages" is going to a bit >> more obvious to a kernel newbie. So if we can make a tiny gesture >> towards comprehensibility, it would be good to do so while it's still >> easier to change the name. > > All this sounds really weird to me. I doubt there is any name that > nicely explains "structure used to manage arbitrary power of two > units of memory in the kernel" very well. So I agree with willy here, > let's pick something short and not clumsy. I initially found the folio > name a little strange, but working with it I got used to it quickly. > And all the other uggestions I've seen s far are significantly worse, > especially all the odd compounds with page in it. > A comment from the peanut gallery: I find the name folio completely appropriate and easy to understand. Our vocabulary is already strongly inspired by words used in the world of printed text: the smallest unit of information is a char(acter) [ok, we usually call them bytes], a few characters make up a word, there's a number of words to each (cache) line, and a number of those is what makes up a page. So obviously a folio is something consisting of a few pages. Are the analogies perfect? Of course not. But they are actually quite apt; words, lines and pages don't universally have one size, but they do form a natural hierarchy describing how we organize information. Splitting a word across lines can slow down the reader so should be avoided... [sorry, couldn't resist]. Rasmus
On Wed, 2021-08-25 at 07:32 +0100, Christoph Hellwig wrote: > On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: > > The problem is whether we use struct head_page, or folio, or mempages, > > we're going to be subsystem users' faces. And people who are using it > > every day will eventually get used to anything, whether it's "folio" > > or "xmoqax", we sould give a thought to newcomers to Linux file system > > code. If they see things like "read_folio()", they are going to be > > far more confused than "read_pages()" or "read_mempages()". > > Are they? It's not like page isn't some randomly made up term > as well, just one that had a lot more time to spread. > Absolutely. "folio" is no worse than "page", we've just had more time to get used to "page". > > So if someone sees "kmem_cache_alloc()", they can probably make a > > guess what it means, and it's memorable once they learn it. > > Similarly, something like "head_page", or "mempages" is going to a bit > > more obvious to a kernel newbie. So if we can make a tiny gesture > > towards comprehensibility, it would be good to do so while it's still > > easier to change the name. > > All this sounds really weird to me. I doubt there is any name that > nicely explains "structure used to manage arbitrary power of two > units of memory in the kernel" very well. So I agree with willy here, > let's pick something short and not clumsy. I initially found the folio > name a little strange, but working with it I got used to it quickly. > And all the other uggestions I've seen s far are significantly worse, > especially all the odd compounds with page in it. Same here. Compound words are especially bad, as newbies will continually have to look at whether it's "page_set" or "pageset".
On Tue, 2021-08-24 at 22:32 +0100, David Howells wrote: > Theodore Ts'o <tytso@mit.edu> wrote: > > > What do you think of "struct pageset"? Not quite as short as folios, > > but it's clearer. > > Fine by me (I suggested page_set), and as Vlastimil points out, the current > usage of the name could be renamed. > I honestly fail to see how any of this is better than "folio". It's just a name, and "folio" has the advantage of being fairly unique. The greppability that Willy mentioned is a perk, but folio also doesn't sound similar to other words when discussing them verbally. That's another advantage. If I say "pageset" in a conversation, do I mean "struct pageset" or "a random set of pages"? If I say "folio", it's much more clear to what I'm referring. We've had a lot of time to get used to "page" as a term of art. We'd get used to folio too.
On Tue, Aug 24, 2021 at 08:44:01PM +0100, Matthew Wilcox wrote: > On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote: > > The folio doc says "It is at least as large as %PAGE_SIZE"; > > folio_order() says "A folio is composed of 2^order pages"; > > page_folio(), folio_pfn(), folio_nr_pages all encode a N:1 > > relationship. And yes, the name implies it too. > > > > This is in direct conflict with what I'm talking about, where base > > page granularity could become coarser than file cache granularity. > > That doesn't make any sense. A page is the fundamental unit of the > mm. Why would we want to increase the granularity of page allocation > and not increase the granularity of the file cache? I'm not sure why one should be tied to the other. The folio itself is based on the premise that a cache entry doesn't have to correspond to exactly one struct page. And I agree with that. I'm just wondering why it continues to imply a cache entry is at least one full page, rather than saying a cache entry is a set of bytes that can be backed however the MM sees fit. So that in case we do bump struct page size in the future we don't have to redo the filesystem interface again. I've listed reasons why 4k pages are increasingly the wrong choice for many allocations, reclaim and paging. We also know there is a need to maintain support for 4k cache entries. > > Are we going to bump struct page to 2M soon? I don't know. Here is > > what I do know about 4k pages, though: > > > > - It's a lot of transactional overhead to manage tens of gigs of > > memory in 4k pages. We're reclaiming, paging and swapping more than > > ever before in our DCs, because flash provides in abundance the > > low-latency IOPS required for that, and parking cold/warm workload > > memory on cheap flash saves expensive RAM. But we're continously > > scanning thousands of pages per second to do this. There was also > > the RWF_UNCACHED thread around reclaim CPU overhead at the higher > > end of buffered IO rates. There is the fact that we have a pending > > proposal from Google to replace rmap because it's too CPU-intense > > when paging into compressed memory pools. > > This seems like an argument for folios, not against them. If user > memory (both anon and file) is being allocated in larger chunks, there > are fewer pages to scan, less book-keeping to do, and all you're paying > for that is I/O bandwidth. Well, it's an argument for huge pages, and we already have those in the form of THP. The problem with THP today is that the page allocator fragments the physical address space at the 4k granularity per default, and groups random allocations with no type information and rudimentary lifetime/reclaimability hints together. I'm having a hard time seeing 2M allocations scale as long as we do this. As opposed to making 2M the default block and using slab-style physical grouping by type and instantiation time for smaller cache entries - to improve the chances of physically contiguous reclaim. But because folios are compound/head pages first and foremost, they are inherently tied to being multiples of PAGE_SIZE. > > - It's a lot of internal fragmentation. Compaction is becoming the > > default method for allocating the majority of memory in our > > servers. This is a latency concern during page faults, and a > > predictability concern when we defer it to khugepaged collapsing. > > Again, the more memory that we allocate in higher-order chunks, the > better this situation becomes. It only needs 1 unfortunately placed 4k page out of 512 to mess up a 2M block indefinitely. And the page allocator has little awareness whether the 4k page it's handing out to somebody pairs well with the 4k page adjacent to it in terms of type and lifetime. > > - struct page is statically eating gigs of expensive memory on every > > single machine, when only some of our workloads would require this > > level of granularity for some of their memory. And that's *after* > > we're fighting over every bit in that structure. > > That, folios does not help with. I have post-folio ideas about how > to address that, but I can't realistically start working on them > until folios are upstream. How would you reduce the memory overhead of struct page without losing necessary 4k granularity at the cache level? As long as folio implies that cache entries can't be smaller than a struct page? I appreciate folio is a big patchset and I don't mean to get too much into speculation about the future. But we're here in part because the filesystems have been too exposed to the backing memory implementation details. So all I'm saying is, if you're touching all the file cache interface now anyway, why not use the opportunity to properly disconnect it from the reality of pages, instead of making the compound page the new interface for filesystems. What's wrong with the idea of a struct cache_entry which can be embedded wherever we want: in a page, a folio or a pageset. Or in the future allocated on demand for <PAGE_SIZE entries, if need be. But actually have it be just a cache entry for the fs to read and write, not also a compound page and an anon page etc. all at the same time. Even today that would IMO delineate more clearly between the file cache data plane and the backing memory plane. It doesn't get in the way of also fixing the base-or-compound mess inside MM code with folio/pageset, either. And if down the line we change how the backing memory is implemented, the changes would be a more manageable scope inside MM proper. Anyway, I think I've asked all this before and don't mean to harp on it if people generally disagree that this is a concern.
On Wed, Aug 25, 2021 at 11:13:45AM -0400, Johannes Weiner wrote: > On Tue, Aug 24, 2021 at 08:44:01PM +0100, Matthew Wilcox wrote: > > On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote: > > > The folio doc says "It is at least as large as %PAGE_SIZE"; > > > folio_order() says "A folio is composed of 2^order pages"; > > > page_folio(), folio_pfn(), folio_nr_pages all encode a N:1 > > > relationship. And yes, the name implies it too. > > > > > > This is in direct conflict with what I'm talking about, where base > > > page granularity could become coarser than file cache granularity. > > > > That doesn't make any sense. A page is the fundamental unit of the > > mm. Why would we want to increase the granularity of page allocation > > and not increase the granularity of the file cache? > > I'm not sure why one should be tied to the other. The folio itself is > based on the premise that a cache entry doesn't have to correspond to > exactly one struct page. And I agree with that. I'm just wondering why > it continues to imply a cache entry is at least one full page, rather > than saying a cache entry is a set of bytes that can be backed however > the MM sees fit. So that in case we do bump struct page size in the > future we don't have to redo the filesystem interface again. > > I've listed reasons why 4k pages are increasingly the wrong choice for > many allocations, reclaim and paging. We also know there is a need to > maintain support for 4k cache entries. > > > > Are we going to bump struct page to 2M soon? I don't know. Here is > > > what I do know about 4k pages, though: > > > > > > - It's a lot of transactional overhead to manage tens of gigs of > > > memory in 4k pages. We're reclaiming, paging and swapping more than > > > ever before in our DCs, because flash provides in abundance the > > > low-latency IOPS required for that, and parking cold/warm workload > > > memory on cheap flash saves expensive RAM. But we're continously > > > scanning thousands of pages per second to do this. There was also > > > the RWF_UNCACHED thread around reclaim CPU overhead at the higher > > > end of buffered IO rates. There is the fact that we have a pending > > > proposal from Google to replace rmap because it's too CPU-intense > > > when paging into compressed memory pools. > > > > This seems like an argument for folios, not against them. If user > > memory (both anon and file) is being allocated in larger chunks, there > > are fewer pages to scan, less book-keeping to do, and all you're paying > > for that is I/O bandwidth. > > Well, it's an argument for huge pages, and we already have those in > the form of THP. > > The problem with THP today is that the page allocator fragments the > physical address space at the 4k granularity per default, and groups > random allocations with no type information and rudimentary > lifetime/reclaimability hints together. > > I'm having a hard time seeing 2M allocations scale as long as we do > this. As opposed to making 2M the default block and using slab-style > physical grouping by type and instantiation time for smaller cache > entries - to improve the chances of physically contiguous reclaim. > > But because folios are compound/head pages first and foremost, they > are inherently tied to being multiples of PAGE_SIZE. > > > > - It's a lot of internal fragmentation. Compaction is becoming the > > > default method for allocating the majority of memory in our > > > servers. This is a latency concern during page faults, and a > > > predictability concern when we defer it to khugepaged collapsing. > > > > Again, the more memory that we allocate in higher-order chunks, the > > better this situation becomes. > > It only needs 1 unfortunately placed 4k page out of 512 to mess up a > 2M block indefinitely. And the page allocator has little awareness > whether the 4k page it's handing out to somebody pairs well with the > 4k page adjacent to it in terms of type and lifetime. > > > > - struct page is statically eating gigs of expensive memory on every > > > single machine, when only some of our workloads would require this > > > level of granularity for some of their memory. And that's *after* > > > we're fighting over every bit in that structure. > > > > That, folios does not help with. I have post-folio ideas about how > > to address that, but I can't realistically start working on them > > until folios are upstream. > > How would you reduce the memory overhead of struct page without losing > necessary 4k granularity at the cache level? As long as folio implies > that cache entries can't be smaller than a struct page? > > I appreciate folio is a big patchset and I don't mean to get too much > into speculation about the future. > > But we're here in part because the filesystems have been too exposed > to the backing memory implementation details. So all I'm saying is, if > you're touching all the file cache interface now anyway, why not use > the opportunity to properly disconnect it from the reality of pages, > instead of making the compound page the new interface for filesystems. > > What's wrong with the idea of a struct cache_entry which can be > embedded wherever we want: in a page, a folio or a pageset. Or in the > future allocated on demand for <PAGE_SIZE entries, if need be. But > actually have it be just a cache entry for the fs to read and write, > not also a compound page and an anon page etc. all at the same time. Pardon my ignorance, but ... how would adding yet another layer help a filesystem? No matter how the software is structured, we have to set up and manage the (hardware) page state for programs, and we must keep that coherent with the file space mappings that we maintain. I already know how to deal with pages and dealing with "folios" seems about the same. Adding another layer of caching structures just adds another layer of cra^Wcoherency management for a filesystem to screw up. The folios change management of memory pages enough to disentangle the page/compound page confusion that exists now, and it seems like a reasonable means to supporting unreasonable things like copy on write storage for filesystems with a 56k block size. (And I'm sure I'll get tons of blowback for this, but XFS can manage space in weird units like that (configure the rt volume, set a 56k rt extent size, and all the allocations are multiples of 56k); if we ever wanted to support reflink on /that/ hot mess, it would be awesome to be able to say that we're only going to do 56k folios in the page cache for those files instead of the crazy writeback games that the prototype patchset does now.) --D > Even today that would IMO delineate more clearly between the file > cache data plane and the backing memory plane. It doesn't get in the > way of also fixing the base-or-compound mess inside MM code with > folio/pageset, either. > > And if down the line we change how the backing memory is implemented, > the changes would be a more manageable scope inside MM proper. > > Anyway, I think I've asked all this before and don't mean to harp on > it if people generally disagree that this is a concern.
On Wed, Aug 25, 2021 at 08:03:18AM -0400, Jeff Layton wrote: > On Wed, 2021-08-25 at 07:32 +0100, Christoph Hellwig wrote: > > On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: > > > The problem is whether we use struct head_page, or folio, or mempages, > > > we're going to be subsystem users' faces. And people who are using it > > > every day will eventually get used to anything, whether it's "folio" > > > or "xmoqax", we sould give a thought to newcomers to Linux file system > > > code. If they see things like "read_folio()", they are going to be > > > far more confused than "read_pages()" or "read_mempages()". > > > > Are they? It's not like page isn't some randomly made up term > > as well, just one that had a lot more time to spread. > > > > Absolutely. "folio" is no worse than "page", we've just had more time > to get used to "page". I /like/ the name 'folio'. My privileged education :P informed me (when Matthew talked to me the very first time about this patchset) that it's a wonderfully flexible word that describes both a collection of various pages and a single large sheet of paper folded in half. Or in the case of x86, folded in half nine times. That's *exactly* the usage that Matthew is proposing. English already had a word ready for us to use, so let's use it. --D (Well, ok, the one thing I dislike is that my brain keeps typing out 'fileio' instead of 'folio', but it's still better than struct xmoqax or remembering if we do camel_case or PotholeCase.) > > > So if someone sees "kmem_cache_alloc()", they can probably make a > > > guess what it means, and it's memorable once they learn it. > > > Similarly, something like "head_page", or "mempages" is going to a bit > > > more obvious to a kernel newbie. So if we can make a tiny gesture > > > towards comprehensibility, it would be good to do so while it's still > > > easier to change the name. > > > > All this sounds really weird to me. I doubt there is any name that > > nicely explains "structure used to manage arbitrary power of two > > units of memory in the kernel" very well. So I agree with willy here, > > let's pick something short and not clumsy. I initially found the folio > > name a little strange, but working with it I got used to it quickly. > > And all the other uggestions I've seen s far are significantly worse, > > especially all the odd compounds with page in it. > > Same here. Compound words are especially bad, as newbies will > continually have to look at whether it's "page_set" or "pageset". > > -- > Jeff Layton <jlayton@kernel.org> >
Excerpts from Christoph Hellwig's message of August 25, 2021 4:32 pm: > On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: >> The problem is whether we use struct head_page, or folio, or mempages, >> we're going to be subsystem users' faces. And people who are using it >> every day will eventually get used to anything, whether it's "folio" >> or "xmoqax", we sould give a thought to newcomers to Linux file system >> code. If they see things like "read_folio()", they are going to be >> far more confused than "read_pages()" or "read_mempages()". > > Are they? It's not like page isn't some randomly made up term > as well, just one that had a lot more time to spread. > >> So if someone sees "kmem_cache_alloc()", they can probably make a >> guess what it means, and it's memorable once they learn it. >> Similarly, something like "head_page", or "mempages" is going to a bit >> more obvious to a kernel newbie. So if we can make a tiny gesture >> towards comprehensibility, it would be good to do so while it's still >> easier to change the name. > > All this sounds really weird to me. I doubt there is any name that > nicely explains "structure used to manage arbitrary power of two > units of memory in the kernel" very well. Cluster is easily understandable to a filesystem developer as contiguous set of one or more, probably aligned and sized to power of 2. Swap subsystem in mm uses it (maybe because it's disk adjacent, but it does have page clusters) so mm developers would be fine with it too. Sadly you might have to call it page_cluster to avoid confusion with block clusters in fs then it gets a bit long. Superpage could be different enough from huge page that implies one page of a particular large size (even though some other OS might use it for that), but a super set of pages, which could be 1 or more. Thanks, Nick
On Wed, Aug 25, 2021 at 12:02 PM Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote: > > On 25/08/2021 08.32, Christoph Hellwig wrote: > > On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: > >> The problem is whether we use struct head_page, or folio, or mempages, > >> we're going to be subsystem users' faces. And people who are using it > >> every day will eventually get used to anything, whether it's "folio" > >> or "xmoqax", we sould give a thought to newcomers to Linux file system > >> code. If they see things like "read_folio()", they are going to be > >> far more confused than "read_pages()" or "read_mempages()". > > > > Are they? It's not like page isn't some randomly made up term > > as well, just one that had a lot more time to spread. > > > >> So if someone sees "kmem_cache_alloc()", they can probably make a > >> guess what it means, and it's memorable once they learn it. > >> Similarly, something like "head_page", or "mempages" is going to a bit > >> more obvious to a kernel newbie. So if we can make a tiny gesture > >> towards comprehensibility, it would be good to do so while it's still > >> easier to change the name. > > > > All this sounds really weird to me. I doubt there is any name that > > nicely explains "structure used to manage arbitrary power of two > > units of memory in the kernel" very well. So I agree with willy here, > > let's pick something short and not clumsy. I initially found the folio > > name a little strange, but working with it I got used to it quickly. > > And all the other uggestions I've seen s far are significantly worse, > > especially all the odd compounds with page in it. > > > > A comment from the peanut gallery: I find the name folio completely > appropriate and easy to understand. Our vocabulary is already strongly > inspired by words used in the world of printed text: the smallest unit > of information is a char(acter) [ok, we usually call them bytes], a few > characters make up a word, there's a number of words to each (cache) > line, and a number of those is what makes up a page. So obviously a > folio is something consisting of a few pages. > > Are the analogies perfect? Of course not. But they are actually quite > apt; words, lines and pages don't universally have one size, but they do > form a natural hierarchy describing how we organize information. > > Splitting a word across lines can slow down the reader so should be > avoided... [sorry, couldn't resist]. > And if we ever want to manage page cache using an arbitrary number of contiguous filios, we can always saw them into a scroll ;-) Thanks, Amir.
Johannes Weiner <hannes@cmpxchg.org> wrote: > But we're here in part because the filesystems have been too exposed > to the backing memory implementation details. So all I'm saying is, if > you're touching all the file cache interface now anyway, why not use > the opportunity to properly disconnect it from the reality of pages, > instead of making the compound page the new interface for filesystems. > > What's wrong with the idea of a struct cache_entry Well, the name's already taken, though only in cifs. And we have a *lot* of caches so just calling it "cache_entry" is kind of unspecific. > which can be > embedded wherever we want: in a page, a folio or a pageset. Or in the > future allocated on demand for <PAGE_SIZE entries, if need be. But > actually have it be just a cache entry for the fs to read and write, > not also a compound page and an anon page etc. all at the same time. > > Even today that would IMO delineate more clearly between the file > cache data plane and the backing memory plane. It doesn't get in the > way of also fixing the base-or-compound mess inside MM code with > folio/pageset, either. One thing I like about Willy's folio concept is that, as long as everyone uses the proper accessor functions and macros, we can mostly ignore the fact that they're 2^N sized/aligned and they're composed of exact multiples of pages. What really matters are the correspondences between folio size/alignment and medium/IO size/alignment, so you could look on the folio as being a tool to disconnect the filesystem from the concept of pages. We could, in the future, in theory, allow the internal implementation of a folio to shift from being a page array to being a kmalloc'd page list or allow higher order units to be mixed in. The main thing we have to stop people from doing is directly accessing the members of the struct. There are some tricky bits: kmap and mmapped page handling, for example. Some of this can be mitigated by making iov_iters handle folios (the ITER_XARRAY type does, for example) and providing utilities to populate scatterlists. David
On Tue, Aug 24, 2021 at 12:48:13PM -0700, Linus Torvalds wrote: > On Tue, Aug 24, 2021 at 12:38 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > "pageset" is such a great name that we already use it, so I guess that > > doesn't work. > > Actually, maybe I can backtrack on that a bit. > > Maybe 'pageset' would work as a name. It's not used as a type right > now, but the usage where we do have those comments around 'struct > per_cpu_pages' are actually not that different from the folio kind of > thing. It has a list of "pages" that have a fixed order. > > So that existing 'pageset' user might actually fit in conceptually. > The 'pageset' is only really used in comments and as part of a field > name, and the use does seem to be kind of consistent with the Willy's > use of a "aligned allocation-group of pages". The 'pageset' in use in mm/page_alloc.c really seems to be more of a pagelist than a pageset. The one concern I have about renaming it is that we actually print the word 'pagesets' in /proc/zoneinfo. There's also some infiniband driver that uses the word "pageset" which really seems to mean "DMA range". So if I rename the existing mm pageset to pagelist, and then modify all these patches to call a folio a pageset, you'd take this patchset?
On Thu, Aug 26, 2021 at 09:58:06AM +0100, David Howells wrote: > One thing I like about Willy's folio concept is that, as long as everyone uses > the proper accessor functions and macros, we can mostly ignore the fact that > they're 2^N sized/aligned and they're composed of exact multiples of pages. > What really matters are the correspondences between folio size/alignment and > medium/IO size/alignment, so you could look on the folio as being a tool to > disconnect the filesystem from the concept of pages. > > We could, in the future, in theory, allow the internal implementation of a > folio to shift from being a page array to being a kmalloc'd page list or > allow higher order units to be mixed in. The main thing we have to stop > people from doing is directly accessing the members of the struct. In the current state of the folio patches, I agree with you. But conceptually, folios are not disconnecting from the page beyond PAGE_SIZE -> PAGE_SIZE * (1 << folio_order()). This is why I asked what the intended endgame is. And I wonder if there is a bit of an alignment issue between FS and MM people about the exact nature and identity of this data structure. At the current stage of conversion, folio is a more clearly delineated API of what can be safely used from the FS for the interaction with the page cache and memory management. And it looks still flexible to make all sorts of changes, including how it's backed by memory. Compared with the page, where parts of the API are for the FS, but there are tons of members, functions, constants, and restrictions due to the page's role inside MM core code. Things you shouldn't be using, things you shouldn't be assuming from the fs side, but it's hard to tell which is which, because struct page is a lot of things. However, the MM narrative for folios is that they're an abstraction for regular vs compound pages. This is rather generic. Conceptually, it applies very broadly and deeply to MM core code: anonymous memory handling, reclaim, swapping, even the slab allocator uses them. If we follow through on this concept from the MM side - and that seems to be the plan - it's inevitable that the folio API will grow more MM-internal members, methods, as well as restrictions again in the process. Except for the tail page bits, I don't see too much in struct page that would not conceptually fit into this version of the folio. The cache_entry idea is really just to codify and retain that domain-specific minimalism and clarity from the filesystem side. As well as the flexibility around how backing memory is implemented, which I think could come in handy soon, but isn't the sole reason.
Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Aug 26, 2021 at 09:58:06AM +0100, David Howells wrote: > > One thing I like about Willy's folio concept is that, as long as everyone uses > > the proper accessor functions and macros, we can mostly ignore the fact that > > they're 2^N sized/aligned and they're composed of exact multiples of pages. > > What really matters are the correspondences between folio size/alignment and > > medium/IO size/alignment, so you could look on the folio as being a tool to > > disconnect the filesystem from the concept of pages. > > > > We could, in the future, in theory, allow the internal implementation of a > > folio to shift from being a page array to being a kmalloc'd page list or > > allow higher order units to be mixed in. The main thing we have to stop > > people from doing is directly accessing the members of the struct. > > In the current state of the folio patches, I agree with you. But > conceptually, folios are not disconnecting from the page beyond > PAGE_SIZE -> PAGE_SIZE * (1 << folio_order()). This is why I asked > what the intended endgame is. And I wonder if there is a bit of an > alignment issue between FS and MM people about the exact nature and > identity of this data structure. Possibly. I would guess there are a couple of reasons that on the MM side particularly it's dealt with as a strict array of pages: efficiency and mmap-related faults. It's most efficient to treat it as an array of contiguous pages as that removes the need for indirection. From the pov of mmap, faults happen along the lines of h/w page divisions. From an FS point of view, at minimum, I just need to know the state of the folio. If a page fault dirties several folios, that's fine. If I can find out that a folio was partially dirtied, that's useful, but not critical. I am a bit concerned about higher-order folios causing huge writes - but I do realise that we might want to improve TLB/PT efficiency by using larger entries and that that comes with consequences for mmapped writes. > At the current stage of conversion, folio is a more clearly delineated > API of what can be safely used from the FS for the interaction with > the page cache and memory management. And it looks still flexible to > make all sorts of changes, including how it's backed by > memory. Compared with the page, where parts of the API are for the FS, > but there are tons of members, functions, constants, and restrictions > due to the page's role inside MM core code. Things you shouldn't be > using, things you shouldn't be assuming from the fs side, but it's > hard to tell which is which, because struct page is a lot of things. I definitely like the API cleanup that folios offer. However, I do think Willy needs to better document the differences between some of the functions, or at least when/where they should be used - folio_mapping() and folio_file_mapping() being examples of this. > However, the MM narrative for folios is that they're an abstraction > for regular vs compound pages. This is rather generic. Conceptually, > it applies very broadly and deeply to MM core code: anonymous memory > handling, reclaim, swapping, even the slab allocator uses them. If we > follow through on this concept from the MM side - and that seems to be > the plan - it's inevitable that the folio API will grow more > MM-internal members, methods, as well as restrictions again in the > process. Except for the tail page bits, I don't see too much in struct > page that would not conceptually fit into this version of the folio. > > The cache_entry idea is really just to codify and retain that > domain-specific minimalism and clarity from the filesystem side. As > well as the flexibility around how backing memory is implemented, > which I think could come in handy soon, but isn't the sole reason. I can see while you might want the clarification. However, at this point, can you live with this set of folio patches? Can you live with the name? Could you live with it if "folio" was changed to something else? I would really like to see this patchset get in. It's hanging over changes I and others want to make that will conflict with Willy's changes. If we can get the basic API of folios in now, that's means I can make my changes on top of them. Thanks, David
On Fri, Aug 27, 2021 at 06:03:25AM -0400, Johannes Weiner wrote: > At the current stage of conversion, folio is a more clearly delineated > API of what can be safely used from the FS for the interaction with > the page cache and memory management. And it looks still flexible to > make all sorts of changes, including how it's backed by > memory. Compared with the page, where parts of the API are for the FS, > but there are tons of members, functions, constants, and restrictions > due to the page's role inside MM core code. Things you shouldn't be > using, things you shouldn't be assuming from the fs side, but it's > hard to tell which is which, because struct page is a lot of things. > > However, the MM narrative for folios is that they're an abstraction > for regular vs compound pages. This is rather generic. Conceptually, > it applies very broadly and deeply to MM core code: anonymous memory > handling, reclaim, swapping, even the slab allocator uses them. If we > follow through on this concept from the MM side - and that seems to be > the plan - it's inevitable that the folio API will grow more > MM-internal members, methods, as well as restrictions again in the > process. Except for the tail page bits, I don't see too much in struct > page that would not conceptually fit into this version of the folio. So the superhypermegaultra ambitious version of this does something like: struct slab_page { unsigned long flags; union { struct list_head slab_list; struct { ... }; }; struct kmem_cache *slab_cache; void *freelist; void *s_mem; unsigned int active; atomic_t _refcount; unsigned long memcg_data; }; struct folio { ... more or less as now ... }; struct net_page { unsigned long flags; unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr[2]; atomic_t _mapcount; atomic_t _refcount; unsigned long memcg_data; }; struct page { union { struct folio folio; struct slab_page slab; struct net_page pool; ... }; }; and then functions which only take one specific type of page use that type. And the compiler will tell you that you can't pass a net_page to a slab function, or vice versa. This is a lot more churn, and I'm far from convinced that it's worth doing. There's also the tricky "This page is mappable to userspace" kind of functions, which (for example) includes vmalloc and net_page as well as folios and random driver allocations, but shouldn't include slab or page table pages. They're especially tricky because mapping to userspace comes with rules around the use of the ->mapping field as well as ->_mapcount.
On Wed, Aug 25, 2021 at 05:45:55PM -0700, Darrick J. Wong wrote: > Pardon my ignorance, but ... how would adding yet another layer help a > filesystem? No matter how the software is structured, we have to set up > and manage the (hardware) page state for programs, and we must keep that > coherent with the file space mappings that we maintain. I already know > how to deal with pages and dealing with "folios" seems about the same. > Adding another layer of caching structures just adds another layer of > cra^Wcoherency management for a filesystem to screw up. > > The folios change management of memory pages enough to disentangle the > page/compound page confusion that exists now, and it seems like a > reasonable means to supporting unreasonable things like copy on write > storage for filesystems with a 56k block size. > > (And I'm sure I'll get tons of blowback for this, but XFS can manage > space in weird units like that (configure the rt volume, set a 56k rt > extent size, and all the allocations are multiples of 56k); if we ever > wanted to support reflink on /that/ hot mess, it would be awesome to be > able to say that we're only going to do 56k folios in the page cache for > those files instead of the crazy writeback games that the prototype > patchset does now.) I'm guessing the reason you want 56k blocks is because with a larger filesystems and faster drives it would be a more reasonable unit for managing this amount of data than 4k would be. We have the same thoughts in MM and growing memory sizes. The DAX stuff said from the start it won't be built on linear struct page mappings anymore because we expect the memory modules to be too big to manage them with such fine-grained granularity. But in practice, this is more and more becoming true for DRAM as well. We don't want to allocate gigabytes of struct page when on our servers only a very small share of overall memory needs to be managed at this granularity. Folio perpetuates the problem of the base page being the floor for cache granularity, and so from an MM POV it doesn't allow us to scale up to current memory sizes without horribly regressing certain filesystem workloads that still need us to be able to scale down. But there is something more important that I wish more MM people would engage on: When you ask for 56k/2M/whatever buffers, the MM has to be able to *allocate* them. I'm assuming that while you certainly have preferences, you don't rely too much on whether that memory is composed of a contiguous chunk of 4k pages, a single 56k page, a part of a 2M page, or maybe even discontig 4k chunks with an SG API. You want to manage your disk space one way, but you could afford the MM some flexibility to do the right thing under different levels of memory load, and allow it to scale in the direction it needs for its own purposes. But if folios are also the low-level compound pages used throughout the MM code, we're tying these fs allocations to the requirement of being physically contiguous. This is a much more difficult allocation problem. And from the MM side, we have a pretty poor track record of serving contiguous memory larger than the base page size. Since forever have non-MM people assumed that because the page allocator takes an order argument you could make arbitrary 2^n requests. When they inevitably complain that it doesn't work, even under light loads, we tell them "lol order-0 or good luck". Compaction has improved our ability to serve these requests, but only *if you bring the time for defragmentation*. Many allocations don't. THP has been around for years, but honestly it doesn't really work in general purpose environments. Yeah if you have some HPC number cruncher that allocates all the anon at startup and then runs for hours, it's fine. But in a more dynamic environment after some uptime, the MM code just isn't able to produce these larger pages reliably and within a reasonable deadline. I'm assuming filesystem workloads won't bring the necessary patience for this either. We've effectively declared bankruptcy on this already. Many requests have been replaced with kvmalloc(), and THP has been mostly relegated to the optimistic background tinkering of khugepaged. You can't rely on it, so you need to structure your expectations around it, and perform well when it isn't. This will apply to filesystems as well. I really don't think it makes sense to discuss folios as the means for enabling huge pages in the page cache, without also taking a long hard look at the allocation model that is supposed to back them. Because you can't make it happen without that. And this part isn't looking so hot to me, tbh. Willy says he has future ideas to make compound pages scale. But we have years of history saying this is incredibly hard to achieve - and it certainly wasn't for a lack of constant trying. Decoupling the filesystems from struct page is a necessary step. I can also see an argument for abstracting away compound pages to clean up the compound_head() mess in all the helpers (although I'm still not convinced the wholesale replacement of the page concept is the best way to achieve this). But combining the two objectives, and making compound pages the basis for huge page cache - after everything we know about higher-order allocs - seems like a stretch to me.
On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote: > We have the same thoughts in MM and growing memory sizes. The DAX > stuff said from the start it won't be built on linear struct page > mappings anymore because we expect the memory modules to be too big to > manage them with such fine-grained granularity. Well, I did. Then I left Intel, and Dan took over. Now we have a struct page for each 4kB of PMEM. I'm not particularly happy about this change of direction. > But in practice, this > is more and more becoming true for DRAM as well. We don't want to > allocate gigabytes of struct page when on our servers only a very > small share of overall memory needs to be managed at this granularity. This is a much less compelling argument than you think. I had some ideas along these lines and I took them to a performance analysis group. They told me that for their workloads, doubling the amount of DRAM in a system increased performance by ~10%. So increasing the amount of DRAM by 1/63 is going to increase performance by 1/630 or 0.15%. There are more important performance wins to go after. Even in the cloud space where increasing memory by 1/63 might increase the number of VMs you can host by 1/63, how many PMs host as many as 63 VMs? ie does it really buy you anything? It sounds like a nice big number ("My 1TB machine has 16GB occupied by memmap!"), but the real benefit doesn't really seem to be there. And of course, that assumes that you have enough other resources to scale to 64/63 of your current workload; you might hit CPU, IO or some other limit first. > Folio perpetuates the problem of the base page being the floor for > cache granularity, and so from an MM POV it doesn't allow us to scale > up to current memory sizes without horribly regressing certain > filesystem workloads that still need us to be able to scale down. The mistake you're making is coupling "minimum mapping granularity" with "minimum allocation granularity". We can happily build a system which only allocates memory on 2MB boundaries and yet lets you map that memory to userspace in 4kB granules. > I really don't think it makes sense to discuss folios as the means for > enabling huge pages in the page cache, without also taking a long hard > look at the allocation model that is supposed to back them. Because > you can't make it happen without that. And this part isn't looking so > hot to me, tbh. Please, don't creep the scope of this project to "first, redesign the memory allocator". This project is _if we can_, use larg(er) pages to cache files. What Darrick is talking about is an entirely different project that I haven't signed up for and won't. > Willy says he has future ideas to make compound pages scale. But we > have years of history saying this is incredibly hard to achieve - and > it certainly wasn't for a lack of constant trying. I genuinely don't understand. We have five primary users of memory in Linux (once we're in a steady state after boot): - Anonymous memory - File-backed memory - Slab - Network buffers - Page tables The relative importance of each one very much depends on your workload. Slab already uses medium order pages and can be made to use larger. Folios should give us large allocations of file-backed memory and eventually anonymous memory. Network buffers seem to be headed towards larger allocations too. Page tables will need some more thought, but once we're no longer interleaving file cache pages, anon pages and page tables, they become less of a problem to deal with. Once everybody's allocating order-4 pages, order-4 pages become easy to allocate. When everybody's allocating order-0 pages, order-4 pages require the right 16 pages to come available, and that's really freaking hard.
On Fri, Aug 27, 2021 at 11:47 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote: > > We have the same thoughts in MM and growing memory sizes. The DAX > > stuff said from the start it won't be built on linear struct page > > mappings anymore because we expect the memory modules to be too big to > > manage them with such fine-grained granularity. > > Well, I did. Then I left Intel, and Dan took over. Now we have a struct > page for each 4kB of PMEM. I'm not particularly happy about this change > of direction. Page-less DAX left more problems than it solved. Meanwhile, ZONE_DEVICE has spawned other useful things like peer-to-peer DMA. I am more encouraged by efforts to make the 'struct page' overhead disappear, first from Muchun Song for hugetlbfs and recently Joao Martins for device-dax. If anything, I think 'struct page' for PMEM / DAX *strengthens* the case for folios / better mechanisms to reduce the overhead of tracking 4K pages.
On Fri, Aug 27, 2021 at 02:41:11PM -0700, Dan Williams wrote: > On Fri, Aug 27, 2021 at 11:47 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote: > > > We have the same thoughts in MM and growing memory sizes. The DAX > > > stuff said from the start it won't be built on linear struct page > > > mappings anymore because we expect the memory modules to be too big to > > > manage them with such fine-grained granularity. > > > > Well, I did. Then I left Intel, and Dan took over. Now we have a struct > > page for each 4kB of PMEM. I'm not particularly happy about this change > > of direction. > > Page-less DAX left more problems than it solved. Meanwhile, > ZONE_DEVICE has spawned other useful things like peer-to-peer DMA. ZONE_DEVICE has created more problems than it solved. Pageless memory is a concept which still needs to be supported, and we could have made a start on that five years ago. Instead you opted for the expeditious solution.
On Mon, Aug 23, 2021 at 08:01:44PM +0100, Matthew Wilcox wrote: > The following changes since commit f0eb870a84224c9bfde0dc547927e8df1be4267c: > > Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux (2021-07-18 11:27:25 -0700) > > are available in the Git repository at: > > git://git.infradead.org/users/willy/pagecache.git tags/folio-5.15 > > for you to fetch changes up to 1a90e9dae32ce26de43c1c5eddb3ecce27f2a640: > > mm/writeback: Add folio_write_one (2021-08-15 23:04:07 -0400) Running 'sed -i' across the patches and reapplying them got me this: The following changes since commit f0eb870a84224c9bfde0dc547927e8df1be4267c: Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux (2021-07-18 11:27:25 -0700) are available in the Git repository at: git://git.infradead.org/users/willy/pagecache.git tags/pageset-5.15 for you to fetch changes up to dc185ab836d41729f15b2925a59c7dc29ae72377: mm/writeback: Add pageset_write_one (2021-08-27 22:52:26 -0400) ---------------------------------------------------------------- Pagesets Add pagesets, a new type to represent either an order-0 page or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to pagesets. ---------------------------------------------------------------- Matthew Wilcox (Oracle) (90): mm: Convert get_page_unless_zero() to return bool mm: Introduce struct pageset mm: Add pageset_pgdat(), pageset_zone() and pageset_zonenum() mm/vmstat: Add functions to account pageset statistics mm/debug: Add VM_BUG_ON_PAGESET() and VM_WARN_ON_ONCE_PAGESET() mm: Add pageset reference count functions mm: Add pageset_put() mm: Add pageset_get() mm: Add pageset_try_get_rcu() mm: Add pageset flag manipulation functions mm/lru: Add pageset LRU functions mm: Handle per-pageset private data mm/filemap: Add pageset_index(), pageset_file_page() and pageset_contains() mm/filemap: Add pageset_next_index() mm/filemap: Add pageset_pos() and pageset_file_pos() mm/util: Add pageset_mapping() and pageset_file_mapping() mm/filemap: Add pageset_unlock() mm/filemap: Add pageset_lock() mm/filemap: Add pageset_lock_killable() mm/filemap: Add __pageset_lock_async() mm/filemap: Add pageset_wait_locked() mm/filemap: Add __pageset_lock_or_retry() mm/swap: Add pageset_rotate_reclaimable() mm/filemap: Add pageset_end_writeback() mm/writeback: Add pageset_wait_writeback() mm/writeback: Add pageset_wait_stable() mm/filemap: Add pageset_wait_bit() mm/filemap: Add pageset_wake_bit() mm/filemap: Convert page wait queues to be pagesets mm/filemap: Add pageset private_2 functions fs/netfs: Add pageset fscache functions mm: Add pageset_mapped() mm: Add pageset_nid() mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics() mm/memcg: Use the node id in mem_cgroup_update_tree() mm/memcg: Remove soft_limit_tree_node() mm/memcg: Convert memcg_check_events to take a node ID mm/memcg: Add pageset_memcg() and related functions mm/memcg: Convert commit_charge() to take a pageset mm/memcg: Convert mem_cgroup_charge() to take a pageset mm/memcg: Convert uncharge_page() to uncharge_pageset() mm/memcg: Convert mem_cgroup_uncharge() to take a pageset mm/memcg: Convert mem_cgroup_migrate() to take pagesets mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to pageset mm/memcg: Add pageset_memcg_lock() and pageset_memcg_unlock() mm/memcg: Convert mem_cgroup_move_account() to use a pageset mm/memcg: Add pageset_lruvec() mm/memcg: Add pageset_lruvec_lock() and similar functions mm/memcg: Add pageset_lruvec_relock_irq() and pageset_lruvec_relock_irqsave() mm/workingset: Convert workingset_activation to take a pageset mm: Add pageset_pfn() mm: Add pageset_raw_mapping() mm: Add flush_dcache_pageset() mm: Add kmap_local_pageset() mm: Add arch_make_pageset_accessible() mm: Add pageset_young and pageset_idle mm/swap: Add pageset_activate() mm/swap: Add pageset_mark_accessed() mm/rmap: Add pageset_mkclean() mm/migrate: Add pageset_migrate_mapping() mm/migrate: Add pageset_migrate_flags() mm/migrate: Add pageset_migrate_copy() mm/writeback: Rename __add_wb_stat() to wb_stat_mod() flex_proportions: Allow N events instead of 1 mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add() mm/writeback: Add __pageset_end_writeback() mm/writeback: Add pageset_start_writeback() mm/writeback: Add pageset_mark_dirty() mm/writeback: Add __pageset_mark_dirty() mm/writeback: Convert tracing writeback_page_template to pagesets mm/writeback: Add filemap_dirty_pageset() mm/writeback: Add pageset_account_cleaned() mm/writeback: Add pageset_cancel_dirty() mm/writeback: Add pageset_clear_dirty_for_io() mm/writeback: Add pageset_account_redirty() mm/writeback: Add pageset_redirty_for_writepage() mm/filemap: Add i_blocks_per_pageset() mm/filemap: Add pageset_mkwrite_check_truncate() mm/filemap: Add readahead_pageset() mm/workingset: Convert workingset_refault() to take a pageset mm: Add pageset_evictable() mm/lru: Convert __pagevec_lru_add_fn to take a pageset mm/lru: Add pageset_add_lru() mm/page_alloc: Add pageset allocation functions mm/filemap: Add filemap_alloc_pageset mm/filemap: Add filemap_add_pageset() mm/filemap: Convert mapping_get_entry to return a pageset mm/filemap: Add filemap_get_pageset mm/filemap: Add FGP_STABLE mm/writeback: Add pageset_write_one Documentation/core-api/cachetlb.rst | 6 + Documentation/core-api/mm-api.rst | 5 + Documentation/filesystems/netfs_library.rst | 2 + arch/arc/include/asm/cacheflush.h | 1 + arch/arm/include/asm/cacheflush.h | 1 + arch/mips/include/asm/cacheflush.h | 2 + arch/nds32/include/asm/cacheflush.h | 1 + arch/nios2/include/asm/cacheflush.h | 3 +- arch/parisc/include/asm/cacheflush.h | 3 +- arch/sh/include/asm/cacheflush.h | 3 +- arch/xtensa/include/asm/cacheflush.h | 3 +- fs/afs/write.c | 9 +- fs/cachefiles/rdwr.c | 16 +- fs/io_uring.c | 2 +- fs/jfs/jfs_metapage.c | 1 + include/asm-generic/cacheflush.h | 6 + include/linux/backing-dev.h | 6 +- include/linux/flex_proportions.h | 9 +- include/linux/gfp.h | 22 +- include/linux/highmem-internal.h | 11 + include/linux/highmem.h | 37 ++ include/linux/huge_mm.h | 15 - include/linux/ksm.h | 4 +- include/linux/memcontrol.h | 231 ++++++----- include/linux/migrate.h | 4 + include/linux/mm.h | 239 +++++++++--- include/linux/mm_inline.h | 103 +++-- include/linux/mm_types.h | 77 ++++ include/linux/mmdebug.h | 20 + include/linux/netfs.h | 77 ++-- include/linux/page-flags.h | 267 +++++++++---- include/linux/page_idle.h | 99 +++-- include/linux/page_owner.h | 8 +- include/linux/page_ref.h | 158 +++++++- include/linux/pagemap.h | 585 ++++++++++++++++++---------- include/linux/rmap.h | 10 +- include/linux/swap.h | 17 +- include/linux/vmstat.h | 113 +++++- include/linux/writeback.h | 9 +- include/trace/events/pagemap.h | 46 ++- include/trace/events/writeback.h | 28 +- kernel/bpf/verifier.c | 2 +- kernel/events/uprobes.c | 3 +- lib/flex_proportions.c | 28 +- mm/Makefile | 2 +- mm/compaction.c | 4 +- mm/filemap.c | 575 +++++++++++++-------------- mm/huge_memory.c | 7 +- mm/hugetlb.c | 2 +- mm/internal.h | 36 +- mm/khugepaged.c | 8 +- mm/ksm.c | 34 +- mm/memcontrol.c | 358 +++++++++-------- mm/memory-failure.c | 2 +- mm/memory.c | 20 +- mm/mempolicy.c | 10 + mm/memremap.c | 2 +- mm/migrate.c | 189 +++++---- mm/mlock.c | 3 +- mm/page-writeback.c | 477 +++++++++++++---------- mm/page_alloc.c | 14 +- mm/page_io.c | 4 +- mm/page_owner.c | 10 +- mm/pageset-compat.c | 142 +++++++ mm/rmap.c | 14 +- mm/shmem.c | 7 +- mm/swap.c | 197 +++++----- mm/swap_state.c | 2 +- mm/swapfile.c | 8 +- mm/userfaultfd.c | 2 +- mm/util.c | 111 +++--- mm/vmscan.c | 8 +- mm/workingset.c | 52 +-- 73 files changed, 2900 insertions(+), 1692 deletions(-) create mode 100644 mm/pageset-compat.c
On Fri, Aug 27, 2021 at 07:44:29PM +0100, Matthew Wilcox wrote: > On Fri, Aug 27, 2021 at 10:07:16AM -0400, Johannes Weiner wrote: > > We have the same thoughts in MM and growing memory sizes. The DAX > > stuff said from the start it won't be built on linear struct page > > mappings anymore because we expect the memory modules to be too big to > > manage them with such fine-grained granularity. > > Well, I did. Then I left Intel, and Dan took over. Now we have a struct > page for each 4kB of PMEM. I'm not particularly happy about this change > of direction. > > > But in practice, this > > is more and more becoming true for DRAM as well. We don't want to > > allocate gigabytes of struct page when on our servers only a very > > small share of overall memory needs to be managed at this granularity. > > This is a much less compelling argument than you think. I had some > ideas along these lines and I took them to a performance analysis group. > They told me that for their workloads, doubling the amount of DRAM in a > system increased performance by ~10%. So increasing the amount of DRAM > by 1/63 is going to increase performance by 1/630 or 0.15%. There are > more important performance wins to go after. Well, that's kind of obvious. Once a configuration is balanced for CPU, memory, IO, network etc, adding sticks of RAM doesn't help; neither will freeing some memory here and there. The short term isn't where this matters. It matters rather a lot, though, when we design and purchase the hardware. RAM is becoming a larger share of overall machine cost, so at-scale deployments like ours are under more pressure than ever to provision it tightly. When we configure our systems we look at the workloads' resource consumption ratios, as well as the kernel overhead, and then we need to buy capacity accordingly. > Even in the cloud space where increasing memory by 1/63 might increase the > number of VMs you can host by 1/63, how many PMs host as many as 63 VMs? > ie does it really buy you anything? It sounds like a nice big number > ("My 1TB machine has 16GB occupied by memmap!"), but the real benefit > doesn't really seem to be there. And of course, that assumes that you > have enough other resources to scale to 64/63 of your current workload; > you might hit CPU, IO or some other limit first. A lot of DC hosts nowadays are in a direct pipeline for handling user requests, which are highly parallelizable. They are much smaller, and there are a lot more of them than there are VMs in the world. The per-request and per-host margins are thinner, and the compute-to-memory ratio is more finely calibrated than when you're renting out large VMs that don't neatly divide up the machine. Right now, we're averaging ~1G of RAM per CPU thread for most of our hosts. You don't need a very large system - certainly not in the TB ballpark - where struct page takes up the memory budget of entire CPU threads. So now we have to spec memory for it, and spend additional capex and watts, or we'll end up leaving those CPU threads stranded. You're certainly right that there are configurations that likely won't care much - especially more legacy, big-iron style stuff that isn't quite as parallelized and as thinly provisioned. But you can't make the argument that nobody will miss 16G in a 1TB host that has the CPU concurrency and the parallel work to match it. > > Folio perpetuates the problem of the base page being the floor for > > cache granularity, and so from an MM POV it doesn't allow us to scale > > up to current memory sizes without horribly regressing certain > > filesystem workloads that still need us to be able to scale down. > > The mistake you're making is coupling "minimum mapping granularity" with > "minimum allocation granularity". We can happily build a system which > only allocates memory on 2MB boundaries and yet lets you map that memory > to userspace in 4kB granules. Yeah, but I want to do it without allocating 4k granule descriptors statically at boot time for the entirety of available memory. > > I really don't think it makes sense to discuss folios as the means for > > enabling huge pages in the page cache, without also taking a long hard > > look at the allocation model that is supposed to back them. Because > > you can't make it happen without that. And this part isn't looking so > > hot to me, tbh. > > Please, don't creep the scope of this project to "first, redesign > the memory allocator". This project is _if we can_, use larg(er) > pages to cache files. What Darrick is talking about is an entirely > different project that I haven't signed up for and won't. I never said the allocator needs to be fixed first. I've only been advocating to remove (or keep out) unnecessary allocation assumptions from folio to give us the flexibility to fix the allocator later on. > > Willy says he has future ideas to make compound pages scale. But we > > have years of history saying this is incredibly hard to achieve - and > > it certainly wasn't for a lack of constant trying. > > I genuinely don't understand. We have five primary users of memory > in Linux (once we're in a steady state after boot): > > - Anonymous memory > - File-backed memory > - Slab > - Network buffers > - Page tables > > The relative importance of each one very much depends on your workload. > Slab already uses medium order pages and can be made to use larger. > Folios should give us large allocations of file-backed memory and > eventually anonymous memory. Network buffers seem to be headed towards > larger allocations too. Page tables will need some more thought, but > once we're no longer interleaving file cache pages, anon pages and > page tables, they become less of a problem to deal with. > > Once everybody's allocating order-4 pages, order-4 pages become easy > to allocate. When everybody's allocating order-0 pages, order-4 pages > require the right 16 pages to come available, and that's really freaking > hard. Well yes, once (and iff) everybody is doing that. But for the foreseeable future we're expecting to stay in a world where the *majority* of memory is in larger chunks, while we continue to see 4k cache entries, anon pages, and corresponding ptes, yes? Memory is dominated by larger allocations from the main workloads, but we'll continue to have a base system that does logging, package upgrades, IPC stuff, has small config files, small libraries, small executables. It'll be a while until we can raise the floor on those much smaller allocations - if ever. So we need a system to manage them living side by side. The slab allocator has proven to be an excellent solution to this problem, because the mailing lists are not flooded with OOM reports where smaller allocations fragmented the 4k page space. And even large temporary slab explosions (inodes, dentries etc.) are usually pushed back with fairly reasonable CPU overhead. The same really cannot be said for the untyped page allocator and the various solutions we've had to address fragmentation after the fact. Again, I'm not saying any of this needs to be actually *fixed* MM-side to enable the huge page cache in the filesystems. I'd be more than happy to go ahead with the "cache descriptor" aspect of the folio. All I'm saying we shouldn't double down on compound pages and tie the filesystems to that anchor, just for that false synergy between the new cache descriptor and fixing the compound_head() mess.
On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote: > A lot of DC hosts nowadays are in a direct pipeline for handling user > requests, which are highly parallelizable. > > They are much smaller, and there are a lot more of them than there are > VMs in the world. The per-request and per-host margins are thinner, > and the compute-to-memory ratio is more finely calibrated than when > you're renting out large VMs that don't neatly divide up the machine. > > Right now, we're averaging ~1G of RAM per CPU thread for most of our > hosts. You don't need a very large system - certainly not in the TB > ballpark - where struct page takes up the memory budget of entire CPU > threads. So now we have to spec memory for it, and spend additional > capex and watts, or we'll end up leaving those CPU threads stranded. So you're noticing at the level of a 64 thread machine (something like a dual-socket Xeon Gold 5318H, which would have 2x18x2 = 72 threads). Things certainly have changed, then. > > The mistake you're making is coupling "minimum mapping granularity" with > > "minimum allocation granularity". We can happily build a system which > > only allocates memory on 2MB boundaries and yet lets you map that memory > > to userspace in 4kB granules. > > Yeah, but I want to do it without allocating 4k granule descriptors > statically at boot time for the entirety of available memory. Even that is possible when bumping the PAGE_SIZE to 16kB. It needs a bit of fiddling: static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot) { if (!pte_none(*pte)) return -EBUSY; /* Ok, finally just insert the thing.. */ get_page(page); inc_mm_counter_fast(mm, mm_counter_file(page)); page_add_file_rmap(page, false); set_pte_at(mm, addr, pte, mk_pte(page, prot)); return 0; } mk_pte() assumes that a struct page refers to a single pte. If we revamped it to take (page, offset, prot), it could construct the appropriate pte for the offset within that page. --- Independent of _that_, the biggest problem we face (I think) in getting rid of memmap is that it offers the pfn_to_page() lookup. If we move to a dynamically allocated descriptor for our arbitrarily-sized memory objects, we need a tree to store them in. Given the trees we currently have, our best bet is probably the radix tree, but I dislike its glass jaws. I'm hoping that (again) the maple tree becomes stable soon enough for us to dynamically allocate memory descriptors and store them in it. And that we don't discover a bootstrapping problem between kmalloc() (for tree nodes) and memmap (to look up the page associated with a node). But that's all a future problem and if we can't even take a first step to decouple filesystems from struct page then working towards that would be wasted effort. > > > Willy says he has future ideas to make compound pages scale. But we > > > have years of history saying this is incredibly hard to achieve - and > > > it certainly wasn't for a lack of constant trying. > > > > I genuinely don't understand. We have five primary users of memory > > in Linux (once we're in a steady state after boot): > > > > - Anonymous memory > > - File-backed memory > > - Slab > > - Network buffers > > - Page tables > > > > The relative importance of each one very much depends on your workload. > > Slab already uses medium order pages and can be made to use larger. > > Folios should give us large allocations of file-backed memory and > > eventually anonymous memory. Network buffers seem to be headed towards > > larger allocations too. Page tables will need some more thought, but > > once we're no longer interleaving file cache pages, anon pages and > > page tables, they become less of a problem to deal with. > > > > Once everybody's allocating order-4 pages, order-4 pages become easy > > to allocate. When everybody's allocating order-0 pages, order-4 pages > > require the right 16 pages to come available, and that's really freaking > > hard. > > Well yes, once (and iff) everybody is doing that. But for the > foreseeable future we're expecting to stay in a world where the > *majority* of memory is in larger chunks, while we continue to see 4k > cache entries, anon pages, and corresponding ptes, yes? No. 4k page table entries are demanded by the architecture, and there's little we can do about that. We can allocate them in larger chunks, but let's not solve that problem in this email. I can see a world where anon memory is managed (by default, opportunistically) in larger chunks within a year. Maybe six months if somebody really works hard on it. > Memory is dominated by larger allocations from the main workloads, but > we'll continue to have a base system that does logging, package > upgrades, IPC stuff, has small config files, small libraries, small > executables. It'll be a while until we can raise the floor on those > much smaller allocations - if ever. > > So we need a system to manage them living side by side. > > The slab allocator has proven to be an excellent solution to this > problem, because the mailing lists are not flooded with OOM reports > where smaller allocations fragmented the 4k page space. And even large > temporary slab explosions (inodes, dentries etc.) are usually pushed > back with fairly reasonable CPU overhead. You may not see the bug reports, but they exist. Right now, we have a service that is echoing 2 to drop_caches every hour on systems which are lightly loaded, otherwise the dcache swamps the entire machine and takes hours or days to come back under control.
On Mon, Aug 30, 2021 at 07:22:25PM +0100, Matthew Wilcox wrote: > On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote: > > > The mistake you're making is coupling "minimum mapping granularity" with > > > "minimum allocation granularity". We can happily build a system which > > > only allocates memory on 2MB boundaries and yet lets you map that memory > > > to userspace in 4kB granules. > > > > Yeah, but I want to do it without allocating 4k granule descriptors > > statically at boot time for the entirety of available memory. > > Even that is possible when bumping the PAGE_SIZE to 16kB. It needs a > bit of fiddling: > > static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, > unsigned long addr, struct page *page, pgprot_t prot) > { > if (!pte_none(*pte)) > return -EBUSY; > /* Ok, finally just insert the thing.. */ > get_page(page); > inc_mm_counter_fast(mm, mm_counter_file(page)); > page_add_file_rmap(page, false); > set_pte_at(mm, addr, pte, mk_pte(page, prot)); > return 0; > } > > mk_pte() assumes that a struct page refers to a single pte. If we > revamped it to take (page, offset, prot), it could construct the > appropriate pte for the offset within that page. Right, page tables only need a pfn. The struct page is for us to maintain additional state about the object. For the objects that are subpage sized, we should be able to hold that state (shrinker lru linkage, referenced bit, dirtiness, ...) inside ad-hoc allocated descriptors. Descriptors which could well be what struct folio {} is today, IMO. As long as it doesn't innately assume, or will assume, in the API the 1:1+ mapping to struct page that is inherent to the compound page. > Independent of _that_, the biggest problem we face (I think) in getting > rid of memmap is that it offers the pfn_to_page() lookup. If we move to a > dynamically allocated descriptor for our arbitrarily-sized memory objects, > we need a tree to store them in. Given the trees we currently have, > our best bet is probably the radix tree, but I dislike its glass jaws. > I'm hoping that (again) the maple tree becomes stable soon enough for > us to dynamically allocate memory descriptors and store them in it. > And that we don't discover a bootstrapping problem between kmalloc() > (for tree nodes) and memmap (to look up the page associated with a node). > > But that's all a future problem and if we can't even take a first step > to decouple filesystems from struct page then working towards that would > be wasted effort. Agreed. Again, I'm just advocating to keep the doors open on that, and avoid the situation where the filesystem folks run off and convert to a flexible folio data structure, and the MM people run off and convert all compound pages to folio and in the process hardcode assumptions and turn it basically into struct page again that can't easily change. > > > > Willy says he has future ideas to make compound pages scale. But we > > > > have years of history saying this is incredibly hard to achieve - and > > > > it certainly wasn't for a lack of constant trying. > > > > > > I genuinely don't understand. We have five primary users of memory > > > in Linux (once we're in a steady state after boot): > > > > > > - Anonymous memory > > > - File-backed memory > > > - Slab > > > - Network buffers > > > - Page tables > > > > > > The relative importance of each one very much depends on your workload. > > > Slab already uses medium order pages and can be made to use larger. > > > Folios should give us large allocations of file-backed memory and > > > eventually anonymous memory. Network buffers seem to be headed towards > > > larger allocations too. Page tables will need some more thought, but > > > once we're no longer interleaving file cache pages, anon pages and > > > page tables, they become less of a problem to deal with. > > > > > > Once everybody's allocating order-4 pages, order-4 pages become easy > > > to allocate. When everybody's allocating order-0 pages, order-4 pages > > > require the right 16 pages to come available, and that's really freaking > > > hard. > > > > Well yes, once (and iff) everybody is doing that. But for the > > foreseeable future we're expecting to stay in a world where the > > *majority* of memory is in larger chunks, while we continue to see 4k > > cache entries, anon pages, and corresponding ptes, yes? > > No. 4k page table entries are demanded by the architecture, and there's > little we can do about that. I wasn't claiming otherwise..? > > Memory is dominated by larger allocations from the main workloads, but > > we'll continue to have a base system that does logging, package > > upgrades, IPC stuff, has small config files, small libraries, small > > executables. It'll be a while until we can raise the floor on those > > much smaller allocations - if ever. > > > > So we need a system to manage them living side by side. > > > > The slab allocator has proven to be an excellent solution to this > > problem, because the mailing lists are not flooded with OOM reports > > where smaller allocations fragmented the 4k page space. And even large > > temporary slab explosions (inodes, dentries etc.) are usually pushed > > back with fairly reasonable CPU overhead. > > You may not see the bug reports, but they exist. Right now, we have > a service that is echoing 2 to drop_caches every hour on systems which > are lightly loaded, otherwise the dcache swamps the entire machine and > takes hours or days to come back under control. Sure, but compare that to the number of complaints about higher-order allocations failing or taking too long (THP in the fault path e.g.)... Typegrouping isn't infallible for fighting fragmentation, but it seems to be good enough for most cases. Unlike the buddy allocator.
On Mon, Aug 30, 2021 at 04:27:04PM -0400, Johannes Weiner wrote: > Right, page tables only need a pfn. The struct page is for us to > maintain additional state about the object. > > For the objects that are subpage sized, we should be able to hold that > state (shrinker lru linkage, referenced bit, dirtiness, ...) inside > ad-hoc allocated descriptors. > > Descriptors which could well be what struct folio {} is today, IMO. As > long as it doesn't innately assume, or will assume, in the API the > 1:1+ mapping to struct page that is inherent to the compound page. Maybe this is where we fundamentally disagree. I don't think there's any point in *managing* memory in a different size from that in which it is *allocated*. There's no point in tracking dirtiness, LRU position, locked, etc, etc in different units from allocation size. The point of tracking all these things is so we can allocate and free memory. If a 'cache descriptor' reaches the end of the LRU and should be reclaimed, that's wasted effort in tracking if the rest of the 'cache descriptor' is dirty and heavily in use. So a 'cache descriptor' should always be at least a 'struct page' in size (assuming you're using 'struct page' to mean "the size of the smallest allocation unit from the page allocator") > > > > I genuinely don't understand. We have five primary users of memory > > > > in Linux (once we're in a steady state after boot): > > > > > > > > - Anonymous memory > > > > - File-backed memory > > > > - Slab > > > > - Network buffers > > > > - Page tables > > > > > > > > The relative importance of each one very much depends on your workload. > > > > Slab already uses medium order pages and can be made to use larger. > > > > Folios should give us large allocations of file-backed memory and > > > > eventually anonymous memory. Network buffers seem to be headed towards > > > > larger allocations too. Page tables will need some more thought, but > > > > once we're no longer interleaving file cache pages, anon pages and > > > > page tables, they become less of a problem to deal with. > > > > > > > > Once everybody's allocating order-4 pages, order-4 pages become easy > > > > to allocate. When everybody's allocating order-0 pages, order-4 pages > > > > require the right 16 pages to come available, and that's really freaking > > > > hard. > > > > > > Well yes, once (and iff) everybody is doing that. But for the > > > foreseeable future we're expecting to stay in a world where the > > > *majority* of memory is in larger chunks, while we continue to see 4k > > > cache entries, anon pages, and corresponding ptes, yes? > > > > No. 4k page table entries are demanded by the architecture, and there's > > little we can do about that. > > I wasn't claiming otherwise..? You snipped the part of my paragraph that made the 'No' make sense. I'm agreeing that page tables will continue to be a problem, but everything else (page cache, anon, networking, slab) I expect to be using higher order allocations within the next year. > > > The slab allocator has proven to be an excellent solution to this > > > problem, because the mailing lists are not flooded with OOM reports > > > where smaller allocations fragmented the 4k page space. And even large > > > temporary slab explosions (inodes, dentries etc.) are usually pushed > > > back with fairly reasonable CPU overhead. > > > > You may not see the bug reports, but they exist. Right now, we have > > a service that is echoing 2 to drop_caches every hour on systems which > > are lightly loaded, otherwise the dcache swamps the entire machine and > > takes hours or days to come back under control. > > Sure, but compare that to the number of complaints about higher-order > allocations failing or taking too long (THP in the fault path e.g.)... Oh, we have those bug reports too ... > Typegrouping isn't infallible for fighting fragmentation, but it seems > to be good enough for most cases. Unlike the buddy allocator. You keep saying that the buddy allocator isn't given enough information to do any better, but I think it is. Page cache and anon memory are marked with GFP_MOVABLE. Slab, network and page tables aren't. Is there a reason that isn't enough? I think something that might actually help is if we added a pair of new GFP flags, __GFP_FAST and __GFP_DENSE. Dense allocations are those which are expected to live for a long time, and so the page allocator should try to group them with other dense allocations. Slab and page tables should use DENSE, along with things like superblocks, or fs bitmaps where the speed of allocation is almost unimportant, but attempting to keep them out of the way of other allocations is useful. Fast allocations are for allocations which should not live for very long. The speed of allocation dominates, and it's OK if the allocation gets in the way of defragmentation for a while. An example of another allocator that could care about DENSE vs FAST would be vmalloc. Today, it does: if (array_size > PAGE_SIZE) { area->pages = __vmalloc_node(array_size, 1, nested_gfp, node, area->caller); } else { area->pages = kmalloc_node(array_size, nested_gfp, node); } That's actually pretty bad; if you have, say, a 768kB vmalloc space, you need a 12kB array. We currently allocate 16kB for the array, when we could use alloc_pages_exact() to free the 4kB we're never going to use. If this is GFP_DENSE, we know it's a long-lived allocation and we can let somebody else use the extra 4kB. If it's not, it's probably not worth bothering with.
On 8/30/21 23:38, Matthew Wilcox wrote: > I think something that might actually help is if we added a pair of new > GFP flags, __GFP_FAST and __GFP_DENSE. Dense allocations are those which > are expected to live for a long time, and so the page allocator should > try to group them with other dense allocations. Slab and page tables > should use DENSE, along with things like superblocks, or fs bitmaps where > the speed of allocation is almost unimportant, but attempting to keep > them out of the way of other allocations is useful. Fast allocations > are for allocations which should not live for very long. The speed of > allocation dominates, and it's OK if the allocation gets in the way of > defragmentation for a while. Note we used to have GFP_TEMPORARY, but it didn't really work out: https://lwn.net/Articles/732107/ > An example of another allocator that could care about DENSE vs FAST > would be vmalloc. Today, it does: > > if (array_size > PAGE_SIZE) { > area->pages = __vmalloc_node(array_size, 1, nested_gfp, node, > area->caller); > } else { > area->pages = kmalloc_node(array_size, nested_gfp, node); > } > > That's actually pretty bad; if you have, say, a 768kB vmalloc space, > you need a 12kB array. We currently allocate 16kB for the array, when we > could use alloc_pages_exact() to free the 4kB we're never going to use. > If this is GFP_DENSE, we know it's a long-lived allocation and we can > let somebody else use the extra 4kB. If it's not, it's probably not > worth bothering with. >
Johannes Weiner <hannes@cmpxchg.org> writes: > On Mon, Aug 30, 2021 at 07:22:25PM +0100, Matthew Wilcox wrote: >> On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote: >> > > The mistake you're making is coupling "minimum mapping granularity" with >> > > "minimum allocation granularity". We can happily build a system which >> > > only allocates memory on 2MB boundaries and yet lets you map that memory >> > > to userspace in 4kB granules. >> > >> > Yeah, but I want to do it without allocating 4k granule descriptors >> > statically at boot time for the entirety of available memory. >> >> Even that is possible when bumping the PAGE_SIZE to 16kB. It needs a >> bit of fiddling: >> >> static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, >> unsigned long addr, struct page *page, pgprot_t prot) >> { >> if (!pte_none(*pte)) >> return -EBUSY; >> /* Ok, finally just insert the thing.. */ >> get_page(page); >> inc_mm_counter_fast(mm, mm_counter_file(page)); >> page_add_file_rmap(page, false); >> set_pte_at(mm, addr, pte, mk_pte(page, prot)); >> return 0; >> } >> >> mk_pte() assumes that a struct page refers to a single pte. If we >> revamped it to take (page, offset, prot), it could construct the >> appropriate pte for the offset within that page. > > Right, page tables only need a pfn. The struct page is for us to > maintain additional state about the object. > > For the objects that are subpage sized, we should be able to hold that > state (shrinker lru linkage, referenced bit, dirtiness, ...) inside > ad-hoc allocated descriptors. > > Descriptors which could well be what struct folio {} is today, IMO. As > long as it doesn't innately assume, or will assume, in the API the > 1:1+ mapping to struct page that is inherent to the compound page. struct buffer_head any one? I am being silly but when you say you want something that isn't a page for caching that could be less than a page in size, it really sounds like you want struct buffer_head. The only actual problem I am aware of with struct buffer_head is that it is a block device abstraction and does not map well to other situations. Which makes network filesystems unable to use struct buffer_head. Eric
On Tue, Aug 24, 2021 at 03:44:48PM -0400, Theodore Ts'o wrote: > On Tue, Aug 24, 2021 at 08:23:15PM +0100, Matthew Wilcox wrote: > > So if someone sees "kmem_cache_alloc()", they can probably make a > guess what it means, and it's memorable once they learn it. > Similarly, something like "head_page", or "mempages" is going to a bit > more obvious to a kernel newbie. So if we can make a tiny gesture > towards comprehensibility, it would be good to do so while it's still > easier to change the name. Talking about being newbie friendly, how about we'll just add a piece of documentation along with the new type for a change? Something along those lines (I'm sure willy can add several more sentences for Folio description) diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst index 30e8fbed6914..b5b39ebe67cf 100644 --- a/Documentation/vm/memory-model.rst +++ b/Documentation/vm/memory-model.rst @@ -30,6 +30,29 @@ Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` helpers that allow the conversion from PFN to `struct page` and vice versa. +Pages +----- + +Each physical page frame in the system is represented by a `struct page`. +This structure aggregatates several types, each corresponding to a +particular usage of a page frame, such as anonymous memory, SLAB caches, +file-backed memory etc. These types are define within unions in the struct +page to reduce memory footprint of the memory map. + +The actual type of the particular insance of struct page is determined by +values of the fields shared between the different types and can be quired +using page flag operatoins defined in ``include/linux/page-flags.h`` + +Folios +------ + +For many use cases, single page frame granularity is too small. In such +cases a contiguous range of memory can be referred by `struct folio`. + +A folio is a physically, virtually and logically contiguous range of +bytes. It is a power-of-two in size, and it is aligned to that same +power-of-two. It is at least as large as PAGE_SIZE. + FLATMEM =======
On Mon, Aug 30, 2021 at 10:38:20PM +0100, Matthew Wilcox wrote: > On Mon, Aug 30, 2021 at 04:27:04PM -0400, Johannes Weiner wrote: > > Right, page tables only need a pfn. The struct page is for us to > > maintain additional state about the object. > > > > For the objects that are subpage sized, we should be able to hold that > > state (shrinker lru linkage, referenced bit, dirtiness, ...) inside > > ad-hoc allocated descriptors. > > > > Descriptors which could well be what struct folio {} is today, IMO. As > > long as it doesn't innately assume, or will assume, in the API the > > 1:1+ mapping to struct page that is inherent to the compound page. > > Maybe this is where we fundamentally disagree. I don't think there's > any point in *managing* memory in a different size from that in which it > is *allocated*. There's no point in tracking dirtiness, LRU position, > locked, etc, etc in different units from allocation size. The point of > tracking all these things is so we can allocate and free memory. If > a 'cache descriptor' reaches the end of the LRU and should be reclaimed, > that's wasted effort in tracking if the rest of the 'cache descriptor' > is dirty and heavily in use. So a 'cache descriptor' should always be > at least a 'struct page' in size (assuming you're using 'struct page' > to mean "the size of the smallest allocation unit from the page > allocator") First off, we've been doing this with the slab shrinker for decades. Second, you'll still be doing this when you track 4k struct pages in a system that is trying to serve primarily higher-order pages. Whether you free N cache descriptors to free a page, or free N pages to free a compound page, it's the same thing. You won't avoid this problem. > > > > Well yes, once (and iff) everybody is doing that. But for the > > > > foreseeable future we're expecting to stay in a world where the > > > > *majority* of memory is in larger chunks, while we continue to see 4k > > > > cache entries, anon pages, and corresponding ptes, yes? > > > > > > No. 4k page table entries are demanded by the architecture, and there's > > > little we can do about that. > > > > I wasn't claiming otherwise..? > > You snipped the part of my paragraph that made the 'No' make sense. > I'm agreeing that page tables will continue to be a problem, but > everything else (page cache, anon, networking, slab) I expect to be > using higher order allocations within the next year. Some, maybe, but certainly not all of them. I'd like to remind you of this analysis that Al did on the linux source tree with various page sizes: https://lore.kernel.org/linux-mm/YGVUobKUMUtEy1PS@zeniv-ca.linux.org.uk/ Page size Footprint 4Kb 1128Mb 8Kb 1324Mb 16Kb 1764Mb 32Kb 2739Mb 64Kb 4832Mb 128Kb 9191Mb 256Kb 18062Mb 512Kb 35883Mb 1Mb 71570Mb 2Mb 142958Mb Even just going to 32k more than doubles the cache footprint of this one repo. This is a no-go from a small-file scalability POV. I think my point stands: for the foreseeable future, we're going to continue to see demand for 4k cache entries as well as an increasing demand for 2M blocks in the page cache and for anonymous mappings. We're going to need an allocation model that can handle this. Luckily, we already do... > > > > The slab allocator has proven to be an excellent solution to this > > > > problem, because the mailing lists are not flooded with OOM reports > > > > where smaller allocations fragmented the 4k page space. And even large > > > > temporary slab explosions (inodes, dentries etc.) are usually pushed > > > > back with fairly reasonable CPU overhead. > > > > > > You may not see the bug reports, but they exist. Right now, we have > > > a service that is echoing 2 to drop_caches every hour on systems which > > > are lightly loaded, otherwise the dcache swamps the entire machine and > > > takes hours or days to come back under control. > > > > Sure, but compare that to the number of complaints about higher-order > > allocations failing or taking too long (THP in the fault path e.g.)... > > Oh, we have those bug reports too ... > > > Typegrouping isn't infallible for fighting fragmentation, but it seems > > to be good enough for most cases. Unlike the buddy allocator. > > You keep saying that the buddy allocator isn't given enough information to > do any better, but I think it is. Page cache and anon memory are marked > with GFP_MOVABLE. Slab, network and page tables aren't. Is there a > reason that isn't enough? Anon and cache don't have the same lifetime, and anon isn't reclaimable without swap. Yes, movable means we don't have to reclaim them, but background reclaim happens anyway due to the watermarks, and if that doesn't produce contiguous blocks by itself already then compaction has to run on top of that. This is where we tend to see the allocation latencies that prohibit THP allocations during page faults. I would say the same is true for page tables allocated alongside network buffers and unreclaimable slab pages. I.e. a burst in short-lived network buffer allocations being interleaved with long-lived page table allocations. Ongoing concurrency scaling is going to increase the likelihood of those happening. > I think something that might actually help is if we added a pair of new > GFP flags, __GFP_FAST and __GFP_DENSE. Dense allocations are those which > are expected to live for a long time, and so the page allocator should > try to group them with other dense allocations. Slab and page tables > should use DENSE, You're really just recreating a crappier, less maintainable version of the object packing that *slab already does*. It's *slab* that is supposed to deal with internal fragmentation, not the page allocator. The page allocator is good at cranking out uniform, slightly big memory blocks. The slab allocator is good at subdividing those into smaller objects, neatly packed and grouped to facilitate contiguous reclaim, while providing detailed breakdowns of per-type memory usage and internal fragmentation to the user and to kernel developers. [ And introspection and easy reporting from production are *really important*, because fragmentation issues develop over timelines that extend the usual testing horizon of kernel developers. ] By trying to make compound pages the norm, you're making internal fragmentation a first-class problem of the page allocator. This conflates the problem space between slab and the page allocator and it forces you to duplicate large parts of the solution. This is not about whether it's technically achievable. It's about making an incomprehensible mess of the allocator layering and having to solve a difficult MM problem in two places. Because you're trying to make compound pages into something they were never meant to be. They're fine for the odd optimistic allocation that can either wait forever to defragment or fall back gracefully. But there is just no way these things are going to be the maintainable route for transitioning to a larger page size. As long as this is your ambition with the folio, I'm sorry but it's a NAK from me.
On 1 Sep 2021, at 13:43, Johannes Weiner wrote: > On Mon, Aug 30, 2021 at 10:38:20PM +0100, Matthew Wilcox wrote: >> On Mon, Aug 30, 2021 at 04:27:04PM -0400, Johannes Weiner wrote: >>> Right, page tables only need a pfn. The struct page is for us to >>> maintain additional state about the object. >>> >>> For the objects that are subpage sized, we should be able to hold that >>> state (shrinker lru linkage, referenced bit, dirtiness, ...) inside >>> ad-hoc allocated descriptors. >>> >>> Descriptors which could well be what struct folio {} is today, IMO. As >>> long as it doesn't innately assume, or will assume, in the API the >>> 1:1+ mapping to struct page that is inherent to the compound page. >> >> Maybe this is where we fundamentally disagree. I don't think there's >> any point in *managing* memory in a different size from that in which it >> is *allocated*. There's no point in tracking dirtiness, LRU position, >> locked, etc, etc in different units from allocation size. The point of >> tracking all these things is so we can allocate and free memory. If >> a 'cache descriptor' reaches the end of the LRU and should be reclaimed, >> that's wasted effort in tracking if the rest of the 'cache descriptor' >> is dirty and heavily in use. So a 'cache descriptor' should always be >> at least a 'struct page' in size (assuming you're using 'struct page' >> to mean "the size of the smallest allocation unit from the page >> allocator") > > First off, we've been doing this with the slab shrinker for decades. > > Second, you'll still be doing this when you track 4k struct pages in a > system that is trying to serve primarily higher-order pages. Whether > you free N cache descriptors to free a page, or free N pages to free a > compound page, it's the same thing. You won't avoid this problem. > >>>>> Well yes, once (and iff) everybody is doing that. But for the >>>>> foreseeable future we're expecting to stay in a world where the >>>>> *majority* of memory is in larger chunks, while we continue to see 4k >>>>> cache entries, anon pages, and corresponding ptes, yes? >>>> >>>> No. 4k page table entries are demanded by the architecture, and there's >>>> little we can do about that. >>> >>> I wasn't claiming otherwise..? >> >> You snipped the part of my paragraph that made the 'No' make sense. >> I'm agreeing that page tables will continue to be a problem, but >> everything else (page cache, anon, networking, slab) I expect to be >> using higher order allocations within the next year. > > Some, maybe, but certainly not all of them. I'd like to remind you of > this analysis that Al did on the linux source tree with various page > sizes: > > https://lore.kernel.org/linux-mm/YGVUobKUMUtEy1PS@zeniv-ca.linux.org.uk/ > > Page size Footprint > 4Kb 1128Mb > 8Kb 1324Mb > 16Kb 1764Mb > 32Kb 2739Mb > 64Kb 4832Mb > 128Kb 9191Mb > 256Kb 18062Mb > 512Kb 35883Mb > 1Mb 71570Mb > 2Mb 142958Mb > > Even just going to 32k more than doubles the cache footprint of this > one repo. This is a no-go from a small-file scalability POV. > > I think my point stands: for the foreseeable future, we're going to > continue to see demand for 4k cache entries as well as an increasing > demand for 2M blocks in the page cache and for anonymous mappings. > > We're going to need an allocation model that can handle this. Luckily, > we already do... > >>>>> The slab allocator has proven to be an excellent solution to this >>>>> problem, because the mailing lists are not flooded with OOM reports >>>>> where smaller allocations fragmented the 4k page space. And even large >>>>> temporary slab explosions (inodes, dentries etc.) are usually pushed >>>>> back with fairly reasonable CPU overhead. >>>> >>>> You may not see the bug reports, but they exist. Right now, we have >>>> a service that is echoing 2 to drop_caches every hour on systems which >>>> are lightly loaded, otherwise the dcache swamps the entire machine and >>>> takes hours or days to come back under control. >>> >>> Sure, but compare that to the number of complaints about higher-order >>> allocations failing or taking too long (THP in the fault path e.g.)... >> >> Oh, we have those bug reports too ... >> >>> Typegrouping isn't infallible for fighting fragmentation, but it seems >>> to be good enough for most cases. Unlike the buddy allocator. >> >> You keep saying that the buddy allocator isn't given enough information to >> do any better, but I think it is. Page cache and anon memory are marked >> with GFP_MOVABLE. Slab, network and page tables aren't. Is there a >> reason that isn't enough? > > Anon and cache don't have the same lifetime, and anon isn't > reclaimable without swap. Yes, movable means we don't have to reclaim > them, but background reclaim happens anyway due to the watermarks, and > if that doesn't produce contiguous blocks by itself already then > compaction has to run on top of that. This is where we tend to see the > allocation latencies that prohibit THP allocations during page faults. > > I would say the same is true for page tables allocated alongside > network buffers and unreclaimable slab pages. I.e. a burst in > short-lived network buffer allocations being interleaved with > long-lived page table allocations. Ongoing concurrency scaling is > going to increase the likelihood of those happening. > >> I think something that might actually help is if we added a pair of new >> GFP flags, __GFP_FAST and __GFP_DENSE. Dense allocations are those which >> are expected to live for a long time, and so the page allocator should >> try to group them with other dense allocations. Slab and page tables >> should use DENSE, > > You're really just recreating a crappier, less maintainable version of > the object packing that *slab already does*. > > It's *slab* that is supposed to deal with internal fragmentation, not > the page allocator. > > The page allocator is good at cranking out uniform, slightly big > memory blocks. The slab allocator is good at subdividing those into > smaller objects, neatly packed and grouped to facilitate contiguous > reclaim, while providing detailed breakdowns of per-type memory usage > and internal fragmentation to the user and to kernel developers. > > [ And introspection and easy reporting from production are *really > important*, because fragmentation issues develop over timelines that > extend the usual testing horizon of kernel developers. ] Initially, I thought it was a great idea to bump PAGE_SIZE to 2MB and use slab allocator like method for <2MB pages. But as I think about it more, I fail to see how it solves the existing fragmentation issues compared to our existing method, pageblock, since IMHO the fundamental issue of fragmentation in page allocation comes from mixing moveable and unmoveable pages in one pageblock, which does not exist in current slab allocation. There is no mix of reclaimable and unreclaimable objects in slab allocation, right? In my mind, reclaimable object is an analog of moveable page and unreclaimable object is an analog of unmoveable page. In addition, pageblock with different migrate types resembles how slab groups objects, so what is new in using slab instead of pageblock? My key question is do we allow mixing moveable sub-2MB data chunks with unmoveable sub-2MB data chunks in your new slab-like allocation method? If yes, how would kernel reclaim an order-0 (2MB) page that has an unmoveable sub-2MB data chunk? Isn’t it the same fragmentation situation we are facing nowadays when kernel tries to allocate a 2MB page but finds every 2MB pageblock has an unmoveable page? If no, why wouldn’t kernel do the same for pageblock? If kernel disallows page allocation fallbacks, so that unmoveable pages and moveable pages will not sit in a single pageblock, compaction and reclaim should be able to get a 2MB free page most of the time. And this would be a much smaller change, right? Let me know if I miss anything. -- Best Regards, Yan, Zi
On 9/2/21 17:13, Zi Yan wrote: >> You're really just recreating a crappier, less maintainable version of >> the object packing that *slab already does*. >> >> It's *slab* that is supposed to deal with internal fragmentation, not >> the page allocator. >> >> The page allocator is good at cranking out uniform, slightly big >> memory blocks. The slab allocator is good at subdividing those into >> smaller objects, neatly packed and grouped to facilitate contiguous >> reclaim, while providing detailed breakdowns of per-type memory usage >> and internal fragmentation to the user and to kernel developers. >> >> [ And introspection and easy reporting from production are *really >> important*, because fragmentation issues develop over timelines that >> extend the usual testing horizon of kernel developers. ] > > Initially, I thought it was a great idea to bump PAGE_SIZE to 2MB and > use slab allocator like method for <2MB pages. But as I think about it > more, I fail to see how it solves the existing fragmentation issues > compared to our existing method, pageblock, since IMHO the fundamental > issue of fragmentation in page allocation comes from mixing moveable > and unmoveable pages in one pageblock, which does not exist in current > slab allocation. There is no mix of reclaimable and unreclaimable objects > in slab allocation, right? AFAICS that's correct. Slab caches can in general merge, as that decreases memory usage (with the tradeoff of potentially mixing objects with different lifetimes more). But SLAB_RECLAIM_ACCOUNT (a flag for reclaimable caches) is part of SLAB_MERGE_SAME, so caches can only merge it they are both reclaimable or not. > In my mind, reclaimable object is an analog > of moveable page and unreclaimable object is an analog of unmoveable page. More precisely it resembles reclaimable and unreclaimable pages. Movable pages can be also migrated, but slab objects not. > In addition, pageblock with different migrate types resembles how > slab groups objects, so what is new in using slab instead of pageblock? Slab would be more strict in not allowing the merge. At page allocator level, if memory is exhausted, eventually page of any type can be allocated from pageblock of any other type as part of the fallback. The only really strict mechanism is movable zone. > My key question is do we allow mixing moveable sub-2MB data chunks with > unmoveable sub-2MB data chunks in your new slab-like allocation method? > > If yes, how would kernel reclaim an order-0 (2MB) page that has an > unmoveable sub-2MB data chunk? Isn’t it the same fragmentation situation > we are facing nowadays when kernel tries to allocate a 2MB page but finds > every 2MB pageblock has an unmoveable page? Yes, any scheme where all pages are not movable can theoretically degrade to a situation where at one moment all memory is allocated by the unmovable pages, and later almost all pages were freed, but leaving one unmovable page in each pageblock. > If no, why wouldn’t kernel do the same for pageblock? If kernel disallows > page allocation fallbacks, so that unmoveable pages and moveable pages > will not sit in a single pageblock, compaction and reclaim should be able > to get a 2MB free page most of the time. And this would be a much smaller > change, right? If we did that restriction of fallbacks, it would indeed be as strict the way as slab is, but things could still degrade to unmovable pages scattered all over the pageblocks as mentioned above. But since it's so similar to slabs, the same thing could happen with slabs today, and I don't recall reports of that happening massively? But of course slabs are not all 2MB large, serving 4k pages. > Let me know if I miss anything. > > > -- > Best Regards, > Yan, Zi >
So what is the result here? Not having folios (with that or another name) is really going to set back making progress on sane support for huge pages. Both in the pagecache but also for other places like direct I/O.
On 9/9/21 14:43, Christoph Hellwig wrote: > So what is the result here? Not having folios (with that or another > name) is really going to set back making progress on sane support for > huge pages. Both in the pagecache but also for other places like direct > I/O. Yeah, the silence doesn't seem actionable. If naming is the issue, I believe Matthew had also a branch where it was renamed to pageset. If it's the unclear future evolution wrt supporting subpages of large pages, should we just do nothing until somebody turns that hypothetical future into code and we see whether it works or not?
On Thu, Sep 09, 2021 at 03:56:54PM +0200, Vlastimil Babka wrote: > On 9/9/21 14:43, Christoph Hellwig wrote: > > So what is the result here? Not having folios (with that or another > > name) is really going to set back making progress on sane support for > > huge pages. Both in the pagecache but also for other places like direct > > I/O. From my end, I have no objections to using the current shape of Willy's data structure as a cache descriptor for the filesystem API: struct foo { /* private: don't document the anon union */ union { struct { /* public: */ unsigned long flags; struct list_head lru; struct address_space *mapping; pgoff_t index; void *private; atomic_t _mapcount; atomic_t _refcount; #ifdef CONFIG_MEMCG unsigned long memcg_data; #endif /* private: the union with struct page is transitional */ }; struct page page; }; }; I also have no general objection to a *separate* folio or pageset or whatever data structure to address the compound page mess inside VM code. With its own cost/benefit analysis. For whatever is left after the filesystems have been sorted out. My objection is simply to one shared abstraction for both. There is ample evidence from years of hands-on production experience that compound pages aren't the way toward scalable and maintainable larger page sizes from the MM side. And it's anything but obvious or self-evident that just because struct page worked for both roles that the same is true for compound pages. Willy says it'll work out, I say it won't. We don't have code to prove this either way right now. Why expose the filesystems to this gamble? Nothing prevents us from putting a 'struct pageset pageset' or 'struct folio folio' into a cache descriptor like above later on, right? [ And IMO, the fact that filesystem people are currently exposed to, and blocked on, mindnumbing internal MM discussions just further strengthens the argument to disconnect the page cache frontend from the memory allocation backend. The fs folks don't care - and really shouldn't care - about any of this. I understand the frustration. ] Can we go ahead with the cache descriptor for now, and keep the door open on how they are backed from the MM side? We should be able to answer this without going too deep into MM internals. In the short term, this would unblock the fs people. In the longer term this would allow the fs people to focus on fs problems, and MM people to solve MM problems. > Yeah, the silence doesn't seem actionable. If naming is the issue, I believe > Matthew had also a branch where it was renamed to pageset. If it's the > unclear future evolution wrt supporting subpages of large pages, should we > just do nothing until somebody turns that hypothetical future into code and > we see whether it works or not? Folio or pageset works for compound pages, but implies unnecessary implementation details for a variable-sized cache descriptor, IMO. I don't love the name folio for compound pages, but I think it's actually hazardous for the filesystem API. To move forward with the filesystem bits, can we: 1. call it something - anything - that isn't tied to the page, or the nature of multiple pages? fsmem, fsblock, cachemem, cachent, I don't care too deeply and would rather have a less snappy name than a clever misleading one, 2. make things like folio_order(), folio_nr_pages(), folio_page() page_folio() private API in mm/internal.h, to acknowledge that these are current implementation details, not promises on how the cache entry will forever be backed in the future? 3. remove references to physical contiguity, PAGE_SIZE, anonymous pages - and really anything else that nobody has explicitly asked for yet - from the kerneldoc; generally keep things specced to what we need now, and not create dependencies against speculative future ambitions that may or may not pan out, 4. separate and/or table the bits that are purely about compound pages inside MM code and not relevant for the fs interface - things like the workingset.c and swap.c conversions (page_folio() usage seems like a good indicator for where it permeated too deeply into MM core code which then needs to translate back up again)?
On Thu, Sep 09, 2021 at 02:16:39PM -0400, Johannes Weiner wrote: > My objection is simply to one shared abstraction for both. There is > ample evidence from years of hands-on production experience that > compound pages aren't the way toward scalable and maintainable larger > page sizes from the MM side. And it's anything but obvious or > self-evident that just because struct page worked for both roles that > the same is true for compound pages. I object to this requirement. The folio work has been going on for almost a year now, and you come in AT THE END OF THE MERGE WINDOW to ask for it to do something entirely different from what it's supposed to be doing. If you'd asked for this six months ago -- maybe. But now is completely unreasonable. I don't think it's a good thing to try to do. I think that your "let's use slab for this" idea is bonkers and doesn't work. And I really object to you getting in the way of my patchset which has actual real-world performance advantages in order to whine about how bad the type system is in Linux without doing anything to help with it. Do something. Or stop standing in the way. Either works for me.
On 9/9/21 06:56, Vlastimil Babka wrote: > On 9/9/21 14:43, Christoph Hellwig wrote: >> So what is the result here? Not having folios (with that or another >> name) is really going to set back making progress on sane support for >> huge pages. Both in the pagecache but also for other places like direct >> I/O. > > Yeah, the silence doesn't seem actionable. If naming is the issue, I believe > Matthew had also a branch where it was renamed to pageset. If it's the > unclear future evolution wrt supporting subpages of large pages, should we > just do nothing until somebody turns that hypothetical future into code and > we see whether it works or not? > When I saw Matthew's proposal to rename folio --> pageset, my reaction was, "OK, this is a huge win!". Because: * The new name addressed Linus' concerns about naming, which unblocks it there, and * The new name seems to meet all of the criteria of the "folio" name, including even grep-ability, after a couple of tiny page_set and pageset cases are renamed--AND it also meets Linus' criteria for self-describing names. So I didn't want to add noise to that thread, but now that there is still some doubt about this, I'll pop up and suggest: do the huge 's/folio/pageset/g', and of course the associated renaming of the conflicting existing pageset and page_set cases, and then maybe it goes in. thanks,
On Thu, Sep 09, 2021 at 12:17:00PM -0700, John Hubbard wrote: > On 9/9/21 06:56, Vlastimil Babka wrote: > > On 9/9/21 14:43, Christoph Hellwig wrote: > > > So what is the result here? Not having folios (with that or another > > > name) is really going to set back making progress on sane support for > > > huge pages. Both in the pagecache but also for other places like direct > > > I/O. > > > > Yeah, the silence doesn't seem actionable. If naming is the issue, I believe > > Matthew had also a branch where it was renamed to pageset. If it's the > > unclear future evolution wrt supporting subpages of large pages, should we > > just do nothing until somebody turns that hypothetical future into code and > > we see whether it works or not? > > > > When I saw Matthew's proposal to rename folio --> pageset, my reaction was, > "OK, this is a huge win!". Because: > > * The new name addressed Linus' concerns about naming, which unblocks it > there, and > > * The new name seems to meet all of the criteria of the "folio" name, > including even grep-ability, after a couple of tiny page_set and pageset > cases are renamed--AND it also meets Linus' criteria for self-describing > names. > > So I didn't want to add noise to that thread, but now that there is still > some doubt about this, I'll pop up and suggest: do the huge > 's/folio/pageset/g', and of course the associated renaming of the conflicting > existing pageset and page_set cases, and then maybe it goes in. So I've done that. https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/tags/pageset-5.15 I sent it to Linus almost two weeks ago: https://lore.kernel.org/linux-mm/YSmtjVTqR9%2F4W1aq@casper.infradead.org/ Still nothing, so I presume he's still thinking about it.
On Thu, Sep 09, 2021 at 07:44:22PM +0100, Matthew Wilcox wrote: > On Thu, Sep 09, 2021 at 02:16:39PM -0400, Johannes Weiner wrote: > > My objection is simply to one shared abstraction for both. There is > > ample evidence from years of hands-on production experience that > > compound pages aren't the way toward scalable and maintainable larger > > page sizes from the MM side. And it's anything but obvious or > > self-evident that just because struct page worked for both roles that > > the same is true for compound pages. > > I object to this requirement. The folio work has been going on for almost > a year now, and you come in AT THE END OF THE MERGE WINDOW to ask for it > to do something entirely different from what it's supposed to be doing. > If you'd asked for this six months ago -- maybe. But now is completely > unreasonable. I asked for exactly this exactly six months ago. On March 22nd, I wrote this re: the filesystem interfacing: : So I think transitioning away from ye olde page is a great idea. I : wonder this: have we mapped out the near future of the VM enough to : say that the folio is the right abstraction? : : What does 'folio' mean when it corresponds to either a single page or : some slab-type object with no dedicated page? : : If we go through with all the churn now anyway, IMO it makes at least : sense to ditch all association and conceptual proximity to the : hardware page or collections thereof. Simply say it's some length of : memory, and keep thing-to-page translations out of the public API from : the start. I mean, is there a good reason to keep this baggage? It's not my fault you consistently dismissed and pushed past this question and then send a pull request anyway. > I don't think it's a good thing to try to do. I think that your "let's > use slab for this" idea is bonkers and doesn't work. Based on what exactly? You can't think it's that bonkers when you push for replicating slab-like grouping in the page allocator. Anyway, it was never about how larger pages will pan out in MM. It was about keeping some flexibility around the backing memory for cache entries, given that this is still an unsolved problem. This is not a crazy or unreasonable request, it's the prudent thing to do given the amount of open-ended churn and disruptiveness of your patches. It seems you're not interested in engaging in this argument. You prefer to go off on tangents and speculations about how the page allocator will work in the future, with seemingly little production experience about what does and doesn't work in real life; and at the same time dismiss the experience of people that deal with MM problems hands-on on millions of machines & thousands of workloads every day. > And I really object to you getting in the way of my patchset which > has actual real-world performance advantages So? You've gotten in the way of patches that removed unnecessary compound_head() call and would have immediately provided some of these same advantages without hurting anybody - because the folio will eventually solve them all anyway. We all balance immediate payoff against what we think will be the right thing longer term. Anyway, if you think I'm bonkers, just ignore me. If not, maybe lay off the rhetorics, engage in a good-faith discussion and actually address my feedback?
Ugh. I'm not dealing with this shit. I'm supposed to be on holiday. I've been checking in to see what needs to happen for folios to be merged. But now I'm just fucking done. I shan't be checking my email until September 19th. Merge the folio branch, merge the pageset branch, or don't merge anything. I don't fucking care any more. On Thu, Sep 09, 2021 at 06:03:17PM -0400, Johannes Weiner wrote: > On Thu, Sep 09, 2021 at 07:44:22PM +0100, Matthew Wilcox wrote: > > On Thu, Sep 09, 2021 at 02:16:39PM -0400, Johannes Weiner wrote: > > > My objection is simply to one shared abstraction for both. There is > > > ample evidence from years of hands-on production experience that > > > compound pages aren't the way toward scalable and maintainable larger > > > page sizes from the MM side. And it's anything but obvious or > > > self-evident that just because struct page worked for both roles that > > > the same is true for compound pages. > > > > I object to this requirement. The folio work has been going on for almost > > a year now, and you come in AT THE END OF THE MERGE WINDOW to ask for it > > to do something entirely different from what it's supposed to be doing. > > If you'd asked for this six months ago -- maybe. But now is completely > > unreasonable. > > I asked for exactly this exactly six months ago. > > On March 22nd, I wrote this re: the filesystem interfacing: > > : So I think transitioning away from ye olde page is a great idea. I > : wonder this: have we mapped out the near future of the VM enough to > : say that the folio is the right abstraction? > : > : What does 'folio' mean when it corresponds to either a single page or > : some slab-type object with no dedicated page? > : > : If we go through with all the churn now anyway, IMO it makes at least > : sense to ditch all association and conceptual proximity to the > : hardware page or collections thereof. Simply say it's some length of > : memory, and keep thing-to-page translations out of the public API from > : the start. I mean, is there a good reason to keep this baggage? > > It's not my fault you consistently dismissed and pushed past this > question and then send a pull request anyway. > > > I don't think it's a good thing to try to do. I think that your "let's > > use slab for this" idea is bonkers and doesn't work. > > Based on what exactly? > > You can't think it's that bonkers when you push for replicating > slab-like grouping in the page allocator. > > Anyway, it was never about how larger pages will pan out in MM. It was > about keeping some flexibility around the backing memory for cache > entries, given that this is still an unsolved problem. This is not a > crazy or unreasonable request, it's the prudent thing to do given the > amount of open-ended churn and disruptiveness of your patches. > > It seems you're not interested in engaging in this argument. You > prefer to go off on tangents and speculations about how the page > allocator will work in the future, with seemingly little production > experience about what does and doesn't work in real life; and at the > same time dismiss the experience of people that deal with MM problems > hands-on on millions of machines & thousands of workloads every day. > > > And I really object to you getting in the way of my patchset which > > has actual real-world performance advantages > > So? You've gotten in the way of patches that removed unnecessary > compound_head() call and would have immediately provided some of these > same advantages without hurting anybody - because the folio will > eventually solve them all anyway. > > We all balance immediate payoff against what we think will be the > right thing longer term. > > Anyway, if you think I'm bonkers, just ignore me. If not, maybe lay > off the rhetorics, engage in a good-faith discussion and actually > address my feedback?
So I've been following the folio discussion, and it seems like the discussion has gone off the rails a bit partly just because struct page is such a mess and has been so overused, and we all want to see that cleaned up but we're not being clear about what that means. I was just talking with Johannes off list, and I thought I'd recap that discussion as well as other talks with Mathew and see if I can lay something out that everyone agrees with. Some background: For some years now, the overhead of dealing with 4k pages in the page cache has gotten really, really painful. Any time we're doing buffered IO, we end up walking a radix tree to get to the cached page, then doing a memcpy to or from that page - which quite conveniently blows away the CPU cache - then walking the radix tree to look up the next page, often touching locks along the way that are no longer in cache - it's really bad. We've been hacking around this - the btrfs people have a vectorized buffered write path, and also this is what my generic_file_buffered_read() patches we're about, batching up the page cache lookups - but really these are hacks that make our core IO paths even more complicated, when the right answer that's been staring all of us filesystem people in the face for years has been that it's 2021 and dealing with cached data in 4k chunks (when block based filesystems are a thing of the past!) is abject stupidity. So we need to be moving to larger, variable sized allocations for cached data. We NEED this, this HAS TO HAPPEN - spend some time really digging into profiles, and looking actual application usage, this is the #1 thing that's killing our performance in the IO paths. Remember, us developers tend to be benchmarking things like direct IO and small random IOs because we're looking at the whole IO path, but most reads and writes are buffered, and they're already in cache, and they're mostly big and sequential. I emphasize this because a lot of us have really been waiting rather urgently for Willy's work to go in, and there will no doubt be a lot more downstream filesystem work to be done to fully take advantage of it and we're waiting on this stuff to get merged so we can actually start testing and profiling the brave new world and seeing what to work on next. As an aside, before this there have been quite a few attempts at using hugepages to deal with these issues, and they're all _fucking gross_, because they all do if (normal page) else if (hugepage), and they all cut and paste filemap.c code because no one (rightly) wanted to add their abortions to the main IO paths. But look around the kernel and see how many times you can find core filemap.c code duplicated elsewhere... Anyways, Willy's work is going to let us delete all that crap. So: this all means that filesystem code needs to start working in larger, variable sized units, which today means - compound pages. Hence, the folio work started out as a wrapper around compound pages. So, one objection to folios has been that they leak too much MM details out into the filesystem code. To that we must point out: all the code that's going to be using folios is right now using struct page - this isn't leaking out new details and making things worse, this is actually (potentially!) a step in the right direction, by moving some users of struct page to a new type that is actually created for a specific purpose. I think a lot of the acrimony in this discussion came precisely from this mess; Johannes and the other MM people would like to see this situation improved so that they have more freedom to reengineer and improve things on their side. One particularly noteworthy idea was having struct page refer to multiple hardware pages, and using slab/slub for larger alloctions. In my view, the primary reason for making this change isn't the memory overhead to struct page (though reducing that would be nice); it's that the slab allocator is _significantly_ faster than the buddy allocator (the buddy allocator isn't percpu!) and as average allocation sizes increase, this is hurting us more and more over time. So we should listen to the MM people. Fortunately, Matthew made a big step in the right direction by making folios a new type. Right now, struct folio is not separately allocated - it's just unionized/overlayed with struct page - but perhaps in the future they could be separately allocated. I don't think that is a remotely realistic goal for _this_ patch series given the amount of code that touches struct page (thing: writeback code, LRU list code, page fault handlers!) - but I think that's a goal we could keep in mind going forward. We should also be clear on what _exactly_ folios are for, so they don't become the new dumping ground for everyone to stash their crap. They're to be a new core abstraction, and we should endeaver to keep our core data structures _small_, and _simple_. So: no scatter gather. A folio should just represent a single buffer of physically contiguous memory - vmap is slow, kmap_atomic() only works on single pages, we do _not_ want to make filesystem code jump through hoops to deal with anything else. The buffers should probably be power of two sized, as that's what the buddy allocator likes to give us - that doesn't necessarily have to be baked into the design, but I can't see us ever actually wanting non power of two sized allocations. Q: But what about fragmentation? Won't these allocations fail sometimes? Yes, and that's OK. The relevant filesystem code is all changing to handle variable sized allocations, so it's completely fine if we fail a 256k allocation and we have to fall back to whatever is available. But also keep in mind that switching the biggest consumer of kernel side memory to larger allocations is going to do more than anything else to help prevent memory from getting fragmented in the first place. We _want_ this. Q: Oh yeah, but what again are folios for, exactly? Folios are for cached filesystem data which (importantly) may be mapped to userspace. So when MM people see a new data structure come up with new references to page size - there's a very good reason with that, which is that we need to be allocating in multiples of the hardware page size if we're going to be able to map it to userspace and have PTEs point to it. So going forward, if the MM people want struct page to refer to muliple hardware pages - this shouldn't prevent that, and folios will refer to multiples of the _hardware_ page size, not struct page pagesize. Also - all the filesystem code that's being converted tends to talk and thing in units of pages. So going forward, it would be a nice cleanup to get rid of as many of those references as possible and just talk in terms of bytes (e.g. I have generally been trying to get rid of references to PAGE_SIZE in bcachefs wherever reasonable, for other reasons) - those cleanups are probably for another patch series, and in the interests of getting this patch series merged with the fewest introduced bugs possible we probably want the current helpers. ------------- That's my recap, I hope I haven't missed anything. The TL;DR is: * struct page is a mess; yes, we know. We're all living with that pain. * This isn't our ultimate end goal (nothing ever is!) - but it's probably along the right path. * Going forward: maybe struct folio should be separately allocated. That will entail a lot more work so it's not appropriate for this patch series, but I think it's a goal that would make everyone * We should probably think and talk more concretely about what our end goals are. Getting away from struct page is something that comes up again and again - DAX is another notable (and acrimonious) area this has come up. Also, page->mapping and page->index make sharing cached data in different files (thing: reflink, snapshots) pretty much non starters. I'm going to publicly float one of my own ideas here: maybe entries in the page cache radix tree don't have to be just a single pointer/ulong. If those entries were bigger, perhaps some things would fit better there than in either struct page/folio. Excessive PAGE_SIZE usage: -------------------------- Another thing that keeps coming up is - indiscriminate use of PAGE_SIZE makes it hard, especially when we're reviewing new code, to tell what's a legitimate use or not. When it's tied to the hardware page size (as folios are), it's probably legitimate, but PAGE_SIZE is _way_ overused. Partly this was because historically slab had to be used for small allocations and the buddy allocator, __get_free_pages(), had to be used for larger allocations. This is still somewhat the case - slab can go up to something like 128k, but there's still a hard cap on allocation size with kmalloc(). Perhaps the MM people could look into lifting this restriction, so that kmalloc() could be used for any sized physically contiguous allocation that the system could satisfy? If we had this, then it would make it more practical to go through and refactor existing code that uses __get_free_pages() and convert it to kmalloc(), without having to stare at code and figure out if it's safe. And that's my $.02
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> So we should listen to the MM people.
Count me here.
I think the problem with folio is that everybody wants to read in her/his
hopes and dreams into it and gets disappointed when see their somewhat
related problem doesn't get magically fixed with folio.
Folio started as a way to relief pain from dealing with compound pages.
It provides an unified view on base pages and compound pages. That's it.
It is required ground work for wider adoption of compound pages in page
cache. But it also will be useful for anon THP and hugetlb.
Based on adoption rate and resulting code, the new abstraction has nice
downstream effects. It may be suitable for more than it was intended for
initially. That's great.
But if it doesn't solve your problem... well, sorry...
The patchset makes a nice step forward and cuts back on mess I created on
the way to huge-tmpfs.
I would be glad to see the patchset upstream.
On Sat 11-09-21 04:23:24, Kirill A. Shutemov wrote: > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > > So we should listen to the MM people. > > Count me here. > > I think the problem with folio is that everybody wants to read in her/his > hopes and dreams into it and gets disappointed when see their somewhat > related problem doesn't get magically fixed with folio. > > Folio started as a way to relief pain from dealing with compound pages. > It provides an unified view on base pages and compound pages. That's it. > > It is required ground work for wider adoption of compound pages in page > cache. But it also will be useful for anon THP and hugetlb. > > Based on adoption rate and resulting code, the new abstraction has nice > downstream effects. It may be suitable for more than it was intended for > initially. That's great. > > But if it doesn't solve your problem... well, sorry... > > The patchset makes a nice step forward and cuts back on mess I created on > the way to huge-tmpfs. > > I would be glad to see the patchset upstream. I do agree here. While points that Johannes brought up are relevant and worth thinking about I do also see a clear advantage of folio (or whatever $name) is bringing. The compound page handling is just a mess and source of practical problems and bugs. This really requires some systematic approach to deal with it. The proposed type system is definitely a good way to approach it. Johannes is not happy about having the type still refer to page units but I haven't seen an example where that leads to a worse or harder to maintain code so far. The evolution is likely not going to stop at the current type system but I haven't seen any specifics to prove it would stand in the way. The existing code (fs or other subsystem interacting with MM) is going to require quite a lot of changes to move away from struct page notion but I do not see folios to add fundamental blocker there. All that being said, not only I see folios to be a step into the right direction to address compound pages mess it is also a code that already exists and gives some real advantages. I haven't heard anybody subscribing to a different approach and providing an implementation in a foreseeable future so I would rather go with this approach then dealing with the existing code long term.
On Mon, Sep 13, 2021 at 01:32:30PM +0200, Michal Hocko wrote: > The existing code (fs or other subsystem interacting with MM) is > going to require quite a lot of changes to move away from struct > page notion but I do not see folios to add fundamental blocker > there. The current folio seems to do quite a bit of that work, actually. But it'll be undone when the MM conversion matures the data structure into the full-blown new page. It's not about hopes and dreams, it's the simple fact that the patches do something now that seems very valuable, but which we'll lose again over time. And avoiding that is a relatively minor adjustment at this time compared to a much larger one later on. So yeah, it's not really a blocker. It's just a missed opportunity to lastingly disentangle struct page's multiple roles when touching all the relevant places anyway. It's also (needlessly) betting that compound pages can be made into a scalable, reliable, and predictable allocation model, and proliferating them into fs/ based on that. These patches, and all the ones that will need to follow to finish the conversion, are exceptionally expensive. It would have been nice to get more out of this disruption than to identify the relatively few places that genuinely need compound_head(), and having a datatype for N contiguous pages. Is there merit in solving those problems? Sure. Is it a robust, forward-looking direction for the MM space that justifies the cost of these and later patches? You seem to think so, I don't. It doesn't look like we'll agree on this. But I think I've made my points several times now, so I'll defer to Linus and Andrew.
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > One particularly noteworthy idea was having struct page refer to > multiple hardware pages, and using slab/slub for larger > alloctions. In my view, the primary reason for making this change > isn't the memory overhead to struct page (though reducing that would > be nice); Don't underestimate this, however. Picture the near future Willy describes, where we don't bump struct page size yet but serve most cache with compound huge pages. On x86, it would mean that the average page cache entry has 512 mapping pointers, 512 index members, 512 private pointers, 1024 LRU list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate flags, 512 memcg pointers etc. - you get the idea. This is a ton of memory. I think this doesn't get more traction because it's memory we've always allocated, and we're simply more sensitive to regressions than long-standing pain. But nevertheless this is a pretty low-hanging fruit. The folio makes a great first step moving those into a separate data structure, opening the door to one day realizing these savings. Even when some MM folks say this was never the intent behind the patches, I think this is going to matter significantly, if not more so, later on. > Fortunately, Matthew made a big step in the right direction by making folios a > new type. Right now, struct folio is not separately allocated - it's just > unionized/overlayed with struct page - but perhaps in the future they could be > separately allocated. I don't think that is a remotely realistic goal for _this_ > patch series given the amount of code that touches struct page (thing: writeback > code, LRU list code, page fault handlers!) - but I think that's a goal we could > keep in mind going forward. Yeah, agreed. Not doable out of the gate, but retaining the ability to allocate the "cache entry descriptor" bits - mapping, index etc. - on-demand would be a huge benefit down the road for the above reason. For that they would have to be in - and stay in - their own type. > We should also be clear on what _exactly_ folios are for, so they don't become > the new dumping ground for everyone to stash their crap. They're to be a new > core abstraction, and we should endeaver to keep our core data structures > _small_, and _simple_. Right. struct page is a lot of things and anything but simple and obvious today. struct folio in its current state does a good job separating some of that stuff out. However, when we think about *which* of the struct page mess the folio wants to address, I think that bias toward recent pain over much bigger long-standing pain strikes again. The compound page proliferation is new, and we're sensitive to the ambiguity it created between head and tail pages. It's added some compound_head() in lower-level accessor functions that are not necessary for many contexts. The folio type safety will help clean that up, and this is great. However, there is a much bigger, systematic type ambiguity in the MM world that we've just gotten used to over the years: anon vs file vs shmem vs slab vs ... - Many places rely on context to say "if we get here, it must be anon/file", and then unsafely access overloaded member elements: page->mapping, PG_readahead, PG_swapcache, PG_private - On the other hand, we also have low-level accessor functions that disambiguate the type and impose checks on contexts that may or may not actually need them - not unlike compound_head() in PageActive(): struct address_space *folio_mapping(struct folio *folio) { struct address_space *mapping; /* This happens if someone calls flush_dcache_page on slab page */ if (unlikely(folio_test_slab(folio))) return NULL; if (unlikely(folio_test_swapcache(folio))) return swap_address_space(folio_swap_entry(folio)); mapping = folio->mapping; if ((unsigned long)mapping & PAGE_MAPPING_ANON) return NULL; return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS); } Then we go identify places that say "we know it's at least not a slab page!" and convert them to page_mapping_file() which IS safe to use with anon. Or we say "we know this MUST be a file page" and just access the (unsafe) mapping pointer directly. - We have a singular page lock, but what it guards depends on what type of page we're dealing with. For a cache page it protects uptodate and the mapping. For an anon page it protects swap state. A lot of us can remember the rules if we try, but the code doesn't help and it gets really tricky when dealing with multiple types of pages simultaneously. Even mature code like reclaim just serializes the operation instead of protecting data - the writeback checks and the page table reference tests don't seem to need page lock. When the cgroup folks wrote the initial memory controller, they just added their own page-scope lock to protect page->memcg even though the page lock would have covered what it needed. - shrink_page_list() uses page_mapping() in the first half of the function to tell whether the page is anon or file, but halfway through we do this: /* Adding to swap updated mapping */ mapping = page_mapping(page); and then use PageAnon() to disambiguate the page type. - At activate_locked:, we check PG_swapcache directly on the page and rely on it doing the right thing for anon, file, and shmem pages. But this flag is PG_owner_priv_1 and actually used by the filesystem for something else. I guess PG_checked pages currently don't make it this far in reclaim, or we'd crash somewhere in try_to_free_swap(). I suppose we're also never calling page_mapping() on PageChecked filesystem pages right now, because it would return a swap mapping before testing whether this is a file page. You know, because shmem. These are just a few examples from an MM perspective. I'm sure the FS folks have their own stories and examples about pitfalls in dealing with struct page members. We're so used to this that we don't realize how much bigger and pervasive this lack of typing is than the compound page thing. I'm not saying the compound page mess isn't worth fixing. It is. I'm saying if we started with a file page or cache entry abstraction we'd solve not only the huge page cache, but also set us up for a MUCH more comprehensive cleanup in MM code and MM/FS interaction that makes the tailpage cleanup pale in comparison. For the same amount of churn, since folio would also touch all of these places.
Hello together, I am an outsider and following the discussion here on the subject. Can we not go upsream with the state of development ? Optimizations will always be there and new kernel releases too. I can not assess the risk but I think a decision must be made. Damian On Wed, 15. Sep 11:40, Johannes Weiner wrote: > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > > One particularly noteworthy idea was having struct page refer to > > multiple hardware pages, and using slab/slub for larger > > alloctions. In my view, the primary reason for making this change > > isn't the memory overhead to struct page (though reducing that would > > be nice); > > Don't underestimate this, however. > > Picture the near future Willy describes, where we don't bump struct > page size yet but serve most cache with compound huge pages. > > On x86, it would mean that the average page cache entry has 512 > mapping pointers, 512 index members, 512 private pointers, 1024 LRU > list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate > flags, 512 memcg pointers etc. - you get the idea. > > This is a ton of memory. I think this doesn't get more traction > because it's memory we've always allocated, and we're simply more > sensitive to regressions than long-standing pain. But nevertheless > this is a pretty low-hanging fruit. > > The folio makes a great first step moving those into a separate data > structure, opening the door to one day realizing these savings. Even > when some MM folks say this was never the intent behind the patches, I > think this is going to matter significantly, if not more so, later on. > > > Fortunately, Matthew made a big step in the right direction by making folios a > > new type. Right now, struct folio is not separately allocated - it's just > > unionized/overlayed with struct page - but perhaps in the future they could be > > separately allocated. I don't think that is a remotely realistic goal for _this_ > > patch series given the amount of code that touches struct page (thing: writeback > > code, LRU list code, page fault handlers!) - but I think that's a goal we could > > keep in mind going forward. > > Yeah, agreed. Not doable out of the gate, but retaining the ability to > allocate the "cache entry descriptor" bits - mapping, index etc. - > on-demand would be a huge benefit down the road for the above reason. > > For that they would have to be in - and stay in - their own type. > > > We should also be clear on what _exactly_ folios are for, so they don't become > > the new dumping ground for everyone to stash their crap. They're to be a new > > core abstraction, and we should endeaver to keep our core data structures > > _small_, and _simple_. > > Right. struct page is a lot of things and anything but simple and > obvious today. struct folio in its current state does a good job > separating some of that stuff out. > > However, when we think about *which* of the struct page mess the folio > wants to address, I think that bias toward recent pain over much > bigger long-standing pain strikes again. > > The compound page proliferation is new, and we're sensitive to the > ambiguity it created between head and tail pages. It's added some > compound_head() in lower-level accessor functions that are not > necessary for many contexts. The folio type safety will help clean > that up, and this is great. > > However, there is a much bigger, systematic type ambiguity in the MM > world that we've just gotten used to over the years: anon vs file vs > shmem vs slab vs ... > > - Many places rely on context to say "if we get here, it must be > anon/file", and then unsafely access overloaded member elements: > page->mapping, PG_readahead, PG_swapcache, PG_private > > - On the other hand, we also have low-level accessor functions that > disambiguate the type and impose checks on contexts that may or may > not actually need them - not unlike compound_head() in PageActive(): > > struct address_space *folio_mapping(struct folio *folio) > { > struct address_space *mapping; > > /* This happens if someone calls flush_dcache_page on slab page */ > if (unlikely(folio_test_slab(folio))) > return NULL; > > if (unlikely(folio_test_swapcache(folio))) > return swap_address_space(folio_swap_entry(folio)); > > mapping = folio->mapping; > if ((unsigned long)mapping & PAGE_MAPPING_ANON) > return NULL; > > return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS); > } > > Then we go identify places that say "we know it's at least not a > slab page!" and convert them to page_mapping_file() which IS safe to > use with anon. Or we say "we know this MUST be a file page" and just > access the (unsafe) mapping pointer directly. > > - We have a singular page lock, but what it guards depends on what > type of page we're dealing with. For a cache page it protects > uptodate and the mapping. For an anon page it protects swap state. > > A lot of us can remember the rules if we try, but the code doesn't > help and it gets really tricky when dealing with multiple types of > pages simultaneously. Even mature code like reclaim just serializes > the operation instead of protecting data - the writeback checks and > the page table reference tests don't seem to need page lock. > > When the cgroup folks wrote the initial memory controller, they just > added their own page-scope lock to protect page->memcg even though > the page lock would have covered what it needed. > > - shrink_page_list() uses page_mapping() in the first half of the > function to tell whether the page is anon or file, but halfway > through we do this: > > /* Adding to swap updated mapping */ > mapping = page_mapping(page); > > and then use PageAnon() to disambiguate the page type. > > - At activate_locked:, we check PG_swapcache directly on the page and > rely on it doing the right thing for anon, file, and shmem pages. > But this flag is PG_owner_priv_1 and actually used by the filesystem > for something else. I guess PG_checked pages currently don't make it > this far in reclaim, or we'd crash somewhere in try_to_free_swap(). > > I suppose we're also never calling page_mapping() on PageChecked > filesystem pages right now, because it would return a swap mapping > before testing whether this is a file page. You know, because shmem. > > These are just a few examples from an MM perspective. I'm sure the FS > folks have their own stories and examples about pitfalls in dealing > with struct page members. > > We're so used to this that we don't realize how much bigger and > pervasive this lack of typing is than the compound page thing. > > I'm not saying the compound page mess isn't worth fixing. It is. > > I'm saying if we started with a file page or cache entry abstraction > we'd solve not only the huge page cache, but also set us up for a MUCH > more comprehensive cleanup in MM code and MM/FS interaction that makes > the tailpage cleanup pale in comparison. For the same amount of churn, > since folio would also touch all of these places. >
On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote: > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > > One particularly noteworthy idea was having struct page refer to > > multiple hardware pages, and using slab/slub for larger > > alloctions. In my view, the primary reason for making this change > > isn't the memory overhead to struct page (though reducing that would > > be nice); > > Don't underestimate this, however. > > Picture the near future Willy describes, where we don't bump struct > page size yet but serve most cache with compound huge pages. > > On x86, it would mean that the average page cache entry has 512 > mapping pointers, 512 index members, 512 private pointers, 1024 LRU > list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate > flags, 512 memcg pointers etc. - you get the idea. > > This is a ton of memory. I think this doesn't get more traction > because it's memory we've always allocated, and we're simply more > sensitive to regressions than long-standing pain. But nevertheless > this is a pretty low-hanging fruit. > > The folio makes a great first step moving those into a separate data > structure, opening the door to one day realizing these savings. Even > when some MM folks say this was never the intent behind the patches, I > think this is going to matter significantly, if not more so, later on. So ... I chatted with Kent the other day, who suggested to me that maybe the point you're really after is that you want to increase the hw page size to reduce overhead while retaining the ability to hand out parts of those larger pages to the page cache, and folios don't get us there? > > Fortunately, Matthew made a big step in the right direction by making folios a > > new type. Right now, struct folio is not separately allocated - it's just > > unionized/overlayed with struct page - but perhaps in the future they could be > > separately allocated. I don't think that is a remotely realistic goal for _this_ > > patch series given the amount of code that touches struct page (thing: writeback > > code, LRU list code, page fault handlers!) - but I think that's a goal we could > > keep in mind going forward. > > Yeah, agreed. Not doable out of the gate, but retaining the ability to > allocate the "cache entry descriptor" bits - mapping, index etc. - > on-demand would be a huge benefit down the road for the above reason. > > For that they would have to be in - and stay in - their own type. > > > We should also be clear on what _exactly_ folios are for, so they don't become > > the new dumping ground for everyone to stash their crap. They're to be a new > > core abstraction, and we should endeaver to keep our core data structures > > _small_, and _simple_. > > Right. struct page is a lot of things and anything but simple and > obvious today. struct folio in its current state does a good job > separating some of that stuff out. > > However, when we think about *which* of the struct page mess the folio > wants to address, I think that bias toward recent pain over much > bigger long-standing pain strikes again. > > The compound page proliferation is new, and we're sensitive to the > ambiguity it created between head and tail pages. It's added some > compound_head() in lower-level accessor functions that are not > necessary for many contexts. The folio type safety will help clean > that up, and this is great. > > However, there is a much bigger, systematic type ambiguity in the MM > world that we've just gotten used to over the years: anon vs file vs > shmem vs slab vs ... > > - Many places rely on context to say "if we get here, it must be > anon/file", and then unsafely access overloaded member elements: > page->mapping, PG_readahead, PG_swapcache, PG_private > > - On the other hand, we also have low-level accessor functions that > disambiguate the type and impose checks on contexts that may or may > not actually need them - not unlike compound_head() in PageActive(): > > struct address_space *folio_mapping(struct folio *folio) > { > struct address_space *mapping; > > /* This happens if someone calls flush_dcache_page on slab page */ > if (unlikely(folio_test_slab(folio))) > return NULL; > > if (unlikely(folio_test_swapcache(folio))) > return swap_address_space(folio_swap_entry(folio)); > > mapping = folio->mapping; > if ((unsigned long)mapping & PAGE_MAPPING_ANON) > return NULL; > > return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS); > } > > Then we go identify places that say "we know it's at least not a > slab page!" and convert them to page_mapping_file() which IS safe to > use with anon. Or we say "we know this MUST be a file page" and just > access the (unsafe) mapping pointer directly. > > - We have a singular page lock, but what it guards depends on what > type of page we're dealing with. For a cache page it protects > uptodate and the mapping. For an anon page it protects swap state. > > A lot of us can remember the rules if we try, but the code doesn't > help and it gets really tricky when dealing with multiple types of > pages simultaneously. Even mature code like reclaim just serializes > the operation instead of protecting data - the writeback checks and > the page table reference tests don't seem to need page lock. > > When the cgroup folks wrote the initial memory controller, they just > added their own page-scope lock to protect page->memcg even though > the page lock would have covered what it needed. > > - shrink_page_list() uses page_mapping() in the first half of the > function to tell whether the page is anon or file, but halfway > through we do this: > > /* Adding to swap updated mapping */ > mapping = page_mapping(page); > > and then use PageAnon() to disambiguate the page type. > > - At activate_locked:, we check PG_swapcache directly on the page and > rely on it doing the right thing for anon, file, and shmem pages. > But this flag is PG_owner_priv_1 and actually used by the filesystem > for something else. I guess PG_checked pages currently don't make it > this far in reclaim, or we'd crash somewhere in try_to_free_swap(). > > I suppose we're also never calling page_mapping() on PageChecked > filesystem pages right now, because it would return a swap mapping > before testing whether this is a file page. You know, because shmem. (Yes, it would be helpful to fix these ambiguities, because I feel like discussions about all these other non-pagecache uses of memory keep coming up on fsdevel and the code /really/ doesn't help me figure out what everyone's talking about before the discussion moves on...) > These are just a few examples from an MM perspective. I'm sure the FS > folks have their own stories and examples about pitfalls in dealing > with struct page members. We do, and I thought we were making good progress pushing a lot of that into the fs/iomap/ library. With fs iomap, disk filesystems pass space mapping data to the iomap functions and let them deal with pages (or folios). IOWs, filesystems don't deal with pages directly anymore, and folios sounded like an easy transition (for a filesystem) to whatever comes next. At some point it would be nice to get fscrypt and fsverify hooked up so that we could move ext4 further off of buffer heads. I don't know how we proceed from here -- there's quite a bit of filesystems work that depended on the folios series actually landing. Given that Linus has neither pulled it, rejected it, or told willy what to do, and the folio series now has a NAK on it, I can't even start on how to proceed from here. --D > We're so used to this that we don't realize how much bigger and > pervasive this lack of typing is than the compound page thing. > > I'm not saying the compound page mess isn't worth fixing. It is. > > I'm saying if we started with a file page or cache entry abstraction > we'd solve not only the huge page cache, but also set us up for a MUCH > more comprehensive cleanup in MM code and MM/FS interaction that makes > the tailpage cleanup pale in comparison. For the same amount of churn, > since folio would also touch all of these places.
On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote: > On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote: > > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > > > One particularly noteworthy idea was having struct page refer to > > > multiple hardware pages, and using slab/slub for larger > > > alloctions. In my view, the primary reason for making this change > > > isn't the memory overhead to struct page (though reducing that would > > > be nice); > > > > Don't underestimate this, however. > > > > Picture the near future Willy describes, where we don't bump struct > > page size yet but serve most cache with compound huge pages. > > > > On x86, it would mean that the average page cache entry has 512 > > mapping pointers, 512 index members, 512 private pointers, 1024 LRU > > list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate > > flags, 512 memcg pointers etc. - you get the idea. > > > > This is a ton of memory. I think this doesn't get more traction > > because it's memory we've always allocated, and we're simply more > > sensitive to regressions than long-standing pain. But nevertheless > > this is a pretty low-hanging fruit. > > > > The folio makes a great first step moving those into a separate data > > structure, opening the door to one day realizing these savings. Even > > when some MM folks say this was never the intent behind the patches, I > > think this is going to matter significantly, if not more so, later on. > > So ... I chatted with Kent the other day, who suggested to me that maybe > the point you're really after is that you want to increase the hw page > size to reduce overhead while retaining the ability to hand out parts of > those larger pages to the page cache, and folios don't get us there? Yes, that's one of the points. It's exporting the huge page model we've been using for anonymous memory to the filesystems, even though that model has shown significant limitations in practice: it doesn't work well out of the box, the necessary configuration is painful and complicated, and even when done correctly it still has high allocation latencies. It's much more "handtuned HPC workload" than "general purpose feature". Fixing this is an open problem. I don't know for sure if we need to increase the page size for that, but neither does anybody else. This is simply work and experiments that haven't been done on the MM side. Exposing the filesystems to that implementation now exposes them to the risk of a near-term do-over, and puts a significantly higher barrier on fixing the allocation model down the line. There isn't a technical reason for this coupling the filesystems that tightly to the allocation model. It's just that the filesystem people would like a size-agnostic cache object, and some MM folks would like to clean up the compound page mess, and folio tries to do both of these things at once. > > > Fortunately, Matthew made a big step in the right direction by making folios a > > > new type. Right now, struct folio is not separately allocated - it's just > > > unionized/overlayed with struct page - but perhaps in the future they could be > > > separately allocated. I don't think that is a remotely realistic goal for _this_ > > > patch series given the amount of code that touches struct page (thing: writeback > > > code, LRU list code, page fault handlers!) - but I think that's a goal we could > > > keep in mind going forward. > > > > Yeah, agreed. Not doable out of the gate, but retaining the ability to > > allocate the "cache entry descriptor" bits - mapping, index etc. - > > on-demand would be a huge benefit down the road for the above reason. > > > > For that they would have to be in - and stay in - their own type. > > > > > We should also be clear on what _exactly_ folios are for, so they don't become > > > the new dumping ground for everyone to stash their crap. They're to be a new > > > core abstraction, and we should endeaver to keep our core data structures > > > _small_, and _simple_. > > > > Right. struct page is a lot of things and anything but simple and > > obvious today. struct folio in its current state does a good job > > separating some of that stuff out. > > > > However, when we think about *which* of the struct page mess the folio > > wants to address, I think that bias toward recent pain over much > > bigger long-standing pain strikes again. > > > > The compound page proliferation is new, and we're sensitive to the > > ambiguity it created between head and tail pages. It's added some > > compound_head() in lower-level accessor functions that are not > > necessary for many contexts. The folio type safety will help clean > > that up, and this is great. > > > > However, there is a much bigger, systematic type ambiguity in the MM > > world that we've just gotten used to over the years: anon vs file vs > > shmem vs slab vs ... > > > > - Many places rely on context to say "if we get here, it must be > > anon/file", and then unsafely access overloaded member elements: > > page->mapping, PG_readahead, PG_swapcache, PG_private > > > > - On the other hand, we also have low-level accessor functions that > > disambiguate the type and impose checks on contexts that may or may > > not actually need them - not unlike compound_head() in PageActive(): > > > > struct address_space *folio_mapping(struct folio *folio) > > { > > struct address_space *mapping; > > > > /* This happens if someone calls flush_dcache_page on slab page */ > > if (unlikely(folio_test_slab(folio))) > > return NULL; > > > > if (unlikely(folio_test_swapcache(folio))) > > return swap_address_space(folio_swap_entry(folio)); > > > > mapping = folio->mapping; > > if ((unsigned long)mapping & PAGE_MAPPING_ANON) > > return NULL; > > > > return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS); > > } > > > > Then we go identify places that say "we know it's at least not a > > slab page!" and convert them to page_mapping_file() which IS safe to > > use with anon. Or we say "we know this MUST be a file page" and just > > access the (unsafe) mapping pointer directly. > > > > - We have a singular page lock, but what it guards depends on what > > type of page we're dealing with. For a cache page it protects > > uptodate and the mapping. For an anon page it protects swap state. > > > > A lot of us can remember the rules if we try, but the code doesn't > > help and it gets really tricky when dealing with multiple types of > > pages simultaneously. Even mature code like reclaim just serializes > > the operation instead of protecting data - the writeback checks and > > the page table reference tests don't seem to need page lock. > > > > When the cgroup folks wrote the initial memory controller, they just > > added their own page-scope lock to protect page->memcg even though > > the page lock would have covered what it needed. > > > > - shrink_page_list() uses page_mapping() in the first half of the > > function to tell whether the page is anon or file, but halfway > > through we do this: > > > > /* Adding to swap updated mapping */ > > mapping = page_mapping(page); > > > > and then use PageAnon() to disambiguate the page type. > > > > - At activate_locked:, we check PG_swapcache directly on the page and > > rely on it doing the right thing for anon, file, and shmem pages. > > But this flag is PG_owner_priv_1 and actually used by the filesystem > > for something else. I guess PG_checked pages currently don't make it > > this far in reclaim, or we'd crash somewhere in try_to_free_swap(). > > > > I suppose we're also never calling page_mapping() on PageChecked > > filesystem pages right now, because it would return a swap mapping > > before testing whether this is a file page. You know, because shmem. > > (Yes, it would be helpful to fix these ambiguities, because I feel like > discussions about all these other non-pagecache uses of memory keep > coming up on fsdevel and the code /really/ doesn't help me figure out > what everyone's talking about before the discussion moves on...) Excellent. However, after listening to Kent and other filesystem folks, I think it's important to point out that the folio is not a dedicated page cache page descriptor that will address any of the above examples. The MM POV (and the justification for both the acks and the naks of the patchset) is that it's a generic, untyped compound page abstraction, which applies to file, anon, slab, networking pages. Certainly, the folio patches as of right now also convert anon page handling to the folio. If followed to its conclusion, the folio will have plenty of members and API functions for non-pagecache users and look pretty much like struct page today, just with a dynamic size. I know Kent was surprised by this. I know Dave Chinner suggested to call it "cache page" or "cage" early on, which also suggests an understanding of a *dedicated* cache page descriptor. I don't think the ambiguous folio name and the ambiguous union with the page helped in any way in aligning fs and mm folks on what this thing is actually supposed to be! I agree with what I think the filesystems want: instead of an untyped, variable-sized block of memory, I think we should have a typed page cache desciptor. That would work better for the filesystems, and I think would also work better for the MM code down the line and fix the above examples. The headpage/tailpage cleanup would come free with that. > > These are just a few examples from an MM perspective. I'm sure the FS > > folks have their own stories and examples about pitfalls in dealing > > with struct page members. > > We do, and I thought we were making good progress pushing a lot of that > into the fs/iomap/ library. With fs iomap, disk filesystems pass space > mapping data to the iomap functions and let them deal with pages (or > folios). IOWs, filesystems don't deal with pages directly anymore, and > folios sounded like an easy transition (for a filesystem) to whatever > comes next. At some point it would be nice to get fscrypt and fsverify > hooked up so that we could move ext4 further off of buffer heads. > > I don't know how we proceed from here -- there's quite a bit of > filesystems work that depended on the folios series actually landing. > Given that Linus has neither pulled it, rejected it, or told willy what > to do, and the folio series now has a NAK on it, I can't even start on > how to proceed from here. I think divide and conquer is the way forward. The crux of the matter is that folio is trying to 1) replace struct page as the filesystem interface to the MM and 2) replace struct page as the internal management object for file and anon, and conceptually also slab & networking pages all at the same time. As you can guess, goals 1) and 2) have vastly different scopes. Replacing struct page in the filesystem isn't very controversial, and filesystem folks seem uniformly ready to go. I agree. Replacing struct page in MM code is much less clear cut. We have some people who say it'll be great, some people who say we can probably figure out open questions down the line, and we have some people who have expressed doubts that all this churn will ever be worth it. I think it's worth replacing, but not with an untyped compound thing. It's sh*tty that the filesystem people are acutely blocked on large-scope, long-term MM discussions they don't care about. It's also sh*tty that these MM discussions are rushed by folks who aren't familiar with or care too much about the MM internals. This friction isn't necessary. The folio conversion is an incremental process. It's not like everything in MM code has been fully converted already - some stuff deals with the folio, most stuff with the page. An easy way forward that I see is to split this large, open-ended project into more digestible pieces. E.g. separate 1) and 2): merge a "size-agnostic cache page" type now; give MM folks the time they need to figure out how and if they want to replace struct page internally. That's why I suggested to drop the anon page conversion bits in swap.c, workingset.c, memcontrol.c etc, and just focus on the uncontroversial page cache bits for now.
Johannes Weiner <hannes@cmpxchg.org> wrote: > I know Kent was surprised by this. I know Dave Chinner suggested to > call it "cache page" or "cage" early on, which also suggests an > understanding of a *dedicated* cache page descriptor. If we are aiming to get pages out of the view of the filesystem, then we should probably not include "page" in the name. "Data cache" would seem obvious, but we already have that concept for the CPU. How about something like "struct content" and rename i_pages to i_content? David
On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote: > On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote: > > On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote: > > > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > The MM POV (and the justification for both the acks and the naks of > the patchset) is that it's a generic, untyped compound page > abstraction, which applies to file, anon, slab, networking > pages. Certainly, the folio patches as of right now also convert anon > page handling to the folio. If followed to its conclusion, the folio > will have plenty of members and API functions for non-pagecache users > and look pretty much like struct page today, just with a dynamic size. > > I know Kent was surprised by this. I know Dave Chinner suggested to > call it "cache page" or "cage" early on, which also suggests an > understanding of a *dedicated* cache page descriptor. Don't take a flippant I comment made in a bikeshed as any sort of representation of what I think about this current situation. I've largely been silent because of your history of yelling incoherently in response to anything I say that you don't agree with. But now you've explicitly drawn me into this discussion, I'll point out that I'm one of very few people in the wider Linux mm/fs community who has any *direct experience* with the cache handle based architecture being advocated for here. I don't agree with your assertion that cache handle based objects are the way forward, so please read and try to understand what I've just put a couple of hours into writing before you start shouting. Please? --- Ok, so this cache page descriptor/handle/object architecture has been implemented in other operating systems. It's the solution that Irix implemented back in the early _1990s_ via it's chunk cache. I've talked about this a few times in the past 15 years, so I guess I'll talk about it again. eg at LSFMM 2014 where I said "we don't really want to go down that path" in reference to supporting sector sizes > PAGE_SIZE: https://lwn.net/Articles/592101/ So, in more gory detail why I don't think we really want to go down that path..... The Irix chunk cache sat between the low layer global, disk address indexed buffer cache[1] and the high layer per-mm-context page cache used for mmap(). A "chunk" was a variable sized object indexed by file offset on a per-inode AVL tree - basically the same caching architecture as our current per-inode mapping tree uses to index pages. But unlike the Linux page cache, these chunks were an extension of the low level buffer cache. Hence they were also indexed by physical disk address and the life-cycle was managed by the buffer cache shrinker rather than the mm-based page cache reclaim algorithms. Chunks were built from page cache pages, and pages pointed back to the chunk that they belonged to. Chunks needed their own locking. IO was done based on chunks, not pages. Filesystems decided the size of chunks, not the page cache. Pages attached to chunks could be of any hardware supported size - the only limitation was that all pages attached to a chunk had to be the same size. A large hardware page in the page cache could be mapped by multiple smaller chunks. A chunk made up of multiple hardware pages could vmap it's contents if the user needed contiguous access.[2] Chunks were largely unaware of ongoing mmap operations. page faults on pages that had no associated chunk (e.g. originally populated into the page cache by a read fault into hole or a cached page that the buffer cache had torn down) then a new chunk had to be built. The code needed to handle to partially populated chunks in this sort of situation was really, really nasty as it required interacting with the filesystem and having the filesystem take locks and call back up into the page cache to build the new chunk in the IO path. Similarly, dirty page state from page faults needed to be propagated down to the chunks, because dirty tracking for writeback was done at the chunk level, not the page cache level. This was *really* nasty, because if the page didn't have a chunk already built, it couldn't be built in a write fault context. Hence sweeping dirty page state to the IO subsystem was handled periodically by a pdflush daemon , which could work with the filesytsem to build new (dirty) chunks and insert them into the chunk cache for writeback. Similar problems will have to be considered during design for Linux because the dirty tracking in Linux for writeback is done at the per-inode mapping tree level. Hence things like ->page_mkwrite are going to have to dig through the page to the cached chunk and mark the chunk dirty rather than the page. Whether deadlocks are going to have to be worked around is an open question; I don't have answers to these concerns because nobody is proposing an architecture detailed enough to explore these situations. This also leads to really interesting questions about how page and chunk state w.r.t. IO is kept coherent. e.g. if we are not tracking IO state on individual page cache pages, how do we ensure all the pages stay stable when IO is being done to a block device that requires stable pages? Along similar lines: what's the interlock mechanism that we'll use to ensure that IO or truncate can lock out per-page accesses if the filesystem IO paths no longer directly interact with page state any more? I also wonder how will we manage cached chunks if the filesystem currently relies on page level locking for atomicity, concurrency and existence guarantees (e.g. ext4 buffered IO)? IOWs, it is extremely likely that there will still be situations where we have to blast directly through the cache handle abstraction to manipulate the objects behind the abstraction so that we can make specific functionality work correctly, without regressions and/or efficiently. Hence the biggest issue that a chunk-like cache handle introduces is the complex multi-dimensional state update interactions. These will require more complex locking and that locking will be required to work in arbitrary orders for operations to be performed safely and atomically. e.g IO needs inode->chunk->page order, whilst page migration/comapction needs page->chunk->inode order. Page migration and compaction on Irix had some unfixable deadlocks in rare corner cases because of locking inversion problems between filesystems, chunks, pages and mm contexts. I don't see any fundamental difference in Linux architecture that makes me think that it will be any different.[3] I've got war chest full of chunk cache related data corruption bugs on Irix that were crazy hard to reproduce and even more difficult to fix. At least half the bugs I had to fix in the chunk cache over 3-4 years as maintainer were data corruption bugs resulting from inconsistencies in multi-object state updates. I've got a whole 'nother barrel full of problem cases that revolve around memory reclaim, too. The cache handles really need to pin the pages that back them, and so we can't really do access optimised per-page based reclaim of file-backed pages anymore. The Irix chunk cache had it's own LRUs and shrinker[4] to manage life-cycles of chunks under memory pressure, and the mm code had it's own independent page cache shrinker. Hence pages didn't get freed until both the chunk cache and the page cache released the pages they had references to. IOWs, we're going to end up needing to reclaim cache handles before we can do page reclaim. This needs careful thought and will likely need a complete redesign of the vmscan.c algorithms to work properly. I really, really don't want to see awful layer violations like bufferhead reclaim getting hacked into the low layer page reclaim algorithms happen ever again. We're still paying the price for that. And given the way Linux uses the mapping tree for keeping stuff like per-page working set refault information after the pages have been removed from the page cache, I really struggle to see how functionality like this can be supported with a chunk based cache index that doesn't actually have direct tracking of individual page access and reclaim behaviour. We're also going to need a range-based indexing mechanism for the mapping tree if we want to avoid the inefficiencies that mapping large objects into the Xarray require. We'll need an rcu-aware tree of some kind, be it a btree, maple tree or something else so that we can maintain lockless lookups of cache objects. That infrastructure doesn't exist yet, either. And on that note, it is worth keeping in mind that one of the reasons that the current linux page cache architecture scales better for single files than the Irix architecture ever did is because the Irix chunk cache could not be made lockless. The requirements for atomic multi-dimensional indexing updates and coherent, atomic multi-object state changes could never be solved in a lockless manner. It was not for lack of trying or talent; people way smarter than me couldn't solve that problem. SO there's an open question as to whether we can maintain existing lockless algorithms when a chunk cache is layered over the top of the page cache. IOWs, I see significant, fundamental problems that chunk cache architectures suffer from. I know there are inherent problems with state coherency, locking, complexity in the IO path, etc. Some of these problems will inot be discovered until the implementation is well under way. Some of these problem may well be unsolveable, too. And until there's an actual model proposed of how everything will interact and work, we can't actually do any of this architectural analysis to determine if it might work or not. The chunk cache proposal is really just a grand thought experiment at this point in time. OTOH, folios have none of these problems and are here right now. Sure, they have their own issues, but we can see them for what they are given the code is already out there, and pretty much everyone sees them as a big step forwards. Folios don't prevent a chunk cache from being implemented. In fact, to make folios highly efficient, we have to do things a chunk cache would also require to be implemented. e.g. range-based cache indexing. Unlike a chunk cache, folios don't depend on this being done first - they stand alone without those changes, and will only improve from making them. IOWs, you can't use the "folios being mapped 512 times into the mapping tree" as a reason the chunk cache is better - the chunk cache also requires this same problem to be solved, but the chunk cache needs efficient range lookups done *before* it is implemented, not provided afterwards as an optimisation. IOWs, if we want to move towards a chunk cache, the first step is to move to folios to allow large objects in the page cache. Then we can implement a lock-less range based index mechanism for the mapping tree. Then we can look to replace folios with a typed cache handle without having to worry about all the whacky multi-object coherency problems because they only need to point to a single folio. Then we can work out all the memory reclaim issues, locking issues, sort out the API that filesystems use instead of folios, etc that ineed to be done when cache handles are introduced. And once we've worked through all that, then we can add support for multiple folios within a single cache object and discover all the really hard problems that this exposes. At this point, the cache objects are no longer dependent on folios to provide objects > PAGE_SIZE to the filesystems, and we can start to remove folios from the mm code and replace them with something else that the cache handle uses to provide the backing store to the filesysetms... Seriously, I have given a lot of thought over the years to a chunk cache for Linux. Right now, a chunk cache is a solution looking for a problem to solve. Unless there's an overall architectural mm plan that is being worked towards that requires a chunk cache, then I just don't see the justification for doing all this work because the first two steps above get filesystems everything they are currently asking for. Everything else past that is really just an experiment... > I agree with what I think the filesystems want: instead of an untyped, > variable-sized block of memory, I think we should have a typed page > cache desciptor. I don't think that's what fs devs want at all. It's what you think fs devs want. If you'd been listening to us the same way that Willy has been for the past year, maybe you'd have a different opinion. Indeed, we don't actually need a new page cache abstraction. fs/iomap already provides filesystems with a complete, efficient page cache abstraction that only requires filesytems to provide block mapping services. Filesystems using iomap do not interact with the page cache at all. And David Howells is working with Willy and all the network fs devs to build an equivalent generic netfs page cache abstraction based on folios that is supported by the major netfs client implementations in the kernel. IOWs, fs devs don't need a new page cache abstraction - we've got our own abstractions tailored directly to our needs. What we need are API cleanups, consistency in object access mechanisms and dynamic object size support to simplify and fill out the feature set of the abstractions we've already built. The fact that so many fs developers are pushing *hard* for folios is that it provides what we've been asking for individually over last few years. Willy has done a great job of working with the fs developers and getting feedback at every step of the process, and you see that in the amount of work that in progress that is already based on folios. ANd it provides those cleanups and new functionality without changing or invalidating any of the knowledge we collectively hold about how the page cache works. That's _pure gold_ right there. In summary: If you don't know anything about the architecture and limitations of the XFS buffer cache (also read the footnotes), you'd do very well to pay heed to what I've said in this email considering the direct relevancy it's history has to the alternative cache handle proposal being made here. We also need to consider the evidence that filesystems do not actually need a new page cache abstraction - they just need the existing page cache to be able to index objects larger than PAGE_SIZE. So with all that in mind, I consider folios (or whatever we call them) to be the best stepping stone towards a PAGE_SIZE indepedent future that we currently have. folios don't prevent us from introducing a cache handle based architecture if we have a compelling reason to do so in the future, nor do they stop anyone working on such infrastructure in parallel if it really is necessary. But the reality is that we don't need such a fundamental architectural change to provide the functionality that folios provide us with _right now_. Folios are not perfect, but they are here and they solve many issues we need solved. We're never going to have a perfect solution that everyone agrees with, so the real question is "are folios good enough?". To me the answer is a resounding yes. Cheers, Dave. [1] fs/xfs/xfs_buf.c is an example of a high performance handle based, variable object size cache that abstracts away the details of the data store being allocated from slab, discontiguous pages, contiguous pages or [2] vmapped memory. It is basically two decade old re-implementation of the Irix low layer global disk-addressed buffer cache, modernised and tailored directly to the needs of XFS metadata caching. [3] Keep in mind that the xfs_buf cache used to be page cache backed. The page cache provided the caching and memory reclaim infrastructure to the xfs_buf handles - and so we do actually have recent direct experience on Linux with the architecture you are proposing here. This architecture proved to be a major limitation from a performance, multi-object state coherency and cache residency prioritisation aspects. It really sucked with systems that had 64KB page sizes and 4kB metadata block sizes, and .... [4] So we went back to the old Irix way of managing the cache - our own buffer based LRUs and aging mechanisms, with memory reclaim run by a shrinkers based on buffer-type base priorities. We use bulk page allocation for buffers that >= PAGE_SIZE, and slab allocation < PAGE_SIZE. That's exactly what you are suggesting we do with 2MB sized base pages, but without having to care about mmap() at all.
On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote: > Folios are not perfect, but they are here and they solve many issues > we need solved. We're never going to have a perfect solution that > everyone agrees with, so the real question is "are folios good > enough?". To me the answer is a resounding yes. Besides agreeing to all what you said, the other important part is: even if we were to eventually go with Johannes grand plans (which I disagree with in many apects), what is the harm in doing folios now? Despite all the fuzz, the pending folio PR does nothing but add type safety to compound pages. Which is something we badly need, no matter what kind of other caching grand plans people have.
On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote: > On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote: > > I agree with what I think the filesystems want: instead of an untyped, > > variable-sized block of memory, I think we should have a typed page > > cache desciptor. > > I don't think that's what fs devs want at all. It's what you think > fs devs want. If you'd been listening to us the same way that Willy > has been for the past year, maybe you'd have a different opinion. I was going off of Darrick's remarks about non-pagecache uses, Kent's remarks Kent about simple and obvious core data structures, and yes your suggestion of "cache page". But I think you may have overinterpreted what I meant by cache descriptor: > Indeed, we don't actually need a new page cache abstraction. I didn't suggest to change what the folio currently already is for the page cache. I asked to keep anon pages out of it (and in the future potentially other random stuff that is using compound pages). It doesn't have any bearing on how it presents to you on the filesystem side, other than that it isn't as overloaded as struct page is with non-pagecache stuff. A full-on disconnect between the cache entry descriptor and the page is something that came up during speculation on how the MM will be able to effectively raise the page size and meet scalability requirements on modern hardware - and in that context I do appreciate you providing background information on the chunk cache, which will be valuable to inform *that* discussion. But it isn't what I suggested as the immediate action to unblock the folio merge. > The fact that so many fs developers are pushing *hard* for folios is > that it provides what we've been asking for individually over last > few years. I'm not sure filesystem people are pushing hard for non-pagecache stuff to be in the folio. > Willy has done a great job of working with the fs developers and > getting feedback at every step of the process, and you see that in > the amount of work that in progress that is already based on > folios. And that's great, but the folio is blocked on MM questions: 1. Is the folio a good descriptor for all uses of anon and file pages inside MM code way beyond the page cache layer YOU care about? 2. Are compound pages a scalable, future-proof allocation strategy? For some people the answers are yes, for others they are a no. For 1), the value proposition is to clean up the relatively recent head/tail page confusion. And though everybody agrees that there is value in that, it's a LOT of churn for what it does. Several people have pointed this out, and AFAICS this is the most common reason for people that have expressed doubt or hesitation over the patches. In an attempt to address this, I pointed out the cleanup opportunities that would open up by using separate anon and file folio types instead of one type for both. Nothing more. No intermediate thing, no chunk cache. Doesn't affect you. Just taking Willy's concept of type safety and applying it to file and anon instead of page vs compound page. - It wouldn't change anything for fs people from the current folio patchset (except maybe the name) - It would accomplish the head/tail page cleanup the same way, since just like a folio, a "file folio" could also never be a tail page - It would take the same solution folio prescribes to the compound page issue (explicit typing to get rid of useless checks, lookups and subtle bugs) and solve way more instances of this all over MM code, thereby hopefully boosting the value proposition and making *that part* of the patches a clearer win for the MM subsystem This is a question directed at MM people, not filesystem people. It doesn't pertain to you at all. And if MM people agree or want to keep discussing it, the relatively minor action item for the folio patch is the same: drop the partial anon-to-folio conversion bits inside MM code for now and move on. For 2), nobody knows the answer to this. Nobody. Anybody who claims to do so is full of sh*t. Maybe compound pages work out, maybe they don't. We can talk a million years about larger page sizes, how to handle internal fragmentation, the difficulties of implementing a chunk cache, but it's completely irrelevant because it's speculative. We know there are multiple page sizes supported by the hardware and the smallest supported one is no longer the most dominant one. We do not know for sure yet how the MM is internally going to lay out its type system so that the allocator, mmap, page reclaim etc. can be CPU efficient and the descriptors be memory efficient. Nobody's "grand plan" here is any more viable, tested or proven than anybody else's. My question for fs folks is simply this: as long as you can pass a folio to kmap and mmap and it knows what to do with it, is there any filesystem relevant requirement that the folio map to 1 or more literal "struct page", and that folio_page(), folio_nr_pages() etc be part of the public API? Or can we keep this translation layer private to MM code? And will page_folio() be required for anything beyond the transitional period away from pages? Can we move things not used outside of MM into mm/internal.h, mark the transitional bits of the public API as such, and move on? The unproductive vitriol, personal attacks and dismissiveness over relatively minor asks and RFCs from the subsystem that is the most impacted by this patchset is just nuts.
On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote: > I didn't suggest to change what the folio currently already is for the > page cache. I asked to keep anon pages out of it (and in the future > potentially other random stuff that is using compound pages). It would mean that anon-THP cannot benefit from the work Willy did with folios. Anon-THP is the most active user of compound pages at the moment and it also suffers from the compound_head() plague. You ask to exclude anon-THP siting *possible* future benefits for pagecache. Sorry, but this doesn't sound fair to me. We already had similar experiment with PAGE_CACHE_SIZE. It was introduced with hope to have PAGE_CACHE_SIZE != PAGE_SIZE one day. It never happened and only caused confusion on the border between pagecache-specific code and generic code that handled both file and anon pages. If you want to limit usage of the new type to pagecache, the burden on you to prove that it is useful and not just a dead weight.
Snipped, reordered: On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote: > 2. Are compound pages a scalable, future-proof allocation strategy? > > For 2), nobody knows the answer to this. Nobody. Anybody who claims to > do so is full of sh*t. Maybe compound pages work out, maybe they > don't. We can talk a million years about larger page sizes, how to > handle internal fragmentation, the difficulties of implementing a > chunk cache, but it's completely irrelevant because it's speculative. Calling it compound pages here is a misnomer, and it confuses the discussion. The question is really about whether we should start using higher order allocations for data in the page cache, and perhaps a better way of framing that question is: should we continue to fragment all our page cache allocations up front into individual pages? But I don't think this really the blocker. > 1. Is the folio a good descriptor for all uses of anon and file pages > inside MM code way beyond the page cache layer YOU care about? > > For some people the answers are yes, for others they are a no. The anon page conversion does seem to be where all the disagreement is coming from. So my ask, to everyone involved is - if anonymous pages are dropped from the folio patches, do we have any other real objections to the patch series? It's an open question as to how much anonymous pages are like file pages, and if we continue down the route of of splitting up struct page into separate types whether anonymous pages should be the same time as file pages. Also, it appears even file pages aren't fully converted to folios in Willy's patch set - grepping around reveals plenty of references to struct page left in fs/. I think that even if anonymous pages are going to become folios it's a pretty reasonable ask for that to wait a cycle or two and see how the conversion of file pages fully plays out. Also: it's become pretty clear to me that we have crappy communications between MM developers and filesystem developers. Internally both teams have solid communications - I know in filesystem land we all talk to each other and are pretty good at working colaboratively, and it sounds like the MM team also has good internal communications. But we seem to have some problems with tackling issues that cross over between FS and MM land, or awkwardly sit between them. Perhaps this is something we could try to address when picking conference topics in the future. Johannes also mentioned a monthly group call the MM devs schedule - I wonder if it would be useful to get something similar going between MM and interested parties in filesystem land.
On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote: > On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote: > > I didn't suggest to change what the folio currently already is for the > > page cache. I asked to keep anon pages out of it (and in the future > > potentially other random stuff that is using compound pages). > > It would mean that anon-THP cannot benefit from the work Willy did with > folios. Anon-THP is the most active user of compound pages at the moment > and it also suffers from the compound_head() plague. You ask to exclude > anon-THP siting *possible* future benefits for pagecache. > > Sorry, but this doesn't sound fair to me. I'm less concerned with what's fair than figuring out what the consensus is so we can move forward. I agree that anonymous THPs could benefit greatly from conversion to folios - but looking at the code it doesn't look like much of that has been done yet. I understand you've had some input into the folio patches, so maybe you'd be best able to answer while Matthew is away - would it be fair to say that, in the interests of moving forward, anonymous pages could be split out for now? That way the MM people gain time to come to their own consensus and we can still unblock the FS work that's already been done on top of folios.
On Fri, Sep 17, 2021 at 05:17:09PM -0400, Kent Overstreet wrote: > On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote: > > On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote: > > > I didn't suggest to change what the folio currently already is for the > > > page cache. I asked to keep anon pages out of it (and in the future > > > potentially other random stuff that is using compound pages). > > > > It would mean that anon-THP cannot benefit from the work Willy did with > > folios. Anon-THP is the most active user of compound pages at the moment > > and it also suffers from the compound_head() plague. You ask to exclude > > anon-THP siting *possible* future benefits for pagecache. > > > > Sorry, but this doesn't sound fair to me. > > I'm less concerned with what's fair than figuring out what the consensus is so > we can move forward. I agree that anonymous THPs could benefit greatly from > conversion to folios - but looking at the code it doesn't look like much of that > has been done yet. > > I understand you've had some input into the folio patches, so maybe you'd be > best able to answer while Matthew is away - would it be fair to say that, in the > interests of moving forward, anonymous pages could be split out for now? That > way the MM people gain time to come to their own consensus and we can still > unblock the FS work that's already been done on top of folios. I can't answer for Matthew. Anon conversion patchset doesn't exists yet (but it is in plans) so there's nothing to split out. Once someone will come up with such patchset he has to sell it upstream on its own merit. Possible future efforts should not block code at hands. "Talk is cheap. Show me the code."
On Sat, Sep 18, 2021 at 01:02:09AM +0300, Kirill A. Shutemov wrote: > I can't answer for Matthew. > > Anon conversion patchset doesn't exists yet (but it is in plans) so > there's nothing to split out. Once someone will come up with such patchset > he has to sell it upstream on its own merit. Perhaps we've been operating under some incorrect assumptions then. If the current patch series doesn't actually touch anonymous pages - the patch series does touch code in e.g. mm/swap.c, but looking closer it might just be due to the (mis)organization of the current code - maybe there aren't any real objections left?
On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote: > Also: it's become pretty clear to me that we have crappy > communications between MM developers and filesystem > developers. I think one of the challenges has been the lack of an LSF/MM since 2019. And it may be that having *some* kind of ad hoc technical discussion given that LSF/MM in 2021 is not happening might be a good thing. I'm sure if we asked nicely, we could use the LPC infrasutrcture to set up something, assuming we can find a mutually agreeable day or dates. > Internally both teams have solid communications - I know > in filesystem land we all talk to each other and are pretty good at > working colaboratively, and it sounds like the MM team also has good > internal communications. But we seem to have some problems with > tackling issues that cross over between FS and MM land, or awkwardly > sit between them. That's a bit of a over-generalization; it seems like we've uncovered that some of the disagreemnts are between different parts of the MM community over the suitability of folios for anonymous pages. And it's interesting, because I don't really consider Willy to be one of "the FS folks" --- and he has been quite diligent to reaching out to a number of folks in the FS community about our needs, and it's clear that this has been really, really helpful. There's no question that we've had for many years some difficulties in the code paths that sit between FS and MM, and I'd claim that it's not just because of communications, but the relative lack of effort that was focused in that area. The fact that Willy has spent the last 9 months working on FS / MM interactions has been really great, and I hope it continues. That being said, it sounds like there are issues internal to the MM devs that still need to be ironed out, and at the risk of throwing the anon-THP folks under the bus, if we can land at least some portion of the folio commits, it seems like that would be a step in the right direction. Cheers, - Ted
On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote: > On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote: > > I didn't suggest to change what the folio currently already is for the > > page cache. I asked to keep anon pages out of it (and in the future > > potentially other random stuff that is using compound pages). > > It would mean that anon-THP cannot benefit from the work Willy did with > folios. Anon-THP is the most active user of compound pages at the moment > and it also suffers from the compound_head() plague. You ask to exclude > anon-THP siting *possible* future benefits for pagecache. > > Sorry, but this doesn't sound fair to me. Hold on Kirill. I'm not saying we shouldn't fix anonthp. But let's clarify the actual code in question in this specific patchset. You say anonthp cannot benefit from folio, but in the other email you say this patchset isn't doing the conversion yet. The code I'm specifically referring to here is the conversion of some code that encounters both anon and file pages - swap.c, memcontrol.c, workingset.c, and a few other places. It's a small part of the folio patches, but it's a big deal for the MM code conceptually. I'm requesting to drop those and just keep the page cache bits. Not because I think anonthp shouldn't be fixed, but because I think we're not in agreement yet on how they should be fixed. And it's somewhat independent of fixing the page cache interface now that people are waiting on much more desparately and acutely than we inside MM wait for a struct page cleanup. It's not good to hold them while we argue. Dropping the anon bits isn't final. Depending on how our discussion turns out, we can still put them in later or we can put in something new. The important thing is that the uncontroversial page cache bits aren't held up any longer while we figure it out. > If you want to limit usage of the new type to pagecache, the burden on you > to prove that it is useful and not just a dead weight. I'm not asking to add anything to the folio patches, just to remove some bits around the edges. And for the page cache bits: I think we have a rather large number of folks really wanting those. Now. Again, I think we should fix anonthp. But I also think we should really look at struct page more broadly. And I think we should have that discussion inside a forum of MM people that truly care. I'm just trying to unblock the fs folks at this point and merge what we can now.
On 9/17/21 6:25 PM, Theodore Ts'o wrote: > On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote: >> Also: it's become pretty clear to me that we have crappy >> communications between MM developers and filesystem >> developers. > > I think one of the challenges has been the lack of an LSF/MM since > 2019. And it may be that having *some* kind of ad hoc technical > discussion given that LSF/MM in 2021 is not happening might be a good > thing. I'm sure if we asked nicely, we could use the LPC > infrasutrcture to set up something, assuming we can find a mutually > agreeable day or dates. > We have a slot for this in the FS MC, first slot actually, so hopefully we can get things hashed out there. Thanks, Josef
On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote: > My question for fs folks is simply this: as long as you can pass a > folio to kmap and mmap and it knows what to do with it, is there any > filesystem relevant requirement that the folio map to 1 or more > literal "struct page", and that folio_page(), folio_nr_pages() etc be > part of the public API? In the short term, yes, we need those things in the public API. In the long term, not so much. We need something in the public API that tells us the offset and size of the folio. Lots of page cache code currently does stuff like calculate the size or iteration counts based on the difference of page->index values (i.e. number of pages) and iterate page by page. A direct conversion of such algorithms increments by folio_nr_pages() instead of 1. So stuff like this is definitely necessary as public APIs in the initial conversion. Let's face it, folio_nr_pages() is a huge improvement on directly exposing THP/compound page interfaces to filesystems and leaving them to work it out for themselves. So even in the short term, these API members represent a major step forward in mm API cleanliness. As for long term, everything in the page cache API needs to transition to byte offsets and byte counts instead of units of PAGE_SIZE and page->index. That's a more complex transition, but AFAIA that's part of the future work Willy is intended to do with folios and the folio API. Once we get away from accounting and tracking everything as units of struct page, all the public facing APIs that use those units can go away. It's fairly slow to do this, because we have so much code that is doing stuff like converting file offsets between byte counts and page counts and vice versa. And it's not necessary to do an initial conversion to folios, either. But once everything in the page cache indexing API moves to byte ranges, the need to count pages, use page counts are ranges, iterate by page index, etc all goes away and hence those APIs can also go away. As for converting between folios and pages, we'll need those sorts of APIs for the foreseeable future because low level storage layers and hardware use pages for their scatter gather arrays and at some point we've got to expose those pages from behind the folio API. Even if we replace struct page with some other hardware page descriptor, we're still going to need such translation APIs are some point in the stack.... > Or can we keep this translation layer private > to MM code? And will page_folio() be required for anything beyond the > transitional period away from pages? No idea, but as per above I think it's a largely irrelevant concern for the forseeable future because pages will be here for a long time yet. > Can we move things not used outside of MM into mm/internal.h, mark the > transitional bits of the public API as such, and move on? Sure, but that's up to you to do as a patch set on top of Willy's folio trees if you think it improves the status quo. Write the patches and present them for review just like everyone else does, and they can be discussed on their merits in that context rather than being presented as a reason for blocking current progress on folios. Cheers, Dave.
On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote: > As for long term, everything in the page cache API needs to > transition to byte offsets and byte counts instead of units of > PAGE_SIZE and page->index. That's a more complex transition, but > AFAIA that's part of the future work Willy is intended to do with > folios and the folio API. Once we get away from accounting and > tracking everything as units of struct page, all the public facing > APIs that use those units can go away. Probably 95% of the places we use page->index and page->mapping aren't necessary because we've already got that information from the context we're in and removing them would be a useful cleanup - if we've already got that from context (e.g. we're looking up the page in the page cache, via i_pageS) eliminating the page->index or page->mapping use means we're getting rid of a data dependency so it's good for performance - but more importantly, those (much fewer) places in the code where we actually _do_ need page->index and page->mapping are really important places to be able to find because they're interesting boundaries between different components in the VM.
On Sat, Sep 18, 2021 at 12:51:50AM -0400, Kent Overstreet wrote: > On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote: > > As for long term, everything in the page cache API needs to > > transition to byte offsets and byte counts instead of units of > > PAGE_SIZE and page->index. That's a more complex transition, but > > AFAIA that's part of the future work Willy is intended to do with > > folios and the folio API. Once we get away from accounting and > > tracking everything as units of struct page, all the public facing > > APIs that use those units can go away. > > Probably 95% of the places we use page->index and page->mapping aren't necessary > because we've already got that information from the context we're in and > removing them would be a useful cleanup *nod* > - if we've already got that from context > (e.g. we're looking up the page in the page cache, via i_pageS) eliminating the > page->index or page->mapping use means we're getting rid of a data dependency so > it's good for performance - but more importantly, those (much fewer) places in > the code where we actually _do_ need page->index and page->mapping are really > important places to be able to find because they're interesting boundaries > between different components in the VM. *nod* This is where infrastructure like like write_cache_pages() is problematic. It's not actually a component of the VM - it's core page cache/filesystem API functionality - but the implementation is determined by the fact there is no clear abstraction between the page cache and the VM and so while the filesysetm side of the API is byte-ranged based, the VM side is struct page based and so the impedence mismatch has to be handled in the page cache implementation. Folios are definitely pointing out issues like this whilst, IMO, demonstrating that an abstraction like folios are also a necessary first step to address the problems they make obvious... Cheers, Dave.
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote: > Q: Oh yeah, but what again are folios for, exactly? > > Folios are for cached filesystem data which (importantly) may be mapped to > userspace. > > So when MM people see a new data structure come up with new references to page > size - there's a very good reason with that, which is that we need to be > allocating in multiples of the hardware page size if we're going to be able to > map it to userspace and have PTEs point to it. > > So going forward, if the MM people want struct page to refer to muliple hardware > pages - this shouldn't prevent that, and folios will refer to multiples of the > _hardware_ page size, not struct page pagesize. > > Also - all the filesystem code that's being converted tends to talk and thing in > units of pages. So going forward, it would be a nice cleanup to get rid of as > many of those references as possible and just talk in terms of bytes (e.g. I > have generally been trying to get rid of references to PAGE_SIZE in bcachefs > wherever reasonable, for other reasons) - those cleanups are probably for > another patch series, and in the interests of getting this patch series merged > with the fewest introduced bugs possible we probably want the current helpers. I'd like to thank those who reached out off-list. Some of you know I've had trouble with depression in the past, and I'd like to reassure you that that's not a problem at the moment. I had a good holiday, and I was able to keep from thinking about folios most of the time. I'd also like to thank those who engaged in the discussion while I was gone. A lot of good points have been made. I don't think the normal style of replying to each email individually makes a lot of sense at this point, so I'll make some general comments instead. I'll respond to the process issues on the other thread. I agree with the feeling a lot of people have expressed, that struct page is massively overloaded and we would do much better with stronger typing. I like it when the compiler catches bugs for me. Disentangling struct page is something I've been working on for a while, and folios are a step in that direction (in that they remove the two types of tail page from the universe of possibilities). I don't believe it is realistic to disentangle file pages and anon pages from each other. Thanks to swap and shmem, both file pages and anon pages need to be able to be moved in and out of the swap cache. The swap cache shares a lot of code with the page cache, so changing how the swap cache works is also tricky. What I do believe is possible is something Kent hinted at; treating anon pages more like file pages. I also believe that shmem should be able to write pages to swap without moving the pages into the swap cache first. But these two things are just beliefs. I haven't tried to verify them and they may come to nothing. I also want to split out slab_page and page_table_page from struct page. I don't intend to convert either of those to folios. I do want to make struct page dynamically allocated (and have for a while). There are some complicating factors ... There are two primary places where we need to map from a physical address to a "memory descriptor". The one that most people care about is get_user_pages(). We have a page table entry and need to increment the refcount on the head page, possibly mark the head page dirty, but also return the subpage of any compound page we find. The one that far fewer people care about is memory-failure.c; we also need to find the head page to determine what kind of memory has been affected, but we need to mark the subpage as HWPoison. Both of these need to be careful to not confuse tail and non-tail pages. So yes, we need to use folios for anything that's mappable to userspace. That's not just anon & file pages but also network pools, graphics card memory and vmalloc memory. Eventually, I think struct page actually goes down to a union of a few words of padding, along with ->compound_head. Because that's all we're guaranteed is actually there; everything else is only there in head pages. There are a lot of places that should use folios which the current patchset doesn't convert. I prioritised filesystems because we've got ~60 filesystems to convert, and working on the filesystems can proceed in parallel with working on the rest of the MM. Also, if I converted the entire MM at once, there would be complaints that a 600 patch series was unreviewable. So here we are, there's a bunch of compatibility code that indicates areas which still need to be converted. I'm sure I've missed things, but I've been working on this email all day and wanted to send it out before going to sleep.
On Fri, Sep 17, 2021 at 07:15:40PM -0400, Johannes Weiner wrote: > The code I'm specifically referring to here is the conversion of some > code that encounters both anon and file pages - swap.c, memcontrol.c, > workingset.c, and a few other places. It's a small part of the folio > patches, but it's a big deal for the MM code conceptually. Hard to say without actually trying, but my worry here that this may lead to code duplication to separate file and anon code path. I donno.
Just a note upfront: This discussion is now about whether folio are suitable for anon pages as well. I'd like to reiterate that regardless of the outcome of this discussion I think we should probably move ahead with the page cache bits, since people are specifically blocked on those and there is no dependency on the anon stuff, as the conversion is incremental. On Mon, Sep 20, 2021 at 03:17:15AM +0100, Matthew Wilcox wrote: > I don't believe it is realistic to disentangle file pages and anon > pages from each other. Thanks to swap and shmem, both file pages and > anon pages need to be able to be moved in and out of the swap cache. Yes, the swapcache is actually shared code and needs a shared type. However, once swap and shmem are fully operating on *typed* anon and file pages, there are no possible routes of admission for tail pages into the swapcache: vmscan: add_to_swap_cache(anon_page->page); shmem: delete_from_swap_cache(file_page->page); and so the justification for replacing page with folio *below* those entry points to address tailpage confusion becomes nil: there is no confusion. Move the anon bits to anon_page and leave the shared bits in page. That's 912 lines of swap_state.c we could mostly leave alone. The same is true for the LRU code in swap.c. Conceptually, already no tailpages *should* make it onto the LRU. Once the high-level page instantiation functions - add_to_page_cache_lru, do_anonymous_page - have type safety, you really do not need to worry about tail pages deep in the LRU code. 1155 more lines of swap.c. And when you've ensured that tail pages can't make it onto the LRU, that takes care of the entire page reclaim code as well; converting it wholesale to folio again would provide little additional value. 4707 lines of vmscan.c. And with the page instantiation functions typed, nobody can pass tail pages into memcg charging code, either. 7509 lines of memcontrol.c. But back to your generic swapcache example: beyond the swapcache and the page LRU management, there really isn't a great deal of code that is currently truly type-agnostic and generic like that. And the rest could actually benefit from being typed more tightly to bring out what is actually going on. The anon_page->page relationship may look familiar too. It's a natural type hierarchy between superclass and subclasses that is common in object oriented languages: page has attributes and methods that are generic and shared; anon_page and file_page encode where their implementation differs. A type system like that would set us up for a lot of clarification and generalization of the MM code. For example it would immediately highlight when "generic" code is trying to access type-specific stuff that maybe it shouldn't, and thus help/force us refactor - something that a shared, flat folio type would not. And again, higher-level types would take care of the tail page confusion in many (most?) places automatically. > The swap cache shares a lot of code with the page cache, so changing > how the swap cache works is also tricky. The overlap is actually fairly small right now. Add and delete routines are using the raw xarray functions. Lookups use the most minimal version of find_get_page(), which wouldn't be a big deal to open-code until swapcache and pagecache would *actually* be unified. > What I do believe is possible is something Kent hinted at; treating > anon pages more like file pages. I also believe that shmem should > be able to write pages to swap without moving the pages into the > swap cache first. But these two things are just beliefs. I haven't > tried to verify them and they may come to nothing. Treating anon and file pages the same where possible makes sense. It's simple: the more code that can be made truly generic and be shared between subclasses, the better. However, for that we first have to identify what parts actually are generic, and what parts are falsely shared and shoehorned into equivalency due to being crammed into the same overloaded structure. For example, page->mapping for file is an address_space and the page's membership in that tree structure is protected by the page lock. page->mapping for anon is... not that. The pointer itself is ad-hoc typed to point to an anon_vma instead. And anon_vmas behave completely differently from a page's pagecache state. The *swapcache* state of an anon page is actually much closer to what the pagecache state of a file page is. And since it would be nice to share more of the swapcache and pagecache *implementation*, it makes sense that the relevant page attributes would correspond as well. (Yeah, page->mapping and page->index are used "the same way" for rmap, but that's a much smaller, read-only case. And when you look at how "generic" the rmap code is - with its foo_file and foo_anon functions, and PageAnon() checks, and conditional page locking in the shared bits-- the attribute sharing at the page level really did nothing to help the implementation be more generic.) It really should be something like: struct page { /* pagecache/swapcache state */ struct address_space *address_space; pgoff_t index; lock_t lock; } struct file_page { struct page; } struct anon_page { struct page; struct anon_vma *anon_vma; pgoff_t offset; }; to recognize the difference in anon vs file rmapping and locking, while recognizing the similarity between swapcache and pagecache. A shared folio would perpetuate false equivalencies between anon and file which make it difficult to actually split out and refactor what *should* be generic vs what should be type-specific. And instead lead to more "generic" code littered with FolioAnon() conditionals. And in the name of tail page cleanup it would churn through thousands of lines of code where there is no conceptual confusion about tail pages to begin with. Proper type inheritance would allow us to encode how things actually are implemented right now and would be a great first step in identifying what needs to be done in order to share more code. And it would take care of so many places re: tail pages that it's a legitimate question to ask: how many places would actually be *left* that *need* to deal with tail pages? Couldn't we bubble compound_head() and friends into these few select places and be done? > I also want to split out slab_page and page_table_page from struct page. > I don't intend to convert either of those to folios. > > I do want to make struct page dynamically allocated (and have for > a while). There are some complicating factors ... > > There are two primary places where we need to map from a physical > address to a "memory descriptor". The one that most people care about > is get_user_pages(). We have a page table entry and need to increment > the refcount on the head page, possibly mark the head page dirty, but > also return the subpage of any compound page we find. The one that far > fewer people care about is memory-failure.c; we also need to find the > head page to determine what kind of memory has been affected, but we > need to mark the subpage as HWPoison. > > Both of these need to be careful to not confuse tail and non-tail pages. That makes sense. But gup() as an interface to the rest of the kernel is rather strange: It's not a generic page table walker that can take a callback argument to deal with whatever the page table points to. It also doesn't return properly typed objects: it returns struct page which is currently a wildcard for whatever people cram into it. > So yes, we need to use folios for anything that's mappable to userspace. > That's not just anon & file pages but also network pools, graphics card > memory and vmalloc memory. Eventually, I think struct page actually goes > down to a union of a few words of padding, along with ->compound_head. > Because that's all we're guaranteed is actually there; everything else > is only there in head pages. (Side question: if GUP can return tail pages, how does that map to folios?) Anyway, I don't see that folio for everything mappable is the obvious conclusion, because it doesn't address what is really weird about gup. While a folio interface would clean up the head and tail page issue, it maintains the incentive of cramming everything that people want to mmap and gup into the same wildcard type struct. And still leave the bigger problem of ad-hoc typing that wildcard ("What is the thing that was returned? Anon? File? GPU memory?") to the user. I think rather than cramming everything that can be mmapped into folio for the purpose of GUP and tail pages - even when these objects have otherwise little in common - it would make more sense to reconsider how GUP as an interface deals with typing. Some options: a) Make it a higher-order function that leaves typing fully to the provided callback. This makes it clear (and greppable) which functions need to be wary about tail pages, and type inference in general. b) Create an intermediate mmap type that can map to one of the higher-order types like anon or file, but never to a tail page. This sounds like what you want struct page to be long-term. But this sort of depends on the side question above - what if pte maps tail page? c) Provide a stricter interface for known higher-order types (get_anon_pages...). Supporting new types means adding more entry function, which IMO is preferable to cramming more stuff into a wildcard struct folio. d) A hybrid of a) and c) to safely cover common cases, while allowing "i know what i'm doing" uses. In summary, I think a page type hierarchy would do wonders to clean up anon and file page implementations, and encourage and enable more code sharing down the line, while taking care of tail pages as well. This leaves the question how many places are actually *left* to deal with tail pages in MM. Folios are based on the premise that the confusion is simply everywhere, and that everything needs to be converted first to be safe. This is convenient because it means we never have to identify which parts truly *do* need tailpage handling, truly *need* the compound_head() lookups. Yes, compound_head() has to go from generic page flags testers. But as per the examples at the top, I really don't think we need to convert every crevice of the MM code to folio before we can be reasonably sure that removing it is safe. I really want to to see a better ballpark analysis of what parts need to deal with tail pages to justify all this churn for them.
On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote: > This discussion is now about whether folio are suitable for anon pages > as well. I'd like to reiterate that regardless of the outcome of this > discussion I think we should probably move ahead with the page cache > bits, since people are specifically blocked on those and there is no > dependency on the anon stuff, as the conversion is incremental. So you withdraw your NAK for the 5.15 pull request which is now four weeks old and has utterly missed the merge window? > and so the justification for replacing page with folio *below* those > entry points to address tailpage confusion becomes nil: there is no > confusion. Move the anon bits to anon_page and leave the shared bits > in page. That's 912 lines of swap_state.c we could mostly leave alone. Your argument seems to be based on "minimising churn". Which is certainly a goal that one could have, but I think in this case is actually harmful. There are hundreds, maybe thousands, of functions throughout the kernel (certainly throughout filesystems) which assume that a struct page is PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that, but tracking them all down is a never-ending task as new ones will be added as fast as they can be removed. > The same is true for the LRU code in swap.c. Conceptually, already no > tailpages *should* make it onto the LRU. Once the high-level page > instantiation functions - add_to_page_cache_lru, do_anonymous_page - > have type safety, you really do not need to worry about tail pages > deep in the LRU code. 1155 more lines of swap.c. It's actually impossible in practice as well as conceptually. The list LRU is in the union with compound_head, so you cannot put a tail page onto the LRU. But yet we call compound_head() on every one of them multiple times because our current type system does not allow us to express "this is not a tail page". > The anon_page->page relationship may look familiar too. It's a natural > type hierarchy between superclass and subclasses that is common in > object oriented languages: page has attributes and methods that are > generic and shared; anon_page and file_page encode where their > implementation differs. > > A type system like that would set us up for a lot of clarification and > generalization of the MM code. For example it would immediately > highlight when "generic" code is trying to access type-specific stuff > that maybe it shouldn't, and thus help/force us refactor - something > that a shared, flat folio type would not. If you want to try your hand at splitting out anon_folio from folio later, be my guest. I've just finished splitting out 'slab' from page, and I'll post it later. I don't think that splitting anon_folio from folio is worth doing, but will not stand in your way. I do think that splitting tail pages from non-tail pages is worthwhile, and that's what this patchset does.
On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote: > On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote: > > and so the justification for replacing page with folio *below* those > > entry points to address tailpage confusion becomes nil: there is no > > confusion. Move the anon bits to anon_page and leave the shared bits > > in page. That's 912 lines of swap_state.c we could mostly leave alone. > > Your argument seems to be based on "minimising churn". Which is certainly > a goal that one could have, but I think in this case is actually harmful. > There are hundreds, maybe thousands, of functions throughout the kernel > (certainly throughout filesystems) which assume that a struct page is > PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that, > but tracking them all down is a never-ending task as new ones will be > added as fast as they can be removed. Yet it's only file backed pages that are actually changing in behaviour right now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for network pools, for slab. > > The anon_page->page relationship may look familiar too. It's a natural > > type hierarchy between superclass and subclasses that is common in > > object oriented languages: page has attributes and methods that are > > generic and shared; anon_page and file_page encode where their > > implementation differs. > > > > A type system like that would set us up for a lot of clarification and > > generalization of the MM code. For example it would immediately > > highlight when "generic" code is trying to access type-specific stuff > > that maybe it shouldn't, and thus help/force us refactor - something > > that a shared, flat folio type would not. > > If you want to try your hand at splitting out anon_folio from folio > later, be my guest. I've just finished splitting out 'slab' from page, > and I'll post it later. I don't think that splitting anon_folio from > folio is worth doing, but will not stand in your way. I do think that > splitting tail pages from non-tail pages is worthwhile, and that's what > this patchset does. Eesh, we can and should hold ourselves to a higher standard in our technical discussions. Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate and shouldn't have happened) colour our perceptions and keep us from having productive working relationships going forward. The points Johannes is bringing up are valid and pertinent and deserve to be discussed. If you're still trying to sell folios as the be all, end all solution for anything using compound pages, I think you should be willing to make the argument that that really is the _right_ solution - not just that it was the one easiest for you to implement. Actual code might make this discussion more concrete and clearer. Could you post your slab conversion?
On Tue, Sep 21, 2021 at 05:11:09PM -0400, Kent Overstreet wrote: > On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote: > > On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote: > > > and so the justification for replacing page with folio *below* those > > > entry points to address tailpage confusion becomes nil: there is no > > > confusion. Move the anon bits to anon_page and leave the shared bits > > > in page. That's 912 lines of swap_state.c we could mostly leave alone. > > > > Your argument seems to be based on "minimising churn". Which is certainly > > a goal that one could have, but I think in this case is actually harmful. > > There are hundreds, maybe thousands, of functions throughout the kernel > > (certainly throughout filesystems) which assume that a struct page is > > PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that, > > but tracking them all down is a never-ending task as new ones will be > > added as fast as they can be removed. > > Yet it's only file backed pages that are actually changing in behaviour right > now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for > network pools, for slab. > > > > The anon_page->page relationship may look familiar too. It's a natural > > > type hierarchy between superclass and subclasses that is common in > > > object oriented languages: page has attributes and methods that are > > > generic and shared; anon_page and file_page encode where their > > > implementation differs. > > > > > > A type system like that would set us up for a lot of clarification and > > > generalization of the MM code. For example it would immediately > > > highlight when "generic" code is trying to access type-specific stuff > > > that maybe it shouldn't, and thus help/force us refactor - something > > > that a shared, flat folio type would not. > > > > If you want to try your hand at splitting out anon_folio from folio > > later, be my guest. I've just finished splitting out 'slab' from page, > > and I'll post it later. I don't think that splitting anon_folio from > > folio is worth doing, but will not stand in your way. I do think that > > splitting tail pages from non-tail pages is worthwhile, and that's what > > this patchset does. > > Eesh, we can and should hold ourselves to a higher standard in our technical > discussions. > > Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate > and shouldn't have happened) colour our perceptions and keep us from having > productive working relationships going forward. The points Johannes is bringing > up are valid and pertinent and deserve to be discussed. > > If you're still trying to sell folios as the be all, end all solution for > anything using compound pages, I think you should be willing to make the > argument that that really is the _right_ solution - not just that it was the one > easiest for you to implement. > > Actual code might make this discussion more concrete and clearer. Could you post > your slab conversion? Linus, I'd also like to humbly and publicly request that, despite it being past the merge window and a breach of our normal process, folios still be merged for 5.15. Or failing that, that they're the first thing in for 5.16. The reason for my request is that: - folios, at least in filesystem land, solve pressing problems and much work has been done on top of them assuming they go in, and the filesystem people seem to be pretty unanimous that we both want and need this - the public process and discussion has been a trainwreck. We're effectively arguing about the future of struct page, which is a "boiling the oceans" type issue, and the amount of mess that needs to be cleaned up makes it hard for parties working in different areas of the code with different interests and concerns to see the areas where we really do have common interests and goals - it's become apparent that there haven't been any real objections to the code that was queued up for 5.15. There _are_ very real discussions and points of contention still to be decided and resolved for the work beyond file backed pages, but those discussions were what derailed the more modest, and more badly needed, work that affects everyone in filesystem land - And, last but not least: it would really help with the frustration levels that have been making these discussions extroardinarily difficult. I think this whole thing has been showing that our process has some weak points where hopefully we'll do better in the future, but in the meantime - Matthew has been doing good and badly needed work, and he has my vote of confidence. I don't necessarily fully agree with _everything_ he wants to do with folios - I'm not writing a blank check here - but he's someone I can work with and want to continue to work with. Johannes too, for that matter. Thanks and regards, Kent
On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote: > On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote: > > This discussion is now about whether folio are suitable for anon pages > > as well. I'd like to reiterate that regardless of the outcome of this > > discussion I think we should probably move ahead with the page cache > > bits, since people are specifically blocked on those and there is no > > dependency on the anon stuff, as the conversion is incremental. > > So you withdraw your NAK for the 5.15 pull request which is now four > weeks old and has utterly missed the merge window? Once you drop the bits that convert shared anon and file infrastructure, yes. Because we haven't discussed yet, nor agree on, that folio are the way forward for anon pages. > > and so the justification for replacing page with folio *below* those > > entry points to address tailpage confusion becomes nil: there is no > > confusion. Move the anon bits to anon_page and leave the shared bits > > in page. That's 912 lines of swap_state.c we could mostly leave alone. > > Your argument seems to be based on "minimising churn". Which is certainly > a goal that one could have, but I think in this case is actually harmful. > There are hundreds, maybe thousands, of functions throughout the kernel > (certainly throughout filesystems) which assume that a struct page is > PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that, > but tracking them all down is a never-ending task as new ones will be > added as fast as they can be removed. What does that have to do with anon pages? > > The same is true for the LRU code in swap.c. Conceptually, already no > > tailpages *should* make it onto the LRU. Once the high-level page > > instantiation functions - add_to_page_cache_lru, do_anonymous_page - > > have type safety, you really do not need to worry about tail pages > > deep in the LRU code. 1155 more lines of swap.c. > > It's actually impossible in practice as well as conceptually. The list > LRU is in the union with compound_head, so you cannot put a tail page > onto the LRU. But yet we call compound_head() on every one of them > multiple times because our current type system does not allow us to > express "this is not a tail page". No, because we haven't identified *who actually needs* these calls and move them up and out of the low-level helpers. It was a mistake to add them there, yes. But they were added recently for rather few callers. And we've had people send patches already to move them where they are actually needed. Of course converting *absolutely everybody else* to not-tailpage instead will also fix the problem... I just don't agree that this is an appropriate response to the issue. Asking again: who conceptually deals with tail pages in MM? LRU and reclaim don't. The page cache doesn't. Compaction doesn't. Migration doesn't. All these data structures and operations are structured around headpages, because that's the logical unit they operate on. The notable exception, of course, are the page tables because they map the pfns of tail pages. But is that it? Does it come down to page table walkers encountering pte-mapped tailpages? And needing compound_head() before calling mark_page_accessed() or set_page_dirty()? We couldn't fix vm_normal_page() to handle this? And switch khugepaged to a new vm_raw_page() or whatever? It should be possible to answer this question as part of the case for converting tens of thousands of lines of code to folio.
On Tue, Sep 21, 2021 at 05:11:09PM -0400, Kent Overstreet wrote: > On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote: > > On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote: > > > and so the justification for replacing page with folio *below* those > > > entry points to address tailpage confusion becomes nil: there is no > > > confusion. Move the anon bits to anon_page and leave the shared bits > > > in page. That's 912 lines of swap_state.c we could mostly leave alone. > > > > Your argument seems to be based on "minimising churn". Which is certainly > > a goal that one could have, but I think in this case is actually harmful. > > There are hundreds, maybe thousands, of functions throughout the kernel > > (certainly throughout filesystems) which assume that a struct page is > > PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that, > > but tracking them all down is a never-ending task as new ones will be > > added as fast as they can be removed. > > Yet it's only file backed pages that are actually changing in behaviour right > now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for > network pools, for slab. The point (I think) Johannes is making is that some of the patches in this series touch code paths which are used by both anon and file pages. And it's those he's objecting to. > > If you want to try your hand at splitting out anon_folio from folio > > later, be my guest. I've just finished splitting out 'slab' from page, > > and I'll post it later. I don't think that splitting anon_folio from > > folio is worth doing, but will not stand in your way. I do think that > > splitting tail pages from non-tail pages is worthwhile, and that's what > > this patchset does. > > Eesh, we can and should hold ourselves to a higher standard in our technical > discussions. > > Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate > and shouldn't have happened) colour our perceptions and keep us from having > productive working relationships going forward. The points Johannes is bringing > up are valid and pertinent and deserve to be discussed. > > If you're still trying to sell folios as the be all, end all solution for > anything using compound pages, I think you should be willing to make the > argument that that really is the _right_ solution - not just that it was the one > easiest for you to implement. Starting from the principle that the type of a pointer should never be wrong, GUP can convert from a PTE to a struct page. We need a name for the head page that GUP converts to, and my choice for that name is folio. A folio needs a refcount, a lock bit and a dirty bit. By the way, I think I see a path to: struct page { unsigned long compound_head; }; which will reduce the overhead of struct page from 64 bytes to 8. That should solve one of Johannes' problems. > Actual code might make this discussion more concrete and clearer. Could you post > your slab conversion? It's a bit big and deserves to be split into multiple patches. It's on top of folio-5.15. It also only really works for SLUB right now; CONFIG_SLAB doesn't compile yet. It does pass xfstests with CONFIG_SLUB ;-) I'm not entirely convinced I've done the right thing with page_memcg_check(). There's probably other things wrong with it, I was banging it out during gaps between sessions at Plumbers. diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index ddeaba947eb3..5f3d2efeb88b 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -981,7 +981,7 @@ static void __meminit free_pagetable(struct page *page, int order) if (PageReserved(page)) { __ClearPageReserved(page); - magic = (unsigned long)page->freelist; + magic = page->index; if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) { while (nr_pages--) put_page_bootmem(page++); diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h index 2bc8b1f69c93..cc35d010fa94 100644 --- a/include/linux/bootmem_info.h +++ b/include/linux/bootmem_info.h @@ -30,7 +30,7 @@ void put_page_bootmem(struct page *page); */ static inline void free_bootmem_page(struct page *page) { - unsigned long magic = (unsigned long)page->freelist; + unsigned long magic = page->index; /* * The reserve_bootmem_region sets the reserved flag on bootmem diff --git a/include/linux/kasan.h b/include/linux/kasan.h index dd874a1ee862..59c860295618 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -188,11 +188,11 @@ static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache) return 0; } -void __kasan_poison_slab(struct page *page); -static __always_inline void kasan_poison_slab(struct page *page) +void __kasan_poison_slab(struct slab *slab); +static __always_inline void kasan_poison_slab(struct slab *slab) { if (kasan_enabled()) - __kasan_poison_slab(page); + __kasan_poison_slab(slab); } void __kasan_unpoison_object_data(struct kmem_cache *cache, void *object); @@ -317,7 +317,7 @@ static inline void kasan_cache_create(struct kmem_cache *cache, slab_flags_t *flags) {} static inline void kasan_cache_create_kmalloc(struct kmem_cache *cache) {} static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } -static inline void kasan_poison_slab(struct page *page) {} +static inline void kasan_poison_slab(struct slab *slab) {} static inline void kasan_unpoison_object_data(struct kmem_cache *cache, void *object) {} static inline void kasan_poison_object_data(struct kmem_cache *cache, diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 562b27167c9e..1c0b3b95bdd7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -546,41 +546,39 @@ static inline bool folio_memcg_kmem(struct folio *folio) } /* - * page_objcgs - get the object cgroups vector associated with a page - * @page: a pointer to the page struct + * slab_objcgs - get the object cgroups vector associated with a slab + * @slab: a pointer to the slab struct * - * Returns a pointer to the object cgroups vector associated with the page, - * or NULL. This function assumes that the page is known to have an - * associated object cgroups vector. It's not safe to call this function - * against pages, which might have an associated memory cgroup: e.g. - * kernel stack pages. + * Returns a pointer to the object cgroups vector associated with the slab, + * or NULL. This function assumes that the slab is known to have an + * associated object cgroups vector. */ -static inline struct obj_cgroup **page_objcgs(struct page *page) +static inline struct obj_cgroup **slab_objcgs(struct slab *slab) { - unsigned long memcg_data = READ_ONCE(page->memcg_data); + unsigned long memcg_data = READ_ONCE(slab->memcg_data); - VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS), page); - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); + VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS), &slab->page); + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, &slab->page); return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } /* - * page_objcgs_check - get the object cgroups vector associated with a page - * @page: a pointer to the page struct + * slab_objcgs_check - get the object cgroups vector associated with a slab + * @slab: a pointer to the slab struct * - * Returns a pointer to the object cgroups vector associated with the page, - * or NULL. This function is safe to use if the page can be directly associated + * Returns a pointer to the object cgroups vector associated with the slab, + * or NULL. This function is safe to use if the slab can be directly associated * with a memory cgroup. */ -static inline struct obj_cgroup **page_objcgs_check(struct page *page) +static inline struct obj_cgroup **slab_objcgs_check(struct slab *slab) { - unsigned long memcg_data = READ_ONCE(page->memcg_data); + unsigned long memcg_data = READ_ONCE(slab->memcg_data); if (!memcg_data || !(memcg_data & MEMCG_DATA_OBJCGS)) return NULL; - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, &slab->page); return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } @@ -591,12 +589,12 @@ static inline bool folio_memcg_kmem(struct folio *folio) return false; } -static inline struct obj_cgroup **page_objcgs(struct page *page) +static inline struct obj_cgroup **slab_objcgs(struct slab *slab) { return NULL; } -static inline struct obj_cgroup **page_objcgs_check(struct page *page) +static inline struct obj_cgroup **slab_objcgs_check(struct slab *slab) { return NULL; } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1066afc9a06d..6db4d64ebe6d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -109,33 +109,6 @@ struct page { */ unsigned long dma_addr[2]; }; - struct { /* slab, slob and slub */ - union { - struct list_head slab_list; - struct { /* Partial pages */ - struct page *next; -#ifdef CONFIG_64BIT - int pages; /* Nr of pages left */ - int pobjects; /* Approximate count */ -#else - short int pages; - short int pobjects; -#endif - }; - }; - struct kmem_cache *slab_cache; /* not slob */ - /* Double-word boundary */ - void *freelist; /* first free object */ - union { - void *s_mem; /* slab: first object */ - unsigned long counters; /* SLUB */ - struct { /* SLUB */ - unsigned inuse:16; - unsigned objects:15; - unsigned frozen:1; - }; - }; - }; struct { /* Tail pages of compound page */ unsigned long compound_head; /* Bit zero is set */ @@ -199,9 +172,6 @@ struct page { * which are currently stored here. */ unsigned int page_type; - - unsigned int active; /* SLAB */ - int units; /* SLOB */ }; /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ @@ -231,6 +201,59 @@ struct page { #endif } _struct_page_alignment; +struct slab { + union { + struct { + unsigned long flags; + union { + struct list_head slab_list; + struct { /* Partial pages */ + struct slab *next; +#ifdef CONFIG_64BIT + int slabs; /* Nr of slabs left */ + int pobjects; /* Approximate count */ +#else + short int slabs; + short int pobjects; +#endif + }; + }; + struct kmem_cache *slab_cache; /* not slob */ + /* Double-word boundary */ + void *freelist; /* first free object */ + union { + void *s_mem; /* slab: first object */ + unsigned long counters; /* SLUB */ + struct { /* SLUB */ + unsigned inuse:16; + unsigned objects:15; + unsigned frozen:1; + }; + }; + + union { + unsigned int active; /* SLAB */ + int units; /* SLOB */ + }; + atomic_t _refcount; +#ifdef CONFIG_MEMCG + unsigned long memcg_data; +#endif + }; + struct page page; + }; +}; + +#define SLAB_MATCH(pg, sl) \ + static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl)) +SLAB_MATCH(flags, flags); +SLAB_MATCH(compound_head, slab_list); +SLAB_MATCH(_refcount, _refcount); +#ifdef CONFIG_MEMCG +SLAB_MATCH(memcg_data, memcg_data); +#endif +#undef SLAB_MATCH + /** * struct folio - Represents a contiguous set of bytes. * @flags: Identical to the page flags. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b48bc214fe89..a21d14fec973 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -167,6 +167,8 @@ enum pageflags { /* Remapped by swiotlb-xen. */ PG_xen_remapped = PG_owner_priv_1, + /* SLAB / SLUB / SLOB */ + PG_pfmemalloc = PG_active, /* SLOB */ PG_slob_free = PG_private, @@ -193,6 +195,25 @@ static inline unsigned long _compound_head(const struct page *page) #define compound_head(page) ((typeof(page))_compound_head(page)) +/** + * page_slab - Converts from page to slab. + * @p: The page. + * + * This function cannot be called on a NULL pointer. It can be called + * on a non-slab page; the caller should check is_slab() to be sure + * that the slab really is a slab. + * + * Return: The slab which contains this page. + */ +#define page_slab(p) (_Generic((p), \ + const struct page *: (const struct slab *)_compound_head(p), \ + struct page *: (struct slab *)_compound_head(p))) + +static inline bool is_slab(struct slab *slab) +{ + return test_bit(PG_slab, &slab->flags); +} + /** * page_folio - Converts from page to folio. * @p: The page. @@ -921,34 +942,6 @@ extern bool is_free_buddy_page(struct page *page); __PAGEFLAG(Isolated, isolated, PF_ANY); -/* - * If network-based swap is enabled, sl*b must keep track of whether pages - * were allocated from pfmemalloc reserves. - */ -static inline int PageSlabPfmemalloc(struct page *page) -{ - VM_BUG_ON_PAGE(!PageSlab(page), page); - return PageActive(page); -} - -static inline void SetPageSlabPfmemalloc(struct page *page) -{ - VM_BUG_ON_PAGE(!PageSlab(page), page); - SetPageActive(page); -} - -static inline void __ClearPageSlabPfmemalloc(struct page *page) -{ - VM_BUG_ON_PAGE(!PageSlab(page), page); - __ClearPageActive(page); -} - -static inline void ClearPageSlabPfmemalloc(struct page *page) -{ - VM_BUG_ON_PAGE(!PageSlab(page), page); - ClearPageActive(page); -} - #ifdef CONFIG_MMU #define __PG_MLOCKED (1UL << PG_mlocked) #else diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 3aa5e1e73ab6..f1bfcb10f5e0 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -87,11 +87,11 @@ struct kmem_cache { struct kmem_cache_node *node[MAX_NUMNODES]; }; -static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, +static inline void *nearest_obj(struct kmem_cache *cache, struct slab *slab, void *x) { - void *object = x - (x - page->s_mem) % cache->size; - void *last_object = page->s_mem + (cache->num - 1) * cache->size; + void *object = x - (x - slab->s_mem) % cache->size; + void *last_object = slab->s_mem + (cache->num - 1) * cache->size; if (unlikely(object > last_object)) return last_object; @@ -106,16 +106,16 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, * reciprocal_divide(offset, cache->reciprocal_buffer_size) */ static inline unsigned int obj_to_index(const struct kmem_cache *cache, - const struct page *page, void *obj) + const struct slab *slab, void *obj) { - u32 offset = (obj - page->s_mem); + u32 offset = (obj - slab->s_mem); return reciprocal_divide(offset, cache->reciprocal_buffer_size); } -static inline int objs_per_slab_page(const struct kmem_cache *cache, - const struct page *page) +static inline int objs_per_slab(const struct kmem_cache *cache, + const struct slab *slab) { - if (is_kfence_address(page_address(page))) + if (is_kfence_address(slab_address(slab))) return 1; return cache->num; } diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index dcde82a4434c..7394c959dc5f 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -43,9 +43,9 @@ enum stat_item { struct kmem_cache_cpu { void **freelist; /* Pointer to next available object */ unsigned long tid; /* Globally unique transaction id */ - struct page *page; /* The slab from which we are allocating */ + struct slab *slab; /* The slab from which we are allocating */ #ifdef CONFIG_SLUB_CPU_PARTIAL - struct page *partial; /* Partially allocated frozen slabs */ + struct slab *partial; /* Partially allocated frozen slabs */ #endif #ifdef CONFIG_SLUB_STATS unsigned stat[NR_SLUB_STAT_ITEMS]; @@ -159,16 +159,16 @@ static inline void sysfs_slab_release(struct kmem_cache *s) } #endif -void object_err(struct kmem_cache *s, struct page *page, +void object_err(struct kmem_cache *s, struct slab *slab, u8 *object, char *reason); void *fixup_red_left(struct kmem_cache *s, void *p); -static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, +static inline void *nearest_obj(struct kmem_cache *cache, struct slab *slab, void *x) { - void *object = x - (x - page_address(page)) % cache->size; - void *last_object = page_address(page) + - (page->objects - 1) * cache->size; + void *object = x - (x - slab_address(slab)) % cache->size; + void *last_object = slab_address(slab) + + (slab->objects - 1) * cache->size; void *result = (unlikely(object > last_object)) ? last_object : object; result = fixup_red_left(cache, result); @@ -184,16 +184,16 @@ static inline unsigned int __obj_to_index(const struct kmem_cache *cache, } static inline unsigned int obj_to_index(const struct kmem_cache *cache, - const struct page *page, void *obj) + const struct slab *slab, void *obj) { if (is_kfence_address(obj)) return 0; - return __obj_to_index(cache, page_address(page), obj); + return __obj_to_index(cache, slab_address(slab), obj); } -static inline int objs_per_slab_page(const struct kmem_cache *cache, - const struct page *page) +static inline int objs_per_slab(const struct kmem_cache *cache, + const struct slab *slab) { - return page->objects; + return slab->objects; } #endif /* _LINUX_SLUB_DEF_H */ diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c index 5b152dba7344..cf8f62c59b0a 100644 --- a/mm/bootmem_info.c +++ b/mm/bootmem_info.c @@ -15,7 +15,7 @@ void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) { - page->freelist = (void *)type; + page->index = type; SetPagePrivate(page); set_page_private(page, info); page_ref_inc(page); @@ -23,14 +23,13 @@ void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) void put_page_bootmem(struct page *page) { - unsigned long type; + unsigned long type = page->index; - type = (unsigned long) page->freelist; BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE); if (page_ref_dec_return(page) == 1) { - page->freelist = NULL; + page->index = 0; ClearPagePrivate(page); set_page_private(page, 0); INIT_LIST_HEAD(&page->lru); diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 2baf121fb8c5..a8b9a7822b9f 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -247,8 +247,9 @@ struct kasan_free_meta *kasan_get_free_meta(struct kmem_cache *cache, } #endif -void __kasan_poison_slab(struct page *page) +void __kasan_poison_slab(struct slab *slab) { + struct page *page = &slab->page; unsigned long i; for (i = 0; i < compound_nr(page); i++) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c954fda9d7f4..c21b9a63fb4a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2842,16 +2842,16 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) */ #define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT) -int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, +int memcg_alloc_slab_obj_cgroups(struct slab *slab, struct kmem_cache *s, gfp_t gfp, bool new_page) { - unsigned int objects = objs_per_slab_page(s, page); + unsigned int objects = objs_per_slab(s, slab); unsigned long memcg_data; void *vec; gfp &= ~OBJCGS_CLEAR_MASK; vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp, - page_to_nid(page)); + slab_nid(slab)); if (!vec) return -ENOMEM; @@ -2862,8 +2862,8 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, * it's memcg_data, no synchronization is required and * memcg_data can be simply assigned. */ - page->memcg_data = memcg_data; - } else if (cmpxchg(&page->memcg_data, 0, memcg_data)) { + slab->memcg_data = memcg_data; + } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) { /* * If the slab page is already in use, somebody can allocate * and assign obj_cgroups in parallel. In this case the existing @@ -2891,38 +2891,39 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, */ struct mem_cgroup *mem_cgroup_from_obj(void *p) { - struct page *page; + struct slab *slab; if (mem_cgroup_disabled()) return NULL; - page = virt_to_head_page(p); + slab = virt_to_slab(p); /* * Slab objects are accounted individually, not per-page. * Memcg membership data for each individual object is saved in - * the page->obj_cgroups. + * the slab->obj_cgroups. */ - if (page_objcgs_check(page)) { + if (slab_objcgs_check(slab)) { struct obj_cgroup *objcg; unsigned int off; - off = obj_to_index(page->slab_cache, page, p); - objcg = page_objcgs(page)[off]; + off = obj_to_index(slab->slab_cache, slab, p); + objcg = slab_objcgs(slab)[off]; if (objcg) return obj_cgroup_memcg(objcg); return NULL; } + /* I am pretty sure this is wrong */ /* - * page_memcg_check() is used here, because page_has_obj_cgroups() + * page_memcg_check() is used here, because slab_has_obj_cgroups() * check above could fail because the object cgroups vector wasn't set * at that moment, but it can be set concurrently. - * page_memcg_check(page) will guarantee that a proper memory + * page_memcg_check() will guarantee that a proper memory * cgroup pointer or NULL will be returned. */ - return page_memcg_check(page); + return page_memcg_check(&slab->page); } __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) diff --git a/mm/slab.h b/mm/slab.h index f997fd5e42c8..1c6311fd7060 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -5,6 +5,69 @@ * Internal slab definitions */ +static inline void *slab_address(const struct slab *slab) +{ + return page_address(&slab->page); +} + +static inline struct pglist_data *slab_pgdat(const struct slab *slab) +{ + return page_pgdat(&slab->page); +} + +static inline int slab_nid(const struct slab *slab) +{ + return page_to_nid(&slab->page); +} + +static inline struct slab *virt_to_slab(const void *addr) +{ + struct page *page = virt_to_page(addr); + + return page_slab(page); +} + +static inline bool SlabMulti(const struct slab *slab) +{ + return test_bit(PG_head, &slab->flags); +} + +static inline int slab_order(const struct slab *slab) +{ + if (!SlabMulti(slab)) + return 0; + return (&slab->page)[1].compound_order; +} + +static inline size_t slab_size(const struct slab *slab) +{ + return PAGE_SIZE << slab_order(slab); +} + +/* + * If network-based swap is enabled, sl*b must keep track of whether pages + * were allocated from pfmemalloc reserves. + */ +static inline bool SlabPfmemalloc(const struct slab *slab) +{ + return test_bit(PG_pfmemalloc, &slab->flags); +} + +static inline void SetSlabPfmemalloc(struct slab *slab) +{ + set_bit(PG_pfmemalloc, &slab->flags); +} + +static inline void __ClearSlabPfmemalloc(struct slab *slab) +{ + __clear_bit(PG_pfmemalloc, &slab->flags); +} + +static inline void ClearSlabPfmemalloc(struct slab *slab) +{ + clear_bit(PG_pfmemalloc, &slab->flags); +} + #ifdef CONFIG_SLOB /* * Common fields provided in kmem_cache by all slab allocators @@ -245,15 +308,15 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla } #ifdef CONFIG_MEMCG_KMEM -int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, +int memcg_alloc_slab_obj_cgroups(struct slab *slab, struct kmem_cache *s, gfp_t gfp, bool new_page); void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr); -static inline void memcg_free_page_obj_cgroups(struct page *page) +static inline void memcg_free_slab_obj_cgroups(struct slab *slab) { - kfree(page_objcgs(page)); - page->memcg_data = 0; + kfree(slab_objcgs(slab)); + slab->memcg_data = 0; } static inline size_t obj_full_size(struct kmem_cache *s) @@ -298,7 +361,7 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, size_t size, void **p) { - struct page *page; + struct slab *slab; unsigned long off; size_t i; @@ -307,19 +370,19 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, for (i = 0; i < size; i++) { if (likely(p[i])) { - page = virt_to_head_page(p[i]); + slab = virt_to_slab(p[i]); - if (!page_objcgs(page) && - memcg_alloc_page_obj_cgroups(page, s, flags, + if (!slab_objcgs(slab) && + memcg_alloc_slab_obj_cgroups(slab, s, flags, false)) { obj_cgroup_uncharge(objcg, obj_full_size(s)); continue; } - off = obj_to_index(s, page, p[i]); + off = obj_to_index(s, slab, p[i]); obj_cgroup_get(objcg); - page_objcgs(page)[off] = objcg; - mod_objcg_state(objcg, page_pgdat(page), + slab_objcgs(slab)[off] = objcg; + mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), obj_full_size(s)); } else { obj_cgroup_uncharge(objcg, obj_full_size(s)); @@ -334,7 +397,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig, struct kmem_cache *s; struct obj_cgroup **objcgs; struct obj_cgroup *objcg; - struct page *page; + struct slab *slab; unsigned int off; int i; @@ -345,24 +408,24 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig, if (unlikely(!p[i])) continue; - page = virt_to_head_page(p[i]); - objcgs = page_objcgs(page); + slab = virt_to_slab(p[i]); + objcgs = slab_objcgs(slab); if (!objcgs) continue; if (!s_orig) - s = page->slab_cache; + s = slab->slab_cache; else s = s_orig; - off = obj_to_index(s, page, p[i]); + off = obj_to_index(s, slab, p[i]); objcg = objcgs[off]; if (!objcg) continue; objcgs[off] = NULL; obj_cgroup_uncharge(objcg, obj_full_size(s)); - mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s), + mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), -obj_full_size(s)); obj_cgroup_put(objcg); } @@ -374,14 +437,14 @@ static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr) return NULL; } -static inline int memcg_alloc_page_obj_cgroups(struct page *page, +static inline int memcg_alloc_slab_obj_cgroups(struct slab *slab, struct kmem_cache *s, gfp_t gfp, bool new_page) { return 0; } -static inline void memcg_free_page_obj_cgroups(struct page *page) +static inline void memcg_free_slab_obj_cgroups(struct slab *slab) { } @@ -407,33 +470,33 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, static inline struct kmem_cache *virt_to_cache(const void *obj) { - struct page *page; + struct slab *slab; - page = virt_to_head_page(obj); - if (WARN_ONCE(!PageSlab(page), "%s: Object is not a Slab page!\n", + slab = virt_to_slab(obj); + if (WARN_ONCE(!is_slab(slab), "%s: Object is not a Slab page!\n", __func__)) return NULL; - return page->slab_cache; + return slab->slab_cache; } -static __always_inline void account_slab_page(struct page *page, int order, +static __always_inline void account_slab(struct slab *slab, int order, struct kmem_cache *s, gfp_t gfp) { if (memcg_kmem_enabled() && (s->flags & SLAB_ACCOUNT)) - memcg_alloc_page_obj_cgroups(page, s, gfp, true); + memcg_alloc_slab_obj_cgroups(slab, s, gfp, true); - mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), + mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s), PAGE_SIZE << order); } -static __always_inline void unaccount_slab_page(struct page *page, int order, +static __always_inline void unaccount_slab(struct slab *slab, int order, struct kmem_cache *s) { if (memcg_kmem_enabled()) - memcg_free_page_obj_cgroups(page); + memcg_free_slab_obj_cgroups(slab); - mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), + mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s), -(PAGE_SIZE << order)); } @@ -635,7 +698,7 @@ static inline void debugfs_slab_release(struct kmem_cache *s) { } #define KS_ADDRS_COUNT 16 struct kmem_obj_info { void *kp_ptr; - struct page *kp_page; + struct slab *kp_slab; void *kp_objp; unsigned long kp_data_offset; struct kmem_cache *kp_slab_cache; @@ -643,7 +706,7 @@ struct kmem_obj_info { void *kp_stack[KS_ADDRS_COUNT]; void *kp_free_stack[KS_ADDRS_COUNT]; }; -void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page); +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab); #endif #endif /* MM_SLAB_H */ diff --git a/mm/slab_common.c b/mm/slab_common.c index 1c673c323baf..d0d843cb7cf1 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -585,18 +585,18 @@ void kmem_dump_obj(void *object) { char *cp = IS_ENABLED(CONFIG_MMU) ? "" : "/vmalloc"; int i; - struct page *page; + struct slab *slab; unsigned long ptroffset; struct kmem_obj_info kp = { }; if (WARN_ON_ONCE(!virt_addr_valid(object))) return; - page = virt_to_head_page(object); - if (WARN_ON_ONCE(!PageSlab(page))) { + slab = virt_to_slab(object); + if (WARN_ON_ONCE(!is_slab(slab))) { pr_cont(" non-slab memory.\n"); return; } - kmem_obj_info(&kp, object, page); + kmem_obj_info(&kp, object, slab); if (kp.kp_slab_cache) pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name); else diff --git a/mm/slub.c b/mm/slub.c index 090fa14628f9..c3b84bd61400 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -47,7 +47,7 @@ * Lock order: * 1. slab_mutex (Global Mutex) * 2. node->list_lock - * 3. slab_lock(page) (Only on some arches and for debugging) + * 3. slab_lock(slab) (Only on some arches and for debugging) * * slab_mutex * @@ -56,17 +56,17 @@ * * The slab_lock is only used for debugging and on arches that do not * have the ability to do a cmpxchg_double. It only protects: - * A. page->freelist -> List of object free in a page - * B. page->inuse -> Number of objects in use - * C. page->objects -> Number of objects in page - * D. page->frozen -> frozen state + * A. slab->freelist -> List of object free in a slab + * B. slab->inuse -> Number of objects in use + * C. slab->objects -> Number of objects in slab + * D. slab->frozen -> frozen state * * If a slab is frozen then it is exempt from list management. It is not * on any list except per cpu partial list. The processor that froze the - * slab is the one who can perform list operations on the page. Other + * slab is the one who can perform list operations on the slab. Other * processors may put objects onto the freelist but the processor that * froze the slab is the only one that can retrieve the objects from the - * page's freelist. + * slab's freelist. * * The list_lock protects the partial and full list on each node and * the partial slab counter. If taken then no new slabs may be added or @@ -94,10 +94,10 @@ * cannot scan all objects. * * Slabs are freed when they become empty. Teardown and setup is - * minimal so we rely on the page allocators per cpu caches for + * minimal so we rely on the slab allocators per cpu caches for * fast frees and allocs. * - * page->frozen The slab is frozen and exempt from list processing. + * slab->frozen The slab is frozen and exempt from list processing. * This means that the slab is dedicated to a purpose * such as satisfying allocations for a specific * processor. Objects may be freed in the slab while @@ -192,7 +192,7 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s) #define OO_SHIFT 16 #define OO_MASK ((1 << OO_SHIFT) - 1) -#define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */ +#define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is u15 */ /* Internal SLUB flags */ /* Poison object */ @@ -357,22 +357,20 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x) } /* - * Per slab locking using the pagelock + * Per slab locking using the slablock */ -static __always_inline void slab_lock(struct page *page) +static __always_inline void slab_lock(struct slab *slab) { - VM_BUG_ON_PAGE(PageTail(page), page); - bit_spin_lock(PG_locked, &page->flags); + bit_spin_lock(PG_locked, &slab->flags); } -static __always_inline void slab_unlock(struct page *page) +static __always_inline void slab_unlock(struct slab *slab) { - VM_BUG_ON_PAGE(PageTail(page), page); - __bit_spin_unlock(PG_locked, &page->flags); + __bit_spin_unlock(PG_locked, &slab->flags); } /* Interrupts must be disabled (for the fallback code to work right) */ -static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page, +static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab, void *freelist_old, unsigned long counters_old, void *freelist_new, unsigned long counters_new, const char *n) @@ -381,22 +379,22 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) if (s->flags & __CMPXCHG_DOUBLE) { - if (cmpxchg_double(&page->freelist, &page->counters, + if (cmpxchg_double(&slab->freelist, &slab->counters, freelist_old, counters_old, freelist_new, counters_new)) return true; } else #endif { - slab_lock(page); - if (page->freelist == freelist_old && - page->counters == counters_old) { - page->freelist = freelist_new; - page->counters = counters_new; - slab_unlock(page); + slab_lock(slab); + if (slab->freelist == freelist_old && + slab->counters == counters_old) { + slab->freelist = freelist_new; + slab->counters = counters_new; + slab_unlock(slab); return true; } - slab_unlock(page); + slab_unlock(slab); } cpu_relax(); @@ -409,7 +407,7 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page return false; } -static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, +static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab, void *freelist_old, unsigned long counters_old, void *freelist_new, unsigned long counters_new, const char *n) @@ -417,7 +415,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) if (s->flags & __CMPXCHG_DOUBLE) { - if (cmpxchg_double(&page->freelist, &page->counters, + if (cmpxchg_double(&slab->freelist, &slab->counters, freelist_old, counters_old, freelist_new, counters_new)) return true; @@ -427,16 +425,16 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, unsigned long flags; local_irq_save(flags); - slab_lock(page); - if (page->freelist == freelist_old && - page->counters == counters_old) { - page->freelist = freelist_new; - page->counters = counters_new; - slab_unlock(page); + slab_lock(slab); + if (slab->freelist == freelist_old && + slab->counters == counters_old) { + slab->freelist = freelist_new; + slab->counters = counters_new; + slab_unlock(slab); local_irq_restore(flags); return true; } - slab_unlock(page); + slab_unlock(slab); local_irq_restore(flags); } @@ -475,24 +473,24 @@ static inline bool slab_add_kunit_errors(void) { return false; } #endif /* - * Determine a map of object in use on a page. + * Determine a map of object in use on a slab. * - * Node listlock must be held to guarantee that the page does + * Node listlock must be held to guarantee that the slab does * not vanish from under us. */ -static unsigned long *get_map(struct kmem_cache *s, struct page *page) +static unsigned long *get_map(struct kmem_cache *s, struct slab *slab) __acquires(&object_map_lock) { void *p; - void *addr = page_address(page); + void *addr = slab_address(slab); VM_BUG_ON(!irqs_disabled()); spin_lock(&object_map_lock); - bitmap_zero(object_map, page->objects); + bitmap_zero(object_map, slab->objects); - for (p = page->freelist; p; p = get_freepointer(s, p)) + for (p = slab->freelist; p; p = get_freepointer(s, p)) set_bit(__obj_to_index(s, addr, p), object_map); return object_map; @@ -552,19 +550,19 @@ static inline void metadata_access_disable(void) * Object debugging */ -/* Verify that a pointer has an address that is valid within a slab page */ +/* Verify that a pointer has an address that is valid within a slab */ static inline int check_valid_pointer(struct kmem_cache *s, - struct page *page, void *object) + struct slab *slab, void *object) { void *base; if (!object) return 1; - base = page_address(page); + base = slab_address(slab); object = kasan_reset_tag(object); object = restore_red_left(s, object); - if (object < base || object >= base + page->objects * s->size || + if (object < base || object >= base + slab->objects * s->size || (object - base) % s->size) { return 0; } @@ -675,11 +673,11 @@ void print_tracking(struct kmem_cache *s, void *object) print_track("Freed", get_track(s, object, TRACK_FREE), pr_time); } -static void print_page_info(struct page *page) +static void print_slab_info(struct slab *slab) { pr_err("Slab 0x%p objects=%u used=%u fp=0x%p flags=%#lx(%pGp)\n", - page, page->objects, page->inuse, page->freelist, - page->flags, &page->flags); + slab, slab->objects, slab->inuse, slab->freelist, + slab->flags, &slab->flags); } @@ -713,12 +711,12 @@ static void slab_fix(struct kmem_cache *s, char *fmt, ...) va_end(args); } -static bool freelist_corrupted(struct kmem_cache *s, struct page *page, +static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab, void **freelist, void *nextfree) { if ((s->flags & SLAB_CONSISTENCY_CHECKS) && - !check_valid_pointer(s, page, nextfree) && freelist) { - object_err(s, page, *freelist, "Freechain corrupt"); + !check_valid_pointer(s, slab, nextfree) && freelist) { + object_err(s, slab, *freelist, "Freechain corrupt"); *freelist = NULL; slab_fix(s, "Isolate corrupted freechain"); return true; @@ -727,14 +725,14 @@ static bool freelist_corrupted(struct kmem_cache *s, struct page *page, return false; } -static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) +static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p) { unsigned int off; /* Offset of last byte */ - u8 *addr = page_address(page); + u8 *addr = slab_address(slab); print_tracking(s, p); - print_page_info(page); + print_slab_info(slab); pr_err("Object 0x%p @offset=%tu fp=0x%p\n\n", p, p - addr, get_freepointer(s, p)); @@ -766,18 +764,18 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) dump_stack(); } -void object_err(struct kmem_cache *s, struct page *page, +void object_err(struct kmem_cache *s, struct slab *slab, u8 *object, char *reason) { if (slab_add_kunit_errors()) return; slab_bug(s, "%s", reason); - print_trailer(s, page, object); + print_trailer(s, slab, object); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); } -static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page, +static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab, const char *fmt, ...) { va_list args; @@ -790,7 +788,7 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page, vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); slab_bug(s, "%s", buf); - print_page_info(page); + print_slab_info(slab); dump_stack(); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); } @@ -818,13 +816,13 @@ static void restore_bytes(struct kmem_cache *s, char *message, u8 data, memset(from, data, to - from); } -static int check_bytes_and_report(struct kmem_cache *s, struct page *page, +static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab, u8 *object, char *what, u8 *start, unsigned int value, unsigned int bytes) { u8 *fault; u8 *end; - u8 *addr = page_address(page); + u8 *addr = slab_address(slab); metadata_access_enable(); fault = memchr_inv(kasan_reset_tag(start), value, bytes); @@ -843,7 +841,7 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page, pr_err("0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x\n", fault, end - 1, fault - addr, fault[0], value); - print_trailer(s, page, object); + print_trailer(s, slab, object); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); skip_bug_print: @@ -889,7 +887,7 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page, * may be used with merged slabcaches. */ -static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p) +static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p) { unsigned long off = get_info_end(s); /* The end of info */ @@ -902,12 +900,12 @@ static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p) if (size_from_object(s) == off) return 1; - return check_bytes_and_report(s, page, p, "Object padding", + return check_bytes_and_report(s, slab, p, "Object padding", p + off, POISON_INUSE, size_from_object(s) - off); } -/* Check the pad bytes at the end of a slab page */ -static int slab_pad_check(struct kmem_cache *s, struct page *page) +/* Check the pad bytes at the end of a slab */ +static int slab_pad_check(struct kmem_cache *s, struct slab *slab) { u8 *start; u8 *fault; @@ -919,8 +917,8 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page) if (!(s->flags & SLAB_POISON)) return 1; - start = page_address(page); - length = page_size(page); + start = slab_address(slab); + length = slab_size(slab); end = start + length; remainder = length % s->size; if (!remainder) @@ -935,7 +933,7 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page) while (end > fault && end[-1] == POISON_INUSE) end--; - slab_err(s, page, "Padding overwritten. 0x%p-0x%p @offset=%tu", + slab_err(s, slab, "Padding overwritten. 0x%p-0x%p @offset=%tu", fault, end - 1, fault - start); print_section(KERN_ERR, "Padding ", pad, remainder); @@ -943,23 +941,23 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page) return 0; } -static int check_object(struct kmem_cache *s, struct page *page, +static int check_object(struct kmem_cache *s, struct slab *slab, void *object, u8 val) { u8 *p = object; u8 *endobject = object + s->object_size; if (s->flags & SLAB_RED_ZONE) { - if (!check_bytes_and_report(s, page, object, "Left Redzone", + if (!check_bytes_and_report(s, slab, object, "Left Redzone", object - s->red_left_pad, val, s->red_left_pad)) return 0; - if (!check_bytes_and_report(s, page, object, "Right Redzone", + if (!check_bytes_and_report(s, slab, object, "Right Redzone", endobject, val, s->inuse - s->object_size)) return 0; } else { if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) { - check_bytes_and_report(s, page, p, "Alignment padding", + check_bytes_and_report(s, slab, p, "Alignment padding", endobject, POISON_INUSE, s->inuse - s->object_size); } @@ -967,15 +965,15 @@ static int check_object(struct kmem_cache *s, struct page *page, if (s->flags & SLAB_POISON) { if (val != SLUB_RED_ACTIVE && (s->flags & __OBJECT_POISON) && - (!check_bytes_and_report(s, page, p, "Poison", p, + (!check_bytes_and_report(s, slab, p, "Poison", p, POISON_FREE, s->object_size - 1) || - !check_bytes_and_report(s, page, p, "End Poison", + !check_bytes_and_report(s, slab, p, "End Poison", p + s->object_size - 1, POISON_END, 1))) return 0; /* * check_pad_bytes cleans up on its own. */ - check_pad_bytes(s, page, p); + check_pad_bytes(s, slab, p); } if (!freeptr_outside_object(s) && val == SLUB_RED_ACTIVE) @@ -986,8 +984,8 @@ static int check_object(struct kmem_cache *s, struct page *page, return 1; /* Check free pointer validity */ - if (!check_valid_pointer(s, page, get_freepointer(s, p))) { - object_err(s, page, p, "Freepointer corrupt"); + if (!check_valid_pointer(s, slab, get_freepointer(s, p))) { + object_err(s, slab, p, "Freepointer corrupt"); /* * No choice but to zap it and thus lose the remainder * of the free objects in this slab. May cause @@ -999,57 +997,57 @@ static int check_object(struct kmem_cache *s, struct page *page, return 1; } -static int check_slab(struct kmem_cache *s, struct page *page) +static int check_slab(struct kmem_cache *s, struct slab *slab) { int maxobj; VM_BUG_ON(!irqs_disabled()); - if (!PageSlab(page)) { - slab_err(s, page, "Not a valid slab page"); + if (!is_slab(slab)) { + slab_err(s, slab, "Not a valid slab slab"); return 0; } - maxobj = order_objects(compound_order(page), s->size); - if (page->objects > maxobj) { - slab_err(s, page, "objects %u > max %u", - page->objects, maxobj); + maxobj = order_objects(slab_order(slab), s->size); + if (slab->objects > maxobj) { + slab_err(s, slab, "objects %u > max %u", + slab->objects, maxobj); return 0; } - if (page->inuse > page->objects) { - slab_err(s, page, "inuse %u > max %u", - page->inuse, page->objects); + if (slab->inuse > slab->objects) { + slab_err(s, slab, "inuse %u > max %u", + slab->inuse, slab->objects); return 0; } /* Slab_pad_check fixes things up after itself */ - slab_pad_check(s, page); + slab_pad_check(s, slab); return 1; } /* - * Determine if a certain object on a page is on the freelist. Must hold the + * Determine if a certain object on a slab is on the freelist. Must hold the * slab lock to guarantee that the chains are in a consistent state. */ -static int on_freelist(struct kmem_cache *s, struct page *page, void *search) +static int on_freelist(struct kmem_cache *s, struct slab *slab, void *search) { int nr = 0; void *fp; void *object = NULL; int max_objects; - fp = page->freelist; - while (fp && nr <= page->objects) { + fp = slab->freelist; + while (fp && nr <= slab->objects) { if (fp == search) return 1; - if (!check_valid_pointer(s, page, fp)) { + if (!check_valid_pointer(s, slab, fp)) { if (object) { - object_err(s, page, object, + object_err(s, slab, object, "Freechain corrupt"); set_freepointer(s, object, NULL); } else { - slab_err(s, page, "Freepointer corrupt"); - page->freelist = NULL; - page->inuse = page->objects; + slab_err(s, slab, "Freepointer corrupt"); + slab->freelist = NULL; + slab->inuse = slab->objects; slab_fix(s, "Freelist cleared"); return 0; } @@ -1060,34 +1058,34 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search) nr++; } - max_objects = order_objects(compound_order(page), s->size); + max_objects = order_objects(slab_order(slab), s->size); if (max_objects > MAX_OBJS_PER_PAGE) max_objects = MAX_OBJS_PER_PAGE; - if (page->objects != max_objects) { - slab_err(s, page, "Wrong number of objects. Found %d but should be %d", - page->objects, max_objects); - page->objects = max_objects; + if (slab->objects != max_objects) { + slab_err(s, slab, "Wrong number of objects. Found %d but should be %d", + slab->objects, max_objects); + slab->objects = max_objects; slab_fix(s, "Number of objects adjusted"); } - if (page->inuse != page->objects - nr) { - slab_err(s, page, "Wrong object count. Counter is %d but counted were %d", - page->inuse, page->objects - nr); - page->inuse = page->objects - nr; + if (slab->inuse != slab->objects - nr) { + slab_err(s, slab, "Wrong object count. Counter is %d but counted were %d", + slab->inuse, slab->objects - nr); + slab->inuse = slab->objects - nr; slab_fix(s, "Object count adjusted"); } return search == NULL; } -static void trace(struct kmem_cache *s, struct page *page, void *object, +static void trace(struct kmem_cache *s, struct slab *slab, void *object, int alloc) { if (s->flags & SLAB_TRACE) { pr_info("TRACE %s %s 0x%p inuse=%d fp=0x%p\n", s->name, alloc ? "alloc" : "free", - object, page->inuse, - page->freelist); + object, slab->inuse, + slab->freelist); if (!alloc) print_section(KERN_INFO, "Object ", (void *)object, @@ -1101,22 +1099,22 @@ static void trace(struct kmem_cache *s, struct page *page, void *object, * Tracking of fully allocated slabs for debugging purposes. */ static void add_full(struct kmem_cache *s, - struct kmem_cache_node *n, struct page *page) + struct kmem_cache_node *n, struct slab *slab) { if (!(s->flags & SLAB_STORE_USER)) return; lockdep_assert_held(&n->list_lock); - list_add(&page->slab_list, &n->full); + list_add(&slab->slab_list, &n->full); } -static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page) +static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct slab *slab) { if (!(s->flags & SLAB_STORE_USER)) return; lockdep_assert_held(&n->list_lock); - list_del(&page->slab_list); + list_del(&slab->slab_list); } /* Tracking of the number of slabs for debugging purposes */ @@ -1156,7 +1154,7 @@ static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects) } /* Object debug checks for alloc/free paths */ -static void setup_object_debug(struct kmem_cache *s, struct page *page, +static void setup_object_debug(struct kmem_cache *s, struct slab *slab, void *object) { if (!kmem_cache_debug_flags(s, SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)) @@ -1167,90 +1165,90 @@ static void setup_object_debug(struct kmem_cache *s, struct page *page, } static -void setup_page_debug(struct kmem_cache *s, struct page *page, void *addr) +void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) { if (!kmem_cache_debug_flags(s, SLAB_POISON)) return; metadata_access_enable(); - memset(kasan_reset_tag(addr), POISON_INUSE, page_size(page)); + memset(kasan_reset_tag(addr), POISON_INUSE, slab_size(slab)); metadata_access_disable(); } static inline int alloc_consistency_checks(struct kmem_cache *s, - struct page *page, void *object) + struct slab *slab, void *object) { - if (!check_slab(s, page)) + if (!check_slab(s, slab)) return 0; - if (!check_valid_pointer(s, page, object)) { - object_err(s, page, object, "Freelist Pointer check fails"); + if (!check_valid_pointer(s, slab, object)) { + object_err(s, slab, object, "Freelist Pointer check fails"); return 0; } - if (!check_object(s, page, object, SLUB_RED_INACTIVE)) + if (!check_object(s, slab, object, SLUB_RED_INACTIVE)) return 0; return 1; } static noinline int alloc_debug_processing(struct kmem_cache *s, - struct page *page, + struct slab *slab, void *object, unsigned long addr) { if (s->flags & SLAB_CONSISTENCY_CHECKS) { - if (!alloc_consistency_checks(s, page, object)) + if (!alloc_consistency_checks(s, slab, object)) goto bad; } /* Success perform special debug activities for allocs */ if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_ALLOC, addr); - trace(s, page, object, 1); + trace(s, slab, object, 1); init_object(s, object, SLUB_RED_ACTIVE); return 1; bad: - if (PageSlab(page)) { + if (is_slab(slab)) { /* - * If this is a slab page then lets do the best we can + * If this is a slab then lets do the best we can * to avoid issues in the future. Marking all objects * as used avoids touching the remaining objects. */ slab_fix(s, "Marking all objects used"); - page->inuse = page->objects; - page->freelist = NULL; + slab->inuse = slab->objects; + slab->freelist = NULL; } return 0; } static inline int free_consistency_checks(struct kmem_cache *s, - struct page *page, void *object, unsigned long addr) + struct slab *slab, void *object, unsigned long addr) { - if (!check_valid_pointer(s, page, object)) { - slab_err(s, page, "Invalid object pointer 0x%p", object); + if (!check_valid_pointer(s, slab, object)) { + slab_err(s, slab, "Invalid object pointer 0x%p", object); return 0; } - if (on_freelist(s, page, object)) { - object_err(s, page, object, "Object already free"); + if (on_freelist(s, slab, object)) { + object_err(s, slab, object, "Object already free"); return 0; } - if (!check_object(s, page, object, SLUB_RED_ACTIVE)) + if (!check_object(s, slab, object, SLUB_RED_ACTIVE)) return 0; - if (unlikely(s != page->slab_cache)) { - if (!PageSlab(page)) { - slab_err(s, page, "Attempt to free object(0x%p) outside of slab", + if (unlikely(s != slab->slab_cache)) { + if (!is_slab(slab)) { + slab_err(s, slab, "Attempt to free object(0x%p) outside of slab", object); - } else if (!page->slab_cache) { + } else if (!slab->slab_cache) { pr_err("SLUB <none>: no slab for object 0x%p.\n", object); dump_stack(); } else - object_err(s, page, object, - "page slab pointer corrupt."); + object_err(s, slab, object, + "slab slab pointer corrupt."); return 0; } return 1; @@ -1258,21 +1256,21 @@ static inline int free_consistency_checks(struct kmem_cache *s, /* Supports checking bulk free of a constructed freelist */ static noinline int free_debug_processing( - struct kmem_cache *s, struct page *page, + struct kmem_cache *s, struct slab *slab, void *head, void *tail, int bulk_cnt, unsigned long addr) { - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + struct kmem_cache_node *n = get_node(s, slab_nid(slab)); void *object = head; int cnt = 0; unsigned long flags; int ret = 0; spin_lock_irqsave(&n->list_lock, flags); - slab_lock(page); + slab_lock(slab); if (s->flags & SLAB_CONSISTENCY_CHECKS) { - if (!check_slab(s, page)) + if (!check_slab(s, slab)) goto out; } @@ -1280,13 +1278,13 @@ static noinline int free_debug_processing( cnt++; if (s->flags & SLAB_CONSISTENCY_CHECKS) { - if (!free_consistency_checks(s, page, object, addr)) + if (!free_consistency_checks(s, slab, object, addr)) goto out; } if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); - trace(s, page, object, 0); + trace(s, slab, object, 0); /* Freepointer not overwritten by init_object(), SLAB_POISON moved it */ init_object(s, object, SLUB_RED_INACTIVE); @@ -1299,10 +1297,10 @@ static noinline int free_debug_processing( out: if (cnt != bulk_cnt) - slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n", + slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n", bulk_cnt, cnt); - slab_unlock(page); + slab_unlock(slab); spin_unlock_irqrestore(&n->list_lock, flags); if (!ret) slab_fix(s, "Object at 0x%p not freed", object); @@ -1514,26 +1512,26 @@ slab_flags_t kmem_cache_flags(unsigned int object_size, } #else /* !CONFIG_SLUB_DEBUG */ static inline void setup_object_debug(struct kmem_cache *s, - struct page *page, void *object) {} + struct slab *slab, void *object) {} static inline -void setup_page_debug(struct kmem_cache *s, struct page *page, void *addr) {} +void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {} static inline int alloc_debug_processing(struct kmem_cache *s, - struct page *page, void *object, unsigned long addr) { return 0; } + struct slab *slab, void *object, unsigned long addr) { return 0; } static inline int free_debug_processing( - struct kmem_cache *s, struct page *page, + struct kmem_cache *s, struct slab *slab, void *head, void *tail, int bulk_cnt, unsigned long addr) { return 0; } -static inline int slab_pad_check(struct kmem_cache *s, struct page *page) +static inline int slab_pad_check(struct kmem_cache *s, struct slab *slab) { return 1; } -static inline int check_object(struct kmem_cache *s, struct page *page, +static inline int check_object(struct kmem_cache *s, struct slab *slab, void *object, u8 val) { return 1; } static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n, - struct page *page) {} + struct slab *slab) {} static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, - struct page *page) {} + struct slab *slab) {} slab_flags_t kmem_cache_flags(unsigned int object_size, slab_flags_t flags, const char *name) { @@ -1552,7 +1550,7 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node, static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects) {} -static bool freelist_corrupted(struct kmem_cache *s, struct page *page, +static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab, void **freelist, void *nextfree) { return false; @@ -1662,10 +1660,10 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s, return *head != NULL; } -static void *setup_object(struct kmem_cache *s, struct page *page, +static void *setup_object(struct kmem_cache *s, struct slab *slab, void *object) { - setup_object_debug(s, page, object); + setup_object_debug(s, slab, object); object = kasan_init_slab_obj(s, object); if (unlikely(s->ctor)) { kasan_unpoison_object_data(s, object); @@ -1678,18 +1676,25 @@ static void *setup_object(struct kmem_cache *s, struct page *page, /* * Slab allocation and freeing */ -static inline struct page *alloc_slab_page(struct kmem_cache *s, +static inline struct slab *alloc_slab(struct kmem_cache *s, gfp_t flags, int node, struct kmem_cache_order_objects oo) { struct page *page; + struct slab *slab; unsigned int order = oo_order(oo); if (node == NUMA_NO_NODE) page = alloc_pages(flags, order); else page = __alloc_pages_node(node, flags, order); + if (!page) + return NULL; - return page; + __SetPageSlab(page); + slab = (struct slab *)page; + if (page_is_pfmemalloc(page)) + SetSlabPfmemalloc(slab); + return slab; } #ifdef CONFIG_SLAB_FREELIST_RANDOM @@ -1710,7 +1715,7 @@ static int init_cache_random_seq(struct kmem_cache *s) return err; } - /* Transform to an offset on the set of pages */ + /* Transform to an offset on the set of slabs */ if (s->random_seq) { unsigned int i; @@ -1734,54 +1739,54 @@ static void __init init_freelist_randomization(void) } /* Get the next entry on the pre-computed freelist randomized */ -static void *next_freelist_entry(struct kmem_cache *s, struct page *page, +static void *next_freelist_entry(struct kmem_cache *s, struct slab *slab, unsigned long *pos, void *start, - unsigned long page_limit, + unsigned long slab_limit, unsigned long freelist_count) { unsigned int idx; /* - * If the target page allocation failed, the number of objects on the - * page might be smaller than the usual size defined by the cache. + * If the target slab allocation failed, the number of objects on the + * slab might be smaller than the usual size defined by the cache. */ do { idx = s->random_seq[*pos]; *pos += 1; if (*pos >= freelist_count) *pos = 0; - } while (unlikely(idx >= page_limit)); + } while (unlikely(idx >= slab_limit)); return (char *)start + idx; } /* Shuffle the single linked freelist based on a random pre-computed sequence */ -static bool shuffle_freelist(struct kmem_cache *s, struct page *page) +static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab) { void *start; void *cur; void *next; - unsigned long idx, pos, page_limit, freelist_count; + unsigned long idx, pos, slab_limit, freelist_count; - if (page->objects < 2 || !s->random_seq) + if (slab->objects < 2 || !s->random_seq) return false; freelist_count = oo_objects(s->oo); pos = get_random_int() % freelist_count; - page_limit = page->objects * s->size; - start = fixup_red_left(s, page_address(page)); + slab_limit = slab->objects * s->size; + start = fixup_red_left(s, slab_address(slab)); /* First entry is used as the base of the freelist */ - cur = next_freelist_entry(s, page, &pos, start, page_limit, + cur = next_freelist_entry(s, slab, &pos, start, slab_limit, freelist_count); - cur = setup_object(s, page, cur); - page->freelist = cur; + cur = setup_object(s, slab, cur); + slab->freelist = cur; - for (idx = 1; idx < page->objects; idx++) { - next = next_freelist_entry(s, page, &pos, start, page_limit, + for (idx = 1; idx < slab->objects; idx++) { + next = next_freelist_entry(s, slab, &pos, start, slab_limit, freelist_count); - next = setup_object(s, page, next); + next = setup_object(s, slab, next); set_freepointer(s, cur, next); cur = next; } @@ -1795,15 +1800,15 @@ static inline int init_cache_random_seq(struct kmem_cache *s) return 0; } static inline void init_freelist_randomization(void) { } -static inline bool shuffle_freelist(struct kmem_cache *s, struct page *page) +static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab) { return false; } #endif /* CONFIG_SLAB_FREELIST_RANDOM */ -static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) +static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) { - struct page *page; + struct slab *slab; struct kmem_cache_order_objects oo = s->oo; gfp_t alloc_gfp; void *start, *p, *next; @@ -1825,65 +1830,62 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min)) alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~(__GFP_RECLAIM|__GFP_NOFAIL); - page = alloc_slab_page(s, alloc_gfp, node, oo); - if (unlikely(!page)) { + slab = alloc_slab(s, alloc_gfp, node, oo); + if (unlikely(!slab)) { oo = s->min; alloc_gfp = flags; /* * Allocation may have failed due to fragmentation. * Try a lower order alloc if possible */ - page = alloc_slab_page(s, alloc_gfp, node, oo); - if (unlikely(!page)) + slab = alloc_slab(s, alloc_gfp, node, oo); + if (unlikely(!slab)) goto out; stat(s, ORDER_FALLBACK); } - page->objects = oo_objects(oo); + slab->objects = oo_objects(oo); - account_slab_page(page, oo_order(oo), s, flags); + account_slab(slab, oo_order(oo), s, flags); - page->slab_cache = s; - __SetPageSlab(page); - if (page_is_pfmemalloc(page)) - SetPageSlabPfmemalloc(page); + slab->slab_cache = s; - kasan_poison_slab(page); + kasan_poison_slab(slab); - start = page_address(page); + start = slab_address(slab); - setup_page_debug(s, page, start); + setup_slab_debug(s, slab, start); - shuffle = shuffle_freelist(s, page); + shuffle = shuffle_freelist(s, slab); if (!shuffle) { start = fixup_red_left(s, start); - start = setup_object(s, page, start); - page->freelist = start; - for (idx = 0, p = start; idx < page->objects - 1; idx++) { + start = setup_object(s, slab, start); + slab->freelist = start; + for (idx = 0, p = start; idx < slab->objects - 1; idx++) { next = p + s->size; - next = setup_object(s, page, next); + next = setup_object(s, slab, next); set_freepointer(s, p, next); p = next; } set_freepointer(s, p, NULL); } - page->inuse = page->objects; - page->frozen = 1; + slab->inuse = slab->objects; + slab->frozen = 1; out: if (gfpflags_allow_blocking(flags)) local_irq_disable(); - if (!page) + if (!slab) return NULL; - inc_slabs_node(s, page_to_nid(page), page->objects); + inc_slabs_node(s, slab_nid(slab), slab->objects); - return page; + return slab; } -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node) { if (unlikely(flags & GFP_SLAB_BUG_MASK)) flags = kmalloc_fix_flags(flags); @@ -1892,76 +1894,77 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); } -static void __free_slab(struct kmem_cache *s, struct page *page) +static void __free_slab(struct kmem_cache *s, struct slab *slab) { - int order = compound_order(page); - int pages = 1 << order; + struct page *page = &slab->page; + int order = slab_order(slab); + int slabs = 1 << order; if (kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) { void *p; - slab_pad_check(s, page); - for_each_object(p, s, page_address(page), - page->objects) - check_object(s, page, p, SLUB_RED_INACTIVE); + slab_pad_check(s, slab); + for_each_object(p, s, slab_address(slab), + slab->objects) + check_object(s, slab, p, SLUB_RED_INACTIVE); } - __ClearPageSlabPfmemalloc(page); + __ClearSlabPfmemalloc(slab); __ClearPageSlab(page); - /* In union with page->mapping where page allocator expects NULL */ - page->slab_cache = NULL; + page->mapping = NULL; if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += pages; - unaccount_slab_page(page, order, s); - __free_pages(page, order); + current->reclaim_state->reclaimed_slab += slabs; + unaccount_slab(slab, order, s); + put_page(page); } static void rcu_free_slab(struct rcu_head *h) { struct page *page = container_of(h, struct page, rcu_head); + struct slab *slab = (struct slab *)page; - __free_slab(page->slab_cache, page); + __free_slab(slab->slab_cache, slab); } -static void free_slab(struct kmem_cache *s, struct page *page) +static void free_slab(struct kmem_cache *s, struct slab *slab) { if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) { - call_rcu(&page->rcu_head, rcu_free_slab); + call_rcu(&slab->page.rcu_head, rcu_free_slab); } else - __free_slab(s, page); + __free_slab(s, slab); } -static void discard_slab(struct kmem_cache *s, struct page *page) +static void discard_slab(struct kmem_cache *s, struct slab *slab) { - dec_slabs_node(s, page_to_nid(page), page->objects); - free_slab(s, page); + dec_slabs_node(s, slab_nid(slab), slab->objects); + free_slab(s, slab); } /* * Management of partially allocated slabs. */ static inline void -__add_partial(struct kmem_cache_node *n, struct page *page, int tail) +__add_partial(struct kmem_cache_node *n, struct slab *slab, int tail) { n->nr_partial++; if (tail == DEACTIVATE_TO_TAIL) - list_add_tail(&page->slab_list, &n->partial); + list_add_tail(&slab->slab_list, &n->partial); else - list_add(&page->slab_list, &n->partial); + list_add(&slab->slab_list, &n->partial); } static inline void add_partial(struct kmem_cache_node *n, - struct page *page, int tail) + struct slab *slab, int tail) { lockdep_assert_held(&n->list_lock); - __add_partial(n, page, tail); + __add_partial(n, slab, tail); } static inline void remove_partial(struct kmem_cache_node *n, - struct page *page) + struct slab *slab) { lockdep_assert_held(&n->list_lock); - list_del(&page->slab_list); + list_del(&slab->slab_list); n->nr_partial--; } @@ -1972,12 +1975,12 @@ static inline void remove_partial(struct kmem_cache_node *n, * Returns a list of objects or NULL if it fails. */ static inline void *acquire_slab(struct kmem_cache *s, - struct kmem_cache_node *n, struct page *page, + struct kmem_cache_node *n, struct slab *slab, int mode, int *objects) { void *freelist; unsigned long counters; - struct page new; + struct slab new; lockdep_assert_held(&n->list_lock); @@ -1986,12 +1989,12 @@ static inline void *acquire_slab(struct kmem_cache *s, * The old freelist is the list of objects for the * per cpu allocation list. */ - freelist = page->freelist; - counters = page->counters; + freelist = slab->freelist; + counters = slab->counters; new.counters = counters; *objects = new.objects - new.inuse; if (mode) { - new.inuse = page->objects; + new.inuse = slab->objects; new.freelist = NULL; } else { new.freelist = freelist; @@ -2000,19 +2003,19 @@ static inline void *acquire_slab(struct kmem_cache *s, VM_BUG_ON(new.frozen); new.frozen = 1; - if (!__cmpxchg_double_slab(s, page, + if (!__cmpxchg_double_slab(s, slab, freelist, counters, new.freelist, new.counters, "acquire_slab")) return NULL; - remove_partial(n, page); + remove_partial(n, slab); WARN_ON(!freelist); return freelist; } -static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain); -static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags); +static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain); +static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags); /* * Try to allocate a partial slab from a specific node. @@ -2020,7 +2023,7 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags); static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, struct kmem_cache_cpu *c, gfp_t flags) { - struct page *page, *page2; + struct slab *slab, *slab2; void *object = NULL; unsigned int available = 0; int objects; @@ -2035,23 +2038,23 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, return NULL; spin_lock(&n->list_lock); - list_for_each_entry_safe(page, page2, &n->partial, slab_list) { + list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) { void *t; - if (!pfmemalloc_match(page, flags)) + if (!pfmemalloc_match(slab, flags)) continue; - t = acquire_slab(s, n, page, object == NULL, &objects); + t = acquire_slab(s, n, slab, object == NULL, &objects); if (!t) break; available += objects; if (!object) { - c->page = page; + c->slab = slab; stat(s, ALLOC_FROM_PARTIAL); object = t; } else { - put_cpu_partial(s, page, 0); + put_cpu_partial(s, slab, 0); stat(s, CPU_PARTIAL_NODE); } if (!kmem_cache_has_cpu_partial(s) @@ -2064,7 +2067,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, } /* - * Get a page from somewhere. Search in increasing NUMA distances. + * Get a slab from somewhere. Search in increasing NUMA distances. */ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, struct kmem_cache_cpu *c) @@ -2128,7 +2131,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, } /* - * Get a partial page, lock it and return it. + * Get a partial slab, lock it and return it. */ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, struct kmem_cache_cpu *c) @@ -2218,19 +2221,19 @@ static void init_kmem_cache_cpus(struct kmem_cache *s) /* * Remove the cpu slab */ -static void deactivate_slab(struct kmem_cache *s, struct page *page, +static void deactivate_slab(struct kmem_cache *s, struct slab *slab, void *freelist, struct kmem_cache_cpu *c) { enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE }; - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); + struct kmem_cache_node *n = get_node(s, slab_nid(slab)); int lock = 0, free_delta = 0; enum slab_modes l = M_NONE, m = M_NONE; void *nextfree, *freelist_iter, *freelist_tail; int tail = DEACTIVATE_TO_HEAD; - struct page new; - struct page old; + struct slab new; + struct slab old; - if (page->freelist) { + if (slab->freelist) { stat(s, DEACTIVATE_REMOTE_FREES); tail = DEACTIVATE_TO_TAIL; } @@ -2249,7 +2252,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, * 'freelist_iter' is already corrupted. So isolate all objects * starting at 'freelist_iter' by skipping them. */ - if (freelist_corrupted(s, page, &freelist_iter, nextfree)) + if (freelist_corrupted(s, slab, &freelist_iter, nextfree)) break; freelist_tail = freelist_iter; @@ -2259,25 +2262,25 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, } /* - * Stage two: Unfreeze the page while splicing the per-cpu - * freelist to the head of page's freelist. + * Stage two: Unfreeze the slab while splicing the per-cpu + * freelist to the head of slab's freelist. * - * Ensure that the page is unfrozen while the list presence + * Ensure that the slab is unfrozen while the list presence * reflects the actual number of objects during unfreeze. * * We setup the list membership and then perform a cmpxchg - * with the count. If there is a mismatch then the page - * is not unfrozen but the page is on the wrong list. + * with the count. If there is a mismatch then the slab + * is not unfrozen but the slab is on the wrong list. * * Then we restart the process which may have to remove - * the page from the list that we just put it on again + * the slab from the list that we just put it on again * because the number of objects in the slab may have * changed. */ redo: - old.freelist = READ_ONCE(page->freelist); - old.counters = READ_ONCE(page->counters); + old.freelist = READ_ONCE(slab->freelist); + old.counters = READ_ONCE(slab->counters); VM_BUG_ON(!old.frozen); /* Determine target state of the slab */ @@ -2299,7 +2302,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, lock = 1; /* * Taking the spinlock removes the possibility - * that acquire_slab() will see a slab page that + * that acquire_slab() will see a slab slab that * is frozen */ spin_lock(&n->list_lock); @@ -2319,18 +2322,18 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, if (l != m) { if (l == M_PARTIAL) - remove_partial(n, page); + remove_partial(n, slab); else if (l == M_FULL) - remove_full(s, n, page); + remove_full(s, n, slab); if (m == M_PARTIAL) - add_partial(n, page, tail); + add_partial(n, slab, tail); else if (m == M_FULL) - add_full(s, n, page); + add_full(s, n, slab); } l = m; - if (!__cmpxchg_double_slab(s, page, + if (!__cmpxchg_double_slab(s, slab, old.freelist, old.counters, new.freelist, new.counters, "unfreezing slab")) @@ -2345,11 +2348,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, stat(s, DEACTIVATE_FULL); else if (m == M_FREE) { stat(s, DEACTIVATE_EMPTY); - discard_slab(s, page); + discard_slab(s, slab); stat(s, FREE_SLAB); } - c->page = NULL; + c->slab = NULL; c->freelist = NULL; } @@ -2365,15 +2368,15 @@ static void unfreeze_partials(struct kmem_cache *s, { #ifdef CONFIG_SLUB_CPU_PARTIAL struct kmem_cache_node *n = NULL, *n2 = NULL; - struct page *page, *discard_page = NULL; + struct slab *slab, *next_slab = NULL; - while ((page = slub_percpu_partial(c))) { - struct page new; - struct page old; + while ((slab = slub_percpu_partial(c))) { + struct slab new; + struct slab old; - slub_set_percpu_partial(c, page); + slub_set_percpu_partial(c, slab); - n2 = get_node(s, page_to_nid(page)); + n2 = get_node(s, slab_nid(slab)); if (n != n2) { if (n) spin_unlock(&n->list_lock); @@ -2384,8 +2387,8 @@ static void unfreeze_partials(struct kmem_cache *s, do { - old.freelist = page->freelist; - old.counters = page->counters; + old.freelist = slab->freelist; + old.counters = slab->counters; VM_BUG_ON(!old.frozen); new.counters = old.counters; @@ -2393,16 +2396,16 @@ static void unfreeze_partials(struct kmem_cache *s, new.frozen = 0; - } while (!__cmpxchg_double_slab(s, page, + } while (!__cmpxchg_double_slab(s, slab, old.freelist, old.counters, new.freelist, new.counters, "unfreezing slab")); if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) { - page->next = discard_page; - discard_page = page; + slab->next = next_slab; + next_slab = slab; } else { - add_partial(n, page, DEACTIVATE_TO_TAIL); + add_partial(n, slab, DEACTIVATE_TO_TAIL); stat(s, FREE_ADD_PARTIAL); } } @@ -2410,40 +2413,40 @@ static void unfreeze_partials(struct kmem_cache *s, if (n) spin_unlock(&n->list_lock); - while (discard_page) { - page = discard_page; - discard_page = discard_page->next; + while (next_slab) { + slab = next_slab; + next_slab = next_slab->next; stat(s, DEACTIVATE_EMPTY); - discard_slab(s, page); + discard_slab(s, slab); stat(s, FREE_SLAB); } #endif /* CONFIG_SLUB_CPU_PARTIAL */ } /* - * Put a page that was just frozen (in __slab_free|get_partial_node) into a - * partial page slot if available. + * Put a slab that was just frozen (in __slab_free|get_partial_node) into a + * partial slab slot if available. * * If we did not find a slot then simply move all the partials to the * per node partial list. */ -static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) +static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain) { #ifdef CONFIG_SLUB_CPU_PARTIAL - struct page *oldpage; - int pages; + struct slab *oldslab; + int slabs; int pobjects; preempt_disable(); do { - pages = 0; + slabs = 0; pobjects = 0; - oldpage = this_cpu_read(s->cpu_slab->partial); + oldslab = this_cpu_read(s->cpu_slab->partial); - if (oldpage) { - pobjects = oldpage->pobjects; - pages = oldpage->pages; + if (oldslab) { + pobjects = oldslab->pobjects; + slabs = oldslab->slabs; if (drain && pobjects > slub_cpu_partial(s)) { unsigned long flags; /* @@ -2453,22 +2456,22 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) local_irq_save(flags); unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); local_irq_restore(flags); - oldpage = NULL; + oldslab = NULL; pobjects = 0; - pages = 0; + slabs = 0; stat(s, CPU_PARTIAL_DRAIN); } } - pages++; - pobjects += page->objects - page->inuse; + slabs++; + pobjects += slab->objects - slab->inuse; - page->pages = pages; - page->pobjects = pobjects; - page->next = oldpage; + slab->slabs = slabs; + slab->pobjects = pobjects; + slab->next = oldslab; - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) - != oldpage); + } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldslab, slab) + != oldslab); if (unlikely(!slub_cpu_partial(s))) { unsigned long flags; @@ -2483,7 +2486,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) { stat(s, CPUSLAB_FLUSH); - deactivate_slab(s, c->page, c->freelist, c); + deactivate_slab(s, c->slab, c->freelist, c); c->tid = next_tid(c->tid); } @@ -2497,7 +2500,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); - if (c->page) + if (c->slab) flush_slab(s, c); unfreeze_partials(s, c); @@ -2515,7 +2518,7 @@ static bool has_cpu_slab(int cpu, void *info) struct kmem_cache *s = info; struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); - return c->page || slub_percpu_partial(c); + return c->slab || slub_percpu_partial(c); } static void flush_all(struct kmem_cache *s) @@ -2546,19 +2549,19 @@ static int slub_cpu_dead(unsigned int cpu) * Check if the objects in a per cpu structure fit numa * locality expectations. */ -static inline int node_match(struct page *page, int node) +static inline int node_match(struct slab *slab, int node) { #ifdef CONFIG_NUMA - if (node != NUMA_NO_NODE && page_to_nid(page) != node) + if (node != NUMA_NO_NODE && slab_nid(slab) != node) return 0; #endif return 1; } #ifdef CONFIG_SLUB_DEBUG -static int count_free(struct page *page) +static int count_free(struct slab *slab) { - return page->objects - page->inuse; + return slab->objects - slab->inuse; } static inline unsigned long node_nr_objs(struct kmem_cache_node *n) @@ -2569,15 +2572,15 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n) #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) static unsigned long count_partial(struct kmem_cache_node *n, - int (*get_count)(struct page *)) + int (*get_count)(struct slab *)) { unsigned long flags; unsigned long x = 0; - struct page *page; + struct slab *slab; spin_lock_irqsave(&n->list_lock, flags); - list_for_each_entry(page, &n->partial, slab_list) - x += get_count(page); + list_for_each_entry(slab, &n->partial, slab_list) + x += get_count(slab); spin_unlock_irqrestore(&n->list_lock, flags); return x; } @@ -2625,7 +2628,7 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, { void *freelist; struct kmem_cache_cpu *c = *pc; - struct page *page; + struct slab *slab; WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)); @@ -2634,62 +2637,62 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, if (freelist) return freelist; - page = new_slab(s, flags, node); - if (page) { + slab = new_slab(s, flags, node); + if (slab) { c = raw_cpu_ptr(s->cpu_slab); - if (c->page) + if (c->slab) flush_slab(s, c); /* - * No other reference to the page yet so we can + * No other reference to the slab yet so we can * muck around with it freely without cmpxchg */ - freelist = page->freelist; - page->freelist = NULL; + freelist = slab->freelist; + slab->freelist = NULL; stat(s, ALLOC_SLAB); - c->page = page; + c->slab = slab; *pc = c; } return freelist; } -static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags) +static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags) { - if (unlikely(PageSlabPfmemalloc(page))) + if (unlikely(SlabPfmemalloc(slab))) return gfp_pfmemalloc_allowed(gfpflags); return true; } /* - * Check the page->freelist of a page and either transfer the freelist to the - * per cpu freelist or deactivate the page. + * Check the slab->freelist of a slab and either transfer the freelist to the + * per cpu freelist or deactivate the slab. * - * The page is still frozen if the return value is not NULL. + * The slab is still frozen if the return value is not NULL. * - * If this function returns NULL then the page has been unfrozen. + * If this function returns NULL then the slab has been unfrozen. * * This function must be called with interrupt disabled. */ -static inline void *get_freelist(struct kmem_cache *s, struct page *page) +static inline void *get_freelist(struct kmem_cache *s, struct slab *slab) { - struct page new; + struct slab new; unsigned long counters; void *freelist; do { - freelist = page->freelist; - counters = page->counters; + freelist = slab->freelist; + counters = slab->counters; new.counters = counters; VM_BUG_ON(!new.frozen); - new.inuse = page->objects; + new.inuse = slab->objects; new.frozen = freelist != NULL; - } while (!__cmpxchg_double_slab(s, page, + } while (!__cmpxchg_double_slab(s, slab, freelist, counters, NULL, new.counters, "get_freelist")); @@ -2711,7 +2714,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page) * * And if we were unable to get a new slab from the partial slab lists then * we need to allocate a new slab. This is the slowest path since it involves - * a call to the page allocator and the setup of a new slab. + * a call to the slab allocator and the setup of a new slab. * * Version of __slab_alloc to use when we know that interrupts are * already disabled (which is the case for bulk allocation). @@ -2720,12 +2723,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr, struct kmem_cache_cpu *c) { void *freelist; - struct page *page; + struct slab *slab; stat(s, ALLOC_SLOWPATH); - page = c->page; - if (!page) { + slab = c->slab; + if (!slab) { /* * if the node is not online or has no normal memory, just * ignore the node constraint @@ -2737,7 +2740,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, } redo: - if (unlikely(!node_match(page, node))) { + if (unlikely(!node_match(slab, node))) { /* * same as above but node_match() being false already * implies node != NUMA_NO_NODE @@ -2747,18 +2750,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, goto redo; } else { stat(s, ALLOC_NODE_MISMATCH); - deactivate_slab(s, page, c->freelist, c); + deactivate_slab(s, slab, c->freelist, c); goto new_slab; } } /* - * By rights, we should be searching for a slab page that was + * By rights, we should be searching for a slab slab that was * PFMEMALLOC but right now, we are losing the pfmemalloc - * information when the page leaves the per-cpu allocator + * information when the slab leaves the per-cpu allocator */ - if (unlikely(!pfmemalloc_match(page, gfpflags))) { - deactivate_slab(s, page, c->freelist, c); + if (unlikely(!pfmemalloc_match(slab, gfpflags))) { + deactivate_slab(s, slab, c->freelist, c); goto new_slab; } @@ -2767,10 +2770,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, if (freelist) goto load_freelist; - freelist = get_freelist(s, page); + freelist = get_freelist(s, slab); if (!freelist) { - c->page = NULL; + c->slab = NULL; stat(s, DEACTIVATE_BYPASS); goto new_slab; } @@ -2780,10 +2783,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, load_freelist: /* * freelist is pointing to the list of objects to be used. - * page is pointing to the page from which the objects are obtained. - * That page must be frozen for per cpu allocations to work. + * slab is pointing to the slab from which the objects are obtained. + * That slab must be frozen for per cpu allocations to work. */ - VM_BUG_ON(!c->page->frozen); + VM_BUG_ON(!c->slab->frozen); c->freelist = get_freepointer(s, freelist); c->tid = next_tid(c->tid); return freelist; @@ -2791,8 +2794,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, new_slab: if (slub_percpu_partial(c)) { - page = c->page = slub_percpu_partial(c); - slub_set_percpu_partial(c, page); + slab = c->slab = slub_percpu_partial(c); + slub_set_percpu_partial(c, slab); stat(s, CPU_PARTIAL_ALLOC); goto redo; } @@ -2804,16 +2807,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, return NULL; } - page = c->page; - if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags))) + slab = c->slab; + if (likely(!kmem_cache_debug(s) && pfmemalloc_match(slab, gfpflags))) goto load_freelist; /* Only entered in the debug case */ if (kmem_cache_debug(s) && - !alloc_debug_processing(s, page, freelist, addr)) + !alloc_debug_processing(s, slab, freelist, addr)) goto new_slab; /* Slab failed checks. Next slab needed */ - deactivate_slab(s, page, get_freepointer(s, freelist), c); + deactivate_slab(s, slab, get_freepointer(s, freelist), c); return freelist; } @@ -2869,7 +2872,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, { void *object; struct kmem_cache_cpu *c; - struct page *page; + struct slab *slab; unsigned long tid; struct obj_cgroup *objcg = NULL; bool init = false; @@ -2902,9 +2905,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, /* * Irqless object alloc/free algorithm used here depends on sequence * of fetching cpu_slab's data. tid should be fetched before anything - * on c to guarantee that object and page associated with previous tid + * on c to guarantee that object and slab associated with previous tid * won't be used with current tid. If we fetch tid first, object and - * page could be one associated with next tid and our alloc/free + * slab could be one associated with next tid and our alloc/free * request will be failed. In this case, we will retry. So, no problem. */ barrier(); @@ -2917,8 +2920,8 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, */ object = c->freelist; - page = c->page; - if (unlikely(!object || !page || !node_match(page, node))) { + slab = c->slab; + if (unlikely(!object || !slab || !node_match(slab, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); } else { void *next_object = get_freepointer_safe(s, object); @@ -3020,17 +3023,17 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); * have a longer lifetime than the cpu slabs in most processing loads. * * So we still attempt to reduce cache line usage. Just take the slab - * lock and free the item. If there is no additional partial page + * lock and free the item. If there is no additional partial slab * handling required then we can return immediately. */ -static void __slab_free(struct kmem_cache *s, struct page *page, +static void __slab_free(struct kmem_cache *s, struct slab *slab, void *head, void *tail, int cnt, unsigned long addr) { void *prior; int was_frozen; - struct page new; + struct slab new; unsigned long counters; struct kmem_cache_node *n = NULL; unsigned long flags; @@ -3041,7 +3044,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, return; if (kmem_cache_debug(s) && - !free_debug_processing(s, page, head, tail, cnt, addr)) + !free_debug_processing(s, slab, head, tail, cnt, addr)) return; do { @@ -3049,8 +3052,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page, spin_unlock_irqrestore(&n->list_lock, flags); n = NULL; } - prior = page->freelist; - counters = page->counters; + prior = slab->freelist; + counters = slab->counters; set_freepointer(s, tail, prior); new.counters = counters; was_frozen = new.frozen; @@ -3069,7 +3072,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } else { /* Needs to be taken off a list */ - n = get_node(s, page_to_nid(page)); + n = get_node(s, slab_nid(slab)); /* * Speculatively acquire the list_lock. * If the cmpxchg does not succeed then we may @@ -3083,7 +3086,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } } - } while (!cmpxchg_double_slab(s, page, + } while (!cmpxchg_double_slab(s, slab, prior, counters, head, new.counters, "__slab_free")); @@ -3098,10 +3101,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page, stat(s, FREE_FROZEN); } else if (new.frozen) { /* - * If we just froze the page then put it onto the + * If we just froze the slab then put it onto the * per cpu partial list. */ - put_cpu_partial(s, page, 1); + put_cpu_partial(s, slab, 1); stat(s, CPU_PARTIAL_FREE); } @@ -3116,8 +3119,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page, * then add it. */ if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) { - remove_full(s, n, page); - add_partial(n, page, DEACTIVATE_TO_TAIL); + remove_full(s, n, slab); + add_partial(n, slab, DEACTIVATE_TO_TAIL); stat(s, FREE_ADD_PARTIAL); } spin_unlock_irqrestore(&n->list_lock, flags); @@ -3128,16 +3131,16 @@ static void __slab_free(struct kmem_cache *s, struct page *page, /* * Slab on the partial list. */ - remove_partial(n, page); + remove_partial(n, slab); stat(s, FREE_REMOVE_PARTIAL); } else { /* Slab must be on the full list */ - remove_full(s, n, page); + remove_full(s, n, slab); } spin_unlock_irqrestore(&n->list_lock, flags); stat(s, FREE_SLAB); - discard_slab(s, page); + discard_slab(s, slab); } /* @@ -3152,11 +3155,11 @@ static void __slab_free(struct kmem_cache *s, struct page *page, * with all sorts of special processing. * * Bulk free of a freelist with several objects (all pointing to the - * same page) possible by specifying head and tail ptr, plus objects + * same slab) possible by specifying head and tail ptr, plus objects * count (cnt). Bulk free indicated by tail pointer being set. */ static __always_inline void do_slab_free(struct kmem_cache *s, - struct page *page, void *head, void *tail, + struct slab *slab, void *head, void *tail, int cnt, unsigned long addr) { void *tail_obj = tail ? : head; @@ -3180,7 +3183,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s, /* Same with comment on barrier() in slab_alloc_node() */ barrier(); - if (likely(page == c->page)) { + if (likely(slab == c->slab)) { void **freelist = READ_ONCE(c->freelist); set_freepointer(s, tail_obj, freelist); @@ -3195,11 +3198,11 @@ static __always_inline void do_slab_free(struct kmem_cache *s, } stat(s, FREE_FASTPATH); } else - __slab_free(s, page, head, tail_obj, cnt, addr); + __slab_free(s, slab, head, tail_obj, cnt, addr); } -static __always_inline void slab_free(struct kmem_cache *s, struct page *page, +static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab, void *head, void *tail, int cnt, unsigned long addr) { @@ -3208,13 +3211,13 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page, * to remove objects, whose reuse must be delayed. */ if (slab_free_freelist_hook(s, &head, &tail)) - do_slab_free(s, page, head, tail, cnt, addr); + do_slab_free(s, slab, head, tail, cnt, addr); } #ifdef CONFIG_KASAN_GENERIC void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr) { - do_slab_free(cache, virt_to_head_page(x), x, NULL, 1, addr); + do_slab_free(cache, virt_to_slab(x), x, NULL, 1, addr); } #endif @@ -3223,13 +3226,13 @@ void kmem_cache_free(struct kmem_cache *s, void *x) s = cache_from_obj(s, x); if (!s) return; - slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_); + slab_free(s, virt_to_slab(x), x, NULL, 1, _RET_IP_); trace_kmem_cache_free(_RET_IP_, x, s->name); } EXPORT_SYMBOL(kmem_cache_free); struct detached_freelist { - struct page *page; + struct slab *slab; void *tail; void *freelist; int cnt; @@ -3239,8 +3242,8 @@ struct detached_freelist { /* * This function progressively scans the array with free objects (with * a limited look ahead) and extract objects belonging to the same - * page. It builds a detached freelist directly within the given - * page/objects. This can happen without any need for + * slab. It builds a detached freelist directly within the given + * slab/objects. This can happen without any need for * synchronization, because the objects are owned by running process. * The freelist is build up as a single linked list in the objects. * The idea is, that this detached freelist can then be bulk @@ -3255,10 +3258,10 @@ int build_detached_freelist(struct kmem_cache *s, size_t size, size_t first_skipped_index = 0; int lookahead = 3; void *object; - struct page *page; + struct slab *slab; /* Always re-init detached_freelist */ - df->page = NULL; + df->slab = NULL; do { object = p[--size]; @@ -3268,18 +3271,18 @@ int build_detached_freelist(struct kmem_cache *s, size_t size, if (!object) return 0; - page = virt_to_head_page(object); + slab = virt_to_slab(object); if (!s) { /* Handle kalloc'ed objects */ - if (unlikely(!PageSlab(page))) { - BUG_ON(!PageCompound(page)); + if (unlikely(!is_slab(slab))) { + BUG_ON(!SlabMulti(slab)); kfree_hook(object); - __free_pages(page, compound_order(page)); + put_page(&slab->page); p[size] = NULL; /* mark object processed */ return size; } /* Derive kmem_cache from object */ - df->s = page->slab_cache; + df->s = slab->slab_cache; } else { df->s = cache_from_obj(s, object); /* Support for memcg */ } @@ -3292,7 +3295,7 @@ int build_detached_freelist(struct kmem_cache *s, size_t size, } /* Start new detached freelist */ - df->page = page; + df->slab = slab; set_freepointer(df->s, object, NULL); df->tail = object; df->freelist = object; @@ -3304,8 +3307,8 @@ int build_detached_freelist(struct kmem_cache *s, size_t size, if (!object) continue; /* Skip processed objects */ - /* df->page is always set at this point */ - if (df->page == virt_to_head_page(object)) { + /* df->slab is always set at this point */ + if (df->slab == virt_to_slab(object)) { /* Opportunity build freelist */ set_freepointer(df->s, object, df->freelist); df->freelist = object; @@ -3337,10 +3340,10 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) struct detached_freelist df; size = build_detached_freelist(s, size, p, &df); - if (!df.page) + if (!df.slab) continue; - slab_free(df.s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_); + slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt, _RET_IP_); } while (likely(size)); } EXPORT_SYMBOL(kmem_cache_free_bulk); @@ -3435,7 +3438,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk); */ /* - * Minimum / Maximum order of slab pages. This influences locking overhead + * Minimum / Maximum order of slab slabs. This influences locking overhead * and slab fragmentation. A higher order reduces the number of partial slabs * and increases the number of allocations possible without having to * take the list_lock. @@ -3449,7 +3452,7 @@ static unsigned int slub_min_objects; * * The order of allocation has significant impact on performance and other * system components. Generally order 0 allocations should be preferred since - * order 0 does not cause fragmentation in the page allocator. Larger objects + * order 0 does not cause fragmentation in the slab allocator. Larger objects * be problematic to put into order 0 slabs because there may be too much * unused space left. We go to a higher order if more than 1/16th of the slab * would be wasted. @@ -3461,15 +3464,15 @@ static unsigned int slub_min_objects; * * slub_max_order specifies the order where we begin to stop considering the * number of objects in a slab as critical. If we reach slub_max_order then - * we try to keep the page order as low as possible. So we accept more waste - * of space in favor of a small page order. + * we try to keep the slab order as low as possible. So we accept more waste + * of space in favor of a small slab order. * * Higher order allocations also allow the placement of more objects in a * slab and thereby reduce object handling overhead. If the user has * requested a higher minimum order then we start with that one instead of * the smallest order which will fit the object. */ -static inline unsigned int slab_order(unsigned int size, +static inline unsigned int calc_slab_order(unsigned int size, unsigned int min_objects, unsigned int max_order, unsigned int fract_leftover) { @@ -3533,7 +3536,7 @@ static inline int calculate_order(unsigned int size) fraction = 16; while (fraction >= 4) { - order = slab_order(size, min_objects, + order = calc_slab_order(size, min_objects, slub_max_order, fraction); if (order <= slub_max_order) return order; @@ -3546,14 +3549,14 @@ static inline int calculate_order(unsigned int size) * We were unable to place multiple objects in a slab. Now * lets see if we can place a single object there. */ - order = slab_order(size, 1, slub_max_order, 1); + order = calc_slab_order(size, 1, slub_max_order, 1); if (order <= slub_max_order) return order; /* * Doh this slab cannot be placed using slub_max_order. */ - order = slab_order(size, 1, MAX_ORDER, 1); + order = calc_slab_order(size, 1, MAX_ORDER, 1); if (order < MAX_ORDER) return order; return -ENOSYS; @@ -3605,38 +3608,38 @@ static struct kmem_cache *kmem_cache_node; */ static void early_kmem_cache_node_alloc(int node) { - struct page *page; + struct slab *slab; struct kmem_cache_node *n; BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node)); - page = new_slab(kmem_cache_node, GFP_NOWAIT, node); + slab = new_slab(kmem_cache_node, GFP_NOWAIT, node); - BUG_ON(!page); - if (page_to_nid(page) != node) { + BUG_ON(!slab); + if (slab_nid(slab) != node) { pr_err("SLUB: Unable to allocate memory from node %d\n", node); pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n"); } - n = page->freelist; + n = slab->freelist; BUG_ON(!n); #ifdef CONFIG_SLUB_DEBUG init_object(kmem_cache_node, n, SLUB_RED_ACTIVE); init_tracking(kmem_cache_node, n); #endif n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false); - page->freelist = get_freepointer(kmem_cache_node, n); - page->inuse = 1; - page->frozen = 0; + slab->freelist = get_freepointer(kmem_cache_node, n); + slab->inuse = 1; + slab->frozen = 0; kmem_cache_node->node[node] = n; init_kmem_cache_node(n); - inc_slabs_node(kmem_cache_node, node, page->objects); + inc_slabs_node(kmem_cache_node, node, slab->objects); /* * No locks need to be taken here as it has just been * initialized and there is no concurrent access. */ - __add_partial(n, page, DEACTIVATE_TO_HEAD); + __add_partial(n, slab, DEACTIVATE_TO_HEAD); } static void free_kmem_cache_nodes(struct kmem_cache *s) @@ -3894,8 +3897,8 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) #endif /* - * The larger the object size is, the more pages we want on the partial - * list to avoid pounding the page allocator excessively. + * The larger the object size is, the more slabs we want on the partial + * list to avoid pounding the slab allocator excessively. */ set_min_partial(s, ilog2(s->size) / 2); @@ -3922,19 +3925,19 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) return -EINVAL; } -static void list_slab_objects(struct kmem_cache *s, struct page *page, +static void list_slab_objects(struct kmem_cache *s, struct slab *slab, const char *text) { #ifdef CONFIG_SLUB_DEBUG - void *addr = page_address(page); + void *addr = slab_address(slab); unsigned long *map; void *p; - slab_err(s, page, text, s->name); - slab_lock(page); + slab_err(s, slab, text, s->name); + slab_lock(slab); - map = get_map(s, page); - for_each_object(p, s, addr, page->objects) { + map = get_map(s, slab); + for_each_object(p, s, addr, slab->objects) { if (!test_bit(__obj_to_index(s, addr, p), map)) { pr_err("Object 0x%p @offset=%tu\n", p, p - addr); @@ -3942,7 +3945,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page, } } put_map(map); - slab_unlock(page); + slab_unlock(slab); #endif } @@ -3954,23 +3957,23 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page, static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) { LIST_HEAD(discard); - struct page *page, *h; + struct slab *slab, *h; BUG_ON(irqs_disabled()); spin_lock_irq(&n->list_lock); - list_for_each_entry_safe(page, h, &n->partial, slab_list) { - if (!page->inuse) { - remove_partial(n, page); - list_add(&page->slab_list, &discard); + list_for_each_entry_safe(slab, h, &n->partial, slab_list) { + if (!slab->inuse) { + remove_partial(n, slab); + list_add(&slab->slab_list, &discard); } else { - list_slab_objects(s, page, + list_slab_objects(s, slab, "Objects remaining in %s on __kmem_cache_shutdown()"); } } spin_unlock_irq(&n->list_lock); - list_for_each_entry_safe(page, h, &discard, slab_list) - discard_slab(s, page); + list_for_each_entry_safe(slab, h, &discard, slab_list) + discard_slab(s, slab); } bool __kmem_cache_empty(struct kmem_cache *s) @@ -4003,31 +4006,31 @@ int __kmem_cache_shutdown(struct kmem_cache *s) } #ifdef CONFIG_PRINTK -void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page) +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab) { void *base; int __maybe_unused i; unsigned int objnr; void *objp; void *objp0; - struct kmem_cache *s = page->slab_cache; + struct kmem_cache *s = slab->slab_cache; struct track __maybe_unused *trackp; kpp->kp_ptr = object; - kpp->kp_page = page; + kpp->kp_slab = slab; kpp->kp_slab_cache = s; - base = page_address(page); + base = slab_address(slab); objp0 = kasan_reset_tag(object); #ifdef CONFIG_SLUB_DEBUG objp = restore_red_left(s, objp0); #else objp = objp0; #endif - objnr = obj_to_index(s, page, objp); + objnr = obj_to_index(s, slab, objp); kpp->kp_data_offset = (unsigned long)((char *)objp0 - (char *)objp); objp = base + s->size * objnr; kpp->kp_objp = objp; - if (WARN_ON_ONCE(objp < base || objp >= base + page->objects * s->size || (objp - base) % s->size) || + if (WARN_ON_ONCE(objp < base || objp >= base + slab->objects * s->size || (objp - base) % s->size) || !(s->flags & SLAB_STORE_USER)) return; #ifdef CONFIG_SLUB_DEBUG @@ -4115,8 +4118,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node) unsigned int order = get_order(size); flags |= __GFP_COMP; - page = alloc_pages_node(node, flags, order); - if (page) { + slab = alloc_pages_node(node, flags, order); + if (slab) { ptr = page_address(page); mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B, PAGE_SIZE << order); @@ -4165,7 +4168,7 @@ EXPORT_SYMBOL(__kmalloc_node); * Returns NULL if check passes, otherwise const char * to name of cache * to indicate an error. */ -void __check_heap_object(const void *ptr, unsigned long n, struct page *page, +void __check_heap_object(const void *ptr, unsigned long n, struct slab *slab, bool to_user) { struct kmem_cache *s; @@ -4176,18 +4179,18 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page, ptr = kasan_reset_tag(ptr); /* Find object and usable object size. */ - s = page->slab_cache; + s = slab->slab_cache; /* Reject impossible pointers. */ - if (ptr < page_address(page)) - usercopy_abort("SLUB object not in SLUB page?!", NULL, + if (ptr < slab_address(slab)) + usercopy_abort("SLUB object not in SLUB slab?!", NULL, to_user, 0, n); /* Find offset within object. */ if (is_kfence) offset = ptr - kfence_object_start(ptr); else - offset = (ptr - page_address(page)) % s->size; + offset = (ptr - slab_address(slab)) % s->size; /* Adjust for redzone and reject if within the redzone. */ if (!is_kfence && kmem_cache_debug_flags(s, SLAB_RED_ZONE)) { @@ -4222,25 +4225,25 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page, size_t __ksize(const void *object) { - struct page *page; + struct slab *slab; if (unlikely(object == ZERO_SIZE_PTR)) return 0; - page = virt_to_head_page(object); + slab = virt_to_slab(object); - if (unlikely(!PageSlab(page))) { - WARN_ON(!PageCompound(page)); - return page_size(page); + if (unlikely(!is_slab(slab))) { + WARN_ON(!SlabMulti(slab)); + return slab_size(slab); } - return slab_ksize(page->slab_cache); + return slab_ksize(slab->slab_cache); } EXPORT_SYMBOL(__ksize); void kfree(const void *x) { - struct page *page; + struct slab *slab; void *object = (void *)x; trace_kfree(_RET_IP_, x); @@ -4248,18 +4251,19 @@ void kfree(const void *x) if (unlikely(ZERO_OR_NULL_PTR(x))) return; - page = virt_to_head_page(x); - if (unlikely(!PageSlab(page))) { - unsigned int order = compound_order(page); + slab = virt_to_slab(x); + if (unlikely(!is_slab(slab))) { + unsigned int order = slab_order(slab); + struct page *page = &slab->page; - BUG_ON(!PageCompound(page)); + BUG_ON(!SlabMulti(slab)); kfree_hook(object); mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B, -(PAGE_SIZE << order)); - __free_pages(page, order); + put_page(page); return; } - slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_); + slab_free(slab->slab_cache, slab, object, NULL, 1, _RET_IP_); } EXPORT_SYMBOL(kfree); @@ -4279,8 +4283,8 @@ int __kmem_cache_shrink(struct kmem_cache *s) int node; int i; struct kmem_cache_node *n; - struct page *page; - struct page *t; + struct slab *slab; + struct slab *t; struct list_head discard; struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; @@ -4298,22 +4302,22 @@ int __kmem_cache_shrink(struct kmem_cache *s) * Build lists of slabs to discard or promote. * * Note that concurrent frees may occur while we hold the - * list_lock. page->inuse here is the upper limit. + * list_lock. slab->inuse here is the upper limit. */ - list_for_each_entry_safe(page, t, &n->partial, slab_list) { - int free = page->objects - page->inuse; + list_for_each_entry_safe(slab, t, &n->partial, slab_list) { + int free = slab->objects - slab->inuse; - /* Do not reread page->inuse */ + /* Do not reread slab->inuse */ barrier(); /* We do not keep full slabs on the list */ BUG_ON(free <= 0); - if (free == page->objects) { - list_move(&page->slab_list, &discard); + if (free == slab->objects) { + list_move(&slab->slab_list, &discard); n->nr_partial--; } else if (free <= SHRINK_PROMOTE_MAX) - list_move(&page->slab_list, promote + free - 1); + list_move(&slab->slab_list, promote + free - 1); } /* @@ -4326,8 +4330,8 @@ int __kmem_cache_shrink(struct kmem_cache *s) spin_unlock_irqrestore(&n->list_lock, flags); /* Release empty slabs */ - list_for_each_entry_safe(page, t, &discard, slab_list) - discard_slab(s, page); + list_for_each_entry_safe(slab, t, &discard, slab_list) + discard_slab(s, slab); if (slabs_node(s, node)) ret = 1; @@ -4461,7 +4465,7 @@ static struct notifier_block slab_memory_callback_nb = { /* * Used for early kmem_cache structures that were allocated using - * the page allocator. Allocate them properly then fix up the pointers + * the slab allocator. Allocate them properly then fix up the pointers * that may be pointing to the wrong kmem_cache structure. */ @@ -4480,7 +4484,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) */ __flush_cpu_slab(s, smp_processor_id()); for_each_kmem_cache_node(s, node, n) { - struct page *p; + struct slab *p; list_for_each_entry(p, &n->partial, slab_list) p->slab_cache = s; @@ -4656,54 +4660,54 @@ EXPORT_SYMBOL(__kmalloc_node_track_caller); #endif #ifdef CONFIG_SYSFS -static int count_inuse(struct page *page) +static int count_inuse(struct slab *slab) { - return page->inuse; + return slab->inuse; } -static int count_total(struct page *page) +static int count_total(struct slab *slab) { - return page->objects; + return slab->objects; } #endif #ifdef CONFIG_SLUB_DEBUG -static void validate_slab(struct kmem_cache *s, struct page *page) +static void validate_slab(struct kmem_cache *s, struct slab *slab) { void *p; - void *addr = page_address(page); + void *addr = slab_address(slab); unsigned long *map; - slab_lock(page); + slab_lock(slab); - if (!check_slab(s, page) || !on_freelist(s, page, NULL)) + if (!check_slab(s, slab) || !on_freelist(s, slab, NULL)) goto unlock; /* Now we know that a valid freelist exists */ - map = get_map(s, page); - for_each_object(p, s, addr, page->objects) { + map = get_map(s, slab); + for_each_object(p, s, addr, slab->objects) { u8 val = test_bit(__obj_to_index(s, addr, p), map) ? SLUB_RED_INACTIVE : SLUB_RED_ACTIVE; - if (!check_object(s, page, p, val)) + if (!check_object(s, slab, p, val)) break; } put_map(map); unlock: - slab_unlock(page); + slab_unlock(slab); } static int validate_slab_node(struct kmem_cache *s, struct kmem_cache_node *n) { unsigned long count = 0; - struct page *page; + struct slab *slab; unsigned long flags; spin_lock_irqsave(&n->list_lock, flags); - list_for_each_entry(page, &n->partial, slab_list) { - validate_slab(s, page); + list_for_each_entry(slab, &n->partial, slab_list) { + validate_slab(s, slab); count++; } if (count != n->nr_partial) { @@ -4715,8 +4719,8 @@ static int validate_slab_node(struct kmem_cache *s, if (!(s->flags & SLAB_STORE_USER)) goto out; - list_for_each_entry(page, &n->full, slab_list) { - validate_slab(s, page); + list_for_each_entry(slab, &n->full, slab_list) { + validate_slab(s, slab); count++; } if (count != atomic_long_read(&n->nr_slabs)) { @@ -4838,7 +4842,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s, cpumask_set_cpu(track->cpu, to_cpumask(l->cpus)); } - node_set(page_to_nid(virt_to_page(track)), l->nodes); + node_set(slab_nid(virt_to_slab(track)), l->nodes); return 1; } @@ -4869,19 +4873,19 @@ static int add_location(struct loc_track *t, struct kmem_cache *s, cpumask_clear(to_cpumask(l->cpus)); cpumask_set_cpu(track->cpu, to_cpumask(l->cpus)); nodes_clear(l->nodes); - node_set(page_to_nid(virt_to_page(track)), l->nodes); + node_set(slab_nid(virt_to_slab(track)), l->nodes); return 1; } static void process_slab(struct loc_track *t, struct kmem_cache *s, - struct page *page, enum track_item alloc) + struct slab *slab, enum track_item alloc) { - void *addr = page_address(page); + void *addr = slab_address(slab); void *p; unsigned long *map; - map = get_map(s, page); - for_each_object(p, s, addr, page->objects) + map = get_map(s, slab); + for_each_object(p, s, addr, slab->objects) if (!test_bit(__obj_to_index(s, addr, p), map)) add_location(t, s, get_track(s, p, alloc)); put_map(map); @@ -4924,32 +4928,32 @@ static ssize_t show_slab_objects(struct kmem_cache *s, struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); int node; - struct page *page; + struct slab *slab; - page = READ_ONCE(c->page); - if (!page) + slab = READ_ONCE(c->slab); + if (!slab) continue; - node = page_to_nid(page); + node = slab_nid(slab); if (flags & SO_TOTAL) - x = page->objects; + x = slab->objects; else if (flags & SO_OBJECTS) - x = page->inuse; + x = slab->inuse; else x = 1; total += x; nodes[node] += x; - page = slub_percpu_partial_read_once(c); - if (page) { - node = page_to_nid(page); + slab = slub_percpu_partial_read_once(c); + if (slab) { + node = slab_nid(slab); if (flags & SO_TOTAL) WARN_ON_ONCE(1); else if (flags & SO_OBJECTS) WARN_ON_ONCE(1); else - x = page->pages; + x = slab->slabs; total += x; nodes[node] += x; } @@ -5146,31 +5150,31 @@ SLAB_ATTR_RO(objects_partial); static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf) { int objects = 0; - int pages = 0; + int slabs = 0; int cpu; int len = 0; for_each_online_cpu(cpu) { - struct page *page; + struct slab *slab; - page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); + slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); - if (page) { - pages += page->pages; - objects += page->pobjects; + if (slab) { + slabs += slab->slabs; + objects += slab->pobjects; } } - len += sysfs_emit_at(buf, len, "%d(%d)", objects, pages); + len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs); #ifdef CONFIG_SMP for_each_online_cpu(cpu) { - struct page *page; + struct slab *slab; - page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); - if (page) + slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); + if (slab) len += sysfs_emit_at(buf, len, " C%d=%d(%d)", - cpu, page->pobjects, page->pages); + cpu, slab->pobjects, slab->slabs); } #endif len += sysfs_emit_at(buf, len, "\n"); @@ -5825,16 +5829,16 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep) for_each_kmem_cache_node(s, node, n) { unsigned long flags; - struct page *page; + struct slab *slab; if (!atomic_long_read(&n->nr_slabs)) continue; spin_lock_irqsave(&n->list_lock, flags); - list_for_each_entry(page, &n->partial, slab_list) - process_slab(t, s, page, alloc); - list_for_each_entry(page, &n->full, slab_list) - process_slab(t, s, page, alloc); + list_for_each_entry(slab, &n->partial, slab_list) + process_slab(t, s, slab, alloc); + list_for_each_entry(slab, &n->full, slab_list) + process_slab(t, s, slab, alloc); spin_unlock_irqrestore(&n->list_lock, flags); } diff --git a/mm/sparse.c b/mm/sparse.c index 6326cdf36c4f..2b1099c986c6 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -750,7 +750,7 @@ static void free_map_bootmem(struct page *memmap) >> PAGE_SHIFT; for (i = 0; i < nr_pages; i++, page++) { - magic = (unsigned long) page->freelist; + magic = page->index; BUG_ON(magic == NODE_INFO); diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 68e8831068f4..0661dc09e11b 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -17,7 +17,7 @@ * * Usage of struct page fields: * page->private: points to zspage - * page->freelist(index): links together all component pages of a zspage + * page->index: links together all component pages of a zspage * For the huge page, this is always 0, so we use this field * to store handle. * page->units: first object offset in a subpage of zspage @@ -827,7 +827,7 @@ static struct page *get_next_page(struct page *page) if (unlikely(PageHugeObject(page))) return NULL; - return page->freelist; + return (struct page *)page->index; } /** @@ -901,7 +901,7 @@ static void reset_page(struct page *page) set_page_private(page, 0); page_mapcount_reset(page); ClearPageHugeObject(page); - page->freelist = NULL; + page->index = 0; } static int trylock_zspage(struct zspage *zspage) @@ -1027,7 +1027,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage, /* * Allocate individual pages and link them together as: - * 1. all pages are linked together using page->freelist + * 1. all pages are linked together using page->index * 2. each sub-page point to zspage using page->private * * we set PG_private to identify the first page (i.e. no other sub-page @@ -1036,7 +1036,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage, for (i = 0; i < nr_pages; i++) { page = pages[i]; set_page_private(page, (unsigned long)zspage); - page->freelist = NULL; + page->index = 0; if (i == 0) { zspage->first_page = page; SetPagePrivate(page); @@ -1044,7 +1044,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage, class->pages_per_zspage == 1)) SetPageHugeObject(page); } else { - prev_page->freelist = page; + prev_page->index = (unsigned long)page; } prev_page = page; }
On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote: > - it's become apparent that there haven't been any real objections to the code > that was queued up for 5.15. There _are_ very real discussions and points of > contention still to be decided and resolved for the work beyond file backed > pages, but those discussions were what derailed the more modest, and more > badly needed, work that affects everyone in filesystem land Unfortunately, I think this is a result of me wanting to discuss a way forward rather than a way back. To clarify: I do very much object to the code as currently queued up, and not just to a vague future direction. The patches add and convert a lot of complicated code to provision for a future we do not agree on. The indirections it adds, and the hybrid state it leaves the tree in, make it directly more difficult to work with and understand the MM code base. Stuff that isn't needed for exposing folios to the filesystems. As Willy has repeatedly expressed a take-it-or-leave-it attitude in response to my feedback, I'm not excited about merging this now and potentially leaving quite a bit of cleanup work to others if the downstream discussion don't go to his liking. Here is the roughly annotated pull request: mm: Convert get_page_unless_zero() to return bool mm: Introduce struct folio mm: Add folio_pgdat(), folio_zone() and folio_zonenum() mm/vmstat: Add functions to account folio statistics Used internally and not *really* needed for filesystem folios... There are a couple of callsites in mm/page-writeback.c so I suppose it's ok. mm/debug: Add VM_BUG_ON_FOLIO() and VM_WARN_ON_ONCE_FOLIO() mm: Add folio reference count functions mm: Add folio_put() mm: Add folio_get() mm: Add folio_try_get_rcu() mm: Add folio flag manipulation functions mm/lru: Add folio LRU functions The LRU code is used by anon and file and not needed for the filesystem API. And as discussed, there is generally no ambiguity of tail pages on the LRU list. mm: Handle per-folio private data mm/filemap: Add folio_index(), folio_file_page() and folio_contains() mm/filemap: Add folio_next_index() mm/filemap: Add folio_pos() and folio_file_pos() mm/util: Add folio_mapping() and folio_file_mapping() mm/filemap: Add folio_unlock() mm/filemap: Add folio_lock() mm/filemap: Add folio_lock_killable() mm/filemap: Add __folio_lock_async() mm/filemap: Add folio_wait_locked() mm/filemap: Add __folio_lock_or_retry() mm/swap: Add folio_rotate_reclaimable() More LRU code, although this one is only used by page-writeback... I suppose. mm/filemap: Add folio_end_writeback() mm/writeback: Add folio_wait_writeback() mm/writeback: Add folio_wait_stable() mm/filemap: Add folio_wait_bit() mm/filemap: Add folio_wake_bit() mm/filemap: Convert page wait queues to be folios mm/filemap: Add folio private_2 functions fs/netfs: Add folio fscache functions mm: Add folio_mapped() mm: Add folio_nid() mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics() mm/memcg: Use the node id in mem_cgroup_update_tree() mm/memcg: Remove soft_limit_tree_node() mm/memcg: Convert memcg_check_events to take a node ID These are nice cleanups, unrelated to folios. Ack. mm/memcg: Add folio_memcg() and related functions mm/memcg: Convert commit_charge() to take a folio mm/memcg: Convert mem_cgroup_charge() to take a folio mm/memcg: Convert uncharge_page() to uncharge_folio() mm/memcg: Convert mem_cgroup_uncharge() to take a folio mm/memcg: Convert mem_cgroup_migrate() to take folios mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock() mm/memcg: Convert mem_cgroup_move_account() to use a folio mm/memcg: Add folio_lruvec() mm/memcg: Add folio_lruvec_lock() and similar functions mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave() mm/workingset: Convert workingset_activation to take a folio This is all anon+file stuff, not needed for filesystem folios. As per the other email, no conceptual entry point for tail pages into either subsystem, so no ambiguity around the necessity of any compound_head() calls, directly or indirectly. It's easy to rule out wholesale, so there is no justification for incrementally annotating every single use of the page. NAK. mm: Add folio_pfn() mm: Add folio_raw_mapping() mm: Add flush_dcache_folio() mm: Add kmap_local_folio() mm: Add arch_make_folio_accessible() mm: Add folio_young and folio_idle mm/swap: Add folio_activate() mm/swap: Add folio_mark_accessed() This is anon+file aging stuff, not needed. mm/rmap: Add folio_mkclean() mm/migrate: Add folio_migrate_mapping() mm/migrate: Add folio_migrate_flags() mm/migrate: Add folio_migrate_copy() More anon+file conversion, not needed. mm/writeback: Rename __add_wb_stat() to wb_stat_mod() flex_proportions: Allow N events instead of 1 mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add() mm/writeback: Add __folio_end_writeback() mm/writeback: Add folio_start_writeback() mm/writeback: Add folio_mark_dirty() mm/writeback: Add __folio_mark_dirty() mm/writeback: Convert tracing writeback_page_template to folios mm/writeback: Add filemap_dirty_folio() mm/writeback: Add folio_account_cleaned() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_redirty_for_writepage() mm/filemap: Add i_blocks_per_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add readahead_folio() mm/workingset: Convert workingset_refault() to take a folio Anon+file, not needed. NAK. mm: Add folio_evictable() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm/lru: Add folio_add_lru() LRU code, not needed. mm/page_alloc: Add folio allocation functions mm/filemap: Add filemap_alloc_folio mm/filemap: Add filemap_add_folio() mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_get_folio mm/filemap: Add FGP_STABLE mm/writeback: Add folio_write_one I'm counting about a thousand of lines of contentious LOC that clearly aren't necessary for exposing folios to the filesystems. The rest of these are pagecache and writeback. It's still a ton of (internal) code converted to folios that has conceptually little to no ambiguity about head and tail pages. As per the other email I still think it would have been good to have a high-level discussion about the *legitimate* entry points and data structures that will continue to deal with tail pages down the line. To scope the actual problem that is being addressed by this inverted/whitelist approach - so we don't annotate the entire world just to box in a handful of page table walkers... But oh well. Not a hill I care to die on at this point...
On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote: > > - it's become apparent that there haven't been any real objections to the code > > that was queued up for 5.15. There _are_ very real discussions and points of > > contention still to be decided and resolved for the work beyond file backed > > pages, but those discussions were what derailed the more modest, and more > > badly needed, work that affects everyone in filesystem land > > Unfortunately, I think this is a result of me wanting to discuss a way > forward rather than a way back. > > To clarify: I do very much object to the code as currently queued up, > and not just to a vague future direction. > > The patches add and convert a lot of complicated code to provision for > a future we do not agree on. The indirections it adds, and the hybrid > state it leaves the tree in, make it directly more difficult to work > with and understand the MM code base. Stuff that isn't needed for > exposing folios to the filesystems. > > As Willy has repeatedly expressed a take-it-or-leave-it attitude in > response to my feedback, I'm not excited about merging this now and > potentially leaving quite a bit of cleanup work to others if the > downstream discussion don't go to his liking. > > Here is the roughly annotated pull request: Thanks for breaking this out, Johannes. So: mm/filemap.c and mm/page-writeback.c - I disagree about folios not really being needed there. Those files really belong more in fs/ than mm/, and the code in those files needs folios the most - especially filemap.c, a lot of those algorithms have to change from block based to extent based, making the analogy with filesystems. I think it makes sense to drop the mm/lru stuff, as well as the mm/memcg, mm/migrate and mm/workingset and mm/swap stuff that you object to - that is, the code paths that are for both file + anonymous pages, unless Matthew has technical reasons why that would break the rest of the patch set. And then, we really should have a pow wow and figure out what our options are going forward. I think we have some agreement now that not everything is going to be a folio going forwards (Matthew already split out his slab conversion to a new type) - so if anonymous pages aren't becoming folios, we should prototype some stuff and see where that helps and hurts us. > As per the other email I still think it would have been good to have a > high-level discussion about the *legitimate* entry points and data > structures that will continue to deal with tail pages down the > line. To scope the actual problem that is being addressed by this > inverted/whitelist approach - so we don't annotate the entire world > just to box in a handful of page table walkers... That discussion can still happen... and there's still the potential to get a lot more done if we're breaking open struct page and coming up with new types. I got Matthew on board with what you wanted, re: using the slab allocator for larger allocations
On Wed, Sep 22, 2021 at 11:46:04AM -0400, Kent Overstreet wrote: > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > > On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote: > > > - it's become apparent that there haven't been any real objections to the code > > > that was queued up for 5.15. There _are_ very real discussions and points of > > > contention still to be decided and resolved for the work beyond file backed > > > pages, but those discussions were what derailed the more modest, and more > > > badly needed, work that affects everyone in filesystem land > > > > Unfortunately, I think this is a result of me wanting to discuss a way > > forward rather than a way back. > > > > To clarify: I do very much object to the code as currently queued up, > > and not just to a vague future direction. > > > > The patches add and convert a lot of complicated code to provision for > > a future we do not agree on. The indirections it adds, and the hybrid > > state it leaves the tree in, make it directly more difficult to work > > with and understand the MM code base. Stuff that isn't needed for > > exposing folios to the filesystems. > > > > As Willy has repeatedly expressed a take-it-or-leave-it attitude in > > response to my feedback, I'm not excited about merging this now and > > potentially leaving quite a bit of cleanup work to others if the > > downstream discussion don't go to his liking. We're at a take-it-or-leave-it point for this pull request. The time for discussion was *MONTHS* ago. > > Here is the roughly annotated pull request: > > Thanks for breaking this out, Johannes. > > So: mm/filemap.c and mm/page-writeback.c - I disagree about folios not really > being needed there. Those files really belong more in fs/ than mm/, and the code > in those files needs folios the most - especially filemap.c, a lot of those > algorithms have to change from block based to extent based, making the analogy > with filesystems. > > I think it makes sense to drop the mm/lru stuff, as well as the mm/memcg, > mm/migrate and mm/workingset and mm/swap stuff that you object to - that is, the > code paths that are for both file + anonymous pages, unless Matthew has > technical reasons why that would break the rest of the patch set. Conceptually, it breaks the patch set. Anywhere that we convert back from a folio to a page, the guarantee of folios is weakened (and possibly violated). I don't think it makes sense from a practical point of view either; it's re-adding compound_head() calls that just don't need to be there. > That discussion can still happen... and there's still the potential to get a lot > more done if we're breaking open struct page and coming up with new types. I got > Matthew on board with what you wanted, re: using the slab allocator for larger > allocations Wait, no, you didn't. I think it's a terrible idea. It's just completely orthogonal to this patch set, so I don't want to talk about it.
> On Sep 22, 2021, at 12:26 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Sep 22, 2021 at 11:46:04AM -0400, Kent Overstreet wrote: >> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: >>> On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote: >>>> - it's become apparent that there haven't been any real objections to the code >>>> that was queued up for 5.15. There _are_ very real discussions and points of >>>> contention still to be decided and resolved for the work beyond file backed >>>> pages, but those discussions were what derailed the more modest, and more >>>> badly needed, work that affects everyone in filesystem land >>> >>> Unfortunately, I think this is a result of me wanting to discuss a way >>> forward rather than a way back. >>> >>> To clarify: I do very much object to the code as currently queued up, >>> and not just to a vague future direction. >>> >>> The patches add and convert a lot of complicated code to provision for >>> a future we do not agree on. The indirections it adds, and the hybrid >>> state it leaves the tree in, make it directly more difficult to work >>> with and understand the MM code base. Stuff that isn't needed for >>> exposing folios to the filesystems. >>> >>> As Willy has repeatedly expressed a take-it-or-leave-it attitude in >>> response to my feedback, I'm not excited about merging this now and >>> potentially leaving quite a bit of cleanup work to others if the >>> downstream discussion don't go to his liking. > > We're at a take-it-or-leave-it point for this pull request. The time > for discussion was *MONTHS* ago. > I’ll admit I’m not impartial, but my fundamental goal is moving the patches forward. Given folios will need long term maintenance, engagement, and iteration throughout mm/, take-it-or-leave-it pulls seem like a recipe for future conflict, and more importantly, bugs. I’d much rather work it out now. -chris
On Wed, Sep 22, 2021 at 04:56:16PM +0000, Chris Mason wrote: > > > On Sep 22, 2021, at 12:26 PM, Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Sep 22, 2021 at 11:46:04AM -0400, Kent Overstreet wrote: > >> On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > >>> On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote: > >>>> - it's become apparent that there haven't been any real objections to the code > >>>> that was queued up for 5.15. There _are_ very real discussions and points of > >>>> contention still to be decided and resolved for the work beyond file backed > >>>> pages, but those discussions were what derailed the more modest, and more > >>>> badly needed, work that affects everyone in filesystem land > >>> > >>> Unfortunately, I think this is a result of me wanting to discuss a way > >>> forward rather than a way back. > >>> > >>> To clarify: I do very much object to the code as currently queued up, > >>> and not just to a vague future direction. > >>> > >>> The patches add and convert a lot of complicated code to provision for > >>> a future we do not agree on. The indirections it adds, and the hybrid > >>> state it leaves the tree in, make it directly more difficult to work > >>> with and understand the MM code base. Stuff that isn't needed for > >>> exposing folios to the filesystems. > >>> > >>> As Willy has repeatedly expressed a take-it-or-leave-it attitude in > >>> response to my feedback, I'm not excited about merging this now and > >>> potentially leaving quite a bit of cleanup work to others if the > >>> downstream discussion don't go to his liking. > > > > We're at a take-it-or-leave-it point for this pull request. The time > > for discussion was *MONTHS* ago. > > > > I’ll admit I’m not impartial, but my fundamental goal is moving the patches forward. Given folios will need long term maintenance, engagement, and iteration throughout mm/, take-it-or-leave-it pulls seem like a recipe for future conflict, and more importantly, bugs. > > I’d much rather work it out now. That's the nature of a pull request. It's binary -- either it's pulled or it's rejected. Well, except that Linus has opted for silence, leaving me in limbo. I have no idea what he's thinking. I don't know if he agrees with Johannes. I don't know what needs to change for Linus to like this series enough to pull it (either now or in the 5.16 merge window). And that makes me frustrated. This is over a year of work from me and others, and it's being held up over concerns which seem to me to be entirely insubstantial (the name "folio"? really? and even my change to use "pageset" was met with silence from Linus.) I agree with Kent & Johannes that struct page is a mess. I agree that cleaning it up will bring many benefits. I've even started a design document here: https://kernelnewbies.org/MemoryTypes I do see some advantages to splitting out anon memory descriptors from file memory descriptors, but there is also plenty of code which handles both types in the same way. I see the requests to continue to use struct page to mean a "memory descriptor which is either anon or file", but I really think that's the wrong approach. A struct page should represent /a page/ of memory. Otherwise we're just confusing people. I know it's a confusion we've had since compound pages were introduced, what, 25+ years ago, but that expediency has overstayed its welcome. The continued silence from Linus is really driving me to despair. I'm sorry I've been so curt with some of the requests. I really am willing to change things; I wasn't planning on doing anything with slab until Kent prodded me to do it. But equally, I strongly believe that everything I've done here is a step towards the things that everybody wants, and I'm frustrated that it's being perceived as a step away, or even to the side of what people want. So ... if any of you have Linus' ear. Maybe you're at a conference with him later this week. Please, just get him to tell me what I need to do to make him happy with this patchset.
On Wed, Sep 22, 2021 at 08:54:11PM +0100, Matthew Wilcox wrote: > That's the nature of a pull request. It's binary -- either it's pulled or > it's rejected. Well, except that Linus has opted for silence, leaving > me in limbo. I have no idea what he's thinking. I don't know if he > agrees with Johannes. I don't know what needs to change for Linus to > like this series enough to pull it (either now or in the 5.16 merge > window). And that makes me frustrated. This is over a year of work > from me and others, and it's being held up over concerns which seem to > me to be entirely insubstantial (the name "folio"? really? and even > my change to use "pageset" was met with silence from Linus.) People bikeshed the naming when they're uncomfortable with what's being proposed and have nothing substantive to say, and people are uncomfortable with what's being proposed when there's clear disagreement between major stakeholders who aren't working with each other. And the utterly ridiculous part of this whole fiasco is that you and Johannes have a LOT of common ground regarding the larger picture of what we do with the struct page mess, but you two keep digging in your heels purely because you're convinced that you can't work with each other so you need to either route around each other or be as forceful as possible to get what you want. You're convinced you're not listenig to each other, but even that isn't true because when I pass ideas back and forth between you and they come from "not Matthew" or "not Johannes" you both listen and incorporate them just fine. We can't have a process where major stakeholders are trying to actively sabotage each other's efforts, which is pretty close to where we're at now. You two just need to learn to work with each other.
On Wed, Sep 22, 2021 at 12:56 PM Matthew Wilcox <willy@infradead.org> wrote: > > The continued silence from Linus is really driving me to despair. No need to despair. The silence isn't some "deep" thing. What happened is literally that I wasn't 100% happy with the naming, but didn't hate the patches, and still don't. But when there is still active discussion about them during the merge window, I'm just not going to merge them. The silence literally is just due to that - not participating in the discussion for the simple reason that I had no hugely strong opinions on my side - but also simply because there is no way I'd merge this for 5.15 simply exactly _because_ of this discussion. Normally I get to clean up my inbox the week after the merge window, but the -Werror things kept my attention for one extra week, and so my mailbox has been a disaster area as a result. So only today does my inbox start to look reasonable again after the merge window (not because of the extra email during the merge window, but simply because the merge window causes me to ignore non-pull emails, and then I need to go back and check the other stuff afterwards). So I'm not particularly unhappy with the patchset. I understand where it is coming from, I have no huge technical disagreement with it personally. That said, I'm not hugely _enthused_ about the mm side of it either, which is why I also wouldn't just override the discussion and say "that's it, I'm merging it". I basically wanted to see if it led somewhere. I'm not convinced it led anywhere, but that didn't really change things for me, except for the "yeah, I'm not merging something core like this while it's under active discussion" part. Linus
On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote: ... > +/** > + * page_slab - Converts from page to slab. > + * @p: The page. > + * > + * This function cannot be called on a NULL pointer. It can be called > + * on a non-slab page; the caller should check is_slab() to be sure > + * that the slab really is a slab. > + * > + * Return: The slab which contains this page. > + */ > +#define page_slab(p) (_Generic((p), \ > + const struct page *: (const struct slab *)_compound_head(p), \ > + struct page *: (struct slab *)_compound_head(p))) > + > +static inline bool is_slab(struct slab *slab) > +{ > + return test_bit(PG_slab, &slab->flags); > +} > + I'm sorry, I don't have a dog in this fight and conceptually I think folios are a good idea... But for this work, having a call which returns if a 'struct slab' really is a 'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return NULL if there is no slab containing that page? Ira
On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote: > On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote: > > +/** > > + * page_slab - Converts from page to slab. > > + * @p: The page. > > + * > > + * This function cannot be called on a NULL pointer. It can be called > > + * on a non-slab page; the caller should check is_slab() to be sure > > + * that the slab really is a slab. > > + * > > + * Return: The slab which contains this page. > > + */ > > +#define page_slab(p) (_Generic((p), \ > > + const struct page *: (const struct slab *)_compound_head(p), \ > > + struct page *: (struct slab *)_compound_head(p))) > > + > > +static inline bool is_slab(struct slab *slab) > > +{ > > + return test_bit(PG_slab, &slab->flags); > > +} > > + > > I'm sorry, I don't have a dog in this fight and conceptually I think folios are > a good idea... > > But for this work, having a call which returns if a 'struct slab' really is a > 'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return > NULL if there is no slab containing that page? No, this is a good question. The way slub works right now is that if you ask for a "large" allocation, it does: flags |= __GFP_COMP; page = alloc_pages_node(node, flags, order); and returns page_address(page) (eventually; the code is more complex) So when you call kfree(), it uses the PageSlab flag to determine if the allocation was "large" or not: page = virt_to_head_page(x); if (unlikely(!PageSlab(page))) { free_nonslab_page(page, object); return; } slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_); Now, you could say that this is a bad way to handle things, and every allocation from slab should have PageSlab set, and it should use one of the many other bits in page->flags to indicate whether it's a large allocation or not. I may have feelings in that direction myself. But I don't think I should be changing that in this patch. Maybe calling this function is_slab() is the confusing thing. Perhaps it should be called SlabIsLargeAllocation(). Not sure.
On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote: > > - it's become apparent that there haven't been any real objections to the code > > that was queued up for 5.15. There _are_ very real discussions and points of > > contention still to be decided and resolved for the work beyond file backed > > pages, but those discussions were what derailed the more modest, and more > > badly needed, work that affects everyone in filesystem land > > Unfortunately, I think this is a result of me wanting to discuss a way > forward rather than a way back. > > To clarify: I do very much object to the code as currently queued up, > and not just to a vague future direction. > > The patches add and convert a lot of complicated code to provision for > a future we do not agree on. The indirections it adds, and the hybrid > state it leaves the tree in, make it directly more difficult to work > with and understand the MM code base. Stuff that isn't needed for > exposing folios to the filesystems. I think something we need is an alternate view - anon_folio, perhaps - and an idea of what that would look like. Because you've been saying you don't think file pages and anymous pages are similar enough to be the same time - so if they're not, how's the code that works on both types of pages going to change to accomadate that? Do we have if (file_folio) else if (anon_folio) both doing the same thing, but operating on different types? Some sort of subclassing going on? I was agreeing with you that slab/network pools etc. shouldn't be folios - that folios shouldn't be a replacement for compound pages. But I think we're going to need a serious alternative proposal for anonymous pages if you're still against them becoming folios, especially because according to Kirill they're already working on that (and you have to admit transhuge pages did introduce a mess that they will help with...)
On Thu, Sep 23, 2021 at 01:42:17AM -0400, Kent Overstreet wrote: > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > > On Tue, Sep 21, 2021 at 05:22:54PM -0400, Kent Overstreet wrote: > > > - it's become apparent that there haven't been any real objections to the code > > > that was queued up for 5.15. There _are_ very real discussions and points of > > > contention still to be decided and resolved for the work beyond file backed > > > pages, but those discussions were what derailed the more modest, and more > > > badly needed, work that affects everyone in filesystem land > > > > Unfortunately, I think this is a result of me wanting to discuss a way > > forward rather than a way back. > > > > To clarify: I do very much object to the code as currently queued up, > > and not just to a vague future direction. > > > > The patches add and convert a lot of complicated code to provision for > > a future we do not agree on. The indirections it adds, and the hybrid > > state it leaves the tree in, make it directly more difficult to work > > with and understand the MM code base. Stuff that isn't needed for > > exposing folios to the filesystems. > > I think something we need is an alternate view - anon_folio, perhaps - and an > idea of what that would look like. Because you've been saying you don't think > file pages and anymous pages are similar enough to be the same time - so if > they're not, how's the code that works on both types of pages going to change to > accomadate that? > > Do we have if (file_folio) else if (anon_folio) both doing the same thing, but > operating on different types? Some sort of subclassing going on? Yeah, with subclassing and a generic type for shared code. I outlined that earlier in the thread: https://lore.kernel.org/all/YUo20TzAlqz8Tceg@cmpxchg.org/ So you have anon_page and file_page being subclasses of page - similar to how filesystems have subclasses that inherit from struct inode - to help refactor what is generic, what isn't, and highlight what should be. Whether we do anon_page and file_page inheriting from struct page, or anon_folio and file_folio inheriting from struct folio - either would work of course. Again I think it comes down to the value proposition of folio as a means to clean up compound pages inside the MM code. It's pretty uncontroversial that we want PAGE_SIZE assumptions gone from the filesystems, networking, drivers and other random code. The argument for MM code is a different one. We seem to be discussing the folio abstraction as a binary thing for the Linux kernel, rather than a selectively applied tool, and I think it prevents us from doing proper one-by-one cost/benefit analyses on the areas of application. I suggested the anon/file split as an RFC to sidestep the cost/benefit question of doing the massive folio change in MM just to cleanup the compound pages; takeing the idea of redoing the page typing, just in a way that would maybe benefit MM code more broadly and obviously. > I was agreeing with you that slab/network pools etc. shouldn't be folios - that > folios shouldn't be a replacement for compound pages. But I think we're going to > need a serious alternative proposal for anonymous pages if you're still against > them becoming folios, especially because according to Kirill they're already > working on that (and you have to admit transhuge pages did introduce a mess that > they will help with...) I think we need a better analysis of that mess and a concept where tailpages are and should be, if that is the justification for the MM conversion. The motivation is that we have a ton of compound_head() calls in places we don't need them. No argument there, I think. But the explanation for going with whitelisting - the most invasive approach possible (and which leaves more than one person "unenthused" about that part of the patches) - is that it's difficult and error prone to identify which ones are necessary and which ones are not. And maybe that we'll continue to have a widespread hybrid existence of head and tail pages that will continue to require clarification. But that seems to be an article of faith. It's implied by the approach, but this may or may not be the case. I certainly think it used to be messier in the past. But strides have been made already to narrow the channels through which tail pages can actually enter the code. Certainly we can rule out entire MM subsystems and simply declare their compound_head() usage unnecessary with little risk or ambiguity. Then the question becomes which ones are legit. Whether anybody outside the page allocator ever needs to *see* a tailpage struct page to begin with. (Arguably that bit in __split_huge_page_tail() could be a page allocator function; the pte handling is pfn-based except for the mapcount management which could be encapsulated; the collapse code uses vm_normal_page() but follows it quickly by compound_head() - and arguably a tailpage generally isn't a "normal" vm page, so a new pfn_to_normal_page() could encapsulate the compound_head()). Because if not, seeing struct page in MM code isn't nearly as ambiguous as is being implied. You would never have to worry about it - unless you are in fact the page allocator. So if this problem could be solved by making tail pages an encapsulated page_alloc thing, and chasing down the rest of find_subpage() callers (which needs to happen anyway), I don't think a wholesale folio conversion of this subsystem would be justified. A more in-depth analyses of where and how we need to deal with tailpages - laying out the data structures that hold them and code entry points for them - would go a long way for making the case for folios. And might convince reluctant people to get behind the effort. Or show that we don't need it. Either way, it seems like a win-win. But I do think the onus for explaining why the particular approach was chosen against much less invasive options is on the person pushing the changes. And it should be more detailed than "we all know it sucks".
On Thu, Sep 23, 2021 at 02:00:46PM -0400, Johannes Weiner wrote: > On Thu, Sep 23, 2021 at 01:42:17AM -0400, Kent Overstreet wrote: > > I think something we need is an alternate view - anon_folio, perhaps - and an > > idea of what that would look like. Because you've been saying you don't think > > file pages and anymous pages are similar enough to be the same time - so if > > they're not, how's the code that works on both types of pages going to change to > > accomadate that? > > > > Do we have if (file_folio) else if (anon_folio) both doing the same thing, but > > operating on different types? Some sort of subclassing going on? > > Yeah, with subclassing and a generic type for shared code. I outlined > that earlier in the thread: > > https://lore.kernel.org/all/YUo20TzAlqz8Tceg@cmpxchg.org/ > > So you have anon_page and file_page being subclasses of page - similar > to how filesystems have subclasses that inherit from struct inode - to > help refactor what is generic, what isn't, and highlight what should be. I'm with you there. I don't understand anon pages well enough to know whether splitting them out from file pages is good or bad. I had assumed that if it were worth doing, they would have gained their own named members in the page union, but perhaps that didn't happen in order to keep the complexity of the union down? > Whether we do anon_page and file_page inheriting from struct page, or > anon_folio and file_folio inheriting from struct folio - either would > work of course. Again I think it comes down to the value proposition > of folio as a means to clean up compound pages inside the MM code. > It's pretty uncontroversial that we want PAGE_SIZE assumptions gone > from the filesystems, networking, drivers and other random code. The > argument for MM code is a different one. We seem to be discussing the > folio abstraction as a binary thing for the Linux kernel, rather than > a selectively applied tool, and I think it prevents us from doing > proper one-by-one cost/benefit analyses on the areas of application. I wasn't originally planning on doing nearly as much as Kent has opened me up to. Slab seems like a clear win to split out. Page tables seem like they will be too. I'd like to get to these structs: struct page { unsigned long flags; unsigned long compound_head; union { struct { /* First tail page only */ unsigned char compound_dtor; unsigned char compound_order; atomic_t compound_mapcount; unsigned int compound_nr; }; struct { /* Second tail page only */ atomic_t hpage_pinned_refcount; struct list_head deferred_list; }; unsigned long padding1[5]; }; unsigned int padding2[2]; #ifdef CONFIG_MEMCG unsigned long padding3; #endif #ifdef WANT_PAGE_VIRTUAL void *virtual; #endif #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS int _last_cpupid; #endif }; struct slab { ... slab specific stuff here ... }; struct page_table { ... pgtable stuff here ... }; struct folio { unsigned long flags; union { struct { struct list_head lru; struct address_space *mapping; pgoff_t index; void *private; }; struct { ... net pool here ... }; struct { ... zone device here ... }; }; atomic_t _mapcount; atomic_t _refcount; #ifdef CONFIG_MEMCG unsigned long memcg_data; #endif }; ie a 'struct page' contains no information on its own. You have to go to the compound_head page (cast to the appropriate type) to find the information. What Kent is proposing is exciting, but I think further off. > > I was agreeing with you that slab/network pools etc. shouldn't be folios - that > > folios shouldn't be a replacement for compound pages. But I think we're going to > > need a serious alternative proposal for anonymous pages if you're still against > > them becoming folios, especially because according to Kirill they're already > > working on that (and you have to admit transhuge pages did introduce a mess that > > they will help with...) > > I think we need a better analysis of that mess and a concept where > tailpages are and should be, if that is the justification for the MM > conversion. > > The motivation is that we have a ton of compound_head() calls in > places we don't need them. No argument there, I think. > > But the explanation for going with whitelisting - the most invasive > approach possible (and which leaves more than one person "unenthused" > about that part of the patches) - is that it's difficult and error > prone to identify which ones are necessary and which ones are not. And > maybe that we'll continue to have a widespread hybrid existence of > head and tail pages that will continue to require clarification. > > But that seems to be an article of faith. It's implied by the > approach, but this may or may not be the case. > > I certainly think it used to be messier in the past. But strides have > been made already to narrow the channels through which tail pages can > actually enter the code. Certainly we can rule out entire MM > subsystems and simply declare their compound_head() usage unnecessary > with little risk or ambiguity. > > Then the question becomes which ones are legit. Whether anybody > outside the page allocator ever needs to *see* a tailpage struct page > to begin with. (Arguably that bit in __split_huge_page_tail() could be > a page allocator function; the pte handling is pfn-based except for > the mapcount management which could be encapsulated; the collapse code > uses vm_normal_page() but follows it quickly by compound_head() - and > arguably a tailpage generally isn't a "normal" vm page, so a new > pfn_to_normal_page() could encapsulate the compound_head()). Because > if not, seeing struct page in MM code isn't nearly as ambiguous as is > being implied. You would never have to worry about it - unless you are > in fact the page allocator. > > So if this problem could be solved by making tail pages an > encapsulated page_alloc thing, and chasing down the rest of > find_subpage() callers (which needs to happen anyway), I don't think a > wholesale folio conversion of this subsystem would be justified. > > A more in-depth analyses of where and how we need to deal with > tailpages - laying out the data structures that hold them and code > entry points for them - would go a long way for making the case for > folios. And might convince reluctant people to get behind the effort. OK. So filesystems still need to deal with pages in some places. One place is at the bottom of the filesystem where memory gets packaged into BIOs or SKBs to eventually participate in DMA: struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; }; That could become a folio (or Christoph's preferred option, a phys_addr_t), but this is really an entirely different role for struct page; it's just carrying the address of some memory for I/O to happen to. Nobody looks at the contents of the struct page until it goes back to the filesystem, at which point it clears the writeback bit or marks it uptodate. The other place that definitely still needs to be a struct page is struct vm_fault { ... struct page *page; /* ->fault handlers should return a * page here, unless VM_FAULT_NOPAGE * is set (which is also implied by * VM_FAULT_ERROR). */ ... }; Most filesystems use filemap_fault(), which handles this, but this affects device drivers too. We can't return a folio here because we need to know which page corresponds to the address that took the fault. We can deduce it for filesystems, because we know how folios are allocated for the page cache, but device drivers can map memory absolutely arbitrarily, so there's no way to reconstruct that information. Again, this could be a physical address (or a pfn), but we have it as a page because it's locked and we're going to unlock it after mapping it. So this is actually a place where we'll need to get a page from the filesystem, convert to a folio and call folio operations on it. This is one of the reasons that lock_page() / unlock_page() contain the embedded compound_head() today.
On Thu, Sep 23, 2021 at 02:00:46PM -0400, Johannes Weiner wrote: > Yeah, with subclassing and a generic type for shared code. I outlined > that earlier in the thread: > > https://lore.kernel.org/all/YUo20TzAlqz8Tceg@cmpxchg.org/ > > So you have anon_page and file_page being subclasses of page - similar > to how filesystems have subclasses that inherit from struct inode - to > help refactor what is generic, what isn't, and highlight what should be. > > Whether we do anon_page and file_page inheriting from struct page, or > anon_folio and file_folio inheriting from struct folio - either would > work of course. If we go that route, my preference would be for completely separate anon_folio and file_folio types - separately allocated when we get their, both their completely own thing. I think even in languages that have it data inheritence is kind of evil and I prefer to avoid it - even if that means having code that does if (anon_folio) else if (file_folio) where both branches do the exact same thing. For the LRU lists we might be able to create a new type wrapping a list head, and embed that in both file_folio and anon_folio, and pass that type to the LRU code. I'm just spitballing ideas though, you know that code better than I do. > Again I think it comes down to the value proposition > of folio as a means to clean up compound pages inside the MM code. > It's pretty uncontroversial that we want PAGE_SIZE assumptions gone > from the filesystems, networking, drivers and other random code. The > argument for MM code is a different one. We seem to be discussing the > folio abstraction as a binary thing for the Linux kernel, rather than > a selectively applied tool, and I think it prevents us from doing > proper one-by-one cost/benefit analyses on the areas of application. > > I suggested the anon/file split as an RFC to sidestep the cost/benefit > question of doing the massive folio change in MM just to cleanup the > compound pages; takeing the idea of redoing the page typing, just in a > way that would maybe benefit MM code more broadly and obviously. It's not just compound pages though - THPs introduced a lot of if (normal page) else if (hugepage) stuff that needs to be cleaned up. Also, by enabling arbitrary size compound pages for anonymous memory, this is going to help with memory fragmentation - right now, the situation for anonymous pages is all or nothing, normal page or hugepage, and since most of the time it ends up being normal pages we end up fragmenting memory unnecessarily. I don't think it'll have anywhere near the performance impact for anonymous pages as it will for file pages, but we should still see some performance gains too. That's all true though whether or not anonymous pages end up using the same type as folios though, so it's not an argument either way. > I think we need a better analysis of that mess and a concept where > tailpages are and should be, if that is the justification for the MM > conversion. > > The motivation is that we have a ton of compound_head() calls in > places we don't need them. No argument there, I think. I don't think that's the main motivation at this point, though. See the struct page proposal document I wrote last night - several of the ideas in there are yours. The compound vs. tail page confusion is just one of many birds we can kill with this stone. I'd really love to hear your thoughts on that document btw - I want to know if we're on the same page and if I accurately captured your ideas and if you've got more to add. > But the explanation for going with whitelisting - the most invasive > approach possible (and which leaves more than one person "unenthused" > about that part of the patches) - is that it's difficult and error > prone to identify which ones are necessary and which ones are not. And > maybe that we'll continue to have a widespread hybrid existence of > head and tail pages that will continue to require clarification. > > But that seems to be an article of faith. It's implied by the > approach, but this may or may not be the case. > > I certainly think it used to be messier in the past. But strides have > been made already to narrow the channels through which tail pages can > actually enter the code. Certainly we can rule out entire MM > subsystems and simply declare their compound_head() usage unnecessary > with little risk or ambiguity. This sounds like we're not using assertions nearly enough. The primary use of assertions isn't to catch where we've fucked and don't have a way to recover - the right way to think of assertions is that they're for documenting invariants in a way that can't go out of date, like comments can. They're almost as good as doing it with the type system. > Then the question becomes which ones are legit. Whether anybody > outside the page allocator ever needs to *see* a tailpage struct page > to begin with. (Arguably that bit in __split_huge_page_tail() could be > a page allocator function; the pte handling is pfn-based except for > the mapcount management which could be encapsulated; the collapse code > uses vm_normal_page() but follows it quickly by compound_head() - and > arguably a tailpage generally isn't a "normal" vm page, so a new > pfn_to_normal_page() could encapsulate the compound_head()). Because > if not, seeing struct page in MM code isn't nearly as ambiguous as is > being implied. You would never have to worry about it - unless you are > in fact the page allocator. > > So if this problem could be solved by making tail pages an > encapsulated page_alloc thing, and chasing down the rest of > find_subpage() callers (which needs to happen anyway), I don't think a > wholesale folio conversion of this subsystem would be justified. > > A more in-depth analyses of where and how we need to deal with > tailpages - laying out the data structures that hold them and code > entry points for them - would go a long way for making the case for > folios. And might convince reluctant people to get behind the effort. Alternately - imagine we get to the struct page proposal I laid out. What code is still going to deal with struct page, and which code is going to change to working with some subtype of page?
On Thu, Sep 23, 2021 at 04:41:04AM +0100, Matthew Wilcox wrote: > On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote: > > On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote: > > > +/** > > > + * page_slab - Converts from page to slab. > > > + * @p: The page. > > > + * > > > + * This function cannot be called on a NULL pointer. It can be called > > > + * on a non-slab page; the caller should check is_slab() to be sure > > > + * that the slab really is a slab. > > > + * > > > + * Return: The slab which contains this page. > > > + */ > > > +#define page_slab(p) (_Generic((p), \ > > > + const struct page *: (const struct slab *)_compound_head(p), \ > > > + struct page *: (struct slab *)_compound_head(p))) > > > + > > > +static inline bool is_slab(struct slab *slab) > > > +{ > > > + return test_bit(PG_slab, &slab->flags); > > > +} > > > + > > > > I'm sorry, I don't have a dog in this fight and conceptually I think folios are > > a good idea... > > > > But for this work, having a call which returns if a 'struct slab' really is a > > 'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return > > NULL if there is no slab containing that page? > > No, this is a good question. > > The way slub works right now is that if you ask for a "large" allocation, > it does: > > flags |= __GFP_COMP; > page = alloc_pages_node(node, flags, order); > > and returns page_address(page) (eventually; the code is more complex) > So when you call kfree(), it uses the PageSlab flag to determine if the > allocation was "large" or not: > > page = virt_to_head_page(x); > if (unlikely(!PageSlab(page))) { > free_nonslab_page(page, object); > return; > } > slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_); > > Now, you could say that this is a bad way to handle things, and every > allocation from slab should have PageSlab set, Yea basically. So what makes 'struct slab' different from 'struct page' in an order 0 allocation? Am I correct in deducing that PG_slab is not set in that case? > and it should use one of > the many other bits in page->flags to indicate whether it's a large > allocation or not. Isn't the fact that it is a compound page enough to know that? > I may have feelings in that direction myself. > But I don't think I should be changing that in this patch. > > Maybe calling this function is_slab() is the confusing thing. > Perhaps it should be called SlabIsLargeAllocation(). Not sure. Well that makes a lot more sense to me from an API standpoint but checking PG_slab is still likely to raise some eyebrows. Regardless I like the fact that the community is at least attempting to fix stuff like this. Because adding types like this make it easier for people like me to understand what is going on. Ira
On Thu, Sep 23, 2021 at 03:12:41PM -0700, Ira Weiny wrote: > On Thu, Sep 23, 2021 at 04:41:04AM +0100, Matthew Wilcox wrote: > > On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote: > > > On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote: > > > > +/** > > > > + * page_slab - Converts from page to slab. > > > > + * @p: The page. > > > > + * > > > > + * This function cannot be called on a NULL pointer. It can be called > > > > + * on a non-slab page; the caller should check is_slab() to be sure > > > > + * that the slab really is a slab. > > > > + * > > > > + * Return: The slab which contains this page. > > > > + */ > > > > +#define page_slab(p) (_Generic((p), \ > > > > + const struct page *: (const struct slab *)_compound_head(p), \ > > > > + struct page *: (struct slab *)_compound_head(p))) > > > > + > > > > +static inline bool is_slab(struct slab *slab) > > > > +{ > > > > + return test_bit(PG_slab, &slab->flags); > > > > +} > > > > + > > > > > > I'm sorry, I don't have a dog in this fight and conceptually I think folios are > > > a good idea... > > > > > > But for this work, having a call which returns if a 'struct slab' really is a > > > 'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return > > > NULL if there is no slab containing that page? > > > > No, this is a good question. > > > > The way slub works right now is that if you ask for a "large" allocation, > > it does: > > > > flags |= __GFP_COMP; > > page = alloc_pages_node(node, flags, order); > > > > and returns page_address(page) (eventually; the code is more complex) > > So when you call kfree(), it uses the PageSlab flag to determine if the > > allocation was "large" or not: > > > > page = virt_to_head_page(x); > > if (unlikely(!PageSlab(page))) { > > free_nonslab_page(page, object); > > return; > > } > > slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_); > > > > Now, you could say that this is a bad way to handle things, and every > > allocation from slab should have PageSlab set, > > Yea basically. > > So what makes 'struct slab' different from 'struct page' in an order 0 > allocation? Am I correct in deducing that PG_slab is not set in that case? You might mean a couple of different things by that question, so let me say some things which are true (on x86) but might not answer your question: If you kmalloc(4095) bytes, it comes from a slab. That slab would usually be an order-3 allocation. If that order-3 allocation fails, slab might go as low as an order-0 allocation, but PageSlab will always be set on that head/base page because the allocation is smaller than two pages. If you kmalloc(8193) bytes, slub throws up its hands and does an allocation from the page allocator. So it allocates an order-2 page, does not set PG_slab on it, but PG_head is set on the head page and PG_tail is set on all three tail pages. > > and it should use one of > > the many other bits in page->flags to indicate whether it's a large > > allocation or not. > > Isn't the fact that it is a compound page enough to know that? No -- regular slab allocations have PG_head set. But it could use, eg, slab->slab_cache == NULL to distinguish page allocations from slab allocations. > > I may have feelings in that direction myself. > > But I don't think I should be changing that in this patch. > > > > Maybe calling this function is_slab() is the confusing thing. > > Perhaps it should be called SlabIsLargeAllocation(). Not sure. > > Well that makes a lot more sense to me from an API standpoint but checking > PG_slab is still likely to raise some eyebrows. Yeah. Here's what I have right now: +static inline bool SlabMultiPage(const struct slab *slab) +{ + return test_bit(PG_head, &slab->flags); +} + +/* Did this allocation come from the page allocator instead of slab? */ +static inline bool SlabPageAllocation(const struct slab *slab) +{ + return !test_bit(PG_slab, &slab->flags); +} > Regardless I like the fact that the community is at least attempting to fix > stuff like this. Because adding types like this make it easier for people like > me to understand what is going on. Yes, I dislike that 'struct page' is so hard to understand, and so easy to misuse. It's a very weak type.
On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote: > One one hand, the ambition appears to substitute folio for everything > that could be a base page or a compound page even inside core MM > code. Since there are very few places in the MM code that expressly > deal with tail pages in the first place, this amounts to a conversion > of most MM code - including the LRU management, reclaim, rmap, > migrate, swap, page fault code etc. - away from "the page". > > However, this far exceeds the goal of a better mm-fs interface. And > the value proposition of a full MM-internal conversion, including > e.g. the less exposed anon page handling, is much more nebulous. It's > been proposed to leave anon pages out, but IMO to keep that direction > maintainable, the folio would have to be translated to a page quite > early when entering MM code, rather than propagating it inward, in > order to avoid huge, massively overlapping page and folio APIs. Here's an example where our current confusion between "any page" and "head page" at least produces confusing behaviour, if not an outright bug, isolate_migratepages_block(): page = pfn_to_page(low_pfn); ... if (PageCompound(page) && !cc->alloc_contig) { const unsigned int order = compound_order(page); if (likely(order < MAX_ORDER)) low_pfn += (1UL << order) - 1; goto isolate_fail; } compound_order() does not expect a tail page; it returns 0 unless it's a head page. I think what we actually want to do here is: if (!cc->alloc_contig) { struct page *head = compound_head(page); if (PageHead(head)) { const unsigned int order = compound_order(head); low_pfn |= (1UL << order) - 1; goto isolate_fail; } } Not earth-shattering; not even necessarily a bug. But it's an example of the way the code reads is different from how the code is executed, and that's potentially dangerous. Having a different type for tail and not-tail pages prevents the muddy thinking that can lead to tail pages being passed to compound_order().
On Tue, Oct 05, 2021 at 02:52:01PM +0100, Matthew Wilcox wrote: > On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote: > > One one hand, the ambition appears to substitute folio for everything > > that could be a base page or a compound page even inside core MM > > code. Since there are very few places in the MM code that expressly > > deal with tail pages in the first place, this amounts to a conversion > > of most MM code - including the LRU management, reclaim, rmap, > > migrate, swap, page fault code etc. - away from "the page". > > > > However, this far exceeds the goal of a better mm-fs interface. And > > the value proposition of a full MM-internal conversion, including > > e.g. the less exposed anon page handling, is much more nebulous. It's > > been proposed to leave anon pages out, but IMO to keep that direction > > maintainable, the folio would have to be translated to a page quite > > early when entering MM code, rather than propagating it inward, in > > order to avoid huge, massively overlapping page and folio APIs. > > Here's an example where our current confusion between "any page" > and "head page" at least produces confusing behaviour, if not an > outright bug, isolate_migratepages_block(): > > page = pfn_to_page(low_pfn); > ... > if (PageCompound(page) && !cc->alloc_contig) { > const unsigned int order = compound_order(page); > > if (likely(order < MAX_ORDER)) > low_pfn += (1UL << order) - 1; > goto isolate_fail; > } > > compound_order() does not expect a tail page; it returns 0 unless it's > a head page. I think what we actually want to do here is: > > if (!cc->alloc_contig) { > struct page *head = compound_head(page); > if (PageHead(head)) { > const unsigned int order = compound_order(head); > > low_pfn |= (1UL << order) - 1; > goto isolate_fail; > } > } > > Not earth-shattering; not even necessarily a bug. But it's an example > of the way the code reads is different from how the code is executed, > and that's potentially dangerous. Having a different type for tail > and not-tail pages prevents the muddy thinking that can lead to > tail pages being passed to compound_order(). Thanks for digging this up. I agree the second version is much better. My question is still whether the extensive folio whitelisting of everybody else is the best way to bring those codepaths to light. The above isn't totally random. That code is a pfn walker which translates from the basepage address space to an ambiguous struct page object. There are more of those, but we can easily identify them: all uses of pfn_to_page() and virt_to_page() indicate that the code needs an audit for how exactly they're using the returned page. The above instance of such a walker wants to deal with a higher-level VM object: a thing that can be on the LRU, can be locked, etc. For those instances the pattern is clear that the pfn_to_page() always needs to be paired with a compound_head() before handling the page. I had mentioned in the other subthread a pfn_to_normal_page() to streamline this pattern, clarify intent, and mark the finished audit. Another class are page table walkers resolving to an ambiguous struct page right now. Those are also easy to identify, and AFAICS they all want headpages, which is why I had mentioned a central compound_head() in vm_normal_page(). Are there other such classes that I'm missing? Because it seems to me there are two and they both have rather clear markers for where the disambiguation needs to happen - and central helpers to put them in! And it makes sense: almost nobody *actually* needs to access the tail members of struct page. This suggests a pushdown and early filtering in a few central translation/lookup helpers would work to completely disambiguate remaining struct page usage inside MM code. There *are* a few weird struct page usages left, like bio and sparse, and you mentioned vm_fault as well in the other subthread. But it really seems these want converting away from arbitrary struct page to either something like phys_addr_t or a proper headpage anyway. Maybe a tuple of headpage and subpage index in the fault case. Because even after a full folio conversion of everybody else, those would be quite weird in their use of an ambiguous struct page! Which struct members are safe to access? What does it mean to lock a tailpage? Etc. But it's possible I'm missing something. Are there entry points that are difficult to identify both conceptually and code-wise? And which couldn't be pushed down to resolve to headpages quite early? Those I think would make the argument for the folio in the MM implementation.
On 05.10.21 19:29, Johannes Weiner wrote: > On Tue, Oct 05, 2021 at 02:52:01PM +0100, Matthew Wilcox wrote: >> On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote: >>> One one hand, the ambition appears to substitute folio for everything >>> that could be a base page or a compound page even inside core MM >>> code. Since there are very few places in the MM code that expressly >>> deal with tail pages in the first place, this amounts to a conversion >>> of most MM code - including the LRU management, reclaim, rmap, >>> migrate, swap, page fault code etc. - away from "the page". >>> >>> However, this far exceeds the goal of a better mm-fs interface. And >>> the value proposition of a full MM-internal conversion, including >>> e.g. the less exposed anon page handling, is much more nebulous. It's >>> been proposed to leave anon pages out, but IMO to keep that direction >>> maintainable, the folio would have to be translated to a page quite >>> early when entering MM code, rather than propagating it inward, in >>> order to avoid huge, massively overlapping page and folio APIs. >> >> Here's an example where our current confusion between "any page" >> and "head page" at least produces confusing behaviour, if not an >> outright bug, isolate_migratepages_block(): >> >> page = pfn_to_page(low_pfn); >> ... >> if (PageCompound(page) && !cc->alloc_contig) { >> const unsigned int order = compound_order(page); >> >> if (likely(order < MAX_ORDER)) >> low_pfn += (1UL << order) - 1; >> goto isolate_fail; >> } >> >> compound_order() does not expect a tail page; it returns 0 unless it's >> a head page. I think what we actually want to do here is: >> >> if (!cc->alloc_contig) { >> struct page *head = compound_head(page); >> if (PageHead(head)) { >> const unsigned int order = compound_order(head); >> >> low_pfn |= (1UL << order) - 1; >> goto isolate_fail; >> } >> } >> >> Not earth-shattering; not even necessarily a bug. But it's an example >> of the way the code reads is different from how the code is executed, >> and that's potentially dangerous. Having a different type for tail >> and not-tail pages prevents the muddy thinking that can lead to >> tail pages being passed to compound_order(). > > Thanks for digging this up. I agree the second version is much better. > > My question is still whether the extensive folio whitelisting of > everybody else is the best way to bring those codepaths to light. > > The above isn't totally random. That code is a pfn walker which > translates from the basepage address space to an ambiguous struct page > object. There are more of those, but we can easily identify them: all > uses of pfn_to_page() and virt_to_page() indicate that the code needs > an audit for how exactly they're using the returned page. +pfn_to_online_page()
On Tue, Oct 05, 2021 at 01:29:43PM -0400, Johannes Weiner wrote: > On Tue, Oct 05, 2021 at 02:52:01PM +0100, Matthew Wilcox wrote: > > On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote: > > > One one hand, the ambition appears to substitute folio for everything > > > that could be a base page or a compound page even inside core MM > > > code. Since there are very few places in the MM code that expressly > > > deal with tail pages in the first place, this amounts to a conversion > > > of most MM code - including the LRU management, reclaim, rmap, > > > migrate, swap, page fault code etc. - away from "the page". > > > > > > However, this far exceeds the goal of a better mm-fs interface. And > > > the value proposition of a full MM-internal conversion, including > > > e.g. the less exposed anon page handling, is much more nebulous. It's > > > been proposed to leave anon pages out, but IMO to keep that direction > > > maintainable, the folio would have to be translated to a page quite > > > early when entering MM code, rather than propagating it inward, in > > > order to avoid huge, massively overlapping page and folio APIs. > > > > Here's an example where our current confusion between "any page" > > and "head page" at least produces confusing behaviour, if not an > > outright bug, isolate_migratepages_block(): > > > > page = pfn_to_page(low_pfn); > > ... > > if (PageCompound(page) && !cc->alloc_contig) { > > const unsigned int order = compound_order(page); > > > > if (likely(order < MAX_ORDER)) > > low_pfn += (1UL << order) - 1; > > goto isolate_fail; > > } > > > > compound_order() does not expect a tail page; it returns 0 unless it's > > a head page. I think what we actually want to do here is: > > > > if (!cc->alloc_contig) { > > struct page *head = compound_head(page); > > if (PageHead(head)) { > > const unsigned int order = compound_order(head); > > > > low_pfn |= (1UL << order) - 1; > > goto isolate_fail; > > } > > } > > > > Not earth-shattering; not even necessarily a bug. But it's an example > > of the way the code reads is different from how the code is executed, > > and that's potentially dangerous. Having a different type for tail > > and not-tail pages prevents the muddy thinking that can lead to > > tail pages being passed to compound_order(). > > Thanks for digging this up. I agree the second version is much better. > > My question is still whether the extensive folio whitelisting of > everybody else is the best way to bring those codepaths to light. Outside of core MM developers, I'm not sure that a lot of people know that a struct page might represent 2^n pages of memory. Even architecture maintainers seem to be pretty fuzzy on what flush_dcache_page() does for compound pages: https://lore.kernel.org/linux-arch/20200818150736.GQ17456@casper.infradead.org/ I know this change is a massive pain, but I do think we're better off in a world where 'struct page' really refers to one page of memory, and we have some other name for a memory descriptor that refers to 2^n pages of memory. > The above isn't totally random. That code is a pfn walker which > translates from the basepage address space to an ambiguous struct page > object. There are more of those, but we can easily identify them: all > uses of pfn_to_page() and virt_to_page() indicate that the code needs > an audit for how exactly they're using the returned page. Right; it's not random at all. I ran across it while trying to work out how zsmalloc interacts with memory compaction. It just seemed like a particularly compelling example because it's not part of some random driver, it's a relatively important part of the MM. And if such a place has this kind of ambiguity, everything else must surely be worse. > The above instance of such a walker wants to deal with a higher-level > VM object: a thing that can be on the LRU, can be locked, etc. For > those instances the pattern is clear that the pfn_to_page() always > needs to be paired with a compound_head() before handling the page. I > had mentioned in the other subthread a pfn_to_normal_page() to > streamline this pattern, clarify intent, and mark the finished audit. > > Another class are page table walkers resolving to an ambiguous struct > page right now. Those are also easy to identify, and AFAICS they all > want headpages, which is why I had mentioned a central compound_head() > in vm_normal_page(). > > Are there other such classes that I'm missing? Because it seems to me > there are two and they both have rather clear markers for where the > disambiguation needs to happen - and central helpers to put them in! > > And it makes sense: almost nobody *actually* needs to access the tail > members of struct page. This suggests a pushdown and early filtering > in a few central translation/lookup helpers would work to completely > disambiguate remaining struct page usage inside MM code. The end goal (before you started talking about shrinking memmap) was to rename page->mapping, page->index, page->lru and page->private, so you can't look at members of struct page any more. struct page would still have ->compound_head, but anythng else would require conversion to folio first. Now that you've put the "dynamically allocate the memory descriptor" idea in my head, that rename becomes a deletion, and struct page itself shrinks down to a single pointer. > There *are* a few weird struct page usages left, like bio and sparse, > and you mentioned vm_fault as well in the other subthread. But it > really seems these want converting away from arbitrary struct page to > either something like phys_addr_t or a proper headpage anyway. Maybe a > tuple of headpage and subpage index in the fault case. Because even > after a full folio conversion of everybody else, those would be quite > weird in their use of an ambiguous struct page! Which struct members > are safe to access? What does it mean to lock a tailpage? Etc. If you think converting the MM from struct page to struct folio is bad, a lot of churn, etc, you're going to be amazed at how much churn it'll be to convert all of block and networking from struct page to phys_addr_t! I'm not saying it's not worth doing, or it'll never be done, but that's a five year project. And I have no idea how to migrate to it gracefully. > But it's possible I'm missing something. Are there entry points that > are difficult to identify both conceptually and code-wise? And which > couldn't be pushed down to resolve to headpages quite early? Those I > think would make the argument for the folio in the MM implementation. The approach I took with folio was to justify their appearance by showing how they could remove all these hidden calls to compound_head(). So I went bottom-up. Doing the slub conversion, I went in the opposite direction; start out by converting the top layers from virt_to_head_page() to use virt_to_slab(). Then simply call slab_page() when calling any function which hasn't yet been converted. At each step, we get better and better type safety because every place that gets converted knows it's being passed a head page and doesn't have to worry about whether it might be passed a tail page. Doing it in this direction doesn't let us remove the hidden calls to compound_head() until the very end of the conversion, but people don't seem to be particularly moved by all this wasted i-cache anyway. I can look at doing this for the page cache, but we kind of need agreement that separating the types is where we're going, and what we're going to end up calling both types. Slab was easy; Bonwick decided what we were going to call the memory descriptor ;-)
On Tue, Oct 05, 2021 at 07:30:06PM +0100, Matthew Wilcox wrote: > Outside of core MM developers, I'm not sure that a lot of people > know that a struct page might represent 2^n pages of memory. Even > architecture maintainers seem to be pretty fuzzy on what > flush_dcache_page() does for compound pages: > https://lore.kernel.org/linux-arch/20200818150736.GQ17456@casper.infradead.org/ I definitely second that opinion A final outcome where we still have struct page refering to all kinds of different things feels like a missed opportunity to me. Jason
On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > mm/memcg: Add folio_memcg() and related functions > mm/memcg: Convert commit_charge() to take a folio > mm/memcg: Convert mem_cgroup_charge() to take a folio > mm/memcg: Convert uncharge_page() to uncharge_folio() > mm/memcg: Convert mem_cgroup_uncharge() to take a folio > mm/memcg: Convert mem_cgroup_migrate() to take folios > mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio > mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock() > mm/memcg: Convert mem_cgroup_move_account() to use a folio > mm/memcg: Add folio_lruvec() > mm/memcg: Add folio_lruvec_lock() and similar functions > mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave() > mm/workingset: Convert workingset_activation to take a folio > > This is all anon+file stuff, not needed for filesystem > folios. No, that's not true. A number of these functions are called from filesystem code. mem_cgroup_track_foreign_dirty() is only called from filesystem code. We at the very least need wrappers like folio_cgroup_charge(), and folio_memcg_lock(). > As per the other email, no conceptual entry point for > tail pages into either subsystem, so no ambiguity > around the necessity of any compound_head() calls, > directly or indirectly. It's easy to rule out > wholesale, so there is no justification for > incrementally annotating every single use of the page. The justification is that we can remove all those hidden calls to compound_head(). Hundreds of bytes of text spread throughout this file. > mm: Add folio_young and folio_idle > mm/swap: Add folio_activate() > mm/swap: Add folio_mark_accessed() > > This is anon+file aging stuff, not needed. Again, very much needed. Take a look at pagecache_get_page(). In Linus' tree today, it calls if (page_is_idle(page)) clear_page_idle(page); So either we need wrappers (which are needlessly complicated thanks to how page_is_idle() is defined) or we just convert it. > mm/rmap: Add folio_mkclean() > > mm/migrate: Add folio_migrate_mapping() > mm/migrate: Add folio_migrate_flags() > mm/migrate: Add folio_migrate_copy() > > More anon+file conversion, not needed. As far as I can tell, anon never calls any of these three functions. anon calls migrate_page(), which calls migrate_page_move_mapping(), but several filesystems do call these individual functions. > mm/lru: Add folio_add_lru() > > LRU code, not needed. Again, we need folio_add_lru() for filemap. This one's more tractable as a wrapper function.
On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > mm/lru: Add folio LRU functions > > The LRU code is used by anon and file and not needed > for the filesystem API. > > And as discussed, there is generally no ambiguity of > tail pages on the LRU list. One of the assumptions you're making is that the current code is suitable for folios. One of the things that happens in this patch is: - update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); + update_lru_size(lruvec, lru, folio_zonenum(folio), + folio_nr_pages(folio)); static inline long folio_nr_pages(struct folio *folio) { return compound_nr(&folio->page); } vs #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline int thp_nr_pages(struct page *page) { VM_BUG_ON_PGFLAGS(PageTail(page), page); if (PageHead(page)) return HPAGE_PMD_NR; return 1; } #else static inline int thp_nr_pages(struct page *page) { VM_BUG_ON_PGFLAGS(PageTail(page), page); return 1; } #endif So if you want to leave all the LRU code using pages, all the uses of thp_nr_pages() need to be converted to compound_nr(). Or maybe not all of them; I don't know which ones might be safe to leave as thp_nr_pages(). That's one of the reasons I went with a whitelist approach.
On Sat, Oct 16, 2021 at 04:28:23AM +0100, Matthew Wilcox wrote: > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > > mm/memcg: Add folio_memcg() and related functions > > mm/memcg: Convert commit_charge() to take a folio > > mm/memcg: Convert mem_cgroup_charge() to take a folio > > mm/memcg: Convert uncharge_page() to uncharge_folio() > > mm/memcg: Convert mem_cgroup_uncharge() to take a folio > > mm/memcg: Convert mem_cgroup_migrate() to take folios > > mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio > > mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock() > > mm/memcg: Convert mem_cgroup_move_account() to use a folio > > mm/memcg: Add folio_lruvec() > > mm/memcg: Add folio_lruvec_lock() and similar functions > > mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave() > > mm/workingset: Convert workingset_activation to take a folio > > > > This is all anon+file stuff, not needed for filesystem > > folios. > > No, that's not true. A number of these functions are called from > filesystem code. mem_cgroup_track_foreign_dirty() is only called > from filesystem code. We at the very least need wrappers like > folio_cgroup_charge(), and folio_memcg_lock(). Well, a handful of exceptions don't refute the broader point. No objection from me to convert mem_cgroup_track_foreign_dirty(). No objection to add a mem_cgroup_charge_folio(). But I insist on the subsystem prefix, because that's in line with how we're charging a whole bunch of other different things (swap, skmem, etc.). It'll also match a mem_cgroup_charge_anon() if we agree to an anon type. folio_memcg_lock() sounds good to me. > > As per the other email, no conceptual entry point for > > tail pages into either subsystem, so no ambiguity > > around the necessity of any compound_head() calls, > > directly or indirectly. It's easy to rule out > > wholesale, so there is no justification for > > incrementally annotating every single use of the page. > > The justification is that we can remove all those hidden calls to > compound_head(). Hundreds of bytes of text spread throughout this file. I find this line of argument highly disingenuous. No new type is necessary to remove these calls inside MM code. Migrate them into the callsites and remove the 99.9% very obviously bogus ones. The process is the same whether you switch to a new type or not. (I'll send more patches like the PageSlab() ones to that effect. It's easy. The only reason nobody has bothered removing those until now is that nobody reported regressions when they were added.) But typesafety is an entirely different argument. And to reiterate the main point of contention on these patches: there is no concensus among MM people how (or whether) we want MM-internal typesafety for pages. Personally, I think we do, but I don't think head vs tail is the most important or the most error-prone aspect of the many identities struct page can have. In most cases it's not even in the top 5 of questions I have about the page when I see it in a random MM context (outside of the very few places that do virt_to_page or pfn_to_page). Therefor "folio" is not a very poignant way to name the object that is passed around in most MM code. struct anon_page and struct file_page would be way more descriptive and would imply the head/tail aspect. Anyway, the email you are responding to was an offer to split the uncontroversial "large pages backing filesystems" part from the controversial "MM-internal typesafety" discussion. Several people in both the fs space and the mm space have now asked to do this to move ahead. Since you have stated in another subthread that you "want to get back to working on large pages in the page cache," and you never wanted to get involved that deeply in the struct page subtyping efforts, it's not clear to me why you are not taking this offer. > > mm: Add folio_young and folio_idle > > mm/swap: Add folio_activate() > > mm/swap: Add folio_mark_accessed() > > > > This is anon+file aging stuff, not needed. > > Again, very much needed. Take a look at pagecache_get_page(). In Linus' > tree today, it calls if (page_is_idle(page)) clear_page_idle(page); > So either we need wrappers (which are needlessly complicated thanks to > how page_is_idle() is defined) or we just convert it. I'm not sure I understand the complication. That you'd have to do if (page_is_idle(folio->page)) clear_page_idle(folio->page) inside code in mm/? It's either that, or a) generic code shared with anon pages has to do: if (folio_is_idle(page->folio)) clear_folio_idle(page->folio) which is a weird, or b) both types work with their own wrappers: if (page_is_idle(page)) clear_page_idle(page) if (folio_is_idle(folio)) clear_folio_idle(folio) and it's not obvious at all that they are in fact tracking the same state. State which is exported to userspace through the "page_idle" feature. Doing the folio->page translation in mm/-private code, and keeping this a page interface, is by far the most preferable solution. > > mm/rmap: Add folio_mkclean() > > > > mm/migrate: Add folio_migrate_mapping() > > mm/migrate: Add folio_migrate_flags() > > mm/migrate: Add folio_migrate_copy() > > > > More anon+file conversion, not needed. > > As far as I can tell, anon never calls any of these three functions. > anon calls migrate_page(), which calls migrate_page_move_mapping(), > but several filesystems do call these individual functions. In the current series, migrate_page_move_mapping() has been replaced, and anon pages go through them: int folio_migrate_mapping(struct address_space *mapping, struct folio *newfolio, struct folio *folio, int extra_count) { [...] if (!mapping) { /* Anonymous page without mapping */ if (folio_ref_count(folio) != expected_count) return -EAGAIN; /* No turning back from here */ newfolio->index = folio->index; newfolio->mapping = folio->mapping; if (folio_test_swapbacked(folio)) __folio_set_swapbacked(newfolio); That's what I'm objecting to. I'm not objecting to adding these to the filesystem interface as thin folio->page wrappers that call the page implementation. > > mm/lru: Add folio_add_lru() > > > > LRU code, not needed. > > Again, we need folio_add_lru() for filemap. This one's more > tractable as a wrapper function. Please don't quote selectively to the point of it being misleading. The original block my statement applied to was this: mm: Add folio_evictable() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm/lru: Add folio_add_lru() which goes way behond just being filesystem-interfacing. I have no objection to a cache interface function for adding a folio to the LRU (a wrapper to encapsulate the folio->page transition). However, like with the memcg code above, the API is called lru_cache: we have had lru_cache_add_file() and lru_cache_add_anon() in the past, so lru_cache_add_folio() seems more appropriate - especially as long as we still have one for pages (and maybe later one for anon pages). --- All that to say, adding folio as a new type for file headpages with API functions like this: mem_cgroup_charge_folio() lru_cache_add_folio() now THAT would be an incremental change to the kernel code. And if that new type proves like a great idea, we can do the same for anon - whether with a shared type or with separate types. And if it does end up the same type, in the interfaces and in the implementation, we can merge mem_cgroup_charge_page() # generic bits mem_cgroup_charge_folio() # file bits mem_cgroup_charge_anon() # anon bits back into a single function, just like we've done it already for the anon and file variants of those functions that we have had before. And if we then want to rename that function to something we agree is more appropriate, we can do that as yet another step. That would actually be incremental refactoring.
On Sat, Oct 16, 2021 at 08:07:40PM +0100, Matthew Wilcox wrote: > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > > mm/lru: Add folio LRU functions > > > > The LRU code is used by anon and file and not needed > > for the filesystem API. > > > > And as discussed, there is generally no ambiguity of > > tail pages on the LRU list. > > One of the assumptions you're making is that the current code is suitable > for folios. One of the things that happens in this patch is: > > - update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); > + update_lru_size(lruvec, lru, folio_zonenum(folio), > + folio_nr_pages(folio)); > > static inline long folio_nr_pages(struct folio *folio) > { > return compound_nr(&folio->page); > } > > vs > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > static inline int thp_nr_pages(struct page *page) > { > VM_BUG_ON_PGFLAGS(PageTail(page), page); > if (PageHead(page)) > return HPAGE_PMD_NR; > return 1; > } > #else > static inline int thp_nr_pages(struct page *page) > { > VM_BUG_ON_PGFLAGS(PageTail(page), page); > return 1; > } > #endif > > So if you want to leave all the LRU code using pages, all the uses of > thp_nr_pages() need to be converted to compound_nr(). Or maybe not all > of them; I don't know which ones might be safe to leave as thp_nr_pages(). > That's one of the reasons I went with a whitelist approach. All of them. The only compound pages that can exist on the LRUs are THPs, and the only THP pages that can exist on the LRUs are compound. There is no plausible scenario where those two functions would disagree in the LRU code. Or elsewhere in the kernel, for that matter. Where would thp_nr_pages() returning compound_nr() ever be wrong? How else are we implementing THPs? I'm not sure that would make sense.
On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote: > I find this line of argument highly disingenuous. > > No new type is necessary to remove these calls inside MM code. Migrate > them into the callsites and remove the 99.9% very obviously bogus > ones. The process is the same whether you switch to a new type or not. Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous pages" to be a serious counterargument. I got that you really don't want anonymous pages to be folios from the call Friday, but I haven't been getting anything that looks like a serious counterproposal from you. Think about what our goal is: we want to get to a world where our types describe unambigiuously how our data is used. That means working towards - getting rid of type punning - struct fields that are only used for a single purpose Leaving all the LRU code as struct page means leaving a shit ton of type punning in place, and you aren't outlining any alternate ways of dealing with that. As long as all the LRU code is using struct page, that halts efforts towards separately allocating these types and making struct page smaller (which was one of your stated goals as well!), and it would leave a big mess in place for god knows how long. It's been a massive effort for Willy to get this far, who knows when someone else with the requisite skillset would be summoning up the energy to deal with that - I don't see you or I doing it. Meanwhile: we've got people working on using folios for anonymous pages to solve some major problems - it cleans up all of the if (normalpage) else if (hugepage) mess - it'll _majorly_ help with our memory fragmentation problems, as I recently outlined. As long as we've got a very bimodal distribution in our allocation sizes where the peaks are at order 0 and HUGEPAGE_ORDER, we're going to have problems allocating hugepages. If anonymous + file memory can be arbitrary sized compound pages, we'll end up with more of a poisson distribution in our allocation sizes, and a _great deal_ of our difficulties with memory fragmentation are going to be alleviated. - and on architectures that support merging of TLB entries, folios for anonymous memory are going to get us some major performance improvements due to reduced TLB pressure, same as hugepages but without nearly as much memory fragmetation pain And on top of all that, file and anonymous pages are just more alike than they are different. As I keep saying, the sane incremental approach to splitting up struct page into different dedicated types is to follow the union of structs. I get that you REALLY REALLY don't want file and anonymous pages to be the same type, but what you're asking just isn't incremental, it's asking for one big refactoring to be done at the same time as another. > (I'll send more patches like the PageSlab() ones to that effect. It's > easy. The only reason nobody has bothered removing those until now is > that nobody reported regressions when they were added.) I was also pretty frustrated by your response to Willy's struct slab patches. You claim to be all in favour of introducing more type safety and splitting struct page up into multiple types, but on the basis of one objection - that his patches start marking tail slab pages as PageSlab (and I agree with your objection, FWIW) - instead of just asking for that to be changed, or posting a patch that made that change to his series, you said in effect that we shouldn't be doing any of the struct slab stuff by posting your own much more limited refactoring, that was only targeted at the compound_head() issue, which we all agree is a distraction and not the real issue. Why are you letting yourself get distracted by that? I'm not really sure what you want Johannes, besides the fact that you really don't want file and anon pages to be the same type - but I don't see how that gives us a route forwards on the fronts I just outlined.
On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote: > On Sat, Oct 16, 2021 at 04:28:23AM +0100, Matthew Wilcox wrote: > > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > > > mm/memcg: Add folio_memcg() and related functions > > > mm/memcg: Convert commit_charge() to take a folio > > > mm/memcg: Convert mem_cgroup_charge() to take a folio > > > mm/memcg: Convert uncharge_page() to uncharge_folio() > > > mm/memcg: Convert mem_cgroup_uncharge() to take a folio > > > mm/memcg: Convert mem_cgroup_migrate() to take folios > > > mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio > > > mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock() > > > mm/memcg: Convert mem_cgroup_move_account() to use a folio > > > mm/memcg: Add folio_lruvec() > > > mm/memcg: Add folio_lruvec_lock() and similar functions > > > mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave() > > > mm/workingset: Convert workingset_activation to take a folio > > > > > > This is all anon+file stuff, not needed for filesystem > > > folios. > > > > No, that's not true. A number of these functions are called from > > filesystem code. mem_cgroup_track_foreign_dirty() is only called > > from filesystem code. We at the very least need wrappers like > > folio_cgroup_charge(), and folio_memcg_lock(). > > Well, a handful of exceptions don't refute the broader point. > > No objection from me to convert mem_cgroup_track_foreign_dirty(). > > No objection to add a mem_cgroup_charge_folio(). But I insist on the > subsystem prefix, because that's in line with how we're charging a > whole bunch of other different things (swap, skmem, etc.). It'll also > match a mem_cgroup_charge_anon() if we agree to an anon type. I don't care about the name; I'll change that. I still don't get when you want mem_cgroup_foo() and when you want memcg_foo() > > > As per the other email, no conceptual entry point for > > > tail pages into either subsystem, so no ambiguity > > > around the necessity of any compound_head() calls, > > > directly or indirectly. It's easy to rule out > > > wholesale, so there is no justification for > > > incrementally annotating every single use of the page. > > > > The justification is that we can remove all those hidden calls to > > compound_head(). Hundreds of bytes of text spread throughout this file. > > I find this line of argument highly disingenuous. > > No new type is necessary to remove these calls inside MM code. Migrate > them into the callsites and remove the 99.9% very obviously bogus > ones. The process is the same whether you switch to a new type or not. > > (I'll send more patches like the PageSlab() ones to that effect. It's > easy. The only reason nobody has bothered removing those until now is > that nobody reported regressions when they were added.) That kind of change is actively dangerous. Today, you can call PageSlab() on a tail page, and it returns true. After your patch, it returns false. Sure, there's a debug check in there that's enabled on about 0.1% of all kernel builds, but I bet most people won't notice. We're not able to catch these kinds of mistakes at review time: https://lore.kernel.org/linux-mm/20211001024105.3217339-1-willy@infradead.org/ which means it escaped the eagle eyes of (at least): Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> I don't say that to shame these people. We need the compiler's help here. If we're removing the ability to ask for whether a tail page belongs to the slab allocator, we have to have the compiler warn us. I have a feeling your patch also breaks tools/vm/page-types.c > But typesafety is an entirely different argument. And to reiterate the > main point of contention on these patches: there is no concensus among > MM people how (or whether) we want MM-internal typesafety for pages. I don't think there will ever be consensus as long as you don't take the concerns of other MM developers seriously. On Friday's call, several people working on using large pages for anon memory told you that using folios for anon memory would make their lives easier, and you didn't care. > Personally, I think we do, but I don't think head vs tail is the most > important or the most error-prone aspect of the many identities struct > page can have. In most cases it's not even in the top 5 of questions I > have about the page when I see it in a random MM context (outside of > the very few places that do virt_to_page or pfn_to_page). Therefor > "folio" is not a very poignant way to name the object that is passed > around in most MM code. struct anon_page and struct file_page would be > way more descriptive and would imply the head/tail aspect. I get it that you want to split out anon pages from other types of pages. I'm not against there being a struct anon_folio { struct folio f; }; which marks functions or regions of functions that only deal with anon memory. But we need _a_ type which represents "the head page of a compound page or an order-0 page". And that's what folio is. Maybe we also want struct file_folio. I don't see the need for it myself, but maybe I'm wrong. > Anyway, the email you are responding to was an offer to split the > uncontroversial "large pages backing filesystems" part from the > controversial "MM-internal typesafety" discussion. Several people in > both the fs space and the mm space have now asked to do this to move > ahead. Since you have stated in another subthread that you "want to > get back to working on large pages in the page cache," and you never > wanted to get involved that deeply in the struct page subtyping > efforts, it's not clear to me why you are not taking this offer. I am. This email was written after trying to do just this. I dropped the patches you were opposed to and looked at the result. It's not good. You seem wedded to this idea that "folios are just for file backed memory", and that's not my proposal at all. folios are for everything. Maybe we specialise out other types of memory later, or during, or instead of converting something to use folios, but folios are an utterly generic concept.
On Mon, Oct 18, 2021 at 02:12:32PM -0400, Kent Overstreet wrote: > On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote: > > I find this line of argument highly disingenuous. > > > > No new type is necessary to remove these calls inside MM code. Migrate > > them into the callsites and remove the 99.9% very obviously bogus > > ones. The process is the same whether you switch to a new type or not. > > Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous > pages" to be a serious counterargument. I got that you really don't want > anonymous pages to be folios from the call Friday, but I haven't been getting > anything that looks like a serious counterproposal from you. > > Think about what our goal is: we want to get to a world where our types describe > unambigiuously how our data is used. That means working towards > - getting rid of type punning > - struct fields that are only used for a single purpose How is a common type inheritance model with a generic page type and subclasses not a counter proposal? And one which actually accomplishes those two things you're saying, as opposed to a shared folio where even 'struct address_space *mapping' is a total lie type-wise? Plus, really, what's the *alternative* to doing that anyway? How are we going to implement code that operates on folios and other subtypes of the page alike? And deal with attributes and properties that are shared among them all? Willy's original answer to that was that folio is just *going* to be all these things - file, anon, slab, network, rando driver stuff. But since that wasn't very popular, would not get rid of type punning and overloaded members, would get rid of efficiently allocating descriptor memory etc.- what *is* the alternative now to common properties between split out subtypes? I'm not *against* what you and Willy are saying. I have *genuinely zero idea what* you are saying. > Leaving all the LRU code as struct page means leaving a shit ton of type punning > in place, and you aren't outlining any alternate ways of dealing with that. As > long as all the LRU code is using struct page, that halts efforts towards > separately allocating these types and making struct page smaller (which was one > of your stated goals as well!), and it would leave a big mess in place for god > knows how long. I don't follow either of these claims. Converting to a shared anon/file folio makes almost no dent into the existing type punning we have, because head/tail page disambiguation is a tiny part of the type inferment we do on struct page. And leaving the LRU linkage in the struct page doesn't get in the way of allocating separate subtype descriptors. All these types need a list_head anyway, from anon to file to slab to the buddy allocator. Maybe anon, file, slab don't need it at the 4k granularity all the time, but the buddy allocator does anyway as long as it's 4k based and I'm sure you don't want to be allocating a new buddy descriptor every time we're splitting a larger page block into a smaller one? I really have no idea how that would even work. > It's been a massive effort for Willy to get this far, who knows when > someone else with the requisite skillset would be summoning up the > energy to deal with that - I don't see you or I doing it. > > Meanwhile: we've got people working on using folios for anonymous pages to solve > some major problems > > - it cleans up all of the if (normalpage) else if (hugepage) mess No it doesn't. > - it'll _majorly_ help with our memory fragmentation problems, as I recently > outlined. As long as we've got a very bimodal distribution in our allocation > sizes where the peaks are at order 0 and HUGEPAGE_ORDER, we're going to have > problems allocating hugepages. If anonymous + file memory can be arbitrary > sized compound pages, we'll end up with more of a poisson distribution in our > allocation sizes, and a _great deal_ of our difficulties with memory > fragmentation are going to be alleviated. > > - and on architectures that support merging of TLB entries, folios for > anonymous memory are going to get us some major performance improvements due > to reduced TLB pressure, same as hugepages but without nearly as much memory > fragmetation pain It doesn't do those, either. It's a new name for headpages, that's it. Converting to arbitrary-order huge pages needs to rework assumptions around what THP pages mean in various places of the code. Mainly the page table code. Presumably. We don't have anything even resembling a proposal on how this is all going to look like implementation-wise. How does changing the name help with this? How does not having the new name get in the way of it? > And on top of all that, file and anonymous pages are just more alike than they > are different. I don't know what you're basing this on, and you can't just keep making this claim without showing code to actually unify them. They have some stuff in common, and some stuff is deeply different. All about this screams class & subclass. Meanwhile you and Willy just keep coming up with hacks on how we can somehow work around this fact and contort the types to work out anyway. You yourself said that folio including slab and other random stuff is a bonkers idea. But that means we need to deal with properties that are going to be shared between subtypes, and I'm the only one that has come up with a remotely coherent proposal on how to do that. > > (I'll send more patches like the PageSlab() ones to that effect. It's > > easy. The only reason nobody has bothered removing those until now is > > that nobody reported regressions when they were added.) > > I was also pretty frustrated by your response to Willy's struct slab patches. > > You claim to be all in favour of introducing more type safety and splitting > struct page up into multiple types, but on the basis of one objection - that his > patches start marking tail slab pages as PageSlab (and I agree with your > objection, FWIW) - instead of just asking for that to be changed, or posting a > patch that made that change to his series, you said in effect that we shouldn't > be doing any of the struct slab stuff by posting your own much more limited > refactoring, that was only targeted at the compound_head() issue, which we all > agree is a distraction and not the real issue. Why are you letting yourself get > distracted by that? Kent, you can't be serious. I actually did exactly what you suggested I should have done. The struct slab patches are the right thing to do. I had one minor concern (which you seem to share) and suggested a small cleanup. Willy worried about this cleanup adding a needless compound_head() call, so *I sent patches to eliminate this call and allow this cleanup and the struct slab patches to go ahead.* My patches are to unblock Willy's. He then moved the goal posts and started talking about prefetching, but that isn't my fault. I was collaborating and putting my own time and effort where my mouth is. Can you please debug your own approach to reading these conversations?
On Mon, Oct 18, 2021 at 07:28:13PM +0100, Matthew Wilcox wrote: > On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote: > > On Sat, Oct 16, 2021 at 04:28:23AM +0100, Matthew Wilcox wrote: > > > On Wed, Sep 22, 2021 at 11:08:58AM -0400, Johannes Weiner wrote: > > > > As per the other email, no conceptual entry point for > > > > tail pages into either subsystem, so no ambiguity > > > > around the necessity of any compound_head() calls, > > > > directly or indirectly. It's easy to rule out > > > > wholesale, so there is no justification for > > > > incrementally annotating every single use of the page. > > > > > > The justification is that we can remove all those hidden calls to > > > compound_head(). Hundreds of bytes of text spread throughout this file. > > > > I find this line of argument highly disingenuous. > > > > No new type is necessary to remove these calls inside MM code. Migrate > > them into the callsites and remove the 99.9% very obviously bogus > > ones. The process is the same whether you switch to a new type or not. > > > > (I'll send more patches like the PageSlab() ones to that effect. It's > > easy. The only reason nobody has bothered removing those until now is > > that nobody reported regressions when they were added.) > > That kind of change is actively dangerous. Today, you can call > PageSlab() on a tail page, and it returns true. After your patch, > it returns false. Sure, there's a debug check in there that's enabled > on about 0.1% of all kernel builds, but I bet most people won't notice. > > We're not able to catch these kinds of mistakes at review time: > https://lore.kernel.org/linux-mm/20211001024105.3217339-1-willy@infradead.org/ > > which means it escaped the eagle eyes of (at least): > Signed-off-by: Andrey Konovalov <andreyknvl@google.com> > Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> > Reviewed-by: Dmitry Vyukov <dvyukov@google.com> > Cc: Christoph Lameter <cl@linux.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Will Deacon <will.deacon@arm.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > I don't say that to shame these people. We need the compiler's help > here. If we're removing the ability to ask for whether a tail page > belongs to the slab allocator, we have to have the compiler warn us. > > I have a feeling your patch also breaks tools/vm/page-types.c As Hugh said in the meeting in response to this, "you'll look at kernel code for any amount of time, you'll find bugs". I already pointed out dangerous code from anon/file confusion somewhere in this thread. None of that is a reason not to fix it. But it should inform the approach on how we fix it. I'm not against type safety, I'm for incremental changes. And replacing an enormous subset of struct page users with an unproven new type and loosely defined interaction with other page subtypes is just not that. > > But typesafety is an entirely different argument. And to reiterate the > > main point of contention on these patches: there is no concensus among > > MM people how (or whether) we want MM-internal typesafety for pages. > > I don't think there will ever be consensus as long as you don't take > the concerns of other MM developers seriously. On Friday's call, several > people working on using large pages for anon memory told you that using > folios for anon memory would make their lives easier, and you didn't care. Nope, one person claimed that it would help, and I asked how. Not because I'm against typesafety, but because I wanted to know if there is an aspect in there that would specifically benefit from a shared folio type. I don't remember there being one, and I'm not against type safety for anon pages. What several people *did* say at this meeting was whether you could drop the anon stuff for now until we have consensus. > > Anyway, the email you are responding to was an offer to split the > > uncontroversial "large pages backing filesystems" part from the > > controversial "MM-internal typesafety" discussion. Several people in > > both the fs space and the mm space have now asked to do this to move > > ahead. Since you have stated in another subthread that you "want to > > get back to working on large pages in the page cache," and you never > > wanted to get involved that deeply in the struct page subtyping > > efforts, it's not clear to me why you are not taking this offer. > > I am. This email was written after trying to do just this. I dropped > the patches you were opposed to and looked at the result. It's not good. > > You seem wedded to this idea that "folios are just for file backed > memory", and that's not my proposal at all. folios are for everything. > Maybe we specialise out other types of memory later, or during, or > instead of converting something to use folios, but folios are an utterly > generic concept. That train left the station when several people said slab should not be in the folio. Once that happened, you could no longer say it'll work itself out around the edges. Now it needs a real approach to coordinating with other subtypes, including shared properties and implementation between them. The "simple" folio approach only works when it really is a wholesale replacement for *everything* that page is right now - modulo PAGE_SIZE and modulo compound tail. But it isn't that anymore, is it? Folio can't be everything and only some subtypes simultaneously. So when you say folio is for everything, is struct slab dead? If not, how is the relationship between them? How do you query shared property? There really is no coherent proposal right now. These patches start an open-ended conversion into a nebulous direction. All I'm saying is: start with a reasonable, delineated scope (page cache), and if that test balloon works out we can do the next one with lessons learned from the first. Maybe that will converge to the "simple" folio for all compound subtypes, maybe we'll move more toward explicit subtyping that imply the head/tail thing anyway. What is even the counter argument to that?
On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote: > > I don't think there will ever be consensus as long as you don't take > > the concerns of other MM developers seriously. On Friday's call, several > > people working on using large pages for anon memory told you that using > > folios for anon memory would make their lives easier, and you didn't care. > > Nope, one person claimed that it would help, and I asked how. Not > because I'm against typesafety, but because I wanted to know if there > is an aspect in there that would specifically benefit from a shared > folio type. I don't remember there being one, and I'm not against type > safety for anon pages. > > What several people *did* say at this meeting was whether you could > drop the anon stuff for now until we have consensus. My read on the meeting was that most of people had nothing against anon stuff, but asked if Willy could drop anon parts to get past your objections to move forward. You was the only person who was vocal against including anon pars. (Hugh nodded to some of your points, but I don't really know his position on folios in general and anon stuff in particular). For record: I think folios has to be applied, including anon bits. They are useful and address long standing issues with compound pages. Any future type-safety work can be done on top of it. I know it's not democracy and we don't count votes here, but we are dragging it for months and don't get closer to consensus. At some point "disagree and commit" has to be considered.
On Tue, Oct 19, 2021 at 02:16:27AM +0300, Kirill A. Shutemov wrote: > On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote: > > > I don't think there will ever be consensus as long as you don't take > > > the concerns of other MM developers seriously. On Friday's call, several > > > people working on using large pages for anon memory told you that using > > > folios for anon memory would make their lives easier, and you didn't care. > > > > Nope, one person claimed that it would help, and I asked how. Not > > because I'm against typesafety, but because I wanted to know if there > > is an aspect in there that would specifically benefit from a shared > > folio type. I don't remember there being one, and I'm not against type > > safety for anon pages. > > > > What several people *did* say at this meeting was whether you could > > drop the anon stuff for now until we have consensus. > > My read on the meeting was that most of people had nothing against anon > stuff, but asked if Willy could drop anon parts to get past your > objections to move forward. > > You was the only person who was vocal against including anon pars. (Hugh > nodded to some of your points, but I don't really know his position on > folios in general and anon stuff in particular). Nobody likes to be the crazy person on the soapbox, so I asked Hugh in private a few weeks back. Quoting him, with permission: : To the first and second order of approximation, you have been : speaking for me: but in a much more informed and constructive and : coherent and rational way than I would have managed myself. It's a broad and open-ended proposal with far reaching consequences, and not everybody has the time (or foolhardiness) to engage on that. I wouldn't count silence as approval - just like I don't see approval as a sign that a person took a hard look at all the implications. My only effort from the start has been working out unanswered questions in this proposal: Are compound pages the reliable, scalable, and memory-efficient way to do bigger page sizes? What's the scope of remaining tailpages where typesafety will continue to lack? How do we implement code and properties shared by folios and non-folio types (like mmap/fault code for folio and network and driver pages)? There are no satisfying answers to any of these questions, but that also isn't very surprising: it's a huge scope. Lack of answers isn't failure, it's just a sign that the step size is too large and too dependent on a speculative future. It would have been great to whittle things down to a more incremental and concrete first step, which would have allowed us to keep testing the project against reality as we go through all the myriad of uses and cornercases of struct page that no single person can keep straight in their head. I'm grateful for the struct slab spinoff, I think it's exactly all of the above. I'm in full support of it and have dedicated time, effort and patches to help work out kinks that immediately and inevitably surfaced around the slab<->page boundary. I only hoped we could do the same for file pages first, learn from that, and then do anon pages; if they come out looking the same in the process, a unified folio would be a great trailing refactoring step. But alas here we are months later at the same impasse with the same open questions, and still talking in circles about speculative code. I don't have more time to invest into this, and I'm tired of the vitriol and ad-hominems both in public and in private channels. I'm not really sure how to exit this. The reasons for my NAK are still there. But I will no longer argue or stand in the way of the patches.
On Mon, Oct 18, 2021 at 04:45:59PM -0400, Johannes Weiner wrote: > On Mon, Oct 18, 2021 at 02:12:32PM -0400, Kent Overstreet wrote: > > On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote: > > > I find this line of argument highly disingenuous. > > > > > > No new type is necessary to remove these calls inside MM code. Migrate > > > them into the callsites and remove the 99.9% very obviously bogus > > > ones. The process is the same whether you switch to a new type or not. > > > > Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous > > pages" to be a serious counterargument. I got that you really don't want > > anonymous pages to be folios from the call Friday, but I haven't been getting > > anything that looks like a serious counterproposal from you. > > > > Think about what our goal is: we want to get to a world where our types describe > > unambigiuously how our data is used. That means working towards > > - getting rid of type punning > > - struct fields that are only used for a single purpose > > How is a common type inheritance model with a generic page type and > subclasses not a counter proposal? > > And one which actually accomplishes those two things you're saying, as > opposed to a shared folio where even 'struct address_space *mapping' > is a total lie type-wise? > > Plus, really, what's the *alternative* to doing that anyway? How are > we going to implement code that operates on folios and other subtypes > of the page alike? And deal with attributes and properties that are > shared among them all? Willy's original answer to that was that folio > is just *going* to be all these things - file, anon, slab, network, > rando driver stuff. But since that wasn't very popular, would not get > rid of type punning and overloaded members, would get rid of > efficiently allocating descriptor memory etc.- what *is* the > alternative now to common properties between split out subtypes? > > I'm not *against* what you and Willy are saying. I have *genuinely > zero idea what* you are saying. So we were starting to talk more concretely last night about the splitting of struct page into multiple types, and what that means for page->lru. The basic process I've had in mind for splitting struct page up into multiple types is: create a new type for each struct in the union-of-structs, change code to refer to that type instead of struct page, then - importantly - delete those members from the union-of-structs in struct page. E.g. for struct slab, after Willy's struct slab patches, we want to delete that stuff from struct page - otherwise we've introduced new type punning where code can refer to the same members via struct page and struct slab, and it's also completely necessary in order to separately allocate these new structs and slim down struct page. Roughly what I've been envisioning for folios is that the struct in the union-of-structs with lru, mapping & index - that's what turns into folios. Note that we have a bunch of code using page->lru, page->mapping, and page->index that really shouldn't be. The buddy allocator uses page->lru for freelists, and it shouldn't be, but there's a straightforward solution for that: we need to create a new struct in the union-of-structs for free pages, and confine the buddy allocator to that (it'll be a nice cleanup, right now it's overloading both page->lru and page->private which makes no sense, and it'll give us a nice place to stick some other things). Other things that need to be fixed: - page->lru is used by the old .readpages interface for the list of pages we're doing reads to; Matthew converted most filesystems to his new and improved .readahead which thankfully no longer uses page->lru, but there's still a few filesystems that need to be converted - it looks like cifs and erofs, not sure what's going on with fs/cachefiles/. We need help from the maintainers of those filesystems to get that conversion done, this is holding up future cleanups. - page->mapping and page->index are used for entirely random purposes by some driver code - drivers/net/ethernet/sun/niu.c looks to be using page->mapping for a singly linked list (!). - unrelated, but worth noting: there's a fair amount of filesystem code that uses page->mapping and page->index and doesn't need to because it has it from context - it's both a performance improvement and a cleanup to change that code to not get it from the page. Basically, we need to get to a point where each field in struct page is used for one and just one thing, but that's going to take some time. You've been noting that page->mapping is used for different things depending on whether it's a file page or an anonymous page, and I agree that that's not ideal - but it's one that I'm much less concerned about because a field being used for two different things that are both core and related concepts in the kernel is less bad than fields that are used as dumping grounds for whatever is convenient - file & anon overloading page->mapping is just not the most pressing issue to me. Also, let's look at what file & anonymous pages share: - they're both mapped to userspace - they both need page->mapcount - they both share the lru code - they both need page->lru page->lru is the real decider for me, because getting rid of non-lru uses of that field looks very achievable to me, and once it's done it's one of the fields we want to delete from struct page and move to struct folio. If we leave the lru code using struct page, it creates a real problem for this approach - it means we won't be able to delete the folio struct from the union-of-structs in struct page. I'm not sure what our path forward would be. That's my resistance to trying to separate file & anon at this point. I'm definitely not saying we shouldn't separate file & anon in the future - I don't have an opinion on whether or not it should be done, and if we do want to do that I'd want to think about doing it by embedding a "struct lru_object" into both file_folio and anon_folio and having the lru code refer that instead of struct page - embedding an object is generally preferable to inheritence. I want to say - and I don't think I've been clear enough about this - my objection to trying to split up file & anonymous pages into separate types isn't so much based on any deep philosophical reasons (I have some ideas for making anonymous pages more like file pages that I would like to attempt, but I also heard you when you said you'd tried to do that in the past and it hadn't worked out) - my objection is because I think it would very much get in the way of shorter term cleanups that are much more pressing. This is what I've been referring to when I've been talking about following the union-of-structs in splitting up struct page - I'm just trying to be practical. Another relevant thing we've been talking about is consolidating the types of pages that can be mapped into userspace. Right now we've got driver code mapping all sorts of rando pages into userspace, and this isn't good - pages in theory have this abstract interface that they implement, and pages mapped into userspace have a bigger and more complicated interface - i.e. a_ops.set_page_dirty; any page mapped into userspace can have this called on it via the O_DIRECT read path, and possibly other things. Right now we have drivers allocating vmalloc() memory and then mapping it into userspace, which is just bizarre - what chunk of code really owns that page, and is implementing that interface? vmalloc, or the driver? What I'd like to see happen is for those to get switched to some sort of internal device or inode, something that the driver owns and has an a_ops struct - at this point they'd just be normal file pages. The reason drivers are mapping vmalloc() memory into userspace is so they can get it into a contiguous kernel side memory mapping, but they could also be doing that by calling vmap() on existing pages - I think that would be much cleaner. I have no idea if this approach works for network pool pages or how those would be used, I haven't gotten that far - if someone can chime in about those that would be great. But, the end goal I'm envisioning is a world where _only_ bog standard file & anonymous pages are mapped to userspace - then _mapcount can be deleted from struct page and only needs to live in struct folio. Anyways, that's another thing to consider when thinking about whether file & anonymous pages should be the same type.
On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote: > On Mon, Oct 18, 2021 at 04:45:59PM -0400, Johannes Weiner wrote: > > On Mon, Oct 18, 2021 at 02:12:32PM -0400, Kent Overstreet wrote: > > > On Mon, Oct 18, 2021 at 12:47:37PM -0400, Johannes Weiner wrote: > > > > I find this line of argument highly disingenuous. > > > > > > > > No new type is necessary to remove these calls inside MM code. Migrate > > > > them into the callsites and remove the 99.9% very obviously bogus > > > > ones. The process is the same whether you switch to a new type or not. > > > > > > Conversely, I don't see "leave all LRU code as struct page, and ignore anonymous > > > pages" to be a serious counterargument. I got that you really don't want > > > anonymous pages to be folios from the call Friday, but I haven't been getting > > > anything that looks like a serious counterproposal from you. > > > > > > Think about what our goal is: we want to get to a world where our types describe > > > unambigiuously how our data is used. That means working towards > > > - getting rid of type punning > > > - struct fields that are only used for a single purpose > > > > How is a common type inheritance model with a generic page type and > > subclasses not a counter proposal? > > > > And one which actually accomplishes those two things you're saying, as > > opposed to a shared folio where even 'struct address_space *mapping' > > is a total lie type-wise? > > > > Plus, really, what's the *alternative* to doing that anyway? How are > > we going to implement code that operates on folios and other subtypes > > of the page alike? And deal with attributes and properties that are > > shared among them all? Willy's original answer to that was that folio > > is just *going* to be all these things - file, anon, slab, network, > > rando driver stuff. But since that wasn't very popular, would not get > > rid of type punning and overloaded members, would get rid of > > efficiently allocating descriptor memory etc.- what *is* the > > alternative now to common properties between split out subtypes? > > > > I'm not *against* what you and Willy are saying. I have *genuinely > > zero idea what* you are saying. > > So we were starting to talk more concretely last night about the splitting of > struct page into multiple types, and what that means for page->lru. > > The basic process I've had in mind for splitting struct page up into multiple > types is: create a new type for each struct in the union-of-structs, change code > to refer to that type instead of struct page, then - importantly - delete those > members from the union-of-structs in struct page. > > E.g. for struct slab, after Willy's struct slab patches, we want to delete that > stuff from struct page - otherwise we've introduced new type punning where code > can refer to the same members via struct page and struct slab, and it's also > completely necessary in order to separately allocate these new structs and slim > down struct page. > > Roughly what I've been envisioning for folios is that the struct in the > union-of-structs with lru, mapping & index - that's what turns into folios. > > Note that we have a bunch of code using page->lru, page->mapping, and > page->index that really shouldn't be. The buddy allocator uses page->lru for > freelists, and it shouldn't be, but there's a straightforward solution for that: > we need to create a new struct in the union-of-structs for free pages, and > confine the buddy allocator to that (it'll be a nice cleanup, right now it's > overloading both page->lru and page->private which makes no sense, and it'll > give us a nice place to stick some other things). > > Other things that need to be fixed: > > - page->lru is used by the old .readpages interface for the list of pages we're > doing reads to; Matthew converted most filesystems to his new and improved > .readahead which thankfully no longer uses page->lru, but there's still a few > filesystems that need to be converted - it looks like cifs and erofs, not > sure what's going on with fs/cachefiles/. We need help from the maintainers > of those filesystems to get that conversion done, this is holding up future > cleanups. The reason why using page->lru for non-LRU pages was just because the page struct is already there and it's an effective way to organize variable temporary pages without any extra memory overhead other than page structure itself. Another benefits is that such non-LRU pages can be immediately picked from the list and added into page cache without any pain (thus ->lru can be reused for real lru usage). In order to maximize the performance (so that pages can be shared in the same read request flexibly without extra overhead rather than memory allocation/free from/to the buddy allocator) and minimise extra footprint, this way was used. I'm pretty fine to transfer into some other way instead if some similar field can be used in this way. Yet if no such field anymore, I'm also very glad to write a patch to get rid of such usage, but I wish it could be merged _only_ with the real final transformation together otherwise it still takes the extra memory of the old page structure and sacrifices the overall performance to end users (..thus has no benefits at all.) Thanks, Gao Xiang
On Wed, Oct 20, 2021 at 01:06:04AM +0800, Gao Xiang wrote: > On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote: > > Other things that need to be fixed: > > > > - page->lru is used by the old .readpages interface for the list of pages we're > > doing reads to; Matthew converted most filesystems to his new and improved > > .readahead which thankfully no longer uses page->lru, but there's still a few > > filesystems that need to be converted - it looks like cifs and erofs, not > > sure what's going on with fs/cachefiles/. We need help from the maintainers > > of those filesystems to get that conversion done, this is holding up future > > cleanups. > > The reason why using page->lru for non-LRU pages was just because the > page struct is already there and it's an effective way to organize > variable temporary pages without any extra memory overhead other than > page structure itself. Another benefits is that such non-LRU pages can > be immediately picked from the list and added into page cache without > any pain (thus ->lru can be reused for real lru usage). > > In order to maximize the performance (so that pages can be shared in > the same read request flexibly without extra overhead rather than > memory allocation/free from/to the buddy allocator) and minimise extra > footprint, this way was used. I'm pretty fine to transfer into some > other way instead if some similar field can be used in this way. > > Yet if no such field anymore, I'm also very glad to write a patch to > get rid of such usage, but I wish it could be merged _only_ with the > real final transformation together otherwise it still takes the extra > memory of the old page structure and sacrifices the overall performance > to end users (..thus has no benefits at all.) I haven't dived in to clean up erofs because I don't have a way to test it, and I haven't taken the time to understand exactly what it's doing. The old ->readpages interface gave you pages linked together on ->lru and this code seems to have been written in that era, when you would add pages to the page cache yourself. In the new scheme, the pages get added to the page cache for you, and then you take care of filling them (and marking them uptodate if the read succeeds). There's now readahead_expand() which you can call to add extra pages to the cache if the readahead request isn't compressed-block aligned. Of course, it may not succeed if we're out of memory or there were already pages in the cache. It looks like this will be quite a large change to how erofs handles compressed blocks, but if you're open to taking this on, I'd be very happy.
On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote: > I have no idea if this approach works for network pool pages or how those would > be used, I haven't gotten that far - if someone can chime in about those that Generally the driver goal is to create a shared memory buffer between kernel and user space. The broadly two common patterns are to have userspace call mmap() and the kernel side returns the kernel pages from there - getting them from some kernel allocator. Or, userspace allocates the buffer and the kernel driver does pin_user_pages() to import them to its address space. I think it is quite feasible to provide some simple library API to manage the shared buffer through mmap approach, and if that library wants to allocate inodes, folios and what not it should be possible. It would help this idea to see Christoph's cleanup series go forward: https://lore.kernel.org/all/20200508153634.249933-1-hch@lst.de/ As it makes it alot easier for drivers to get inodes in the first place. > would be great. But, the end goal I'm envisioning is a world where _only_ bog > standard file & anonymous pages are mapped to userspace - then _mapcount can be > deleted from struct page and only needs to live in struct folio. There is a lot of work in the past years on ZONE_DEVICE pages into userspace. Today FSDAX is kind of a mashup of a file and device page, but other stuff is less obvious, especially DEVICE_COHERENT. Jason
Hi Matthew, On Tue, Oct 19, 2021 at 06:34:19PM +0100, Matthew Wilcox wrote: > On Wed, Oct 20, 2021 at 01:06:04AM +0800, Gao Xiang wrote: > > On Tue, Oct 19, 2021 at 12:11:35PM -0400, Kent Overstreet wrote: > > > Other things that need to be fixed: > > > > > > - page->lru is used by the old .readpages interface for the list of pages we're > > > doing reads to; Matthew converted most filesystems to his new and improved > > > .readahead which thankfully no longer uses page->lru, but there's still a few > > > filesystems that need to be converted - it looks like cifs and erofs, not > > > sure what's going on with fs/cachefiles/. We need help from the maintainers > > > of those filesystems to get that conversion done, this is holding up future > > > cleanups. > > > > The reason why using page->lru for non-LRU pages was just because the > > page struct is already there and it's an effective way to organize > > variable temporary pages without any extra memory overhead other than > > page structure itself. Another benefits is that such non-LRU pages can > > be immediately picked from the list and added into page cache without > > any pain (thus ->lru can be reused for real lru usage). > > > > In order to maximize the performance (so that pages can be shared in > > the same read request flexibly without extra overhead rather than > > memory allocation/free from/to the buddy allocator) and minimise extra > > footprint, this way was used. I'm pretty fine to transfer into some > > other way instead if some similar field can be used in this way. > > > > Yet if no such field anymore, I'm also very glad to write a patch to > > get rid of such usage, but I wish it could be merged _only_ with the > > real final transformation together otherwise it still takes the extra > > memory of the old page structure and sacrifices the overall performance > > to end users (..thus has no benefits at all.) > > I haven't dived in to clean up erofs because I don't have a way to test > it, and I haven't taken the time to understand exactly what it's doing. Actually I don't think it's an actual clean up due to the current page structure design. > > The old ->readpages interface gave you pages linked together on ->lru > and this code seems to have been written in that era, when you would > add pages to the page cache yourself. > > In the new scheme, the pages get added to the page cache for you, and > then you take care of filling them (and marking them uptodate if the > read succeeds). There's now readahead_expand() which you can call to add > extra pages to the cache if the readahead request isn't compressed-block > aligned. Of course, it may not succeed if we're out of memory or there > were already pages in the cache. Hmmm, these temporary pages in the list may be (re)used later for page cache, or just used for temporary compressed pages for some I/O or lz4 decompression buffer (technically called lz77 sliding window) to temporarily contain some decompressed data in the same read request (due to some pages are already mapped and we cannot expose the decompression process to userspace or some other reasons). All are in the recycle way. These temporary pages may finally go into some file page cache or recycle for several temporary uses for many time and finally free to the buddy allocator. > > It looks like this will be quite a large change to how erofs handles > compressed blocks, but if you're open to taking this on, I'd be very happy. For ->lru, it's quite small, but it sacrifices the performance. Yet I'm very glad to do if some decision of this ->lru field is determined. Thanks, Gao Xiang
Kent Overstreet <kent.overstreet@gmail.com> wrote: > > - page->lru is used by the old .readpages interface for the list of pages we're > doing reads to; Matthew converted most filesystems to his new and improved > .readahead which thankfully no longer uses page->lru, but there's still a few > filesystems that need to be converted - it looks like cifs and erofs, not > sure what's going on with fs/cachefiles/. We need help from the maintainers > of those filesystems to get that conversion done, this is holding up future > cleanups. fscache and cachefiles should be taken care of by my patchset here: https://lore.kernel.org/r/163363935000.1980952.15279841414072653108.stgit@warthog.procyon.org.uk https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-remove-old-io With that 9p, afs and ceph use netfs lib to handle readpage, readahead and part of write_begin. nfs and cifs do their own wrangling of readpages/readahead, but will call out to the cache directly to handle each page individually. At some point, cifs will hopefully be converted to use netfs lib. David
On Tue, Oct 19, 2021 at 11:16:18AM -0400, Johannes Weiner wrote: > My only effort from the start has been working out unanswered > questions in this proposal: Are compound pages the reliable, scalable, > and memory-efficient way to do bigger page sizes? What's the scope of > remaining tailpages where typesafety will continue to lack? How do we > implement code and properties shared by folios and non-folio types > (like mmap/fault code for folio and network and driver pages)? I don't think those questions need to be answered before proceeding with this patchset. They're interesting questions, to be sure, but to a large extent they're orthogonal to the changes here. I look forward to continuing to work on those problems while filesystems and the VFS continue to be converted to use folios. > I'm not really sure how to exit this. The reasons for my NAK are still > there. But I will no longer argue or stand in the way of the patches. Thank you. I appreciate that.
On 19.10.21 17:16, Johannes Weiner wrote: > On Tue, Oct 19, 2021 at 02:16:27AM +0300, Kirill A. Shutemov wrote: >> On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote: >>>> I don't think there will ever be consensus as long as you don't take >>>> the concerns of other MM developers seriously. On Friday's call, several >>>> people working on using large pages for anon memory told you that using >>>> folios for anon memory would make their lives easier, and you didn't care. >>> >>> Nope, one person claimed that it would help, and I asked how. Not >>> because I'm against typesafety, but because I wanted to know if there >>> is an aspect in there that would specifically benefit from a shared >>> folio type. I don't remember there being one, and I'm not against type >>> safety for anon pages. >>> >>> What several people *did* say at this meeting was whether you could >>> drop the anon stuff for now until we have consensus. >> >> My read on the meeting was that most of people had nothing against anon >> stuff, but asked if Willy could drop anon parts to get past your >> objections to move forward. >> >> You was the only person who was vocal against including anon pars. (Hugh >> nodded to some of your points, but I don't really know his position on >> folios in general and anon stuff in particular). > > Nobody likes to be the crazy person on the soapbox, so I asked Hugh in > private a few weeks back. Quoting him, with permission: > > : To the first and second order of approximation, you have been > : speaking for me: but in a much more informed and constructive and > : coherent and rational way than I would have managed myself. > > It's a broad and open-ended proposal with far reaching consequences, > and not everybody has the time (or foolhardiness) to engage on that. I > wouldn't count silence as approval - just like I don't see approval as > a sign that a person took a hard look at all the implications. > > My only effort from the start has been working out unanswered > questions in this proposal: Are compound pages the reliable, scalable, > and memory-efficient way to do bigger page sizes? What's the scope of > remaining tailpages where typesafety will continue to lack? How do we > implement code and properties shared by folios and non-folio types > (like mmap/fault code for folio and network and driver pages)? > > There are no satisfying answers to any of these questions, but that > also isn't very surprising: it's a huge scope. Lack of answers isn't > failure, it's just a sign that the step size is too large and too > dependent on a speculative future. It would have been great to whittle > things down to a more incremental and concrete first step, which would > have allowed us to keep testing the project against reality as we go > through all the myriad of uses and cornercases of struct page that no > single person can keep straight in their head. > > I'm grateful for the struct slab spinoff, I think it's exactly all of > the above. I'm in full support of it and have dedicated time, effort > and patches to help work out kinks that immediately and inevitably > surfaced around the slab<->page boundary. > > I only hoped we could do the same for file pages first, learn from > that, and then do anon pages; if they come out looking the same in the > process, a unified folio would be a great trailing refactoring step. > > But alas here we are months later at the same impasse with the same > open questions, and still talking in circles about speculative code. > I don't have more time to invest into this, and I'm tired of the > vitriol and ad-hominems both in public and in private channels. Thanks Johannes for defending your position and I can understand that you are running out of motivation+energy to defend further. For the records: I was happy to see the slab refactoring, although I raised some points regarding how to access properties that belong into the "struct page". As raised elsewhere, I'd also be more comfortable seeing small incremental changes/cleanups that are consistent even without having decided on an ultimate end-goal -- this includes folios. I'd be happy to see file-backed THP gaining their own, dedicated type first ("struct $whatever"), before generalizing it to folios. I'm writing this message solely to back your "not everybody has the time (or foolhardiness) to engage on that. I wouldn't count silence as approval.". While I do have the capacity to review smaller, incremental steps (see struct slab), I don't have the time+energy to gasp the full folio picture. So I also second "it's a huge scope. [...] it's just a sign that the step size is too large and too dependent on a speculative future." My 2 cents on this topic.
On Wed, Oct 20, 2021 at 09:50:58AM +0200, David Hildenbrand wrote: > For the records: I was happy to see the slab refactoring, although I > raised some points regarding how to access properties that belong into > the "struct page". I thought the slab discussion was quite productive. Unfortunately, none of our six (!) slab maintainers had anything to say about it. So I think it's pointless to proceed unless one of them weighs in and says "I'd be interested in merging something along these lines once these problems are addressed". > As raised elsewhere, I'd also be more comfortable > seeing small incremental changes/cleanups that are consistent even > without having decided on an ultimate end-goal -- this includes folios. > I'd be happy to see file-backed THP gaining their own, dedicated type > first ("struct $whatever"), before generalizing it to folios. I am genuinely confused by this. Folios are non-tail pages. That's all. There's no "ultimate end-goal". It's just a new type that lets the compiler (and humans!) know that this isn't a tail page. Some people want to take this further, and split off special types from struct page. I think that's a great idea. I'm even willing to help. But there are all kinds of places in the kernel where we handle generic pages of almost any type, and so regardless of how much we end up splitting off from struct page, we're still going to want the concept of folio. I get that in some parts of the MM, we can just assume that any struct page is a non-tail page. But that's not the case in the filemap APIs; they're pretty much all defined to return the precise page which contains the specific byte. I think that's a mistake, and I'm working to fix it. But until it is all fixed [1], having a type which says "this is not a tail page" is, frankly, essential. [1] which is a gargantuan job because I'm not just dealing with mm/filemap.c, but also with ~90 filesystems and things sufficiently like filesystems to have an address_space_operations of their own, including graphics drivers.
On Tue, Oct 19, 2021 at 11:16:18AM -0400, Johannes Weiner wrote: > On Tue, Oct 19, 2021 at 02:16:27AM +0300, Kirill A. Shutemov wrote: > > On Mon, Oct 18, 2021 at 05:56:34PM -0400, Johannes Weiner wrote: > > > > I don't think there will ever be consensus as long as you don't take > > > > the concerns of other MM developers seriously. On Friday's call, several > > > > people working on using large pages for anon memory told you that using > > > > folios for anon memory would make their lives easier, and you didn't care. > > > > > > Nope, one person claimed that it would help, and I asked how. Not > > > because I'm against typesafety, but because I wanted to know if there > > > is an aspect in there that would specifically benefit from a shared > > > folio type. I don't remember there being one, and I'm not against type > > > safety for anon pages. > > > > > > What several people *did* say at this meeting was whether you could > > > drop the anon stuff for now until we have consensus. > > > > My read on the meeting was that most of people had nothing against anon > > stuff, but asked if Willy could drop anon parts to get past your > > objections to move forward. > > > > You was the only person who was vocal against including anon pars. (Hugh > > nodded to some of your points, but I don't really know his position on > > folios in general and anon stuff in particular). > > Nobody likes to be the crazy person on the soapbox, so I asked Hugh in > private a few weeks back. Quoting him, with permission: > > : To the first and second order of approximation, you have been > : speaking for me: but in a much more informed and constructive and > : coherent and rational way than I would have managed myself. > > It's a broad and open-ended proposal with far reaching consequences, > and not everybody has the time (or foolhardiness) to engage on that. I > wouldn't count silence as approval - just like I don't see approval as > a sign that a person took a hard look at all the implications. > > My only effort from the start has been working out unanswered > questions in this proposal: Are compound pages the reliable, scalable, > and memory-efficient way to do bigger page sizes? What's the scope of > remaining tailpages where typesafety will continue to lack? How do we > implement code and properties shared by folios and non-folio types > (like mmap/fault code for folio and network and driver pages)? > > There are no satisfying answers to any of these questions, but that > also isn't very surprising: it's a huge scope. Lack of answers isn't > failure, it's just a sign that the step size is too large and too > dependent on a speculative future. It would have been great to whittle > things down to a more incremental and concrete first step, which would > have allowed us to keep testing the project against reality as we go > through all the myriad of uses and cornercases of struct page that no > single person can keep straight in their head. > > I'm grateful for the struct slab spinoff, I think it's exactly all of > the above. I'm in full support of it and have dedicated time, effort > and patches to help work out kinks that immediately and inevitably > surfaced around the slab<->page boundary. Thank you for at least (belatedly) voicing your appreciation of the struct slab patches, that much wasn't at all clear to me or Matthew during the initial discussion. > I only hoped we could do the same for file pages first, learn from > that, and then do anon pages; if they come out looking the same in the > process, a unified folio would be a great trailing refactoring step. > > But alas here we are months later at the same impasse with the same > open questions, and still talking in circles about speculative code. > I don't have more time to invest into this, and I'm tired of the > vitriol and ad-hominems both in public and in private channels. > > I'm not really sure how to exit this. The reasons for my NAK are still > there. But I will no longer argue or stand in the way of the patches. Johannes, what I gathered from the meeting on Friday is that all you seem to care about at this point is whether or not file and anonymous pages are the same type. You got most of what you wanted regarding the direction of folios - they're no longer targeted at all compound pages! We're working on breaking struct page up into multiple types! But I'm frustrated by you disengaging like this, after I went to a lot of effort to bring you and your ideas into the discussion, but... if you're going to stubbornly cling to this point and refuse to hear other ideas the way you have been, I honestly don't know what to tell you. And after all this it's hard to see the wider issues with struct page actually getting tackled. Shame.
On Wed, Oct 20, 2021 at 01:54:20AM +0800, Gao Xiang wrote: > On Tue, Oct 19, 2021 at 06:34:19PM +0100, Matthew Wilcox wrote: > > It looks like this will be quite a large change to how erofs handles > > compressed blocks, but if you're open to taking this on, I'd be very happy. > > For ->lru, it's quite small, but it sacrifices the performance. Yet I'm > very glad to do if some decision of this ->lru field is determined. I would be very appreciative if you were willing to do the work, and I know others would be too. These kinds of cleanups may seem small individually, but they make a _very_ real difference when we're looking kernel-wide at how possible these struct page changes may be - and even if they don't happen, it really helps understandability of the code if we can move towards a single struct field always being used for a single purpose in our core data types.
On 20.10.21 19:26, Matthew Wilcox wrote: > On Wed, Oct 20, 2021 at 09:50:58AM +0200, David Hildenbrand wrote: >> For the records: I was happy to see the slab refactoring, although I >> raised some points regarding how to access properties that belong into >> the "struct page". > > I thought the slab discussion was quite productive. Unfortunately, > none of our six (!) slab maintainers had anything to say about it. So I > think it's pointless to proceed unless one of them weighs in and says > "I'd be interested in merging something along these lines once these > problems are addressed". Yes, that's really unfortunate ... :( > >> As raised elsewhere, I'd also be more comfortable >> seeing small incremental changes/cleanups that are consistent even >> without having decided on an ultimate end-goal -- this includes folios. >> I'd be happy to see file-backed THP gaining their own, dedicated type >> first ("struct $whatever"), before generalizing it to folios. > > I am genuinely confused by this. > > Folios are non-tail pages. That's all. There's no "ultimate end-goal". > It's just a new type that lets the compiler (and humans!) know that this > isn't a tail page. > > Some people want to take this further, and split off special types from > struct page. I think that's a great idea. I'm even willing to help. > But there are all kinds of places in the kernel where we handle generic > pages of almost any type, and so regardless of how much we end up > splitting off from struct page, we're still going to want the concept > of folio. And I guess that generic mechanism is where the controversy starts and where people start having different expectation. IMHO you can tell that from the whole "naming" discussion/controversy. I always thought, why not call it "struct compound_page" until I think someone commented that it might not be a compound page but only a single base page somewhere. But I got tired (most probably just like you) when reading all the wild ideas and all the side discussions. Nobody can follow all that. If we'd be limiting this to "this is an anon THP" and call it "struct anon_thp", I assume the end result would be significantly easier. Anon THP only make sense with pages >1, otherwise it's simply a base page and has to be treated completely different by most MM code (esp. THP splitting). Similarly, call it "struct filemap" (bad name) and define it as either being a single page or a compound page, however, the head of the page (what you call folio). Let's think about this (and this is something that might happen for real): assume we have to add a field for handling something about anon THP in the struct page (let's assume in the head page for simplicity). Where would we add it? To "struct folio" and expose it to all other folios that don't really need it because it's so special? To "struct page" where it actually doesn't belong after all the discussions? And if we would have to move that field it into a tail page, it would get even more "tricky". Of course, we could let all special types inherit from "struct folio", which inherit from "struct page" ... but I am not convinced that we actually want that. After all, we're C programmers ;) But enough with another side-discussion :) Yes, the types is what I think is something very reasonable to have now that we discussed it; and I think it's a valuable result of the whole discussion. I consider it as the cleaner, smaller step. > > I get that in some parts of the MM, we can just assume that any struct > page is a non-tail page. But that's not the case in the filemap APIs; > they're pretty much all defined to return the precise page which contains > the specific byte. I think that's a mistake, and I'm working to fix it. > But until it is all fixed [1], having a type which says "this is not a > tail page" is, frankly, essential. I can completely understand that the filemap API wants and needs such a concept. I think having some way to do that for the filemap API is very much desired.
On Wed, Oct 20, 2021 at 08:04:56PM +0200, David Hildenbrand wrote: > real): assume we have to add a field for handling something about anon > THP in the struct page (let's assume in the head page for simplicity). > Where would we add it? To "struct folio" and expose it to all other > folios that don't really need it because it's so special? To "struct > page" where it actually doesn't belong after all the discussions? And if > we would have to move that field it into a tail page, it would get even > more "tricky". > > Of course, we could let all special types inherit from "struct folio", > which inherit from "struct page" ... but I am not convinced that we > actually want that. After all, we're C programmers ;) > > But enough with another side-discussion :) FYI, with my block and direct I/O developer hat on I really, really want to have the folio for both file and anon pages. Because to make the get_user_pages path a _lot_ more efficient it should store folios. And to make that work I need them to work for file and anon pages because for get_user_pages and related code they are treated exactly the same.
On 21.10.21 08:51, Christoph Hellwig wrote: > On Wed, Oct 20, 2021 at 08:04:56PM +0200, David Hildenbrand wrote: >> real): assume we have to add a field for handling something about anon >> THP in the struct page (let's assume in the head page for simplicity). >> Where would we add it? To "struct folio" and expose it to all other >> folios that don't really need it because it's so special? To "struct >> page" where it actually doesn't belong after all the discussions? And if >> we would have to move that field it into a tail page, it would get even >> more "tricky". >> >> Of course, we could let all special types inherit from "struct folio", >> which inherit from "struct page" ... but I am not convinced that we >> actually want that. After all, we're C programmers ;) >> >> But enough with another side-discussion :) > > FYI, with my block and direct I/O developer hat on I really, really > want to have the folio for both file and anon pages. Because to make > the get_user_pages path a _lot_ more efficient it should store folios. > And to make that work I need them to work for file and anon pages > because for get_user_pages and related code they are treated exactly > the same. Thanks, I can understand that. And IMHO that would be even possible with split types; the function prototype will simply have to look a little more fancy instead of replacing "struct page" by "struct folio". :)
On Thu, Oct 21, 2021 at 09:21:17AM +0200, David Hildenbrand wrote: > On 21.10.21 08:51, Christoph Hellwig wrote: > > FYI, with my block and direct I/O developer hat on I really, really > > want to have the folio for both file and anon pages. Because to make > > the get_user_pages path a _lot_ more efficient it should store folios. > > And to make that work I need them to work for file and anon pages > > because for get_user_pages and related code they are treated exactly > > the same. ++ > Thanks, I can understand that. And IMHO that would be even possible with > split types; the function prototype will simply have to look a little > more fancy instead of replacing "struct page" by "struct folio". :) Possible yes, but might it be a little premature to split them?
On 21.10.21 14:03, Kent Overstreet wrote: > On Thu, Oct 21, 2021 at 09:21:17AM +0200, David Hildenbrand wrote: >> On 21.10.21 08:51, Christoph Hellwig wrote: >>> FYI, with my block and direct I/O developer hat on I really, really >>> want to have the folio for both file and anon pages. Because to make >>> the get_user_pages path a _lot_ more efficient it should store folios. >>> And to make that work I need them to work for file and anon pages >>> because for get_user_pages and related code they are treated exactly >>> the same. > > ++ > >> Thanks, I can understand that. And IMHO that would be even possible with >> split types; the function prototype will simply have to look a little >> more fancy instead of replacing "struct page" by "struct folio". :) > > Possible yes, but might it be a little premature to split them? Personally, I think it's the right thing to do to introduce something limited like "struct filemap" (again, bad name, i.e., folio restricted to the filemap API) first and avoid introducing a generic folio thingy. So I'd even consider going with folios all the way premature. But I assume what to consider premature and what not depends on the point of view already. And maybe that's the biggest point where we all disagree. Anyhow, what I don't quite understand is the following: as the first important goal, we want to improve the filemap API; that's a noble goal and I highly appreciate Willy's work. To improve the API, there is absolutely no need introduce generic folio. Yet we argue about whether generic folio vs. filemap specific folio seems to be the right thing to do as a first step. My opinion after all the discussions: use a dedicate type with a clear name to solve the immediate filemap API issue. Leave the remainder alone for now. Less code to touch, less subsystems to involve (well, still a lot), less people to upset, less discussions to have, faster review, faster upstream, faster progress. A small but reasonable step. But maybe I'm just living in a dream world :)
On Thu, Oct 21, 2021 at 02:35:32PM +0200, David Hildenbrand wrote: > My opinion after all the discussions: use a dedicate type with a clear > name to solve the immediate filemap API issue. Leave the remainder alone > for now. Less code to touch, less subsystems to involve (well, still a > lot), less people to upset, less discussions to have, faster review, > faster upstream, faster progress. A small but reasonable step. I don't get it. I mean I'm not the MM expert, I've only been touching most areas of it occasionally for the last 20 years, but anon and file pages have way more in common both in terms of use cases and implementation than what is different (unlike some of the other (ab)uses of struct page). What is the point of splitting it now when there are tons of use cases where they are used absolutely interchangable both in consumers of the API and the implementation?
On Thu, Oct 21, 2021 at 02:35:32PM +0200, David Hildenbrand wrote: > My opinion after all the discussions: use a dedicate type with a clear > name to solve the immediate filemap API issue. Leave the remainder alone > for now. Less code to touch, less subsystems to involve (well, still a > lot), less people to upset, less discussions to have, faster review, > faster upstream, faster progress. A small but reasonable step. I didn't change anything I didn't need to. File pages go onto the LRU list, so I need to change the LRU code to handle arbitrary-sized folios instead of pages which are either order-0 or order-9. Every function that I convert in this patchset is either used by another function in this patchset, or by the fs/iomap conversion that I have staged for the next merge window after folios goes in.
On 21.10.21 14:38, Christoph Hellwig wrote: > On Thu, Oct 21, 2021 at 02:35:32PM +0200, David Hildenbrand wrote: >> My opinion after all the discussions: use a dedicate type with a clear >> name to solve the immediate filemap API issue. Leave the remainder alone >> for now. Less code to touch, less subsystems to involve (well, still a >> lot), less people to upset, less discussions to have, faster review, >> faster upstream, faster progress. A small but reasonable step. > > I don't get it. I mean I'm not the MM expert, I've only been touching > most areas of it occasionally for the last 20 years, but anon and file > pages have way more in common both in terms of use cases and You most certainly have way more MM expertise than me ;) I'm just a random MM developer, so everybody can feel free to just ignore what I'm saying here. I didn't NACK anything, I just consider a lot of things that Johannes raised reasonable. > implementation than what is different (unlike some of the other (ab)uses > of struct page). What is the point of splitting it now when there are > tons of use cases where they are used absolutely interchangable both > in consumers of the API and the implementation? I guess in an ideal world, we'd have multiple abstractions. We could clearly express for a function what type it expects. We'd have a type for something passed on the filemap API. We'd have a type for anon THP (or even just an anon page). We'd have a type that abstracts both. With that in mind, and not planning with what we'll actually end up with, to me it makes perfect sense to teach the filemap API to consume the expected type first. And I am not convinced that the folio as is ("not a tail page") is the right abstraction we actually want to pass around in places where we expect either anon or file pages -- or only anon pages or only file pages. Again, my 2 cents.
On Wed, Oct 20, 2021 at 01:39:10PM -0400, Kent Overstreet wrote: > Thank you for at least (belatedly) voicing your appreciation of the struct slab > patches, that much wasn't at all clear to me or Matthew during the initial > discussion. The first sentence I wrote in response to that series is: "I like this whole patch series, but I think for memcg this is a particularly nice cleanup." - https://lore.kernel.org/all/YWRwrka5h4Q5buca@cmpxchg.org/ The second email I wrote started with: "This looks great to me. It's a huge step in disentangling struct page, and it's already showing very cool downstream effects in somewhat unexpected places like the memory cgroup controller." - https://lore.kernel.org/all/YWSZctm%2F2yxu19BV@cmpxchg.org/ Then I sent a pageflag cleanup series specifically to help improve the clarity of the struct slab split a bit. Truly ambiguous stuff..? > > I only hoped we could do the same for file pages first, learn from > > that, and then do anon pages; if they come out looking the same in the > > process, a unified folio would be a great trailing refactoring step. > > > > But alas here we are months later at the same impasse with the same > > open questions, and still talking in circles about speculative code. > > I don't have more time to invest into this, and I'm tired of the > > vitriol and ad-hominems both in public and in private channels. > > > > I'm not really sure how to exit this. The reasons for my NAK are still > > there. But I will no longer argue or stand in the way of the patches. > > Johannes, what I gathered from the meeting on Friday is that all you seem to > care about at this point is whether or not file and anonymous pages are the same > type. No. I'm going to bow out because - as the above confirms again - the communication around these patches is utterly broken. But I'm not leaving on a misrepresentation of my stance after having spent months thinking about these patches and their implications. Here is my summary of the discussion, and my conclusion: The premise of the folio was initially to simply be a type that says: I'm the headpage for one or more pages. Never a tailpage. Cool. However, after we talked about what that actually means, we seem to have some consensus on the following: 1) If folio is to be a generic headpage, it'll be the new dumping ground for slab, network, drivers etc. Nobody is psyched about this, hence the idea to split the page into subtypes which already resulted in the struct slab patches. 2) If higher-order allocations are going to be the norm, it's wasteful to statically allocate full descriptors at a 4k granularity. Hence the push to eliminate overloading and do on-demand allocation of necessary descriptor space. I think that's accurate, but for the record: is there anybody who disagrees with this and insists that struct folio should continue to be the dumping ground for all kinds of memory types? Let's assume the answer is "no" for now and move on. If folios are NOT the common headpage type, it begs two questions: 1) What subtype(s) of page SHOULD it represent? This is somewhat unclear at this time. Some say file+anon. It's also been suggested everything userspace-mappable, but that would again bring back major type punning. Who knows? Vocal proponents of the folio type have made conflicting statements on this, which certainly gives me pause. 2) What IS the common type used for attributes and code shared between subtypes? For example: if a folio is anon+file, then the code that maps memory to userspace needs a generic type in order to map both folios and network pages. Same as the page table walkers, and things like GUP. Will this common type be struct page? Something new? Are we going to duplicate the implementation for each subtype? Another example: GUP can return tailpages. I don't see how it could return folio with even its most generic definition of "headpage". (But bottomline, it's not clear how folio can be the universal headpage type and simultaneously avoid being the type dumping ground that the page was. Maybe I'm not creative enough?) Anyway. I can even be convinved that we can figure out the exact fault lines along which we split the page down the road. My worry is more about 2). A shared type and generic code is likely to emerge regardless of how we split it. Think about it, the only world in which that isn't true would be one in which either a) page subtypes are all the same, or b) the subtypes have nothing in common and both are clearly bogus. I think we're being overly dismissive of this question. It seems to me that *the core challenge* in splitting out the various subtypes of struct page is to properly identify the generic domain and private domains of the subtypes, and then clearly and consistently implement boundaries! If this isn't a deliberate effort, things will get messy and confusing quickly. These boundary quirks were the first thing that showed up in the struct slab patches, and finding a clean and intuitive fix didn't seem trivial to agree on (to my own surprise.) So. All of the above leads me to these conclusions: Once you acknowledge the need for a shared abstraction layer, forcing a binary choice between anon and file doesn't make sense: they have some stuff in common, and some stuff is different. Some code can be shared naturally, some cannot. This isn't unlike the VFS inode and the various fs-specific inode types. It's a chance for the code to finally reflect the sizable but incomplete overlap of the two. And once you need a model for generic and private attributes and code anyway, doing just file at first - even if it isn't along a substruct boundary - becomes a more reasonable, smaller step for splitting things out of the page. Just the fs interface and page cache bits, as opposed to also reclaim, lru, migration, memcg, all at once. Obviously, because it's a smaller step, it won't go as far toward shrinking struct page and separately allocatable descriptors. But it also doesn't work against that effort. And there are still a ton of bootstrapping questions around separately allocating descriptors anyway. So it strikes me as an acceptable tradeoff for now. There is something else that the smaller step would be great for: doing file first would force us to properly deal with the generic vs private domain delineation, and come up with a sound strategy for it. With private file code and shared anon/file code. And it would do so inside a much smaller and deliberate changeset, where we could give it the proper attention. As opposed to letting it emerge ad-hoc and drowning out the case-by-case decisions in huge, churny series. So that's my ACTUAL stance. (For completeness, here are the other considerations I mentioned in the past: I don't think compound page allocations are a good path to larger page sizes, based on the THP experience at FB, Google's THP experience, and testimony from other people who have worked on fragmentation and compaction; but I'm willing to punt on that pending more data. I also don't think the head/tailpage question is interesting enough to make it the central identity of the object we're passing around MM code. Or that we need a new type to get rid of bogus compound_head() calls. But whatever at this point.) Counterarguments I've heard to the above: Wouldn't a generic struct page layer eat into the goal of shrinking struct page down to two words? Well sure, but if all that's left in it at the end is a pointer, a list_head and some flags used by every subtype, we've done pretty well on that front. It's all tradeoffs. Also, way too many cornercases to be thinking in absolutes already. Would it give up type safety in the LRU code? Not really, if all additions are through typed headpages. We don't need to worry about tailpages in that code, the same way we don't need to check PageReserved() in there: there is no plausible route for such pages. Don't you want tailpage safety in anon code? I'm not against that, but it's not like the current folio patches provide it. They just set up a direction (without MM consensus). Either way, it'd happen later on. Why are my eyes glazing over when I read all this? Well, mine glazed over writing all this. struct page is a lot of stuff, and IMO these patches touch too much of it at once. Anyway, that's my exhaustive take on things.
On Thu, Oct 21, 2021 at 05:37:41PM -0400, Johannes Weiner wrote: > Here is my summary of the discussion, and my conclusion: Thank you for this. It's the clearest, most useful post on this thread, including my own. It really highlights the substantial points that should be discussed. > The premise of the folio was initially to simply be a type that says: > I'm the headpage for one or more pages. Never a tailpage. Cool. > > However, after we talked about what that actually means, we seem to > have some consensus on the following: > > 1) If folio is to be a generic headpage, it'll be the new > dumping ground for slab, network, drivers etc. Nobody is > psyched about this, hence the idea to split the page into > subtypes which already resulted in the struct slab patches. > > 2) If higher-order allocations are going to be the norm, it's > wasteful to statically allocate full descriptors at a 4k > granularity. Hence the push to eliminate overloading and do > on-demand allocation of necessary descriptor space. > > I think that's accurate, but for the record: is there anybody who > disagrees with this and insists that struct folio should continue to > be the dumping ground for all kinds of memory types? I think there's a useful distinction to be drawn between "where we're going with this patchset", "where we're going in the next six-twelve months" and "where we're going eventually". I think we have minor differences of opinion on the answers to those questions, and they can be resolved as we go, instead of up-front. My answer to that question is that, while this full conversion is not part of this patch, struct folio is logically: struct folio { ... almost everything that's currently in struct page ... }; struct page { unsigned long flags; unsigned long compound_head; union { struct { /* First tail page only */ unsigned char compound_dtor; unsigned char compound_order; atomic_t compound_mapcount; unsigned int compound_nr; }; struct { /* Second tail page only */ atomic_t hpage_pinned_refcount; struct list_head deferred_list; }; unsigned long padding1[4]; }; unsigned int padding2[2]; #ifdef CONFIG_MEMCG unsigned long padding3; #endif #ifdef WANT_PAGE_VIRTUAL void *virtual; #endif #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS int _last_cpupid; #endif }; (I'm open to being told I have some of that wrong, eg maybe _last_cpupid is actually part of struct folio and isn't a per-page property at all) I'd like to get there in the next year. I think dynamically allocating memory descriptors is more than a year out. Now, as far as struct folio being a dumping group, I would like to split other things out from struct folio. Let me address that below. > Let's assume the answer is "no" for now and move on. > > If folios are NOT the common headpage type, it begs two questions: > > 1) What subtype(s) of page SHOULD it represent? > > This is somewhat unclear at this time. Some say file+anon. > It's also been suggested everything userspace-mappable, but > that would again bring back major type punning. Who knows? > > Vocal proponents of the folio type have made conflicting > statements on this, which certainly gives me pause. > > 2) What IS the common type used for attributes and code shared > between subtypes? > > For example: if a folio is anon+file, then the code that > maps memory to userspace needs a generic type in order to > map both folios and network pages. Same as the page table > walkers, and things like GUP. > > Will this common type be struct page? Something new? Are we > going to duplicate the implementation for each subtype? > > Another example: GUP can return tailpages. I don't see how > it could return folio with even its most generic definition > of "headpage". > > (But bottomline, it's not clear how folio can be the universal > headpage type and simultaneously avoid being the type dumping ground > that the page was. Maybe I'm not creative enough?) This whole section is predicated on "If it is NOT the headpage type", but I think this is a great list of why it _should_ be the generic headpage type. To answer a questions in here; GUP should continue to return precise pages because that's what its callers expect. But we should have a better interface than GUP which returns a rather more compressed list (something like today's biovec). > Anyway. I can even be convinved that we can figure out the exact fault > lines along which we split the page down the road. > > My worry is more about 2). A shared type and generic code is likely to > emerge regardless of how we split it. Think about it, the only world > in which that isn't true would be one in which either > > a) page subtypes are all the same, or > b) the subtypes have nothing in common > > and both are clearly bogus. Amen! I'm convinced that pgtable, slab and zsmalloc uses of struct page can all be split out into their own types instead of being folios. They have little-to-nothing in common with anon+file; they can't be mapped into userspace and they can't be on the LRU. The only situation you can find them in is something like compaction which walks PFNs. I don't think we can split out ZONE_DEVICE and netpool into their own types. While they can't be on the LRU, they can be mapped to userspace, like random device drivers. So they can be found by GUP, and we want (need) to be able to go to folio from there in order to get, lock and set a folio as dirty. Also, they have a mapcount as well as a refcount. The real question, I think, is whether it's worth splitting anon & file pages out from generic pages. I can see arguments for it, but I can also see arguments against it (whether it's two types: lru_mem and folio, three types: anon_mem, file_mem and folio or even four types: ksm_mem, anon_mem and file_mem). I don't think a compelling argument has been made either way. Perhaps you could comment on how you'd see separate anon_mem and file_mem types working for the memcg code? Would you want to have separate lock_anon_memcg() and lock_file_memcg(), or would you want them to be cast to a common type like lock_folio_memcg()? P.S. One variant we haven't explored is separating type specialisation from finding the head page. eg, instead of having struct slab *slab = page_slab(page); we could have: struct slab *slab = folio_slab(page_folio(page)); I don't think it's particularly worth doing, but Kent mused about it at one point.
On 22.10.21 03:52, Matthew Wilcox wrote: > On Thu, Oct 21, 2021 at 05:37:41PM -0400, Johannes Weiner wrote: >> Here is my summary of the discussion, and my conclusion: > > Thank you for this. It's the clearest, most useful post on this thread, > including my own. It really highlights the substantial points that > should be discussed. > >> The premise of the folio was initially to simply be a type that says: >> I'm the headpage for one or more pages. Never a tailpage. Cool. >> >> However, after we talked about what that actually means, we seem to >> have some consensus on the following: >> >> 1) If folio is to be a generic headpage, it'll be the new >> dumping ground for slab, network, drivers etc. Nobody is >> psyched about this, hence the idea to split the page into >> subtypes which already resulted in the struct slab patches. >> >> 2) If higher-order allocations are going to be the norm, it's >> wasteful to statically allocate full descriptors at a 4k >> granularity. Hence the push to eliminate overloading and do >> on-demand allocation of necessary descriptor space. >> >> I think that's accurate, but for the record: is there anybody who >> disagrees with this and insists that struct folio should continue to >> be the dumping ground for all kinds of memory types? > > I think there's a useful distinction to be drawn between "where we're > going with this patchset", "where we're going in the next six-twelve > months" and "where we're going eventually". I think we have minor > differences of opinion on the answers to those questions, and they can > be resolved as we go, instead of up-front. > > My answer to that question is that, while this full conversion is not > part of this patch, struct folio is logically: > > struct folio { > ... almost everything that's currently in struct page ... > }; > > struct page { > unsigned long flags; > unsigned long compound_head; > union { > struct { /* First tail page only */ > unsigned char compound_dtor; > unsigned char compound_order; > atomic_t compound_mapcount; > unsigned int compound_nr; > }; > struct { /* Second tail page only */ > atomic_t hpage_pinned_refcount; > struct list_head deferred_list; > }; > unsigned long padding1[4]; > }; > unsigned int padding2[2]; > #ifdef CONFIG_MEMCG > unsigned long padding3; > #endif > #ifdef WANT_PAGE_VIRTUAL > void *virtual; > #endif > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > int _last_cpupid; > #endif > }; > > (I'm open to being told I have some of that wrong, eg maybe _last_cpupid > is actually part of struct folio and isn't a per-page property at all) > > I'd like to get there in the next year. I think dynamically allocating > memory descriptors is more than a year out. > > Now, as far as struct folio being a dumping group, I would like to > split other things out from struct folio. Let me address that below. > >> Let's assume the answer is "no" for now and move on. >> >> If folios are NOT the common headpage type, it begs two questions: >> >> 1) What subtype(s) of page SHOULD it represent? >> >> This is somewhat unclear at this time. Some say file+anon. >> It's also been suggested everything userspace-mappable, but >> that would again bring back major type punning. Who knows? >> >> Vocal proponents of the folio type have made conflicting >> statements on this, which certainly gives me pause. >> >> 2) What IS the common type used for attributes and code shared >> between subtypes? >> >> For example: if a folio is anon+file, then the code that >> maps memory to userspace needs a generic type in order to >> map both folios and network pages. Same as the page table >> walkers, and things like GUP. >> >> Will this common type be struct page? Something new? Are we >> going to duplicate the implementation for each subtype? >> >> Another example: GUP can return tailpages. I don't see how >> it could return folio with even its most generic definition >> of "headpage". >> >> (But bottomline, it's not clear how folio can be the universal >> headpage type and simultaneously avoid being the type dumping ground >> that the page was. Maybe I'm not creative enough?) > > This whole section is predicated on "If it is NOT the headpage type", > but I think this is a great list of why it _should_ be the generic > headpage type. > > To answer a questions in here; GUP should continue to return precise > pages because that's what its callers expect. But we should have a > better interface than GUP which returns a rather more compressed list > (something like today's biovec). > >> Anyway. I can even be convinved that we can figure out the exact fault >> lines along which we split the page down the road. >> >> My worry is more about 2). A shared type and generic code is likely to >> emerge regardless of how we split it. Think about it, the only world >> in which that isn't true would be one in which either >> >> a) page subtypes are all the same, or >> b) the subtypes have nothing in common >> >> and both are clearly bogus. > > Amen! > > I'm convinced that pgtable, slab and zsmalloc uses of struct page can all > be split out into their own types instead of being folios. They have > little-to-nothing in common with anon+file; they can't be mapped into > userspace and they can't be on the LRU. The only situation you can find > them in is something like compaction which walks PFNs. > > I don't think we can split out ZONE_DEVICE and netpool into their own > types. While they can't be on the LRU, they can be mapped to userspace, > like random device drivers. So they can be found by GUP, and we want > (need) to be able to go to folio from there in order to get, lock and > set a folio as dirty. Also, they have a mapcount as well as a refcount. > > The real question, I think, is whether it's worth splitting anon & file > pages out from generic pages. I can see arguments for it, but I can also > see arguments against it (whether it's two types: lru_mem and folio, > three types: anon_mem, file_mem and folio or even four types: ksm_mem, > anon_mem and file_mem). I don't think a compelling argument has been > made either way. > > Perhaps you could comment on how you'd see separate anon_mem and > file_mem types working for the memcg code? Would you want to have > separate lock_anon_memcg() and lock_file_memcg(), or would you want > them to be cast to a common type like lock_folio_memcg()? FWIW, something like this would roughly express what I've been mumbling about: anon_mem file_mem | | ------|------ lru_mem slab | | ------------- | page I wouldn't include folios in this picture, because IMHO folios as of now are actually what we want to be "lru_mem", just which a much clearer name+description (again, IMHO). Going from file_mem -> page is easy, just casting pointers. Going from page -> file_mem requires going to the head page if it's a compound page. But we expect most interfaces to pass around a proper type (e.g., lru_mem) instead of a page, which avoids having to lookup the compund head page. And each function can express which type it actually wants to consume. The filmap API wants to consume file_mem, so it should use that. And IMHO, with something above in mind and not having a clue which additional layers we'll really need, or which additional leaves we want to have, we would start with the leaves (e.g., file_mem, anon_mem, slab) and work our way towards the root. Just like we already started with slab. Maybe that makes sense.
On Fri, Oct 22, 2021 at 09:59:05AM +0200, David Hildenbrand wrote: > something like this would roughly express what I've been mumbling about: > > anon_mem file_mem > | | > ------|------ > lru_mem slab > | | > ------------- > | > page > > I wouldn't include folios in this picture, because IMHO folios as of now > are actually what we want to be "lru_mem", just which a much clearer > name+description (again, IMHO). I think folios are a superset of lru_mem. To enhance your drawing: page folio lru_mem anon_mem ksm file_mem netpool devmem zonedev slab pgtable buddy zsmalloc vmalloc I have a little list of memory types here: https://kernelnewbies.org/MemoryTypes Let me know if anything is missing. > Going from file_mem -> page is easy, just casting pointers. > Going from page -> file_mem requires going to the head page if it's a > compound page. > > But we expect most interfaces to pass around a proper type (e.g., > lru_mem) instead of a page, which avoids having to lookup the compund > head page. And each function can express which type it actually wants to > consume. The filmap API wants to consume file_mem, so it should use that. > > And IMHO, with something above in mind and not having a clue which > additional layers we'll really need, or which additional leaves we want > to have, we would start with the leaves (e.g., file_mem, anon_mem, slab) > and work our way towards the root. Just like we already started with slab. That assumes that the "root" layers already handle compound pages properly. For example, nothing in mm/page-writeback.c does; it assumes everything is an order-0 page. So working in the opposite direction makes sense because it tells us what has already been converted and is thus safe to call. And starting with file_mem makes the supposition that it's worth splitting file_mem from anon_mem. I believe that's one or two steps further than it's worth, but I can be convinced otherwise. For example, do we have examples of file pages being passed to routines that expect anon pages? Most routines that I've looked at expect to see both file & anon pages, and treat them either identically or do slightly different things. But those are just the functions I've looked at; your experience may be quite different.
On 22.10.21 15:01, Matthew Wilcox wrote: > On Fri, Oct 22, 2021 at 09:59:05AM +0200, David Hildenbrand wrote: >> something like this would roughly express what I've been mumbling about: >> >> anon_mem file_mem >> | | >> ------|------ >> lru_mem slab >> | | >> ------------- >> | >> page >> >> I wouldn't include folios in this picture, because IMHO folios as of now >> are actually what we want to be "lru_mem", just which a much clearer >> name+description (again, IMHO). > > I think folios are a superset of lru_mem. To enhance your drawing: > In the picture below we want "folio" to be the abstraction of "mappable into user space", after reading your link below and reading your graph, correct? Like calling it "user_mem" instead. Because any of these types would imply that we're looking at the head page (if it's a compound page). And we could (or even already have?) have other types that cannot be mapped to user space that are actually a compound page. > page > folio > lru_mem > anon_mem > ksm > file_mem > netpool > devmem > zonedev > slab > pgtable > buddy > zsmalloc > vmalloc > > I have a little list of memory types here: > https://kernelnewbies.org/MemoryTypes > > Let me know if anything is missing. hugetlbfs pages might deserve a dedicated type, right? > >> Going from file_mem -> page is easy, just casting pointers. >> Going from page -> file_mem requires going to the head page if it's a >> compound page. >> >> But we expect most interfaces to pass around a proper type (e.g., >> lru_mem) instead of a page, which avoids having to lookup the compund >> head page. And each function can express which type it actually wants to >> consume. The filmap API wants to consume file_mem, so it should use that. >> >> And IMHO, with something above in mind and not having a clue which >> additional layers we'll really need, or which additional leaves we want >> to have, we would start with the leaves (e.g., file_mem, anon_mem, slab) >> and work our way towards the root. Just like we already started with slab. > > That assumes that the "root" layers already handle compound pages > properly. For example, nothing in mm/page-writeback.c does; it assumes > everything is an order-0 page. So working in the opposite direction > makes sense because it tells us what has already been converted and is > thus safe to call. Right, as long as the lower layers receive a "struct page", they have to assume it's "anything" -- IOW a random base page. We need some temporary logic when transitioning from "typed" code into "struct page" code that doesn't talk compound pages yet, I agree. And I think the different types used actually would tell us what has been converted and what not. Whenever you have to go from type -> "struct page" we have to be very careful. > > And starting with file_mem makes the supposition that it's worth splitting > file_mem from anon_mem. I believe that's one or two steps further than > it's worth, but I can be convinced otherwise. For example, do we have > examples of file pages being passed to routines that expect anon pages? That would be a BUG, so I hope we don't have it ;) > Most routines that I've looked at expect to see both file & anon pages, Right, many of them do. Which tells me that they share a common type in many places. Let's consider LRU code static inline int folio_is_file_lru(struct folio *folio) { return !folio_swapbacked(folio); } I would say we don't really want to pass folios here. We actually want to pass something reasonable, like "lru_mem". But yes, it's just doing what "struct page" used to do via page_is_file_lru(). Let's consider folio_wait_writeback(struct folio *folio) Do we actually want to pass in a folio here? Would we actually want to pass in lru_mem here or even something else? > and treat them either identically or do slightly different things. > But those are just the functions I've looked at; your experience may be > quite different. I assume when it comes to LRU, writeback, ... the behavior is very similar or at least the current functions just decide internally what to do based on e.g., ..._is_file_lru(). I don't know if it's best to keep hiding that functionality within an abstracted type or just provide two separate functions for anon and file. folios mostly mimic what the old struct page used to do, introducing similar functions. Maybe the reason we branch off within these functions is because it just made sense when passing around "struct page" and not having something clearer at hand that let the caller do the branch. For the cases of LRU I looked at it somewhat makes sense to just do it internally. Looking at some core MM code, like mm/huge_memory.c, and seeing all the PageAnon() specializations, having a dedicated anon_mem type might be valuable. But at this point it's hard to tell if splitting up these functions would actually be desirable. We're knee-deep in the type discussion now and I appreciate it. I can understand that folio are currently really just a "not a tail page" concept and mimic a lot of what we already inherited from the old "struct page" world.
On Fri, Oct 22, 2021 at 04:40:24PM +0200, David Hildenbrand wrote: > On 22.10.21 15:01, Matthew Wilcox wrote: > > On Fri, Oct 22, 2021 at 09:59:05AM +0200, David Hildenbrand wrote: > >> something like this would roughly express what I've been mumbling about: > >> > >> anon_mem file_mem > >> | | > >> ------|------ > >> lru_mem slab > >> | | > >> ------------- > >> | > >> page > >> > >> I wouldn't include folios in this picture, because IMHO folios as of now > >> are actually what we want to be "lru_mem", just which a much clearer > >> name+description (again, IMHO). > > > > I think folios are a superset of lru_mem. To enhance your drawing: > > > > In the picture below we want "folio" to be the abstraction of "mappable > into user space", after reading your link below and reading your graph, > correct? Like calling it "user_mem" instead. Hmm. Actually, we want a new layer in the ontology: page folio mappable lru_mem anon_mem ksm file_mem netpool devmem zonedev vmalloc zsmalloc dmapool devmem (*) slab pgtable buddy (*) yes, devmem appears twice; some is mappable and some is not The ontology is kind of confusing because *every* page is part of a folio. Sometimes it's a folio of one page (eg vmalloc). Which means that it's legitimate to call page_folio() on a slab page and then call folio_test_slab(). It's not the direction we want to go though. We're also inconsistent about whether we consider an entire compound page / folio the thing which is mapped, or whether each individual page in the compound page / folio can be mapped. See how differently file-THP and anon-THP are handled in rmap, for example. I think that was probably a mistake. > Because any of these types would imply that we're looking at the head > page (if it's a compound page). And we could (or even already have?) > have other types that cannot be mapped to user space that are actually a > compound page. Sure, slabs are compound pages which cannot be mapped to userspace. > > I have a little list of memory types here: > > https://kernelnewbies.org/MemoryTypes > > > > Let me know if anything is missing. > > hugetlbfs pages might deserve a dedicated type, right? Not sure. Aren't they just file pages (albeit sometimes treated specially, which is one of the mistakes we need to fix)? > > And starting with file_mem makes the supposition that it's worth splitting > > file_mem from anon_mem. I believe that's one or two steps further than > > it's worth, but I can be convinced otherwise. For example, do we have > > examples of file pages being passed to routines that expect anon pages? > > That would be a BUG, so I hope we don't have it ;) Right. I'm asking, did we fix any bugs in the last year or two that were caused by this kind of mismatch and would be prevented by using a different type? There's about half a dozen bugs we've had in the last year that were caused by passing tail pages to functions that were expecting head pages. I can think of one problem we have, which is that (for a few filesystems which have opted into this), we can pass an anon page into ->readpage() and we've had problems with those filesystems then mishandling the anon page. The solution to this problem is not to pass an lru_mem to readpage, but to use a different fs operation to read swap pages. > Let's consider folio_wait_writeback(struct folio *folio) > > Do we actually want to pass in a folio here? Would we actually want to > pass in lru_mem here or even something else? Well, let's look at the callers (for simplicity, look at Linus' current tree). Other than the ones in filesystems which we can assume have file pages, mm/migrate.c has __unmap_and_move(). What type should migrate_pages() have and pass around? > Looking at some core MM code, like mm/huge_memory.c, and seeing all the > PageAnon() specializations, having a dedicated anon_mem type might be > valuable. But at this point it's hard to tell if splitting up these > functions would actually be desirable. Yes. That's my point; it *might* be desirable. I have no objections to it, but the people doing the work need to show the benefits. I'm showing the benefits to folios -- fewer bugs, smaller code, larger pages in the page cache leading to faster systems. I acknowledge the costs in terms of churn. You can see folios as a first step to disentangling some of the users of struct page. It certainly won't be the last step. But I'd really like to stop having theoretical discussions of memory types and get on with writing code. If that means we modify the fs APIs again in twelve months to replace folios with file_mem, well, I'm OK with that.
On Sat, Oct 23, 2021 at 03:22:35AM +0100, Matthew Wilcox wrote: > You can see folios as a first step to disentangling some of the users > of struct page. It certainly won't be the last step. But I'd really > like to stop having theoretical discussions of memory types and get on > with writing code. Agreed. I think folios are really important to sort out the mess around compound pages ASAP. I'm a lot more lukewarm on the other splits. Yes, struct page is a mess, but I'm not sure creating gazillions of new types solve that mess. Getting rid of a bunch of the crazy optimizations that abuse struct page fields might a better first step - or rather after the first step of folios which fix real bugs in compount handling and do enable sane handling of compound pages in the page cache. > If that means we modify the fs APIs again in twelve > months to replace folios with file_mem, well, I'm OK with that. I suspect we won't even need that so quickly if at all, but I'd rather have a little more churn rather than blocking this important work forever.
>> In the picture below we want "folio" to be the abstraction of "mappable >> into user space", after reading your link below and reading your graph, >> correct? Like calling it "user_mem" instead. > > Hmm. Actually, we want a new layer in the ontology: > > page > folio > mappable > lru_mem > anon_mem > ksm > file_mem > netpool > devmem > zonedev > vmalloc > zsmalloc > dmapool > devmem (*) > slab > pgtable > buddy > > (*) yes, devmem appears twice; some is mappable and some is not > I mostly agree to 99% to the above and I think that's a valuable outcome of the discussion. What I yet don't understand why we would require the type "folio" at all. This will be my last question: you're the folio expert, which interfaces would you think would actually consume in the above done right a folio and we would consequently need it? I would assume that there would be no real need for them. Say we have "struct lru_mem" and we want to test if it's an anon_mem for example to upcast. Say the function to perform the check is something called "lru_mem_test_anon()" for example. Instead of folio_test_anon(lru_mem_to_folio()) We'd do _PageAnon(lru_mem_to_page()) Whereby _PageAnon() is just a variant that does no implicit compound head lookup -- however you would want to call that. Because we know that lru_mem doesn't point to a tail page. I imagine the same would hold for any other type of accesses that go via a page type, except that we might not always go directly via the "struct page" but instead via an casted type (e.g., cast file_mem -> lru_mem and call the corresponding helper that implements the magic). > The ontology is kind of confusing because *every* page is part of a > folio. Sometimes it's a folio of one page (eg vmalloc). Which means > that it's legitimate to call page_folio() on a slab page and then call > folio_test_slab(). It's not the direction we want to go though. That tackles part of the problem I'm having with having a dedicated "folio" type in the picture above. A folio is literally *any page* as long as it's not a tail page :) > > We're also inconsistent about whether we consider an entire compound > page / folio the thing which is mapped, or whether each individual page > in the compound page / folio can be mapped. See how differently file-THP > and anon-THP are handled in rmap, for example. I think that was probably > a mistake. Yes. And whenever I think about "why do we want to split both types" the thought that keeps dominating is "splitting and migrating anon THP is just very different from any other THP". > >> Because any of these types would imply that we're looking at the head >> page (if it's a compound page). And we could (or even already have?) >> have other types that cannot be mapped to user space that are actually a >> compound page. > > Sure, slabs are compound pages which cannot be mapped to userspace. > >>> I have a little list of memory types here: >>> https://kernelnewbies.org/MemoryTypes >>> >>> Let me know if anything is missing. >> >> hugetlbfs pages might deserve a dedicated type, right? > > Not sure. Aren't they just file pages (albeit sometimes treated > specially, which is one of the mistakes we need to fix)? From all the special-casing in core-mm and remembering that they make excessive use of compound-tail members, my impression was that they might look like file pages but are in many cases very different. <offtopic> Just for the records, I could imagine a type spanning multiple struct pages, to handle the cases right now that actually store data in tail page metadata. Like having "struct hugetlb" that is actually X*sizeof(struct page) and instead of all these crazy compound tail page lookups, we'd just be able to reference the relevant members via "struct hugetlb" directly. We can do that for types we know are actually compound pages of a certain size -- like hugetlbfs. </offtopic> > >>> And starting with file_mem makes the supposition that it's worth splitting >>> file_mem from anon_mem. I believe that's one or two steps further than >>> it's worth, but I can be convinced otherwise. For example, do we have >>> examples of file pages being passed to routines that expect anon pages? >> >> That would be a BUG, so I hope we don't have it ;) > > Right. I'm asking, did we fix any bugs in the last year or two that > were caused by this kind of mismatch and would be prevented by using > a different type? There's about half a dozen bugs we've had in the > last year that were caused by passing tail pages to functions that > were expecting head pages. For my part, I don't recall either writing (well, it's not my area of expertise) or reviewing such patches. I do assume that many type checks catch that early during testing. I do recall reviewing some patches that remove setting page flags on (IIRC) anon pages that just don't make any sense, but were not harmful. <example> I keep stumbling over type checks that I think might just be due to old cruft we're dragging along, due to the way we for example extended THP. Like __split_huge_page(). I can spot two PageAnon(head) calls which end up looking up the head page again. Then, we call remap_page(), which doesn't make any sense for !PageAnon(), thus we end up doing a third call to PageAnon(head). In __split_huge_page_tail() we check PageAnon(head) again for every invocation. I'm not saying that we should rewrite __split_huge_page() completely, or that this cannot be cleaned up differently. I'm rather imagining that splitting out an "struct anon_mem" might turn things cleaner and avoid many of the type checks and consequently also more head page lookups. Again, this is most probably a bad example, I just wanted to share something that I noticed. <\example> Passing "struct page *" to random functions just has to let these functions * Eventually lookup or at least verify that it's not a tail page * Eventually lookup or at least verify that it's the right type. And some functions to the same lookup over and over again. > > I can think of one problem we have, which is that (for a few filesystems > which have opted into this), we can pass an anon page into ->readpage() > and we've had problems with those filesystems then mishandling the > anon page. The solution to this problem is not to pass an lru_mem to > readpage, but to use a different fs operation to read swap pages. Interesting example! > >> Let's consider folio_wait_writeback(struct folio *folio) >> >> Do we actually want to pass in a folio here? Would we actually want to >> pass in lru_mem here or even something else? > > Well, let's look at the callers (for simplicity, look at Linus' > current tree). Other than the ones in filesystems which we can assume > have file pages, mm/migrate.c has __unmap_and_move(). What type should > migrate_pages() have and pass around? That's an interesting point. Ideally it should deal with an abstract type "struct migratable", which would include lru and !lru migratable pages (e.g., balloon compaction). The current function name indicates that we're working on pages ("migrate_pages") :) so the upcast would have to happen internally unless we'd change the interface or even split it up ("migrate_lru_mems()"). But yes, that's an interesting case. > >> Looking at some core MM code, like mm/huge_memory.c, and seeing all the >> PageAnon() specializations, having a dedicated anon_mem type might be >> valuable. But at this point it's hard to tell if splitting up these >> functions would actually be desirable. > > Yes. That's my point; it *might* be desirable. I have no objections to > it, but the people doing the work need to show the benefits. I'm showing > the benefits to folios -- fewer bugs, smaller code, larger pages in the > page cache leading to faster systems. I acknowledge the costs in terms > of churn. See my bad example above. From the "bitwise" discussion I get the feeling that some people care about type safety (including me) :) > > You can see folios as a first step to disentangling some of the users > of struct page. It certainly won't be the last step. But I'd really > like to stop having theoretical discussions of memory types and get on > with writing code. If that means we modify the fs APIs again in twelve > months to replace folios with file_mem, well, I'm OK with that. I know, the crowd is screaming "we want folios, we need folios, get out of the way". I know that the *compound page* handling is a mess and that we want something to change that. The point I am making is that folios are not necessarily what we *need*. Types as discussed above are really just using the basic idea of a folio lifted to the next level that not only avoid any kind of PageTail checks but also any kind of type checks we have splattered all over the place. IMHO that's a huge win when it comes to code readability and maintainability. This also tackles the point Johannes made: folios being the dumping ground for everything. And he has a point, because folios are really just "not tail pages", so consequently they will 99% just mimic what "struct page" does, and we all know what that means. Your patches introduce the concept of folio across many layers and your point is to eventually clean up later and eventually remove it from all layers again. I can understand that approach, yet I am at least asking the question if this is the right order to do this. And again, I am not blocking this, I think cleaning up compound pages is very nice. I'm asking questions to see how the concept of folios would fit in long-term and if it would be required at all if types are done right. And I think a valuable result of this discussion at least to me is that: * I can understand why we want (many parts of) the filemap API to consume an abstracted type instead of file_mem and anon_mem. * I understand that compound pages are a fact and properly teaching different layers subsystems of how to handle them cleanly is not something radical. It's just the natural and clean thing to do. * I believe types as discussed above are realistic and comparatively easy to add. I believe they are much more realistic than a bunch of other ideas I heard throughout the last couple of months. I acknowledge that defragmentation is a real problem, though. But it has been and most probably will remain a different problem than just getting compound page handling right. Again, I appreciate this discussion. I know you're sick and tired of folio discussions, so I'll stop asking questions.
On Sat, Oct 23, 2021 at 11:58:42AM +0200, David Hildenbrand wrote: > I know, the crowd is screaming "we want folios, we need folios, get out > of the way". I know that the *compound page* handling is a mess and that > we want something to change that. The point I am making is that folios > are not necessarily what we *need*. > > Types as discussed above are really just using the basic idea of a folio > lifted to the next level that not only avoid any kind of PageTail checks > but also any kind of type checks we have splattered all over the place. > IMHO that's a huge win when it comes to code readability and > maintainability. This also tackles the point Johannes made: folios being > the dumping ground for everything. And he has a point, because folios > are really just "not tail pages", so consequently they will 99% just > mimic what "struct page" does, and we all know what that means. Look, even if folios go this direction of being the compound page replacement, the "new dumping ground" argument is just completely bogus. In introducing new types and type safety for struct page, it's not reasonable to try to solve everything at once - we don't know what an ideal end solution is going to look like, we can't see that far ahead. What is a reasonable approach is looking for where the fault lines in the way struct page is used now, and cutting along those lines, look at the result, then cut it up some more. If the first new type still inherits most of the mess in struct page but it solves real problems, that's not a failure, that's normal incremental progress! -------- More than that, I think you and Johannes heard what I was saying about imagining what the ideal end solution would look like with infinite refactoring and you two have been running way too far with that idea - the stuff you guys are talking about sounds overengineered to me - inheritence heirarchies before we've introduced the first new type! The point of such thought experiments is to imagine how simple things could be - and also to not take such thought experiments too seriously, because when we start refactoring real world code, that's when we discover what's actually _possible_. I ran into a major roadblock when I tried converting buddy allocator freelists to radix trees: freeing a page may require allocating a new page for the radix tree freelist, which is fine normally - we're freeing a page after all - but not if it's highmem. So right now I'm not sure if getting struct page down to two words is even possible. Oh well. > Your patches introduce the concept of folio across many layers and your > point is to eventually clean up later and eventually remove it from all > layers again. I can understand that approach, yet I am at least asking > the question if this is the right order to do this. > > And again, I am not blocking this, I think cleaning up compound pages is > very nice. I'm asking questions to see how the concept of folios would > fit in long-term and if it would be required at all if types are done right. I'm also not really seeing the need to introduce folios as a replacement for all of compound pages, though - I think limiting it to file & anon and using the union-of-structs in struct page as the fault lines for introducing new types would be the reasonable thing to do. The struct slab patches were great, it's a real shame that the slab maintainers have been completely absent. Also introducing new types to be describing our current using of struct page isn't the only thing we should be doing - as we do that, that will (is!) uncover a lot of places where our ontology of struct page uses is just nonsensical (all the types of pages mapped into userspace!) - and part of our mission should be to clean those up. That does turn things into a much bigger project than what Matthew signed up for, but we shouldn't all be sitting on the sidelines here...
On Sat, Oct 23, 2021 at 12:00:38PM -0400, Kent Overstreet wrote: > I ran into a major roadblock when I tried converting buddy allocator freelists > to radix trees: freeing a page may require allocating a new page for the radix > tree freelist, which is fine normally - we're freeing a page after all - but not > if it's highmem. So right now I'm not sure if getting struct page down to two > words is even possible. Oh well. I have a design in mind that I think avoids the problem. It's somewhat based on Bonwick's vmem paper, but not exactly. I need to write it up. > > Your patches introduce the concept of folio across many layers and your > > point is to eventually clean up later and eventually remove it from all > > layers again. I can understand that approach, yet I am at least asking > > the question if this is the right order to do this. > > > > And again, I am not blocking this, I think cleaning up compound pages is > > very nice. I'm asking questions to see how the concept of folios would > > fit in long-term and if it would be required at all if types are done right. > > I'm also not really seeing the need to introduce folios as a replacement for all > of compound pages, though - I think limiting it to file & anon and using the > union-of-structs in struct page as the fault lines for introducing new types > would be the reasonable thing to do. The struct slab patches were great, it's a > real shame that the slab maintainers have been completely absent. Right. Folios are for unspecialised head pages. If we decide to specialise further in the future, that's great! I think David misunderstood me slightly; I don't know that specialising file + anon pages (the aforementioned lru_mem) is the right approach. It might be! But it needs someone to try it, and find the advantages & disadvantages. > Also introducing new types to be describing our current using of struct page > isn't the only thing we should be doing - as we do that, that will (is!) uncover > a lot of places where our ontology of struct page uses is just nonsensical (all > the types of pages mapped into userspace!) - and part of our mission should be > to clean those up. > > That does turn things into a much bigger project than what Matthew signed up > for, but we shouldn't all be sitting on the sidelines here... I'm happy to help. Indeed I may take on some of these sub-projects myself. I just don't want the perfect to be the enemy of the good.
On Sat, Oct 23, 2021 at 10:41:41PM +0100, Matthew Wilcox wrote: > On Sat, Oct 23, 2021 at 12:00:38PM -0400, Kent Overstreet wrote: > > I ran into a major roadblock when I tried converting buddy allocator freelists > > to radix trees: freeing a page may require allocating a new page for the radix > > tree freelist, which is fine normally - we're freeing a page after all - but not > > if it's highmem. So right now I'm not sure if getting struct page down to two > > words is even possible. Oh well. > > I have a design in mind that I think avoids the problem. It's somewhat > based on Bonwick's vmem paper, but not exactly. I need to write it up. I am intruiged... Care to drop some hints? > Right. Folios are for unspecialised head pages. If we decide > to specialise further in the future, that's great! I think David > misunderstood me slightly; I don't know that specialising file + anon > pages (the aforementioned lru_mem) is the right approach. It might be! > But it needs someone to try it, and find the advantages & disadvantages. Well, that's where your current patches are basically headed, aren't they? As I understand it it's just file and some of the anon code that's converted so far. Are you thinking more along the lines of converting everything that can be mapped to userspace to folios? I think that would make a lot of sense given that converting the weird things to file pages isn't likely to happen any time soon, and it would us convert gup() to return folios, as Christoph noted. > > > Also introducing new types to be describing our current using of struct page > > isn't the only thing we should be doing - as we do that, that will (is!) uncover > > a lot of places where our ontology of struct page uses is just nonsensical (all > > the types of pages mapped into userspace!) - and part of our mission should be > > to clean those up. > > > > That does turn things into a much bigger project than what Matthew signed up > > for, but we shouldn't all be sitting on the sidelines here... > > I'm happy to help. Indeed I may take on some of these sub-projects > myself. I just don't want the perfect to be the enemy of the good. Agreed!
On Fri, Oct 22, 2021 at 02:52:31AM +0100, Matthew Wilcox wrote: > > Anyway. I can even be convinved that we can figure out the exact fault > > lines along which we split the page down the road. > > > > My worry is more about 2). A shared type and generic code is likely to > > emerge regardless of how we split it. Think about it, the only world > > in which that isn't true would be one in which either > > > > a) page subtypes are all the same, or > > b) the subtypes have nothing in common > > > > and both are clearly bogus. > > Amen! > > I'm convinced that pgtable, slab and zsmalloc uses of struct page can all > be split out into their own types instead of being folios. They have > little-to-nothing in common with anon+file; they can't be mapped into > userspace and they can't be on the LRU. The only situation you can find > them in is something like compaction which walks PFNs. They can all be accounted to a cgroup. pgtables are tracked the same as other __GFP_ACCOUNT pages (pipe buffers and kernel stacks right now from a quick grep, but as you can guess that's open-ended). So if those all aren't folios, the generic type and the interfacing object for memcg and accounting would continue to be the page. > Perhaps you could comment on how you'd see separate anon_mem and > file_mem types working for the memcg code? Would you want to have > separate lock_anon_memcg() and lock_file_memcg(), or would you want > them to be cast to a common type like lock_folio_memcg()? That should be lock_<generic>_memcg() since it actually serializes and protects the same thing for all subtypes (unlike lock_page()!). The memcg interface is fully type agnostic nowadays, but it also needs to be able to handle any subtype. It should continue to interface with the broadest, most generic definition of "chunk of memory". Notably it does not do tailpages (and I don't see how it ever would), so it could in theory use the folio - but only if the folio is really the systematic replacement of absolutely *everything* that isn't a tailpage - including pgtables, kernel stack, pipe buffers, and all other random alloc_page() calls spread throughout the code base. Not just conceptually, but an actual wholesale replacement of struct page throughout allocation sites. I'm not sure that's realistic. So I'm thinking struct page will likely be the interfacing object for memcg for the foreseeable future.
On Mon, Oct 25, 2021 at 11:35:25AM -0400, Johannes Weiner wrote: > On Fri, Oct 22, 2021 at 02:52:31AM +0100, Matthew Wilcox wrote: > > > Anyway. I can even be convinved that we can figure out the exact fault > > > lines along which we split the page down the road. > > > > > > My worry is more about 2). A shared type and generic code is likely to > > > emerge regardless of how we split it. Think about it, the only world > > > in which that isn't true would be one in which either > > > > > > a) page subtypes are all the same, or > > > b) the subtypes have nothing in common > > > > > > and both are clearly bogus. > > > > Amen! > > > > I'm convinced that pgtable, slab and zsmalloc uses of struct page can all > > be split out into their own types instead of being folios. They have > > little-to-nothing in common with anon+file; they can't be mapped into > > userspace and they can't be on the LRU. The only situation you can find > > them in is something like compaction which walks PFNs. > > They can all be accounted to a cgroup. pgtables are tracked the same > as other __GFP_ACCOUNT pages (pipe buffers and kernel stacks right now > from a quick grep, but as you can guess that's open-ended). Oh, this is good information! > So if those all aren't folios, the generic type and the interfacing > object for memcg and accounting would continue to be the page. > > > Perhaps you could comment on how you'd see separate anon_mem and > > file_mem types working for the memcg code? Would you want to have > > separate lock_anon_memcg() and lock_file_memcg(), or would you want > > them to be cast to a common type like lock_folio_memcg()? > > That should be lock_<generic>_memcg() since it actually serializes and > protects the same thing for all subtypes (unlike lock_page()!). > > The memcg interface is fully type agnostic nowadays, but it also needs > to be able to handle any subtype. It should continue to interface with > the broadest, most generic definition of "chunk of memory". Some of the memory descriptors might prefer to keep their memcg_data at a different offset from the start of the struct. Can we accommodate that, or do we ever get handed a specialised memory descriptor, then have to cast back to an unspecialised descriptor? (the LRU list would be an example of this; the list_head must be at the same offset in all memory descriptors which use the LRU list)
On Mon, Oct 25, 2021 at 11:35:25AM -0400, Johannes Weiner wrote: > On Fri, Oct 22, 2021 at 02:52:31AM +0100, Matthew Wilcox wrote: > > > Anyway. I can even be convinved that we can figure out the exact fault > > > lines along which we split the page down the road. > > > > > > My worry is more about 2). A shared type and generic code is likely to > > > emerge regardless of how we split it. Think about it, the only world > > > in which that isn't true would be one in which either > > > > > > a) page subtypes are all the same, or > > > b) the subtypes have nothing in common > > > > > > and both are clearly bogus. > > > > Amen! > > > > I'm convinced that pgtable, slab and zsmalloc uses of struct page can all > > be split out into their own types instead of being folios. They have > > little-to-nothing in common with anon+file; they can't be mapped into > > userspace and they can't be on the LRU. The only situation you can find > > them in is something like compaction which walks PFNs. > > They can all be accounted to a cgroup. pgtables are tracked the same > as other __GFP_ACCOUNT pages (pipe buffers and kernel stacks right now > from a quick grep, but as you can guess that's open-ended). > > So if those all aren't folios, the generic type and the interfacing > object for memcg and accounting would continue to be the page. > > > Perhaps you could comment on how you'd see separate anon_mem and > > file_mem types working for the memcg code? Would you want to have > > separate lock_anon_memcg() and lock_file_memcg(), or would you want > > them to be cast to a common type like lock_folio_memcg()? > > That should be lock_<generic>_memcg() since it actually serializes and > protects the same thing for all subtypes (unlike lock_page()!). > > The memcg interface is fully type agnostic nowadays, but it also needs > to be able to handle any subtype. It should continue to interface with > the broadest, most generic definition of "chunk of memory". > > Notably it does not do tailpages (and I don't see how it ever would), > so it could in theory use the folio - but only if the folio is really > the systematic replacement of absolutely *everything* that isn't a > tailpage - including pgtables, kernel stack, pipe buffers, and all > other random alloc_page() calls spread throughout the code base. Not > just conceptually, but an actual wholesale replacement of struct page > throughout allocation sites. > > I'm not sure that's realistic. So I'm thinking struct page will likely > be the interfacing object for memcg for the foreseeable future. Interesting. We were also just discussing how in the block layer, bvecs can currently point to multiple pages - this is the multipage bvec work that Ming did, it made bio segment merging a lot cheaper by moving it from the layer that maps bvecs to sglists and up to bio_add_page() and got rid of the need for segment counting. But with the upper layers transitioning to compound pages, i.e. keeping contiguous stuff together as a unit - we're going to want to switch bvecs to pointing to compound pages, and ditch all the code that breaks up a bvec into individual 4k pages when we iterate over them; we also won't need or want any kind of page/segment merging anymore, which is really cool. But since bios can do IO to/from basically any type of memory, this is another argument in favor of folios becoming the replacement for all or essentially all compound pages. The alternative would be changing bvecs to only point to head pages, which I do think would be completely workable with appropriate assertions. We don't want to prevent doing block IO to/from slab memory - there's a lot of places where we do block IO to memory that isn't exposed to userspace (e.g. filesystem metadata, other weirder paths), so if bvecs point to folios, then at least slab needs to be a subtype of folios and folios need to be all or most compound pages. I've been anti folios being the replacement for all compound pages because this is C, trying to do a lot with types is a pain in the ass and I think in general nested inheritence heirarchies tend to not be the way to go. But I'm definitely keeping an open mind...