Message ID | 20240704043132.28501-1-osalvador@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | hugetlb pagewalk unification | expand |
On Thu, Jul 04, 2024 at 06:30:47AM +0200, Oscar Salvador wrote: > Hi all, > > During Peter's talk at the LSFMM, it was agreed that one of the things > that need to be done in order to further integrate hugetlb into mm core, > is to unify generic and hugetlb pagewalkers. > I started with this one, which is unifying hugetlb into generic > pagewalk, instead of having its hugetlb_entry entries. > Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with > hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be > pud mapped (devm pages as well but we seem not to care about those with > the exception of hmm code). > > The outcome is this RFC. Just dropping the git tree, in case someone finds it more suitable: https://github.com/leberus/linux/ hugetlb-unification-pagewalk
On 04.07.24 06:30, Oscar Salvador wrote: > Hi all, > > During Peter's talk at the LSFMM, it was agreed that one of the things > that need to be done in order to further integrate hugetlb into mm core, > is to unify generic and hugetlb pagewalkers. > I started with this one, which is unifying hugetlb into generic > pagewalk, instead of having its hugetlb_entry entries. > Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with > hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be > pud mapped (devm pages as well but we seem not to care about those with > the exception of hmm code). > > The outcome is this RFC. First of all, a good step into the right direction, but maybe not what we want long-term. So I'm questioning whether we want this intermediate approach. walk_page_range() and friends are simply not a good design (e.g., indirect function calls). There are roughly two categories of page table walkers we have: 1) We actually only want to walk present folios (to be precise, page ranges of folios). We should look into moving away from the walk the page walker API where possible, and have something better that directly gives us the folio (page ranges). Any PTE batching would be done internally. 2) We want to deal with non-present folios as well (swp entries and all kinds of other stuff). We should maybe implement our custom page table walker and move away from walk_page_range(). We are not walking "pages" after all but everything else included :) Then, there is a subset of 1) where we only want to walk to a single address (a single folio). I'm working on that right now to get rid of follow_page() and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still remain a bit special, but I'm afraid we cannot hide that completely.
Hey, David, On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote: > There are roughly two categories of page table walkers we have: > > 1) We actually only want to walk present folios (to be precise, page > ranges of folios). We should look into moving away from the walk the > page walker API where possible, and have something better that > directly gives us the folio (page ranges). Any PTE batching would be > done internally. > > 2) We want to deal with non-present folios as well (swp entries and all > kinds of other stuff). We should maybe implement our custom page > table walker and move away from walk_page_range(). We are not walking > "pages" after all but everything else included :) > > Then, there is a subset of 1) where we only want to walk to a single address > (a single folio). I'm working on that right now to get rid of follow_page() > and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still > remain a bit special, but I'm afraid we cannot hide that completely. Maybe you are talking about the generic concept of "page table walker", not walk_page_range() explicitly? I'd agree if it's about the generic concept. For example, follow_page() definitely is tailored for getting the page/folio. But just to mention Oscar's series is only working on the page_walk API itself. What I see so far is most of the walk_page API users aren't described above - most of them do not fall into category 1) at all, if any. And they either need to fetch something from the pgtable where having the folio isn't enough, or modify the pgtable for different reasons. A generic pgtable walker looks still wanted at some point, but it can be too involved to be introduced together with this "remove hugetlb_entry" effort. To me, that future work is not yet about "get the folio, ignore the pgtable", but about how to abstract different layers of pgtables, so the caller may get a generic concept of "one pgtable entry" with the level/size information attached, and process it at a single place / hook, and perhaps hopefully even work with a device pgtable, as long as it's a radix tree. [Adding Jason into the loop too. PS: Oscar, please consider copying Jason for the works too; Jason provided great lots of useful discussions in the past on relevant topics] Thanks,
On 04.07.24 16:30, Peter Xu wrote: > Hey, David, > Hi! > On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote: >> There are roughly two categories of page table walkers we have: >> >> 1) We actually only want to walk present folios (to be precise, page >> ranges of folios). We should look into moving away from the walk the >> page walker API where possible, and have something better that >> directly gives us the folio (page ranges). Any PTE batching would be >> done internally. >> >> 2) We want to deal with non-present folios as well (swp entries and all >> kinds of other stuff). We should maybe implement our custom page >> table walker and move away from walk_page_range(). We are not walking >> "pages" after all but everything else included :) >> >> Then, there is a subset of 1) where we only want to walk to a single address >> (a single folio). I'm working on that right now to get rid of follow_page() >> and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still >> remain a bit special, but I'm afraid we cannot hide that completely. > > Maybe you are talking about the generic concept of "page table walker", not > walk_page_range() explicitly? > > I'd agree if it's about the generic concept. For example, follow_page() > definitely is tailored for getting the page/folio. But just to mention > Oscar's series is only working on the page_walk API itself. What I see so > far is most of the walk_page API users aren't described above - most of > them do not fall into category 1) at all, if any. And they either need to > fetch something from the pgtable where having the folio isn't enough, or > modify the pgtable for different reasons. Right, but having 1) does not imply that we won't be having access to the page table entry in an abstracted form, the folio is simply the primary source of information that these users care about. 2) is an extension of 1), but walking+exposing all (or most) other page table entries as well in some form, which is certainly harder to get right. Taking a look at some examples: * madvise_cold_or_pageout_pte_range() only cares about present folios. * madvise_free_pte_range() only cares about present folios. * break_ksm_ops() only cares about present folios. * mlock_walk_ops() only cares about present folios. * damon_mkold_ops() only cares about present folios. * damon_young_ops() only cares about present folios. There are certainly other page_walk API users that are more involved and need to do way more magic, which fall into category 2). In particular things like swapin_walk_ops(), hmm_walk_ops() and most fs/proc/task_mmu.c. Likely there are plenty of them. Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there even is left in using walk_page_range() :) > > A generic pgtable walker looks still wanted at some point, but it can be > too involved to be introduced together with this "remove hugetlb_entry" > effort. My thinking was if "remove hugetlb_entry" cannot wait for "remove page_walk", because we found a reasonable way to do it better and convert the individual users. Maybe it can't. I've not given up hope that we can end up with something better and clearer than the current page_walk API :) > > To me, that future work is not yet about "get the folio, ignore the > pgtable", but about how to abstract different layers of pgtables, so the > caller may get a generic concept of "one pgtable entry" with the level/size > information attached, and process it at a single place / hook, and perhaps > hopefully even work with a device pgtable, as long as it's a radix tree. To me 2) is an extension of 1). My thinking is that we can start with 1) without having to are about all details of 2). If we have to make it as generic that we can walk any page table layout out there in this world, I'm not so sure.
On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote: > On 04.07.24 16:30, Peter Xu wrote: > > Hey, David, > > > > Hi! > > > On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote: > > > There are roughly two categories of page table walkers we have: > > > > > > 1) We actually only want to walk present folios (to be precise, page > > > ranges of folios). We should look into moving away from the walk the > > > page walker API where possible, and have something better that > > > directly gives us the folio (page ranges). Any PTE batching would be > > > done internally. > > > > > > 2) We want to deal with non-present folios as well (swp entries and all > > > kinds of other stuff). We should maybe implement our custom page > > > table walker and move away from walk_page_range(). We are not walking > > > "pages" after all but everything else included :) > > > > > > Then, there is a subset of 1) where we only want to walk to a single address > > > (a single folio). I'm working on that right now to get rid of follow_page() > > > and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still > > > remain a bit special, but I'm afraid we cannot hide that completely. > > > > Maybe you are talking about the generic concept of "page table walker", not > > walk_page_range() explicitly? > > > > I'd agree if it's about the generic concept. For example, follow_page() > > definitely is tailored for getting the page/folio. But just to mention > > Oscar's series is only working on the page_walk API itself. What I see so > > far is most of the walk_page API users aren't described above - most of > > them do not fall into category 1) at all, if any. And they either need to > > fetch something from the pgtable where having the folio isn't enough, or > > modify the pgtable for different reasons. > > Right, but having 1) does not imply that we won't be having access to the > page table entry in an abstracted form, the folio is simply the primary > source of information that these users care about. 2) is an extension of 1), > but walking+exposing all (or most) other page table entries as well in some > form, which is certainly harder to get right. > > Taking a look at some examples: > > * madvise_cold_or_pageout_pte_range() only cares about present folios. > * madvise_free_pte_range() only cares about present folios. > * break_ksm_ops() only cares about present folios. > * mlock_walk_ops() only cares about present folios. > * damon_mkold_ops() only cares about present folios. > * damon_young_ops() only cares about present folios. > > There are certainly other page_walk API users that are more involved and > need to do way more magic, which fall into category 2). In particular things > like swapin_walk_ops(), hmm_walk_ops() and most fs/proc/task_mmu.c. Likely > there are plenty of them. > > > Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there > even is left in using walk_page_range() :) Hmm, I need to confess from a quick look I didn't yet see why the current page_walk API won't work under p4d there.. it could be that I missed some details. > > > > > A generic pgtable walker looks still wanted at some point, but it can be > > too involved to be introduced together with this "remove hugetlb_entry" > > effort. > > My thinking was if "remove hugetlb_entry" cannot wait for "remove > page_walk", because we found a reasonable way to do it better and convert > the individual users. Maybe it can't. > > I've not given up hope that we can end up with something better and clearer > than the current page_walk API :) Oh so you meant you have plan to rewrite some of the page_walk API users to use the new API you plan to propose? It looks fine by me. I assume anything new will already taking hugetlb folios into account, so it'll "just work" and actually reduce number of patches here, am I right? If it still needs time to land, I think it's also fine that it's done on top of Oscar's. So it may boil down to the schedule in that case, and we may also want to know how Oscar sees this. > > > > > To me, that future work is not yet about "get the folio, ignore the > > pgtable", but about how to abstract different layers of pgtables, so the > > caller may get a generic concept of "one pgtable entry" with the level/size > > information attached, and process it at a single place / hook, and perhaps > > hopefully even work with a device pgtable, as long as it's a radix tree. > > To me 2) is an extension of 1). My thinking is that we can start with 1) > without having to are about all details of 2). If we have to make it as > generic that we can walk any page table layout out there in this world, I'm > not so sure. I still see a hope there, after all the radix pgtable is indeed a common abstraction and it looks to me a lot of things share that structure. IIUC one challenge of it is being fast. So.. I don't know. But I'll be more than happy to see it come if someone can work it out, and it just sounds very nice too if some chunk of code can be run the same for mm/, kvm/ and iommu/. Thanks,
On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote: > My thinking was if "remove hugetlb_entry" cannot wait for "remove > page_walk", because we found a reasonable way to do it better and convert > the individual users. Maybe it can't. > > I've not given up hope that we can end up with something better and clearer > than the current page_walk API :) Hi David, I agree that the current page_walk might be a bit convoluted, and that the indirect functions approach is a bit of a hassle. Having said that, let me clarify something. Although this patchset touches the page_walk API wrt. getting rid of hugetlb special casing all over the place, my goal was not as focused on the page_walk as it was on the hugetlb code to gain hability to be interpreted on PUD/PMD level. One of the things, among other things, that helped in creating this mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb __always__ operates on ptes, which means that we cannot rely on the mm core to do the right thing, and we need a bunch of hugetlb-pte functions that knows about their thing, so we lean on that. IMHO, that was a mistake to start with, but I was not around when it was introduced and maybe there were good reasons to deal with that the way it is done. But, the thing is that my ultimate goal, is for hugetlb code to be able to deal with PUD/PMD (pte and cont-pte is already dealt with) just like mm core does for THP (PUD is not supported by THP, but you get me), and that is not that difficult to do, as this patchset tries to prove. Of course, for hugetlb to gain the hability to operate on PUD/PMD, this means we need to add a fairly amount of code. e.g: for operating hugepages on PUD level, code for markers on PUD/PMD level for uffd/poison stuff (only dealt on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc. Basically, almost all we did for PMD-* stuff we need it for PUD as well, and that will be around when THP gains support for PUD if it ever does (I guess that in a few years if memory capacity keeps increasing). E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with is_hwpoison_entry Yes, it is a hassle to have more code around, but IMO, this new code will help us in 1) move away from __always__ operate on ptes 2) ease integrate hugetlb code into mm core. I will keep working on this patchset not because of pagewalk savings, but because I think it will help us in have hugetlb more mm-core ready, since the current pagewalk has to test that a hugetlb page can be properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD, hwpoison entries for swp on PUD/PMD, pud invalidating, etc. If that gets accomplished, I think that a fair amount of code that lives in hugetlb.c can be deleted/converted as less special casing will be needed. I might be wrong and maybe I will hit a brick wall, but hopefully not.
On Mon, Jul 08, 2024 at 10:18:30AM +0200, Oscar Salvador wrote: > IMHO, that was a mistake to start with, but I was not around when it was > introduced and maybe there were good reasons to deal with that the way > it is done. It is a trade off, either you have to write out a lot of duplicated code for every level or you have this sort of level agnostic design. > But, the thing is that my ultimate goal, is for hugetlb code to be able > to deal with PUD/PMD (pte and cont-pte is already dealt with) just like > mm core does for THP (PUD is not supported by THP, but you get me), and > that is not that difficult to do, as this patchset tries to prove. IMHO we need to get to an API that can understand everything in a page table. Having two APIs that are both disjoint is the problematic bit. Improving the pud/pmd/etc API is a good direction Nobody has explored it, but generalizing to a 'non-level' API could also be a direction. 'non-level' means it works more like the huge API where the level is not part of the function names but somehow the level is encoded by the values/state/something. This is appealing for things like page_walk where we have all these per-level ops which are kind of pointless code duplication. I've been doing some experiments on the iommu page table side on both these directions and so far I haven't come to something that is really great :\ > Of course, for hugetlb to gain the hability to operate on PUD/PMD, this > means we need to add a fairly amount of code. e.g: for operating > hugepages on PUD level, code for markers on PUD/PMD level for > uffd/poison stuff (only dealt > on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc. > Basically, almost all we did for PMD-* stuff we need it for PUD as well, > and that will be around when THP gains support for PUD if it ever does > (I guess that in a few years if memory capacity keeps increasing). Right, this is the general pain of the mm's design is we have to duplicate so much stuff N-wise for each level, even though in alot of cases it isn't different for each level. > I will keep working on this patchset not because of pagewalk savings, > but because I think it will help us in have hugetlb more mm-core ready, > since the current pagewalk has to test that a hugetlb page can be > properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD, > hwpoison entries for swp on PUD/PMD, pud invalidating, etc. Right, it would be nice if the page walk ops didn't have to touch huge stuff at all. pagewalk ops, as they are today, should just work with pud/pmd/pte normal functions in all cases. Jason
On Thu, Jul 04, 2024 at 10:30:14AM -0400, Peter Xu wrote: > Hey, David, > > On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote: > > There are roughly two categories of page table walkers we have: > > > > 1) We actually only want to walk present folios (to be precise, page > > ranges of folios). We should look into moving away from the walk the > > page walker API where possible, and have something better that > > directly gives us the folio (page ranges). Any PTE batching would be > > done internally. This seems like a good direction for some users as well to me. If we can reduce the number of places touching the pud/pmd/pte APIs that is a nice abstraction to reach toward. It naturally would remove hugepte users too. Jason
On 08.07.24 10:18, Oscar Salvador wrote: > On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote: >> My thinking was if "remove hugetlb_entry" cannot wait for "remove >> page_walk", because we found a reasonable way to do it better and convert >> the individual users. Maybe it can't. >> >> I've not given up hope that we can end up with something better and clearer >> than the current page_walk API :) > > Hi David, > Hi! > I agree that the current page_walk might be a bit convoluted, and that the > indirect functions approach is a bit of a hassle. > Having said that, let me clarify something. > > Although this patchset touches the page_walk API wrt. getting rid of > hugetlb special casing all over the place, my goal was not as focused on > the page_walk as it was on the hugetlb code to gain hability to be > interpreted on PUD/PMD level. I understand that. And it would all be easier+more straight forward if we wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that works similar, but different to "ordinary" cont-pte for thp. I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. If we, at one point *don't* use these hugetlb functions right now to modify hugetlb entries, we might be in trouble. That's why I think we should maybe invest our time and effort in having a new pagewalker that will just batch such things naturally, and users that can operate on that naturally. For example: a hugetlb cont-pte-mapped folio will just naturally be reported as a "fully mapped folio", just like a THP would be if mapped in a compatible way. Yes, this requires more work, but as raised in some patches here, working on individual PTEs/PMDs for hugetlb is problematic. You have to batch every operation, to essentially teach ordinary code to do what the hugetlb_* special code would have done on cont-pte/cont-pmd things. (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code on how many entries we can batch, like we do in folio_pte_batch(); point is that we want to batch also on architectures where we don't have such bits, and prepare for architectures that implement various sizes of batching; IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong approach. Again, folio_pte_batch() is where we tackled the problem differently from the THP perspective) > > One of the things, among other things, that helped in creating this > mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb > __always__ operates on ptes, which means that we cannot rely on the mm > core to do the right thing, and we need a bunch of hugetlb-pte functions > that knows about their thing, so we lean on that. > > IMHO, that was a mistake to start with, but I was not around when it was > introduced and maybe there were good reasons to deal with that the way > it is done. > But, the thing is that my ultimate goal, is for hugetlb code to be able > to deal with PUD/PMD (pte and cont-pte is already dealt with) just like > mm core does for THP (PUD is not supported by THP, but you get me), and > that is not that difficult to do, as this patchset tries to prove. > > Of course, for hugetlb to gain the hability to operate on PUD/PMD, this > means we need to add a fairly amount of code. e.g: for operating > hugepages on PUD level, code for markers on PUD/PMD level for > uffd/poison stuff (only dealt > on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc. > Basically, almost all we did for PMD-* stuff we need it for PUD as well, > and that will be around when THP gains support for PUD if it ever does > (I guess that in a few years if memory capacity keeps increasing). > > E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with > is_hwpoison_entry > > Yes, it is a hassle to have more code around, but IMO, this new code > will help us in 1) move away from __always__ operate on ptes 2) ease > integrate hugetlb code into mm core. > > I will keep working on this patchset not because of pagewalk savings, > but because I think it will help us in have hugetlb more mm-core ready, > since the current pagewalk has to test that a hugetlb page can be > properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD, > hwpoison entries for swp on PUD/PMD, pud invalidating, etc. > > If that gets accomplished, I think that a fair amount of code that lives > in hugetlb.c can be deleted/converted as less special casing will be needed. > > I might be wrong and maybe I will hit a brick wall, but hopefully not. I have an idea for a better page table walker API that would try batching most entries (under one PTL), and walkers can just register for the types they want. Hoping I will find some time to at least scetch the user interface soon. That doesn't mean that this should block your work, but the cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't particularly like where this is going.
On Wed, Jul 10, 2024 at 05:52:43AM +0200, David Hildenbrand wrote: > I understand that. And it would all be easier+more straight forward if we > wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that works > similar, but different to "ordinary" cont-pte for thp. > > I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. If > we, at one point *don't* use these hugetlb functions right now to modify > hugetlb entries, we might be in trouble. > > That's why I think we should maybe invest our time and effort in having a > new pagewalker that will just batch such things naturally, and users that > can operate on that naturally. For example: a hugetlb cont-pte-mapped folio > will just naturally be reported as a "fully mapped folio", just like a THP > would be if mapped in a compatible way. > > Yes, this requires more work, but as raised in some patches here, working on > individual PTEs/PMDs for hugetlb is problematic. > > You have to batch every operation, to essentially teach ordinary code to do > what the hugetlb_* special code would have done on cont-pte/cont-pmd things. > > > (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code > on how many entries we can batch, like we do in folio_pte_batch(); point is > that we want to batch also on architectures where we don't have such bits, > and prepare for architectures that implement various sizes of batching; > IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong > approach. Again, folio_pte_batch() is where we tackled the problem > differently from the THP perspective) I must say I did not check folio_pte_batch() and I am totally ignorant of what/how it does things. I will have a look. > I have an idea for a better page table walker API that would try batching > most entries (under one PTL), and walkers can just register for the types > they want. Hoping I will find some time to at least scetch the user > interface soon. > > That doesn't mean that this should block your work, but the > cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't > particularly like where this is going. Ok, let me take a step back then. Previous versions of that RFC did not handle cont-{pte-pmd} wide in the open, so let me go back to the drawing board and come up with something that does not fiddle with cont- stuff in that way. I might post here a small diff just to see if we are on the same page. As usual, thanks a lot for your comments David!
>> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code >> on how many entries we can batch, like we do in folio_pte_batch(); point is >> that we want to batch also on architectures where we don't have such bits, >> and prepare for architectures that implement various sizes of batching; >> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong >> approach. Again, folio_pte_batch() is where we tackled the problem >> differently from the THP perspective) > > I must say I did not check folio_pte_batch() and I am totally ignorant > of what/how it does things. > I will have a look. > >> I have an idea for a better page table walker API that would try batching >> most entries (under one PTL), and walkers can just register for the types >> they want. Hoping I will find some time to at least scetch the user >> interface soon. >> >> That doesn't mean that this should block your work, but the >> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't >> particularly like where this is going. > > Ok, let me take a step back then. > Previous versions of that RFC did not handle cont-{pte-pmd} wide in the > open, so let me go back to the drawing board and come up with something > that does not fiddle with cont- stuff in that way. > > I might post here a small diff just to see if we are on the same page. > > As usual, thanks a lot for your comments David! Feel free to reach out to discuss ways forward. I think we should (a) move to the automatic cont-pte setting as done for THPs via set_ptes(). (b) Batching PTE updates at all relevant places, so we get no change in behavior: cont-pte bit will remain set. (c) Likely remove the use of cont-pte bits in hugetlb code for anything that is not a present folio (i.e., where automatic cont-pte bit setting would never set it). Migration entries might require thought (we can easily batch to achieve the same thing, but the behavior of hugetlb likely differs to the generic way of handling migration entries on multiple ptes: reference the folio vs. the respective subpages of the folio). Then we are essentially replacing what the current hugetlb_ functions do by the common way it is being done for THP (which does not exist for cont-pmd yet ... ). The only real alternative is special casing hugetlb all over the place to still call the hugetlb_* functions. :( it would all be easier without that hugetlb cont-pte/cont-pmd usage. CCing Ryan so he's aware.
On Thu, Jul 11, 2024 at 02:15:38AM +0200, David Hildenbrand wrote: > > > (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code > > > on how many entries we can batch, like we do in folio_pte_batch(); point is > > > that we want to batch also on architectures where we don't have such bits, > > > and prepare for architectures that implement various sizes of batching; > > > IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong > > > approach. Again, folio_pte_batch() is where we tackled the problem > > > differently from the THP perspective) > > > > I must say I did not check folio_pte_batch() and I am totally ignorant > > of what/how it does things. > > I will have a look. > > > > > I have an idea for a better page table walker API that would try batching > > > most entries (under one PTL), and walkers can just register for the types > > > they want. Hoping I will find some time to at least scetch the user > > > interface soon. > > > > > > That doesn't mean that this should block your work, but the > > > cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't > > > particularly like where this is going. > > > > Ok, let me take a step back then. > > Previous versions of that RFC did not handle cont-{pte-pmd} wide in the > > open, so let me go back to the drawing board and come up with something > > that does not fiddle with cont- stuff in that way. > > > > I might post here a small diff just to see if we are on the same page. > > > > As usual, thanks a lot for your comments David! > > Feel free to reach out to discuss ways forward. I think we should > > (a) move to the automatic cont-pte setting as done for THPs via > set_ptes(). > (b) Batching PTE updates at all relevant places, so we get no change in > behavior: cont-pte bit will remain set. > (c) Likely remove the use of cont-pte bits in hugetlb code for anything > that is not a present folio (i.e., where automatic cont-pte bit > setting would never set it). Migration entries might require > thought (we can easily batch to achieve the same thing, but the > behavior of hugetlb likely differs to the generic way of handling > migration entries on multiple ptes: reference the folio vs. > the respective subpages of the folio). Uhm, I see, but I am bit confused. Although related, this seems orthogonal to this series and more like for a next-thing to do, right? It is true that this series tries to handle cont-{pmd,pte} in the pagewalk api for hugetlb vmas, but in order to raise less eye brows I can come up with a way not to do that for now, so we do not fiddle with cont-stuff in this series. Or am I misunderstanding you?
On 11.07.24 06:48, Oscar Salvador wrote: > On Thu, Jul 11, 2024 at 02:15:38AM +0200, David Hildenbrand wrote: >>>> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code >>>> on how many entries we can batch, like we do in folio_pte_batch(); point is >>>> that we want to batch also on architectures where we don't have such bits, >>>> and prepare for architectures that implement various sizes of batching; >>>> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong >>>> approach. Again, folio_pte_batch() is where we tackled the problem >>>> differently from the THP perspective) >>> >>> I must say I did not check folio_pte_batch() and I am totally ignorant >>> of what/how it does things. >>> I will have a look. >>> >>>> I have an idea for a better page table walker API that would try batching >>>> most entries (under one PTL), and walkers can just register for the types >>>> they want. Hoping I will find some time to at least scetch the user >>>> interface soon. >>>> >>>> That doesn't mean that this should block your work, but the >>>> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't >>>> particularly like where this is going. >>> >>> Ok, let me take a step back then. >>> Previous versions of that RFC did not handle cont-{pte-pmd} wide in the >>> open, so let me go back to the drawing board and come up with something >>> that does not fiddle with cont- stuff in that way. >>> >>> I might post here a small diff just to see if we are on the same page. >>> >>> As usual, thanks a lot for your comments David! >> >> Feel free to reach out to discuss ways forward. I think we should >> >> (a) move to the automatic cont-pte setting as done for THPs via >> set_ptes(). >> (b) Batching PTE updates at all relevant places, so we get no change in >> behavior: cont-pte bit will remain set. >> (c) Likely remove the use of cont-pte bits in hugetlb code for anything >> that is not a present folio (i.e., where automatic cont-pte bit >> setting would never set it). Migration entries might require >> thought (we can easily batch to achieve the same thing, but the >> behavior of hugetlb likely differs to the generic way of handling >> migration entries on multiple ptes: reference the folio vs. >> the respective subpages of the folio). > > Uhm, I see, but I am bit confused. > Although related, this seems orthogonal to this series and more like for > a next-thing to do, right? Well, yes and no. The thing is, that the cont-pte/cont-pmd stuff is not as easy to handle like the PMD/PUD stuff, and sorting that out sounds like some "pain". That's the ugly part of hugetlb, where it's simply ... quite different :( > > It is true that this series tries to handle cont-{pmd,pte} in the > pagewalk api for hugetlb vmas, but in order to raise less eye brows I > can come up with a way not to do that for now, so we do not fiddle with > cont-stuff in this series. > > > Or am I misunderstanding you? I can answer once I know more details about the approach you have in mind :)