Message ID | 20201120064325.34492-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Free some vmemmap pages of hugetlb page | expand |
On Fri 20-11-20 14:43:04, Muchun Song wrote: [...] Thanks for improving the cover letter and providing some numbers. I have only glanced through the patchset because I didn't really have more time to dive depply into them. Overall it looks promissing. To summarize. I would prefer to not have the feature enablement controlled by compile time option and the kernel command line option should be opt-in. I also do not like that freeing the pool can trigger the oom killer or even shut the system down if no oom victim is eligible. One thing that I didn't really get to think hard about is what is the effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be invalid when racing with the split. How do we enforce that this won't blow up? I have also asked in a previous version whether the vmemmap manipulation should be really unconditional. E.g. shortlived hugetlb pages allocated from the buddy allocator directly rather than for a pool. Maybe it should be restricted for the pool allocation as those are considered long term and therefore the overhead will be amortized and freeing path restrictions better understandable. > Documentation/admin-guide/kernel-parameters.txt | 9 + > Documentation/admin-guide/mm/hugetlbpage.rst | 3 + > arch/x86/include/asm/hugetlb.h | 17 + > arch/x86/include/asm/pgtable_64_types.h | 8 + > arch/x86/mm/init_64.c | 7 +- > fs/Kconfig | 14 + > include/linux/bootmem_info.h | 78 +++ > include/linux/hugetlb.h | 19 + > include/linux/hugetlb_cgroup.h | 15 +- > include/linux/memory_hotplug.h | 27 - > mm/Makefile | 2 + > mm/bootmem_info.c | 124 ++++ > mm/hugetlb.c | 163 ++++- > mm/hugetlb_vmemmap.c | 765 ++++++++++++++++++++++++ > mm/hugetlb_vmemmap.h | 103 ++++ I will need to look closer but I suspect that a non-trivial part of the vmemmap manipulation really belongs to mm/sparse-vmemmap.c because the split and remapping shouldn't really be hugetlb specific. Sure hugetlb knows how to split but all the splitting should be implemented in vmemmap proper. > mm/memory_hotplug.c | 116 ---- > mm/sparse.c | 5 +- > 17 files changed, 1295 insertions(+), 180 deletions(-) > create mode 100644 include/linux/bootmem_info.h > create mode 100644 mm/bootmem_info.c > create mode 100644 mm/hugetlb_vmemmap.c > create mode 100644 mm/hugetlb_vmemmap.h Thanks!
On 20.11.20 09:42, Michal Hocko wrote: > On Fri 20-11-20 14:43:04, Muchun Song wrote: > [...] > > Thanks for improving the cover letter and providing some numbers. I have > only glanced through the patchset because I didn't really have more time > to dive depply into them. > > Overall it looks promissing. To summarize. I would prefer to not have > the feature enablement controlled by compile time option and the kernel > command line option should be opt-in. I also do not like that freeing > the pool can trigger the oom killer or even shut the system down if no > oom victim is eligible. > > One thing that I didn't really get to think hard about is what is the > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > invalid when racing with the split. How do we enforce that this won't > blow up? I have the same concerns - the sections are online the whole time and anybody with pfn_to_online_page() can grab them I think we have similar issues with memory offlining when removing the vmemmap, it's just very hard to trigger and we can easily protect by grabbing the memhotplug lock. I once discussed with Dan using rcu to protect the SECTION_IS_ONLINE bit, to make sure anybody who did a pfn_to_online_page() stopped using the page. Of course, such an approach is not easy to use in this context where the sections stay online the whole time ... we would have to protect vmemmap table entries using rcu or similar, which can get quite ugly. To keep things easy, maybe simply never allow to free these hugetlb pages again for now? If they were reserved during boot and the vmemmap condensed, then just let them stick around for all eternity. Once we have a safe approach on how to modify an online vmemmap, we can enable this freeing, and eventually also dynamically manage vmemmaps for runtime-allocated huge pages.
On Fri 20-11-20 10:27:05, David Hildenbrand wrote: > On 20.11.20 09:42, Michal Hocko wrote: > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > [...] > > > > Thanks for improving the cover letter and providing some numbers. I have > > only glanced through the patchset because I didn't really have more time > > to dive depply into them. > > > > Overall it looks promissing. To summarize. I would prefer to not have > > the feature enablement controlled by compile time option and the kernel > > command line option should be opt-in. I also do not like that freeing > > the pool can trigger the oom killer or even shut the system down if no > > oom victim is eligible. > > > > One thing that I didn't really get to think hard about is what is the > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > invalid when racing with the split. How do we enforce that this won't > > blow up? > > I have the same concerns - the sections are online the whole time and > anybody with pfn_to_online_page() can grab them > > I think we have similar issues with memory offlining when removing the > vmemmap, it's just very hard to trigger and we can easily protect by > grabbing the memhotplug lock. I am not sure we can/want to span memory hotplug locking out to all pfn walkers. But you are right that the underlying problem is similar but much harder to trigger because vmemmaps are only removed when the physical memory is hotremoved and that happens very seldom. Maybe it will happen more with virtualization usecases. But this work makes it even more tricky. If a pfn walker races with a hotremove then it would just blow up when accessing the unmapped physical address space. For this feature a pfn walker would just grab a real struct page re-used for some unpredictable use under its feet. Any failure would be silent and hard to debug. [...] > To keep things easy, maybe simply never allow to free these hugetlb pages > again for now? If they were reserved during boot and the vmemmap condensed, > then just let them stick around for all eternity. Not sure I understand. Do you propose to only free those vmemmap pages when the pool is initialized during boot time and never allow to free them up? That would certainly make it safer and maybe even simpler wrt implementation.
On 20.11.20 10:39, Michal Hocko wrote: > On Fri 20-11-20 10:27:05, David Hildenbrand wrote: >> On 20.11.20 09:42, Michal Hocko wrote: >>> On Fri 20-11-20 14:43:04, Muchun Song wrote: >>> [...] >>> >>> Thanks for improving the cover letter and providing some numbers. I have >>> only glanced through the patchset because I didn't really have more time >>> to dive depply into them. >>> >>> Overall it looks promissing. To summarize. I would prefer to not have >>> the feature enablement controlled by compile time option and the kernel >>> command line option should be opt-in. I also do not like that freeing >>> the pool can trigger the oom killer or even shut the system down if no >>> oom victim is eligible. >>> >>> One thing that I didn't really get to think hard about is what is the >>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be >>> invalid when racing with the split. How do we enforce that this won't >>> blow up? >> >> I have the same concerns - the sections are online the whole time and >> anybody with pfn_to_online_page() can grab them >> >> I think we have similar issues with memory offlining when removing the >> vmemmap, it's just very hard to trigger and we can easily protect by >> grabbing the memhotplug lock. > > I am not sure we can/want to span memory hotplug locking out to all pfn > walkers. But you are right that the underlying problem is similar but > much harder to trigger because vmemmaps are only removed when the > physical memory is hotremoved and that happens very seldom. Maybe it > will happen more with virtualization usecases. But this work makes it > even more tricky. If a pfn walker races with a hotremove then it would > just blow up when accessing the unmapped physical address space. For > this feature a pfn walker would just grab a real struct page re-used for > some unpredictable use under its feet. Any failure would be silent and > hard to debug. Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it. > > [...] >> To keep things easy, maybe simply never allow to free these hugetlb pages >> again for now? If they were reserved during boot and the vmemmap condensed, >> then just let them stick around for all eternity. > > Not sure I understand. Do you propose to only free those vmemmap pages > when the pool is initialized during boot time and never allow to free > them up? That would certainly make it safer and maybe even simpler wrt > implementation. Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.
On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > [...] > > Thanks for improving the cover letter and providing some numbers. I have > only glanced through the patchset because I didn't really have more time > to dive depply into them. > > Overall it looks promissing. To summarize. I would prefer to not have > the feature enablement controlled by compile time option and the kernel > command line option should be opt-in. I also do not like that freeing > the pool can trigger the oom killer or even shut the system down if no > oom victim is eligible. Hi Michal, I have replied to you about those questions on the other mail thread. Thanks. > > One thing that I didn't really get to think hard about is what is the > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > invalid when racing with the split. How do we enforce that this won't > blow up? This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, in this case, the pfn_to_page can work. The return value of the pfn_to_page is actually the address of it's struct page struct. I can not figure out where the problem is. Can you describe the problem in detail please? Thanks. > > I have also asked in a previous version whether the vmemmap manipulation > should be really unconditional. E.g. shortlived hugetlb pages allocated > from the buddy allocator directly rather than for a pool. Maybe it > should be restricted for the pool allocation as those are considered > long term and therefore the overhead will be amortized and freeing path > restrictions better understandable. Yeah, I agree with you. This can be an optimization. And we can add it to the todo list and implement it in the future. Now the patch series is already huge. > > > Documentation/admin-guide/kernel-parameters.txt | 9 + > > Documentation/admin-guide/mm/hugetlbpage.rst | 3 + > > arch/x86/include/asm/hugetlb.h | 17 + > > arch/x86/include/asm/pgtable_64_types.h | 8 + > > arch/x86/mm/init_64.c | 7 +- > > fs/Kconfig | 14 + > > include/linux/bootmem_info.h | 78 +++ > > include/linux/hugetlb.h | 19 + > > include/linux/hugetlb_cgroup.h | 15 +- > > include/linux/memory_hotplug.h | 27 - > > mm/Makefile | 2 + > > mm/bootmem_info.c | 124 ++++ > > mm/hugetlb.c | 163 ++++- > > mm/hugetlb_vmemmap.c | 765 ++++++++++++++++++++++++ > > mm/hugetlb_vmemmap.h | 103 ++++ > > I will need to look closer but I suspect that a non-trivial part of the > vmemmap manipulation really belongs to mm/sparse-vmemmap.c because the > split and remapping shouldn't really be hugetlb specific. Sure hugetlb > knows how to split but all the splitting should be implemented in > vmemmap proper. > > > mm/memory_hotplug.c | 116 ---- > > mm/sparse.c | 5 +- > > 17 files changed, 1295 insertions(+), 180 deletions(-) > > create mode 100644 include/linux/bootmem_info.h > > create mode 100644 mm/bootmem_info.c > > create mode 100644 mm/hugetlb_vmemmap.c > > create mode 100644 mm/hugetlb_vmemmap.h > > Thanks! > -- > Michal Hocko > SUSE Labs
On Fri 20-11-20 20:40:46, Muchun Song wrote: > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > [...] > > > > Thanks for improving the cover letter and providing some numbers. I have > > only glanced through the patchset because I didn't really have more time > > to dive depply into them. > > > > Overall it looks promissing. To summarize. I would prefer to not have > > the feature enablement controlled by compile time option and the kernel > > command line option should be opt-in. I also do not like that freeing > > the pool can trigger the oom killer or even shut the system down if no > > oom victim is eligible. > > Hi Michal, > > I have replied to you about those questions on the other mail thread. > > Thanks. > > > > > One thing that I didn't really get to think hard about is what is the > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > invalid when racing with the split. How do we enforce that this won't > > blow up? > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > in this case, the pfn_to_page can work. The return value of the > pfn_to_page is actually the address of it's struct page struct. > I can not figure out where the problem is. Can you describe the > problem in detail please? Thanks. struct page returned by pfn_to_page might get invalid right when it is returned because vmemmap could get freed up and the respective memory released to the page allocator and reused for something else. See? > > I have also asked in a previous version whether the vmemmap manipulation > > should be really unconditional. E.g. shortlived hugetlb pages allocated > > from the buddy allocator directly rather than for a pool. Maybe it > > should be restricted for the pool allocation as those are considered > > long term and therefore the overhead will be amortized and freeing path > > restrictions better understandable. > > Yeah, I agree with you. This can be an optimization. And we can > add it to the todo list and implement it in the future. Now the patch > series is already huge. Yes the patchset is large and the primary aim should be reducing functionality to make it smaller in the first incarnation. Especially when it is tricky to implement. Releasing vmemmap sparse hugepages is one of those things. Do you really need it for your usecase?
On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > [...] > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > only glanced through the patchset because I didn't really have more time > > > to dive depply into them. > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > the feature enablement controlled by compile time option and the kernel > > > command line option should be opt-in. I also do not like that freeing > > > the pool can trigger the oom killer or even shut the system down if no > > > oom victim is eligible. > > > > Hi Michal, > > > > I have replied to you about those questions on the other mail thread. > > > > Thanks. > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > invalid when racing with the split. How do we enforce that this won't > > > blow up? > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > in this case, the pfn_to_page can work. The return value of the > > pfn_to_page is actually the address of it's struct page struct. > > I can not figure out where the problem is. Can you describe the > > problem in detail please? Thanks. > > struct page returned by pfn_to_page might get invalid right when it is > returned because vmemmap could get freed up and the respective memory > released to the page allocator and reused for something else. See? If the HugeTLB page is already allocated from the buddy allocator, the struct page of the HugeTLB can be freed? Does this exist? If yes, how to free the HugeTLB page to the buddy allocator (cannot access the struct page)? > > > > I have also asked in a previous version whether the vmemmap manipulation > > > should be really unconditional. E.g. shortlived hugetlb pages allocated > > > from the buddy allocator directly rather than for a pool. Maybe it > > > should be restricted for the pool allocation as those are considered > > > long term and therefore the overhead will be amortized and freeing path > > > restrictions better understandable. > > > > Yeah, I agree with you. This can be an optimization. And we can > > add it to the todo list and implement it in the future. Now the patch > > series is already huge. > > Yes the patchset is large and the primary aim should be reducing > functionality to make it smaller in the first incarnation. Especially > when it is tricky to implement. Releasing vmemmap sparse hugepages is > one of those things. Do you really need it for your usecase? > -- > Michal Hocko > SUSE Labs
On 11/20/20 1:43 AM, David Hildenbrand wrote: > On 20.11.20 10:39, Michal Hocko wrote: >> On Fri 20-11-20 10:27:05, David Hildenbrand wrote: >>> On 20.11.20 09:42, Michal Hocko wrote: >>>> On Fri 20-11-20 14:43:04, Muchun Song wrote: >>>> [...] >>>> >>>> Thanks for improving the cover letter and providing some numbers. I have >>>> only glanced through the patchset because I didn't really have more time >>>> to dive depply into them. >>>> >>>> Overall it looks promissing. To summarize. I would prefer to not have >>>> the feature enablement controlled by compile time option and the kernel >>>> command line option should be opt-in. I also do not like that freeing >>>> the pool can trigger the oom killer or even shut the system down if no >>>> oom victim is eligible. >>>> >>>> One thing that I didn't really get to think hard about is what is the >>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be >>>> invalid when racing with the split. How do we enforce that this won't >>>> blow up? >>> >>> I have the same concerns - the sections are online the whole time and >>> anybody with pfn_to_online_page() can grab them >>> >>> I think we have similar issues with memory offlining when removing the >>> vmemmap, it's just very hard to trigger and we can easily protect by >>> grabbing the memhotplug lock. >> >> I am not sure we can/want to span memory hotplug locking out to all pfn >> walkers. But you are right that the underlying problem is similar but >> much harder to trigger because vmemmaps are only removed when the >> physical memory is hotremoved and that happens very seldom. Maybe it >> will happen more with virtualization usecases. But this work makes it >> even more tricky. If a pfn walker races with a hotremove then it would >> just blow up when accessing the unmapped physical address space. For >> this feature a pfn walker would just grab a real struct page re-used for >> some unpredictable use under its feet. Any failure would be silent and >> hard to debug. > > Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it. > >> >> [...] >>> To keep things easy, maybe simply never allow to free these hugetlb pages >>> again for now? If they were reserved during boot and the vmemmap condensed, >>> then just let them stick around for all eternity. >> >> Not sure I understand. Do you propose to only free those vmemmap pages >> when the pool is initialized during boot time and never allow to free >> them up? That would certainly make it safer and maybe even simpler wrt >> implementation. > > Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them. Not sure if I agree with that last statement. Database and virtualization use cases from my employer allocate allocate hugetlb pages after boot. It is shortly after boot, but still not from boot/kernel command line. Somewhat related, but not exactly addressing this issue ... One idea discussed in a previous patch set was to disable PMD/huge page mapping of vmemmap if this feature was enabled. This would eliminate a bunch of the complex code doing page table manipulation. It does not address the issue of struct page pages going away which is being discussed here, but it could be a way to simply the first version of this code. If this is going to be an 'opt in' feature as previously suggested, then eliminating the PMD/huge page vmemmap mapping may be acceptable. My guess is that sysadmins would only 'opt in' if they expect most of system memory to be used by hugetlb pages. We certainly have database and virtualization use cases where this is true.
On 20.11.20 18:45, Mike Kravetz wrote: > On 11/20/20 1:43 AM, David Hildenbrand wrote: >> On 20.11.20 10:39, Michal Hocko wrote: >>> On Fri 20-11-20 10:27:05, David Hildenbrand wrote: >>>> On 20.11.20 09:42, Michal Hocko wrote: >>>>> On Fri 20-11-20 14:43:04, Muchun Song wrote: >>>>> [...] >>>>> >>>>> Thanks for improving the cover letter and providing some numbers. I have >>>>> only glanced through the patchset because I didn't really have more time >>>>> to dive depply into them. >>>>> >>>>> Overall it looks promissing. To summarize. I would prefer to not have >>>>> the feature enablement controlled by compile time option and the kernel >>>>> command line option should be opt-in. I also do not like that freeing >>>>> the pool can trigger the oom killer or even shut the system down if no >>>>> oom victim is eligible. >>>>> >>>>> One thing that I didn't really get to think hard about is what is the >>>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be >>>>> invalid when racing with the split. How do we enforce that this won't >>>>> blow up? >>>> >>>> I have the same concerns - the sections are online the whole time and >>>> anybody with pfn_to_online_page() can grab them >>>> >>>> I think we have similar issues with memory offlining when removing the >>>> vmemmap, it's just very hard to trigger and we can easily protect by >>>> grabbing the memhotplug lock. >>> >>> I am not sure we can/want to span memory hotplug locking out to all pfn >>> walkers. But you are right that the underlying problem is similar but >>> much harder to trigger because vmemmaps are only removed when the >>> physical memory is hotremoved and that happens very seldom. Maybe it >>> will happen more with virtualization usecases. But this work makes it >>> even more tricky. If a pfn walker races with a hotremove then it would >>> just blow up when accessing the unmapped physical address space. For >>> this feature a pfn walker would just grab a real struct page re-used for >>> some unpredictable use under its feet. Any failure would be silent and >>> hard to debug. >> >> Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it. >> >>> >>> [...] >>>> To keep things easy, maybe simply never allow to free these hugetlb pages >>>> again for now? If they were reserved during boot and the vmemmap condensed, >>>> then just let them stick around for all eternity. >>> >>> Not sure I understand. Do you propose to only free those vmemmap pages >>> when the pool is initialized during boot time and never allow to free >>> them up? That would certainly make it safer and maybe even simpler wrt >>> implementation. >> >> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them. > > Not sure if I agree with that last statement. Database and virtualization > use cases from my employer allocate allocate hugetlb pages after boot. It > is shortly after boot, but still not from boot/kernel command line. Right, but the ones that care about this optimization for now could be converted, I assume? I mean we are talking about "opt-in" from sysadmins, so requiring to specify a different cmdline parameter does not sound to weird to me. And it should simplify a first version quite a lot. The more I think about this, the more I believe doing these vmemmap modifications after boot are very dangerous. > > Somewhat related, but not exactly addressing this issue ... > > One idea discussed in a previous patch set was to disable PMD/huge page > mapping of vmemmap if this feature was enabled. This would eliminate a bunch > of the complex code doing page table manipulation. It does not address > the issue of struct page pages going away which is being discussed here, > but it could be a way to simply the first version of this code. If this > is going to be an 'opt in' feature as previously suggested, then eliminating > the PMD/huge page vmemmap mapping may be acceptable. My guess is that > sysadmins would only 'opt in' if they expect most of system memory to be used > by hugetlb pages. We certainly have database and virtualization use cases > where this is true. It sounds like a hack to me, which does not fully solve the problem. But yeah, it's a simplification.
On Sat, Nov 21, 2020 at 1:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 11/20/20 1:43 AM, David Hildenbrand wrote: > > On 20.11.20 10:39, Michal Hocko wrote: > >> On Fri 20-11-20 10:27:05, David Hildenbrand wrote: > >>> On 20.11.20 09:42, Michal Hocko wrote: > >>>> On Fri 20-11-20 14:43:04, Muchun Song wrote: > >>>> [...] > >>>> > >>>> Thanks for improving the cover letter and providing some numbers. I have > >>>> only glanced through the patchset because I didn't really have more time > >>>> to dive depply into them. > >>>> > >>>> Overall it looks promissing. To summarize. I would prefer to not have > >>>> the feature enablement controlled by compile time option and the kernel > >>>> command line option should be opt-in. I also do not like that freeing > >>>> the pool can trigger the oom killer or even shut the system down if no > >>>> oom victim is eligible. > >>>> > >>>> One thing that I didn't really get to think hard about is what is the > >>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > >>>> invalid when racing with the split. How do we enforce that this won't > >>>> blow up? > >>> > >>> I have the same concerns - the sections are online the whole time and > >>> anybody with pfn_to_online_page() can grab them > >>> > >>> I think we have similar issues with memory offlining when removing the > >>> vmemmap, it's just very hard to trigger and we can easily protect by > >>> grabbing the memhotplug lock. > >> > >> I am not sure we can/want to span memory hotplug locking out to all pfn > >> walkers. But you are right that the underlying problem is similar but > >> much harder to trigger because vmemmaps are only removed when the > >> physical memory is hotremoved and that happens very seldom. Maybe it > >> will happen more with virtualization usecases. But this work makes it > >> even more tricky. If a pfn walker races with a hotremove then it would > >> just blow up when accessing the unmapped physical address space. For > >> this feature a pfn walker would just grab a real struct page re-used for > >> some unpredictable use under its feet. Any failure would be silent and > >> hard to debug. > > > > Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it. > > > >> > >> [...] > >>> To keep things easy, maybe simply never allow to free these hugetlb pages > >>> again for now? If they were reserved during boot and the vmemmap condensed, > >>> then just let them stick around for all eternity. > >> > >> Not sure I understand. Do you propose to only free those vmemmap pages > >> when the pool is initialized during boot time and never allow to free > >> them up? That would certainly make it safer and maybe even simpler wrt > >> implementation. > > > > Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them. > > Not sure if I agree with that last statement. Database and virtualization > use cases from my employer allocate allocate hugetlb pages after boot. It > is shortly after boot, but still not from boot/kernel command line. > > Somewhat related, but not exactly addressing this issue ... > > One idea discussed in a previous patch set was to disable PMD/huge page > mapping of vmemmap if this feature was enabled. This would eliminate a bunch > of the complex code doing page table manipulation. It does not address > the issue of struct page pages going away which is being discussed here, > but it could be a way to simply the first version of this code. If this > is going to be an 'opt in' feature as previously suggested, then eliminating > the PMD/huge page vmemmap mapping may be acceptable. My guess is that > sysadmins would only 'opt in' if they expect most of system memory to be used > by hugetlb pages. We certainly have database and virtualization use cases > where this is true. Hi Mike, Yeah, I agree with you that the first version of this feature should be simply. I can do that (disable PMD/huge page mapping of vmemmap) in the next version patch. But I have another question: what the problem is when struct page pages go away? I have not understood the issues discussed here, hope you can answer for me. Thanks. > -- > Mike Kravetz
On Fri 20-11-20 09:45:12, Mike Kravetz wrote: > On 11/20/20 1:43 AM, David Hildenbrand wrote: [...] > >>> To keep things easy, maybe simply never allow to free these hugetlb pages > >>> again for now? If they were reserved during boot and the vmemmap condensed, > >>> then just let them stick around for all eternity. > >> > >> Not sure I understand. Do you propose to only free those vmemmap pages > >> when the pool is initialized during boot time and never allow to free > >> them up? That would certainly make it safer and maybe even simpler wrt > >> implementation. > > > > Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them. > > Not sure if I agree with that last statement. Database and virtualization > use cases from my employer allocate allocate hugetlb pages after boot. It > is shortly after boot, but still not from boot/kernel command line. Is there any strong reason for that? > Somewhat related, but not exactly addressing this issue ... > > One idea discussed in a previous patch set was to disable PMD/huge page > mapping of vmemmap if this feature was enabled. This would eliminate a bunch > of the complex code doing page table manipulation. It does not address > the issue of struct page pages going away which is being discussed here, > but it could be a way to simply the first version of this code. If this > is going to be an 'opt in' feature as previously suggested, then eliminating > the PMD/huge page vmemmap mapping may be acceptable. My guess is that > sysadmins would only 'opt in' if they expect most of system memory to be used > by hugetlb pages. We certainly have database and virtualization use cases > where this is true. Would this simplify the code considerably? I mean, the vmemmap page tables will need to be updated anyway. So that code has to stay. PMD entry split shouldn't be the most complex part of that operation. On the other hand dropping large pages for all vmemmaps will likely have a performance.
On Fri 20-11-20 23:44:26, Muchun Song wrote: > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > [...] > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > only glanced through the patchset because I didn't really have more time > > > > to dive depply into them. > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > the feature enablement controlled by compile time option and the kernel > > > > command line option should be opt-in. I also do not like that freeing > > > > the pool can trigger the oom killer or even shut the system down if no > > > > oom victim is eligible. > > > > > > Hi Michal, > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > Thanks. > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > invalid when racing with the split. How do we enforce that this won't > > > > blow up? > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > in this case, the pfn_to_page can work. The return value of the > > > pfn_to_page is actually the address of it's struct page struct. > > > I can not figure out where the problem is. Can you describe the > > > problem in detail please? Thanks. > > > > struct page returned by pfn_to_page might get invalid right when it is > > returned because vmemmap could get freed up and the respective memory > > released to the page allocator and reused for something else. See? > > If the HugeTLB page is already allocated from the buddy allocator, > the struct page of the HugeTLB can be freed? Does this exist? Nope, struct pages only ever get deallocated when the respective memory (they describe) is hotremoved via hotplug. > If yes, how to free the HugeTLB page to the buddy allocator > (cannot access the struct page)? But I do not follow how that relates to my concern above.
On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > [...] > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > only glanced through the patchset because I didn't really have more time > > > > > to dive depply into them. > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > the feature enablement controlled by compile time option and the kernel > > > > > command line option should be opt-in. I also do not like that freeing > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > oom victim is eligible. > > > > > > > > Hi Michal, > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > Thanks. > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > blow up? > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > in this case, the pfn_to_page can work. The return value of the > > > > pfn_to_page is actually the address of it's struct page struct. > > > > I can not figure out where the problem is. Can you describe the > > > > problem in detail please? Thanks. > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > returned because vmemmap could get freed up and the respective memory > > > released to the page allocator and reused for something else. See? > > > > If the HugeTLB page is already allocated from the buddy allocator, > > the struct page of the HugeTLB can be freed? Does this exist? > > Nope, struct pages only ever get deallocated when the respective memory > (they describe) is hotremoved via hotplug. > > > If yes, how to free the HugeTLB page to the buddy allocator > > (cannot access the struct page)? > > But I do not follow how that relates to my concern above. Sorry. I shouldn't understand your concerns. vmemmap pages page frame +-----------+ mapping to +-----------+ | | -------------> | 0 | +-----------+ +-----------+ | | -------------> | 1 | +-----------+ +-----------+ | | -------------> | 2 | +-----------+ +-----------+ | | -------------> | 3 | +-----------+ +-----------+ | | -------------> | 4 | +-----------+ +-----------+ | | -------------> | 5 | +-----------+ +-----------+ | | -------------> | 6 | +-----------+ +-----------+ | | -------------> | 7 | +-----------+ +-----------+ In this patch series, we will free the page frame 2-7 to the buddy allocator. You mean that pfn_to_page can return invalid value when the pfn is the page frame 2-7? Thanks. > > -- > Michal Hocko > SUSE Labs -- Yours, Muchun
On Mon 23-11-20 16:53:53, Muchun Song wrote: > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > > [...] > > > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > > only glanced through the patchset because I didn't really have more time > > > > > > to dive depply into them. > > > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > > the feature enablement controlled by compile time option and the kernel > > > > > > command line option should be opt-in. I also do not like that freeing > > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > > oom victim is eligible. > > > > > > > > > > Hi Michal, > > > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > > blow up? > > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > > in this case, the pfn_to_page can work. The return value of the > > > > > pfn_to_page is actually the address of it's struct page struct. > > > > > I can not figure out where the problem is. Can you describe the > > > > > problem in detail please? Thanks. > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > > returned because vmemmap could get freed up and the respective memory > > > > released to the page allocator and reused for something else. See? > > > > > > If the HugeTLB page is already allocated from the buddy allocator, > > > the struct page of the HugeTLB can be freed? Does this exist? > > > > Nope, struct pages only ever get deallocated when the respective memory > > (they describe) is hotremoved via hotplug. > > > > > If yes, how to free the HugeTLB page to the buddy allocator > > > (cannot access the struct page)? > > > > But I do not follow how that relates to my concern above. > > Sorry. I shouldn't understand your concerns. > > vmemmap pages page frame > +-----------+ mapping to +-----------+ > | | -------------> | 0 | > +-----------+ +-----------+ > | | -------------> | 1 | > +-----------+ +-----------+ > | | -------------> | 2 | > +-----------+ +-----------+ > | | -------------> | 3 | > +-----------+ +-----------+ > | | -------------> | 4 | > +-----------+ +-----------+ > | | -------------> | 5 | > +-----------+ +-----------+ > | | -------------> | 6 | > +-----------+ +-----------+ > | | -------------> | 7 | > +-----------+ +-----------+ > > In this patch series, we will free the page frame 2-7 to the > buddy allocator. You mean that pfn_to_page can return invalid > value when the pfn is the page frame 2-7? Thanks. No I really mean that pfn_to_page will give you a struct page pointer from pages which you release from the vmemmap page tables. Those pages might get reused as soon sa they are freed to the page allocator.
On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 23-11-20 16:53:53, Muchun Song wrote: > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > > > [...] > > > > > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > > > only glanced through the patchset because I didn't really have more time > > > > > > > to dive depply into them. > > > > > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > > > the feature enablement controlled by compile time option and the kernel > > > > > > > command line option should be opt-in. I also do not like that freeing > > > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > > > oom victim is eligible. > > > > > > > > > > > > Hi Michal, > > > > > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > > > blow up? > > > > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > > > in this case, the pfn_to_page can work. The return value of the > > > > > > pfn_to_page is actually the address of it's struct page struct. > > > > > > I can not figure out where the problem is. Can you describe the > > > > > > problem in detail please? Thanks. > > > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > > > returned because vmemmap could get freed up and the respective memory > > > > > released to the page allocator and reused for something else. See? > > > > > > > > If the HugeTLB page is already allocated from the buddy allocator, > > > > the struct page of the HugeTLB can be freed? Does this exist? > > > > > > Nope, struct pages only ever get deallocated when the respective memory > > > (they describe) is hotremoved via hotplug. > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator > > > > (cannot access the struct page)? > > > > > > But I do not follow how that relates to my concern above. > > > > Sorry. I shouldn't understand your concerns. > > > > vmemmap pages page frame > > +-----------+ mapping to +-----------+ > > | | -------------> | 0 | > > +-----------+ +-----------+ > > | | -------------> | 1 | > > +-----------+ +-----------+ > > | | -------------> | 2 | > > +-----------+ +-----------+ > > | | -------------> | 3 | > > +-----------+ +-----------+ > > | | -------------> | 4 | > > +-----------+ +-----------+ > > | | -------------> | 5 | > > +-----------+ +-----------+ > > | | -------------> | 6 | > > +-----------+ +-----------+ > > | | -------------> | 7 | > > +-----------+ +-----------+ > > > > In this patch series, we will free the page frame 2-7 to the > > buddy allocator. You mean that pfn_to_page can return invalid > > value when the pfn is the page frame 2-7? Thanks. > > No I really mean that pfn_to_page will give you a struct page pointer > from pages which you release from the vmemmap page tables. Those pages > might get reused as soon sa they are freed to the page allocator. We will remap vmemmap pages 2-7 (virtual addresses) to page frame 1. And then we free page frame 2-7 to the buddy allocator. Then accessing the struct page pointer that returned by pfn_to_page will be reflected on page frame 1. I think that here is no problem. Thanks. > -- > Michal Hocko > SUSE Labs
On Mon 23-11-20 18:36:33, Muchun Song wrote: > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote: > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > > > > [...] > > > > > > > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > > > > only glanced through the patchset because I didn't really have more time > > > > > > > > to dive depply into them. > > > > > > > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > > > > the feature enablement controlled by compile time option and the kernel > > > > > > > > command line option should be opt-in. I also do not like that freeing > > > > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > > > > oom victim is eligible. > > > > > > > > > > > > > > Hi Michal, > > > > > > > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > > > > blow up? > > > > > > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > > > > in this case, the pfn_to_page can work. The return value of the > > > > > > > pfn_to_page is actually the address of it's struct page struct. > > > > > > > I can not figure out where the problem is. Can you describe the > > > > > > > problem in detail please? Thanks. > > > > > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > > > > returned because vmemmap could get freed up and the respective memory > > > > > > released to the page allocator and reused for something else. See? > > > > > > > > > > If the HugeTLB page is already allocated from the buddy allocator, > > > > > the struct page of the HugeTLB can be freed? Does this exist? > > > > > > > > Nope, struct pages only ever get deallocated when the respective memory > > > > (they describe) is hotremoved via hotplug. > > > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator > > > > > (cannot access the struct page)? > > > > > > > > But I do not follow how that relates to my concern above. > > > > > > Sorry. I shouldn't understand your concerns. > > > > > > vmemmap pages page frame > > > +-----------+ mapping to +-----------+ > > > | | -------------> | 0 | > > > +-----------+ +-----------+ > > > | | -------------> | 1 | > > > +-----------+ +-----------+ > > > | | -------------> | 2 | > > > +-----------+ +-----------+ > > > | | -------------> | 3 | > > > +-----------+ +-----------+ > > > | | -------------> | 4 | > > > +-----------+ +-----------+ > > > | | -------------> | 5 | > > > +-----------+ +-----------+ > > > | | -------------> | 6 | > > > +-----------+ +-----------+ > > > | | -------------> | 7 | > > > +-----------+ +-----------+ > > > > > > In this patch series, we will free the page frame 2-7 to the > > > buddy allocator. You mean that pfn_to_page can return invalid > > > value when the pfn is the page frame 2-7? Thanks. > > > > No I really mean that pfn_to_page will give you a struct page pointer > > from pages which you release from the vmemmap page tables. Those pages > > might get reused as soon sa they are freed to the page allocator. > > We will remap vmemmap pages 2-7 (virtual addresses) to page > frame 1. And then we free page frame 2-7 to the buddy allocator. And this doesn't really happen in an atomic fashion from the pfn walker POV, right? So it is very well possible that struct page *page = pfn_to_page(); // remapping happens here // page content is no longer valid because its backing memory can be // reused for whatever purpose.
On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 23-11-20 18:36:33, Muchun Song wrote: > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote: > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > > > > > only glanced through the patchset because I didn't really have more time > > > > > > > > > to dive depply into them. > > > > > > > > > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > > > > > the feature enablement controlled by compile time option and the kernel > > > > > > > > > command line option should be opt-in. I also do not like that freeing > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > > > > > oom victim is eligible. > > > > > > > > > > > > > > > > Hi Michal, > > > > > > > > > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > > > > > blow up? > > > > > > > > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > > > > > in this case, the pfn_to_page can work. The return value of the > > > > > > > > pfn_to_page is actually the address of it's struct page struct. > > > > > > > > I can not figure out where the problem is. Can you describe the > > > > > > > > problem in detail please? Thanks. > > > > > > > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > > > > > returned because vmemmap could get freed up and the respective memory > > > > > > > released to the page allocator and reused for something else. See? > > > > > > > > > > > > If the HugeTLB page is already allocated from the buddy allocator, > > > > > > the struct page of the HugeTLB can be freed? Does this exist? > > > > > > > > > > Nope, struct pages only ever get deallocated when the respective memory > > > > > (they describe) is hotremoved via hotplug. > > > > > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator > > > > > > (cannot access the struct page)? > > > > > > > > > > But I do not follow how that relates to my concern above. > > > > > > > > Sorry. I shouldn't understand your concerns. > > > > > > > > vmemmap pages page frame > > > > +-----------+ mapping to +-----------+ > > > > | | -------------> | 0 | > > > > +-----------+ +-----------+ > > > > | | -------------> | 1 | > > > > +-----------+ +-----------+ > > > > | | -------------> | 2 | > > > > +-----------+ +-----------+ > > > > | | -------------> | 3 | > > > > +-----------+ +-----------+ > > > > | | -------------> | 4 | > > > > +-----------+ +-----------+ > > > > | | -------------> | 5 | > > > > +-----------+ +-----------+ > > > > | | -------------> | 6 | > > > > +-----------+ +-----------+ > > > > | | -------------> | 7 | > > > > +-----------+ +-----------+ > > > > > > > > In this patch series, we will free the page frame 2-7 to the > > > > buddy allocator. You mean that pfn_to_page can return invalid > > > > value when the pfn is the page frame 2-7? Thanks. > > > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > from pages which you release from the vmemmap page tables. Those pages > > > might get reused as soon sa they are freed to the page allocator. > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > And this doesn't really happen in an atomic fashion from the pfn walker > POV, right? So it is very well possible that Yeah, you are right. But it may not be a problem for HugeTLB pages. Because in most cases, we only read the tail struct page and get the head struct page through compound_head() when the pfn is within a HugeTLB range. Right? > > struct page *page = pfn_to_page(); > // remapping happens here > // page content is no longer valid because its backing memory can be If we only read the page->compound_head. The content is also valid. Because the value of compound_head is the same for the tail page struct of HugeTLB page. > // reused for whatever purpose. > -- > Michal Hocko > SUSE Labs
On Mon 23-11-20 19:16:18, Muchun Song wrote: > On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 23-11-20 18:36:33, Muchun Song wrote: > > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote: > > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > > > > > > only glanced through the patchset because I didn't really have more time > > > > > > > > > > to dive depply into them. > > > > > > > > > > > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > > > > > > the feature enablement controlled by compile time option and the kernel > > > > > > > > > > command line option should be opt-in. I also do not like that freeing > > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > > > > > > oom victim is eligible. > > > > > > > > > > > > > > > > > > Hi Michal, > > > > > > > > > > > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > > > > > > blow up? > > > > > > > > > > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > > > > > > in this case, the pfn_to_page can work. The return value of the > > > > > > > > > pfn_to_page is actually the address of it's struct page struct. > > > > > > > > > I can not figure out where the problem is. Can you describe the > > > > > > > > > problem in detail please? Thanks. > > > > > > > > > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > > > > > > returned because vmemmap could get freed up and the respective memory > > > > > > > > released to the page allocator and reused for something else. See? > > > > > > > > > > > > > > If the HugeTLB page is already allocated from the buddy allocator, > > > > > > > the struct page of the HugeTLB can be freed? Does this exist? > > > > > > > > > > > > Nope, struct pages only ever get deallocated when the respective memory > > > > > > (they describe) is hotremoved via hotplug. > > > > > > > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator > > > > > > > (cannot access the struct page)? > > > > > > > > > > > > But I do not follow how that relates to my concern above. > > > > > > > > > > Sorry. I shouldn't understand your concerns. > > > > > > > > > > vmemmap pages page frame > > > > > +-----------+ mapping to +-----------+ > > > > > | | -------------> | 0 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 1 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 2 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 3 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 4 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 5 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 6 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 7 | > > > > > +-----------+ +-----------+ > > > > > > > > > > In this patch series, we will free the page frame 2-7 to the > > > > > buddy allocator. You mean that pfn_to_page can return invalid > > > > > value when the pfn is the page frame 2-7? Thanks. > > > > > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > from pages which you release from the vmemmap page tables. Those pages > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > POV, right? So it is very well possible that > > Yeah, you are right. But it may not be a problem for HugeTLB pages. > Because in most cases, we only read the tail struct page and get the > head struct page through compound_head() when the pfn is within > a HugeTLB range. Right? Many pfn walkers would encounter the head page first and then skip over the rest. Those should be reasonably safe. But there is no guarantee and the fact that you need a valid page->compound_head which might get scribbled over once you have the struct page makes this extremely subtle.
On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 23-11-20 19:16:18, Muchun Song wrote: > > On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 23-11-20 18:36:33, Muchun Song wrote: > > > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote: > > > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > > > > > > > only glanced through the patchset because I didn't really have more time > > > > > > > > > > > to dive depply into them. > > > > > > > > > > > > > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > > > > > > > the feature enablement controlled by compile time option and the kernel > > > > > > > > > > > command line option should be opt-in. I also do not like that freeing > > > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > > > > > > > oom victim is eligible. > > > > > > > > > > > > > > > > > > > > Hi Michal, > > > > > > > > > > > > > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > > > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > > > > > > > blow up? > > > > > > > > > > > > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > > > > > > > in this case, the pfn_to_page can work. The return value of the > > > > > > > > > > pfn_to_page is actually the address of it's struct page struct. > > > > > > > > > > I can not figure out where the problem is. Can you describe the > > > > > > > > > > problem in detail please? Thanks. > > > > > > > > > > > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > > > > > > > returned because vmemmap could get freed up and the respective memory > > > > > > > > > released to the page allocator and reused for something else. See? > > > > > > > > > > > > > > > > If the HugeTLB page is already allocated from the buddy allocator, > > > > > > > > the struct page of the HugeTLB can be freed? Does this exist? > > > > > > > > > > > > > > Nope, struct pages only ever get deallocated when the respective memory > > > > > > > (they describe) is hotremoved via hotplug. > > > > > > > > > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator > > > > > > > > (cannot access the struct page)? > > > > > > > > > > > > > > But I do not follow how that relates to my concern above. > > > > > > > > > > > > Sorry. I shouldn't understand your concerns. > > > > > > > > > > > > vmemmap pages page frame > > > > > > +-----------+ mapping to +-----------+ > > > > > > | | -------------> | 0 | > > > > > > +-----------+ +-----------+ > > > > > > | | -------------> | 1 | > > > > > > +-----------+ +-----------+ > > > > > > | | -------------> | 2 | > > > > > > +-----------+ +-----------+ > > > > > > | | -------------> | 3 | > > > > > > +-----------+ +-----------+ > > > > > > | | -------------> | 4 | > > > > > > +-----------+ +-----------+ > > > > > > | | -------------> | 5 | > > > > > > +-----------+ +-----------+ > > > > > > | | -------------> | 6 | > > > > > > +-----------+ +-----------+ > > > > > > | | -------------> | 7 | > > > > > > +-----------+ +-----------+ > > > > > > > > > > > > In this patch series, we will free the page frame 2-7 to the > > > > > > buddy allocator. You mean that pfn_to_page can return invalid > > > > > > value when the pfn is the page frame 2-7? Thanks. > > > > > > > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > > from pages which you release from the vmemmap page tables. Those pages > > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > > POV, right? So it is very well possible that > > > > Yeah, you are right. But it may not be a problem for HugeTLB pages. > > Because in most cases, we only read the tail struct page and get the > > head struct page through compound_head() when the pfn is within > > a HugeTLB range. Right? > > Many pfn walkers would encounter the head page first and then skip over > the rest. Those should be reasonably safe. But there is no guarantee and > the fact that you need a valid page->compound_head which might get > scribbled over once you have the struct page makes this extremely > subtle. In this patch series, we can guarantee that the page->compound_head is always valid. Because we reuse the first tail page. Maybe you need to look closer at this series. Thanks. > > -- > > SUSE Labs -- Yours, Muchun
On Mon 23-11-20 20:07:23, Muchun Song wrote: > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > > > from pages which you release from the vmemmap page tables. Those pages > > > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > > > POV, right? So it is very well possible that > > > > > > Yeah, you are right. But it may not be a problem for HugeTLB pages. > > > Because in most cases, we only read the tail struct page and get the > > > head struct page through compound_head() when the pfn is within > > > a HugeTLB range. Right? > > > > Many pfn walkers would encounter the head page first and then skip over > > the rest. Those should be reasonably safe. But there is no guarantee and > > the fact that you need a valid page->compound_head which might get > > scribbled over once you have the struct page makes this extremely > > subtle. > > In this patch series, we can guarantee that the page->compound_head > is always valid. Because we reuse the first tail page. Maybe you need to > look closer at this series. Thanks. I must be really terrible exaplaining my concern. Let me try one last time. It is really _irrelevant_ what you do with tail pages. The underlying problem is that you are changing struct pages under users without any synchronization. What used to be a valid struct page will turn into garbage as soon as you remap vmemmap page tables.
On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 23-11-20 20:07:23, Muchun Song wrote: > > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > > > > from pages which you release from the vmemmap page tables. Those pages > > > > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > > > > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > > > > POV, right? So it is very well possible that > > > > > > > > Yeah, you are right. But it may not be a problem for HugeTLB pages. > > > > Because in most cases, we only read the tail struct page and get the > > > > head struct page through compound_head() when the pfn is within > > > > a HugeTLB range. Right? > > > > > > Many pfn walkers would encounter the head page first and then skip over > > > the rest. Those should be reasonably safe. But there is no guarantee and > > > the fact that you need a valid page->compound_head which might get > > > scribbled over once you have the struct page makes this extremely > > > subtle. > > > > In this patch series, we can guarantee that the page->compound_head > > is always valid. Because we reuse the first tail page. Maybe you need to > > look closer at this series. Thanks. > > I must be really terrible exaplaining my concern. Let me try one last > time. It is really _irrelevant_ what you do with tail pages. The > underlying problem is that you are changing struct pages under users > without any synchronization. What used to be a valid struct page will > turn into garbage as soon as you remap vmemmap page tables. Thank you very much for your patient explanation. So if the pfn walkers always try get the head struct page through compound_head() when it encounter a tail struct page. There will be no concerns. Do you agree? > -- > Michal Hocko > SUSE Labs
On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote: > On Mon 23-11-20 18:36:33, Muchun Song wrote: > > > No I really mean that pfn_to_page will give you a struct page pointer > > > from pages which you release from the vmemmap page tables. Those pages > > > might get reused as soon sa they are freed to the page allocator. > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > And this doesn't really happen in an atomic fashion from the pfn walker > POV, right? So it is very well possible that > > struct page *page = pfn_to_page(); > // remapping happens here > // page content is no longer valid because its backing memory can be > // reused for whatever purpose. pfn_to_page() returns you a virtual address. That virtual address remains a valid pointer to exactly the same contents, it's just that the page tables change to point to a different struct page which has the same compound_head().
On Mon 23-11-20 20:40:40, Muchun Song wrote: > On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 23-11-20 20:07:23, Muchun Song wrote: > > > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > > > > > from pages which you release from the vmemmap page tables. Those pages > > > > > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > > > > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > > > > > > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > > > > > POV, right? So it is very well possible that > > > > > > > > > > Yeah, you are right. But it may not be a problem for HugeTLB pages. > > > > > Because in most cases, we only read the tail struct page and get the > > > > > head struct page through compound_head() when the pfn is within > > > > > a HugeTLB range. Right? > > > > > > > > Many pfn walkers would encounter the head page first and then skip over > > > > the rest. Those should be reasonably safe. But there is no guarantee and > > > > the fact that you need a valid page->compound_head which might get > > > > scribbled over once you have the struct page makes this extremely > > > > subtle. > > > > > > In this patch series, we can guarantee that the page->compound_head > > > is always valid. Because we reuse the first tail page. Maybe you need to > > > look closer at this series. Thanks. > > > > I must be really terrible exaplaining my concern. Let me try one last > > time. It is really _irrelevant_ what you do with tail pages. The > > underlying problem is that you are changing struct pages under users > > without any synchronization. What used to be a valid struct page will > > turn into garbage as soon as you remap vmemmap page tables. > > Thank you very much for your patient explanation. So if the pfn walkers > always try get the head struct page through compound_head() when it > encounter a tail struct page. There will be no concerns. Do you agree? No, I do not agree. Please read again. The content of the struct page might be a complete garbage at any time after pfn_to_page returns a struct page. So there is no valid compound_head anywamore.
On Mon, Nov 23, 2020 at 8:45 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote: > > On Mon 23-11-20 18:36:33, Muchun Song wrote: > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > from pages which you release from the vmemmap page tables. Those pages > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > POV, right? So it is very well possible that > > > > struct page *page = pfn_to_page(); > > // remapping happens here > > // page content is no longer valid because its backing memory can be > > // reused for whatever purpose. > > pfn_to_page() returns you a virtual address. That virtual address > remains a valid pointer to exactly the same contents, it's just that > the page tables change to point to a different struct page which has > the same compound_head(). I agree with you. Hi Michal, Maybe you need to look at this.
On Mon 23-11-20 12:45:13, Matthew Wilcox wrote: > On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote: > > On Mon 23-11-20 18:36:33, Muchun Song wrote: > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > from pages which you release from the vmemmap page tables. Those pages > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > POV, right? So it is very well possible that > > > > struct page *page = pfn_to_page(); > > // remapping happens here > > // page content is no longer valid because its backing memory can be > > // reused for whatever purpose. > > pfn_to_page() returns you a virtual address. That virtual address > remains a valid pointer to exactly the same contents, it's just that > the page tables change to point to a different struct page which has > the same compound_head(). You are right. I have managed to completely confuse myself. Sorry about the noise!
On 11/22/20 11:38 PM, Michal Hocko wrote: > On Fri 20-11-20 09:45:12, Mike Kravetz wrote: >> On 11/20/20 1:43 AM, David Hildenbrand wrote: > [...] >>>>> To keep things easy, maybe simply never allow to free these hugetlb pages >>>>> again for now? If they were reserved during boot and the vmemmap condensed, >>>>> then just let them stick around for all eternity. >>>> >>>> Not sure I understand. Do you propose to only free those vmemmap pages >>>> when the pool is initialized during boot time and never allow to free >>>> them up? That would certainly make it safer and maybe even simpler wrt >>>> implementation. >>> >>> Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them. >> >> Not sure if I agree with that last statement. Database and virtualization >> use cases from my employer allocate allocate hugetlb pages after boot. It >> is shortly after boot, but still not from boot/kernel command line. > > Is there any strong reason for that? > The reason I have been given is that it is preferable to have SW compute the number of needed huge pages after boot based on total memory, rather than have a sysadmin calculate the number and add a boot parameter. >> Somewhat related, but not exactly addressing this issue ... >> >> One idea discussed in a previous patch set was to disable PMD/huge page >> mapping of vmemmap if this feature was enabled. This would eliminate a bunch >> of the complex code doing page table manipulation. It does not address >> the issue of struct page pages going away which is being discussed here, >> but it could be a way to simply the first version of this code. If this >> is going to be an 'opt in' feature as previously suggested, then eliminating >> the PMD/huge page vmemmap mapping may be acceptable. My guess is that >> sysadmins would only 'opt in' if they expect most of system memory to be used >> by hugetlb pages. We certainly have database and virtualization use cases >> where this is true. > > Would this simplify the code considerably? I mean, the vmemmap page > tables will need to be updated anyway. So that code has to stay. PMD > entry split shouldn't be the most complex part of that operation. On > the other hand dropping large pages for all vmemmaps will likely have a > performance. I agree with your points. This was just one way in which the patch set could be simplified.
On Mon, Nov 23, 2020 at 01:52:13PM -0800, Mike Kravetz wrote: > On 11/22/20 11:38 PM, Michal Hocko wrote: > > On Fri 20-11-20 09:45:12, Mike Kravetz wrote: > >> Not sure if I agree with that last statement. Database and virtualization > >> use cases from my employer allocate allocate hugetlb pages after boot. It > >> is shortly after boot, but still not from boot/kernel command line. > > > > Is there any strong reason for that? > > The reason I have been given is that it is preferable to have SW compute > the number of needed huge pages after boot based on total memory, rather > than have a sysadmin calculate the number and add a boot parameter. Oh, I remember this bug! I think it was posted publically, even. If the sysadmin configures, say, 90% of the RAM to be hugepages and then a DIMM fails and the sysadmin doesn't remember to adjust the boot parameter, Linux does some pretty horrible things and the symptom is "Linux doesn't boot".