Message ID | 20210409120957.GM3697@techsingularity.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove | expand |
On Fri 09-04-21 13:09:57, Mel Gorman wrote: > zone_pcp_reset allegedly protects against a race with drain_pages > using local_irq_save but this is bogus. local_irq_save only operates > on the local CPU. If memory hotplug is running on CPU A and drain_pages > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > offers no protection. Yes, the synchronization aspect is bogus indeed. > This patch reorders memory hotremove such that the PCP structures > relevant to the zone are no longer reachable by the time the structures > are freed. With this reordering, no protection is required to prevent > a use-after-free and the IRQs can be left enabled. zone_pcp_reset is > renamed to zone_pcp_destroy to make it clear that the per-cpu structures > are deleted when the function returns. Wouldn't it be much easier to simply not destroy/reset pcp of an empty zone at all? The whole point of this exercise seems to be described in 340175b7d14d5. setup_zone_pageset can check for an already allocated pcp and simply reinitialize it.
On Fri 09-04-21 14:42:58, Michal Hocko wrote: > On Fri 09-04-21 13:09:57, Mel Gorman wrote: > > zone_pcp_reset allegedly protects against a race with drain_pages > > using local_irq_save but this is bogus. local_irq_save only operates > > on the local CPU. If memory hotplug is running on CPU A and drain_pages > > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > > offers no protection. > > Yes, the synchronization aspect is bogus indeed. > > > This patch reorders memory hotremove such that the PCP structures > > relevant to the zone are no longer reachable by the time the structures > > are freed. With this reordering, no protection is required to prevent > > a use-after-free and the IRQs can be left enabled. zone_pcp_reset is > > renamed to zone_pcp_destroy to make it clear that the per-cpu structures > > are deleted when the function returns. > > Wouldn't it be much easier to simply not destroy/reset pcp of an empty > zone at all? The whole point of this exercise seems to be described in > 340175b7d14d5. setup_zone_pageset can check for an already allocated pcp > and simply reinitialize it. I meant this diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 754026a9164d..7169342f5474 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1810,7 +1818,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) init_per_zone_wmark_min(); if (!populated_zone(zone)) { - zone_pcp_reset(zone); build_all_zonelists(NULL); } else zone_pcp_update(zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e6a602e82860..b0fdda77e570 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6496,7 +6496,13 @@ void __meminit setup_zone_pageset(struct zone *zone) struct per_cpu_pageset *p; int cpu; - zone->pageset = alloc_percpu(struct per_cpu_pageset); + /* + * zone could have gone completely offline during memory hotplug + * when the pgdat is left behind for simplicity. On a next onlining + * we do not need to reallocate pcp state. + */ + if (!zone->pageset) + zone->pageset = alloc_percpu(struct per_cpu_pageset); for_each_possible_cpu(cpu) { p = per_cpu_ptr(zone->pageset, cpu); pageset_init(p); @@ -8803,25 +8809,6 @@ void zone_pcp_enable(struct zone *zone) mutex_unlock(&pcp_batch_high_lock); } -void zone_pcp_reset(struct zone *zone) -{ - unsigned long flags; - int cpu; - struct per_cpu_pageset *pset; - - /* avoid races with drain_pages() */ - local_irq_save(flags); - if (zone->pageset != &boot_pageset) { - for_each_online_cpu(cpu) { - pset = per_cpu_ptr(zone->pageset, cpu); - drain_zonestat(zone, pset); - } - free_percpu(zone->pageset); - zone->pageset = &boot_pageset; - } - local_irq_restore(flags); -} - #ifdef CONFIG_MEMORY_HOTREMOVE /* * All pages in the range must be in a single zone, must not contain holes,
On Fri, Apr 09, 2021 at 02:48:12PM +0200, Michal Hocko wrote: > On Fri 09-04-21 14:42:58, Michal Hocko wrote: > > On Fri 09-04-21 13:09:57, Mel Gorman wrote: > > > zone_pcp_reset allegedly protects against a race with drain_pages > > > using local_irq_save but this is bogus. local_irq_save only operates > > > on the local CPU. If memory hotplug is running on CPU A and drain_pages > > > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > > > offers no protection. > > > > Yes, the synchronization aspect is bogus indeed. > > > > > This patch reorders memory hotremove such that the PCP structures > > > relevant to the zone are no longer reachable by the time the structures > > > are freed. With this reordering, no protection is required to prevent > > > a use-after-free and the IRQs can be left enabled. zone_pcp_reset is > > > renamed to zone_pcp_destroy to make it clear that the per-cpu structures > > > are deleted when the function returns. > > > > Wouldn't it be much easier to simply not destroy/reset pcp of an empty > > zone at all? The whole point of this exercise seems to be described in > > 340175b7d14d5. setup_zone_pageset can check for an already allocated pcp > > and simply reinitialize it. > > I meant this > It might be simplier but if the intention is to free as much memory as possible during hot-remove, it seems wasteful to leave the per-cpu structures behind if we do not have to. If a problem with my patch can be spotted then I'm happy to go with an alternative fix but there are two minor issues with your proposed fix. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e6a602e82860..b0fdda77e570 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6496,7 +6496,13 @@ void __meminit setup_zone_pageset(struct zone *zone) > struct per_cpu_pageset *p; > int cpu; > > - zone->pageset = alloc_percpu(struct per_cpu_pageset); > + /* > + * zone could have gone completely offline during memory hotplug > + * when the pgdat is left behind for simplicity. On a next onlining > + * we do not need to reallocate pcp state. > + */ > + if (!zone->pageset) > + zone->pageset = alloc_percpu(struct per_cpu_pageset); Should be "if (zone->pageset != &boot_pageset)" ? > for_each_possible_cpu(cpu) { > p = per_cpu_ptr(zone->pageset, cpu); > pageset_init(p); > @@ -8803,25 +8809,6 @@ void zone_pcp_enable(struct zone *zone) > mutex_unlock(&pcp_batch_high_lock); > } > > -void zone_pcp_reset(struct zone *zone) > -{ > - unsigned long flags; > - int cpu; > - struct per_cpu_pageset *pset; > - > - /* avoid races with drain_pages() */ > - local_irq_save(flags); > - if (zone->pageset != &boot_pageset) { > - for_each_online_cpu(cpu) { > - pset = per_cpu_ptr(zone->pageset, cpu); > - drain_zonestat(zone, pset); > - } > - free_percpu(zone->pageset); > - zone->pageset = &boot_pageset; > - } > - local_irq_restore(flags); > -} > - zone_pcp_reset still needs to exist to drain the remaining vmstats or it'll break 5a883813845a ("memory-hotplug: fix zone stat mismatch").
On Fri 09-04-21 14:42:21, Mel Gorman wrote: > On Fri, Apr 09, 2021 at 02:48:12PM +0200, Michal Hocko wrote: > > On Fri 09-04-21 14:42:58, Michal Hocko wrote: > > > On Fri 09-04-21 13:09:57, Mel Gorman wrote: > > > > zone_pcp_reset allegedly protects against a race with drain_pages > > > > using local_irq_save but this is bogus. local_irq_save only operates > > > > on the local CPU. If memory hotplug is running on CPU A and drain_pages > > > > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > > > > offers no protection. > > > > > > Yes, the synchronization aspect is bogus indeed. > > > > > > > This patch reorders memory hotremove such that the PCP structures > > > > relevant to the zone are no longer reachable by the time the structures > > > > are freed. With this reordering, no protection is required to prevent > > > > a use-after-free and the IRQs can be left enabled. zone_pcp_reset is > > > > renamed to zone_pcp_destroy to make it clear that the per-cpu structures > > > > are deleted when the function returns. > > > > > > Wouldn't it be much easier to simply not destroy/reset pcp of an empty > > > zone at all? The whole point of this exercise seems to be described in > > > 340175b7d14d5. setup_zone_pageset can check for an already allocated pcp > > > and simply reinitialize it. > > > > I meant this > > > > It might be simplier but if the intention is to free as much memory > as possible during hot-remove, it seems wasteful to leave the per-cpu > structures behind if we do not have to. We do leave the whole pgdat behind. I do not think pagesets really do matter. > If a problem with my patch can > be spotted then I'm happy to go with an alternative fix but there are > two minor issues with your proposed fix. I will not insist but this code has proven to bitrot and I just find it much simpler to drop it altogether rather than conserve it in some form. Not something I would insist though. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index e6a602e82860..b0fdda77e570 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -6496,7 +6496,13 @@ void __meminit setup_zone_pageset(struct zone *zone) > > struct per_cpu_pageset *p; > > int cpu; > > > > - zone->pageset = alloc_percpu(struct per_cpu_pageset); > > + /* > > + * zone could have gone completely offline during memory hotplug > > + * when the pgdat is left behind for simplicity. On a next onlining > > + * we do not need to reallocate pcp state. > > + */ > > + if (!zone->pageset) > > + zone->pageset = alloc_percpu(struct per_cpu_pageset); > > Should be "if (zone->pageset != &boot_pageset)" ? Memory hotplug really wants the NULL check. it doesn't use boot_pageset (if we drop rest to boot_pageset). But you are right that the boot time initialization first sets boot_pageset (zone_pcp_init) and initializes real pagesets later (setup_per_cpu_pageset). But this can be handled at the memory hotplug layer I believe diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 754026a9164d..1cadfec323fc 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -883,7 +883,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, */ if (!populated_zone(zone)) { need_zonelists_rebuild = 1; - setup_zone_pageset(zone); + if (!zone->pageset) + setup_zone_pageset(zone); } > > for_each_possible_cpu(cpu) { > > p = per_cpu_ptr(zone->pageset, cpu); > > pageset_init(p); > > @@ -8803,25 +8809,6 @@ void zone_pcp_enable(struct zone *zone) > > mutex_unlock(&pcp_batch_high_lock); > > } > > > > -void zone_pcp_reset(struct zone *zone) > > -{ > > - unsigned long flags; > > - int cpu; > > - struct per_cpu_pageset *pset; > > - > > - /* avoid races with drain_pages() */ > > - local_irq_save(flags); > > - if (zone->pageset != &boot_pageset) { > > - for_each_online_cpu(cpu) { > > - pset = per_cpu_ptr(zone->pageset, cpu); > > - drain_zonestat(zone, pset); > > - } > > - free_percpu(zone->pageset); > > - zone->pageset = &boot_pageset; > > - } > > - local_irq_restore(flags); > > -} > > - > > zone_pcp_reset still needs to exist to drain the remaining vmstats or > it'll break 5a883813845a ("memory-hotplug: fix zone stat > mismatch"). Are you sure we are reseting vmstats in the hotremove. I do not see anything like that. Maybe this was needed at the time. I will double check.
On Fri, Apr 09, 2021 at 04:37:59PM +0200, Michal Hocko wrote: > > > > Yes, the synchronization aspect is bogus indeed. > > > > > > > > > This patch reorders memory hotremove such that the PCP structures > > > > > relevant to the zone are no longer reachable by the time the structures > > > > > are freed. With this reordering, no protection is required to prevent > > > > > a use-after-free and the IRQs can be left enabled. zone_pcp_reset is > > > > > renamed to zone_pcp_destroy to make it clear that the per-cpu structures > > > > > are deleted when the function returns. > > > > > > > > Wouldn't it be much easier to simply not destroy/reset pcp of an empty > > > > zone at all? The whole point of this exercise seems to be described in > > > > 340175b7d14d5. setup_zone_pageset can check for an already allocated pcp > > > > and simply reinitialize it. > > > > > > I meant this > > > > > > > It might be simplier but if the intention is to free as much memory > > as possible during hot-remove, it seems wasteful to leave the per-cpu > > structures behind if we do not have to. > > We do leave the whole pgdat behind. I do not think pagesets really do > matter. > Probably not given that zone and node hotplug is fragile to say the least with a low success rate. Still, it seems wasteful if we can preserve the freeing part and the alternative fix is more subtle than it appears. > > If a problem with my patch can > > be spotted then I'm happy to go with an alternative fix but there are > > two minor issues with your proposed fix. > > I will not insist but this code has proven to bitrot and I just find it > much simpler to drop it altogether rather than conserve it in some form. > Not something I would insist though. > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index e6a602e82860..b0fdda77e570 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -6496,7 +6496,13 @@ void __meminit setup_zone_pageset(struct zone *zone) > > > struct per_cpu_pageset *p; > > > int cpu; > > > > > > - zone->pageset = alloc_percpu(struct per_cpu_pageset); > > > + /* > > > + * zone could have gone completely offline during memory hotplug > > > + * when the pgdat is left behind for simplicity. On a next onlining > > > + * we do not need to reallocate pcp state. > > > + */ > > > + if (!zone->pageset) > > > + zone->pageset = alloc_percpu(struct per_cpu_pageset); > > > > Should be "if (zone->pageset != &boot_pageset)" ? > > Memory hotplug really wants the NULL > check. it doesn't use boot_pageset (if we drop rest to boot_pageset). > But you are right that the boot time initialization first sets > boot_pageset (zone_pcp_init) and initializes real pagesets later > (setup_per_cpu_pageset). But this can be handled at the memory hotplug > layer I believe > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 754026a9164d..1cadfec323fc 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -883,7 +883,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, > */ > if (!populated_zone(zone)) { > need_zonelists_rebuild = 1; > - setup_zone_pageset(zone); > + if (!zone->pageset) > + setup_zone_pageset(zone); > } > I think this will be fragile because it makes assumptions on how boot works -- specifically that zone->pageset is left alone for unpopulated zones. This will only work for zones in unpopulated *nodes* during boot because of this sequence free_area_init_core(pgdat) for (j = 0; j < MAX_NR_ZONES; j++) { ... zone_init_internals(zone) } -> zone_init_internals(zone) -> zone_pcp_init(zone) zone->pageset = &boot_pageset; I think a NULL check will mean that hotplugging a previously unpopulated zone will run the risk of using boot_pageset. That will likely explode because IRQs disabled does not protect boot_pageset, it only works early in boot or during the initial hotplug when it's not visible yet. > > > for_each_possible_cpu(cpu) { > > > p = per_cpu_ptr(zone->pageset, cpu); > > > pageset_init(p); > > > @@ -8803,25 +8809,6 @@ void zone_pcp_enable(struct zone *zone) > > > mutex_unlock(&pcp_batch_high_lock); > > > } > > > > > > -void zone_pcp_reset(struct zone *zone) > > > -{ > > > - unsigned long flags; > > > - int cpu; > > > - struct per_cpu_pageset *pset; > > > - > > > - /* avoid races with drain_pages() */ > > > - local_irq_save(flags); > > > - if (zone->pageset != &boot_pageset) { > > > - for_each_online_cpu(cpu) { > > > - pset = per_cpu_ptr(zone->pageset, cpu); > > > - drain_zonestat(zone, pset); > > > - } > > > - free_percpu(zone->pageset); > > > - zone->pageset = &boot_pageset; > > > - } > > > - local_irq_restore(flags); > > > -} > > > - > > > > zone_pcp_reset still needs to exist to drain the remaining vmstats or > > it'll break 5a883813845a ("memory-hotplug: fix zone stat > > mismatch"). > > Are you sure we are reseting vmstats in the hotremove. I do not see > anything like that. Maybe this was needed at the time. I will double > check. zone_pcp_reset calls drain_zonestat to apply the per-cpu vmstat deltas to the atomic per-zone and global stats. If anything, the minimal "fix" is to simply delete IRQ disable/enable on the grounds that IRQs protect nothing and assume the existing hotplug paths guarantees the PCP cannot be used after zone_pcp_enable(). That should be the case already because all the pages have been freed and there is nothing to even put into the PCPs but I worried that the PCP structure itself might still be reachable even if it's useless which is why I freed the structure once they could not be reached via zonelists.
>>> zone_pcp_reset still needs to exist to drain the remaining vmstats or >>> it'll break 5a883813845a ("memory-hotplug: fix zone stat >>> mismatch"). >> >> Are you sure we are reseting vmstats in the hotremove. I do not see >> anything like that. Maybe this was needed at the time. I will double >> check. > > zone_pcp_reset calls drain_zonestat to apply the per-cpu vmstat deltas > to the atomic per-zone and global stats. > > If anything, the minimal "fix" is to simply delete IRQ disable/enable on > the grounds that IRQs protect nothing and assume the existing hotplug > paths guarantees the PCP cannot be used after zone_pcp_enable(). That ^ that sounds sane to me > should be the case already because all the pages have been freed and > there is nothing to even put into the PCPs but I worried that the PCP > structure itself might still be reachable even if it's useless which is > why I freed the structure once they could not be reached via zonelists. >
On Fri 09-04-21 16:12:59, Mel Gorman wrote: [...] > If anything, the minimal "fix" is to simply delete IRQ disable/enable on > the grounds that IRQs protect nothing and assume the existing hotplug > paths guarantees the PCP cannot be used after zone_pcp_enable(). That > should be the case already because all the pages have been freed and > there is nothing to even put into the PCPs but I worried that the PCP > structure itself might still be reachable even if it's useless which is > why I freed the structure once they could not be reached via zonelists. OK. Let's do that for now and I will put a follow up on my todo list. Thanks!
diff --git a/mm/internal.h b/mm/internal.h index 09adf152a10b..cc34ce4461b7 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -203,7 +203,7 @@ extern void free_unref_page(struct page *page); extern void free_unref_page_list(struct list_head *list); extern void zone_pcp_update(struct zone *zone); -extern void zone_pcp_reset(struct zone *zone); +extern void zone_pcp_destroy(struct zone *zone); extern void zone_pcp_disable(struct zone *zone); extern void zone_pcp_enable(struct zone *zone); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0cdbbfbc5757..3d059c9f9c2d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1687,12 +1687,16 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages; spin_unlock_irqrestore(&zone->lock, flags); - zone_pcp_enable(zone); - /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); zone->present_pages -= nr_pages; + /* + * Restore PCP after managed pages has been updated. Unpopulated + * zones PCP structures will remain unusable. + */ + zone_pcp_enable(zone); + pgdat_resize_lock(zone->zone_pgdat, &flags); zone->zone_pgdat->node_present_pages -= nr_pages; pgdat_resize_unlock(zone->zone_pgdat, &flags); @@ -1700,8 +1704,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) init_per_zone_wmark_min(); if (!populated_zone(zone)) { - zone_pcp_reset(zone); build_all_zonelists(NULL); + zone_pcp_destroy(zone); } else zone_pcp_update(zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5e8aedb64b57..d6c3db853552 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8946,18 +8946,29 @@ void zone_pcp_disable(struct zone *zone) void zone_pcp_enable(struct zone *zone) { - __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch); + /* + * If the zone is populated, restore the high and batch counts. + * If unpopulated, leave the high and batch count as 0 and 1 + * respectively as done by zone_pcp_disable. The per-cpu + * structures will later be freed by zone_pcp_destroy. + */ + if (populated_zone(zone)) + __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch); + mutex_unlock(&pcp_batch_high_lock); } -void zone_pcp_reset(struct zone *zone) +/* + * Called when a zone has been hot-removed. At this point, the PCP has been + * drained, disabled and the zone is removed from the zonelists so the + * structures are no longer in use. PCP was disabled/drained by + * zone_pcp_disable. This function will drain any remaining vmstat deltas. + */ +void zone_pcp_destroy(struct zone *zone) { - unsigned long flags; int cpu; struct per_cpu_pageset *pset; - /* avoid races with drain_pages() */ - local_irq_save(flags); if (zone->pageset != &boot_pageset) { for_each_online_cpu(cpu) { pset = per_cpu_ptr(zone->pageset, cpu); @@ -8966,7 +8977,6 @@ void zone_pcp_reset(struct zone *zone) free_percpu(zone->pageset); zone->pageset = &boot_pageset; } - local_irq_restore(flags); } #ifdef CONFIG_MEMORY_HOTREMOVE
zone_pcp_reset allegedly protects against a race with drain_pages using local_irq_save but this is bogus. local_irq_save only operates on the local CPU. If memory hotplug is running on CPU A and drain_pages is running on CPU B, disabling IRQs on CPU A does not affect CPU B and offers no protection. This patch reorders memory hotremove such that the PCP structures relevant to the zone are no longer reachable by the time the structures are freed. With this reordering, no protection is required to prevent a use-after-free and the IRQs can be left enabled. zone_pcp_reset is renamed to zone_pcp_destroy to make it clear that the per-cpu structures are deleted when the function returns. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/internal.h | 2 +- mm/memory_hotplug.c | 10 +++++++--- mm/page_alloc.c | 22 ++++++++++++++++------ 3 files changed, 24 insertions(+), 10 deletions(-)