Message ID | 20210412120842.GY3697@techsingularity.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove | expand |
On 12.04.21 14:08, Mel Gorman wrote: > zone_pcp_reset allegedly protects against a race with drain_pages > using local_irq_save but this is bogus. local_irq_save only operates > on the local CPU. If memory hotplug is running on CPU A and drain_pages > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > offers no protection. > > This patch deletes IRQ disable/enable on the grounds that IRQs protect > nothing and assumes the existing hotplug paths guarantees the PCP cannot be > used after zone_pcp_enable(). That should be the case already because all > the pages have been freed and there is no page to put on the PCP lists. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > Resending for email address correction and adding lists > > Changelog since v1 > o Minimal fix > > mm/page_alloc.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 5e8aedb64b57..9bf0db982f14 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone) > > void zone_pcp_reset(struct zone *zone) > { > - unsigned long flags; > int cpu; > struct per_cpu_pageset *pset; > > - /* avoid races with drain_pages() */ > - local_irq_save(flags); > if (zone->pageset != &boot_pageset) { > for_each_online_cpu(cpu) { > pset = per_cpu_ptr(zone->pageset, cpu); > @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone) > free_percpu(zone->pageset); > zone->pageset = &boot_pageset; > } > - local_irq_restore(flags); > } > > #ifdef CONFIG_MEMORY_HOTREMOVE > This was originally introduced by 340175b7d14d ("mm/hotplug: free zone->pageset when a zone becomes empty"). I wonder why it was ever added. Could it be that drain_pages() could have been called from an interrupt handler (e.g., on concurrent CPU hotunplug of the current CPU?) back then while we are just about to remove it ourselves? Anyhow, if we need some protection after setting the zone unpopulated (setting present_pages = 0), it doesn't seem like disabling local IRQs is the right thing to do. Reviewed-by: David Hildenbrand <david@redhat.com>
On 4/12/21 2:08 PM, Mel Gorman wrote: > zone_pcp_reset allegedly protects against a race with drain_pages > using local_irq_save but this is bogus. local_irq_save only operates > on the local CPU. If memory hotplug is running on CPU A and drain_pages > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > offers no protection. > > This patch deletes IRQ disable/enable on the grounds that IRQs protect > nothing and assumes the existing hotplug paths guarantees the PCP cannot be > used after zone_pcp_enable(). That should be the case already because all > the pages have been freed and there is no page to put on the PCP lists. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Yeah the irq disabling here is clearly bogus, so: Acked-by: Vlastimil Babka <vbabka@suse.cz> But I think Michal has a point that we might best leave the pagesets around, by a future change. I'm have some doubts that even with your reordering of the reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no protection between zonelist rebuild and zonelist traversal, and that's why we just leave pgdats around. So I can imagine a task racing with memory hotremove might see watermarks as ok in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist(). So that's very rare thus not urgent, and this patch doesn't make it less rare so not a reason to block it. > --- > Resending for email address correction and adding lists > > Changelog since v1 > o Minimal fix > > mm/page_alloc.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 5e8aedb64b57..9bf0db982f14 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone) > > void zone_pcp_reset(struct zone *zone) > { > - unsigned long flags; > int cpu; > struct per_cpu_pageset *pset; > > - /* avoid races with drain_pages() */ > - local_irq_save(flags); > if (zone->pageset != &boot_pageset) { > for_each_online_cpu(cpu) { > pset = per_cpu_ptr(zone->pageset, cpu); > @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone) > free_percpu(zone->pageset); > zone->pageset = &boot_pageset; > } > - local_irq_restore(flags); > } > > #ifdef CONFIG_MEMORY_HOTREMOVE >
On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote: > On 4/12/21 2:08 PM, Mel Gorman wrote: > > zone_pcp_reset allegedly protects against a race with drain_pages > > using local_irq_save but this is bogus. local_irq_save only operates > > on the local CPU. If memory hotplug is running on CPU A and drain_pages > > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > > offers no protection. > > > > This patch deletes IRQ disable/enable on the grounds that IRQs protect > > nothing and assumes the existing hotplug paths guarantees the PCP cannot be > > used after zone_pcp_enable(). That should be the case already because all > > the pages have been freed and there is no page to put on the PCP lists. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > Yeah the irq disabling here is clearly bogus, so: > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > Thanks! > But I think Michal has a point that we might best leave the pagesets around, by > a future change. I'm have some doubts that even with your reordering of the > reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no > protection between zonelist rebuild and zonelist traversal, and that's why we > just leave pgdats around. > > So I can imagine a task racing with memory hotremove might see watermarks as ok > in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then > gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys > the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist(). > > So that's very rare thus not urgent, and this patch doesn't make it less rare so > not a reason to block it. > After v1 of the patch, the race was reduced to the point between the zone watermark check and the rmqueue_pcplist but yes, it still existed. Closing it completely was either complex or expensive. Setting zone->pageset = &boot_pageset before the free would shrink the race further but that still leaves a potential memory ordering issue. While fixable, it's either complex, expensive or both so yes, just leaving the pageset structures in place would be much more straight-forward assuming the structures were not allocated in the zone that is being hot-removed. As things stand, I had trouble even testing zone hot-remove as there was always a few pages left behind and I did not chase down why. The focus was getting rid of the bogus local_irq_save() because it was clearly wrong and offering a false sense of safety and the last problematic local_irq_save() user in page_alloc.c when local_lock is used to protect the PCP structures.
On 12.04.21 16:08, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote: >> On 4/12/21 2:08 PM, Mel Gorman wrote: >>> zone_pcp_reset allegedly protects against a race with drain_pages >>> using local_irq_save but this is bogus. local_irq_save only operates >>> on the local CPU. If memory hotplug is running on CPU A and drain_pages >>> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and >>> offers no protection. >>> >>> This patch deletes IRQ disable/enable on the grounds that IRQs protect >>> nothing and assumes the existing hotplug paths guarantees the PCP cannot be >>> used after zone_pcp_enable(). That should be the case already because all >>> the pages have been freed and there is no page to put on the PCP lists. >>> >>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> >> >> Yeah the irq disabling here is clearly bogus, so: >> >> Acked-by: Vlastimil Babka <vbabka@suse.cz> >> > > Thanks! > >> But I think Michal has a point that we might best leave the pagesets around, by >> a future change. I'm have some doubts that even with your reordering of the >> reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no >> protection between zonelist rebuild and zonelist traversal, and that's why we >> just leave pgdats around. >> >> So I can imagine a task racing with memory hotremove might see watermarks as ok >> in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then >> gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys >> the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist(). >> >> So that's very rare thus not urgent, and this patch doesn't make it less rare so >> not a reason to block it. >> > > After v1 of the patch, the race was reduced to the point between the > zone watermark check and the rmqueue_pcplist but yes, it still existed. > Closing it completely was either complex or expensive. Setting > zone->pageset = &boot_pageset before the free would shrink the race > further but that still leaves a potential memory ordering issue. > > While fixable, it's either complex, expensive or both so yes, just leaving > the pageset structures in place would be much more straight-forward > assuming the structures were not allocated in the zone that is being > hot-removed. As things stand, I had trouble even testing zone hot-remove > as there was always a few pages left behind and I did not chase down > why. Can you elaborate? I can reliably trigger zone present pages going to 0 by just hotplugging a DIMM, onlining the memory block devices to the MOVABLE zone, followed by offlining the memory block again.
On Mon, Apr 12, 2021 at 04:12:11PM +0200, David Hildenbrand wrote: > > After v1 of the patch, the race was reduced to the point between the > > zone watermark check and the rmqueue_pcplist but yes, it still existed. > > Closing it completely was either complex or expensive. Setting > > zone->pageset = &boot_pageset before the free would shrink the race > > further but that still leaves a potential memory ordering issue. > > > > While fixable, it's either complex, expensive or both so yes, just leaving > > the pageset structures in place would be much more straight-forward > > assuming the structures were not allocated in the zone that is being > > hot-removed. As things stand, I had trouble even testing zone hot-remove > > as there was always a few pages left behind and I did not chase down > > why. > > Can you elaborate? I can reliably trigger zone present pages going to 0 by > just hotplugging a DIMM, onlining the memory block devices to the MOVABLE > zone, followed by offlining the memory block again. > For the machine I was testing on, I tried offlining all memory within a zone on a NUMA machine. Even if I used movable_zone to create a zone or numa=fake to create multiple fake nodes and zones, there was always either reserved or pinned pages preventing the full zone being removed.
On 12.04.21 17:27, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 04:12:11PM +0200, David Hildenbrand wrote: >>> After v1 of the patch, the race was reduced to the point between the >>> zone watermark check and the rmqueue_pcplist but yes, it still existed. >>> Closing it completely was either complex or expensive. Setting >>> zone->pageset = &boot_pageset before the free would shrink the race >>> further but that still leaves a potential memory ordering issue. >>> >>> While fixable, it's either complex, expensive or both so yes, just leaving >>> the pageset structures in place would be much more straight-forward >>> assuming the structures were not allocated in the zone that is being >>> hot-removed. As things stand, I had trouble even testing zone hot-remove >>> as there was always a few pages left behind and I did not chase down >>> why. >> >> Can you elaborate? I can reliably trigger zone present pages going to 0 by >> just hotplugging a DIMM, onlining the memory block devices to the MOVABLE >> zone, followed by offlining the memory block again. >> > > For the machine I was testing on, I tried offlining all memory within > a zone on a NUMA machine. Even if I used movable_zone to create a zone > or numa=fake to create multiple fake nodes and zones, there was always > either reserved or pinned pages preventing the full zone being removed. What can happen is that memblock allocations are still placed into the MOVABLE zone -- even with "movablenode" IIRC. Memory hot(un)plug is usually best tested in QEMU via pc-dimm devices.
On Mon 12-04-21 13:08:42, Mel Gorman wrote: > zone_pcp_reset allegedly protects against a race with drain_pages > using local_irq_save but this is bogus. local_irq_save only operates > on the local CPU. If memory hotplug is running on CPU A and drain_pages > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > offers no protection. > > This patch deletes IRQ disable/enable on the grounds that IRQs protect > nothing and assumes the existing hotplug paths guarantees the PCP cannot be > used after zone_pcp_enable(). That should be the case already because all > the pages have been freed and there is no page to put on the PCP lists. Yes, that is the case since ec6e8c7e0314 ("mm, page_alloc: disable pcplists during memory offline"). Prior to this commit the behavior was undefined but full zone/node hotremove is rare enough that an existing race was likely never observed. Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > Resending for email address correction and adding lists > > Changelog since v1 > o Minimal fix > > mm/page_alloc.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 5e8aedb64b57..9bf0db982f14 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone) > > void zone_pcp_reset(struct zone *zone) > { > - unsigned long flags; > int cpu; > struct per_cpu_pageset *pset; > > - /* avoid races with drain_pages() */ > - local_irq_save(flags); > if (zone->pageset != &boot_pageset) { > for_each_online_cpu(cpu) { > pset = per_cpu_ptr(zone->pageset, cpu); > @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone) > free_percpu(zone->pageset); > zone->pageset = &boot_pageset; > } > - local_irq_restore(flags); > } > > #ifdef CONFIG_MEMORY_HOTREMOVE
On Mon 12-04-21 14:40:18, Vlastimil Babka wrote: > On 4/12/21 2:08 PM, Mel Gorman wrote: > > zone_pcp_reset allegedly protects against a race with drain_pages > > using local_irq_save but this is bogus. local_irq_save only operates > > on the local CPU. If memory hotplug is running on CPU A and drain_pages > > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > > offers no protection. > > > > This patch deletes IRQ disable/enable on the grounds that IRQs protect > > nothing and assumes the existing hotplug paths guarantees the PCP cannot be > > used after zone_pcp_enable(). That should be the case already because all > > the pages have been freed and there is no page to put on the PCP lists. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > Yeah the irq disabling here is clearly bogus, so: > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > But I think Michal has a point that we might best leave the pagesets around, by > a future change. I'm have some doubts that even with your reordering of the > reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no > protection between zonelist rebuild and zonelist traversal, and that's why we > just leave pgdats around. > > So I can imagine a task racing with memory hotremove might see watermarks as ok > in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then > gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys > the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist(). > > So that's very rare thus not urgent, and this patch doesn't make it less rare so > not a reason to block it. Completely agreed here. Not an urgent thing to work on but something to look into long term.
On 4/12/21 4:08 PM, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote: >> On 4/12/21 2:08 PM, Mel Gorman wrote: > > the pageset structures in place would be much more straight-forward > assuming the structures were not allocated in the zone that is being > hot-removed. I would expect this is not possible, at least for ZONE_MOVABLE, as the percpu allocations should be GFP_KERNEL. And it's not realistic to expect offlining to succeed at all without using ZONE_MOVABLE. AFAIK even Oscar's work on using the node to self-contain its own structures is only applicable to struct pages, not percpu allocations?
On Tue, Apr 13, 2021 at 11:36:08AM +0200, Vlastimil Babka wrote: > On 4/12/21 4:08 PM, Mel Gorman wrote: > > On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote: > >> On 4/12/21 2:08 PM, Mel Gorman wrote: > > > > the pageset structures in place would be much more straight-forward > > assuming the structures were not allocated in the zone that is being > > hot-removed. > > I would expect this is not possible, at least for ZONE_MOVABLE, as the percpu > allocations should be GFP_KERNEL. True. > And it's not realistic to expect offlining to > succeed at all without using ZONE_MOVABLE. > > AFAIK even Oscar's work on using the node to self-contain its own structures is > only applicable to struct pages, not percpu allocations? That I don't know as I didn't check although in general, it would be somewhat unfortunate if per-cpu structures were remote. It wouldn't be critical given that they'll be in cache assuming the per-cpu structures are not straddling cache lines.
On Tue 13-04-21 11:36:08, Vlastimil Babka wrote: [...] > AFAIK even Oscar's work on using the node to self-contain its own structures is > only applicable to struct pages, not percpu allocations? Correct. Teaching pcp storage on movable zone sounds like a large undertaking to me. Not sure this is worth it TBH. Even an idea of any pcp access synchronization with memory hotplug makes for a decent headache.
On Mon, Apr 12, 2021 at 01:08:42PM +0100, Mel Gorman wrote: > zone_pcp_reset allegedly protects against a race with drain_pages > using local_irq_save but this is bogus. local_irq_save only operates > on the local CPU. If memory hotplug is running on CPU A and drain_pages > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > offers no protection. > > This patch deletes IRQ disable/enable on the grounds that IRQs protect > nothing and assumes the existing hotplug paths guarantees the PCP cannot be > used after zone_pcp_enable(). That should be the case already because all > the pages have been freed and there is no page to put on the PCP lists. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Oscar Salvador <osalvador@suse.de>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5e8aedb64b57..9bf0db982f14 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone) void zone_pcp_reset(struct zone *zone) { - unsigned long flags; int cpu; struct per_cpu_pageset *pset; - /* avoid races with drain_pages() */ - local_irq_save(flags); if (zone->pageset != &boot_pageset) { for_each_online_cpu(cpu) { pset = per_cpu_ptr(zone->pageset, cpu); @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone) free_percpu(zone->pageset); zone->pageset = &boot_pageset; } - local_irq_restore(flags); } #ifdef CONFIG_MEMORY_HOTREMOVE
zone_pcp_reset allegedly protects against a race with drain_pages using local_irq_save but this is bogus. local_irq_save only operates on the local CPU. If memory hotplug is running on CPU A and drain_pages is running on CPU B, disabling IRQs on CPU A does not affect CPU B and offers no protection. This patch deletes IRQ disable/enable on the grounds that IRQs protect nothing and assumes the existing hotplug paths guarantees the PCP cannot be used after zone_pcp_enable(). That should be the case already because all the pages have been freed and there is no page to put on the PCP lists. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- Resending for email address correction and adding lists Changelog since v1 o Minimal fix mm/page_alloc.c | 4 ---- 1 file changed, 4 deletions(-)