Message ID | 20241203094732.200195-5-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/page_alloc: gfp flags cleanups for alloc_contig_*() | expand |
On 12/3/24 10:47, David Hildenbrand wrote: > It's all a bit complicated for alloc_contig_range(). For example, we don't > support many flags, so let's start bailing out on unsupported > ones -- ignoring the placement hints, as we are already given the range > to allocate. > > While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we > simply create yet another GFP mask whereby we ignore the reclaim flags > specify by the caller. That looks very inconsistent. > > Let's clean it up, constructing the gfp flags used for > compaction/migration exactly once. Update the documentation of the > gfp_mask parameter for alloc_contig_range() and alloc_contig_pages(). > > Acked-by: Zi Yan <ziy@nvidia.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > + /* > + * Flags to control page compaction/migration/reclaim, to free up our > + * page range. Migratable pages are movable, __GFP_MOVABLE is implied > + * for them. > + * > + * Traditionally we always had __GFP_HARDWALL|__GFP_RETRY_MAYFAIL set, > + * keep doing that to not degrade callers. > + */ Wonder if we could revisit that eventually. Why limit migration targets by cpuset via __GFP_HARDWALL if we were not called with __GFP_HARDWALL? And why weaken the attempts with __GFP_RETRY_MAYFAIL if we didn't specify it? Unless I'm missing something, cc->gfp is only checked for __GFP_FS and __GFP_NOWARN in few places, so it's mostly migration_target_control the callers could meaningfully influence. > + *gfp_cc_mask = (gfp_mask & (reclaim_mask | cc_action_mask)) | > + __GFP_HARDWALL | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL; > + return 0; > +} > + > /** > * alloc_contig_range() -- tries to allocate given range of pages > * @start: start PFN to allocate > @@ -6398,7 +6431,9 @@ static void split_free_pages(struct list_head *list) > * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks > * in range must have the same migratetype and it must > * be either of the two. > - * @gfp_mask: GFP mask to use during compaction > + * @gfp_mask: GFP mask. Node/zone/placement hints are ignored; only some > + * action and reclaim modifiers are supported. Reclaim modifiers > + * control allocation behavior during compaction/migration/reclaim. > * > * The PFN range does not have to be pageblock aligned. The PFN range must > * belong to a single zone. > @@ -6424,11 +6459,14 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > .mode = MIGRATE_SYNC, > .ignore_skip_hint = true, > .no_set_skip_hint = true, > - .gfp_mask = current_gfp_context(gfp_mask), > .alloc_contig = true, > }; > INIT_LIST_HEAD(&cc.migratepages); > > + gfp_mask = current_gfp_context(gfp_mask); > + if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) > + return -EINVAL; > + > /* > * What we do here is we mark all pageblocks in range as > * MIGRATE_ISOLATE. Because pageblock and max order pages may > @@ -6571,7 +6609,9 @@ static bool zone_spans_last_pfn(const struct zone *zone, > /** > * alloc_contig_pages() -- tries to find and allocate contiguous range of pages > * @nr_pages: Number of contiguous pages to allocate > - * @gfp_mask: GFP mask to limit search and used during compaction > + * @gfp_mask: GFP mask. Node/zone/placement hints limit the search; only some > + * action and reclaim modifiers are supported. Reclaim modifiers > + * control allocation behavior during compaction/migration/reclaim. > * @nid: Target node > * @nodemask: Mask for other possible nodes > *
On 03.12.24 14:55, Vlastimil Babka wrote: > On 12/3/24 10:47, David Hildenbrand wrote: >> It's all a bit complicated for alloc_contig_range(). For example, we don't >> support many flags, so let's start bailing out on unsupported >> ones -- ignoring the placement hints, as we are already given the range >> to allocate. >> >> While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we >> simply create yet another GFP mask whereby we ignore the reclaim flags >> specify by the caller. That looks very inconsistent. >> >> Let's clean it up, constructing the gfp flags used for >> compaction/migration exactly once. Update the documentation of the >> gfp_mask parameter for alloc_contig_range() and alloc_contig_pages(). >> >> Acked-by: Zi Yan <ziy@nvidia.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > >> + /* >> + * Flags to control page compaction/migration/reclaim, to free up our >> + * page range. Migratable pages are movable, __GFP_MOVABLE is implied >> + * for them. >> + * >> + * Traditionally we always had __GFP_HARDWALL|__GFP_RETRY_MAYFAIL set, >> + * keep doing that to not degrade callers. >> + */ > > Wonder if we could revisit that eventually. Why limit migration targets by > cpuset via __GFP_HARDWALL if we were not called with __GFP_HARDWALL? And why > weaken the attempts with __GFP_RETRY_MAYFAIL if we didn't specify it? See below. > > Unless I'm missing something, cc->gfp is only checked for __GFP_FS and > __GFP_NOWARN in few places, so it's mostly migration_target_control the > callers could meaningfully influence. Note the fist change in the file, where we now use the mask instead of coming up with another one out of the blue. :) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ce7589a4ec01..54594cc4f650 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6294,7 +6294,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, int ret = 0; struct migration_target_control mtc = { .nid = zone_to_nid(cc->zone), - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + .gfp_mask = cc->gfp_mask, .reason = MR_CONTIG_RANGE, }; GFP_USER contains __GFP_HARDWALL. I am not sure if that matters here, but likely the thing we are assuming here is that we are migrating a page, and usually, these are user allocation (except maybe balloon and some other non-lru movable things). The __GFP_RETRY_MAYFAIL should be moved to relevant callers a some point, __GFP_HARDWALL, I really don't know ... Thanks!
On 12/3/24 15:12, David Hildenbrand wrote: > On 03.12.24 14:55, Vlastimil Babka wrote: >> On 12/3/24 10:47, David Hildenbrand wrote: >>> It's all a bit complicated for alloc_contig_range(). For example, we don't >>> support many flags, so let's start bailing out on unsupported >>> ones -- ignoring the placement hints, as we are already given the range >>> to allocate. >>> >>> While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we >>> simply create yet another GFP mask whereby we ignore the reclaim flags >>> specify by the caller. That looks very inconsistent. >>> >>> Let's clean it up, constructing the gfp flags used for >>> compaction/migration exactly once. Update the documentation of the >>> gfp_mask parameter for alloc_contig_range() and alloc_contig_pages(). >>> >>> Acked-by: Zi Yan <ziy@nvidia.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >> >> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> >> >>> + /* >>> + * Flags to control page compaction/migration/reclaim, to free up our >>> + * page range. Migratable pages are movable, __GFP_MOVABLE is implied >>> + * for them. >>> + * >>> + * Traditionally we always had __GFP_HARDWALL|__GFP_RETRY_MAYFAIL set, >>> + * keep doing that to not degrade callers. >>> + */ >> >> Wonder if we could revisit that eventually. Why limit migration targets by >> cpuset via __GFP_HARDWALL if we were not called with __GFP_HARDWALL? And why >> weaken the attempts with __GFP_RETRY_MAYFAIL if we didn't specify it? > > See below. > >> >> Unless I'm missing something, cc->gfp is only checked for __GFP_FS and >> __GFP_NOWARN in few places, so it's mostly migration_target_control the >> callers could meaningfully influence. > > Note the fist change in the file, where we now use the mask instead of coming up > with another one out of the blue. :) I know. What I wanted to say - cc->gfp is on its own only checked in few places, but now since we also translate it to migration_target_control's gfp_mask, it's mostly that part the caller might influence with the passed flags. But we still impose own additions to it, limiting that influence. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ce7589a4ec01..54594cc4f650 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6294,7 +6294,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, > int ret = 0; > struct migration_target_control mtc = { > .nid = zone_to_nid(cc->zone), > - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, > + .gfp_mask = cc->gfp_mask, > .reason = MR_CONTIG_RANGE, > }; > > GFP_USER contains __GFP_HARDWALL. I am not sure if that matters here, but Yeah wonder if GFP_USER was used specifically for that part, or just randomly :) > likely the thing we are assuming here is that we are migrating a page, and > usually, these are user allocation (except maybe balloon and some other non-lru > movable things). Yeah and user allocations obey cpuset and mempolicies etc. But these are likely somebody elses allocations that were done according to their policies. With our migration we might be actually violating those, which probably can't be helped (is at least migration within the same node preferred? hmm). But it doesn't seem to me that our caller's restrictions (if those exist, would be enforced by __GFP_HARDWALL) are that relevant for somebody else's pages? > The __GFP_RETRY_MAYFAIL should be moved to relevant callers a some point, > __GFP_HARDWALL, I really don't know ... > > Thanks! >
On 3 Dec 2024, at 9:24, Vlastimil Babka wrote: > On 12/3/24 15:12, David Hildenbrand wrote: >> On 03.12.24 14:55, Vlastimil Babka wrote: >>> On 12/3/24 10:47, David Hildenbrand wrote: >>>> It's all a bit complicated for alloc_contig_range(). For example, we don't >>>> support many flags, so let's start bailing out on unsupported >>>> ones -- ignoring the placement hints, as we are already given the range >>>> to allocate. >>>> >>>> While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we >>>> simply create yet another GFP mask whereby we ignore the reclaim flags >>>> specify by the caller. That looks very inconsistent. >>>> >>>> Let's clean it up, constructing the gfp flags used for >>>> compaction/migration exactly once. Update the documentation of the >>>> gfp_mask parameter for alloc_contig_range() and alloc_contig_pages(). >>>> >>>> Acked-by: Zi Yan <ziy@nvidia.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> >>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> >>> >>>> + /* >>>> + * Flags to control page compaction/migration/reclaim, to free up our >>>> + * page range. Migratable pages are movable, __GFP_MOVABLE is implied >>>> + * for them. >>>> + * >>>> + * Traditionally we always had __GFP_HARDWALL|__GFP_RETRY_MAYFAIL set, >>>> + * keep doing that to not degrade callers. >>>> + */ >>> >>> Wonder if we could revisit that eventually. Why limit migration targets by >>> cpuset via __GFP_HARDWALL if we were not called with __GFP_HARDWALL? And why >>> weaken the attempts with __GFP_RETRY_MAYFAIL if we didn't specify it? >> >> See below. >> >>> >>> Unless I'm missing something, cc->gfp is only checked for __GFP_FS and >>> __GFP_NOWARN in few places, so it's mostly migration_target_control the >>> callers could meaningfully influence. >> >> Note the fist change in the file, where we now use the mask instead of coming up >> with another one out of the blue. :) > > I know. What I wanted to say - cc->gfp is on its own only checked in few > places, but now since we also translate it to migration_target_control's > gfp_mask, it's mostly that part the caller might influence with the passed > flags. But we still impose own additions to it, limiting that influence. > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index ce7589a4ec01..54594cc4f650 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6294,7 +6294,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, >> int ret = 0; >> struct migration_target_control mtc = { >> .nid = zone_to_nid(cc->zone), >> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, >> + .gfp_mask = cc->gfp_mask, >> .reason = MR_CONTIG_RANGE, >> }; >> >> GFP_USER contains __GFP_HARDWALL. I am not sure if that matters here, but > > Yeah wonder if GFP_USER was used specifically for that part, or just randomly :) > >> likely the thing we are assuming here is that we are migrating a page, and >> usually, these are user allocation (except maybe balloon and some other non-lru >> movable things). > > Yeah and user allocations obey cpuset and mempolicies etc. But these are > likely somebody elses allocations that were done according to their > policies. With our migration we might be actually violating those, which > probably can't be helped (is at least migration within the same node > preferred? hmm). But it doesn't seem to me that our caller's restrictions > (if those exist, would be enforced by __GFP_HARDWALL) are that relevant for > somebody else's pages? Yeah, I was wondering why current_gfp_context() is used to adjust gfp_mask, since current context might not be relevant. But I see it is used in the original code, so I did not ask. If current context is irrelevant w.r.t the to-be-migrated pages, should current_gfp_context() part be removed? Ideally, to respect the to-be-migrated page’s gfp, we might need to go through rmap to find its corresponding vma and possible task struct, but that seems overly complicated. Best Regards, Yan, Zi
On 03.12.24 16:49, Zi Yan wrote: > On 3 Dec 2024, at 9:24, Vlastimil Babka wrote: > >> On 12/3/24 15:12, David Hildenbrand wrote: >>> On 03.12.24 14:55, Vlastimil Babka wrote: >>>> On 12/3/24 10:47, David Hildenbrand wrote: >>>>> It's all a bit complicated for alloc_contig_range(). For example, we don't >>>>> support many flags, so let's start bailing out on unsupported >>>>> ones -- ignoring the placement hints, as we are already given the range >>>>> to allocate. >>>>> >>>>> While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we >>>>> simply create yet another GFP mask whereby we ignore the reclaim flags >>>>> specify by the caller. That looks very inconsistent. >>>>> >>>>> Let's clean it up, constructing the gfp flags used for >>>>> compaction/migration exactly once. Update the documentation of the >>>>> gfp_mask parameter for alloc_contig_range() and alloc_contig_pages(). >>>>> >>>>> Acked-by: Zi Yan <ziy@nvidia.com> >>>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> >>>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> >>>> >>>>> + /* >>>>> + * Flags to control page compaction/migration/reclaim, to free up our >>>>> + * page range. Migratable pages are movable, __GFP_MOVABLE is implied >>>>> + * for them. >>>>> + * >>>>> + * Traditionally we always had __GFP_HARDWALL|__GFP_RETRY_MAYFAIL set, >>>>> + * keep doing that to not degrade callers. >>>>> + */ >>>> >>>> Wonder if we could revisit that eventually. Why limit migration targets by >>>> cpuset via __GFP_HARDWALL if we were not called with __GFP_HARDWALL? And why >>>> weaken the attempts with __GFP_RETRY_MAYFAIL if we didn't specify it? >>> >>> See below. >>> >>>> >>>> Unless I'm missing something, cc->gfp is only checked for __GFP_FS and >>>> __GFP_NOWARN in few places, so it's mostly migration_target_control the >>>> callers could meaningfully influence. >>> >>> Note the fist change in the file, where we now use the mask instead of coming up >>> with another one out of the blue. :) >> >> I know. What I wanted to say - cc->gfp is on its own only checked in few >> places, but now since we also translate it to migration_target_control's >> gfp_mask, it's mostly that part the caller might influence with the passed >> flags. But we still impose own additions to it, limiting that influence. >> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index ce7589a4ec01..54594cc4f650 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -6294,7 +6294,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, >>> int ret = 0; >>> struct migration_target_control mtc = { >>> .nid = zone_to_nid(cc->zone), >>> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, >>> + .gfp_mask = cc->gfp_mask, >>> .reason = MR_CONTIG_RANGE, >>> }; >>> >>> GFP_USER contains __GFP_HARDWALL. I am not sure if that matters here, but >> >> Yeah wonder if GFP_USER was used specifically for that part, or just randomly :) >> >>> likely the thing we are assuming here is that we are migrating a page, and >>> usually, these are user allocation (except maybe balloon and some other non-lru >>> movable things). >> >> Yeah and user allocations obey cpuset and mempolicies etc. But these are >> likely somebody elses allocations that were done according to their >> policies. With our migration we might be actually violating those, which >> probably can't be helped (is at least migration within the same node >> preferred? hmm). But it doesn't seem to me that our caller's restrictions >> (if those exist, would be enforced by __GFP_HARDWALL) are that relevant for >> somebody else's pages? > > Yeah, I was wondering why current_gfp_context() is used to adjust gfp_mask, > since current context might not be relevant. But I see it is used in > the original code, so I did not ask. If current context is irrelevant w.r.t > the to-be-migrated pages, should current_gfp_context() part be removed? Please see how current_gfp_context() is only concerned (excluding the __GFP_MOVABLE thing we unconditionally set ...) about reclaim flags. This part make absolute sense to respect here. So that is something different than __GFP_HARDWALL that *we so far unconditionally set* and is not a "reclaim" flag.
On 03.12.24 15:24, Vlastimil Babka wrote: > On 12/3/24 15:12, David Hildenbrand wrote: >> On 03.12.24 14:55, Vlastimil Babka wrote: >>> On 12/3/24 10:47, David Hildenbrand wrote: >>>> It's all a bit complicated for alloc_contig_range(). For example, we don't >>>> support many flags, so let's start bailing out on unsupported >>>> ones -- ignoring the placement hints, as we are already given the range >>>> to allocate. >>>> >>>> While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we >>>> simply create yet another GFP mask whereby we ignore the reclaim flags >>>> specify by the caller. That looks very inconsistent. >>>> >>>> Let's clean it up, constructing the gfp flags used for >>>> compaction/migration exactly once. Update the documentation of the >>>> gfp_mask parameter for alloc_contig_range() and alloc_contig_pages(). >>>> >>>> Acked-by: Zi Yan <ziy@nvidia.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> >>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> >>> >>>> + /* >>>> + * Flags to control page compaction/migration/reclaim, to free up our >>>> + * page range. Migratable pages are movable, __GFP_MOVABLE is implied >>>> + * for them. >>>> + * >>>> + * Traditionally we always had __GFP_HARDWALL|__GFP_RETRY_MAYFAIL set, >>>> + * keep doing that to not degrade callers. >>>> + */ >>> >>> Wonder if we could revisit that eventually. Why limit migration targets by >>> cpuset via __GFP_HARDWALL if we were not called with __GFP_HARDWALL? And why >>> weaken the attempts with __GFP_RETRY_MAYFAIL if we didn't specify it? >> >> See below. >> >>> >>> Unless I'm missing something, cc->gfp is only checked for __GFP_FS and >>> __GFP_NOWARN in few places, so it's mostly migration_target_control the >>> callers could meaningfully influence. >> >> Note the fist change in the file, where we now use the mask instead of coming up >> with another one out of the blue. :) > > I know. What I wanted to say - cc->gfp is on its own only checked in few > places, but now since we also translate it to migration_target_control's > gfp_mask, it's mostly that part the caller might influence with the passed > flags. But we still impose own additions to it, limiting that influence. > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index ce7589a4ec01..54594cc4f650 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6294,7 +6294,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, >> int ret = 0; >> struct migration_target_control mtc = { >> .nid = zone_to_nid(cc->zone), >> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, >> + .gfp_mask = cc->gfp_mask, >> .reason = MR_CONTIG_RANGE, >> }; >> >> GFP_USER contains __GFP_HARDWALL. I am not sure if that matters here, but > > Yeah wonder if GFP_USER was used specifically for that part, or just randomly :) > >> likely the thing we are assuming here is that we are migrating a page, and >> usually, these are user allocation (except maybe balloon and some other non-lru >> movable things). > > Yeah and user allocations obey cpuset and mempolicies etc. But these are > likely somebody elses allocations that were done according to their > policies. With our migration we might be actually violating those, which > probably can't be helped (is at least migration within the same node > preferred? hmm). I would hope that we handle memory policies somehow (via VMAs? not sure). cpuset? I have no idea. But it doesn't seem to me that our caller's restrictions > (if those exist, would be enforced by __GFP_HARDWALL) are that relevant for > somebody else's pages? It was always set using "GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL", and I removed the same flag combination in #2 from memory offline code, and we do have the exact same thing in do_migrate_range() in mm/memory_hotplug.c. We should investigate if__GFP_HARDWALL is the right thing to use here, and if we can get rid of that by switching to GFP_KERNEL in all these places. I can look into it + send a follow-up patch. Thanks!
On 12/3/24 20:19, David Hildenbrand wrote: > On 03.12.24 15:24, Vlastimil Babka wrote: >> On 12/3/24 15:12, David Hildenbrand wrote: >>> On 03.12.24 14:55, Vlastimil Babka wrote: >>> likely the thing we are assuming here is that we are migrating a page, and >>> usually, these are user allocation (except maybe balloon and some other non-lru >>> movable things). >> >> Yeah and user allocations obey cpuset and mempolicies etc. But these are >> likely somebody elses allocations that were done according to their >> policies. With our migration we might be actually violating those, which >> probably can't be helped (is at least migration within the same node >> preferred? hmm). > > I would hope that we handle memory policies somehow (via VMAs? not > sure). cpuset? I have no idea. They are handled when allocating, but then the info is lost, the allocation doesn't carry its effective nodemask. But that's really a separate issue that just occured to me. > But it doesn't seem to me that our caller's restrictions >> (if those exist, would be enforced by __GFP_HARDWALL) are that relevant for >> somebody else's pages? > > It was always set using "GFP_USER | __GFP_MOVABLE | > __GFP_RETRY_MAYFAIL", and I removed the same flag combination in #2 from > memory offline code, and we do have the exact same thing in > do_migrate_range() in mm/memory_hotplug.c. Yeah I agree a refactoring patch shouldn't change the existing behavior... > We should investigate if__GFP_HARDWALL is the right thing to use here, > and if we can get rid of that by switching to GFP_KERNEL in all these > places. > > I can look into it + send a follow-up patch. ...but it's a great opportunity to start questioning it and possibly change it as a follow up :) Thanks! > Thanks! >
On Tue, Dec 03, 2024 at 08:19:02PM +0100, David Hildenbrand wrote: > It was always set using "GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL", > and I removed the same flag combination in #2 from memory offline code, and > we do have the exact same thing in do_migrate_range() in > mm/memory_hotplug.c. > > We should investigate if__GFP_HARDWALL is the right thing to use here, and > if we can get rid of that by switching to GFP_KERNEL in all these places. Why would not we want __GFP_HARDWALL set? Without it, we could potentially migrate the page to a node which is not part of the cpuset of the task that originally allocated it, thus violating the policy? Is not that a problem?
On Tue, Dec 03, 2024 at 10:47:30AM +0100, David Hildenbrand wrote: > It's all a bit complicated for alloc_contig_range(). For example, we don't > support many flags, so let's start bailing out on unsupported > ones -- ignoring the placement hints, as we are already given the range > to allocate. > > While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we > simply create yet another GFP mask whereby we ignore the reclaim flags > specify by the caller. That looks very inconsistent. > > Let's clean it up, constructing the gfp flags used for > compaction/migration exactly once. Update the documentation of the > gfp_mask parameter for alloc_contig_range() and alloc_contig_pages(). > > Acked-by: Zi Yan <ziy@nvidia.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de>
On 12/4/24 09:59, Oscar Salvador wrote: > On Tue, Dec 03, 2024 at 08:19:02PM +0100, David Hildenbrand wrote: >> It was always set using "GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL", >> and I removed the same flag combination in #2 from memory offline code, and >> we do have the exact same thing in do_migrate_range() in >> mm/memory_hotplug.c. >> >> We should investigate if__GFP_HARDWALL is the right thing to use here, and >> if we can get rid of that by switching to GFP_KERNEL in all these places. > > Why would not we want __GFP_HARDWALL set? > Without it, we could potentially migrate the page to a node which is not > part of the cpuset of the task that originally allocated it, thus violating the > policy? Is not that a problem? The task doing the alloc_contig_range() will likely not be the same task as the one that originally allocated the page, so its policy would be different, no? So even with __GFP_HARDWALL we might be already migrating outside the original tasks's constraint? Am I missing something?
On Wed, Dec 04, 2024 at 10:03:28AM +0100, Vlastimil Babka wrote: > On 12/4/24 09:59, Oscar Salvador wrote: > > On Tue, Dec 03, 2024 at 08:19:02PM +0100, David Hildenbrand wrote: > >> It was always set using "GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL", > >> and I removed the same flag combination in #2 from memory offline code, and > >> we do have the exact same thing in do_migrate_range() in > >> mm/memory_hotplug.c. > >> > >> We should investigate if__GFP_HARDWALL is the right thing to use here, and > >> if we can get rid of that by switching to GFP_KERNEL in all these places. > > > > Why would not we want __GFP_HARDWALL set? > > Without it, we could potentially migrate the page to a node which is not > > part of the cpuset of the task that originally allocated it, thus violating the > > policy? Is not that a problem? > > The task doing the alloc_contig_range() will likely not be the same task as > the one that originally allocated the page, so its policy would be > different, no? So even with __GFP_HARDWALL we might be already migrating > outside the original tasks's constraint? Am I missing something? Yes, that is right, I thought we derive the policy from the old page somehow when migrating it, but reading the code does not seem to be the case. Looking at prepare_alloc_pages(), if !ac->nodemask, which would be the case here, we would get the policy from the current task (alloc_contig_range()) when cpusets are enabled. So yes, I am a bit puzzled why __GFP_HARDWALL was chosen in the first place.
On 04.12.24 10:15, Oscar Salvador wrote: > On Wed, Dec 04, 2024 at 10:03:28AM +0100, Vlastimil Babka wrote: >> On 12/4/24 09:59, Oscar Salvador wrote: >>> On Tue, Dec 03, 2024 at 08:19:02PM +0100, David Hildenbrand wrote: >>>> It was always set using "GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL", >>>> and I removed the same flag combination in #2 from memory offline code, and >>>> we do have the exact same thing in do_migrate_range() in >>>> mm/memory_hotplug.c. >>>> >>>> We should investigate if__GFP_HARDWALL is the right thing to use here, and >>>> if we can get rid of that by switching to GFP_KERNEL in all these places. >>> >>> Why would not we want __GFP_HARDWALL set? >>> Without it, we could potentially migrate the page to a node which is not >>> part of the cpuset of the task that originally allocated it, thus violating the >>> policy? Is not that a problem? >> >> The task doing the alloc_contig_range() will likely not be the same task as >> the one that originally allocated the page, so its policy would be >> different, no? So even with __GFP_HARDWALL we might be already migrating >> outside the original tasks's constraint? Am I missing something? > > Yes, that is right, I thought we derive the policy from the old page > somehow when migrating it, but reading the code does not seem to be the > case. > > Looking at prepare_alloc_pages(), if !ac->nodemask, which would be the > case here, we would get the policy from the current task > (alloc_contig_range()) when cpusets are enabled. > > So yes, I am a bit puzzled why __GFP_HARDWALL was chosen in the first > place. I suspect because "GFP_USER" felt like the appropriate thing to do. Before: commit f90b1d2f1aaaa40c6519a32e69615edc25bb97d5 Author: Paul Jackson <pj@sgi.com> Date: Tue Sep 6 15:18:10 2005 -0700 [PATCH] cpusets: new __GFP_HARDWALL flag Add another GFP flag: __GFP_HARDWALL. GFP_USER and GFP_KERNEL were the same thing. But memory offlining/alloc_contig were added later.
On Wed, Dec 04, 2024 at 10:28:39AM +0100, David Hildenbrand wrote: > On 04.12.24 10:15, Oscar Salvador wrote: > > On Wed, Dec 04, 2024 at 10:03:28AM +0100, Vlastimil Babka wrote: > > > On 12/4/24 09:59, Oscar Salvador wrote: > > > > On Tue, Dec 03, 2024 at 08:19:02PM +0100, David Hildenbrand wrote: > > > > > It was always set using "GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL", > > > > > and I removed the same flag combination in #2 from memory offline code, and > > > > > we do have the exact same thing in do_migrate_range() in > > > > > mm/memory_hotplug.c. > > > > > > > > > > We should investigate if__GFP_HARDWALL is the right thing to use here, and > > > > > if we can get rid of that by switching to GFP_KERNEL in all these places. > > > > > > > > Why would not we want __GFP_HARDWALL set? > > > > Without it, we could potentially migrate the page to a node which is not > > > > part of the cpuset of the task that originally allocated it, thus violating the > > > > policy? Is not that a problem? > > > > > > The task doing the alloc_contig_range() will likely not be the same task as > > > the one that originally allocated the page, so its policy would be > > > different, no? So even with __GFP_HARDWALL we might be already migrating > > > outside the original tasks's constraint? Am I missing something? > > > > Yes, that is right, I thought we derive the policy from the old page > > somehow when migrating it, but reading the code does not seem to be the > > case. > > > > Looking at prepare_alloc_pages(), if !ac->nodemask, which would be the > > case here, we would get the policy from the current task > > (alloc_contig_range()) when cpusets are enabled. > > > > So yes, I am a bit puzzled why __GFP_HARDWALL was chosen in the first > > place. > > I suspect because "GFP_USER" felt like the appropriate thing to do. Looking back at when the whole contiguous allocator patchset was posted, it seems that it kinda copied what memory-offline code was doing, which was migrating pages with GFP_HIGHUSER_MOVABLE (hotremove_migrate_alloc()). Then, the HIGHMEM modifier was dropped due to HIGHMEM restrictions on some systems, ending up with GFP_USER.
On 04.12.24 11:04, Oscar Salvador wrote: > On Wed, Dec 04, 2024 at 10:28:39AM +0100, David Hildenbrand wrote: >> On 04.12.24 10:15, Oscar Salvador wrote: >>> On Wed, Dec 04, 2024 at 10:03:28AM +0100, Vlastimil Babka wrote: >>>> On 12/4/24 09:59, Oscar Salvador wrote: >>>>> On Tue, Dec 03, 2024 at 08:19:02PM +0100, David Hildenbrand wrote: >>>>>> It was always set using "GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL", >>>>>> and I removed the same flag combination in #2 from memory offline code, and >>>>>> we do have the exact same thing in do_migrate_range() in >>>>>> mm/memory_hotplug.c. >>>>>> >>>>>> We should investigate if__GFP_HARDWALL is the right thing to use here, and >>>>>> if we can get rid of that by switching to GFP_KERNEL in all these places. >>>>> >>>>> Why would not we want __GFP_HARDWALL set? >>>>> Without it, we could potentially migrate the page to a node which is not >>>>> part of the cpuset of the task that originally allocated it, thus violating the >>>>> policy? Is not that a problem? >>>> >>>> The task doing the alloc_contig_range() will likely not be the same task as >>>> the one that originally allocated the page, so its policy would be >>>> different, no? So even with __GFP_HARDWALL we might be already migrating >>>> outside the original tasks's constraint? Am I missing something? >>> >>> Yes, that is right, I thought we derive the policy from the old page >>> somehow when migrating it, but reading the code does not seem to be the >>> case. >>> >>> Looking at prepare_alloc_pages(), if !ac->nodemask, which would be the >>> case here, we would get the policy from the current task >>> (alloc_contig_range()) when cpusets are enabled. >>> >>> So yes, I am a bit puzzled why __GFP_HARDWALL was chosen in the first >>> place. >> >> I suspect because "GFP_USER" felt like the appropriate thing to do. > > Looking back at when the whole contiguous allocator patchset was posted, > it seems that it kinda copied what memory-offline code was doing, which > was migrating pages with GFP_HIGHUSER_MOVABLE (hotremove_migrate_alloc()). > > Then, the HIGHMEM modifier was dropped due to HIGHMEM restrictions on > some systems, ending up with GFP_USER. Looking at some other migration_target_control users, some of them also shouldn't be setting GFP_USER->HARDWALL either I think. Essentially, whenever we are migrating a page that is not guaranteed to be "ours" in the context of the caller. mm/damon/paddr.c:__damon_pa_migrate_folio_list() for example, which obtained the addresses by scanning a chunk of physical address space. For others it's less clear: soft_offline_in_use_page() may be called either using madvise() from process context, but also using sysfs using a PFN.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ce7589a4ec01..54594cc4f650 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6294,7 +6294,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, int ret = 0; struct migration_target_control mtc = { .nid = zone_to_nid(cc->zone), - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + .gfp_mask = cc->gfp_mask, .reason = MR_CONTIG_RANGE, }; struct page *page; @@ -6390,6 +6390,39 @@ static void split_free_pages(struct list_head *list) } } +static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) +{ + const gfp_t reclaim_mask = __GFP_IO | __GFP_FS | __GFP_RECLAIM; + const gfp_t action_mask = __GFP_COMP | __GFP_RETRY_MAYFAIL | __GFP_NOWARN; + const gfp_t cc_action_mask = __GFP_RETRY_MAYFAIL | __GFP_NOWARN; + + /* + * We are given the range to allocate; node, mobility and placement + * hints are irrelevant at this point. We'll simply ignore them. + */ + gfp_mask &= ~(GFP_ZONEMASK | __GFP_RECLAIMABLE | __GFP_WRITE | + __GFP_HARDWALL | __GFP_THISNODE | __GFP_MOVABLE); + + /* + * We only support most reclaim flags (but not NOFAIL/NORETRY), and + * selected action flags. + */ + if (gfp_mask & ~(reclaim_mask | action_mask)) + return -EINVAL; + + /* + * Flags to control page compaction/migration/reclaim, to free up our + * page range. Migratable pages are movable, __GFP_MOVABLE is implied + * for them. + * + * Traditionally we always had __GFP_HARDWALL|__GFP_RETRY_MAYFAIL set, + * keep doing that to not degrade callers. + */ + *gfp_cc_mask = (gfp_mask & (reclaim_mask | cc_action_mask)) | + __GFP_HARDWALL | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL; + return 0; +} + /** * alloc_contig_range() -- tries to allocate given range of pages * @start: start PFN to allocate @@ -6398,7 +6431,9 @@ static void split_free_pages(struct list_head *list) * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks * in range must have the same migratetype and it must * be either of the two. - * @gfp_mask: GFP mask to use during compaction + * @gfp_mask: GFP mask. Node/zone/placement hints are ignored; only some + * action and reclaim modifiers are supported. Reclaim modifiers + * control allocation behavior during compaction/migration/reclaim. * * The PFN range does not have to be pageblock aligned. The PFN range must * belong to a single zone. @@ -6424,11 +6459,14 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, .mode = MIGRATE_SYNC, .ignore_skip_hint = true, .no_set_skip_hint = true, - .gfp_mask = current_gfp_context(gfp_mask), .alloc_contig = true, }; INIT_LIST_HEAD(&cc.migratepages); + gfp_mask = current_gfp_context(gfp_mask); + if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) + return -EINVAL; + /* * What we do here is we mark all pageblocks in range as * MIGRATE_ISOLATE. Because pageblock and max order pages may @@ -6571,7 +6609,9 @@ static bool zone_spans_last_pfn(const struct zone *zone, /** * alloc_contig_pages() -- tries to find and allocate contiguous range of pages * @nr_pages: Number of contiguous pages to allocate - * @gfp_mask: GFP mask to limit search and used during compaction + * @gfp_mask: GFP mask. Node/zone/placement hints limit the search; only some + * action and reclaim modifiers are supported. Reclaim modifiers + * control allocation behavior during compaction/migration/reclaim. * @nid: Target node * @nodemask: Mask for other possible nodes *