Message ID | 20240521-mm-hotplug-sync-v1-1-6d53706c1ba8@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Clean up hotplug zone data synchronization | expand |
On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote: > On 21.05.24 14:57, Brendan Jackman wrote: > The old seqlock guaranteed that we would have obtained consistent values > here. start + spanned_pages defines a range. For example, growing a zone to > the beginning implies that both ranges must be changed. > > I do wonder if it might be better to instead have zone->zone_start_pfn and > zone->zone_end_pfn. That way, both can be changed individually, not > requiring adjustment of both to grow/shrink a zone at the beginning. Thanks this is a good point. So basically the fact that spanned_pages is "once or eventually" correct is certainly not enough because it only has meaning with reference to zone_start_pfn. I didn't realise this because of my spontaneous inspiration to believe that zone_start_pfn was fixed. By the way, some noob questions: am I OK with my assumption that it's fine for reader code to operate on zone spans that are both stale and "from the future"? thinking abstractly I guess that seeing a stale value when racing with offline_pages is roughly the same as seeing a value "from the future" when racing with online_pages? Also, is it ever possible for pages to get removed and then added back and end up in a different zone than before?
On 22.05.24 16:27, Brendan Jackman wrote: > On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote: >> On 21.05.24 14:57, Brendan Jackman wrote: >> The old seqlock guaranteed that we would have obtained consistent values >> here. start + spanned_pages defines a range. For example, growing a zone to >> the beginning implies that both ranges must be changed. >> >> I do wonder if it might be better to instead have zone->zone_start_pfn and >> zone->zone_end_pfn. That way, both can be changed individually, not >> requiring adjustment of both to grow/shrink a zone at the beginning. > > Thanks this is a good point. > > So basically the fact that spanned_pages is "once or eventually" > correct is certainly not enough because it only has meaning with > reference to zone_start_pfn. I didn't realise this because of my > spontaneous inspiration to believe that zone_start_pfn was fixed. Right, it isn't. > > By the way, some noob questions: am I OK with my assumption that it's > fine for reader code to operate on zone spans that are both stale and > "from the future"? thinking abstractly I guess that seeing a stale > value when racing with offline_pages is roughly the same as seeing a > value "from the future" when racing with online_pages? Right. PFN walkers should be using pfn_to_online_page(), where races are possible but barely seen in practice. zone handlers like mm/compaction.c can likely deal with races, although it might all be cleaner (and safer?) when using start+end. I recall it also recalls on pfn_to_online_page(). Regarding page_outside_zone_boundaries(), it should be fine if we can read start+end atomically, that way we would not accidentally report "page outside ..." when changing the start address. I think with your current patch that might happen (although likely extremely hard to trigger) when growing the zone at the start, reducing zone_start_pfn. > > Also, is it ever possible for pages to get removed and then added back > and end up in a different zone than before? Yes. Changing between MOVABLE and NORMAL is possible and can easily be triggered by offlining+re-onlining memory blocks.
On Wed, May 22, 2024 at 05:24:17PM +0200, David Hildenbrand wrote: > On 22.05.24 16:27, Brendan Jackman wrote: > > On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote: > > By the way, some noob questions: am I OK with my assumption that it's > > fine for reader code to operate on zone spans that are both stale and > > "from the future"? thinking abstractly I guess that seeing a stale > > value when racing with offline_pages is roughly the same as seeing a > > value "from the future" when racing with online_pages? > > Right. PFN walkers should be using pfn_to_online_page(), where races are > possible but barely seen in practice. > > zone handlers like mm/compaction.c can likely deal with races, although it > might all be cleaner (and safer?) when using start+end. I recall it also > recalls on pfn_to_online_page(). > > Regarding page_outside_zone_boundaries(), it should be fine if we can read > start+end atomically, that way we would not accidentally report "page > outside ..." when changing the start address. I think with your current > patch that might happen (although likely extremely hard to trigger) when > growing the zone at the start, reducing zone_start_pfn. Thanks a lot, this is very helpful > > Also, is it ever possible for pages to get removed and then added back > > and end up in a different zone than before? > > Yes. Changing between MOVABLE and NORMAL is possible and can easily be > triggered by offlining+re-onlining memory blocks. So, even if we make it impossible to see a totally bogus zone span, you can observe a stale/futuristic span which currently contains pages from a different zone? That seems to imply you could look up a page page from a PFN within zone A's apparent span, lock zone A and assume you can safely modify the freelist the page is on, but actually that page is now in zone B. So for example: 1. compact_zone() sets cc->free_pfn based on zone_end_pfn 2. isolate_freepages() sets isolate_start_pfn = cc->free_pfn 3. isolate_freepages_block() looks up a page based on that PFN 3. ... then takes the cc->zone lock 4. ... then calls __isolate_free_page which removes the page from whatever freelist it's on. Is anything stopping part 4 from modifying a zone that wasn't locked in part 3?
Am 24.05.24 um 14:02 schrieb Brendan Jackman: > On Wed, May 22, 2024 at 05:24:17PM +0200, David Hildenbrand wrote: >> On 22.05.24 16:27, Brendan Jackman wrote: >>> On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote: > >>> By the way, some noob questions: am I OK with my assumption that it's >>> fine for reader code to operate on zone spans that are both stale and >>> "from the future"? thinking abstractly I guess that seeing a stale >>> value when racing with offline_pages is roughly the same as seeing a >>> value "from the future" when racing with online_pages? >> >> Right. PFN walkers should be using pfn_to_online_page(), where races are >> possible but barely seen in practice. >> >> zone handlers like mm/compaction.c can likely deal with races, although it >> might all be cleaner (and safer?) when using start+end. I recall it also >> recalls on pfn_to_online_page(). >> >> Regarding page_outside_zone_boundaries(), it should be fine if we can read >> start+end atomically, that way we would not accidentally report "page >> outside ..." when changing the start address. I think with your current >> patch that might happen (although likely extremely hard to trigger) when >> growing the zone at the start, reducing zone_start_pfn. > > Thanks a lot, this is very helpful > >>> Also, is it ever possible for pages to get removed and then added back >>> and end up in a different zone than before? >> >> Yes. Changing between MOVABLE and NORMAL is possible and can easily be >> triggered by offlining+re-onlining memory blocks. > > So, even if we make it impossible to see a totally bogus zone span, > you can observe a stale/futuristic span which currently contains pages > from a different zone? Yes. Note that zones/nodes can easily overlap, so a page being spanned by another zones is common and supported already. > > That seems to imply you could look up a page page from a PFN within > zone A's apparent span, lock zone A and assume you can safely modify > the freelist the page is on, but actually that page is now in zone B. That's why we obtain the zone/node always from the page itself (stored in page flags). This data can only change when offlining+reonlining memory (and pfn_to_online_page() would refuse to hand out the page while temporarily online). There were discussions around using RCU to improve pfn_to_online_page() racing with memory offlining, but the motivation to do that has been rather small: we barely see such races in practice. Memory offlining+re-onlining simply takes too long. > > So for example: > > 1. compact_zone() sets cc->free_pfn based on zone_end_pfn > 2. isolate_freepages() sets isolate_start_pfn = cc->free_pfn > 3. isolate_freepages_block() looks up a page based on that PFN > 3. ... then takes the cc->zone lock > 4. ... then calls __isolate_free_page which removes the page from > whatever freelist it's on. > > Is anything stopping part 4 from modifying a zone that wasn't locked > in part 3? Likely that overlapping zones already exist and are handled accordingly.
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7a9ff464608d..f9577e67e5ee 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -141,31 +141,7 @@ bool mhp_supports_memmap_on_memory(void); /* * Zone resizing functions - * - * Note: any attempt to resize a zone should has pgdat_resize_lock() - * zone_span_writelock() both held. This ensure the size of a zone - * can't be changed while pgdat_resize_lock() held. */ -static inline unsigned zone_span_seqbegin(struct zone *zone) -{ - return read_seqbegin(&zone->span_seqlock); -} -static inline int zone_span_seqretry(struct zone *zone, unsigned iv) -{ - return read_seqretry(&zone->span_seqlock, iv); -} -static inline void zone_span_writelock(struct zone *zone) -{ - write_seqlock(&zone->span_seqlock); -} -static inline void zone_span_writeunlock(struct zone *zone) -{ - write_sequnlock(&zone->span_seqlock); -} -static inline void zone_seqlock_init(struct zone *zone) -{ - seqlock_init(&zone->span_seqlock); -} extern void adjust_present_page_count(struct page *page, struct memory_group *group, long nr_pages); @@ -251,17 +227,6 @@ static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat) ___page; \ }) -static inline unsigned zone_span_seqbegin(struct zone *zone) -{ - return 0; -} -static inline int zone_span_seqretry(struct zone *zone, unsigned iv) -{ - return 0; -} -static inline void zone_span_writelock(struct zone *zone) {} -static inline void zone_span_writeunlock(struct zone *zone) {} -static inline void zone_seqlock_init(struct zone *zone) {} static inline int try_online_node(int nid) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8f9c9590a42c..194ef7fed9d6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -14,7 +14,6 @@ #include <linux/threads.h> #include <linux/numa.h> #include <linux/init.h> -#include <linux/seqlock.h> #include <linux/nodemask.h> #include <linux/pageblock-flags.h> #include <linux/page-flags-layout.h> @@ -896,18 +895,11 @@ struct zone { * * Locking rules: * - * zone_start_pfn and spanned_pages are protected by span_seqlock. - * It is a seqlock because it has to be read outside of zone->lock, - * and it is done in the main allocator path. But, it is written - * quite infrequently. - * - * The span_seq lock is declared along with zone->lock because it is - * frequently read in proximity to zone->lock. It's good to - * give them a chance of being in the same cacheline. - * - * Write access to present_pages at runtime should be protected by - * mem_hotplug_begin/done(). Any reader who can't tolerant drift of - * present_pages should use get_online_mems() to get a stable value. + * Besides system initialization functions, memory-hotplug is the only + * user that can change zone's {spanned,present} pages at runtime, and + * it does so by holding the mem_hotplug_lock lock. Any readers who + * can't tolerate drift values should use {get,put}_online_mems to get + * a stable value. */ atomic_long_t managed_pages; unsigned long spanned_pages; @@ -930,11 +922,6 @@ struct zone { unsigned long nr_isolate_pageblock; #endif -#ifdef CONFIG_MEMORY_HOTPLUG - /* see spanned/present_pages for more description */ - seqlock_t span_seqlock; -#endif - int initialized; /* Write-intensive fields used from the page allocator */ diff --git a/mm/mm_init.c b/mm/mm_init.c index f72b852bd5b8..c725618aeb58 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1383,7 +1383,6 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, zone->name = zone_names[idx]; zone->zone_pgdat = NODE_DATA(nid); spin_lock_init(&zone->lock); - zone_seqlock_init(zone); zone_pcp_init(zone); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e22ce5675ca..5116a2b9ea6e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -426,16 +426,12 @@ void set_pageblock_migratetype(struct page *page, int migratetype) static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { int ret; - unsigned seq; unsigned long pfn = page_to_pfn(page); unsigned long sp, start_pfn; - do { - seq = zone_span_seqbegin(zone); - start_pfn = zone->zone_start_pfn; - sp = zone->spanned_pages; - ret = !zone_spans_pfn(zone, pfn); - } while (zone_span_seqretry(zone, seq)); + start_pfn = zone->zone_start_pfn; + sp = READ_ONCE(zone->spanned_pages); + ret = !zone_spans_pfn(zone, pfn); if (ret) pr_err("page 0x%lx outside node %d zone %s [ 0x%lx - 0x%lx ]\n",