Message ID | 20160126000639.358.89668.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Mon, Jan 25, 2016 at 04:06:40PM -0800, Dan Williams wrote: > It appears devices requiring ZONE_DMA are still prevalent (see link > below). For this reason the proposal to require turning off ZONE_DMA to > enable ZONE_DEVICE is untenable in the short term. We want a single > kernel image to be able to support legacy devices as well as next > generation persistent memory platforms. > > Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing > to maintain a unique zone number for ZONE_DEVICE. Record the geometry > of ZONE_DMA at init (->init_spanned_pages) and use that information in > is_zone_device_page() to differentiate pages allocated via > devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the > simpler definition of is_zone_device_page() when ZONE_DMA is turned off. > > Note that this also teaches the memory hot remove path that the zone may > not have sections for all pfn spans (->zone_dyn_start_pfn). > > A user visible implication of this change is potentially an unexpectedly > high "spanned" value in /proc/zoneinfo for the DMA zone. > > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Rik van Riel <riel@redhat.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Jerome Glisse <j.glisse@gmail.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 > Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"") > Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> It should actually be Reported-by: Mark <markk@clara.co.uk> Hi Mark, Can you please test this patch available at https://patchwork.kernel.org/patch/8116991/ in your setup.. regards sudip
On Mon, Jan 25, 2016 at 10:00 PM, Sudip Mukherjee <sudipm.mukherjee@gmail.com> wrote: > On Mon, Jan 25, 2016 at 04:06:40PM -0800, Dan Williams wrote: >> It appears devices requiring ZONE_DMA are still prevalent (see link >> below). For this reason the proposal to require turning off ZONE_DMA to >> enable ZONE_DEVICE is untenable in the short term. We want a single >> kernel image to be able to support legacy devices as well as next >> generation persistent memory platforms. >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry >> of ZONE_DMA at init (->init_spanned_pages) and use that information in >> is_zone_device_page() to differentiate pages allocated via >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. >> >> Note that this also teaches the memory hot remove path that the zone may >> not have sections for all pfn spans (->zone_dyn_start_pfn). >> >> A user visible implication of this change is potentially an unexpectedly >> high "spanned" value in /proc/zoneinfo for the DMA zone. >> >> Cc: H. Peter Anvin <hpa@zytor.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Rik van Riel <riel@redhat.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Cc: Jerome Glisse <j.glisse@gmail.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 >> Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"") >> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> > > It should actually be Reported-by: Mark <markk@clara.co.uk> > > Hi Mark, > Can you please test this patch available at https://patchwork.kernel.org/patch/8116991/ > in your setup.. Note this patch is on top of 4.5-rc1 and is likely not a suitable for -stable backport to 4.3/4.4. For 4.3 and 4.4, distributions that want to support legacy devices should leave ZONE_DEVICE disabled as it is by default.
On Tue, January 26, 2016 06:00, Sudip Mukherjee wrote: > On Mon, Jan 25, 2016 at 04:06:40PM -0800, Dan Williams wrote: >> It appears devices requiring ZONE_DMA are still prevalent (see link >> below). For this reason the proposal to require turning off ZONE_DMA to >> enable ZONE_DEVICE is untenable in the short term. We want a single >> kernel image to be able to support legacy devices as well as next >> generation persistent memory platforms. >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry >> of ZONE_DMA at init (->init_spanned_pages) and use that information in >> is_zone_device_page() to differentiate pages allocated via >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. >> >> Note that this also teaches the memory hot remove path that the zone may >> not have sections for all pfn spans (->zone_dyn_start_pfn). >> >> A user visible implication of this change is potentially an unexpectedly >> high "spanned" value in /proc/zoneinfo for the DMA zone. >> >> Cc: H. Peter Anvin <hpa@zytor.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Rik van Riel <riel@redhat.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Cc: Jerome Glisse <j.glisse@gmail.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 >> Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"") >> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> > > It should actually be Reported-by: Mark <markk@clara.co.uk> > > Hi Mark, > Can you please test this patch available at > https://patchwork.kernel.org/patch/8116991/ > in your setup.. I applied that patch to 4.5-rc1 and it seems to work. At least, there is no error message in dmesg output any more. I didn't actually try using the parallel port (need to find a parallel printer cable). Presumably a parallel printer would work whether DMA is used or not, just slower and using more CPU time in the PIO case. Also, I don't have any hardware that needs CONFIG_ZONE_DEVICE. The config file I used to compile the kernel can be downloaded from https://www.mediafire.com/?1do33bkko41ypo3 if anyone feels like taking a look. Perhaps someone with one of the affected PCI sound cards could also test the patch, since those presumably don't work/build at all without it. Hopefully someone else has a PC with native parallel port to confirm the fix. (Native floppy controller may be another affected device.) Mark
On 26.1.2016 1:06, Dan Williams wrote: > It appears devices requiring ZONE_DMA are still prevalent (see link > below). For this reason the proposal to require turning off ZONE_DMA to > enable ZONE_DEVICE is untenable in the short term. We want a single > kernel image to be able to support legacy devices as well as next > generation persistent memory platforms. > > Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing > to maintain a unique zone number for ZONE_DEVICE. Record the geometry > of ZONE_DMA at init (->init_spanned_pages) and use that information in > is_zone_device_page() to differentiate pages allocated via > devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the > simpler definition of is_zone_device_page() when ZONE_DMA is turned off. > > Note that this also teaches the memory hot remove path that the zone may > not have sections for all pfn spans (->zone_dyn_start_pfn). > > A user visible implication of this change is potentially an unexpectedly > high "spanned" value in /proc/zoneinfo for the DMA zone. [+CC Joonsoo, Laura] Sounds like quite a hack :( Would it be possible to extend the bits encoding zone? Potentially, ZONE_CMA could be added one day... > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Rik van Riel <riel@redhat.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Jerome Glisse <j.glisse@gmail.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 > Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"") > Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > include/linux/mm.h | 46 ++++++++++++++++++++++++++++++++-------------- > include/linux/mmzone.h | 24 ++++++++++++++++++++---- > mm/Kconfig | 1 - > mm/memory_hotplug.c | 15 +++++++++++---- > mm/page_alloc.c | 9 ++++++--- > 5 files changed, 69 insertions(+), 26 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index f1cd22f2df1a..b4bccd3d3c41 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -664,12 +664,44 @@ static inline enum zone_type page_zonenum(const struct page *page) > return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK; > } > > +#ifdef NODE_NOT_IN_PAGE_FLAGS > +extern int page_to_nid(const struct page *page); > +#else > +static inline int page_to_nid(const struct page *page) > +{ > + return (page->flags >> NODES_PGSHIFT) & NODES_MASK; > +} > +#endif > + > +static inline struct zone *page_zone(const struct page *page) > +{ > + return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; > +} > + > #ifdef CONFIG_ZONE_DEVICE > void get_zone_device_page(struct page *page); > void put_zone_device_page(struct page *page); > static inline bool is_zone_device_page(const struct page *page) > { > +#ifndef CONFIG_ZONE_DMA > return page_zonenum(page) == ZONE_DEVICE; > +#else /* ZONE_DEVICE == ZONE_DMA */ > + struct zone *zone; > + > + if (page_zonenum(page) != ZONE_DEVICE) > + return false; > + > + /* > + * If ZONE_DEVICE is aliased with ZONE_DMA we need to check > + * whether this was a dynamically allocated page from > + * devm_memremap_pages() by checking against the size of > + * ZONE_DMA at boot. > + */ > + zone = page_zone(page); > + if (page_to_pfn(page) <= zone_end_pfn_boot(zone)) > + return false; > + return true; > +#endif > } > #else > static inline void get_zone_device_page(struct page *page) > @@ -735,15 +767,6 @@ static inline int zone_to_nid(struct zone *zone) > #endif > } > > -#ifdef NODE_NOT_IN_PAGE_FLAGS > -extern int page_to_nid(const struct page *page); > -#else > -static inline int page_to_nid(const struct page *page) > -{ > - return (page->flags >> NODES_PGSHIFT) & NODES_MASK; > -} > -#endif > - > #ifdef CONFIG_NUMA_BALANCING > static inline int cpu_pid_to_cpupid(int cpu, int pid) > { > @@ -857,11 +880,6 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) > } > #endif /* CONFIG_NUMA_BALANCING */ > > -static inline struct zone *page_zone(const struct page *page) > -{ > - return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; > -} > - > #ifdef SECTION_IN_PAGE_FLAGS > static inline void set_page_section(struct page *page, unsigned long section) > { > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 33bb1b19273e..a0ef09b7f893 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -288,6 +288,13 @@ enum zone_type { > */ > ZONE_DMA, > #endif > +#ifdef CONFIG_ZONE_DEVICE > +#ifndef CONFIG_ZONE_DMA > + ZONE_DEVICE, > +#else > + ZONE_DEVICE = ZONE_DMA, > +#endif > +#endif > #ifdef CONFIG_ZONE_DMA32 > /* > * x86_64 needs two ZONE_DMAs because it supports devices that are > @@ -314,11 +321,7 @@ enum zone_type { > ZONE_HIGHMEM, > #endif > ZONE_MOVABLE, > -#ifdef CONFIG_ZONE_DEVICE > - ZONE_DEVICE, > -#endif > __MAX_NR_ZONES > - > }; > > #ifndef __GENERATING_BOUNDS_H > @@ -379,12 +382,19 @@ struct zone { > > /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ > unsigned long zone_start_pfn; > + /* first dynamically added pfn of the zone */ > + unsigned long zone_dyn_start_pfn; > > /* > * spanned_pages is the total pages spanned by the zone, including > * holes, which is calculated as: > * spanned_pages = zone_end_pfn - zone_start_pfn; > * > + * init_spanned_pages is the boot/init time total pages spanned > + * by the zone for differentiating statically assigned vs > + * dynamically hot added memory to a zone. > + * init_spanned_pages = init_zone_end_pfn - zone_start_pfn; > + * > * present_pages is physical pages existing within the zone, which > * is calculated as: > * present_pages = spanned_pages - absent_pages(pages in holes); > @@ -423,6 +433,7 @@ struct zone { > */ > unsigned long managed_pages; > unsigned long spanned_pages; > + unsigned long init_spanned_pages; > unsigned long present_pages; > > const char *name; > @@ -546,6 +557,11 @@ static inline unsigned long zone_end_pfn(const struct zone *zone) > return zone->zone_start_pfn + zone->spanned_pages; > } > > +static inline unsigned long zone_end_pfn_boot(const struct zone *zone) > +{ > + return zone->zone_start_pfn + zone->init_spanned_pages; > +} > + > static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn) > { > return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone); > diff --git a/mm/Kconfig b/mm/Kconfig > index 97a4e06b15c0..08a92a9c8fbd 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -652,7 +652,6 @@ config IDLE_PAGE_TRACKING > config ZONE_DEVICE > bool "Device memory (pmem, etc...) hotplug support" if EXPERT > default !ZONE_DMA > - depends on !ZONE_DMA > depends on MEMORY_HOTPLUG > depends on MEMORY_HOTREMOVE > depends on X86_64 #arch_add_memory() comprehends device memory > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 4af58a3a8ffa..c3f0ff45bd47 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -300,6 +300,8 @@ static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn, > > zone->spanned_pages = max(old_zone_end_pfn, end_pfn) - > zone->zone_start_pfn; > + if (!zone->zone_dyn_start_pfn || start_pfn < zone->zone_dyn_start_pfn) > + zone->zone_dyn_start_pfn = start_pfn; > > zone_span_writeunlock(zone); > } > @@ -601,8 +603,9 @@ static int find_biggest_section_pfn(int nid, struct zone *zone, > static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, > unsigned long end_pfn) > { > - unsigned long zone_start_pfn = zone->zone_start_pfn; > + unsigned long zone_start_pfn = zone->zone_dyn_start_pfn; > unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */ > + bool dyn_zone = zone->zone_start_pfn == zone_start_pfn; > unsigned long zone_end_pfn = z; > unsigned long pfn; > struct mem_section *ms; > @@ -619,7 +622,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, > pfn = find_smallest_section_pfn(nid, zone, end_pfn, > zone_end_pfn); > if (pfn) { > - zone->zone_start_pfn = pfn; > + if (dyn_zone) > + zone->zone_start_pfn = pfn; > + zone->zone_dyn_start_pfn = pfn; > zone->spanned_pages = zone_end_pfn - pfn; > } > } else if (zone_end_pfn == end_pfn) { > @@ -661,8 +666,10 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, > } > > /* The zone has no valid section */ > - zone->zone_start_pfn = 0; > - zone->spanned_pages = 0; > + if (dyn_zone) > + zone->zone_start_pfn = 0; > + zone->zone_dyn_start_pfn = 0; > + zone->spanned_pages = zone->init_spanned_pages; > zone_span_writeunlock(zone); > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 63358d9f9aa9..2d8b1d602ff3 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -209,6 +209,10 @@ EXPORT_SYMBOL(totalram_pages); > static char * const zone_names[MAX_NR_ZONES] = { > #ifdef CONFIG_ZONE_DMA > "DMA", > +#else > +#ifdef CONFIG_ZONE_DEVICE > + "Device", > +#endif > #endif > #ifdef CONFIG_ZONE_DMA32 > "DMA32", > @@ -218,9 +222,6 @@ static char * const zone_names[MAX_NR_ZONES] = { > "HighMem", > #endif > "Movable", > -#ifdef CONFIG_ZONE_DEVICE > - "Device", > -#endif > }; > > compound_page_dtor * const compound_page_dtors[] = { > @@ -5082,6 +5083,8 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, > node_start_pfn, node_end_pfn, > zholes_size); > zone->spanned_pages = size; > + zone->init_spanned_pages = size; > + zone->zone_dyn_start_pfn = 0; > zone->present_pages = real_size; > > totalpages += size; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >
On Tue, Jan 26, 2016 at 1:42 PM, Vlastimil Babka <vbabka@suse.cz> wrote: > On 26.1.2016 1:06, Dan Williams wrote: >> It appears devices requiring ZONE_DMA are still prevalent (see link >> below). For this reason the proposal to require turning off ZONE_DMA to >> enable ZONE_DEVICE is untenable in the short term. We want a single >> kernel image to be able to support legacy devices as well as next >> generation persistent memory platforms. >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry >> of ZONE_DMA at init (->init_spanned_pages) and use that information in >> is_zone_device_page() to differentiate pages allocated via >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. >> >> Note that this also teaches the memory hot remove path that the zone may >> not have sections for all pfn spans (->zone_dyn_start_pfn). >> >> A user visible implication of this change is potentially an unexpectedly >> high "spanned" value in /proc/zoneinfo for the DMA zone. > > [+CC Joonsoo, Laura] > > Sounds like quite a hack :( Indeed... > Would it be possible to extend the bits encoding > zone? Potentially, ZONE_CMA could be added one day... Not without impacting the ability to quickly lookup the numa node and parent section for a page. See ZONES_WIDTH, NODES_WIDTH, and SECTIONS_WIDTH. My initial implementation of ZONE_DEVICE ran into this conflict when ZONES_SHIFT is > 2, and I fell back to cannibalizing ZONE_DMA.
On Mon, 25 Jan 2016 16:06:40 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > It appears devices requiring ZONE_DMA are still prevalent (see link > below). For this reason the proposal to require turning off ZONE_DMA to > enable ZONE_DEVICE is untenable in the short term. More than "short term". When can we ever nuke ZONE_DMA? This was a pretty big goof - the removal of ZONE_DMA whizzed straight past my attention, alas. In fact I never noticed the patch at all until I got some conflicts in -next a few weeks later (wasn't cc'ed). And then I didn't read the changelog closely enough. > We want a single > kernel image to be able to support legacy devices as well as next > generation persistent memory platforms. yup. > Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing > to maintain a unique zone number for ZONE_DEVICE. Record the geometry > of ZONE_DMA at init (->init_spanned_pages) and use that information in > is_zone_device_page() to differentiate pages allocated via > devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the > simpler definition of is_zone_device_page() when ZONE_DMA is turned off. > > Note that this also teaches the memory hot remove path that the zone may > not have sections for all pfn spans (->zone_dyn_start_pfn). > > A user visible implication of this change is potentially an unexpectedly > high "spanned" value in /proc/zoneinfo for the DMA zone. Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? Is it possible to just use ZONES_SHIFT=3? Also, this "dynamically added pfn of the zone" thing is a new concept and I think it should be more completely documented somewhere in the code.
On Tue, Jan 26, 2016 at 2:11 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 25 Jan 2016 16:06:40 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > >> It appears devices requiring ZONE_DMA are still prevalent (see link >> below). For this reason the proposal to require turning off ZONE_DMA to >> enable ZONE_DEVICE is untenable in the short term. > > More than "short term". When can we ever nuke ZONE_DMA? I'm assuming at some point these legacy devices will die off or move to something attached over a more capable bus like USB? > This was a pretty big goof - the removal of ZONE_DMA whizzed straight > past my attention, alas. In fact I never noticed the patch at all > until I got some conflicts in -next a few weeks later (wasn't cc'ed). > And then I didn't read the changelog closely enough. I endeavor to never surprise you again... To be clear the patch did not disable ZONE_DMA by default, but it was indeed a goof to assume that ZONE_DMA was less prevalent than it turns out to be. >> We want a single >> kernel image to be able to support legacy devices as well as next >> generation persistent memory platforms. > > yup. > >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry >> of ZONE_DMA at init (->init_spanned_pages) and use that information in >> is_zone_device_page() to differentiate pages allocated via >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. >> >> Note that this also teaches the memory hot remove path that the zone may >> not have sections for all pfn spans (->zone_dyn_start_pfn). >> >> A user visible implication of this change is potentially an unexpectedly >> high "spanned" value in /proc/zoneinfo for the DMA zone. > > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? > Is it possible to just use ZONES_SHIFT=3? Last I tried I hit this warning in mm/memory.c #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. > Also, this "dynamically added pfn of the zone" thing is a new concept > and I think it should be more completely documented somewhere in the > code. Ok, I'll take a look.
On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing > >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry > >> of ZONE_DMA at init (->init_spanned_pages) and use that information in > >> is_zone_device_page() to differentiate pages allocated via > >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the > >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. > >> > >> Note that this also teaches the memory hot remove path that the zone may > >> not have sections for all pfn spans (->zone_dyn_start_pfn). > >> > >> A user visible implication of this change is potentially an unexpectedly > >> high "spanned" value in /proc/zoneinfo for the DMA zone. > > > > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? > > Is it possible to just use ZONES_SHIFT=3? > > Last I tried I hit this warning in mm/memory.c > > #warning Unfortunate NUMA and NUMA Balancing config, growing > page-frame for last_cpupid. Well yes, it may take a bit of work - perhaps salvaging a bit from somewhere else if poss. But that might provide a better overall solution so could you please have a think?
On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing >> >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry >> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in >> >> is_zone_device_page() to differentiate pages allocated via >> >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the >> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. >> >> >> >> Note that this also teaches the memory hot remove path that the zone may >> >> not have sections for all pfn spans (->zone_dyn_start_pfn). >> >> >> >> A user visible implication of this change is potentially an unexpectedly >> >> high "spanned" value in /proc/zoneinfo for the DMA zone. >> > >> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? >> > Is it possible to just use ZONES_SHIFT=3? >> >> Last I tried I hit this warning in mm/memory.c >> >> #warning Unfortunate NUMA and NUMA Balancing config, growing >> page-frame for last_cpupid. > > Well yes, it may take a bit of work - perhaps salvaging a bit from > somewhere else if poss. But that might provide a better overall > solution so could you please have a think? > Will do, especially since other efforts are feeling the pinch on the MAX_NR_ZONES limitation.
Hello, On Tue, Jan 26, 2016 at 03:11:36PM -0800, Dan Williams wrote: > On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton > <akpm@linux-foundation.org> wrote: > > On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > > > >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing > >> >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry > >> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in > >> >> is_zone_device_page() to differentiate pages allocated via > >> >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the > >> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. > >> >> > >> >> Note that this also teaches the memory hot remove path that the zone may > >> >> not have sections for all pfn spans (->zone_dyn_start_pfn). > >> >> > >> >> A user visible implication of this change is potentially an unexpectedly > >> >> high "spanned" value in /proc/zoneinfo for the DMA zone. > >> > > >> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? > >> > Is it possible to just use ZONES_SHIFT=3? > >> > >> Last I tried I hit this warning in mm/memory.c > >> > >> #warning Unfortunate NUMA and NUMA Balancing config, growing > >> page-frame for last_cpupid. > > > > Well yes, it may take a bit of work - perhaps salvaging a bit from > > somewhere else if poss. But that might provide a better overall > > solution so could you please have a think? > > > > Will do, especially since other efforts are feeling the pinch on the > MAX_NR_ZONES limitation. Please refer my previous attempt to add a new zone, ZONE_CMA. https://lkml.org/lkml/2015/2/12/84 It salvages a bit from SECTION_WIDTH by increasing section size. Similarly, I guess we can reduce NODE_WIDTH if needed although it could cause to reduce maximum node size. Thanks.
On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > Hello, > > On Tue, Jan 26, 2016 at 03:11:36PM -0800, Dan Williams wrote: >> On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton >> <akpm@linux-foundation.org> wrote: >> > On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote: >> > >> >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing >> >> >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry >> >> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in >> >> >> is_zone_device_page() to differentiate pages allocated via >> >> >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the >> >> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. >> >> >> >> >> >> Note that this also teaches the memory hot remove path that the zone may >> >> >> not have sections for all pfn spans (->zone_dyn_start_pfn). >> >> >> >> >> >> A user visible implication of this change is potentially an unexpectedly >> >> >> high "spanned" value in /proc/zoneinfo for the DMA zone. >> >> > >> >> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? >> >> > Is it possible to just use ZONES_SHIFT=3? >> >> >> >> Last I tried I hit this warning in mm/memory.c >> >> >> >> #warning Unfortunate NUMA and NUMA Balancing config, growing >> >> page-frame for last_cpupid. >> > >> > Well yes, it may take a bit of work - perhaps salvaging a bit from >> > somewhere else if poss. But that might provide a better overall >> > solution so could you please have a think? >> > >> >> Will do, especially since other efforts are feeling the pinch on the >> MAX_NR_ZONES limitation. > > Please refer my previous attempt to add a new zone, ZONE_CMA. > > https://lkml.org/lkml/2015/2/12/84 > > It salvages a bit from SECTION_WIDTH by increasing section size. > Similarly, I guess we can reduce NODE_WIDTH if needed although > it could cause to reduce maximum node size. Dave pointed out to me that LAST__PID_SHIFT might be a better candidate to reduce to 7 bits. That field is for storing pids which are already bigger than 8 bits. If it is relying on the fact that pids don't rollover very often then likely the impact of 7-bits instead of 8 will be minimal.
On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote: > On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > > Hello, > > > > On Tue, Jan 26, 2016 at 03:11:36PM -0800, Dan Williams wrote: > >> On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton > >> <akpm@linux-foundation.org> wrote: > >> > On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > >> > > >> >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing > >> >> >> to maintain a unique zone number for ZONE_DEVICE. Record the geometry > >> >> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in > >> >> >> is_zone_device_page() to differentiate pages allocated via > >> >> >> devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the > >> >> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off. > >> >> >> > >> >> >> Note that this also teaches the memory hot remove path that the zone may > >> >> >> not have sections for all pfn spans (->zone_dyn_start_pfn). > >> >> >> > >> >> >> A user visible implication of this change is potentially an unexpectedly > >> >> >> high "spanned" value in /proc/zoneinfo for the DMA zone. > >> >> > > >> >> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? > >> >> > Is it possible to just use ZONES_SHIFT=3? > >> >> > >> >> Last I tried I hit this warning in mm/memory.c > >> >> > >> >> #warning Unfortunate NUMA and NUMA Balancing config, growing > >> >> page-frame for last_cpupid. > >> > > >> > Well yes, it may take a bit of work - perhaps salvaging a bit from > >> > somewhere else if poss. But that might provide a better overall > >> > solution so could you please have a think? > >> > > >> > >> Will do, especially since other efforts are feeling the pinch on the > >> MAX_NR_ZONES limitation. > > > > Please refer my previous attempt to add a new zone, ZONE_CMA. > > > > https://lkml.org/lkml/2015/2/12/84 > > > > It salvages a bit from SECTION_WIDTH by increasing section size. > > Similarly, I guess we can reduce NODE_WIDTH if needed although > > it could cause to reduce maximum node size. > > Dave pointed out to me that LAST__PID_SHIFT might be a better > candidate to reduce to 7 bits. That field is for storing pids which > are already bigger than 8 bits. If it is relying on the fact that > pids don't rollover very often then likely the impact of 7-bits > instead of 8 will be minimal. Hmm... I'm not sure it's possible or not, but, it looks not a general solution. It will solve your problem because you are using 64 bit arch but other 32 bit archs can't get the benefit. Thanks.
On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote: >> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: [..] >> > Please refer my previous attempt to add a new zone, ZONE_CMA. >> > >> > https://lkml.org/lkml/2015/2/12/84 >> > >> > It salvages a bit from SECTION_WIDTH by increasing section size. >> > Similarly, I guess we can reduce NODE_WIDTH if needed although >> > it could cause to reduce maximum node size. >> >> Dave pointed out to me that LAST__PID_SHIFT might be a better >> candidate to reduce to 7 bits. That field is for storing pids which >> are already bigger than 8 bits. If it is relying on the fact that >> pids don't rollover very often then likely the impact of 7-bits >> instead of 8 will be minimal. > > Hmm... I'm not sure it's possible or not, but, it looks not a general > solution. It will solve your problem because you are using 64 bit arch > but other 32 bit archs can't get the benefit. This is where the ZONE_CMA and ZONE_DEVICE efforts diverge. ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of persistent memory. A 64-bit-only limitation for ZONE_DEVICE is reasonable.
On Tue, Jan 26, 2016 at 07:23:59PM -0800, Dan Williams wrote: > On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > > On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote: > >> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > [..] > >> > Please refer my previous attempt to add a new zone, ZONE_CMA. > >> > > >> > https://lkml.org/lkml/2015/2/12/84 > >> > > >> > It salvages a bit from SECTION_WIDTH by increasing section size. > >> > Similarly, I guess we can reduce NODE_WIDTH if needed although > >> > it could cause to reduce maximum node size. > >> > >> Dave pointed out to me that LAST__PID_SHIFT might be a better > >> candidate to reduce to 7 bits. That field is for storing pids which > >> are already bigger than 8 bits. If it is relying on the fact that > >> pids don't rollover very often then likely the impact of 7-bits > >> instead of 8 will be minimal. > > > > Hmm... I'm not sure it's possible or not, but, it looks not a general > > solution. It will solve your problem because you are using 64 bit arch > > but other 32 bit archs can't get the benefit. > > This is where the ZONE_CMA and ZONE_DEVICE efforts diverge. > ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of > persistent memory. A 64-bit-only limitation for ZONE_DEVICE is > reasonable. Yes, but, my point is that if someone need another zone like as ZONE_CMA, they couldn't get the benefit from this change. They need to re-investigate what bits they can reduce and need to re-do all things. If it is implemented more generally at this time, it can relieve their burden and less churn the code. It would be helpful for maintainability. Thanks.
On Tue, Jan 26, 2016 at 7:52 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > On Tue, Jan 26, 2016 at 07:23:59PM -0800, Dan Williams wrote: >> On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: >> > On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote: >> >> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: >> [..] >> >> > Please refer my previous attempt to add a new zone, ZONE_CMA. >> >> > >> >> > https://lkml.org/lkml/2015/2/12/84 >> >> > >> >> > It salvages a bit from SECTION_WIDTH by increasing section size. >> >> > Similarly, I guess we can reduce NODE_WIDTH if needed although >> >> > it could cause to reduce maximum node size. >> >> >> >> Dave pointed out to me that LAST__PID_SHIFT might be a better >> >> candidate to reduce to 7 bits. That field is for storing pids which >> >> are already bigger than 8 bits. If it is relying on the fact that >> >> pids don't rollover very often then likely the impact of 7-bits >> >> instead of 8 will be minimal. >> > >> > Hmm... I'm not sure it's possible or not, but, it looks not a general >> > solution. It will solve your problem because you are using 64 bit arch >> > but other 32 bit archs can't get the benefit. >> >> This is where the ZONE_CMA and ZONE_DEVICE efforts diverge. >> ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of >> persistent memory. A 64-bit-only limitation for ZONE_DEVICE is >> reasonable. > > Yes, but, my point is that if someone need another zone like as > ZONE_CMA, they couldn't get the benefit from this change. They need to > re-investigate what bits they can reduce and need to re-do all things. > > If it is implemented more generally at this time, it can relieve their > burden and less churn the code. It would be helpful for maintainability. I agree in principle that finding a 32-bit compatible solution is desirable, but it simply may not be feasible. For now, I'll help with auditing the existing bits so we can enumerate the tradeoffs. Hmm, one tradeoff that comes to mind for 32-bit is sacrificing ZONE_HIGHMEM, for ZONE_CMA. Are there configurations that need both enabled? If a platform needs highmem it really should be using a 64-bit kernel (if possible), desire for ZONE_CMA might be a nice encouragement to lessen the prevalence of highmem.
On Tue, Jan 26, 2016 at 08:26:24PM -0800, Dan Williams wrote: > On Tue, Jan 26, 2016 at 7:52 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > > On Tue, Jan 26, 2016 at 07:23:59PM -0800, Dan Williams wrote: > >> On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > >> > On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote: > >> >> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote: > >> [..] > >> >> > Please refer my previous attempt to add a new zone, ZONE_CMA. > >> >> > > >> >> > https://lkml.org/lkml/2015/2/12/84 > >> >> > > >> >> > It salvages a bit from SECTION_WIDTH by increasing section size. > >> >> > Similarly, I guess we can reduce NODE_WIDTH if needed although > >> >> > it could cause to reduce maximum node size. > >> >> > >> >> Dave pointed out to me that LAST__PID_SHIFT might be a better > >> >> candidate to reduce to 7 bits. That field is for storing pids which > >> >> are already bigger than 8 bits. If it is relying on the fact that > >> >> pids don't rollover very often then likely the impact of 7-bits > >> >> instead of 8 will be minimal. > >> > > >> > Hmm... I'm not sure it's possible or not, but, it looks not a general > >> > solution. It will solve your problem because you are using 64 bit arch > >> > but other 32 bit archs can't get the benefit. > >> > >> This is where the ZONE_CMA and ZONE_DEVICE efforts diverge. > >> ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of > >> persistent memory. A 64-bit-only limitation for ZONE_DEVICE is > >> reasonable. > > > > Yes, but, my point is that if someone need another zone like as > > ZONE_CMA, they couldn't get the benefit from this change. They need to > > re-investigate what bits they can reduce and need to re-do all things. > > > > If it is implemented more generally at this time, it can relieve their > > burden and less churn the code. It would be helpful for maintainability. > > I agree in principle that finding a 32-bit compatible solution is > desirable, but it simply may not be feasible. Okay. > > For now, I'll help with auditing the existing bits so we can enumerate > the tradeoffs. Thanks! :) > Hmm, one tradeoff that comes to mind for 32-bit is sacrificing > ZONE_HIGHMEM, for ZONE_CMA. Are there configurations that need both > enabled? If a platform needs highmem it really should be using a > 64-bit kernel (if possible), desire for ZONE_CMA might be a nice > encouragement to lessen the prevalence of highmem. I guess that it's not possible. There are many systems that need both. I don't think deeply, but, there is another option for ZONE_CMA. It can share ZONE_MOVABLE because their chracteristic is roughly same in view of MM. I will think more. Thanks.
On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote: > >> Will do, especially since other efforts are feeling the pinch on the > >> MAX_NR_ZONES limitation. > > > > Please refer my previous attempt to add a new zone, ZONE_CMA. > > > > https://lkml.org/lkml/2015/2/12/84 > > > > It salvages a bit from SECTION_WIDTH by increasing section size. > > Similarly, I guess we can reduce NODE_WIDTH if needed although > > it could cause to reduce maximum node size. > > Dave pointed out to me that LAST__PID_SHIFT might be a better > candidate to reduce to 7 bits. That field is for storing pids which > are already bigger than 8 bits. If it is relying on the fact that > pids don't rollover very often then likely the impact of 7-bits > instead of 8 will be minimal. It's not relying on the fact pids don't roll over very often. The information is used by automatic NUMA balancing to detect if multiple accesses to data are from the same task or not. Reducing the number of bits it uses increases the chance that two tasks will both think they are the data owner and keep migrating it.
diff --git a/include/linux/mm.h b/include/linux/mm.h index f1cd22f2df1a..b4bccd3d3c41 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -664,12 +664,44 @@ static inline enum zone_type page_zonenum(const struct page *page) return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK; } +#ifdef NODE_NOT_IN_PAGE_FLAGS +extern int page_to_nid(const struct page *page); +#else +static inline int page_to_nid(const struct page *page) +{ + return (page->flags >> NODES_PGSHIFT) & NODES_MASK; +} +#endif + +static inline struct zone *page_zone(const struct page *page) +{ + return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; +} + #ifdef CONFIG_ZONE_DEVICE void get_zone_device_page(struct page *page); void put_zone_device_page(struct page *page); static inline bool is_zone_device_page(const struct page *page) { +#ifndef CONFIG_ZONE_DMA return page_zonenum(page) == ZONE_DEVICE; +#else /* ZONE_DEVICE == ZONE_DMA */ + struct zone *zone; + + if (page_zonenum(page) != ZONE_DEVICE) + return false; + + /* + * If ZONE_DEVICE is aliased with ZONE_DMA we need to check + * whether this was a dynamically allocated page from + * devm_memremap_pages() by checking against the size of + * ZONE_DMA at boot. + */ + zone = page_zone(page); + if (page_to_pfn(page) <= zone_end_pfn_boot(zone)) + return false; + return true; +#endif } #else static inline void get_zone_device_page(struct page *page) @@ -735,15 +767,6 @@ static inline int zone_to_nid(struct zone *zone) #endif } -#ifdef NODE_NOT_IN_PAGE_FLAGS -extern int page_to_nid(const struct page *page); -#else -static inline int page_to_nid(const struct page *page) -{ - return (page->flags >> NODES_PGSHIFT) & NODES_MASK; -} -#endif - #ifdef CONFIG_NUMA_BALANCING static inline int cpu_pid_to_cpupid(int cpu, int pid) { @@ -857,11 +880,6 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) } #endif /* CONFIG_NUMA_BALANCING */ -static inline struct zone *page_zone(const struct page *page) -{ - return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; -} - #ifdef SECTION_IN_PAGE_FLAGS static inline void set_page_section(struct page *page, unsigned long section) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 33bb1b19273e..a0ef09b7f893 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -288,6 +288,13 @@ enum zone_type { */ ZONE_DMA, #endif +#ifdef CONFIG_ZONE_DEVICE +#ifndef CONFIG_ZONE_DMA + ZONE_DEVICE, +#else + ZONE_DEVICE = ZONE_DMA, +#endif +#endif #ifdef CONFIG_ZONE_DMA32 /* * x86_64 needs two ZONE_DMAs because it supports devices that are @@ -314,11 +321,7 @@ enum zone_type { ZONE_HIGHMEM, #endif ZONE_MOVABLE, -#ifdef CONFIG_ZONE_DEVICE - ZONE_DEVICE, -#endif __MAX_NR_ZONES - }; #ifndef __GENERATING_BOUNDS_H @@ -379,12 +382,19 @@ struct zone { /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ unsigned long zone_start_pfn; + /* first dynamically added pfn of the zone */ + unsigned long zone_dyn_start_pfn; /* * spanned_pages is the total pages spanned by the zone, including * holes, which is calculated as: * spanned_pages = zone_end_pfn - zone_start_pfn; * + * init_spanned_pages is the boot/init time total pages spanned + * by the zone for differentiating statically assigned vs + * dynamically hot added memory to a zone. + * init_spanned_pages = init_zone_end_pfn - zone_start_pfn; + * * present_pages is physical pages existing within the zone, which * is calculated as: * present_pages = spanned_pages - absent_pages(pages in holes); @@ -423,6 +433,7 @@ struct zone { */ unsigned long managed_pages; unsigned long spanned_pages; + unsigned long init_spanned_pages; unsigned long present_pages; const char *name; @@ -546,6 +557,11 @@ static inline unsigned long zone_end_pfn(const struct zone *zone) return zone->zone_start_pfn + zone->spanned_pages; } +static inline unsigned long zone_end_pfn_boot(const struct zone *zone) +{ + return zone->zone_start_pfn + zone->init_spanned_pages; +} + static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn) { return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone); diff --git a/mm/Kconfig b/mm/Kconfig index 97a4e06b15c0..08a92a9c8fbd 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -652,7 +652,6 @@ config IDLE_PAGE_TRACKING config ZONE_DEVICE bool "Device memory (pmem, etc...) hotplug support" if EXPERT default !ZONE_DMA - depends on !ZONE_DMA depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE depends on X86_64 #arch_add_memory() comprehends device memory diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 4af58a3a8ffa..c3f0ff45bd47 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -300,6 +300,8 @@ static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn, zone->spanned_pages = max(old_zone_end_pfn, end_pfn) - zone->zone_start_pfn; + if (!zone->zone_dyn_start_pfn || start_pfn < zone->zone_dyn_start_pfn) + zone->zone_dyn_start_pfn = start_pfn; zone_span_writeunlock(zone); } @@ -601,8 +603,9 @@ static int find_biggest_section_pfn(int nid, struct zone *zone, static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, unsigned long end_pfn) { - unsigned long zone_start_pfn = zone->zone_start_pfn; + unsigned long zone_start_pfn = zone->zone_dyn_start_pfn; unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */ + bool dyn_zone = zone->zone_start_pfn == zone_start_pfn; unsigned long zone_end_pfn = z; unsigned long pfn; struct mem_section *ms; @@ -619,7 +622,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, pfn = find_smallest_section_pfn(nid, zone, end_pfn, zone_end_pfn); if (pfn) { - zone->zone_start_pfn = pfn; + if (dyn_zone) + zone->zone_start_pfn = pfn; + zone->zone_dyn_start_pfn = pfn; zone->spanned_pages = zone_end_pfn - pfn; } } else if (zone_end_pfn == end_pfn) { @@ -661,8 +666,10 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, } /* The zone has no valid section */ - zone->zone_start_pfn = 0; - zone->spanned_pages = 0; + if (dyn_zone) + zone->zone_start_pfn = 0; + zone->zone_dyn_start_pfn = 0; + zone->spanned_pages = zone->init_spanned_pages; zone_span_writeunlock(zone); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 63358d9f9aa9..2d8b1d602ff3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -209,6 +209,10 @@ EXPORT_SYMBOL(totalram_pages); static char * const zone_names[MAX_NR_ZONES] = { #ifdef CONFIG_ZONE_DMA "DMA", +#else +#ifdef CONFIG_ZONE_DEVICE + "Device", +#endif #endif #ifdef CONFIG_ZONE_DMA32 "DMA32", @@ -218,9 +222,6 @@ static char * const zone_names[MAX_NR_ZONES] = { "HighMem", #endif "Movable", -#ifdef CONFIG_ZONE_DEVICE - "Device", -#endif }; compound_page_dtor * const compound_page_dtors[] = { @@ -5082,6 +5083,8 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, node_start_pfn, node_end_pfn, zholes_size); zone->spanned_pages = size; + zone->init_spanned_pages = size; + zone->zone_dyn_start_pfn = 0; zone->present_pages = real_size; totalpages += size;
It appears devices requiring ZONE_DMA are still prevalent (see link below). For this reason the proposal to require turning off ZONE_DMA to enable ZONE_DEVICE is untenable in the short term. We want a single kernel image to be able to support legacy devices as well as next generation persistent memory platforms. Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing to maintain a unique zone number for ZONE_DEVICE. Record the geometry of ZONE_DMA at init (->init_spanned_pages) and use that information in is_zone_device_page() to differentiate pages allocated via devm_memremap_pages() vs true ZONE_DMA pages. Otherwise, use the simpler definition of is_zone_device_page() when ZONE_DMA is turned off. Note that this also teaches the memory hot remove path that the zone may not have sections for all pfn spans (->zone_dyn_start_pfn). A user visible implication of this change is potentially an unexpectedly high "spanned" value in /proc/zoneinfo for the DMA zone. Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jerome Glisse <j.glisse@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"") Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- include/linux/mm.h | 46 ++++++++++++++++++++++++++++++++-------------- include/linux/mmzone.h | 24 ++++++++++++++++++++---- mm/Kconfig | 1 - mm/memory_hotplug.c | 15 +++++++++++---- mm/page_alloc.c | 9 ++++++--- 5 files changed, 69 insertions(+), 26 deletions(-)