diff mbox

[RFC] mm: support CONFIG_ZONE_DEVICE + CONFIG_ZONE_DMA

Message ID 20160126000639.358.89668.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State Superseded
Headers show

Commit Message

Dan Williams Jan. 26, 2016, 12:06 a.m. UTC
It appears devices requiring ZONE_DMA are still prevalent (see link
below).  For this reason the proposal to require turning off ZONE_DMA to
enable ZONE_DEVICE is untenable in the short term.  We want a single
kernel image to be able to support legacy devices as well as next
generation persistent memory platforms.

Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
of ZONE_DMA at init (->init_spanned_pages) and use that information in
is_zone_device_page() to differentiate pages allocated via
devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
simpler definition of is_zone_device_page() when ZONE_DMA is turned off.

Note that this also teaches the memory hot remove path that the zone may
not have sections for all pfn spans (->zone_dyn_start_pfn).

A user visible implication of this change is potentially an unexpectedly
high "spanned" value in /proc/zoneinfo for the DMA zone.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h     |   46 ++++++++++++++++++++++++++++++++--------------
 include/linux/mmzone.h |   24 ++++++++++++++++++++----
 mm/Kconfig             |    1 -
 mm/memory_hotplug.c    |   15 +++++++++++----
 mm/page_alloc.c        |    9 ++++++---
 5 files changed, 69 insertions(+), 26 deletions(-)

Comments

Sudip Mukherjee Jan. 26, 2016, 6 a.m. UTC | #1
On Mon, Jan 25, 2016 at 04:06:40PM -0800, Dan Williams wrote:
> It appears devices requiring ZONE_DMA are still prevalent (see link
> below).  For this reason the proposal to require turning off ZONE_DMA to
> enable ZONE_DEVICE is untenable in the short term.  We want a single
> kernel image to be able to support legacy devices as well as next
> generation persistent memory platforms.
> 
> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
> of ZONE_DMA at init (->init_spanned_pages) and use that information in
> is_zone_device_page() to differentiate pages allocated via
> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
> 
> Note that this also teaches the memory hot remove path that the zone may
> not have sections for all pfn spans (->zone_dyn_start_pfn).
> 
> A user visible implication of this change is potentially an unexpectedly
> high "spanned" value in /proc/zoneinfo for the DMA zone.
> 
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jerome Glisse <j.glisse@gmail.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
> Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>

It should actually be Reported-by: Mark <markk@clara.co.uk>

Hi Mark,
Can you please test this patch available at https://patchwork.kernel.org/patch/8116991/
in your setup..

regards
sudip
Dan Williams Jan. 26, 2016, 5:07 p.m. UTC | #2
On Mon, Jan 25, 2016 at 10:00 PM, Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
> On Mon, Jan 25, 2016 at 04:06:40PM -0800, Dan Williams wrote:
>> It appears devices requiring ZONE_DMA are still prevalent (see link
>> below).  For this reason the proposal to require turning off ZONE_DMA to
>> enable ZONE_DEVICE is untenable in the short term.  We want a single
>> kernel image to be able to support legacy devices as well as next
>> generation persistent memory platforms.
>>
>> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
>> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
>> of ZONE_DMA at init (->init_spanned_pages) and use that information in
>> is_zone_device_page() to differentiate pages allocated via
>> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
>> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
>>
>> Note that this also teaches the memory hot remove path that the zone may
>> not have sections for all pfn spans (->zone_dyn_start_pfn).
>>
>> A user visible implication of this change is potentially an unexpectedly
>> high "spanned" value in /proc/zoneinfo for the DMA zone.
>>
>> Cc: H. Peter Anvin <hpa@zytor.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Jerome Glisse <j.glisse@gmail.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
>> Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
>> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
>
> It should actually be Reported-by: Mark <markk@clara.co.uk>
>
> Hi Mark,
> Can you please test this patch available at https://patchwork.kernel.org/patch/8116991/
> in your setup..

Note this patch is on top of 4.5-rc1 and is likely not a suitable for
-stable backport to 4.3/4.4.  For 4.3 and 4.4, distributions that want
to support legacy devices should leave ZONE_DEVICE disabled as it is
by default.
Mark Jan. 26, 2016, 7:10 p.m. UTC | #3
On Tue, January 26, 2016 06:00, Sudip Mukherjee wrote:
> On Mon, Jan 25, 2016 at 04:06:40PM -0800, Dan Williams wrote:
>> It appears devices requiring ZONE_DMA are still prevalent (see link
>> below).  For this reason the proposal to require turning off ZONE_DMA to
>> enable ZONE_DEVICE is untenable in the short term.  We want a single
>> kernel image to be able to support legacy devices as well as next
>> generation persistent memory platforms.
>>
>> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
>> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
>> of ZONE_DMA at init (->init_spanned_pages) and use that information in
>> is_zone_device_page() to differentiate pages allocated via
>> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
>> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
>>
>> Note that this also teaches the memory hot remove path that the zone may
>> not have sections for all pfn spans (->zone_dyn_start_pfn).
>>
>> A user visible implication of this change is potentially an unexpectedly
>> high "spanned" value in /proc/zoneinfo for the DMA zone.
>>
>> Cc: H. Peter Anvin <hpa@zytor.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Jerome Glisse <j.glisse@gmail.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
>> Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
>> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
>
> It should actually be Reported-by: Mark <markk@clara.co.uk>
>
> Hi Mark,
> Can you please test this patch available at
> https://patchwork.kernel.org/patch/8116991/
> in your setup..

I applied that patch to 4.5-rc1 and it seems to work. At least, there is
no error message in dmesg output any more. I didn't actually try using the
parallel port (need to find a parallel printer cable). Presumably a
parallel printer would work whether DMA is used or not, just slower and
using more CPU time in the PIO case. Also, I don't have any hardware that
needs CONFIG_ZONE_DEVICE.

The config file I used to compile the kernel can be downloaded from
https://www.mediafire.com/?1do33bkko41ypo3
if anyone feels like taking a look.

Perhaps someone with one of the affected PCI sound cards could also test
the patch, since those presumably don't work/build at all without it.
Hopefully someone else has a PC with native parallel port to confirm the
fix. (Native floppy controller may be another affected device.)


Mark
Vlastimil Babka Jan. 26, 2016, 9:42 p.m. UTC | #4
On 26.1.2016 1:06, Dan Williams wrote:
> It appears devices requiring ZONE_DMA are still prevalent (see link
> below).  For this reason the proposal to require turning off ZONE_DMA to
> enable ZONE_DEVICE is untenable in the short term.  We want a single
> kernel image to be able to support legacy devices as well as next
> generation persistent memory platforms.
> 
> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
> of ZONE_DMA at init (->init_spanned_pages) and use that information in
> is_zone_device_page() to differentiate pages allocated via
> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
> 
> Note that this also teaches the memory hot remove path that the zone may
> not have sections for all pfn spans (->zone_dyn_start_pfn).
> 
> A user visible implication of this change is potentially an unexpectedly
> high "spanned" value in /proc/zoneinfo for the DMA zone.

[+CC Joonsoo, Laura]

Sounds like quite a hack :( Would it be possible to extend the bits encoding
zone? Potentially, ZONE_CMA could be added one day...

> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jerome Glisse <j.glisse@gmail.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
> Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/mm.h     |   46 ++++++++++++++++++++++++++++++++--------------
>  include/linux/mmzone.h |   24 ++++++++++++++++++++----
>  mm/Kconfig             |    1 -
>  mm/memory_hotplug.c    |   15 +++++++++++----
>  mm/page_alloc.c        |    9 ++++++---
>  5 files changed, 69 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f1cd22f2df1a..b4bccd3d3c41 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -664,12 +664,44 @@ static inline enum zone_type page_zonenum(const struct page *page)
>  	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
>  }
>  
> +#ifdef NODE_NOT_IN_PAGE_FLAGS
> +extern int page_to_nid(const struct page *page);
> +#else
> +static inline int page_to_nid(const struct page *page)
> +{
> +	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
> +}
> +#endif
> +
> +static inline struct zone *page_zone(const struct page *page)
> +{
> +	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  void get_zone_device_page(struct page *page);
>  void put_zone_device_page(struct page *page);
>  static inline bool is_zone_device_page(const struct page *page)
>  {
> +#ifndef CONFIG_ZONE_DMA
>  	return page_zonenum(page) == ZONE_DEVICE;
> +#else /* ZONE_DEVICE == ZONE_DMA */
> +	struct zone *zone;
> +
> +	if (page_zonenum(page) != ZONE_DEVICE)
> +		return false;
> +
> +	/*
> +	 * If ZONE_DEVICE is aliased with ZONE_DMA we need to check
> +	 * whether this was a dynamically allocated page from
> +	 * devm_memremap_pages() by checking against the size of
> +	 * ZONE_DMA at boot.
> +	 */
> +	zone = page_zone(page);
> +	if (page_to_pfn(page) <= zone_end_pfn_boot(zone))
> +		return false;
> +	return true;
> +#endif
>  }
>  #else
>  static inline void get_zone_device_page(struct page *page)
> @@ -735,15 +767,6 @@ static inline int zone_to_nid(struct zone *zone)
>  #endif
>  }
>  
> -#ifdef NODE_NOT_IN_PAGE_FLAGS
> -extern int page_to_nid(const struct page *page);
> -#else
> -static inline int page_to_nid(const struct page *page)
> -{
> -	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
> -}
> -#endif
> -
>  #ifdef CONFIG_NUMA_BALANCING
>  static inline int cpu_pid_to_cpupid(int cpu, int pid)
>  {
> @@ -857,11 +880,6 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> -static inline struct zone *page_zone(const struct page *page)
> -{
> -	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
> -}
> -
>  #ifdef SECTION_IN_PAGE_FLAGS
>  static inline void set_page_section(struct page *page, unsigned long section)
>  {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 33bb1b19273e..a0ef09b7f893 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -288,6 +288,13 @@ enum zone_type {
>  	 */
>  	ZONE_DMA,
>  #endif
> +#ifdef CONFIG_ZONE_DEVICE
> +#ifndef CONFIG_ZONE_DMA
> +	ZONE_DEVICE,
> +#else
> +	ZONE_DEVICE = ZONE_DMA,
> +#endif
> +#endif
>  #ifdef CONFIG_ZONE_DMA32
>  	/*
>  	 * x86_64 needs two ZONE_DMAs because it supports devices that are
> @@ -314,11 +321,7 @@ enum zone_type {
>  	ZONE_HIGHMEM,
>  #endif
>  	ZONE_MOVABLE,
> -#ifdef CONFIG_ZONE_DEVICE
> -	ZONE_DEVICE,
> -#endif
>  	__MAX_NR_ZONES
> -
>  };
>  
>  #ifndef __GENERATING_BOUNDS_H
> @@ -379,12 +382,19 @@ struct zone {
>  
>  	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
>  	unsigned long		zone_start_pfn;
> +	/* first dynamically added pfn of the zone */
> +	unsigned long		zone_dyn_start_pfn;
>  
>  	/*
>  	 * spanned_pages is the total pages spanned by the zone, including
>  	 * holes, which is calculated as:
>  	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
>  	 *
> +	 * init_spanned_pages is the boot/init time total pages spanned
> +	 * by the zone for differentiating statically assigned vs
> +	 * dynamically hot added memory to a zone.
> +	 * 	init_spanned_pages = init_zone_end_pfn - zone_start_pfn;
> +	 *
>  	 * present_pages is physical pages existing within the zone, which
>  	 * is calculated as:
>  	 *	present_pages = spanned_pages - absent_pages(pages in holes);
> @@ -423,6 +433,7 @@ struct zone {
>  	 */
>  	unsigned long		managed_pages;
>  	unsigned long		spanned_pages;
> +	unsigned long		init_spanned_pages;
>  	unsigned long		present_pages;
>  
>  	const char		*name;
> @@ -546,6 +557,11 @@ static inline unsigned long zone_end_pfn(const struct zone *zone)
>  	return zone->zone_start_pfn + zone->spanned_pages;
>  }
>  
> +static inline unsigned long zone_end_pfn_boot(const struct zone *zone)
> +{
> +	return zone->zone_start_pfn + zone->init_spanned_pages;
> +}
> +
>  static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
>  {
>  	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 97a4e06b15c0..08a92a9c8fbd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -652,7 +652,6 @@ config IDLE_PAGE_TRACKING
>  config ZONE_DEVICE
>  	bool "Device memory (pmem, etc...) hotplug support" if EXPERT
>  	default !ZONE_DMA
> -	depends on !ZONE_DMA
>  	depends on MEMORY_HOTPLUG
>  	depends on MEMORY_HOTREMOVE
>  	depends on X86_64 #arch_add_memory() comprehends device memory
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4af58a3a8ffa..c3f0ff45bd47 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,8 @@ static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn,
>  
>  	zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
>  				zone->zone_start_pfn;
> +	if (!zone->zone_dyn_start_pfn || start_pfn < zone->zone_dyn_start_pfn)
> +		zone->zone_dyn_start_pfn = start_pfn;
>  
>  	zone_span_writeunlock(zone);
>  }
> @@ -601,8 +603,9 @@ static int find_biggest_section_pfn(int nid, struct zone *zone,
>  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>  			     unsigned long end_pfn)
>  {
> -	unsigned long zone_start_pfn = zone->zone_start_pfn;
> +	unsigned long zone_start_pfn = zone->zone_dyn_start_pfn;
>  	unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
> +	bool dyn_zone = zone->zone_start_pfn == zone_start_pfn;
>  	unsigned long zone_end_pfn = z;
>  	unsigned long pfn;
>  	struct mem_section *ms;
> @@ -619,7 +622,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>  		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
>  						zone_end_pfn);
>  		if (pfn) {
> -			zone->zone_start_pfn = pfn;
> +			if (dyn_zone)
> +				zone->zone_start_pfn = pfn;
> +			zone->zone_dyn_start_pfn = pfn;
>  			zone->spanned_pages = zone_end_pfn - pfn;
>  		}
>  	} else if (zone_end_pfn == end_pfn) {
> @@ -661,8 +666,10 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>  	}
>  
>  	/* The zone has no valid section */
> -	zone->zone_start_pfn = 0;
> -	zone->spanned_pages = 0;
> +	if (dyn_zone)
> +		zone->zone_start_pfn = 0;
> +	zone->zone_dyn_start_pfn = 0;
> +	zone->spanned_pages = zone->init_spanned_pages;
>  	zone_span_writeunlock(zone);
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 63358d9f9aa9..2d8b1d602ff3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -209,6 +209,10 @@ EXPORT_SYMBOL(totalram_pages);
>  static char * const zone_names[MAX_NR_ZONES] = {
>  #ifdef CONFIG_ZONE_DMA
>  	 "DMA",
> +#else
> +#ifdef CONFIG_ZONE_DEVICE
> +	 "Device",
> +#endif
>  #endif
>  #ifdef CONFIG_ZONE_DMA32
>  	 "DMA32",
> @@ -218,9 +222,6 @@ static char * const zone_names[MAX_NR_ZONES] = {
>  	 "HighMem",
>  #endif
>  	 "Movable",
> -#ifdef CONFIG_ZONE_DEVICE
> -	 "Device",
> -#endif
>  };
>  
>  compound_page_dtor * const compound_page_dtors[] = {
> @@ -5082,6 +5083,8 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
>  						  node_start_pfn, node_end_pfn,
>  						  zholes_size);
>  		zone->spanned_pages = size;
> +		zone->init_spanned_pages = size;
> +		zone->zone_dyn_start_pfn = 0;
>  		zone->present_pages = real_size;
>  
>  		totalpages += size;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Dan Williams Jan. 26, 2016, 9:48 p.m. UTC | #5
On Tue, Jan 26, 2016 at 1:42 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 26.1.2016 1:06, Dan Williams wrote:
>> It appears devices requiring ZONE_DMA are still prevalent (see link
>> below).  For this reason the proposal to require turning off ZONE_DMA to
>> enable ZONE_DEVICE is untenable in the short term.  We want a single
>> kernel image to be able to support legacy devices as well as next
>> generation persistent memory platforms.
>>
>> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
>> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
>> of ZONE_DMA at init (->init_spanned_pages) and use that information in
>> is_zone_device_page() to differentiate pages allocated via
>> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
>> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
>>
>> Note that this also teaches the memory hot remove path that the zone may
>> not have sections for all pfn spans (->zone_dyn_start_pfn).
>>
>> A user visible implication of this change is potentially an unexpectedly
>> high "spanned" value in /proc/zoneinfo for the DMA zone.
>
> [+CC Joonsoo, Laura]
>
> Sounds like quite a hack :(

Indeed...

> Would it be possible to extend the bits encoding
> zone? Potentially, ZONE_CMA could be added one day...

Not without impacting the ability to quickly lookup the numa node and
parent section for a page.  See ZONES_WIDTH, NODES_WIDTH, and
SECTIONS_WIDTH.

My initial implementation of ZONE_DEVICE ran into this conflict when
ZONES_SHIFT is > 2, and I fell back to cannibalizing ZONE_DMA.
Andrew Morton Jan. 26, 2016, 10:11 p.m. UTC | #6
On Mon, 25 Jan 2016 16:06:40 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> It appears devices requiring ZONE_DMA are still prevalent (see link
> below).  For this reason the proposal to require turning off ZONE_DMA to
> enable ZONE_DEVICE is untenable in the short term.

More than "short term".  When can we ever nuke ZONE_DMA?

This was a pretty big goof - the removal of ZONE_DMA whizzed straight
past my attention, alas.  In fact I never noticed the patch at all
until I got some conflicts in -next a few weeks later (wasn't cc'ed). 
And then I didn't read the changelog closely enough.

>  We want a single
> kernel image to be able to support legacy devices as well as next
> generation persistent memory platforms.

yup.
 
> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
> of ZONE_DMA at init (->init_spanned_pages) and use that information in
> is_zone_device_page() to differentiate pages allocated via
> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
> 
> Note that this also teaches the memory hot remove path that the zone may
> not have sections for all pfn spans (->zone_dyn_start_pfn).
> 
> A user visible implication of this change is potentially an unexpectedly
> high "spanned" value in /proc/zoneinfo for the DMA zone.

Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes? 
Is it possible to just use ZONES_SHIFT=3?

Also, this "dynamically added pfn of the zone" thing is a new concept
and I think it should be more completely documented somewhere in the
code.
Dan Williams Jan. 26, 2016, 10:33 p.m. UTC | #7
On Tue, Jan 26, 2016 at 2:11 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 25 Jan 2016 16:06:40 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>
>> It appears devices requiring ZONE_DMA are still prevalent (see link
>> below).  For this reason the proposal to require turning off ZONE_DMA to
>> enable ZONE_DEVICE is untenable in the short term.
>
> More than "short term".  When can we ever nuke ZONE_DMA?

I'm assuming at some point these legacy devices will die off or move
to something attached over a more capable bus like USB?

> This was a pretty big goof - the removal of ZONE_DMA whizzed straight
> past my attention, alas.  In fact I never noticed the patch at all
> until I got some conflicts in -next a few weeks later (wasn't cc'ed).
> And then I didn't read the changelog closely enough.

I endeavor to never surprise you again...

To be clear the patch did not disable ZONE_DMA by default, but it was
indeed a goof to assume that ZONE_DMA was less prevalent than it turns
out to be.

>>  We want a single
>> kernel image to be able to support legacy devices as well as next
>> generation persistent memory platforms.
>
> yup.
>
>> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
>> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
>> of ZONE_DMA at init (->init_spanned_pages) and use that information in
>> is_zone_device_page() to differentiate pages allocated via
>> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
>> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
>>
>> Note that this also teaches the memory hot remove path that the zone may
>> not have sections for all pfn spans (->zone_dyn_start_pfn).
>>
>> A user visible implication of this change is potentially an unexpectedly
>> high "spanned" value in /proc/zoneinfo for the DMA zone.
>
> Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes?
> Is it possible to just use ZONES_SHIFT=3?

Last I tried I hit this warning in mm/memory.c

#warning Unfortunate NUMA and NUMA Balancing config, growing
page-frame for last_cpupid.

> Also, this "dynamically added pfn of the zone" thing is a new concept
> and I think it should be more completely documented somewhere in the
> code.

Ok, I'll take a look.
Andrew Morton Jan. 26, 2016, 10:51 p.m. UTC | #8
On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
> >> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in
> >> is_zone_device_page() to differentiate pages allocated via
> >> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
> >>
> >> Note that this also teaches the memory hot remove path that the zone may
> >> not have sections for all pfn spans (->zone_dyn_start_pfn).
> >>
> >> A user visible implication of this change is potentially an unexpectedly
> >> high "spanned" value in /proc/zoneinfo for the DMA zone.
> >
> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes?
> > Is it possible to just use ZONES_SHIFT=3?
> 
> Last I tried I hit this warning in mm/memory.c
> 
> #warning Unfortunate NUMA and NUMA Balancing config, growing
> page-frame for last_cpupid.

Well yes, it may take a bit of work - perhaps salvaging a bit from
somewhere else if poss.  But that might provide a better overall
solution so could you please have a think?
Dan Williams Jan. 26, 2016, 11:11 p.m. UTC | #9
On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>
>> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
>> >> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
>> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in
>> >> is_zone_device_page() to differentiate pages allocated via
>> >> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
>> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
>> >>
>> >> Note that this also teaches the memory hot remove path that the zone may
>> >> not have sections for all pfn spans (->zone_dyn_start_pfn).
>> >>
>> >> A user visible implication of this change is potentially an unexpectedly
>> >> high "spanned" value in /proc/zoneinfo for the DMA zone.
>> >
>> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes?
>> > Is it possible to just use ZONES_SHIFT=3?
>>
>> Last I tried I hit this warning in mm/memory.c
>>
>> #warning Unfortunate NUMA and NUMA Balancing config, growing
>> page-frame for last_cpupid.
>
> Well yes, it may take a bit of work - perhaps salvaging a bit from
> somewhere else if poss.  But that might provide a better overall
> solution so could you please have a think?
>

Will do, especially since other efforts are feeling the pinch on the
MAX_NR_ZONES limitation.
Joonsoo Kim Jan. 27, 2016, 1:18 a.m. UTC | #10
Hello,

On Tue, Jan 26, 2016 at 03:11:36PM -0800, Dan Williams wrote:
> On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
> >> >> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
> >> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in
> >> >> is_zone_device_page() to differentiate pages allocated via
> >> >> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
> >> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
> >> >>
> >> >> Note that this also teaches the memory hot remove path that the zone may
> >> >> not have sections for all pfn spans (->zone_dyn_start_pfn).
> >> >>
> >> >> A user visible implication of this change is potentially an unexpectedly
> >> >> high "spanned" value in /proc/zoneinfo for the DMA zone.
> >> >
> >> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes?
> >> > Is it possible to just use ZONES_SHIFT=3?
> >>
> >> Last I tried I hit this warning in mm/memory.c
> >>
> >> #warning Unfortunate NUMA and NUMA Balancing config, growing
> >> page-frame for last_cpupid.
> >
> > Well yes, it may take a bit of work - perhaps salvaging a bit from
> > somewhere else if poss.  But that might provide a better overall
> > solution so could you please have a think?
> >
> 
> Will do, especially since other efforts are feeling the pinch on the
> MAX_NR_ZONES limitation.

Please refer my previous attempt to add a new zone, ZONE_CMA.

https://lkml.org/lkml/2015/2/12/84

It salvages a bit from SECTION_WIDTH by increasing section size.
Similarly, I guess we can reduce NODE_WIDTH if needed although
it could cause to reduce maximum node size.

Thanks.
Dan Williams Jan. 27, 2016, 1:37 a.m. UTC | #11
On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> Hello,
>
> On Tue, Jan 26, 2016 at 03:11:36PM -0800, Dan Williams wrote:
>> On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>> > On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>> >
>> >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
>> >> >> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
>> >> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in
>> >> >> is_zone_device_page() to differentiate pages allocated via
>> >> >> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
>> >> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
>> >> >>
>> >> >> Note that this also teaches the memory hot remove path that the zone may
>> >> >> not have sections for all pfn spans (->zone_dyn_start_pfn).
>> >> >>
>> >> >> A user visible implication of this change is potentially an unexpectedly
>> >> >> high "spanned" value in /proc/zoneinfo for the DMA zone.
>> >> >
>> >> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes?
>> >> > Is it possible to just use ZONES_SHIFT=3?
>> >>
>> >> Last I tried I hit this warning in mm/memory.c
>> >>
>> >> #warning Unfortunate NUMA and NUMA Balancing config, growing
>> >> page-frame for last_cpupid.
>> >
>> > Well yes, it may take a bit of work - perhaps salvaging a bit from
>> > somewhere else if poss.  But that might provide a better overall
>> > solution so could you please have a think?
>> >
>>
>> Will do, especially since other efforts are feeling the pinch on the
>> MAX_NR_ZONES limitation.
>
> Please refer my previous attempt to add a new zone, ZONE_CMA.
>
> https://lkml.org/lkml/2015/2/12/84
>
> It salvages a bit from SECTION_WIDTH by increasing section size.
> Similarly, I guess we can reduce NODE_WIDTH if needed although
> it could cause to reduce maximum node size.

Dave pointed out to me that LAST__PID_SHIFT might be a better
candidate to reduce to 7 bits.  That field is for storing pids which
are already bigger than 8 bits.  If it is relying on the fact that
pids don't rollover very often then likely the impact of 7-bits
instead of 8 will be minimal.
Joonsoo Kim Jan. 27, 2016, 2:15 a.m. UTC | #12
On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote:
> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > Hello,
> >
> > On Tue, Jan 26, 2016 at 03:11:36PM -0800, Dan Williams wrote:
> >> On Tue, Jan 26, 2016 at 2:51 PM, Andrew Morton
> >> <akpm@linux-foundation.org> wrote:
> >> > On Tue, 26 Jan 2016 14:33:48 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> >> >
> >> >> >> Towards this end, alias ZONE_DMA and ZONE_DEVICE to work around needing
> >> >> >> to maintain a unique zone number for ZONE_DEVICE.  Record the geometry
> >> >> >> of ZONE_DMA at init (->init_spanned_pages) and use that information in
> >> >> >> is_zone_device_page() to differentiate pages allocated via
> >> >> >> devm_memremap_pages() vs true ZONE_DMA pages.  Otherwise, use the
> >> >> >> simpler definition of is_zone_device_page() when ZONE_DMA is turned off.
> >> >> >>
> >> >> >> Note that this also teaches the memory hot remove path that the zone may
> >> >> >> not have sections for all pfn spans (->zone_dyn_start_pfn).
> >> >> >>
> >> >> >> A user visible implication of this change is potentially an unexpectedly
> >> >> >> high "spanned" value in /proc/zoneinfo for the DMA zone.
> >> >> >
> >> >> > Well, all these icky tricks are to avoid increasing ZONES_SHIFT, yes?
> >> >> > Is it possible to just use ZONES_SHIFT=3?
> >> >>
> >> >> Last I tried I hit this warning in mm/memory.c
> >> >>
> >> >> #warning Unfortunate NUMA and NUMA Balancing config, growing
> >> >> page-frame for last_cpupid.
> >> >
> >> > Well yes, it may take a bit of work - perhaps salvaging a bit from
> >> > somewhere else if poss.  But that might provide a better overall
> >> > solution so could you please have a think?
> >> >
> >>
> >> Will do, especially since other efforts are feeling the pinch on the
> >> MAX_NR_ZONES limitation.
> >
> > Please refer my previous attempt to add a new zone, ZONE_CMA.
> >
> > https://lkml.org/lkml/2015/2/12/84
> >
> > It salvages a bit from SECTION_WIDTH by increasing section size.
> > Similarly, I guess we can reduce NODE_WIDTH if needed although
> > it could cause to reduce maximum node size.
> 
> Dave pointed out to me that LAST__PID_SHIFT might be a better
> candidate to reduce to 7 bits.  That field is for storing pids which
> are already bigger than 8 bits.  If it is relying on the fact that
> pids don't rollover very often then likely the impact of 7-bits
> instead of 8 will be minimal.

Hmm... I'm not sure it's possible or not, but, it looks not a general
solution. It will solve your problem because you are using 64 bit arch
but other 32 bit archs can't get the benefit.

Thanks.
Dan Williams Jan. 27, 2016, 3:23 a.m. UTC | #13
On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote:
>> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
[..]
>> > Please refer my previous attempt to add a new zone, ZONE_CMA.
>> >
>> > https://lkml.org/lkml/2015/2/12/84
>> >
>> > It salvages a bit from SECTION_WIDTH by increasing section size.
>> > Similarly, I guess we can reduce NODE_WIDTH if needed although
>> > it could cause to reduce maximum node size.
>>
>> Dave pointed out to me that LAST__PID_SHIFT might be a better
>> candidate to reduce to 7 bits.  That field is for storing pids which
>> are already bigger than 8 bits.  If it is relying on the fact that
>> pids don't rollover very often then likely the impact of 7-bits
>> instead of 8 will be minimal.
>
> Hmm... I'm not sure it's possible or not, but, it looks not a general
> solution. It will solve your problem because you are using 64 bit arch
> but other 32 bit archs can't get the benefit.

This is where the ZONE_CMA and ZONE_DEVICE efforts diverge.
ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of
persistent memory.  A 64-bit-only limitation for ZONE_DEVICE is
reasonable.
Joonsoo Kim Jan. 27, 2016, 3:52 a.m. UTC | #14
On Tue, Jan 26, 2016 at 07:23:59PM -0800, Dan Williams wrote:
> On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote:
> >> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> [..]
> >> > Please refer my previous attempt to add a new zone, ZONE_CMA.
> >> >
> >> > https://lkml.org/lkml/2015/2/12/84
> >> >
> >> > It salvages a bit from SECTION_WIDTH by increasing section size.
> >> > Similarly, I guess we can reduce NODE_WIDTH if needed although
> >> > it could cause to reduce maximum node size.
> >>
> >> Dave pointed out to me that LAST__PID_SHIFT might be a better
> >> candidate to reduce to 7 bits.  That field is for storing pids which
> >> are already bigger than 8 bits.  If it is relying on the fact that
> >> pids don't rollover very often then likely the impact of 7-bits
> >> instead of 8 will be minimal.
> >
> > Hmm... I'm not sure it's possible or not, but, it looks not a general
> > solution. It will solve your problem because you are using 64 bit arch
> > but other 32 bit archs can't get the benefit.
> 
> This is where the ZONE_CMA and ZONE_DEVICE efforts diverge.
> ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of
> persistent memory.  A 64-bit-only limitation for ZONE_DEVICE is
> reasonable.

Yes, but, my point is that if someone need another zone like as
ZONE_CMA, they couldn't get the benefit from this change. They need to
re-investigate what bits they can reduce and need to re-do all things.

If it is implemented more generally at this time, it can relieve their
burden and less churn the code. It would be helpful for maintainability.

Thanks.
Dan Williams Jan. 27, 2016, 4:26 a.m. UTC | #15
On Tue, Jan 26, 2016 at 7:52 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> On Tue, Jan 26, 2016 at 07:23:59PM -0800, Dan Williams wrote:
>> On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
>> > On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote:
>> >> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
>> [..]
>> >> > Please refer my previous attempt to add a new zone, ZONE_CMA.
>> >> >
>> >> > https://lkml.org/lkml/2015/2/12/84
>> >> >
>> >> > It salvages a bit from SECTION_WIDTH by increasing section size.
>> >> > Similarly, I guess we can reduce NODE_WIDTH if needed although
>> >> > it could cause to reduce maximum node size.
>> >>
>> >> Dave pointed out to me that LAST__PID_SHIFT might be a better
>> >> candidate to reduce to 7 bits.  That field is for storing pids which
>> >> are already bigger than 8 bits.  If it is relying on the fact that
>> >> pids don't rollover very often then likely the impact of 7-bits
>> >> instead of 8 will be minimal.
>> >
>> > Hmm... I'm not sure it's possible or not, but, it looks not a general
>> > solution. It will solve your problem because you are using 64 bit arch
>> > but other 32 bit archs can't get the benefit.
>>
>> This is where the ZONE_CMA and ZONE_DEVICE efforts diverge.
>> ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of
>> persistent memory.  A 64-bit-only limitation for ZONE_DEVICE is
>> reasonable.
>
> Yes, but, my point is that if someone need another zone like as
> ZONE_CMA, they couldn't get the benefit from this change. They need to
> re-investigate what bits they can reduce and need to re-do all things.
>
> If it is implemented more generally at this time, it can relieve their
> burden and less churn the code. It would be helpful for maintainability.

I agree in principle that finding a 32-bit compatible solution is
desirable, but it simply may not be feasible.

For now, I'll help with auditing the existing bits so we can enumerate
the tradeoffs.

Hmm, one tradeoff that comes to mind for 32-bit is sacrificing
ZONE_HIGHMEM, for ZONE_CMA.  Are there configurations that need both
enabled?  If a platform needs highmem it really should be using a
64-bit kernel (if possible), desire for ZONE_CMA might be a nice
encouragement to lessen the prevalence of highmem.
Joonsoo Kim Jan. 27, 2016, 5:52 a.m. UTC | #16
On Tue, Jan 26, 2016 at 08:26:24PM -0800, Dan Williams wrote:
> On Tue, Jan 26, 2016 at 7:52 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > On Tue, Jan 26, 2016 at 07:23:59PM -0800, Dan Williams wrote:
> >> On Tue, Jan 26, 2016 at 6:15 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> >> > On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote:
> >> >> On Tue, Jan 26, 2016 at 5:18 PM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> >> [..]
> >> >> > Please refer my previous attempt to add a new zone, ZONE_CMA.
> >> >> >
> >> >> > https://lkml.org/lkml/2015/2/12/84
> >> >> >
> >> >> > It salvages a bit from SECTION_WIDTH by increasing section size.
> >> >> > Similarly, I guess we can reduce NODE_WIDTH if needed although
> >> >> > it could cause to reduce maximum node size.
> >> >>
> >> >> Dave pointed out to me that LAST__PID_SHIFT might be a better
> >> >> candidate to reduce to 7 bits.  That field is for storing pids which
> >> >> are already bigger than 8 bits.  If it is relying on the fact that
> >> >> pids don't rollover very often then likely the impact of 7-bits
> >> >> instead of 8 will be minimal.
> >> >
> >> > Hmm... I'm not sure it's possible or not, but, it looks not a general
> >> > solution. It will solve your problem because you are using 64 bit arch
> >> > but other 32 bit archs can't get the benefit.
> >>
> >> This is where the ZONE_CMA and ZONE_DEVICE efforts diverge.
> >> ZONE_DEVICE is meant to enable DMA access to hundreds of gigagbytes of
> >> persistent memory.  A 64-bit-only limitation for ZONE_DEVICE is
> >> reasonable.
> >
> > Yes, but, my point is that if someone need another zone like as
> > ZONE_CMA, they couldn't get the benefit from this change. They need to
> > re-investigate what bits they can reduce and need to re-do all things.
> >
> > If it is implemented more generally at this time, it can relieve their
> > burden and less churn the code. It would be helpful for maintainability.
> 
> I agree in principle that finding a 32-bit compatible solution is
> desirable, but it simply may not be feasible.

Okay.

> 
> For now, I'll help with auditing the existing bits so we can enumerate
> the tradeoffs.

Thanks! :)

> Hmm, one tradeoff that comes to mind for 32-bit is sacrificing
> ZONE_HIGHMEM, for ZONE_CMA.  Are there configurations that need both
> enabled?  If a platform needs highmem it really should be using a
> 64-bit kernel (if possible), desire for ZONE_CMA might be a nice
> encouragement to lessen the prevalence of highmem.

I guess that it's not possible. There are many systems that need
both.

I don't think deeply, but, there is another option for ZONE_CMA.
It can share ZONE_MOVABLE because their chracteristic is roughly
same in view of MM. I will think more.

Thanks.
Mel Gorman Jan. 27, 2016, 7:46 a.m. UTC | #17
On Tue, Jan 26, 2016 at 05:37:38PM -0800, Dan Williams wrote:
> >> Will do, especially since other efforts are feeling the pinch on the
> >> MAX_NR_ZONES limitation.
> >
> > Please refer my previous attempt to add a new zone, ZONE_CMA.
> >
> > https://lkml.org/lkml/2015/2/12/84
> >
> > It salvages a bit from SECTION_WIDTH by increasing section size.
> > Similarly, I guess we can reduce NODE_WIDTH if needed although
> > it could cause to reduce maximum node size.
> 
> Dave pointed out to me that LAST__PID_SHIFT might be a better
> candidate to reduce to 7 bits.  That field is for storing pids which
> are already bigger than 8 bits.  If it is relying on the fact that
> pids don't rollover very often then likely the impact of 7-bits
> instead of 8 will be minimal.

It's not relying on the fact pids don't roll over very often. The
information is used by automatic NUMA balancing to detect if multiple
accesses to data are from the same task or not. Reducing the number of
bits it uses increases the chance that two tasks will both think they are
the data owner and keep migrating it.
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1cd22f2df1a..b4bccd3d3c41 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -664,12 +664,44 @@  static inline enum zone_type page_zonenum(const struct page *page)
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+extern int page_to_nid(const struct page *page);
+#else
+static inline int page_to_nid(const struct page *page)
+{
+	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
+}
+#endif
+
+static inline struct zone *page_zone(const struct page *page)
+{
+	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 void get_zone_device_page(struct page *page);
 void put_zone_device_page(struct page *page);
 static inline bool is_zone_device_page(const struct page *page)
 {
+#ifndef CONFIG_ZONE_DMA
 	return page_zonenum(page) == ZONE_DEVICE;
+#else /* ZONE_DEVICE == ZONE_DMA */
+	struct zone *zone;
+
+	if (page_zonenum(page) != ZONE_DEVICE)
+		return false;
+
+	/*
+	 * If ZONE_DEVICE is aliased with ZONE_DMA we need to check
+	 * whether this was a dynamically allocated page from
+	 * devm_memremap_pages() by checking against the size of
+	 * ZONE_DMA at boot.
+	 */
+	zone = page_zone(page);
+	if (page_to_pfn(page) <= zone_end_pfn_boot(zone))
+		return false;
+	return true;
+#endif
 }
 #else
 static inline void get_zone_device_page(struct page *page)
@@ -735,15 +767,6 @@  static inline int zone_to_nid(struct zone *zone)
 #endif
 }
 
-#ifdef NODE_NOT_IN_PAGE_FLAGS
-extern int page_to_nid(const struct page *page);
-#else
-static inline int page_to_nid(const struct page *page)
-{
-	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
-}
-#endif
-
 #ifdef CONFIG_NUMA_BALANCING
 static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
@@ -857,11 +880,6 @@  static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-static inline struct zone *page_zone(const struct page *page)
-{
-	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
-}
-
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 33bb1b19273e..a0ef09b7f893 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -288,6 +288,13 @@  enum zone_type {
 	 */
 	ZONE_DMA,
 #endif
+#ifdef CONFIG_ZONE_DEVICE
+#ifndef CONFIG_ZONE_DMA
+	ZONE_DEVICE,
+#else
+	ZONE_DEVICE = ZONE_DMA,
+#endif
+#endif
 #ifdef CONFIG_ZONE_DMA32
 	/*
 	 * x86_64 needs two ZONE_DMAs because it supports devices that are
@@ -314,11 +321,7 @@  enum zone_type {
 	ZONE_HIGHMEM,
 #endif
 	ZONE_MOVABLE,
-#ifdef CONFIG_ZONE_DEVICE
-	ZONE_DEVICE,
-#endif
 	__MAX_NR_ZONES
-
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -379,12 +382,19 @@  struct zone {
 
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
+	/* first dynamically added pfn of the zone */
+	unsigned long		zone_dyn_start_pfn;
 
 	/*
 	 * spanned_pages is the total pages spanned by the zone, including
 	 * holes, which is calculated as:
 	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
 	 *
+	 * init_spanned_pages is the boot/init time total pages spanned
+	 * by the zone for differentiating statically assigned vs
+	 * dynamically hot added memory to a zone.
+	 * 	init_spanned_pages = init_zone_end_pfn - zone_start_pfn;
+	 *
 	 * present_pages is physical pages existing within the zone, which
 	 * is calculated as:
 	 *	present_pages = spanned_pages - absent_pages(pages in holes);
@@ -423,6 +433,7 @@  struct zone {
 	 */
 	unsigned long		managed_pages;
 	unsigned long		spanned_pages;
+	unsigned long		init_spanned_pages;
 	unsigned long		present_pages;
 
 	const char		*name;
@@ -546,6 +557,11 @@  static inline unsigned long zone_end_pfn(const struct zone *zone)
 	return zone->zone_start_pfn + zone->spanned_pages;
 }
 
+static inline unsigned long zone_end_pfn_boot(const struct zone *zone)
+{
+	return zone->zone_start_pfn + zone->init_spanned_pages;
+}
+
 static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
 {
 	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
diff --git a/mm/Kconfig b/mm/Kconfig
index 97a4e06b15c0..08a92a9c8fbd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -652,7 +652,6 @@  config IDLE_PAGE_TRACKING
 config ZONE_DEVICE
 	bool "Device memory (pmem, etc...) hotplug support" if EXPERT
 	default !ZONE_DMA
-	depends on !ZONE_DMA
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on X86_64 #arch_add_memory() comprehends device memory
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4af58a3a8ffa..c3f0ff45bd47 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -300,6 +300,8 @@  static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn,
 
 	zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
 				zone->zone_start_pfn;
+	if (!zone->zone_dyn_start_pfn || start_pfn < zone->zone_dyn_start_pfn)
+		zone->zone_dyn_start_pfn = start_pfn;
 
 	zone_span_writeunlock(zone);
 }
@@ -601,8 +603,9 @@  static int find_biggest_section_pfn(int nid, struct zone *zone,
 static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 			     unsigned long end_pfn)
 {
-	unsigned long zone_start_pfn = zone->zone_start_pfn;
+	unsigned long zone_start_pfn = zone->zone_dyn_start_pfn;
 	unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
+	bool dyn_zone = zone->zone_start_pfn == zone_start_pfn;
 	unsigned long zone_end_pfn = z;
 	unsigned long pfn;
 	struct mem_section *ms;
@@ -619,7 +622,9 @@  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
 						zone_end_pfn);
 		if (pfn) {
-			zone->zone_start_pfn = pfn;
+			if (dyn_zone)
+				zone->zone_start_pfn = pfn;
+			zone->zone_dyn_start_pfn = pfn;
 			zone->spanned_pages = zone_end_pfn - pfn;
 		}
 	} else if (zone_end_pfn == end_pfn) {
@@ -661,8 +666,10 @@  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	}
 
 	/* The zone has no valid section */
-	zone->zone_start_pfn = 0;
-	zone->spanned_pages = 0;
+	if (dyn_zone)
+		zone->zone_start_pfn = 0;
+	zone->zone_dyn_start_pfn = 0;
+	zone->spanned_pages = zone->init_spanned_pages;
 	zone_span_writeunlock(zone);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63358d9f9aa9..2d8b1d602ff3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -209,6 +209,10 @@  EXPORT_SYMBOL(totalram_pages);
 static char * const zone_names[MAX_NR_ZONES] = {
 #ifdef CONFIG_ZONE_DMA
 	 "DMA",
+#else
+#ifdef CONFIG_ZONE_DEVICE
+	 "Device",
+#endif
 #endif
 #ifdef CONFIG_ZONE_DMA32
 	 "DMA32",
@@ -218,9 +222,6 @@  static char * const zone_names[MAX_NR_ZONES] = {
 	 "HighMem",
 #endif
 	 "Movable",
-#ifdef CONFIG_ZONE_DEVICE
-	 "Device",
-#endif
 };
 
 compound_page_dtor * const compound_page_dtors[] = {
@@ -5082,6 +5083,8 @@  static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
 						  node_start_pfn, node_end_pfn,
 						  zholes_size);
 		zone->spanned_pages = size;
+		zone->init_spanned_pages = size;
+		zone->zone_dyn_start_pfn = 0;
 		zone->present_pages = real_size;
 
 		totalpages += size;