Message ID | 155552635098.2015392.5460028594173939000.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Sub-section memory hotplug support | expand |
On Wed, 2019-04-17 at 11:39 -0700, Dan Williams wrote: > Prepare for hot{plug,remove} of sub-ranges of a section by tracking a > section active bitmask, each bit representing 2MB (SECTION_SIZE > (128M) / > map_active bitmask length (64)). If it turns out that 2MB is too > large > of an active tracking granularity it is trivial to increase the size > of > the map_active bitmap. > > The implications of a partially populated section is that pfn_valid() > needs to go beyond a valid_section() check and read the sub-section > active ranges from the bitmask. > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Logan Gunthorpe <logang@deltatee.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> Hi Dan, I am still going through the patchset but: > +static unsigned long section_active_mask(unsigned long pfn, > + unsigned long nr_pages) > +{ > + int idx_start, idx_size; > + phys_addr_t start, size; > + > + if (!nr_pages) > + return 0; > + > + start = PFN_PHYS(pfn); > + size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION > + - (pfn & ~PAGE_SECTION_MASK))); We already picked the lowest value in section_active_init, didn't we? This min() operations seems redundant to me here. > + size = ALIGN(size, SECTION_ACTIVE_SIZE); > + > + idx_start = section_active_index(start); > + idx_size = section_active_index(size); > + > + if (idx_size == 0) > + return -1; > + return ((1UL << idx_size) - 1) << idx_start; > +} > + > +void section_active_init(unsigned long pfn, unsigned long nr_pages) > +{ > + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > + int i, start_sec = pfn_to_section_nr(pfn); > + > + if (!nr_pages) > + return; > + > + for (i = start_sec; i <= end_sec; i++) { > + struct mem_section *ms; > + unsigned long mask; > + unsigned long pfns; s/pfns/nr_pfns/ instead? > + pfns = min(nr_pages, PAGES_PER_SECTION > + - (pfn & ~PAGE_SECTION_MASK)); > + mask = section_active_mask(pfn, pfns); > + > + ms = __nr_to_section(i); > + pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, > mask); > + ms->usage->map_active = mask; > + > + pfn += pfns; > + nr_pages -= pfns; > + } > +} Although the code is not very complicated, it could use some comments here and there I think. > + > /* Record a memory area against a node. */ > void __init memory_present(int nid, unsigned long start, unsigned > long end) > { >
On Wed, Apr 17, 2019 at 11:39:11AM -0700, Dan Williams wrote: > Prepare for hot{plug,remove} of sub-ranges of a section by tracking a > section active bitmask, each bit representing 2MB (SECTION_SIZE (128M) / > map_active bitmask length (64)). If it turns out that 2MB is too large > of an active tracking granularity it is trivial to increase the size of > the map_active bitmap. > > The implications of a partially populated section is that pfn_valid() > needs to go beyond a valid_section() check and read the sub-section > active ranges from the bitmask. > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Logan Gunthorpe <logang@deltatee.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> [...] > +static unsigned long section_active_mask(unsigned long pfn, > + unsigned long nr_pages) > +{ > + int idx_start, idx_size; > + phys_addr_t start, size; > + > + if (!nr_pages) > + return 0; > + > + start = PFN_PHYS(pfn); > + size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION > + - (pfn & ~PAGE_SECTION_MASK))); > + size = ALIGN(size, SECTION_ACTIVE_SIZE); I am probably missing something, and this is more a question than anything else, but: is there a reason for shifting pfn and pages to get the size and the address? Could not we operate on pfn/pages, so we do not have to shift every time? (even for pfn_section_valid() calls) Something like: #define SUB_SECTION_ACTIVE_PAGES (SECTION_ACTIVE_SIZE / PAGE_SIZE) static inline int section_active_index(unsigned long pfn) { return (pfn & ~(PAGE_SECTION_MASK)) / SUB_SECTION_ACTIVE_PAGES; } > + > + idx_start = section_active_index(start); > + idx_size = section_active_index(size); > + > + if (idx_size == 0) > + return -1; What about turning that into something more intuitive? Since -1 represents here a full section, we could define something like: #define FULL_SECTION (-1UL) Or a better name, it is just that I find "-1" not really easy to interpret. > + return ((1UL << idx_size) - 1) << idx_start; > +} > + > +void section_active_init(unsigned long pfn, unsigned long nr_pages) > +{ > + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > + int i, start_sec = pfn_to_section_nr(pfn); > + > + if (!nr_pages) > + return; > + > + for (i = start_sec; i <= end_sec; i++) { > + struct mem_section *ms; > + unsigned long mask; > + unsigned long pfns; > + > + pfns = min(nr_pages, PAGES_PER_SECTION > + - (pfn & ~PAGE_SECTION_MASK)); > + mask = section_active_mask(pfn, pfns); > + > + ms = __nr_to_section(i); > + pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, mask); > + ms->usage->map_active = mask; > + > + pfn += pfns; > + nr_pages -= pfns; > + } > +} > + > /* Record a memory area against a node. */ > void __init memory_present(int nid, unsigned long start, unsigned long end) > { >
On Wed, Apr 17, 2019 at 2:53 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Prepare for hot{plug,remove} of sub-ranges of a section by tracking a > section active bitmask, each bit representing 2MB (SECTION_SIZE (128M) / > map_active bitmask length (64)). If it turns out that 2MB is too large > of an active tracking granularity it is trivial to increase the size of > the map_active bitmap. Please mention that 2M on Intel, and 16M on Arm64. > > The implications of a partially populated section is that pfn_valid() > needs to go beyond a valid_section() check and read the sub-section > active ranges from the bitmask. > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Logan Gunthorpe <logang@deltatee.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > include/linux/mmzone.h | 29 ++++++++++++++++++++++++++++- > mm/page_alloc.c | 4 +++- > mm/sparse.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 79 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6726fc175b51..cffde898e345 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1175,6 +1175,8 @@ struct mem_section_usage { > unsigned long pageblock_flags[0]; > }; > > +void section_active_init(unsigned long pfn, unsigned long nr_pages); > + > struct page; > struct page_ext; > struct mem_section { > @@ -1312,12 +1314,36 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn) > > extern int __highest_present_section_nr; > > +static inline int section_active_index(phys_addr_t phys) > +{ > + return (phys & ~(PA_SECTION_MASK)) / SECTION_ACTIVE_SIZE; How about also defining SECTION_ACTIVE_SHIFT like this: /* BITS_PER_LONG = 2^6 */ #define BITS_PER_LONG_SHIFT 6 #define SECTION_ACTIVE_SHIFT (SECTION_SIZE_BITS - BITS_PER_LONG_SHIFT) #define SECTION_ACTIVE_SIZE (1 << SECTION_ACTIVE_SHIFT) The return above would become: return (phys & ~(PA_SECTION_MASK)) >> SECTION_ACTIVE_SHIFT; > +} > + > +#ifdef CONFIG_SPARSEMEM_VMEMMAP > +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) > +{ > + int idx = section_active_index(PFN_PHYS(pfn)); > + > + return !!(ms->usage->map_active & (1UL << idx)); > +} > +#else > +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) > +{ > + return 1; > +} > +#endif > + > #ifndef CONFIG_HAVE_ARCH_PFN_VALID > static inline int pfn_valid(unsigned long pfn) > { > + struct mem_section *ms; > + > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > return 0; > - return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); > + ms = __nr_to_section(pfn_to_section_nr(pfn)); > + if (!valid_section(ms)) > + return 0; > + return pfn_section_valid(ms, pfn); > } > #endif > > @@ -1349,6 +1375,7 @@ void sparse_init(void); > #define sparse_init() do {} while (0) > #define sparse_index_init(_sec, _nid) do {} while (0) > #define pfn_present pfn_valid > +#define section_active_init(_pfn, _nr_pages) do {} while (0) > #endif /* CONFIG_SPARSEMEM */ > > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f671401a7c0b..c9ad28a78018 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7273,10 +7273,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) > > /* Print out the early node map */ > pr_info("Early memory node ranges\n"); > - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { > pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid, > (u64)start_pfn << PAGE_SHIFT, > ((u64)end_pfn << PAGE_SHIFT) - 1); > + section_active_init(start_pfn, end_pfn - start_pfn); > + } > > /* Initialise every node */ > mminit_verify_pageflags_layout(); > diff --git a/mm/sparse.c b/mm/sparse.c > index f87de7ad32c8..5ef2f884c4e1 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -210,6 +210,54 @@ static inline unsigned long first_present_section_nr(void) > return next_present_section_nr(-1); > } > > +static unsigned long section_active_mask(unsigned long pfn, > + unsigned long nr_pages) > +{ > + int idx_start, idx_size; > + phys_addr_t start, size; > + > + if (!nr_pages) > + return 0; > + > + start = PFN_PHYS(pfn); > + size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION > + - (pfn & ~PAGE_SECTION_MASK))); > + size = ALIGN(size, SECTION_ACTIVE_SIZE); > + > + idx_start = section_active_index(start); > + idx_size = section_active_index(size); > + > + if (idx_size == 0) > + return -1; > + return ((1UL << idx_size) - 1) << idx_start; > +} > + > +void section_active_init(unsigned long pfn, unsigned long nr_pages) > +{ > + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > + int i, start_sec = pfn_to_section_nr(pfn); > + > + if (!nr_pages) > + return; > + > + for (i = start_sec; i <= end_sec; i++) { > + struct mem_section *ms; > + unsigned long mask; > + unsigned long pfns; > + > + pfns = min(nr_pages, PAGES_PER_SECTION > + - (pfn & ~PAGE_SECTION_MASK)); > + mask = section_active_mask(pfn, pfns); > + > + ms = __nr_to_section(i); > + pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, mask); > + ms->usage->map_active = mask; > + > + pfn += pfns; > + nr_pages -= pfns; > + } > +} For some reasons the above code is confusing to me. It seems all the code supposed to do is set all map_active to -1, and trim the first and last sections (can be the same section of course). So, I would replace the above two functions with one function like this: void section_active_init(unsigned long pfn, unsigned long nr_pages) { int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); int i, idx, start_sec = pfn_to_section_nr(pfn); struct mem_section *ms; if (!nr_pages) return; for (i = start_sec; i <= end_sec; i++) { ms = __nr_to_section(i); ms->usage->map_active = ~0ul; } /* Might need to trim active pfns from the beginning and end */ idx = section_active_index(PFN_PHYS(pfn)); ms = __nr_to_section(start_sec); ms->usage->map_active &= (~0ul << idx); idx = section_active_index(PFN_PHYS(pfn + nr_pages -1)); ms = __nr_to_section(end_sec); ms->usage->map_active &= (~0ul >> (BITS_PER_LONG - idx - 1)); }
On Thu, May 2, 2019 at 9:12 AM Pavel Tatashin <pasha.tatashin@soleen.com> wrote: > > On Wed, Apr 17, 2019 at 2:53 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > Prepare for hot{plug,remove} of sub-ranges of a section by tracking a > > section active bitmask, each bit representing 2MB (SECTION_SIZE (128M) / > > map_active bitmask length (64)). If it turns out that 2MB is too large > > of an active tracking granularity it is trivial to increase the size of > > the map_active bitmap. > > Please mention that 2M on Intel, and 16M on Arm64. > > > > > The implications of a partially populated section is that pfn_valid() > > needs to go beyond a valid_section() check and read the sub-section > > active ranges from the bitmask. > > > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Logan Gunthorpe <logang@deltatee.com> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > --- > > include/linux/mmzone.h | 29 ++++++++++++++++++++++++++++- > > mm/page_alloc.c | 4 +++- > > mm/sparse.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 79 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 6726fc175b51..cffde898e345 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1175,6 +1175,8 @@ struct mem_section_usage { > > unsigned long pageblock_flags[0]; > > }; > > > > +void section_active_init(unsigned long pfn, unsigned long nr_pages); > > + > > struct page; > > struct page_ext; > > struct mem_section { > > @@ -1312,12 +1314,36 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn) > > > > extern int __highest_present_section_nr; > > > > +static inline int section_active_index(phys_addr_t phys) > > +{ > > + return (phys & ~(PA_SECTION_MASK)) / SECTION_ACTIVE_SIZE; > > How about also defining SECTION_ACTIVE_SHIFT like this: > > /* BITS_PER_LONG = 2^6 */ > #define BITS_PER_LONG_SHIFT 6 > #define SECTION_ACTIVE_SHIFT (SECTION_SIZE_BITS - BITS_PER_LONG_SHIFT) > #define SECTION_ACTIVE_SIZE (1 << SECTION_ACTIVE_SHIFT) > > The return above would become: > return (phys & ~(PA_SECTION_MASK)) >> SECTION_ACTIVE_SHIFT; > > > +} > > + > > +#ifdef CONFIG_SPARSEMEM_VMEMMAP > > +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) > > +{ > > + int idx = section_active_index(PFN_PHYS(pfn)); > > + > > + return !!(ms->usage->map_active & (1UL << idx)); > > +} > > +#else > > +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) > > +{ > > + return 1; > > +} > > +#endif > > + > > #ifndef CONFIG_HAVE_ARCH_PFN_VALID > > static inline int pfn_valid(unsigned long pfn) > > { > > + struct mem_section *ms; > > + > > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > > return 0; > > - return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); > > + ms = __nr_to_section(pfn_to_section_nr(pfn)); > > + if (!valid_section(ms)) > > + return 0; > > + return pfn_section_valid(ms, pfn); > > } > > #endif > > > > @@ -1349,6 +1375,7 @@ void sparse_init(void); > > #define sparse_init() do {} while (0) > > #define sparse_index_init(_sec, _nid) do {} while (0) > > #define pfn_present pfn_valid > > +#define section_active_init(_pfn, _nr_pages) do {} while (0) > > #endif /* CONFIG_SPARSEMEM */ > > > > /* > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index f671401a7c0b..c9ad28a78018 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -7273,10 +7273,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) > > > > /* Print out the early node map */ > > pr_info("Early memory node ranges\n"); > > - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) > > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { > > pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid, > > (u64)start_pfn << PAGE_SHIFT, > > ((u64)end_pfn << PAGE_SHIFT) - 1); > > + section_active_init(start_pfn, end_pfn - start_pfn); > > + } > > > > /* Initialise every node */ > > mminit_verify_pageflags_layout(); > > diff --git a/mm/sparse.c b/mm/sparse.c > > index f87de7ad32c8..5ef2f884c4e1 100644 > > --- a/mm/sparse.c > > +++ b/mm/sparse.c > > @@ -210,6 +210,54 @@ static inline unsigned long first_present_section_nr(void) > > return next_present_section_nr(-1); > > } > > > > +static unsigned long section_active_mask(unsigned long pfn, > > + unsigned long nr_pages) > > +{ > > + int idx_start, idx_size; > > + phys_addr_t start, size; > > + > > + if (!nr_pages) > > + return 0; > > + > > + start = PFN_PHYS(pfn); > > + size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION > > + - (pfn & ~PAGE_SECTION_MASK))); > > + size = ALIGN(size, SECTION_ACTIVE_SIZE); > > + > > + idx_start = section_active_index(start); > > + idx_size = section_active_index(size); > > + > > + if (idx_size == 0) > > + return -1; > > + return ((1UL << idx_size) - 1) << idx_start; > > +} > > + > > +void section_active_init(unsigned long pfn, unsigned long nr_pages) > > +{ > > + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > > + int i, start_sec = pfn_to_section_nr(pfn); > > + > > + if (!nr_pages) > > + return; > > + > > + for (i = start_sec; i <= end_sec; i++) { > > + struct mem_section *ms; > > + unsigned long mask; > > + unsigned long pfns; > > + > > + pfns = min(nr_pages, PAGES_PER_SECTION > > + - (pfn & ~PAGE_SECTION_MASK)); > > + mask = section_active_mask(pfn, pfns); > > + > > + ms = __nr_to_section(i); > > + pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, mask); > > + ms->usage->map_active = mask; > > + > > + pfn += pfns; > > + nr_pages -= pfns; > > + } > > +} > > For some reasons the above code is confusing to me. It seems all the > code supposed to do is set all map_active to -1, and trim the first > and last sections (can be the same section of course). So, I would > replace the above two functions with one function like this: > > void section_active_init(unsigned long pfn, unsigned long nr_pages) > { > int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > int i, idx, start_sec = pfn_to_section_nr(pfn); > struct mem_section *ms; > > if (!nr_pages) > return; > > for (i = start_sec; i <= end_sec; i++) { > ms = __nr_to_section(i); > ms->usage->map_active = ~0ul; > } > > /* Might need to trim active pfns from the beginning and end */ > idx = section_active_index(PFN_PHYS(pfn)); > ms = __nr_to_section(start_sec); > ms->usage->map_active &= (~0ul << idx); > > idx = section_active_index(PFN_PHYS(pfn + nr_pages -1)); > ms = __nr_to_section(end_sec); > ms->usage->map_active &= (~0ul >> (BITS_PER_LONG - idx - 1)); > } I like the cleanup, but one of the fixes in v7 resulted in the realization that a given section may be populated twice at init time. For example, enabling that pr_debug() yields: section_active_init: sec: 12 mask: 0x00000003ffffffff section_active_init: sec: 12 mask: 0xe000000000000000 So, the implementation can't blindly clear bits based on the current parameters. However, I'm switching this code over to use bitmap_*() helpers which should help with the readability.
On Sat, May 4, 2019, 3:26 PM Dan Williams <dan.j.williams@intel.com> wrote: > On Thu, May 2, 2019 at 9:12 AM Pavel Tatashin <pasha.tatashin@soleen.com> > wrote: > > > > On Wed, Apr 17, 2019 at 2:53 PM Dan Williams <dan.j.williams@intel.com> > wrote: > > > > > > Prepare for hot{plug,remove} of sub-ranges of a section by tracking a > > > section active bitmask, each bit representing 2MB (SECTION_SIZE (128M) > / > > > map_active bitmask length (64)). If it turns out that 2MB is too large > > > of an active tracking granularity it is trivial to increase the size of > > > the map_active bitmap. > > > > Please mention that 2M on Intel, and 16M on Arm64. > > > > > > > > The implications of a partially populated section is that pfn_valid() > > > needs to go beyond a valid_section() check and read the sub-section > > > active ranges from the bitmask. > > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Cc: Logan Gunthorpe <logang@deltatee.com> > > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > > --- > > > include/linux/mmzone.h | 29 ++++++++++++++++++++++++++++- > > > mm/page_alloc.c | 4 +++- > > > mm/sparse.c | 48 > ++++++++++++++++++++++++++++++++++++++++++++++++ > > > 3 files changed, 79 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > index 6726fc175b51..cffde898e345 100644 > > > --- a/include/linux/mmzone.h > > > +++ b/include/linux/mmzone.h > > > @@ -1175,6 +1175,8 @@ struct mem_section_usage { > > > unsigned long pageblock_flags[0]; > > > }; > > > > > > +void section_active_init(unsigned long pfn, unsigned long nr_pages); > > > + > > > struct page; > > > struct page_ext; > > > struct mem_section { > > > @@ -1312,12 +1314,36 @@ static inline struct mem_section > *__pfn_to_section(unsigned long pfn) > > > > > > extern int __highest_present_section_nr; > > > > > > +static inline int section_active_index(phys_addr_t phys) > > > +{ > > > + return (phys & ~(PA_SECTION_MASK)) / SECTION_ACTIVE_SIZE; > > > > How about also defining SECTION_ACTIVE_SHIFT like this: > > > > /* BITS_PER_LONG = 2^6 */ > > #define BITS_PER_LONG_SHIFT 6 > > #define SECTION_ACTIVE_SHIFT (SECTION_SIZE_BITS - BITS_PER_LONG_SHIFT) > > #define SECTION_ACTIVE_SIZE (1 << SECTION_ACTIVE_SHIFT) > > > > The return above would become: > > return (phys & ~(PA_SECTION_MASK)) >> SECTION_ACTIVE_SHIFT; > > > > > +} > > > + > > > +#ifdef CONFIG_SPARSEMEM_VMEMMAP > > > +static inline int pfn_section_valid(struct mem_section *ms, unsigned > long pfn) > > > +{ > > > + int idx = section_active_index(PFN_PHYS(pfn)); > > > + > > > + return !!(ms->usage->map_active & (1UL << idx)); > > > +} > > > +#else > > > +static inline int pfn_section_valid(struct mem_section *ms, unsigned > long pfn) > > > +{ > > > + return 1; > > > +} > > > +#endif > > > + > > > #ifndef CONFIG_HAVE_ARCH_PFN_VALID > > > static inline int pfn_valid(unsigned long pfn) > > > { > > > + struct mem_section *ms; > > > + > > > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > > > return 0; > > > - return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); > > > + ms = __nr_to_section(pfn_to_section_nr(pfn)); > > > + if (!valid_section(ms)) > > > + return 0; > > > + return pfn_section_valid(ms, pfn); > > > } > > > #endif > > > > > > @@ -1349,6 +1375,7 @@ void sparse_init(void); > > > #define sparse_init() do {} while (0) > > > #define sparse_index_init(_sec, _nid) do {} while (0) > > > #define pfn_present pfn_valid > > > +#define section_active_init(_pfn, _nr_pages) do {} while (0) > > > #endif /* CONFIG_SPARSEMEM */ > > > > > > /* > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index f671401a7c0b..c9ad28a78018 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -7273,10 +7273,12 @@ void __init free_area_init_nodes(unsigned long > *max_zone_pfn) > > > > > > /* Print out the early node map */ > > > pr_info("Early memory node ranges\n"); > > > - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, > &nid) > > > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, > &nid) { > > > pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid, > > > (u64)start_pfn << PAGE_SHIFT, > > > ((u64)end_pfn << PAGE_SHIFT) - 1); > > > + section_active_init(start_pfn, end_pfn - start_pfn); > > > + } > > > > > > /* Initialise every node */ > > > mminit_verify_pageflags_layout(); > > > diff --git a/mm/sparse.c b/mm/sparse.c > > > index f87de7ad32c8..5ef2f884c4e1 100644 > > > --- a/mm/sparse.c > > > +++ b/mm/sparse.c > > > @@ -210,6 +210,54 @@ static inline unsigned long > first_present_section_nr(void) > > > return next_present_section_nr(-1); > > > } > > > > > > +static unsigned long section_active_mask(unsigned long pfn, > > > + unsigned long nr_pages) > > > +{ > > > + int idx_start, idx_size; > > > + phys_addr_t start, size; > > > + > > > + if (!nr_pages) > > > + return 0; > > > + > > > + start = PFN_PHYS(pfn); > > > + size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION > > > + - (pfn & ~PAGE_SECTION_MASK))); > > > + size = ALIGN(size, SECTION_ACTIVE_SIZE); > > > + > > > + idx_start = section_active_index(start); > > > + idx_size = section_active_index(size); > > > + > > > + if (idx_size == 0) > > > + return -1; > > > + return ((1UL << idx_size) - 1) << idx_start; > > > +} > > > + > > > +void section_active_init(unsigned long pfn, unsigned long nr_pages) > > > +{ > > > + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > > > + int i, start_sec = pfn_to_section_nr(pfn); > > > + > > > + if (!nr_pages) > > > + return; > > > + > > > + for (i = start_sec; i <= end_sec; i++) { > > > + struct mem_section *ms; > > > + unsigned long mask; > > > + unsigned long pfns; > > > + > > > + pfns = min(nr_pages, PAGES_PER_SECTION > > > + - (pfn & ~PAGE_SECTION_MASK)); > > > + mask = section_active_mask(pfn, pfns); > > > + > > > + ms = __nr_to_section(i); > > > + pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, > mask); > > > + ms->usage->map_active = mask; > > > + > > > + pfn += pfns; > > > + nr_pages -= pfns; > > > + } > > > +} > > > > For some reasons the above code is confusing to me. It seems all the > > code supposed to do is set all map_active to -1, and trim the first > > and last sections (can be the same section of course). So, I would > > replace the above two functions with one function like this: > > > > void section_active_init(unsigned long pfn, unsigned long nr_pages) > > { > > int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > > int i, idx, start_sec = pfn_to_section_nr(pfn); > > struct mem_section *ms; > > > > if (!nr_pages) > > return; > > > > for (i = start_sec; i <= end_sec; i++) { > > ms = __nr_to_section(i); > > ms->usage->map_active = ~0ul; > > } > > > > /* Might need to trim active pfns from the beginning and end */ > > idx = section_active_index(PFN_PHYS(pfn)); > > ms = __nr_to_section(start_sec); > > ms->usage->map_active &= (~0ul << idx); > > > > idx = section_active_index(PFN_PHYS(pfn + nr_pages -1)); > > ms = __nr_to_section(end_sec); > > ms->usage->map_active &= (~0ul >> (BITS_PER_LONG - idx - 1)); > > } > > I like the cleanup, but one of the fixes in v7 resulted in the > realization that a given section may be populated twice at init time. > For example, enabling that pr_debug() yields: > > section_active_init: sec: 12 mask: 0x00000003ffffffff > section_active_init: sec: 12 mask: 0xe000000000000000 > > So, the implementation can't blindly clear bits based on the current > parameters. However, I'm switching this code over to use bitmap_*() > helpers which should help with the readability. > Yes, bitmap_* will help, and I assume you will also make active_map size scalable? I will take another look at version 8. Thank you, Pasha <div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, May 4, 2019, 3:26 PM Dan Williams <<a href="mailto:dan.j.williams@intel.com">dan.j.williams@intel.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, May 2, 2019 at 9:12 AM Pavel Tatashin <<a href="mailto:pasha.tatashin@soleen.com" target="_blank" rel="noreferrer">pasha.tatashin@soleen.com</a>> wrote:<br> ><br> > On Wed, Apr 17, 2019 at 2:53 PM Dan Williams <<a href="mailto:dan.j.williams@intel.com" target="_blank" rel="noreferrer">dan.j.williams@intel.com</a>> wrote:<br> > ><br> > > Prepare for hot{plug,remove} of sub-ranges of a section by tracking a<br> > > section active bitmask, each bit representing 2MB (SECTION_SIZE (128M) /<br> > > map_active bitmask length (64)). If it turns out that 2MB is too large<br> > > of an active tracking granularity it is trivial to increase the size of<br> > > the map_active bitmap.<br> ><br> > Please mention that 2M on Intel, and 16M on Arm64.<br> ><br> > ><br> > > The implications of a partially populated section is that pfn_valid()<br> > > needs to go beyond a valid_section() check and read the sub-section<br> > > active ranges from the bitmask.<br> > ><br> > > Cc: Michal Hocko <<a href="mailto:mhocko@suse.com" target="_blank" rel="noreferrer">mhocko@suse.com</a>><br> > > Cc: Vlastimil Babka <<a href="mailto:vbabka@suse.cz" target="_blank" rel="noreferrer">vbabka@suse.cz</a>><br> > > Cc: Logan Gunthorpe <<a href="mailto:logang@deltatee.com" target="_blank" rel="noreferrer">logang@deltatee.com</a>><br> > > Signed-off-by: Dan Williams <<a href="mailto:dan.j.williams@intel.com" target="_blank" rel="noreferrer">dan.j.williams@intel.com</a>><br> > > ---<br> > > include/linux/mmzone.h | 29 ++++++++++++++++++++++++++++-<br> > > mm/page_alloc.c | 4 +++-<br> > > mm/sparse.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++<br> > > 3 files changed, 79 insertions(+), 2 deletions(-)<br> > ><br> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h<br> > > index 6726fc175b51..cffde898e345 100644<br> > > --- a/include/linux/mmzone.h<br> > > +++ b/include/linux/mmzone.h<br> > > @@ -1175,6 +1175,8 @@ struct mem_section_usage {<br> > > unsigned long pageblock_flags[0];<br> > > };<br> > ><br> > > +void section_active_init(unsigned long pfn, unsigned long nr_pages);<br> > > +<br> > > struct page;<br> > > struct page_ext;<br> > > struct mem_section {<br> > > @@ -1312,12 +1314,36 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn)<br> > ><br> > > extern int __highest_present_section_nr;<br> > ><br> > > +static inline int section_active_index(phys_addr_t phys)<br> > > +{<br> > > + return (phys & ~(PA_SECTION_MASK)) / SECTION_ACTIVE_SIZE;<br> ><br> > How about also defining SECTION_ACTIVE_SHIFT like this:<br> ><br> > /* BITS_PER_LONG = 2^6 */<br> > #define BITS_PER_LONG_SHIFT 6<br> > #define SECTION_ACTIVE_SHIFT (SECTION_SIZE_BITS - BITS_PER_LONG_SHIFT)<br> > #define SECTION_ACTIVE_SIZE (1 << SECTION_ACTIVE_SHIFT)<br> ><br> > The return above would become:<br> > return (phys & ~(PA_SECTION_MASK)) >> SECTION_ACTIVE_SHIFT;<br> ><br> > > +}<br> > > +<br> > > +#ifdef CONFIG_SPARSEMEM_VMEMMAP<br> > > +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)<br> > > +{<br> > > + int idx = section_active_index(PFN_PHYS(pfn));<br> > > +<br> > > + return !!(ms->usage->map_active & (1UL << idx));<br> > > +}<br> > > +#else<br> > > +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)<br> > > +{<br> > > + return 1;<br> > > +}<br> > > +#endif<br> > > +<br> > > #ifndef CONFIG_HAVE_ARCH_PFN_VALID<br> > > static inline int pfn_valid(unsigned long pfn)<br> > > {<br> > > + struct mem_section *ms;<br> > > +<br> > > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)<br> > > return 0;<br> > > - return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));<br> > > + ms = __nr_to_section(pfn_to_section_nr(pfn));<br> > > + if (!valid_section(ms))<br> > > + return 0;<br> > > + return pfn_section_valid(ms, pfn);<br> > > }<br> > > #endif<br> > ><br> > > @@ -1349,6 +1375,7 @@ void sparse_init(void);<br> > > #define sparse_init() do {} while (0)<br> > > #define sparse_index_init(_sec, _nid) do {} while (0)<br> > > #define pfn_present pfn_valid<br> > > +#define section_active_init(_pfn, _nr_pages) do {} while (0)<br> > > #endif /* CONFIG_SPARSEMEM */<br> > ><br> > > /*<br> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c<br> > > index f671401a7c0b..c9ad28a78018 100644<br> > > --- a/mm/page_alloc.c<br> > > +++ b/mm/page_alloc.c<br> > > @@ -7273,10 +7273,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)<br> > ><br> > > /* Print out the early node map */<br> > > pr_info("Early memory node ranges\n");<br> > > - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)<br> > > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {<br> > > pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid,<br> > > (u64)start_pfn << PAGE_SHIFT,<br> > > ((u64)end_pfn << PAGE_SHIFT) - 1);<br> > > + section_active_init(start_pfn, end_pfn - start_pfn);<br> > > + }<br> > ><br> > > /* Initialise every node */<br> > > mminit_verify_pageflags_layout();<br> > > diff --git a/mm/sparse.c b/mm/sparse.c<br> > > index f87de7ad32c8..5ef2f884c4e1 100644<br> > > --- a/mm/sparse.c<br> > > +++ b/mm/sparse.c<br> > > @@ -210,6 +210,54 @@ static inline unsigned long first_present_section_nr(void)<br> > > return next_present_section_nr(-1);<br> > > }<br> > ><br> > > +static unsigned long section_active_mask(unsigned long pfn,<br> > > + unsigned long nr_pages)<br> > > +{<br> > > + int idx_start, idx_size;<br> > > + phys_addr_t start, size;<br> > > +<br> > > + if (!nr_pages)<br> > > + return 0;<br> > > +<br> > > + start = PFN_PHYS(pfn);<br> > > + size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION<br> > > + - (pfn & ~PAGE_SECTION_MASK)));<br> > > + size = ALIGN(size, SECTION_ACTIVE_SIZE);<br> > > +<br> > > + idx_start = section_active_index(start);<br> > > + idx_size = section_active_index(size);<br> > > +<br> > > + if (idx_size == 0)<br> > > + return -1;<br> > > + return ((1UL << idx_size) - 1) << idx_start;<br> > > +}<br> > > +<br> > > +void section_active_init(unsigned long pfn, unsigned long nr_pages)<br> > > +{<br> > > + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1);<br> > > + int i, start_sec = pfn_to_section_nr(pfn);<br> > > +<br> > > + if (!nr_pages)<br> > > + return;<br> > > +<br> > > + for (i = start_sec; i <= end_sec; i++) {<br> > > + struct mem_section *ms;<br> > > + unsigned long mask;<br> > > + unsigned long pfns;<br> > > +<br> > > + pfns = min(nr_pages, PAGES_PER_SECTION<br> > > + - (pfn & ~PAGE_SECTION_MASK));<br> > > + mask = section_active_mask(pfn, pfns);<br> > > +<br> > > + ms = __nr_to_section(i);<br> > > + pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, mask);<br> > > + ms->usage->map_active = mask;<br> > > +<br> > > + pfn += pfns;<br> > > + nr_pages -= pfns;<br> > > + }<br> > > +}<br> ><br> > For some reasons the above code is confusing to me. It seems all the<br> > code supposed to do is set all map_active to -1, and trim the first<br> > and last sections (can be the same section of course). So, I would<br> > replace the above two functions with one function like this:<br> ><br> > void section_active_init(unsigned long pfn, unsigned long nr_pages)<br> > {<br> > int end_sec = pfn_to_section_nr(pfn + nr_pages - 1);<br> > int i, idx, start_sec = pfn_to_section_nr(pfn);<br> > struct mem_section *ms;<br> ><br> > if (!nr_pages)<br> > return;<br> ><br> > for (i = start_sec; i <= end_sec; i++) {<br> > ms = __nr_to_section(i);<br> > ms->usage->map_active = ~0ul;<br> > }<br> ><br> > /* Might need to trim active pfns from the beginning and end */<br> > idx = section_active_index(PFN_PHYS(pfn));<br> > ms = __nr_to_section(start_sec);<br> > ms->usage->map_active &= (~0ul << idx);<br> ><br> > idx = section_active_index(PFN_PHYS(pfn + nr_pages -1));<br> > ms = __nr_to_section(end_sec);<br> > ms->usage->map_active &= (~0ul >> (BITS_PER_LONG - idx - 1));<br> > }<br> <br> I like the cleanup, but one of the fixes in v7 resulted in the<br> realization that a given section may be populated twice at init time.<br> For example, enabling that pr_debug() yields:<br> <br> section_active_init: sec: 12 mask: 0x00000003ffffffff<br> section_active_init: sec: 12 mask: 0xe000000000000000<br> <br> So, the implementation can't blindly clear bits based on the current<br> parameters. However, I'm switching this code over to use bitmap_*()<br> helpers which should help with the readability.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Yes, bitmap_* will help, and I assume you will also make active_map size scalable?</div><div dir="auto"><br></div><div dir="auto">I will take another look at version 8.</div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">Thank you,</div><div dir="auto">Pasha</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> </blockquote></div></div></div>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6726fc175b51..cffde898e345 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1175,6 +1175,8 @@ struct mem_section_usage { unsigned long pageblock_flags[0]; }; +void section_active_init(unsigned long pfn, unsigned long nr_pages); + struct page; struct page_ext; struct mem_section { @@ -1312,12 +1314,36 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn) extern int __highest_present_section_nr; +static inline int section_active_index(phys_addr_t phys) +{ + return (phys & ~(PA_SECTION_MASK)) / SECTION_ACTIVE_SIZE; +} + +#ifdef CONFIG_SPARSEMEM_VMEMMAP +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) +{ + int idx = section_active_index(PFN_PHYS(pfn)); + + return !!(ms->usage->map_active & (1UL << idx)); +} +#else +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) +{ + return 1; +} +#endif + #ifndef CONFIG_HAVE_ARCH_PFN_VALID static inline int pfn_valid(unsigned long pfn) { + struct mem_section *ms; + if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; - return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); + ms = __nr_to_section(pfn_to_section_nr(pfn)); + if (!valid_section(ms)) + return 0; + return pfn_section_valid(ms, pfn); } #endif @@ -1349,6 +1375,7 @@ void sparse_init(void); #define sparse_init() do {} while (0) #define sparse_index_init(_sec, _nid) do {} while (0) #define pfn_present pfn_valid +#define section_active_init(_pfn, _nr_pages) do {} while (0) #endif /* CONFIG_SPARSEMEM */ /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f671401a7c0b..c9ad28a78018 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7273,10 +7273,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) /* Print out the early node map */ pr_info("Early memory node ranges\n"); - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid, (u64)start_pfn << PAGE_SHIFT, ((u64)end_pfn << PAGE_SHIFT) - 1); + section_active_init(start_pfn, end_pfn - start_pfn); + } /* Initialise every node */ mminit_verify_pageflags_layout(); diff --git a/mm/sparse.c b/mm/sparse.c index f87de7ad32c8..5ef2f884c4e1 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -210,6 +210,54 @@ static inline unsigned long first_present_section_nr(void) return next_present_section_nr(-1); } +static unsigned long section_active_mask(unsigned long pfn, + unsigned long nr_pages) +{ + int idx_start, idx_size; + phys_addr_t start, size; + + if (!nr_pages) + return 0; + + start = PFN_PHYS(pfn); + size = PFN_PHYS(min(nr_pages, PAGES_PER_SECTION + - (pfn & ~PAGE_SECTION_MASK))); + size = ALIGN(size, SECTION_ACTIVE_SIZE); + + idx_start = section_active_index(start); + idx_size = section_active_index(size); + + if (idx_size == 0) + return -1; + return ((1UL << idx_size) - 1) << idx_start; +} + +void section_active_init(unsigned long pfn, unsigned long nr_pages) +{ + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); + int i, start_sec = pfn_to_section_nr(pfn); + + if (!nr_pages) + return; + + for (i = start_sec; i <= end_sec; i++) { + struct mem_section *ms; + unsigned long mask; + unsigned long pfns; + + pfns = min(nr_pages, PAGES_PER_SECTION + - (pfn & ~PAGE_SECTION_MASK)); + mask = section_active_mask(pfn, pfns); + + ms = __nr_to_section(i); + pr_debug("%s: sec: %d mask: %#018lx\n", __func__, i, mask); + ms->usage->map_active = mask; + + pfn += pfns; + nr_pages -= pfns; + } +} + /* Record a memory area against a node. */ void __init memory_present(int nid, unsigned long start, unsigned long end) {
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a section active bitmask, each bit representing 2MB (SECTION_SIZE (128M) / map_active bitmask length (64)). If it turns out that 2MB is too large of an active tracking granularity it is trivial to increase the size of the map_active bitmap. The implications of a partially populated section is that pfn_valid() needs to go beyond a valid_section() check and read the sub-section active ranges from the bitmask. Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- include/linux/mmzone.h | 29 ++++++++++++++++++++++++++++- mm/page_alloc.c | 4 +++- mm/sparse.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 79 insertions(+), 2 deletions(-)