Message ID | 20200623094258.6705-1-richard.weiyang@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/spase: never partially remove memmap for early section | expand |
On 23.06.20 11:42, Wei Yang wrote: > For early sections, we assumes its memmap will never be partially > removed. But current behavior breaks this. > > Let's correct it. > > Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> > --- > mm/sparse.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/sparse.c b/mm/sparse.c > index b2b9a3e34696..1a0069f492f5 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; > } > > - if (section_is_early && memmap) > - free_map_bootmem(memmap); > - else > + if (!section_is_early) > depopulate_section_memmap(pfn, nr_pages, altmap); > + else if (memmap) > + free_map_bootmem(memmap); > > if (empty) > ms->section_mem_map = (unsigned long)NULL; > Agreed, that's what pfn_valid() and section_activate() expect. "If we hot-add memory into such a section then we do not need to populate the memmap and can simply reuse what is already there." - this is the case when hot-adding sub-sections into partially populated early sections, and has to be the case when re-hot-adding after hot-removing. Acked-by: David Hildenbrand <david@redhat.com> I am also not convinced that the complicated sparse_decode_mem_map() handling in that function is required - ms->section_mem_map & SECTION_MAP_MASK is sufficient for this use case of removing the memmap of a full early section once empty.
On Tue, Jun 23, 2020 at 02:44:02PM +0200, David Hildenbrand wrote: >On 23.06.20 11:42, Wei Yang wrote: >> For early sections, we assumes its memmap will never be partially >> removed. But current behavior breaks this. >> >> Let's correct it. >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >> --- >> mm/sparse.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/mm/sparse.c b/mm/sparse.c >> index b2b9a3e34696..1a0069f492f5 100644 >> --- a/mm/sparse.c >> +++ b/mm/sparse.c >> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, >> ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; >> } >> >> - if (section_is_early && memmap) >> - free_map_bootmem(memmap); >> - else >> + if (!section_is_early) >> depopulate_section_memmap(pfn, nr_pages, altmap); >> + else if (memmap) >> + free_map_bootmem(memmap); >> >> if (empty) >> ms->section_mem_map = (unsigned long)NULL; >> > >Agreed, that's what pfn_valid() and section_activate() expect. > >"If we hot-add memory into such a section then we do not need to >populate the memmap and can simply reuse what is already there." - this >is the case when hot-adding sub-sections into partially populated early >sections, and has to be the case when re-hot-adding after hot-removing. > >Acked-by: David Hildenbrand <david@redhat.com> > > >I am also not convinced that the complicated sparse_decode_mem_map() >handling in that function is required - ms->section_mem_map & >SECTION_MAP_MASK is sufficient for this use case of removing the memmap >of a full early section once empty. > You mean remove this line? memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); Then what to passed to free_map_bootmem() ? >-- >Thanks, > >David / dhildenb
On 23.06.20 15:02, Wei Yang wrote: > On Tue, Jun 23, 2020 at 02:44:02PM +0200, David Hildenbrand wrote: >> On 23.06.20 11:42, Wei Yang wrote: >>> For early sections, we assumes its memmap will never be partially >>> removed. But current behavior breaks this. >>> >>> Let's correct it. >>> >>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >>> --- >>> mm/sparse.c | 6 +++--- >>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/sparse.c b/mm/sparse.c >>> index b2b9a3e34696..1a0069f492f5 100644 >>> --- a/mm/sparse.c >>> +++ b/mm/sparse.c >>> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, >>> ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; >>> } >>> >>> - if (section_is_early && memmap) >>> - free_map_bootmem(memmap); >>> - else >>> + if (!section_is_early) >>> depopulate_section_memmap(pfn, nr_pages, altmap); >>> + else if (memmap) >>> + free_map_bootmem(memmap); >>> >>> if (empty) >>> ms->section_mem_map = (unsigned long)NULL; >>> >> >> Agreed, that's what pfn_valid() and section_activate() expect. >> >> "If we hot-add memory into such a section then we do not need to >> populate the memmap and can simply reuse what is already there." - this >> is the case when hot-adding sub-sections into partially populated early >> sections, and has to be the case when re-hot-adding after hot-removing. >> >> Acked-by: David Hildenbrand <david@redhat.com> >> >> >> I am also not convinced that the complicated sparse_decode_mem_map() >> handling in that function is required - ms->section_mem_map & >> SECTION_MAP_MASK is sufficient for this use case of removing the memmap >> of a full early section once empty. >> > > You mean remove this line? > > memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); > > Then what to passed to free_map_bootmem() ? Never mind, I misread something, sparse_decode_mem_map() is indeed necessary.
On Tue 23-06-20 17:42:58, Wei Yang wrote: > For early sections, we assumes its memmap will never be partially > removed. But current behavior breaks this. > > Let's correct it. > > Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Can a user trigger this or is this a theoretical bug? > --- > mm/sparse.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/sparse.c b/mm/sparse.c > index b2b9a3e34696..1a0069f492f5 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; > } > > - if (section_is_early && memmap) > - free_map_bootmem(memmap); > - else > + if (!section_is_early) This begs a comment. > depopulate_section_memmap(pfn, nr_pages, altmap); > + else if (memmap) > + free_map_bootmem(memmap); > > if (empty) > ms->section_mem_map = (unsigned long)NULL; > -- > 2.20.1 (Apple Git-117) >
On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote: >On Tue 23-06-20 17:42:58, Wei Yang wrote: >> For early sections, we assumes its memmap will never be partially >> removed. But current behavior breaks this. >> >> Let's correct it. >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> > >Can a user trigger this or is this a theoretical bug? > I don't expect to have a non-section aligned system, so this is a theoretical bug to me. >> --- >> mm/sparse.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/mm/sparse.c b/mm/sparse.c >> index b2b9a3e34696..1a0069f492f5 100644 >> --- a/mm/sparse.c >> +++ b/mm/sparse.c >> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, >> ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; >> } >> >> - if (section_is_early && memmap) >> - free_map_bootmem(memmap); >> - else >> + if (!section_is_early) > >This begs a comment. > Like: /* Only depopulate sub-section memmap for non early section. */ Looks good to you? >> depopulate_section_memmap(pfn, nr_pages, altmap); >> + else if (memmap) >> + free_map_bootmem(memmap); >> >> if (empty) >> ms->section_mem_map = (unsigned long)NULL; >> -- >> 2.20.1 (Apple Git-117) >> > >-- >Michal Hocko >SUSE Labs
On Tue, Jun 23, 2020 at 2:43 AM Wei Yang <richard.weiyang@linux.alibaba.com> wrote: > > For early sections, we assumes its memmap will never be partially > removed. But current behavior breaks this. Where do we assume that? The primary use case for this was mapping pmem that collides with System-RAM in the same 128MB section. That collision will certainly be depopulated on-demand depending on the state of the pmem device. So, I'm not understanding the problem or the benefit of this change.
On Tue, Jun 23, 2020 at 05:21:06PM -0700, Dan Williams wrote: >On Tue, Jun 23, 2020 at 2:43 AM Wei Yang ><richard.weiyang@linux.alibaba.com> wrote: >> >> For early sections, we assumes its memmap will never be partially >> removed. But current behavior breaks this. > >Where do we assume that? > >The primary use case for this was mapping pmem that collides with >System-RAM in the same 128MB section. That collision will certainly be >depopulated on-demand depending on the state of the pmem device. So, >I'm not understanding the problem or the benefit of this change. Hi, Dan There is a discussion in the thread you just replied: mm/shuffle: don't move pages between zones and don't read garbage memmaps Besides this, the comment in section_activate() says: * The early init code does not consider partially populated * initial sections, it simply assumes that memory will never be * referenced. If we hot-add memory into such a section then we * do not need to populate the memmap and can simply reuse what * is already there. Per my understanding, if we hot-add then hot-remove the sub-section, we may not have a valid memmep for this part sub-section? Because we depopulate it at hot-remove while we don't populate it when hot-add. Is my understanding correct?
On 06/23/20 at 05:21pm, Dan Williams wrote: > On Tue, Jun 23, 2020 at 2:43 AM Wei Yang > <richard.weiyang@linux.alibaba.com> wrote: > > > > For early sections, we assumes its memmap will never be partially > > removed. But current behavior breaks this. > > Where do we assume that? > > The primary use case for this was mapping pmem that collides with > System-RAM in the same 128MB section. That collision will certainly be > depopulated on-demand depending on the state of the pmem device. So, > I'm not understanding the problem or the benefit of this change. I was also confused when review this patch, the patch log is a little short and simple. From the current code, with SPARSE_VMEMMAP enabled, we do build memmap for the whole memory section during boot, even though some of them may be partially populated. We just mark the subsection map for present pages. Later, if pmem device is mapped into the partially boot memory section, we just fill the relevant subsection map, do return directly, w/o building the memmap for it, in section_activate(). Because the memmap for the unpresent RAM part have been there. I guess this is what Wei is trying to do to keep the behaviour be consistent for pmem device adding, or pmem device removing and later adding again. Please correct me if I am wrong. To me, fixing it looks good. But a clear doc or code comment is necessary so that people can understand the code with less time. Leaving it as is doesn't cause harm. I personally tend to choose the former. paging_init() ->sparse_init() ->sparse_init_nid() { ... for_each_present_section_nr(pnum_begin, pnum) { ... map = __populate_section_memmap(pfn, PAGES_PER_SECTION, nid, NULL); ... } } ... ->zone_sizes_init() ->free_area_init() { for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { subsection_map_init(start_pfn, end_pfn - start_pfn); } { __add_pages() ->sparse_add_section() ->section_activate() { ... fill_subsection_map(); if (nr_pages < PAGES_PER_SECTION && early_section(ms)) <----------********* return pfn_to_page(pfn); ... } >
On 06/24/20 at 09:47am, Baoquan He wrote: > On 06/23/20 at 05:21pm, Dan Williams wrote: > > On Tue, Jun 23, 2020 at 2:43 AM Wei Yang > > <richard.weiyang@linux.alibaba.com> wrote: > > > > > > For early sections, we assumes its memmap will never be partially > > > removed. But current behavior breaks this. > > > > Where do we assume that? > > > > The primary use case for this was mapping pmem that collides with > > System-RAM in the same 128MB section. That collision will certainly be > > depopulated on-demand depending on the state of the pmem device. So, > > I'm not understanding the problem or the benefit of this change. > > I was also confused when review this patch, the patch log is a little > short and simple. From the current code, with SPARSE_VMEMMAP enabled, we > do build memmap for the whole memory section during boot, even though > some of them may be partially populated. We just mark the subsection map > for present pages. > > Later, if pmem device is mapped into the partially boot memory section, > we just fill the relevant subsection map, do return directly, w/o building > the memmap for it, in section_activate(). Because the memmap for the > unpresent RAM part have been there. I guess this is what Wei is trying to > do to keep the behaviour be consistent for pmem device adding, or OK, from Wei's reply I realized this patch is a necessary fix. If we depoluate the partial memmap for pmem removing, the later pmem re-adding won't have a valid memmap. > pmem device removing and later adding again. > > Please correct me if I am wrong. > > To me, fixing it looks good. But a clear doc or code comment is > necessary so that people can understand the code with less time. > Leaving it as is doesn't cause harm. I personally tend to choose > the former. > > paging_init() > ->sparse_init() > ->sparse_init_nid() > { > ... > for_each_present_section_nr(pnum_begin, pnum) { > ... > map = __populate_section_memmap(pfn, PAGES_PER_SECTION, > nid, NULL); > ... > } > } > ... > ->zone_sizes_init() > ->free_area_init() > { > for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { > subsection_map_init(start_pfn, end_pfn - start_pfn); > } > { > > __add_pages() > ->sparse_add_section() > ->section_activate() > { > ... > fill_subsection_map(); > if (nr_pages < PAGES_PER_SECTION && early_section(ms)) <----------********* > return pfn_to_page(pfn); > ... > } > > > >
On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote: >On 06/23/20 at 05:21pm, Dan Williams wrote: >> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang >> <richard.weiyang@linux.alibaba.com> wrote: >> > >> > For early sections, we assumes its memmap will never be partially >> > removed. But current behavior breaks this. >> >> Where do we assume that? >> >> The primary use case for this was mapping pmem that collides with >> System-RAM in the same 128MB section. That collision will certainly be >> depopulated on-demand depending on the state of the pmem device. So, >> I'm not understanding the problem or the benefit of this change. > >I was also confused when review this patch, the patch log is a little >short and simple. From the current code, with SPARSE_VMEMMAP enabled, we >do build memmap for the whole memory section during boot, even though >some of them may be partially populated. We just mark the subsection map >for present pages. > >Later, if pmem device is mapped into the partially boot memory section, >we just fill the relevant subsection map, do return directly, w/o building >the memmap for it, in section_activate(). Because the memmap for the >unpresent RAM part have been there. I guess this is what Wei is trying to >do to keep the behaviour be consistent for pmem device adding, or >pmem device removing and later adding again. > >Please correct me if I am wrong. You are right here. > >To me, fixing it looks good. But a clear doc or code comment is >necessary so that people can understand the code with less time. >Leaving it as is doesn't cause harm. I personally tend to choose >the former. > The former is to add a clear doc? > paging_init() > ->sparse_init() > ->sparse_init_nid() > { > ... > for_each_present_section_nr(pnum_begin, pnum) { > ... > map = __populate_section_memmap(pfn, PAGES_PER_SECTION, > nid, NULL); > ... > } > } > ... > ->zone_sizes_init() > ->free_area_init() > { > for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { > subsection_map_init(start_pfn, end_pfn - start_pfn); > } > { > > __add_pages() > ->sparse_add_section() > ->section_activate() > { > ... > fill_subsection_map(); > if (nr_pages < PAGES_PER_SECTION && early_section(ms)) <----------********* > return pfn_to_page(pfn); > ... > } >>
On 06/24/20 at 11:46am, Wei Yang wrote: > On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote: > >On 06/23/20 at 05:21pm, Dan Williams wrote: > >> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang > >> <richard.weiyang@linux.alibaba.com> wrote: > >> > > >> > For early sections, we assumes its memmap will never be partially > >> > removed. But current behavior breaks this. > >> > >> Where do we assume that? > >> > >> The primary use case for this was mapping pmem that collides with > >> System-RAM in the same 128MB section. That collision will certainly be > >> depopulated on-demand depending on the state of the pmem device. So, > >> I'm not understanding the problem or the benefit of this change. > > > >I was also confused when review this patch, the patch log is a little > >short and simple. From the current code, with SPARSE_VMEMMAP enabled, we > >do build memmap for the whole memory section during boot, even though > >some of them may be partially populated. We just mark the subsection map > >for present pages. > > > >Later, if pmem device is mapped into the partially boot memory section, > >we just fill the relevant subsection map, do return directly, w/o building > >the memmap for it, in section_activate(). Because the memmap for the > >unpresent RAM part have been there. I guess this is what Wei is trying to > >do to keep the behaviour be consistent for pmem device adding, or > >pmem device removing and later adding again. > > > >Please correct me if I am wrong. > > You are right here. > > > > >To me, fixing it looks good. But a clear doc or code comment is > >necessary so that people can understand the code with less time. > >Leaving it as is doesn't cause harm. I personally tend to choose > >the former. > > > > The former is to add a clear doc? Sorry for the confusion. The former means the fix in your patch. Maybe a improved log and some code comment adding can make it more perfect. > > > paging_init() > > ->sparse_init() > > ->sparse_init_nid() > > { > > ... > > for_each_present_section_nr(pnum_begin, pnum) { > > ... > > map = __populate_section_memmap(pfn, PAGES_PER_SECTION, > > nid, NULL); > > ... > > } > > } > > ... > > ->zone_sizes_init() > > ->free_area_init() > > { > > for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { > > subsection_map_init(start_pfn, end_pfn - start_pfn); > > } > > { > > > > __add_pages() > > ->sparse_add_section() > > ->section_activate() > > { > > ... > > fill_subsection_map(); > > if (nr_pages < PAGES_PER_SECTION && early_section(ms)) <----------********* > > return pfn_to_page(pfn); > > ... > > } > >> > > -- > Wei Yang > Help you, Help me >
On Wed, Jun 24, 2020 at 11:52:36AM +0800, Baoquan He wrote: >On 06/24/20 at 11:46am, Wei Yang wrote: >> On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote: >> >On 06/23/20 at 05:21pm, Dan Williams wrote: >> >> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang >> >> <richard.weiyang@linux.alibaba.com> wrote: >> >> > >> >> > For early sections, we assumes its memmap will never be partially >> >> > removed. But current behavior breaks this. >> >> >> >> Where do we assume that? >> >> >> >> The primary use case for this was mapping pmem that collides with >> >> System-RAM in the same 128MB section. That collision will certainly be >> >> depopulated on-demand depending on the state of the pmem device. So, >> >> I'm not understanding the problem or the benefit of this change. >> > >> >I was also confused when review this patch, the patch log is a little >> >short and simple. From the current code, with SPARSE_VMEMMAP enabled, we >> >do build memmap for the whole memory section during boot, even though >> >some of them may be partially populated. We just mark the subsection map >> >for present pages. >> > >> >Later, if pmem device is mapped into the partially boot memory section, >> >we just fill the relevant subsection map, do return directly, w/o building >> >the memmap for it, in section_activate(). Because the memmap for the >> >unpresent RAM part have been there. I guess this is what Wei is trying to >> >do to keep the behaviour be consistent for pmem device adding, or >> >pmem device removing and later adding again. >> > >> >Please correct me if I am wrong. >> >> You are right here. >> >> > >> >To me, fixing it looks good. But a clear doc or code comment is >> >necessary so that people can understand the code with less time. >> >Leaving it as is doesn't cause harm. I personally tend to choose >> >the former. >> > >> >> The former is to add a clear doc? > >Sorry for the confusion. The former means the fix in your patch. Maybe a >improved log and some code comment adding can make it more perfect. > Sure, I would try to add more log and comments, in case you have some good suggestion, just let me know :)
On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote: >On Tue 23-06-20 17:42:58, Wei Yang wrote: >> For early sections, we assumes its memmap will never be partially >> removed. But current behavior breaks this. >> >> Let's correct it. >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> > >Can a user trigger this or is this a theoretical bug? Let me rewrite the changelog a little. Look forward any comments. For early sections, its memmap is handled specially even sub-section is enabled. The memmap could only be populated as a whole. Quoted from the comment of section_activate(): * The early init code does not consider partially populated * initial sections, it simply assumes that memory will never be * referenced. If we hot-add memory into such a section then we * do not need to populate the memmap and can simply reuse what * is already there. While current section_deactivate() breaks this rule. When hot-remove a sub-section, section_deactivate() would depopulate its memmap. The consequence is if we hot-add this subsection again, its memmap never get proper populated. > >> --- >> mm/sparse.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/mm/sparse.c b/mm/sparse.c >> index b2b9a3e34696..1a0069f492f5 100644 >> --- a/mm/sparse.c >> +++ b/mm/sparse.c >> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, >> ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; >> } >> >> - if (section_is_early && memmap) >> - free_map_bootmem(memmap); >> - else >> + if (!section_is_early) > >This begs a comment. > >> depopulate_section_memmap(pfn, nr_pages, altmap); >> + else if (memmap) >> + free_map_bootmem(memmap); >> >> if (empty) >> ms->section_mem_map = (unsigned long)NULL; >> -- >> 2.20.1 (Apple Git-117) >> > >-- >Michal Hocko >SUSE Labs
On 23.06.20 17:18, Michal Hocko wrote: > On Tue 23-06-20 17:42:58, Wei Yang wrote: >> For early sections, we assumes its memmap will never be partially >> removed. But current behavior breaks this. >> >> Let's correct it. >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> > > Can a user trigger this or is this a theoretical bug? I tried to reproduce it, but somehow I get unexpected behavior. With a hacked QEMU I can get $ cat /proc/iomem [...] 100000000-143ffffff : System RAM 144000000-343dfffff : Persistent Memory 144000000-1441fffff : namespace0.0 144200000-144ffffff : dax0.0 After $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M I get $ cat /proc/iomem [...] 100000000-143ffffff : System RAM 144000000-343dfffff : Persistent Memory 144000000-1441fffff : namespace0.0 144200000-144ffffff : dax0.0 I can trigger remove+re-add via $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M So we clearly have an overlap between System RAM and dax0.0 within a section. However, I never get early_section() to trigger in section_activate()/section_deactivate() ? That's unexpected Definitely something seems to go wrong after the first "ndctl create-namespace". Using a random page walker: [root@localhost ~]# ./page-types [ 387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI [ 387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G W 5.8.0-rc2 #20 [ 387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 [ 387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0 [ 387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f [ 387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202 [ 387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78 [ 387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000 [ 387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000 [ 387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80 [ 387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0 [ 387.027974] FS: 00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000 [ 387.028699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0 [ 387.029864] Call Trace: [ 387.030108] kpageflags_read+0xcc/0x160 [ 387.030473] proc_reg_read+0x53/0x80 [ 387.030809] vfs_read+0x9d/0x150 [ 387.031114] ksys_pread64+0x65/0xa0 [ 387.031449] do_syscall_64+0x4d/0x90 [ 387.031783] entry_SYSCALL_64_after_hwframe+0x44/0xa9
On 24.06.20 09:48, David Hildenbrand wrote: > On 23.06.20 17:18, Michal Hocko wrote: >> On Tue 23-06-20 17:42:58, Wei Yang wrote: >>> For early sections, we assumes its memmap will never be partially >>> removed. But current behavior breaks this. >>> >>> Let's correct it. >>> >>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >> >> Can a user trigger this or is this a theoretical bug? > > I tried to reproduce it, but somehow I get unexpected behavior. > With a hacked QEMU I can get > > $ cat /proc/iomem > [...] > 100000000-143ffffff : System RAM > 144000000-343dfffff : Persistent Memory > 144000000-1441fffff : namespace0.0 > 144200000-144ffffff : dax0.0 > > After > $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M > > I get > > $ cat /proc/iomem > [...] > 100000000-143ffffff : System RAM > 144000000-343dfffff : Persistent Memory > 144000000-1441fffff : namespace0.0 > 144200000-144ffffff : dax0.0 > > I can trigger remove+re-add via > $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M > > So we clearly have an overlap between System RAM and dax0.0 within a section. > However, I never get early_section() to trigger in > section_activate()/section_deactivate() ? That's unexpected ... booting the correct kernel I can see the printk's I added ... I'll continue to poke it with a stick.
On Wed, Jun 24, 2020 at 09:48:43AM +0200, David Hildenbrand wrote: >On 23.06.20 17:18, Michal Hocko wrote: >> On Tue 23-06-20 17:42:58, Wei Yang wrote: >>> For early sections, we assumes its memmap will never be partially >>> removed. But current behavior breaks this. >>> >>> Let's correct it. >>> >>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >> >> Can a user trigger this or is this a theoretical bug? > >I tried to reproduce it, but somehow I get unexpected behavior. >With a hacked QEMU I can get David, Thanks for your effort. Would you mind sharing your qemu command line, so that I can have a try at my side? > >$ cat /proc/iomem >[...] >100000000-143ffffff : System RAM >144000000-343dfffff : Persistent Memory > 144000000-1441fffff : namespace0.0 > 144200000-144ffffff : dax0.0 > >After >$ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M > >I get > >$ cat /proc/iomem >[...] >100000000-143ffffff : System RAM >144000000-343dfffff : Persistent Memory > 144000000-1441fffff : namespace0.0 > 144200000-144ffffff : dax0.0 > The memory layout seems not changed. >I can trigger remove+re-add via >$ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M Do we need to change the mode to force the reconfig? > >So we clearly have an overlap between System RAM and dax0.0 within a section. >However, I never get early_section() to trigger in >section_activate()/section_deactivate() ? That's unexpected > > >Definitely something seems to go wrong after the first "ndctl create-namespace". >Using a random page walker: > >[root@localhost ~]# ./page-types What is this page-types? >[ 387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI >[ 387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G W 5.8.0-rc2 #20 >[ 387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 >[ 387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0 >[ 387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f >[ 387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202 >[ 387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78 >[ 387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000 >[ 387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000 >[ 387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80 >[ 387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0 >[ 387.027974] FS: 00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000 >[ 387.028699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >[ 387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0 >[ 387.029864] Call Trace: >[ 387.030108] kpageflags_read+0xcc/0x160 >[ 387.030473] proc_reg_read+0x53/0x80 >[ 387.030809] vfs_read+0x9d/0x150 >[ 387.031114] ksys_pread64+0x65/0xa0 >[ 387.031449] do_syscall_64+0x4d/0x90 >[ 387.031783] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > >-- >Thanks, > >David / dhildenb
On 24.06.20 10:13, Wei Yang wrote: > On Wed, Jun 24, 2020 at 09:48:43AM +0200, David Hildenbrand wrote: >> On 23.06.20 17:18, Michal Hocko wrote: >>> On Tue 23-06-20 17:42:58, Wei Yang wrote: >>>> For early sections, we assumes its memmap will never be partially >>>> removed. But current behavior breaks this. >>>> >>>> Let's correct it. >>>> >>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >>>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >>> >>> Can a user trigger this or is this a theoretical bug? >> >> I tried to reproduce it, but somehow I get unexpected behavior. >> With a hacked QEMU I can get > > David, > > Thanks for your effort. Would you mind sharing your qemu command line, so that > I can have a try at my side? The following change to QEMU: diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 51b3050d01..c6a78d83c0 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms, } machine->device_memory->base = - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB); + 0x100000000ULL + x86ms->above_4g_mem_size; if (pcmc->enforce_aligned_dimm) { /* size device region assuming 1G page max alignment per slot */ Then you can use the following QEMU cmdline: #! /bin/bash sudo x86_64-softmmu/qemu-system-x86_64 \ --enable-kvm \ -m 4160M,maxmem=20G,slots=1 \ -smp sockets=2,cores=16 \ -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \ -machine pc,nvdimm \ -nographic \ -hda /home/dhildenb/git/Fedora-Cloud-Base-31-1.9.x86_64.qcow2 \ -nodefaults \ -net nic -net user \ -chardev stdio,nosignal,id=serial \ -device isa-serial,chardev=serial \ -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \ -mon chardev=monitor,mode=readline \ -object memory-backend-ram,id=mem0,size=8G \ -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k > >> >> $ cat /proc/iomem >> [...] >> 100000000-143ffffff : System RAM >> 144000000-343dfffff : Persistent Memory >> 144000000-1441fffff : namespace0.0 >> 144200000-144ffffff : dax0.0 >> >> After >> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M >> >> I get >> >> $ cat /proc/iomem >> [...] >> 100000000-143ffffff : System RAM >> 144000000-343dfffff : Persistent Memory >> 144000000-1441fffff : namespace0.0 >> 144200000-144ffffff : dax0.0 >> > > The memory layout seems not changed. Yeah, sorry, int he first example, dax0.0 should be missing (on a fresh QEMU instance instead of after a reboot). > >> I can trigger remove+re-add via >> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M > > Do we need to change the mode to force the reconfig? No, that's sufficient. > >> >> So we clearly have an overlap between System RAM and dax0.0 within a section. >> However, I never get early_section() to trigger in >> section_activate()/section_deactivate() ? That's unexpected >> >> >> Definitely something seems to go wrong after the first "ndctl create-namespace". >> Using a random page walker: >> >> [root@localhost ~]# ./page-types > > What is this page-types? > >> [ 387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI >> [ 387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G W 5.8.0-rc2 #20 >> [ 387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 >> [ 387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0 >> [ 387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f >> [ 387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202 >> [ 387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78 >> [ 387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000 >> [ 387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000 >> [ 387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80 >> [ 387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0 >> [ 387.027974] FS: 00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000 >> [ 387.028699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0 >> [ 387.029864] Call Trace: >> [ 387.030108] kpageflags_read+0xcc/0x160 >> [ 387.030473] proc_reg_read+0x53/0x80 >> [ 387.030809] vfs_read+0x9d/0x150 >> [ 387.031114] ksys_pread64+0x65/0xa0 >> [ 387.031449] do_syscall_64+0x4d/0x90 >> [ 387.031783] entry_SYSCALL_64_after_hwframe+0x44/0xa9 It's a tool located in Linux tools/vm/page-types. So with two printk's, I can spot: [ 24.185558] section_deactivate(): depopulate_section_memmap [ 24.202689] pmem0: detected capacity change from 0 to 16777216 [ 24.229747] section_activate(): early_section, not populating memmap [ 24.229777] memmap_init_zone_device initialised 3584 pages in 0ms But nothing actually breaks .... because *drummroll* we use huge pages in the vmemmap, so the partial depopulate will not actually depopulate anything here. Huge page is 2M, the memmap of 128MB sections is exactly 2MB == one hugepages. Trying to depopulate a fraction (e.g., 16MB) of that won't do anything. Now, forcing a CPU without hugepages - PSE (QEMU: "-cpu host,pse=off)", I can trigger via ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M [ 18.044973] pmem0: detected capacity change from 0 to 16777216 [ 18.073813] BUG: unable to handle page fault for address: ffffec73c51000b4 [ 18.076236] #PF: supervisor write access in kernel mode [ 18.077211] #PF: error_code(0x0002) - not-present page [ 18.077943] PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0 [ 18.078551] Oops: 0002 [#1] SMP NOPTI [ 18.078915] CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24 [ 18.079718] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 [ 18.080750] RIP: 0010:memmap_init_zone+0x154/0x1c2 [ 18.081213] Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f [ 18.082902] RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282 [ 18.083395] RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000 [ 18.084057] RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080 [ 18.084704] RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708 [ 18.085369] R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004 [ 18.086020] R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000 [ 18.086670] FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000 [ 18.087417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.087965] CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0 [ 18.088623] Call Trace: [ 18.088884] move_pfn_range_to_zone+0x128/0x150 [ 18.089313] memremap_pages+0x4e4/0x5a0 [ 18.089681] devm_memremap_pages+0x1e/0x60 [ 18.090081] dev_dax_probe+0x69/0x160 [device_dax] [ 18.090533] really_probe+0x298/0x3c0 [ 18.090896] driver_probe_device+0xe1/0x150 [ 18.091303] ? driver_allows_async_probing+0x50/0x50 [ 18.091780] bus_for_each_drv+0x7e/0xc0 [ 18.092154] __device_attach+0xdf/0x160 [ 18.092527] bus_probe_device+0x8e/0xa0 [ 18.092916] device_add+0x3b9/0x740 [ 18.093258] __devm_create_dev_dax+0x127/0x1c0 [ 18.093686] __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core] [ 18.094200] dax_pmem_probe+0xc/0x1b [dax_pmem] [ 18.094632] nvdimm_bus_probe+0x69/0x1c0 [libnvdimm] [ 18.095109] really_probe+0x147/0x3c0 [ 18.095470] driver_probe_device+0xe1/0x150 [ 18.095883] device_driver_attach+0x53/0x60 [ 18.096289] bind_store+0xd1/0x110 [ 18.096627] kernfs_fop_write+0xce/0x1b0 [ 18.097017] vfs_write+0xb6/0x1a0 [ 18.097350] ksys_write+0x5f/0xe0 [ 18.097681] do_syscall_64+0x4d/0x90 [ 18.098041] entry_SYSCALL_64_after_hwframe+0x44/0xa9
On Wed 24-06-20 10:41:18, David Hildenbrand wrote: [...] > But nothing actually breaks .... because *drummroll* we use huge pages in the vmemmap, > so the partial depopulate will not actually depopulate anything here. Huge page is 2M, > the memmap of 128MB sections is exactly 2MB == one hugepages. Trying to depopulate a > fraction (e.g., 16MB) of that won't do anything. > > > Now, forcing a CPU without hugepages - PSE (QEMU: "-cpu host,pse=off)", I can trigger > via ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M OK, now I am following! Thanks a lot. That is something to be mentioned in the changelog. Small pages might be used if the hotplug happens on a system with fragmented memory so this is something we do care about. Thanks a lot David!
On 24.06.20 05:56, Wei Yang wrote: > On Wed, Jun 24, 2020 at 11:52:36AM +0800, Baoquan He wrote: >> On 06/24/20 at 11:46am, Wei Yang wrote: >>> On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote: >>>> On 06/23/20 at 05:21pm, Dan Williams wrote: >>>>> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang >>>>> <richard.weiyang@linux.alibaba.com> wrote: >>>>>> >>>>>> For early sections, we assumes its memmap will never be partially >>>>>> removed. But current behavior breaks this. >>>>> >>>>> Where do we assume that? >>>>> >>>>> The primary use case for this was mapping pmem that collides with >>>>> System-RAM in the same 128MB section. That collision will certainly be >>>>> depopulated on-demand depending on the state of the pmem device. So, >>>>> I'm not understanding the problem or the benefit of this change. >>>> >>>> I was also confused when review this patch, the patch log is a little >>>> short and simple. From the current code, with SPARSE_VMEMMAP enabled, we >>>> do build memmap for the whole memory section during boot, even though >>>> some of them may be partially populated. We just mark the subsection map >>>> for present pages. >>>> >>>> Later, if pmem device is mapped into the partially boot memory section, >>>> we just fill the relevant subsection map, do return directly, w/o building >>>> the memmap for it, in section_activate(). Because the memmap for the >>>> unpresent RAM part have been there. I guess this is what Wei is trying to >>>> do to keep the behaviour be consistent for pmem device adding, or >>>> pmem device removing and later adding again. >>>> >>>> Please correct me if I am wrong. >>> >>> You are right here. >>> >>>> >>>> To me, fixing it looks good. But a clear doc or code comment is >>>> necessary so that people can understand the code with less time. >>>> Leaving it as is doesn't cause harm. I personally tend to choose >>>> the former. >>>> >>> >>> The former is to add a clear doc? >> >> Sorry for the confusion. The former means the fix in your patch. Maybe a >> improved log and some code comment adding can make it more perfect. >> > > Sure, I would try to add more log and comments, in case you have some good > suggestion, just let me know :) > We have documented this is section_activate() and pfn_valid() sufficiently. Maybe add a pointer like /* * The memmap of early sections is always fully populated. See * section_activate() and pfn_valid() . */
On Tue, Jun 23, 2020 at 11:14 PM Wei Yang <richard.weiyang@linux.alibaba.com> wrote: > > On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote: > >On Tue 23-06-20 17:42:58, Wei Yang wrote: > >> For early sections, we assumes its memmap will never be partially > >> removed. But current behavior breaks this. > >> > >> Let's correct it. > >> > >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> > > > >Can a user trigger this or is this a theoretical bug? > > Let me rewrite the changelog a little. Look forward any comments. > > For early sections, its memmap is handled specially even sub-section is > enabled. The memmap could only be populated as a whole. > > Quoted from the comment of section_activate(): > > * The early init code does not consider partially populated > * initial sections, it simply assumes that memory will never be > * referenced. If we hot-add memory into such a section then we > * do not need to populate the memmap and can simply reuse what > * is already there. > > While current section_deactivate() breaks this rule. When hot-remove a > sub-section, section_deactivate() would depopulate its memmap. The > consequence is if we hot-add this subsection again, its memmap never get > proper populated. Ok, forgive the latency as re-fetched this logic into my mental cache. So what I was remembering was the initial state of the code that special cased early sections, and that still seems to be the case in pfn_valid(). IIRC early_sections / bootmem are blocked from being removed entirely. Partial / subsection removals are ok.
On Wed, Jun 24, 2020 at 09:10:09AM -0700, Dan Williams wrote: >On Tue, Jun 23, 2020 at 11:14 PM Wei Yang ><richard.weiyang@linux.alibaba.com> wrote: >> >> On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote: >> >On Tue 23-06-20 17:42:58, Wei Yang wrote: >> >> For early sections, we assumes its memmap will never be partially >> >> removed. But current behavior breaks this. >> >> >> >> Let's correct it. >> >> >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >> > >> >Can a user trigger this or is this a theoretical bug? >> >> Let me rewrite the changelog a little. Look forward any comments. >> >> For early sections, its memmap is handled specially even sub-section is >> enabled. The memmap could only be populated as a whole. >> >> Quoted from the comment of section_activate(): >> >> * The early init code does not consider partially populated >> * initial sections, it simply assumes that memory will never be >> * referenced. If we hot-add memory into such a section then we >> * do not need to populate the memmap and can simply reuse what >> * is already there. >> >> While current section_deactivate() breaks this rule. When hot-remove a >> sub-section, section_deactivate() would depopulate its memmap. The >> consequence is if we hot-add this subsection again, its memmap never get >> proper populated. > >Ok, forgive the latency as re-fetched this logic into my mental cache. >So what I was remembering was the initial state of the code that >special cased early sections, and that still seems to be the case in >pfn_valid(). IIRC early_sections / bootmem are blocked from being >removed entirely. Partial / subsection removals are ok. Would you mind giving more words? Partial subsection removal is ok, so no need to fix this?
On Wed, Jun 24, 2020 at 10:51:08AM +0200, David Hildenbrand wrote: >On 24.06.20 05:56, Wei Yang wrote: >> On Wed, Jun 24, 2020 at 11:52:36AM +0800, Baoquan He wrote: >>> On 06/24/20 at 11:46am, Wei Yang wrote: >>>> On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote: >>>>> On 06/23/20 at 05:21pm, Dan Williams wrote: >>>>>> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang >>>>>> <richard.weiyang@linux.alibaba.com> wrote: >>>>>>> >>>>>>> For early sections, we assumes its memmap will never be partially >>>>>>> removed. But current behavior breaks this. >>>>>> >>>>>> Where do we assume that? >>>>>> >>>>>> The primary use case for this was mapping pmem that collides with >>>>>> System-RAM in the same 128MB section. That collision will certainly be >>>>>> depopulated on-demand depending on the state of the pmem device. So, >>>>>> I'm not understanding the problem or the benefit of this change. >>>>> >>>>> I was also confused when review this patch, the patch log is a little >>>>> short and simple. From the current code, with SPARSE_VMEMMAP enabled, we >>>>> do build memmap for the whole memory section during boot, even though >>>>> some of them may be partially populated. We just mark the subsection map >>>>> for present pages. >>>>> >>>>> Later, if pmem device is mapped into the partially boot memory section, >>>>> we just fill the relevant subsection map, do return directly, w/o building >>>>> the memmap for it, in section_activate(). Because the memmap for the >>>>> unpresent RAM part have been there. I guess this is what Wei is trying to >>>>> do to keep the behaviour be consistent for pmem device adding, or >>>>> pmem device removing and later adding again. >>>>> >>>>> Please correct me if I am wrong. >>>> >>>> You are right here. >>>> >>>>> >>>>> To me, fixing it looks good. But a clear doc or code comment is >>>>> necessary so that people can understand the code with less time. >>>>> Leaving it as is doesn't cause harm. I personally tend to choose >>>>> the former. >>>>> >>>> >>>> The former is to add a clear doc? >>> >>> Sorry for the confusion. The former means the fix in your patch. Maybe a >>> improved log and some code comment adding can make it more perfect. >>> >> >> Sure, I would try to add more log and comments, in case you have some good >> suggestion, just let me know :) >> > >We have documented this is section_activate() and pfn_valid() >sufficiently. Maybe add a pointer like > >/* > * The memmap of early sections is always fully populated. See > * section_activate() and pfn_valid() . > */ Thanks, I have added this above the "if" check. > >-- >Thanks, > >David / dhildenb
On Wed, Jun 24, 2020 at 3:06 PM Wei Yang <richard.weiyang@linux.alibaba.com> wrote: > > On Wed, Jun 24, 2020 at 09:10:09AM -0700, Dan Williams wrote: > >On Tue, Jun 23, 2020 at 11:14 PM Wei Yang > ><richard.weiyang@linux.alibaba.com> wrote: > >> > >> On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote: > >> >On Tue 23-06-20 17:42:58, Wei Yang wrote: > >> >> For early sections, we assumes its memmap will never be partially > >> >> removed. But current behavior breaks this. > >> >> > >> >> Let's correct it. > >> >> > >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > >> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> > >> > > >> >Can a user trigger this or is this a theoretical bug? > >> > >> Let me rewrite the changelog a little. Look forward any comments. > >> > >> For early sections, its memmap is handled specially even sub-section is > >> enabled. The memmap could only be populated as a whole. > >> > >> Quoted from the comment of section_activate(): > >> > >> * The early init code does not consider partially populated > >> * initial sections, it simply assumes that memory will never be > >> * referenced. If we hot-add memory into such a section then we > >> * do not need to populate the memmap and can simply reuse what > >> * is already there. > >> > >> While current section_deactivate() breaks this rule. When hot-remove a > >> sub-section, section_deactivate() would depopulate its memmap. The > >> consequence is if we hot-add this subsection again, its memmap never get > >> proper populated. > > > >Ok, forgive the latency as re-fetched this logic into my mental cache. > >So what I was remembering was the initial state of the code that > >special cased early sections, and that still seems to be the case in > >pfn_valid(). IIRC early_sections / bootmem are blocked from being > >removed entirely. Partial / subsection removals are ok. > > Would you mind giving more words? Partial subsection removal is ok, so no need > to fix this? Early sections establish a memmap for the full section. There's conceptually nothing wrong with unplugging the non-system-RAM portion of the memmap, but it would need to be careful, at least on x86, to map the partial section with PTEs instead of PMDs. So, you are right that there is a mismatch here, but I think the comprehensive fix is to allow early sections to be partially depopulated/repopulated rather than have section_activate() and section_deacticate() special case early sections. The special casing is problematic in retrospect as section_deactivate() can't be maintained without understand special rules in section_activate().
On Wed, Jun 24, 2020 at 10:41:18AM +0200, David Hildenbrand wrote: >On 24.06.20 10:13, Wei Yang wrote: >> On Wed, Jun 24, 2020 at 09:48:43AM +0200, David Hildenbrand wrote: >>> On 23.06.20 17:18, Michal Hocko wrote: >>>> On Tue 23-06-20 17:42:58, Wei Yang wrote: >>>>> For early sections, we assumes its memmap will never be partially >>>>> removed. But current behavior breaks this. >>>>> >>>>> Let's correct it. >>>>> >>>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >>>>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >>>> >>>> Can a user trigger this or is this a theoretical bug? >>> >>> I tried to reproduce it, but somehow I get unexpected behavior. >>> With a hacked QEMU I can get >> >> David, >> >> Thanks for your effort. Would you mind sharing your qemu command line, so that >> I can have a try at my side? > >The following change to QEMU: > >diff --git a/hw/i386/pc.c b/hw/i386/pc.c >index 51b3050d01..c6a78d83c0 100644 >--- a/hw/i386/pc.c >+++ b/hw/i386/pc.c >@@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms, > } > > machine->device_memory->base = >- ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB); >+ 0x100000000ULL + x86ms->above_4g_mem_size; > > if (pcmc->enforce_aligned_dimm) { > /* size device region assuming 1G page max alignment per slot */ > > >Then you can use the following QEMU cmdline: > >#! /bin/bash >sudo x86_64-softmmu/qemu-system-x86_64 \ > --enable-kvm \ > -m 4160M,maxmem=20G,slots=1 \ > -smp sockets=2,cores=16 \ > -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \ > -machine pc,nvdimm \ > -nographic \ > -hda /home/dhildenb/git/Fedora-Cloud-Base-31-1.9.x86_64.qcow2 \ > -nodefaults \ > -net nic -net user \ > -chardev stdio,nosignal,id=serial \ > -device isa-serial,chardev=serial \ > -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \ > -mon chardev=monitor,mode=readline \ > -object memory-backend-ram,id=mem0,size=8G \ > -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k > > >> >>> >>> $ cat /proc/iomem >>> [...] >>> 100000000-143ffffff : System RAM >>> 144000000-343dfffff : Persistent Memory >>> 144000000-1441fffff : namespace0.0 >>> 144200000-144ffffff : dax0.0 >>> >>> After >>> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M >>> >>> I get >>> >>> $ cat /proc/iomem >>> [...] >>> 100000000-143ffffff : System RAM >>> 144000000-343dfffff : Persistent Memory >>> 144000000-1441fffff : namespace0.0 >>> 144200000-144ffffff : dax0.0 >>> >> >> The memory layout seems not changed. > >Yeah, sorry, int he first example, dax0.0 should be missing (on a fresh >QEMU instance instead of after a reboot). > >> >>> I can trigger remove+re-add via >>> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M >> >> Do we need to change the mode to force the reconfig? > >No, that's sufficient. > >> >>> >>> So we clearly have an overlap between System RAM and dax0.0 within a section. >>> However, I never get early_section() to trigger in >>> section_activate()/section_deactivate() ? That's unexpected >>> >>> >>> Definitely something seems to go wrong after the first "ndctl create-namespace". >>> Using a random page walker: >>> >>> [root@localhost ~]# ./page-types >> >> What is this page-types? >> >>> [ 387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI >>> [ 387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G W 5.8.0-rc2 #20 >>> [ 387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 >>> [ 387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0 >>> [ 387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f >>> [ 387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202 >>> [ 387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78 >>> [ 387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000 >>> [ 387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000 >>> [ 387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80 >>> [ 387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0 >>> [ 387.027974] FS: 00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000 >>> [ 387.028699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0 >>> [ 387.029864] Call Trace: >>> [ 387.030108] kpageflags_read+0xcc/0x160 >>> [ 387.030473] proc_reg_read+0x53/0x80 >>> [ 387.030809] vfs_read+0x9d/0x150 >>> [ 387.031114] ksys_pread64+0x65/0xa0 >>> [ 387.031449] do_syscall_64+0x4d/0x90 >>> [ 387.031783] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >It's a tool located in Linux tools/vm/page-types. > > >So with two printk's, I can spot: > >[ 24.185558] section_deactivate(): depopulate_section_memmap >[ 24.202689] pmem0: detected capacity change from 0 to 16777216 >[ 24.229747] section_activate(): early_section, not populating memmap >[ 24.229777] memmap_init_zone_device initialised 3584 pages in 0ms > >But nothing actually breaks .... because *drummroll* we use huge pages in the vmemmap, >so the partial depopulate will not actually depopulate anything here. Huge page is 2M, >the memmap of 128MB sections is exactly 2MB == one hugepages. Trying to depopulate a >fraction (e.g., 16MB) of that won't do anything. > Thanks David. I miss some knowledge here about the vmemmap and 2MB. Would you minding pointing me the location which affects this behavior? I don't see how depopulate_section_memmap would be affect by this. > >Now, forcing a CPU without hugepages - PSE (QEMU: "-cpu host,pse=off)", I can trigger >via ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M > >[ 18.044973] pmem0: detected capacity change from 0 to 16777216 >[ 18.073813] BUG: unable to handle page fault for address: ffffec73c51000b4 >[ 18.076236] #PF: supervisor write access in kernel mode >[ 18.077211] #PF: error_code(0x0002) - not-present page >[ 18.077943] PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0 >[ 18.078551] Oops: 0002 [#1] SMP NOPTI >[ 18.078915] CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24 >[ 18.079718] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 >[ 18.080750] RIP: 0010:memmap_init_zone+0x154/0x1c2 >[ 18.081213] Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f >[ 18.082902] RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282 >[ 18.083395] RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000 >[ 18.084057] RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080 >[ 18.084704] RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708 >[ 18.085369] R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004 >[ 18.086020] R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000 >[ 18.086670] FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000 >[ 18.087417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >[ 18.087965] CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0 >[ 18.088623] Call Trace: >[ 18.088884] move_pfn_range_to_zone+0x128/0x150 >[ 18.089313] memremap_pages+0x4e4/0x5a0 >[ 18.089681] devm_memremap_pages+0x1e/0x60 >[ 18.090081] dev_dax_probe+0x69/0x160 [device_dax] >[ 18.090533] really_probe+0x298/0x3c0 >[ 18.090896] driver_probe_device+0xe1/0x150 >[ 18.091303] ? driver_allows_async_probing+0x50/0x50 >[ 18.091780] bus_for_each_drv+0x7e/0xc0 >[ 18.092154] __device_attach+0xdf/0x160 >[ 18.092527] bus_probe_device+0x8e/0xa0 >[ 18.092916] device_add+0x3b9/0x740 >[ 18.093258] __devm_create_dev_dax+0x127/0x1c0 >[ 18.093686] __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core] >[ 18.094200] dax_pmem_probe+0xc/0x1b [dax_pmem] >[ 18.094632] nvdimm_bus_probe+0x69/0x1c0 [libnvdimm] >[ 18.095109] really_probe+0x147/0x3c0 >[ 18.095470] driver_probe_device+0xe1/0x150 >[ 18.095883] device_driver_attach+0x53/0x60 >[ 18.096289] bind_store+0xd1/0x110 >[ 18.096627] kernfs_fop_write+0xce/0x1b0 >[ 18.097017] vfs_write+0xb6/0x1a0 >[ 18.097350] ksys_write+0x5f/0xe0 >[ 18.097681] do_syscall_64+0x4d/0x90 >[ 18.098041] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > I would put this into changelog. > >-- >Thanks, > >David / dhildenb
On Wed, Jun 24, 2020 at 03:20:59PM -0700, Dan Williams wrote: >On Wed, Jun 24, 2020 at 3:06 PM Wei Yang ><richard.weiyang@linux.alibaba.com> wrote: >> >> On Wed, Jun 24, 2020 at 09:10:09AM -0700, Dan Williams wrote: >> >On Tue, Jun 23, 2020 at 11:14 PM Wei Yang >> ><richard.weiyang@linux.alibaba.com> wrote: >> >> >> >> On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote: >> >> >On Tue 23-06-20 17:42:58, Wei Yang wrote: >> >> >> For early sections, we assumes its memmap will never be partially >> >> >> removed. But current behavior breaks this. >> >> >> >> >> >> Let's correct it. >> >> >> >> >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >> >> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> >> >> > >> >> >Can a user trigger this or is this a theoretical bug? >> >> >> >> Let me rewrite the changelog a little. Look forward any comments. >> >> >> >> For early sections, its memmap is handled specially even sub-section is >> >> enabled. The memmap could only be populated as a whole. >> >> >> >> Quoted from the comment of section_activate(): >> >> >> >> * The early init code does not consider partially populated >> >> * initial sections, it simply assumes that memory will never be >> >> * referenced. If we hot-add memory into such a section then we >> >> * do not need to populate the memmap and can simply reuse what >> >> * is already there. >> >> >> >> While current section_deactivate() breaks this rule. When hot-remove a >> >> sub-section, section_deactivate() would depopulate its memmap. The >> >> consequence is if we hot-add this subsection again, its memmap never get >> >> proper populated. >> > >> >Ok, forgive the latency as re-fetched this logic into my mental cache. >> >So what I was remembering was the initial state of the code that >> >special cased early sections, and that still seems to be the case in >> >pfn_valid(). IIRC early_sections / bootmem are blocked from being >> >removed entirely. Partial / subsection removals are ok. >> >> Would you mind giving more words? Partial subsection removal is ok, so no need >> to fix this? > >Early sections establish a memmap for the full section. There's >conceptually nothing wrong with unplugging the non-system-RAM portion >of the memmap, but it would need to be careful, at least on x86, to >map the partial section with PTEs instead of PMDs. > >So, you are right that there is a mismatch here, but I think the >comprehensive fix is to allow early sections to be partially >depopulated/repopulated rather than have section_activate() and >section_deacticate() special case early sections. The special casing >is problematic in retrospect as section_deactivate() can't be >maintained without understand special rules in section_activate(). Hmm... This means we need to adjust pfn_valid() too, which always return true for early sections.
On Wed, Jun 24, 2020 at 3:44 PM Wei Yang <richard.weiyang@linux.alibaba.com> wrote: [..] > >So, you are right that there is a mismatch here, but I think the > >comprehensive fix is to allow early sections to be partially > >depopulated/repopulated rather than have section_activate() and > >section_deacticate() special case early sections. The special casing > >is problematic in retrospect as section_deactivate() can't be > >maintained without understand special rules in section_activate(). > > Hmm... This means we need to adjust pfn_valid() too, which always return true > for early sections. Right, rather than carry workarounds in 3 locations, and the bug that has resulted from then getting out of sync, just teach early section mapping to allow for the subsection populate/depopulate.
> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: > > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang > <richard.weiyang@linux.alibaba.com> wrote: > [..] >>> So, you are right that there is a mismatch here, but I think the >>> comprehensive fix is to allow early sections to be partially >>> depopulated/repopulated rather than have section_activate() and >>> section_deacticate() special case early sections. The special casing >>> is problematic in retrospect as section_deactivate() can't be >>> maintained without understand special rules in section_activate(). >> >> Hmm... This means we need to adjust pfn_valid() too, which always return true >> for early sections. > > Right, rather than carry workarounds in 3 locations, and the bug that > has resulted from then getting out of sync, just teach early section > mapping to allow for the subsection populate/depopulate. > I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. At least my ack stands.
On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote: > > > > > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: > > > > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang > > <richard.weiyang@linux.alibaba.com> wrote: > > [..] > >>> So, you are right that there is a mismatch here, but I think the > >>> comprehensive fix is to allow early sections to be partially > >>> depopulated/repopulated rather than have section_activate() and > >>> section_deacticate() special case early sections. The special casing > >>> is problematic in retrospect as section_deactivate() can't be > >>> maintained without understand special rules in section_activate(). > >> > >> Hmm... This means we need to adjust pfn_valid() too, which always return true > >> for early sections. > > > > Right, rather than carry workarounds in 3 locations, and the bug that > > has resulted from then getting out of sync, just teach early section > > mapping to allow for the subsection populate/depopulate. > > > > I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. Agree, yes, let's do the simple fix first for 5.8 and the special-case elimination work later.
On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote: >On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote: >> >> >> >> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: >> > >> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang >> > <richard.weiyang@linux.alibaba.com> wrote: >> > [..] >> >>> So, you are right that there is a mismatch here, but I think the >> >>> comprehensive fix is to allow early sections to be partially >> >>> depopulated/repopulated rather than have section_activate() and >> >>> section_deacticate() special case early sections. The special casing >> >>> is problematic in retrospect as section_deactivate() can't be >> >>> maintained without understand special rules in section_activate(). >> >> >> >> Hmm... This means we need to adjust pfn_valid() too, which always return true >> >> for early sections. >> > >> > Right, rather than carry workarounds in 3 locations, and the bug that >> > has resulted from then getting out of sync, just teach early section >> > mapping to allow for the subsection populate/depopulate. >> > >> >> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. > >Agree, yes, let's do the simple fix first for 5.8 and the special-case >elimination work later. Ok, let me send v2 with detailed change log and a comment in code first. Thanks all for your time.
On Thu, Jun 25, 2020 at 07:53:37AM +0200, David Hildenbrand wrote: > > >> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: >> >> On Wed, Jun 24, 2020 at 3:44 PM Wei Yang >> <richard.weiyang@linux.alibaba.com> wrote: >> [..] >>>> So, you are right that there is a mismatch here, but I think the >>>> comprehensive fix is to allow early sections to be partially >>>> depopulated/repopulated rather than have section_activate() and >>>> section_deacticate() special case early sections. The special casing >>>> is problematic in retrospect as section_deactivate() can't be >>>> maintained without understand special rules in section_activate(). >>> >>> Hmm... This means we need to adjust pfn_valid() too, which always return true >>> for early sections. >> >> Right, rather than carry workarounds in 3 locations, and the bug that >> has resulted from then getting out of sync, just teach early section >> mapping to allow for the subsection populate/depopulate. >> > >I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. > Hi, David, Which part of pfn_to_online_page() needs to be changed? pfn_valid_within() would call pfn_valid() to check the pfn first. This looks enough for me. I may not follow you at this part. >At least my ack stands.
> Am 26.06.2020 um 00:40 schrieb Wei Yang <richard.weiyang@linux.alibaba.com>: > > On Thu, Jun 25, 2020 at 07:53:37AM +0200, David Hildenbrand wrote: >> >> >>>> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: >>> >>> On Wed, Jun 24, 2020 at 3:44 PM Wei Yang >>> <richard.weiyang@linux.alibaba.com> wrote: >>> [..] >>>>> So, you are right that there is a mismatch here, but I think the >>>>> comprehensive fix is to allow early sections to be partially >>>>> depopulated/repopulated rather than have section_activate() and >>>>> section_deacticate() special case early sections. The special casing >>>>> is problematic in retrospect as section_deactivate() can't be >>>>> maintained without understand special rules in section_activate(). >>>> >>>> Hmm... This means we need to adjust pfn_valid() too, which always return true >>>> for early sections. >>> >>> Right, rather than carry workarounds in 3 locations, and the bug that >>> has resulted from then getting out of sync, just teach early section >>> mapping to allow for the subsection populate/depopulate. >>> >> >> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. >> > > Hi, David, > > Which part of pfn_to_online_page() needs to be changed? pfn_valid_within() > would call pfn_valid() to check the pfn first. This looks enough for me. Not for all configurations. For some (e.g., x86 iirc) it‘s just a nop.
On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote: >On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote: >> >> >> >> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: >> > >> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang >> > <richard.weiyang@linux.alibaba.com> wrote: >> > [..] >> >>> So, you are right that there is a mismatch here, but I think the >> >>> comprehensive fix is to allow early sections to be partially >> >>> depopulated/repopulated rather than have section_activate() and >> >>> section_deacticate() special case early sections. The special casing >> >>> is problematic in retrospect as section_deactivate() can't be >> >>> maintained without understand special rules in section_activate(). >> >> >> >> Hmm... This means we need to adjust pfn_valid() too, which always return true >> >> for early sections. >> > >> > Right, rather than carry workarounds in 3 locations, and the bug that >> > has resulted from then getting out of sync, just teach early section >> > mapping to allow for the subsection populate/depopulate. >> > >> >> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. > >Agree, yes, let's do the simple fix first for 5.8 and the special-case >elimination work later. Dan, A quick test shows this is not a simple task. First, early sections don't set subsection bitmap, which is necessary for the hot-add/remove. To properly set subsection bitmap, we need to know how many subsections in early section. While current code doesn't has a alignment requirement for last early section. We mark the whole last early section as present. I don't find a way to enable this.
On Mon, Jun 29, 2020 at 1:34 AM Wei Yang <richard.weiyang@linux.alibaba.com> wrote: > > On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote: > >On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> > >> > >> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: > >> > > >> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang > >> > <richard.weiyang@linux.alibaba.com> wrote: > >> > [..] > >> >>> So, you are right that there is a mismatch here, but I think the > >> >>> comprehensive fix is to allow early sections to be partially > >> >>> depopulated/repopulated rather than have section_activate() and > >> >>> section_deacticate() special case early sections. The special casing > >> >>> is problematic in retrospect as section_deactivate() can't be > >> >>> maintained without understand special rules in section_activate(). > >> >> > >> >> Hmm... This means we need to adjust pfn_valid() too, which always return true > >> >> for early sections. > >> > > >> > Right, rather than carry workarounds in 3 locations, and the bug that > >> > has resulted from then getting out of sync, just teach early section > >> > mapping to allow for the subsection populate/depopulate. > >> > > >> > >> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. > > > >Agree, yes, let's do the simple fix first for 5.8 and the special-case > >elimination work later. > > Dan, > > A quick test shows this is not a simple task. Thanks for taking a look... > First, early sections don't set subsection bitmap, which is necessary for the > hot-add/remove. > > To properly set subsection bitmap, we need to know how many subsections in > early section. While current code doesn't has a alignment requirement for > last early section. We mark the whole last early section as present. I was thinking that the subsection map does not need to be accurate on initial setup, it only needs to be accurate after the first removal. However, that would result in new special casing that somewhat defeats the purpose. The hardest part is potentially breaking up a PMD mapping of the page array into a series of PTE mappings without disturbing in-flight pfn_to_page() users. > I don't find a way to enable this. While I don't like that this bug crept into the mismatched special casing of early sections, I'm now coming around to the same opinion. I.e. that making the memmap for early sections permanent is a simpler mechanism to maintain.
On Mon, Jun 29, 2020 at 03:13:25PM -0700, Dan Williams wrote: >On Mon, Jun 29, 2020 at 1:34 AM Wei Yang ><richard.weiyang@linux.alibaba.com> wrote: >> >> On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote: >> >On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote: >> >> >> >> >> >> >> >> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: >> >> > >> >> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang >> >> > <richard.weiyang@linux.alibaba.com> wrote: >> >> > [..] >> >> >>> So, you are right that there is a mismatch here, but I think the >> >> >>> comprehensive fix is to allow early sections to be partially >> >> >>> depopulated/repopulated rather than have section_activate() and >> >> >>> section_deacticate() special case early sections. The special casing >> >> >>> is problematic in retrospect as section_deactivate() can't be >> >> >>> maintained without understand special rules in section_activate(). >> >> >> >> >> >> Hmm... This means we need to adjust pfn_valid() too, which always return true >> >> >> for early sections. >> >> > >> >> > Right, rather than carry workarounds in 3 locations, and the bug that >> >> > has resulted from then getting out of sync, just teach early section >> >> > mapping to allow for the subsection populate/depopulate. >> >> > >> >> >> >> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. >> > >> >Agree, yes, let's do the simple fix first for 5.8 and the special-case >> >elimination work later. >> >> Dan, >> >> A quick test shows this is not a simple task. > >Thanks for taking a look... > >> First, early sections don't set subsection bitmap, which is necessary for the >> hot-add/remove. >> >> To properly set subsection bitmap, we need to know how many subsections in >> early section. While current code doesn't has a alignment requirement for >> last early section. We mark the whole last early section as present. > >I was thinking that the subsection map does not need to be accurate on >initial setup, it only needs to be accurate after the first removal. >However, that would result in new special casing that somewhat defeats >the purpose. The hardest part is potentially breaking up a PMD mapping >of the page array into a series of PTE mappings without disturbing >in-flight pfn_to_page() users. > >> I don't find a way to enable this. > >While I don't like that this bug crept into the mismatched special >casing of early sections, I'm now coming around to the same opinion. >I.e. that making the memmap for early sections permanent is a simpler >mechanism to maintain. I think so ...
On 30.06.20 00:58, Wei Yang wrote: > On Mon, Jun 29, 2020 at 03:13:25PM -0700, Dan Williams wrote: >> On Mon, Jun 29, 2020 at 1:34 AM Wei Yang >> <richard.weiyang@linux.alibaba.com> wrote: >>> >>> On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote: >>>> On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>>> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>: >>>>>> >>>>>> On Wed, Jun 24, 2020 at 3:44 PM Wei Yang >>>>>> <richard.weiyang@linux.alibaba.com> wrote: >>>>>> [..] >>>>>>>> So, you are right that there is a mismatch here, but I think the >>>>>>>> comprehensive fix is to allow early sections to be partially >>>>>>>> depopulated/repopulated rather than have section_activate() and >>>>>>>> section_deacticate() special case early sections. The special casing >>>>>>>> is problematic in retrospect as section_deactivate() can't be >>>>>>>> maintained without understand special rules in section_activate(). >>>>>>> >>>>>>> Hmm... This means we need to adjust pfn_valid() too, which always return true >>>>>>> for early sections. >>>>>> >>>>>> Right, rather than carry workarounds in 3 locations, and the bug that >>>>>> has resulted from then getting out of sync, just teach early section >>>>>> mapping to allow for the subsection populate/depopulate. >>>>>> >>>>> >>>>> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well. >>>> >>>> Agree, yes, let's do the simple fix first for 5.8 and the special-case >>>> elimination work later. >>> >>> Dan, >>> >>> A quick test shows this is not a simple task. >> >> Thanks for taking a look... >> >>> First, early sections don't set subsection bitmap, which is necessary for the >>> hot-add/remove. >>> >>> To properly set subsection bitmap, we need to know how many subsections in >>> early section. While current code doesn't has a alignment requirement for >>> last early section. We mark the whole last early section as present. >> >> I was thinking that the subsection map does not need to be accurate on >> initial setup, it only needs to be accurate after the first removal. >> However, that would result in new special casing that somewhat defeats >> the purpose. The hardest part is potentially breaking up a PMD mapping >> of the page array into a series of PTE mappings without disturbing >> in-flight pfn_to_page() users. >> >>> I don't find a way to enable this. >> >> While I don't like that this bug crept into the mismatched special >> casing of early sections, I'm now coming around to the same opinion. >> I.e. that making the memmap for early sections permanent is a simpler >> mechanism to maintain. > > I think so ... > Yes, and I think having to replace quite some pfn_valid_within() - nops - by pfn_valid() just to handle one corner case might not be worth it. At least for now.
diff --git a/mm/sparse.c b/mm/sparse.c index b2b9a3e34696..1a0069f492f5 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; } - if (section_is_early && memmap) - free_map_bootmem(memmap); - else + if (!section_is_early) depopulate_section_memmap(pfn, nr_pages, altmap); + else if (memmap) + free_map_bootmem(memmap); if (empty) ms->section_mem_map = (unsigned long)NULL;
For early sections, we assumes its memmap will never be partially removed. But current behavior breaks this. Let's correct it. Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> --- mm/sparse.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)