Message ID | 20191028094825.21448-1-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [RFC,1/4] libnvdimm/namespace: Make namespace size validation arch dependent | expand |
> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c > index 377712e85605..2e661a08dae5 100644 > --- a/arch/powerpc/lib/pmem.c > +++ b/arch/powerpc/lib/pmem.c > @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) > unsigned long start = (unsigned long) addr; > flush_dcache_range(start, start + size); > } > -EXPORT_SYMBOL(arch_wb_cache_pmem); > +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); Unrelated change? Ira
On Mon, Oct 28, 2019 at 2:48 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > The page size used to map the namespace is arch dependent. For example > architectures like ppc64 use 16MB page size for direct-mapping. If the namespace > size is not aligned to the mapping page size, we can observe kernel crash > during namespace init and destroy. > > This is due to kernel doing partial map/unmap of the resource range > > BUG: Unable to handle kernel data access at 0xc001000406000000 > Faulting instruction address: 0xc000000000090790 > NIP [c000000000090790] arch_add_memory+0xc0/0x130 > LR [c000000000090744] arch_add_memory+0x74/0x130 > Call Trace: > arch_add_memory+0x74/0x130 (unreliable) > memremap_pages+0x74c/0xa30 > devm_memremap_pages+0x3c/0xa0 > pmem_attach_disk+0x188/0x770 > nvdimm_bus_probe+0xd8/0x470 > really_probe+0x148/0x570 > driver_probe_device+0x19c/0x1d0 > device_driver_attach+0xcc/0x100 > bind_store+0x134/0x1c0 > drv_attr_store+0x44/0x60 > sysfs_kf_write+0x74/0xc0 > kernfs_fop_write+0x1b4/0x290 > __vfs_write+0x3c/0x70 > vfs_write+0xd0/0x260 > ksys_write+0xdc/0x130 > system_call+0x5c/0x68 > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > arch/arm64/mm/flush.c | 11 +++++++++++ > arch/powerpc/lib/pmem.c | 21 +++++++++++++++++++-- > arch/x86/mm/pageattr.c | 12 ++++++++++++ > include/linux/libnvdimm.h | 1 + > 4 files changed, 43 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c > index ac485163a4a7..90c54c600023 100644 > --- a/arch/arm64/mm/flush.c > +++ b/arch/arm64/mm/flush.c > @@ -91,4 +91,15 @@ void arch_invalidate_pmem(void *addr, size_t size) > __inval_dcache_area(addr, size); > } > EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > + > +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) > +{ > + u32 remainder; > + > + div_u64_rem(size, PAGE_SIZE * ndr_mappings, &remainder); > + if (remainder) > + return PAGE_SIZE * ndr_mappings; > + return 0; > +} > +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); > #endif > diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c > index 377712e85605..2e661a08dae5 100644 > --- a/arch/powerpc/lib/pmem.c > +++ b/arch/powerpc/lib/pmem.c > @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) > unsigned long start = (unsigned long) addr; > flush_dcache_range(start, start + size); > } > -EXPORT_SYMBOL(arch_wb_cache_pmem); > +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); > > void arch_invalidate_pmem(void *addr, size_t size) > { > unsigned long start = (unsigned long) addr; > flush_dcache_range(start, start + size); > } > -EXPORT_SYMBOL(arch_invalidate_pmem); > +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > + > +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) > +{ > + u32 remainder; > + unsigned long linear_map_size; > + > + if (radix_enabled()) > + linear_map_size = PAGE_SIZE; > + else > + linear_map_size = (1UL << mmu_psize_defs[mmu_linear_psize].shift); This seems more a "supported_alignments" problem, and less a namespace size or PAGE_SIZE problem, because if the starting address is misaligned this size validation can still succeed when it shouldn't. One problem is that __size_store() does not validate the size against the namespace alignment. However, the next problem is that alignment is a property of the pfn device, but not the raw namespace. I think this alignment constraint should be captured by exposing "align" and "supported_alignments" at the namespace level as the minimum alignment. The pfn level alignment could then be an additional alignment constraint, but ndctl would likely set them to the same value. The "* ndr_mappings" constraint should be left out of the interface, because that's a side effect of supporting namespace-type-blk aliasing which no platform seems to do in practice. If for some strange reason it's need in the future I'd rather expose the "aliasing" property rather than fold it into the align / supported_aligns interface.
On 10/29/19 4:38 AM, Dan Williams wrote: > On Mon, Oct 28, 2019 at 2:48 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> The page size used to map the namespace is arch dependent. For example >> architectures like ppc64 use 16MB page size for direct-mapping. If the namespace >> size is not aligned to the mapping page size, we can observe kernel crash >> during namespace init and destroy. >> >> This is due to kernel doing partial map/unmap of the resource range >> >> BUG: Unable to handle kernel data access at 0xc001000406000000 >> Faulting instruction address: 0xc000000000090790 >> NIP [c000000000090790] arch_add_memory+0xc0/0x130 >> LR [c000000000090744] arch_add_memory+0x74/0x130 >> Call Trace: >> arch_add_memory+0x74/0x130 (unreliable) >> memremap_pages+0x74c/0xa30 >> devm_memremap_pages+0x3c/0xa0 >> pmem_attach_disk+0x188/0x770 >> nvdimm_bus_probe+0xd8/0x470 >> really_probe+0x148/0x570 >> driver_probe_device+0x19c/0x1d0 >> device_driver_attach+0xcc/0x100 >> bind_store+0x134/0x1c0 >> drv_attr_store+0x44/0x60 >> sysfs_kf_write+0x74/0xc0 >> kernfs_fop_write+0x1b4/0x290 >> __vfs_write+0x3c/0x70 >> vfs_write+0xd0/0x260 >> ksys_write+0xdc/0x130 >> system_call+0x5c/0x68 >> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> --- >> arch/arm64/mm/flush.c | 11 +++++++++++ >> arch/powerpc/lib/pmem.c | 21 +++++++++++++++++++-- >> arch/x86/mm/pageattr.c | 12 ++++++++++++ >> include/linux/libnvdimm.h | 1 + >> 4 files changed, 43 insertions(+), 2 deletions(-) >> >> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c >> index ac485163a4a7..90c54c600023 100644 >> --- a/arch/arm64/mm/flush.c >> +++ b/arch/arm64/mm/flush.c >> @@ -91,4 +91,15 @@ void arch_invalidate_pmem(void *addr, size_t size) >> __inval_dcache_area(addr, size); >> } >> EXPORT_SYMBOL_GPL(arch_invalidate_pmem); >> + >> +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) >> +{ >> + u32 remainder; >> + >> + div_u64_rem(size, PAGE_SIZE * ndr_mappings, &remainder); >> + if (remainder) >> + return PAGE_SIZE * ndr_mappings; >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); >> #endif >> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c >> index 377712e85605..2e661a08dae5 100644 >> --- a/arch/powerpc/lib/pmem.c >> +++ b/arch/powerpc/lib/pmem.c >> @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) >> unsigned long start = (unsigned long) addr; >> flush_dcache_range(start, start + size); >> } >> -EXPORT_SYMBOL(arch_wb_cache_pmem); >> +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); >> >> void arch_invalidate_pmem(void *addr, size_t size) >> { >> unsigned long start = (unsigned long) addr; >> flush_dcache_range(start, start + size); >> } >> -EXPORT_SYMBOL(arch_invalidate_pmem); >> +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); >> + >> +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) >> +{ >> + u32 remainder; >> + unsigned long linear_map_size; >> + >> + if (radix_enabled()) >> + linear_map_size = PAGE_SIZE; >> + else >> + linear_map_size = (1UL << mmu_psize_defs[mmu_linear_psize].shift); > > This seems more a "supported_alignments" problem, and less a namespace > size or PAGE_SIZE problem, because if the starting address is > misaligned this size validation can still succeed when it shouldn't. > Isn't supported_alignments an indication of how user want the namespace to be mapped to applications? Ie, with the above restrictions we can still do both 64K and 16M mapping of the namespace to userspace. Also for supported alignment the huge page mapping is further dependent on the THP feature. The restrictions here are mostly w.r.t the direct-mapping page size used by some architecture. > One problem is that __size_store() does not validate the size against > the namespace alignment. > > However, the next problem is that alignment is a property of the pfn > device, but not the raw namespace. I think this alignment constraint > should be captured by exposing "align" and "supported_alignments" at > the namespace level as the minimum alignment. The pfn level alignment > could then be an additional alignment constraint, but ndctl would > likely set them to the same value. > > The "* ndr_mappings" constraint should be left out of the interface, > because that's a side effect of supporting namespace-type-blk aliasing > which no platform seems to do in practice. If for some strange reason > it's need in the future I'd rather expose the "aliasing" property > rather than fold it into the align / supported_aligns interface. > -aneesh
On 10/29/19 2:51 AM, Ira Weiny wrote: >> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c >> index 377712e85605..2e661a08dae5 100644 >> --- a/arch/powerpc/lib/pmem.c >> +++ b/arch/powerpc/lib/pmem.c >> @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) >> unsigned long start = (unsigned long) addr; >> flush_dcache_range(start, start + size); >> } >> -EXPORT_SYMBOL(arch_wb_cache_pmem); >> +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); > > Unrelated change? > > Yes, will split that as a separate patch. -aneesh
On Mon, Oct 28, 2019 at 9:35 PM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > On 10/29/19 4:38 AM, Dan Williams wrote: > > On Mon, Oct 28, 2019 at 2:48 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> The page size used to map the namespace is arch dependent. For example > >> architectures like ppc64 use 16MB page size for direct-mapping. If the namespace > >> size is not aligned to the mapping page size, we can observe kernel crash > >> during namespace init and destroy. > >> > >> This is due to kernel doing partial map/unmap of the resource range > >> > >> BUG: Unable to handle kernel data access at 0xc001000406000000 > >> Faulting instruction address: 0xc000000000090790 > >> NIP [c000000000090790] arch_add_memory+0xc0/0x130 > >> LR [c000000000090744] arch_add_memory+0x74/0x130 > >> Call Trace: > >> arch_add_memory+0x74/0x130 (unreliable) > >> memremap_pages+0x74c/0xa30 > >> devm_memremap_pages+0x3c/0xa0 > >> pmem_attach_disk+0x188/0x770 > >> nvdimm_bus_probe+0xd8/0x470 > >> really_probe+0x148/0x570 > >> driver_probe_device+0x19c/0x1d0 > >> device_driver_attach+0xcc/0x100 > >> bind_store+0x134/0x1c0 > >> drv_attr_store+0x44/0x60 > >> sysfs_kf_write+0x74/0xc0 > >> kernfs_fop_write+0x1b4/0x290 > >> __vfs_write+0x3c/0x70 > >> vfs_write+0xd0/0x260 > >> ksys_write+0xdc/0x130 > >> system_call+0x5c/0x68 > >> > >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > >> --- > >> arch/arm64/mm/flush.c | 11 +++++++++++ > >> arch/powerpc/lib/pmem.c | 21 +++++++++++++++++++-- > >> arch/x86/mm/pageattr.c | 12 ++++++++++++ > >> include/linux/libnvdimm.h | 1 + > >> 4 files changed, 43 insertions(+), 2 deletions(-) > >> > >> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c > >> index ac485163a4a7..90c54c600023 100644 > >> --- a/arch/arm64/mm/flush.c > >> +++ b/arch/arm64/mm/flush.c > >> @@ -91,4 +91,15 @@ void arch_invalidate_pmem(void *addr, size_t size) > >> __inval_dcache_area(addr, size); > >> } > >> EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > >> + > >> +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) > >> +{ > >> + u32 remainder; > >> + > >> + div_u64_rem(size, PAGE_SIZE * ndr_mappings, &remainder); > >> + if (remainder) > >> + return PAGE_SIZE * ndr_mappings; > >> + return 0; > >> +} > >> +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); > >> #endif > >> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c > >> index 377712e85605..2e661a08dae5 100644 > >> --- a/arch/powerpc/lib/pmem.c > >> +++ b/arch/powerpc/lib/pmem.c > >> @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) > >> unsigned long start = (unsigned long) addr; > >> flush_dcache_range(start, start + size); > >> } > >> -EXPORT_SYMBOL(arch_wb_cache_pmem); > >> +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); > >> > >> void arch_invalidate_pmem(void *addr, size_t size) > >> { > >> unsigned long start = (unsigned long) addr; > >> flush_dcache_range(start, start + size); > >> } > >> -EXPORT_SYMBOL(arch_invalidate_pmem); > >> +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > >> + > >> +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) > >> +{ > >> + u32 remainder; > >> + unsigned long linear_map_size; > >> + > >> + if (radix_enabled()) > >> + linear_map_size = PAGE_SIZE; > >> + else > >> + linear_map_size = (1UL << mmu_psize_defs[mmu_linear_psize].shift); > > > > This seems more a "supported_alignments" problem, and less a namespace > > size or PAGE_SIZE problem, because if the starting address is > > misaligned this size validation can still succeed when it shouldn't. > > > > > Isn't supported_alignments an indication of how user want the namespace > to be mapped to applications? Ie, with the above restrictions we can > still do both 64K and 16M mapping of the namespace to userspace. True, for the pfn device and the device-dax mapping size, but I'm suggesting adding another instance of alignment control at the raw namespace level. That would need to be disconnected from the device-dax page mapping granularity. > > Also for supported alignment the huge page mapping is further dependent > on the THP feature. > > The restrictions here are mostly w.r.t the direct-mapping page size used > by some architecture. Right, that's a base requirement for the namespace that can be independent of the user page mapping size for device-dax.
On 10/29/19 11:00 AM, Dan Williams wrote: > On Mon, Oct 28, 2019 at 9:35 PM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> On 10/29/19 4:38 AM, Dan Williams wrote: >>> On Mon, Oct 28, 2019 at 2:48 AM Aneesh Kumar K.V >>> <aneesh.kumar@linux.ibm.com> wrote: >>>> >>>> The page size used to map the namespace is arch dependent. For example >>>> architectures like ppc64 use 16MB page size for direct-mapping. If the namespace >>>> size is not aligned to the mapping page size, we can observe kernel crash >>>> during namespace init and destroy. >>>> >>>> This is due to kernel doing partial map/unmap of the resource range >>>> >>>> BUG: Unable to handle kernel data access at 0xc001000406000000 >>>> Faulting instruction address: 0xc000000000090790 >>>> NIP [c000000000090790] arch_add_memory+0xc0/0x130 >>>> LR [c000000000090744] arch_add_memory+0x74/0x130 >>>> Call Trace: >>>> arch_add_memory+0x74/0x130 (unreliable) >>>> memremap_pages+0x74c/0xa30 >>>> devm_memremap_pages+0x3c/0xa0 >>>> pmem_attach_disk+0x188/0x770 >>>> nvdimm_bus_probe+0xd8/0x470 >>>> really_probe+0x148/0x570 >>>> driver_probe_device+0x19c/0x1d0 >>>> device_driver_attach+0xcc/0x100 >>>> bind_store+0x134/0x1c0 >>>> drv_attr_store+0x44/0x60 >>>> sysfs_kf_write+0x74/0xc0 >>>> kernfs_fop_write+0x1b4/0x290 >>>> __vfs_write+0x3c/0x70 >>>> vfs_write+0xd0/0x260 >>>> ksys_write+0xdc/0x130 >>>> system_call+0x5c/0x68 >>>> >>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>> --- >>>> arch/arm64/mm/flush.c | 11 +++++++++++ >>>> arch/powerpc/lib/pmem.c | 21 +++++++++++++++++++-- >>>> arch/x86/mm/pageattr.c | 12 ++++++++++++ >>>> include/linux/libnvdimm.h | 1 + >>>> 4 files changed, 43 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c >>>> index ac485163a4a7..90c54c600023 100644 >>>> --- a/arch/arm64/mm/flush.c >>>> +++ b/arch/arm64/mm/flush.c >>>> @@ -91,4 +91,15 @@ void arch_invalidate_pmem(void *addr, size_t size) >>>> __inval_dcache_area(addr, size); >>>> } >>>> EXPORT_SYMBOL_GPL(arch_invalidate_pmem); >>>> + >>>> +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) >>>> +{ >>>> + u32 remainder; >>>> + >>>> + div_u64_rem(size, PAGE_SIZE * ndr_mappings, &remainder); >>>> + if (remainder) >>>> + return PAGE_SIZE * ndr_mappings; >>>> + return 0; >>>> +} >>>> +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); >>>> #endif >>>> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c >>>> index 377712e85605..2e661a08dae5 100644 >>>> --- a/arch/powerpc/lib/pmem.c >>>> +++ b/arch/powerpc/lib/pmem.c >>>> @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) >>>> unsigned long start = (unsigned long) addr; >>>> flush_dcache_range(start, start + size); >>>> } >>>> -EXPORT_SYMBOL(arch_wb_cache_pmem); >>>> +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); >>>> >>>> void arch_invalidate_pmem(void *addr, size_t size) >>>> { >>>> unsigned long start = (unsigned long) addr; >>>> flush_dcache_range(start, start + size); >>>> } >>>> -EXPORT_SYMBOL(arch_invalidate_pmem); >>>> +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); >>>> + >>>> +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) >>>> +{ >>>> + u32 remainder; >>>> + unsigned long linear_map_size; >>>> + >>>> + if (radix_enabled()) >>>> + linear_map_size = PAGE_SIZE; >>>> + else >>>> + linear_map_size = (1UL << mmu_psize_defs[mmu_linear_psize].shift); >>> >>> This seems more a "supported_alignments" problem, and less a namespace >>> size or PAGE_SIZE problem, because if the starting address is >>> misaligned this size validation can still succeed when it shouldn't. >>> >> >> >> Isn't supported_alignments an indication of how user want the namespace >> to be mapped to applications? Ie, with the above restrictions we can >> still do both 64K and 16M mapping of the namespace to userspace. > > True, for the pfn device and the device-dax mapping size, but I'm > suggesting adding another instance of alignment control at the raw > namespace level. That would need to be disconnected from the > device-dax page mapping granularity. > Can you explain what you mean by raw namespace level ? We don't have multiple values against which we need to check the alignment of namespace start and namespace size. If you can outline how and where you would like to enforce that check I can start working on it. -aneesh
On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: [..] > > True, for the pfn device and the device-dax mapping size, but I'm > > suggesting adding another instance of alignment control at the raw > > namespace level. That would need to be disconnected from the > > device-dax page mapping granularity. > > > > Can you explain what you mean by raw namespace level ? We don't have > multiple values against which we need to check the alignment of > namespace start and namespace size. > > If you can outline how and where you would like to enforce that check I > can start working on it. > What I mean is that the process of setting up a pfn namespace goes something like this in shell script form: 1/ echo $size > /sys/bus/nd/devices/$namespace/size 2/ echo $namespace > /sys/bus/nd/devices/$pfn/namespace 3/ echo $pfn_align > /sys/bus/nd/devices/$pfn/align What I'm suggesting is add an optional 0th step that does: echo $raw_align > /sys/bus/nd/devices/$namespace/align Where the raw align needs to be needs to be max($pfn_align, arch_mapping_granulariy). So on powerpc where PAGE_SIZE < arch_mapping_granulariy, the following: cat /sys/bus/nd/devices/$namespace/supported_aligns ...would show the same output as: cat /sys/bus/nd/devices/$pfn/align ...but with any alignment choice less than arch_mapping_granulariy removed. All that said, the x86 vmemmap_populate() falls back to use small pages in some case to get around this constraint. Can't powerpc do the same? It would seem to be less work than the above proposal.
On 10/31/19 12:00 PM, Dan Williams wrote: > On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: > [..] > > > All that said, the x86 vmemmap_populate() falls back to use small > pages in some case to get around this constraint. Can't powerpc do the > same? It would seem to be less work than the above proposal. > ppc64 supports two translation mode (radix and hash). We can do the above with radix translation. With hash we use just one page size( in this specific case 16MB) for direct-map mapping. -aneesh
On Thu, Oct 31, 2019 at 1:38 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > On 10/31/19 12:00 PM, Dan Williams wrote: > > On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > > [..] > > > > > > > All that said, the x86 vmemmap_populate() falls back to use small > > pages in some case to get around this constraint. Can't powerpc do the > > same? It would seem to be less work than the above proposal. > > > > ppc64 supports two translation mode (radix and hash). We can do the > above with radix translation. With hash we use just one page size( in > this specific case 16MB) for direct-map mapping. Ok, if it's truly a hard constraint then yes, more infrastructure is needed to expose that constraint to the namespace provisioning flow.
Dan Williams <dan.j.williams@intel.com> writes: > On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: > [..] >> > True, for the pfn device and the device-dax mapping size, but I'm >> > suggesting adding another instance of alignment control at the raw >> > namespace level. That would need to be disconnected from the >> > device-dax page mapping granularity. >> > >> >> Can you explain what you mean by raw namespace level ? We don't have >> multiple values against which we need to check the alignment of >> namespace start and namespace size. >> >> If you can outline how and where you would like to enforce that check I >> can start working on it. >> > > What I mean is that the process of setting up a pfn namespace goes > something like this in shell script form: > > 1/ echo $size > /sys/bus/nd/devices/$namespace/size > 2/ echo $namespace > /sys/bus/nd/devices/$pfn/namespace > 3/ echo $pfn_align > /sys/bus/nd/devices/$pfn/align > > What I'm suggesting is add an optional 0th step that does: > > echo $raw_align > /sys/bus/nd/devices/$namespace/align > > Where the raw align needs to be needs to be max($pfn_align, > arch_mapping_granulariy). I started looking at this and was wondering about userspace being aware of the direct-map mapping size. We can figure that out by parsing kernel log [ 0.000000] Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24 But I am not sure we want to do that. There is not set of raw_align value to select. What we need to make sure is the below. 1) While creating a namespace we need to make sure that namespace size is multiple of direct-map mapping size. If we ensure that size is aligned, we should also get the namespace start to be aligned? 2) While initialzing a namespace by scanning label area, we need to make sure every namespace in the region satisfy the above requirement. If not we should mark the region disabled. > > So on powerpc where PAGE_SIZE < arch_mapping_granulariy, the following: > > cat /sys/bus/nd/devices/$namespace/supported_aligns > > ...would show the same output as: > > cat /sys/bus/nd/devices/$pfn/align > > ...but with any alignment choice less than arch_mapping_granulariy removed. > I am not sure why we need to do that. For example: even if we have direct-map mapping size as PAGE_SIZE (64K), we still should allow an user mapping with hugepage size (16M)? -aneesh
Hi Dan, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > Dan Williams <dan.j.williams@intel.com> writes: > >> On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V >> <aneesh.kumar@linux.ibm.com> wrote: >> [..] >>> > True, for the pfn device and the device-dax mapping size, but I'm >>> > suggesting adding another instance of alignment control at the raw >>> > namespace level. That would need to be disconnected from the >>> > device-dax page mapping granularity. >>> > >>> >>> Can you explain what you mean by raw namespace level ? We don't have >>> multiple values against which we need to check the alignment of >>> namespace start and namespace size. >>> >>> If you can outline how and where you would like to enforce that check I >>> can start working on it. >>> >> >> What I mean is that the process of setting up a pfn namespace goes >> something like this in shell script form: >> >> 1/ echo $size > /sys/bus/nd/devices/$namespace/size >> 2/ echo $namespace > /sys/bus/nd/devices/$pfn/namespace >> 3/ echo $pfn_align > /sys/bus/nd/devices/$pfn/align >> >> What I'm suggesting is add an optional 0th step that does: >> >> echo $raw_align > /sys/bus/nd/devices/$namespace/align >> >> Where the raw align needs to be needs to be max($pfn_align, >> arch_mapping_granulariy). > > > I started looking at this and was wondering about userspace being aware > of the direct-map mapping size. We can figure that out by parsing kernel > log > > [ 0.000000] Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24 > > > But I am not sure we want to do that. There is not set of raw_align > value to select. What we need to make sure is the below. > > 1) While creating a namespace we need to make sure that namespace size > is multiple of direct-map mapping size. If we ensure that > size is aligned, we should also get the namespace start to be aligned? > > 2) While initialzing a namespace by scanning label area, we need to make > sure every namespace in the region satisfy the above requirement. If not > we should mark the region disabled. > > >> >> So on powerpc where PAGE_SIZE < arch_mapping_granulariy, the following: >> >> cat /sys/bus/nd/devices/$namespace/supported_aligns >> >> ...would show the same output as: >> >> cat /sys/bus/nd/devices/$pfn/align >> >> ...but with any alignment choice less than arch_mapping_granulariy removed. >> > > I am not sure why we need to do that. For example: even if we have > direct-map mapping size as PAGE_SIZE (64K), we still should allow an user > mapping with hugepage size (16M)? > Considering the direct-map map size is not going to be user selectable, do you agree that we can skip the above step 0 configuration you suggested. The changes proposed in the patch series essentially does the rest. 1) It validate the size against the arch specific limit during namespace creation. (part of step 1) 2) It also disable initializing a region if it find the size not correctly aligned as per the platform requirement. 3) Direct map mapping size is different from supported_alignment for a namespace. The supported alignment controls what possible PAGE SIZE user want the namespace to be mapped to user space. With the above do you think the current patch series is good? -aneesh
On Sat, Nov 16, 2019 at 4:15 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > > Hi Dan, > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > > > Dan Williams <dan.j.williams@intel.com> writes: > > > >> On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V > >> <aneesh.kumar@linux.ibm.com> wrote: > >> [..] > >>> > True, for the pfn device and the device-dax mapping size, but I'm > >>> > suggesting adding another instance of alignment control at the raw > >>> > namespace level. That would need to be disconnected from the > >>> > device-dax page mapping granularity. > >>> > > >>> > >>> Can you explain what you mean by raw namespace level ? We don't have > >>> multiple values against which we need to check the alignment of > >>> namespace start and namespace size. > >>> > >>> If you can outline how and where you would like to enforce that check I > >>> can start working on it. > >>> > >> > >> What I mean is that the process of setting up a pfn namespace goes > >> something like this in shell script form: > >> > >> 1/ echo $size > /sys/bus/nd/devices/$namespace/size > >> 2/ echo $namespace > /sys/bus/nd/devices/$pfn/namespace > >> 3/ echo $pfn_align > /sys/bus/nd/devices/$pfn/align > >> > >> What I'm suggesting is add an optional 0th step that does: > >> > >> echo $raw_align > /sys/bus/nd/devices/$namespace/align > >> > >> Where the raw align needs to be needs to be max($pfn_align, > >> arch_mapping_granulariy). > > > > > > I started looking at this and was wondering about userspace being aware > > of the direct-map mapping size. We can figure that out by parsing kernel > > log > > > > [ 0.000000] Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24 > > > > > > But I am not sure we want to do that. There is not set of raw_align > > value to select. What we need to make sure is the below. > > > > 1) While creating a namespace we need to make sure that namespace size > > is multiple of direct-map mapping size. If we ensure that > > size is aligned, we should also get the namespace start to be aligned? > > > > 2) While initialzing a namespace by scanning label area, we need to make > > sure every namespace in the region satisfy the above requirement. If not > > we should mark the region disabled. > > > > > >> > >> So on powerpc where PAGE_SIZE < arch_mapping_granulariy, the following: > >> > >> cat /sys/bus/nd/devices/$namespace/supported_aligns > >> > >> ...would show the same output as: > >> > >> cat /sys/bus/nd/devices/$pfn/align > >> > >> ...but with any alignment choice less than arch_mapping_granulariy removed. > >> > > > > I am not sure why we need to do that. For example: even if we have > > direct-map mapping size as PAGE_SIZE (64K), we still should allow an user > > mapping with hugepage size (16M)? > > > > > Considering the direct-map map size is not going to be user selectable, > do you agree that we can skip the above step 0 configuration you > suggested. > > The changes proposed in the patch series essentially does the rest. > > 1) It validate the size against the arch specific limit during > namespace creation. (part of step 1) This validation is a surprise failure to ndctl. > 2) It also disable initializing a region if it find the size not > correctly aligned as per the platform requirement. There needs to be a way for the user to discover the correct alignment that the kernel will accept. > 3) Direct map mapping size is different from supported_alignment for a > namespace. The supported alignment controls what possible PAGE SIZE user want the > namespace to be mapped to user space. No, the namespace alignment is different than the page mapping size. The alignment is only interpreted as a mapping size at the device-dax level, otherwise at the raw namespace level it's just an arbitrary alignment. > With the above do you think the current patch series is good? I don't think we've quite converged on a solution.
Dan Williams <dan.j.williams@intel.com> writes: > On Sat, Nov 16, 2019 at 4:15 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> .... >> >> Considering the direct-map map size is not going to be user selectable, >> do you agree that we can skip the above step 0 configuration you >> suggested. >> >> The changes proposed in the patch series essentially does the rest. >> >> 1) It validate the size against the arch specific limit during >> namespace creation. (part of step 1) > > This validation is a surprise failure to ndctl. > >> 2) It also disable initializing a region if it find the size not >> correctly aligned as per the platform requirement. > > There needs to be a way for the user to discover the correct alignment > that the kernel will accept. > >> 3) Direct map mapping size is different from supported_alignment for a >> namespace. The supported alignment controls what possible PAGE SIZE user want the >> namespace to be mapped to user space. > > No, the namespace alignment is different than the page mapping size. > The alignment is only interpreted as a mapping size at the device-dax > level, otherwise at the raw namespace level it's just an arbitrary > alignment. > >> With the above do you think the current patch series is good? > > I don't think we've quite converged on a solution. How about we make it a property of seed device. ie, we add `supported_size_align` RO attribute to the seed device. ndctl can use this to validate the size value. So this now becomes step0 sys/bus/nd/devices/region0> cat btt0.0/supported_size_align 16777216 /sys/bus/nd/devices/region0> cat pfn0.0/supported_size_align 16777216 /sys/bus/nd/devices/region0> cat dax0.0/supported_size_align 16777216 /sys/bus/nd/devices/region0> We follow that up with validating the size value written to size attribute(step 1). While initializing the namespaces already present in a region we again validate the size and if not properly aligned we mark the region disabled. -aneesh
On Mon, Nov 18, 2019 at 1:52 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > On Sat, Nov 16, 2019 at 4:15 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > > .... > > > >> > >> Considering the direct-map map size is not going to be user selectable, > >> do you agree that we can skip the above step 0 configuration you > >> suggested. > >> > >> The changes proposed in the patch series essentially does the rest. > >> > >> 1) It validate the size against the arch specific limit during > >> namespace creation. (part of step 1) > > > > This validation is a surprise failure to ndctl. > > > >> 2) It also disable initializing a region if it find the size not > >> correctly aligned as per the platform requirement. > > > > There needs to be a way for the user to discover the correct alignment > > that the kernel will accept. > > > >> 3) Direct map mapping size is different from supported_alignment for a > >> namespace. The supported alignment controls what possible PAGE SIZE user want the > >> namespace to be mapped to user space. > > > > No, the namespace alignment is different than the page mapping size. > > The alignment is only interpreted as a mapping size at the device-dax > > level, otherwise at the raw namespace level it's just an arbitrary > > alignment. > > > >> With the above do you think the current patch series is good? > > > > I don't think we've quite converged on a solution. > > How about we make it a property of seed device. ie, > we add `supported_size_align` RO attribute to the seed device. ndctl can > use this to validate the size value. So this now becomes step0 > > sys/bus/nd/devices/region0> cat btt0.0/supported_size_align > 16777216 > /sys/bus/nd/devices/region0> cat pfn0.0/supported_size_align > 16777216 > /sys/bus/nd/devices/region0> cat dax0.0/supported_size_align > 16777216 Why on those devices and not namespace0.0? > We follow that up with validating the size value written to size > attribute(step 1). > > While initializing the namespaces already present in a region we again > validate the size and if not properly aligned we mark the region > disabled. The region might have a mix of namespaces, some aligned and some not, only the misaligned namespaces should fail to enable. The region should otherwise enable successfully.
On 11/19/19 11:28 PM, Dan Williams wrote: > On Mon, Nov 18, 2019 at 1:52 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> Dan Williams <dan.j.williams@intel.com> writes: >> >>> On Sat, Nov 16, 2019 at 4:15 AM Aneesh Kumar K.V >>> <aneesh.kumar@linux.ibm.com> wrote: >>>> >> >> .... >> >> >>>> >>>> Considering the direct-map map size is not going to be user selectable, >>>> do you agree that we can skip the above step 0 configuration you >>>> suggested. >>>> >>>> The changes proposed in the patch series essentially does the rest. >>>> >>>> 1) It validate the size against the arch specific limit during >>>> namespace creation. (part of step 1) >>> >>> This validation is a surprise failure to ndctl. >>> >>>> 2) It also disable initializing a region if it find the size not >>>> correctly aligned as per the platform requirement. >>> >>> There needs to be a way for the user to discover the correct alignment >>> that the kernel will accept. >>> >>>> 3) Direct map mapping size is different from supported_alignment for a >>>> namespace. The supported alignment controls what possible PAGE SIZE user want the >>>> namespace to be mapped to user space. >>> >>> No, the namespace alignment is different than the page mapping size. >>> The alignment is only interpreted as a mapping size at the device-dax >>> level, otherwise at the raw namespace level it's just an arbitrary >>> alignment. >>> >>>> With the above do you think the current patch series is good? >>> >>> I don't think we've quite converged on a solution. >> >> How about we make it a property of seed device. ie, >> we add `supported_size_align` RO attribute to the seed device. ndctl can >> use this to validate the size value. So this now becomes step0 >> >> sys/bus/nd/devices/region0> cat btt0.0/supported_size_align >> 16777216 >> /sys/bus/nd/devices/region0> cat pfn0.0/supported_size_align >> 16777216 >> /sys/bus/nd/devices/region0> cat dax0.0/supported_size_align >> 16777216 > > Why on those devices and not namespace0.0? sure. > >> We follow that up with validating the size value written to size >> attribute(step 1). >> >> While initializing the namespaces already present in a region we again >> validate the size and if not properly aligned we mark the region >> disabled. > > The region might have a mix of namespaces, some aligned and some not, > only the misaligned namespaces should fail to enable. The region > should otherwise enable successfully. > One misaligned namespace would mean, we get other namespace resource start addr wrongly aligned. If we allow regions to be enabled with namespace with wrong size, user would find further namespace creation in that regions failing due to wrongly aligned resource start. IMHO that is a confusing user experience. -aneesh
On Tue, Nov 19, 2019 at 7:19 PM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > On 11/19/19 11:28 PM, Dan Williams wrote: > > On Mon, Nov 18, 2019 at 1:52 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> Dan Williams <dan.j.williams@intel.com> writes: > >> > >>> On Sat, Nov 16, 2019 at 4:15 AM Aneesh Kumar K.V > >>> <aneesh.kumar@linux.ibm.com> wrote: > >>>> > >> > >> .... > >> > >> > >>>> > >>>> Considering the direct-map map size is not going to be user selectable, > >>>> do you agree that we can skip the above step 0 configuration you > >>>> suggested. > >>>> > >>>> The changes proposed in the patch series essentially does the rest. > >>>> > >>>> 1) It validate the size against the arch specific limit during > >>>> namespace creation. (part of step 1) > >>> > >>> This validation is a surprise failure to ndctl. > >>> > >>>> 2) It also disable initializing a region if it find the size not > >>>> correctly aligned as per the platform requirement. > >>> > >>> There needs to be a way for the user to discover the correct alignment > >>> that the kernel will accept. > >>> > >>>> 3) Direct map mapping size is different from supported_alignment for a > >>>> namespace. The supported alignment controls what possible PAGE SIZE user want the > >>>> namespace to be mapped to user space. > >>> > >>> No, the namespace alignment is different than the page mapping size. > >>> The alignment is only interpreted as a mapping size at the device-dax > >>> level, otherwise at the raw namespace level it's just an arbitrary > >>> alignment. > >>> > >>>> With the above do you think the current patch series is good? > >>> > >>> I don't think we've quite converged on a solution. > >> > >> How about we make it a property of seed device. ie, > >> we add `supported_size_align` RO attribute to the seed device. ndctl can > >> use this to validate the size value. So this now becomes step0 > >> > >> sys/bus/nd/devices/region0> cat btt0.0/supported_size_align > >> 16777216 > >> /sys/bus/nd/devices/region0> cat pfn0.0/supported_size_align > >> 16777216 > >> /sys/bus/nd/devices/region0> cat dax0.0/supported_size_align > >> 16777216 > > > > Why on those devices and not namespace0.0? > > sure. > > > > >> We follow that up with validating the size value written to size > >> attribute(step 1). > >> > >> While initializing the namespaces already present in a region we again > >> validate the size and if not properly aligned we mark the region > >> disabled. > > > > The region might have a mix of namespaces, some aligned and some not, > > only the misaligned namespaces should fail to enable. The region > > should otherwise enable successfully. > > > > One misaligned namespace would mean, we get other namespace resource > start addr wrongly aligned. If we allow regions to be enabled with > namespace with wrong size, user would find further namespace creation in > that regions failing due to wrongly aligned resource start. IMHO that is > a confusing user experience. > Why would one wrongly aligned namespace prevent other namespaces from being aligned? There's no requirement that consecutive namespaces are allocated contiguously. Also consider a namespace that starts misaligned, but ends aligned. That subsequent namespace can be enabled without issue.
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index ac485163a4a7..90c54c600023 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -91,4 +91,15 @@ void arch_invalidate_pmem(void *addr, size_t size) __inval_dcache_area(addr, size); } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); + +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) +{ + u32 remainder; + + div_u64_rem(size, PAGE_SIZE * ndr_mappings, &remainder); + if (remainder) + return PAGE_SIZE * ndr_mappings; + return 0; +} +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); #endif diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c index 377712e85605..2e661a08dae5 100644 --- a/arch/powerpc/lib/pmem.c +++ b/arch/powerpc/lib/pmem.c @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) unsigned long start = (unsigned long) addr; flush_dcache_range(start, start + size); } -EXPORT_SYMBOL(arch_wb_cache_pmem); +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); void arch_invalidate_pmem(void *addr, size_t size) { unsigned long start = (unsigned long) addr; flush_dcache_range(start, start + size); } -EXPORT_SYMBOL(arch_invalidate_pmem); +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); + +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) +{ + u32 remainder; + unsigned long linear_map_size; + + if (radix_enabled()) + linear_map_size = PAGE_SIZE; + else + linear_map_size = (1UL << mmu_psize_defs[mmu_linear_psize].shift); + + div_u64_rem(size, linear_map_size * ndr_mappings, &remainder); + if (remainder) + return linear_map_size * ndr_mappings; + return 0; +} +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); /* * CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE symbols diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 0d09cc5aad61..023329d7dfac 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -310,6 +310,18 @@ void arch_invalidate_pmem(void *addr, size_t size) } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) +{ + u32 remainder; + + div_u64_rem(size, PAGE_SIZE * ndr_mappings, &remainder); + if (remainder) + return PAGE_SIZE * ndr_mappings; + return 0; +} +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); + + static void __cpa_flush_all(void *arg) { unsigned long cache = (unsigned long)arg; diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index b6eddf912568..e2f8387d9ef4 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -291,4 +291,5 @@ static inline void arch_invalidate_pmem(void *addr, size_t size) } #endif +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size); #endif /* __LIBNVDIMM_H__ */
The page size used to map the namespace is arch dependent. For example architectures like ppc64 use 16MB page size for direct-mapping. If the namespace size is not aligned to the mapping page size, we can observe kernel crash during namespace init and destroy. This is due to kernel doing partial map/unmap of the resource range BUG: Unable to handle kernel data access at 0xc001000406000000 Faulting instruction address: 0xc000000000090790 NIP [c000000000090790] arch_add_memory+0xc0/0x130 LR [c000000000090744] arch_add_memory+0x74/0x130 Call Trace: arch_add_memory+0x74/0x130 (unreliable) memremap_pages+0x74c/0xa30 devm_memremap_pages+0x3c/0xa0 pmem_attach_disk+0x188/0x770 nvdimm_bus_probe+0xd8/0x470 really_probe+0x148/0x570 driver_probe_device+0x19c/0x1d0 device_driver_attach+0xcc/0x100 bind_store+0x134/0x1c0 drv_attr_store+0x44/0x60 sysfs_kf_write+0x74/0xc0 kernfs_fop_write+0x1b4/0x290 __vfs_write+0x3c/0x70 vfs_write+0xd0/0x260 ksys_write+0xdc/0x130 system_call+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- arch/arm64/mm/flush.c | 11 +++++++++++ arch/powerpc/lib/pmem.c | 21 +++++++++++++++++++-- arch/x86/mm/pageattr.c | 12 ++++++++++++ include/linux/libnvdimm.h | 1 + 4 files changed, 43 insertions(+), 2 deletions(-)