Message ID | 20240709002757.2431399-1-scott@os.amperecomputing.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] resource: limit request_free_mem_region based on arch_get_mappable_range | expand |
D Scott Phillips wrote: > On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order > of struct page size to dimension region"), the amdgpu driver could trip > over the warning of: > > `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));` > > in vmemmap_populate()[1]. After that commit, it becomes a translation fault > and panic[2]. > > The cause is that the amdgpu driver allocates some unused space from > iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and > devm_memremap_pages() it. An address above those backed by the arm64 > vmemmap is picked. > > Limit request_free_mem_region() so that only addresses within the > arch_get_mappable_range() can be chosen as device private addresses. It seems odd that devm_request_free_mem_region() needs to be careful about this restriction. The caller passes in the resource tree that is the bounds of valid address ranges. This change assumes that the caller wants to be restricted to vmemmap capable address ranges beyond the restrictions it already requested in the passed in @base argument. That restriction may be true with respect to request_mem_region(), but not necessarily other users of get_free_mem_region() like alloc_free_mem_region(). So, 2 questions / change request options: 1/ Preferred: Is there a possibility for the AMD driver to trim the resource it is passing to be bound by arch_get_mappable_range()? For CXL this is achieved by inserting CXL aperture windows into the resource tree. In the future what happens in the MEMORY_DEVICE_PUBLIC case when the memory address is picked by a hardware aperture on the device? It occurs to me if that aperture is communicated to the device via some platform mechanism (to honor arch_get_mappable_range() restrictions), then maybe the same should be done here. I have always cringed at the request_free_mem_region() implementation playing fast and loose with the platform memory map. Maybe this episode is a sign that these constraints need more formal handling in the resource tree. I.e. IORES_DESC_DEVICE_PRIVATE_MEMORY becomes a platform communicated aperture rather than hoping that unused portions of iomem_resource can be repurposed like this. 2/ If option 1/ proves too difficult, can you rework the consideration of @base to be gated by @desc? Something like: if (desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) base_start = max(base->start, arch_get_mappable_range().start); base_end = min(base->end, arch_get_mappable_range().end; else base_start = base->start; base_end = base->end; ...to localize the consideration of arch_get_mappable_range() only to the places it is absolutely required?
Dan Williams <dan.j.williams@intel.com> writes: > D Scott Phillips wrote: >> On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order >> of struct page size to dimension region"), the amdgpu driver could trip >> over the warning of: >> >> `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));` >> >> in vmemmap_populate()[1]. After that commit, it becomes a translation fault >> and panic[2]. >> >> The cause is that the amdgpu driver allocates some unused space from >> iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and >> devm_memremap_pages() it. An address above those backed by the arm64 >> vmemmap is picked. >> >> Limit request_free_mem_region() so that only addresses within the >> arch_get_mappable_range() can be chosen as device private addresses. > > It seems odd that devm_request_free_mem_region() needs to be careful > about this restriction. The caller passes in the resource tree that is > the bounds of valid address ranges. This change assumes that the caller > wants to be restricted to vmemmap capable address ranges beyond the > restrictions it already requested in the passed in @base argument. That > restriction may be true with respect to request_mem_region(), but not > necessarily other users of get_free_mem_region() like > alloc_free_mem_region(). > > So, 2 questions / change request options: > > 1/ Preferred: Is there a possibility for the AMD driver to trim the > resource it is passing to be bound by arch_get_mappable_range()? For CXL > this is achieved by inserting CXL aperture windows into the resource > tree. > > In the future what happens in the MEMORY_DEVICE_PUBLIC case when the > memory address is picked by a hardware aperture on the device? It occurs > to me if that aperture is communicated to the device via some platform > mechanism (to honor arch_get_mappable_range() restrictions), then maybe > the same should be done here. > > I have always cringed at the request_free_mem_region() implementation > playing fast and loose with the platform memory map. Maybe this episode > is a sign that these constraints need more formal handling in the > resource tree. > > I.e. IORES_DESC_DEVICE_PRIVATE_MEMORY becomes a platform communicated > aperture rather than hoping that unused portions of iomem_resource can > be repurposed like this. Hi Dan, sorry for my incredibly delayed response, I lost your message to a filter on my end :( I'm happy to work toward your preferred approach here, though I'm not sure I know how to achieve it. I think I understand how cxl is keeping device_private_memory out, but I don't think I understand the resource system well enough to see how amdgpu can make a properly trimmed resource for request_free_mem_region. My novice attempt would be something like: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 8ee3d07ffbdfa..d84de6d66ac45 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -1038,7 +1039,14 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev) pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1; pgmap->type = MEMORY_DEVICE_COHERENT; } else { - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + struct range mappable; + struct resource root; + + mappable = arch_get_mappable_range(); + root.start = mappable.start; + root.end = mappable.end; + root.child = iomem_resource.child; + res = devm_request_free_mem_region(adev->dev, &root, size); if (IS_ERR(res)) return PTR_ERR(res); pgmap->range.start = res->start; Apart from this being wrong with respect to resource_lock, is that sort of the idea? or am I missing the sensible way to hoist the vmemmap range into iomem_resource? or maybe I'm just totally off in the weeds.
D Scott Phillips wrote: [..] > Hi Dan, sorry for my incredibly delayed response, I lost your message to > a filter on my end :( > > I'm happy to work toward your preferred approach here, though I'm not > sure I know how to achieve it. I think I understand how cxl is keeping > device_private_memory out, but I don't think I understand the resource > system well enough to see how amdgpu can make a properly trimmed > resource for request_free_mem_region. My novice attempt would be > something like: > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > index 8ee3d07ffbdfa..d84de6d66ac45 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > @@ -1038,7 +1039,14 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev) > pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1; > pgmap->type = MEMORY_DEVICE_COHERENT; > } else { > - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); > + struct range mappable; > + struct resource root; > + > + mappable = arch_get_mappable_range(); > + root.start = mappable.start; > + root.end = mappable.end; > + root.child = iomem_resource.child; > + res = devm_request_free_mem_region(adev->dev, &root, size); > if (IS_ERR(res)) > return PTR_ERR(res); > pgmap->range.start = res->start; > > Apart from this being wrong with respect to resource_lock, is that sort > of the idea? or am I missing the sensible way to hoist the vmemmap range > into iomem_resource? or maybe I'm just totally off in the weeds. You have the right idea, however, I think a better solution has appeared in the meantime. See this recent fix from Thomas regarding collisions between KASLR and request_free_mem_region(): http://lore.kernel.org/172418629773.2215.4158024254077335422.tip-bot2@tip-bot2 ...in that case KASLR is limiting the maximum possible usable address range that request_free_mem_region() can play. For this arch_get_mappable_range() restriction can you adjust the new @physmem_end variable for the same effect?
Dan Williams <dan.j.williams@intel.com> writes: > D Scott Phillips wrote: > [..] >> Hi Dan, sorry for my incredibly delayed response, I lost your message to >> a filter on my end :( >> >> I'm happy to work toward your preferred approach here, though I'm not >> sure I know how to achieve it. I think I understand how cxl is keeping >> device_private_memory out, but I don't think I understand the resource >> system well enough to see how amdgpu can make a properly trimmed >> resource for request_free_mem_region. My novice attempt would be >> something like: >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >> index 8ee3d07ffbdfa..d84de6d66ac45 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >> @@ -1038,7 +1039,14 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev) >> pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1; >> pgmap->type = MEMORY_DEVICE_COHERENT; >> } else { >> - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); >> + struct range mappable; >> + struct resource root; >> + >> + mappable = arch_get_mappable_range(); >> + root.start = mappable.start; >> + root.end = mappable.end; >> + root.child = iomem_resource.child; >> + res = devm_request_free_mem_region(adev->dev, &root, size); >> if (IS_ERR(res)) >> return PTR_ERR(res); >> pgmap->range.start = res->start; >> >> Apart from this being wrong with respect to resource_lock, is that sort >> of the idea? or am I missing the sensible way to hoist the vmemmap range >> into iomem_resource? or maybe I'm just totally off in the weeds. > > You have the right idea, however, I think a better solution has appeared > in the meantime. See this recent fix from Thomas regarding collisions > between KASLR and request_free_mem_region(): > > http://lore.kernel.org/172418629773.2215.4158024254077335422.tip-bot2@tip-bot2 > > ...in that case KASLR is limiting the maximum possible usable address > range that request_free_mem_region() can play. For this > arch_get_mappable_range() restriction can you adjust the new > @physmem_end variable for the same effect? Oh perfect, yes I think that very directly addresses my problem, I'll handle it that way. Thanks for the pointer Dan.
diff --git a/kernel/resource.c b/kernel/resource.c index fcbca39dbc450..6f256aa0191b4 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1832,25 +1832,28 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size, if (flags & GFR_DESCENDING) { resource_size_t end; - end = min_t(resource_size_t, base->end, + end = min3(base->end, arch_get_mappable_range().end, (1ULL << MAX_PHYSMEM_BITS) - 1); return end - size + 1; } - return ALIGN(base->start, align); + return ALIGN(max_t(resource_size_t, base->start, + arch_get_mappable_range().start), align); } static bool gfr_continue(struct resource *base, resource_size_t addr, resource_size_t size, unsigned long flags) { + if (flags & GFR_DESCENDING) - return addr > size && addr >= base->start; + return addr > size && addr >= base->start && + addr >= arch_get_mappable_range().start; /* * In the ascend case be careful that the last increment by * @size did not wrap 0. */ return addr > addr - size && - addr <= min_t(resource_size_t, base->end, + addr <= min3(base->end, arch_get_mappable_range().end, (1ULL << MAX_PHYSMEM_BITS) - 1); }
On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order of struct page size to dimension region"), the amdgpu driver could trip over the warning of: `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));` in vmemmap_populate()[1]. After that commit, it becomes a translation fault and panic[2]. The cause is that the amdgpu driver allocates some unused space from iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and devm_memremap_pages() it. An address above those backed by the arm64 vmemmap is picked. Limit request_free_mem_region() so that only addresses within the arch_get_mappable_range() can be chosen as device private addresses. [1]: Call trace: vmemmap_populate+0x30/0x48 __populate_section_memmap+0x40/0x90 sparse_add_section+0xfc/0x3e8 __add_pages+0xb4/0x168 pagemap_range+0x300/0x410 memremap_pages+0x184/0x2d8 devm_memremap_pages+0x30/0x90 kgd2kfd_init_zone_device+0xe0/0x1f0 [amdgpu] amdgpu_device_ip_init+0x674/0x888 [amdgpu] amdgpu_device_init+0x7bc/0xed8 [amdgpu] amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu] amdgpu_pci_probe+0x194/0x580 [amdgpu] local_pci_probe+0x48/0xb8 work_for_cpu_fn+0x24/0x40 process_one_work+0x170/0x3e0 worker_thread+0x2ac/0x3e0 kthread+0xf4/0x108 ret_from_fork+0x10/0x20 [2]: Unable to handle kernel paging request at virtual address 000001ffa6000034 Mem abort info: ESR = 0x0000000096000044 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000 CM = 0, WnR = 1, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=000008000287c000 [000001ffa6000034] pgd=0000000000000000, p4d=0000000000000000 Call trace: __init_zone_device_page.constprop.0+0x2c/0xa8 memmap_init_zone_device+0xf0/0x210 pagemap_range+0x1e0/0x410 memremap_pages+0x18c/0x2e0 devm_memremap_pages+0x30/0x90 kgd2kfd_init_zone_device+0xf0/0x200 [amdgpu] amdgpu_device_ip_init+0x674/0x888 [amdgpu] amdgpu_device_init+0x7a4/0xea0 [amdgpu] amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu] amdgpu_pci_probe+0x1a0/0x560 [amdgpu] local_pci_probe+0x48/0xb8 work_for_cpu_fn+0x24/0x40 process_one_work+0x170/0x3e0 worker_thread+0x2ac/0x3e0 kthread+0xf4/0x108 ret_from_fork+0x10/0x20 Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com> --- Link to v1: https://lore.kernel.org/all/20240703210707.1986816-1-scott@os.amperecomputing.com/ Changes since v1: - Change from fiddling the architecture's MAX_PHYSMEM_BITS to checking arch_get_mappable_range(). kernel/resource.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)