diff mbox series

[v2] resource: limit request_free_mem_region based on arch_get_mappable_range

Message ID 20240709002757.2431399-1-scott@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series [v2] resource: limit request_free_mem_region based on arch_get_mappable_range | expand

Commit Message

D Scott Phillips July 9, 2024, 12:27 a.m. UTC
On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order
of struct page size to dimension region"), the amdgpu driver could trip
over the warning of:

`WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));`

in vmemmap_populate()[1]. After that commit, it becomes a translation fault
and panic[2].

The cause is that the amdgpu driver allocates some unused space from
iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and
devm_memremap_pages() it. An address above those backed by the arm64
vmemmap is picked.

Limit request_free_mem_region() so that only addresses within the
arch_get_mappable_range() can be chosen as device private addresses.

[1]: Call trace:
      vmemmap_populate+0x30/0x48
      __populate_section_memmap+0x40/0x90
      sparse_add_section+0xfc/0x3e8
      __add_pages+0xb4/0x168
      pagemap_range+0x300/0x410
      memremap_pages+0x184/0x2d8
      devm_memremap_pages+0x30/0x90
      kgd2kfd_init_zone_device+0xe0/0x1f0 [amdgpu]
      amdgpu_device_ip_init+0x674/0x888 [amdgpu]
      amdgpu_device_init+0x7bc/0xed8 [amdgpu]
      amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu]
      amdgpu_pci_probe+0x194/0x580 [amdgpu]
      local_pci_probe+0x48/0xb8
      work_for_cpu_fn+0x24/0x40
      process_one_work+0x170/0x3e0
      worker_thread+0x2ac/0x3e0
      kthread+0xf4/0x108
      ret_from_fork+0x10/0x20

[2]: Unable to handle kernel paging request at virtual address
             000001ffa6000034
     Mem abort info:
       ESR = 0x0000000096000044
       EC = 0x25: DABT (current EL), IL = 32 bits
       SET = 0, FnV = 0
       EA = 0, S1PTW = 0
       FSC = 0x04: level 0 translation fault
     Data abort info:
       ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
       CM = 0, WnR = 1, TnD = 0, TagAccess = 0
       GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
     user pgtable: 4k pages, 48-bit VAs, pgdp=000008000287c000
     [000001ffa6000034] pgd=0000000000000000, p4d=0000000000000000
     Call trace:
      __init_zone_device_page.constprop.0+0x2c/0xa8
      memmap_init_zone_device+0xf0/0x210
      pagemap_range+0x1e0/0x410
      memremap_pages+0x18c/0x2e0
      devm_memremap_pages+0x30/0x90
      kgd2kfd_init_zone_device+0xf0/0x200 [amdgpu]
      amdgpu_device_ip_init+0x674/0x888 [amdgpu]
      amdgpu_device_init+0x7a4/0xea0 [amdgpu]
      amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu]
      amdgpu_pci_probe+0x1a0/0x560 [amdgpu]
      local_pci_probe+0x48/0xb8
      work_for_cpu_fn+0x24/0x40
      process_one_work+0x170/0x3e0
      worker_thread+0x2ac/0x3e0
      kthread+0xf4/0x108
      ret_from_fork+0x10/0x20

Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com>
---
Link to v1: https://lore.kernel.org/all/20240703210707.1986816-1-scott@os.amperecomputing.com/
Changes since v1:
 - Change from fiddling the architecture's MAX_PHYSMEM_BITS to checking
   arch_get_mappable_range().

 kernel/resource.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Dan Williams July 9, 2024, 1:31 a.m. UTC | #1
D Scott Phillips wrote:
> On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order
> of struct page size to dimension region"), the amdgpu driver could trip
> over the warning of:
> 
> `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));`
> 
> in vmemmap_populate()[1]. After that commit, it becomes a translation fault
> and panic[2].
> 
> The cause is that the amdgpu driver allocates some unused space from
> iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and
> devm_memremap_pages() it. An address above those backed by the arm64
> vmemmap is picked.
> 
> Limit request_free_mem_region() so that only addresses within the
> arch_get_mappable_range() can be chosen as device private addresses.

It seems odd that devm_request_free_mem_region() needs to be careful
about this restriction. The caller passes in the resource tree that is
the bounds of valid address ranges. This change assumes that the caller
wants to be restricted to vmemmap capable address ranges beyond the
restrictions it already requested in the passed in @base argument. That
restriction may be true with respect to request_mem_region(), but not
necessarily other users of get_free_mem_region() like
alloc_free_mem_region().

So, 2 questions / change request options:

1/ Preferred: Is there a possibility for the AMD driver to trim the
resource it is passing to be bound by arch_get_mappable_range()? For CXL
this is achieved by inserting CXL aperture windows into the resource
tree.

In the future what happens in the MEMORY_DEVICE_PUBLIC case when the
memory address is picked by a hardware aperture on the device? It occurs
to me if that aperture is communicated to the device via some platform
mechanism (to honor arch_get_mappable_range() restrictions), then maybe
the same should be done here.

I have always cringed at the request_free_mem_region() implementation
playing fast and loose with the platform memory map. Maybe this episode
is a sign that these constraints need more formal handling in the
resource tree.

I.e. IORES_DESC_DEVICE_PRIVATE_MEMORY becomes a platform communicated
aperture rather than hoping that unused portions of iomem_resource can
be repurposed like this.


2/ If option 1/ proves too difficult, can you rework the consideration
of @base to be gated by @desc?

Something like:

if (desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
	base_start = max(base->start, arch_get_mappable_range().start);
	base_end = min(base->end, arch_get_mappable_range().end;
else
	base_start = base->start;
	base_end = base->end;

...to localize the consideration of arch_get_mappable_range() only to
the places it is absolutely required?
D Scott Phillips Aug. 28, 2024, 2:57 a.m. UTC | #2
Dan Williams <dan.j.williams@intel.com> writes:

> D Scott Phillips wrote:
>> On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order
>> of struct page size to dimension region"), the amdgpu driver could trip
>> over the warning of:
>> 
>> `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));`
>> 
>> in vmemmap_populate()[1]. After that commit, it becomes a translation fault
>> and panic[2].
>> 
>> The cause is that the amdgpu driver allocates some unused space from
>> iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and
>> devm_memremap_pages() it. An address above those backed by the arm64
>> vmemmap is picked.
>> 
>> Limit request_free_mem_region() so that only addresses within the
>> arch_get_mappable_range() can be chosen as device private addresses.
>
> It seems odd that devm_request_free_mem_region() needs to be careful
> about this restriction. The caller passes in the resource tree that is
> the bounds of valid address ranges. This change assumes that the caller
> wants to be restricted to vmemmap capable address ranges beyond the
> restrictions it already requested in the passed in @base argument. That
> restriction may be true with respect to request_mem_region(), but not
> necessarily other users of get_free_mem_region() like
> alloc_free_mem_region().
>
> So, 2 questions / change request options:
>
> 1/ Preferred: Is there a possibility for the AMD driver to trim the
> resource it is passing to be bound by arch_get_mappable_range()? For CXL
> this is achieved by inserting CXL aperture windows into the resource
> tree.
>
> In the future what happens in the MEMORY_DEVICE_PUBLIC case when the
> memory address is picked by a hardware aperture on the device? It occurs
> to me if that aperture is communicated to the device via some platform
> mechanism (to honor arch_get_mappable_range() restrictions), then maybe
> the same should be done here.
>
> I have always cringed at the request_free_mem_region() implementation
> playing fast and loose with the platform memory map. Maybe this episode
> is a sign that these constraints need more formal handling in the
> resource tree.
>
> I.e. IORES_DESC_DEVICE_PRIVATE_MEMORY becomes a platform communicated
> aperture rather than hoping that unused portions of iomem_resource can
> be repurposed like this.

Hi Dan, sorry for my incredibly delayed response, I lost your message to
a filter on my end :(

I'm happy to work toward your preferred approach here, though I'm not
sure I know how to achieve it. I think I understand how cxl is keeping
device_private_memory out, but I don't think I understand the resource
system well enough to see how amdgpu can make a properly trimmed
resource for request_free_mem_region. My novice attempt would be
something like:

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 8ee3d07ffbdfa..d84de6d66ac45 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -1038,7 +1039,14 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
                pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
                pgmap->type = MEMORY_DEVICE_COHERENT;
        } else {
-               res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+               struct range mappable;
+               struct resource root;
+
+               mappable = arch_get_mappable_range();
+               root.start = mappable.start;
+               root.end = mappable.end;
+               root.child = iomem_resource.child;
+               res = devm_request_free_mem_region(adev->dev, &root, size);
                if (IS_ERR(res))
                        return PTR_ERR(res);
                pgmap->range.start = res->start;

Apart from this being wrong with respect to resource_lock, is that sort
of the idea? or am I missing the sensible way to hoist the vmemmap range
into iomem_resource? or maybe I'm just totally off in the weeds.
Dan Williams Aug. 29, 2024, 12:08 a.m. UTC | #3
D Scott Phillips wrote:
[..]
> Hi Dan, sorry for my incredibly delayed response, I lost your message to
> a filter on my end :(
> 
> I'm happy to work toward your preferred approach here, though I'm not
> sure I know how to achieve it. I think I understand how cxl is keeping
> device_private_memory out, but I don't think I understand the resource
> system well enough to see how amdgpu can make a properly trimmed
> resource for request_free_mem_region. My novice attempt would be
> something like:
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index 8ee3d07ffbdfa..d84de6d66ac45 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -1038,7 +1039,14 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
>                 pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
>                 pgmap->type = MEMORY_DEVICE_COHERENT;
>         } else {
> -               res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
> +               struct range mappable;
> +               struct resource root;
> +
> +               mappable = arch_get_mappable_range();
> +               root.start = mappable.start;
> +               root.end = mappable.end;
> +               root.child = iomem_resource.child;
> +               res = devm_request_free_mem_region(adev->dev, &root, size);
>                 if (IS_ERR(res))
>                         return PTR_ERR(res);
>                 pgmap->range.start = res->start;
> 
> Apart from this being wrong with respect to resource_lock, is that sort
> of the idea? or am I missing the sensible way to hoist the vmemmap range
> into iomem_resource? or maybe I'm just totally off in the weeds.

You have the right idea, however, I think a better solution has appeared
in the meantime. See this recent fix from Thomas regarding collisions
between KASLR and request_free_mem_region():

http://lore.kernel.org/172418629773.2215.4158024254077335422.tip-bot2@tip-bot2

...in that case KASLR is limiting the maximum possible usable address
range that request_free_mem_region() can play. For this
arch_get_mappable_range() restriction can you adjust the new
@physmem_end variable for the same effect?
D Scott Phillips Aug. 29, 2024, 5:19 p.m. UTC | #4
Dan Williams <dan.j.williams@intel.com> writes:

> D Scott Phillips wrote:
> [..]
>> Hi Dan, sorry for my incredibly delayed response, I lost your message to
>> a filter on my end :(
>> 
>> I'm happy to work toward your preferred approach here, though I'm not
>> sure I know how to achieve it. I think I understand how cxl is keeping
>> device_private_memory out, but I don't think I understand the resource
>> system well enough to see how amdgpu can make a properly trimmed
>> resource for request_free_mem_region. My novice attempt would be
>> something like:
>> 
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> index 8ee3d07ffbdfa..d84de6d66ac45 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> @@ -1038,7 +1039,14 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
>>                 pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
>>                 pgmap->type = MEMORY_DEVICE_COHERENT;
>>         } else {
>> -               res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
>> +               struct range mappable;
>> +               struct resource root;
>> +
>> +               mappable = arch_get_mappable_range();
>> +               root.start = mappable.start;
>> +               root.end = mappable.end;
>> +               root.child = iomem_resource.child;
>> +               res = devm_request_free_mem_region(adev->dev, &root, size);
>>                 if (IS_ERR(res))
>>                         return PTR_ERR(res);
>>                 pgmap->range.start = res->start;
>> 
>> Apart from this being wrong with respect to resource_lock, is that sort
>> of the idea? or am I missing the sensible way to hoist the vmemmap range
>> into iomem_resource? or maybe I'm just totally off in the weeds.
>
> You have the right idea, however, I think a better solution has appeared
> in the meantime. See this recent fix from Thomas regarding collisions
> between KASLR and request_free_mem_region():
>
> http://lore.kernel.org/172418629773.2215.4158024254077335422.tip-bot2@tip-bot2
>
> ...in that case KASLR is limiting the maximum possible usable address
> range that request_free_mem_region() can play. For this
> arch_get_mappable_range() restriction can you adjust the new
> @physmem_end variable for the same effect?

Oh perfect, yes I think that very directly addresses my problem, I'll
handle it that way. Thanks for the pointer Dan.
diff mbox series

Patch

diff --git a/kernel/resource.c b/kernel/resource.c
index fcbca39dbc450..6f256aa0191b4 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1832,25 +1832,28 @@  static resource_size_t gfr_start(struct resource *base, resource_size_t size,
 	if (flags & GFR_DESCENDING) {
 		resource_size_t end;
 
-		end = min_t(resource_size_t, base->end,
+		end = min3(base->end, arch_get_mappable_range().end,
 			    (1ULL << MAX_PHYSMEM_BITS) - 1);
 		return end - size + 1;
 	}
 
-	return ALIGN(base->start, align);
+	return ALIGN(max_t(resource_size_t, base->start,
+			   arch_get_mappable_range().start), align);
 }
 
 static bool gfr_continue(struct resource *base, resource_size_t addr,
 			 resource_size_t size, unsigned long flags)
 {
+
 	if (flags & GFR_DESCENDING)
-		return addr > size && addr >= base->start;
+		return addr > size && addr >= base->start &&
+		       addr >= arch_get_mappable_range().start;
 	/*
 	 * In the ascend case be careful that the last increment by
 	 * @size did not wrap 0.
 	 */
 	return addr > addr - size &&
-	       addr <= min_t(resource_size_t, base->end,
+	       addr <= min3(base->end, arch_get_mappable_range().end,
 			     (1ULL << MAX_PHYSMEM_BITS) - 1);
 }