Message ID | 20221228163102.468-1-mario.limonciello@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Recover from failure to probe GPU | expand |
Patches 1-10 are: Reviewed-by: Alex Deucher <alexander.deucher@amd.com> On Wed, Dec 28, 2022 at 11:31 AM Mario Limonciello <mario.limonciello@amd.com> wrote: > > One of the first thing that KMS drivers do during initialization is > destroy the system firmware framebuffer by means of > `drm_aperture_remove_conflicting_pci_framebuffers` > > This means that if for any reason the GPU failed to probe the user > will be stuck with at best a screen frozen at the last thing that > was shown before the KMS driver continued it's probe. > > The problem is most pronounced when new GPU support is introduced > because users will need to have a recent linux-firmware snapshot > on their system when they boot a kernel with matching support. > > However the problem is further exaggerated in the case of amdgpu because > it has migrated to "IP discovery" where amdgpu will attempt to load > on "ALL" AMD GPUs even if the driver is missing support for IP blocks > contained in that GPU. > > IP discovery requires some probing and isn't run until after the > framebuffer has been destroyed. > > This means a situation can occur where a user purchases a new GPU not > yet supported by a distribution and when booting the installer it will > "freeze" even if the distribution doesn't have the matching kernel support > for those IP blocks. > > The perfect example of this is Ubuntu 22.10 and the new dGPUs just > launched by AMD. The installation media ships with kernel 5.19 (which > has IP discovery) but the amdgpu support for those IP blocks landed in > kernel 6.0. The matching linux-firmware was released after 22.10's launch. > The screen will freeze without nomodeset. Even if a user manages to install > and then upgrades to kernel 6.0 after install they'll still have the > problem of missing firmware, and the same experience. > > This is quite jarring for users, particularly if they don't know > that they have to use "nomodeset" to install. > > To help the situation make changes to GPU discovery: > 1) Delay releasing the firmware framebuffer until after IP discovery has > completed. This will help the situation of an older kernel that doesn't > yet support the IP blocks probing a new GPU. > 2) Request loading all PSP, VCN, SDMA, MES and GC microcode into memory > during IP discovery. This will help the situation of new enough kernel for > the IP discovery phase to otherwise pass but missing microcode from > linux-firmware.git. > > Not all requested firmware will be loaded during IP discovery as some of it > will require larger driver architecture changes. For example SMU firmware > isn't loaded on certain products, but that's not known until later on when > the early_init phase of the SMU load occurs. > > v1->v2: > * Take the suggestion from v1 thread to delay the framebuffer release until > ip discovery is done. This patch is CC to stable to that older stable > kernels with IP discovery won't try to probe unknown IP. > * Drop changes to drm aperature. > * Fetch SDMA, VCN, MES, GC and PSP microcode during IP discovery. > > Mario Limonciello (11): > drm/amd: Delay removal of the firmware framebuffer > drm/amd: Add a legacy mapping to "amdgpu_ucode_ip_version_decode" > drm/amd: Convert SMUv11 microcode init to use > `amdgpu_ucode_ip_version_decode` > drm/amd: Convert SMU v13 to use `amdgpu_ucode_ip_version_decode` > drm/amd: Request SDMA microcode during IP discovery > drm/amd: Request VCN microcode during IP discovery > drm/amd: Request MES microcode during IP discovery > drm/amd: Request GFX9 microcode during IP discovery > drm/amd: Request GFX10 microcode during IP discovery > drm/amd: Request GFX11 microcode during IP discovery > drm/amd: Request PSP microcode during IP discovery > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 + > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 590 +++++++++++++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 - > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 - > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 9 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 208 ++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 85 +-- > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 180 +----- > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 64 +- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 143 +---- > drivers/gpu/drm/amd/amdgpu/mes_v10_1.c | 28 - > drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 25 +- > drivers/gpu/drm/amd/amdgpu/psp_v10_0.c | 106 +--- > drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 165 +---- > drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 102 +-- > drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 82 --- > drivers/gpu/drm/amd/amdgpu/psp_v13_0_4.c | 36 -- > drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 36 -- > drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 61 +- > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 42 +- > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 65 +- > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 30 +- > .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 35 +- > .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 12 +- > 25 files changed, 919 insertions(+), 1203 deletions(-) > > > base-commit: de9a71e391a92841582ca3008e7b127a0b8ccf41 > -- > 2.34.1 >
On 12/28/2022 10:00 PM, Mario Limonciello wrote: > One of the first thing that KMS drivers do during initialization is > destroy the system firmware framebuffer by means of > `drm_aperture_remove_conflicting_pci_framebuffers` > > This means that if for any reason the GPU failed to probe the user > will be stuck with at best a screen frozen at the last thing that > was shown before the KMS driver continued it's probe. > > The problem is most pronounced when new GPU support is introduced > because users will need to have a recent linux-firmware snapshot > on their system when they boot a kernel with matching support. > > However the problem is further exaggerated in the case of amdgpu because > it has migrated to "IP discovery" where amdgpu will attempt to load > on "ALL" AMD GPUs even if the driver is missing support for IP blocks > contained in that GPU. > > IP discovery requires some probing and isn't run until after the > framebuffer has been destroyed. > > This means a situation can occur where a user purchases a new GPU not > yet supported by a distribution and when booting the installer it will > "freeze" even if the distribution doesn't have the matching kernel support > for those IP blocks. > > The perfect example of this is Ubuntu 22.10 and the new dGPUs just > launched by AMD. The installation media ships with kernel 5.19 (which > has IP discovery) but the amdgpu support for those IP blocks landed in > kernel 6.0. The matching linux-firmware was released after 22.10's launch. > The screen will freeze without nomodeset. Even if a user manages to install > and then upgrades to kernel 6.0 after install they'll still have the > problem of missing firmware, and the same experience. > > This is quite jarring for users, particularly if they don't know > that they have to use "nomodeset" to install. > > To help the situation make changes to GPU discovery: > 1) Delay releasing the firmware framebuffer until after IP discovery has > completed. This will help the situation of an older kernel that doesn't > yet support the IP blocks probing a new GPU. > 2) Request loading all PSP, VCN, SDMA, MES and GC microcode into memory > during IP discovery. This will help the situation of new enough kernel for > the IP discovery phase to otherwise pass but missing microcode from > linux-firmware.git. > > Not all requested firmware will be loaded during IP discovery as some of it > will require larger driver architecture changes. For example SMU firmware > isn't loaded on certain products, but that's not known until later on when > the early_init phase of the SMU load occurs. > > v1->v2: > * Take the suggestion from v1 thread to delay the framebuffer release until > ip discovery is done. This patch is CC to stable to that older stable > kernels with IP discovery won't try to probe unknown IP. > * Drop changes to drm aperature. > * Fetch SDMA, VCN, MES, GC and PSP microcode during IP discovery. > What is the gain here in just checking if firmware files are available? It can fail anywhere during sw_init and it's the same situation. Restricting IP FWs to IP specific files looks better to me than centralizing and creating interdependencies. Thanks, Lijo > Mario Limonciello (11): > drm/amd: Delay removal of the firmware framebuffer > drm/amd: Add a legacy mapping to "amdgpu_ucode_ip_version_decode" > drm/amd: Convert SMUv11 microcode init to use > `amdgpu_ucode_ip_version_decode` > drm/amd: Convert SMU v13 to use `amdgpu_ucode_ip_version_decode` > drm/amd: Request SDMA microcode during IP discovery > drm/amd: Request VCN microcode during IP discovery > drm/amd: Request MES microcode during IP discovery > drm/amd: Request GFX9 microcode during IP discovery > drm/amd: Request GFX10 microcode during IP discovery > drm/amd: Request GFX11 microcode during IP discovery > drm/amd: Request PSP microcode during IP discovery > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 + > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 590 +++++++++++++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 - > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 - > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 9 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 208 ++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 85 +-- > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 180 +----- > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 64 +- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 143 +---- > drivers/gpu/drm/amd/amdgpu/mes_v10_1.c | 28 - > drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 25 +- > drivers/gpu/drm/amd/amdgpu/psp_v10_0.c | 106 +--- > drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 165 +---- > drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 102 +-- > drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 82 --- > drivers/gpu/drm/amd/amdgpu/psp_v13_0_4.c | 36 -- > drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 36 -- > drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 61 +- > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 42 +- > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 65 +- > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 30 +- > .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 35 +- > .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 12 +- > 25 files changed, 919 insertions(+), 1203 deletions(-) > > > base-commit: de9a71e391a92841582ca3008e7b127a0b8ccf41
On Tue, Jan 3, 2023 at 5:10 AM Lazar, Lijo <lijo.lazar@amd.com> wrote: > > > > On 12/28/2022 10:00 PM, Mario Limonciello wrote: > > One of the first thing that KMS drivers do during initialization is > > destroy the system firmware framebuffer by means of > > `drm_aperture_remove_conflicting_pci_framebuffers` > > > > This means that if for any reason the GPU failed to probe the user > > will be stuck with at best a screen frozen at the last thing that > > was shown before the KMS driver continued it's probe. > > > > The problem is most pronounced when new GPU support is introduced > > because users will need to have a recent linux-firmware snapshot > > on their system when they boot a kernel with matching support. > > > > However the problem is further exaggerated in the case of amdgpu because > > it has migrated to "IP discovery" where amdgpu will attempt to load > > on "ALL" AMD GPUs even if the driver is missing support for IP blocks > > contained in that GPU. > > > > IP discovery requires some probing and isn't run until after the > > framebuffer has been destroyed. > > > > This means a situation can occur where a user purchases a new GPU not > > yet supported by a distribution and when booting the installer it will > > "freeze" even if the distribution doesn't have the matching kernel support > > for those IP blocks. > > > > The perfect example of this is Ubuntu 22.10 and the new dGPUs just > > launched by AMD. The installation media ships with kernel 5.19 (which > > has IP discovery) but the amdgpu support for those IP blocks landed in > > kernel 6.0. The matching linux-firmware was released after 22.10's launch. > > The screen will freeze without nomodeset. Even if a user manages to install > > and then upgrades to kernel 6.0 after install they'll still have the > > problem of missing firmware, and the same experience. > > > > This is quite jarring for users, particularly if they don't know > > that they have to use "nomodeset" to install. > > > > To help the situation make changes to GPU discovery: > > 1) Delay releasing the firmware framebuffer until after IP discovery has > > completed. This will help the situation of an older kernel that doesn't > > yet support the IP blocks probing a new GPU. > > 2) Request loading all PSP, VCN, SDMA, MES and GC microcode into memory > > during IP discovery. This will help the situation of new enough kernel for > > the IP discovery phase to otherwise pass but missing microcode from > > linux-firmware.git. > > > > Not all requested firmware will be loaded during IP discovery as some of it > > will require larger driver architecture changes. For example SMU firmware > > isn't loaded on certain products, but that's not known until later on when > > the early_init phase of the SMU load occurs. > > > > v1->v2: > > * Take the suggestion from v1 thread to delay the framebuffer release until > > ip discovery is done. This patch is CC to stable to that older stable > > kernels with IP discovery won't try to probe unknown IP. > > * Drop changes to drm aperature. > > * Fetch SDMA, VCN, MES, GC and PSP microcode during IP discovery. > > > > What is the gain here in just checking if firmware files are available? > It can fail anywhere during sw_init and it's the same situation. Other failures are presumably a bug or hardware issue. The missing firmware would be a common issue when chips are first launched. Thinking about it a bit more, another option might be to move the calls to request_firmware() into the IP specific early_init() functions and then move the drm_aperture release after early_init(). That would keep the firmware handling in the IPs and should still happen early enough that we haven't messed with the hardware yet. Alex > > Restricting IP FWs to IP specific files looks better to me than > centralizing and creating interdependencies. > > Thanks, > Lijo > > > Mario Limonciello (11): > > drm/amd: Delay removal of the firmware framebuffer > > drm/amd: Add a legacy mapping to "amdgpu_ucode_ip_version_decode" > > drm/amd: Convert SMUv11 microcode init to use > > `amdgpu_ucode_ip_version_decode` > > drm/amd: Convert SMU v13 to use `amdgpu_ucode_ip_version_decode` > > drm/amd: Request SDMA microcode during IP discovery > > drm/amd: Request VCN microcode during IP discovery > > drm/amd: Request MES microcode during IP discovery > > drm/amd: Request GFX9 microcode during IP discovery > > drm/amd: Request GFX10 microcode during IP discovery > > drm/amd: Request GFX11 microcode during IP discovery > > drm/amd: Request PSP microcode during IP discovery > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 590 +++++++++++++++++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 - > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 - > > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 9 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 2 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 208 ++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 85 +-- > > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 180 +----- > > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 64 +- > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 143 +---- > > drivers/gpu/drm/amd/amdgpu/mes_v10_1.c | 28 - > > drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 25 +- > > drivers/gpu/drm/amd/amdgpu/psp_v10_0.c | 106 +--- > > drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 165 +---- > > drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 102 +-- > > drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 82 --- > > drivers/gpu/drm/amd/amdgpu/psp_v13_0_4.c | 36 -- > > drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 36 -- > > drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 61 +- > > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 42 +- > > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 65 +- > > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 30 +- > > .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 35 +- > > .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 12 +- > > 25 files changed, 919 insertions(+), 1203 deletions(-) > > > > > > base-commit: de9a71e391a92841582ca3008e7b127a0b8ccf41