Message ID | 1465847120-17924-1-git-send-email-alexander.deucher@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 13 Jun 2016 15:45:20 -0400 Alex Deucher <alexdeucher@gmail.com> wrote: > When executing in a PCI passthrough based virtuzliation environment, the > hypervisor will usually attempt to send a PCIe bus reset signal to the > ASIC when the VM reboots. In this scenario, the card is not correctly > initialized, but we still consider it to be posted. Therefore, in a > passthrough based environemnt we should always post the card to guarantee > it is in a good state for driver initialization. > > Ported from amdgpu commit: > amdgpu: fix asic initialization for virtualized environments > > Cc: Andres Rodriguez <andres.rodriguez@amd.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > Cc: stable@vger.kernel.org > --- > drivers/gpu/drm/radeon/radeon_device.c | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) Thanks, I expect it's an improvement, though it's always a bit disappointing when a driver starts modifying its behavior based on what might be a transient feature of the platform, in this case a hypervisor platform. For instance, why does our bus reset and video ROM execution result in a different state than a physical BIOS doing the same? Can't this condition occur regardless of a hypervisor, perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or perhaps simply a system BIOS optimized to post a limited set of devices. Detection based on some state of the device rather than an expectation based on what the device is running on seems preferable. I suspect Andres' patch for amdgpu only affects newer devices, which pretty much all suffer reset issues, at least under QEMU/VFIO, but I wonder how this patch affects existing working devices, like 6, 7, and some 8-series. Anyway, if this is the solution to the poor behavior we've seen with assigned AMD cards, maybe someone could request the same for the closed drivers, including Windows. Thanks, Alex > diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c > index e61c763..21c44b2 100644 > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -630,6 +630,23 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc) > /* > * GPU helpers function. > */ > + > +/** > + * radeon_device_is_virtual - check if we are running is a virtual environment > + * > + * Check if the asic has been passed through to a VM (all asics). > + * Used at driver startup. > + * Returns true if virtual or false if not. > + */ > +static bool radeon_device_is_virtual(void) > +{ > +#ifdef CONFIG_X86 > + return boot_cpu_has(X86_FEATURE_HYPERVISOR); > +#else > + return false; > +#endif > +} > + > /** > * radeon_card_posted - check if the hw has already been initialized > * > @@ -643,6 +660,10 @@ bool radeon_card_posted(struct radeon_device *rdev) > { > uint32_t reg; > > + /* for pass through, always force asic_init */ > + if (radeon_device_is_virtual()) > + return false; > + > /* required for EFI mode on macbook2,1 which uses an r5xx asic */ > if (efi_enabled(EFI_BOOT) && > (rdev->pdev->subsystem_vendor == PCI_VENDOR_ID_APPLE) &&
On Mon, Jun 13, 2016 at 4:10 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > On Mon, 13 Jun 2016 15:45:20 -0400 > Alex Deucher <alexdeucher@gmail.com> wrote: > >> When executing in a PCI passthrough based virtuzliation environment, the >> hypervisor will usually attempt to send a PCIe bus reset signal to the >> ASIC when the VM reboots. In this scenario, the card is not correctly >> initialized, but we still consider it to be posted. Therefore, in a >> passthrough based environemnt we should always post the card to guarantee >> it is in a good state for driver initialization. >> >> Ported from amdgpu commit: >> amdgpu: fix asic initialization for virtualized environments >> >> Cc: Andres Rodriguez <andres.rodriguez@amd.com> >> Cc: Alex Williamson <alex.williamson@redhat.com> >> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> >> Cc: stable@vger.kernel.org >> --- >> drivers/gpu/drm/radeon/radeon_device.c | 21 +++++++++++++++++++++ >> 1 file changed, 21 insertions(+) > > Thanks, I expect it's an improvement, though it's always a bit > disappointing when a driver starts modifying its behavior based on > what might be a transient feature of the platform, in this case a > hypervisor platform. For instance, why does our bus reset and video > ROM execution result in a different state than a physical BIOS doing > the same? Can't this condition occur regardless of a hypervisor, Just doing a pci reset is not enough on newer cards. The hw handling pci resets changed in CI and more of the logic moved to the driver. That does a limited reset, but not the registers that the driver checks to determine whether or not the asic has been posted so the driver skips posting and leaves the hw in a bad reset state. > perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or perhaps > simply a system BIOS optimized to post a limited set of devices. We can tell if a card has never been posted and properly post it. Where it's tricky is when a card has been posted and has subsequently been pci reset on CI and newer hw. I'm not sure of a good way to detect this particular scenario. Generally this is mainly done for qemu/kvm. > Detection based on some state of the device rather than an expectation > based on what the device is running on seems preferable. I suspect > Andres' patch for amdgpu only affects newer devices, which pretty much > all suffer reset issues, at least under QEMU/VFIO, but I wonder how this > patch affects existing working devices, like 6, 7, and some 8-series. Posting the asic at init time should be safe on all asics. > Anyway, if this is the solution to the poor behavior we've seen with > assigned AMD cards, maybe someone could request the same for the closed > drivers, including Windows. Thanks, The closed drivers already do this. Alex > > Alex > >> diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c >> index e61c763..21c44b2 100644 >> --- a/drivers/gpu/drm/radeon/radeon_device.c >> +++ b/drivers/gpu/drm/radeon/radeon_device.c >> @@ -630,6 +630,23 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc) >> /* >> * GPU helpers function. >> */ >> + >> +/** >> + * radeon_device_is_virtual - check if we are running is a virtual environment >> + * >> + * Check if the asic has been passed through to a VM (all asics). >> + * Used at driver startup. >> + * Returns true if virtual or false if not. >> + */ >> +static bool radeon_device_is_virtual(void) >> +{ >> +#ifdef CONFIG_X86 >> + return boot_cpu_has(X86_FEATURE_HYPERVISOR); >> +#else >> + return false; >> +#endif >> +} >> + >> /** >> * radeon_card_posted - check if the hw has already been initialized >> * >> @@ -643,6 +660,10 @@ bool radeon_card_posted(struct radeon_device *rdev) >> { >> uint32_t reg; >> >> + /* for pass through, always force asic_init */ >> + if (radeon_device_is_virtual()) >> + return false; >> + >> /* required for EFI mode on macbook2,1 which uses an r5xx asic */ >> if (efi_enabled(EFI_BOOT) && >> (rdev->pdev->subsystem_vendor == PCI_VENDOR_ID_APPLE) && >
On Wed, 15 Jun 2016 02:23:37 -0400 Alex Deucher <alexdeucher@gmail.com> wrote: > On Mon, Jun 13, 2016 at 4:10 PM, Alex Williamson > <alex.williamson@redhat.com> wrote: > > On Mon, 13 Jun 2016 15:45:20 -0400 > > Alex Deucher <alexdeucher@gmail.com> wrote: > > > >> When executing in a PCI passthrough based virtuzliation environment, the > >> hypervisor will usually attempt to send a PCIe bus reset signal to the > >> ASIC when the VM reboots. In this scenario, the card is not correctly > >> initialized, but we still consider it to be posted. Therefore, in a > >> passthrough based environemnt we should always post the card to guarantee > >> it is in a good state for driver initialization. > >> > >> Ported from amdgpu commit: > >> amdgpu: fix asic initialization for virtualized environments > >> > >> Cc: Andres Rodriguez <andres.rodriguez@amd.com> > >> Cc: Alex Williamson <alex.williamson@redhat.com> > >> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > >> Cc: stable@vger.kernel.org > >> --- > >> drivers/gpu/drm/radeon/radeon_device.c | 21 +++++++++++++++++++++ > >> 1 file changed, 21 insertions(+) > > > > Thanks, I expect it's an improvement, though it's always a bit > > disappointing when a driver starts modifying its behavior based on > > what might be a transient feature of the platform, in this case a > > hypervisor platform. For instance, why does our bus reset and video > > ROM execution result in a different state than a physical BIOS doing > > the same? Can't this condition occur regardless of a hypervisor, > > Just doing a pci reset is not enough on newer cards. The hw handling > pci resets changed in CI and more of the logic moved to the driver. Gag, please relay my disapproval to your hardware folks. > That does a limited reset, but not the registers that the driver > checks to determine whether or not the asic has been posted so the > driver skips posting and leaves the hw in a bad reset state. > > > perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or perhaps > > simply a system BIOS optimized to post a limited set of devices. > > We can tell if a card has never been posted and properly post it. > Where it's tricky is when a card has been posted and has subsequently > been pci reset on CI and newer hw. I'm not sure of a good way to > detect this particular scenario. Generally this is mainly done for > qemu/kvm. How do you tell if a card has never been posted? Is it something we could easily toggle after a bus reset? > > Detection based on some state of the device rather than an expectation > > based on what the device is running on seems preferable. I suspect > > Andres' patch for amdgpu only affects newer devices, which pretty much > > all suffer reset issues, at least under QEMU/VFIO, but I wonder how this > > patch affects existing working devices, like 6, 7, and some 8-series. > > Posting the asic at init time should be safe on all asics. > > > Anyway, if this is the solution to the poor behavior we've seen with > > assigned AMD cards, maybe someone could request the same for the closed > > drivers, including Windows. Thanks, > > The closed drivers already do this. Hmm, that's not terribly encouraging then since the majority of users are running Windows guests for the purpose of creating a gaming VM and still experiencing reset issues with the closed drivers there. Thanks, Alex
On Wed, Jun 15, 2016 at 12:45 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 15 Jun 2016 02:23:37 -0400 > Alex Deucher <alexdeucher@gmail.com> wrote: > >> On Mon, Jun 13, 2016 at 4:10 PM, Alex Williamson >> <alex.williamson@redhat.com> wrote: >> > On Mon, 13 Jun 2016 15:45:20 -0400 >> > Alex Deucher <alexdeucher@gmail.com> wrote: >> > >> >> When executing in a PCI passthrough based virtuzliation environment, the >> >> hypervisor will usually attempt to send a PCIe bus reset signal to the >> >> ASIC when the VM reboots. In this scenario, the card is not correctly >> >> initialized, but we still consider it to be posted. Therefore, in a >> >> passthrough based environemnt we should always post the card to guarantee >> >> it is in a good state for driver initialization. >> >> >> >> Ported from amdgpu commit: >> >> amdgpu: fix asic initialization for virtualized environments >> >> >> >> Cc: Andres Rodriguez <andres.rodriguez@amd.com> >> >> Cc: Alex Williamson <alex.williamson@redhat.com> >> >> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> >> >> Cc: stable@vger.kernel.org >> >> --- >> >> drivers/gpu/drm/radeon/radeon_device.c | 21 +++++++++++++++++++++ >> >> 1 file changed, 21 insertions(+) >> > >> > Thanks, I expect it's an improvement, though it's always a bit >> > disappointing when a driver starts modifying its behavior based on >> > what might be a transient feature of the platform, in this case a >> > hypervisor platform. For instance, why does our bus reset and video >> > ROM execution result in a different state than a physical BIOS doing >> > the same? Can't this condition occur regardless of a hypervisor, >> >> Just doing a pci reset is not enough on newer cards. The hw handling >> pci resets changed in CI and more of the logic moved to the driver. > > Gag, please relay my disapproval to your hardware folks. > >> That does a limited reset, but not the registers that the driver >> checks to determine whether or not the asic has been posted so the >> driver skips posting and leaves the hw in a bad reset state. >> >> > perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or perhaps >> > simply a system BIOS optimized to post a limited set of devices. >> >> We can tell if a card has never been posted and properly post it. >> Where it's tricky is when a card has been posted and has subsequently >> been pci reset on CI and newer hw. I'm not sure of a good way to >> detect this particular scenario. Generally this is mainly done for >> qemu/kvm. > > How do you tell if a card has never been posted? Is it something we > could easily toggle after a bus reset? We check CONFIG_MEMSIZE which is a scratch register set by the asic_init command table to tell the driver how much vram is on the board. > >> > Detection based on some state of the device rather than an expectation >> > based on what the device is running on seems preferable. I suspect >> > Andres' patch for amdgpu only affects newer devices, which pretty much >> > all suffer reset issues, at least under QEMU/VFIO, but I wonder how this >> > patch affects existing working devices, like 6, 7, and some 8-series. >> >> Posting the asic at init time should be safe on all asics. >> >> > Anyway, if this is the solution to the poor behavior we've seen with >> > assigned AMD cards, maybe someone could request the same for the closed >> > drivers, including Windows. Thanks, >> >> The closed drivers already do this. > > Hmm, that's not terribly encouraging then since the majority of users > are running Windows guests for the purpose of creating a gaming VM and > still experiencing reset issues with the closed drivers there. Thanks, I'll have to check with the windows team to see how much validation they do with the windows driver as a qemu/kvm guest. It could be that they don't properly detect that as a virtual case. Alex
> -----Original Message----- > From: Alex Deucher [mailto:alexdeucher@gmail.com] > Sent: June-15-16 1:00 PM > To: Alex Williamson > Cc: Maling list - DRI developers; Deucher, Alexander; Rodriguez, Andres; for > 3.8 > Subject: Re: [PATCH] drm/radeon: fix asic initialization for virtualized > environments > > On Wed, Jun 15, 2016 at 12:45 PM, Alex Williamson > <alex.williamson@redhat.com> wrote: > > On Wed, 15 Jun 2016 02:23:37 -0400 > > Alex Deucher <alexdeucher@gmail.com> wrote: > > > >> On Mon, Jun 13, 2016 at 4:10 PM, Alex Williamson > >> <alex.williamson@redhat.com> wrote: > >> > On Mon, 13 Jun 2016 15:45:20 -0400 > >> > Alex Deucher <alexdeucher@gmail.com> wrote: > >> > > >> >> When executing in a PCI passthrough based virtuzliation > >> >> environment, the hypervisor will usually attempt to send a PCIe > >> >> bus reset signal to the ASIC when the VM reboots. In this > >> >> scenario, the card is not correctly initialized, but we still > >> >> consider it to be posted. Therefore, in a passthrough based > >> >> environemnt we should always post the card to guarantee it is in a > good state for driver initialization. > >> >> > >> >> Ported from amdgpu commit: > >> >> amdgpu: fix asic initialization for virtualized environments > >> >> > >> >> Cc: Andres Rodriguez <andres.rodriguez@amd.com> > >> >> Cc: Alex Williamson <alex.williamson@redhat.com> > >> >> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > >> >> Cc: stable@vger.kernel.org > >> >> --- > >> >> drivers/gpu/drm/radeon/radeon_device.c | 21 > +++++++++++++++++++++ > >> >> 1 file changed, 21 insertions(+) > >> > > >> > Thanks, I expect it's an improvement, though it's always a bit > >> > disappointing when a driver starts modifying its behavior based on > >> > what might be a transient feature of the platform, in this case a > >> > hypervisor platform. For instance, why does our bus reset and > >> > video ROM execution result in a different state than a physical > >> > BIOS doing the same? Can't this condition occur regardless of a > >> > hypervisor, > >> > >> Just doing a pci reset is not enough on newer cards. The hw handling > >> pci resets changed in CI and more of the logic moved to the driver. > > > > Gag, please relay my disapproval to your hardware folks. > > > >> That does a limited reset, but not the registers that the driver > >> checks to determine whether or not the asic has been posted so the > >> driver skips posting and leaves the hw in a bad reset state. > >> > >> > perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or > >> > perhaps simply a system BIOS optimized to post a limited set of devices. > >> > >> We can tell if a card has never been posted and properly post it. > >> Where it's tricky is when a card has been posted and has subsequently > >> been pci reset on CI and newer hw. I'm not sure of a good way to > >> detect this particular scenario. Generally this is mainly done for > >> qemu/kvm. > > > > How do you tell if a card has never been posted? Is it something we > > could easily toggle after a bus reset? > > We check CONFIG_MEMSIZE which is a scratch register set by the asic_init > command table to tell the driver how much vram is on the board. > Yeah, detecting a specific virtualization environment is something that I don't really like. Specially since there isn't a nice generic way to do this for all environments (i.e. in this patch I used x86 specific functionality). If there is a generic approach that works on multiple host CPU architectures do let me know. Another approach I considered was adding an option, amdgpu.always_post that would shift the responsibility onto the user. But I don't think it's really fair to have that expectation, and at some point you start falling into config hell. As Alex mentioned, we don't really have a good mechanism for detecting when we need to post due to how the HW handles PCIe bus reset. Hopefully we can get this fixed for future ASICs. > > > >> > Detection based on some state of the device rather than an > >> > expectation based on what the device is running on seems > >> > preferable. I suspect Andres' patch for amdgpu only affects newer > >> > devices, which pretty much all suffer reset issues, at least under > >> > QEMU/VFIO, but I wonder how this patch affects existing working > devices, like 6, 7, and some 8-series. > >> > >> Posting the asic at init time should be safe on all asics. > >> > >> > Anyway, if this is the solution to the poor behavior we've seen > >> > with assigned AMD cards, maybe someone could request the same for > >> > the closed drivers, including Windows. Thanks, > >> > >> The closed drivers already do this. > > > > Hmm, that's not terribly encouraging then since the majority of users > > are running Windows guests for the purpose of creating a gaming VM and > > still experiencing reset issues with the closed drivers there. > > Thanks, > > I'll have to check with the windows team to see how much validation they do > with the windows driver as a qemu/kvm guest. It could be that they don't > properly detect that as a virtual case. > > Alex
diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index e61c763..21c44b2 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -630,6 +630,23 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc) /* * GPU helpers function. */ + +/** + * radeon_device_is_virtual - check if we are running is a virtual environment + * + * Check if the asic has been passed through to a VM (all asics). + * Used at driver startup. + * Returns true if virtual or false if not. + */ +static bool radeon_device_is_virtual(void) +{ +#ifdef CONFIG_X86 + return boot_cpu_has(X86_FEATURE_HYPERVISOR); +#else + return false; +#endif +} + /** * radeon_card_posted - check if the hw has already been initialized * @@ -643,6 +660,10 @@ bool radeon_card_posted(struct radeon_device *rdev) { uint32_t reg; + /* for pass through, always force asic_init */ + if (radeon_device_is_virtual()) + return false; + /* required for EFI mode on macbook2,1 which uses an r5xx asic */ if (efi_enabled(EFI_BOOT) && (rdev->pdev->subsystem_vendor == PCI_VENDOR_ID_APPLE) &&
When executing in a PCI passthrough based virtuzliation environment, the hypervisor will usually attempt to send a PCIe bus reset signal to the ASIC when the VM reboots. In this scenario, the card is not correctly initialized, but we still consider it to be posted. Therefore, in a passthrough based environemnt we should always post the card to guarantee it is in a good state for driver initialization. Ported from amdgpu commit: amdgpu: fix asic initialization for virtualized environments Cc: Andres Rodriguez <andres.rodriguez@amd.com> Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org --- drivers/gpu/drm/radeon/radeon_device.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)