Message ID | 20221010184812.1576601-1-daniele.ceraolospurio@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] drm/i915/huc: bump timeout for delayed load and reduce print verbosity | expand |
On 10/10/2022 11:48, Daniele Ceraolo Spurio wrote: > We're observing sporadic HuC delayed load timeouts in CI, due to mei_pxp > binding completing later than we expected. HuC is still loaded when the > bind occurs, but in the meantime i915 has started allowing submission to > the VCS engines even if HuC is not there. > In most of the cases I've observed, the timeout was due to the > init/resume of another driver between i915 and mei hitting errors and > thus adding an extra delay, but HuC was still loaded before userspace > could submit, because the whole resume process time was increased by the > delays. > > Given that there is no upper bound to the delay that can be introduced > by other drivers, I've reached the following compromise with the media > team: > > 1) i915 is going to bump the timeout to 5s, to reduce the probability > of reaching it. We still expect HuC to be loaded before userspace > starts submitting, so increasing the timeout should have no impact on > normal operations, but in case something weird happens we don't want to > stall video submissions for too long. > > 2) The media driver will cope with the failing submissions that manage > to go through between i915 init/resume complete and HuC loading, if any > ever happen. This could cause a small corruption of video playback > immediately after a resume (we should be safe on boot because the media > driver polls the HUC_STATUS ioctl before starting submissions). > > Since we're accepting the timeout as a valid outcome, I'm also reducing > the print verbosity from error to notice. > > v2: use separate prints for MEI GSC and MEI PXP init timeouts (John) > > References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033 > Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a fence") > Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > Cc: Tony Ye <tony.ye@intel.com> > Cc: John Harrison <john.c.harrison@intel.com> > --- > drivers/gpu/drm/i915/gt/uc/intel_huc.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c b/drivers/gpu/drm/i915/gt/uc/intel_huc.c > index 4d1cc383b681..41c032ab34b3 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c > @@ -52,10 +52,12 @@ > * guaranteed for this to happen during boot, so the big timeout is a safety net > * that we never expect to need. > * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs to be resumed > - * and/or reset, this can take longer. > + * and/or reset, this can take longer. Note that the kernel might schedule > + * other work between the i915 init/resume and the MEI one, which can add to > + * the delay. > */ > #define GSC_INIT_TIMEOUT_MS 10000 > -#define PXP_INIT_TIMEOUT_MS 2000 > +#define PXP_INIT_TIMEOUT_MS 5000 > > static int sw_fence_dummy_notify(struct i915_sw_fence *sf, > enum i915_sw_fence_notify state) > @@ -104,8 +106,12 @@ static enum hrtimer_restart huc_delayed_load_timer_callback(struct hrtimer *hrti > struct intel_huc *huc = container_of(hrtimer, struct intel_huc, delayed_load.timer); > > if (!intel_huc_is_authenticated(huc)) { > - drm_err(&huc_to_gt(huc)->i915->drm, > - "timed out waiting for GSC init to load HuC\n"); > + if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_GSC) > + drm_notice(&huc_to_gt(huc)->i915->drm, > + "timed out waiting for MEI GSC init to load HuC\n"); > + else if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_PXP) > + drm_notice(&huc_to_gt(huc)->i915->drm, > + "timed out waiting for MEI PXP init to load HuC\n"); Hmm. It feels wrong to assume that the status can only be GSC or PXP. Granted, it certainly should be one of those two values if we get here. But it just seems like bad practice to have a partial if/elsif ladder for an enum value instead of a switch with a default path. What I meant originally was just have a single print that says 'timed out waiting for MEI GSC/PXP to load...', maybe with the status value being printed. I don't think it is critically important to differentiate between the two, I just meant that it was wrong to explicitly state GSC when it might not be. John. > > __gsc_init_error(huc); > }
On 10/10/2022 3:50 PM, John Harrison wrote: > On 10/10/2022 11:48, Daniele Ceraolo Spurio wrote: >> We're observing sporadic HuC delayed load timeouts in CI, due to mei_pxp >> binding completing later than we expected. HuC is still loaded when the >> bind occurs, but in the meantime i915 has started allowing submission to >> the VCS engines even if HuC is not there. >> In most of the cases I've observed, the timeout was due to the >> init/resume of another driver between i915 and mei hitting errors and >> thus adding an extra delay, but HuC was still loaded before userspace >> could submit, because the whole resume process time was increased by the >> delays. >> >> Given that there is no upper bound to the delay that can be introduced >> by other drivers, I've reached the following compromise with the media >> team: >> >> 1) i915 is going to bump the timeout to 5s, to reduce the probability >> of reaching it. We still expect HuC to be loaded before userspace >> starts submitting, so increasing the timeout should have no impact on >> normal operations, but in case something weird happens we don't want to >> stall video submissions for too long. >> >> 2) The media driver will cope with the failing submissions that manage >> to go through between i915 init/resume complete and HuC loading, if any >> ever happen. This could cause a small corruption of video playback >> immediately after a resume (we should be safe on boot because the media >> driver polls the HUC_STATUS ioctl before starting submissions). >> >> Since we're accepting the timeout as a valid outcome, I'm also reducing >> the print verbosity from error to notice. >> >> v2: use separate prints for MEI GSC and MEI PXP init timeouts (John) >> >> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033 >> Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a >> fence") >> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> >> Cc: Tony Ye <tony.ye@intel.com> >> Cc: John Harrison <john.c.harrison@intel.com> >> --- >> drivers/gpu/drm/i915/gt/uc/intel_huc.c | 14 ++++++++++---- >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c >> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c >> index 4d1cc383b681..41c032ab34b3 100644 >> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c >> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c >> @@ -52,10 +52,12 @@ >> * guaranteed for this to happen during boot, so the big timeout is >> a safety net >> * that we never expect to need. >> * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs to >> be resumed >> - * and/or reset, this can take longer. >> + * and/or reset, this can take longer. Note that the kernel might >> schedule >> + * other work between the i915 init/resume and the MEI one, which >> can add to >> + * the delay. >> */ >> #define GSC_INIT_TIMEOUT_MS 10000 >> -#define PXP_INIT_TIMEOUT_MS 2000 >> +#define PXP_INIT_TIMEOUT_MS 5000 >> static int sw_fence_dummy_notify(struct i915_sw_fence *sf, >> enum i915_sw_fence_notify state) >> @@ -104,8 +106,12 @@ static enum hrtimer_restart >> huc_delayed_load_timer_callback(struct hrtimer *hrti >> struct intel_huc *huc = container_of(hrtimer, struct intel_huc, >> delayed_load.timer); >> if (!intel_huc_is_authenticated(huc)) { >> - drm_err(&huc_to_gt(huc)->i915->drm, >> - "timed out waiting for GSC init to load HuC\n"); >> + if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_GSC) >> + drm_notice(&huc_to_gt(huc)->i915->drm, >> + "timed out waiting for MEI GSC init to load HuC\n"); >> + else if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_PXP) >> + drm_notice(&huc_to_gt(huc)->i915->drm, >> + "timed out waiting for MEI PXP init to load HuC\n"); > Hmm. It feels wrong to assume that the status can only be GSC or PXP. > Granted, it certainly should be one of those two values if we get > here. But it just seems like bad practice to have a partial if/elsif > ladder for an enum value instead of a switch with a default path. What > I meant originally was just have a single print that says 'timed out > waiting for MEI GSC/PXP to load...', maybe with the status value being > printed. I don't think it is critically important to differentiate > between the two, I just meant that it was wrong to explicitly state > GSC when it might not be. It is guaranteed that the state is one of those 2 in this callback. The only other option is an error state, which we set from the __gsc_init_error() below. I can make it an unconditional else, or add an else MISSING_CASE(status) if you think that works better, but given the errors we've seen I'd prefer to keep the prints separate for extra clarity. Daniele > > John. > >> __gsc_init_error(huc); >> } >
On 10/10/2022 15:57, Ceraolo Spurio, Daniele wrote: > On 10/10/2022 3:50 PM, John Harrison wrote: >> On 10/10/2022 11:48, Daniele Ceraolo Spurio wrote: >>> We're observing sporadic HuC delayed load timeouts in CI, due to >>> mei_pxp >>> binding completing later than we expected. HuC is still loaded when the >>> bind occurs, but in the meantime i915 has started allowing >>> submission to >>> the VCS engines even if HuC is not there. >>> In most of the cases I've observed, the timeout was due to the >>> init/resume of another driver between i915 and mei hitting errors and >>> thus adding an extra delay, but HuC was still loaded before userspace >>> could submit, because the whole resume process time was increased by >>> the >>> delays. >>> >>> Given that there is no upper bound to the delay that can be introduced >>> by other drivers, I've reached the following compromise with the media >>> team: >>> >>> 1) i915 is going to bump the timeout to 5s, to reduce the probability >>> of reaching it. We still expect HuC to be loaded before userspace >>> starts submitting, so increasing the timeout should have no impact on >>> normal operations, but in case something weird happens we don't want to >>> stall video submissions for too long. >>> >>> 2) The media driver will cope with the failing submissions that manage >>> to go through between i915 init/resume complete and HuC loading, if any >>> ever happen. This could cause a small corruption of video playback >>> immediately after a resume (we should be safe on boot because the media >>> driver polls the HUC_STATUS ioctl before starting submissions). >>> >>> Since we're accepting the timeout as a valid outcome, I'm also reducing >>> the print verbosity from error to notice. >>> >>> v2: use separate prints for MEI GSC and MEI PXP init timeouts (John) >>> >>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033 >>> Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a >>> fence") >>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> >>> Cc: Tony Ye <tony.ye@intel.com> >>> Cc: John Harrison <john.c.harrison@intel.com> >>> --- >>> drivers/gpu/drm/i915/gt/uc/intel_huc.c | 14 ++++++++++---- >>> 1 file changed, 10 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c >>> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c >>> index 4d1cc383b681..41c032ab34b3 100644 >>> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c >>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c >>> @@ -52,10 +52,12 @@ >>> * guaranteed for this to happen during boot, so the big timeout >>> is a safety net >>> * that we never expect to need. >>> * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs >>> to be resumed >>> - * and/or reset, this can take longer. >>> + * and/or reset, this can take longer. Note that the kernel might >>> schedule >>> + * other work between the i915 init/resume and the MEI one, which >>> can add to >>> + * the delay. >>> */ >>> #define GSC_INIT_TIMEOUT_MS 10000 >>> -#define PXP_INIT_TIMEOUT_MS 2000 >>> +#define PXP_INIT_TIMEOUT_MS 5000 >>> static int sw_fence_dummy_notify(struct i915_sw_fence *sf, >>> enum i915_sw_fence_notify state) >>> @@ -104,8 +106,12 @@ static enum hrtimer_restart >>> huc_delayed_load_timer_callback(struct hrtimer *hrti >>> struct intel_huc *huc = container_of(hrtimer, struct >>> intel_huc, delayed_load.timer); >>> if (!intel_huc_is_authenticated(huc)) { >>> - drm_err(&huc_to_gt(huc)->i915->drm, >>> - "timed out waiting for GSC init to load HuC\n"); >>> + if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_GSC) >>> + drm_notice(&huc_to_gt(huc)->i915->drm, >>> + "timed out waiting for MEI GSC init to load >>> HuC\n"); >>> + else if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_PXP) >>> + drm_notice(&huc_to_gt(huc)->i915->drm, >>> + "timed out waiting for MEI PXP init to load >>> HuC\n"); >> Hmm. It feels wrong to assume that the status can only be GSC or PXP. >> Granted, it certainly should be one of those two values if we get >> here. But it just seems like bad practice to have a partial if/elsif >> ladder for an enum value instead of a switch with a default path. >> What I meant originally was just have a single print that says 'timed >> out waiting for MEI GSC/PXP to load...', maybe with the status value >> being printed. I don't think it is critically important to >> differentiate between the two, I just meant that it was wrong to >> explicitly state GSC when it might not be. > > It is guaranteed that the state is one of those 2 in this callback. > The only other option is an error state, which we set from the > __gsc_init_error() below. I can make it an unconditional else, or add > an else MISSING_CASE(status) if you think that works better, but given > the errors we've seen I'd prefer to keep the prints separate for extra > clarity. I agree a third state is theoretically impossible. Currently. But yeah the MISSING_CASE thing works if you think two separate prints is beneficial. John. > > Daniele > >> >> John. >> >>> __gsc_init_error(huc); >>> } >> >
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c b/drivers/gpu/drm/i915/gt/uc/intel_huc.c index 4d1cc383b681..41c032ab34b3 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c @@ -52,10 +52,12 @@ * guaranteed for this to happen during boot, so the big timeout is a safety net * that we never expect to need. * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs to be resumed - * and/or reset, this can take longer. + * and/or reset, this can take longer. Note that the kernel might schedule + * other work between the i915 init/resume and the MEI one, which can add to + * the delay. */ #define GSC_INIT_TIMEOUT_MS 10000 -#define PXP_INIT_TIMEOUT_MS 2000 +#define PXP_INIT_TIMEOUT_MS 5000 static int sw_fence_dummy_notify(struct i915_sw_fence *sf, enum i915_sw_fence_notify state) @@ -104,8 +106,12 @@ static enum hrtimer_restart huc_delayed_load_timer_callback(struct hrtimer *hrti struct intel_huc *huc = container_of(hrtimer, struct intel_huc, delayed_load.timer); if (!intel_huc_is_authenticated(huc)) { - drm_err(&huc_to_gt(huc)->i915->drm, - "timed out waiting for GSC init to load HuC\n"); + if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_GSC) + drm_notice(&huc_to_gt(huc)->i915->drm, + "timed out waiting for MEI GSC init to load HuC\n"); + else if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_PXP) + drm_notice(&huc_to_gt(huc)->i915->drm, + "timed out waiting for MEI PXP init to load HuC\n"); __gsc_init_error(huc); }
We're observing sporadic HuC delayed load timeouts in CI, due to mei_pxp binding completing later than we expected. HuC is still loaded when the bind occurs, but in the meantime i915 has started allowing submission to the VCS engines even if HuC is not there. In most of the cases I've observed, the timeout was due to the init/resume of another driver between i915 and mei hitting errors and thus adding an extra delay, but HuC was still loaded before userspace could submit, because the whole resume process time was increased by the delays. Given that there is no upper bound to the delay that can be introduced by other drivers, I've reached the following compromise with the media team: 1) i915 is going to bump the timeout to 5s, to reduce the probability of reaching it. We still expect HuC to be loaded before userspace starts submitting, so increasing the timeout should have no impact on normal operations, but in case something weird happens we don't want to stall video submissions for too long. 2) The media driver will cope with the failing submissions that manage to go through between i915 init/resume complete and HuC loading, if any ever happen. This could cause a small corruption of video playback immediately after a resume (we should be safe on boot because the media driver polls the HUC_STATUS ioctl before starting submissions). Since we're accepting the timeout as a valid outcome, I'm also reducing the print verbosity from error to notice. v2: use separate prints for MEI GSC and MEI PXP init timeouts (John) References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033 Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a fence") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Tony Ye <tony.ye@intel.com> Cc: John Harrison <john.c.harrison@intel.com> --- drivers/gpu/drm/i915/gt/uc/intel_huc.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)