Message ID | 1361876716-8625-10-git-send-email-mika.kuoppala@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 02/26/2013 03:05 AM, Mika Kuoppala wrote: > For arb-robustness, every context needs to have it's own > reset state tracking. Default context will be handled in a identical > way as the no-context case in further down in the patch set. > For no-context case, the reset state will be stored in > the file_priv part. > > v2: handle default context inside get_reset_state This isn't the interface we want. I already sent you the patches for Mesa, and you seem to have completely ignored them. Moreover, this interface provides no mechanism to query for its existence (other than relying on the kernel version), and no method to deprecate it. Based on e-mail discussions, I think danvet agrees with me here. Putting guilty / innocent counting in kernel puts policy decisions in the kernel that belong with the user space API that implements them. Putting these choices in the kernel significantly decreases how "future proof" the interface is. Since any kernel/user interface has to be kept forever, this is just asking for maintenance trouble down the road. Also, it's difficult (for me) to tell from the rest of the series whether or not a context that was not affected by a reset (had no batches in flight at all) can still observe that a reset occurred. From the GL point of view, once a context has observed a reset, it's garbage. Knowing that it saw 1 reset or 43,000,000 resets is the same. Since it's a count, GL will have to query the values before any rendering happens or it won't know whether the value 6 means there was a reset or not. The right interface is a set of flags indicating the state the context was in when it observed a reset. As this patch series stands, Mesa will not use this interface. > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > --- > drivers/gpu/drm/i915/i915_drv.h | 4 ++++ > drivers/gpu/drm/i915/i915_gem_context.c | 34 +++++++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index 1c67fb2..2cc5817 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -1684,6 +1684,10 @@ int i915_switch_context(struct intel_ring_buffer *ring, > struct drm_file *file, int to_id, > struct i915_hw_context **ctx); > void i915_gem_context_free(struct kref *ctx_ref); > +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, > + struct drm_file *file, > + u32 id, > + struct ctx_reset_state **rs); > int i915_gem_context_create_ioctl(struct drm_device *dev, void *data, > struct drm_file *file); > int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data, > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c > index cba54fb..1b14a06 100644 > --- a/drivers/gpu/drm/i915/i915_gem_context.c > +++ b/drivers/gpu/drm/i915/i915_gem_context.c > @@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p, void *data) > return 0; > } > > +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, > + struct drm_file *file, > + u32 id, > + struct ctx_reset_state **rs) > +{ > + struct drm_i915_private *dev_priv = ring->dev->dev_private; > + struct drm_i915_file_private *file_priv = file->driver_priv; > + struct i915_hw_context *to; > + > + if (dev_priv->hw_contexts_disabled) > + return -ENOENT; > + > + if (ring->id != RCS) > + return -EINVAL; > + > + if (rs == NULL) > + return -EINVAL; > + > + if (file == NULL) > + return -EINVAL; > + > + if (id == DEFAULT_CONTEXT_ID) { > + *rs = &file_priv->reset_state; > + } else { > + to = i915_gem_context_get(file->driver_priv, id); > + if (to == NULL) > + return -ENOENT; > + > + *rs = &to->reset_state; > + } > + > + return 0; > +} > + > void i915_gem_context_close(struct drm_device *dev, struct drm_file *file) > { > struct drm_i915_file_private *file_priv = file->driver_priv; >
On 02/26/2013 05:47 PM, Ian Romanick wrote: > On 02/26/2013 03:05 AM, Mika Kuoppala wrote: >> For arb-robustness, every context needs to have it's own >> reset state tracking. Default context will be handled in a identical >> way as the no-context case in further down in the patch set. >> For no-context case, the reset state will be stored in >> the file_priv part. >> >> v2: handle default context inside get_reset_state > > This isn't the interface we want. I already sent you the patches for > Mesa, and you seem to have completely ignored them. Moreover, this > interface provides no mechanism to query for its existence (other than > relying on the kernel version), and no method to deprecate it. > > Based on e-mail discussions, I think danvet agrees with me here. > > Putting guilty / innocent counting in kernel puts policy decisions in > the kernel that belong with the user space API that implements them. > Putting these choices in the kernel significantly decreases how "future > proof" the interface is. Since any kernel/user interface has to be kept > forever, this is just asking for maintenance trouble down the road. > > Also, it's difficult (for me) to tell from the rest of the series > whether or not a context that was not affected by a reset (had no > batches in flight at all) can still observe that a reset occurred. > > From the GL point of view, once a context has observed a reset, it's > garbage. Knowing that it saw 1 reset or 43,000,000 resets is the same. > Since it's a count, GL will have to query the values before any > rendering happens or it won't know whether the value 6 means there was a > reset or not. > > The right interface is a set of flags indicating the state the context > was in when it observed a reset. As this patch series stands, Mesa will > not use this interface. I almost forgot... it seems like there is a lot of code in this series to support GPUs where we don't support (either in the existing kernel driver or in the hardware at all) hardware contexts. Is this really necessary? Is anyone going to implement ARB_robustness for 845? I have no intention to do so. It seems much better to simplify this code to only support the GPUs we are actually going to support in user-mode. If there are no users, the code is untested. Repeat Keith's mantra: Any code that isn't tested is broken. :) >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> --- >> drivers/gpu/drm/i915/i915_drv.h | 4 ++++ >> drivers/gpu/drm/i915/i915_gem_context.c | 34 >> +++++++++++++++++++++++++++++++ >> 2 files changed, 38 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_drv.h >> b/drivers/gpu/drm/i915/i915_drv.h >> index 1c67fb2..2cc5817 100644 >> --- a/drivers/gpu/drm/i915/i915_drv.h >> +++ b/drivers/gpu/drm/i915/i915_drv.h >> @@ -1684,6 +1684,10 @@ int i915_switch_context(struct >> intel_ring_buffer *ring, >> struct drm_file *file, int to_id, >> struct i915_hw_context **ctx); >> void i915_gem_context_free(struct kref *ctx_ref); >> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, >> + struct drm_file *file, >> + u32 id, >> + struct ctx_reset_state **rs); >> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data, >> struct drm_file *file); >> int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data, >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c >> b/drivers/gpu/drm/i915/i915_gem_context.c >> index cba54fb..1b14a06 100644 >> --- a/drivers/gpu/drm/i915/i915_gem_context.c >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c >> @@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p, >> void *data) >> return 0; >> } >> >> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, >> + struct drm_file *file, >> + u32 id, >> + struct ctx_reset_state **rs) >> +{ >> + struct drm_i915_private *dev_priv = ring->dev->dev_private; >> + struct drm_i915_file_private *file_priv = file->driver_priv; >> + struct i915_hw_context *to; >> + >> + if (dev_priv->hw_contexts_disabled) >> + return -ENOENT; >> + >> + if (ring->id != RCS) >> + return -EINVAL; >> + >> + if (rs == NULL) >> + return -EINVAL; >> + >> + if (file == NULL) >> + return -EINVAL; >> + >> + if (id == DEFAULT_CONTEXT_ID) { >> + *rs = &file_priv->reset_state; >> + } else { >> + to = i915_gem_context_get(file->driver_priv, id); >> + if (to == NULL) >> + return -ENOENT; >> + >> + *rs = &to->reset_state; >> + } >> + >> + return 0; >> +} >> + >> void i915_gem_context_close(struct drm_device *dev, struct drm_file >> *file) >> { >> struct drm_i915_file_private *file_priv = file->driver_priv; >> >
On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote: > On 02/26/2013 03:05 AM, Mika Kuoppala wrote: > >For arb-robustness, every context needs to have it's own > >reset state tracking. Default context will be handled in a identical > >way as the no-context case in further down in the patch set. > >For no-context case, the reset state will be stored in > >the file_priv part. > > > >v2: handle default context inside get_reset_state > > This isn't the interface we want. I already sent you the patches > for Mesa, and you seem to have completely ignored them. Moreover, > this interface provides no mechanism to query for its existence > (other than relying on the kernel version), and no method to > deprecate it. It's existence, and the existence of any successor, is the only test you need to check for the interface. Testing the flag field up front also lets you know if individual flags are known prior to use. > Based on e-mail discussions, I think danvet agrees with me here. > > Putting guilty / innocent counting in kernel puts policy decisions > in the kernel that belong with the user space API that implements > them. Putting these choices in the kernel significantly decreases > how "future proof" the interface is. Since any kernel/user > interface has to be kept forever, this is just asking for > maintenance trouble down the road. I think you have the policy standpoint reversed. > Also, it's difficult (for me) to tell from the rest of the series > whether or not a context that was not affected by a reset (had no > batches in flight at all) can still observe that a reset occurred. Yes, the context can observe that the global reset increased, its did not. > From the GL point of view, once a context has observed a reset, it's > garbage. Knowing that it saw 1 reset or 43,000,000 resets is the > same. Since it's a count, GL will have to query the values before > any rendering happens or it won't know whether the value 6 means > there was a reset or not. > > The right interface is a set of flags indicating the state the > context was in when it observed a reset. As this patch series > stands, Mesa will not use this interface. This interface conforms to your specifications and interface that you last posted to the list and were revised on list. There are two substantive differences; there is *less* policy in the kernel than before, and the breadcrumb of last reset state was overlooked. In order to make the reset status work in a multi-threaded environment the kernel exists in, we have to assume that there will be multiple actors on the reset status, and they will be comparing the state that they are interested in. The set of flags that GL understands is deducible from this interface. -Chris
Ian Romanick <idr@freedesktop.org> writes: > On 02/26/2013 03:05 AM, Mika Kuoppala wrote: >> For arb-robustness, every context needs to have it's own >> reset state tracking. Default context will be handled in a identical >> way as the no-context case in further down in the patch set. >> For no-context case, the reset state will be stored in >> the file_priv part. >> >> v2: handle default context inside get_reset_state > > This isn't the interface we want. I already sent you the patches for > Mesa, and you seem to have completely ignored them. Moreover, this > interface provides no mechanism to query for its existence (other than > relying on the kernel version), and no method to deprecate it. > > Based on e-mail discussions, I think danvet agrees with me here. > > Putting guilty / innocent counting in kernel puts policy decisions in > the kernel that belong with the user space API that implements them. > Putting these choices in the kernel significantly decreases how "future > proof" the interface is. Since any kernel/user interface has to be kept > forever, this is just asking for maintenance trouble down the road. > > Also, it's difficult (for me) to tell from the rest of the series > whether or not a context that was not affected by a reset (had no > batches in flight at all) can still observe that a reset occurred. > > From the GL point of view, once a context has observed a reset, it's > garbage. Knowing that it saw 1 reset or 43,000,000 resets is the same. > Since it's a count, GL will have to query the values before any > rendering happens or it won't know whether the value 6 means there was a > reset or not. > > The right interface is a set of flags indicating the state the context > was in when it observed a reset. As this patch series stands, Mesa will > not use this interface. There is a little bit of misunderstanding in here I think. I haven't ignored your input. I abandoned the statistics based ioctl interface based on your input. This commit in question adds only internal interface to query the context reset state. My intention was to redo the ioctl interface on top of this internal bookkeepping. The ioctl interface will be made according to this commit: http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97 Thanks, -Mika >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> --- >> drivers/gpu/drm/i915/i915_drv.h | 4 ++++ >> drivers/gpu/drm/i915/i915_gem_context.c | 34 +++++++++++++++++++++++++++++++ >> 2 files changed, 38 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h >> index 1c67fb2..2cc5817 100644 >> --- a/drivers/gpu/drm/i915/i915_drv.h >> +++ b/drivers/gpu/drm/i915/i915_drv.h >> @@ -1684,6 +1684,10 @@ int i915_switch_context(struct intel_ring_buffer *ring, >> struct drm_file *file, int to_id, >> struct i915_hw_context **ctx); >> void i915_gem_context_free(struct kref *ctx_ref); >> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, >> + struct drm_file *file, >> + u32 id, >> + struct ctx_reset_state **rs); >> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data, >> struct drm_file *file); >> int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data, >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c >> index cba54fb..1b14a06 100644 >> --- a/drivers/gpu/drm/i915/i915_gem_context.c >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c >> @@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p, void *data) >> return 0; >> } >> >> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, >> + struct drm_file *file, >> + u32 id, >> + struct ctx_reset_state **rs) >> +{ >> + struct drm_i915_private *dev_priv = ring->dev->dev_private; >> + struct drm_i915_file_private *file_priv = file->driver_priv; >> + struct i915_hw_context *to; >> + >> + if (dev_priv->hw_contexts_disabled) >> + return -ENOENT; >> + >> + if (ring->id != RCS) >> + return -EINVAL; >> + >> + if (rs == NULL) >> + return -EINVAL; >> + >> + if (file == NULL) >> + return -EINVAL; >> + >> + if (id == DEFAULT_CONTEXT_ID) { >> + *rs = &file_priv->reset_state; >> + } else { >> + to = i915_gem_context_get(file->driver_priv, id); >> + if (to == NULL) >> + return -ENOENT; >> + >> + *rs = &to->reset_state; >> + } >> + >> + return 0; >> +} >> + >> void i915_gem_context_close(struct drm_device *dev, struct drm_file *file) >> { >> struct drm_i915_file_private *file_priv = file->driver_priv; >>
On Tue, 26 Feb 2013 17:47:12 -0800 Ian Romanick <idr@freedesktop.org> wrote: > On 02/26/2013 03:05 AM, Mika Kuoppala wrote: > > For arb-robustness, every context needs to have it's own > > reset state tracking. Default context will be handled in a identical > > way as the no-context case in further down in the patch set. > > For no-context case, the reset state will be stored in > > the file_priv part. > > > > v2: handle default context inside get_reset_state > > This isn't the interface we want. I already sent you the patches for > Mesa, and you seem to have completely ignored them. Moreover, this > interface provides no mechanism to query for its existence (other than > relying on the kernel version), and no method to deprecate it. Before you get all wound up about this, note that you're replying to a kernel internal function, not the ioctl that will be exposed to Mesa...
On 02/27/2013 01:13 AM, Chris Wilson wrote: > On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote: >> On 02/26/2013 03:05 AM, Mika Kuoppala wrote: >>> For arb-robustness, every context needs to have it's own >>> reset state tracking. Default context will be handled in a identical >>> way as the no-context case in further down in the patch set. >>> For no-context case, the reset state will be stored in >>> the file_priv part. >>> >>> v2: handle default context inside get_reset_state >> >> This isn't the interface we want. I already sent you the patches >> for Mesa, and you seem to have completely ignored them. Moreover, >> this interface provides no mechanism to query for its existence >> (other than relying on the kernel version), and no method to >> deprecate it. > > It's existence, and the existence of any successor, is the only test you > need to check for the interface. Testing the flag field up front also > lets you know if individual flags are known prior to use. > >> Based on e-mail discussions, I think danvet agrees with me here. >> >> Putting guilty / innocent counting in kernel puts policy decisions >> in the kernel that belong with the user space API that implements >> them. Putting these choices in the kernel significantly decreases >> how "future proof" the interface is. Since any kernel/user >> interface has to be kept forever, this is just asking for >> maintenance trouble down the road. > > I think you have the policy standpoint reversed. Can you elaborate? I've been out of the kernel loop for a long time, but I always thought policy and mechanism were supposed to be separated. In the case of drivers with a user-mode component, the mechanism was in the kernel (where else could it be?), and policy decisions were in user-mode. This is often at odds with keeping the interfaces between kernel and user thin, so that may be where my misunderstanding is. >> Also, it's difficult (for me) to tell from the rest of the series >> whether or not a context that was not affected by a reset (had no >> batches in flight at all) can still observe that a reset occurred. > > Yes, the context can observe that the global reset increased, its did > not. See below for a couple reasons why I think this may be a mistake. >> From the GL point of view, once a context has observed a reset, it's >> garbage. Knowing that it saw 1 reset or 43,000,000 resets is the >> same. Since it's a count, GL will have to query the values before >> any rendering happens or it won't know whether the value 6 means >> there was a reset or not. >> >> The right interface is a set of flags indicating the state the >> context was in when it observed a reset. As this patch series >> stands, Mesa will not use this interface. > > This interface conforms to your specifications and interface that you > last posted to the list and were revised on list. There are two > substantive differences; there is *less* policy in the kernel than > before, and the breadcrumb of last reset state was overlooked. > > In order to make the reset status work in a multi-threaded environment > the kernel exists in, we have to assume that there will be multiple > actors on the reset status, and they will be comparing the state that > they are interested in. The set of flags that GL understands is > deducible from this interface. We had some discussion on the list about a proposed interface last year. We had a really good discussion, and both you and Daniel provided some really valuable feedback. The biggest piece of feedback that both of you provided was to simplify. I took that advice to heart. The original interface was clunky and over complicated. If memory serves, that discussion was on the order of 8 to 10 months ago. Quite a lot has happened since then. The biggest thing that happened is I started implementing the interface we discussed in the kernel, libdrm, and Mesa. In that process I found a number of problems with even the simplified interface. Back in September I used this new data to simplify the interface even further. This allowed both the kernel and the user-mode code to be much less complex. Right about that same time the whole Mesa team got super busy. First we had OpenGL 3.1, then we had OpenGL ES 3.0. During that time all of my ARB_robustness work languished. I didn't want to send another round of not-fully-baked ideas (or patches) to the list, so they just sat. And sat. And sat. :( At FOSDEM earlier this month, I discussed this with Jesse. I told him that there was approximately epsilon probability that I would get back to this work. As a result, I was going to have to throw it over the wall to the kernel team. At no point did he mention that anyone was already working on this. About a week ago Daniel pulled me into a discussion with Mika about the ARB_robustness work. I dug out my old code, paged back in the memory of the work from off-line storage, and provided a bunch of feedback to Mika and Daniel. After I sent around my explanation of why I abandoned the counting interface and circulated my work, there was no response from Mika or anyone on that team. I had a brief discussion with Daniel on IRC, and he seemed to agree with my sentiments and like the direction of my code. Then this patch series appeared on the list, and here we are. There are two requirements in the ARB_robustness extension (and additional layered extensions) that I believe cause problems with all of the proposed counting interfaces. 1. GL contexts that are not affected by a reset (i.e., didn't lose any objects, didn't lose any rendering, etc.) should not observe it. This is the big one that all the browser vendors want. If you have 5 tabs open, a reset caused by a malicious WebGL app in one tab shouldn't, if possible, force them to rebuild all of their state for all the other tabs. 2. After a GL context observes a reset, it is garbage. It must be destroyed and a new one created. This means we only need to know that a reset of any variety affected the context. In addition, I'm concerned about having one GL context be able to observe, in any manner, resets that did not affect it. Given that people have figured out how to get encryption keys by observing cache timings, it's not impossible for a malicious program to use information about reset counts to do something. Leaving a potential gap like this in an extension that's used to improved security seems... like we're just asking for a Phoronix article. We don't have any requirement to expose this information, so I don't think we should. We also don't have any requirement to expose this functionality on anything pre-Sandy Bridge. We may want, at some point, to expose it on Ironlake. Based on that, it seems reasonable to tie the availability of reset notifications to hardware contexts. Since we won't test it, Mesa certainly isn't going to expose it on 8xx or 915. Based on this, a simplified interface became obvious: for each hardware context, track a mask of kinds of resets that have affected that context. I defined three bits in the mask: 1. The hw context had a batch executing when the reset occurred. 2. The hw context had a batch pending when the reset occurred. Hopefully in the future we could make most occurrences of this go away by re-queuing the batch. It may also be possible to do this in user-mode, but that may be more difficult. 3. "Other." There's also the potential to add other bits to communicate information about lost buffers, textures, etc. This could be used to create a layered extension that lets applications transfer data from the dead context to the new context. Browsers may not be too interested in this, but I could imagine compositors or other kinds of applications benefiting. We may never implement that, but it's easier to communicate this through a set of flags than through a set of counts. The guilty / innocent counts in this patch series lose information. We can probably reconstruct the current flags from the counts, but we would have to do some refactoring (or additional counting) to get other flags in the future. All of this would make the kernel code more complex. Why reconstruct the data you want when you could just track that data in the first place? Since the functionality in not available on all hardware, I also added a query to determine the availability. This has the added benefit that, should this interface prove to be insufficient in the future, we could deprecate it by having the query always return false. All software involved has to be able the handle the case where the interface is not available, so I don't think this should run awry of the rules about kernel/user interface breakage. Please correct me if I'm wrong. The Mesa patches are in my robustness2 branch: http://cgit.freedesktop.org/~idr/mesa/log/?h=robustness2 As Mika pointed out, the patch that makes use of the new kernel interface is: http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97 In addition, http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=6f73cbeecb90b0a08b0f016e19a16f6f50a1a99c changes to brw_context.c to use intel_get_boolean to determine the availability of the kernel query. I'd really like to only expose __DRI2_ROBUSTNESS when the kernel query exists, but I don't see a way to do that with the DRI extension mechanism. The libdrm patches are in master of: http://cgit.freedesktop.org/~idr/drm I had attacked the problem from the opposite end from Mika. My initial implementation always reported "other" because a lot of other work needed to happen to differentiate the other cases. It's good to see that work progressing. My really, really, really incomplete kernel patch is at: http://people.freedesktop.org/~idr/robust.patch This patch was against some kernel from early September 2012. It may not apply or build at this point. I believe I was able to observe a reset by having an app run a shader with an infinite loop and poll glGetGraphicsResetStatusARB. I couldn't find that code, so I may be misremembering. With the exception of the I915_PARAM_HAS_RESET_QUERY handling, nearly all of this patch is probably useless at this point.
On Wed, Feb 27, 2013 at 05:15:08PM -0800, Ian Romanick wrote: > On 02/27/2013 01:13 AM, Chris Wilson wrote: > >On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote: > >>On 02/26/2013 03:05 AM, Mika Kuoppala wrote: > >>>For arb-robustness, every context needs to have it's own > >>>reset state tracking. Default context will be handled in a identical > >>>way as the no-context case in further down in the patch set. > >>>For no-context case, the reset state will be stored in > >>>the file_priv part. > >>> > >>>v2: handle default context inside get_reset_state > >> > >>This isn't the interface we want. I already sent you the patches > >>for Mesa, and you seem to have completely ignored them. Moreover, > >>this interface provides no mechanism to query for its existence > >>(other than relying on the kernel version), and no method to > >>deprecate it. > > > >It's existence, and the existence of any successor, is the only test you > >need to check for the interface. Testing the flag field up front also > >lets you know if individual flags are known prior to use. > > > >>Based on e-mail discussions, I think danvet agrees with me here. > >> > >>Putting guilty / innocent counting in kernel puts policy decisions > >>in the kernel that belong with the user space API that implements > >>them. Putting these choices in the kernel significantly decreases > >>how "future proof" the interface is. Since any kernel/user > >>interface has to be kept forever, this is just asking for > >>maintenance trouble down the road. > > > >I think you have the policy standpoint reversed. > > Can you elaborate? I've been out of the kernel loop for a long > time, but I always thought policy and mechanism were supposed to be > separated. In the case of drivers with a user-mode component, the > mechanism was in the kernel (where else could it be?), and policy > decisions were in user-mode. This is often at odds with keeping the > interfaces between kernel and user thin, so that may be where my > misunderstanding is. Right, the idea is to keep policy out of the kernel. I disagree that your suggestion succeeds in doing this. [snip to details] > There are two requirements in the ARB_robustness extension (and > additional layered extensions) that I believe cause problems with > all of the proposed counting interfaces. > > 1. GL contexts that are not affected by a reset (i.e., didn't lose > any objects, didn't lose any rendering, etc.) should not observe it. > This is the big one that all the browser vendors want. If you have > 5 tabs open, a reset caused by a malicious WebGL app in one tab > shouldn't, if possible, force them to rebuild all of their state for > all the other tabs. > > 2. After a GL context observes a reset, it is garbage. It must be > destroyed and a new one created. This means we only need to know > that a reset of any variety affected the context. For me observes implies the ability for the context to inspect global system state, whereas I think you mean if the context and its associated state is affected by a reset. > In addition, I'm concerned about having one GL context be able to > observe, in any manner, resets that did not affect it. Given that > people have figured out how to get encryption keys by observing > cache timings, it's not impossible for a malicious program to use > information about reset counts to do something. Leaving a potential > gap like this in an extension that's used to improved security > seems... like we're just asking for a Phoronix article. We don't > have any requirement to expose this information, so I don't think we > should. So we should not do minor+major pagefault tracking either? I only suggested that you could read the global state because I thought you implied you wanted it. From the display server POV, global state is the most useful as I am more interested in whether I can reliably use the GPU and prevent a DoS from a misbehaving set of clients. > We also don't have any requirement to expose this functionality on > anything pre-Sandy Bridge. We may want, at some point, to expose it > on Ironlake. Based on that, it seems reasonable to tie the > availability of reset notifications to hardware contexts. Since we > won't test it, Mesa certainly isn't going to expose it on 8xx or > 915. From a design standpoint hw contexts are just one function of the context object and not the reverse. Ben has been rightfully arguing that we need contexts in the kernel for far more than supporting logical hw contexts. > Based on this, a simplified interface became obvious: for each > hardware context, track a mask of kinds of resets that have affected > that context. I defined three bits in the mask: The problem is that you added an atomic-read-and-reset into the ioctl. I objected to that policy when Mika presented it earlier as it prevents concurrent accesses and so imposes an explicit usage model. > 1. The hw context had a batch executing when the reset occurred. > > 2. The hw context had a batch pending when the reset occurred. > Hopefully in the future we could make most occurrences of this go > away by re-queuing the batch. It may also be possible to do this in > user-mode, but that may be more difficult. > > 3. "Other." > > There's also the potential to add other bits to communicate > information about lost buffers, textures, etc. This could be used > to create a layered extension that lets applications transfer data > from the dead context to the new context. Browsers may not be too > interested in this, but I could imagine compositors or other kinds > of applications benefiting. We may never implement that, but it's > easier to communicate this through a set of flags than through a set > of counts. > > The guilty / innocent counts in this patch series lose information. Disagree. The flags represent a lose of information, as you explained earlier. > We can probably reconstruct the current flags from the counts, but > we would have to do some refactoring (or additional counting) to get > other flags in the future. All of this would make the kernel code > more complex. Why reconstruct the data you want when you could just > track that data in the first place? Because you are constructing policy decision based on raw information. > Since the functionality in not available on all hardware, I also > added a query to determine the availability. This has the added > benefit that, should this interface prove to be insufficient in the > future, we could deprecate it by having the query always return > false. That offers no more information than just probing the ioctl. > All software involved has to be able the handle the case > where the interface is not available, so I don't think this should > run awry of the rules about kernel/user interface breakage. Please > correct me if I'm wrong. [snip patch locations] The biggest change you made was to reduce the counts to a set of flags, to make it harder for an attacker to analyse the reset frequency. Instead that attacker can just poll the flags and coarsely reconstruct the counts... If that is their goal. Instead it transfers userspace policy into the kernel, which is what we were seeking to avoid, right? -Chris
On 02/28/2013 03:14 AM, Chris Wilson wrote: > On Wed, Feb 27, 2013 at 05:15:08PM -0800, Ian Romanick wrote: >> On 02/27/2013 01:13 AM, Chris Wilson wrote: >>> On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote: >>>> On 02/26/2013 03:05 AM, Mika Kuoppala wrote: >>>>> For arb-robustness, every context needs to have it's own >>>>> reset state tracking. Default context will be handled in a identical >>>>> way as the no-context case in further down in the patch set. >>>>> For no-context case, the reset state will be stored in >>>>> the file_priv part. >>>>> >>>>> v2: handle default context inside get_reset_state >>>> >>>> This isn't the interface we want. I already sent you the patches >>>> for Mesa, and you seem to have completely ignored them. Moreover, >>>> this interface provides no mechanism to query for its existence >>>> (other than relying on the kernel version), and no method to >>>> deprecate it. >>> >>> It's existence, and the existence of any successor, is the only test you >>> need to check for the interface. Testing the flag field up front also >>> lets you know if individual flags are known prior to use. >>> >>>> Based on e-mail discussions, I think danvet agrees with me here. >>>> >>>> Putting guilty / innocent counting in kernel puts policy decisions >>>> in the kernel that belong with the user space API that implements >>>> them. Putting these choices in the kernel significantly decreases >>>> how "future proof" the interface is. Since any kernel/user >>>> interface has to be kept forever, this is just asking for >>>> maintenance trouble down the road. >>> >>> I think you have the policy standpoint reversed. >> >> Can you elaborate? I've been out of the kernel loop for a long >> time, but I always thought policy and mechanism were supposed to be >> separated. In the case of drivers with a user-mode component, the >> mechanism was in the kernel (where else could it be?), and policy >> decisions were in user-mode. This is often at odds with keeping the >> interfaces between kernel and user thin, so that may be where my >> misunderstanding is. > > Right, the idea is to keep policy out of the kernel. I disagree that > your suggestion succeeds in doing this. I was trying to put the mechanism for determining what things happened in the kernel and the policy for interpreting what those things mean in user-mode. Ignoring some bugs / misimplementation in my patch (which I try to address below), I thought I was doing that. User-mode gets some information about the state it was in when the reset occurred. It then uses that state to decide the value to return to the application for glGetGraphicsResetStatusARB. After reading the bits below where I talk about other problems in the kernel code I posted, maybe you can address specific ways my proposal fails. I want any glaring interface deficiencies fixed before any code goes in the kernel... I fully understand the pain, on both sides, of shipping a kernel interface only to have to replace it in a year. > [snip to details] > >> There are two requirements in the ARB_robustness extension (and >> additional layered extensions) that I believe cause problems with >> all of the proposed counting interfaces. >> >> 1. GL contexts that are not affected by a reset (i.e., didn't lose >> any objects, didn't lose any rendering, etc.) should not observe it. >> This is the big one that all the browser vendors want. If you have >> 5 tabs open, a reset caused by a malicious WebGL app in one tab >> shouldn't, if possible, force them to rebuild all of their state for >> all the other tabs. >> >> 2. After a GL context observes a reset, it is garbage. It must be >> destroyed and a new one created. This means we only need to know >> that a reset of any variety affected the context. > > For me observes implies the ability for the context to inspect global > system state, whereas I think you mean if the context and its associated > state is affected by a reset. I had thought about explicitly defining what I meant by a couple terms in my previous e-mail, but it was already getting quite long. I guess I should have. :( Here's what I meant: A. "Affect the context" means the reset causes the context to lose some data. It may be rendering that was queued (so framebuffer data is lost), it may be the contents of a buffer, or it may be something else. Either way, the state of some data is not what the application expects it to be because of the reset. B. "Observe the reset" means the GL context gets some data to know that the reset happened. From the application point of view, this means glGetGraphicsResetStatusARB returns a value other than GL_NO_ERROR. What I was trying the describe in point 1 in my previous message (above) is that B should occur if and only if A occurs. Once B occurs, the application has to create a new context... recompile all of its shaders, reload all of its textures, reload all of its vertex data, etc. We don't want to make apps do that any more often than absolutely necessary. >> In addition, I'm concerned about having one GL context be able to >> observe, in any manner, resets that did not affect it. Given that >> people have figured out how to get encryption keys by observing >> cache timings, it's not impossible for a malicious program to use >> information about reset counts to do something. Leaving a potential >> gap like this in an extension that's used to improved security >> seems... like we're just asking for a Phoronix article. We don't >> have any requirement to expose this information, so I don't think we >> should. > > So we should not do minor+major pagefault tracking either? I only > suggested that you could read the global state because I thought you > implied you wanted it. From the display server POV, global state is the > most useful as I am more interested in whether I can reliably use the GPU > and prevent a DoS from a misbehaving set of clients. In the X model, this would be easy to solve. We would make the total-number-of-resets query only available to be DRM master. As we move away from that model, it becomes more difficult. It seems that the usage is different enough that having separate queries for the global reset count and the per-context resets feels right. That still leaves the potential information leakage issue, and I'm honestly not sure what the right answer is there. Perhaps we could add that query after we've gotten some feedback from an actual security expert. There may be other options that we haven't considered yet. Maybe something like advertising the GPU reset "load average." Dunno. >> We also don't have any requirement to expose this functionality on >> anything pre-Sandy Bridge. We may want, at some point, to expose it >> on Ironlake. Based on that, it seems reasonable to tie the >> availability of reset notifications to hardware contexts. Since we >> won't test it, Mesa certainly isn't going to expose it on 8xx or >> 915. > >>From a design standpoint hw contexts are just one function of the context > object and not the reverse. Ben has been rightfully arguing that we need > contexts in the kernel for far more than supporting logical hw contexts. That's a fair point, and I also tend to agree with Ben. I think it may also be an orthogonal issue. As things currently stand (at least as far as I can tell) kernel contexts are tied to logical hw contexts. Since we don't have a requirement for reset query support on older GPUs, it doesn't seem to make sense to put extra code in the kernel, that nobody will use, to do it. If at some point the linkage between kernel contexts and hw contexts, and we can get a bunch of this mostly for free, then it makes sense to add the support then. Every single time I've written "future" code like that, it has either been broken or it turns out to not be what I need when the future arrives. I know I'm not the only one with that experience. >> Based on this, a simplified interface became obvious: for each >> hardware context, track a mask of kinds of resets that have affected >> that context. I defined three bits in the mask: > > The problem is that you added an atomic-read-and-reset into the ioctl. I > objected to that policy when Mika presented it earlier as it prevents > concurrent accesses and so imposes an explicit usage model. I may have mentioned that my kernel patch was very early work and should be taken with a grain of salt. :) You probably noticed that the patch was generated by diff and not by git-format-patch. I hadn't even committed any of that to my local tree. Given its shabby state, I was reluctant to even send it out. However, if I'm going to be critical of another person's work, I have a responsibility to be fully open. I too came to the conclusion that the atomic-read-and-reset part was a bad idea... I had forgotten that I even coded it that way, or I would have mentioned that in my previous message. The other thing in my kernel patch that I think is a bad idea (as you note below) is that processes can query the reset state of HW contexts associated with different fds. That coupled with the atomic-read-and-reset is completely disastrous. I think this should be tightened up for all the reasons we've both mentioned. >> 1. The hw context had a batch executing when the reset occurred. >> >> 2. The hw context had a batch pending when the reset occurred. >> Hopefully in the future we could make most occurrences of this go >> away by re-queuing the batch. It may also be possible to do this in >> user-mode, but that may be more difficult. >> >> 3. "Other." >> >> There's also the potential to add other bits to communicate >> information about lost buffers, textures, etc. This could be used >> to create a layered extension that lets applications transfer data >> from the dead context to the new context. Browsers may not be too >> interested in this, but I could imagine compositors or other kinds >> of applications benefiting. We may never implement that, but it's >> easier to communicate this through a set of flags than through a set >> of counts. >> >> The guilty / innocent counts in this patch series lose information. > > Disagree. The flags represent a lose of information, as you explained > earlier. With the counts in this series, user-mode can't tell the difference between a reset that didn't affect the context and one that did. Once a GL context observes a reset (i.e., when glGetGraphicsResetStatusARB returns non-GL_NO_ERROR), it is dead. The only numbers that matter in that case are zero and not-zero. If we assume that applications using this interface will query resets on a periodic basis, and both the major Linux browser vendors have told me that they do, the count will either be zero or some very small number. Given that, there's very little information lost, and the information that is lost is not important to the consumers of that information. Now, this isn't the case with the display manager use case. That case clearly wants counts, and it wants counts over time. Like I said above, I think this is a different enough use case that having a separate mechanism may be the right approach. >> We can probably reconstruct the current flags from the counts, but >> we would have to do some refactoring (or additional counting) to get >> other flags in the future. All of this would make the kernel code >> more complex. Why reconstruct the data you want when you could just >> track that data in the first place? > > Because you are constructing policy decision based on raw information. > >> Since the functionality in not available on all hardware, I also >> added a query to determine the availability. This has the added >> benefit that, should this interface prove to be insufficient in the >> future, we could deprecate it by having the query always return >> false. > > That offers no more information than just probing the ioctl. There are a couple other bits of kernel functionality that Mesa queries using this method before using (e.g., GEN7 SOL reset). Since this was, as far as I could tell, similarly querying kernel support for some hardware functionality, I assumed it should have a similar sort of query. In the future, what metric should I use to determine which is which? Also, what does "probing the ioctl" mean exactly? Just call it and see whether you get ENOSYS? Doesn't that run afoul of the requirement to never remove interfaces? >> All software involved has to be able the handle the case >> where the interface is not available, so I don't think this should >> run awry of the rules about kernel/user interface breakage. Please >> correct me if I'm wrong. > > [snip patch locations] > > The biggest change you made was to reduce the counts to a set of flags, > to make it harder for an attacker to analyse the reset frequency. > Instead that attacker can just poll the flags and coarsely reconstruct > the counts... If that is their goal. Instead it transfers userspace > policy into the kernel, which is what we were seeking to avoid, right? > -Chris
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 1c67fb2..2cc5817 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1684,6 +1684,10 @@ int i915_switch_context(struct intel_ring_buffer *ring, struct drm_file *file, int to_id, struct i915_hw_context **ctx); void i915_gem_context_free(struct kref *ctx_ref); +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, + struct drm_file *file, + u32 id, + struct ctx_reset_state **rs); int i915_gem_context_create_ioctl(struct drm_device *dev, void *data, struct drm_file *file); int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data, diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index cba54fb..1b14a06 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p, void *data) return 0; } +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring, + struct drm_file *file, + u32 id, + struct ctx_reset_state **rs) +{ + struct drm_i915_private *dev_priv = ring->dev->dev_private; + struct drm_i915_file_private *file_priv = file->driver_priv; + struct i915_hw_context *to; + + if (dev_priv->hw_contexts_disabled) + return -ENOENT; + + if (ring->id != RCS) + return -EINVAL; + + if (rs == NULL) + return -EINVAL; + + if (file == NULL) + return -EINVAL; + + if (id == DEFAULT_CONTEXT_ID) { + *rs = &file_priv->reset_state; + } else { + to = i915_gem_context_get(file->driver_priv, id); + if (to == NULL) + return -ENOENT; + + *rs = &to->reset_state; + } + + return 0; +} + void i915_gem_context_close(struct drm_device *dev, struct drm_file *file) { struct drm_i915_file_private *file_priv = file->driver_priv;
For arb-robustness, every context needs to have it's own reset state tracking. Default context will be handled in a identical way as the no-context case in further down in the patch set. For no-context case, the reset state will be stored in the file_priv part. v2: handle default context inside get_reset_state Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> --- drivers/gpu/drm/i915/i915_drv.h | 4 ++++ drivers/gpu/drm/i915/i915_gem_context.c | 34 +++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+)