Message ID | 20190814202027.18735-2-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hmm & mmu_notifier debug/lockdep annotations | expand |
On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > Just a bit of paranoia, since if we start pushing this deep into > callchains it's hard to spot all places where an mmu notifier > implementation might fail when it's not allowed to. > > Inspired by some confusion we had discussing i915 mmu notifiers and > whether we could use the newly-introduced return value to handle some > corner cases. Until we realized that these are only for when a task > has been killed by the oom reaper. > > An alternative approach would be to split the callback into two > versions, one with the int return value, and the other with void > return value like in older kernels. But that's a lot more churn for > fairly little gain I think. > > Summary from the m-l discussion on why we want something at warning > level: This allows automated tooling in CI to catch bugs without > humans having to look at everything. If we just upgrade the existing > pr_info to a pr_warn, then we'll have false positives. And as-is, no > one will ever spot the problem since it's lost in the massive amounts > of overall dmesg noise. > > ... > > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c > @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) > pr_info("%pS callback failed with %d in %sblockable context.\n", > mn->ops->invalidate_range_start, _ret, > !mmu_notifier_range_blockable(range) ? "non-" : ""); > + WARN_ON(mmu_notifier_range_blockable(range) || > + ret != -EAGAIN); > ret = _ret; > } > } A problem with WARN_ON(a || b) is that if it triggers, we don't know whether it was because of a or because of b. Or both. So I'd suggest WARN_ON(a); WARN_ON(b);
On Wed, Aug 14, 2019 at 03:14:47PM -0700, Andrew Morton wrote: > On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > Just a bit of paranoia, since if we start pushing this deep into > > callchains it's hard to spot all places where an mmu notifier > > implementation might fail when it's not allowed to. > > > > Inspired by some confusion we had discussing i915 mmu notifiers and > > whether we could use the newly-introduced return value to handle some > > corner cases. Until we realized that these are only for when a task > > has been killed by the oom reaper. > > > > An alternative approach would be to split the callback into two > > versions, one with the int return value, and the other with void > > return value like in older kernels. But that's a lot more churn for > > fairly little gain I think. > > > > Summary from the m-l discussion on why we want something at warning > > level: This allows automated tooling in CI to catch bugs without > > humans having to look at everything. If we just upgrade the existing > > pr_info to a pr_warn, then we'll have false positives. And as-is, no > > one will ever spot the problem since it's lost in the massive amounts > > of overall dmesg noise. > > > > ... > > > > +++ b/mm/mmu_notifier.c > > @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) > > pr_info("%pS callback failed with %d in %sblockable context.\n", > > mn->ops->invalidate_range_start, _ret, > > !mmu_notifier_range_blockable(range) ? "non-" : ""); > > + WARN_ON(mmu_notifier_range_blockable(range) || > > + ret != -EAGAIN); > > ret = _ret; > > } > > } > > A problem with WARN_ON(a || b) is that if it triggers, we don't know > whether it was because of a or because of b. Or both. So I'd suggest > > WARN_ON(a); > WARN_ON(b); > Well, we did just make a pr_info right above with the value of blockable, that seems enough to tell the cases apart? But you are generally right, the full logic: if (_ret) { if (WARN_ON(mmu_notifier_range_blockable(range))) continue; WARN_ON(_ret != -EAGAIN); ret = -EAGAIN; break; } would force correct API contract up the call chain once we detect a broken driver.. But at some point it does feel like a bit much debugging logic to have in a production code path, as this should never happen and is just to discourage wrong driver behaviors during driver development. If we like this version then: Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Also - I have a bunch of other patches to mmu notifiers for hmm.git, so when everyone agrees I can grab this to avoid conflicts. Thanks, Jason
On 8/14/19 3:14 PM, Andrew Morton wrote: > On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > >> Just a bit of paranoia, since if we start pushing this deep into >> callchains it's hard to spot all places where an mmu notifier >> implementation might fail when it's not allowed to. >> >> Inspired by some confusion we had discussing i915 mmu notifiers and >> whether we could use the newly-introduced return value to handle some >> corner cases. Until we realized that these are only for when a task >> has been killed by the oom reaper. >> >> An alternative approach would be to split the callback into two >> versions, one with the int return value, and the other with void >> return value like in older kernels. But that's a lot more churn for >> fairly little gain I think. >> >> Summary from the m-l discussion on why we want something at warning >> level: This allows automated tooling in CI to catch bugs without >> humans having to look at everything. If we just upgrade the existing >> pr_info to a pr_warn, then we'll have false positives. And as-is, no >> one will ever spot the problem since it's lost in the massive amounts >> of overall dmesg noise. >> >> ... >> >> --- a/mm/mmu_notifier.c >> +++ b/mm/mmu_notifier.c >> @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) >> pr_info("%pS callback failed with %d in %sblockable context.\n", >> mn->ops->invalidate_range_start, _ret, >> !mmu_notifier_range_blockable(range) ? "non-" : ""); >> + WARN_ON(mmu_notifier_range_blockable(range) || >> + ret != -EAGAIN); >> ret = _ret; >> } >> } > > A problem with WARN_ON(a || b) is that if it triggers, we don't know > whether it was because of a or because of b. Or both. So I'd suggest > > WARN_ON(a); > WARN_ON(b); > This won't quite work. It is OK to have mmu_notifier_range_blockable(range) be true or false. sync_cpu_device_pagetables() shouldn't return -EAGAIN unless blockable is true.
On Wed, Aug 14, 2019 at 10:20:23PM +0200, Daniel Vetter wrote: > Just a bit of paranoia, since if we start pushing this deep into > callchains it's hard to spot all places where an mmu notifier > implementation might fail when it's not allowed to. > > Inspired by some confusion we had discussing i915 mmu notifiers and > whether we could use the newly-introduced return value to handle some > corner cases. Until we realized that these are only for when a task > has been killed by the oom reaper. > > An alternative approach would be to split the callback into two > versions, one with the int return value, and the other with void > return value like in older kernels. But that's a lot more churn for > fairly little gain I think. > > Summary from the m-l discussion on why we want something at warning > level: This allows automated tooling in CI to catch bugs without > humans having to look at everything. If we just upgrade the existing > pr_info to a pr_warn, then we'll have false positives. And as-is, no > one will ever spot the problem since it's lost in the massive amounts > of overall dmesg noise. > > v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for > the problematic case (Michal Hocko). > > v3: Rebase on top of Glisse's arg rework. > > v4: More rebase on top of Glisse reworking everything. > > v5: Fixup rebase damage and also catch failures != EAGAIN for > !blockable (Jason). Also go back to WARN_ON as requested by Jason, so > automatic checkers can easily catch bugs by setting panic_on_warn. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Christian König" <christian.koenig@amd.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: linux-mm@kvack.org > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > --- > mm/mmu_notifier.c | 2 ++ > 1 file changed, 2 insertions(+) Applied to hmm.git, thanks Jason
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index b5670620aea0..16f1cbc775d0 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) pr_info("%pS callback failed with %d in %sblockable context.\n", mn->ops->invalidate_range_start, _ret, !mmu_notifier_range_blockable(range) ? "non-" : ""); + WARN_ON(mmu_notifier_range_blockable(range) || + ret != -EAGAIN); ret = _ret; } }
Just a bit of paranoia, since if we start pushing this deep into callchains it's hard to spot all places where an mmu notifier implementation might fail when it's not allowed to. Inspired by some confusion we had discussing i915 mmu notifiers and whether we could use the newly-introduced return value to handle some corner cases. Until we realized that these are only for when a task has been killed by the oom reaper. An alternative approach would be to split the callback into two versions, one with the int return value, and the other with void return value like in older kernels. But that's a lot more churn for fairly little gain I think. Summary from the m-l discussion on why we want something at warning level: This allows automated tooling in CI to catch bugs without humans having to look at everything. If we just upgrade the existing pr_info to a pr_warn, then we'll have false positives. And as-is, no one will ever spot the problem since it's lost in the massive amounts of overall dmesg noise. v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for the problematic case (Michal Hocko). v3: Rebase on top of Glisse's arg rework. v4: More rebase on top of Glisse reworking everything. v5: Fixup rebase damage and also catch failures != EAGAIN for !blockable (Jason). Also go back to WARN_ON as requested by Jason, so automatic checkers can easily catch bugs by setting panic_on_warn. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: linux-mm@kvack.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- mm/mmu_notifier.c | 2 ++ 1 file changed, 2 insertions(+)