Message ID | 20181122165106.18238-2-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC: mmu notifier debug checks | expand |
Quoting Daniel Vetter (2018-11-22 16:51:04) > Just a bit of paranoia, since if we start pushing this deep into > callchains it's hard to spot all places where an mmu notifier > implementation might fail when it's not allowed to. Most callers could handle the failure correctly. It looks like the failure was not propagated for convenience. -Chris
Am 22.11.18 um 17:51 schrieb Daniel Vetter: > Just a bit of paranoia, since if we start pushing this deep into > callchains it's hard to spot all places where an mmu notifier > implementation might fail when it's not allowed to. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Christian König" <christian.koenig@amd.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: linux-mm@kvack.org > Cc: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Acked-by: Christian König <christian.koenig@amd.com> > --- > mm/mmu_notifier.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index 5119ff846769..59e102589a25 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c > @@ -190,6 +190,8 @@ int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, > pr_info("%pS callback failed with %d in %sblockable context.\n", > mn->ops->invalidate_range_start, _ret, > !blockable ? "non-" : ""); > + WARN(blockable,"%pS callback failure not allowed\n", > + mn->ops->invalidate_range_start); > ret = _ret; > } > }
On Thu, Nov 22, 2018 at 04:53:34PM +0000, Chris Wilson wrote: > Quoting Daniel Vetter (2018-11-22 16:51:04) > > Just a bit of paranoia, since if we start pushing this deep into > > callchains it's hard to spot all places where an mmu notifier > > implementation might fail when it's not allowed to. > > Most callers could handle the failure correctly. It looks like the > failure was not propagated for convenience. I have no idea whether the mm is semantically ok if pte shootdown doesn't work for all sorts of strange reasons. From the commit that introduced the error code it souded like this was very much only ok in the limited case of an already killed process, in the oom killer path, where it's really only about trying to free any kind of memory. And where the process is gone already, so semantics of what exactly happens don't matter that much anymore. And even if a lot more paths could support some kind of error recovery (they'd need to restart stuff, at least for your i915 patch to work I think), as long as we have paths where that's not allowed I think it's good to catch any bugs where a nonzero errno is errornously returned. -Daniel
On Fri 23-11-18 09:49:34, Daniel Vetter wrote: > On Thu, Nov 22, 2018 at 04:53:34PM +0000, Chris Wilson wrote: > > Quoting Daniel Vetter (2018-11-22 16:51:04) > > > Just a bit of paranoia, since if we start pushing this deep into > > > callchains it's hard to spot all places where an mmu notifier > > > implementation might fail when it's not allowed to. > > > > Most callers could handle the failure correctly. It looks like the > > failure was not propagated for convenience. > > I have no idea whether the mm is semantically ok if pte shootdown doesn't > work for all sorts of strange reasons. From the commit that introduced the > error code it souded like this was very much only ok in the limited case > of an already killed process, in the oom killer path, where it's really > only about trying to free any kind of memory. And where the process is > gone already, so semantics of what exactly happens don't matter that much > anymore. Yes this was indeed the case. There is still the exit path which would do the rest of the work so we are not leaving anything behind.
On Thu 22-11-18 17:51:04, Daniel Vetter wrote: > Just a bit of paranoia, since if we start pushing this deep into > callchains it's hard to spot all places where an mmu notifier > implementation might fail when it's not allowed to. What does WARN give you more than the existing pr_info? Is really backtrace that interesting? > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Christian König" <christian.koenig@amd.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: linux-mm@kvack.org > Cc: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > --- > mm/mmu_notifier.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index 5119ff846769..59e102589a25 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c > @@ -190,6 +190,8 @@ int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, > pr_info("%pS callback failed with %d in %sblockable context.\n", > mn->ops->invalidate_range_start, _ret, > !blockable ? "non-" : ""); > + WARN(blockable,"%pS callback failure not allowed\n", > + mn->ops->invalidate_range_start); > ret = _ret; > } > } > -- > 2.19.1 >
On Fri, Nov 23, 2018 at 12:15:57PM +0100, Michal Hocko wrote: > On Thu 22-11-18 17:51:04, Daniel Vetter wrote: > > Just a bit of paranoia, since if we start pushing this deep into > > callchains it's hard to spot all places where an mmu notifier > > implementation might fail when it's not allowed to. > > What does WARN give you more than the existing pr_info? Is really > backtrace that interesting? Automated tools have to ignore everything at info level (there's too much of that). I guess I could do something like if (blockable) pr_warn(...) else pr_info(...) WARN() is simply my goto tool for getting something at warning level dumped into dmesg. But I think the pr_warn with the callback function should be enough indeed. If you wonder where all the info level stuff happens that we have to ignore: suspend/resume is a primary culprit (fairly important for gfx/desktops), but there's a bunch of other places. Even if we ignore everything at info and below we still need filters because some drivers are a bit too trigger-happy (i915 definitely included I guess, so everyone contributes to this problem). Cheers, Daniel > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: "Christian König" <christian.koenig@amd.com> > > Cc: David Rientjes <rientjes@google.com> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > > Cc: "Jérôme Glisse" <jglisse@redhat.com> > > Cc: linux-mm@kvack.org > > Cc: Paolo Bonzini <pbonzini@redhat.com> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > --- > > mm/mmu_notifier.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > > index 5119ff846769..59e102589a25 100644 > > --- a/mm/mmu_notifier.c > > +++ b/mm/mmu_notifier.c > > @@ -190,6 +190,8 @@ int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, > > pr_info("%pS callback failed with %d in %sblockable context.\n", > > mn->ops->invalidate_range_start, _ret, > > !blockable ? "non-" : ""); > > + WARN(blockable,"%pS callback failure not allowed\n", > > + mn->ops->invalidate_range_start); > > ret = _ret; > > } > > } > > -- > > 2.19.1 > > > > -- > Michal Hocko > SUSE Labs
On Fri 23-11-18 13:30:57, Daniel Vetter wrote: > On Fri, Nov 23, 2018 at 12:15:57PM +0100, Michal Hocko wrote: > > On Thu 22-11-18 17:51:04, Daniel Vetter wrote: > > > Just a bit of paranoia, since if we start pushing this deep into > > > callchains it's hard to spot all places where an mmu notifier > > > implementation might fail when it's not allowed to. > > > > What does WARN give you more than the existing pr_info? Is really > > backtrace that interesting? > > Automated tools have to ignore everything at info level (there's too much > of that). I guess I could do something like > > if (blockable) > pr_warn(...) > else > pr_info(...) > > WARN() is simply my goto tool for getting something at warning level > dumped into dmesg. But I think the pr_warn with the callback function > should be enough indeed. I wouldn't mind s@pr_info@pr_warn@ > If you wonder where all the info level stuff happens that we have to > ignore: suspend/resume is a primary culprit (fairly important for > gfx/desktops), but there's a bunch of other places. Even if we ignore > everything at info and below we still need filters because some drivers > are a bit too trigger-happy (i915 definitely included I guess, so everyone > contributes to this problem). Thanks for the clarification.
On Fri, Nov 23, 2018 at 1:43 PM Michal Hocko <mhocko@kernel.org> wrote: > On Fri 23-11-18 13:30:57, Daniel Vetter wrote: > > On Fri, Nov 23, 2018 at 12:15:57PM +0100, Michal Hocko wrote: > > > On Thu 22-11-18 17:51:04, Daniel Vetter wrote: > > > > Just a bit of paranoia, since if we start pushing this deep into > > > > callchains it's hard to spot all places where an mmu notifier > > > > implementation might fail when it's not allowed to. > > > > > > What does WARN give you more than the existing pr_info? Is really > > > backtrace that interesting? > > > > Automated tools have to ignore everything at info level (there's too much > > of that). I guess I could do something like > > > > if (blockable) > > pr_warn(...) > > else > > pr_info(...) > > > > WARN() is simply my goto tool for getting something at warning level > > dumped into dmesg. But I think the pr_warn with the callback function > > should be enough indeed. > > I wouldn't mind s@pr_info@pr_warn@ Well that's too much, because then it would misfire in the oom testcase, where failing is ok (desireble even, we want to avoid blocking after all). So needs to be a switch (or else we need to filter it in results, and that's a bit a maintenance headache from a CI pov). -Danile > > If you wonder where all the info level stuff happens that we have to > > ignore: suspend/resume is a primary culprit (fairly important for > > gfx/desktops), but there's a bunch of other places. Even if we ignore > > everything at info and below we still need filters because some drivers > > are a bit too trigger-happy (i915 definitely included I guess, so everyone > > contributes to this problem). > > Thanks for the clarification. > -- > Michal Hocko > SUSE Labs
On Fri 23-11-18 14:15:11, Daniel Vetter wrote: > On Fri, Nov 23, 2018 at 1:43 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 23-11-18 13:30:57, Daniel Vetter wrote: > > > On Fri, Nov 23, 2018 at 12:15:57PM +0100, Michal Hocko wrote: > > > > On Thu 22-11-18 17:51:04, Daniel Vetter wrote: > > > > > Just a bit of paranoia, since if we start pushing this deep into > > > > > callchains it's hard to spot all places where an mmu notifier > > > > > implementation might fail when it's not allowed to. > > > > > > > > What does WARN give you more than the existing pr_info? Is really > > > > backtrace that interesting? > > > > > > Automated tools have to ignore everything at info level (there's too much > > > of that). I guess I could do something like > > > > > > if (blockable) > > > pr_warn(...) > > > else > > > pr_info(...) > > > > > > WARN() is simply my goto tool for getting something at warning level > > > dumped into dmesg. But I think the pr_warn with the callback function > > > should be enough indeed. > > > > I wouldn't mind s@pr_info@pr_warn@ > > Well that's too much, because then it would misfire in the oom > testcase, where failing is ok (desireble even, we want to avoid > blocking after all). So needs to be a switch (or else we need to > filter it in results, and that's a bit a maintenance headache from a > CI pov). I thought the failure should be rare enough that warning about them can be actually useful. E.g. in the oom case we can live with the failure because we want to release _some_ memory but know about a callback that prevents us to go the full way might be interesting. But I do not really feel strongly about this. I find WARN a bit abuse because the trace is unlikely going to help us much. If you want to make a verbosity depending on the blockable context then I will surely not stand in the way.
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 5119ff846769..59e102589a25 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -190,6 +190,8 @@ int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, pr_info("%pS callback failed with %d in %sblockable context.\n", mn->ops->invalidate_range_start, _ret, !blockable ? "non-" : ""); + WARN(blockable,"%pS callback failure not allowed\n", + mn->ops->invalidate_range_start); ret = _ret; } }
Just a bit of paranoia, since if we start pushing this deep into callchains it's hard to spot all places where an mmu notifier implementation might fail when it's not allowed to. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: linux-mm@kvack.org Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- mm/mmu_notifier.c | 2 ++ 1 file changed, 2 insertions(+)