Message ID | 20241025174113.554-5-Yunxiang.Li@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | rework bo mem stats tracking | expand |
Am 25.10.24 um 19:41 schrieb Yunxiang Li: > Add a helper to check if the memory stats is zero, this will be used to > check for memory accounting errors. > > Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> In theory I would need to upstream that through the drm-misc-next channel, but I think that's simply enough that we can take it through amd-staging-drm-next. Regards, Christian. > --- > drivers/gpu/drm/drm_file.c | 9 +++++++++ > include/drm/drm_file.h | 1 + > 2 files changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c > index 714e42b051080..75ed701d80f74 100644 > --- a/drivers/gpu/drm/drm_file.c > +++ b/drivers/gpu/drm/drm_file.c > @@ -859,6 +859,15 @@ static void print_size(struct drm_printer *p, const char *stat, > drm_printf(p, "drm-%s-%s:\t%llu%s\n", stat, region, sz, units[u]); > } > > +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) { > + return (stats->shared == 0 && > + stats->private == 0 && > + stats->resident == 0 && > + stats->purgeable == 0 && > + stats->active == 0); > +} > +EXPORT_SYMBOL(drm_memory_stats_is_zero); > + > /** > * drm_print_memory_stats - A helper to print memory stats > * @p: The printer to print output to > diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h > index ab230d3af138d..7f91e35d027d9 100644 > --- a/include/drm/drm_file.h > +++ b/include/drm/drm_file.h > @@ -477,6 +477,7 @@ struct drm_memory_stats { > > enum drm_gem_object_status; > > +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats); > void drm_print_memory_stats(struct drm_printer *p, > const struct drm_memory_stats *stats, > enum drm_gem_object_status supported_status,
On 25/10/2024 18:41, Yunxiang Li wrote: > Add a helper to check if the memory stats is zero, this will be used to > check for memory accounting errors. > > Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> > --- > drivers/gpu/drm/drm_file.c | 9 +++++++++ > include/drm/drm_file.h | 1 + > 2 files changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c > index 714e42b051080..75ed701d80f74 100644 > --- a/drivers/gpu/drm/drm_file.c > +++ b/drivers/gpu/drm/drm_file.c > @@ -859,6 +859,15 @@ static void print_size(struct drm_printer *p, const char *stat, > drm_printf(p, "drm-%s-%s:\t%llu%s\n", stat, region, sz, units[u]); > } > > +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) { > + return (stats->shared == 0 && > + stats->private == 0 && > + stats->resident == 0 && > + stats->purgeable == 0 && > + stats->active == 0); > +} Could use mem_is_zero() for some value of source/binary compactness. > +EXPORT_SYMBOL(drm_memory_stats_is_zero); > + I am not a huge fan of adding this as an interface as the only caller appears to be a sanity check in amdgpu_vm_fini(): if (!amdgpu_vm_stats_is_zero(vm)) dev_err(adev->dev, "VM memory stats is non-zero when fini\n"); But I guess there is some value in sanity checking since amdgpu does not have a notion of debug only code (compiled at production and exercised via a test suite). I do suggest to demote the dev_err to notice log level would suffice and be more accurate. Regards, Tvrtko > /** > * drm_print_memory_stats - A helper to print memory stats > * @p: The printer to print output to > diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h > index ab230d3af138d..7f91e35d027d9 100644 > --- a/include/drm/drm_file.h > +++ b/include/drm/drm_file.h > @@ -477,6 +477,7 @@ struct drm_memory_stats { > > enum drm_gem_object_status; > > +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats); > void drm_print_memory_stats(struct drm_printer *p, > const struct drm_memory_stats *stats, > enum drm_gem_object_status supported_status,
[AMD Official Use Only - AMD Internal Distribution Only] > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Sent: Thursday, November 7, 2024 5:41 > On 25/10/2024 18:41, Yunxiang Li wrote: > > Add a helper to check if the memory stats is zero, this will be used > > to check for memory accounting errors. > > > > Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> > > --- > > drivers/gpu/drm/drm_file.c | 9 +++++++++ > > include/drm/drm_file.h | 1 + > > 2 files changed, 10 insertions(+) > > > > diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c > > index 714e42b051080..75ed701d80f74 100644 > > --- a/drivers/gpu/drm/drm_file.c > > +++ b/drivers/gpu/drm/drm_file.c > > @@ -859,6 +859,15 @@ static void print_size(struct drm_printer *p, const char > *stat, > > drm_printf(p, "drm-%s-%s:\t%llu%s\n", stat, region, sz, units[u]); > > } > > > > +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) { > > + return (stats->shared == 0 && > > + stats->private == 0 && > > + stats->resident == 0 && > > + stats->purgeable == 0 && > > + stats->active == 0); > > +} > > Could use mem_is_zero() for some value of source/binary compactness. Yeah, the patch set started out with that when it's just a function in amdgpu, but Christ didn't like it. > > +EXPORT_SYMBOL(drm_memory_stats_is_zero); > > + > > I am not a huge fan of adding this as an interface as the only caller appears to be a > sanity check in amdgpu_vm_fini(): > > if (!amdgpu_vm_stats_is_zero(vm)) > dev_err(adev->dev, "VM memory stats is non-zero when fini\n"); > > But I guess there is some value in sanity checking since amdgpu does not have a > notion of debug only code (compiled at production and exercised via a test suite). > > I do suggest to demote the dev_err to notice log level would suffice and be more > accurate. I think it's very important to have a check like this when we have a known invariant, especially in this case where there's stat tracking code spread out everywhere and we have very little chance of catching a bug right when it happened. And since whenever this check fails we know for sure there is a bug, I don't see the harm of keeping it as an error. Now that I think about it, I probably want to have the process & task name in here to aid in reproduction. Teddy
On 07/11/2024 14:17, Li, Yunxiang (Teddy) wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > >> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >> Sent: Thursday, November 7, 2024 5:41 >> On 25/10/2024 18:41, Yunxiang Li wrote: >>> Add a helper to check if the memory stats is zero, this will be used >>> to check for memory accounting errors. >>> >>> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> >>> --- >>> drivers/gpu/drm/drm_file.c | 9 +++++++++ >>> include/drm/drm_file.h | 1 + >>> 2 files changed, 10 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c >>> index 714e42b051080..75ed701d80f74 100644 >>> --- a/drivers/gpu/drm/drm_file.c >>> +++ b/drivers/gpu/drm/drm_file.c >>> @@ -859,6 +859,15 @@ static void print_size(struct drm_printer *p, const char >> *stat, >>> drm_printf(p, "drm-%s-%s:\t%llu%s\n", stat, region, sz, units[u]); >>> } >>> >>> +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) { >>> + return (stats->shared == 0 && >>> + stats->private == 0 && >>> + stats->resident == 0 && >>> + stats->purgeable == 0 && >>> + stats->active == 0); >>> +} >> >> Could use mem_is_zero() for some value of source/binary compactness. > > Yeah, the patch set started out with that when it's just a function in amdgpu, but Christ didn't like it. Okay, I don't feel so strongly about the implementation details. >>> +EXPORT_SYMBOL(drm_memory_stats_is_zero); >>> + >> >> I am not a huge fan of adding this as an interface as the only caller appears to be a >> sanity check in amdgpu_vm_fini(): >> >> if (!amdgpu_vm_stats_is_zero(vm)) >> dev_err(adev->dev, "VM memory stats is non-zero when fini\n"); >> >> But I guess there is some value in sanity checking since amdgpu does not have a >> notion of debug only code (compiled at production and exercised via a test suite). >> >> I do suggest to demote the dev_err to notice log level would suffice and be more >> accurate. > > I think it's very important to have a check like this when we have a known invariant, especially in this case where there's stat tracking code spread out everywhere and we have very little chance of catching a bug right when it happened. And since whenever this check fails we know for sure there is a bug, I don't see the harm of keeping it as an error. It would indeed be a programming error if it can happen, but from the point of view of a driver and system log I think a warning is actually right. Regards, Tvrtko > > Now that I think about it, I probably want to have the process & task name in here to aid in reproduction. > > Teddy
Am 07.11.24 um 15:43 schrieb Tvrtko Ursulin: > On 07/11/2024 14:17, Li, Yunxiang (Teddy) wrote: >> [AMD Official Use Only - AMD Internal Distribution Only] >> >>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >>> Sent: Thursday, November 7, 2024 5:41 >>> On 25/10/2024 18:41, Yunxiang Li wrote: >>>> Add a helper to check if the memory stats is zero, this will be used >>>> to check for memory accounting errors. >>>> >>>> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> >>>> --- >>>> drivers/gpu/drm/drm_file.c | 9 +++++++++ >>>> include/drm/drm_file.h | 1 + >>>> 2 files changed, 10 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c >>>> index 714e42b051080..75ed701d80f74 100644 >>>> --- a/drivers/gpu/drm/drm_file.c >>>> +++ b/drivers/gpu/drm/drm_file.c >>>> @@ -859,6 +859,15 @@ static void print_size(struct drm_printer *p, >>>> const char >>> *stat, >>>> drm_printf(p, "drm-%s-%s:\t%llu%s\n", stat, region, sz, >>>> units[u]); >>>> } >>>> >>>> +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) { >>>> + return (stats->shared == 0 && >>>> + stats->private == 0 && >>>> + stats->resident == 0 && >>>> + stats->purgeable == 0 && >>>> + stats->active == 0); >>>> +} >>> >>> Could use mem_is_zero() for some value of source/binary compactness. >> >> Yeah, the patch set started out with that when it's just a function >> in amdgpu, but Christ didn't like it. > > Okay, I don't feel so strongly about the implementation details. mem_is_zero() just has the tendency to randomly fail when the compiler adds padding in between fields. >>>> +EXPORT_SYMBOL(drm_memory_stats_is_zero); >>>> + >>> >>> I am not a huge fan of adding this as an interface as the only >>> caller appears to be a >>> sanity check in amdgpu_vm_fini(): >>> >>> if (!amdgpu_vm_stats_is_zero(vm)) >>> dev_err(adev->dev, "VM memory stats is non-zero when >>> fini\n"); >>> >>> But I guess there is some value in sanity checking since amdgpu does >>> not have a >>> notion of debug only code (compiled at production and exercised via >>> a test suite). >>> >>> I do suggest to demote the dev_err to notice log level would suffice >>> and be more >>> accurate. >> >> I think it's very important to have a check like this when we have a >> known invariant, especially in this case where there's stat tracking >> code spread out everywhere and we have very little chance of catching >> a bug right when it happened. And since whenever this check fails we >> know for sure there is a bug, I don't see the harm of keeping it as >> an error. > It would indeed be a programming error if it can happen, but from the > point of view of a driver and system log I think a warning is actually > right. Yeah agree, an error usually means you have either done something wrong or your data is corrupted because something bad happened (failed disk etc...). The the stats are nonsense that is annoying but not fatal, so not really an error. Regards, Christian. > > Regards, > > Tvrtko > >> >> Now that I think about it, I probably want to have the process & task >> name in here to aid in reproduction. >> >> Teddy
diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c index 714e42b051080..75ed701d80f74 100644 --- a/drivers/gpu/drm/drm_file.c +++ b/drivers/gpu/drm/drm_file.c @@ -859,6 +859,15 @@ static void print_size(struct drm_printer *p, const char *stat, drm_printf(p, "drm-%s-%s:\t%llu%s\n", stat, region, sz, units[u]); } +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) { + return (stats->shared == 0 && + stats->private == 0 && + stats->resident == 0 && + stats->purgeable == 0 && + stats->active == 0); +} +EXPORT_SYMBOL(drm_memory_stats_is_zero); + /** * drm_print_memory_stats - A helper to print memory stats * @p: The printer to print output to diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h index ab230d3af138d..7f91e35d027d9 100644 --- a/include/drm/drm_file.h +++ b/include/drm/drm_file.h @@ -477,6 +477,7 @@ struct drm_memory_stats { enum drm_gem_object_status; +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats); void drm_print_memory_stats(struct drm_printer *p, const struct drm_memory_stats *stats, enum drm_gem_object_status supported_status,
Add a helper to check if the memory stats is zero, this will be used to check for memory accounting errors. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> --- drivers/gpu/drm/drm_file.c | 9 +++++++++ include/drm/drm_file.h | 1 + 2 files changed, 10 insertions(+)