Message ID | 20220830214919.53220-1-surenb@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Code tagging framework and applications | expand |
On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > =========================== > Code tagging framework > =========================== > Code tag is a structure identifying a specific location in the source code > which is generated at compile time and can be embedded in an application- > specific structure. Several applications of code tagging are included in > this RFC, such as memory allocation tracking, dynamic fault injection, > latency tracking and improved error code reporting. > Basically, it takes the old trick of "define a special elf section for > objects of a given type so that we can iterate over them at runtime" and > creates a proper library for it. I might be super dense this morning, but what!? I've skimmed through the set and I don't think I get it. What does this provide that ftrace/kprobes don't already allow?
On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > =========================== > > Code tagging framework > > =========================== > > Code tag is a structure identifying a specific location in the source code > > which is generated at compile time and can be embedded in an application- > > specific structure. Several applications of code tagging are included in > > this RFC, such as memory allocation tracking, dynamic fault injection, > > latency tracking and improved error code reporting. > > Basically, it takes the old trick of "define a special elf section for > > objects of a given type so that we can iterate over them at runtime" and > > creates a proper library for it. > > I might be super dense this morning, but what!? I've skimmed through the > set and I don't think I get it. > > What does this provide that ftrace/kprobes don't already allow? You're kidding, right?
On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > > =========================== > > > Code tagging framework > > > =========================== > > > Code tag is a structure identifying a specific location in the source code > > > which is generated at compile time and can be embedded in an application- > > > specific structure. Several applications of code tagging are included in > > > this RFC, such as memory allocation tracking, dynamic fault injection, > > > latency tracking and improved error code reporting. > > > Basically, it takes the old trick of "define a special elf section for > > > objects of a given type so that we can iterate over them at runtime" and > > > creates a proper library for it. > > > > I might be super dense this morning, but what!? I've skimmed through the > > set and I don't think I get it. > > > > What does this provide that ftrace/kprobes don't already allow? > > You're kidding, right? It's a valid question. From the description, it main addition that would be hard to do with ftrace or probes is catching where an error code is returned. A secondary addition would be catching all historical state and not just state since the tracing started. It's also unclear *who* would enable this. It looks like it would mostly have value during the development stage of an embedded platform to track kernel memory usage on a per-application basis in an environment where it may be difficult to setup tracing and tracking. Would it ever be enabled in production? Would a distribution ever enable this? If it's enabled, any overhead cannot be disabled/enabled at run or boot time so anyone enabling this would carry the cost without never necessarily consuming the data. It might be an ease-of-use thing. Gathering the information from traces is tricky and would need combining multiple different elements and that is development effort but not impossible. Whatever asking for an explanation as to why equivalent functionality cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
On Wed 31-08-22 11:19:48, Mel Gorman wrote: > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > > > =========================== > > > > Code tagging framework > > > > =========================== > > > > Code tag is a structure identifying a specific location in the source code > > > > which is generated at compile time and can be embedded in an application- > > > > specific structure. Several applications of code tagging are included in > > > > this RFC, such as memory allocation tracking, dynamic fault injection, > > > > latency tracking and improved error code reporting. > > > > Basically, it takes the old trick of "define a special elf section for > > > > objects of a given type so that we can iterate over them at runtime" and > > > > creates a proper library for it. > > > > > > I might be super dense this morning, but what!? I've skimmed through the > > > set and I don't think I get it. > > > > > > What does this provide that ftrace/kprobes don't already allow? > > > > You're kidding, right? > > It's a valid question. From the description, it main addition that would > be hard to do with ftrace or probes is catching where an error code is > returned. A secondary addition would be catching all historical state and > not just state since the tracing started. > > It's also unclear *who* would enable this. It looks like it would mostly > have value during the development stage of an embedded platform to track > kernel memory usage on a per-application basis in an environment where it > may be difficult to setup tracing and tracking. Would it ever be enabled > in production? Would a distribution ever enable this? If it's enabled, any > overhead cannot be disabled/enabled at run or boot time so anyone enabling > this would carry the cost without never necessarily consuming the data. > > It might be an ease-of-use thing. Gathering the information from traces > is tricky and would need combining multiple different elements and that > is development effort but not impossible. > > Whatever asking for an explanation as to why equivalent functionality > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. Fully agreed and this is especially true for a change this size 77 files changed, 3406 insertions(+), 703 deletions(-)
On Wed, Aug 31, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: > > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > > > > =========================== > > > > > Code tagging framework > > > > > =========================== > > > > > Code tag is a structure identifying a specific location in the source code > > > > > which is generated at compile time and can be embedded in an application- > > > > > specific structure. Several applications of code tagging are included in > > > > > this RFC, such as memory allocation tracking, dynamic fault injection, > > > > > latency tracking and improved error code reporting. > > > > > Basically, it takes the old trick of "define a special elf section for > > > > > objects of a given type so that we can iterate over them at runtime" and > > > > > creates a proper library for it. > > > > > > > > I might be super dense this morning, but what!? I've skimmed through the > > > > set and I don't think I get it. > > > > > > > > What does this provide that ftrace/kprobes don't already allow? > > > > > > You're kidding, right? > > > > It's a valid question. From the description, it main addition that would > > be hard to do with ftrace or probes is catching where an error code is > > returned. A secondary addition would be catching all historical state and > > not just state since the tracing started. > > > > It's also unclear *who* would enable this. It looks like it would mostly > > have value during the development stage of an embedded platform to track > > kernel memory usage on a per-application basis in an environment where it > > may be difficult to setup tracing and tracking. Would it ever be enabled > > in production? Would a distribution ever enable this? If it's enabled, any > > overhead cannot be disabled/enabled at run or boot time so anyone enabling > > this would carry the cost without never necessarily consuming the data. Thank you for the question. For memory tracking my intent is to have a mechanism that can be enabled in the field testing (pre-production testing on a large population of internal users). The issue that we are often facing is when some memory leaks are happening in the field but very hard to reproduce locally. We get a bugreport from the user which indicates it but often has not enough information to track it. Note that quite often these leaks/issues happen in the drivers, so even simply finding out where they came from is a big help. The way I envision this mechanism to be used is to enable the basic memory tracking in the field tests and have a user space process collecting the allocation statistics periodically (say once an hour). Once it detects some counter growing infinitely or atypically (the definition of this is left to the user space) it can enable context capturing only for that specific location, still keeping the overhead to the minimum but getting more information about potential issues. Collected stats and contexts are then attached to the bugreport and we get more visibility into the issue when we receive it. The goal is to provide a mechanism with low enough overhead that it can be enabled all the time during these field tests without affecting the device's performance profiles. Tracing is very cheap when it's disabled but having it enabled all the time would introduce higher overhead than the counter manipulations. My apologies, I should have clarified all this in this cover letter from the beginning. As for other applications, maybe I'm not such an advanced user of tracing but I think only the latency tracking application might be done with tracing, assuming we have all the right tracepoints but I don't see how we would use tracing for fault injections and descriptive error codes. Again, I might be mistaken. Thanks, Suren. > > > > It might be an ease-of-use thing. Gathering the information from traces > > is tricky and would need combining multiple different elements and that > > is development effort but not impossible. > > > > Whatever asking for an explanation as to why equivalent functionality > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > Fully agreed and this is especially true for a change this size > 77 files changed, 3406 insertions(+), 703 deletions(-) > > -- > Michal Hocko > SUSE Labs
On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote: > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > > > =========================== > > > > Code tagging framework > > > > =========================== > > > > Code tag is a structure identifying a specific location in the source code > > > > which is generated at compile time and can be embedded in an application- > > > > specific structure. Several applications of code tagging are included in > > > > this RFC, such as memory allocation tracking, dynamic fault injection, > > > > latency tracking and improved error code reporting. > > > > Basically, it takes the old trick of "define a special elf section for > > > > objects of a given type so that we can iterate over them at runtime" and > > > > creates a proper library for it. > > > > > > I might be super dense this morning, but what!? I've skimmed through the > > > set and I don't think I get it. > > > > > > What does this provide that ftrace/kprobes don't already allow? > > > > You're kidding, right? > > It's a valid question. From the description, it main addition that would > be hard to do with ftrace or probes is catching where an error code is > returned. A secondary addition would be catching all historical state and > not just state since the tracing started. Catching all historical state is pretty important in the case of memory allocation accounting, don't you think? Also, ftrace can drop events. Not really ideal if under system load your memory accounting numbers start to drift. > It's also unclear *who* would enable this. It looks like it would mostly > have value during the development stage of an embedded platform to track > kernel memory usage on a per-application basis in an environment where it > may be difficult to setup tracing and tracking. Would it ever be enabled > in production? Would a distribution ever enable this? If it's enabled, any > overhead cannot be disabled/enabled at run or boot time so anyone enabling > this would carry the cost without never necessarily consuming the data. The whole point of this is to be cheap enough to enable in production - especially the latency tracing infrastructure. There's a lot of value to always-on system visibility infrastructure, so that when a live machine starts to do something wonky the data is already there. What we've built here this is _far_ cheaper than anything that could be done with ftrace. > It might be an ease-of-use thing. Gathering the information from traces > is tricky and would need combining multiple different elements and that > is development effort but not impossible. > > Whatever asking for an explanation as to why equivalent functionality > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. I think perhaps some of the expectation should be on the "ftrace for everything!" people to explain a: how their alternative could be even built and b: how it would compare in terms of performance and ease of use. Look, I've been a tracing user for many years, and it has its uses, but some of the claims I've been hearing from tracing/bpf people when any alternative tooling is proposed sound like vaporware and bullshitting.
On Wed, Aug 31, 2022 at 8:28 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Aug 31, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: > > > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > > > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > > > > > =========================== > > > > > > Code tagging framework > > > > > > =========================== > > > > > > Code tag is a structure identifying a specific location in the source code > > > > > > which is generated at compile time and can be embedded in an application- > > > > > > specific structure. Several applications of code tagging are included in > > > > > > this RFC, such as memory allocation tracking, dynamic fault injection, > > > > > > latency tracking and improved error code reporting. > > > > > > Basically, it takes the old trick of "define a special elf section for > > > > > > objects of a given type so that we can iterate over them at runtime" and > > > > > > creates a proper library for it. > > > > > > > > > > I might be super dense this morning, but what!? I've skimmed through the > > > > > set and I don't think I get it. > > > > > > > > > > What does this provide that ftrace/kprobes don't already allow? > > > > > > > > You're kidding, right? > > > > > > It's a valid question. From the description, it main addition that would > > > be hard to do with ftrace or probes is catching where an error code is > > > returned. A secondary addition would be catching all historical state and > > > not just state since the tracing started. > > > > > > It's also unclear *who* would enable this. It looks like it would mostly > > > have value during the development stage of an embedded platform to track > > > kernel memory usage on a per-application basis in an environment where it > > > may be difficult to setup tracing and tracking. Would it ever be enabled > > > in production? Would a distribution ever enable this? If it's enabled, any > > > overhead cannot be disabled/enabled at run or boot time so anyone enabling > > > this would carry the cost without never necessarily consuming the data. > > Thank you for the question. > For memory tracking my intent is to have a mechanism that can be enabled in > the field testing (pre-production testing on a large population of > internal users). > The issue that we are often facing is when some memory leaks are happening > in the field but very hard to reproduce locally. We get a bugreport > from the user > which indicates it but often has not enough information to track it. Note that > quite often these leaks/issues happen in the drivers, so even simply finding out > where they came from is a big help. > The way I envision this mechanism to be used is to enable the basic memory > tracking in the field tests and have a user space process collecting > the allocation > statistics periodically (say once an hour). Once it detects some counter growing > infinitely or atypically (the definition of this is left to the user > space) it can enable > context capturing only for that specific location, still keeping the > overhead to the > minimum but getting more information about potential issues. Collected stats and > contexts are then attached to the bugreport and we get more visibility > into the issue > when we receive it. > The goal is to provide a mechanism with low enough overhead that it > can be enabled > all the time during these field tests without affecting the device's > performance profiles. > Tracing is very cheap when it's disabled but having it enabled all the > time would > introduce higher overhead than the counter manipulations. > My apologies, I should have clarified all this in this cover letter > from the beginning. > > As for other applications, maybe I'm not such an advanced user of > tracing but I think only > the latency tracking application might be done with tracing, assuming > we have all the > right tracepoints but I don't see how we would use tracing for fault > injections and > descriptive error codes. Again, I might be mistaken. Sorry about the formatting of my reply. Forgot to reconfigure the editor on the new machine. > > Thanks, > Suren. > > > > > > > It might be an ease-of-use thing. Gathering the information from traces > > > is tricky and would need combining multiple different elements and that > > > is development effort but not impossible. > > > > > > Whatever asking for an explanation as to why equivalent functionality > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > Fully agreed and this is especially true for a change this size > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > -- > > Michal Hocko > > SUSE Labs
On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > Whatever asking for an explanation as to why equivalent functionality > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > Fully agreed and this is especially true for a change this size > 77 files changed, 3406 insertions(+), 703 deletions(-) In the case of memory allocation accounting, you flat cannot do this with ftrace - you could maybe do a janky version that isn't fully accurate, much slower, more complicated for the developer to understand and debug and more complicated for the end user. But please, I invite anyone who's actually been doing this with ftrace to demonstrate otherwise. Ftrace just isn't the right tool for the job here - we're talking about adding per callsite accounting to some of the fastest fast paths in the kernel. And the size of the changes for memory allocation accounting are much more reasonable: 33 files changed, 623 insertions(+), 99 deletions(-) The code tagging library should exist anyways, it's been open coded half a dozen times in the kernel already. And once we've got that, the time stats code is _also_ far simpler than doing it with ftrace would be. If anyone here has successfully debugged latency issues with ftrace, I'd really like to hear it. Again, for debugging latency issues you want something that can always be on, and that's not cheap with ftrace - and never mind the hassle of correlating start and end wait trace events, builting up histograms, etc. - that's all handled here. Cheap, simple, easy to use. What more could you want?
On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > Whatever asking for an explanation as to why equivalent functionality > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > Fully agreed and this is especially true for a change this size > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > In the case of memory allocation accounting, you flat cannot do this with ftrace > - you could maybe do a janky version that isn't fully accurate, much slower, > more complicated for the developer to understand and debug and more complicated > for the end user. > > But please, I invite anyone who's actually been doing this with ftrace to > demonstrate otherwise. > > Ftrace just isn't the right tool for the job here - we're talking about adding > per callsite accounting to some of the fastest fast paths in the kernel. > > And the size of the changes for memory allocation accounting are much more > reasonable: > 33 files changed, 623 insertions(+), 99 deletions(-) > > The code tagging library should exist anyways, it's been open coded half a dozen > times in the kernel already. > > And once we've got that, the time stats code is _also_ far simpler than doing it > with ftrace would be. If anyone here has successfully debugged latency issues > with ftrace, I'd really like to hear it. Again, for debugging latency issues you > want something that can always be on, and that's not cheap with ftrace - and > never mind the hassle of correlating start and end wait trace events, builting > up histograms, etc. - that's all handled here. > > Cheap, simple, easy to use. What more could you want? > This is very interesting work! Do you have any data about the overhead this introduces, especially in a production environment? I am especially interested in memory allocations tracking and detecting leaks. (Sorry if you already posted this kind of data somewhere that I missed)
On Wed, Aug 31, 2022 at 1:56 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > > Whatever asking for an explanation as to why equivalent functionality > > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > > > Fully agreed and this is especially true for a change this size > > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a dozen > > times in the kernel already. > > > > And once we've got that, the time stats code is _also_ far simpler than doing it > > with ftrace would be. If anyone here has successfully debugged latency issues > > with ftrace, I'd really like to hear it. Again, for debugging latency issues you > > want something that can always be on, and that's not cheap with ftrace - and > > never mind the hassle of correlating start and end wait trace events, builting > > up histograms, etc. - that's all handled here. > > > > Cheap, simple, easy to use. What more could you want? > > > > This is very interesting work! Do you have any data about the overhead > this introduces, especially in a production environment? I am > especially interested in memory allocations tracking and detecting > leaks. I had the numbers for my previous implementation, before we started using the lazy percpu counters but that would not apply to the new implementation. I'll rerun the measurements and will post the exact numbers in a day or so. > (Sorry if you already posted this kind of data somewhere that I missed)
On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > =========================== > Code tagging framework > =========================== > Code tag is a structure identifying a specific location in the source code > which is generated at compile time and can be embedded in an application- > specific structure. Several applications of code tagging are included in > this RFC, such as memory allocation tracking, dynamic fault injection, > latency tracking and improved error code reporting. > Basically, it takes the old trick of "define a special elf section for > objects of a given type so that we can iterate over them at runtime" and > creates a proper library for it. > > =========================== > Memory allocation tracking > =========================== > The goal for using codetags for memory allocation tracking is to minimize > performance and memory overhead. By recording only the call count and > allocation size, the required operations are kept at the minimum while > collecting statistics for every allocation in the codebase. With that > information, if users are interested in mode detailed context for a > specific allocation, they can enable more in-depth context tracking, > which includes capturing the pid, tgid, task name, allocation size, > timestamp and call stack for every allocation at the specified code > location. > Memory allocation tracking is implemented in two parts: > > part1: instruments page and slab allocators to record call count and total > memory allocated at every allocation in the source code. Every time an > allocation is performed by an instrumented allocator, the codetag at that > location increments its call and size counters. Every time the memory is > freed these counters are decremented. To decrement the counters upon free, > allocated object needs a reference to its codetag. Page allocators use > page_ext to record this reference while slab allocators use memcg_data of > the slab page. > The data is exposed to the user space via a read-only debugfs file called > alloc_tags. Hi Suren, I just posted a patch [1] and reading through your changelog and seeing your PoC, I think we have some kind of overlap. My patchset aims to give you the stacktrace <-> relationship information and it is achieved by a little amount of extra code mostly in page_owner.c/ and lib/stackdepot. Of course, your works seems to be more complete wrt. the information you get. I CCed you in case you want to have a look [1] https://lkml.org/lkml/2022/9/1/36 Thanks
On Wed, Aug 31, 2022 at 9:52 PM Oscar Salvador <osalvador@suse.de> wrote: > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > =========================== > > Code tagging framework > > =========================== > > Code tag is a structure identifying a specific location in the source code > > which is generated at compile time and can be embedded in an application- > > specific structure. Several applications of code tagging are included in > > this RFC, such as memory allocation tracking, dynamic fault injection, > > latency tracking and improved error code reporting. > > Basically, it takes the old trick of "define a special elf section for > > objects of a given type so that we can iterate over them at runtime" and > > creates a proper library for it. > > > > =========================== > > Memory allocation tracking > > =========================== > > The goal for using codetags for memory allocation tracking is to minimize > > performance and memory overhead. By recording only the call count and > > allocation size, the required operations are kept at the minimum while > > collecting statistics for every allocation in the codebase. With that > > information, if users are interested in mode detailed context for a > > specific allocation, they can enable more in-depth context tracking, > > which includes capturing the pid, tgid, task name, allocation size, > > timestamp and call stack for every allocation at the specified code > > location. > > Memory allocation tracking is implemented in two parts: > > > > part1: instruments page and slab allocators to record call count and total > > memory allocated at every allocation in the source code. Every time an > > allocation is performed by an instrumented allocator, the codetag at that > > location increments its call and size counters. Every time the memory is > > freed these counters are decremented. To decrement the counters upon free, > > allocated object needs a reference to its codetag. Page allocators use > > page_ext to record this reference while slab allocators use memcg_data of > > the slab page. > > The data is exposed to the user space via a read-only debugfs file called > > alloc_tags. > > Hi Suren, > > I just posted a patch [1] and reading through your changelog and seeing your PoC, > I think we have some kind of overlap. > My patchset aims to give you the stacktrace <-> relationship information and it is > achieved by a little amount of extra code mostly in page_owner.c/ and lib/stackdepot. > > Of course, your works seems to be more complete wrt. the information you get. > > I CCed you in case you want to have a look > > [1] https://lkml.org/lkml/2022/9/1/36 Hi Oscar, Thanks for the note. I'll take a look most likely on Friday and will follow up with you. Thanks, Suren. > > Thanks > > > -- > Oscar Salvador > SUSE Labs
On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote: > It's also unclear *who* would enable this. It looks like it would mostly > have value during the development stage of an embedded platform to track > kernel memory usage on a per-application basis in an environment where it > may be difficult to setup tracing and tracking. Would it ever be enabled > in production? Afaict this is developer only; it is all unconditional code. > Would a distribution ever enable this? I would sincerely hope not. Because: > If it's enabled, any overhead cannot be disabled/enabled at run or > boot time so anyone enabling this would carry the cost without never > necessarily consuming the data. this.
On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote: > Also, ftrace can drop events. Not really ideal if under system load your memory > accounting numbers start to drift. You could attach custom handlers to tracepoints. If you were to replace these unconditional code hooks of yours with tracepoints then you could conditionally (say at boot) register custom handlers that do the accounting you want. Nobody is mandating you use the ftrace ringbuffer to consume tracepoints. Many people these days attach eBPF scripts to them and do whatever they want.
On Wed 31-08-22 15:01:54, Kent Overstreet wrote: > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > Whatever asking for an explanation as to why equivalent functionality > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > Fully agreed and this is especially true for a change this size > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > In the case of memory allocation accounting, you flat cannot do this with ftrace > - you could maybe do a janky version that isn't fully accurate, much slower, > more complicated for the developer to understand and debug and more complicated > for the end user. > > But please, I invite anyone who's actually been doing this with ftrace to > demonstrate otherwise. > > Ftrace just isn't the right tool for the job here - we're talking about adding > per callsite accounting to some of the fastest fast paths in the kernel. > > And the size of the changes for memory allocation accounting are much more > reasonable: > 33 files changed, 623 insertions(+), 99 deletions(-) > > The code tagging library should exist anyways, it's been open coded half a dozen > times in the kernel already. > > And once we've got that, the time stats code is _also_ far simpler than doing it > with ftrace would be. If anyone here has successfully debugged latency issues > with ftrace, I'd really like to hear it. Again, for debugging latency issues you > want something that can always be on, and that's not cheap with ftrace - and > never mind the hassle of correlating start and end wait trace events, builting > up histograms, etc. - that's all handled here. > > Cheap, simple, easy to use. What more could you want? A big ad on a banner. But more seriously. This patchset is _huge_ and touching a lot of different areas. It will be not only hard to review but even harder to maintain longterm. So it is completely reasonable to ask for potential alternatives with a smaller code footprint. I am pretty sure you are aware of that workflow. So I find Peter's question completely appropriate while your response to that not so much! Maybe ftrace is not the right tool for the intented job. Maybe there are other ways and it would be really great to show that those have been evaluated and they are not suitable for a), b) and c) reasons. E.g. Oscar has been working on extending page_ext to track number of allocations for specific calltrace[1]. Is this 1:1 replacement? No! But it can help in environments where page_ext can be enabled and it is completely non-intrusive to the MM code. If the page_ext overhead is not desirable/acceptable then I am sure there are other options. E.g. kprobes/LivePatching framework can hook into functions and alter their behavior. So why not use that for data collection? Has this been evaluated at all? And please note that I am not claiming the presented work is approaching the problem from a wrong direction. It might very well solve multiple problems in a single go _but_ the long term code maintenance burden really has to to be carefully evaluated and if we can achieve a reasonable subset of the functionality with an existing infrastructure then I would be inclined to sacrifice some portions with a considerably smaller code footprint. [1] http://lkml.kernel.org/r/20220901044249.4624-1-osalvador@suse.de
On 9/1/22 09:05, Peter Zijlstra wrote: >> Also, ftrace can drop events. Not really ideal if under system load your memory >> accounting numbers start to drift. > You could attach custom handlers to tracepoints. If you were to replace > these unconditional code hooks of yours with tracepoints then you could > conditionally (say at boot) register custom handlers that do the > accounting you want. That is strategy in RV (kernel/trace/rv/). It is in C, but I am also adding support for monitors in bpf. The osnoise/timerlat tracers work this way too, and they are enabled on Fedora/Red Hat/SUSE... production. They will also be enabled in Ubuntu and Debian (the interwebs say). The overhead of attaching code to tracepoints (or any "attachable thing") and processing data in kernel is often lower than consuming it in user-space. Obviously, when it is possible, e.g., when you respect locking rules, etc. This paper (the basis for RV) shows a little comparison: https://bristot.me/wp-content/uploads/2019/09/paper.pdf By doing so, we also avoid problems of losing events... and you can also generate other events from your attached code. (It is also way easier to convince a maintainer to add a tracepoints or a trace events than to add arbitrary code... ;-) -- Daniel
On Thu, Sep 01, 2022 at 09:05:36AM +0200, Peter Zijlstra wrote: > On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote: > > > Also, ftrace can drop events. Not really ideal if under system load your memory > > accounting numbers start to drift. > > You could attach custom handlers to tracepoints. If you were to replace > these unconditional code hooks of yours with tracepoints then you could > conditionally (say at boot) register custom handlers that do the > accounting you want. > > Nobody is mandating you use the ftrace ringbuffer to consume tracepoints. > Many people these days attach eBPF scripts to them and do whatever they > want. Look at kernel/trace/blktrace.c for a fine in-kernel !BFP example of this.
On 31.08.22 21:01, Kent Overstreet wrote: > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: >> On Wed 31-08-22 11:19:48, Mel Gorman wrote: >>> Whatever asking for an explanation as to why equivalent functionality >>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. >> >> Fully agreed and this is especially true for a change this size >> 77 files changed, 3406 insertions(+), 703 deletions(-) > > In the case of memory allocation accounting, you flat cannot do this with ftrace > - you could maybe do a janky version that isn't fully accurate, much slower, > more complicated for the developer to understand and debug and more complicated > for the end user. > > But please, I invite anyone who's actually been doing this with ftrace to > demonstrate otherwise. > > Ftrace just isn't the right tool for the job here - we're talking about adding > per callsite accounting to some of the fastest fast paths in the kernel. > > And the size of the changes for memory allocation accounting are much more > reasonable: > 33 files changed, 623 insertions(+), 99 deletions(-) > > The code tagging library should exist anyways, it's been open coded half a dozen > times in the kernel already. Hi Kent, independent of the other discussions, if it's open coded already, does it make sense to factor that already-open-coded part out independently of the remainder of the full series here? [I didn't immediately spot if this series also attempts already to replace that open-coded part]
On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote: > On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote: > > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: > > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > > > > =========================== > > > > > Code tagging framework > > > > > =========================== > > > > > Code tag is a structure identifying a specific location in the source code > > > > > which is generated at compile time and can be embedded in an application- > > > > > specific structure. Several applications of code tagging are included in > > > > > this RFC, such as memory allocation tracking, dynamic fault injection, > > > > > latency tracking and improved error code reporting. > > > > > Basically, it takes the old trick of "define a special elf section for > > > > > objects of a given type so that we can iterate over them at runtime" and > > > > > creates a proper library for it. > > > > > > > > I might be super dense this morning, but what!? I've skimmed through the > > > > set and I don't think I get it. > > > > > > > > What does this provide that ftrace/kprobes don't already allow? > > > > > > You're kidding, right? > > > > It's a valid question. From the description, it main addition that would > > be hard to do with ftrace or probes is catching where an error code is > > returned. A secondary addition would be catching all historical state and > > not just state since the tracing started. > > Catching all historical state is pretty important in the case of memory > allocation accounting, don't you think? > Not always. If the intent is to catch a memory leak that gets worse over time, early boot should be sufficient. Sure, there might be drivers that leak memory allocated at init but if it's not a growing leak, it doesn't matter. > Also, ftrace can drop events. Not really ideal if under system load your memory > accounting numbers start to drift. > As pointed out elsewhere, attaching to the tracepoint and recording relevant state is an option other than trying to parse a raw ftrace feed. For memory leaks, there are already tracepoints for page allocation and free that could be used to track allocations that are not freed at a given point in time. There is also the kernel memory leak detector although I never had reason to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html) and it sounds like it would be expensive. > > It's also unclear *who* would enable this. It looks like it would mostly > > have value during the development stage of an embedded platform to track > > kernel memory usage on a per-application basis in an environment where it > > may be difficult to setup tracing and tracking. Would it ever be enabled > > in production? Would a distribution ever enable this? If it's enabled, any > > overhead cannot be disabled/enabled at run or boot time so anyone enabling > > this would carry the cost without never necessarily consuming the data. > > The whole point of this is to be cheap enough to enable in production - > especially the latency tracing infrastructure. There's a lot of value to > always-on system visibility infrastructure, so that when a live machine starts > to do something wonky the data is already there. > Sure, there is value but nothing stops the tracepoints being attached as a boot-time service where interested. For latencies, there is already bpf examples for tracing individual function latency over time e.g. https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although I haven't used it recently. Live parsing of ftrace is possible, albeit expensive. https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl tracks counts of high-order allocations and dumps a report on interrupt as an example of live parsing ftrace and only recording interesting state. It's not tracking state you are interested in but it demonstrates it is possible to rely on ftrace alone and monitor from userspace. It's bit-rotted but can be fixed with diff --git a/monitors/watch-highorder.pl b/monitors/watch-highorder.pl index 8c80ae79e556..fd0d477427df 100755 --- a/monitors/watch-highorder.pl +++ b/monitors/watch-highorder.pl @@ -52,7 +52,7 @@ my $regex_pagealloc; # Static regex used. Specified like this for readability and for use with /o # (process_pid) (cpus ) ( time ) (tpoint ) (details) -my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9. ]*):\s*([a-zA-Z_]*):\s*(.*)'; my $regex_statname = '[-0-9]*\s\((.*)\).*'; my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; @@ -73,6 +73,7 @@ sub generate_traceevent_regex { $regex =~ s/%p/\([0-9a-f]*\)/g; $regex =~ s/%d/\([-0-9]*\)/g; $regex =~ s/%lu/\([0-9]*\)/g; + $regex =~ s/%lx/\([0-9a-zA-Z]*\)/g; $regex =~ s/%s/\([A-Z_|]*\)/g; $regex =~ s/\(REC->gfp_flags\).*/REC->gfp_flags/; $regex =~ s/\",.*//; Example output 3 instances order=2 normal gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO => trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd> => __alloc_pages+0x188/0x250 <ffffffffb1cee8a8> => kmalloc_large_node+0x3f/0x80 <ffffffffb1d1cd3f> => __kmalloc_node+0x321/0x420 <ffffffffb1d22351> => kvmalloc_node+0x46/0xe0 <ffffffffb1ca4906> => ttm_sg_tt_init+0x88/0xb0 [ttm] <ffffffffc03a02c8> => amdgpu_ttm_tt_create+0x4f/0x80 [amdgpu] <ffffffffc04cff0f> => ttm_tt_create+0x59/0x90 [ttm] <ffffffffc03a03b9> => ttm_bo_handle_move_mem+0x7e/0x1c0 [ttm] <ffffffffc03a0d9e> => ttm_bo_validate+0xc5/0x140 [ttm] <ffffffffc03a2095> => ttm_bo_init_reserved+0x17b/0x200 [ttm] <ffffffffc03a228b> => amdgpu_bo_create+0x1a3/0x470 [amdgpu] <ffffffffc04d36c3> => amdgpu_bo_create_user+0x34/0x60 [amdgpu] <ffffffffc04d39c4> => amdgpu_gem_create_ioctl+0x131/0x3a0 [amdgpu] <ffffffffc04d94f1> => drm_ioctl_kernel+0xb5/0x140 <ffffffffb21652c5> => drm_ioctl+0x224/0x3e0 <ffffffffb2165574> => amdgpu_drm_ioctl+0x49/0x80 [amdgpu] <ffffffffc04bd2d9> => __x64_sys_ioctl+0x8a/0xc0 <ffffffffb1d7c2da> => do_syscall_64+0x5c/0x90 <ffffffffb253016c> => entry_SYSCALL_64_after_hwframe+0x63/0xcd <ffffffffb260009b> 3 instances order=1 normal gfp_flags=GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT => trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd> => __alloc_pages+0x188/0x250 <ffffffffb1cee8a8> => __folio_alloc+0x17/0x50 <ffffffffb1cef1a7> => vma_alloc_folio+0x8f/0x350 <ffffffffb1d11e4f> => __handle_mm_fault+0xa1e/0x1120 <ffffffffb1cc80ee> => handle_mm_fault+0xb2/0x280 <ffffffffb1cc88a2> => do_user_addr_fault+0x1b9/0x690 <ffffffffb1a89949> => exc_page_fault+0x67/0x150 <ffffffffb2534627> => asm_exc_page_fault+0x22/0x30 <ffffffffb2600b62> It's not tracking leaks because that is not what I was intrested in at the time but could using the same method and recording PFNs that were allocated, their call site but not freed. These days, this approach may be a bit unexpected but it was originally written 13 years ago. It could have been done with systemtap back then but my recollection was that it was difficult to keep systemtap working with rc kernels. > What we've built here this is _far_ cheaper than anything that could be done > with ftrace. > > > It might be an ease-of-use thing. Gathering the information from traces > > is tricky and would need combining multiple different elements and that > > is development effort but not impossible. > > > > Whatever asking for an explanation as to why equivalent functionality > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > I think perhaps some of the expectation should be on the "ftrace for > everything!" people to explain a: how their alternative could be even built and > b: how it would compare in terms of performance and ease of use. > The ease of use is a criticism as there is effort required to develop the state tracking of in-kernel event be it from live parsing ftrace, attaching to tracepoints with systemtap/bpf/whatever and the like. The main disadvantage with an in-kernel implementation is three-fold. First, it doesn't work with older kernels without backports. Second, if something slightly different it needed then it's a kernel rebuild. Third, if the option is not enabled in the deployed kernel config then you are relying on the end user being willing to deploy a custom kernel. The initial investment in doing memory leak tracking or latency tracking by attaching to tracepoints is significant but it works with older kernels up to a point and is less sensitive to the kernel config options selected as features like ftrace are often selected.
On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote: > On 31.08.22 21:01, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > >> On Wed 31-08-22 11:19:48, Mel Gorman wrote: > >>> Whatever asking for an explanation as to why equivalent functionality > >>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > >> > >> Fully agreed and this is especially true for a change this size > >> 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a dozen > > times in the kernel already. > > Hi Kent, > > independent of the other discussions, if it's open coded already, does > it make sense to factor that already-open-coded part out independently > of the remainder of the full series here? It's discussed in the cover letter, that is exactly how the patch series is structured. > [I didn't immediately spot if this series also attempts already to > replace that open-coded part] Uh huh. Honestly, some days it feels like lkml is just as bad as slashdot, with people wanting to get in their two cents without actually reading...
On Thu, Sep 01, 2022 at 09:00:17AM +0200, Peter Zijlstra wrote: > On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote: > > > It's also unclear *who* would enable this. It looks like it would mostly > > have value during the development stage of an embedded platform to track > > kernel memory usage on a per-application basis in an environment where it > > may be difficult to setup tracing and tracking. Would it ever be enabled > > in production? > > Afaict this is developer only; it is all unconditional code. > > > Would a distribution ever enable this? > > I would sincerely hope not. Because: > > > If it's enabled, any overhead cannot be disabled/enabled at run or > > boot time so anyone enabling this would carry the cost without never > > necessarily consuming the data. > > this. We could make it a boot parameter, with the alternatives infrastructure - with a bit of refactoring there'd be a single function call to nop out, and then we could also drop the elf sections as well, so that when built in but disabled the overhead would be practically nil.
On 01.09.22 16:23, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote: >> On 31.08.22 21:01, Kent Overstreet wrote: >>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: >>>> On Wed 31-08-22 11:19:48, Mel Gorman wrote: >>>>> Whatever asking for an explanation as to why equivalent functionality >>>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. >>>> >>>> Fully agreed and this is especially true for a change this size >>>> 77 files changed, 3406 insertions(+), 703 deletions(-) >>> >>> In the case of memory allocation accounting, you flat cannot do this with ftrace >>> - you could maybe do a janky version that isn't fully accurate, much slower, >>> more complicated for the developer to understand and debug and more complicated >>> for the end user. >>> >>> But please, I invite anyone who's actually been doing this with ftrace to >>> demonstrate otherwise. >>> >>> Ftrace just isn't the right tool for the job here - we're talking about adding >>> per callsite accounting to some of the fastest fast paths in the kernel. >>> >>> And the size of the changes for memory allocation accounting are much more >>> reasonable: >>> 33 files changed, 623 insertions(+), 99 deletions(-) >>> >>> The code tagging library should exist anyways, it's been open coded half a dozen >>> times in the kernel already. >> >> Hi Kent, >> >> independent of the other discussions, if it's open coded already, does >> it make sense to factor that already-open-coded part out independently >> of the remainder of the full series here? > > It's discussed in the cover letter, that is exactly how the patch series is > structured. Skimming over the patches (that I was CCed on) and skimming over the cover letter, I got the impression that everything after patch 7 is introducing something new instead of refactoring something out. > >> [I didn't immediately spot if this series also attempts already to >> replace that open-coded part] > > Uh huh. > > Honestly, some days it feels like lkml is just as bad as slashdot, with people > wanting to get in their two cents without actually reading... ... and of course you had to reply like that. I should just have learned from my last upstream experience with you and kept you on my spam list. Thanks, bye
On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 31-08-22 15:01:54, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > > Whatever asking for an explanation as to why equivalent functionality > > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > > > Fully agreed and this is especially true for a change this size > > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a dozen > > times in the kernel already. > > > > And once we've got that, the time stats code is _also_ far simpler than doing it > > with ftrace would be. If anyone here has successfully debugged latency issues > > with ftrace, I'd really like to hear it. Again, for debugging latency issues you > > want something that can always be on, and that's not cheap with ftrace - and > > never mind the hassle of correlating start and end wait trace events, builting > > up histograms, etc. - that's all handled here. > > > > Cheap, simple, easy to use. What more could you want? > > A big ad on a banner. But more seriously. > > This patchset is _huge_ and touching a lot of different areas. It will > be not only hard to review but even harder to maintain longterm. So > it is completely reasonable to ask for potential alternatives with a > smaller code footprint. I am pretty sure you are aware of that workflow. The patchset is huge because it introduces a reusable part (the first 6 patches introducing code tagging) and 6 different applications in very different areas of the kernel. We wanted to present all of them in the RFC to show the variety of cases this mechanism can be reused for. If the code tagging is accepted, each application can be posted separately to the appropriate group of people. Hopefully that makes it easier to review. Those first 6 patches are not that big and are quite isolated IMHO: include/linux/codetag.h | 83 ++++++++++ include/linux/lazy-percpu-counter.h | 67 ++++++++ include/linux/module.h | 1 + kernel/module/internal.h | 1 - kernel/module/main.c | 4 + lib/Kconfig | 3 + lib/Kconfig.debug | 4 + lib/Makefile | 3 + lib/codetag.c | 248 ++++++++++++++++++++++++++++ lib/lazy-percpu-counter.c | 141 ++++++++++++++++ lib/string_helpers.c | 3 +- scripts/kallsyms.c | 13 ++ > > So I find Peter's question completely appropriate while your response to > that not so much! Maybe ftrace is not the right tool for the intented > job. Maybe there are other ways and it would be really great to show > that those have been evaluated and they are not suitable for a), b) and > c) reasons. That's fair. For memory tracking I looked into using kmemleak and page_owner which can't match the required functionality at an overhead acceptable for production and pre-production testing environments. traces + BPF I haven't evaluated myself but heard from other members of my team who tried using that in production environment with poor results. I'll try to get more specific information on that. > > E.g. Oscar has been working on extending page_ext to track number of > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But > it can help in environments where page_ext can be enabled and it is > completely non-intrusive to the MM code. Thanks for pointing out this work. I'll need to review and maybe profile it before making any claims. > > If the page_ext overhead is not desirable/acceptable then I am sure > there are other options. E.g. kprobes/LivePatching framework can hook > into functions and alter their behavior. So why not use that for data > collection? Has this been evaluated at all? I'm not sure how I can hook into say alloc_pages() to find out where it was called from without capturing the call stack (which would introduce an overhead at every allocation). Would love to discuss this or other alternatives if they can be done with low enough overhead. Thanks, Suren. > > And please note that I am not claiming the presented work is approaching > the problem from a wrong direction. It might very well solve multiple > problems in a single go _but_ the long term code maintenance burden > really has to to be carefully evaluated and if we can achieve a > reasonable subset of the functionality with an existing infrastructure > then I would be inclined to sacrifice some portions with a considerably > smaller code footprint. > > [1] http://lkml.kernel.org/r/20220901044249.4624-1-osalvador@suse.de > > -- > Michal Hocko > SUSE Labs
On Thu, Sep 1, 2022 at 8:07 AM David Hildenbrand <david@redhat.com> wrote: > > On 01.09.22 16:23, Kent Overstreet wrote: > > On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote: > >> On 31.08.22 21:01, Kent Overstreet wrote: > >>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > >>>> On Wed 31-08-22 11:19:48, Mel Gorman wrote: > >>>>> Whatever asking for an explanation as to why equivalent functionality > >>>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > >>>> > >>>> Fully agreed and this is especially true for a change this size > >>>> 77 files changed, 3406 insertions(+), 703 deletions(-) > >>> > >>> In the case of memory allocation accounting, you flat cannot do this with ftrace > >>> - you could maybe do a janky version that isn't fully accurate, much slower, > >>> more complicated for the developer to understand and debug and more complicated > >>> for the end user. > >>> > >>> But please, I invite anyone who's actually been doing this with ftrace to > >>> demonstrate otherwise. > >>> > >>> Ftrace just isn't the right tool for the job here - we're talking about adding > >>> per callsite accounting to some of the fastest fast paths in the kernel. > >>> > >>> And the size of the changes for memory allocation accounting are much more > >>> reasonable: > >>> 33 files changed, 623 insertions(+), 99 deletions(-) > >>> > >>> The code tagging library should exist anyways, it's been open coded half a dozen > >>> times in the kernel already. > >> > >> Hi Kent, > >> > >> independent of the other discussions, if it's open coded already, does > >> it make sense to factor that already-open-coded part out independently > >> of the remainder of the full series here? > > > > It's discussed in the cover letter, that is exactly how the patch series is > > structured. > > Skimming over the patches (that I was CCed on) and skimming over the > cover letter, I got the impression that everything after patch 7 is > introducing something new instead of refactoring something out. Hi David, Yes, you are right, the RFC does incorporate lots of parts which can be considered separately. They are sent together to present the overall scope of the proposal but I do intend to send them separately once we decide if it's worth working on. Thanks, Suren. > > > > >> [I didn't immediately spot if this series also attempts already to > >> replace that open-coded part] > > > > Uh huh. > > > > Honestly, some days it feels like lkml is just as bad as slashdot, with people > > wanting to get in their two cents without actually reading... > > ... and of course you had to reply like that. I should just have learned > from my last upstream experience with you and kept you on my spam list. > > Thanks, bye > > -- > Thanks, > > David / dhildenb >
On Thu, Sep 01, 2022 at 05:07:06PM +0200, David Hildenbrand wrote: > Skimming over the patches (that I was CCed on) and skimming over the > cover letter, I got the impression that everything after patch 7 is > introducing something new instead of refactoring something out. You skimmed over the dynamic debug patch then...
On Thu, Sep 01, 2022 at 12:05:01PM +0100, Mel Gorman wrote: > As pointed out elsewhere, attaching to the tracepoint and recording relevant > state is an option other than trying to parse a raw ftrace feed. For memory > leaks, there are already tracepoints for page allocation and free that could > be used to track allocations that are not freed at a given point in time. Page allocation tracepoints are not sufficient for what we're trying to do here, and a substantial amount of effort in this patchset has gone into just getting the hooking locations right - our memory allocation interfaces are not trivial. That's something people should keep in mind when commenting on the size of this patchset, since that's effort that would have to be spent for /any/ complete solution, be in tracepoint based or no. Additionally, we need to be able to write assertions that verify that our hook locations are correct, that allocations or frees aren't getting double counted or missed - highly necessary given the maze of nested memory allocation interfaces we have (i.e. slab.h), and it's something a tracepoint based implementation would have to account for - otherwise, a tool isn't very useful if you can't trust the numbers it's giving you. And then you have to correlate the allocate and free events, so that you know which allocate callsite to decrement the amount freed from. How would you plan on doing that with tracepoints? > There is also the kernel memory leak detector although I never had reason > to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html) > and it sounds like it would be expensive. Kmemleak is indeed expensive, and in the past I've had issues with it not catching everything (I've noticed the kmemleak annotations growing, so maybe this is less of an issue than it was). And this is a more complete solution (though not something that could strictly replace kmemleak): strict memory leaks aren't the only issue, it's also drivers unexpectedly consuming more memory than expected. I'll bet you a beer that when people have had this awhile, we're going to have a bunch of bugs discovered and fixed along the lines of "oh hey, this driver wasn't supposed to be using this 1 MB of memory, I never noticed that before". > > > It's also unclear *who* would enable this. It looks like it would mostly > > > have value during the development stage of an embedded platform to track > > > kernel memory usage on a per-application basis in an environment where it > > > may be difficult to setup tracing and tracking. Would it ever be enabled > > > in production? Would a distribution ever enable this? If it's enabled, any > > > overhead cannot be disabled/enabled at run or boot time so anyone enabling > > > this would carry the cost without never necessarily consuming the data. > > > > The whole point of this is to be cheap enough to enable in production - > > especially the latency tracing infrastructure. There's a lot of value to > > always-on system visibility infrastructure, so that when a live machine starts > > to do something wonky the data is already there. > > > > Sure, there is value but nothing stops the tracepoints being attached as > a boot-time service where interested. For latencies, there is already > bpf examples for tracing individual function latency over time e.g. > https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although > I haven't used it recently. So this is cool, I'll check it out today. Tracing of /function/ latency is definitely something you'd want tracing/kprobes for - that's way more practical than any code tagging-based approach. And if the output is reliable and useful I could definitely see myself using this, thank you. But for data collection where it makes sense to annotate in the source code where the data collection points are, I see the code-tagging based approach as simpler - it cuts out a whole bunch of indirection. The diffstat on the code tagging time stats patch is 8 files changed, 233 insertions(+), 6 deletions(-) And that includes hooking wait.h - this is really simple, easy stuff. The memory allocation tracking patches are more complicated because we've got a ton of memory allocation interfaces and we're aiming for strict correctness there - because that tool needs strict correctness in order to be useful. > Live parsing of ftrace is possible, albeit expensive. > https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl > tracks counts of high-order allocations and dumps a report on interrupt as > an example of live parsing ftrace and only recording interesting state. It's > not tracking state you are interested in but it demonstrates it is possible > to rely on ftrace alone and monitor from userspace. It's bit-rotted but > can be fixed with Yeah, if this is as far as people have gotten with ftrace on memory allocations than I don't think tracing is credible here, sorry. > The ease of use is a criticism as there is effort required to develop > the state tracking of in-kernel event be it from live parsing ftrace, > attaching to tracepoints with systemtap/bpf/whatever and the like. The > main disadvantage with an in-kernel implementation is three-fold. First, > it doesn't work with older kernels without backports. Second, if something > slightly different it needed then it's a kernel rebuild. Third, if the > option is not enabled in the deployed kernel config then you are relying > on the end user being willing to deploy a custom kernel. The initial > investment in doing memory leak tracking or latency tracking by attaching > to tracepoints is significant but it works with older kernels up to a point > and is less sensitive to the kernel config options selected as features > like ftrace are often selected. The next version of this patch set is going to use the alternatives mechanism to add a boot parameter. I'm not interested in backporting to older kernels - eesh. People on old enterprise kernels don't always get all the new shiny things :)
On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote: > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > So I find Peter's question completely appropriate while your response to > > that not so much! Maybe ftrace is not the right tool for the intented > > job. Maybe there are other ways and it would be really great to show > > that those have been evaluated and they are not suitable for a), b) and > > c) reasons. > > That's fair. > For memory tracking I looked into using kmemleak and page_owner which > can't match the required functionality at an overhead acceptable for > production and pre-production testing environments. Being more specific would be really helpful. Especially when your cover letter suggests that you rely on page_owner/memcg metadata as well to match allocation and their freeing parts. > traces + BPF I > haven't evaluated myself but heard from other members of my team who > tried using that in production environment with poor results. I'll try > to get more specific information on that. That would be helpful as well. > > E.g. Oscar has been working on extending page_ext to track number of > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But > > it can help in environments where page_ext can be enabled and it is > > completely non-intrusive to the MM code. > > Thanks for pointing out this work. I'll need to review and maybe > profile it before making any claims. > > > > > If the page_ext overhead is not desirable/acceptable then I am sure > > there are other options. E.g. kprobes/LivePatching framework can hook > > into functions and alter their behavior. So why not use that for data > > collection? Has this been evaluated at all? > > I'm not sure how I can hook into say alloc_pages() to find out where > it was called from without capturing the call stack (which would > introduce an overhead at every allocation). Would love to discuss this > or other alternatives if they can be done with low enough overhead. Yes, tracking back the call trace would be really needed. The question is whether this is really prohibitively expensive. How much overhead are we talking about? There is no free lunch here, really. You either have the overhead during runtime when the feature is used or on the source code level for all the future development (with a maze of macros and wrappers). Thanks!
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote: > > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > So I find Peter's question completely appropriate while your response to > > > that not so much! Maybe ftrace is not the right tool for the intented > > > job. Maybe there are other ways and it would be really great to show > > > that those have been evaluated and they are not suitable for a), b) and > > > c) reasons. > > > > That's fair. > > For memory tracking I looked into using kmemleak and page_owner which > > can't match the required functionality at an overhead acceptable for > > production and pre-production testing environments. > > Being more specific would be really helpful. Especially when your cover > letter suggests that you rely on page_owner/memcg metadata as well to > match allocation and their freeing parts. kmemleak is known to be slow and it's even documented [1], so I hope I can skip that part. For page_owner to provide the comparable information we would have to capture the call stacks for all page allocations unlike our proposal which allows to do that selectively for specific call sites. I'll post the overhead numbers of call stack capturing once I'm finished with profiling the latest code, hopefully sometime tomorrow, in the worst case after the long weekend. > > > traces + BPF I > > haven't evaluated myself but heard from other members of my team who > > tried using that in production environment with poor results. I'll try > > to get more specific information on that. > > That would be helpful as well. Ack. > > > > E.g. Oscar has been working on extending page_ext to track number of > > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But > > > it can help in environments where page_ext can be enabled and it is > > > completely non-intrusive to the MM code. > > > > Thanks for pointing out this work. I'll need to review and maybe > > profile it before making any claims. > > > > > > > > If the page_ext overhead is not desirable/acceptable then I am sure > > > there are other options. E.g. kprobes/LivePatching framework can hook > > > into functions and alter their behavior. So why not use that for data > > > collection? Has this been evaluated at all? > > > > I'm not sure how I can hook into say alloc_pages() to find out where > > it was called from without capturing the call stack (which would > > introduce an overhead at every allocation). Would love to discuss this > > or other alternatives if they can be done with low enough overhead. > > Yes, tracking back the call trace would be really needed. The question > is whether this is really prohibitively expensive. How much overhead are > we talking about? There is no free lunch here, really. You either have > the overhead during runtime when the feature is used or on the source > code level for all the future development (with a maze of macros and > wrappers). Will post the overhead numbers soon. What I hear loud and clear is that we need a kernel command-line kill switch that mitigates the overhead for having this feature. That seems to be the main concern. Thanks, Suren. [1] https://docs.kernel.org/dev-tools/kmemleak.html#limitations-and-drawbacks > > Thanks! > -- > Michal Hocko > SUSE Labs
On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote: > kmemleak is known to be slow and it's even documented [1], so I hope I > can skip that part. For page_owner to provide the comparable > information we would have to capture the call stacks for all page > allocations unlike our proposal which allows to do that selectively > for specific call sites. I'll post the overhead numbers of call stack > capturing once I'm finished with profiling the latest code, hopefully > sometime tomorrow, in the worst case after the long weekend. To expand on this further: we're stashing a pointer to the alloc_tag, which is defined at the allocation callsite. That's how we're able to decrement the proper counter on free, and why this beats any tracing based approach - with tracing you'd instead have to correlate allocate/free events. Ouch. > > Yes, tracking back the call trace would be really needed. The question > > is whether this is really prohibitively expensive. How much overhead are > > we talking about? There is no free lunch here, really. You either have > > the overhead during runtime when the feature is used or on the source > > code level for all the future development (with a maze of macros and > > wrappers). The full call stack is really not what you want in most applications - that's what people think they want at first, and why page_owner works the way it does, but it turns out that then combining all the different but related stack traces _sucks_ (so why were you saving them in the first place?), and then you have to do a separate memory allocate for each stack track, which destroys performance. > > Will post the overhead numbers soon. > What I hear loud and clear is that we need a kernel command-line kill > switch that mitigates the overhead for having this feature. That seems > to be the main concern. > Thanks, After looking at this more I don't think we should commit just yet - there's some tradeoffs to be evaluated, and maybe the thing to do first will be to see if we can cut down on the (huge!) number of allocation interfaces before adding more complexity. The ideal approach, from a performance POV, would be to pass a pointer to the alloc tag to kmalloc() et. all, and then we'd have the actual accounting code in one place and use a jump label to skip over it when this feature is disabled. However, there are _many, many_ wrapper functions in our allocation code, and this approach is going to make the plumbing for the hooks quite a bit bigger than what we have now - and then, do we want to have this extra alloc_tag parameter that's not used when CONFIG_ALLOC_TAGGING=n? It's a tiny cost for an extra unused parameter, but it's a cost - or do we get rid of that with some extra macro hackery (eww, gross)? If we do the boot parameter before submission, I think we'll have something that's maybe not strictly ideal from a performance POV when CONFIG_ALLOC_TAGGING=y but boot parameter=n, but it'll introduce the minimum amount of macro insanity. What we should be able to do pretty easily is discard the alloc_tag structs when the boot parameter is disabled, because they're in special elf sections and we already do that (e.g. for .init).
On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > > Whatever asking for an explanation as to why equivalent functionality > > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > > > Fully agreed and this is especially true for a change this size > > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a dozen > > times in the kernel already. > > > > And once we've got that, the time stats code is _also_ far simpler than doing it > > with ftrace would be. If anyone here has successfully debugged latency issues > > with ftrace, I'd really like to hear it. Again, for debugging latency issues you > > want something that can always be on, and that's not cheap with ftrace - and > > never mind the hassle of correlating start and end wait trace events, builting > > up histograms, etc. - that's all handled here. > > > > Cheap, simple, easy to use. What more could you want? > > > > This is very interesting work! Do you have any data about the overhead > this introduces, especially in a production environment? I am > especially interested in memory allocations tracking and detecting > leaks. +1 I think the question whether it indeed can be always turned on in the production or not is the main one. If not, the advantage over ftrace/bpf/... is not that obvious. Otherwise it will be indeed a VERY useful thing. Also, there is a lot of interesting stuff within this patchset, which might be useful elsewhere. So thanks to Kent and Suren for this work! Thanks!
On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote: > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > > This is very interesting work! Do you have any data about the overhead > > this introduces, especially in a production environment? I am > > especially interested in memory allocations tracking and detecting > > leaks. > > +1 > > I think the question whether it indeed can be always turned on in the production > or not is the main one. If not, the advantage over ftrace/bpf/... is not that > obvious. Otherwise it will be indeed a VERY useful thing. Low enough overhead to run in production was my primary design goal. Stats are kept in a struct that's defined at the callsite. So this adds _no_ pointer chasing to the allocation path, unless we've switch to percpu counters at that callsite (see the lazy percpu counters patch), where we need to deref one percpu pointer to save an atomic. Then we need to stash a pointer to the alloc_tag, so that kfree() can find it. For slab allocations this uses the same storage area as memcg, so for allocations that are using that we won't be touching any additional cachelines. (I wanted the pointer to the alloc_tag to be stored inline with the allocation, but that would've caused alignment difficulties). Then there's a pointer deref introduced to the kfree() path, to get back to the original alloc_tag and subtract the allocation from that callsite. That one won't be free, and with percpu counters we've got another dependent load too - hmm, it might be worth benchmarking with just atomics, skipping the percpu counters. So the overhead won't be zero, I expect it'll show up in some synthetic benchmarks, but yes I do definitely expect this to be worth enabling in production in many scenarios.
On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote: > > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > > > This is very interesting work! Do you have any data about the overhead > > > this introduces, especially in a production environment? I am > > > especially interested in memory allocations tracking and detecting > > > leaks. > > > > +1 > > > > I think the question whether it indeed can be always turned on in the production > > or not is the main one. If not, the advantage over ftrace/bpf/... is not that > > obvious. Otherwise it will be indeed a VERY useful thing. > > Low enough overhead to run in production was my primary design goal. > > Stats are kept in a struct that's defined at the callsite. So this adds _no_ > pointer chasing to the allocation path, unless we've switch to percpu counters > at that callsite (see the lazy percpu counters patch), where we need to deref > one percpu pointer to save an atomic. > > Then we need to stash a pointer to the alloc_tag, so that kfree() can find it. > For slab allocations this uses the same storage area as memcg, so for > allocations that are using that we won't be touching any additional cachelines. > (I wanted the pointer to the alloc_tag to be stored inline with the allocation, > but that would've caused alignment difficulties). > > Then there's a pointer deref introduced to the kfree() path, to get back to the > original alloc_tag and subtract the allocation from that callsite. That one > won't be free, and with percpu counters we've got another dependent load too - > hmm, it might be worth benchmarking with just atomics, skipping the percpu > counters. > > So the overhead won't be zero, I expect it'll show up in some synthetic > benchmarks, but yes I do definitely expect this to be worth enabling in > production in many scenarios. I'm somewhat sceptical, but I usually am. And in this case I'll be really happy to be wrong. On a bright side, maybe most of the overhead will come from few allocations, so an option to explicitly exclude them will do the trick. I'd suggest to run something like iperf on a fast hardware. And maybe some io_uring stuff too. These are two places which were historically most sensitive to the (kernel) memory accounting speed. Thanks!
On Thu, Sep 1, 2022 at 3:54 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote: > > On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote: > > > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > > > > This is very interesting work! Do you have any data about the overhead > > > > this introduces, especially in a production environment? I am > > > > especially interested in memory allocations tracking and detecting > > > > leaks. > > > > > > +1 > > > > > > I think the question whether it indeed can be always turned on in the production > > > or not is the main one. If not, the advantage over ftrace/bpf/... is not that > > > obvious. Otherwise it will be indeed a VERY useful thing. > > > > Low enough overhead to run in production was my primary design goal. > > > > Stats are kept in a struct that's defined at the callsite. So this adds _no_ > > pointer chasing to the allocation path, unless we've switch to percpu counters > > at that callsite (see the lazy percpu counters patch), where we need to deref > > one percpu pointer to save an atomic. > > > > Then we need to stash a pointer to the alloc_tag, so that kfree() can find it. > > For slab allocations this uses the same storage area as memcg, so for > > allocations that are using that we won't be touching any additional cachelines. > > (I wanted the pointer to the alloc_tag to be stored inline with the allocation, > > but that would've caused alignment difficulties). > > > > Then there's a pointer deref introduced to the kfree() path, to get back to the > > original alloc_tag and subtract the allocation from that callsite. That one > > won't be free, and with percpu counters we've got another dependent load too - > > hmm, it might be worth benchmarking with just atomics, skipping the percpu > > counters. > > > > So the overhead won't be zero, I expect it'll show up in some synthetic > > benchmarks, but yes I do definitely expect this to be worth enabling in > > production in many scenarios. > > I'm somewhat sceptical, but I usually am. And in this case I'll be really happy > to be wrong. > > On a bright side, maybe most of the overhead will come from few allocations, > so an option to explicitly exclude them will do the trick. > > I'd suggest to run something like iperf on a fast hardware. And maybe some > io_uring stuff too. These are two places which were historically most sensitive > to the (kernel) memory accounting speed. Thanks for the suggestions, Roman. I'll see how I can get this done. I'll have to find someone with access to fast hardware (Android is not great for that) and backporting the patchset to the supported kernel version. Will do my best. Thanks, Suren. > > Thanks! > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > I'd suggest to run something like iperf on a fast hardware. And maybe some > io_uring stuff too. These are two places which were historically most sensitive > to the (kernel) memory accounting speed. I'm getting wildly inconsistent results with iperf. io_uring-echo-server and rust_echo_bench gets me: Benchmarking: 127.0.0.1:12345 50 clients, running 512 bytes, 60 sec. Without alloc tagging: 120547 request/sec With: 116748 request/sec https://github.com/frevib/io_uring-echo-server https://github.com/haraldh/rust_echo_bench How's that look to you? Close enough? :)
On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > > I'd suggest to run something like iperf on a fast hardware. And maybe some > > io_uring stuff too. These are two places which were historically most sensitive > > to the (kernel) memory accounting speed. > > I'm getting wildly inconsistent results with iperf. > > io_uring-echo-server and rust_echo_bench gets me: > Benchmarking: 127.0.0.1:12345 > 50 clients, running 512 bytes, 60 sec. > > Without alloc tagging: 120547 request/sec > With: 116748 request/sec > > https://github.com/frevib/io_uring-echo-server > https://github.com/haraldh/rust_echo_bench > > How's that look to you? Close enough? :) Yes, this looks good (a bit too good). I'm not that familiar with io_uring, Jens and Pavel should have a better idea what and how to run (I know they've workarounded the kernel memory accounting because of the performance in the past, this is why I suspect it might be an issue here as well). This is a recent optimization on the networking side: https://lore.kernel.org/linux-mm/20220825000506.239406-1-shakeelb@google.com/ Maybe you can try to repeat this experiment. Thanks!
On Thu, Sep 01, 2022 at 06:04:46PM -0700, Roman Gushchin wrote: > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: > > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > > > I'd suggest to run something like iperf on a fast hardware. And maybe some > > > io_uring stuff too. These are two places which were historically most sensitive > > > to the (kernel) memory accounting speed. > > > > I'm getting wildly inconsistent results with iperf. > > > > io_uring-echo-server and rust_echo_bench gets me: > > Benchmarking: 127.0.0.1:12345 > > 50 clients, running 512 bytes, 60 sec. > > > > Without alloc tagging: 120547 request/sec > > With: 116748 request/sec > > > > https://github.com/frevib/io_uring-echo-server > > https://github.com/haraldh/rust_echo_bench > > > > How's that look to you? Close enough? :) > > Yes, this looks good (a bit too good). Eh, I was hoping for better :) > I'm not that familiar with io_uring, Jens and Pavel should have a better idea > what and how to run (I know they've workarounded the kernel memory accounting > because of the performance in the past, this is why I suspect it might be an > issue here as well). > > This is a recent optimization on the networking side: > https://lore.kernel.org/linux-mm/20220825000506.239406-1-shakeelb@google.com/ > > Maybe you can try to repeat this experiment. I'd be more interested in a synthetic benchmark, if you know of any.
On 9/1/22 7:04 PM, Roman Gushchin wrote: > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: >> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: >>> I'd suggest to run something like iperf on a fast hardware. And maybe some >>> io_uring stuff too. These are two places which were historically most sensitive >>> to the (kernel) memory accounting speed. >> >> I'm getting wildly inconsistent results with iperf. >> >> io_uring-echo-server and rust_echo_bench gets me: >> Benchmarking: 127.0.0.1:12345 >> 50 clients, running 512 bytes, 60 sec. >> >> Without alloc tagging: 120547 request/sec >> With: 116748 request/sec >> >> https://github.com/frevib/io_uring-echo-server >> https://github.com/haraldh/rust_echo_bench >> >> How's that look to you? Close enough? :) > > Yes, this looks good (a bit too good). > > I'm not that familiar with io_uring, Jens and Pavel should have a better idea > what and how to run (I know they've workarounded the kernel memory accounting > because of the performance in the past, this is why I suspect it might be an > issue here as well). io_uring isn't alloc+free intensive on a per request basis anymore, it would not be a good benchmark if the goal is to check for regressions in that area.
On Fri, Sep 02, 2022 at 06:02:12AM -0600, Jens Axboe wrote: > On 9/1/22 7:04 PM, Roman Gushchin wrote: > > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: > >> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > >>> I'd suggest to run something like iperf on a fast hardware. And maybe some > >>> io_uring stuff too. These are two places which were historically most sensitive > >>> to the (kernel) memory accounting speed. > >> > >> I'm getting wildly inconsistent results with iperf. > >> > >> io_uring-echo-server and rust_echo_bench gets me: > >> Benchmarking: 127.0.0.1:12345 > >> 50 clients, running 512 bytes, 60 sec. > >> > >> Without alloc tagging: 120547 request/sec > >> With: 116748 request/sec > >> > >> https://github.com/frevib/io_uring-echo-server > >> https://github.com/haraldh/rust_echo_bench > >> > >> How's that look to you? Close enough? :) > > > > Yes, this looks good (a bit too good). > > > > I'm not that familiar with io_uring, Jens and Pavel should have a better idea > > what and how to run (I know they've workarounded the kernel memory accounting > > because of the performance in the past, this is why I suspect it might be an > > issue here as well). > > io_uring isn't alloc+free intensive on a per request basis anymore, it > would not be a good benchmark if the goal is to check for regressions in > that area. Good to know. The benchmark is still a TCP benchmark though, so still useful. Matthew suggested while true; do echo 1 >/tmp/foo; rm /tmp/foo; done I ran that on tmpfs, and the numbers with and without alloc tagging were statistically equal - there was a fair amount of variation, it wasn't a super controlled test, anywhere from 38-41 seconds with 100000 iterations (and alloc tagging was some of the faster runs). But with memcg off, it ran in 32-33 seconds. We're piggybacking on the same mechanism memcg uses for stashing per-object pointers, so it looks like that's the bigger cost.
On 9/2/22 1:48 PM, Kent Overstreet wrote: > On Fri, Sep 02, 2022 at 06:02:12AM -0600, Jens Axboe wrote: >> On 9/1/22 7:04 PM, Roman Gushchin wrote: >>> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: >>>> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: >>>>> I'd suggest to run something like iperf on a fast hardware. And maybe some >>>>> io_uring stuff too. These are two places which were historically most sensitive >>>>> to the (kernel) memory accounting speed. >>>> >>>> I'm getting wildly inconsistent results with iperf. >>>> >>>> io_uring-echo-server and rust_echo_bench gets me: >>>> Benchmarking: 127.0.0.1:12345 >>>> 50 clients, running 512 bytes, 60 sec. >>>> >>>> Without alloc tagging: 120547 request/sec >>>> With: 116748 request/sec >>>> >>>> https://github.com/frevib/io_uring-echo-server >>>> https://github.com/haraldh/rust_echo_bench >>>> >>>> How's that look to you? Close enough? :) >>> >>> Yes, this looks good (a bit too good). >>> >>> I'm not that familiar with io_uring, Jens and Pavel should have a better idea >>> what and how to run (I know they've workarounded the kernel memory accounting >>> because of the performance in the past, this is why I suspect it might be an >>> issue here as well). >> >> io_uring isn't alloc+free intensive on a per request basis anymore, it >> would not be a good benchmark if the goal is to check for regressions in >> that area. > > Good to know. The benchmark is still a TCP benchmark though, so still useful. > > Matthew suggested > while true; do echo 1 >/tmp/foo; rm /tmp/foo; done > > I ran that on tmpfs, and the numbers with and without alloc tagging were > statistically equal - there was a fair amount of variation, it wasn't a super > controlled test, anywhere from 38-41 seconds with 100000 iterations (and alloc > tagging was some of the faster runs). > > But with memcg off, it ran in 32-33 seconds. We're piggybacking on the same > mechanism memcg uses for stashing per-object pointers, so it looks like that's > the bigger cost. I've complained about memcg accounting before, the slowness of it is why io_uring works around it by caching. Anything we account we try NOT do in the fast path because of it, the slowdown is considerable. You care about efficiency now? I thought that was relegated to irrelevant 10M IOPS cases.
On Fri, Sep 02, 2022 at 01:53:53PM -0600, Jens Axboe wrote: > I've complained about memcg accounting before, the slowness of it is why > io_uring works around it by caching. Anything we account we try NOT do > in the fast path because of it, the slowdown is considerable. I'm with you on that, it definitely raises an eyebrow. > You care about efficiency now? I thought that was relegated to > irrelevant 10M IOPS cases. I always did, it's just not the only thing I care about.
On 9/2/22 2:05 PM, Kent Overstreet wrote: > On Fri, Sep 02, 2022 at 01:53:53PM -0600, Jens Axboe wrote: >> I've complained about memcg accounting before, the slowness of it is why >> io_uring works around it by caching. Anything we account we try NOT do >> in the fast path because of it, the slowdown is considerable. > > I'm with you on that, it definitely raises an eyebrow. > >> You care about efficiency now? I thought that was relegated to >> irrelevant 10M IOPS cases. > > I always did, it's just not the only thing I care about. It's not the only thing anyone cares about.
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote: > > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > So I find Peter's question completely appropriate while your response to > > > that not so much! Maybe ftrace is not the right tool for the intented > > > job. Maybe there are other ways and it would be really great to show > > > that those have been evaluated and they are not suitable for a), b) and > > > c) reasons. > > > > That's fair. > > For memory tracking I looked into using kmemleak and page_owner which > > can't match the required functionality at an overhead acceptable for > > production and pre-production testing environments. > > Being more specific would be really helpful. Especially when your cover > letter suggests that you rely on page_owner/memcg metadata as well to > match allocation and their freeing parts. > > > traces + BPF I > > haven't evaluated myself but heard from other members of my team who > > tried using that in production environment with poor results. I'll try > > to get more specific information on that. > > That would be helpful as well. > > > > E.g. Oscar has been working on extending page_ext to track number of > > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But > > > it can help in environments where page_ext can be enabled and it is > > > completely non-intrusive to the MM code. > > > > Thanks for pointing out this work. I'll need to review and maybe > > profile it before making any claims. > > > > > > > > If the page_ext overhead is not desirable/acceptable then I am sure > > > there are other options. E.g. kprobes/LivePatching framework can hook > > > into functions and alter their behavior. So why not use that for data > > > collection? Has this been evaluated at all? > > > > I'm not sure how I can hook into say alloc_pages() to find out where > > it was called from without capturing the call stack (which would > > introduce an overhead at every allocation). Would love to discuss this > > or other alternatives if they can be done with low enough overhead. > > Yes, tracking back the call trace would be really needed. The question > is whether this is really prohibitively expensive. How much overhead are > we talking about? There is no free lunch here, really. You either have > the overhead during runtime when the feature is used or on the source > code level for all the future development (with a maze of macros and > wrappers). As promised, I profiled a simple code that repeatedly makes 10 allocations/frees in a loop and measured overheads of code tagging, call stack capturing and tracing+BPF for page and slab allocations. Summary: Page allocations (overheads are compared to get_free_pages() duration): 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) 8.8% lookup_page_ext 1237% call stack capture 139% tracepoint with attached empty BPF program Slab allocations (overheads are compared to __kmalloc() duration): With CONFIG_MEMCG_KMEM=y 39% Codetag counter manipulations(__lazy_percpu_counter_add + __alloc_tag_add) 55% get_slab_tag_ref 3.9% __ksize 3027% call stack capture 397% tracepoint with attached empty BPF program With CONFIG_MEMCG_KMEM=n 26% Codetag counter manipulation(__lazy_percpu_counter_add + __alloc_tag_add) 72% get_slab_tag_ref 7.4% __ksize 2789% call stack capture 345% tracepoint with attached empty BPF program Details: _get_free_pages is used as page allocation duration baseline __kmalloc is used as slab allocation duration baseline 1. Profile with instrumented page allocator |--50.13%--my__get_free_page | | | |--38.99%--_get_free_pages | | | | | |--34.75%--__alloc_pages | | | | | | | |--27.59%--get_page_from_freelist | | | | | --3.98%--_alloc_pages | | | | | --0.53%--policy_node | | | |--3.45%--lookup_page_ext | | | |--1.59%--__lazy_percpu_counter_add | | | | | --0.80%--pcpu_alloc | | memset_orig | | | --1.06%--__alloc_tag_add | | | --0.80%--__lazy_percpu_counter_add | |--35.28%--free_unref_page | | | |--23.08%--_raw_spin_unlock_irqrestore | | | |--2.39%--preempt_count_add | | | | | --0.80%--in_lock_functions | | | |--1.59%--free_pcp_prepare | | | |--1.33%--preempt_count_sub | | | --0.80%--check_preemption_disabled | |--4.24%--__free_pages | --1.59%--free_pages 2. Profile with non-instrumented page allocator and call stack capturing |--84.18%--my__get_free_page | | | --83.91%--stack_depot_capture_stack | | | |--77.99%--stack_trace_save | | | | | --77.53%--arch_stack_walk | | | | | |--37.17%--unwind_next_frame | | | | | | | |--8.44%--__orc_find | | | | | | |--10.57%-stack_trace_consume_entry | | | | | --9.64%--unwind_get_return_address | | | --5.78%--__stack_depot_save | |--6.78%--__get_free_pages | | | |--5.85%--__alloc_pages | | | | | --3.86%--get_page_from_freelist | | | | | --1.33%--_raw_spin_unlock_irqrestore | | | --0.80%--alloc_pages | |--5.19%--free_unref_page | | | |--2.73%--_raw_spin_unlock_irqrestore | | | --0.60%--free_pcp_prepare | --0.73%--__free_pages 3. Profile with non-instrumented page allocator and BPF attached to tracepoint |--42.42%--my__get_free_page | | | --38.53%--perf_trace_kmem_alloc | | | |--25.76%--perf_trace_run_bpf_submit | | | | | |--21.86%--trace_call_bpf | | | | | | | |--4.76%--migrate_enable | | | | | | | |--4.55%--migrate_disable | | | | | | | |--3.03%--check_preemption_disabled | | | | | | | |--0.65%--__this_cpu_preempt_check | | | | | | | --0.65%--__rcu_read_unlock | | | | | --0.87%--check_preemption_disabled | | | |--8.01%--perf_trace_buf_alloc | | | | | |--3.68%--perf_swevent_get_recursion_context | | | | | | | --0.87%--check_preemption_disabled | | | | | --1.30%--check_preemption_disabled | | | --0.87%--check_preemption_disabled | |--27.71%--__get_free_pages | | | |--23.38%--__alloc_pages | | | | | --17.75%--get_page_from_freelist | | | | | |--8.66%--_raw_spin_unlock_irqrestore | | | | | | | --1.95%--preempt_count_sub | | | | | --1.08%--preempt_count_add | | | --4.33%--alloc_pages | | | |--0.87%--policy_node | | | --0.65%--policy_nodemask | |--15.37%--free_unref_page | | | |--6.71%--_raw_spin_unlock_irqrestore | | | |--1.52%--check_preemption_disabled | | | |--0.65%--free_pcp_prepare | | | --0.65%--preempt_count_add |--4.98%--__free_pages 4. Profile with instrumented slab allocator CONFIG_MEMCG_KMEM=y |--51.28%--my__get_free_page | | | |--21.79%--__kmalloc | | | | | |--3.42%--memcg_slab_post_alloc_hook | | | | | |--1.71%--kmalloc_slab | | | | | --0.85%--should_failslab | | | |--11.97%--get_slab_tag_ref | | | |--5.56%--__alloc_tag_add | | | | | --2.56%--__lazy_percpu_counter_add | | | |--2.99%--__lazy_percpu_counter_add | | | --0.85%--__ksize | --35.90%--kfree | |--13.68%--get_slab_tag_ref | |--6.41%--__alloc_tag_sub | | | --4.70%--__lazy_percpu_counter_add | --2.14%--__ksize 5. Profile with non-instrumented slab allocator and call stack capturing CONFIG_MEMCG_KMEM=y |--91.50%--my__get_free_page | | | --91.13%--stack_depot_capture_stack | | | |--85.48%--stack_trace_save | | | | | --85.12%--arch_stack_walk | | | | | |--40.54%--unwind_next_frame | | | | | |--14.30%--__unwind_start | | | | | |--11.95%-unwind_get_return_address | | | | | --10.48%-stack_trace_consume_entry | | | --4.99%--__stack_depot_save | | | --0.66%--filter_irq_stacks | |--3.01%--__kmalloc | |--2.05%--kfree 6. Profile with non-instrumented slab allocator and BPF attached to a tracepoint CONFIG_MEMCG_KMEM=y |--72.39%--__kmalloc | | | |--57.84%--perf_trace_kmem_alloc | | | | | |--38.06%--perf_trace_run_bpf_submit | | | | | | | --33.96%--trace_call_bpf | | | | | | | |--10.07%--migrate_disable | | | | | | | |--4.85%--migrate_enable | | | | | | | |--4.10%--check_preemption_disabled | | | | | | | |--1.87%--__rcu_read_unlock | | | | | | | --0.75%--__rcu_read_lock | | | | | --9.70%--perf_trace_buf_alloc | | | | | |--2.99%--perf_swevent_get_recursion_context | | | | | |--1.12%--check_preemption_disabled | | | | | --0.75%--debug_smp_processor_id | | | |--2.24%--kmalloc_slab | | | |--1.49%--memcg_slab_post_alloc_hook | | | --1.12%--__cond_resched | |--7.84%--kfree 7. Profile with instrumented slab allocator CONFIG_MEMCG_KMEM=n |--49.39%--my__get_free_page | | | |--22.04%--__kmalloc | | | | | |--3.27%--kmalloc_slab | | | | | --0.82%--asm_sysvec_apic_timer_interrupt | | sysvec_apic_timer_interrupt | | __irq_exit_rcu | | __softirqentry_text_start | | | |--15.92%--get_slab_tag_ref | | | |--3.27%--__alloc_tag_add | | | | | --2.04%--__lazy_percpu_counter_add | | | --2.45%--__lazy_percpu_counter_add | |--35.51%--kfree | | | |--13.88%--get_slab_tag_ref | | | |--11.84%--__alloc_tag_sub | | | | | --5.31%--__lazy_percpu_counter_add | | | --1.63%--__ksize 8. Profile with non-instrumented slab allocator and call stack capturing CONFIG_MEMCG_KMEM=n |--91.70%--my__get_free_page | | | --91.48%--stack_depot_capture_stack | | | |--85.29%--stack_trace_save | | | | | --85.07%--arch_stack_walk | | | | | |--45.23%--unwind_next_frame | | | | | |--12.89%--__unwind_start | | | | | |--10.20%-unwind_get_return_address | | | | | --10.12%-stack_trace_consume_entry | | | --5.75%--__stack_depot_save | | | --0.87%--filter_irq_stacks | |--3.28%--__kmalloc | --1.89%--kfree 9. Profile with non-instrumented slab allocator and BPF attached to a tracepoint CONFIG_MEMCG_KMEM=n |--71.65%--__kmalloc | | | |--55.56%--perf_trace_kmem_alloc | | | | | |--38.31%--perf_trace_run_bpf_submit | | | | | | | |--31.80%--trace_call_bpf | | | | | | | | | |--9.96%--migrate_enable | | | | | | | | | |--4.98%--migrate_disable | | | | | | | | | |--1.92%--check_preemption_disabled | | | | | | | | | |--1.92%--__rcu_read_unlock | | | | | | | | | --1.15%--__rcu_read_lock | | | | | | | --0.77%--check_preemption_disabled | | | | | --11.11%--perf_trace_buf_alloc | | | | | --4.98%--perf_swevent_get_recursion_context | | | | | --1.53%--check_preemption_disabled | | | |--2.68%--kmalloc_slab | | | --1.15%--__cond_resched | --9.58%--kfree > > Thanks! > -- > Michal Hocko > SUSE Labs
On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote: > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > Yes, tracking back the call trace would be really needed. The question > > is whether this is really prohibitively expensive. How much overhead are > > we talking about? There is no free lunch here, really. You either have > > the overhead during runtime when the feature is used or on the source > > code level for all the future development (with a maze of macros and > > wrappers). > > As promised, I profiled a simple code that repeatedly makes 10 > allocations/frees in a loop and measured overheads of code tagging, > call stack capturing and tracing+BPF for page and slab allocations. > Summary: > > Page allocations (overheads are compared to get_free_pages() duration): > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > 8.8% lookup_page_ext > 1237% call stack capture > 139% tracepoint with attached empty BPF program Yes, I am not surprised that the call stack capturing is really expensive comparing to the allocator fast path (which is really highly optimized and I suspect that with 10 allocation/free loop you mostly get your memory from the pcp lists). Is this overhead still _that_ visible for somehow less microoptimized workloads which have to take slow paths as well? Also what kind of stack unwinder is configured (I guess ORC)? This is not my area but from what I remember the unwinder overhead varies between ORC and FP. And just to make it clear. I do realize that an overhead from the stack unwinding is unavoidable. And code tagging would logically have lower overhead as it performs much less work. But the main point is whether our existing stack unwiding approach is really prohibitively expensive to be used for debugging purposes on production systems. I might misremember but I recall people having bigger concerns with page_owner memory footprint than the actual stack unwinder overhead.
On Thu 01-09-22 16:15:02, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote: > > kmemleak is known to be slow and it's even documented [1], so I hope I > > can skip that part. For page_owner to provide the comparable > > information we would have to capture the call stacks for all page > > allocations unlike our proposal which allows to do that selectively > > for specific call sites. I'll post the overhead numbers of call stack > > capturing once I'm finished with profiling the latest code, hopefully > > sometime tomorrow, in the worst case after the long weekend. > > To expand on this further: we're stashing a pointer to the alloc_tag, which is > defined at the allocation callsite. That's how we're able to decrement the > proper counter on free, and why this beats any tracing based approach - with > tracing you'd instead have to correlate allocate/free events. Ouch. > > > > Yes, tracking back the call trace would be really needed. The question > > > is whether this is really prohibitively expensive. How much overhead are > > > we talking about? There is no free lunch here, really. You either have > > > the overhead during runtime when the feature is used or on the source > > > code level for all the future development (with a maze of macros and > > > wrappers). > > The full call stack is really not what you want in most applications - that's > what people think they want at first, and why page_owner works the way it does, > but it turns out that then combining all the different but related stack traces > _sucks_ (so why were you saving them in the first place?), and then you have to > do a separate memory allocate for each stack track, which destroys performance. I do agree that the full stack trace is likely not what you need. But the portion of the stack that you need is not really clear because the relevant part might be on a different level of the calltrace depending on the allocation site. Take this as an example: {traverse, seq_read_iter, single_open_size}->seq_buf_alloc -> kvmalloc -> kmalloc This whole part of the stack is not really all that interesting and you would have to allocate pretty high at the API layer to catch something useful. And please remember that seq_file interface is heavily used in throughout the kernel. I wouldn't suspect seq_file itself to be buggy, that is well exercised code but its users can botch things and that is where the leak would happen. There are many other examples like that where the allocation is done at a lib/infrastructure layer (sysfs framework, mempools network pool allocators and whatnot). We do care about those users, really. Ad-hoc pool allocators built on top of the core MM allocators are not really uncommon. And I am really skeptical we really want to add all the tagging source code level changes to each and every one of them. This is really my main concern about this whole work. Not only it adds a considerable maintenance burden to the core MM because it adds on top of our existing allocator layers complexity but it would need to spread beyond MM to be useful because it is usually outside of MM where leaks happen.
On Mon, 5 Sept 2022 at 10:12, Michal Hocko <mhocko@suse.com> wrote: > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote: > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > Yes, tracking back the call trace would be really needed. The question > > > is whether this is really prohibitively expensive. How much overhead are > > > we talking about? There is no free lunch here, really. You either have > > > the overhead during runtime when the feature is used or on the source > > > code level for all the future development (with a maze of macros and > > > wrappers). > > > > As promised, I profiled a simple code that repeatedly makes 10 > > allocations/frees in a loop and measured overheads of code tagging, > > call stack capturing and tracing+BPF for page and slab allocations. > > Summary: > > > > Page allocations (overheads are compared to get_free_pages() duration): > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > > 8.8% lookup_page_ext > > 1237% call stack capture > > 139% tracepoint with attached empty BPF program > > Yes, I am not surprised that the call stack capturing is really > expensive comparing to the allocator fast path (which is really highly > optimized and I suspect that with 10 allocation/free loop you mostly get > your memory from the pcp lists). Is this overhead still _that_ visible > for somehow less microoptimized workloads which have to take slow paths > as well? > > Also what kind of stack unwinder is configured (I guess ORC)? This is > not my area but from what I remember the unwinder overhead varies > between ORC and FP. > > And just to make it clear. I do realize that an overhead from the stack > unwinding is unavoidable. And code tagging would logically have lower > overhead as it performs much less work. But the main point is whether > our existing stack unwiding approach is really prohibitively expensive > to be used for debugging purposes on production systems. I might > misremember but I recall people having bigger concerns with page_owner > memory footprint than the actual stack unwinder overhead. This is just to point out that we've also been looking at cheaper collection of the stack trace (for KASAN and other sanitizers). The cheapest way to unwind the stack would be a system with "shadow call stack" enabled. With compiler support it's available on arm64, see CONFIG_SHADOW_CALL_STACK. For x86 the hope is that at one point the kernel will support CET, which newer Intel and AMD CPUs support. Collecting the call stack would then be a simple memcpy.
On Sun, 4 Sep 2022 18:32:58 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > Page allocations (overheads are compared to get_free_pages() duration): > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > 8.8% lookup_page_ext > 1237% call stack capture > 139% tracepoint with attached empty BPF program Have you tried tracepoint with custom callback? static void my_callback(void *data, unsigned long call_site, const void *ptr, struct kmem_cache *s, size_t bytes_req, size_t bytes_alloc, gfp_t gfp_flags) { struct my_data_struct *my_data = data; { do whatever } } [..] register_trace_kmem_alloc(my_callback, my_data); Now the my_callback function will be called directly every time the kmem_alloc tracepoint is hit. This avoids that perf and BPF overhead. -- Steve
On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote: > > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote: > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > Yes, tracking back the call trace would be really needed. The question > > > is whether this is really prohibitively expensive. How much overhead are > > > we talking about? There is no free lunch here, really. You either have > > > the overhead during runtime when the feature is used or on the source > > > code level for all the future development (with a maze of macros and > > > wrappers). > > > > As promised, I profiled a simple code that repeatedly makes 10 > > allocations/frees in a loop and measured overheads of code tagging, > > call stack capturing and tracing+BPF for page and slab allocations. > > Summary: > > > > Page allocations (overheads are compared to get_free_pages() duration): > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > > 8.8% lookup_page_ext > > 1237% call stack capture > > 139% tracepoint with attached empty BPF program > > Yes, I am not surprised that the call stack capturing is really > expensive comparing to the allocator fast path (which is really highly > optimized and I suspect that with 10 allocation/free loop you mostly get > your memory from the pcp lists). Is this overhead still _that_ visible > for somehow less microoptimized workloads which have to take slow paths > as well? Correct, it's a comparison with the allocation fast path, so in a sense represents the worst case scenario. However at the same time the measurements are fair because they measure the overheads against the same meaningful baseline, therefore can be used for comparison. > > Also what kind of stack unwinder is configured (I guess ORC)? This is > not my area but from what I remember the unwinder overhead varies > between ORC and FP. I used whatever is default and didn't try other mechanisms. Don't think the difference would be orders of magnitude better though. > > And just to make it clear. I do realize that an overhead from the stack > unwinding is unavoidable. And code tagging would logically have lower > overhead as it performs much less work. But the main point is whether > our existing stack unwiding approach is really prohibitively expensive > to be used for debugging purposes on production systems. I might > misremember but I recall people having bigger concerns with page_owner > memory footprint than the actual stack unwinder overhead. That's one of those questions which are very difficult to answer (if even possible) because that would depend on the use scenario. If the workload allocates frequently then adding the overhead will likely affect it, otherwise might not be even noticeable. In general, in pre-production testing we try to minimize the difference in performance and memory profiles between the software we are testing and the production one. From that point of view, the smaller the overhead, the better. I know it's kinda obvious but unfortunately I have no better answer to that question. For the memory overhead, in my early internal proposal with assumption of 10000 instrumented allocation call sites, I've made some calculations for an 8GB 8-core system (quite typical for Android) and ended up with the following: per-cpu counters atomic counters page_ext references 16MB 16MB slab object references 10.5MB 10.5MB alloc_tags 900KB 312KB Total memory overhead 27.4MB 26.8MB so, about 0.34% of the total memory. Our implementation has changed since then and the number might not be completely correct but it should be in the ballpark. I just checked the number of instrumented calls that we currently have in the 6.0-rc3 built with defconfig and it's 165 page allocation and 2684 slab allocation sites. I readily accept that we are probably missing some allocations and additional modules can also contribute to these numbers but my guess it's still less than 10000 that I used in my calculations. I don't claim that 0.34% overhead is low enough to be always acceptable, just posting the numbers to provide some reference points. > -- > Michal Hocko > SUSE Labs
On Mon, Sep 5, 2022 at 1:58 AM Marco Elver <elver@google.com> wrote: > > On Mon, 5 Sept 2022 at 10:12, Michal Hocko <mhocko@suse.com> wrote: > > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote: > > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > Yes, tracking back the call trace would be really needed. The question > > > > is whether this is really prohibitively expensive. How much overhead are > > > > we talking about? There is no free lunch here, really. You either have > > > > the overhead during runtime when the feature is used or on the source > > > > code level for all the future development (with a maze of macros and > > > > wrappers). > > > > > > As promised, I profiled a simple code that repeatedly makes 10 > > > allocations/frees in a loop and measured overheads of code tagging, > > > call stack capturing and tracing+BPF for page and slab allocations. > > > Summary: > > > > > > Page allocations (overheads are compared to get_free_pages() duration): > > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > > > 8.8% lookup_page_ext > > > 1237% call stack capture > > > 139% tracepoint with attached empty BPF program > > > > Yes, I am not surprised that the call stack capturing is really > > expensive comparing to the allocator fast path (which is really highly > > optimized and I suspect that with 10 allocation/free loop you mostly get > > your memory from the pcp lists). Is this overhead still _that_ visible > > for somehow less microoptimized workloads which have to take slow paths > > as well? > > > > Also what kind of stack unwinder is configured (I guess ORC)? This is > > not my area but from what I remember the unwinder overhead varies > > between ORC and FP. > > > > And just to make it clear. I do realize that an overhead from the stack > > unwinding is unavoidable. And code tagging would logically have lower > > overhead as it performs much less work. But the main point is whether > > our existing stack unwiding approach is really prohibitively expensive > > to be used for debugging purposes on production systems. I might > > misremember but I recall people having bigger concerns with page_owner > > memory footprint than the actual stack unwinder overhead. > > This is just to point out that we've also been looking at cheaper > collection of the stack trace (for KASAN and other sanitizers). The > cheapest way to unwind the stack would be a system with "shadow call > stack" enabled. With compiler support it's available on arm64, see > CONFIG_SHADOW_CALL_STACK. For x86 the hope is that at one point the > kernel will support CET, which newer Intel and AMD CPUs support. > Collecting the call stack would then be a simple memcpy. Thanks for the note Marco! I'll check out the CONFIG_SHADOW_CALL_STACK on Android.
On Mon, Sep 5, 2022 at 8:06 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Sun, 4 Sep 2022 18:32:58 -0700 > Suren Baghdasaryan <surenb@google.com> wrote: > > > Page allocations (overheads are compared to get_free_pages() duration): > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > > 8.8% lookup_page_ext > > 1237% call stack capture > > 139% tracepoint with attached empty BPF program > > Have you tried tracepoint with custom callback? > > static void my_callback(void *data, unsigned long call_site, > const void *ptr, struct kmem_cache *s, > size_t bytes_req, size_t bytes_alloc, > gfp_t gfp_flags) > { > struct my_data_struct *my_data = data; > > { do whatever } > } > > [..] > register_trace_kmem_alloc(my_callback, my_data); > > Now the my_callback function will be called directly every time the > kmem_alloc tracepoint is hit. > > This avoids that perf and BPF overhead. Haven't tried that yet but will do. Thanks for the reference code! > > -- Steve
On Aug 31, 2022, at 3:19 AM, Mel Gorman <mgorman@suse.de> wrote: > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: >> On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: >>> On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: >>>> =========================== >>>> Code tagging framework >>>> =========================== >>>> Code tag is a structure identifying a specific location in the source code >>>> which is generated at compile time and can be embedded in an application- >>>> specific structure. Several applications of code tagging are included in >>>> this RFC, such as memory allocation tracking, dynamic fault injection, >>>> latency tracking and improved error code reporting. >>>> Basically, it takes the old trick of "define a special elf section for >>>> objects of a given type so that we can iterate over them at runtime" and >>>> creates a proper library for it. >>> >>> I might be super dense this morning, but what!? I've skimmed through the >>> set and I don't think I get it. >>> >>> What does this provide that ftrace/kprobes don't already allow? >> >> You're kidding, right? > > It's a valid question. From the description, it main addition that would > be hard to do with ftrace or probes is catching where an error code is > returned. A secondary addition would be catching all historical state and > not just state since the tracing started. > > It's also unclear *who* would enable this. It looks like it would mostly > have value during the development stage of an embedded platform to track > kernel memory usage on a per-application basis in an environment where it > may be difficult to setup tracing and tracking. Would it ever be enabled > in production? Would a distribution ever enable this? If it's enabled, any > overhead cannot be disabled/enabled at run or boot time so anyone enabling > this would carry the cost without never necessarily consuming the data. > > It might be an ease-of-use thing. Gathering the information from traces > is tricky and would need combining multiple different elements and that > is development effort but not impossible. > > Whatever asking for an explanation as to why equivalent functionality > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. I would note that I have a solution in the making (which pretty much works) for this matter, and does not require any kernel changes. It produces a call stack that leads to the code that lead to syscall failure. The way it works is by using seccomp to trap syscall failures, and then setting ftrace function filters and kprobes on conditional branches, indirect branch targets and function returns. Using symbolic execution, backtracking is performed and the condition that lead to the failure is then pin-pointed. I hope to share the code soon.
On Mon, 5 Sep 2022 11:44:55 -0700 Nadav Amit <nadav.amit@gmail.com> wrote: > I would note that I have a solution in the making (which pretty much works) > for this matter, and does not require any kernel changes. It produces a > call stack that leads to the code that lead to syscall failure. > > The way it works is by using seccomp to trap syscall failures, and then > setting ftrace function filters and kprobes on conditional branches, > indirect branch targets and function returns. Ooh nifty! > > Using symbolic execution, backtracking is performed and the condition that > lead to the failure is then pin-pointed. > > I hope to share the code soon. Looking forward to it. -- Steve
On Mon, Sep 05, 2022 at 11:08:21AM -0700, Suren Baghdasaryan wrote: > On Mon, Sep 5, 2022 at 8:06 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Sun, 4 Sep 2022 18:32:58 -0700 > > Suren Baghdasaryan <surenb@google.com> wrote: > > > > > Page allocations (overheads are compared to get_free_pages() duration): > > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > > > 8.8% lookup_page_ext > > > 1237% call stack capture > > > 139% tracepoint with attached empty BPF program > > > > Have you tried tracepoint with custom callback? > > > > static void my_callback(void *data, unsigned long call_site, > > const void *ptr, struct kmem_cache *s, > > size_t bytes_req, size_t bytes_alloc, > > gfp_t gfp_flags) > > { > > struct my_data_struct *my_data = data; > > > > { do whatever } > > } > > > > [..] > > register_trace_kmem_alloc(my_callback, my_data); > > > > Now the my_callback function will be called directly every time the > > kmem_alloc tracepoint is hit. > > > > This avoids that perf and BPF overhead. > > Haven't tried that yet but will do. Thanks for the reference code! Is it really worth the effort of benchmarking tracing API overhead here? The main cost of a tracing based approach is going to to be the data structure for remembering outstanding allocations so that free events can be matched to the appropriate callsite. Regardless of whether it's done with BFP or by attaching to the tracepoints directly, that's going to be the main overhead.
On Mon, 5 Sep 2022 16:42:29 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > Haven't tried that yet but will do. Thanks for the reference code! > > Is it really worth the effort of benchmarking tracing API overhead here? > > The main cost of a tracing based approach is going to to be the data structure > for remembering outstanding allocations so that free events can be matched to > the appropriate callsite. Regardless of whether it's done with BFP or by > attaching to the tracepoints directly, that's going to be the main overhead. The point I was making here is that you do not need your own hooking mechanism. You can get the information directly by attaching to the tracepoint. > > static void my_callback(void *data, unsigned long call_site, > > const void *ptr, struct kmem_cache *s, > > size_t bytes_req, size_t bytes_alloc, > > gfp_t gfp_flags) > > { > > struct my_data_struct *my_data = data; > > > > { do whatever } > > } The "do whatever" is anything you want to do. Or is the data structure you create with this approach going to be too much overhead? How hard is it for a hash or binary search lookup? -- Steve
On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote: > This is really my main concern about this whole work. Not only it adds a > considerable maintenance burden to the core MM because [citation needed] > it adds on top of > our existing allocator layers complexity but it would need to spread beyond > MM to be useful because it is usually outside of MM where leaks happen. If you want the tracking to happen at a different level of the call stack, just call _kmalloc() directly and call alloc_tag_add()/sub() yourself.
On Mon, Sep 05, 2022 at 06:16:50PM -0400, Steven Rostedt wrote: > On Mon, 5 Sep 2022 16:42:29 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > Haven't tried that yet but will do. Thanks for the reference code! > > > > Is it really worth the effort of benchmarking tracing API overhead here? > > > > The main cost of a tracing based approach is going to to be the data structure > > for remembering outstanding allocations so that free events can be matched to > > the appropriate callsite. Regardless of whether it's done with BFP or by > > attaching to the tracepoints directly, that's going to be the main overhead. > > The point I was making here is that you do not need your own hooking > mechanism. You can get the information directly by attaching to the > tracepoint. > > > > static void my_callback(void *data, unsigned long call_site, > > > const void *ptr, struct kmem_cache *s, > > > size_t bytes_req, size_t bytes_alloc, > > > gfp_t gfp_flags) > > > { > > > struct my_data_struct *my_data = data; > > > > > > { do whatever } > > > } > > The "do whatever" is anything you want to do. > > Or is the data structure you create with this approach going to be too much > overhead? How hard is it for a hash or binary search lookup? If you don't think it's hard, go ahead and show us.
On Mon 05-09-22 19:46:49, Kent Overstreet wrote: > On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote: > > This is really my main concern about this whole work. Not only it adds a > > considerable maintenance burden to the core MM because > > [citation needed] I thought this was clear from the email content (the part you haven't quoted here). But let me be explicit one more time for you. I hope we can agree that in order for this kind of tracking to be useful you need to cover _callers_ of the allocator or in the ideal world the users/owner of the tracked memory (the later is sometimes much harder/impossible to track when the memory is handed over from one peer to another). It is not particularly useful IMO to see that a large portion of the memory has been allocated by say vmalloc or kvmalloc, right? How much does it really tell you that a lot of memory has been allocated by kvmalloc or vmalloc? Yet, neither of the two is handled by the proposed tracking and it would require additional code to be added and _maintained_ to cover them. But that would be still far from complete, we have bulk allocator, mempools etc. If that was not enough some of those allocators are used by library code like seq_file, networking pools, module loader and whatnot. So this grows and effectively doubles the API space for many allocators as they need both normal API and the one which can pass the tracking context down the path to prevent double tracking. Right? This in my book is a considerable maintenance burden. And especially for the MM subsystem this means additional burden because we have a very rich allocators APIs. You are absolutely right that processing stack traces is PITA but that allows to see the actual callers irrespectively how many layers of indirection or library code it goes. > > it adds on top of > > our existing allocator layers complexity but it would need to spread beyond > > MM to be useful because it is usually outside of MM where leaks happen. > > If you want the tracking to happen at a different level of the call stack, just > call _kmalloc() directly and call alloc_tag_add()/sub() yourself. As pointed above this just scales poorly and adds to the API space. Not to mention that direct use of alloc_tag_add can just confuse layers below which rely on the same thing. Hope this makes it clearer.
On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote: > On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote: > > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > Yes, tracking back the call trace would be really needed. The question > > > > is whether this is really prohibitively expensive. How much overhead are > > > > we talking about? There is no free lunch here, really. You either have > > > > the overhead during runtime when the feature is used or on the source > > > > code level for all the future development (with a maze of macros and > > > > wrappers). > > > > > > As promised, I profiled a simple code that repeatedly makes 10 > > > allocations/frees in a loop and measured overheads of code tagging, > > > call stack capturing and tracing+BPF for page and slab allocations. > > > Summary: > > > > > > Page allocations (overheads are compared to get_free_pages() duration): > > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > > > 8.8% lookup_page_ext > > > 1237% call stack capture > > > 139% tracepoint with attached empty BPF program > > > > Yes, I am not surprised that the call stack capturing is really > > expensive comparing to the allocator fast path (which is really highly > > optimized and I suspect that with 10 allocation/free loop you mostly get > > your memory from the pcp lists). Is this overhead still _that_ visible > > for somehow less microoptimized workloads which have to take slow paths > > as well? > > Correct, it's a comparison with the allocation fast path, so in a > sense represents the worst case scenario. However at the same time the > measurements are fair because they measure the overheads against the > same meaningful baseline, therefore can be used for comparison. Yes, I am not saying it is an unfair comparision. It is just not a particularly practical one for real life situations. So I am not sure you can draw many conclusions from that. Or let me put it differently. There is not real point comparing the code tagging and stack unwiding approaches because the later is simply more complex because it collects more state. The main question is whether that additional state collection is too expensive to be practically used. > > Also what kind of stack unwinder is configured (I guess ORC)? This is > > not my area but from what I remember the unwinder overhead varies > > between ORC and FP. > > I used whatever is default and didn't try other mechanisms. Don't > think the difference would be orders of magnitude better though. > > > > > And just to make it clear. I do realize that an overhead from the stack > > unwinding is unavoidable. And code tagging would logically have lower > > overhead as it performs much less work. But the main point is whether > > our existing stack unwiding approach is really prohibitively expensive > > to be used for debugging purposes on production systems. I might > > misremember but I recall people having bigger concerns with page_owner > > memory footprint than the actual stack unwinder overhead. > > That's one of those questions which are very difficult to answer (if > even possible) because that would depend on the use scenario. If the > workload allocates frequently then adding the overhead will likely > affect it, otherwise might not be even noticeable. In general, in > pre-production testing we try to minimize the difference in > performance and memory profiles between the software we are testing > and the production one. From that point of view, the smaller the > overhead, the better. I know it's kinda obvious but unfortunately I > have no better answer to that question. This is clear but it doesn't really tell whether the existing tooling is unusable for _your_ or any specific scenarios. Because when we are talking about adding quite a lot of code and make our allocators APIs more complicated to track the state then we should carefully weigh the benefit and the cost. As replied to other email I am really skeptical this patchset is at the final stage and the more allocators get covered the more code we have to maintain. So there must be a very strong reason to add it. > For the memory overhead, in my early internal proposal with assumption > of 10000 instrumented allocation call sites, I've made some > calculations for an 8GB 8-core system (quite typical for Android) and > ended up with the following: > > per-cpu counters atomic counters > page_ext references 16MB 16MB > slab object references 10.5MB 10.5MB > alloc_tags 900KB 312KB > Total memory overhead 27.4MB 26.8MB I do not really think this is all that interesting because the major memory overhead contributors (page_ext and objcg are going to be there with other approaches that want to match alloc and free as that clearly requires to store the allocator objects somewhere). > so, about 0.34% of the total memory. Our implementation has changed > since then and the number might not be completely correct but it > should be in the ballpark. > I just checked the number of instrumented calls that we currently have > in the 6.0-rc3 built with defconfig and it's 165 page allocation and > 2684 slab allocation sites. I readily accept that we are probably > missing some allocations and additional modules can also contribute to > these numbers but my guess it's still less than 10000 that I used in > my calculations. yes, in the current implementation you are missing most indirect users of the page allocator as stated elsewhere so the usefulness can be really limited. A better coverege will not increase the memory consumption much but it will add an additional maintenance burden that will scale with different usecases.
On Tue, Sep 6, 2022 at 1:01 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote: > > On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote: > > > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > > Yes, tracking back the call trace would be really needed. The question > > > > > is whether this is really prohibitively expensive. How much overhead are > > > > > we talking about? There is no free lunch here, really. You either have > > > > > the overhead during runtime when the feature is used or on the source > > > > > code level for all the future development (with a maze of macros and > > > > > wrappers). > > > > > > > > As promised, I profiled a simple code that repeatedly makes 10 > > > > allocations/frees in a loop and measured overheads of code tagging, > > > > call stack capturing and tracing+BPF for page and slab allocations. > > > > Summary: > > > > > > > > Page allocations (overheads are compared to get_free_pages() duration): > > > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add) > > > > 8.8% lookup_page_ext > > > > 1237% call stack capture > > > > 139% tracepoint with attached empty BPF program > > > > > > Yes, I am not surprised that the call stack capturing is really > > > expensive comparing to the allocator fast path (which is really highly > > > optimized and I suspect that with 10 allocation/free loop you mostly get > > > your memory from the pcp lists). Is this overhead still _that_ visible > > > for somehow less microoptimized workloads which have to take slow paths > > > as well? > > > > Correct, it's a comparison with the allocation fast path, so in a > > sense represents the worst case scenario. However at the same time the > > measurements are fair because they measure the overheads against the > > same meaningful baseline, therefore can be used for comparison. > > Yes, I am not saying it is an unfair comparision. It is just not a > particularly practical one for real life situations. So I am not sure > you can draw many conclusions from that. Or let me put it differently. > There is not real point comparing the code tagging and stack unwiding > approaches because the later is simply more complex because it collects > more state. The main question is whether that additional state > collection is too expensive to be practically used. You asked me to provide the numbers in one of your replies, that's what I did. > > > > Also what kind of stack unwinder is configured (I guess ORC)? This is > > > not my area but from what I remember the unwinder overhead varies > > > between ORC and FP. > > > > I used whatever is default and didn't try other mechanisms. Don't > > think the difference would be orders of magnitude better though. > > > > > > > > And just to make it clear. I do realize that an overhead from the stack > > > unwinding is unavoidable. And code tagging would logically have lower > > > overhead as it performs much less work. But the main point is whether > > > our existing stack unwiding approach is really prohibitively expensive > > > to be used for debugging purposes on production systems. I might > > > misremember but I recall people having bigger concerns with page_owner > > > memory footprint than the actual stack unwinder overhead. > > > > That's one of those questions which are very difficult to answer (if > > even possible) because that would depend on the use scenario. If the > > workload allocates frequently then adding the overhead will likely > > affect it, otherwise might not be even noticeable. In general, in > > pre-production testing we try to minimize the difference in > > performance and memory profiles between the software we are testing > > and the production one. From that point of view, the smaller the > > overhead, the better. I know it's kinda obvious but unfortunately I > > have no better answer to that question. > > This is clear but it doesn't really tell whether the existing tooling is > unusable for _your_ or any specific scenarios. Because when we are > talking about adding quite a lot of code and make our allocators APIs > more complicated to track the state then we should carefully weigh the > benefit and the cost. As replied to other email I am really skeptical > this patchset is at the final stage and the more allocators get covered > the more code we have to maintain. So there must be a very strong reason > to add it. The patchset is quite complete at this point. Instrumenting new allocators takes 3 lines of code, see how kmalloc_hooks macro is used in https://lore.kernel.org/all/20220830214919.53220-17-surenb@google.com/ > > > For the memory overhead, in my early internal proposal with assumption > > of 10000 instrumented allocation call sites, I've made some > > calculations for an 8GB 8-core system (quite typical for Android) and > > ended up with the following: > > > > per-cpu counters atomic counters > > page_ext references 16MB 16MB > > slab object references 10.5MB 10.5MB > > alloc_tags 900KB 312KB > > Total memory overhead 27.4MB 26.8MB > > I do not really think this is all that interesting because the major > memory overhead contributors (page_ext and objcg are going to be there > with other approaches that want to match alloc and free as that clearly > requires to store the allocator objects somewhere). You mentioned that memory consumption in the page_owner approach was more important overhead, so I provided the numbers for that part of the discussion. > > > so, about 0.34% of the total memory. Our implementation has changed > > since then and the number might not be completely correct but it > > should be in the ballpark. > > I just checked the number of instrumented calls that we currently have > > in the 6.0-rc3 built with defconfig and it's 165 page allocation and > > 2684 slab allocation sites. I readily accept that we are probably > > missing some allocations and additional modules can also contribute to > > these numbers but my guess it's still less than 10000 that I used in > > my calculations. > > yes, in the current implementation you are missing most indirect users > of the page allocator as stated elsewhere so the usefulness can be > really limited. A better coverege will not increase the memory > consumption much but it will add an additional maintenance burden that > will scale with different usecases. Your comments in the last two letters about needing the stack tracing and covering indirect users of the allocators makes me think that you missed my reply here: https://lore.kernel.org/all/CAJuCfpGZ==v0HGWBzZzHTgbo4B_ZBe6V6U4T_788LVWj8HhCRQ@mail.gmail.com/. I messed up with formatting but hopefully it's still readable. The idea of having two stage tracking - first one very cheap and the second one more in-depth I think should address your concerns about indirect users. Thanks, Suren. > -- > Michal Hocko > SUSE Labs
On Tue, Sep 06, 2022 at 09:23:31AM +0200, Michal Hocko wrote: > On Mon 05-09-22 19:46:49, Kent Overstreet wrote: > > On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote: > > > This is really my main concern about this whole work. Not only it adds a > > > considerable maintenance burden to the core MM because > > > > [citation needed] > > I thought this was clear from the email content (the part you haven't > quoted here). But let me be explicit one more time for you. > > I hope we can agree that in order for this kind of tracking to be useful > you need to cover _callers_ of the allocator or in the ideal world > the users/owner of the tracked memory (the later is sometimes much > harder/impossible to track when the memory is handed over from one peer > to another). > > It is not particularly useful IMO to see that a large portion of the > memory has been allocated by say vmalloc or kvmalloc, right? How > much does it really tell you that a lot of memory has been allocated > by kvmalloc or vmalloc? Yet, neither of the two is handled by the > proposed tracking and it would require additional code to be added and > _maintained_ to cover them. But that would be still far from complete, > we have bulk allocator, mempools etc. Of course - and even a light skimming of the patch set would see it does indeed address this. We still have to do vmalloc and percpu memory allocations, but slab is certainly handled and that's the big one. > As pointed above this just scales poorly and adds to the API space. Not > to mention that direct use of alloc_tag_add can just confuse layers > below which rely on the same thing. It might help you make your case if you'd say something about what you'd like better. Otherwise, saying "code has to be maintained" is a little bit like saying water is wet, and we're all engineers here, I think we know that :)
On Tue 06-09-22 14:20:58, Kent Overstreet wrote: [...] > Otherwise, saying "code has to be maintained" is a little bit like saying water > is wet, and we're all engineers here, I think we know that :) Hmm, it seems that further discussion doesn't really make much sense here. I know how to use my time better.
On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote: > Hmm, it seems that further discussion doesn't really make much sense > here. I know how to use my time better. Just a thought, but I generally find it more productive to propose ideas than to just be disparaging. Cheers, Kent
On Wed, 7 Sep 2022 09:04:28 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote: > > Hmm, it seems that further discussion doesn't really make much sense > > here. I know how to use my time better. > > Just a thought, but I generally find it more productive to propose ideas than to > just be disparaging. > But it's not Michal's job to do so. He's just telling you that the given feature is not worth the burden. He's telling you the issues that he has with the patch set. It's the submitter's job to address those concerns and not the maintainer's to tell you how to make it better. When Linus tells us that a submission is crap, we don't ask him how to make it less crap, we listen to why he called it crap, and then rewrite to be not so crappy. If we cannot figure it out, it doesn't get in. -- Steve
On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote: > On Wed, 7 Sep 2022 09:04:28 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote: > > > Hmm, it seems that further discussion doesn't really make much sense > > > here. I know how to use my time better. > > > > Just a thought, but I generally find it more productive to propose ideas than to > > just be disparaging. > > > > But it's not Michal's job to do so. He's just telling you that the given > feature is not worth the burden. He's telling you the issues that he has > with the patch set. It's the submitter's job to address those concerns and > not the maintainer's to tell you how to make it better. > > When Linus tells us that a submission is crap, we don't ask him how to make > it less crap, we listen to why he called it crap, and then rewrite to be > not so crappy. If we cannot figure it out, it doesn't get in. When Linus tells someone a submission is crap, he _always_ has a sound, and _specific_ technical justification for doing so. "This code is going to be a considerable maintenance burden" is vapid, and lazy. It's the kind of feedback made by someone who has looked at the number of lines of code a patch touches and not much more.
On Wed, Sep 7, 2022 at 11:35 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote: > > On Wed, 7 Sep 2022 09:04:28 -0400 > > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote: > > > > Hmm, it seems that further discussion doesn't really make much sense > > > > here. I know how to use my time better. > > > > > > Just a thought, but I generally find it more productive to propose ideas than to > > > just be disparaging. > > > > > > > But it's not Michal's job to do so. He's just telling you that the given > > feature is not worth the burden. He's telling you the issues that he has > > with the patch set. It's the submitter's job to address those concerns and > > not the maintainer's to tell you how to make it better. > > > > When Linus tells us that a submission is crap, we don't ask him how to make > > it less crap, we listen to why he called it crap, and then rewrite to be > > not so crappy. If we cannot figure it out, it doesn't get in. > > When Linus tells someone a submission is crap, he _always_ has a sound, and > _specific_ technical justification for doing so. > > "This code is going to be a considerable maintenance burden" is vapid, and lazy. > It's the kind of feedback made by someone who has looked at the number of lines > of code a patch touches and not much more. I would really appreciate if everyone could please stick to the technical side of the conversation. That way we can get some constructive feedback. Everything else is not helpful and at best is a distraction. Maintenance burden is a price we pay and I think it's the prerogative of the maintainers to take that into account. Our job is to prove that the price is worth paying. > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Wed, Sep 07, 2022 at 11:49:37PM -0700, Suren Baghdasaryan wrote: > I would really appreciate if everyone could please stick to the > technical side of the conversation. That way we can get some > constructive feedback. Everything else is not helpful and at best is a > distraction. > Maintenance burden is a price we pay and I think it's the prerogative > of the maintainers to take that into account. Our job is to prove that > the price is worth paying. Well said. I'd also like to add - slab.h does look pretty overgrown and messy. We've grown a _lot_ of special purpose memory allocation interfaces, and I think it probably is time to try and wrangle that back. The API complexity isn't just an issue for this patch - it's an issue for anything that has to wrap and plumb through memory allocation interfaces. It's a pain point for the Rust people, and also comes in e.g. the mempool API. I think we should keep going with the memalloc_no*_save()/restore() approach, and extend it to other things: - memalloc_nowait_save() - memalloc_highpri_save() (these two get you GFP_ATOMIC). Also, I don't think these all need to be separate functions, we could have memalloc_gfp_apply() memalloc_gfp_restore() which simply takes a gfp flags argument and applies it to the current PF_MEMALLOC flags. We've had long standing bugs where vmalloc() can't correctly take gfp flags because some of the allocations it does for page tables don't have it correctly plumbed through; switching to the memalloc_*_(save|restore) is something people have been wanting in order to fix this - for years. Actually following through and completing this would let us kill the gfp flags arguments to our various memory allocators entirely. I think we can do the same thing with the numa node parameter - kill kmalloc_node() et. all, move it to task_struct with a set of save/restore functions. There's probably other things we can do to simplify slab.h if we look more. I've been hoping to start pushing patches for some of this stuff - it's going to be some time before I can get to it though, can only handle so many projects in flight at a time :)
On Thu 08-09-22 02:35:48, Kent Overstreet wrote: > On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote: > > On Wed, 7 Sep 2022 09:04:28 -0400 > > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote: > > > > Hmm, it seems that further discussion doesn't really make much sense > > > > here. I know how to use my time better. > > > > > > Just a thought, but I generally find it more productive to propose ideas than to > > > just be disparaging. > > > > > > > But it's not Michal's job to do so. He's just telling you that the given > > feature is not worth the burden. He's telling you the issues that he has > > with the patch set. It's the submitter's job to address those concerns and > > not the maintainer's to tell you how to make it better. > > > > When Linus tells us that a submission is crap, we don't ask him how to make > > it less crap, we listen to why he called it crap, and then rewrite to be > > not so crappy. If we cannot figure it out, it doesn't get in. > > When Linus tells someone a submission is crap, he _always_ has a sound, and > _specific_ technical justification for doing so. > > "This code is going to be a considerable maintenance burden" is vapid, and lazy. > It's the kind of feedback made by someone who has looked at the number of lines > of code a patch touches and not much more. Then you have probably missed a huge part of my emails. Please re-read. If those arguments are not clear, feel free to ask for clarification. Reducing the whole my reasoning and objections to the sentence above and calling that vapid and lazy is not only unfair but also disrespectful.
On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote: > Then you have probably missed a huge part of my emails. Please > re-read. If those arguments are not clear, feel free to ask for > clarification. Reducing the whole my reasoning and objections to the > sentence above and calling that vapid and lazy is not only unfair but > also disrespectful. What, where you complained about slab's page allocations showing up in the profile instead of slab, and I pointed out to you that actually each and every slab call is instrumented, and you're just seeing some double counting (that we will no doubt fix?) Or when you complained about allocation sites where it should actually be the caller that should be instrumented, and I pointed out that it'd be quite easy to simply change that code to use _kmalloc() and slab_tag_add() directly, if it becomes an issue. Of course, if we got that far, we'd have this code to thank for telling us where to look! Did I miss anything?
On Thu 08-09-22 03:29:50, Kent Overstreet wrote: > On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote: > > Then you have probably missed a huge part of my emails. Please > > re-read. If those arguments are not clear, feel free to ask for > > clarification. Reducing the whole my reasoning and objections to the > > sentence above and calling that vapid and lazy is not only unfair but > > also disrespectful. > > What, where you complained about slab's page allocations showing up in the > profile instead of slab, and I pointed out to you that actually each and every > slab call is instrumented, and you're just seeing some double counting (that we > will no doubt fix?) > > Or when you complained about allocation sites where it should actually be the > caller that should be instrumented, and I pointed out that it'd be quite easy to > simply change that code to use _kmalloc() and slab_tag_add() directly, if it > becomes an issue. > > Of course, if we got that far, we'd have this code to thank for telling us where > to look! > > Did I miss anything? Feel free to reponse to specific arguments as I wrote them. I won't repeat them again. Sure we can discuss how important/relevant those are. And that _can_ be a productive discussion.