Message ID | 20230501165450.15352-1-surenb@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Memory allocation profiling | expand |
On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote: > Performance overhead: > To evaluate performance we implemented an in-kernel test executing > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > affinity set to a specific CPU to minimize the noise. Below is performance > comparison between the baseline kernel, profiling when enabled, profiling > when disabled (nomem_profiling=y) and (for comparison purposes) baseline > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT: > > kmalloc pgalloc > Baseline (6.3-rc7) 9.200s 31.050s > profiling disabled 9.800 (+6.52%) 32.600 (+4.99%) > profiling enabled 12.500 (+35.87%) 39.010 (+25.60%) > memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%) Hm, this makes me think we have a regression with memcg_kmem in one of the recent releases. When I measured it a couple of years ago, the overhead was definitely within 100%. Do you understand what makes the your profiling drastically faster than kmem? Thanks!
On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote: > > Performance overhead: > > To evaluate performance we implemented an in-kernel test executing > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > affinity set to a specific CPU to minimize the noise. Below is performance > > comparison between the baseline kernel, profiling when enabled, profiling > > when disabled (nomem_profiling=y) and (for comparison purposes) baseline > > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT: > > > > kmalloc pgalloc > > Baseline (6.3-rc7) 9.200s 31.050s > > profiling disabled 9.800 (+6.52%) 32.600 (+4.99%) > > profiling enabled 12.500 (+35.87%) 39.010 (+25.60%) > > memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%) > > Hm, this makes me think we have a regression with memcg_kmem in one of > the recent releases. When I measured it a couple of years ago, the overhead > was definitely within 100%. > > Do you understand what makes the your profiling drastically faster than kmem? I haven't profiled or looked into kmem overhead closely but I can do that. I just wanted to see how the overhead compares with the existing accounting mechanisms. For kmalloc, the overhead is low because after we create the vector of slab_ext objects (which is the same as what memcg_kmem does), memory profiling just increments a lazy counter (which in many cases would be a per-cpu counter). memcg_kmem operates on cgroup hierarchy with additional overhead associated with that. I'm guessing that's the reason for the big difference between these mechanisms but, I didn't look into the details to understand memcg_kmem performance. > > Thanks!
On Mon, May 01, 2023 at 11:08:05AM -0700, Suren Baghdasaryan wrote: > On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote: > > > Performance overhead: > > > To evaluate performance we implemented an in-kernel test executing > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > > affinity set to a specific CPU to minimize the noise. Below is performance > > > comparison between the baseline kernel, profiling when enabled, profiling > > > when disabled (nomem_profiling=y) and (for comparison purposes) baseline > > > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT: > > > > > > kmalloc pgalloc > > > Baseline (6.3-rc7) 9.200s 31.050s > > > profiling disabled 9.800 (+6.52%) 32.600 (+4.99%) > > > profiling enabled 12.500 (+35.87%) 39.010 (+25.60%) > > > memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%) > > > > Hm, this makes me think we have a regression with memcg_kmem in one of > > the recent releases. When I measured it a couple of years ago, the overhead > > was definitely within 100%. > > > > Do you understand what makes the your profiling drastically faster than kmem? > > I haven't profiled or looked into kmem overhead closely but I can do > that. I just wanted to see how the overhead compares with the existing > accounting mechanisms. It's a good idea and I generally think that +25-35% for kmalloc/pgalloc should be ok for the production use, which is great! In the reality, most workloads are not that sensitive to the speed of memory allocation. > > For kmalloc, the overhead is low because after we create the vector of > slab_ext objects (which is the same as what memcg_kmem does), memory > profiling just increments a lazy counter (which in many cases would be > a per-cpu counter). So does kmem (this is why I'm somewhat surprised by the difference). > memcg_kmem operates on cgroup hierarchy with > additional overhead associated with that. I'm guessing that's the > reason for the big difference between these mechanisms but, I didn't > look into the details to understand memcg_kmem performance. I suspect recent rt-related changes and also the wide usage of rcu primitives in the kmem code. I'll try to look closer as well. Thanks!
On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote: > It's a good idea and I generally think that +25-35% for kmalloc/pgalloc > should be ok for the production use, which is great! > In the reality, most workloads are not that sensitive to the speed of > memory allocation. :) My main takeaway has been "the slub fast path is _really_ fast". No disabling of preemption, no atomic instructions, just a non locked double word cmpxchg - it's a slick piece of work. > > For kmalloc, the overhead is low because after we create the vector of > > slab_ext objects (which is the same as what memcg_kmem does), memory > > profiling just increments a lazy counter (which in many cases would be > > a per-cpu counter). > > So does kmem (this is why I'm somewhat surprised by the difference). > > > memcg_kmem operates on cgroup hierarchy with > > additional overhead associated with that. I'm guessing that's the > > reason for the big difference between these mechanisms but, I didn't > > look into the details to understand memcg_kmem performance. > > I suspect recent rt-related changes and also the wide usage of > rcu primitives in the kmem code. I'll try to look closer as well. Happy to give you something to compare against :)
On Mon, May 01, 2023 at 03:37:58PM -0400, Kent Overstreet wrote: > On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote: > > It's a good idea and I generally think that +25-35% for kmalloc/pgalloc > > should be ok for the production use, which is great! > > In the reality, most workloads are not that sensitive to the speed of > > memory allocation. > > :) > > My main takeaway has been "the slub fast path is _really_ fast". No > disabling of preemption, no atomic instructions, just a non locked > double word cmpxchg - it's a slick piece of work. > > > > For kmalloc, the overhead is low because after we create the vector of > > > slab_ext objects (which is the same as what memcg_kmem does), memory > > > profiling just increments a lazy counter (which in many cases would be > > > a per-cpu counter). > > > > So does kmem (this is why I'm somewhat surprised by the difference). > > > > > memcg_kmem operates on cgroup hierarchy with > > > additional overhead associated with that. I'm guessing that's the > > > reason for the big difference between these mechanisms but, I didn't > > > look into the details to understand memcg_kmem performance. > > > > I suspect recent rt-related changes and also the wide usage of > > rcu primitives in the kmem code. I'll try to look closer as well. > > Happy to give you something to compare against :) To be fair, it's not an apple-to-apple comparison, because: 1) memcgs are organized in a tree, these days usually with at least 3 layers, 2) memcgs are dynamic. In theory a task can be moved to a different memcg while performing a (very slow) allocation, and the original memcg can be released. To prevent this we have to perform a lot of operations which you can happily avoid. That said, there is clearly a place for optimization, so thank you for indirectly bringing this up. Thanks!
On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote: > Memory allocation profiling infrastructure provides a low overhead > mechanism to make all kernel allocations in the system visible. It can be > used to monitor memory usage, track memory hotspots, detect memory leaks, > identify memory regressions. > > To keep the overhead to the minimum, we record only allocation sizes for > every allocation in the codebase. With that information, if users are > interested in more detailed context for a specific allocation, they can > enable in-depth context tracking, which includes capturing the pid, tgid, > task name, allocation size, timestamp and call stack for every allocation > at the specified code location. [...] > Implementation utilizes a more generic concept of code tagging, introduced > as part of this patchset. Code tag is a structure identifying a specific > location in the source code which is generated at compile time and can be > embedded in an application-specific structure. A number of applications > for code tagging have been presented in the original RFC [1]. > Code tagging uses the old trick of "define a special elf section for > objects of a given type so that we can iterate over them at runtime" and > creates a proper library for it. > > To profile memory allocations, we instrument page, slab and percpu > allocators to record total memory allocated in the associated code tag at > every allocation in the codebase. Every time an allocation is performed by > an instrumented allocator, the code tag at that location increments its > counter by allocation size. Every time the memory is freed the counter is > decremented. To decrement the counter upon freeing, allocated object needs > a reference to its code tag. Page allocators use page_ext to record this > reference while slab allocators use memcg_data (renamed into more generic > slabobj_ext) of the slab page. [...] > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ [...] > 70 files changed, 2765 insertions(+), 554 deletions(-) Sorry for cutting the cover considerably but I believe I have quoted the most important/interesting parts here. The approach is not fundamentally different from the previous version [1] and there was a significant discussion around this approach. The cover letter doesn't summarize nor deal with concerns expressed previous AFAICS. So let me bring those up back. At least those I find the most important: - This is a big change and it adds a significant maintenance burden because each allocation entry point needs to be handled specifically. The cost will grow with the intended coverage especially there when allocation is hidden in a library code. - It has been brought up that this is duplicating functionality already available via existing tracing infrastructure. You should make it very clear why that is not suitable for the job - We already have page_owner infrastructure that provides allocation tracking data. Why it cannot be used/extended? Thanks!
On Wed, May 03, 2023 at 09:25:29AM +0200, Michal Hocko wrote: > On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote: > > Memory allocation profiling infrastructure provides a low overhead > > mechanism to make all kernel allocations in the system visible. It can be > > used to monitor memory usage, track memory hotspots, detect memory leaks, > > identify memory regressions. > > > > To keep the overhead to the minimum, we record only allocation sizes for > > every allocation in the codebase. With that information, if users are > > interested in more detailed context for a specific allocation, they can > > enable in-depth context tracking, which includes capturing the pid, tgid, > > task name, allocation size, timestamp and call stack for every allocation > > at the specified code location. > [...] > > Implementation utilizes a more generic concept of code tagging, introduced > > as part of this patchset. Code tag is a structure identifying a specific > > location in the source code which is generated at compile time and can be > > embedded in an application-specific structure. A number of applications > > for code tagging have been presented in the original RFC [1]. > > Code tagging uses the old trick of "define a special elf section for > > objects of a given type so that we can iterate over them at runtime" and > > creates a proper library for it. > > > > To profile memory allocations, we instrument page, slab and percpu > > allocators to record total memory allocated in the associated code tag at > > every allocation in the codebase. Every time an allocation is performed by > > an instrumented allocator, the code tag at that location increments its > > counter by allocation size. Every time the memory is freed the counter is > > decremented. To decrement the counter upon freeing, allocated object needs > > a reference to its code tag. Page allocators use page_ext to record this > > reference while slab allocators use memcg_data (renamed into more generic > > slabobj_ext) of the slab page. > [...] > > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ > [...] > > 70 files changed, 2765 insertions(+), 554 deletions(-) > > Sorry for cutting the cover considerably but I believe I have quoted the > most important/interesting parts here. The approach is not fundamentally > different from the previous version [1] and there was a significant > discussion around this approach. The cover letter doesn't summarize nor > deal with concerns expressed previous AFAICS. So let me bring those up > back. At least those I find the most important: We covered this previously, I'll just be giving the same answers I did before: > - This is a big change and it adds a significant maintenance burden > because each allocation entry point needs to be handled specifically. > The cost will grow with the intended coverage especially there when > allocation is hidden in a library code. We've made this as clean and simple as posssible: a single new macro invocation per allocation function, no calling convention changes (that would indeed have been a lot of churn!) > - It has been brought up that this is duplicating functionality already > available via existing tracing infrastructure. You should make it very > clear why that is not suitable for the job Tracing people _claimed_ this, but never demonstrated it. Tracepoints exist but the tooling that would consume them to provide this kind of information does not exist; it would require maintaining an index of _every outstanding allocation_ so that frees could be accounted correctly - IOW, it would be _drastically_ higher overhead, so not at all comparable. > - We already have page_owner infrastructure that provides allocation > tracking data. Why it cannot be used/extended? Page owner is also very high overhead, and the output is not very user friendly (tracking full call stack means many related overhead gets split, not generally what you want), and it doesn't cover slab. This tracks _all_ memory allocations - slab, page, vmalloc, percpu.
On Wed 03-05-23 03:34:21, Kent Overstreet wrote: > On Wed, May 03, 2023 at 09:25:29AM +0200, Michal Hocko wrote: > > On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote: > > > Memory allocation profiling infrastructure provides a low overhead > > > mechanism to make all kernel allocations in the system visible. It can be > > > used to monitor memory usage, track memory hotspots, detect memory leaks, > > > identify memory regressions. > > > > > > To keep the overhead to the minimum, we record only allocation sizes for > > > every allocation in the codebase. With that information, if users are > > > interested in more detailed context for a specific allocation, they can > > > enable in-depth context tracking, which includes capturing the pid, tgid, > > > task name, allocation size, timestamp and call stack for every allocation > > > at the specified code location. > > [...] > > > Implementation utilizes a more generic concept of code tagging, introduced > > > as part of this patchset. Code tag is a structure identifying a specific > > > location in the source code which is generated at compile time and can be > > > embedded in an application-specific structure. A number of applications > > > for code tagging have been presented in the original RFC [1]. > > > Code tagging uses the old trick of "define a special elf section for > > > objects of a given type so that we can iterate over them at runtime" and > > > creates a proper library for it. > > > > > > To profile memory allocations, we instrument page, slab and percpu > > > allocators to record total memory allocated in the associated code tag at > > > every allocation in the codebase. Every time an allocation is performed by > > > an instrumented allocator, the code tag at that location increments its > > > counter by allocation size. Every time the memory is freed the counter is > > > decremented. To decrement the counter upon freeing, allocated object needs > > > a reference to its code tag. Page allocators use page_ext to record this > > > reference while slab allocators use memcg_data (renamed into more generic > > > slabobj_ext) of the slab page. > > [...] > > > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ > > [...] > > > 70 files changed, 2765 insertions(+), 554 deletions(-) > > > > Sorry for cutting the cover considerably but I believe I have quoted the > > most important/interesting parts here. The approach is not fundamentally > > different from the previous version [1] and there was a significant > > discussion around this approach. The cover letter doesn't summarize nor > > deal with concerns expressed previous AFAICS. So let me bring those up > > back. At least those I find the most important: > > We covered this previously, I'll just be giving the same answers I did > before: Your answers have shown your insight into tracing is very limited. I have a clear recollection there were many suggestions on how to get what you need and willingness to help out. Repeating your previous position will not help much to be honest with you. > > - This is a big change and it adds a significant maintenance burden > > because each allocation entry point needs to be handled specifically. > > The cost will grow with the intended coverage especially there when > > allocation is hidden in a library code. > > We've made this as clean and simple as posssible: a single new macro > invocation per allocation function, no calling convention changes (that > would indeed have been a lot of churn!) That doesn't really make the concern any less relevant. I believe you and Suren have made a great effort to reduce the churn as much as possible but looking at the diffstat the code changes are clearly there and you have to convince the rest of the community that this maintenance overhead is really worth it. The above statement hasn't helped to convinced me to be honest. > > - It has been brought up that this is duplicating functionality already > > available via existing tracing infrastructure. You should make it very > > clear why that is not suitable for the job > > Tracing people _claimed_ this, but never demonstrated it. The burden is on you and Suren. You are proposing the implement an alternative tracing infrastructure. > Tracepoints > exist but the tooling that would consume them to provide this kind of > information does not exist; Any reasons why an appropriate tooling cannot be developed? > it would require maintaining an index of > _every outstanding allocation_ so that frees could be accounted > correctly - IOW, it would be _drastically_ higher overhead, so not at > all comparable. Do you have any actual data points to prove your claim? > > - We already have page_owner infrastructure that provides allocation > > tracking data. Why it cannot be used/extended? > > Page owner is also very high overhead, Is there any data to prove that claim? I would be really surprised that page_owner would give higher overhead than page tagging with profiling enabled (there is an allocation for each allocation request!!!). We can discuss the bare bone page tagging comparision to page_owner because of the full stack unwinding but is that overhead really prohibitively costly? Can we reduce that by trimming the unwinder information? > and the output is not very user > friendly (tracking full call stack means many related overhead gets > split, not generally what you want), and it doesn't cover slab. Is this something we cannot do anything about? Have you explored any potential ways? > This tracks _all_ memory allocations - slab, page, vmalloc, percpu.
On Wed, May 03, 2023 at 09:51:49AM +0200, Michal Hocko wrote: > Your answers have shown your insight into tracing is very limited. I > have a clear recollection there were many suggestions on how to get what > you need and willingness to help out. Repeating your previous position > will not help much to be honest with you. Please enlighten us, oh wise one. > > > - It has been brought up that this is duplicating functionality already > > > available via existing tracing infrastructure. You should make it very > > > clear why that is not suitable for the job > > > > Tracing people _claimed_ this, but never demonstrated it. > > The burden is on you and Suren. You are proposing the implement an > alternative tracing infrastructure. No, we're still waiting on the tracing people to _demonstrate_, not claim, that this is at all possible in a comparable way with tracing. It's not on us to make your argument for you, and before making accusations about honesty you should try to be more honest yourself. The expectations you're trying to level have never been the norm in the kernel community, sorry. When there's a technical argument about the best way to do something, _code wins_ and we've got working code to do something that hasn't been possible previously. There's absolutely no rule that "tracing has to be the one and only tool for kernel visibility". I'm considering the tracing discussion closed until someone in the pro-tracing camp shows something new. > > > - We already have page_owner infrastructure that provides allocation > > > tracking data. Why it cannot be used/extended? > > > > Page owner is also very high overhead, > > Is there any data to prove that claim? I would be really surprised that > page_owner would give higher overhead than page tagging with profiling > enabled (there is an allocation for each allocation request!!!). We can > discuss the bare bone page tagging comparision to page_owner because of > the full stack unwinding but is that overhead really prohibitively costly? > Can we reduce that by trimming the unwinder information? Honestly, this isn't terribly relevant, because as noted before page owner is limited to just page allocations. > > > and the output is not very user > > friendly (tracking full call stack means many related overhead gets > > split, not generally what you want), and it doesn't cover slab. > > Is this something we cannot do anything about? Have you explored any > potential ways? > > > This tracks _all_ memory allocations - slab, page, vmalloc, percpu. Michel, the discussions with you seem to perpetually go in circles; it's clear you're negative on the patchset, you keep raising the same objections while refusing to concede a single point. I believe I've answered enough, so I'll leave off further discussions with you.
On Wed, 3 May 2023 09:51:49 +0200 Michal Hocko <mhocko@suse.com> wrote: > On Wed 03-05-23 03:34:21, Kent Overstreet wrote: >[...] > > We've made this as clean and simple as posssible: a single new macro > > invocation per allocation function, no calling convention changes (that > > would indeed have been a lot of churn!) > > That doesn't really make the concern any less relevant. I believe you > and Suren have made a great effort to reduce the churn as much as > possible but looking at the diffstat the code changes are clearly there > and you have to convince the rest of the community that this maintenance > overhead is really worth it. I believe this is the crucial point. I have my own concerns about the use of preprocessor macros, which goes against the basic idea of a code tagging framework (patch 13/40). AFAICS the CODE_TAG_INIT macro must be expanded on the same source code line as the tagged code, which makes it hard to use without further macros (unless you want to make the source code unreadable beyond imagination). That's why all allocation functions must be converted to macros. If anyone ever wants to use this code tagging framework for something else, they will also have to convert relevant functions to macros, slowly changing the kernel to a minefield where local identifiers, struct, union and enum tags, field names and labels must avoid name conflict with a tagged function. For now, I have to remember that alloc_pages is forbidden, but the list may grow. FWIW I can see some occurences of "alloc_pages" under arch/ which are not renamed by patch 19/40 of this series. For instance, does the kernel build for s390x after applying the patch series? New code may also work initially, but explode after adding an #include later... HOWEVER, if the rest of the community agrees that the added value of code tagging is worth all these potential risks, I can live with it. Petr T
On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > On Wed, 3 May 2023 09:51:49 +0200 > Michal Hocko <mhocko@suse.com> wrote: > > > On Wed 03-05-23 03:34:21, Kent Overstreet wrote: > >[...] > > > We've made this as clean and simple as posssible: a single new macro > > > invocation per allocation function, no calling convention changes (that > > > would indeed have been a lot of churn!) > > > > That doesn't really make the concern any less relevant. I believe you > > and Suren have made a great effort to reduce the churn as much as > > possible but looking at the diffstat the code changes are clearly there > > and you have to convince the rest of the community that this maintenance > > overhead is really worth it. > > I believe this is the crucial point. > > I have my own concerns about the use of preprocessor macros, which goes > against the basic idea of a code tagging framework (patch 13/40). > AFAICS the CODE_TAG_INIT macro must be expanded on the same source code > line as the tagged code, which makes it hard to use without further > macros (unless you want to make the source code unreadable beyond > imagination). That's why all allocation functions must be converted to > macros. > > If anyone ever wants to use this code tagging framework for something > else, they will also have to convert relevant functions to macros, > slowly changing the kernel to a minefield where local identifiers, > struct, union and enum tags, field names and labels must avoid name > conflict with a tagged function. For now, I have to remember that > alloc_pages is forbidden, but the list may grow. No, we've got other code tagging applications (that have already been posted!) and they don't "convert functions to macros" in the way this patchset does - they do introduce new macros, but as new identifiers, which we do all the time. This was simply the least churny way to hook memory allocations.
On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > If anyone ever wants to use this code tagging framework for something > else, they will also have to convert relevant functions to macros, > slowly changing the kernel to a minefield where local identifiers, > struct, union and enum tags, field names and labels must avoid name > conflict with a tagged function. For now, I have to remember that > alloc_pages is forbidden, but the list may grow. Also, since you're not actually a kernel contributor yet... It's not really good decorum to speculate in code review about things that can be answered by just reading the code. If you're going to comment, please do the necessary work to make sure you're saying something that makes sense.
On Wed, 3 May 2023 05:54:43 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > On Wed, 3 May 2023 09:51:49 +0200 > > Michal Hocko <mhocko@suse.com> wrote: > > > > > On Wed 03-05-23 03:34:21, Kent Overstreet wrote: > > >[...] > > > > We've made this as clean and simple as posssible: a single new macro > > > > invocation per allocation function, no calling convention changes (that > > > > would indeed have been a lot of churn!) > > > > > > That doesn't really make the concern any less relevant. I believe you > > > and Suren have made a great effort to reduce the churn as much as > > > possible but looking at the diffstat the code changes are clearly there > > > and you have to convince the rest of the community that this maintenance > > > overhead is really worth it. > > > > I believe this is the crucial point. > > > > I have my own concerns about the use of preprocessor macros, which goes > > against the basic idea of a code tagging framework (patch 13/40). > > AFAICS the CODE_TAG_INIT macro must be expanded on the same source code > > line as the tagged code, which makes it hard to use without further > > macros (unless you want to make the source code unreadable beyond > > imagination). That's why all allocation functions must be converted to > > macros. > > > > If anyone ever wants to use this code tagging framework for something > > else, they will also have to convert relevant functions to macros, > > slowly changing the kernel to a minefield where local identifiers, > > struct, union and enum tags, field names and labels must avoid name > > conflict with a tagged function. For now, I have to remember that > > alloc_pages is forbidden, but the list may grow. > > No, we've got other code tagging applications (that have already been > posted!) and they don't "convert functions to macros" in the way this > patchset does - they do introduce new macros, but as new identifiers, > which we do all the time. Yes, new all-lowercase macros which do not expand to a single identifier are still added under include/linux. It's unfortunate IMO, but it's a fact of life. You have a point here. > This was simply the least churny way to hook memory allocations. This is a bold statement. You certainly know what you plan to do, but other people keep coming up with ideas... Like, anyone would like to tag semaphore use: up() and down()? Don't get me wrong. I can see how the benefits of code tagging, and I agree that my concerns are not very strong. I just want that the consequences are understood and accepted, and they don't take us by surprise. Petr T
On Wed, 3 May 2023 05:57:15 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > If anyone ever wants to use this code tagging framework for something > > else, they will also have to convert relevant functions to macros, > > slowly changing the kernel to a minefield where local identifiers, > > struct, union and enum tags, field names and labels must avoid name > > conflict with a tagged function. For now, I have to remember that > > alloc_pages is forbidden, but the list may grow. > > Also, since you're not actually a kernel contributor yet... I see, I've been around only since 2007... https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a97468024fb5b6eccee2a67a7796485c829343a Petr T
On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote: > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > If anyone ever wants to use this code tagging framework for > > something > > else, they will also have to convert relevant functions to macros, > > slowly changing the kernel to a minefield where local identifiers, > > struct, union and enum tags, field names and labels must avoid name > > conflict with a tagged function. For now, I have to remember that > > alloc_pages is forbidden, but the list may grow. > > Also, since you're not actually a kernel contributor yet... You have an amazing talent for being wrong. But even if you were actually right about this, it would be an ad hominem personal attack on a new contributor which crosses the line into unacceptable behaviour on the list and runs counter to our code of conduct. James
On Wed, 3 May 2023 04:05:08 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > The burden is on you and Suren. You are proposing the implement an > > alternative tracing infrastructure. > > No, we're still waiting on the tracing people to _demonstrate_, not > claim, that this is at all possible in a comparable way with tracing. It's not my job to do your work for you! I gave you hints on how you can do this with attaching to existing trace events and your response was "If you don't think it's hard, go ahead and show us." No! I'm too busy with my own work to do free work for you! https://lore.kernel.org/all/20220905235007.sc4uk6illlog62fl@kmo-framework/ I know it's easier to create something from scratch that you fully know, than to work with an existing infrastructure that you need to spend effort and learn to make it do what you want. But by recreating the work, you now pass the burden onto everyone else that needs to learn what you did. Not to mention, we would likely have multiple ways to do the same thing. Sorry, but that's not how an open source community is suppose to work. -- Steve
On Wed, May 3, 2023 at 5:34 AM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote: > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > > If anyone ever wants to use this code tagging framework for > > > something > > > else, they will also have to convert relevant functions to macros, > > > slowly changing the kernel to a minefield where local identifiers, > > > struct, union and enum tags, field names and labels must avoid name > > > conflict with a tagged function. For now, I have to remember that > > > alloc_pages is forbidden, but the list may grow. > > > > Also, since you're not actually a kernel contributor yet... > > You have an amazing talent for being wrong. But even if you were > actually right about this, it would be an ad hominem personal attack on > a new contributor which crosses the line into unacceptable behaviour on > the list and runs counter to our code of conduct. Kent, I asked you before and I'm asking you again. Please focus on the technical discussion and stop personal attacks. That is extremely counter-productive. > > James > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote: > > Memory allocation profiling infrastructure provides a low overhead > > mechanism to make all kernel allocations in the system visible. It can be > > used to monitor memory usage, track memory hotspots, detect memory leaks, > > identify memory regressions. > > > > To keep the overhead to the minimum, we record only allocation sizes for > > every allocation in the codebase. With that information, if users are > > interested in more detailed context for a specific allocation, they can > > enable in-depth context tracking, which includes capturing the pid, tgid, > > task name, allocation size, timestamp and call stack for every allocation > > at the specified code location. > [...] > > Implementation utilizes a more generic concept of code tagging, introduced > > as part of this patchset. Code tag is a structure identifying a specific > > location in the source code which is generated at compile time and can be > > embedded in an application-specific structure. A number of applications > > for code tagging have been presented in the original RFC [1]. > > Code tagging uses the old trick of "define a special elf section for > > objects of a given type so that we can iterate over them at runtime" and > > creates a proper library for it. > > > > To profile memory allocations, we instrument page, slab and percpu > > allocators to record total memory allocated in the associated code tag at > > every allocation in the codebase. Every time an allocation is performed by > > an instrumented allocator, the code tag at that location increments its > > counter by allocation size. Every time the memory is freed the counter is > > decremented. To decrement the counter upon freeing, allocated object needs > > a reference to its code tag. Page allocators use page_ext to record this > > reference while slab allocators use memcg_data (renamed into more generic > > slabobj_ext) of the slab page. > [...] > > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ > [...] > > 70 files changed, 2765 insertions(+), 554 deletions(-) > > Sorry for cutting the cover considerably but I believe I have quoted the > most important/interesting parts here. The approach is not fundamentally > different from the previous version [1] and there was a significant > discussion around this approach. The cover letter doesn't summarize nor > deal with concerns expressed previous AFAICS. So let me bring those up > back. Thanks for summarizing! > At least those I find the most important: > - This is a big change and it adds a significant maintenance burden > because each allocation entry point needs to be handled specifically. > The cost will grow with the intended coverage especially there when > allocation is hidden in a library code. Do you mean with more allocations in the codebase more codetags will be generated? Is that the concern? Or maybe as you commented in another patch that context capturing feature does not limit how many stacks will be captured? > - It has been brought up that this is duplicating functionality already > available via existing tracing infrastructure. You should make it very > clear why that is not suitable for the job I experimented with using tracing with _RET_IP_ to implement this accounting. The major issue is the _RET_IP_ to codetag lookup runtime overhead which is orders of magnitude higher than proposed code tagging approach. With code tagging proposal, that link is resolved at compile time. Since we want this mechanism deployed in production, we want to keep the overhead to the absolute minimum. You asked me before how much overhead would be tolerable and the answer will always be "as small as possible". This is especially true for slab allocators which are ridiculously fast and regressing them would be very noticable (due to the frequent use). There is another issue, which I think can be solved in a smart way but will either affect performance or would require more memory. With the tracing approach we don't know beforehand how many individual allocation sites exist, so we have to allocate code tags (or similar structures for counting) at runtime vs compile time. We can be smart about it and allocate in batches or even preallocate more than we need beforehand but, as I said, it will require some kind of compromise. I understand that code tagging creates additional maintenance burdens but I hope it also produces enough benefits that people will want this. The cost is also hopefully amortized when additional applications like the ones we presented in RFC [1] are built using the same framework. > - We already have page_owner infrastructure that provides allocation > tracking data. Why it cannot be used/extended? 1. The overhead. 2. Covers only page allocators. I didn't think about extending the page_owner approach to slab allocators but I suspect it would not be trivial. I don't see attaching an owner to every slab object to be a scalable solution. The overhead would again be of concern here. I should point out that there was one important technical concern about lack of a kill switch for this feature, which was an issue for distributions that can't disable the CONFIG flag. In this series we addressed that concern. [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ Thanks, Suren. > > Thanks! > -- > Michal Hocko > SUSE Labs
On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote: > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote: > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > > If anyone ever wants to use this code tagging framework for > > > something > > > else, they will also have to convert relevant functions to macros, > > > slowly changing the kernel to a minefield where local identifiers, > > > struct, union and enum tags, field names and labels must avoid name > > > conflict with a tagged function. For now, I have to remember that > > > alloc_pages is forbidden, but the list may grow. > > > > Also, since you're not actually a kernel contributor yet... > > You have an amazing talent for being wrong. But even if you were > actually right about this, it would be an ad hominem personal attack on > a new contributor which crosses the line into unacceptable behaviour on > the list and runs counter to our code of conduct. ...Err, what? That was intended _in no way_ as a personal attack. If I was mistaken I do apologize, but lately I've run across quite a lot of people offering review feedback to patches I post that turn out to have 0 or 10 patches in the kernel, and - to be blunt - a pattern of offering feedback in strong language with a presumption of experience that takes a lot to respond to adequately on a technical basis. I don't think a suggestion to spend a bit more time reading code instead of speculating is out of order! We could all, put more effort into how we offer review feedback.
On Wed, May 03, 2023 at 12:26:27PM +0200, Petr Tesařík wrote: > On Wed, 3 May 2023 05:57:15 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > > If anyone ever wants to use this code tagging framework for something > > > else, they will also have to convert relevant functions to macros, > > > slowly changing the kernel to a minefield where local identifiers, > > > struct, union and enum tags, field names and labels must avoid name > > > conflict with a tagged function. For now, I have to remember that > > > alloc_pages is forbidden, but the list may grow. > > > > Also, since you're not actually a kernel contributor yet... > > I see, I've been around only since 2007... > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a97468024fb5b6eccee2a67a7796485c829343a My sincere apologies :) I'd searched for your name and email and found nothing, whoops.
On Wed, May 03, 2023 at 11:28:06AM -0400, Kent Overstreet wrote: > On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote: > > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote: > > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > > > If anyone ever wants to use this code tagging framework for > > > > something > > > > else, they will also have to convert relevant functions to macros, > > > > slowly changing the kernel to a minefield where local identifiers, > > > > struct, union and enum tags, field names and labels must avoid name > > > > conflict with a tagged function. For now, I have to remember that > > > > alloc_pages is forbidden, but the list may grow. > > > > > > Also, since you're not actually a kernel contributor yet... > > > > You have an amazing talent for being wrong. But even if you were > > actually right about this, it would be an ad hominem personal attack on > > a new contributor which crosses the line into unacceptable behaviour on > > the list and runs counter to our code of conduct. > > ...Err, what? That was intended _in no way_ as a personal attack. > As an outside observer, I can assure you that absolutely came across as a personal attack, and the precise kind that puts people off from contributing. I should know as a hobbyist contributor myself. > If I was mistaken I do apologize, but lately I've run across quite a lot > of people offering review feedback to patches I post that turn out to > have 0 or 10 patches in the kernel, and - to be blunt - a pattern of > offering feedback in strong language with a presumption of experience > that takes a lot to respond to adequately on a technical basis. > I, who may very well not merit being considered a contributor of significant merit in your view, have had such 'drive-by' commentary on some of my patches by precisely this type of person, and at no time felt the need to question whether they were a true Scotsman or not. It's simply not productive. > I don't think a suggestion to spend a bit more time reading code instead > of speculating is out of order! We could all, put more effort into how > we offer review feedback. It's the means by which you say it that counts for everything. If you feel the technical comments might not be merited on a deeper level, perhaps ask a broader question, or even don't respond at all? There are other means available. It's remarkable the impact comments like the one you made can have on contributors, certainly those of us who are not maintainers and are naturally plagued with imposter syndrome, so I would ask you on a human level to try to be a little more considerate. By all means address technical issues as robustly as you feel appropriate, that is after all the purpose of code review, but just take a step back and perhaps find the 'cuddlier' side of yourself when not addressing technical things :)
On Wed, 2023-05-03 at 11:28 -0400, Kent Overstreet wrote: > On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote: > > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote: > > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote: > > > > If anyone ever wants to use this code tagging framework for > > > > something else, they will also have to convert relevant > > > > functions to macros, slowly changing the kernel to a minefield > > > > where local identifiers, struct, union and enum tags, field > > > > names and labels must avoid name conflict with a tagged > > > > function. For now, I have to remember that alloc_pages is > > > > forbidden, but the list may grow. > > > > > > Also, since you're not actually a kernel contributor yet... > > > > You have an amazing talent for being wrong. But even if you were > > actually right about this, it would be an ad hominem personal > > attack on a new contributor which crosses the line into > > unacceptable behaviour on the list and runs counter to our code of > > conduct. > > ...Err, what? That was intended _in no way_ as a personal attack. Your reply went on to say "If you're going to comment, please do the necessary work to make sure you're saying something that makes sense." That is a personal attack belittling the person involved and holding them up for general contempt on the mailing list. This is exactly how we should *not* treat newcomers. > If I was mistaken I do apologize, but lately I've run across quite a > lot of people offering review feedback to patches I post that turn > out to have 0 or 10 patches in the kernel, and - to be blunt - a > pattern of offering feedback in strong language with a presumption of > experience that takes a lot to respond to adequately on a technical > basis. A synopsis of the feedback is that using macros to attach trace tags pollutes the global function namespace of the kernel. That's a valid observation and merits a technical not a personal response. James
On Wed, May 03, 2023 at 04:37:36PM +0100, Lorenzo Stoakes wrote: > As an outside observer, I can assure you that absolutely came across as a > personal attack, and the precise kind that puts people off from > contributing. I should know as a hobbyist contributor myself. > > > If I was mistaken I do apologize, but lately I've run across quite a lot > > of people offering review feedback to patches I post that turn out to > > have 0 or 10 patches in the kernel, and - to be blunt - a pattern of > > offering feedback in strong language with a presumption of experience > > that takes a lot to respond to adequately on a technical basis. > > > > I, who may very well not merit being considered a contributor of > significant merit in your view, have had such 'drive-by' commentary on some > of my patches by precisely this type of person, and at no time felt the > need to question whether they were a true Scotsman or not. It's simply not > productive. > > > I don't think a suggestion to spend a bit more time reading code instead > > of speculating is out of order! We could all, put more effort into how > > we offer review feedback. > > It's the means by which you say it that counts for everything. If you feel > the technical comments might not be merited on a deeper level, perhaps ask > a broader question, or even don't respond at all? There are other means > available. > > It's remarkable the impact comments like the one you made can have on > contributors, certainly those of us who are not maintainers and are > naturally plagued with imposter syndrome, so I would ask you on a human > level to try to be a little more considerate. > > By all means address technical issues as robustly as you feel appropriate, > that is after all the purpose of code review, but just take a step back and > perhaps find the 'cuddlier' side of yourself when not addressing technical > things :) Thanks for your reply, it's level headed and appreciated. But I personally value directness, and I see quite a few people in this thread going all out on the tone policing - but look, without the directness the confusion (that Petr is not actually a new contributor) never would've been cleared up. Food for thought, perhaps?
On Wed, 3 May 2023 08:09:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > There is another issue, which I think can be solved in a smart way but > will either affect performance or would require more memory. With the > tracing approach we don't know beforehand how many individual > allocation sites exist, so we have to allocate code tags (or similar > structures for counting) at runtime vs compile time. We can be smart > about it and allocate in batches or even preallocate more than we need > beforehand but, as I said, it will require some kind of compromise. This approach is actually quite common, especially since tagging every instance is usually overkill, as if you trace function calls in a running kernel, you will find that only a small percentage of the kernel ever executes. It's possible that you will be allocating a lot of tags that will never be used. If run time allocation is possible, that is usually the better approach. -- Steve
Hello, Kent. On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote: > No, we're still waiting on the tracing people to _demonstrate_, not > claim, that this is at all possible in a comparable way with tracing. So, we (meta) happen to do stuff like this all the time in the fleet to hunt down tricky persistent problems like memory leaks, ref leaks, what-have-you. In recent kernels, with kprobe and BPF, our ability to debug these sorts of problems has improved a great deal. Below, I'm attaching a bcc script I used to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks can follow the same pattern. There are of course some pros and cons to this approach: Pros: * The framework doesn't really have any runtime overhead, so we can have it deployed in the entire fleet and debug wherever problem is. * It's fully flexible and programmable which enables non-trivial filtering and summarizing to be done inside kernel w/ BPF as necessary, which is pretty handy for tracking high frequency events. * BPF is pretty performant. Dedicated built-in kernel code can do better of course but BPF's jit compiled code & its data structures are fast enough. I don't remember any time this was a problem. Cons: * BPF has some learning curve. Also the fact that what it provides is a wide open field rather than something scoped out for a specific problem can make it seem a bit daunting at the beginning. * Because tracking starts when the script starts running, it doesn't know anything which has happened upto that point, so you gotta pay attention to handling e.g. handling frees which don't match allocs. It's kinda annoying but not a huge problem usually. There are ways to build in BPF progs into the kernel and load it early but I haven't experiemnted with it yet personally. I'm not necessarily against adding dedicated memory debugging mechanism but do wonder whether the extra benefits would be enough to justify the code and maintenance overhead. Oh, a bit of delta but for anyone who's more interested in debugging problems like this, while I tend to go for bcc (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to write against libbpf directly or use bpftrace (https://github.com/iovisor/bpftrace). Thanks. #!/usr/bin/env bcc-py import bcc import time import datetime import argparse import os import sys import errno description = """ Record vmalloc/vfrees and trigger on unmatched vfree """ bpf_source = """ #include <uapi/linux/ptrace.h> #include <linux/vmalloc.h> struct vmalloc_rec { unsigned long ptr; int last_alloc_stkid; int last_free_stkid; int this_stkid; bool allocated; }; BPF_STACK_TRACE(stacks, 8192); BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072); BPF_ARRAY(dup_free, struct vmalloc_rec, 1); int kpret_vmalloc_node_range(struct pt_regs *ctx) { unsigned long ptr = PT_REGS_RC(ctx); uint32_t zkey = 0; struct vmalloc_rec rec_init = { }; struct vmalloc_rec *rec; int stkid; if (!ptr) return 0; stkid = stacks.get_stackid(ctx, 0); rec_init.ptr = ptr; rec_init.last_alloc_stkid = -1; rec_init.last_free_stkid = -1; rec_init.this_stkid = -1; rec = vmallocs.lookup_or_init(&ptr, &rec_init); rec->allocated = true; rec->last_alloc_stkid = stkid; return 0; } int kp_vfree(struct pt_regs *ctx, const void *addr) { unsigned long ptr = (unsigned long)addr; uint32_t zkey = 0; struct vmalloc_rec rec_init = { }; struct vmalloc_rec *rec; int stkid; stkid = stacks.get_stackid(ctx, 0); rec_init.ptr = ptr; rec_init.last_alloc_stkid = -1; rec_init.last_free_stkid = -1; rec_init.this_stkid = -1; rec = vmallocs.lookup_or_init(&ptr, &rec_init); if (!rec->allocated && rec->last_alloc_stkid >= 0) { rec->this_stkid = stkid; dup_free.update(&zkey, rec); } rec->allocated = false; rec->last_free_stkid = stkid; return 0; } """ bpf = bcc.BPF(text=bpf_source) bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range"); bpf.attach_kprobe(event="vfree", fn_name="kp_vfree"); bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree"); stacks = bpf["stacks"] vmallocs = bpf["vmallocs"] dup_free = bpf["dup_free"] last_dup_free_ptr = dup_free[0].ptr def print_stack(stkid): for addr in stacks.walk(stkid): sym = bpf.ksym(addr) print(' {}'.format(sym)) def print_dup(dup): print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr))) if (dup.last_alloc_stkid >= 0): print('last_alloc_stack: ') print_stack(dup.last_alloc_stkid) if (dup.last_free_stkid >= 0): print('last_free_stack: ') print_stack(dup.last_free_stkid) if (dup.this_stkid >= 0): print('this_stack: ') print_stack(dup.this_stkid) while True: time.sleep(1) if dup_free[0].ptr != last_dup_free_ptr: print('\nDUP_FREE:') print_dup(dup_free[0]) last_dup_free_ptr = dup_free[0].ptr
On Wed, May 3, 2023 at 9:28 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 3 May 2023 08:09:28 -0700 > Suren Baghdasaryan <surenb@google.com> wrote: > > > There is another issue, which I think can be solved in a smart way but > > will either affect performance or would require more memory. With the > > tracing approach we don't know beforehand how many individual > > allocation sites exist, so we have to allocate code tags (or similar > > structures for counting) at runtime vs compile time. We can be smart > > about it and allocate in batches or even preallocate more than we need > > beforehand but, as I said, it will require some kind of compromise. > > This approach is actually quite common, especially since tagging every > instance is usually overkill, as if you trace function calls in a running > kernel, you will find that only a small percentage of the kernel ever > executes. It's possible that you will be allocating a lot of tags that will > never be used. If run time allocation is possible, that is usually the > better approach. True but the memory overhead should not be prohibitive here. As a ballpark number, on my machine I see there are 4838 individual allocation locations and each codetag structure is 32 bytes, so that's 152KB. > > -- Steve
On Wed, May 3, 2023 at 9:35 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, Kent. > > On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote: > > No, we're still waiting on the tracing people to _demonstrate_, not > > claim, that this is at all possible in a comparable way with tracing. > > So, we (meta) happen to do stuff like this all the time in the fleet to hunt > down tricky persistent problems like memory leaks, ref leaks, what-have-you. > In recent kernels, with kprobe and BPF, our ability to debug these sorts of > problems has improved a great deal. Below, I'm attaching a bcc script I used > to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks > can follow the same pattern. Thanks for sharing, Tejun! > > There are of course some pros and cons to this approach: > > Pros: > > * The framework doesn't really have any runtime overhead, so we can have it > deployed in the entire fleet and debug wherever problem is. Do you mean it has no runtime overhead when disabled? If so, do you know what's the overhead when enabled? I want to understand if that's truly a viable solution to track all allocations (including slab) all the time. Thanks, Suren. > > * It's fully flexible and programmable which enables non-trivial filtering > and summarizing to be done inside kernel w/ BPF as necessary, which is > pretty handy for tracking high frequency events. > > * BPF is pretty performant. Dedicated built-in kernel code can do better of > course but BPF's jit compiled code & its data structures are fast enough. > I don't remember any time this was a problem. > > Cons: > > * BPF has some learning curve. Also the fact that what it provides is a wide > open field rather than something scoped out for a specific problem can > make it seem a bit daunting at the beginning. > > * Because tracking starts when the script starts running, it doesn't know > anything which has happened upto that point, so you gotta pay attention to > handling e.g. handling frees which don't match allocs. It's kinda annoying > but not a huge problem usually. There are ways to build in BPF progs into > the kernel and load it early but I haven't experiemnted with it yet > personally. > > I'm not necessarily against adding dedicated memory debugging mechanism but > do wonder whether the extra benefits would be enough to justify the code and > maintenance overhead. > > Oh, a bit of delta but for anyone who's more interested in debugging > problems like this, while I tend to go for bcc > (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to > write against libbpf directly or use bpftrace > (https://github.com/iovisor/bpftrace). > > Thanks. > > #!/usr/bin/env bcc-py > > import bcc > import time > import datetime > import argparse > import os > import sys > import errno > > description = """ > Record vmalloc/vfrees and trigger on unmatched vfree > """ > > bpf_source = """ > #include <uapi/linux/ptrace.h> > #include <linux/vmalloc.h> > > struct vmalloc_rec { > unsigned long ptr; > int last_alloc_stkid; > int last_free_stkid; > int this_stkid; > bool allocated; > }; > > BPF_STACK_TRACE(stacks, 8192); > BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072); > BPF_ARRAY(dup_free, struct vmalloc_rec, 1); > > int kpret_vmalloc_node_range(struct pt_regs *ctx) > { > unsigned long ptr = PT_REGS_RC(ctx); > uint32_t zkey = 0; > struct vmalloc_rec rec_init = { }; > struct vmalloc_rec *rec; > int stkid; > > if (!ptr) > return 0; > > stkid = stacks.get_stackid(ctx, 0); > > rec_init.ptr = ptr; > rec_init.last_alloc_stkid = -1; > rec_init.last_free_stkid = -1; > rec_init.this_stkid = -1; > > rec = vmallocs.lookup_or_init(&ptr, &rec_init); > rec->allocated = true; > rec->last_alloc_stkid = stkid; > return 0; > } > > int kp_vfree(struct pt_regs *ctx, const void *addr) > { > unsigned long ptr = (unsigned long)addr; > uint32_t zkey = 0; > struct vmalloc_rec rec_init = { }; > struct vmalloc_rec *rec; > int stkid; > > stkid = stacks.get_stackid(ctx, 0); > > rec_init.ptr = ptr; > rec_init.last_alloc_stkid = -1; > rec_init.last_free_stkid = -1; > rec_init.this_stkid = -1; > > rec = vmallocs.lookup_or_init(&ptr, &rec_init); > if (!rec->allocated && rec->last_alloc_stkid >= 0) { > rec->this_stkid = stkid; > dup_free.update(&zkey, rec); > } > > rec->allocated = false; > rec->last_free_stkid = stkid; > return 0; > } > """ > > bpf = bcc.BPF(text=bpf_source) > bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range"); > bpf.attach_kprobe(event="vfree", fn_name="kp_vfree"); > bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree"); > > stacks = bpf["stacks"] > vmallocs = bpf["vmallocs"] > dup_free = bpf["dup_free"] > last_dup_free_ptr = dup_free[0].ptr > > def print_stack(stkid): > for addr in stacks.walk(stkid): > sym = bpf.ksym(addr) > print(' {}'.format(sym)) > > def print_dup(dup): > print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr))) > if (dup.last_alloc_stkid >= 0): > print('last_alloc_stack: ') > print_stack(dup.last_alloc_stkid) > if (dup.last_free_stkid >= 0): > print('last_free_stack: ') > print_stack(dup.last_free_stkid) > if (dup.this_stkid >= 0): > print('this_stack: ') > print_stack(dup.this_stkid) > > while True: > time.sleep(1) > > if dup_free[0].ptr != last_dup_free_ptr: > print('\nDUP_FREE:') > print_dup(dup_free[0]) > last_dup_free_ptr = dup_free[0].ptr > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote: > Hello, Kent. > > On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote: > > No, we're still waiting on the tracing people to _demonstrate_, not > > claim, that this is at all possible in a comparable way with tracing. > > So, we (meta) happen to do stuff like this all the time in the fleet to hunt > down tricky persistent problems like memory leaks, ref leaks, what-have-you. > In recent kernels, with kprobe and BPF, our ability to debug these sorts of > problems has improved a great deal. Below, I'm attaching a bcc script I used > to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks > can follow the same pattern. > > There are of course some pros and cons to this approach: > > Pros: > > * The framework doesn't really have any runtime overhead, so we can have it > deployed in the entire fleet and debug wherever problem is. > > * It's fully flexible and programmable which enables non-trivial filtering > and summarizing to be done inside kernel w/ BPF as necessary, which is > pretty handy for tracking high frequency events. > > * BPF is pretty performant. Dedicated built-in kernel code can do better of > course but BPF's jit compiled code & its data structures are fast enough. > I don't remember any time this was a problem. You're still going to have the inherent overhead a separate index of outstanding memory allocations, so that frees can be decremented to the correct callsite. The BPF approach is going to be _way_ higher overhead if you try to use it as a general profiler, like this is.
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote: > Hello, Kent. > > On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote: > > No, we're still waiting on the tracing people to _demonstrate_, not > > claim, that this is at all possible in a comparable way with tracing. > > So, we (meta) happen to do stuff like this all the time in the fleet to hunt > down tricky persistent problems like memory leaks, ref leaks, what-have-you. > In recent kernels, with kprobe and BPF, our ability to debug these sorts of > problems has improved a great deal. Below, I'm attaching a bcc script I used > to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks > can follow the same pattern. > > There are of course some pros and cons to this approach: > > Pros: > > * The framework doesn't really have any runtime overhead, so we can have it > deployed in the entire fleet and debug wherever problem is. > > * It's fully flexible and programmable which enables non-trivial filtering > and summarizing to be done inside kernel w/ BPF as necessary, which is > pretty handy for tracking high frequency events. > > * BPF is pretty performant. Dedicated built-in kernel code can do better of > course but BPF's jit compiled code & its data structures are fast enough. > I don't remember any time this was a problem. > > Cons: > > * BPF has some learning curve. Also the fact that what it provides is a wide > open field rather than something scoped out for a specific problem can > make it seem a bit daunting at the beginning. > > * Because tracking starts when the script starts running, it doesn't know > anything which has happened upto that point, so you gotta pay attention to > handling e.g. handling frees which don't match allocs. It's kinda annoying > but not a huge problem usually. There are ways to build in BPF progs into > the kernel and load it early but I haven't experiemnted with it yet > personally. > > I'm not necessarily against adding dedicated memory debugging mechanism but > do wonder whether the extra benefits would be enough to justify the code and > maintenance overhead. > > Oh, a bit of delta but for anyone who's more interested in debugging > problems like this, while I tend to go for bcc > (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to > write against libbpf directly or use bpftrace > (https://github.com/iovisor/bpftrace). Do you have example output? TBH I'm skeptical that it's even possible to do full memory allocation profiling with tracing/bpf, due to recursive memory allocations and needing an index of outstanding allcations.
On Wed, 3 May 2023 10:40:42 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > > This approach is actually quite common, especially since tagging every > > instance is usually overkill, as if you trace function calls in a running > > kernel, you will find that only a small percentage of the kernel ever > > executes. It's possible that you will be allocating a lot of tags that will > > never be used. If run time allocation is possible, that is usually the > > better approach. > > True but the memory overhead should not be prohibitive here. As a > ballpark number, on my machine I see there are 4838 individual > allocation locations and each codetag structure is 32 bytes, so that's > 152KB. If it's not that big, then allocating at runtime should not be an issue either. If runtime allocation can make it less intrusive to the code, that would be more rationale to do so. -- Steve
Hello, Suren. On Wed, May 03, 2023 at 10:42:11AM -0700, Suren Baghdasaryan wrote: > > * The framework doesn't really have any runtime overhead, so we can have it > > deployed in the entire fleet and debug wherever problem is. > > Do you mean it has no runtime overhead when disabled? Yes, that's what I meant. > If so, do you know what's the overhead when enabled? I want to > understand if that's truly a viable solution to track all allocations > (including slab) all the time. (cc'ing Alexei and Andrii who know a lot better than me) I don't have enough concrete benchmark data on the hand to answer definitively but hopefully what my general impresison would help. We attach BPF programs to both per-packet and per-IO paths. They obviously aren't free but their overhead isn't signficantly higher than building in the same thing in C code. Once loaded, BPF progs are jit compiled into native code. The generated code will be a bit worse than regularly compiled C code but those are really micro differences. There's some bridging code to jump into BPF but again negligible / acceptable even in the hottest paths. In terms of execution overhead, I don't think there is a signficant disadvantage to doing these things in BPF. Bigger differences would likely be in tracking data structures and locking around them. One can definitely better integrate tracking into alloc / free paths piggybacking on existing locking and whatnot. That said, BPF hashtable is pretty fast and BPF is constantly improving in terms of data structure support. It really depends on the workload and how much overhead one considers acceptable and I'm sure persistent global tracking can be done more efficiently with built-in C code. That said, done right, the overhead difference most likely isn't gonna be orders of magnitude but more like in the realm of tens of percents, if that. So, it doesn't nullify the benefits a dedicated mechansim can bring but does change the conversation quite a bit. Is the extra code justifiable given that most of what it enables is already possible using a more generic mechanism, albeit at a bit higher cost? That may well be the case but it does raise the bar. Thanks.
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote: > Hello, Kent. > > On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote: > > No, we're still waiting on the tracing people to _demonstrate_, not > > claim, that this is at all possible in a comparable way with tracing. > > So, we (meta) happen to do stuff like this all the time in the fleet to hunt > down tricky persistent problems like memory leaks, ref leaks, what-have-you. > In recent kernels, with kprobe and BPF, our ability to debug these sorts of > problems has improved a great deal. Below, I'm attaching a bcc script I used > to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks > can follow the same pattern. > > There are of course some pros and cons to this approach: > > Pros: > > * The framework doesn't really have any runtime overhead, so we can have it > deployed in the entire fleet and debug wherever problem is. > > * It's fully flexible and programmable which enables non-trivial filtering > and summarizing to be done inside kernel w/ BPF as necessary, which is > pretty handy for tracking high frequency events. > > * BPF is pretty performant. Dedicated built-in kernel code can do better of > course but BPF's jit compiled code & its data structures are fast enough. > I don't remember any time this was a problem. > > Cons: > > * BPF has some learning curve. Also the fact that what it provides is a wide > open field rather than something scoped out for a specific problem can > make it seem a bit daunting at the beginning. > > * Because tracking starts when the script starts running, it doesn't know > anything which has happened upto that point, so you gotta pay attention to > handling e.g. handling frees which don't match allocs. It's kinda annoying > but not a huge problem usually. There are ways to build in BPF progs into > the kernel and load it early but I haven't experiemnted with it yet > personally. Yeah, early loading is definitely important, especially before module loading etc. One common usecase is that we see a machine in the wild with a high amount of kernel memory disappearing somewhere that isn't voluntarily reported in vmstat/meminfo. Reproducing it isn't always practical. Something that records early and always (with acceptable runtime overhead) would be the holy grail. Matching allocs to frees is doable using the pfn as the key for pages, and virtual addresses for slab objects. The biggest issue I had when I tried with bpf was losing updates to the map. IIRC there is some trylocking going on to avoid deadlocks from nested contexts (alloc interrupted, interrupt frees). It doesn't sound like an unsolvable problem, though. Another minor thing was the stack trace map exploding on a basically infinite number of unique interrupt stacks. This could probably also be solved by extending the trace extraction API to cut the frames off at the context switch boundary. Taking a step back though, given the multitude of allocation sites in the kernel, it's a bit odd that the only accounting we do is the tiny fraction of voluntary vmstat/meminfo reporting. We try to cover the biggest consumers with this of course, but it's always going to be incomplete and is maintenance overhead too. There are on average several gigabytes in unknown memory (total - known vmstats) on our machines. It's difficult to detect regressions easily. And it's per definition the unexpected cornercases that are the trickiest to track down. So it might be doable with BPF, but it does feel like the kernel should do a better job of tracking out of the box and without requiring too much plumbing and somewhat fragile kernel allocation API tracking and probing from userspace.
On Wed, May 3, 2023 at 11:03 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 3 May 2023 10:40:42 -0700 > Suren Baghdasaryan <surenb@google.com> wrote: > > > > This approach is actually quite common, especially since tagging every > > > instance is usually overkill, as if you trace function calls in a running > > > kernel, you will find that only a small percentage of the kernel ever > > > executes. It's possible that you will be allocating a lot of tags that will > > > never be used. If run time allocation is possible, that is usually the > > > better approach. > > > > True but the memory overhead should not be prohibitive here. As a > > ballpark number, on my machine I see there are 4838 individual > > allocation locations and each codetag structure is 32 bytes, so that's > > 152KB. > > If it's not that big, then allocating at runtime should not be an issue > either. If runtime allocation can make it less intrusive to the code, that > would be more rationale to do so. As I noted, this issue is minor since we can be smart about how we allocate these entries. The main issue is the performance overhead. The kmalloc path is extremely fast and very hot. Even adding a per-cpu increment in our patchset has a 35% overhead. Adding an additional lookup here would prevent us from having it enabled all the time in production. > > -- Steve
On Wed, May 03, 2023 at 02:03:37PM -0400, Steven Rostedt wrote: > On Wed, 3 May 2023 10:40:42 -0700 > Suren Baghdasaryan <surenb@google.com> wrote: > > > > This approach is actually quite common, especially since tagging every > > > instance is usually overkill, as if you trace function calls in a running > > > kernel, you will find that only a small percentage of the kernel ever > > > executes. It's possible that you will be allocating a lot of tags that will > > > never be used. If run time allocation is possible, that is usually the > > > better approach. > > > > True but the memory overhead should not be prohibitive here. As a > > ballpark number, on my machine I see there are 4838 individual > > allocation locations and each codetag structure is 32 bytes, so that's > > 152KB. > > If it's not that big, then allocating at runtime should not be an issue > either. If runtime allocation can make it less intrusive to the code, that > would be more rationale to do so. We're more optimizing for runtime overhead - a major goal of this patchset was to be cheap enough to be always on, we've got too many debugging features that are really useful, but too expensive to have on all the time. Doing more runtime allocation would add another pointer fetch to the fast paths - and I don't see how it would even be possible to runtime allocate the codetag struct itself. We already do runtime allocation of percpu counters; see the lazy percpu counter patch.
Hello, On Wed, May 03, 2023 at 02:07:26PM -0400, Johannes Weiner wrote: ... > > * Because tracking starts when the script starts running, it doesn't know > > anything which has happened upto that point, so you gotta pay attention to > > handling e.g. handling frees which don't match allocs. It's kinda annoying > > but not a huge problem usually. There are ways to build in BPF progs into > > the kernel and load it early but I haven't experiemnted with it yet > > personally. > > Yeah, early loading is definitely important, especially before module > loading etc. > > One common usecase is that we see a machine in the wild with a high > amount of kernel memory disappearing somewhere that isn't voluntarily > reported in vmstat/meminfo. Reproducing it isn't always > practical. Something that records early and always (with acceptable > runtime overhead) would be the holy grail. > > Matching allocs to frees is doable using the pfn as the key for pages, > and virtual addresses for slab objects. > > The biggest issue I had when I tried with bpf was losing updates to > the map. IIRC there is some trylocking going on to avoid deadlocks > from nested contexts (alloc interrupted, interrupt frees). It doesn't > sound like an unsolvable problem, though. (cc'ing Alexei and Andrii) This is the same thing that I hit with sched_ext. BPF plugged it for struct_ops but I wonder whether it can be done for specific maps / progs - ie. just declare that a given map or prog is not to be accessed from NMI and bypass the trylock deadlock avoidance mechanism. But, yeah, this should be addressed from BPF side. > Another minor thing was the stack trace map exploding on a basically > infinite number of unique interrupt stacks. This could probably also > be solved by extending the trace extraction API to cut the frames off > at the context switch boundary. > > Taking a step back though, given the multitude of allocation sites in > the kernel, it's a bit odd that the only accounting we do is the tiny > fraction of voluntary vmstat/meminfo reporting. We try to cover the > biggest consumers with this of course, but it's always going to be > incomplete and is maintenance overhead too. There are on average > several gigabytes in unknown memory (total - known vmstats) on our > machines. It's difficult to detect regressions easily. And it's per > definition the unexpected cornercases that are the trickiest to track > down. So it might be doable with BPF, but it does feel like the kernel > should do a better job of tracking out of the box and without > requiring too much plumbing and somewhat fragile kernel allocation API > tracking and probing from userspace. Yeah, easy / default visibility argument does make sense to me. Thanks.
Hello, On Wed, May 03, 2023 at 01:51:23PM -0400, Kent Overstreet wrote: > Do you have example output? Not right now. It's from many months ago. It's just a script I could find easily. > TBH I'm skeptical that it's even possible to do full memory allocation > profiling with tracing/bpf, due to recursive memory allocations and > needing an index of outstanding allcations. There are some issues e.g. w/ lossy updates which should be fixed from BPF side but we do run BPF on every single packet and IO on most of our machines, so basing this argument on whether tracking all memory allocations from BPF is possible is probably not a winning strategy for this proposal. Thanks.
On Wed, May 03, 2023 at 08:19:24AM -1000, Tejun Heo wrote: > > Taking a step back though, given the multitude of allocation sites in > > the kernel, it's a bit odd that the only accounting we do is the tiny > > fraction of voluntary vmstat/meminfo reporting. We try to cover the > > biggest consumers with this of course, but it's always going to be > > incomplete and is maintenance overhead too. There are on average > > several gigabytes in unknown memory (total - known vmstats) on our > > machines. It's difficult to detect regressions easily. And it's per > > definition the unexpected cornercases that are the trickiest to track > > down. So it might be doable with BPF, but it does feel like the kernel > > should do a better job of tracking out of the box and without > > requiring too much plumbing and somewhat fragile kernel allocation API > > tracking and probing from userspace. > > Yeah, easy / default visibility argument does make sense to me. So, a bit of addition here. If this is the thrust, the debugfs part seems rather redundant, right? That's trivially obtainable with tracing / bpf and in a more flexible and performant manner. Also, are we happy with recording just single depth for persistent tracking? Thanks.
On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote: > > Yeah, easy / default visibility argument does make sense to me. > > So, a bit of addition here. If this is the thrust, the debugfs part seems > rather redundant, right? That's trivially obtainable with tracing / bpf and > in a more flexible and performant manner. Also, are we happy with recording > just single depth for persistent tracking? Not sure what you're envisioning? I'd consider the debugfs interface pretty integral; it's much more discoverable for users, and it's hardly any code out of the whole patchset. Single depth was discussed previously. It's what makes it cheap enough to be always-on (saving stack traces is expensive!), and I find the output much more usable than e.g. page owner.
On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote: > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote: > > > Yeah, easy / default visibility argument does make sense to me. > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems > > rather redundant, right? That's trivially obtainable with tracing / bpf and > > in a more flexible and performant manner. Also, are we happy with recording > > just single depth for persistent tracking? > > Not sure what you're envisioning? > > I'd consider the debugfs interface pretty integral; it's much more > discoverable for users, and it's hardly any code out of the whole > patchset. You can do the same thing with a bpftrace one liner tho. That's rather difficult to beat. Thanks.
On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote: > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote: > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote: > > > > Yeah, easy / default visibility argument does make sense to me. > > > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems > > > rather redundant, right? That's trivially obtainable with tracing / bpf and > > > in a more flexible and performant manner. Also, are we happy with recording > > > just single depth for persistent tracking? > > > > Not sure what you're envisioning? > > > > I'd consider the debugfs interface pretty integral; it's much more > > discoverable for users, and it's hardly any code out of the whole > > patchset. > > You can do the same thing with a bpftrace one liner tho. That's rather > difficult to beat. Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and allocations.ctx under debugfs. I meant allocations.ctx is redundant. Thanks.
On Wed, May 3, 2023 at 12:09 PM Tejun Heo <tj@kernel.org> wrote: > > On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote: > > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote: > > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote: > > > > > Yeah, easy / default visibility argument does make sense to me. > > > > > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems > > > > rather redundant, right? That's trivially obtainable with tracing / bpf and > > > > in a more flexible and performant manner. Also, are we happy with recording > > > > just single depth for persistent tracking? IIUC, by single depth you mean no call stack capturing? If so, that's the idea behind the context capture feature so that we can enable it on specific allocations only after we determine there is something interesting there. So, with low-cost persistent tracking we can determine the suspects and then pay some more to investigate those suspects in more detail. > > > > > > Not sure what you're envisioning? > > > > > > I'd consider the debugfs interface pretty integral; it's much more > > > discoverable for users, and it's hardly any code out of the whole > > > patchset. > > > > You can do the same thing with a bpftrace one liner tho. That's rather > > difficult to beat. debugfs seemed like a natural choice for such information. If another interface is more appropriate I'm happy to explore that. > > Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and > allocations.ctx under debugfs. I meant allocations.ctx is redundant. Do you mean that we could display allocation context in debugfs/allocations file (for the allocations which we explicitly enabled context capturing)? > > Thanks. > > -- > tejun
Hello, On Wed, May 03, 2023 at 12:41:08PM -0700, Suren Baghdasaryan wrote: > On Wed, May 3, 2023 at 12:09 PM Tejun Heo <tj@kernel.org> wrote: > > > > On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote: > > > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote: > > > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote: > > > > > > Yeah, easy / default visibility argument does make sense to me. > > > > > > > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems > > > > > rather redundant, right? That's trivially obtainable with tracing / bpf and > > > > > in a more flexible and performant manner. Also, are we happy with recording > > > > > just single depth for persistent tracking? > > IIUC, by single depth you mean no call stack capturing? Yes. > If so, that's the idea behind the context capture feature so that we > can enable it on specific allocations only after we determine there is > something interesting there. So, with low-cost persistent tracking we > can determine the suspects and then pay some more to investigate those > suspects in more detail. Yeah, I was wondering whether it'd be useful to have that configurable so that it'd be possible for a user to say "I'm okay with the cost, please track more context per allocation". Given that tracking the immediate caller is already a huge improvement and narrowing it down from there using existing tools shouldn't be that difficult, I don't think this is a blocker in any way. It just bothers me a bit that the code is structured so that source line is the main abstraction. > > > > Not sure what you're envisioning? > > > > > > > > I'd consider the debugfs interface pretty integral; it's much more > > > > discoverable for users, and it's hardly any code out of the whole > > > > patchset. > > > > > > You can do the same thing with a bpftrace one liner tho. That's rather > > > difficult to beat. > > debugfs seemed like a natural choice for such information. If another > interface is more appropriate I'm happy to explore that. > > > > > Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and > > allocations.ctx under debugfs. I meant allocations.ctx is redundant. > > Do you mean that we could display allocation context in > debugfs/allocations file (for the allocations which we explicitly > enabled context capturing)? Sorry about the fumbled communication. Here's what I mean: * Improving memory allocation visibility makes sense to me. To me, a more natural place for that feels like /proc/allocations next to other memory info files rather than under debugfs. * The default visibility provided by "allocations" provides something which is more difficult or at least cumbersome to obtain using existing tracing tools. However, what's provided by "allocations.ctx" can be trivially obtained using kprobe and BPF and seems redundant. Thanks.
Hello, On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote: > > If so, that's the idea behind the context capture feature so that we > > can enable it on specific allocations only after we determine there is > > something interesting there. So, with low-cost persistent tracking we > > can determine the suspects and then pay some more to investigate those > > suspects in more detail. > > Yeah, I was wondering whether it'd be useful to have that configurable so > that it'd be possible for a user to say "I'm okay with the cost, please > track more context per allocation". Given that tracking the immediate caller > is already a huge improvement and narrowing it down from there using > existing tools shouldn't be that difficult, I don't think this is a blocker > in any way. It just bothers me a bit that the code is structured so that > source line is the main abstraction. Another related question. So, the reason for macro'ing stuff is needed is because you want to print the line directly from kernel, right? Is that really necessary? Values from __builtin_return_address() can easily be printed out as function+offset from kernel which already gives most of the necessary information for triaging and mapping that back to source line from userspace isn't difficult. Wouldn't using __builtin_return_address() make the whole thing a lot simpler? Thanks.
On Wed, May 3, 2023 at 6:35 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, Kent. > > On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote: > > No, we're still waiting on the tracing people to _demonstrate_, not > > claim, that this is at all possible in a comparable way with tracing. > > So, we (meta) happen to do stuff like this all the time in the fleet to hunt > down tricky persistent problems like memory leaks, ref leaks, what-have-you. > In recent kernels, with kprobe and BPF, our ability to debug these sorts of > problems has improved a great deal. Below, I'm attaching a bcc script I used > to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks > can follow the same pattern. > For leaks there is example bcc https://github.com/iovisor/bcc/blob/master/tools/memleak.py
On Wed, May 3, 2023 at 12:49 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Wed, May 03, 2023 at 12:41:08PM -0700, Suren Baghdasaryan wrote: > > On Wed, May 3, 2023 at 12:09 PM Tejun Heo <tj@kernel.org> wrote: > > > > > > On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote: > > > > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote: > > > > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote: > > > > > > > Yeah, easy / default visibility argument does make sense to me. > > > > > > > > > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems > > > > > > rather redundant, right? That's trivially obtainable with tracing / bpf and > > > > > > in a more flexible and performant manner. Also, are we happy with recording > > > > > > just single depth for persistent tracking? > > > > IIUC, by single depth you mean no call stack capturing? > > Yes. > > > If so, that's the idea behind the context capture feature so that we > > can enable it on specific allocations only after we determine there is > > something interesting there. So, with low-cost persistent tracking we > > can determine the suspects and then pay some more to investigate those > > suspects in more detail. > > Yeah, I was wondering whether it'd be useful to have that configurable so > that it'd be possible for a user to say "I'm okay with the cost, please > track more context per allocation". I assume by "more context per allocation" you mean for a specific allocation, not for all allocations. So, in a sense you are asking if the context capture feature can be dropped from this series and implemented using some other means. Is that right? > Given that tracking the immediate caller > is already a huge improvement and narrowing it down from there using > existing tools shouldn't be that difficult, I don't think this is a blocker > in any way. It just bothers me a bit that the code is structured so that > source line is the main abstraction. > > > > > > Not sure what you're envisioning? > > > > > > > > > > I'd consider the debugfs interface pretty integral; it's much more > > > > > discoverable for users, and it's hardly any code out of the whole > > > > > patchset. > > > > > > > > You can do the same thing with a bpftrace one liner tho. That's rather > > > > difficult to beat. > > > > debugfs seemed like a natural choice for such information. If another > > interface is more appropriate I'm happy to explore that. > > > > > > > > Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and > > > allocations.ctx under debugfs. I meant allocations.ctx is redundant. > > > > Do you mean that we could display allocation context in > > debugfs/allocations file (for the allocations which we explicitly > > enabled context capturing)? > > Sorry about the fumbled communication. Here's what I mean: > > * Improving memory allocation visibility makes sense to me. To me, a more > natural place for that feels like /proc/allocations next to other memory > info files rather than under debugfs. TBH I would love that if this approach is acceptable. > > * The default visibility provided by "allocations" provides something which > is more difficult or at least cumbersome to obtain using existing tracing > tools. However, what's provided by "allocations.ctx" can be trivially > obtained using kprobe and BPF and seems redundant. Hmm. That might be a good way forward. Since context capture has already high performance overhead, maybe choosing not the most performant but more generic solution is the right answer here. I'll need to think about it some more but thanks for the idea! > > Thanks. > > -- > tejun
On Wed, May 03, 2023 at 01:08:40PM -0700, Suren Baghdasaryan wrote: > On Wed, May 3, 2023 at 12:49 PM Tejun Heo <tj@kernel.org> wrote: > > * Improving memory allocation visibility makes sense to me. To me, a more > > natural place for that feels like /proc/allocations next to other memory > > info files rather than under debugfs. > > TBH I would love that if this approach is acceptable. Ack
On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote: > > > If so, that's the idea behind the context capture feature so that we > > > can enable it on specific allocations only after we determine there is > > > something interesting there. So, with low-cost persistent tracking we > > > can determine the suspects and then pay some more to investigate those > > > suspects in more detail. > > > > Yeah, I was wondering whether it'd be useful to have that configurable so > > that it'd be possible for a user to say "I'm okay with the cost, please > > track more context per allocation". Given that tracking the immediate caller > > is already a huge improvement and narrowing it down from there using > > existing tools shouldn't be that difficult, I don't think this is a blocker > > in any way. It just bothers me a bit that the code is structured so that > > source line is the main abstraction. > > Another related question. So, the reason for macro'ing stuff is needed is > because you want to print the line directly from kernel, right? The main reason is because we want to inject a code tag at the location of the call. If we have a code tag injected at every allocation call, then finding the allocation counter (code tag) to operate takes no time. > Is that > really necessary? Values from __builtin_return_address() can easily be > printed out as function+offset from kernel which already gives most of the > necessary information for triaging and mapping that back to source line from > userspace isn't difficult. Wouldn't using __builtin_return_address() make > the whole thing a lot simpler? If we do that we have to associate that address with the allocation counter at runtime on the first allocation and look it up on all following allocations. That introduces the overhead which we are trying to avoid by using macros. > > Thanks. > > -- > tejun
Hello, On Wed, May 03, 2023 at 01:08:40PM -0700, Suren Baghdasaryan wrote: > > Yeah, I was wondering whether it'd be useful to have that configurable so > > that it'd be possible for a user to say "I'm okay with the cost, please > > track more context per allocation". > > I assume by "more context per allocation" you mean for a specific > allocation, not for all allocations. > So, in a sense you are asking if the context capture feature can be > dropped from this series and implemented using some other means. Is > that right? Oh, no, what I meant was whether it'd make sense to allow enable richer tracking (e.g. record deeper into callstack) for all allocations. For targeted tracking, it seems that the kernel already has everything needed. But this is more of an idle thought and the immediate caller tracking is already a big improvement in terms of visibility, so no need to be hung up on this part of discussion at all. Thanks.
Hello, On Wed, May 03, 2023 at 01:14:57PM -0700, Suren Baghdasaryan wrote: > On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote: > > Another related question. So, the reason for macro'ing stuff is needed is > > because you want to print the line directly from kernel, right? > > The main reason is because we want to inject a code tag at the > location of the call. If we have a code tag injected at every > allocation call, then finding the allocation counter (code tag) to > operate takes no time. > > > Is that > > really necessary? Values from __builtin_return_address() can easily be > > printed out as function+offset from kernel which already gives most of the > > necessary information for triaging and mapping that back to source line from > > userspace isn't difficult. Wouldn't using __builtin_return_address() make > > the whole thing a lot simpler? > > If we do that we have to associate that address with the allocation > counter at runtime on the first allocation and look it up on all > following allocations. That introduces the overhead which we are > trying to avoid by using macros. I see. I'm a bit skeptical about the performance angle given that the hot path can be probably made really cheap even with lookups. In most cases, it's just gonna be an extra pointer deref and a few more arithmetics. That can show up in microbenchmarks but it's not gonna be much. The benefit of going that route would be the tracking thing being mostly self contained. That said, it's nice to not have to worry about allocating tracking slots and managing hash table, so no strong opinion. Thanks.
On Wed, May 03, 2023 at 04:25:30PM -1000, Tejun Heo wrote: > I see. I'm a bit skeptical about the performance angle given that the hot > path can be probably made really cheap even with lookups. In most cases, > it's just gonna be an extra pointer deref and a few more arithmetics. That > can show up in microbenchmarks but it's not gonna be much. The benefit of > going that route would be the tracking thing being mostly self contained. The only way to do it with a single additional pointer deref would be with a completely statically sized hash table without chaining - it'd have to be open addressing. More realistically you're looking at ~3 dependent loads.
On Wed, May 3, 2023 at 7:25 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Wed, May 03, 2023 at 01:14:57PM -0700, Suren Baghdasaryan wrote: > > On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote: > > > Another related question. So, the reason for macro'ing stuff is needed is > > > because you want to print the line directly from kernel, right? > > > > The main reason is because we want to inject a code tag at the > > location of the call. If we have a code tag injected at every > > allocation call, then finding the allocation counter (code tag) to > > operate takes no time. > > > > > Is that > > > really necessary? Values from __builtin_return_address() can easily be > > > printed out as function+offset from kernel which already gives most of the > > > necessary information for triaging and mapping that back to source line from > > > userspace isn't difficult. Wouldn't using __builtin_return_address() make > > > the whole thing a lot simpler? > > > > If we do that we have to associate that address with the allocation > > counter at runtime on the first allocation and look it up on all > > following allocations. That introduces the overhead which we are > > trying to avoid by using macros. > > I see. I'm a bit skeptical about the performance angle given that the hot > path can be probably made really cheap even with lookups. In most cases, > it's just gonna be an extra pointer deref and a few more arithmetics. That > can show up in microbenchmarks but it's not gonna be much. The benefit of > going that route would be the tracking thing being mostly self contained. I'm in the process of rerunning the tests to compare the overhead on the latest kernel but I don't expect that to be cheap compared to kmalloc(). > > That said, it's nice to not have to worry about allocating tracking slots > and managing hash table, so no strong opinion. > > Thanks. > > -- > tejun > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Wed, 3 May 2023 13:14:57 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote: > > > > Hello, > > > > On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote: > > > > If so, that's the idea behind the context capture feature so that we > > > > can enable it on specific allocations only after we determine there is > > > > something interesting there. So, with low-cost persistent tracking we > > > > can determine the suspects and then pay some more to investigate those > > > > suspects in more detail. > > > > > > Yeah, I was wondering whether it'd be useful to have that configurable so > > > that it'd be possible for a user to say "I'm okay with the cost, please > > > track more context per allocation". Given that tracking the immediate caller > > > is already a huge improvement and narrowing it down from there using > > > existing tools shouldn't be that difficult, I don't think this is a blocker > > > in any way. It just bothers me a bit that the code is structured so that > > > source line is the main abstraction. > > > > Another related question. So, the reason for macro'ing stuff is needed is > > because you want to print the line directly from kernel, right? > > The main reason is because we want to inject a code tag at the > location of the call. If we have a code tag injected at every > allocation call, then finding the allocation counter (code tag) to > operate takes no time. Another consequence is that each source code location gets its own tag. The compiler can no longer apply common subexpression elimination (because the tag is different). I have some doubts that there are any places where CSE could be applied to allocation calls, but in general, this is one more difference to using _RET_IP_. Petr T > > Is that > > really necessary? Values from __builtin_return_address() can easily be > > printed out as function+offset from kernel which already gives most of the > > necessary information for triaging and mapping that back to source line from > > userspace isn't difficult. Wouldn't using __builtin_return_address() make > > the whole thing a lot simpler? > > If we do that we have to associate that address with the allocation > counter at runtime on the first allocation and look it up on all > following allocations. That introduces the overhead which we are > trying to avoid by using macros. > > > > > Thanks. > > > > -- > > tejun >
On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote: > On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@suse.com> wrote: [...] > Thanks for summarizing! > > > At least those I find the most important: > > - This is a big change and it adds a significant maintenance burden > > because each allocation entry point needs to be handled specifically. > > The cost will grow with the intended coverage especially there when > > allocation is hidden in a library code. > > Do you mean with more allocations in the codebase more codetags will > be generated? Is that the concern? No. I am mostly concerned about the _maintenance_ overhead. For the bare tracking (without profiling and thus stack traces) only those allocations that are directly inlined into the consumer are really of any use. That increases the code impact of the tracing because any relevant allocation location has to go through the micro surgery. e.g. is it really interesting to know that there is a likely memory leak in seq_file proper doing and allocation? No as it is the specific implementation using seq_file that is leaking most likely. There are other examples like that See? > Or maybe as you commented in > another patch that context capturing feature does not limit how many > stacks will be captured? That is a memory overhead which can be really huge and it would be nice to be more explicit about that in the cover letter. It is a downside for sure but not something that has a code maintenance impact and it is an opt-in so it can be enabled only when necessary. Quite honestly, though, the more I look into context capturing part it seems to me that there is much more to be reconsidered there and if you really want to move forward with the code tagging part then you should drop that for now. It would make the whole series smaller and easier to digest. > > - It has been brought up that this is duplicating functionality already > > available via existing tracing infrastructure. You should make it very > > clear why that is not suitable for the job > > I experimented with using tracing with _RET_IP_ to implement this > accounting. The major issue is the _RET_IP_ to codetag lookup runtime > overhead which is orders of magnitude higher than proposed code > tagging approach. With code tagging proposal, that link is resolved at > compile time. Since we want this mechanism deployed in production, we > want to keep the overhead to the absolute minimum. > You asked me before how much overhead would be tolerable and the > answer will always be "as small as possible". This is especially true > for slab allocators which are ridiculously fast and regressing them > would be very noticable (due to the frequent use). It would have been more convincing if you had some numbers at hands. E.g. this is a typical workload we are dealing with. With the compile time tags we are able to learn this with that much of cost. With a dynamic tracing we are able to learn this much with that cost. See? As small as possible is a rather vague term that different people will have a very different idea about. > There is another issue, which I think can be solved in a smart way but > will either affect performance or would require more memory. With the > tracing approach we don't know beforehand how many individual > allocation sites exist, so we have to allocate code tags (or similar > structures for counting) at runtime vs compile time. We can be smart > about it and allocate in batches or even preallocate more than we need > beforehand but, as I said, it will require some kind of compromise. I have tried our usual distribution config (only vmlinux without modules so the real impact will be larger as we build a lot of stuff into modules) just to get an idea: text data bss dec hex filename 28755345 17040322 19845124 65640791 3e99957 vmlinux.before 28867168 17571838 19386372 65825378 3ec6a62 vmlinux.after Less than 1% for text 3% for data. This is not all that terrible for an initial submission and a more dynamic approach could be added later. E.g. with a smaller pre-allocated hash table that could be expanded lazily. Anyway not something I would be losing sleep over. This can always be improved later on. > I understand that code tagging creates additional maintenance burdens > but I hope it also produces enough benefits that people will want > this. The cost is also hopefully amortized when additional > applications like the ones we presented in RFC [1] are built using the > same framework. TBH I am much more concerned about the maintenance burden on the MM side than the actual code tagging itslef which is much more self contained. I haven't seen other potential applications of the same infrastructure and maybe the code impact would be much smaller than in the MM proper. Our allocator API is really hairy and convoluted. > > - We already have page_owner infrastructure that provides allocation > > tracking data. Why it cannot be used/extended? > > 1. The overhead. Do you have any numbers? > 2. Covers only page allocators. Yes this sucks. > > I didn't think about extending the page_owner approach to slab > allocators but I suspect it would not be trivial. I don't see > attaching an owner to every slab object to be a scalable solution. The > overhead would again be of concern here. This would have been a nice argument to mention in the changelog so that we know that you have considered that option at least. Why should I (as a reviewer) wild guess that? > I should point out that there was one important technical concern > about lack of a kill switch for this feature, which was an issue for > distributions that can't disable the CONFIG flag. In this series we > addressed that concern. Thanks, that is certainly appreciated. I haven't looked deeper into that part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING implies unconditional page_ext and therefore the memory overhead assosiated with that. There seems to be a killswitch nomem_profiling but from a quick look it doesn't seem to disable page_ext allocations. I might be missing something there of course. Having a highlevel describtion for that would be really nice as well. > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote: > > On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > Thanks for summarizing! > > > > > At least those I find the most important: > > > - This is a big change and it adds a significant maintenance burden > > > because each allocation entry point needs to be handled specifically. > > > The cost will grow with the intended coverage especially there when > > > allocation is hidden in a library code. > > > > Do you mean with more allocations in the codebase more codetags will > > be generated? Is that the concern? > > No. I am mostly concerned about the _maintenance_ overhead. For the > bare tracking (without profiling and thus stack traces) only those > allocations that are directly inlined into the consumer are really > of any use. That increases the code impact of the tracing because any > relevant allocation location has to go through the micro surgery. > > e.g. is it really interesting to know that there is a likely memory > leak in seq_file proper doing and allocation? No as it is the specific > implementation using seq_file that is leaking most likely. There are > other examples like that See? Yes, I see that. One level tracking does not provide all the information needed to track such issues. Something more informative would cost more. That's why our proposal is to have a light-weight mechanism to get a high level picture and then be able to zoom into a specific area using context capture. If you have ideas to improve this, I'm open to suggestions. > > > Or maybe as you commented in > > another patch that context capturing feature does not limit how many > > stacks will be captured? > > That is a memory overhead which can be really huge and it would be nice > to be more explicit about that in the cover letter. It is a downside for > sure but not something that has a code maintenance impact and it is an > opt-in so it can be enabled only when necessary. You are right, I'll add that into the cover letter. > > Quite honestly, though, the more I look into context capturing part it > seems to me that there is much more to be reconsidered there and if you > really want to move forward with the code tagging part then you should > drop that for now. It would make the whole series smaller and easier to > digest. Sure, I don't see an issue with removing that for now and refining the mechanism before posting again. > > > > - It has been brought up that this is duplicating functionality already > > > available via existing tracing infrastructure. You should make it very > > > clear why that is not suitable for the job > > > > I experimented with using tracing with _RET_IP_ to implement this > > accounting. The major issue is the _RET_IP_ to codetag lookup runtime > > overhead which is orders of magnitude higher than proposed code > > tagging approach. With code tagging proposal, that link is resolved at > > compile time. Since we want this mechanism deployed in production, we > > want to keep the overhead to the absolute minimum. > > You asked me before how much overhead would be tolerable and the > > answer will always be "as small as possible". This is especially true > > for slab allocators which are ridiculously fast and regressing them > > would be very noticable (due to the frequent use). > > It would have been more convincing if you had some numbers at hands. > E.g. this is a typical workload we are dealing with. With the compile > time tags we are able to learn this with that much of cost. With a dynamic > tracing we are able to learn this much with that cost. See? As small as > possible is a rather vague term that different people will have a very > different idea about. I'm rerunning my tests with the latest kernel to collect the comparison data. I profiled these solutions before but the kernel changed since then, so I need to update them. > > > There is another issue, which I think can be solved in a smart way but > > will either affect performance or would require more memory. With the > > tracing approach we don't know beforehand how many individual > > allocation sites exist, so we have to allocate code tags (or similar > > structures for counting) at runtime vs compile time. We can be smart > > about it and allocate in batches or even preallocate more than we need > > beforehand but, as I said, it will require some kind of compromise. > > I have tried our usual distribution config (only vmlinux without modules > so the real impact will be larger as we build a lot of stuff into > modules) just to get an idea: > text data bss dec hex filename > 28755345 17040322 19845124 65640791 3e99957 vmlinux.before > 28867168 17571838 19386372 65825378 3ec6a62 vmlinux.after > > Less than 1% for text 3% for data. This is not all that terrible > for an initial submission and a more dynamic approach could be added > later. E.g. with a smaller pre-allocated hash table that could be > expanded lazily. Anyway not something I would be losing sleep over. This > can always be improved later on. Ah, right. I should have mentioned this overhead too. Thanks for keeping me honest. > > I understand that code tagging creates additional maintenance burdens > > but I hope it also produces enough benefits that people will want > > this. The cost is also hopefully amortized when additional > > applications like the ones we presented in RFC [1] are built using the > > same framework. > > TBH I am much more concerned about the maintenance burden on the MM side > than the actual code tagging itslef which is much more self contained. I > haven't seen other potential applications of the same infrastructure and > maybe the code impact would be much smaller than in the MM proper. Our > allocator API is really hairy and convoluted. Yes, other applications are much smaller and cleaner. MM allocation code is quite complex indeed. > > > > - We already have page_owner infrastructure that provides allocation > > > tracking data. Why it cannot be used/extended? > > > > 1. The overhead. > > Do you have any numbers? Will post once my tests are completed. > > > 2. Covers only page allocators. > > Yes this sucks. > > > > I didn't think about extending the page_owner approach to slab > > allocators but I suspect it would not be trivial. I don't see > > attaching an owner to every slab object to be a scalable solution. The > > overhead would again be of concern here. > > This would have been a nice argument to mention in the changelog so that > we know that you have considered that option at least. Why should I (as > a reviewer) wild guess that? Sorry, It's hard to remember all the decisions, discussions and conclusions when working on a feature over a long time period. I'll include more information about that. > > > I should point out that there was one important technical concern > > about lack of a kill switch for this feature, which was an issue for > > distributions that can't disable the CONFIG flag. In this series we > > addressed that concern. > > Thanks, that is certainly appreciated. I haven't looked deeper into that > part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING > implies unconditional page_ext and therefore the memory overhead > assosiated with that. There seems to be a killswitch nomem_profiling but > from a quick look it doesn't seem to disable page_ext allocations. I > might be missing something there of course. Having a highlevel > describtion for that would be really nice as well. Right, will add a description of that as well. We eliminate the runtime overhead but not the memory one. However I believe it's also doable using page_ext_operations.need callback. Will look into it. Thanks, Suren. > > > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ > > -- > Michal Hocko > SUSE Labs
On Thu 04-05-23 08:08:13, Suren Baghdasaryan wrote: > On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > e.g. is it really interesting to know that there is a likely memory > > leak in seq_file proper doing and allocation? No as it is the specific > > implementation using seq_file that is leaking most likely. There are > > other examples like that See? > > Yes, I see that. One level tracking does not provide all the > information needed to track such issues. Something more informative > would cost more. That's why our proposal is to have a light-weight > mechanism to get a high level picture and then be able to zoom into a > specific area using context capture. If you have ideas to improve > this, I'm open to suggestions. Well, I think that a more scalable approach would be to not track in callers but in the allocator itself. The full stack trace might not be all that important or interesting and maybe even increase the overall overhead but a partial one with a configurable depth would sound more interesting to me. A per cache hastable indexed by stack trace reference and extending slab metadata to store the reference for kfree path won't be free but the overhead might be just acceptable. If the stack unwinding is really too expensive for tracking another option would be to add code tags dynamically to the compiled kernel without any actual code changes. I can imagine the tracing infrastructure could be used for that or maybe even consider compiler plugins to inject code for functions marked as allocators. So the kernel could be instrumented even without eny userspace tooling required by users directly.
On Sun, May 07, 2023 at 12:27:18PM +0200, Michal Hocko wrote: > On Thu 04-05-23 08:08:13, Suren Baghdasaryan wrote: > > On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > e.g. is it really interesting to know that there is a likely memory > > > leak in seq_file proper doing and allocation? No as it is the specific > > > implementation using seq_file that is leaking most likely. There are > > > other examples like that See? > > > > Yes, I see that. One level tracking does not provide all the > > information needed to track such issues. Something more informative > > would cost more. That's why our proposal is to have a light-weight > > mechanism to get a high level picture and then be able to zoom into a > > specific area using context capture. If you have ideas to improve > > this, I'm open to suggestions. > > Well, I think that a more scalable approach would be to not track in > callers but in the allocator itself. The full stack trace might not be > all that important or interesting and maybe even increase the overall > overhead but a partial one with a configurable depth would sound more > interesting to me. A per cache hastable indexed by stack trace reference > and extending slab metadata to store the reference for kfree path won't > be free but the overhead might be just acceptable. How would you propose to annotate what call chains need what depth of stack trace recorded? How would you propose to make this performant?
On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote: > No. I am mostly concerned about the _maintenance_ overhead. For the > bare tracking (without profiling and thus stack traces) only those > allocations that are directly inlined into the consumer are really > of any use. That increases the code impact of the tracing because any > relevant allocation location has to go through the micro surgery. > > e.g. is it really interesting to know that there is a likely memory > leak in seq_file proper doing and allocation? No as it is the specific > implementation using seq_file that is leaking most likely. There are > other examples like that See? So this is a rather strange usage of "maintenance overhead" :) But it's something we thought of. If we had to plumb around a _RET_IP_ parameter, or a codetag pointer, it would be a hassle annotating the correct callsite. Instead, alloc_hooks() wraps a memory allocation function and stashes a pointer to a codetag in task_struct for use by the core slub/buddy allocator code. That means that in your example, to move tracking to a given seq_file function, we just: - hook the seq_file function with alloc_hooks - change the seq_file function to call non-hooked memory allocation functions. > It would have been more convincing if you had some numbers at hands. > E.g. this is a typical workload we are dealing with. With the compile > time tags we are able to learn this with that much of cost. With a dynamic > tracing we are able to learn this much with that cost. See? As small as > possible is a rather vague term that different people will have a very > different idea about. Engineers don't prototype and benchmark everything as a matter of course, we're expected to have the rough equivealent of a CS education and an understanding of big O notation, cache architecture, etc. The slub fast path is _really_ fast - double word non locked cmpxchg. That's what we're trying to compete with. Adding a big globally accessible hash table is going to tank performance compared to that. I believe the numbers we already posted speak for themselves. We're considerably faster than memcg, fast enough to run in production. I'm not going to be switching to a design that significantly regresses performance, sorry :) > TBH I am much more concerned about the maintenance burden on the MM side > than the actual code tagging itslef which is much more self contained. I > haven't seen other potential applications of the same infrastructure and > maybe the code impact would be much smaller than in the MM proper. Our > allocator API is really hairy and convoluted. You keep saying "maintenance burden", but this is a criticism that can be directed at _any_ patchset that adds new code; it's generally understood that that is the accepted cost for new functionality. If you have specific concerns where you think we did something that makes the code harder to maintain, _please point them out in the appropriate patch_. I don't think you'll find too much - the instrumentation in the allocators simply generalizes what memcg was already doing, and the hooks themselves are a bit boilerplaty but hardly the sort of thing people will be tripping over later. TL;DR - put up or shut up :)
On Sun, 7 May 2023 13:20:55 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote: > > No. I am mostly concerned about the _maintenance_ overhead. For the > > bare tracking (without profiling and thus stack traces) only those > > allocations that are directly inlined into the consumer are really > > of any use. That increases the code impact of the tracing because any > > relevant allocation location has to go through the micro surgery. > > > > e.g. is it really interesting to know that there is a likely memory > > leak in seq_file proper doing and allocation? No as it is the specific > > implementation using seq_file that is leaking most likely. There are > > other examples like that See? > > So this is a rather strange usage of "maintenance overhead" :) > > But it's something we thought of. If we had to plumb around a _RET_IP_ > parameter, or a codetag pointer, it would be a hassle annotating the > correct callsite. > > Instead, alloc_hooks() wraps a memory allocation function and stashes a > pointer to a codetag in task_struct for use by the core slub/buddy > allocator code. > > That means that in your example, to move tracking to a given seq_file > function, we just: > - hook the seq_file function with alloc_hooks > - change the seq_file function to call non-hooked memory allocation > functions. > > > It would have been more convincing if you had some numbers at hands. > > E.g. this is a typical workload we are dealing with. With the compile > > time tags we are able to learn this with that much of cost. With a dynamic > > tracing we are able to learn this much with that cost. See? As small as > > possible is a rather vague term that different people will have a very > > different idea about. > > Engineers don't prototype and benchmark everything as a matter of > course, we're expected to have the rough equivealent of a CS education > and an understanding of big O notation, cache architecture, etc. > > The slub fast path is _really_ fast - double word non locked cmpxchg. > That's what we're trying to compete with. Adding a big globally > accessible hash table is going to tank performance compared to that. > > I believe the numbers we already posted speak for themselves. We're > considerably faster than memcg, fast enough to run in production. > > I'm not going to be switching to a design that significantly regresses > performance, sorry :) > > > TBH I am much more concerned about the maintenance burden on the MM side > > than the actual code tagging itslef which is much more self contained. I > > haven't seen other potential applications of the same infrastructure and > > maybe the code impact would be much smaller than in the MM proper. Our > > allocator API is really hairy and convoluted. > > You keep saying "maintenance burden", but this is a criticism that can > be directed at _any_ patchset that adds new code; it's generally > understood that that is the accepted cost for new functionality. > > If you have specific concerns where you think we did something that > makes the code harder to maintain, _please point them out in the > appropriate patch_. I don't think you'll find too much - the > instrumentation in the allocators simply generalizes what memcg was > already doing, and the hooks themselves are a bit boilerplaty but hardly > the sort of thing people will be tripping over later. > > TL;DR - put up or shut up :) Your email would have been much better if you left the above line out. :-/ Comments like the above do not go over well via text. Even if you add the ":)" Back to the comment about this being a burden. I just applied all the patches and did a diff (much easier than to wade through 40 patches!) One thing we need to get rid of, and this isn't your fault but this series is extending it, is the use of the damn underscores to differentiate functions. This is one of the abominations of the early Linux kernel code base. I admit, I'm guilty of this too. But today I have learned and avoid it at all cost. Underscores are meaningless and error prone, not to mention confusing to people coming onboard. Let's use something that has some meaning. What's the difference between: _kmem_cache_alloc_node() and __kmem_cache_alloc_node()? And if every allocation function requires a double hook, that is a maintenance burden. We do this for things like system calls, but there's a strong rationale for that. I'm guessing that Michal's concern is that he and other mm maintainers will need to make sure any new allocation function has this double call and is done properly. This isn't just new code that needs to be maintained, it's something that needs to be understood when adding any new interface to page allocations. It's true that all new code has a maintenance burden, and unless the maintainer feels the burden is worth their time, they have the right to complain about it. I've given talks about how to get code into open source projects, and the title is "Commits are pulled and never pushed". Where basically I talk about convincing the maintainers that they want your change, and not by pushing it because you want it. -- Steve
On Sun, May 07, 2023 at 04:55:38PM -0400, Steven Rostedt wrote: > > TL;DR - put up or shut up :) > > Your email would have been much better if you left the above line out. :-/ > Comments like the above do not go over well via text. Even if you add the ":)" I stand by that comment :) > Back to the comment about this being a burden. I just applied all the > patches and did a diff (much easier than to wade through 40 patches!) > > One thing we need to get rid of, and this isn't your fault but this > series is extending it, is the use of the damn underscores to > differentiate functions. This is one of the abominations of the early > Linux kernel code base. I admit, I'm guilty of this too. But today I > have learned and avoid it at all cost. Underscores are meaningless and > error prone, not to mention confusing to people coming onboard. Let's > use something that has some meaning. > > What's the difference between: > > _kmem_cache_alloc_node() and __kmem_cache_alloc_node()? > > And if every allocation function requires a double hook, that is a > maintenance burden. We do this for things like system calls, but > there's a strong rationale for that. The underscore is a legitimate complaint - I brought this up in development, not sure why it got lost. We'll do something better with a consistent suffix, perhaps kmem_cache_alloc_noacct(). > I'm guessing that Michal's concern is that he and other mm maintainers > will need to make sure any new allocation function has this double > call and is done properly. This isn't just new code that needs to be > maintained, it's something that needs to be understood when adding any > new interface to page allocations. Well, isn't that part of the problem then? We're _this far_ into the thread and still guessing on what Michal's "maintenance concerns" are? Regarding your specific concern: My main design consideration was making sure every allocation gets accounted somewhere; we don't want a memory allocation profiling system where it's possible for allocations to be silently not tracked! There's warnings in the core allocators if they see an allocation without an alloc tag, and in testing we chased down everything we found. So if anyone later creates a new memory allocation interface and forgets to hook it, they'll see the same warning - but perhaps we could improve the warning message so it says exactly what needs to be done (wrap the allocation in an alloc_hooks() call). > It's true that all new code has a maintenance burden, and unless the > maintainer feels the burden is worth their time, they have the right to > complain about it. Sure, but complaints should say what they're complaining about. Complaints so vague they could be levelled at any patchset don't do anything for the discussion.
On Sun, 7 May 2023 17:53:09 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > The underscore is a legitimate complaint - I brought this up in > development, not sure why it got lost. We'll do something better with a > consistent suffix, perhaps kmem_cache_alloc_noacct(). Would "_noprofile()" be a better name. I'm not sure what "acct" means. -- Steve
On Sun, May 07, 2023 at 06:09:11PM -0400, Steven Rostedt wrote: > On Sun, 7 May 2023 17:53:09 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > The underscore is a legitimate complaint - I brought this up in > > development, not sure why it got lost. We'll do something better with a > > consistent suffix, perhaps kmem_cache_alloc_noacct(). > > Would "_noprofile()" be a better name. I'm not sure what "acct" means. account - but _noprofile() is probably better. Didn't suggest it at first because of an association in my head with CPU profiling, but considering we already renamed the feature to memory allocation profiling... :)
On Sun, 7 May 2023 13:20:55 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote: > > No. I am mostly concerned about the _maintenance_ overhead. For the > > bare tracking (without profiling and thus stack traces) only those > > allocations that are directly inlined into the consumer are really > > of any use. That increases the code impact of the tracing because any > > relevant allocation location has to go through the micro surgery. > > > > e.g. is it really interesting to know that there is a likely memory > > leak in seq_file proper doing and allocation? No as it is the specific > > implementation using seq_file that is leaking most likely. There are > > other examples like that See? > > So this is a rather strange usage of "maintenance overhead" :) > > But it's something we thought of. If we had to plumb around a _RET_IP_ > parameter, or a codetag pointer, it would be a hassle annotating the > correct callsite. > > Instead, alloc_hooks() wraps a memory allocation function and stashes a > pointer to a codetag in task_struct for use by the core slub/buddy > allocator code. > > That means that in your example, to move tracking to a given seq_file > function, we just: > - hook the seq_file function with alloc_hooks Thank you. That's exactly what I was trying to point out. So you hook seq_buf_alloc(), just to find out it's called from traverse(), which is not very helpful either. So, you hook traverse(), which sounds quite generic. Yes, you're lucky, because it is a static function, and the identifier is not actually used anywhere else (right now), but each time you want to hook something, you must make sure it does not conflict with any other identifier in the kernel... Petr T
On Mon, May 08, 2023 at 05:52:06PM +0200, Petr Tesařík wrote: > On Sun, 7 May 2023 13:20:55 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote: > > > No. I am mostly concerned about the _maintenance_ overhead. For the > > > bare tracking (without profiling and thus stack traces) only those > > > allocations that are directly inlined into the consumer are really > > > of any use. That increases the code impact of the tracing because any > > > relevant allocation location has to go through the micro surgery. > > > > > > e.g. is it really interesting to know that there is a likely memory > > > leak in seq_file proper doing and allocation? No as it is the specific > > > implementation using seq_file that is leaking most likely. There are > > > other examples like that See? > > > > So this is a rather strange usage of "maintenance overhead" :) > > > > But it's something we thought of. If we had to plumb around a _RET_IP_ > > parameter, or a codetag pointer, it would be a hassle annotating the > > correct callsite. > > > > Instead, alloc_hooks() wraps a memory allocation function and stashes a > > pointer to a codetag in task_struct for use by the core slub/buddy > > allocator code. > > > > That means that in your example, to move tracking to a given seq_file > > function, we just: > > - hook the seq_file function with alloc_hooks > > Thank you. That's exactly what I was trying to point out. So you hook > seq_buf_alloc(), just to find out it's called from traverse(), which > is not very helpful either. So, you hook traverse(), which sounds quite > generic. Yes, you're lucky, because it is a static function, and the > identifier is not actually used anywhere else (right now), but each > time you want to hook something, you must make sure it does not > conflict with any other identifier in the kernel... Cscope makes quick and easy work of this kind of stuff.
On Mon, 8 May 2023 11:57:10 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Mon, May 08, 2023 at 05:52:06PM +0200, Petr Tesařík wrote: > > On Sun, 7 May 2023 13:20:55 -0400 > > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote: > > > > No. I am mostly concerned about the _maintenance_ overhead. For the > > > > bare tracking (without profiling and thus stack traces) only those > > > > allocations that are directly inlined into the consumer are really > > > > of any use. That increases the code impact of the tracing because any > > > > relevant allocation location has to go through the micro surgery. > > > > > > > > e.g. is it really interesting to know that there is a likely memory > > > > leak in seq_file proper doing and allocation? No as it is the specific > > > > implementation using seq_file that is leaking most likely. There are > > > > other examples like that See? > > > > > > So this is a rather strange usage of "maintenance overhead" :) > > > > > > But it's something we thought of. If we had to plumb around a _RET_IP_ > > > parameter, or a codetag pointer, it would be a hassle annotating the > > > correct callsite. > > > > > > Instead, alloc_hooks() wraps a memory allocation function and stashes a > > > pointer to a codetag in task_struct for use by the core slub/buddy > > > allocator code. > > > > > > That means that in your example, to move tracking to a given seq_file > > > function, we just: > > > - hook the seq_file function with alloc_hooks > > > > Thank you. That's exactly what I was trying to point out. So you hook > > seq_buf_alloc(), just to find out it's called from traverse(), which > > is not very helpful either. So, you hook traverse(), which sounds quite > > generic. Yes, you're lucky, because it is a static function, and the > > identifier is not actually used anywhere else (right now), but each > > time you want to hook something, you must make sure it does not > > conflict with any other identifier in the kernel... > > Cscope makes quick and easy work of this kind of stuff. Sure, although AFAIK the index does not cover all possible config options (so non-x86 arch code is often forgotten). However, that's the less important part. What do you do if you need to hook something that does conflict with an existing identifier? Petr T
On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote: > Sure, although AFAIK the index does not cover all possible config > options (so non-x86 arch code is often forgotten). However, that's the > less important part. > > What do you do if you need to hook something that does conflict with an > existing identifier? As already happens in this patchset, rename the other identifier. But this is C, we avoid these kinds of conflicts already because the language has no namespacing - it's going to be a pretty rare situtaion going forward. Most of the hooking that will be done is done with this patchset, and there was only one identifier that needed to be renamed.
On Mon, 8 May 2023 12:28:52 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote: > > Sure, although AFAIK the index does not cover all possible config > > options (so non-x86 arch code is often forgotten). However, that's the > > less important part. > > > > What do you do if you need to hook something that does conflict with an > > existing identifier? > > As already happens in this patchset, rename the other identifier. > > But this is C, we avoid these kinds of conflicts already because the > language has no namespacing This statement is not accurate, but I agree there's not much. Refer to section 6.2.3 of ISO/IEC9899:2018 (Name spaces of identifiers). More importantly, macros also interfere with identifier scoping, e.g. you cannot even have a local variable with the same name as a macro. That's why I dislike macros so much. But since there's no clear policy regarding macros in the kernel, I'm merely showing a downside; it's perfectly fine to write kernel code like this as long as the maintainers agree that the limitation is acceptable and outweighed by the benefits. Petr T > it's going to be a pretty rare situtaion > going forward. Most of the hooking that will be done is done with this > patchset, and there was only one identifier that needed to be renamed. >
On Mon, May 08, 2023 at 08:59:39PM +0200, Petr Tesařík wrote: > On Mon, 8 May 2023 12:28:52 -0400 > Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote: > > > Sure, although AFAIK the index does not cover all possible config > > > options (so non-x86 arch code is often forgotten). However, that's the > > > less important part. > > > > > > What do you do if you need to hook something that does conflict with an > > > existing identifier? > > > > As already happens in this patchset, rename the other identifier. > > > > But this is C, we avoid these kinds of conflicts already because the > > language has no namespacing > > This statement is not accurate, but I agree there's not much. Refer to > section 6.2.3 of ISO/IEC9899:2018 (Name spaces of identifiers). > > More importantly, macros also interfere with identifier scoping, e.g. > you cannot even have a local variable with the same name as a macro. > That's why I dislike macros so much. Shadowing a global identifier like that would at best be considered poor style, so I don't see this as a major downside. > But since there's no clear policy regarding macros in the kernel, I'm > merely showing a downside; it's perfectly fine to write kernel code > like this as long as the maintainers agree that the limitation is > acceptable and outweighed by the benefits. Macros do have lots of tricky downsides, but in general we're not shy about using them for things that can't be done otherwise; see wait_event(), all of tracing... I think we could in general do a job of making the macros _themselves_ more managable, when writing things that need to be macros I'll often have just the wrapper as a macro and write the bulk as inline functions. See the generic radix tree code for example. Reflection is a major use case for macros, and the underlying mechanism here - code tagging - is something worth talking about more, since it's codifying something that's been done ad-hoc in the kernel for a long time and something we hope to refactor other existing code to use, including tracing - I've got a patch already written to convert the dynamic debug code to code tagging; it's a nice -200 loc cleanup. Regarding the alloc_hooks() macro itself specifically, I've got more plans for it. I have another patch series after this one that implements code tagging based fault injection, which is _far_ more ergonomic to use than our existing fault injection capabilities (and this matters! Fault injection is a really important tool for getting good test coverage, but tools that are a pain in the ass to use don't get used) - and with the alloc_hooks() macro already in place, we'll be able to turn _every individual memory allocation callsite_ into a distinct, individually selectable fault injection point - which is something our existing fault injection framework attempts at but doesn't really manage. If we can get this in, it'll make it really easy to write unit tests that iterate over every memory allocation site in a given module, individually telling them to fail, run a workload until they hit, and verify that the code path being tested was executed. It'll nicely complement the fuzz testing capabilities that we've been working on, particularly in filesystem land.