Message ID | 20240212213922.783301-1-surenb@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Memory allocation profiling | expand |
On Mon, Feb 12, 2024 at 4:39 PM Suren Baghdasaryan <surenb@google.com> wrote: > > Memory allocation, v3 and final: > > Overview: > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > kernels, overhead low enough to be deployed in production. > > We're aiming to get this in the next merge window, for 6.9. The feedback > we've gotten has been that even out of tree this patchset has already > been useful, and there's a significant amount of other work gated on the > code tagging functionality included in this patchset [2]. > > Example output: > root@moria-kvm:~# sort -h /proc/allocinfo|tail > 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode > 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node > 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext > 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio > 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs > 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof > 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof > 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init > 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof > 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page This kind of memory profiling would be an incredible asset in cloud environments. Over the past year, we've encountered several kernel memory overhead issues. Two particularly severe cases involved excessively large IOMMU page tables (20GB per machine) and IOVA magazines (up to 8GB). Considering thousands of machines were affected, the cumulative memory waste was huge. While we eventually resolved these issues with custom kernel profiling hacks (some based on this series) and kdump analysis, comprehensive memory profiling would have significantly accelerated the diagnostic process, pinpointing the precise source of the allocations.
On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote: > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > kernels, overhead low enough to be deployed in production. What's the plan for things like devm_kmalloc() and similar relatively simple wrappers? I was thinking it would be possible to reimplement at least devm_kmalloc() with size and flags changing helper a while back: https://lore.kernel.org/all/202309111428.6F36672F57@keescook/ I suspect it could be possible to adapt the alloc_hooks wrapper in this series similarly: #define alloc_hooks_prep(_do_alloc, _do_prepare, _do_finish, \ ctx, size, flags) \ ({ \ typeof(_do_alloc) _res; \ DEFINE_ALLOC_TAG(_alloc_tag, _old); \ ssize_t _size = (size); \ size_t _usable = _size; \ gfp_t _flags = (flags); \ \ _res = _do_prepare(ctx, &_size, &_flags); \ if (!IS_ERR_OR_NULL(_res) \ _res = _do_alloc(_size, _flags); \ if (!IS_ERR_OR_NULL(_res) \ _res = _do_finish(ctx, _usable, _size, _flags, _res); \ _res; \ }) #define devm_kmalloc(dev, size, flags) \ alloc_hooks_prep(kmalloc, devm_alloc_prep, devm_alloc_finish, \ dev, size, flags) And devm_alloc_prep() and devm_alloc_finish() adapted from the URL above. And _do_finish instances could be marked with __realloc_size(2) -Kees
On Mon, Feb 12, 2024 at 4:29 PM Kees Cook <keescook@chromium.org> wrote: > > On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote: > > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > > kernels, overhead low enough to be deployed in production. > > What's the plan for things like devm_kmalloc() and similar relatively > simple wrappers? I was thinking it would be possible to reimplement at > least devm_kmalloc() with size and flags changing helper a while back: > > https://lore.kernel.org/all/202309111428.6F36672F57@keescook/ > > I suspect it could be possible to adapt the alloc_hooks wrapper in this > series similarly: > > #define alloc_hooks_prep(_do_alloc, _do_prepare, _do_finish, \ > ctx, size, flags) \ > ({ \ > typeof(_do_alloc) _res; \ > DEFINE_ALLOC_TAG(_alloc_tag, _old); \ > ssize_t _size = (size); \ > size_t _usable = _size; \ > gfp_t _flags = (flags); \ > \ > _res = _do_prepare(ctx, &_size, &_flags); \ > if (!IS_ERR_OR_NULL(_res) \ > _res = _do_alloc(_size, _flags); \ > if (!IS_ERR_OR_NULL(_res) \ > _res = _do_finish(ctx, _usable, _size, _flags, _res); \ > _res; \ > }) > > #define devm_kmalloc(dev, size, flags) \ > alloc_hooks_prep(kmalloc, devm_alloc_prep, devm_alloc_finish, \ > dev, size, flags) > > And devm_alloc_prep() and devm_alloc_finish() adapted from the URL > above. > > And _do_finish instances could be marked with __realloc_size(2) devm_kmalloc() is definitely a great candidate to account separately. Looks like it's currently using alloc_dr()->kmalloc_node_track_caller(), so this series will account the internal kmalloc_node_track_caller() allocation. We can easily apply alloc_hook to devm_kmalloc() and friends and replace the kmalloc_node_track_caller() call inside alloc_dr() with kmalloc_node_track_caller_noprof(). That will move accounting directly to devm_kmalloc(). > > -Kees > > -- > Kees Cook
On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: [...] > We're aiming to get this in the next merge window, for 6.9. The feedback > we've gotten has been that even out of tree this patchset has already > been useful, and there's a significant amount of other work gated on the > code tagging functionality included in this patchset [2]. I suspect it will not come as a surprise that I really dislike the implementation proposed here. I will not repeat my arguments, I have done so on several occasions already. Anyway, I didn't go as far as to nak it even though I _strongly_ believe this debugging feature will add a maintenance overhead for a very long time. I can live with all the downsides of the proposed implementation _as long as_ there is a wider agreement from the MM community as this is where the maintenance cost will be payed. So far I have not seen (m)any acks by MM developers so aiming into the next merge window is more than little rushed. > 81 files changed, 2126 insertions(+), 695 deletions(-)
On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > [...] > > We're aiming to get this in the next merge window, for 6.9. The feedback > > we've gotten has been that even out of tree this patchset has already > > been useful, and there's a significant amount of other work gated on the > > code tagging functionality included in this patchset [2]. > > I suspect it will not come as a surprise that I really dislike the > implementation proposed here. I will not repeat my arguments, I have > done so on several occasions already. > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > this debugging feature will add a maintenance overhead for a very long > time. I can live with all the downsides of the proposed implementation > _as long as_ there is a wider agreement from the MM community as this is > where the maintenance cost will be payed. So far I have not seen (m)any > acks by MM developers so aiming into the next merge window is more than > little rushed. We tried other previously proposed approaches and all have their downsides without making maintenance much easier. Your position is understandable and I think it's fair. Let's see if others see more benefit than cost here. Thanks, Suren. > > > 81 files changed, 2126 insertions(+), 695 deletions(-) > -- > Michal Hocko > SUSE Labs
On 13.02.24 22:58, Suren Baghdasaryan wrote: > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: >> >> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: >> [...] >>> We're aiming to get this in the next merge window, for 6.9. The feedback >>> we've gotten has been that even out of tree this patchset has already >>> been useful, and there's a significant amount of other work gated on the >>> code tagging functionality included in this patchset [2]. >> >> I suspect it will not come as a surprise that I really dislike the >> implementation proposed here. I will not repeat my arguments, I have >> done so on several occasions already. >> >> Anyway, I didn't go as far as to nak it even though I _strongly_ believe >> this debugging feature will add a maintenance overhead for a very long >> time. I can live with all the downsides of the proposed implementation >> _as long as_ there is a wider agreement from the MM community as this is >> where the maintenance cost will be payed. So far I have not seen (m)any >> acks by MM developers so aiming into the next merge window is more than >> little rushed. > > We tried other previously proposed approaches and all have their > downsides without making maintenance much easier. Your position is > understandable and I think it's fair. Let's see if others see more > benefit than cost here. Would it make sense to discuss that at LSF/MM once again, especially covering why proposed alternatives did not work out? LSF/MM is not "too far" away (May). I recall that the last LSF/MM session on this topic was a bit unfortunate (IMHO not as productive as it could have been). Maybe we can finally reach a consensus on this.
On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > On 13.02.24 22:58, Suren Baghdasaryan wrote: > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > > > [...] > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > > > we've gotten has been that even out of tree this patchset has already > > > > been useful, and there's a significant amount of other work gated on the > > > > code tagging functionality included in this patchset [2]. > > > > > > I suspect it will not come as a surprise that I really dislike the > > > implementation proposed here. I will not repeat my arguments, I have > > > done so on several occasions already. > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > > > this debugging feature will add a maintenance overhead for a very long > > > time. I can live with all the downsides of the proposed implementation > > > _as long as_ there is a wider agreement from the MM community as this is > > > where the maintenance cost will be payed. So far I have not seen (m)any > > > acks by MM developers so aiming into the next merge window is more than > > > little rushed. > > > > We tried other previously proposed approaches and all have their > > downsides without making maintenance much easier. Your position is > > understandable and I think it's fair. Let's see if others see more > > benefit than cost here. > > Would it make sense to discuss that at LSF/MM once again, especially > covering why proposed alternatives did not work out? LSF/MM is not "too far" > away (May). > > I recall that the last LSF/MM session on this topic was a bit unfortunate > (IMHO not as productive as it could have been). Maybe we can finally reach a > consensus on this. I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd need to see a serious proposl - what we had at the last LSF was people jumping in with half baked alternative proposals that very much hadn't been thought through, and I see no need to repeat that. Like I mentioned, there's other work gated on this patchset; if people want to hold this up for more discussion they better be putting forth something to discuss.
On 13.02.24 23:09, Kent Overstreet wrote: > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: >> On 13.02.24 22:58, Suren Baghdasaryan wrote: >>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: >>>> >>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: >>>> [...] >>>>> We're aiming to get this in the next merge window, for 6.9. The feedback >>>>> we've gotten has been that even out of tree this patchset has already >>>>> been useful, and there's a significant amount of other work gated on the >>>>> code tagging functionality included in this patchset [2]. >>>> >>>> I suspect it will not come as a surprise that I really dislike the >>>> implementation proposed here. I will not repeat my arguments, I have >>>> done so on several occasions already. >>>> >>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe >>>> this debugging feature will add a maintenance overhead for a very long >>>> time. I can live with all the downsides of the proposed implementation >>>> _as long as_ there is a wider agreement from the MM community as this is >>>> where the maintenance cost will be payed. So far I have not seen (m)any >>>> acks by MM developers so aiming into the next merge window is more than >>>> little rushed. >>> >>> We tried other previously proposed approaches and all have their >>> downsides without making maintenance much easier. Your position is >>> understandable and I think it's fair. Let's see if others see more >>> benefit than cost here. >> >> Would it make sense to discuss that at LSF/MM once again, especially >> covering why proposed alternatives did not work out? LSF/MM is not "too far" >> away (May). >> >> I recall that the last LSF/MM session on this topic was a bit unfortunate >> (IMHO not as productive as it could have been). Maybe we can finally reach a >> consensus on this. > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > need to see a serious proposl - what we had at the last LSF was people > jumping in with half baked alternative proposals that very much hadn't > been thought through, and I see no need to repeat that. > > Like I mentioned, there's other work gated on this patchset; if people > want to hold this up for more discussion they better be putting forth > something to discuss. I'm thinking of ways on how to achieve Michal's request: "as long as there is a wider agreement from the MM community". If we can achieve that without LSF, great! (a bi-weekly MM meeting might also be an option)
On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote: > On 13.02.24 23:09, Kent Overstreet wrote: > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > > > On 13.02.24 22:58, Suren Baghdasaryan wrote: > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > > > > > [...] > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > > > > > we've gotten has been that even out of tree this patchset has already > > > > > > been useful, and there's a significant amount of other work gated on the > > > > > > code tagging functionality included in this patchset [2]. > > > > > > > > > > I suspect it will not come as a surprise that I really dislike the > > > > > implementation proposed here. I will not repeat my arguments, I have > > > > > done so on several occasions already. > > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > > > > > this debugging feature will add a maintenance overhead for a very long > > > > > time. I can live with all the downsides of the proposed implementation > > > > > _as long as_ there is a wider agreement from the MM community as this is > > > > > where the maintenance cost will be payed. So far I have not seen (m)any > > > > > acks by MM developers so aiming into the next merge window is more than > > > > > little rushed. > > > > > > > > We tried other previously proposed approaches and all have their > > > > downsides without making maintenance much easier. Your position is > > > > understandable and I think it's fair. Let's see if others see more > > > > benefit than cost here. > > > > > > Would it make sense to discuss that at LSF/MM once again, especially > > > covering why proposed alternatives did not work out? LSF/MM is not "too far" > > > away (May). > > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate > > > (IMHO not as productive as it could have been). Maybe we can finally reach a > > > consensus on this. > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > > need to see a serious proposl - what we had at the last LSF was people > > jumping in with half baked alternative proposals that very much hadn't > > been thought through, and I see no need to repeat that. > > > > Like I mentioned, there's other work gated on this patchset; if people > > want to hold this up for more discussion they better be putting forth > > something to discuss. > > I'm thinking of ways on how to achieve Michal's request: "as long as there > is a wider agreement from the MM community". If we can achieve that without > LSF, great! (a bi-weekly MM meeting might also be an option) A meeting wouldn't be out of the question, _if_ there is an agenda, but: What's that coffeee mug say? I just survived another meeting that could've been an email? What exactly is the outcome we're looking for? Is there info that people are looking for? I think we summed things up pretty well in the cover letter; if there are specifics that people want to discuss, that's why we emailed the series out. There's people in this thread who've used this patchset in production and diagnosed real issues (gigabytes of memory gone missing, I heard the other day); I'm personally looking for them to chime in on this thread (Johannes, Pasha). If it's just grumbling about "maintenance overhead" we need to get past - well, people are going to have to accept that we can't deliver features without writing code, and I'm confident that the hooking in particular is about as clean as it's going to get, _regardless_ of toolchain support; and moreover it addresses what's been historically a pretty gaping hole in our ability to profile and understand the code we write.
On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: > > On 13.02.24 23:09, Kent Overstreet wrote: > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > >> On 13.02.24 22:58, Suren Baghdasaryan wrote: > >>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > >>>> > >>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > >>>> [...] > >>>>> We're aiming to get this in the next merge window, for 6.9. The feedback > >>>>> we've gotten has been that even out of tree this patchset has already > >>>>> been useful, and there's a significant amount of other work gated on the > >>>>> code tagging functionality included in this patchset [2]. > >>>> > >>>> I suspect it will not come as a surprise that I really dislike the > >>>> implementation proposed here. I will not repeat my arguments, I have > >>>> done so on several occasions already. > >>>> > >>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe > >>>> this debugging feature will add a maintenance overhead for a very long > >>>> time. I can live with all the downsides of the proposed implementation > >>>> _as long as_ there is a wider agreement from the MM community as this is > >>>> where the maintenance cost will be payed. So far I have not seen (m)any > >>>> acks by MM developers so aiming into the next merge window is more than > >>>> little rushed. > >>> > >>> We tried other previously proposed approaches and all have their > >>> downsides without making maintenance much easier. Your position is > >>> understandable and I think it's fair. Let's see if others see more > >>> benefit than cost here. > >> > >> Would it make sense to discuss that at LSF/MM once again, especially > >> covering why proposed alternatives did not work out? LSF/MM is not "too far" > >> away (May). > >> > >> I recall that the last LSF/MM session on this topic was a bit unfortunate > >> (IMHO not as productive as it could have been). Maybe we can finally reach a > >> consensus on this. > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > > need to see a serious proposl - what we had at the last LSF was people > > jumping in with half baked alternative proposals that very much hadn't > > been thought through, and I see no need to repeat that. > > > > Like I mentioned, there's other work gated on this patchset; if people > > want to hold this up for more discussion they better be putting forth > > something to discuss. > > I'm thinking of ways on how to achieve Michal's request: "as long as > there is a wider agreement from the MM community". If we can achieve > that without LSF, great! (a bi-weekly MM meeting might also be an option) There will be a maintenance burden even with the cleanest proposed approach. We worked hard to make the patchset as clean as possible and if benefits still don't outweigh the maintenance cost then we should probably stop trying. At LSF/MM I would rather discuss functonal issues/requirements/improvements than alternative approaches to instrument allocators. I'm happy to arrange a separate meeting with MM folks if that would help to progress on the cost/benefit decision. > > -- > Cheers, > > David / dhildenb > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On 13.02.24 23:30, Suren Baghdasaryan wrote: > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 13.02.24 23:09, Kent Overstreet wrote: >>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: >>>> On 13.02.24 22:58, Suren Baghdasaryan wrote: >>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: >>>>>> >>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: >>>>>> [...] >>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback >>>>>>> we've gotten has been that even out of tree this patchset has already >>>>>>> been useful, and there's a significant amount of other work gated on the >>>>>>> code tagging functionality included in this patchset [2]. >>>>>> >>>>>> I suspect it will not come as a surprise that I really dislike the >>>>>> implementation proposed here. I will not repeat my arguments, I have >>>>>> done so on several occasions already. >>>>>> >>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe >>>>>> this debugging feature will add a maintenance overhead for a very long >>>>>> time. I can live with all the downsides of the proposed implementation >>>>>> _as long as_ there is a wider agreement from the MM community as this is >>>>>> where the maintenance cost will be payed. So far I have not seen (m)any >>>>>> acks by MM developers so aiming into the next merge window is more than >>>>>> little rushed. >>>>> >>>>> We tried other previously proposed approaches and all have their >>>>> downsides without making maintenance much easier. Your position is >>>>> understandable and I think it's fair. Let's see if others see more >>>>> benefit than cost here. >>>> >>>> Would it make sense to discuss that at LSF/MM once again, especially >>>> covering why proposed alternatives did not work out? LSF/MM is not "too far" >>>> away (May). >>>> >>>> I recall that the last LSF/MM session on this topic was a bit unfortunate >>>> (IMHO not as productive as it could have been). Maybe we can finally reach a >>>> consensus on this. >>> >>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd >>> need to see a serious proposl - what we had at the last LSF was people >>> jumping in with half baked alternative proposals that very much hadn't >>> been thought through, and I see no need to repeat that. >>> >>> Like I mentioned, there's other work gated on this patchset; if people >>> want to hold this up for more discussion they better be putting forth >>> something to discuss. >> >> I'm thinking of ways on how to achieve Michal's request: "as long as >> there is a wider agreement from the MM community". If we can achieve >> that without LSF, great! (a bi-weekly MM meeting might also be an option) > > There will be a maintenance burden even with the cleanest proposed > approach. Yes. > We worked hard to make the patchset as clean as possible and > if benefits still don't outweigh the maintenance cost then we should > probably stop trying. Indeed. > At LSF/MM I would rather discuss functonal > issues/requirements/improvements than alternative approaches to > instrument allocators. > I'm happy to arrange a separate meeting with MM folks if that would > help to progress on the cost/benefit decision. Note that I am only proposing ways forward. If you think you can easily achieve what Michal requested without all that, good. My past experience was that LSF/MM / bi-weekly MM meetings were extremely helpful to reach consensus.
On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: > On 13.02.24 23:30, Suren Baghdasaryan wrote: > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 13.02.24 23:09, Kent Overstreet wrote: > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote: > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > > > > > > > [...] > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > > > > > > > we've gotten has been that even out of tree this patchset has already > > > > > > > > been useful, and there's a significant amount of other work gated on the > > > > > > > > code tagging functionality included in this patchset [2]. > > > > > > > > > > > > > > I suspect it will not come as a surprise that I really dislike the > > > > > > > implementation proposed here. I will not repeat my arguments, I have > > > > > > > done so on several occasions already. > > > > > > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > > > > > > > this debugging feature will add a maintenance overhead for a very long > > > > > > > time. I can live with all the downsides of the proposed implementation > > > > > > > _as long as_ there is a wider agreement from the MM community as this is > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any > > > > > > > acks by MM developers so aiming into the next merge window is more than > > > > > > > little rushed. > > > > > > > > > > > > We tried other previously proposed approaches and all have their > > > > > > downsides without making maintenance much easier. Your position is > > > > > > understandable and I think it's fair. Let's see if others see more > > > > > > benefit than cost here. > > > > > > > > > > Would it make sense to discuss that at LSF/MM once again, especially > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far" > > > > > away (May). > > > > > > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a > > > > > consensus on this. > > > > > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > > > > need to see a serious proposl - what we had at the last LSF was people > > > > jumping in with half baked alternative proposals that very much hadn't > > > > been thought through, and I see no need to repeat that. > > > > > > > > Like I mentioned, there's other work gated on this patchset; if people > > > > want to hold this up for more discussion they better be putting forth > > > > something to discuss. > > > > > > I'm thinking of ways on how to achieve Michal's request: "as long as > > > there is a wider agreement from the MM community". If we can achieve > > > that without LSF, great! (a bi-weekly MM meeting might also be an option) > > > > There will be a maintenance burden even with the cleanest proposed > > approach. > > Yes. > > > We worked hard to make the patchset as clean as possible and > > if benefits still don't outweigh the maintenance cost then we should > > probably stop trying. > > Indeed. > > > At LSF/MM I would rather discuss functonal > > issues/requirements/improvements than alternative approaches to > > instrument allocators. > > I'm happy to arrange a separate meeting with MM folks if that would > > help to progress on the cost/benefit decision. > Note that I am only proposing ways forward. > > If you think you can easily achieve what Michal requested without all that, > good. He requested something?
On 13.02.24 23:50, Kent Overstreet wrote: > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: >> On 13.02.24 23:30, Suren Baghdasaryan wrote: >>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 13.02.24 23:09, Kent Overstreet wrote: >>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: >>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote: >>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: >>>>>>>> >>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: >>>>>>>> [...] >>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback >>>>>>>>> we've gotten has been that even out of tree this patchset has already >>>>>>>>> been useful, and there's a significant amount of other work gated on the >>>>>>>>> code tagging functionality included in this patchset [2]. >>>>>>>> >>>>>>>> I suspect it will not come as a surprise that I really dislike the >>>>>>>> implementation proposed here. I will not repeat my arguments, I have >>>>>>>> done so on several occasions already. >>>>>>>> >>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe >>>>>>>> this debugging feature will add a maintenance overhead for a very long >>>>>>>> time. I can live with all the downsides of the proposed implementation >>>>>>>> _as long as_ there is a wider agreement from the MM community as this is >>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any >>>>>>>> acks by MM developers so aiming into the next merge window is more than >>>>>>>> little rushed. >>>>>>> >>>>>>> We tried other previously proposed approaches and all have their >>>>>>> downsides without making maintenance much easier. Your position is >>>>>>> understandable and I think it's fair. Let's see if others see more >>>>>>> benefit than cost here. >>>>>> >>>>>> Would it make sense to discuss that at LSF/MM once again, especially >>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far" >>>>>> away (May). >>>>>> >>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate >>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a >>>>>> consensus on this. >>>>> >>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd >>>>> need to see a serious proposl - what we had at the last LSF was people >>>>> jumping in with half baked alternative proposals that very much hadn't >>>>> been thought through, and I see no need to repeat that. >>>>> >>>>> Like I mentioned, there's other work gated on this patchset; if people >>>>> want to hold this up for more discussion they better be putting forth >>>>> something to discuss. >>>> >>>> I'm thinking of ways on how to achieve Michal's request: "as long as >>>> there is a wider agreement from the MM community". If we can achieve >>>> that without LSF, great! (a bi-weekly MM meeting might also be an option) >>> >>> There will be a maintenance burden even with the cleanest proposed >>> approach. >> >> Yes. >> >>> We worked hard to make the patchset as clean as possible and >>> if benefits still don't outweigh the maintenance cost then we should >>> probably stop trying. >> >> Indeed. >> >>> At LSF/MM I would rather discuss functonal >>> issues/requirements/improvements than alternative approaches to >>> instrument allocators. >>> I'm happy to arrange a separate meeting with MM folks if that would >>> help to progress on the cost/benefit decision. >> Note that I am only proposing ways forward. >> >> If you think you can easily achieve what Michal requested without all that, >> good. > > He requested something? > This won't get merged without acks from MM people.
On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: > > On 13.02.24 23:30, Suren Baghdasaryan wrote: > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > > > On 13.02.24 23:09, Kent Overstreet wrote: > > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote: > > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > > > > > > > > [...] > > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > > > > > > > > we've gotten has been that even out of tree this patchset has already > > > > > > > > > been useful, and there's a significant amount of other work gated on the > > > > > > > > > code tagging functionality included in this patchset [2]. > > > > > > > > > > > > > > > > I suspect it will not come as a surprise that I really dislike the > > > > > > > > implementation proposed here. I will not repeat my arguments, I have > > > > > > > > done so on several occasions already. > > > > > > > > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > > > > > > > > this debugging feature will add a maintenance overhead for a very long > > > > > > > > time. I can live with all the downsides of the proposed implementation > > > > > > > > _as long as_ there is a wider agreement from the MM community as this is > > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any > > > > > > > > acks by MM developers so aiming into the next merge window is more than > > > > > > > > little rushed. > > > > > > > > > > > > > > We tried other previously proposed approaches and all have their > > > > > > > downsides without making maintenance much easier. Your position is > > > > > > > understandable and I think it's fair. Let's see if others see more > > > > > > > benefit than cost here. > > > > > > > > > > > > Would it make sense to discuss that at LSF/MM once again, especially > > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far" > > > > > > away (May). > > > > > > > > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate > > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a > > > > > > consensus on this. > > > > > > > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > > > > > need to see a serious proposl - what we had at the last LSF was people > > > > > jumping in with half baked alternative proposals that very much hadn't > > > > > been thought through, and I see no need to repeat that. > > > > > > > > > > Like I mentioned, there's other work gated on this patchset; if people > > > > > want to hold this up for more discussion they better be putting forth > > > > > something to discuss. > > > > > > > > I'm thinking of ways on how to achieve Michal's request: "as long as > > > > there is a wider agreement from the MM community". If we can achieve > > > > that without LSF, great! (a bi-weekly MM meeting might also be an option) > > > > > > There will be a maintenance burden even with the cleanest proposed > > > approach. > > > > Yes. > > > > > We worked hard to make the patchset as clean as possible and > > > if benefits still don't outweigh the maintenance cost then we should > > > probably stop trying. > > > > Indeed. > > > > > At LSF/MM I would rather discuss functonal > > > issues/requirements/improvements than alternative approaches to > > > instrument allocators. > > > I'm happy to arrange a separate meeting with MM folks if that would > > > help to progress on the cost/benefit decision. > > Note that I am only proposing ways forward. > > > > If you think you can easily achieve what Michal requested without all that, > > good. > > He requested something? Yes, a cleaner instrumentation. Unfortunately the cleanest one is not possible until the compiler feature is developed and deployed. And it still would require changes to the headers, so don't think it's worth delaying the feature for years.
On 13.02.24 23:59, Suren Baghdasaryan wrote: > On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: >> >> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: >>> On 13.02.24 23:30, Suren Baghdasaryan wrote: >>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 13.02.24 23:09, Kent Overstreet wrote: >>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: >>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote: >>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: >>>>>>>>> >>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: >>>>>>>>> [...] >>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback >>>>>>>>>> we've gotten has been that even out of tree this patchset has already >>>>>>>>>> been useful, and there's a significant amount of other work gated on the >>>>>>>>>> code tagging functionality included in this patchset [2]. >>>>>>>>> >>>>>>>>> I suspect it will not come as a surprise that I really dislike the >>>>>>>>> implementation proposed here. I will not repeat my arguments, I have >>>>>>>>> done so on several occasions already. >>>>>>>>> >>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe >>>>>>>>> this debugging feature will add a maintenance overhead for a very long >>>>>>>>> time. I can live with all the downsides of the proposed implementation >>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is >>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any >>>>>>>>> acks by MM developers so aiming into the next merge window is more than >>>>>>>>> little rushed. >>>>>>>> >>>>>>>> We tried other previously proposed approaches and all have their >>>>>>>> downsides without making maintenance much easier. Your position is >>>>>>>> understandable and I think it's fair. Let's see if others see more >>>>>>>> benefit than cost here. >>>>>>> >>>>>>> Would it make sense to discuss that at LSF/MM once again, especially >>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far" >>>>>>> away (May). >>>>>>> >>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate >>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a >>>>>>> consensus on this. >>>>>> >>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd >>>>>> need to see a serious proposl - what we had at the last LSF was people >>>>>> jumping in with half baked alternative proposals that very much hadn't >>>>>> been thought through, and I see no need to repeat that. >>>>>> >>>>>> Like I mentioned, there's other work gated on this patchset; if people >>>>>> want to hold this up for more discussion they better be putting forth >>>>>> something to discuss. >>>>> >>>>> I'm thinking of ways on how to achieve Michal's request: "as long as >>>>> there is a wider agreement from the MM community". If we can achieve >>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option) >>>> >>>> There will be a maintenance burden even with the cleanest proposed >>>> approach. >>> >>> Yes. >>> >>>> We worked hard to make the patchset as clean as possible and >>>> if benefits still don't outweigh the maintenance cost then we should >>>> probably stop trying. >>> >>> Indeed. >>> >>>> At LSF/MM I would rather discuss functonal >>>> issues/requirements/improvements than alternative approaches to >>>> instrument allocators. >>>> I'm happy to arrange a separate meeting with MM folks if that would >>>> help to progress on the cost/benefit decision. >>> Note that I am only proposing ways forward. >>> >>> If you think you can easily achieve what Michal requested without all that, >>> good. >> >> He requested something? > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > possible until the compiler feature is developed and deployed. And it > still would require changes to the headers, so don't think it's worth > delaying the feature for years. > I was talking about this: "I can live with all the downsides of the proposed implementationas long as there is a wider agreement from the MM community as this is where the maintenance cost will be payed. So far I have not seen (m)any acks by MM developers". I certainly cannot be motivated at this point to review and ack this, unfortunately too much negative energy around here.
On Tue, Feb 13, 2024 at 02:59:11PM -0800, Suren Baghdasaryan wrote: > On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: > > > On 13.02.24 23:30, Suren Baghdasaryan wrote: > > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: > > > If you think you can easily achieve what Michal requested without all that, > > > good. > > > > He requested something? > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > possible until the compiler feature is developed and deployed. And it > still would require changes to the headers, so don't think it's worth > delaying the feature for years. Hang on, let's look at the actual code. This is what instrumenting an allocation function looks like: #define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__)) IOW, we have to: - rename krealloc_array to krealloc_array_noprof - replace krealloc_array with a one wrapper macro call Is this really all we're getting worked up over? The renaming we need regardless, because the thing that makes this approach efficient enough to run in production is that we account at _one_ point in the callstack, we don't save entire backtraces. And thus we need to explicitly annotate which one that is; which means we need _noprof() versions of functions for when the accounting is done by an outer wraper (e.g. mempool). And, as I keep saying: that alloc_hooks() macro will also get us _per callsite fault injection points_, and we really need that because - if you guys have been paying attention to other threads - whenever moving more stuff to PF_MEMALLOC_* flags comes up (including adding PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and not being testable keeps coming up.
On Tue, Feb 13, 2024 at 05:29:03PM -0500, Kent Overstreet wrote: > On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote: > > On 13.02.24 23:09, Kent Overstreet wrote: > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote: > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > > > > > > [...] > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > > > > > > we've gotten has been that even out of tree this patchset has already > > > > > > > been useful, and there's a significant amount of other work gated on the > > > > > > > code tagging functionality included in this patchset [2]. > > > > > > > > > > > > I suspect it will not come as a surprise that I really dislike the > > > > > > implementation proposed here. I will not repeat my arguments, I have > > > > > > done so on several occasions already. > > > > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > > > > > > this debugging feature will add a maintenance overhead for a very long > > > > > > time. I can live with all the downsides of the proposed implementation > > > > > > _as long as_ there is a wider agreement from the MM community as this is > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any > > > > > > acks by MM developers so aiming into the next merge window is more than > > > > > > little rushed. > > > > > > > > > > We tried other previously proposed approaches and all have their > > > > > downsides without making maintenance much easier. Your position is > > > > > understandable and I think it's fair. Let's see if others see more > > > > > benefit than cost here. > > > > > > > > Would it make sense to discuss that at LSF/MM once again, especially > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far" > > > > away (May). You want to stall this effort for *three months* to schedule a meeting? I would love to have better profiling of memory allocations inside XFS so that we can answer questions like "What's going on with these allocation stalls?" or "What memory is getting used, and where?" more quickly than we can now. Right now I get to stare at tracepoints and printk crap until my eyes bleed, and maybe dchinner comes to my rescue and figures out what's going on sooner than I do. More often we just never figure it out because only the customer can reproduce the problem, the reams of data produced by ftrace is unmanageable, and BPF isn't always available. I'm not thrilled by the large increase in macro crap in the allocation paths, but I don't know of a better way to instrument things. Our attempts to use _RET_IP in XFS to instrument its code paths have never worked quite right w.r.t. inlined functions and whatnot. > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a > > > > consensus on this. From my outsider's perspective, nine months have gone by since the last LSF. Who has come up with a cleaner/better/faster way to do what Suren and Kent have done? Were those code changes integrated into this patchset? Or why not? Most of what I saw in 2023 involved compiler changes (great; now I have to wait until RHEL 11/Debian 14 to use it) and/or still utilize fugly macros. Recalling all the way back to suggestions made in 2022, who wrote the prototype for doing this via ftrace? Or BPF? How well did that go for counting allocation events and the like? I saw Tejun saying something about how they use BPF aggressively inside Meta, but that was about it. Were any of those solutions significantly better than what's in front of us here? I get it, a giant patch forcing everyone to know the difference between alloc_foo and alloc_foo_noperf offends my (yours?) stylistic sensibilities. On the other side, making analysis easier during customer escalations means we kernel people get data, answers, and solutions sooner instead of watching all our time get eaten up on L4 support and backporting hell. > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > > > need to see a serious proposl - what we had at the last LSF was people > > > jumping in with half baked alternative proposals that very much hadn't > > > been thought through, and I see no need to repeat that. > > > > > > Like I mentioned, there's other work gated on this patchset; if people > > > want to hold this up for more discussion they better be putting forth > > > something to discuss. > > > > I'm thinking of ways on how to achieve Michal's request: "as long as there > > is a wider agreement from the MM community". If we can achieve that without > > LSF, great! (a bi-weekly MM meeting might also be an option) > > A meeting wouldn't be out of the question, _if_ there is an agenda, but: > > What's that coffeee mug say? I just survived another meeting that > could've been an email? I congratulate you on your memory of my kitchen mug. Yes, that's what it says. > What exactly is the outcome we're looking for? > > Is there info that people are looking for? I think we summed things up > pretty well in the cover letter; if there are specifics that people > want to discuss, that's why we emailed the series out. > > There's people in this thread who've used this patchset in production > and diagnosed real issues (gigabytes of memory gone missing, I heard the > other day); I'm personally looking for them to chime in on this thread > (Johannes, Pasha). > > If it's just grumbling about "maintenance overhead" we need to get past > - well, people are going to have to accept that we can't deliver > features without writing code, and I'm confident that the hooking in > particular is about as clean as it's going to get, _regardless_ of > toolchain support; and moreover it addresses what's been historically a > pretty gaping hole in our ability to profile and understand the code we > write. Are you and Suren willing to pay whatever maintenance overhead there is? --D
On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote: > On 13.02.24 23:59, Suren Baghdasaryan wrote: > > On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: > > > > On 13.02.24 23:30, Suren Baghdasaryan wrote: > > > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > > > > > > > On 13.02.24 23:09, Kent Overstreet wrote: > > > > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > > > > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote: > > > > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > > > > > > > > > > [...] > > > > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > > > > > > > > > > we've gotten has been that even out of tree this patchset has already > > > > > > > > > > > been useful, and there's a significant amount of other work gated on the > > > > > > > > > > > code tagging functionality included in this patchset [2]. > > > > > > > > > > > > > > > > > > > > I suspect it will not come as a surprise that I really dislike the > > > > > > > > > > implementation proposed here. I will not repeat my arguments, I have > > > > > > > > > > done so on several occasions already. > > > > > > > > > > > > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > > > > > > > > > > this debugging feature will add a maintenance overhead for a very long > > > > > > > > > > time. I can live with all the downsides of the proposed implementation > > > > > > > > > > _as long as_ there is a wider agreement from the MM community as this is > > > > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any > > > > > > > > > > acks by MM developers so aiming into the next merge window is more than > > > > > > > > > > little rushed. > > > > > > > > > > > > > > > > > > We tried other previously proposed approaches and all have their > > > > > > > > > downsides without making maintenance much easier. Your position is > > > > > > > > > understandable and I think it's fair. Let's see if others see more > > > > > > > > > benefit than cost here. > > > > > > > > > > > > > > > > Would it make sense to discuss that at LSF/MM once again, especially > > > > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far" > > > > > > > > away (May). > > > > > > > > > > > > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate > > > > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a > > > > > > > > consensus on this. > > > > > > > > > > > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > > > > > > > need to see a serious proposl - what we had at the last LSF was people > > > > > > > jumping in with half baked alternative proposals that very much hadn't > > > > > > > been thought through, and I see no need to repeat that. > > > > > > > > > > > > > > Like I mentioned, there's other work gated on this patchset; if people > > > > > > > want to hold this up for more discussion they better be putting forth > > > > > > > something to discuss. > > > > > > > > > > > > I'm thinking of ways on how to achieve Michal's request: "as long as > > > > > > there is a wider agreement from the MM community". If we can achieve > > > > > > that without LSF, great! (a bi-weekly MM meeting might also be an option) > > > > > > > > > > There will be a maintenance burden even with the cleanest proposed > > > > > approach. > > > > > > > > Yes. > > > > > > > > > We worked hard to make the patchset as clean as possible and > > > > > if benefits still don't outweigh the maintenance cost then we should > > > > > probably stop trying. > > > > > > > > Indeed. > > > > > > > > > At LSF/MM I would rather discuss functonal > > > > > issues/requirements/improvements than alternative approaches to > > > > > instrument allocators. > > > > > I'm happy to arrange a separate meeting with MM folks if that would > > > > > help to progress on the cost/benefit decision. > > > > Note that I am only proposing ways forward. > > > > > > > > If you think you can easily achieve what Michal requested without all that, > > > > good. > > > > > > He requested something? > > > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > > possible until the compiler feature is developed and deployed. And it > > still would require changes to the headers, so don't think it's worth > > delaying the feature for years. > > > > I was talking about this: "I can live with all the downsides of the proposed > implementationas long as there is a wider agreement from the MM community as > this is where the maintenance cost will be payed. So far I have not seen > (m)any acks by MM developers". > > I certainly cannot be motivated at this point to review and ack this, > unfortunately too much negative energy around here. David, this kind of reaction is exactly why I was telling Andrew I was going to submit this as a direct pull request to Linus. This is an important feature; if we can't stay focused ot the technical and get it done that's what I'll do.
On 14.02.24 00:12, Kent Overstreet wrote: > On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote: >> On 13.02.24 23:59, Suren Baghdasaryan wrote: >>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet >>> <kent.overstreet@linux.dev> wrote: >>>> >>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: >>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote: >>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: >>>>>>> >>>>>>> On 13.02.24 23:09, Kent Overstreet wrote: >>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: >>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote: >>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: >>>>>>>>>>> [...] >>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback >>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already >>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the >>>>>>>>>>>> code tagging functionality included in this patchset [2]. >>>>>>>>>>> >>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the >>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have >>>>>>>>>>> done so on several occasions already. >>>>>>>>>>> >>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe >>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long >>>>>>>>>>> time. I can live with all the downsides of the proposed implementation >>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is >>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any >>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than >>>>>>>>>>> little rushed. >>>>>>>>>> >>>>>>>>>> We tried other previously proposed approaches and all have their >>>>>>>>>> downsides without making maintenance much easier. Your position is >>>>>>>>>> understandable and I think it's fair. Let's see if others see more >>>>>>>>>> benefit than cost here. >>>>>>>>> >>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially >>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far" >>>>>>>>> away (May). >>>>>>>>> >>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate >>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a >>>>>>>>> consensus on this. >>>>>>>> >>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd >>>>>>>> need to see a serious proposl - what we had at the last LSF was people >>>>>>>> jumping in with half baked alternative proposals that very much hadn't >>>>>>>> been thought through, and I see no need to repeat that. >>>>>>>> >>>>>>>> Like I mentioned, there's other work gated on this patchset; if people >>>>>>>> want to hold this up for more discussion they better be putting forth >>>>>>>> something to discuss. >>>>>>> >>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as >>>>>>> there is a wider agreement from the MM community". If we can achieve >>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option) >>>>>> >>>>>> There will be a maintenance burden even with the cleanest proposed >>>>>> approach. >>>>> >>>>> Yes. >>>>> >>>>>> We worked hard to make the patchset as clean as possible and >>>>>> if benefits still don't outweigh the maintenance cost then we should >>>>>> probably stop trying. >>>>> >>>>> Indeed. >>>>> >>>>>> At LSF/MM I would rather discuss functonal >>>>>> issues/requirements/improvements than alternative approaches to >>>>>> instrument allocators. >>>>>> I'm happy to arrange a separate meeting with MM folks if that would >>>>>> help to progress on the cost/benefit decision. >>>>> Note that I am only proposing ways forward. >>>>> >>>>> If you think you can easily achieve what Michal requested without all that, >>>>> good. >>>> >>>> He requested something? >>> >>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not >>> possible until the compiler feature is developed and deployed. And it >>> still would require changes to the headers, so don't think it's worth >>> delaying the feature for years. >>> >> >> I was talking about this: "I can live with all the downsides of the proposed >> implementationas long as there is a wider agreement from the MM community as >> this is where the maintenance cost will be payed. So far I have not seen >> (m)any acks by MM developers". >> >> I certainly cannot be motivated at this point to review and ack this, >> unfortunately too much negative energy around here. > > David, this kind of reaction is exactly why I was telling Andrew I was > going to submit this as a direct pull request to Linus. > > This is an important feature; if we can't stay focused ot the technical > and get it done that's what I'll do. Kent, I started this with "Would it make sense" in an attempt to help Suren and you to finally make progress with this, one way or the other. I know that there were ways in the past to get the MM community to agree on such things. I tried to be helpful, finding ways *not having to* bypass the MM community to get MM stuff merged. The reply I got is mostly negative energy. So you don't need my help here, understood. But I will fight against any attempts to bypass the MM community.
On Tue, Feb 13, 2024 at 03:11:15PM -0800, Darrick J. Wong wrote: > On Tue, Feb 13, 2024 at 05:29:03PM -0500, Kent Overstreet wrote: > > On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote: > > > On 13.02.24 23:09, Kent Overstreet wrote: > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote: > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > > > > > > > [...] > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > > > > > > > we've gotten has been that even out of tree this patchset has already > > > > > > > > been useful, and there's a significant amount of other work gated on the > > > > > > > > code tagging functionality included in this patchset [2]. > > > > > > > > > > > > > > I suspect it will not come as a surprise that I really dislike the > > > > > > > implementation proposed here. I will not repeat my arguments, I have > > > > > > > done so on several occasions already. > > > > > > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe > > > > > > > this debugging feature will add a maintenance overhead for a very long > > > > > > > time. I can live with all the downsides of the proposed implementation > > > > > > > _as long as_ there is a wider agreement from the MM community as this is > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any > > > > > > > acks by MM developers so aiming into the next merge window is more than > > > > > > > little rushed. > > > > > > > > > > > > We tried other previously proposed approaches and all have their > > > > > > downsides without making maintenance much easier. Your position is > > > > > > understandable and I think it's fair. Let's see if others see more > > > > > > benefit than cost here. > > > > > > > > > > Would it make sense to discuss that at LSF/MM once again, especially > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far" > > > > > away (May). > > You want to stall this effort for *three months* to schedule a meeting? > > I would love to have better profiling of memory allocations inside XFS > so that we can answer questions like "What's going on with these > allocation stalls?" or "What memory is getting used, and where?" more > quickly than we can now. > > Right now I get to stare at tracepoints and printk crap until my eyes > bleed, and maybe dchinner comes to my rescue and figures out what's > going on sooner than I do. More often we just never figure it out > because only the customer can reproduce the problem, the reams of data > produced by ftrace is unmanageable, and BPF isn't always available. > > I'm not thrilled by the large increase in macro crap in the allocation > paths, but I don't know of a better way to instrument things. Our > attempts to use _RET_IP in XFS to instrument its code paths have never > worked quite right w.r.t. inlined functions and whatnot. > > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a > > > > > consensus on this. > > From my outsider's perspective, nine months have gone by since the last > LSF. Who has come up with a cleaner/better/faster way to do what Suren > and Kent have done? Were those code changes integrated into this > patchset? Or why not? > > Most of what I saw in 2023 involved compiler changes (great; now I have > to wait until RHEL 11/Debian 14 to use it) and/or still utilize fugly > macros. > > Recalling all the way back to suggestions made in 2022, who wrote the > prototype for doing this via ftrace? Or BPF? How well did that go for > counting allocation events and the like? I saw Tejun saying something > about how they use BPF aggressively inside Meta, but that was about it. > > Were any of those solutions significantly better than what's in front of > us here? > > I get it, a giant patch forcing everyone to know the difference between > alloc_foo and alloc_foo_noperf offends my (yours?) stylistic > sensibilities. On the other side, making analysis easier during > customer escalations means we kernel people get data, answers, and > solutions sooner instead of watching all our time get eaten up on L4 > support and backporting hell. > > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > > > > need to see a serious proposl - what we had at the last LSF was people > > > > jumping in with half baked alternative proposals that very much hadn't > > > > been thought through, and I see no need to repeat that. > > > > > > > > Like I mentioned, there's other work gated on this patchset; if people > > > > want to hold this up for more discussion they better be putting forth > > > > something to discuss. > > > > > > I'm thinking of ways on how to achieve Michal's request: "as long as there > > > is a wider agreement from the MM community". If we can achieve that without > > > LSF, great! (a bi-weekly MM meeting might also be an option) > > > > A meeting wouldn't be out of the question, _if_ there is an agenda, but: > > > > What's that coffeee mug say? I just survived another meeting that > > could've been an email? > > I congratulate you on your memory of my kitchen mug. Yes, that's what > it says. > > > What exactly is the outcome we're looking for? > > > > Is there info that people are looking for? I think we summed things up > > pretty well in the cover letter; if there are specifics that people > > want to discuss, that's why we emailed the series out. > > > > There's people in this thread who've used this patchset in production > > and diagnosed real issues (gigabytes of memory gone missing, I heard the > > other day); I'm personally looking for them to chime in on this thread > > (Johannes, Pasha). > > > > If it's just grumbling about "maintenance overhead" we need to get past > > - well, people are going to have to accept that we can't deliver > > features without writing code, and I'm confident that the hooking in > > particular is about as clean as it's going to get, _regardless_ of > > toolchain support; and moreover it addresses what's been historically a > > pretty gaping hole in our ability to profile and understand the code we > > write. > > Are you and Suren willing to pay whatever maintenance overhead there is? I'm still wondering what this supposed "maintenance overhead" is going to be... As I use this patch series I occasionally notice places where a bunch of memory is being accounted to one line of code, and it would better be accounted to a caller - but then it's just a couple lines of code to fix that. You switch that callsite to the _noprof() version of whatever allocation it's doing, then add an alloc_hooks() wrapper at the place you do want it accounted. That doesn't really feel like overhead to me, just the normal tweaking your tools to get the most out of them. I will continue to do that for the code I'm looking at, yes. If other people are doing that too, it'll be because they're also using memory allocation profiling and finding it valuable. I did notice earlier that we're still lacking documentation in the Documentation/ directory; the workflow for "how you shift accounting to the right spot" is something that should go in there.
On Tue, Feb 13, 2024 at 3:22 PM David Hildenbrand <david@redhat.com> wrote: > > On 14.02.24 00:12, Kent Overstreet wrote: > > On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote: > >> On 13.02.24 23:59, Suren Baghdasaryan wrote: > >>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet > >>> <kent.overstreet@linux.dev> wrote: > >>>> > >>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: > >>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote: > >>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: > >>>>>>> > >>>>>>> On 13.02.24 23:09, Kent Overstreet wrote: > >>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: > >>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote: > >>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: > >>>>>>>>>>> [...] > >>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback > >>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already > >>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the > >>>>>>>>>>>> code tagging functionality included in this patchset [2]. > >>>>>>>>>>> > >>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the > >>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have > >>>>>>>>>>> done so on several occasions already. > >>>>>>>>>>> > >>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe > >>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long > >>>>>>>>>>> time. I can live with all the downsides of the proposed implementation > >>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is > >>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any > >>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than > >>>>>>>>>>> little rushed. > >>>>>>>>>> > >>>>>>>>>> We tried other previously proposed approaches and all have their > >>>>>>>>>> downsides without making maintenance much easier. Your position is > >>>>>>>>>> understandable and I think it's fair. Let's see if others see more > >>>>>>>>>> benefit than cost here. > >>>>>>>>> > >>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially > >>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far" > >>>>>>>>> away (May). > >>>>>>>>> > >>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate > >>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a > >>>>>>>>> consensus on this. > >>>>>>>> > >>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd > >>>>>>>> need to see a serious proposl - what we had at the last LSF was people > >>>>>>>> jumping in with half baked alternative proposals that very much hadn't > >>>>>>>> been thought through, and I see no need to repeat that. > >>>>>>>> > >>>>>>>> Like I mentioned, there's other work gated on this patchset; if people > >>>>>>>> want to hold this up for more discussion they better be putting forth > >>>>>>>> something to discuss. > >>>>>>> > >>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as > >>>>>>> there is a wider agreement from the MM community". If we can achieve > >>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option) > >>>>>> > >>>>>> There will be a maintenance burden even with the cleanest proposed > >>>>>> approach. > >>>>> > >>>>> Yes. > >>>>> > >>>>>> We worked hard to make the patchset as clean as possible and > >>>>>> if benefits still don't outweigh the maintenance cost then we should > >>>>>> probably stop trying. > >>>>> > >>>>> Indeed. > >>>>> > >>>>>> At LSF/MM I would rather discuss functonal > >>>>>> issues/requirements/improvements than alternative approaches to > >>>>>> instrument allocators. > >>>>>> I'm happy to arrange a separate meeting with MM folks if that would > >>>>>> help to progress on the cost/benefit decision. > >>>>> Note that I am only proposing ways forward. > >>>>> > >>>>> If you think you can easily achieve what Michal requested without all that, > >>>>> good. > >>>> > >>>> He requested something? > >>> > >>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > >>> possible until the compiler feature is developed and deployed. And it > >>> still would require changes to the headers, so don't think it's worth > >>> delaying the feature for years. > >>> > >> > >> I was talking about this: "I can live with all the downsides of the proposed > >> implementationas long as there is a wider agreement from the MM community as > >> this is where the maintenance cost will be payed. So far I have not seen > >> (m)any acks by MM developers". > >> > >> I certainly cannot be motivated at this point to review and ack this, > >> unfortunately too much negative energy around here. > > > > David, this kind of reaction is exactly why I was telling Andrew I was > > going to submit this as a direct pull request to Linus. > > > > This is an important feature; if we can't stay focused ot the technical > > and get it done that's what I'll do. > > Kent, I started this with "Would it make sense" in an attempt to help > Suren and you to finally make progress with this, one way or the other. > I know that there were ways in the past to get the MM community to agree > on such things. > > I tried to be helpful, finding ways *not having to* bypass the MM > community to get MM stuff merged. > > The reply I got is mostly negative energy. > > So you don't need my help here, understood. > > But I will fight against any attempts to bypass the MM community. Well, I'm definitely not trying to bypass the MM community, that's why this patchset is posted. Not sure why people can't voice their opinion on the benefit/cost balance of the patchset over the email... But if a meeting would be more productive I'm happy to set it up. > > -- > Cheers, > > David / dhildenb >
> > I tried to be helpful, finding ways *not having to* bypass the MM > > community to get MM stuff merged. > > > > The reply I got is mostly negative energy. > > > > So you don't need my help here, understood. > > > > But I will fight against any attempts to bypass the MM community. > > Well, I'm definitely not trying to bypass the MM community, that's why > this patchset is posted. Not sure why people can't voice their opinion > on the benefit/cost balance of the patchset over the email... But if a > meeting would be more productive I'm happy to set it up. Discussing these concerns during the next available MM Alignment session makes sense. At a minimum, Suren and Kent can present their reasons for believing the current approach is superior to the previously proposed alternatives. However, delaying the discussion and feature merge until after LSF/MM seems unnecessary. As I mentioned earlier in this thread, we've already leveraged the concepts within this feature to debug unexplained memory overhead, saving us many terabytes of memory. This was just the initial benefit; we haven't even explored its full potential to track every allocation path. Pasha
On Tue, Feb 13, 2024 at 06:54:09PM -0500, Pasha Tatashin wrote: > > > I tried to be helpful, finding ways *not having to* bypass the MM > > > community to get MM stuff merged. > > > > > > The reply I got is mostly negative energy. > > > > > > So you don't need my help here, understood. > > > > > > But I will fight against any attempts to bypass the MM community. > > > > Well, I'm definitely not trying to bypass the MM community, that's why > > this patchset is posted. Not sure why people can't voice their opinion > > on the benefit/cost balance of the patchset over the email... But if a > > meeting would be more productive I'm happy to set it up. > > Discussing these concerns during the next available MM Alignment > session makes sense. At a minimum, Suren and Kent can present their > reasons for believing the current approach is superior to the > previously proposed alternatives. Hang on though: I believe we did so adequately within this thread. Both in the cover letter, and I further outlined exactly what the hooks need to do, and cited the exact code. Nobody seems to be complaining about the specifics, so I'm not sure what would be on the agenda?
I'll do a more throrough code review, but before the discussion gets too sidetracked, I wanted to add my POV on the overall merit of the direction that is being proposed here. I have backported and used this code for debugging production issues before. Logging into a random host with an unfamiliar workload and being able to get a reliable, comprehensive list of kernel memory consumers is one of the coolest things I have seen in a long time. This is a huge improvement to sysadmin quality of life. It's also a huge improvement for MM developers. We're the first points of contact for memory regressions that can be caused by pretty much any driver or subsystem in the kernel. I encourage anybody who is undecided on whether this is worth doing to build a kernel with these patches applied and run it on their own machine. I think you'll be surprised what you'll find - and how myopic and uninformative /proc/meminfo feels in comparison to this. Did you know there is a lot more to modern filesystems than the VFS objects we are currently tracking? :) Then imagine what this looks like on a production host running a complex mix of filesystems, enterprise networking, bpf programs, gpus and accelerators etc. Backporting the code to a slightly older production kernel wasn't too difficult. The instrumentation layering is explicit, clean, and fairly centralized, so resolving minor conflicts around the _noprof renames and the wrappers was pretty straight-forward. When we talk about maintenance cost, a fair shake would be to weigh it against the cost and reliability of our current method: evaluating consumers in the kernel on a case-by-case basis and annotating the alloc/free sites by hand; then quibbling with the MM community about whether that consumer is indeed significant enough to warrant an entry in /proc/meminfo, and what the catchiest name for the stat would be. I think we can agree that this is vastly less scalable and more burdensome than central annotations around a handful of mostly static allocator entry points. Especially considering the rate of change in the kernel as a whole, and that not everybody will think of the comprehensive MM picture when writing a random driver. And I think that's generous - we don't even have the network stack in meminfo. So I think what we do now isn't working. In the Meta fleet, at any given time the p50 for unaccounted kernel memory is several gigabytes per host. The p99 is between 15% and 30% of total memory. That's a looot of opaque resource usage we have to accept on faith. For hunting down regressions, all it takes is one untracked consumer in the kernel to really throw a wrench into things. It's difficult to find in the noise with tracing, and if it's not growing after an initial allocation spike, you're pretty much out of luck finding it at all. Raise your hand if you've written a drgn script to walk pfns and try to guess consumers from the state of struct page :) I agree we should discuss how the annotations are implemented on a technical basis, but my take is that we need something like this. In a codebase of our size, I don't think the allocator should be handing out memory without some basic implied tracking of where it's going. It's a liability for production environments, and it can hide bad memory management decisions in drivers and other subsystems for a very long time.
On 14.02.24 00:28, Suren Baghdasaryan wrote: > On Tue, Feb 13, 2024 at 3:22 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 14.02.24 00:12, Kent Overstreet wrote: >>> On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote: >>>> On 13.02.24 23:59, Suren Baghdasaryan wrote: >>>>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet >>>>> <kent.overstreet@linux.dev> wrote: >>>>>> >>>>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: >>>>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote: >>>>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On 13.02.24 23:09, Kent Overstreet wrote: >>>>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote: >>>>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote: >>>>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote: >>>>>>>>>>>>> [...] >>>>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback >>>>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already >>>>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the >>>>>>>>>>>>>> code tagging functionality included in this patchset [2]. >>>>>>>>>>>>> >>>>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the >>>>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have >>>>>>>>>>>>> done so on several occasions already. >>>>>>>>>>>>> >>>>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe >>>>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long >>>>>>>>>>>>> time. I can live with all the downsides of the proposed implementation >>>>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is >>>>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any >>>>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than >>>>>>>>>>>>> little rushed. >>>>>>>>>>>> >>>>>>>>>>>> We tried other previously proposed approaches and all have their >>>>>>>>>>>> downsides without making maintenance much easier. Your position is >>>>>>>>>>>> understandable and I think it's fair. Let's see if others see more >>>>>>>>>>>> benefit than cost here. >>>>>>>>>>> >>>>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially >>>>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far" >>>>>>>>>>> away (May). >>>>>>>>>>> >>>>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate >>>>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a >>>>>>>>>>> consensus on this. >>>>>>>>>> >>>>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd >>>>>>>>>> need to see a serious proposl - what we had at the last LSF was people >>>>>>>>>> jumping in with half baked alternative proposals that very much hadn't >>>>>>>>>> been thought through, and I see no need to repeat that. >>>>>>>>>> >>>>>>>>>> Like I mentioned, there's other work gated on this patchset; if people >>>>>>>>>> want to hold this up for more discussion they better be putting forth >>>>>>>>>> something to discuss. >>>>>>>>> >>>>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as >>>>>>>>> there is a wider agreement from the MM community". If we can achieve >>>>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option) >>>>>>>> >>>>>>>> There will be a maintenance burden even with the cleanest proposed >>>>>>>> approach. >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>>> We worked hard to make the patchset as clean as possible and >>>>>>>> if benefits still don't outweigh the maintenance cost then we should >>>>>>>> probably stop trying. >>>>>>> >>>>>>> Indeed. >>>>>>> >>>>>>>> At LSF/MM I would rather discuss functonal >>>>>>>> issues/requirements/improvements than alternative approaches to >>>>>>>> instrument allocators. >>>>>>>> I'm happy to arrange a separate meeting with MM folks if that would >>>>>>>> help to progress on the cost/benefit decision. >>>>>>> Note that I am only proposing ways forward. >>>>>>> >>>>>>> If you think you can easily achieve what Michal requested without all that, >>>>>>> good. >>>>>> >>>>>> He requested something? >>>>> >>>>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not >>>>> possible until the compiler feature is developed and deployed. And it >>>>> still would require changes to the headers, so don't think it's worth >>>>> delaying the feature for years. >>>>> >>>> >>>> I was talking about this: "I can live with all the downsides of the proposed >>>> implementationas long as there is a wider agreement from the MM community as >>>> this is where the maintenance cost will be payed. So far I have not seen >>>> (m)any acks by MM developers". >>>> >>>> I certainly cannot be motivated at this point to review and ack this, >>>> unfortunately too much negative energy around here. >>> >>> David, this kind of reaction is exactly why I was telling Andrew I was >>> going to submit this as a direct pull request to Linus. >>> >>> This is an important feature; if we can't stay focused ot the technical >>> and get it done that's what I'll do. >> >> Kent, I started this with "Would it make sense" in an attempt to help >> Suren and you to finally make progress with this, one way or the other. >> I know that there were ways in the past to get the MM community to agree >> on such things. >> >> I tried to be helpful, finding ways *not having to* bypass the MM >> community to get MM stuff merged. >> >> The reply I got is mostly negative energy. >> >> So you don't need my help here, understood. >> >> But I will fight against any attempts to bypass the MM community. > > Well, I'm definitely not trying to bypass the MM community, that's why > this patchset is posted. Not sure why people can't voice their opinion > on the benefit/cost balance of the patchset over the email... But if a > meeting would be more productive I'm happy to set it up. If you can get the acks without any additional meetings, great. The replies from Pasha and Johannes are encouraging, let's hope core memory-allocator people will voice their opinion here. If you come to the conclusion that another meeting would help getting maintainers's attention and sorting out some of the remaining concerns, feel free to schedule a meeting with Dave R. I suspect only the slot next week is already taken. In the past, we also had "special" meetings just for things to make progress faster. If you're looking for ideas on what the agenda of such a meeting could look like, I'll happily discuss that with you off-list. v2 was more than 3 months ago. If it's really about minor details here, waiting another 3 months for LSF/MM is indeed not reasonable. Myself, I'll be happy not having to sit through another LSF/MM session of that kind. The level of drama is exceptional and I'm hoping it won't be the new norm in the MM space. Good luck!
On 2/14/24 00:08, Kent Overstreet wrote: > On Tue, Feb 13, 2024 at 02:59:11PM -0800, Suren Baghdasaryan wrote: >> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet >> <kent.overstreet@linux.dev> wrote: >> > >> > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: >> > > On 13.02.24 23:30, Suren Baghdasaryan wrote: >> > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: >> > > If you think you can easily achieve what Michal requested without all that, >> > > good. >> > >> > He requested something? >> >> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not >> possible until the compiler feature is developed and deployed. And it >> still would require changes to the headers, so don't think it's worth >> delaying the feature for years. > > Hang on, let's look at the actual code. > > This is what instrumenting an allocation function looks like: > > #define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__)) > > IOW, we have to: > - rename krealloc_array to krealloc_array_noprof > - replace krealloc_array with a one wrapper macro call > > Is this really all we're getting worked up over? > > The renaming we need regardless, because the thing that makes this > approach efficient enough to run in production is that we account at > _one_ point in the callstack, we don't save entire backtraces. > > And thus we need to explicitly annotate which one that is; which means > we need _noprof() versions of functions for when the accounting is done > by an outer wraper (e.g. mempool). > > And, as I keep saying: that alloc_hooks() macro will also get us _per > callsite fault injection points_, and we really need that because - if > you guys have been paying attention to other threads - whenever moving > more stuff to PF_MEMALLOC_* flags comes up (including adding > PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and > not being testable keeps coming up. How exactly do you envision the fault injection to help here? The proposals are about scoping via a process flag, and the process may then call just about anything under that scope. So if our tool is per callsite fault injection points, how do we know which callsites to enable to focus the fault injection on the particular scope?
On Tue 13-02-24 14:59:11, Suren Baghdasaryan wrote: > On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote: [...] > > > If you think you can easily achieve what Michal requested without all that, > > > good. > > > > He requested something? > > Yes, a cleaner instrumentation. Nope, not really. You have indicated you want to target this version for the _next_ merge window without any acks, really. If you want to go forward with this then you should gain a support from the MM community at least. Why? Because the whole macro layering is adding maintenance cost for MM people. I have expressed why I absolutely hate the additional macro layer. We have been through similar layers of macros in other areas (not to mention page allocator interface itself) and it has _always_ turned out a bad idea long term. I do not see why this case should be any different. The whole kernel is moving to a dynamic tracing realm and now we are going to build a statically macro based tracing infrastructure which will need tweaking anytime real memory consumers are one layer up the existing macro infrastructure (do not forget quite a lot of allocations are in library functions) and/or we need to modify the allocator API in some way. Call me unimpressed! Now, I fully recognize that the solution doesn't really have to be perfect in order to be useful. Hence I never NAKed it even though I really _dislike_ the approach. I have expected you will grow the community support over time if this is indeed the only feasible approach but that is not reflected in the series posted here. If you find a support I will not stand in the way. > Unfortunately the cleanest one is not > possible until the compiler feature is developed and deployed. And it > still would require changes to the headers, so don't think it's worth > delaying the feature for years. I am pretty sure you have invested a non-trivial time into evaluating other ways, yet your cover letter is rather modest about any details: : - looked at alternate hooking methods. : There were suggestions on alternate methods (compiler attribute, : trampolines), but they wouldn't have made the patchset any cleaner : (we still need to have different function versions for accounting vs. no : accounting to control at which point in a call chain the accounting : happens), and they would have added a dependency on toolchain : support. First immediate question would be: What about page_owner? I do remember the runtime overhead being discussed but I do not really remember any actual numbers outside of artificial workloads. Has this been investigated? Is our stack unwinder the problem? Etc. Also what are the biggest obstacles to efficiently track allocations via our tracing infrastructure? Has this been investigated? What were conclusions?
On Wed 14-02-24 01:20:20, Johannes Weiner wrote: [...] > I agree we should discuss how the annotations are implemented on a > technical basis, but my take is that we need something like this. I do not think there is any disagreement on usefulness of a better memory allocation tracking. At least for me the primary problem is the implementation. At LFSMM last year we have heard that existing tracing infrastructure hasn't really been explored much. Cover letter doesn't really talk much about those alternatives so it is really hard to evaluate whether the proposed solution is indeed our best way to approach this. > In a codebase of our size, I don't think the allocator should be > handing out memory without some basic implied tracking of where it's > going. It's a liability for production environments, and it can hide > bad memory management decisions in drivers and other subsystems for a > very long time. Fully agreed! It is quite common to see oom reports with a large portion of memory unaccounted and this really presents additional cost on the debugging side.
On Tue, Feb 13, 2024 at 06:08:45PM -0500, Kent Overstreet wrote: > This is what instrumenting an allocation function looks like: > > #define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__)) > > IOW, we have to: > - rename krealloc_array to krealloc_array_noprof > - replace krealloc_array with a one wrapper macro call > > Is this really all we're getting worked up over? > > The renaming we need regardless, because the thing that makes this > approach efficient enough to run in production is that we account at > _one_ point in the callstack, we don't save entire backtraces. I'm probably going to regret getting involved in this thread, but since Suren already decided to put me on the cc ... There might be a way to do it without renaming. We have a bit of the linker script called SCHED_TEXT which lets us implement in_sched_functions(). ie we could have the equivalent of include/linux/sched/debug.h:#define __sched __section(".sched.text") perhaps #define __memalloc __section(".memalloc.text") which would do all the necessary magic to know where the backtrace should stop.
On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote: > On Wed 14-02-24 01:20:20, Johannes Weiner wrote: > [...] > > I agree we should discuss how the annotations are implemented on a > > technical basis, but my take is that we need something like this. > > I do not think there is any disagreement on usefulness of a better > memory allocation tracking. At least for me the primary problem is the > implementation. At LFSMM last year we have heard that existing tracing > infrastructure hasn't really been explored much. Cover letter doesn't > really talk much about those alternatives so it is really hard to > evaluate whether the proposed solution is indeed our best way to > approach this. Michal, we covered this before. To do this with tracing you'd have to build up data structures separately, in userspace, that would mirror the allocator's data structures; you would have to track every single allocation so that you could match up the free event to the place it was allocated. Even if it could be built, which I doubt, it'd be completely non viable because the performance would be terrible. Like I said, we covered all this before; if you're going to spend so much time on these threads you really should be making a better attempt at keeping up with what's been talked about.
On Wed, Feb 14, 2024 at 03:00:00PM +0000, Matthew Wilcox wrote: > On Tue, Feb 13, 2024 at 06:08:45PM -0500, Kent Overstreet wrote: > > This is what instrumenting an allocation function looks like: > > > > #define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__)) > > > > IOW, we have to: > > - rename krealloc_array to krealloc_array_noprof > > - replace krealloc_array with a one wrapper macro call > > > > Is this really all we're getting worked up over? > > > > The renaming we need regardless, because the thing that makes this > > approach efficient enough to run in production is that we account at > > _one_ point in the callstack, we don't save entire backtraces. > > I'm probably going to regret getting involved in this thread, but since > Suren already decided to put me on the cc ... > > There might be a way to do it without renaming. We have a bit of the > linker script called SCHED_TEXT which lets us implement > in_sched_functions(). ie we could have the equivalent of > > include/linux/sched/debug.h:#define __sched __section(".sched.text") > > perhaps #define __memalloc __section(".memalloc.text") > which would do all the necessary magic to know where the backtrace > should stop. Could we please try to get through the cover letter before proposing alternatives? I already explained there why we need the renaming. In addition, you can't create the per-callsite codetag with linker magic; you nede the macro for that. Instead of citing myself again, I'm just going to post what I was working on last night for the documentation directory: .. SPDX-License-Identifier: GPL-2.0 =========================== MEMORY ALLOCATION PROFILING =========================== Low overhead (suitable for production) accounting of all memory allocations, tracked by file and line number. Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Example output: root@moria-kvm:~# sort -h /proc/allocinfo|tail 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page Theory of operation: Memory allocation profiling builds off of code tagging, which is a library for declaring static structs (that typcially describe a file and line number in some way, hence code tagging) and then finding and operating on them at runtime - i.e. iterating over them to print them in debugfs/procfs. To add accounting for an allocation call, we replace it with a macro invocation, alloc_hooks(), that - declares a code tag - stashes a pointer to it in task_struct - calls the real allocation function - and finally, restores the task_struct alloc tag pointer to its previous value. This allows for alloc_hooks() calls to be nested, with the most recent one taking effect. This is important for allocations internal to the mm/ code that do not properly belong to the outer allocation context and should be counted separately: for example, slab object extension vectors, or when the slab allocates pages from the page allocator. Thus, proper usage requires determining which function in an allocation call stack should be tagged. There are many helper functions that essentially wrap e.g. kmalloc() and do a little more work, then are called in multiple places; we'll generally want the accounting to happen in the callers of these helpers, not in the helpers themselves. To fix up a given helper, for example foo(), do the following: - switch its allocation call to the _noprof() version, e.g. kmalloc_noprof() - rename it to foo_noprof() - define a macro version of foo() like so: #define foo(...) alloc_hooks(foo_noprof(__VA_ARGS__))
On Wed 14-02-24 10:01:14, Kent Overstreet wrote: > On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote: > > On Wed 14-02-24 01:20:20, Johannes Weiner wrote: > > [...] > > > I agree we should discuss how the annotations are implemented on a > > > technical basis, but my take is that we need something like this. > > > > I do not think there is any disagreement on usefulness of a better > > memory allocation tracking. At least for me the primary problem is the > > implementation. At LFSMM last year we have heard that existing tracing > > infrastructure hasn't really been explored much. Cover letter doesn't > > really talk much about those alternatives so it is really hard to > > evaluate whether the proposed solution is indeed our best way to > > approach this. > > Michal, we covered this before. It is a good practice to summarize previous discussions in the cover letter. Especially when there are different approaches discussed over a longer time period or when the topic is controversial. I do not see anything like that here. Neither for the existing tracing infrastructure, page owner nor performance concerns discussed before etc. Look, I do not want to nit pick or insist on formalisms but having those data points layed out would make any further discussion much more smooth.
On Wed, Feb 14, 2024 at 05:02:28PM +0100, Michal Hocko wrote: > On Wed 14-02-24 10:01:14, Kent Overstreet wrote: > > On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote: > > > On Wed 14-02-24 01:20:20, Johannes Weiner wrote: > > > [...] > > > > I agree we should discuss how the annotations are implemented on a > > > > technical basis, but my take is that we need something like this. > > > > > > I do not think there is any disagreement on usefulness of a better > > > memory allocation tracking. At least for me the primary problem is the > > > implementation. At LFSMM last year we have heard that existing tracing > > > infrastructure hasn't really been explored much. Cover letter doesn't > > > really talk much about those alternatives so it is really hard to > > > evaluate whether the proposed solution is indeed our best way to > > > approach this. > > > > Michal, we covered this before. > > It is a good practice to summarize previous discussions in the cover > letter. Especially when there are different approaches discussed over a > longer time period or when the topic is controversial. > > I do not see anything like that here. Neither for the existing tracing > infrastructure, page owner nor performance concerns discussed before > etc. Look, I do not want to nit pick or insist on formalisms but having > those data points layed out would make any further discussion much more > smooth. You don't want to nitpick??? Look, you've been consistently sidestepping the technical discussion; it seems all you want to talk about is process or "your nack". If we're going to have a technical discussion, it's incumbent upon all of us to /keep the focus on the technical/; that is everyone's responsibility. I'm not going to write a 20 page cover letter and recap every dead end that was proposed. That would be a lot of useless crap for eveyone to wade through. I'm going to summarize the important stuff, and keep the focus on what we're doing and documenting it. If you want to take part in a discussion, it's your responsibility to be reading with comprehension and finding useful things to say. You gotta stop with this this derailing garbage.
On Wed 14-02-24 11:17:20, Kent Overstreet wrote:
[...]
> You gotta stop with this this derailing garbage.
It is always pleasure talking to you Kent, but let me give you advice
(free of charge of course). Let Suren talk, chances for civilized
and productive discussion are much higher!
I do not have much more to add to the discussion. My point stays, find a
support of the MM community if you want to proceed with this work.
On Wed, Feb 14, 2024 at 11:20:26AM +0100, Vlastimil Babka wrote: > On 2/14/24 00:08, Kent Overstreet wrote: > > And, as I keep saying: that alloc_hooks() macro will also get us _per > > callsite fault injection points_, and we really need that because - if > > you guys have been paying attention to other threads - whenever moving > > more stuff to PF_MEMALLOC_* flags comes up (including adding > > PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and > > not being testable keeps coming up. > > How exactly do you envision the fault injection to help here? The proposals > are about scoping via a process flag, and the process may then call just > about anything under that scope. So if our tool is per callsite fault > injection points, how do we know which callsites to enable to focus the > fault injection on the particular scope? So the question with fault injection is - how do we integrate it into our existing tests? We need fault injection that we can integrate into our existing tests because that's the only way to get the code coverage we need - writing new tests that cover all the error paths isn't going to happen, and wouldn't work as well anyways. But the trouble with injecting memory allocation failures is that they'll result in errors bubbling up to userspace, and in unpredictable ways. We _definitely_ cannot enable random memory allocation faults for the entire kernel at runttme - or rather we _could_, and that would actually be great to do as a side project; but that's not something we can do in our existing automated tests because the results will be completely unpredictable. If we did that the goal would be to just make sure the kernel doesn't explode - but what we actually want is for our automated pass/fail tests to still pass; we need to constrain what will fail. So we need at a minumum to be able to only enable memory allocation failures for the code we're interested in testing (file/module) - enabling memory allocation failures in some other random subsystem we're not developing or looking at isn't what we want. Beyond that, it's very much subsystem dependent. For bcachefs, my main strategy has been to flip on random (1%) memory allocation failures after the filesystem has mounted. During startup, we do a ton of allocations (I cover those with separate tests), but after startup we should be able to run normally in the precence of allocation failures without ever returning an error to userspace - so that's what I'm trying to test.
On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > If you think you can easily achieve what Michal requested without all that, > > > good. > > > > He requested something? > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > possible until the compiler feature is developed and deployed. And it > still would require changes to the headers, so don't think it's worth > delaying the feature for years. Can we please be told much more about this compiler feature? Description of what it is, what it does, how it will affect this kernel feature, etc. Who is developing it and when can we expect it to become available? Will we be able to migrate to it without back-compatibility concerns? (I think "you need quite recent gcc for memory profiling" is reasonable). Because: if the maintainability issues which Michel describes will be significantly addressed with the gcc support then we're kinda reviewing the wrong patchset. Yes, it may be a maintenance burden initially, but at some (yet to be revealed) time in the future, this will be addressed with the gcc support?
On Wed, Feb 14, 2024 at 8:31 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 14-02-24 11:17:20, Kent Overstreet wrote: > [...] > > You gotta stop with this this derailing garbage. > > It is always pleasure talking to you Kent, but let me give you advice > (free of charge of course). Let Suren talk, chances for civilized > and productive discussion are much higher! Every time I wake up to a new drama... Sorry I won't follow up on this one not to feed the fire. > > I do not have much more to add to the discussion. My point stays, find a > support of the MM community if you want to proceed with this work. > -- > Michal Hocko > SUSE Labs
On Wed, Feb 14, 2024 at 8:55 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > If you think you can easily achieve what Michal requested without all that, > > > > good. > > > > > > He requested something? > > > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > > possible until the compiler feature is developed and deployed. And it > > still would require changes to the headers, so don't think it's worth > > delaying the feature for years. > > Can we please be told much more about this compiler feature? > Description of what it is, what it does, how it will affect this kernel > feature, etc. Sure. The compiler support will be in a form of a new __attribute__, simplified example: // generate data for the wrapper static void _alloc_tag() { static struct alloc_tag _alloc_tag __section ("alloc_tags") = { .ct = CODE_TAG_INIT, .counter = 0 }; } static inline int wrapper (const char *name, int x, int (*callee) (const char *, int), struct alloc_tag *callsite_data) { callsite_data->counter++; printf ("Call #%d from %s:%d (%s)\n", callsite_data->counter, callsite_data->ct.filename, callsite_data->ct.lineno, callsite_data->ct.function); int ret = callee (name, x); printf ("Returned: %d\n", ret); return ret; } __attribute__((annotate("callsite_wrapped_by", wrapper, _alloc_tag))) int foo(const char* name, int x); int foo(const char* name, int x) { printf ("Hello %s, %d!\n", name, x); return x; } Which we will be able to attach to a function without changing its name and preserving the namespace (it applies only to functions with that name, not everything else). Note that we will still need _noprof versions of the allocators. > > Who is developing it and when can we expect it to become available? Aleksei Vetrov (google) with the help of Nick Desaulniers (google). Both are CC'ed on this email. After several iterations Aleksei has a POC which we are evaluating (https://github.com/llvm/llvm-project/compare/main...noxwell:llvm-project:callsite-wrapper-tree-transform). Once it's in good shape we are going to engage with CLANG and GCC community to get it upstreamed. When it will become available and when the distributions will pick it up is anybody's guess. Upstreaming is usually a lengthy process. > > Will we be able to migrate to it without back-compatibility concerns? > (I think "you need quite recent gcc for memory profiling" is > reasonable). The migration should be quite straight-forward, replacing the macros with functions with that attribute. > > > Because: if the maintainability issues which Michel describes will be > significantly addressed with the gcc support then we're kinda reviewing > the wrong patchset. Yes, it may be a maintenance burden initially, but > at some (yet to be revealed) time in the future, this will be addressed > with the gcc support? That's what I'm aiming for. I just don't want this placed on hold until the compiler support is widely available, which might take years. >
On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote: > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > If you think you can easily achieve what Michal requested without all that, > > > > good. > > > > > > He requested something? > > > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > > possible until the compiler feature is developed and deployed. And it > > still would require changes to the headers, so don't think it's worth > > delaying the feature for years. > > Can we please be told much more about this compiler feature? > Description of what it is, what it does, how it will affect this kernel > feature, etc. > > Who is developing it and when can we expect it to become available? > > Will we be able to migrate to it without back-compatibility concerns? > (I think "you need quite recent gcc for memory profiling" is > reasonable). > > > > Because: if the maintainability issues which Michel describes will be > significantly addressed with the gcc support then we're kinda reviewing > the wrong patchset. Yes, it may be a maintenance burden initially, but > at some (yet to be revealed) time in the future, this will be addressed > with the gcc support? Even if we had compiler magic, after considering it more I don't think the patchset would be improved by it - I would still prefer to stick with the macro approach. There's also a lot of unresolved questions about whether the compiler approach would even end being what we need; we need macro expansion to happen in the caller of the allocation function, and that's another level of hooking that I don't think the compiler people are even considering yet, since cpp runs before the main part of the compiler; if C macros worked and were implemented more like Rust macros I'm sure it could be done - in fact, I think this could all be done in Rust _without_ any new compiler support - but in C, this is a lot to ask. Let's look at the instrumentation again. There's two steps: - Renaming the original function to _noprof - Adding a hooked version of the original function. We need to do the renaming regardless of what approach we take in order to correctly handle allocations that happen inside the context of an existing alloc tag hook but should not be accounted to the outer context; we do that by selecting the alloc_foo() or alloc_foo_noprof() version as appropriate. It's important to get this right; consider slab object extension vectors or the slab allocator allocating pages from the page allocator. Second step, adding a hooked version of the original function. We do that with #define alloc_foo(...) alloc_hooks(alloc_foo_noprof(__VA_ARGS__)) That's pretty clean, if you ask me. The only way to make it more succint be if it were possible for a C macro to define a new macro, then it could be just alloc_fn(alloc_foo); But honestly, the former is probably preferable anyways from a ctags/cscope POV.
On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> Memory allocation, v3 and final:
Would be nice to have --base added to cover letter. The very first patch
can't be applied on today's Linux Next.
On Wed, Feb 14, 2024 at 10:44 AM Andy Shevchenko <andy@black.fi.intel.com> wrote: > > On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote: > > Memory allocation, v3 and final: > > Would be nice to have --base added to cover letter. The very first patch > can't be applied on today's Linux Next. Sorry about that. It as based on Linus` ToT at the time of posting (7521f258ea30 Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm). I also applied it to mm-unstable with only one trivial merge conflict in one of the patches. > > -- > With Best Regards, > Andy Shevchenko > > >
On Mon, 2024-02-12 at 13:38 -0800, Suren Baghdasaryan wrote: > Memory allocation, v3 and final: > > Overview: > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > kernels, overhead low enough to be deployed in production. > > We're aiming to get this in the next merge window, for 6.9. The feedback > we've gotten has been that even out of tree this patchset has already > been useful, and there's a significant amount of other work gated on the > code tagging functionality included in this patchset [2]. > > Example output: > root@moria-kvm:~# sort -h /proc/allocinfo|tail > 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode > 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node > 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext > 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio > 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs > 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof > 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof > 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init > 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof > 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page > > Since v2: > - tglx noticed a circular header dependency between sched.h and percpu.h; > a bunch of header cleanups were merged into 6.8 to ameliorate this [3]. > > - a number of improvements, moving alloc_hooks() annotations to the > correct place for better tracking (mempool), and bugfixes. > > - looked at alternate hooking methods. > There were suggestions on alternate methods (compiler attribute, > trampolines), but they wouldn't have made the patchset any cleaner > (we still need to have different function versions for accounting vs. no > accounting to control at which point in a call chain the accounting > happens), and they would have added a dependency on toolchain > support. > > Usage: > kconfig options: > - CONFIG_MEM_ALLOC_PROFILING > - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT > - CONFIG_MEM_ALLOC_PROFILING_DEBUG > adds warnings for allocations that weren't accounted because of a > missing annotation > > sysctl: > /proc/sys/vm/mem_profiling > > Runtime info: > /proc/allocinfo > > Notes: > > [1]: Overhead > To measure the overhead we are comparing the following configurations: > (1) Baseline with CONFIG_MEMCG_KMEM=n > (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) > (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) > (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) > (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT > Thanks for the work on this patchset and it is quite useful. A clarification question on the data: I assume Config (2), (3) and (4) has CONFIG_MEMCG_KMEM=n, right? If so do you have similar data for config (2), (3) and (4) but with CONFIG_MEMCG_KMEM=y for comparison with (5)? Tim > Performance overhead: > To evaluate performance we implemented an in-kernel test executing > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > affinity set to a specific CPU to minimize the noise. Below are results > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on > 56 core Intel Xeon: > > kmalloc pgalloc > (1 baseline) 6.764s 16.902s > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) >
On Wed, Feb 14, 2024 at 10:54 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > On Mon, 2024-02-12 at 13:38 -0800, Suren Baghdasaryan wrote: > > Memory allocation, v3 and final: > > > > Overview: > > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > > kernels, overhead low enough to be deployed in production. > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > we've gotten has been that even out of tree this patchset has already > > been useful, and there's a significant amount of other work gated on the > > code tagging functionality included in this patchset [2]. > > > > Example output: > > root@moria-kvm:~# sort -h /proc/allocinfo|tail > > 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode > > 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node > > 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext > > 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio > > 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs > > 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof > > 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof > > 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init > > 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof > > 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page > > > > Since v2: > > - tglx noticed a circular header dependency between sched.h and percpu.h; > > a bunch of header cleanups were merged into 6.8 to ameliorate this [3]. > > > > - a number of improvements, moving alloc_hooks() annotations to the > > correct place for better tracking (mempool), and bugfixes. > > > > - looked at alternate hooking methods. > > There were suggestions on alternate methods (compiler attribute, > > trampolines), but they wouldn't have made the patchset any cleaner > > (we still need to have different function versions for accounting vs. no > > accounting to control at which point in a call chain the accounting > > happens), and they would have added a dependency on toolchain > > support. > > > > Usage: > > kconfig options: > > - CONFIG_MEM_ALLOC_PROFILING > > - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT > > - CONFIG_MEM_ALLOC_PROFILING_DEBUG > > adds warnings for allocations that weren't accounted because of a > > missing annotation > > > > sysctl: > > /proc/sys/vm/mem_profiling > > > > Runtime info: > > /proc/allocinfo > > > > Notes: > > > > [1]: Overhead > > To measure the overhead we are comparing the following configurations: > > (1) Baseline with CONFIG_MEMCG_KMEM=n > > (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) > > (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && > > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) > > (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && > > CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) > > (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT > > > > Thanks for the work on this patchset and it is quite useful. > A clarification question on the data: > > I assume Config (2), (3) and (4) has CONFIG_MEMCG_KMEM=n, right? Yes, correct. > If so do you have similar data for config (2), (3) and (4) but with > CONFIG_MEMCG_KMEM=y for comparison with (5)? I have data for these additional configs (didn't think there were that important): (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y > > Tim > > > Performance overhead: > > To evaluate performance we implemented an in-kernel test executing > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > affinity set to a specific CPU to minimize the noise. Below are results > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on > > 56 core Intel Xeon: > > > > kmalloc pgalloc > > (1 baseline) 6.764s 16.902s > > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) > > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) > > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) > > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 default disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 default enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) (6) shows a bit better performance than (5) but it's probably noise. I would expect them to be roughly the same. Hope this helps. > > > >
On Wed, Feb 14, 2024 at 9:52 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote: > > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > If you think you can easily achieve what Michal requested without all that, > > > > > good. > > > > > > > > He requested something? > > > > > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > > > possible until the compiler feature is developed and deployed. And it > > > still would require changes to the headers, so don't think it's worth > > > delaying the feature for years. > > > > Can we please be told much more about this compiler feature? > > Description of what it is, what it does, how it will affect this kernel > > feature, etc. > > > > Who is developing it and when can we expect it to become available? > > > > Will we be able to migrate to it without back-compatibility concerns? > > (I think "you need quite recent gcc for memory profiling" is > > reasonable). > > > > > > > > Because: if the maintainability issues which Michel describes will be > > significantly addressed with the gcc support then we're kinda reviewing > > the wrong patchset. Yes, it may be a maintenance burden initially, but > > at some (yet to be revealed) time in the future, this will be addressed > > with the gcc support? > > Even if we had compiler magic, after considering it more I don't think > the patchset would be improved by it - I would still prefer to stick > with the macro approach. > > There's also a lot of unresolved questions about whether the compiler > approach would even end being what we need; we need macro expansion to > happen in the caller of the allocation function For the record, that's what this attribute will be doing. So it should cover our usecase. > , and that's another > level of hooking that I don't think the compiler people are even > considering yet, since cpp runs before the main part of the compiler; if > C macros worked and were implemented more like Rust macros I'm sure it > could be done - in fact, I think this could all be done in Rust > _without_ any new compiler support - but in C, this is a lot to ask. > > Let's look at the instrumentation again. There's two steps: > > - Renaming the original function to _noprof > - Adding a hooked version of the original function. > > We need to do the renaming regardless of what approach we take in order > to correctly handle allocations that happen inside the context of an > existing alloc tag hook but should not be accounted to the outer > context; we do that by selecting the alloc_foo() or alloc_foo_noprof() > version as appropriate. > > It's important to get this right; consider slab object extension > vectors or the slab allocator allocating pages from the page allocator. > > Second step, adding a hooked version of the original function. We do > that with > > #define alloc_foo(...) alloc_hooks(alloc_foo_noprof(__VA_ARGS__)) > > That's pretty clean, if you ask me. The only way to make it more succint > be if it were possible for a C macro to define a new macro, then it > could be just > > alloc_fn(alloc_foo); > > But honestly, the former is probably preferable anyways from a ctags/cscope POV.
On Wed, Feb 14, 2024 at 11:24:23AM -0800, Suren Baghdasaryan wrote: > On Wed, Feb 14, 2024 at 9:52 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote: > > > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > If you think you can easily achieve what Michal requested without all that, > > > > > > good. > > > > > > > > > > He requested something? > > > > > > > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not > > > > possible until the compiler feature is developed and deployed. And it > > > > still would require changes to the headers, so don't think it's worth > > > > delaying the feature for years. > > > > > > Can we please be told much more about this compiler feature? > > > Description of what it is, what it does, how it will affect this kernel > > > feature, etc. > > > > > > Who is developing it and when can we expect it to become available? > > > > > > Will we be able to migrate to it without back-compatibility concerns? > > > (I think "you need quite recent gcc for memory profiling" is > > > reasonable). > > > > > > > > > > > > Because: if the maintainability issues which Michel describes will be > > > significantly addressed with the gcc support then we're kinda reviewing > > > the wrong patchset. Yes, it may be a maintenance burden initially, but > > > at some (yet to be revealed) time in the future, this will be addressed > > > with the gcc support? > > > > Even if we had compiler magic, after considering it more I don't think > > the patchset would be improved by it - I would still prefer to stick > > with the macro approach. > > > > There's also a lot of unresolved questions about whether the compiler > > approach would even end being what we need; we need macro expansion to > > happen in the caller of the allocation function > > For the record, that's what this attribute will be doing. So it should > cover our usecase. That wasn't clear in the meeting we had the other day; all that was discussed there was the attribute syntax, as I recall. So say that does work out (and I don't think that's a given; if I were a compiler person I don't think I'd be interested in this strange half macro, half inline function beast); all that has accomplished is to get rid of the need for the renaming - the _noprof() versions of functions. So then how do you distinguish where in the callstack the accounting happens? If you say "it happens at the outermost wrapper", then what happens is - Extra overhead for all the inner wrapper invocations, where they have to now check "actually, we already have an alloc tag, don't do anything". That's a cost, and given how much time we spent shaving cycles and branches during development it's not one we want. - Inner allocations that shouldn't be accounted to the outer context are now a major problem, because they silently will be accounted there and never noticed. With our approach, inner allocations are by default (i.e. when we haven't switched them to the _noprof() variant) accounted to their own alloc tag; that way, when we're reading the /proc/allocinfo output, we can examine them and check if they should be collapsed to the outer context. With this approach they won't be seen. So no, we still don't want the compiler approach.
> > > Performance overhead: > > > To evaluate performance we implemented an in-kernel test executing > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > > affinity set to a specific CPU to minimize the noise. Below are results > > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on > > > 56 core Intel Xeon: > > > > > > kmalloc pgalloc > > > (1 baseline) 6.764s 16.902s > > > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) > > > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) > > > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) > > > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) > > (6 default disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) > (7 default enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) I think these numbers are very interesting for folks that already use memcg. Specifically, the difference between 6 & 7, which seems to be ~0.85% and ~14.25%. IIUC, this means that the extra overhead is relatively much lower if someone is already using memcgs. > > (6) shows a bit better performance than (5) but it's probably noise. I > would expect them to be roughly the same. Hope this helps. > > > > > > > >
On Wed, Feb 14, 2024 at 12:17 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > Performance overhead: > > > > To evaluate performance we implemented an in-kernel test executing > > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > > > affinity set to a specific CPU to minimize the noise. Below are results > > > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on > > > > 56 core Intel Xeon: > > > > > > > > kmalloc pgalloc > > > > (1 baseline) 6.764s 16.902s > > > > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) > > > > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) > > > > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) > > > > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) > > > > (6 default disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) > > (7 default enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) > > I think these numbers are very interesting for folks that already use > memcg. Specifically, the difference between 6 & 7, which seems to be > ~0.85% and ~14.25%. IIUC, this means that the extra overhead is > relatively much lower if someone is already using memcgs. Well, yes, percentage-wise it's much lower. If you look at the absolute difference between 6 & 7 vs 2 & 3, it's quite close. > > > > > (6) shows a bit better performance than (5) but it's probably noise. I > > would expect them to be roughly the same. Hope this helps. > > > > > > > > > > > >
On Wed, 2024-02-14 at 12:30 -0800, Suren Baghdasaryan wrote: > On Wed, Feb 14, 2024 at 12:17 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > Performance overhead: > > > > > To evaluate performance we implemented an in-kernel test executing > > > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation > > > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU > > > > > affinity set to a specific CPU to minimize the noise. Below are results > > > > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on > > > > > 56 core Intel Xeon: > > > > > > > > > > kmalloc pgalloc > > > > > (1 baseline) 6.764s 16.902s > > > > > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) > > > > > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) > > > > > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) > > > > > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) > > > > > > (6 default disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) > > > (7 default enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) > > > > I think these numbers are very interesting for folks that already use > > memcg. Specifically, the difference between 6 & 7, which seems to be > > ~0.85% and ~14.25%. IIUC, this means that the extra overhead is > > relatively much lower if someone is already using memcgs. > > Well, yes, percentage-wise it's much lower. If you look at the > absolute difference between 6 & 7 vs 2 & 3, it's quite close. > > > > > > > > > (6) shows a bit better performance than (5) but it's probably noise. I > > > would expect them to be roughly the same. Hope this helps. > > > > > > > Thanks for the data. It does show that turning on memcg does not cost extra overhead percentage wise. Tim
On Mon, 12 Feb 2024, Suren Baghdasaryan <surenb@google.com> wrote: > Memory allocation, v3 and final: > > Overview: > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > kernels, overhead low enough to be deployed in production. > > We're aiming to get this in the next merge window, for 6.9. The feedback > we've gotten has been that even out of tree this patchset has already > been useful, and there's a significant amount of other work gated on the > code tagging functionality included in this patchset [2]. I wonder if it wouldn't be too much trouble to write at least a brief overview document under Documentation/ describing what this is all about? Even as follow-up. People seeing the patch series have the benefit of the cover letter and the commit messages, but that's hardly documentation. We have all these great frameworks and tools but their discoverability to kernel developers isn't always all that great. BR, Jani.
On Fri, Feb 16, 2024 at 10:38:00AM +0200, Jani Nikula wrote: > On Mon, 12 Feb 2024, Suren Baghdasaryan <surenb@google.com> wrote: > > Memory allocation, v3 and final: > > > > Overview: > > Low overhead [1] per-callsite memory allocation profiling. Not just for debug > > kernels, overhead low enough to be deployed in production. > > > > We're aiming to get this in the next merge window, for 6.9. The feedback > > we've gotten has been that even out of tree this patchset has already > > been useful, and there's a significant amount of other work gated on the > > code tagging functionality included in this patchset [2]. > > I wonder if it wouldn't be too much trouble to write at least a brief > overview document under Documentation/ describing what this is all > about? Even as follow-up. People seeing the patch series have the > benefit of the cover letter and the commit messages, but that's hardly > documentation. > > We have all these great frameworks and tools but their discoverability > to kernel developers isn't always all that great. commit f589b48789de4b8f77bfc70b9f3ab2013c01eaf2 Author: Kent Overstreet <kent.overstreet@linux.dev> Date: Wed Feb 14 01:13:04 2024 -0500 memprofiling: Documentation Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst new file mode 100644 index 000000000000..d906e9360279 --- /dev/null +++ b/Documentation/mm/allocation-profiling.rst @@ -0,0 +1,68 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +MEMORY ALLOCATION PROFILING +=========================== + +Low overhead (suitable for production) accounting of all memory allocations, +tracked by file and line number. + +Usage: +kconfig options: + - CONFIG_MEM_ALLOC_PROFILING + - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT + - CONFIG_MEM_ALLOC_PROFILING_DEBUG + adds warnings for allocations that weren't accounted because of a + missing annotation + +sysctl: + /proc/sys/vm/mem_profiling + +Runtime info: + /proc/allocinfo + +Example output: + root@moria-kvm:~# sort -h /proc/allocinfo|tail + 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode + 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node + 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext + 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio + 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs + 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof + 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof + 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init + 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof + 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page + + +Theory of operation: + +Memory allocation profiling builds off of code tagging, which is a library for +declaring static structs (that typcially describe a file and line number in +some way, hence code tagging) and then finding and operating on them at runtime +- i.e. iterating over them to print them in debugfs/procfs. + +To add accounting for an allocation call, we replace it with a macro +invocation, alloc_hooks(), that + - declares a code tag + - stashes a pointer to it in task_struct + - calls the real allocation function + - and finally, restores the task_struct alloc tag pointer to its previous value. + +This allows for alloc_hooks() calls to be nested, with the most recent one +taking effect. This is important for allocations internal to the mm/ code that +do not properly belong to the outer allocation context and should be counted +separately: for example, slab object extension vectors, or when the slab +allocates pages from the page allocator. + +Thus, proper usage requires determining which function in an allocation call +stack should be tagged. There are many helper functions that essentially wrap +e.g. kmalloc() and do a little more work, then are called in multiple places; +we'll generally want the accounting to happen in the callers of these helpers, +not in the helpers themselves. + +To fix up a given helper, for example foo(), do the following: + - switch its allocation call to the _noprof() version, e.g. kmalloc_noprof() + - rename it to foo_noprof() + - define a macro version of foo() like so: + #define foo(...) alloc_hooks(foo_noprof(__VA_ARGS__))
On Fri, 16 Feb 2024, Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Fri, Feb 16, 2024 at 10:38:00AM +0200, Jani Nikula wrote: >> I wonder if it wouldn't be too much trouble to write at least a brief >> overview document under Documentation/ describing what this is all >> about? Even as follow-up. People seeing the patch series have the >> benefit of the cover letter and the commit messages, but that's hardly >> documentation. >> >> We have all these great frameworks and tools but their discoverability >> to kernel developers isn't always all that great. > > commit f589b48789de4b8f77bfc70b9f3ab2013c01eaf2 > Author: Kent Overstreet <kent.overstreet@linux.dev> > Date: Wed Feb 14 01:13:04 2024 -0500 > > memprofiling: Documentation > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Thanks! Wasn't part of this series and I wasn't aware it existed. BR, Jani.