Message ID | 20240327155751.3536-1-shikemeng@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | Improve visibility of writeback | expand |
On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> This series tries to improve visilibity of writeback.
Well... why? Is anyone usefully using the existing instrumentation?
What is to be gained by expanding it further? What is the case for
adding this code?
I don't recall hearing of anyone using the existing debug
instrumentation so perhaps we should remove it!
Also, I hit a build error and a pile of warnings with an arm
allnoconfig build.
on 3/28/2024 1:40 AM, Andrew Morton wrote: > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote: > >> This series tries to improve visilibity of writeback. > > Well... why? Is anyone usefully using the existing instrumentation? > What is to be gained by expanding it further? What is the case for > adding this code? > > I don't recall hearing of anyone using the existing debug > instrumentation so perhaps we should remove it! Hi Andrew, this was discussed in [1]. In short, I use the debug files to test change in submit patchset [1]. The wb_monitor.py is suggested by Tejun in [2] to improve visibility of writeback. I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun in [2] to improve visibility of writeback. > > Also, I hit a build error and a pile of warnings with an arm > allnoconfig build. > Sorry for this, I only tested on x86. I will look into this and fix the build problem in next version. [1] https://lore.kernel.org/lkml/44e3b910-8b52-5583-f8a9-37105bf5e5b6@huaweicloud.com/ [2] https://lore.kernel.org/lkml/a747dc7d-f24a-08bd-d969-d3fb35e151b7@huaweicloud.com/ [3] https://lore.kernel.org/lkml/ZcUsOb_fyvYr-zZ-@slm.duckdns.org/
on 3/28/2024 9:59 AM, Kemeng Shi wrote: > > on 3/28/2024 1:40 AM, Andrew Morton wrote: >> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote: >> >>> This series tries to improve visilibity of writeback. >> >> Well... why? Is anyone usefully using the existing instrumentation? >> What is to be gained by expanding it further? What is the case for >> adding this code? >> >> I don't recall hearing of anyone using the existing debug >> instrumentation so perhaps we should remove it! > Hi Andrew, this was discussed in [1]. In short, I use the > debug files to test change in submit patchset [1]. The > wb_monitor.py is suggested by Tejun in [2] to improve > visibility of writeback. > I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun > in [2] to improve visibility of writeback. >> >> Also, I hit a build error and a pile of warnings with an arm >> allnoconfig build. With arm allnoconfig build on uptodated mm-unstable branch, I don't hit any build error but only some warnings as following: ... mm/page-writeback.c: In function ‘cgwb_calc_thresh’: mm/page-writeback.c:906:13: warning: ‘writeback’ is used uninitialized in this function [-Wuninitialized] 906 | mdtc.dirty += writeback; | ~~~~~~~~~~~^~~~~~~~~~~~ In file included from ./include/linux/kernel.h:28, from mm/page-writeback.c:15: ./include/linux/minmax.h:46:54: warning: ‘filepages’ is used uninitialized in this function [-Wuninitialized] 46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y)) | ^ mm/page-writeback.c:898:16: note: ‘filepages’ was declared here 898 | unsigned long filepages, headroom, writeback; | ^~~~~~~~~ In file included from ./include/linux/kernel.h:28, from mm/page-writeback.c:15: ./include/linux/minmax.h:46:54: warning: ‘headroom’ is used uninitialized in this function [-Wuninitialized] 46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y)) | ^ mm/page-writeback.c:898:27: note: ‘headroom’ was declared here 898 | unsigned long filepages, headroom, writeback; | ^~~~~~~~ ... The only reason I can think of is that I also apply patchset [1] for build. I mentioned patchset [1] in cover letter but I forgot to notify the dependency to the patchset. If this is the reason to blame for buidl error, I will send a new set based on mm-unstable in next version. Thanks, Kemeng [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76 >> > Sorry for this, I only tested on x86. I will look into this and > fix the build problem in next version. > > [1] https://lore.kernel.org/lkml/44e3b910-8b52-5583-f8a9-37105bf5e5b6@huaweicloud.com/ > [2] https://lore.kernel.org/lkml/a747dc7d-f24a-08bd-d969-d3fb35e151b7@huaweicloud.com/ > [3] https://lore.kernel.org/lkml/ZcUsOb_fyvYr-zZ-@slm.duckdns.org/ >
On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote: > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote: > > > This series tries to improve visilibity of writeback. > > Well... why? Is anyone usefully using the existing instrumentation? > What is to be gained by expanding it further? What is the case for > adding this code? > > I don't recall hearing of anyone using the existing debug > instrumentation so perhaps we should remove it! Remove debug instrumentation!? Surely you just? I generally don't hear from users of my code when things are working, I only expect to hear from people when it's not.
On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote: > > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote: > > > > > This series tries to improve visilibity of writeback. > > > > Well... why? Is anyone usefully using the existing instrumentation? > > What is to be gained by expanding it further? What is the case for > > adding this code? > > > > I don't recall hearing of anyone using the existing debug > > instrumentation so perhaps we should remove it! > > Remove debug instrumentation!? Surely you just? Absolutely not. Any code in the kernel should have ongoing justification for remaining there. If no such justification exists, out it goes.
On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote: > v1->v2: > -Send cleanup to wq_monitor.py separately. > -Add patch to avoid use after free of bdi. > -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested. > -Use rcu walk to avoid use after free. > -Add debug output to each related patches. > > This series tries to improve visilibity of writeback. Patch 1 make > /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi > instead of only writeback info in root cgroup. Patch 2 add a new > debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback > info. Patch 4 add wb_monitor.py to monitor basic writeback info > of running system, more info could be added on demand. Rest patches > are some random cleanups. More details can be found in respective > patches. Thanks! > This series is on top of patchset [1]. > > [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76 Not bad I've been trying to improve our ability to debug latency issues - stalls of all sorts. While you're looking at all this code, do you think you could find some places to collect useful latency numbers? fs/bcachefs/time_stats.c has some code that's going to be moving out to lib/ at some point, after I switch it to MAD; if you could hook that up as well to a few points we could see at a glance if there are stalls happening in the writeback path. > > Following domain hierarchy is tested: > global domain (320G) > / \ > cgroup domain1(10G) cgroup domain2(10G) > | | > bdi wb1 wb2 > > /* all writeback info of bdi is successfully collected */ > # cat /sys/kernel/debug/bdi/252:16/stats: > BdiWriteback: 448 kB > BdiReclaimable: 1303904 kB > BdiDirtyThresh: 189914124 kB > DirtyThresh: 195337564 kB > BackgroundThresh: 32516508 kB > BdiDirtied: 3591392 kB > BdiWritten: 2287488 kB > BdiWriteBandwidth: 322248 kBps > b_dirty: 0 > b_io: 0 > b_more_io: 2 > b_dirty_time: 0 > bdi_list: 1 > state: 1 > > /* per wb writeback info is collected */ > # cat /sys/kernel/debug/bdi/252:16/wb_stats: > cat wb_stats > WbCgIno: 1 > WbWriteback: 0 kB > WbReclaimable: 0 kB > WbDirtyThresh: 0 kB > WbDirtied: 0 kB > WbWritten: 0 kB > WbWriteBandwidth: 102400 kBps > b_dirty: 0 > b_io: 0 > b_more_io: 0 > b_dirty_time: 0 > state: 1 > WbCgIno: 4284 > WbWriteback: 448 kB > WbReclaimable: 818944 kB > WbDirtyThresh: 3096524 kB > WbDirtied: 2266880 kB > WbWritten: 1447936 kB > WbWriteBandwidth: 214036 kBps > b_dirty: 0 > b_io: 0 > b_more_io: 1 > b_dirty_time: 0 > state: 5 > WbCgIno: 4325 > WbWriteback: 224 kB > WbReclaimable: 819392 kB > WbDirtyThresh: 2920088 kB > WbDirtied: 2551808 kB > WbWritten: 1732416 kB > WbWriteBandwidth: 201832 kBps > b_dirty: 0 > b_io: 0 > b_more_io: 1 > b_dirty_time: 0 > state: 5 > > /* monitor writeback info */ > # ./wb_monitor.py 252:16 -c > writeback reclaimable dirtied written avg_bw > 252:16_1 0 0 0 0 102400 > 252:16_4284 672 820064 9230368 8410304 685612 > 252:16_4325 896 819840 10491264 9671648 652348 > 252:16 1568 1639904 19721632 18081952 1440360 > > > writeback reclaimable dirtied written avg_bw > 252:16_1 0 0 0 0 102400 > 252:16_4284 672 820064 9230368 8410304 685612 > 252:16_4325 896 819840 10491264 9671648 652348 > 252:16 1568 1639904 19721632 18081952 1440360 > ... > > Kemeng Shi (6): > writeback: protect race between bdi release and bdi_debug_stats_show > writeback: collect stats of all wb of bdi in bdi_debug_stats_show > writeback: support retrieving per group debug writeback stats of bdi > writeback: add wb_monitor.py script to monitor writeback info on bdi > writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages > writeback: define GDTC_INIT_NO_WB to null > > include/linux/writeback.h | 1 + > mm/backing-dev.c | 203 ++++++++++++++++++++++++++++++---- > mm/page-writeback.c | 31 ++++-- > tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++ > 4 files changed, 378 insertions(+), 29 deletions(-) > create mode 100644 tools/writeback/wb_monitor.py > > -- > 2.30.0 >
Hello, Kent. On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote: > fs/bcachefs/time_stats.c has some code that's going to be moving out to > lib/ at some point, after I switch it to MAD; if you could hook that up > as well to a few points we could see at a glance if there are stalls > happening in the writeback path. Using BPF (whether through bcc or bpftrace) is likely a better approach for this sort of detailed instrumentation. Fixed debug information is useful and it's also a common occurrence that they don't quite reveal the full picture of what one's trying to understand and one needs to dig a bit deeper, wider, aggregate data in a different way, or whatever. So, rather than adding more fixed infrastructure, I'd suggest adding places which can easily be instrumented using the existing tools (they are really great once you get used to them) whether that's tracepoints or just strategically placed noinline functions. Thanks.
On Thu, Mar 28, 2024 at 12:23:52PM -0700, Andrew Morton wrote: > On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote: > > > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote: > > > > > > > This series tries to improve visilibity of writeback. > > > > > > Well... why? Is anyone usefully using the existing instrumentation? > > > What is to be gained by expanding it further? What is the case for > > > adding this code? > > > > > > I don't recall hearing of anyone using the existing debug > > > instrumentation so perhaps we should remove it! > > > > Remove debug instrumentation!? Surely you just? > > Absolutely not. Any code in the kernel should have ongoing > justification for remaining there. If no such justification exists, > out it goes. Certainly, but this isn't remotely a case where I'd expect to be getting that kind of feedback. Debugging instrumentation is very much a case where no one notices it 99% of the time, but when you need it you _really_ need it. Not having it can turn a 10 minute "oh, that thing is acting wonky - it's because your system is overloaded/your drive is wonky/x subsystem sucks, we know about it and we're working on it" into a weeklong bughunt, burning up expensive engineer time pointlessly. To debug complex systems efficiently, in production, in the wild, we need to be able to see what's going on - we need more of this stuff. Not to say that this couldn't use more work - perhaps additional focus on what kinds of issues we expect to need to debug with this, what the numbers mean and are useful for, documentation on how this relates to writeback internals, etc.
On Thu, Mar 28, 2024 at 09:31:57AM -1000, Tejun Heo wrote: > Hello, Kent. > > On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote: > > fs/bcachefs/time_stats.c has some code that's going to be moving out to > > lib/ at some point, after I switch it to MAD; if you could hook that up > > as well to a few points we could see at a glance if there are stalls > > happening in the writeback path. > > Using BPF (whether through bcc or bpftrace) is likely a better approach for > this sort of detailed instrumentation. Fixed debug information is useful and > it's also a common occurrence that they don't quite reveal the full picture > of what one's trying to understand and one needs to dig a bit deeper, wider, > aggregate data in a different way, or whatever. > > So, rather than adding more fixed infrastructure, I'd suggest adding places > which can easily be instrumented using the existing tools (they are really > great once you get used to them) whether that's tracepoints or just > strategically placed noinline functions. Collecting latency numbers at various key places is _enormously_ useful. The hard part is deciding where it's useful to collect; that requires intimate knowledge of the code. Once you're defining those collection poitns statically, doing it with BPF is just another useless layer of indirection. The time stats stuff I wrote is _really_ cheap, and you really want this stuff always on so that you've actually got the data you need when you're bughunting.
Hello, On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote: > Collecting latency numbers at various key places is _enormously_ useful. > The hard part is deciding where it's useful to collect; that requires > intimate knowledge of the code. Once you're defining those collection > poitns statically, doing it with BPF is just another useless layer of > indirection. Given how much flexibility helps with debugging, claiming it useless is a stretch. > The time stats stuff I wrote is _really_ cheap, and you really want this > stuff always on so that you've actually got the data you need when > you're bughunting. For some stats and some use cases, always being available is useful and building fixed infra for them makes sense. For other stats and other use cases, flexibility is pretty useful too (e.g. what if you want percentile distribution which is filtered by some criteria?). They aren't mutually exclusive and I'm not sure bdi wb instrumentation is on top of enough people's minds. As for overhead, BPF instrumentation can be _really_ cheap too. We often run these programs per packet. Thanks.
On Thu, Mar 28, 2024 at 09:46:39AM -1000, Tejun Heo wrote: > Hello, > > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote: > > Collecting latency numbers at various key places is _enormously_ useful. > > The hard part is deciding where it's useful to collect; that requires > > intimate knowledge of the code. Once you're defining those collection > > poitns statically, doing it with BPF is just another useless layer of > > indirection. > > Given how much flexibility helps with debugging, claiming it useless is a > stretch. Well, what would it add? > > The time stats stuff I wrote is _really_ cheap, and you really want this > > stuff always on so that you've actually got the data you need when > > you're bughunting. > > For some stats and some use cases, always being available is useful and > building fixed infra for them makes sense. For other stats and other use > cases, flexibility is pretty useful too (e.g. what if you want percentile > distribution which is filtered by some criteria?). They aren't mutually > exclusive and I'm not sure bdi wb instrumentation is on top of enough > people's minds. > > As for overhead, BPF instrumentation can be _really_ cheap too. We often run > these programs per packet. The main things I want are just - elapsed time since last writeback IO completed, so we can see at a glance if it's stalled - time stats on writeback io initiation to completion The main value of this one will be tracking down tail latency issues and finding out where in the stack they originate.
Hello, On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote: > > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote: > > > Collecting latency numbers at various key places is _enormously_ useful. > > > The hard part is deciding where it's useful to collect; that requires > > > intimate knowledge of the code. Once you're defining those collection > > > poitns statically, doing it with BPF is just another useless layer of > > > indirection. > > > > Given how much flexibility helps with debugging, claiming it useless is a > > stretch. > > Well, what would it add? It depends on the case but here's an example. If I'm seeing occasional tail latency spikes, I'd want to know whether there's any correation with specific types or sizes of IOs and if so who's issuing them and why. With BPF, you can detect those conditions to tag and capture where exactly those IOs are coming from and aggregate the result however you like across thousands of machines in production without anyone noticing. That's useful, no? Also, actual percentile disribution is almost always a lot more insightful than more coarsely aggregated numbers. We can't add all that to fixed infra. In most cases not because runtime overhead would be too hight but because the added interface and code complexity and maintenance overhead isn't justifiable given how niche, adhoc and varied these use cases get. > > > The time stats stuff I wrote is _really_ cheap, and you really want this > > > stuff always on so that you've actually got the data you need when > > > you're bughunting. > > > > For some stats and some use cases, always being available is useful and > > building fixed infra for them makes sense. For other stats and other use > > cases, flexibility is pretty useful too (e.g. what if you want percentile > > distribution which is filtered by some criteria?). They aren't mutually > > exclusive and I'm not sure bdi wb instrumentation is on top of enough > > people's minds. > > > > As for overhead, BPF instrumentation can be _really_ cheap too. We often run > > these programs per packet. > > The main things I want are just > - elapsed time since last writeback IO completed, so we can see at a > glance if it's stalled > - time stats on writeback io initiation to completion > > The main value of this one will be tracking down tail latency issues and > finding out where in the stack they originate. Yeah, I mean, if always keeping those numbers around is useful for wide enough number of users and cases, sure, go ahead and add fixed infra. I'm not quite sure bdi wb stats fall in that bucket given how little attention it usually gets. Thanks.
On Thu, Mar 28, 2024 at 10:13:27AM -1000, Tejun Heo wrote: > Hello, > > On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote: > > > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote: > > > > Collecting latency numbers at various key places is _enormously_ useful. > > > > The hard part is deciding where it's useful to collect; that requires > > > > intimate knowledge of the code. Once you're defining those collection > > > > poitns statically, doing it with BPF is just another useless layer of > > > > indirection. > > > > > > Given how much flexibility helps with debugging, claiming it useless is a > > > stretch. > > > > Well, what would it add? > > It depends on the case but here's an example. If I'm seeing occasional tail > latency spikes, I'd want to know whether there's any correation with > specific types or sizes of IOs and if so who's issuing them and why. With > BPF, you can detect those conditions to tag and capture where exactly those > IOs are coming from and aggregate the result however you like across > thousands of machines in production without anyone noticing. That's useful, > no? That's cool, but really esoteric. We need to be able to answer basic questions and build an overall picture of what the system is doing without having to reach for the big stuff. Most users are never going to touch tracing, let alone BPF; that's too much setup. But I can and do regularly tell users "check this, this and this" and debug things on that basis without ever touching their machine. And basic latency numbers are really easy for users to understand, that makes them doubly worthwhile to collect and make visible. > Also, actual percentile disribution is almost always a lot more insightful > than more coarsely aggregated numbers. We can't add all that to fixed infra. > In most cases not because runtime overhead would be too hight but because > the added interface and code complexity and maintenance overhead isn't > justifiable given how niche, adhoc and varied these use cases get. You can't calculate percentiles accurately and robustly in one pass - that only works if your input data obeys a nice statistical distribution, and the cases we care about are the ones where it doesn't. > > > > > The time stats stuff I wrote is _really_ cheap, and you really want this > > > > stuff always on so that you've actually got the data you need when > > > > you're bughunting. > > > > > > For some stats and some use cases, always being available is useful and > > > building fixed infra for them makes sense. For other stats and other use > > > cases, flexibility is pretty useful too (e.g. what if you want percentile > > > distribution which is filtered by some criteria?). They aren't mutually > > > exclusive and I'm not sure bdi wb instrumentation is on top of enough > > > people's minds. > > > > > > As for overhead, BPF instrumentation can be _really_ cheap too. We often run > > > these programs per packet. > > > > The main things I want are just > > - elapsed time since last writeback IO completed, so we can see at a > > glance if it's stalled > > - time stats on writeback io initiation to completion > > > > The main value of this one will be tracking down tail latency issues and > > finding out where in the stack they originate. > > Yeah, I mean, if always keeping those numbers around is useful for wide > enough number of users and cases, sure, go ahead and add fixed infra. I'm > not quite sure bdi wb stats fall in that bucket given how little attention > it usually gets. I think it should be getting a lot more attention given that memory reclaim and writeback are generally implicated whenever a user complains about their system going out to lunch.
Hello, Kent. On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote: > Most users are never going to touch tracing, let alone BPF; that's too > much setup. But I can and do regularly tell users "check this, this and > this" and debug things on that basis without ever touching their > machine. I think this is where the disconnect is. It's not difficult to set up at all. Nowadays, in most distros, it comes down to something like run "pacman -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or run this script I'm attaching. It is a signficant boost when debugging many different kernel issues. I strongly suggest giving it a try and getting used to it rather than resisting it. Thanks.
On Thu, Mar 28, 2024 at 10:46:33AM -1000, Tejun Heo wrote: > Hello, Kent. > > On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote: > > Most users are never going to touch tracing, let alone BPF; that's too > > much setup. But I can and do regularly tell users "check this, this and > > this" and debug things on that basis without ever touching their > > machine. > > I think this is where the disconnect is. It's not difficult to set up at > all. Nowadays, in most distros, it comes down to something like run "pacman > -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or > run this script I'm attaching. It is a signficant boost when debugging many > different kernel issues. I strongly suggest giving it a try and getting used > to it rather than resisting it. The disconnect is with you, Tejun. I use tracing all the time, and it's not going to be much of a discussion if you're not going to formulate a response.
on 3/29/2024 3:24 AM, Kent Overstreet wrote: > On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote: >> v1->v2: >> -Send cleanup to wq_monitor.py separately. >> -Add patch to avoid use after free of bdi. >> -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested. >> -Use rcu walk to avoid use after free. >> -Add debug output to each related patches. >> >> This series tries to improve visilibity of writeback. Patch 1 make >> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi >> instead of only writeback info in root cgroup. Patch 2 add a new >> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback >> info. Patch 4 add wb_monitor.py to monitor basic writeback info >> of running system, more info could be added on demand. Rest patches >> are some random cleanups. More details can be found in respective >> patches. Thanks! >> This series is on top of patchset [1]. >> >> [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76 > > Not bad > Hi Kent, > I've been trying to improve our ability to debug latency issues - stalls > of all sorts. While you're looking at all this code, do you think you > could find some places to collect useful latency numbers? I would like to do it to collect more useful info for writeback. > > fs/bcachefs/time_stats.c has some code that's going to be moving out to > lib/ at some point, after I switch it to MAD; if you could hook that up > as well to a few points we could see at a glance if there are stalls > happening in the writeback path. I see that Tejun recommend to use bpf. I don't know much about bpf and new approach in time_stats.c. For me personly, I think that it's better to use the new approach after the work of moving code to lib/ is merged. Then I would like to submit a patchset to discuss of using it in writeback. Would this make sense to you. Look forward to your reply! Thanks, Kemeng > >> >> Following domain hierarchy is tested: >> global domain (320G) >> / \ >> cgroup domain1(10G) cgroup domain2(10G) >> | | >> bdi wb1 wb2 >> >> /* all writeback info of bdi is successfully collected */ >> # cat /sys/kernel/debug/bdi/252:16/stats: >> BdiWriteback: 448 kB >> BdiReclaimable: 1303904 kB >> BdiDirtyThresh: 189914124 kB >> DirtyThresh: 195337564 kB >> BackgroundThresh: 32516508 kB >> BdiDirtied: 3591392 kB >> BdiWritten: 2287488 kB >> BdiWriteBandwidth: 322248 kBps >> b_dirty: 0 >> b_io: 0 >> b_more_io: 2 >> b_dirty_time: 0 >> bdi_list: 1 >> state: 1 >> >> /* per wb writeback info is collected */ >> # cat /sys/kernel/debug/bdi/252:16/wb_stats: >> cat wb_stats >> WbCgIno: 1 >> WbWriteback: 0 kB >> WbReclaimable: 0 kB >> WbDirtyThresh: 0 kB >> WbDirtied: 0 kB >> WbWritten: 0 kB >> WbWriteBandwidth: 102400 kBps >> b_dirty: 0 >> b_io: 0 >> b_more_io: 0 >> b_dirty_time: 0 >> state: 1 >> WbCgIno: 4284 >> WbWriteback: 448 kB >> WbReclaimable: 818944 kB >> WbDirtyThresh: 3096524 kB >> WbDirtied: 2266880 kB >> WbWritten: 1447936 kB >> WbWriteBandwidth: 214036 kBps >> b_dirty: 0 >> b_io: 0 >> b_more_io: 1 >> b_dirty_time: 0 >> state: 5 >> WbCgIno: 4325 >> WbWriteback: 224 kB >> WbReclaimable: 819392 kB >> WbDirtyThresh: 2920088 kB >> WbDirtied: 2551808 kB >> WbWritten: 1732416 kB >> WbWriteBandwidth: 201832 kBps >> b_dirty: 0 >> b_io: 0 >> b_more_io: 1 >> b_dirty_time: 0 >> state: 5 >> >> /* monitor writeback info */ >> # ./wb_monitor.py 252:16 -c >> writeback reclaimable dirtied written avg_bw >> 252:16_1 0 0 0 0 102400 >> 252:16_4284 672 820064 9230368 8410304 685612 >> 252:16_4325 896 819840 10491264 9671648 652348 >> 252:16 1568 1639904 19721632 18081952 1440360 >> >> >> writeback reclaimable dirtied written avg_bw >> 252:16_1 0 0 0 0 102400 >> 252:16_4284 672 820064 9230368 8410304 685612 >> 252:16_4325 896 819840 10491264 9671648 652348 >> 252:16 1568 1639904 19721632 18081952 1440360 >> ... >> >> Kemeng Shi (6): >> writeback: protect race between bdi release and bdi_debug_stats_show >> writeback: collect stats of all wb of bdi in bdi_debug_stats_show >> writeback: support retrieving per group debug writeback stats of bdi >> writeback: add wb_monitor.py script to monitor writeback info on bdi >> writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages >> writeback: define GDTC_INIT_NO_WB to null >> >> include/linux/writeback.h | 1 + >> mm/backing-dev.c | 203 ++++++++++++++++++++++++++++++---- >> mm/page-writeback.c | 31 ++++-- >> tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++ >> 4 files changed, 378 insertions(+), 29 deletions(-) >> create mode 100644 tools/writeback/wb_monitor.py >> >> -- >> 2.30.0 >> >
On Thu 28-03-24 10:46:33, Tejun Heo wrote: > Hello, Kent. > > On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote: > > Most users are never going to touch tracing, let alone BPF; that's too > > much setup. But I can and do regularly tell users "check this, this and > > this" and debug things on that basis without ever touching their > > machine. > > I think this is where the disconnect is. It's not difficult to set up at > all. Nowadays, in most distros, it comes down to something like run "pacman > -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or > run this script I'm attaching. It is a signficant boost when debugging many > different kernel issues. I strongly suggest giving it a try and getting used > to it rather than resisting it. Yeah, BPF is great and I use it but to fill in some cases from practice, there are sysadmins refusing to install bcc or run your BPF scripts on their systems due to company regulations, their personal fear, or whatever. So debugging with what you can achieve from a shell is still the thing quite often. Honza
Hello, On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote: > Yeah, BPF is great and I use it but to fill in some cases from practice, > there are sysadmins refusing to install bcc or run your BPF scripts on > their systems due to company regulations, their personal fear, or whatever. > So debugging with what you can achieve from a shell is still the thing > quite often. Yeah, I mean, this happens with anything new. Tracing itself took quite a while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and it's likely that the adoption line will keep shifting for quite a while. Besides, even with all the new gizmos there definitely are cases where good ol' cat interface makes sense. So, if the static interface makes sense, we add it but we should keep in mind that the trade-offs for adding such static infrastructure, especially for the ones which aren't *widely* useful, are rather quickly shfiting in the less favorable direction. Thanks.
On Wed, Apr 03, 2024 at 08:44:05AM -1000, Tejun Heo wrote: > Hello, > > On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote: > > Yeah, BPF is great and I use it but to fill in some cases from practice, > > there are sysadmins refusing to install bcc or run your BPF scripts on > > their systems due to company regulations, their personal fear, or whatever. > > So debugging with what you can achieve from a shell is still the thing > > quite often. > > Yeah, I mean, this happens with anything new. Tracing itself took quite a > while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and > it's likely that the adoption line will keep shifting for quite a while. > Besides, even with all the new gizmos there definitely are cases where good > ol' cat interface makes sense. > > So, if the static interface makes sense, we add it but we should keep in > mind that the trade-offs for adding such static infrastructure, especially > for the ones which aren't *widely* useful, are rather quickly shfiting in > the less favorable direction. A lot of our static debug infrastructure isn't that useful because it just sucks. Every time I hit a sysfs or procfs file that's just a single integer, and nothing else, when clearly there's internal structure and description that needs to be there I die a little inside. It's lazy and amateurish. I regularly debug things in bcachefs over IRC in about 5-10 minutes of asking to check various files and pastebin them - this is my normal process, I pretty much never have to ssh and touch the actual machines. That's how it should be if you just make a point of making your internal state easy to view and introspect, but when I'm debugging issues that run into the wider block layer, or memory reclaim, we often hit a wall. Writeback throttling was buggy for _months_, no visibility or introspection or concerns for debugging, and that's a small chunk of code. io_uring - had to disable it. I _still_ have people bringing issues to me that are clearly memory reclaim related but I don't have the tools. It's not like any of this code exports much in the way of useful tracepoints either, but tracepoints often just aren't what you want; what you want just to be able to see internal state (_without_ having to use a debugger, because that's completely impractical outside highly controlled environments) - and tracing is also never the first thing you want to reach for when you have a user asking you "hey, this thing went wonky, what's it doing?" - tracing automatically turns it into a multi step process of decide what you want to look at, run the workload more to collect data, iterate. Think more about "what would make code easier to debug" and less about "how do I shove this round peg through the square tracing/BPF slot". There's _way_ more we could be doing that would just make our lives easier.
Hello, On Wed, Apr 03, 2024 at 03:06:56PM -0400, Kent Overstreet wrote: ... > That's how it should be if you just make a point of making your internal > state easy to view and introspect, but when I'm debugging issues that > run into the wider block layer, or memory reclaim, we often hit a wall. Try drgn: https://drgn.readthedocs.io/en/latest/ I've been adding drgn scripts under tools/ directory for introspection. They're easy to write, deploy and ask users to run. > Writeback throttling was buggy for _months_, no visibility or > introspection or concerns for debugging, and that's a small chunk of > code. io_uring - had to disable it. I _still_ have people bringing > issues to me that are clearly memory reclaim related but I don't have > the tools. > > It's not like any of this code exports much in the way of useful > tracepoints either, but tracepoints often just aren't what you want; > what you want just to be able to see internal state (_without_ having to > use a debugger, because that's completely impractical outside highly > controlled environments) - and tracing is also never the first thing you > want to reach for when you have a user asking you "hey, this thing went > wonky, what's it doing?" - tracing automatically turns it into a multi > step process of decide what you want to look at, run the workload more > to collect data, iterate. > > Think more about "what would make code easier to debug" and less about > "how do I shove this round peg through the square tracing/BPF slot". > There's _way_ more we could be doing that would just make our lives > easier. Maybe it'd help classifying visibility into the the following categories: 1. Current state introspection. 2. Dynamic behavior tracing. 3. Accumluative behavior profiling. drgn is great for #1. Tracing and BPF stuff is great for #2 especially when things get complicated. #3 is the trickest. Static stuff is useful in a lot of cases but BPF can also be useful in other cases. I agree that it's all about using the right tool for the problem.
On Wed, Apr 03, 2024 at 09:21:37AM -1000, Tejun Heo wrote: > Hello, > > On Wed, Apr 03, 2024 at 03:06:56PM -0400, Kent Overstreet wrote: > ... > > That's how it should be if you just make a point of making your internal > > state easy to view and introspect, but when I'm debugging issues that > > run into the wider block layer, or memory reclaim, we often hit a wall. > > Try drgn: > > https://drgn.readthedocs.io/en/latest/ > > I've been adding drgn scripts under tools/ directory for introspection. > They're easy to write, deploy and ask users to run. Which is still inferior to simply writing to_text() functions for all your objects and exposing them under sysfs/debugfs. Plus, it's a whole new language/system for boths devs and users to learn. And having to_text() functions makes your log and error messages way better. "But what about code size/overhead?" - bullshit, we're talking about a couple percent of .text for the code itself; we blow more memory on permament dentries/inodes due to the way our virtual filesystems work but that's more of a problem for tracefs. > > Writeback throttling was buggy for _months_, no visibility or > > introspection or concerns for debugging, and that's a small chunk of > > code. io_uring - had to disable it. I _still_ have people bringing > > issues to me that are clearly memory reclaim related but I don't have > > the tools. > > > > It's not like any of this code exports much in the way of useful > > tracepoints either, but tracepoints often just aren't what you want; > > what you want just to be able to see internal state (_without_ having to > > use a debugger, because that's completely impractical outside highly > > controlled environments) - and tracing is also never the first thing you > > want to reach for when you have a user asking you "hey, this thing went > > wonky, what's it doing?" - tracing automatically turns it into a multi > > step process of decide what you want to look at, run the workload more > > to collect data, iterate. > > > > Think more about "what would make code easier to debug" and less about > > "how do I shove this round peg through the square tracing/BPF slot". > > There's _way_ more we could be doing that would just make our lives > > easier. > > Maybe it'd help classifying visibility into the the following categories: > > 1. Current state introspection. > 2. Dynamic behavior tracing. > 3. Accumluative behavior profiling. > > drgn is great for #1. Tracing and BPF stuff is great for #2 especially when > things get complicated. #3 is the trickest. Static stuff is useful in a lot > of cases but BPF can also be useful in other cases. > > I agree that it's all about using the right tool for the problem. Yeah, and you guys are all about the nerdiest and most overengineered tools and ignoring the basics. Get the simple stuff right, /then/ if there's stuff you still can't do, that's when you start looking at the more complicated stuff.