mbox series

[v2,0/6] Improve visibility of writeback

Message ID 20240327155751.3536-1-shikemeng@huaweicloud.com (mailing list archive)
Headers show
Series Improve visibility of writeback | expand

Message

Kemeng Shi March 27, 2024, 3:57 p.m. UTC
v1->v2:
-Send cleanup to wq_monitor.py separately.
-Add patch to avoid use after free of bdi.
-Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
-Use rcu walk to avoid use after free.
-Add debug output to each related patches.

This series tries to improve visilibity of writeback. Patch 1 make
/sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
instead of only writeback info in root cgroup. Patch 2 add a new
debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
info. Patch 4 add wb_monitor.py to monitor basic writeback info
of running system, more info could be added on demand. Rest patches
are some random cleanups. More details can be found in respective
patches. Thanks!
This series is on top of patchset [1].

[1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* all writeback info of bdi is successfully collected */
# cat /sys/kernel/debug/bdi/252:16/stats:
BdiWriteback:              448 kB
BdiReclaimable:        1303904 kB
BdiDirtyThresh:      189914124 kB
DirtyThresh:         195337564 kB
BackgroundThresh:     32516508 kB
BdiDirtied:            3591392 kB
BdiWritten:            2287488 kB
BdiWriteBandwidth:      322248 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   2
b_dirty_time:                0
bdi_list:                    1
state:                       1

/* per wb writeback info is collected */
# cat /sys/kernel/debug/bdi/252:16/wb_stats:
cat wb_stats
WbCgIno:                    1
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1
WbCgIno:                 4284
WbWriteback:              448 kB
WbReclaimable:         818944 kB
WbDirtyThresh:        3096524 kB
WbDirtied:            2266880 kB
WbWritten:            1447936 kB
WbWriteBandwidth:      214036 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5
WbCgIno:                 4325
WbWriteback:              224 kB
WbReclaimable:         819392 kB
WbDirtyThresh:        2920088 kB
WbDirtied:            2551808 kB
WbWritten:            1732416 kB
WbWriteBandwidth:      201832 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5

/* monitor writeback info */
# ./wb_monitor.py 252:16 -c
                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360


                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360
...

Kemeng Shi (6):
  writeback: protect race between bdi release and bdi_debug_stats_show
  writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  writeback: support retrieving per group debug writeback stats of bdi
  writeback: add wb_monitor.py script to monitor writeback info on bdi
  writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
  writeback: define GDTC_INIT_NO_WB to null

 include/linux/writeback.h     |   1 +
 mm/backing-dev.c              | 203 ++++++++++++++++++++++++++++++----
 mm/page-writeback.c           |  31 ++++--
 tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
 4 files changed, 378 insertions(+), 29 deletions(-)
 create mode 100644 tools/writeback/wb_monitor.py

Comments

Andrew Morton March 27, 2024, 5:40 p.m. UTC | #1
On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:

> This series tries to improve visilibity of writeback.

Well...  why?  Is anyone usefully using the existing instrumentation? 
What is to be gained by expanding it further?  What is the case for
adding this code?

I don't recall hearing of anyone using the existing debug
instrumentation so perhaps we should remove it!

Also, I hit a build error and a pile of warnings with an arm
allnoconfig build.
Kemeng Shi March 28, 2024, 1:59 a.m. UTC | #2
on 3/28/2024 1:40 AM, Andrew Morton wrote:
> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> 
>> This series tries to improve visilibity of writeback.
> 
> Well...  why?  Is anyone usefully using the existing instrumentation? 
> What is to be gained by expanding it further?  What is the case for
> adding this code?
> 
> I don't recall hearing of anyone using the existing debug
> instrumentation so perhaps we should remove it!
Hi Andrew, this was discussed in [1]. In short, I use the
debug files to test change in submit patchset [1]. The
wb_monitor.py is suggested by Tejun in [2] to improve
visibility of writeback.
I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun
in [2] to improve visibility of writeback.
> 
> Also, I hit a build error and a pile of warnings with an arm
> allnoconfig build.
> 
Sorry for this, I only tested on x86. I will look into this and
fix the build problem in next version.

[1] https://lore.kernel.org/lkml/44e3b910-8b52-5583-f8a9-37105bf5e5b6@huaweicloud.com/
[2] https://lore.kernel.org/lkml/a747dc7d-f24a-08bd-d969-d3fb35e151b7@huaweicloud.com/
[3] https://lore.kernel.org/lkml/ZcUsOb_fyvYr-zZ-@slm.duckdns.org/
Kemeng Shi March 28, 2024, 8:23 a.m. UTC | #3
on 3/28/2024 9:59 AM, Kemeng Shi wrote:
> 
> on 3/28/2024 1:40 AM, Andrew Morton wrote:
>> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>>
>>> This series tries to improve visilibity of writeback.
>>
>> Well...  why?  Is anyone usefully using the existing instrumentation? 
>> What is to be gained by expanding it further?  What is the case for
>> adding this code?
>>
>> I don't recall hearing of anyone using the existing debug
>> instrumentation so perhaps we should remove it!
> Hi Andrew, this was discussed in [1]. In short, I use the
> debug files to test change in submit patchset [1]. The
> wb_monitor.py is suggested by Tejun in [2] to improve
> visibility of writeback.
> I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun
> in [2] to improve visibility of writeback.
>>
>> Also, I hit a build error and a pile of warnings with an arm
>> allnoconfig build.
With arm allnoconfig build on uptodated mm-unstable branch, I don't
hit any build error but only some warnings as following:
...
mm/page-writeback.c: In function ‘cgwb_calc_thresh’:
mm/page-writeback.c:906:13: warning: ‘writeback’ is used uninitialized in this function [-Wuninitialized]
  906 |  mdtc.dirty += writeback;
      |  ~~~~~~~~~~~^~~~~~~~~~~~
In file included from ./include/linux/kernel.h:28,
                 from mm/page-writeback.c:15:
./include/linux/minmax.h:46:54: warning: ‘filepages’ is used uninitialized in this function [-Wuninitialized]
   46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y))
      |                                                      ^
mm/page-writeback.c:898:16: note: ‘filepages’ was declared here
  898 |  unsigned long filepages, headroom, writeback;
      |                ^~~~~~~~~
In file included from ./include/linux/kernel.h:28,
                 from mm/page-writeback.c:15:
./include/linux/minmax.h:46:54: warning: ‘headroom’ is used uninitialized in this function [-Wuninitialized]
   46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y))
      |                                                      ^
mm/page-writeback.c:898:27: note: ‘headroom’ was declared here
  898 |  unsigned long filepages, headroom, writeback;
      |                           ^~~~~~~~
...

The only reason I can think of is that I also apply patchset [1]
for build. I mentioned patchset [1] in cover letter but I forgot
to notify the dependency to the patchset.
If this is the reason to blame for buidl error, I will send a new
set based on mm-unstable in next version.

Thanks,
Kemeng

[1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
>>
> Sorry for this, I only tested on x86. I will look into this and
> fix the build problem in next version.
> 
> [1] https://lore.kernel.org/lkml/44e3b910-8b52-5583-f8a9-37105bf5e5b6@huaweicloud.com/
> [2] https://lore.kernel.org/lkml/a747dc7d-f24a-08bd-d969-d3fb35e151b7@huaweicloud.com/
> [3] https://lore.kernel.org/lkml/ZcUsOb_fyvYr-zZ-@slm.duckdns.org/
>
Kent Overstreet March 28, 2024, 7:15 p.m. UTC | #4
On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> 
> > This series tries to improve visilibity of writeback.
> 
> Well...  why?  Is anyone usefully using the existing instrumentation? 
> What is to be gained by expanding it further?  What is the case for
> adding this code?
> 
> I don't recall hearing of anyone using the existing debug
> instrumentation so perhaps we should remove it!

Remove debug instrumentation!? Surely you just?

I generally don't hear from users of my code when things are working, I
only expect to hear from people when it's not.
Andrew Morton March 28, 2024, 7:23 p.m. UTC | #5
On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> > 
> > > This series tries to improve visilibity of writeback.
> > 
> > Well...  why?  Is anyone usefully using the existing instrumentation? 
> > What is to be gained by expanding it further?  What is the case for
> > adding this code?
> > 
> > I don't recall hearing of anyone using the existing debug
> > instrumentation so perhaps we should remove it!
> 
> Remove debug instrumentation!? Surely you just?

Absolutely not.  Any code in the kernel should have ongoing
justification for remaining there.  If no such justification exists,
out it goes.
Kent Overstreet March 28, 2024, 7:24 p.m. UTC | #6
On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote:
> v1->v2:
> -Send cleanup to wq_monitor.py separately.
> -Add patch to avoid use after free of bdi.
> -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
> -Use rcu walk to avoid use after free.
> -Add debug output to each related patches.
> 
> This series tries to improve visilibity of writeback. Patch 1 make
> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
> instead of only writeback info in root cgroup. Patch 2 add a new
> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
> info. Patch 4 add wb_monitor.py to monitor basic writeback info
> of running system, more info could be added on demand. Rest patches
> are some random cleanups. More details can be found in respective
> patches. Thanks!
> This series is on top of patchset [1].
> 
> [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76

Not bad

I've been trying to improve our ability to debug latency issues - stalls
of all sorts. While you're looking at all this code, do you think you
could find some places to collect useful latency numbers?

fs/bcachefs/time_stats.c has some code that's going to be moving out to
lib/ at some point, after I switch it to MAD; if you could hook that up
as well to a few points we could see at a glance if there are stalls
happening in the writeback path.

> 
> Following domain hierarchy is tested:
>                 global domain (320G)
>                 /                 \
>         cgroup domain1(10G)     cgroup domain2(10G)
>                 |                 |
> bdi            wb1               wb2
> 
> /* all writeback info of bdi is successfully collected */
> # cat /sys/kernel/debug/bdi/252:16/stats:
> BdiWriteback:              448 kB
> BdiReclaimable:        1303904 kB
> BdiDirtyThresh:      189914124 kB
> DirtyThresh:         195337564 kB
> BackgroundThresh:     32516508 kB
> BdiDirtied:            3591392 kB
> BdiWritten:            2287488 kB
> BdiWriteBandwidth:      322248 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   2
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> /* per wb writeback info is collected */
> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
> cat wb_stats
> WbCgIno:                    1
> WbWriteback:                0 kB
> WbReclaimable:              0 kB
> WbDirtyThresh:              0 kB
> WbDirtied:                  0 kB
> WbWritten:                  0 kB
> WbWriteBandwidth:      102400 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  0
> b_dirty_time:               0
> state:                      1
> WbCgIno:                 4284
> WbWriteback:              448 kB
> WbReclaimable:         818944 kB
> WbDirtyThresh:        3096524 kB
> WbDirtied:            2266880 kB
> WbWritten:            1447936 kB
> WbWriteBandwidth:      214036 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  1
> b_dirty_time:               0
> state:                      5
> WbCgIno:                 4325
> WbWriteback:              224 kB
> WbReclaimable:         819392 kB
> WbDirtyThresh:        2920088 kB
> WbDirtied:            2551808 kB
> WbWritten:            1732416 kB
> WbWriteBandwidth:      201832 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  1
> b_dirty_time:               0
> state:                      5
> 
> /* monitor writeback info */
> # ./wb_monitor.py 252:16 -c
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> 
> 
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> ...
> 
> Kemeng Shi (6):
>   writeback: protect race between bdi release and bdi_debug_stats_show
>   writeback: collect stats of all wb of bdi in bdi_debug_stats_show
>   writeback: support retrieving per group debug writeback stats of bdi
>   writeback: add wb_monitor.py script to monitor writeback info on bdi
>   writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
>   writeback: define GDTC_INIT_NO_WB to null
> 
>  include/linux/writeback.h     |   1 +
>  mm/backing-dev.c              | 203 ++++++++++++++++++++++++++++++----
>  mm/page-writeback.c           |  31 ++++--
>  tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
>  4 files changed, 378 insertions(+), 29 deletions(-)
>  create mode 100644 tools/writeback/wb_monitor.py
> 
> -- 
> 2.30.0
>
Tejun Heo March 28, 2024, 7:31 p.m. UTC | #7
Hello, Kent.

On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote:
> fs/bcachefs/time_stats.c has some code that's going to be moving out to
> lib/ at some point, after I switch it to MAD; if you could hook that up
> as well to a few points we could see at a glance if there are stalls
> happening in the writeback path.

Using BPF (whether through bcc or bpftrace) is likely a better approach for
this sort of detailed instrumentation. Fixed debug information is useful and
it's also a common occurrence that they don't quite reveal the full picture
of what one's trying to understand and one needs to dig a bit deeper, wider,
aggregate data in a different way, or whatever.

So, rather than adding more fixed infrastructure, I'd suggest adding places
which can easily be instrumented using the existing tools (they are really
great once you get used to them) whether that's tracepoints or just
strategically placed noinline functions.

Thanks.
Kent Overstreet March 28, 2024, 7:36 p.m. UTC | #8
On Thu, Mar 28, 2024 at 12:23:52PM -0700, Andrew Morton wrote:
> On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> > > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> > > 
> > > > This series tries to improve visilibity of writeback.
> > > 
> > > Well...  why?  Is anyone usefully using the existing instrumentation? 
> > > What is to be gained by expanding it further?  What is the case for
> > > adding this code?
> > > 
> > > I don't recall hearing of anyone using the existing debug
> > > instrumentation so perhaps we should remove it!
> > 
> > Remove debug instrumentation!? Surely you just?
> 
> Absolutely not.  Any code in the kernel should have ongoing
> justification for remaining there.  If no such justification exists,
> out it goes.

Certainly, but this isn't remotely a case where I'd expect to be getting
that kind of feedback. Debugging instrumentation is very much a case
where no one notices it 99% of the time, but when you need it you
_really_ need it.

Not having it can turn a 10 minute "oh, that thing is acting wonky -
it's because your system is overloaded/your drive is wonky/x subsystem
sucks, we know about it and we're working on it" into a weeklong
bughunt, burning up expensive engineer time pointlessly.

To debug complex systems efficiently, in production, in the wild, we
need to be able to see what's going on - we need more of this stuff.

Not to say that this couldn't use more work - perhaps additional focus
on what kinds of issues we expect to need to debug with this, what the
numbers mean and are useful for, documentation on how this relates to
writeback internals, etc.
Kent Overstreet March 28, 2024, 7:40 p.m. UTC | #9
On Thu, Mar 28, 2024 at 09:31:57AM -1000, Tejun Heo wrote:
> Hello, Kent.
> 
> On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote:
> > fs/bcachefs/time_stats.c has some code that's going to be moving out to
> > lib/ at some point, after I switch it to MAD; if you could hook that up
> > as well to a few points we could see at a glance if there are stalls
> > happening in the writeback path.
> 
> Using BPF (whether through bcc or bpftrace) is likely a better approach for
> this sort of detailed instrumentation. Fixed debug information is useful and
> it's also a common occurrence that they don't quite reveal the full picture
> of what one's trying to understand and one needs to dig a bit deeper, wider,
> aggregate data in a different way, or whatever.
> 
> So, rather than adding more fixed infrastructure, I'd suggest adding places
> which can easily be instrumented using the existing tools (they are really
> great once you get used to them) whether that's tracepoints or just
> strategically placed noinline functions.

Collecting latency numbers at various key places is _enormously_ useful.
The hard part is deciding where it's useful to collect; that requires
intimate knowledge of the code. Once you're defining those collection
poitns statically, doing it with BPF is just another useless layer of
indirection.

The time stats stuff I wrote is _really_ cheap, and you really want this
stuff always on so that you've actually got the data you need when
you're bughunting.
Tejun Heo March 28, 2024, 7:46 p.m. UTC | #10
Hello,

On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> Collecting latency numbers at various key places is _enormously_ useful.
> The hard part is deciding where it's useful to collect; that requires
> intimate knowledge of the code. Once you're defining those collection
> poitns statically, doing it with BPF is just another useless layer of
> indirection.

Given how much flexibility helps with debugging, claiming it useless is a
stretch.

> The time stats stuff I wrote is _really_ cheap, and you really want this
> stuff always on so that you've actually got the data you need when
> you're bughunting.

For some stats and some use cases, always being available is useful and
building fixed infra for them makes sense. For other stats and other use
cases, flexibility is pretty useful too (e.g. what if you want percentile
distribution which is filtered by some criteria?). They aren't mutually
exclusive and I'm not sure bdi wb instrumentation is on top of enough
people's minds.

As for overhead, BPF instrumentation can be _really_ cheap too. We often run
these programs per packet.

Thanks.
Kent Overstreet March 28, 2024, 7:55 p.m. UTC | #11
On Thu, Mar 28, 2024 at 09:46:39AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > Collecting latency numbers at various key places is _enormously_ useful.
> > The hard part is deciding where it's useful to collect; that requires
> > intimate knowledge of the code. Once you're defining those collection
> > poitns statically, doing it with BPF is just another useless layer of
> > indirection.
> 
> Given how much flexibility helps with debugging, claiming it useless is a
> stretch.

Well, what would it add?

> > The time stats stuff I wrote is _really_ cheap, and you really want this
> > stuff always on so that you've actually got the data you need when
> > you're bughunting.
> 
> For some stats and some use cases, always being available is useful and
> building fixed infra for them makes sense. For other stats and other use
> cases, flexibility is pretty useful too (e.g. what if you want percentile
> distribution which is filtered by some criteria?). They aren't mutually
> exclusive and I'm not sure bdi wb instrumentation is on top of enough
> people's minds.
> 
> As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> these programs per packet.

The main things I want are just
 - elapsed time since last writeback IO completed, so we can see at a
   glance if it's stalled
 - time stats on writeback io initiation to completion

The main value of this one will be tracking down tail latency issues and
finding out where in the stack they originate.
Tejun Heo March 28, 2024, 8:13 p.m. UTC | #12
Hello,

On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > > Collecting latency numbers at various key places is _enormously_ useful.
> > > The hard part is deciding where it's useful to collect; that requires
> > > intimate knowledge of the code. Once you're defining those collection
> > > poitns statically, doing it with BPF is just another useless layer of
> > > indirection.
> > 
> > Given how much flexibility helps with debugging, claiming it useless is a
> > stretch.
> 
> Well, what would it add?

It depends on the case but here's an example. If I'm seeing occasional tail
latency spikes, I'd want to know whether there's any correation with
specific types or sizes of IOs and if so who's issuing them and why. With
BPF, you can detect those conditions to tag and capture where exactly those
IOs are coming from and aggregate the result however you like across
thousands of machines in production without anyone noticing. That's useful,
no?

Also, actual percentile disribution is almost always a lot more insightful
than more coarsely aggregated numbers. We can't add all that to fixed infra.
In most cases not because runtime overhead would be too hight but because
the added interface and code complexity and maintenance overhead isn't
justifiable given how niche, adhoc and varied these use cases get.

> > > The time stats stuff I wrote is _really_ cheap, and you really want this
> > > stuff always on so that you've actually got the data you need when
> > > you're bughunting.
> > 
> > For some stats and some use cases, always being available is useful and
> > building fixed infra for them makes sense. For other stats and other use
> > cases, flexibility is pretty useful too (e.g. what if you want percentile
> > distribution which is filtered by some criteria?). They aren't mutually
> > exclusive and I'm not sure bdi wb instrumentation is on top of enough
> > people's minds.
> > 
> > As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> > these programs per packet.
> 
> The main things I want are just
>  - elapsed time since last writeback IO completed, so we can see at a
>    glance if it's stalled
>  - time stats on writeback io initiation to completion
> 
> The main value of this one will be tracking down tail latency issues and
> finding out where in the stack they originate.

Yeah, I mean, if always keeping those numbers around is useful for wide
enough number of users and cases, sure, go ahead and add fixed infra. I'm
not quite sure bdi wb stats fall in that bucket given how little attention
it usually gets.

Thanks.
Kent Overstreet March 28, 2024, 8:22 p.m. UTC | #13
On Thu, Mar 28, 2024 at 10:13:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote:
> > > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > > > Collecting latency numbers at various key places is _enormously_ useful.
> > > > The hard part is deciding where it's useful to collect; that requires
> > > > intimate knowledge of the code. Once you're defining those collection
> > > > poitns statically, doing it with BPF is just another useless layer of
> > > > indirection.
> > > 
> > > Given how much flexibility helps with debugging, claiming it useless is a
> > > stretch.
> > 
> > Well, what would it add?
> 
> It depends on the case but here's an example. If I'm seeing occasional tail
> latency spikes, I'd want to know whether there's any correation with
> specific types or sizes of IOs and if so who's issuing them and why. With
> BPF, you can detect those conditions to tag and capture where exactly those
> IOs are coming from and aggregate the result however you like across
> thousands of machines in production without anyone noticing. That's useful,
> no?

That's cool, but really esoteric. We need to be able to answer basic
questions and build an overall picture of what the system is doing
without having to reach for the big stuff.

Most users are never going to touch tracing, let alone BPF; that's too
much setup. But I can and do regularly tell users "check this, this and
this" and debug things on that basis without ever touching their
machine.

And basic latency numbers are really easy for users to understand, that
makes them doubly worthwhile to collect and make visible.

> Also, actual percentile disribution is almost always a lot more insightful
> than more coarsely aggregated numbers. We can't add all that to fixed infra.
> In most cases not because runtime overhead would be too hight but because
> the added interface and code complexity and maintenance overhead isn't
> justifiable given how niche, adhoc and varied these use cases get.

You can't calculate percentiles accurately and robustly in one pass -
that only works if your input data obeys a nice statistical
distribution, and the cases we care about are the ones where it doesn't.

> 
> > > > The time stats stuff I wrote is _really_ cheap, and you really want this
> > > > stuff always on so that you've actually got the data you need when
> > > > you're bughunting.
> > > 
> > > For some stats and some use cases, always being available is useful and
> > > building fixed infra for them makes sense. For other stats and other use
> > > cases, flexibility is pretty useful too (e.g. what if you want percentile
> > > distribution which is filtered by some criteria?). They aren't mutually
> > > exclusive and I'm not sure bdi wb instrumentation is on top of enough
> > > people's minds.
> > > 
> > > As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> > > these programs per packet.
> > 
> > The main things I want are just
> >  - elapsed time since last writeback IO completed, so we can see at a
> >    glance if it's stalled
> >  - time stats on writeback io initiation to completion
> > 
> > The main value of this one will be tracking down tail latency issues and
> > finding out where in the stack they originate.
> 
> Yeah, I mean, if always keeping those numbers around is useful for wide
> enough number of users and cases, sure, go ahead and add fixed infra. I'm
> not quite sure bdi wb stats fall in that bucket given how little attention
> it usually gets.

I think it should be getting a lot more attention given that memory
reclaim and writeback are generally implicated whenever a user complains
about their system going out to lunch.
Tejun Heo March 28, 2024, 8:46 p.m. UTC | #14
Hello, Kent.

On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> Most users are never going to touch tracing, let alone BPF; that's too
> much setup. But I can and do regularly tell users "check this, this and
> this" and debug things on that basis without ever touching their
> machine.

I think this is where the disconnect is. It's not difficult to set up at
all. Nowadays, in most distros, it comes down to something like run "pacman
-S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
run this script I'm attaching. It is a signficant boost when debugging many
different kernel issues. I strongly suggest giving it a try and getting used
to it rather than resisting it.

Thanks.
Kent Overstreet March 28, 2024, 8:53 p.m. UTC | #15
On Thu, Mar 28, 2024 at 10:46:33AM -1000, Tejun Heo wrote:
> Hello, Kent.
> 
> On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> > Most users are never going to touch tracing, let alone BPF; that's too
> > much setup. But I can and do regularly tell users "check this, this and
> > this" and debug things on that basis without ever touching their
> > machine.
> 
> I think this is where the disconnect is. It's not difficult to set up at
> all. Nowadays, in most distros, it comes down to something like run "pacman
> -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
> run this script I'm attaching. It is a signficant boost when debugging many
> different kernel issues. I strongly suggest giving it a try and getting used
> to it rather than resisting it.

The disconnect is with you, Tejun. I use tracing all the time, and it's
not going to be much of a discussion if you're not going to formulate a
response.
Kemeng Shi April 3, 2024, 6:56 a.m. UTC | #16
on 3/29/2024 3:24 AM, Kent Overstreet wrote:
> On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote:
>> v1->v2:
>> -Send cleanup to wq_monitor.py separately.
>> -Add patch to avoid use after free of bdi.
>> -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
>> -Use rcu walk to avoid use after free.
>> -Add debug output to each related patches.
>>
>> This series tries to improve visilibity of writeback. Patch 1 make
>> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
>> instead of only writeback info in root cgroup. Patch 2 add a new
>> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
>> info. Patch 4 add wb_monitor.py to monitor basic writeback info
>> of running system, more info could be added on demand. Rest patches
>> are some random cleanups. More details can be found in respective
>> patches. Thanks!
>> This series is on top of patchset [1].
>>
>> [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
> 
> Not bad
> 
Hi Kent,
> I've been trying to improve our ability to debug latency issues - stalls
> of all sorts. While you're looking at all this code, do you think you
> could find some places to collect useful latency numbers?
I would like to do it to collect more useful info for writeback.
> 
> fs/bcachefs/time_stats.c has some code that's going to be moving out to
> lib/ at some point, after I switch it to MAD; if you could hook that up
> as well to a few points we could see at a glance if there are stalls
> happening in the writeback path.
I see that Tejun recommend to use bpf. I don't know much about bpf and
new approach in time_stats.c. For me personly, I think that it's better
to use the new approach after the work of moving code to lib/ is merged.
Then I would like to submit a patchset to discuss of using it in writeback.
Would this make sense to you. Look forward to your reply!

Thanks,
Kemeng
> 
>>
>> Following domain hierarchy is tested:
>>                 global domain (320G)
>>                 /                 \
>>         cgroup domain1(10G)     cgroup domain2(10G)
>>                 |                 |
>> bdi            wb1               wb2
>>
>> /* all writeback info of bdi is successfully collected */
>> # cat /sys/kernel/debug/bdi/252:16/stats:
>> BdiWriteback:              448 kB
>> BdiReclaimable:        1303904 kB
>> BdiDirtyThresh:      189914124 kB
>> DirtyThresh:         195337564 kB
>> BackgroundThresh:     32516508 kB
>> BdiDirtied:            3591392 kB
>> BdiWritten:            2287488 kB
>> BdiWriteBandwidth:      322248 kBps
>> b_dirty:                     0
>> b_io:                        0
>> b_more_io:                   2
>> b_dirty_time:                0
>> bdi_list:                    1
>> state:                       1
>>
>> /* per wb writeback info is collected */
>> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
>> cat wb_stats
>> WbCgIno:                    1
>> WbWriteback:                0 kB
>> WbReclaimable:              0 kB
>> WbDirtyThresh:              0 kB
>> WbDirtied:                  0 kB
>> WbWritten:                  0 kB
>> WbWriteBandwidth:      102400 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  0
>> b_dirty_time:               0
>> state:                      1
>> WbCgIno:                 4284
>> WbWriteback:              448 kB
>> WbReclaimable:         818944 kB
>> WbDirtyThresh:        3096524 kB
>> WbDirtied:            2266880 kB
>> WbWritten:            1447936 kB
>> WbWriteBandwidth:      214036 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  1
>> b_dirty_time:               0
>> state:                      5
>> WbCgIno:                 4325
>> WbWriteback:              224 kB
>> WbReclaimable:         819392 kB
>> WbDirtyThresh:        2920088 kB
>> WbDirtied:            2551808 kB
>> WbWritten:            1732416 kB
>> WbWriteBandwidth:      201832 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  1
>> b_dirty_time:               0
>> state:                      5
>>
>> /* monitor writeback info */
>> # ./wb_monitor.py 252:16 -c
>>                   writeback  reclaimable   dirtied   written    avg_bw
>> 252:16_1                  0            0         0         0    102400
>> 252:16_4284             672       820064   9230368   8410304    685612
>> 252:16_4325             896       819840  10491264   9671648    652348
>> 252:16                 1568      1639904  19721632  18081952   1440360
>>
>>
>>                   writeback  reclaimable   dirtied   written    avg_bw
>> 252:16_1                  0            0         0         0    102400
>> 252:16_4284             672       820064   9230368   8410304    685612
>> 252:16_4325             896       819840  10491264   9671648    652348
>> 252:16                 1568      1639904  19721632  18081952   1440360
>> ...
>>
>> Kemeng Shi (6):
>>   writeback: protect race between bdi release and bdi_debug_stats_show
>>   writeback: collect stats of all wb of bdi in bdi_debug_stats_show
>>   writeback: support retrieving per group debug writeback stats of bdi
>>   writeback: add wb_monitor.py script to monitor writeback info on bdi
>>   writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
>>   writeback: define GDTC_INIT_NO_WB to null
>>
>>  include/linux/writeback.h     |   1 +
>>  mm/backing-dev.c              | 203 ++++++++++++++++++++++++++++++----
>>  mm/page-writeback.c           |  31 ++++--
>>  tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
>>  4 files changed, 378 insertions(+), 29 deletions(-)
>>  create mode 100644 tools/writeback/wb_monitor.py
>>
>> -- 
>> 2.30.0
>>
>
Jan Kara April 3, 2024, 4:27 p.m. UTC | #17
On Thu 28-03-24 10:46:33, Tejun Heo wrote:
> Hello, Kent.
> 
> On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> > Most users are never going to touch tracing, let alone BPF; that's too
> > much setup. But I can and do regularly tell users "check this, this and
> > this" and debug things on that basis without ever touching their
> > machine.
> 
> I think this is where the disconnect is. It's not difficult to set up at
> all. Nowadays, in most distros, it comes down to something like run "pacman
> -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
> run this script I'm attaching. It is a signficant boost when debugging many
> different kernel issues. I strongly suggest giving it a try and getting used
> to it rather than resisting it.

Yeah, BPF is great and I use it but to fill in some cases from practice,
there are sysadmins refusing to install bcc or run your BPF scripts on
their systems due to company regulations, their personal fear, or whatever.
So debugging with what you can achieve from a shell is still the thing
quite often.

								Honza
Tejun Heo April 3, 2024, 6:44 p.m. UTC | #18
Hello,

On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote:
> Yeah, BPF is great and I use it but to fill in some cases from practice,
> there are sysadmins refusing to install bcc or run your BPF scripts on
> their systems due to company regulations, their personal fear, or whatever.
> So debugging with what you can achieve from a shell is still the thing
> quite often.

Yeah, I mean, this happens with anything new. Tracing itself took quite a
while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and
it's likely that the adoption line will keep shifting for quite a while.
Besides, even with all the new gizmos there definitely are cases where good
ol' cat interface makes sense.

So, if the static interface makes sense, we add it but we should keep in
mind that the trade-offs for adding such static infrastructure, especially
for the ones which aren't *widely* useful, are rather quickly shfiting in
the less favorable direction.

Thanks.
Kent Overstreet April 3, 2024, 7:06 p.m. UTC | #19
On Wed, Apr 03, 2024 at 08:44:05AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote:
> > Yeah, BPF is great and I use it but to fill in some cases from practice,
> > there are sysadmins refusing to install bcc or run your BPF scripts on
> > their systems due to company regulations, their personal fear, or whatever.
> > So debugging with what you can achieve from a shell is still the thing
> > quite often.
> 
> Yeah, I mean, this happens with anything new. Tracing itself took quite a
> while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and
> it's likely that the adoption line will keep shifting for quite a while.
> Besides, even with all the new gizmos there definitely are cases where good
> ol' cat interface makes sense.
> 
> So, if the static interface makes sense, we add it but we should keep in
> mind that the trade-offs for adding such static infrastructure, especially
> for the ones which aren't *widely* useful, are rather quickly shfiting in
> the less favorable direction.

A lot of our static debug infrastructure isn't that useful because it
just sucks.

Every time I hit a sysfs or procfs file that's just a single integer,
and nothing else, when clearly there's internal structure and
description that needs to be there I die a little inside. It's lazy and
amateurish.

I regularly debug things in bcachefs over IRC in about 5-10 minutes of
asking to check various files and pastebin them - this is my normal
process, I pretty much never have to ssh and touch the actual machines.

That's how it should be if you just make a point of making your internal
state easy to view and introspect, but when I'm debugging issues that
run into the wider block layer, or memory reclaim, we often hit a wall.

Writeback throttling was buggy for _months_, no visibility or
introspection or concerns for debugging, and that's a small chunk of
code. io_uring - had to disable it. I _still_ have people bringing
issues to me that are clearly memory reclaim related but I don't have
the tools.

It's not like any of this code exports much in the way of useful
tracepoints either, but tracepoints often just aren't what you want;
what you want just to be able to see internal state (_without_ having to
use a debugger, because that's completely impractical outside highly
controlled environments) - and tracing is also never the first thing you
want to reach for when you have a user asking you "hey, this thing went
wonky, what's it doing?" - tracing automatically turns it into a multi
step process of decide what you want to look at, run the workload more
to collect data, iterate.

Think more about "what would make code easier to debug" and less about
"how do I shove this round peg through the square tracing/BPF slot".
There's _way_ more we could be doing that would just make our lives
easier.
Tejun Heo April 3, 2024, 7:21 p.m. UTC | #20
Hello,

On Wed, Apr 03, 2024 at 03:06:56PM -0400, Kent Overstreet wrote:
...
> That's how it should be if you just make a point of making your internal
> state easy to view and introspect, but when I'm debugging issues that
> run into the wider block layer, or memory reclaim, we often hit a wall.

Try drgn:

  https://drgn.readthedocs.io/en/latest/

I've been adding drgn scripts under tools/ directory for introspection.
They're easy to write, deploy and ask users to run.

> Writeback throttling was buggy for _months_, no visibility or
> introspection or concerns for debugging, and that's a small chunk of
> code. io_uring - had to disable it. I _still_ have people bringing
> issues to me that are clearly memory reclaim related but I don't have
> the tools.
> 
> It's not like any of this code exports much in the way of useful
> tracepoints either, but tracepoints often just aren't what you want;
> what you want just to be able to see internal state (_without_ having to
> use a debugger, because that's completely impractical outside highly
> controlled environments) - and tracing is also never the first thing you
> want to reach for when you have a user asking you "hey, this thing went
> wonky, what's it doing?" - tracing automatically turns it into a multi
> step process of decide what you want to look at, run the workload more
> to collect data, iterate.
> 
> Think more about "what would make code easier to debug" and less about
> "how do I shove this round peg through the square tracing/BPF slot".
> There's _way_ more we could be doing that would just make our lives
> easier.

Maybe it'd help classifying visibility into the the following categories:

1. Current state introspection.
2. Dynamic behavior tracing.
3. Accumluative behavior profiling.

drgn is great for #1. Tracing and BPF stuff is great for #2 especially when
things get complicated. #3 is the trickest. Static stuff is useful in a lot
of cases but BPF can also be useful in other cases.

I agree that it's all about using the right tool for the problem.
Kent Overstreet April 3, 2024, 10:24 p.m. UTC | #21
On Wed, Apr 03, 2024 at 09:21:37AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 03, 2024 at 03:06:56PM -0400, Kent Overstreet wrote:
> ...
> > That's how it should be if you just make a point of making your internal
> > state easy to view and introspect, but when I'm debugging issues that
> > run into the wider block layer, or memory reclaim, we often hit a wall.
> 
> Try drgn:
> 
>   https://drgn.readthedocs.io/en/latest/
> 
> I've been adding drgn scripts under tools/ directory for introspection.
> They're easy to write, deploy and ask users to run.

Which is still inferior to simply writing to_text() functions for all
your objects and exposing them under sysfs/debugfs.

Plus, it's a whole new language/system for boths devs and users to
learn.

And having to_text() functions makes your log and error messages way
better.

"But what about code size/overhead?" - bullshit, we're talking about a
couple percent of .text for the code itself; we blow more memory on
permament dentries/inodes due to the way our virtual filesystems work
but that's more of a problem for tracefs.

> > Writeback throttling was buggy for _months_, no visibility or
> > introspection or concerns for debugging, and that's a small chunk of
> > code. io_uring - had to disable it. I _still_ have people bringing
> > issues to me that are clearly memory reclaim related but I don't have
> > the tools.
> > 
> > It's not like any of this code exports much in the way of useful
> > tracepoints either, but tracepoints often just aren't what you want;
> > what you want just to be able to see internal state (_without_ having to
> > use a debugger, because that's completely impractical outside highly
> > controlled environments) - and tracing is also never the first thing you
> > want to reach for when you have a user asking you "hey, this thing went
> > wonky, what's it doing?" - tracing automatically turns it into a multi
> > step process of decide what you want to look at, run the workload more
> > to collect data, iterate.
> > 
> > Think more about "what would make code easier to debug" and less about
> > "how do I shove this round peg through the square tracing/BPF slot".
> > There's _way_ more we could be doing that would just make our lives
> > easier.
> 
> Maybe it'd help classifying visibility into the the following categories:
> 
> 1. Current state introspection.
> 2. Dynamic behavior tracing.
> 3. Accumluative behavior profiling.
> 
> drgn is great for #1. Tracing and BPF stuff is great for #2 especially when
> things get complicated. #3 is the trickest. Static stuff is useful in a lot
> of cases but BPF can also be useful in other cases.
> 
> I agree that it's all about using the right tool for the problem.

Yeah, and you guys are all about the nerdiest and most overengineered
tools and ignoring the basics. Get the simple stuff right, /then/ if
there's stuff you still can't do, that's when you start looking at the
more complicated stuff.