mbox series

[0/2] tracing: Have trace_pid_list be a sparse array

Message ID 20210924033547.939554938@goodmis.org (mailing list archive)
Headers show
Series tracing: Have trace_pid_list be a sparse array | expand

Message

Steven Rostedt Sept. 24, 2021, 3:35 a.m. UTC
When the trace_pid_list was created, the default pid max was 32768.
Creating a bitmask that can hold one bit for all 32768 took up 4096 (one
page). Having a one page bitmask was not much of a problem, and that was
used for mapping pids. But today, systems are bigger and can run more
tasks, and now the default pid_max is usually set to 4194304. Which means
to handle that many pids requires 524288 bytes. Worse yet, the pid_max can
be set to 2^30 (1073741824 or 1G) which would take 134217728 (128M) of
memory to store this array.

Since the pid_list array is very sparsely populated, it is a huge waste of
memory to store all possible bits for each pid when most will not be set.

Instead, use a page table scheme to store the array, and allow this to
handle up to 32 bit pids.

The pid_mask will start out with 1024 entries for the first 10 MSB bits.
This will cost 4K for 32 bit architectures and 8K for 64 bit. Each of
these will have a 1024 array to store the next 10 bits of the pid (another
4 or 8K). These will hold an 512 byte bitmask (which will cover the LSB 12
  bits or 4096 bits).
 
When the trace_pid_list is allocated, it will have the 4/8K upper bits
allocated, and then it will allocate a cache for the next upper chunks and
the lower chunks (default 6 of each). Then when a bit is "set", these
chunks will be pulled from the free list and added to the array. If the
free list gets down to a lever (default 2), it will trigger an irqwork
that will refill the cache back up.

On clearing a bit, if the clear causes the bitmask to be zero, that chunk
will then be placed back into the free cache for later use, keeping the
need to allocate more down to a minimum.


Steven Rostedt (VMware) (2):
      tracing: Place trace_pid_list logic into abstract functions
      tracing: Create a sparse bitmask for pid filtering

----
 kernel/trace/Makefile       |   1 +
 kernel/trace/ftrace.c       |   6 +-
 kernel/trace/pid_list.c     | 551 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.c        |  78 +++----
 kernel/trace/trace.h        |  14 +-
 kernel/trace/trace_events.c |   6 +-
 6 files changed, 595 insertions(+), 61 deletions(-)
 create mode 100644 kernel/trace/pid_list.c

Comments

Steven Rostedt Sept. 24, 2021, 4:07 a.m. UTC | #1
On Thu, 23 Sep 2021 23:35:47 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> The pid_mask will start out with 1024 entries for the first 10 MSB bits.
> This will cost 4K for 32 bit architectures and 8K for 64 bit. Each of
> these will have a 1024 array to store the next 10 bits of the pid (another
> 4 or 8K). These will hold an 512 byte bitmask (which will cover the LSB 12
>   bits or 4096 bits).

Thinking about this more, I should adjust this split.

Currently, this is a 10,10,12 split, but since the upper chunks are
arrays of pointers, and the lower chunk is a bitmask, and that pids
tend to be close together, I should make the lower split bigger.

As a 4K page is 32768 bits (2^15), the lower split should be 15 bits,
not 12. This will probably allocate better.

Of course 32 - 15 is 17. So maybe to keep it simple, by having the two
upper chunks still the same in size, I could have it be 14 bits for the
lower (2048 bytes). And since pid_max can only be 2^30, we don't even
need to cover the full 32 bits.

30 - 14 = 16 = 8 * 2.

Then I can make the upper chunks cover 8 bits (arrays of 256 pointers)
and the lower chunks cove 14 bits. This would coincidentally make all
chunks 2K in size (on 64 bit machines).

Hmm, in that case, I can make the lower and upper chunks use the same
memory and not separate them. They are unions after all. But that may
be unfair for 32 bit machines, as the upper chunks are only going to be
1K in size for them (256 * 4).

-- Steve
Eugene Syromyatnikov Sept. 24, 2021, 10:51 a.m. UTC | #2
On Fri, Sep 24, 2021 at 6:07 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 23 Sep 2021 23:35:47 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > The pid_mask will start out with 1024 entries for the first 10 MSB bits.
> > This will cost 4K for 32 bit architectures and 8K for 64 bit. Each of
> > these will have a 1024 array to store the next 10 bits of the pid (another
> > 4 or 8K). These will hold an 512 byte bitmask (which will cover the LSB 12
> >   bits or 4096 bits).
>
> Thinking about this more, I should adjust this split.
>
> Currently, this is a 10,10,12 split, but since the upper chunks are
> arrays of pointers, and the lower chunk is a bitmask, and that pids
> tend to be close together, I should make the lower split bigger.
>
> As a 4K page is 32768 bits (2^15), the lower split should be 15 bits,
> not 12. This will probably allocate better.
>
> Of course 32 - 15 is 17. So maybe to keep it simple, by having the two
> upper chunks still the same in size, I could have it be 14 bits for the
> lower (2048 bytes). And since pid_max can only be 2^30, we don't even
> need to cover the full 32 bits.
>
> 30 - 14 = 16 = 8 * 2.
>
> Then I can make the upper chunks cover 8 bits (arrays of 256 pointers)
> and the lower chunks cove 14 bits. This would coincidentally make all
> chunks 2K in size (on 64 bit machines).
>
> Hmm, in that case, I can make the lower and upper chunks use the same
> memory and not separate them. They are unions after all. But that may
> be unfair for 32 bit machines, as the upper chunks are only going to be
> 1K in size for them (256 * 4).

Note that there is only one top-level chunk (so its size doesn't
matter much), (usually) about one middle-tier chunk (except for some
heavy cases, since pids are allocated linearly), and quite some
lower-tier bitset leaves.  So I'd optimise towards smaller leaves at
the expense of middle-tier (and especially top-tier) chunk size
(especially considering the fact that in the kernel, buddy allocator
is used), like 12-8-12 or something like that, but I have no factual
basis for arguing about specific split.  Also, I cannot resist from
noticing that this reminds me an awful lot of XArray and [1].  Maybe,
some wrapper around XArray would do?

[1] https://gitlab.com/strace/strace/-/raw/master/src/trie.h

>
> -- Steve
>



--
Eugene Syromyatnikov
mailto:evgsyr@gmail.com
xmpp:esyr@jabber.{ru|org}
Steven Rostedt Sept. 24, 2021, 1:16 p.m. UTC | #3
On Fri, 24 Sep 2021 12:51:07 +0200
Eugene Syromyatnikov <evgsyr@gmail.com> wrote:

Hi Eugene,

> Note that there is only one top-level chunk (so its size doesn't
> matter much), (usually) about one middle-tier chunk (except for some
> heavy cases, since pids are allocated linearly), and quite some
> lower-tier bitset leaves.  So I'd optimise towards smaller leaves at
> the expense of middle-tier (and especially top-tier) chunk size
> (especially considering the fact that in the kernel, buddy allocator
> is used), like 12-8-12 or something like that, but I have no factual

What I really like about my 8 8 14 split I have, it makes everything 2K in
size on 64 bit machines (1K 1K 2K for 32 bit, but who cares ;-)

 1 << 8 * 8 bytes = 2K   // top tiers are pointers to lower tiers
 1 << 14 bits = 2K       // lower tier only cares about bits

This means they will likely all be allocated in the same slab.

I'm optimizing the top tiers for size, because they are likely to be empty.
Why add memory for something that will never be used, and can't be removed.
Note, the middle and lower tiers can be reused when they go empty, which is
a likely use case (at least when I test this using hackbench).

> basis for arguing about specific split.  Also, I cannot resist from
> noticing that this reminds me an awful lot of XArray and [1].  Maybe,
> some wrapper around XArray would do?
> 
> [1] https://gitlab.com/strace/strace/-/raw/master/src/trie.h
> 

I looked into xarray and it appears to be optimized for storing something,
where as I'm just interested in a sparse bitmask.

Thanks for the review.

Note, I'll be posting a v3 soon because I found if I echo 1<<30 into
set_event_pid, it adds 0 (because it only looks at the bottom 30 bits).
It should really return -EINVAL.

-- Steve
Masami Hiramatsu (Google) Sept. 24, 2021, 1:35 p.m. UTC | #4
On Fri, 24 Sep 2021 09:16:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Sep 2021 12:51:07 +0200
> Eugene Syromyatnikov <evgsyr@gmail.com> wrote:
> 
> Hi Eugene,
> 
> > Note that there is only one top-level chunk (so its size doesn't
> > matter much), (usually) about one middle-tier chunk (except for some
> > heavy cases, since pids are allocated linearly), and quite some
> > lower-tier bitset leaves.  So I'd optimise towards smaller leaves at
> > the expense of middle-tier (and especially top-tier) chunk size
> > (especially considering the fact that in the kernel, buddy allocator
> > is used), like 12-8-12 or something like that, but I have no factual
> 
> What I really like about my 8 8 14 split I have, it makes everything 2K in
> size on 64 bit machines (1K 1K 2K for 32 bit, but who cares ;-)
> 
>  1 << 8 * 8 bytes = 2K   // top tiers are pointers to lower tiers
>  1 << 14 bits = 2K       // lower tier only cares about bits
> 
> This means they will likely all be allocated in the same slab.
> 
> I'm optimizing the top tiers for size, because they are likely to be empty.
> Why add memory for something that will never be used, and can't be removed.
> Note, the middle and lower tiers can be reused when they go empty, which is
> a likely use case (at least when I test this using hackbench).
> 
> > basis for arguing about specific split.  Also, I cannot resist from
> > noticing that this reminds me an awful lot of XArray and [1].  Maybe,
> > some wrapper around XArray would do?
> > 
> > [1] https://gitlab.com/strace/strace/-/raw/master/src/trie.h
> > 
> 
> I looked into xarray and it appears to be optimized for storing something,
> where as I'm just interested in a sparse bitmask.

I guess he suggested that store the bitmask in xarray. Anyway, both are
OK to me. This is needed for reducing the memory.

Thank you,

> 
> Thanks for the review.
> 
> Note, I'll be posting a v3 soon because I found if I echo 1<<30 into
> set_event_pid, it adds 0 (because it only looks at the bottom 30 bits).
> It should really return -EINVAL.
> 
> -- Steve
Eugene Syromyatnikov Sept. 24, 2021, 2:18 p.m. UTC | #5
On Fri, Sep 24, 2021 at 3:35 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 24 Sep 2021 09:16:27 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> > I'm optimizing the top tiers for size, because they are likely to be empty.
> > Why add memory for something that will never be used, and can't be removed.
> > Note, the middle and lower tiers can be reused when they go empty, which is
> > a likely use case (at least when I test this using hackbench).

Makes sense;  I was thinking about worse case scenarios—tracing
thousands+ processes, but those probably not as common and important.

> > I looked into xarray and it appears to be optimized for storing something,
> > where as I'm just interested in a sparse bitmask.
>
> I guess he suggested that store the bitmask in xarray. Anyway, both are
> OK to me. This is needed for reducing the memory.

Yes, the idea was to store pointers to bitset leaves in XArray and
leverage its radix tree implementation, at cost of somewhat lesser
efficiency (since XArray indices are longs and thus it employs more
intermediate levels on 64-bit architectures).