Message ID | 20200907134055.2878499-1-elver@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KFENCE: A low-overhead sampling-based memory safety error detector | expand |
On 9/7/20 3:40 PM, Marco Elver wrote: > This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a > low-overhead sampling-based memory safety error detector of heap > use-after-free, invalid-free, and out-of-bounds access errors. This > series enables KFENCE for the x86 and arm64 architectures, and adds > KFENCE hooks to the SLAB and SLUB allocators. > > KFENCE is designed to be enabled in production kernels, and has near > zero performance overhead. Compared to KASAN, KFENCE trades performance > for precision. The main motivation behind KFENCE's design, is that with > enough total uptime KFENCE will detect bugs in code paths not typically > exercised by non-production test workloads. One way to quickly achieve a > large enough total uptime is when the tool is deployed across a large > fleet of machines. Looks nice! > KFENCE objects each reside on a dedicated page, at either the left or > right page boundaries. The pages to the left and right of the object > page are "guard pages", whose attributes are changed to a protected > state, and cause page faults on any attempted access to them. Such page > faults are then intercepted by KFENCE, which handles the fault > gracefully by reporting a memory access error. > > Guarded allocations are set up based on a sample interval (can be set > via kfence.sample_interval). After expiration of the sample interval, a > guarded allocation from the KFENCE object pool is returned to the main > allocator (SLAB or SLUB). At this point, the timer is reset, and the > next allocation is set up after the expiration of the interval. > > To enable/disable a KFENCE allocation through the main allocator's > fast-path without overhead, KFENCE relies on static branches via the > static keys infrastructure. The static branch is toggled to redirect the > allocation to KFENCE. Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell you better), and with the default 100ms sample interval, I'd think it's not good to toggle it so often? Did you measure what performance would you get, if the static key was only for long-term toggling the whole feature on and off (boot time or even runtime), but the decisions "am I in a sample interval right now?" would be normal tests behind this static key? Thanks. > We have verified by running synthetic benchmarks (sysbench I/O, > hackbench) that a kernel with KFENCE is performance-neutral compared to > a non-KFENCE baseline kernel. > > KFENCE is inspired by GWP-ASan [1], a userspace tool with similar > properties. The name "KFENCE" is a homage to the Electric Fence Malloc > Debugger [2]. > > For more details, see Documentation/dev-tools/kfence.rst added in the > series -- also viewable here: > > https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst > > [1] http://llvm.org/docs/GwpAsan.html > [2] https://linux.die.net/man/3/efence > > Alexander Potapenko (6): > mm: add Kernel Electric-Fence infrastructure > x86, kfence: enable KFENCE for x86 > mm, kfence: insert KFENCE hooks for SLAB > mm, kfence: insert KFENCE hooks for SLUB > kfence, kasan: make KFENCE compatible with KASAN > kfence, kmemleak: make KFENCE compatible with KMEMLEAK > > Marco Elver (4): > arm64, kfence: enable KFENCE for ARM64 > kfence, lockdep: make KFENCE compatible with lockdep > kfence, Documentation: add KFENCE documentation > kfence: add test suite > > Documentation/dev-tools/index.rst | 1 + > Documentation/dev-tools/kfence.rst | 285 +++++++++++ > MAINTAINERS | 11 + > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/kfence.h | 39 ++ > arch/arm64/mm/fault.c | 4 + > arch/x86/Kconfig | 2 + > arch/x86/include/asm/kfence.h | 60 +++ > arch/x86/mm/fault.c | 4 + > include/linux/kfence.h | 174 +++++++ > init/main.c | 2 + > kernel/locking/lockdep.c | 8 + > lib/Kconfig.debug | 1 + > lib/Kconfig.kfence | 70 +++ > mm/Makefile | 1 + > mm/kasan/common.c | 7 + > mm/kfence/Makefile | 6 + > mm/kfence/core.c | 730 +++++++++++++++++++++++++++ > mm/kfence/kfence-test.c | 777 +++++++++++++++++++++++++++++ > mm/kfence/kfence.h | 104 ++++ > mm/kfence/report.c | 201 ++++++++ > mm/kmemleak.c | 11 + > mm/slab.c | 46 +- > mm/slab_common.c | 6 +- > mm/slub.c | 72 ++- > 25 files changed, 2591 insertions(+), 32 deletions(-) > create mode 100644 Documentation/dev-tools/kfence.rst > create mode 100644 arch/arm64/include/asm/kfence.h > create mode 100644 arch/x86/include/asm/kfence.h > create mode 100644 include/linux/kfence.h > create mode 100644 lib/Kconfig.kfence > create mode 100644 mm/kfence/Makefile > create mode 100644 mm/kfence/core.c > create mode 100644 mm/kfence/kfence-test.c > create mode 100644 mm/kfence/kfence.h > create mode 100644 mm/kfence/report.c >
> Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell > you better), and with the default 100ms sample interval, I'd think it's not good > to toggle it so often? Did you measure what performance would you get, if the > static key was only for long-term toggling the whole feature on and off (boot > time or even runtime), but the decisions "am I in a sample interval right now?" > would be normal tests behind this static key? Thanks. 100ms is the default that we use for testing, but for production it should be fine to pick a longer interval (e.g. 1 second or more). We haven't noticed any performance impact with neither 100ms nor bigger values. Regarding using normal branches, they are quite expensive. E.g. at some point we used to have a branch in slab_free() to check whether the freed object belonged to KFENCE pool. When the pool address was taken from memory, this resulted in some non-zero performance penalty. As for enabling the whole feature at runtime, our intention is to let the users have it enabled by default, otherwise someone will need to tell every machine in the fleet when the feature is to be enabled. > > > We have verified by running synthetic benchmarks (sysbench I/O, > > hackbench) that a kernel with KFENCE is performance-neutral compared to > > a non-KFENCE baseline kernel. > > > > KFENCE is inspired by GWP-ASan [1], a userspace tool with similar > > properties. The name "KFENCE" is a homage to the Electric Fence Malloc > > Debugger [2]. > > > > For more details, see Documentation/dev-tools/kfence.rst added in the > > series -- also viewable here: > > > > https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst > > > > [1] http://llvm.org/docs/GwpAsan.html > > [2] https://linux.die.net/man/3/efence > > > > Alexander Potapenko (6): > > mm: add Kernel Electric-Fence infrastructure > > x86, kfence: enable KFENCE for x86 > > mm, kfence: insert KFENCE hooks for SLAB > > mm, kfence: insert KFENCE hooks for SLUB > > kfence, kasan: make KFENCE compatible with KASAN > > kfence, kmemleak: make KFENCE compatible with KMEMLEAK > > > > Marco Elver (4): > > arm64, kfence: enable KFENCE for ARM64 > > kfence, lockdep: make KFENCE compatible with lockdep > > kfence, Documentation: add KFENCE documentation > > kfence: add test suite > > > > Documentation/dev-tools/index.rst | 1 + > > Documentation/dev-tools/kfence.rst | 285 +++++++++++ > > MAINTAINERS | 11 + > > arch/arm64/Kconfig | 1 + > > arch/arm64/include/asm/kfence.h | 39 ++ > > arch/arm64/mm/fault.c | 4 + > > arch/x86/Kconfig | 2 + > > arch/x86/include/asm/kfence.h | 60 +++ > > arch/x86/mm/fault.c | 4 + > > include/linux/kfence.h | 174 +++++++ > > init/main.c | 2 + > > kernel/locking/lockdep.c | 8 + > > lib/Kconfig.debug | 1 + > > lib/Kconfig.kfence | 70 +++ > > mm/Makefile | 1 + > > mm/kasan/common.c | 7 + > > mm/kfence/Makefile | 6 + > > mm/kfence/core.c | 730 +++++++++++++++++++++++++++ > > mm/kfence/kfence-test.c | 777 +++++++++++++++++++++++++++++ > > mm/kfence/kfence.h | 104 ++++ > > mm/kfence/report.c | 201 ++++++++ > > mm/kmemleak.c | 11 + > > mm/slab.c | 46 +- > > mm/slab_common.c | 6 +- > > mm/slub.c | 72 ++- > > 25 files changed, 2591 insertions(+), 32 deletions(-) > > create mode 100644 Documentation/dev-tools/kfence.rst > > create mode 100644 arch/arm64/include/asm/kfence.h > > create mode 100644 arch/x86/include/asm/kfence.h > > create mode 100644 include/linux/kfence.h > > create mode 100644 lib/Kconfig.kfence > > create mode 100644 mm/kfence/Makefile > > create mode 100644 mm/kfence/core.c > > create mode 100644 mm/kfence/kfence-test.c > > create mode 100644 mm/kfence/kfence.h > > create mode 100644 mm/kfence/report.c > > >
On 9/8/20 2:16 PM, Alexander Potapenko wrote: >> Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell >> you better), and with the default 100ms sample interval, I'd think it's not good >> to toggle it so often? Did you measure what performance would you get, if the >> static key was only for long-term toggling the whole feature on and off (boot >> time or even runtime), but the decisions "am I in a sample interval right now?" >> would be normal tests behind this static key? Thanks. > > 100ms is the default that we use for testing, but for production it > should be fine to pick a longer interval (e.g. 1 second or more). > We haven't noticed any performance impact with neither 100ms nor bigger values. Hmm, I see. > Regarding using normal branches, they are quite expensive. > E.g. at some point we used to have a branch in slab_free() to check > whether the freed object belonged to KFENCE pool. > When the pool address was taken from memory, this resulted in some > non-zero performance penalty. Well yeah, if the checks involve extra cache misses, that adds up. But AFAICS you can't avoid that kind of checks with static key anyway (am I looking right at is_kfence_address()?) because some kfence-allocated objects will exist even after the sampling period ended, right? So AFAICS kfence_alloc() is the only user of the static key and I wonder if it really makes such difference there. > As for enabling the whole feature at runtime, our intention is to let > the users have it enabled by default, otherwise someone will need to > tell every machine in the fleet when the feature is to be enabled. Sure, but I guess there are tools that make it no difference in effort between 1 machine and fleet. I'll try to explain my general purpose distro-kernel POV. What I like e.g. about debug_pagealloc and page_owner (and contributed to that state of these features) is that a distro kernel can be shipped with them compiled in, but they are static-key disabled thus have no overhead, until a user enables them on boot, without a need to replace the kernel with a debug one first. Users can enable them for their own debugging, or when asked by somebody from the distro assisting with the debugging. I think KFENCE has similar potential and could work the same way - compiled in always, but a static key would eliminate everything, even the is_kfence_address() checks, until it became enabled (but then it would probably be a one-way street for the rest of the kernel's uptime). Some distro users would decide to enable it always, some not, but could be advised to when needed. So the existing static key could be repurposed for this, or if it's really worth having the current one to control just the sampling period, then there would be two? Thanks. >> > We have verified by running synthetic benchmarks (sysbench I/O, >> > hackbench) that a kernel with KFENCE is performance-neutral compared to >> > a non-KFENCE baseline kernel. >> > >> > KFENCE is inspired by GWP-ASan [1], a userspace tool with similar >> > properties. The name "KFENCE" is a homage to the Electric Fence Malloc >> > Debugger [2]. >> > >> > For more details, see Documentation/dev-tools/kfence.rst added in the >> > series -- also viewable here: >> > >> > https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst >> > >> > [1] http://llvm.org/docs/GwpAsan.html >> > [2] https://linux.die.net/man/3/efence >> > >> > Alexander Potapenko (6): >> > mm: add Kernel Electric-Fence infrastructure >> > x86, kfence: enable KFENCE for x86 >> > mm, kfence: insert KFENCE hooks for SLAB >> > mm, kfence: insert KFENCE hooks for SLUB >> > kfence, kasan: make KFENCE compatible with KASAN >> > kfence, kmemleak: make KFENCE compatible with KMEMLEAK >> > >> > Marco Elver (4): >> > arm64, kfence: enable KFENCE for ARM64 >> > kfence, lockdep: make KFENCE compatible with lockdep >> > kfence, Documentation: add KFENCE documentation >> > kfence: add test suite >> > >> > Documentation/dev-tools/index.rst | 1 + >> > Documentation/dev-tools/kfence.rst | 285 +++++++++++ >> > MAINTAINERS | 11 + >> > arch/arm64/Kconfig | 1 + >> > arch/arm64/include/asm/kfence.h | 39 ++ >> > arch/arm64/mm/fault.c | 4 + >> > arch/x86/Kconfig | 2 + >> > arch/x86/include/asm/kfence.h | 60 +++ >> > arch/x86/mm/fault.c | 4 + >> > include/linux/kfence.h | 174 +++++++ >> > init/main.c | 2 + >> > kernel/locking/lockdep.c | 8 + >> > lib/Kconfig.debug | 1 + >> > lib/Kconfig.kfence | 70 +++ >> > mm/Makefile | 1 + >> > mm/kasan/common.c | 7 + >> > mm/kfence/Makefile | 6 + >> > mm/kfence/core.c | 730 +++++++++++++++++++++++++++ >> > mm/kfence/kfence-test.c | 777 +++++++++++++++++++++++++++++ >> > mm/kfence/kfence.h | 104 ++++ >> > mm/kfence/report.c | 201 ++++++++ >> > mm/kmemleak.c | 11 + >> > mm/slab.c | 46 +- >> > mm/slab_common.c | 6 +- >> > mm/slub.c | 72 ++- >> > 25 files changed, 2591 insertions(+), 32 deletions(-) >> > create mode 100644 Documentation/dev-tools/kfence.rst >> > create mode 100644 arch/arm64/include/asm/kfence.h >> > create mode 100644 arch/x86/include/asm/kfence.h >> > create mode 100644 include/linux/kfence.h >> > create mode 100644 lib/Kconfig.kfence >> > create mode 100644 mm/kfence/Makefile >> > create mode 100644 mm/kfence/core.c >> > create mode 100644 mm/kfence/kfence-test.c >> > create mode 100644 mm/kfence/kfence.h >> > create mode 100644 mm/kfence/report.c >> > >> > >
On 9/7/20 6:40 AM, Marco Elver wrote: > KFENCE is designed to be enabled in production kernels, and has near > zero performance overhead. Compared to KASAN, KFENCE trades performance > for precision. Could you talk a little bit about where you expect folks to continue to use KASAN? How would a developer or a tester choose which one to use? > KFENCE objects each reside on a dedicated page, at either the left or > right page boundaries. The pages to the left and right of the object > page are "guard pages", whose attributes are changed to a protected > state, and cause page faults on any attempted access to them. Such page > faults are then intercepted by KFENCE, which handles the fault > gracefully by reporting a memory access error. How much memory overhead does this end up having? I know it depends on the object size and so forth. But, could you give some real-world examples of memory consumption? Also, what's the worst case? Say I have a ton of worst-case-sized (32b) slab objects. Will I notice?
On Tue, Sep 08, 2020 at 04:40PM +0200, Vlastimil Babka wrote: > On 9/8/20 2:16 PM, Alexander Potapenko wrote: > >> Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell > >> you better), and with the default 100ms sample interval, I'd think it's not good > >> to toggle it so often? Did you measure what performance would you get, if the > >> static key was only for long-term toggling the whole feature on and off (boot > >> time or even runtime), but the decisions "am I in a sample interval right now?" > >> would be normal tests behind this static key? Thanks. > > > > 100ms is the default that we use for testing, but for production it > > should be fine to pick a longer interval (e.g. 1 second or more). > > We haven't noticed any performance impact with neither 100ms nor bigger values. > > Hmm, I see. To add to this, we initially also weren't sure what the results would be toggling the static branches at varying intervals. In the end we were pleasantly surprised, and our benchmarking results always proved there is no noticeable slowdown above 100ms (somewhat noticeable in the range of 1-10ms but it's tolerable if you wanted to go there). I think we were initially, just like you might be, deceived about the time scales here. 100ms is a really long time for a computer. > > Regarding using normal branches, they are quite expensive. > > E.g. at some point we used to have a branch in slab_free() to check > > whether the freed object belonged to KFENCE pool. > > When the pool address was taken from memory, this resulted in some > > non-zero performance penalty. > > Well yeah, if the checks involve extra cache misses, that adds up. But AFAICS > you can't avoid that kind of checks with static key anyway (am I looking right > at is_kfence_address()?) because some kfence-allocated objects will exist even > after the sampling period ended, right? > So AFAICS kfence_alloc() is the only user of the static key and I wonder if it > really makes such difference there. The really important bit here is to differentiate between fast-paths and slow-paths! We insert kfence_alloc() into the allocator fast-paths, which is where the majority of cost would be. On the other hand, the major user of is_kfence_address(), kfence_free(), is only inserted into the slow-path. As a result, is_kfence_address() usage has negligible cost (esp. if the statically allocated pool is used) -- we benchmarked this quite extensively. > > As for enabling the whole feature at runtime, our intention is to let > > the users have it enabled by default, otherwise someone will need to > > tell every machine in the fleet when the feature is to be enabled. > > Sure, but I guess there are tools that make it no difference in effort between 1 > machine and fleet. > > I'll try to explain my general purpose distro-kernel POV. What I like e.g. about > debug_pagealloc and page_owner (and contributed to that state of these features) > is that a distro kernel can be shipped with them compiled in, but they are > static-key disabled thus have no overhead, until a user enables them on boot, > without a need to replace the kernel with a debug one first. Users can enable > them for their own debugging, or when asked by somebody from the distro > assisting with the debugging. > > I think KFENCE has similar potential and could work the same way - compiled in > always, but a static key would eliminate everything, even the > is_kfence_address() checks, [ See my answer for the cost of is_kfence_address() above. In short, until we add is_kfence_address() to fast-paths, introducing yet another static branch would be premature optimization. ] > until it became enabled (but then it would probably > be a one-way street for the rest of the kernel's uptime). Some distro users > would decide to enable it always, some not, but could be advised to when needed. > So the existing static key could be repurposed for this, or if it's really worth > having the current one to control just the sampling period, then there would be two? You can already do this. Just set CONFIG_KFENCE_SAMPLE_INTERVAL=0. When you decide to enable it, set kfence.sample_interval=<somenumber> as a boot parameter. I'll add something to that effect into Documentation/dev-tools/kfence.rst. Thanks, -- Marco
On Tue, Sep 08, 2020 at 07:52AM -0700, Dave Hansen wrote: > On 9/7/20 6:40 AM, Marco Elver wrote: > > KFENCE is designed to be enabled in production kernels, and has near > > zero performance overhead. Compared to KASAN, KFENCE trades performance > > for precision. > > Could you talk a little bit about where you expect folks to continue to > use KASAN? How would a developer or a tester choose which one to use? We mention some of this in Documentation/dev-tools/kfence.rst: In the kernel, several tools exist to debug memory access errors, and in particular KASAN can detect all bug classes that KFENCE can detect. While KASAN is more precise, relying on compiler instrumentation, this comes at a performance cost. We want to highlight that KASAN and KFENCE are complementary, with different target environments. For instance, KASAN is the better debugging-aid, where a simple reproducer exists: due to the lower chance to detect the error, it would require more effort using KFENCE to debug. Deployments at scale, however, would benefit from using KFENCE to discover bugs due to code paths not exercised by test cases or fuzzers. If you can afford to use KASAN, continue using KASAN. Usually this only applies to test environments. If you have kernels for production use, and cannot enable KASAN for the obvious cost reasons, you could consider KFENCE. I'll try to make this clearer, maybe summarizing what I said here in Documentation as well. > > KFENCE objects each reside on a dedicated page, at either the left or > > right page boundaries. The pages to the left and right of the object > > page are "guard pages", whose attributes are changed to a protected > > state, and cause page faults on any attempted access to them. Such page > > faults are then intercepted by KFENCE, which handles the fault > > gracefully by reporting a memory access error. > > How much memory overhead does this end up having? I know it depends on > the object size and so forth. But, could you give some real-world > examples of memory consumption? Also, what's the worst case? Say I > have a ton of worst-case-sized (32b) slab objects. Will I notice? KFENCE objects are limited (default 255). If we exhaust KFENCE's memory pool, no more KFENCE allocations will occur. Documentation/dev-tools/kfence.rst gives a formula to calculate the KFENCE pool size: The total memory dedicated to the KFENCE memory pool can be computed as:: ( #objects + 1 ) * 2 * PAGE_SIZE Using the default config, and assuming a page size of 4 KiB, results in dedicating 2 MiB to the KFENCE memory pool. Does that clarify this point? Or anything else that could help clarify this? Thanks, -- Marco
On 9/8/20 5:31 PM, Marco Elver wrote: >> >> How much memory overhead does this end up having? I know it depends on >> the object size and so forth. But, could you give some real-world >> examples of memory consumption? Also, what's the worst case? Say I >> have a ton of worst-case-sized (32b) slab objects. Will I notice? > > KFENCE objects are limited (default 255). If we exhaust KFENCE's memory > pool, no more KFENCE allocations will occur. > Documentation/dev-tools/kfence.rst gives a formula to calculate the > KFENCE pool size: > > The total memory dedicated to the KFENCE memory pool can be computed as:: > > ( #objects + 1 ) * 2 * PAGE_SIZE > > Using the default config, and assuming a page size of 4 KiB, results in > dedicating 2 MiB to the KFENCE memory pool. > > Does that clarify this point? Or anything else that could help clarify > this? Hmm did you observe that with this limit, a long-running system would eventually converge to KFENCE memory pool being filled with long-aged objects, so there would be no space to sample new ones? > Thanks, > -- Marco >
On 9/8/20 8:31 AM, Marco Elver wrote: ... > If you can afford to use KASAN, continue using KASAN. Usually this only > applies to test environments. If you have kernels for production use, > and cannot enable KASAN for the obvious cost reasons, you could consider > KFENCE. That's a really nice, succinct way to put it. You might even want to consider putting this in the Kconfig help text. >>> KFENCE objects each reside on a dedicated page, at either the left or >>> right page boundaries. The pages to the left and right of the object >>> page are "guard pages", whose attributes are changed to a protected >>> state, and cause page faults on any attempted access to them. Such page >>> faults are then intercepted by KFENCE, which handles the fault >>> gracefully by reporting a memory access error. >> >> How much memory overhead does this end up having? I know it depends on >> the object size and so forth. But, could you give some real-world >> examples of memory consumption? Also, what's the worst case? Say I >> have a ton of worst-case-sized (32b) slab objects. Will I notice? > > KFENCE objects are limited (default 255). If we exhaust KFENCE's memory > pool, no more KFENCE allocations will occur. > Documentation/dev-tools/kfence.rst gives a formula to calculate the > KFENCE pool size: > > The total memory dedicated to the KFENCE memory pool can be computed as:: > > ( #objects + 1 ) * 2 * PAGE_SIZE > > Using the default config, and assuming a page size of 4 KiB, results in > dedicating 2 MiB to the KFENCE memory pool. > > Does that clarify this point? Or anything else that could help clarify > this? That clears it up, thanks! I would suggest adding a tiny nugget about this in the cover letter, just saying that the worst-case memory consumption on x86 is ~2M.
On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote: > On 9/8/20 5:31 PM, Marco Elver wrote: > >> > >> How much memory overhead does this end up having? I know it depends on > >> the object size and so forth. But, could you give some real-world > >> examples of memory consumption? Also, what's the worst case? Say I > >> have a ton of worst-case-sized (32b) slab objects. Will I notice? > > > > KFENCE objects are limited (default 255). If we exhaust KFENCE's memory > > pool, no more KFENCE allocations will occur. > > Documentation/dev-tools/kfence.rst gives a formula to calculate the > > KFENCE pool size: > > > > The total memory dedicated to the KFENCE memory pool can be computed as:: > > > > ( #objects + 1 ) * 2 * PAGE_SIZE > > > > Using the default config, and assuming a page size of 4 KiB, results in > > dedicating 2 MiB to the KFENCE memory pool. > > > > Does that clarify this point? Or anything else that could help clarify > > this? > > Hmm did you observe that with this limit, a long-running system would eventually > converge to KFENCE memory pool being filled with long-aged objects, so there > would be no space to sample new ones? Sure, that's a possibility. But remember that we're not trying to deterministically detect bugs on 1 system (if you wanted that, you should use KASAN), but a fleet of machines! The non-determinism of which allocations will end up in KFENCE, will ensure we won't end up with a fleet of machines of identical allocations. That's exactly what we're after. Even if we eventually exhaust the pool, you'll still detect bugs if there are any. If you are overly worried, either the sample interval or number of available objects needs to be tweaked to be larger. The default of 255 is quite conservative, and even using something larger on a modern system is hardly noticeable. Choosing a sample interval & number of objects should also factor in how many machines you plan to deploy this on. Monitoring /sys/kernel/debug/kfence/stats can help you here. Thanks, -- Marco
On Tue, Sep 8, 2020 at 5:56 PM Marco Elver <elver@google.com> wrote: > > On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote: > > On 9/8/20 5:31 PM, Marco Elver wrote: > > >> > > >> How much memory overhead does this end up having? I know it depends on > > >> the object size and so forth. But, could you give some real-world > > >> examples of memory consumption? Also, what's the worst case? Say I > > >> have a ton of worst-case-sized (32b) slab objects. Will I notice? > > > > > > KFENCE objects are limited (default 255). If we exhaust KFENCE's memory > > > pool, no more KFENCE allocations will occur. > > > Documentation/dev-tools/kfence.rst gives a formula to calculate the > > > KFENCE pool size: > > > > > > The total memory dedicated to the KFENCE memory pool can be computed as:: > > > > > > ( #objects + 1 ) * 2 * PAGE_SIZE > > > > > > Using the default config, and assuming a page size of 4 KiB, results in > > > dedicating 2 MiB to the KFENCE memory pool. > > > > > > Does that clarify this point? Or anything else that could help clarify > > > this? > > > > Hmm did you observe that with this limit, a long-running system would eventually > > converge to KFENCE memory pool being filled with long-aged objects, so there > > would be no space to sample new ones? > > Sure, that's a possibility. But remember that we're not trying to > deterministically detect bugs on 1 system (if you wanted that, you > should use KASAN), but a fleet of machines! The non-determinism of which > allocations will end up in KFENCE, will ensure we won't end up with a > fleet of machines of identical allocations. That's exactly what we're > after. Even if we eventually exhaust the pool, you'll still detect bugs > if there are any. > > If you are overly worried, either the sample interval or number of > available objects needs to be tweaked to be larger. The default of 255 > is quite conservative, and even using something larger on a modern > system is hardly noticeable. Choosing a sample interval & number of > objects should also factor in how many machines you plan to deploy this > on. Monitoring /sys/kernel/debug/kfence/stats can help you here. Hi Marco, I reviewed patches and they look good to me (minus some local comments that I've left). The main question/concern I have is what Vlastimil mentioned re long-aged objects. Is the default sample interval values reasonable for typical workloads? Do we have any guidelines on choosing the sample interval? Should it depend on workload/use pattern? By "reasonable" I mean if the pool will last long enough to still sample something after hours/days? Have you tried any experiments with some workload (both short-lived processes and long-lived processes/namespaces) capturing state of the pool? It can make sense to do to better understand dynamics. I suspect that the rate may need to be orders of magnitude lower. Also I am wondering about the boot process (both kernel and init). It's both inherently almost the same for the whole population of machines and inherently produces persistent objects. Should we lower the rate for the first minute of uptime? Or maybe make it proportional to uptime? I feel it's quite an important aspect. We can have this awesome idea and implementation, but radically lower its utility by using bad sampling value (which will have silent "failure mode" -- no bugs detected). But to make it clear: all of this does not conflict with the merge of the first version. Just having tunable sampling interval is good enough. We will get the ultimate understanding only when we start using it widely anyway.
On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <dvyukov@google.com> wrote: > On Tue, Sep 8, 2020 at 5:56 PM Marco Elver <elver@google.com> wrote: > > On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote: [...] > > > Hmm did you observe that with this limit, a long-running system would eventually > > > converge to KFENCE memory pool being filled with long-aged objects, so there > > > would be no space to sample new ones? > > > > Sure, that's a possibility. But remember that we're not trying to > > deterministically detect bugs on 1 system (if you wanted that, you > > should use KASAN), but a fleet of machines! The non-determinism of which > > allocations will end up in KFENCE, will ensure we won't end up with a > > fleet of machines of identical allocations. That's exactly what we're > > after. Even if we eventually exhaust the pool, you'll still detect bugs > > if there are any. > > > > If you are overly worried, either the sample interval or number of > > available objects needs to be tweaked to be larger. The default of 255 > > is quite conservative, and even using something larger on a modern > > system is hardly noticeable. Choosing a sample interval & number of > > objects should also factor in how many machines you plan to deploy this > > on. Monitoring /sys/kernel/debug/kfence/stats can help you here. > > Hi Marco, > > I reviewed patches and they look good to me (minus some local comments > that I've left). Thank you. > The main question/concern I have is what Vlastimil mentioned re > long-aged objects. > Is the default sample interval values reasonable for typical > workloads? Do we have any guidelines on choosing the sample interval? > Should it depend on workload/use pattern? As I hinted at before, the sample interval & number of objects needs to depend on: - number of machines, - workload, - acceptable overhead (performance, memory). However, workload can vary greatly, and something more dynamic may be needed. We do have the option to monitor /sys/kernel/debug/kfence/stats and even change the sample interval at runtime, e.g. from a user space tool that checks the currently used objects, and as the pool is closer to exhausted, starts increasing /sys/module/kfence/parameters/sample_interval. Of course, if we figure out the best dynamic policy, we can add this policy into the kernel. But I don't think it makes sense to hard-code such a policy right now. > By "reasonable" I mean if the pool will last long enough to still > sample something after hours/days? Have you tried any experiments with > some workload (both short-lived processes and long-lived > processes/namespaces) capturing state of the pool? It can make sense > to do to better understand dynamics. I suspect that the rate may need > to be orders of magnitude lower. Yes, the current default sample interval is a lower bound, and is also a reasonable default for testing. I expect real deployments to use much higher sample intervals (lower rate). So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that allocated KFENCE objects isn't artificially capped): -- With a mostly vanilla config + KFENCE (sample interval 100 ms), after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects (total allocations >600). Those aren't always the same objects, with roughly ~2 allocations/frees per second. -- Then running sysbench I/O benchmark, KFENCE objects allocated peak at 82. During the benchmark, allocations/frees per second are closer to 10-15. After the benchmark, the KFENCE objects allocated remain at 82, and allocations/frees per second fall back to ~2. -- For the same system, changing the sample interval to 1 ms (echo 1 > /sys/module/kfence/parameters/sample_interval), and re-running the benchmark gives me: KFENCE objects allocated peak at exactly 500, with ~500 allocations/frees per second. After that, allocated KFENCE objects dropped a little to 496, and allocations/frees per second fell back to ~2. -- The long-lived objects are due to caches, and just running 'echo 1 > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to 45. > Also I am wondering about the boot process (both kernel and init). > It's both inherently almost the same for the whole population of > machines and inherently produces persistent objects. Should we lower > the rate for the first minute of uptime? Or maybe make it proportional > to uptime? It should depend on current usage, which is dependent on the workload. I don't think uptime helps much, as seen above. If we imagine a user space tool that tweaks this for us, we can initialize KFENCE with a very large sample interval, and once booted, this user space tool/script adjusts /sys/module/kfence/parameters/sample_interval. At the very least, I think I'll just make /sys/module/kfence/parameters/sample_interval root-writable unconditionally, so that we can experiment with such a tool. Lowering the rate for the first minute of uptime might also be an option, although if we do that, we can also just move kfence_init() to the end of start_kernel(). IMHO, I think it still makes sense to sample normally during boot, because who knows how those allocations are used with different workloads once the kernel is live. With a sample interval of 1000 ms (which is closer to what we probably want in production), I see no more than 20 KFENCE objects allocated after boot. I think we can live with that. > I feel it's quite an important aspect. We can have this awesome idea > and implementation, but radically lower its utility by using bad > sampling value (which will have silent "failure mode" -- no bugs > detected). As a first step, I think monitoring the entire fleet here is key here (collect /sys/kernel/debug/kfence/stats). Essentially, as long as allocations/frees per second remains >0, we're probably fine, even if we always run at max. KFENCE objects allocated. An improvement over allocations/frees per second >0 would be dynamically tweaking sample_interval based on how close we get to max KFENCE objects allocated. Yet another option is to skip KFENCE allocations based on the memcache name, e.g. for those caches dedicated to long-lived allocations. > But to make it clear: all of this does not conflict with the merge of > the first version. Just having tunable sampling interval is good > enough. We will get the ultimate understanding only when we start > using it widely anyway. Thanks, -- Marco
On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <elver@google.com> wrote: > > On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <dvyukov@google.com> wrote: > > On Tue, Sep 8, 2020 at 5:56 PM Marco Elver <elver@google.com> wrote: > > > On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote: > [...] > > > > Hmm did you observe that with this limit, a long-running system would eventually > > > > converge to KFENCE memory pool being filled with long-aged objects, so there > > > > would be no space to sample new ones? > > > > > > Sure, that's a possibility. But remember that we're not trying to > > > deterministically detect bugs on 1 system (if you wanted that, you > > > should use KASAN), but a fleet of machines! The non-determinism of which > > > allocations will end up in KFENCE, will ensure we won't end up with a > > > fleet of machines of identical allocations. That's exactly what we're > > > after. Even if we eventually exhaust the pool, you'll still detect bugs > > > if there are any. > > > > > > If you are overly worried, either the sample interval or number of > > > available objects needs to be tweaked to be larger. The default of 255 > > > is quite conservative, and even using something larger on a modern > > > system is hardly noticeable. Choosing a sample interval & number of > > > objects should also factor in how many machines you plan to deploy this > > > on. Monitoring /sys/kernel/debug/kfence/stats can help you here. > > > > Hi Marco, > > > > I reviewed patches and they look good to me (minus some local comments > > that I've left). > > Thank you. > > > The main question/concern I have is what Vlastimil mentioned re > > long-aged objects. > > Is the default sample interval values reasonable for typical > > workloads? Do we have any guidelines on choosing the sample interval? > > Should it depend on workload/use pattern? > > As I hinted at before, the sample interval & number of objects needs > to depend on: > - number of machines, > - workload, > - acceptable overhead (performance, memory). > > However, workload can vary greatly, and something more dynamic may be > needed. We do have the option to monitor > /sys/kernel/debug/kfence/stats and even change the sample interval at > runtime, e.g. from a user space tool that checks the currently used > objects, and as the pool is closer to exhausted, starts increasing > /sys/module/kfence/parameters/sample_interval. > > Of course, if we figure out the best dynamic policy, we can add this > policy into the kernel. But I don't think it makes sense to hard-code > such a policy right now. > > > By "reasonable" I mean if the pool will last long enough to still > > sample something after hours/days? Have you tried any experiments with > > some workload (both short-lived processes and long-lived > > processes/namespaces) capturing state of the pool? It can make sense > > to do to better understand dynamics. I suspect that the rate may need > > to be orders of magnitude lower. > > Yes, the current default sample interval is a lower bound, and is also > a reasonable default for testing. I expect real deployments to use > much higher sample intervals (lower rate). > > So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that > allocated KFENCE objects isn't artificially capped): > > -- With a mostly vanilla config + KFENCE (sample interval 100 ms), > after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects > (total allocations >600). Those aren't always the same objects, with > roughly ~2 allocations/frees per second. > > -- Then running sysbench I/O benchmark, KFENCE objects allocated peak > at 82. During the benchmark, allocations/frees per second are closer > to 10-15. After the benchmark, the KFENCE objects allocated remain at > 82, and allocations/frees per second fall back to ~2. > > -- For the same system, changing the sample interval to 1 ms (echo 1 > > /sys/module/kfence/parameters/sample_interval), and re-running the > benchmark gives me: KFENCE objects allocated peak at exactly 500, with > ~500 allocations/frees per second. After that, allocated KFENCE > objects dropped a little to 496, and allocations/frees per second fell > back to ~2. > > -- The long-lived objects are due to caches, and just running 'echo 1 > > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to > 45. Interesting. What type of caches is this? If there is some type of cache that caches particularly lots of sampled objects, we could potentially change the cache to release sampled objects eagerly. > > Also I am wondering about the boot process (both kernel and init). > > It's both inherently almost the same for the whole population of > > machines and inherently produces persistent objects. Should we lower > > the rate for the first minute of uptime? Or maybe make it proportional > > to uptime? > > It should depend on current usage, which is dependent on the workload. > I don't think uptime helps much, as seen above. If we imagine a user > space tool that tweaks this for us, we can initialize KFENCE with a > very large sample interval, and once booted, this user space > tool/script adjusts /sys/module/kfence/parameters/sample_interval. > > At the very least, I think I'll just make > /sys/module/kfence/parameters/sample_interval root-writable > unconditionally, so that we can experiment with such a tool. > > Lowering the rate for the first minute of uptime might also be an > option, although if we do that, we can also just move kfence_init() to > the end of start_kernel(). IMHO, I think it still makes sense to > sample normally during boot, because who knows how those allocations > are used with different workloads once the kernel is live. With a > sample interval of 1000 ms (which is closer to what we probably want > in production), I see no more than 20 KFENCE objects allocated after > boot. I think we can live with that. > > > I feel it's quite an important aspect. We can have this awesome idea > > and implementation, but radically lower its utility by using bad > > sampling value (which will have silent "failure mode" -- no bugs > > detected). > > As a first step, I think monitoring the entire fleet here is key here > (collect /sys/kernel/debug/kfence/stats). Essentially, as long as > allocations/frees per second remains >0, we're probably fine, even if > we always run at max. KFENCE objects allocated. > > An improvement over allocations/frees per second >0 would be > dynamically tweaking sample_interval based on how close we get to max > KFENCE objects allocated. > > Yet another option is to skip KFENCE allocations based on the memcache > name, e.g. for those caches dedicated to long-lived allocations. > > > But to make it clear: all of this does not conflict with the merge of > > the first version. Just having tunable sampling interval is good > > enough. We will get the ultimate understanding only when we start > > using it widely anyway. > > Thanks, > -- Marco
On Fri, 11 Sep 2020 at 15:10, Dmitry Vyukov <dvyukov@google.com> wrote: > On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <elver@google.com> wrote: > > On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <dvyukov@google.com> wrote: [...] > > > By "reasonable" I mean if the pool will last long enough to still > > > sample something after hours/days? Have you tried any experiments with > > > some workload (both short-lived processes and long-lived > > > processes/namespaces) capturing state of the pool? It can make sense > > > to do to better understand dynamics. I suspect that the rate may need > > > to be orders of magnitude lower. > > > > Yes, the current default sample interval is a lower bound, and is also > > a reasonable default for testing. I expect real deployments to use > > much higher sample intervals (lower rate). > > > > So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that > > allocated KFENCE objects isn't artificially capped): > > > > -- With a mostly vanilla config + KFENCE (sample interval 100 ms), > > after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects > > (total allocations >600). Those aren't always the same objects, with > > roughly ~2 allocations/frees per second. > > > > -- Then running sysbench I/O benchmark, KFENCE objects allocated peak > > at 82. During the benchmark, allocations/frees per second are closer > > to 10-15. After the benchmark, the KFENCE objects allocated remain at > > 82, and allocations/frees per second fall back to ~2. > > > > -- For the same system, changing the sample interval to 1 ms (echo 1 > > > /sys/module/kfence/parameters/sample_interval), and re-running the > > benchmark gives me: KFENCE objects allocated peak at exactly 500, with > > ~500 allocations/frees per second. After that, allocated KFENCE > > objects dropped a little to 496, and allocations/frees per second fell > > back to ~2. > > > > -- The long-lived objects are due to caches, and just running 'echo 1 > > > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to > > 45. > > Interesting. What type of caches is this? If there is some type of > cache that caches particularly lots of sampled objects, we could > potentially change the cache to release sampled objects eagerly. The 2 major users of KFENCE objects for that workload are 'buffer_head' and 'bio-0'. If we want to deal with those, I guess there are 2 options: 1. More complex, but more precise: make the users of them check is_kfence_address() and release their buffers earlier. 2. Simpler, generic solution: make KFENCE stop return allocations for non-kmalloc_caches memcaches after more than ~90% of the pool is exhausted. This assumes that creators of long-lived objects usually set up their own memcaches. I'm currently inclined to go for (2). Thanks, -- Marco
On Fri, 11 Sep 2020 at 15:33, Marco Elver <elver@google.com> wrote: > On Fri, 11 Sep 2020 at 15:10, Dmitry Vyukov <dvyukov@google.com> wrote: > > On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <elver@google.com> wrote: > > > On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <dvyukov@google.com> wrote: > [...] > > > > By "reasonable" I mean if the pool will last long enough to still > > > > sample something after hours/days? Have you tried any experiments with > > > > some workload (both short-lived processes and long-lived > > > > processes/namespaces) capturing state of the pool? It can make sense > > > > to do to better understand dynamics. I suspect that the rate may need > > > > to be orders of magnitude lower. > > > > > > Yes, the current default sample interval is a lower bound, and is also > > > a reasonable default for testing. I expect real deployments to use > > > much higher sample intervals (lower rate). > > > > > > So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that > > > allocated KFENCE objects isn't artificially capped): > > > > > > -- With a mostly vanilla config + KFENCE (sample interval 100 ms), > > > after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects > > > (total allocations >600). Those aren't always the same objects, with > > > roughly ~2 allocations/frees per second. > > > > > > -- Then running sysbench I/O benchmark, KFENCE objects allocated peak > > > at 82. During the benchmark, allocations/frees per second are closer > > > to 10-15. After the benchmark, the KFENCE objects allocated remain at > > > 82, and allocations/frees per second fall back to ~2. > > > > > > -- For the same system, changing the sample interval to 1 ms (echo 1 > > > > /sys/module/kfence/parameters/sample_interval), and re-running the > > > benchmark gives me: KFENCE objects allocated peak at exactly 500, with > > > ~500 allocations/frees per second. After that, allocated KFENCE > > > objects dropped a little to 496, and allocations/frees per second fell > > > back to ~2. > > > > > > -- The long-lived objects are due to caches, and just running 'echo 1 > > > > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to > > > 45. > > > > Interesting. What type of caches is this? If there is some type of > > cache that caches particularly lots of sampled objects, we could > > potentially change the cache to release sampled objects eagerly. > > The 2 major users of KFENCE objects for that workload are > 'buffer_head' and 'bio-0'. > > If we want to deal with those, I guess there are 2 options: > > 1. More complex, but more precise: make the users of them check > is_kfence_address() and release their buffers earlier. > > 2. Simpler, generic solution: make KFENCE stop return allocations for > non-kmalloc_caches memcaches after more than ~90% of the pool is > exhausted. This assumes that creators of long-lived objects usually > set up their own memcaches. > > I'm currently inclined to go for (2). Ok, after some offline chat, we determined that (2) would be premature and we can't really say if kmalloc should have precedence if we reach some usage threshold. So for now, let's just leave as-is and start with the recommendation to monitor and adjust based on usage, fleet size, etc. Thanks, -- Marco