mbox series

[00/15] kasan: x86: arm64: risc-v: KASAN tag-based mode for x86

Message ID cover.1738686764.git.maciej.wieczor-retman@intel.com (mailing list archive)
Headers show
Series kasan: x86: arm64: risc-v: KASAN tag-based mode for x86 | expand

Message

Maciej Wieczor-Retman Feb. 4, 2025, 5:33 p.m. UTC
======= Introduction
The patchset aims to add a KASAN tag-based mode for the x86 architecture
with the help of the new CPU feature called Linear Address Masking
(LAM). Main improvement introduced by the series is 4x lower memory
usage compared to KASAN's generic mode, the only currently available
mode on x86.

There are two logical parts to this series. The first one attempts to
add a new memory saving mechanism called "dense mode" to the generic
part of the tag-based KASAN code. The second one focuses on implementing
and enabling the tag-based mode for the x86 architecture by using LAM.

======= How KASAN tag-based mode works?
When enabled, memory accesses and allocations are augmented by the
compiler during kernel compilation. Instrumentation functions are added
to each memory allocation and each pointer dereference.

The allocation related functions generate a random tag and save it in
two places: in shadow memory that maps to the allocated memory, and in
the top bits of the pointer that points to the allocated memory. Storing
the tag in the top of the pointer is possible because of Top-Byte Ignore
(TBI) on arm64 architecture and LAM on x86.

The access related functions are performing a comparison between the tag
stored in the pointer and the one stored in shadow memory. If the tags
don't match an out of bounds error must have occurred and so an error
report is generated.

The general idea for the tag-based mode is very well explained in the
series with the original implementation [1].

[1] https://lore.kernel.org/all/cover.1544099024.git.andreyknvl@google.com/

======= What is the new "dense mode"?
To further save memory the dense mode is introduced. The idea is that
normally one shadow byte stores one tag and this one tag covers one
granule of allocated memory which is 16 bytes. In the dense mode, one
tag still covers 16 bytes of allocated memory but is shortened in length
from 8 bits to 4 bits which makes it possible to store two tags in one
shadow memory byte.

=== Example:
The example below shows how the shadow memory looks like after
allocating 48 bytes of memory in both normal tag-based mode and the
dense mode. The contents of shadow memory are overlaid onto address
offsets that they relate to in the allocated kernel memory. Each cell
|        | symbolizes one byte of shadow memory.

= The regular tag based mode:
- Randomly generated 8-bit tag equals 0xAB.
- 0xFE is the tag that symbolizes unallocated memory.

Shadow memory contents:           |  0xAB  |  0xAB  |  0xAB  |  0xFE  |
Shadow memory address offsets:    0        1        2        3        4
Allocated memory address offsets: 0        16       32       48       64

= The dense tag based mode:
- Randomly generated 4-bit tag equals 0xC.
- 0xE is the tag that symbolizes unallocated memory.

Shadow memory contents:           |0xC 0xC |0xC 0xE |0xE 0xE |0xE 0xE |
Shadow memory address offsets:    0        1        2        3        4
Allocated memory address offsets: 0        32       64       96       128

=== Dense mode benefits summary
For a small price of a couple of bit shifts, the dense mode uses only
half the memory compared to the current arm64 tag-based mode, while
still preserving the 16 byte tag granularity which allows catching
smaller offsets of out of bounds errors.

======= Differences summary compared to the arm64 tag-based mode
- Tag width:
	- Tag width influences the chance of a tag mismatch due to two
	  tags from different allocations having the same value. The
	  bigger the possible range of tag values the lower the chance
	  of that happening.
	- Shortening the tag width from 8 bits to 4, while helping with
	  memory usage also increases the chance of not reporting an
	  error. 4 bit tags have a ~7% chance of a tag mismatch.

- TBI and LAM
	- TBI in arm64 allows for storing metadata in the top 8 bits of
	  the virtual address.
	- LAM in x86 allows storing tags in bits [62:57] of the pointer.
	  To maximize memory savings the tag width is reduced to bits
	  [60:57].

======= Testing
Checked all the kunits for both software tags and generic KASAN after
making changes.

In generic mode the results were:

kasan: pass:59 fail:0 skip:13 total:72
Totals: pass:59 fail:0 skip:13 total:72
ok 1 kasan

and for software tags:

kasan: pass:63 fail:0 skip:9 total:72
Totals: pass:63 fail:0 skip:9 total:72
ok 1 kasan

======= Benchmarks
All tests were ran on a Sierra Forest server platform with 512GB of
memory. The only differences between the tests were kernel options:
	- CONFIG_KASAN
	- CONFIG_KASAN_GENERIC
	- CONFIG_KASAN_SW_TAGS
	- CONFIG_KASAN_INLINE [1]
	- CONFIG_KASAN_OUTLINE [1]

Used memory in GBs after boot [2][3]:
* 14 for clean kernel
* 91 / 90 for generic KASAN (inline/outline)
* 31 for tag-based KASAN

Boot time (until login prompt):
* 03:48 for clean kernel
* 08:02 / 09:45 for generic KASAN (inline/outline)
* 08:50 for dense tag-based KASAN
* 04:50 for dense tag-based KASAN with stacktrace disabled [4]

Compilation time comparison (10 cores):
* 7:27 for clean kernel
* 8:21/7:44 for generic KASAN (inline/outline)
* 7:41 for tag-based KASAN

Network performance [5]:
* 13.7 Gbits/sec for clean kernel
* 2.25 Gbits/sec for generic KASAN inline
* 1.50 Gbits/sec for generic KASAN outline
* 1.55 Gbits/sec for dense tag-based KASAN
* 2.86 Gbits/sec for dense tag-based KASAN with stacktrace disabled

[1] Based on hwasan and asan compiler parameters used in
scripts/Makefile.kasan it looks like inline/outline modes have a bigger
impact on generic mode than the tag-based mode. In the former inlining
actually increases the kernel image size and improves performance. In
the latter it un-inlines some code portions for debugging purposes when
the outline mode is chosen but no real difference is visible in
performance and kernel image size.

[2] Used "cat /proc/meminfo | grep MemAvailable" and then subtracted
that from the total memory of the system. Initially wanted to use "grep
Slab" similarly to the cover letter for arm64 tag-based series but
because the tests were ran on a system with 512GB of RAM and memory
usage was more split up between different categories this better shows
the memory savings.

[3] If the 14 GBs from the clean build were subtracted from the KASAN
measurements one can see that the tag-based mode uses about 4x less of
the additional memory compared to the generic mode.

[4] Memory allocation and freeing performance suffers heavily from saving
stacktraces that can be later displayed in error reports.

[5] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.

======= Compilation
Clang was used to compile the series (make LLVM=1) since gcc doesn't
seem to have support for KASAN tag-based compiler instrumentation on
x86.

======= Dependencies
Series is based on risc-v series [1] that's currently in review. Because
of this for the time being it only applies cleanly on top of 6.12
mainline kernel. Will rebase on the newest kernel once the risc-v series
is also rebased.

[1] https://lore.kernel.org/all/20241022015913.3524425-1-samuel.holland@sifive.com/

Maciej Wieczor-Retman (15):
  kasan: Allocation enhancement for dense tag-based mode
  kasan: Tag checking with dense tag-based mode
  kasan: Vmalloc dense tag-based mode support
  kasan: arm64: x86: risc-v: Make special tags arch specific
  x86: Add arch specific kasan functions
  x86: Reset tag for virtual to physical address conversions
  mm: Pcpu chunk address tag reset
  x86: Physical address comparisons in fill_p*d/pte
  x86: Physical address comparison in current_mm pgd check
  x86: KASAN raw shadow memory PTE init
  x86: LAM initialization
  x86: Minimal SLAB alignment
  x86: runtime_const used for KASAN_SHADOW_END
  x86: Make software tag-based kasan available
  kasan: Add mititgation and debug modes

 Documentation/arch/x86/x86_64/mm.rst |  6 +-
 MAINTAINERS                          |  2 +-
 arch/arm64/include/asm/kasan-tags.h  |  9 +++
 arch/riscv/include/asm/kasan-tags.h  | 12 ++++
 arch/riscv/include/asm/kasan.h       |  4 --
 arch/x86/Kconfig                     | 11 +++-
 arch/x86/boot/compressed/misc.h      |  2 +
 arch/x86/include/asm/kasan-tags.h    |  9 +++
 arch/x86/include/asm/kasan.h         | 50 +++++++++++++--
 arch/x86/include/asm/page.h          | 17 +++--
 arch/x86/include/asm/page_64.h       |  2 +-
 arch/x86/kernel/head_64.S            |  3 +
 arch/x86/kernel/setup.c              |  2 +
 arch/x86/kernel/vmlinux.lds.S        |  1 +
 arch/x86/mm/init.c                   |  3 +
 arch/x86/mm/init_64.c                |  8 +--
 arch/x86/mm/kasan_init_64.c          | 24 +++++--
 arch/x86/mm/physaddr.c               |  1 +
 arch/x86/mm/tlb.c                    |  2 +-
 include/linux/kasan-tags.h           | 12 +++-
 include/linux/kasan.h                | 94 +++++++++++++++++++++++-----
 include/linux/mm.h                   |  6 +-
 include/linux/page-flags-layout.h    |  7 +--
 lib/Kconfig.kasan                    | 49 +++++++++++++++
 mm/kasan/Makefile                    |  3 +
 mm/kasan/dense.c                     | 83 ++++++++++++++++++++++++
 mm/kasan/kasan.h                     | 27 +-------
 mm/kasan/report.c                    |  6 +-
 mm/kasan/report_sw_tags.c            | 12 ++--
 mm/kasan/shadow.c                    | 47 ++++++++++----
 mm/kasan/sw_tags.c                   |  8 +++
 mm/kasan/tags.c                      |  5 ++
 mm/percpu-vm.c                       |  2 +-
 33 files changed, 432 insertions(+), 97 deletions(-)
 create mode 100644 arch/arm64/include/asm/kasan-tags.h
 create mode 100644 arch/riscv/include/asm/kasan-tags.h
 create mode 100644 arch/x86/include/asm/kasan-tags.h
 create mode 100644 mm/kasan/dense.c

Comments

Christoph Lameter (Ampere) Feb. 4, 2025, 6:58 p.m. UTC | #1
ARM64 supports MTE which is hardware support for tagging 16 byte granules
and verification of tags in pointers all in hardware and on some platforms
with *no* performance penalty since the tag is stored in the ECC areas of
DRAM and verified at the same time as the ECC.

Could we get support for that? This would allow us to enable tag checking
in production systems without performance penalty and no memory overhead.
Dave Hansen Feb. 4, 2025, 9:05 p.m. UTC | #2
On 2/4/25 10:58, Christoph Lameter (Ampere) wrote:
> ARM64 supports MTE which is hardware support for tagging 16 byte granules
> and verification of tags in pointers all in hardware and on some platforms
> with *no* performance penalty since the tag is stored in the ECC areas of
> DRAM and verified at the same time as the ECC.
> 
> Could we get support for that? This would allow us to enable tag checking
> in production systems without performance penalty and no memory overhead.

At least on the Intel side, there's no trajectory for doing something
like the MTE architecture for memory tagging. The DRAM "ECC" area is in
very high demand and if anything things are moving away from using ECC
"bits" for anything other than actual ECC. Even the MKTME+integrity
(used for TDX) metadata is probably going to find a new home at some point.

This shouldn't be a surprise to anyone on cc here. If it is, you should
probably be reaching out to Intel over your normal channels.
Jessica Clarke Feb. 4, 2025, 11:36 p.m. UTC | #3
On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> ARM64 supports MTE which is hardware support for tagging 16 byte granules
> and verification of tags in pointers all in hardware and on some platforms
> with *no* performance penalty since the tag is stored in the ECC areas of
> DRAM and verified at the same time as the ECC.
> 
> Could we get support for that? This would allow us to enable tag checking
> in production systems without performance penalty and no memory overhead.

It’s not “no performance penalty”, there is a cost to tracking the MTE
tags for checking. In asynchronous (or asymmetric) mode that’s not too
bad, but in synchronous mode there is a significant overhead even with
ECC. Normally on a store, once you’ve translated it and have the data,
you can buffer it up and defer the actual write until some time later.
If you hit in the L1 cache then that will probably be quite soon, but
if you miss then you have to wait for the data to come back from lower
levels of the hierarchy, potentially all the way out to DRAM. Or if you
have a write-around cache then you just send it out to the next level
when it’s ready. But now, if you have synchronous MTE, you cannot
retire your store instruction until you know what the tag for the
location you’re storing to is; effectively you have to wait until you
can do the full cache lookup, and potentially miss, until it can
retire. This puts pressure on the various microarchitectural structures
that track instructions as they get executed, as instructions are now
in flight for longer. Yes, it may well be that it is quicker for the
memory controller to get the tags from ECC bits than via some other
means, but you’re already paying many many cycles at that point, with
the relevant store being stuck unable to retire (and thus every
instruction after it in the instruction stream) that whole time, and no
write allocate or write around schemes can help you, because you
fundamentally have to wait for the tags to be read before you know if
the instruction is going to trap.

Now, you can choose to not use synchronous mode due to that overhead,
but that’s nuance that isn’t considered by your reply here and has some
consequences.

Jess
Jessica Clarke Feb. 4, 2025, 11:36 p.m. UTC | #4
On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> ARM64 supports MTE which is hardware support for tagging 16 byte granules
> and verification of tags in pointers all in hardware and on some platforms
> with *no* performance penalty since the tag is stored in the ECC areas of
> DRAM and verified at the same time as the ECC.
> 
> Could we get support for that? This would allow us to enable tag checking
> in production systems without performance penalty and no memory overhead.

It’s not “no performance penalty”, there is a cost to tracking the MTE
tags for checking. In asynchronous (or asymmetric) mode that’s not too
bad, but in synchronous mode there is a significant overhead even with
ECC. Normally on a store, once you’ve translated it and have the data,
you can buffer it up and defer the actual write until some time later.
If you hit in the L1 cache then that will probably be quite soon, but
if you miss then you have to wait for the data to come back from lower
levels of the hierarchy, potentially all the way out to DRAM. Or if you
have a write-around cache then you just send it out to the next level
when it’s ready. But now, if you have synchronous MTE, you cannot
retire your store instruction until you know what the tag for the
location you’re storing to is; effectively you have to wait until you
can do the full cache lookup, and potentially miss, until it can
retire. This puts pressure on the various microarchitectural structures
that track instructions as they get executed, as instructions are now
in flight for longer. Yes, it may well be that it is quicker for the
memory controller to get the tags from ECC bits than via some other
means, but you’re already paying many many cycles at that point, with
the relevant store being stuck unable to retire (and thus every
instruction after it in the instruction stream) that whole time, and no
write allocate or write around schemes can help you, because you
fundamentally have to wait for the tags to be read before you know if
the instruction is going to trap.

Now, you can choose to not use synchronous mode due to that overhead,
but that’s nuance that isn’t considered by your reply here and has some
consequences.

Jess
Christoph Lameter (Ampere) Feb. 5, 2025, 6:51 p.m. UTC | #5
On Tue, 4 Feb 2025, Jessica Clarke wrote:

> It’s not “no performance penalty”, there is a cost to tracking the MTE
> tags for checking. In asynchronous (or asymmetric) mode that’s not too


On Ampere Processor hardware there is no penalty since the logic is build
into the usual read/write paths. This is by design. There may be on other
platforms that cannot do this.
Christoph Lameter (Ampere) Feb. 5, 2025, 6:59 p.m. UTC | #6
On Tue, 4 Feb 2025, Dave Hansen wrote:

> > Could we get support for that? This would allow us to enable tag checking
> > in production systems without performance penalty and no memory overhead.
>
> At least on the Intel side, there's no trajectory for doing something
> like the MTE architecture for memory tagging. The DRAM "ECC" area is in
> very high demand and if anything things are moving away from using ECC
> "bits" for anything other than actual ECC. Even the MKTME+integrity
> (used for TDX) metadata is probably going to find a new home at some point.
>
> This shouldn't be a surprise to anyone on cc here. If it is, you should
> probably be reaching out to Intel over your normal channels.

Intel was a competitor for our company and AFAICT has issues all over
the place with performance given its conservative stands on technology. But
we do not test against Intel anymore. Can someone from AMD say something?

MTE tagging is part of the processor standard for ARM64 and Linux will
need to support the 16 byte tagging feature one way or another even if
Intel does not like it. And AFAICT hardware tagging support is a critical
security feature for the future.
Ard Biesheuvel Feb. 5, 2025, 11:04 p.m. UTC | #7
On Wed, 5 Feb 2025 at 20:31, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
>
> MTE tagging is part of the processor standard for ARM64 and Linux will
> need to support the 16 byte tagging feature one way or another even if
> Intel does not like it. And AFAICT hardware tagging support is a critical
> security feature for the future.
>

Can you explain what you feel is lacking in the existing MTE support
in KAsan (enabled when selecting CONFIG_KASAN_HW_TAGS)?
Andrey Konovalov Feb. 5, 2025, 11:40 p.m. UTC | #8
On Tue, Feb 4, 2025 at 6:34 PM Maciej Wieczor-Retman
<maciej.wieczor-retman@intel.com> wrote:
>
> ======= Introduction
> The patchset aims to add a KASAN tag-based mode for the x86 architecture
> with the help of the new CPU feature called Linear Address Masking
> (LAM). Main improvement introduced by the series is 4x lower memory
> usage compared to KASAN's generic mode, the only currently available
> mode on x86.
>
> There are two logical parts to this series. The first one attempts to
> add a new memory saving mechanism called "dense mode" to the generic
> part of the tag-based KASAN code. The second one focuses on implementing
> and enabling the tag-based mode for the x86 architecture by using LAM.

Hi Maciej,

Awesome work! Great to see SW_TAGS mode supported on x86!

I started reviewing the patches, but this is somewhat complicated, as
the dense mode changes are squashed together with the generic ones for
x86 support. Could you please split this series into 2? Or at least
reorder the patches so that everything needed for basic x86 support
comes first and can be reviewed and tested separately.

I will post the comments for things I noted so far, including for the
dense mode changes, but I'll take a closer look after the split.

Also feel free to drop the dependency on that risc-v series, as it
doesn't get updated very often. But up to you.

And please also update all affected parts of Documentation/dev-tools/kasan.rst.

Thank you!
Jessica Clarke Feb. 6, 2025, 1:05 a.m. UTC | #9
On 5 Feb 2025, at 18:51, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> 
> On Tue, 4 Feb 2025, Jessica Clarke wrote:
> 
>> It’s not “no performance penalty”, there is a cost to tracking the MTE
>> tags for checking. In asynchronous (or asymmetric) mode that’s not too
> 
> 
> On Ampere Processor hardware there is no penalty since the logic is build
> into the usual read/write paths. This is by design. There may be on other
> platforms that cannot do this.

You helpfully cut out all the explanation of where the performance
penalty comes from. But if it’s as you say I can only assume your
design chooses to stall all stores until they have actually written, in
which case you have a performance cost compared with hardware that
omitted MTE or optimises for non-synchronous MTE. The literature on MTE
agrees that it is not no penalty (but can be low penalty). I don’t
really want to have some big debate here about the ins and outs of MTE,
it’s not the place for it, but I will stand up and point out that
claiming MTE to be “no performance penalty” is misrepresentative of the
truth

Jess
Maciej Wieczor-Retman Feb. 6, 2025, 10:40 a.m. UTC | #10
Hello Andrey!

On 2025-02-06 at 00:40:59 +0100, Andrey Konovalov wrote:
>On Tue, Feb 4, 2025 at 6:34 PM Maciej Wieczor-Retman
><maciej.wieczor-retman@intel.com> wrote:
>>
>> ======= Introduction
>> The patchset aims to add a KASAN tag-based mode for the x86 architecture
>> with the help of the new CPU feature called Linear Address Masking
>> (LAM). Main improvement introduced by the series is 4x lower memory
>> usage compared to KASAN's generic mode, the only currently available
>> mode on x86.
>>
>> There are two logical parts to this series. The first one attempts to
>> add a new memory saving mechanism called "dense mode" to the generic
>> part of the tag-based KASAN code. The second one focuses on implementing
>> and enabling the tag-based mode for the x86 architecture by using LAM.
>
>Hi Maciej,
>
>Awesome work! Great to see SW_TAGS mode supported on x86!

Glad to hear that, it was a lot of fun to work on :)

>
>I started reviewing the patches, but this is somewhat complicated, as
>the dense mode changes are squashed together with the generic ones for
>x86 support. Could you please split this series into 2? Or at least
>reorder the patches so that everything needed for basic x86 support
>comes first and can be reviewed and tested separately.

I'll try reordering first and see if it looks nice. Since the dense mode would
make some parts arch specific I think it's better to have the two parts in one
series for easier reference. But if it turns out more convoluted I'll just split
it as you suggested.

>
>I will post the comments for things I noted so far, including for the
>dense mode changes, but I'll take a closer look after the split.
>
>Also feel free to drop the dependency on that risc-v series, as it
>doesn't get updated very often. But up to you.

Okay, I was mostly interested in the patch that redefines KASAN_SHADOW_END as
KASAN_SHADOW_OFFSET and then gets shadow addresses by using a signed offset. But
I suppose I can just take that patch and prepend my series with that? (after
applying your comments from that series)

>
>And please also update all affected parts of Documentation/dev-tools/kasan.rst.

Right, thanks for the reminder :)

>
>Thank you!
Andrey Konovalov Feb. 6, 2025, 6:10 p.m. UTC | #11
On Thu, Feb 6, 2025 at 11:41 AM Maciej Wieczor-Retman
<maciej.wieczor-retman@intel.com> wrote:
>
> >I started reviewing the patches, but this is somewhat complicated, as
> >the dense mode changes are squashed together with the generic ones for
> >x86 support. Could you please split this series into 2? Or at least
> >reorder the patches so that everything needed for basic x86 support
> >comes first and can be reviewed and tested separately.
>
> I'll try reordering first and see if it looks nice. Since the dense mode would
> make some parts arch specific I think it's better to have the two parts in one
> series for easier reference. But if it turns out more convoluted I'll just split
> it as you suggested.

Yes, please do. I also think if you split the series, we can land the
basic x86 support fairly quickly, or at least I can do the review and
give the ack from the KASAN side. For the dense mode part, I'd like to
also hear the opinion of other KASAN developers wrt the overall
design.

> >Also feel free to drop the dependency on that risc-v series, as it
> >doesn't get updated very often. But up to you.
>
> Okay, I was mostly interested in the patch that redefines KASAN_SHADOW_END as
> KASAN_SHADOW_OFFSET and then gets shadow addresses by using a signed offset. But
> I suppose I can just take that patch and prepend my series with that? (after
> applying your comments from that series)

Sounds good to me!
Christoph Lameter (Ampere) Feb. 6, 2025, 7:11 p.m. UTC | #12
On Thu, 6 Feb 2025, Jessica Clarke wrote:

> On 5 Feb 2025, at 18:51, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> > On Ampere Processor hardware there is no penalty since the logic is build
> > into the usual read/write paths. This is by design. There may be on other
> > platforms that cannot do this.
>
> You helpfully cut out all the explanation of where the performance
> penalty comes from. But if it’s as you say I can only assume your
> design chooses to stall all stores until they have actually written, in
> which case you have a performance cost compared with hardware that
> omitted MTE or optimises for non-synchronous MTE. The literature on MTE
> agrees that it is not no penalty (but can be low penalty). I don’t
> really want to have some big debate here about the ins and outs of MTE,
> it’s not the place for it, but I will stand up and point out that
> claiming MTE to be “no performance penalty” is misrepresentative of the
> truth

I cannot share details since this information has not been released to be
public yet. I hear that a whitepaper will be coming soon to explain this
feature. The AmpereOne processors have been released a couple of months
ago.

I also see that KASAN_HW_TAGS exist but this means that the tags can only
be used with CONFIG_KASAN which is a kernel configuration for debug
purposes.

What we are interested in is a *production* implementation with minimal
software overhead that will be the default on ARM64 if the appropriate
hardware is detected. That in turn will hopefully allow other software
instrumentation that is currently used to keep small objects secure and in
turn creates overhead.
Dave Hansen Feb. 6, 2025, 9:41 p.m. UTC | #13
On 2/6/25 11:11, Christoph Lameter (Ampere) wrote:
> I also see that KASAN_HW_TAGS exist but this means that the tags can only
> be used with CONFIG_KASAN which is a kernel configuration for debug
> purposes.
> 
> What we are interested in is a *production* implementation with minimal
> software overhead that will be the default on ARM64 if the appropriate
> hardware is detected. 

Ahh, interesting. I'd assumed that once folks had in-hardware tag checks
that they'd just turn on CONFIG_KASAN and be happy.  Guess not!

> That in turn will hopefully allow other software instrumentation
> that is currently used to keep small objects secure and in turn
> creates overhead.
OK, so KASAN as-is is too broad. Are you saying that the kernel
_currently_ have "software instrumentation" like SLAB
redzoning/poisoning and you'd like to see MTE used to replace those?

Are you just interested in small objects?  What counts as small?  I
assume it's anything roughly <PAGE_SIZE.
Andrey Konovalov Feb. 6, 2025, 10:56 p.m. UTC | #14
On Thu, Feb 6, 2025 at 8:21 PM 'Christoph Lameter (Ampere)' via
kasan-dev <kasan-dev@googlegroups.com> wrote:
>
> I cannot share details since this information has not been released to be
> public yet. I hear that a whitepaper will be coming soon to explain this
> feature. The AmpereOne processors have been released a couple of months
> ago.
>
> I also see that KASAN_HW_TAGS exist but this means that the tags can only
> be used with CONFIG_KASAN which is a kernel configuration for debug
> purposes.
>
> What we are interested in is a *production* implementation with minimal
> software overhead that will be the default on ARM64 if the appropriate
> hardware is detected. That in turn will hopefully allow other software
> instrumentation that is currently used to keep small objects secure and in
> turn creates overhead.

Is there anything specific CONFIG_KASAN + CONFIG_KASAN_HW_TAGS do that
is not good enough for a production environment?

The last time I did some perf tests (a year+ ago on Pixel 8, I
believe), the two expensive parts of CONFIG_KASAN_HW_TAGS were:

1. Collecting stack traces. Thus, this can now be disabled via
kernel.stacktrace=off. And there's a tracking bug to add a
production-grade implementation [1];

2. Assigning memory tags to large allocations, specifically page_alloc
allocations with large orders  (AFAIR is was specifically assigning
the tags, not checking them). Thus, this can now be controlled via
kasan.page_alloc.sample(.order).

There's definitely room for optimization and additional config options
that cut down KASAN checks (for example, disabling tag checking of
mempool allocations; although arguably, people might want to have this
in a production environment.)

Otherwise, it's unclear to me what a new production-grade MTE
implementation would do different compared to KASAN_HW_TAGS. But if
there's something, we can just adjust KASAN_HW_TAGS instead.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=211785
Maciej Wieczor-Retman Feb. 7, 2025, 7:41 a.m. UTC | #15
On 2025-02-06 at 13:41:29 -0800, Dave Hansen wrote:
>On 2/6/25 11:11, Christoph Lameter (Ampere) wrote:
>> I also see that KASAN_HW_TAGS exist but this means that the tags can only
>> be used with CONFIG_KASAN which is a kernel configuration for debug
>> purposes.
>> 
>> What we are interested in is a *production* implementation with minimal
>> software overhead that will be the default on ARM64 if the appropriate
>> hardware is detected. 
>
>Ahh, interesting. I'd assumed that once folks had in-hardware tag checks
>that they'd just turn on CONFIG_KASAN and be happy.  Guess not!
>
>> That in turn will hopefully allow other software instrumentation
>> that is currently used to keep small objects secure and in turn
>> creates overhead.
>OK, so KASAN as-is is too broad. Are you saying that the kernel
>_currently_ have "software instrumentation" like SLAB
>redzoning/poisoning and you'd like to see MTE used to replace those?

I share Andrey's opinion that in hardware KASAN mode (with MTE on arm64) after
disabling stacktraces (which in my tests in software tag-based mode took up ~90%
of the allocation - small kmalloc() - time) and tweaking the bigger allocations
there doesn't seem to be anything more left in KASAN that'd be slowing things
down.

Obviously this series deals with the tag-based mode which will suffer from all
the software instrumentation penalties to performance. So while it's still a
debugging feature at least it gains 2x-4x memory savings over the generic mode
already present on x86.

>
>Are you just interested in small objects?  What counts as small?  I
>assume it's anything roughly <PAGE_SIZE.

Would disabling vmalloc instrumentation achieve something like this? That is
tweakable during compilation.

>
>_______________________________________________
>linux-riscv mailing list
>linux-riscv@lists.infradead.org
>http://lists.infradead.org/mailman/listinfo/linux-riscv