Message ID | cover.1738686764.git.maciej.wieczor-retman@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | kasan: x86: arm64: risc-v: KASAN tag-based mode for x86 | expand |
ARM64 supports MTE which is hardware support for tagging 16 byte granules and verification of tags in pointers all in hardware and on some platforms with *no* performance penalty since the tag is stored in the ECC areas of DRAM and verified at the same time as the ECC. Could we get support for that? This would allow us to enable tag checking in production systems without performance penalty and no memory overhead.
On 2/4/25 10:58, Christoph Lameter (Ampere) wrote: > ARM64 supports MTE which is hardware support for tagging 16 byte granules > and verification of tags in pointers all in hardware and on some platforms > with *no* performance penalty since the tag is stored in the ECC areas of > DRAM and verified at the same time as the ECC. > > Could we get support for that? This would allow us to enable tag checking > in production systems without performance penalty and no memory overhead. At least on the Intel side, there's no trajectory for doing something like the MTE architecture for memory tagging. The DRAM "ECC" area is in very high demand and if anything things are moving away from using ECC "bits" for anything other than actual ECC. Even the MKTME+integrity (used for TDX) metadata is probably going to find a new home at some point. This shouldn't be a surprise to anyone on cc here. If it is, you should probably be reaching out to Intel over your normal channels.
On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote: > ARM64 supports MTE which is hardware support for tagging 16 byte granules > and verification of tags in pointers all in hardware and on some platforms > with *no* performance penalty since the tag is stored in the ECC areas of > DRAM and verified at the same time as the ECC. > > Could we get support for that? This would allow us to enable tag checking > in production systems without performance penalty and no memory overhead. It’s not “no performance penalty”, there is a cost to tracking the MTE tags for checking. In asynchronous (or asymmetric) mode that’s not too bad, but in synchronous mode there is a significant overhead even with ECC. Normally on a store, once you’ve translated it and have the data, you can buffer it up and defer the actual write until some time later. If you hit in the L1 cache then that will probably be quite soon, but if you miss then you have to wait for the data to come back from lower levels of the hierarchy, potentially all the way out to DRAM. Or if you have a write-around cache then you just send it out to the next level when it’s ready. But now, if you have synchronous MTE, you cannot retire your store instruction until you know what the tag for the location you’re storing to is; effectively you have to wait until you can do the full cache lookup, and potentially miss, until it can retire. This puts pressure on the various microarchitectural structures that track instructions as they get executed, as instructions are now in flight for longer. Yes, it may well be that it is quicker for the memory controller to get the tags from ECC bits than via some other means, but you’re already paying many many cycles at that point, with the relevant store being stuck unable to retire (and thus every instruction after it in the instruction stream) that whole time, and no write allocate or write around schemes can help you, because you fundamentally have to wait for the tags to be read before you know if the instruction is going to trap. Now, you can choose to not use synchronous mode due to that overhead, but that’s nuance that isn’t considered by your reply here and has some consequences. Jess
On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote: > ARM64 supports MTE which is hardware support for tagging 16 byte granules > and verification of tags in pointers all in hardware and on some platforms > with *no* performance penalty since the tag is stored in the ECC areas of > DRAM and verified at the same time as the ECC. > > Could we get support for that? This would allow us to enable tag checking > in production systems without performance penalty and no memory overhead. It’s not “no performance penalty”, there is a cost to tracking the MTE tags for checking. In asynchronous (or asymmetric) mode that’s not too bad, but in synchronous mode there is a significant overhead even with ECC. Normally on a store, once you’ve translated it and have the data, you can buffer it up and defer the actual write until some time later. If you hit in the L1 cache then that will probably be quite soon, but if you miss then you have to wait for the data to come back from lower levels of the hierarchy, potentially all the way out to DRAM. Or if you have a write-around cache then you just send it out to the next level when it’s ready. But now, if you have synchronous MTE, you cannot retire your store instruction until you know what the tag for the location you’re storing to is; effectively you have to wait until you can do the full cache lookup, and potentially miss, until it can retire. This puts pressure on the various microarchitectural structures that track instructions as they get executed, as instructions are now in flight for longer. Yes, it may well be that it is quicker for the memory controller to get the tags from ECC bits than via some other means, but you’re already paying many many cycles at that point, with the relevant store being stuck unable to retire (and thus every instruction after it in the instruction stream) that whole time, and no write allocate or write around schemes can help you, because you fundamentally have to wait for the tags to be read before you know if the instruction is going to trap. Now, you can choose to not use synchronous mode due to that overhead, but that’s nuance that isn’t considered by your reply here and has some consequences. Jess
On Tue, 4 Feb 2025, Jessica Clarke wrote: > It’s not “no performance penalty”, there is a cost to tracking the MTE > tags for checking. In asynchronous (or asymmetric) mode that’s not too On Ampere Processor hardware there is no penalty since the logic is build into the usual read/write paths. This is by design. There may be on other platforms that cannot do this.
On Tue, 4 Feb 2025, Dave Hansen wrote: > > Could we get support for that? This would allow us to enable tag checking > > in production systems without performance penalty and no memory overhead. > > At least on the Intel side, there's no trajectory for doing something > like the MTE architecture for memory tagging. The DRAM "ECC" area is in > very high demand and if anything things are moving away from using ECC > "bits" for anything other than actual ECC. Even the MKTME+integrity > (used for TDX) metadata is probably going to find a new home at some point. > > This shouldn't be a surprise to anyone on cc here. If it is, you should > probably be reaching out to Intel over your normal channels. Intel was a competitor for our company and AFAICT has issues all over the place with performance given its conservative stands on technology. But we do not test against Intel anymore. Can someone from AMD say something? MTE tagging is part of the processor standard for ARM64 and Linux will need to support the 16 byte tagging feature one way or another even if Intel does not like it. And AFAICT hardware tagging support is a critical security feature for the future.
On Wed, 5 Feb 2025 at 20:31, Christoph Lameter (Ampere) <cl@gentwo.org> wrote: > > MTE tagging is part of the processor standard for ARM64 and Linux will > need to support the 16 byte tagging feature one way or another even if > Intel does not like it. And AFAICT hardware tagging support is a critical > security feature for the future. > Can you explain what you feel is lacking in the existing MTE support in KAsan (enabled when selecting CONFIG_KASAN_HW_TAGS)?
On Tue, Feb 4, 2025 at 6:34 PM Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> wrote: > > ======= Introduction > The patchset aims to add a KASAN tag-based mode for the x86 architecture > with the help of the new CPU feature called Linear Address Masking > (LAM). Main improvement introduced by the series is 4x lower memory > usage compared to KASAN's generic mode, the only currently available > mode on x86. > > There are two logical parts to this series. The first one attempts to > add a new memory saving mechanism called "dense mode" to the generic > part of the tag-based KASAN code. The second one focuses on implementing > and enabling the tag-based mode for the x86 architecture by using LAM. Hi Maciej, Awesome work! Great to see SW_TAGS mode supported on x86! I started reviewing the patches, but this is somewhat complicated, as the dense mode changes are squashed together with the generic ones for x86 support. Could you please split this series into 2? Or at least reorder the patches so that everything needed for basic x86 support comes first and can be reviewed and tested separately. I will post the comments for things I noted so far, including for the dense mode changes, but I'll take a closer look after the split. Also feel free to drop the dependency on that risc-v series, as it doesn't get updated very often. But up to you. And please also update all affected parts of Documentation/dev-tools/kasan.rst. Thank you!
On 5 Feb 2025, at 18:51, Christoph Lameter (Ampere) <cl@gentwo.org> wrote: > > On Tue, 4 Feb 2025, Jessica Clarke wrote: > >> It’s not “no performance penalty”, there is a cost to tracking the MTE >> tags for checking. In asynchronous (or asymmetric) mode that’s not too > > > On Ampere Processor hardware there is no penalty since the logic is build > into the usual read/write paths. This is by design. There may be on other > platforms that cannot do this. You helpfully cut out all the explanation of where the performance penalty comes from. But if it’s as you say I can only assume your design chooses to stall all stores until they have actually written, in which case you have a performance cost compared with hardware that omitted MTE or optimises for non-synchronous MTE. The literature on MTE agrees that it is not no penalty (but can be low penalty). I don’t really want to have some big debate here about the ins and outs of MTE, it’s not the place for it, but I will stand up and point out that claiming MTE to be “no performance penalty” is misrepresentative of the truth Jess
Hello Andrey! On 2025-02-06 at 00:40:59 +0100, Andrey Konovalov wrote: >On Tue, Feb 4, 2025 at 6:34 PM Maciej Wieczor-Retman ><maciej.wieczor-retman@intel.com> wrote: >> >> ======= Introduction >> The patchset aims to add a KASAN tag-based mode for the x86 architecture >> with the help of the new CPU feature called Linear Address Masking >> (LAM). Main improvement introduced by the series is 4x lower memory >> usage compared to KASAN's generic mode, the only currently available >> mode on x86. >> >> There are two logical parts to this series. The first one attempts to >> add a new memory saving mechanism called "dense mode" to the generic >> part of the tag-based KASAN code. The second one focuses on implementing >> and enabling the tag-based mode for the x86 architecture by using LAM. > >Hi Maciej, > >Awesome work! Great to see SW_TAGS mode supported on x86! Glad to hear that, it was a lot of fun to work on :) > >I started reviewing the patches, but this is somewhat complicated, as >the dense mode changes are squashed together with the generic ones for >x86 support. Could you please split this series into 2? Or at least >reorder the patches so that everything needed for basic x86 support >comes first and can be reviewed and tested separately. I'll try reordering first and see if it looks nice. Since the dense mode would make some parts arch specific I think it's better to have the two parts in one series for easier reference. But if it turns out more convoluted I'll just split it as you suggested. > >I will post the comments for things I noted so far, including for the >dense mode changes, but I'll take a closer look after the split. > >Also feel free to drop the dependency on that risc-v series, as it >doesn't get updated very often. But up to you. Okay, I was mostly interested in the patch that redefines KASAN_SHADOW_END as KASAN_SHADOW_OFFSET and then gets shadow addresses by using a signed offset. But I suppose I can just take that patch and prepend my series with that? (after applying your comments from that series) > >And please also update all affected parts of Documentation/dev-tools/kasan.rst. Right, thanks for the reminder :) > >Thank you!
On Thu, Feb 6, 2025 at 11:41 AM Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> wrote: > > >I started reviewing the patches, but this is somewhat complicated, as > >the dense mode changes are squashed together with the generic ones for > >x86 support. Could you please split this series into 2? Or at least > >reorder the patches so that everything needed for basic x86 support > >comes first and can be reviewed and tested separately. > > I'll try reordering first and see if it looks nice. Since the dense mode would > make some parts arch specific I think it's better to have the two parts in one > series for easier reference. But if it turns out more convoluted I'll just split > it as you suggested. Yes, please do. I also think if you split the series, we can land the basic x86 support fairly quickly, or at least I can do the review and give the ack from the KASAN side. For the dense mode part, I'd like to also hear the opinion of other KASAN developers wrt the overall design. > >Also feel free to drop the dependency on that risc-v series, as it > >doesn't get updated very often. But up to you. > > Okay, I was mostly interested in the patch that redefines KASAN_SHADOW_END as > KASAN_SHADOW_OFFSET and then gets shadow addresses by using a signed offset. But > I suppose I can just take that patch and prepend my series with that? (after > applying your comments from that series) Sounds good to me!
On Thu, 6 Feb 2025, Jessica Clarke wrote: > On 5 Feb 2025, at 18:51, Christoph Lameter (Ampere) <cl@gentwo.org> wrote: > > On Ampere Processor hardware there is no penalty since the logic is build > > into the usual read/write paths. This is by design. There may be on other > > platforms that cannot do this. > > You helpfully cut out all the explanation of where the performance > penalty comes from. But if it’s as you say I can only assume your > design chooses to stall all stores until they have actually written, in > which case you have a performance cost compared with hardware that > omitted MTE or optimises for non-synchronous MTE. The literature on MTE > agrees that it is not no penalty (but can be low penalty). I don’t > really want to have some big debate here about the ins and outs of MTE, > it’s not the place for it, but I will stand up and point out that > claiming MTE to be “no performance penalty” is misrepresentative of the > truth I cannot share details since this information has not been released to be public yet. I hear that a whitepaper will be coming soon to explain this feature. The AmpereOne processors have been released a couple of months ago. I also see that KASAN_HW_TAGS exist but this means that the tags can only be used with CONFIG_KASAN which is a kernel configuration for debug purposes. What we are interested in is a *production* implementation with minimal software overhead that will be the default on ARM64 if the appropriate hardware is detected. That in turn will hopefully allow other software instrumentation that is currently used to keep small objects secure and in turn creates overhead.
On 2/6/25 11:11, Christoph Lameter (Ampere) wrote: > I also see that KASAN_HW_TAGS exist but this means that the tags can only > be used with CONFIG_KASAN which is a kernel configuration for debug > purposes. > > What we are interested in is a *production* implementation with minimal > software overhead that will be the default on ARM64 if the appropriate > hardware is detected. Ahh, interesting. I'd assumed that once folks had in-hardware tag checks that they'd just turn on CONFIG_KASAN and be happy. Guess not! > That in turn will hopefully allow other software instrumentation > that is currently used to keep small objects secure and in turn > creates overhead. OK, so KASAN as-is is too broad. Are you saying that the kernel _currently_ have "software instrumentation" like SLAB redzoning/poisoning and you'd like to see MTE used to replace those? Are you just interested in small objects? What counts as small? I assume it's anything roughly <PAGE_SIZE.
On Thu, Feb 6, 2025 at 8:21 PM 'Christoph Lameter (Ampere)' via kasan-dev <kasan-dev@googlegroups.com> wrote: > > I cannot share details since this information has not been released to be > public yet. I hear that a whitepaper will be coming soon to explain this > feature. The AmpereOne processors have been released a couple of months > ago. > > I also see that KASAN_HW_TAGS exist but this means that the tags can only > be used with CONFIG_KASAN which is a kernel configuration for debug > purposes. > > What we are interested in is a *production* implementation with minimal > software overhead that will be the default on ARM64 if the appropriate > hardware is detected. That in turn will hopefully allow other software > instrumentation that is currently used to keep small objects secure and in > turn creates overhead. Is there anything specific CONFIG_KASAN + CONFIG_KASAN_HW_TAGS do that is not good enough for a production environment? The last time I did some perf tests (a year+ ago on Pixel 8, I believe), the two expensive parts of CONFIG_KASAN_HW_TAGS were: 1. Collecting stack traces. Thus, this can now be disabled via kernel.stacktrace=off. And there's a tracking bug to add a production-grade implementation [1]; 2. Assigning memory tags to large allocations, specifically page_alloc allocations with large orders (AFAIR is was specifically assigning the tags, not checking them). Thus, this can now be controlled via kasan.page_alloc.sample(.order). There's definitely room for optimization and additional config options that cut down KASAN checks (for example, disabling tag checking of mempool allocations; although arguably, people might want to have this in a production environment.) Otherwise, it's unclear to me what a new production-grade MTE implementation would do different compared to KASAN_HW_TAGS. But if there's something, we can just adjust KASAN_HW_TAGS instead. [1] https://bugzilla.kernel.org/show_bug.cgi?id=211785
On 2025-02-06 at 13:41:29 -0800, Dave Hansen wrote: >On 2/6/25 11:11, Christoph Lameter (Ampere) wrote: >> I also see that KASAN_HW_TAGS exist but this means that the tags can only >> be used with CONFIG_KASAN which is a kernel configuration for debug >> purposes. >> >> What we are interested in is a *production* implementation with minimal >> software overhead that will be the default on ARM64 if the appropriate >> hardware is detected. > >Ahh, interesting. I'd assumed that once folks had in-hardware tag checks >that they'd just turn on CONFIG_KASAN and be happy. Guess not! > >> That in turn will hopefully allow other software instrumentation >> that is currently used to keep small objects secure and in turn >> creates overhead. >OK, so KASAN as-is is too broad. Are you saying that the kernel >_currently_ have "software instrumentation" like SLAB >redzoning/poisoning and you'd like to see MTE used to replace those? I share Andrey's opinion that in hardware KASAN mode (with MTE on arm64) after disabling stacktraces (which in my tests in software tag-based mode took up ~90% of the allocation - small kmalloc() - time) and tweaking the bigger allocations there doesn't seem to be anything more left in KASAN that'd be slowing things down. Obviously this series deals with the tag-based mode which will suffer from all the software instrumentation penalties to performance. So while it's still a debugging feature at least it gains 2x-4x memory savings over the generic mode already present on x86. > >Are you just interested in small objects? What counts as small? I >assume it's anything roughly <PAGE_SIZE. Would disabling vmalloc instrumentation achieve something like this? That is tweakable during compilation. > >_______________________________________________ >linux-riscv mailing list >linux-riscv@lists.infradead.org >http://lists.infradead.org/mailman/listinfo/linux-riscv