Message ID | 20241112082600.298035-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Make inode storage available to tracing prog | expand |
> On Nov 12, 2024, at 12:25 AM, Song Liu <song@kernel.org> wrote: > > bpf inode local storage can be useful beyond LSM programs. For example, > bcc/libbpf-tools file* can use inode local storage to simplify the logic. > This set makes inode local storage available to tracing program. > > 1/4 is missing change for bpf task local storage. 2/4 move inode local > storage from security blob to inode. > > Similar to task local storage in tracing program, it is necessary to add > recursion prevention logic for inode local storage. Patch 3/4 adds such > logic, and 4/4 add a test for the recursion prevention logic. > > Song Liu (4): > bpf: lsm: Remove hook to bpf_task_storage_free > bpf: Make bpf inode storage available to tracing program > bpf: Add recursion prevention logic for inode storage > selftest/bpf: Test inode local storage recursion prevention I accidentally sent some older .patch files together with this set. Please ignore this version. I will resend v2. Thanks, Song > > fs/inode.c | 1 + > include/linux/bpf.h | 9 + > include/linux/bpf_lsm.h | 29 --- > include/linux/fs.h | 4 + > kernel/bpf/Makefile | 3 +- > kernel/bpf/bpf_inode_storage.c | 185 +++++++++++++----- > kernel/bpf/bpf_lsm.c | 4 - > kernel/trace/bpf_trace.c | 8 + > security/bpf/hooks.c | 7 - > tools/testing/selftests/bpf/DENYLIST.s390x | 1 + > .../bpf/prog_tests/inode_local_storage.c | 72 +++++++ > .../bpf/progs/inode_storage_recursion.c | 90 +++++++++ > 12 files changed, 320 insertions(+), 93 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/inode_local_storage.c > create mode 100644 tools/testing/selftests/bpf/progs/inode_storage_recursion.c > > -- > 2.43.5
On 11/12/2024 12:25 AM, Song Liu wrote: > bpf inode local storage can be useful beyond LSM programs. For example, > bcc/libbpf-tools file* can use inode local storage to simplify the logic. > This set makes inode local storage available to tracing program. Mixing the storage and scope of LSM data and tracing data leaves all sorts of opportunities for abuse. Add inode data for tracing if you can get the patch accepted, but do not move the LSM data out of i_security. Moving the LSM data would break the integrity (such that there is) of the LSM model. > > 1/4 is missing change for bpf task local storage. 2/4 move inode local > storage from security blob to inode. > > Similar to task local storage in tracing program, it is necessary to add > recursion prevention logic for inode local storage. Patch 3/4 adds such > logic, and 4/4 add a test for the recursion prevention logic. > > Song Liu (4): > bpf: lsm: Remove hook to bpf_task_storage_free > bpf: Make bpf inode storage available to tracing program > bpf: Add recursion prevention logic for inode storage > selftest/bpf: Test inode local storage recursion prevention > > fs/inode.c | 1 + > include/linux/bpf.h | 9 + > include/linux/bpf_lsm.h | 29 --- > include/linux/fs.h | 4 + > kernel/bpf/Makefile | 3 +- > kernel/bpf/bpf_inode_storage.c | 185 +++++++++++++----- > kernel/bpf/bpf_lsm.c | 4 - > kernel/trace/bpf_trace.c | 8 + > security/bpf/hooks.c | 7 - > tools/testing/selftests/bpf/DENYLIST.s390x | 1 + > .../bpf/prog_tests/inode_local_storage.c | 72 +++++++ > .../bpf/progs/inode_storage_recursion.c | 90 +++++++++ > 12 files changed, 320 insertions(+), 93 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/inode_local_storage.c > create mode 100644 tools/testing/selftests/bpf/progs/inode_storage_recursion.c > > -- > 2.43.5 >
Hi Casey, Thanks for your input. > On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote: > > On 11/12/2024 12:25 AM, Song Liu wrote: >> bpf inode local storage can be useful beyond LSM programs. For example, >> bcc/libbpf-tools file* can use inode local storage to simplify the logic. >> This set makes inode local storage available to tracing program. > > Mixing the storage and scope of LSM data and tracing data leaves all sorts > of opportunities for abuse. Add inode data for tracing if you can get the > patch accepted, but do not move the LSM data out of i_security. Moving > the LSM data would break the integrity (such that there is) of the LSM > model. I honestly don't see how this would cause any issues. Each bpf inode storage maps are independent of each other, and the bpf local storage is designed to handle multiple inode storage maps properly. Therefore, if the user decide to stick with only LSM hooks, there isn't any behavior change. OTOH, if the user decides some tracing hooks (on tracepoints, etc.) are needed, making a inode storage map available for both tracing programs and LSM programs would help simplify the logic. (Alternatively, the tracing programs need to store per inode data in a hash map, and the LSM program would read that instead of the inode storage map.) Does this answer the question and address the concerns? Thanks, Song > >> >> 1/4 is missing change for bpf task local storage. 2/4 move inode local >> storage from security blob to inode. >> >> Similar to task local storage in tracing program, it is necessary to add >> recursion prevention logic for inode local storage. Patch 3/4 adds such >> logic, and 4/4 add a test for the recursion prevention logic. >> >> Song Liu (4): >> bpf: lsm: Remove hook to bpf_task_storage_free >> bpf: Make bpf inode storage available to tracing program >> bpf: Add recursion prevention logic for inode storage >> selftest/bpf: Test inode local storage recursion prevention [...]
On 11/12/2024 10:44 AM, Song Liu wrote: > Hi Casey, > > Thanks for your input. > >> On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote: >> >> On 11/12/2024 12:25 AM, Song Liu wrote: >>> bpf inode local storage can be useful beyond LSM programs. For example, >>> bcc/libbpf-tools file* can use inode local storage to simplify the logic. >>> This set makes inode local storage available to tracing program. >> Mixing the storage and scope of LSM data and tracing data leaves all sorts >> of opportunities for abuse. Add inode data for tracing if you can get the >> patch accepted, but do not move the LSM data out of i_security. Moving >> the LSM data would break the integrity (such that there is) of the LSM >> model. > I honestly don't see how this would cause any issues. Each bpf inode > storage maps are independent of each other, and the bpf local storage is > designed to handle multiple inode storage maps properly. Therefore, if > the user decide to stick with only LSM hooks, there isn't any behavior > change. OTOH, if the user decides some tracing hooks (on tracepoints, > etc.) are needed, making a inode storage map available for both tracing > programs and LSM programs would help simplify the logic. (Alternatively, > the tracing programs need to store per inode data in a hash map, and > the LSM program would read that instead of the inode storage map.) > > Does this answer the question and address the concerns? First off, I had no question. No, this does not address my concern. LSM data should be kept in and managed by the LSMs. We're making an effort to make the LSM infrastructure more consistent. Moving some of the LSM data into an LSM specific field in the inode structure goes against this. What you're proposing is a one-off clever optimization hack. We have too many of those already. > > Thanks, > Song > >>> 1/4 is missing change for bpf task local storage. 2/4 move inode local >>> storage from security blob to inode. >>> >>> Similar to task local storage in tracing program, it is necessary to add >>> recursion prevention logic for inode local storage. Patch 3/4 adds such >>> logic, and 4/4 add a test for the recursion prevention logic. >>> >>> Song Liu (4): >>> bpf: lsm: Remove hook to bpf_task_storage_free >>> bpf: Make bpf inode storage available to tracing program >>> bpf: Add recursion prevention logic for inode storage >>> selftest/bpf: Test inode local storage recursion prevention > [...] >
> On Nov 12, 2024, at 5:10 PM, Casey Schaufler <casey@schaufler-ca.com> wrote: > > On 11/12/2024 10:44 AM, Song Liu wrote: >> Hi Casey, >> >> Thanks for your input. >> >>> On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote: >>> >>> On 11/12/2024 12:25 AM, Song Liu wrote: >>>> bpf inode local storage can be useful beyond LSM programs. For example, >>>> bcc/libbpf-tools file* can use inode local storage to simplify the logic. >>>> This set makes inode local storage available to tracing program. >>> Mixing the storage and scope of LSM data and tracing data leaves all sorts >>> of opportunities for abuse. Add inode data for tracing if you can get the >>> patch accepted, but do not move the LSM data out of i_security. Moving >>> the LSM data would break the integrity (such that there is) of the LSM >>> model. >> I honestly don't see how this would cause any issues. Each bpf inode >> storage maps are independent of each other, and the bpf local storage is >> designed to handle multiple inode storage maps properly. Therefore, if >> the user decide to stick with only LSM hooks, there isn't any behavior >> change. OTOH, if the user decides some tracing hooks (on tracepoints, >> etc.) are needed, making a inode storage map available for both tracing >> programs and LSM programs would help simplify the logic. (Alternatively, >> the tracing programs need to store per inode data in a hash map, and >> the LSM program would read that instead of the inode storage map.) >> >> Does this answer the question and address the concerns? > > First off, I had no question. No, this does not address my concern. > LSM data should be kept in and managed by the LSMs. We're making an > effort to make the LSM infrastructure more consistent. Could you provide more information on the definition of "more consistent" LSM infrastructure? BPF LSM programs have full access to regular BPF maps (hash map, array, etc.). There was never a separation of LSM data vs. other data. AFAICT, other LSMs also use kzalloc and similar APIs for memory allocation. So data separation is not a goal for any LSM, right? Thanks, Song > Moving some of > the LSM data into an LSM specific field in the inode structure goes > against this. What you're proposing is a one-off clever optimization > hack. We have too many of those already.
On 11/12/2024 5:37 PM, Song Liu wrote: > >> On Nov 12, 2024, at 5:10 PM, Casey Schaufler <casey@schaufler-ca.com> wrote: >> >> On 11/12/2024 10:44 AM, Song Liu wrote: >>> Hi Casey, >>> >>> Thanks for your input. >>> >>>> On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote: >>>> >>>> On 11/12/2024 12:25 AM, Song Liu wrote: >>>>> bpf inode local storage can be useful beyond LSM programs. For example, >>>>> bcc/libbpf-tools file* can use inode local storage to simplify the logic. >>>>> This set makes inode local storage available to tracing program. >>>> Mixing the storage and scope of LSM data and tracing data leaves all sorts >>>> of opportunities for abuse. Add inode data for tracing if you can get the >>>> patch accepted, but do not move the LSM data out of i_security. Moving >>>> the LSM data would break the integrity (such that there is) of the LSM >>>> model. >>> I honestly don't see how this would cause any issues. Each bpf inode >>> storage maps are independent of each other, and the bpf local storage is >>> designed to handle multiple inode storage maps properly. Therefore, if >>> the user decide to stick with only LSM hooks, there isn't any behavior >>> change. OTOH, if the user decides some tracing hooks (on tracepoints, >>> etc.) are needed, making a inode storage map available for both tracing >>> programs and LSM programs would help simplify the logic. (Alternatively, >>> the tracing programs need to store per inode data in a hash map, and >>> the LSM program would read that instead of the inode storage map.) >>> >>> Does this answer the question and address the concerns? >> First off, I had no question. No, this does not address my concern. >> LSM data should be kept in and managed by the LSMs. We're making an >> effort to make the LSM infrastructure more consistent. > Could you provide more information on the definition of "more > consistent" LSM infrastructure? We're doing several things. The management of security blobs (e.g. inode->i_security) has been moved out of the individual modules and into the infrastructure. The use of a u32 secid is being replaced with a more general lsm_prop structure, except where networking code won't allow it. A good deal of work has gone into making the return values of LSM hooks consistent. Some of this was done as part of the direct call change, and some in support of LSM stacking. There are also some hardening changes that aren't ready for prime-time, but that are in the works. There have been concerns about the potential expoitability of the LSM infrastructure, and we're serious about addressing those. > > BPF LSM programs have full access to regular BPF maps (hash map, > array, etc.). There was never a separation of LSM data vs. other > data. > > AFAICT, other LSMs also use kzalloc and similar APIs for memory > allocation. So data separation is not a goal for any LSM, right? > > Thanks, > Song > >> Moving some of >> the LSM data into an LSM specific field in the inode structure goes >> against this. What you're proposing is a one-off clever optimization >> hack. We have too many of those already. > >
On Wed, Nov 13, 2024 at 10:06 AM Casey Schaufler <casey@schaufler-ca.com> wrote: > > On 11/12/2024 5:37 PM, Song Liu wrote: [...] > > Could you provide more information on the definition of "more > > consistent" LSM infrastructure? > > We're doing several things. The management of security blobs > (e.g. inode->i_security) has been moved out of the individual > modules and into the infrastructure. The use of a u32 secid is > being replaced with a more general lsm_prop structure, except > where networking code won't allow it. A good deal of work has > gone into making the return values of LSM hooks consistent. Thanks for the information. Unifying per-object memory usage of different LSMs makes sense. However, I don't think we are limiting any LSM to only use memory from the lsm_blobs. The LSMs still have the freedom to use other memory allocators. BPF inode local storage, just like other BPF maps, is a way to manage memory. BPF LSM programs have full access to BPF maps. So I don't think it makes sense to say this BPF map is used by tracing, so we should not allow LSM to use it. Does this make sense? Song > Some of this was done as part of the direct call change, and some > in support of LSM stacking. There are also some hardening changes > that aren't ready for prime-time, but that are in the works. > There have been concerns about the potential expoitability of the > LSM infrastructure, and we're serious about addressing those.
On Wed, Nov 13, 2024 at 10:57:05AM -0800, Song Liu wrote: Good morning, I hope the week is going well for everyone. > On Wed, Nov 13, 2024 at 10:06???AM Casey Schaufler <casey@schaufler-ca.com> wrote: > > > > On 11/12/2024 5:37 PM, Song Liu wrote: > [...] > > > Could you provide more information on the definition of "more > > > consistent" LSM infrastructure? > > > > We're doing several things. The management of security blobs > > (e.g. inode->i_security) has been moved out of the individual > > modules and into the infrastructure. The use of a u32 secid is > > being replaced with a more general lsm_prop structure, except > > where networking code won't allow it. A good deal of work has > > gone into making the return values of LSM hooks consistent. > > Thanks for the information. Unifying per-object memory usage of > different LSMs makes sense. However, I don't think we are limiting > any LSM to only use memory from the lsm_blobs. The LSMs still > have the freedom to use other memory allocators. BPF inode > local storage, just like other BPF maps, is a way to manage > memory. BPF LSM programs have full access to BPF maps. So > I don't think it makes sense to say this BPF map is used by tracing, > so we should not allow LSM to use it. > > Does this make sense? As involved bystanders, some questions and thoughts that may help further the discussion. With respect to inode specific storage, the currently accepted pattern in the LSM world is roughly as follows: The LSM initialization code, at boot, computes the total amount of storage needed by all of the LSM's that are requesting inode specific storage. A single pointer to that 'blob' of storage is included in the inode structure. In an include file, an inline function similar to the following is declared, whose purpose is to return the location inside of the allocated storage or 'LSM inode blob' where a particular LSM's inode specific data structure is located: static inline struct tsem_inode *tsem_inode(struct inode *inode) { return inode->i_security + tsem_blob_sizes.lbs_inode; } In an LSM's implementation code, the function gets used in something like the following manner: static int tsem_inode_alloc_security(struct inode *inode) { struct tsem_inode *tsip = tsem_inode(inode); /* Do something with the structure pointed to by tsip. */ } Christian appears to have already chimed in and indicated that there is no appetite to add another pointer member to the inode structure. So, if this were to proceed forward, is it proposed that there will be a 'flag day' requirement to have each LSM that uses inode specific storage implement a security_inode_alloc() event handler that creates an LSM specific BPF map key/value pair for that inode? Which, in turn, would require that the accessor functions be converted to use a bpf key request to return the LSM specific information for that inode? A flag day event is always somewhat of a concern, but the larger concern may be the substitution of simple pointer arithmetic for a body of more complex code. One would assume with something like this, that there may be a need for a shake-out period to determine what type of potential regressions the more complex implementation may generate, with regressions in security sensitive code always a concern. In a larger context. Given that the current implementation works on simple pointer arithmetic over a common block of storage, there is not much of a safety guarantee that one LSM couldn't interfere with the inode storage of another LSM. However, using a generic BPF construct such as a map, would presumably open the level of influence over LSM specific inode storage to a much larger audience, presumably any BPF program that would be loaded. The LSM inode information is obviously security sensitive, which I presume would be be the motivation for Casey's concern that a 'mistake by a BPF programmer could cause the whole system to blow up', which in full disclosure is only a rough approximation of his statement. We obviously can't speak directly to Casey's concerns. Casey, any specific technical comments on the challenges of using a common inode specific storage architecture? Song, FWIW going forward. I don't know how closely you follow LSM development, but we believe an unbiased observer would conclude that there is some degree of reticence about BPF's involvement with the LSM infrastructure by some of the core LSM maintainers, that in turn makes these types of conversations technically sensitive. > Song We will look forward to your thoughts on the above. Have a good week. As always, Dr. Greg The Quixote Project - Flailing at the Travails of Cybersecurity https://github.com/Quixote-Project
On 11/14/2024 8:36 AM, Dr. Greg wrote: > On Wed, Nov 13, 2024 at 10:57:05AM -0800, Song Liu wrote: > > Good morning, I hope the week is going well for everyone. > >> On Wed, Nov 13, 2024 at 10:06???AM Casey Schaufler <casey@schaufler-ca.com> wrote: >>> On 11/12/2024 5:37 PM, Song Liu wrote: >> [...] >>>> Could you provide more information on the definition of "more >>>> consistent" LSM infrastructure? >>> We're doing several things. The management of security blobs >>> (e.g. inode->i_security) has been moved out of the individual >>> modules and into the infrastructure. The use of a u32 secid is >>> being replaced with a more general lsm_prop structure, except >>> where networking code won't allow it. A good deal of work has >>> gone into making the return values of LSM hooks consistent. >> Thanks for the information. Unifying per-object memory usage of >> different LSMs makes sense. However, I don't think we are limiting >> any LSM to only use memory from the lsm_blobs. The LSMs still >> have the freedom to use other memory allocators. BPF inode >> local storage, just like other BPF maps, is a way to manage >> memory. BPF LSM programs have full access to BPF maps. So >> I don't think it makes sense to say this BPF map is used by tracing, >> so we should not allow LSM to use it. >> >> Does this make sense? > As involved bystanders, some questions and thoughts that may help > further the discussion. > > With respect to inode specific storage, the currently accepted pattern > in the LSM world is roughly as follows: > > The LSM initialization code, at boot, computes the total amount of > storage needed by all of the LSM's that are requesting inode specific > storage. A single pointer to that 'blob' of storage is included in > the inode structure. > > In an include file, an inline function similar to the following is > declared, whose purpose is to return the location inside of the > allocated storage or 'LSM inode blob' where a particular LSM's inode > specific data structure is located: > > static inline struct tsem_inode *tsem_inode(struct inode *inode) > { > return inode->i_security + tsem_blob_sizes.lbs_inode; > } > > In an LSM's implementation code, the function gets used in something > like the following manner: > > static int tsem_inode_alloc_security(struct inode *inode) > { > struct tsem_inode *tsip = tsem_inode(inode); > > /* Do something with the structure pointed to by tsip. */ > } > > Christian appears to have already chimed in and indicated that there > is no appetite to add another pointer member to the inode structure. > > So, if this were to proceed forward, is it proposed that there will be > a 'flag day' requirement to have each LSM that uses inode specific > storage implement a security_inode_alloc() event handler that creates > an LSM specific BPF map key/value pair for that inode? > > Which, in turn, would require that the accessor functions be converted > to use a bpf key request to return the LSM specific information for > that inode? > > A flag day event is always somewhat of a concern, but the larger > concern may be the substitution of simple pointer arithmetic for a > body of more complex code. One would assume with something like this, > that there may be a need for a shake-out period to determine what type > of potential regressions the more complex implementation may generate, > with regressions in security sensitive code always a concern. > > In a larger context. Given that the current implementation works on > simple pointer arithmetic over a common block of storage, there is not > much of a safety guarantee that one LSM couldn't interfere with the > inode storage of another LSM. However, using a generic BPF construct > such as a map, would presumably open the level of influence over LSM > specific inode storage to a much larger audience, presumably any BPF > program that would be loaded. > > The LSM inode information is obviously security sensitive, which I > presume would be be the motivation for Casey's concern that a 'mistake > by a BPF programmer could cause the whole system to blow up', which in > full disclosure is only a rough approximation of his statement. > > We obviously can't speak directly to Casey's concerns. Casey, any > specific technical comments on the challenges of using a common inode > specific storage architecture? My objection to using a union for the BPF and LSM pointer is based on the observation that a lot of modern programmers don't know what a union does. The BPF programmer would see that there are two ways to accomplish their task, one for CONFIG_SECURITY=y and the other for when it isn't. The second is much simpler. Not understanding how kernel configuration works, nor being "real" C language savvy, the programmer installs code using the simpler interfaces on a Redhat system. The SELinux inode data is compromised by the BPF code, which thinks the data is its own. Hilarity ensues. > > Song, FWIW going forward. I don't know how closely you follow LSM > development, but we believe an unbiased observer would conclude that > there is some degree of reticence about BPF's involvement with the LSM > infrastructure by some of the core LSM maintainers, that in turn makes > these types of conversations technically sensitive. > >> Song > We will look forward to your thoughts on the above. > > Have a good week. > > As always, > Dr. Greg > > The Quixote Project - Flailing at the Travails of Cybersecurity > https://github.com/Quixote-Project
Hi Dr. Greg, Thanks for your input on this. > On Nov 14, 2024, at 8:36 AM, Dr. Greg <greg@enjellic.com> wrote: > > On Wed, Nov 13, 2024 at 10:57:05AM -0800, Song Liu wrote: > > Good morning, I hope the week is going well for everyone. > >> On Wed, Nov 13, 2024 at 10:06???AM Casey Schaufler <casey@schaufler-ca.com> wrote: >>> >>> On 11/12/2024 5:37 PM, Song Liu wrote: >> [...] >>>> Could you provide more information on the definition of "more >>>> consistent" LSM infrastructure? >>> >>> We're doing several things. The management of security blobs >>> (e.g. inode->i_security) has been moved out of the individual >>> modules and into the infrastructure. The use of a u32 secid is >>> being replaced with a more general lsm_prop structure, except >>> where networking code won't allow it. A good deal of work has >>> gone into making the return values of LSM hooks consistent. >> >> Thanks for the information. Unifying per-object memory usage of >> different LSMs makes sense. However, I don't think we are limiting >> any LSM to only use memory from the lsm_blobs. The LSMs still >> have the freedom to use other memory allocators. BPF inode >> local storage, just like other BPF maps, is a way to manage >> memory. BPF LSM programs have full access to BPF maps. So >> I don't think it makes sense to say this BPF map is used by tracing, >> so we should not allow LSM to use it. >> >> Does this make sense? > > As involved bystanders, some questions and thoughts that may help > further the discussion. > > With respect to inode specific storage, the currently accepted pattern > in the LSM world is roughly as follows: > > The LSM initialization code, at boot, computes the total amount of > storage needed by all of the LSM's that are requesting inode specific > storage. A single pointer to that 'blob' of storage is included in > the inode structure. > > In an include file, an inline function similar to the following is > declared, whose purpose is to return the location inside of the > allocated storage or 'LSM inode blob' where a particular LSM's inode > specific data structure is located: > > static inline struct tsem_inode *tsem_inode(struct inode *inode) > { > return inode->i_security + tsem_blob_sizes.lbs_inode; > } > > In an LSM's implementation code, the function gets used in something > like the following manner: > > static int tsem_inode_alloc_security(struct inode *inode) > { > struct tsem_inode *tsip = tsem_inode(inode); > > /* Do something with the structure pointed to by tsip. */ > } Yes, I am fully aware how most LSMs allocate and use these inode/task/etc. storage. > Christian appears to have already chimed in and indicated that there > is no appetite to add another pointer member to the inode structure. If I understand Christian correctly, his concern comes from the size of inode, and thus the impact on memory footprint and CPU cache usage of all the inode in the system. While we got easier time adding a pointer to other data structures, for example socket, I personally acknowledge Christian's concern and I am motivated to make changes to reduce the size of inode. > So, if this were to proceed forward, is it proposed that there will be > a 'flag day' requirement to have each LSM that uses inode specific > storage implement a security_inode_alloc() event handler that creates > an LSM specific BPF map key/value pair for that inode? > > Which, in turn, would require that the accessor functions be converted > to use a bpf key request to return the LSM specific information for > that inode? I never thought about asking other LSMs to make any changes. At the moment, none of the BPF maps are available to none BPF code. > A flag day event is always somewhat of a concern, but the larger > concern may be the substitution of simple pointer arithmetic for a > body of more complex code. One would assume with something like this, > that there may be a need for a shake-out period to determine what type > of potential regressions the more complex implementation may generate, > with regressions in security sensitive code always a concern. > > In a larger context. Given that the current implementation works on > simple pointer arithmetic over a common block of storage, there is not > much of a safety guarantee that one LSM couldn't interfere with the > inode storage of another LSM. However, using a generic BPF construct > such as a map, would presumably open the level of influence over LSM > specific inode storage to a much larger audience, presumably any BPF > program that would be loaded. To be honest, I think bpf maps provide much better data isolation than a common block of storage. The creator of each bpf map has _full control_ who can access the map. The only exception is with CAP_SYS_ADMIN, where the root user can access all bpf maps in the system. I don't think this has any security concern over the common block of storage, as the root user can easily probe any data in the common block of storage via /proc/kcore. > > The LSM inode information is obviously security sensitive, which I > presume would be be the motivation for Casey's concern that a 'mistake > by a BPF programmer could cause the whole system to blow up', which in > full disclosure is only a rough approximation of his statement. > > We obviously can't speak directly to Casey's concerns. Casey, any > specific technical comments on the challenges of using a common inode > specific storage architecture? > > Song, FWIW going forward. I don't know how closely you follow LSM > development, but we believe an unbiased observer would conclude that > there is some degree of reticence about BPF's involvement with the LSM > infrastructure by some of the core LSM maintainers, that in turn makes > these types of conversations technically sensitive. I think I indeed got much more push back than I would imagine. However, as always, I value everyone's perspective and I am willing make reasonable changes to address valid concerns. Thanks, Song
> On Nov 14, 2024, at 9:29 AM, Casey Schaufler <casey@schaufler-ca.com> wrote: [...] >> >> >> The LSM inode information is obviously security sensitive, which I >> presume would be be the motivation for Casey's concern that a 'mistake >> by a BPF programmer could cause the whole system to blow up', which in >> full disclosure is only a rough approximation of his statement. >> >> We obviously can't speak directly to Casey's concerns. Casey, any >> specific technical comments on the challenges of using a common inode >> specific storage architecture? > > My objection to using a union for the BPF and LSM pointer is based > on the observation that a lot of modern programmers don't know what > a union does. The BPF programmer would see that there are two ways > to accomplish their task, one for CONFIG_SECURITY=y and the other > for when it isn't. The second is much simpler. Not understanding > how kernel configuration works, nor being "real" C language savvy, > the programmer installs code using the simpler interfaces on a > Redhat system. The SELinux inode data is compromised by the BPF > code, which thinks the data is its own. Hilarity ensues. There must be some serious misunderstanding here. So let me explain the idea again. With CONFIG_SECURITY=y, the code will work the same as right now. BPF inode storage uses i_security, just as any other LSMs. With CONFIG_SECURITY=n, i_security does not exist, so the bpf inode storage will use i_bpf_storage. Since this is a CONFIG_, all the logic got sorted out at compile time. Thus the user API (for user space and for bpf programs) stays the same. Actually, I can understand the concern with union. Although, the logic is set at kernel compile time, it is still possible for kernel source code to use i_bpf_storage when CONFIG_SECURITY is enabled. (Yes, I guess now I finally understand the concern). We can address this with something like following: #ifdef CONFIG_SECURITY void *i_security; #elif CONFIG_BPF_SYSCALL struct bpf_local_storage __rcu *i_bpf_storage; #endif This will help catch all misuse of the i_bpf_storage at compile time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. Does this make sense? Thanks, Song
On Thu, 2024-11-14 at 18:08 +0000, Song Liu wrote: > > > > On Nov 14, 2024, at 9:29 AM, Casey Schaufler > > <casey@schaufler-ca.com> wrote: > > [...] > > > > > > > > > > The LSM inode information is obviously security sensitive, which > > > I > > > presume would be be the motivation for Casey's concern that a > > > 'mistake > > > by a BPF programmer could cause the whole system to blow up', > > > which in > > > full disclosure is only a rough approximation of his statement. > > > > > > We obviously can't speak directly to Casey's concerns. Casey, > > > any > > > specific technical comments on the challenges of using a common > > > inode > > > specific storage architecture? > > > > My objection to using a union for the BPF and LSM pointer is based > > on the observation that a lot of modern programmers don't know what > > a union does. The BPF programmer would see that there are two ways > > to accomplish their task, one for CONFIG_SECURITY=y and the other > > for when it isn't. The second is much simpler. Not understanding > > how kernel configuration works, nor being "real" C language savvy, > > the programmer installs code using the simpler interfaces on a > > Redhat system. The SELinux inode data is compromised by the BPF > > code, which thinks the data is its own. Hilarity ensues. > > There must be some serious misunderstanding here. So let me > explain the idea again. > > With CONFIG_SECURITY=y, the code will work the same as right now. > BPF inode storage uses i_security, just as any other LSMs. > > With CONFIG_SECURITY=n, i_security does not exist, so the bpf > inode storage will use i_bpf_storage. > > Since this is a CONFIG_, all the logic got sorted out at compile > time. Thus the user API (for user space and for bpf programs) > stays the same. > > > Actually, I can understand the concern with union. Although, > the logic is set at kernel compile time, it is still possible > for kernel source code to use i_bpf_storage when > CONFIG_SECURITY is enabled. (Yes, I guess now I finally understand > the concern). > > We can address this with something like following: > > #ifdef CONFIG_SECURITY > void *i_security; > #elif CONFIG_BPF_SYSCALL > struct bpf_local_storage __rcu *i_bpf_storage; > #endif > > This will help catch all misuse of the i_bpf_storage at compile > time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. > > Does this make sense? Got to say I'm with Casey here, this will generate horrible and failure prone code. Since effectively you're making i_security always present anyway, simply do that and also pull the allocation code out of security.c in a way that it's always available? That way you don't have to special case the code depending on whether CONFIG_SECURITY is defined. Effectively this would give everyone a generic way to attach some memory area to an inode. I know it's more complex than this because there are LSM hooks that run from security_inode_alloc() but if you can make it work generically, I'm sure everyone will benefit. Regards, James
Hi James, Thanks for your input! > On Nov 14, 2024, at 1:49 PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: [...] >> >> Actually, I can understand the concern with union. Although, >> the logic is set at kernel compile time, it is still possible >> for kernel source code to use i_bpf_storage when >> CONFIG_SECURITY is enabled. (Yes, I guess now I finally understand >> the concern). >> >> We can address this with something like following: >> >> #ifdef CONFIG_SECURITY >> void *i_security; >> #elif CONFIG_BPF_SYSCALL >> struct bpf_local_storage __rcu *i_bpf_storage; >> #endif >> >> This will help catch all misuse of the i_bpf_storage at compile >> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. >> >> Does this make sense? > > Got to say I'm with Casey here, this will generate horrible and failure > prone code. Yes, as I described in another email in the thread [1], this turned out to cause more troubles than I thought. > Since effectively you're making i_security always present anyway, > simply do that and also pull the allocation code out of security.c in a > way that it's always available? I think this is a very good idea. If folks agree with this approach, I am more than happy to draft patch for this. Thanks again, Song > That way you don't have to special > case the code depending on whether CONFIG_SECURITY is defined. > Effectively this would give everyone a generic way to attach some > memory area to an inode. I know it's more complex than this because > there are LSM hooks that run from security_inode_alloc() but if you can > make it work generically, I'm sure everyone will benefit. [1] https://lore.kernel.org/linux-fsdevel/86C65B85-8167-4D04-BFF5-40FD4F3407A4@fb.com/
Hi Christian, James and Jan, > On Nov 14, 2024, at 1:49 PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: [...] >> >> We can address this with something like following: >> >> #ifdef CONFIG_SECURITY >> void *i_security; >> #elif CONFIG_BPF_SYSCALL >> struct bpf_local_storage __rcu *i_bpf_storage; >> #endif >> >> This will help catch all misuse of the i_bpf_storage at compile >> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. >> >> Does this make sense? > > Got to say I'm with Casey here, this will generate horrible and failure > prone code. > > Since effectively you're making i_security always present anyway, > simply do that and also pull the allocation code out of security.c in a > way that it's always available? That way you don't have to special > case the code depending on whether CONFIG_SECURITY is defined. > Effectively this would give everyone a generic way to attach some > memory area to an inode. I know it's more complex than this because > there are LSM hooks that run from security_inode_alloc() but if you can > make it work generically, I'm sure everyone will benefit. On a second thought, I think making i_security generic is not the right solution for "BPF inode storage in tracing use cases". This is because i_security serves a very specific use case: it points to a piece of memory whose size is calculated at system boot time. If some of the supported LSMs is not enabled by the lsm= kernel arg, the kernel will not allocate memory in i_security for them. The only way to change lsm= is to reboot the system. BPF LSM programs can be disabled at the boot time, which fits well in i_security. However, BPF tracing programs cannot be disabled at boot time (even we change the code to make it possible, we are not likely to disable BPF tracing). IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some BPF tracing programs to load at some point of time, and these programs may use BPF inode storage. Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory always will be attached to i_security (maybe under a different name, say, i_generic) of every inode. In this case, we should really add i_bpf_storage directly to the inode, because another pointer jump via i_generic gives nothing but overhead. Does this make sense? Or did I misunderstand the suggestion? Thanks, Song
On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote: > Hi Christian, James and Jan, Good morning, I hope the day is starting well for everyone. > > On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > [...] > > >> > >> We can address this with something like following: > >> > >> #ifdef CONFIG_SECURITY > >> void *i_security; > >> #elif CONFIG_BPF_SYSCALL > >> struct bpf_local_storage __rcu *i_bpf_storage; > >> #endif > >> > >> This will help catch all misuse of the i_bpf_storage at compile > >> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. > >> > >> Does this make sense? > > > > Got to say I'm with Casey here, this will generate horrible and failure > > prone code. > > > > Since effectively you're making i_security always present anyway, > > simply do that and also pull the allocation code out of security.c in a > > way that it's always available? That way you don't have to special > > case the code depending on whether CONFIG_SECURITY is defined. > > Effectively this would give everyone a generic way to attach some > > memory area to an inode. I know it's more complex than this because > > there are LSM hooks that run from security_inode_alloc() but if you can > > make it work generically, I'm sure everyone will benefit. > On a second thought, I think making i_security generic is not > the right solution for "BPF inode storage in tracing use cases". > > This is because i_security serves a very specific use case: it > points to a piece of memory whose size is calculated at system > boot time. If some of the supported LSMs is not enabled by the > lsm= kernel arg, the kernel will not allocate memory in > i_security for them. The only way to change lsm= is to reboot > the system. BPF LSM programs can be disabled at the boot time, > which fits well in i_security. However, BPF tracing programs > cannot be disabled at boot time (even we change the code to > make it possible, we are not likely to disable BPF tracing). > IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some > BPF tracing programs to load at some point of time, and these > programs may use BPF inode storage. > > Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory > always will be attached to i_security (maybe under a different > name, say, i_generic) of every inode. In this case, we should > really add i_bpf_storage directly to the inode, because another > pointer jump via i_generic gives nothing but overhead. > > Does this make sense? Or did I misunderstand the suggestion? There is a colloquialism that seems relevant here: "Pick your poison". In the greater interests of the kernel, it seems that a generic mechanism for attaching per inode information is the only realistic path forward, unless Christian changes his position on expanding the size of struct inode. There are two pathways forward. 1.) Attach a constant size 'blob' of storage to each inode. This is a similar approach to what the LSM uses where each blob is sized as follows: S = U * sizeof(void *) Where U is the number of sub-systems that have a desire to use inode specific storage. Each sub-system uses it's pointer slot to manage any additional storage that it desires to attach to the inode. This has the obvious advantage of O(1) cost complexity for any sub-system that wants to access its inode specific storage. The disadvantage, as you note, is that it wastes memory if a sub-system does not elect to attach per inode information, for example the tracing infrastructure. This disadvantage is parried by the fact that it reduces the size of the inode proper by 24 bytes (4 pointers down to 1) and allows future extensibility without colliding with the interests and desires of the VFS maintainers. 2.) Implement key/value mapping for inode specific storage. The key would be a sub-system specific numeric value that returns a pointer the sub-system uses to manage its inode specific memory for a particular inode. A participating sub-system in turn uses its identifier to register an inode specific pointer for its sub-system. This strategy loses O(1) lookup complexity but reduces total memory consumption and only imposes memory costs for inodes when a sub-system desires to use inode specific storage. Approach 2 requires the introduction of generic infrastructure that allows an inode's key/value mappings to be located, presumably based on the inode's pointer value. We could probably just resurrect the old IMA iint code for this purpose. In the end it comes down to a rather standard trade-off in this business, memory vs. execution cost. We would posit that option 2 is the only viable scheme if the design metric is overall good for the Linux kernel eco-system. > Thanks, > Song Have a good day. As always, Dr. Greg The Quixote Project - Flailing at the Travails of Cybersecurity https://github.com/Quixote-Project
On 11/19/2024 4:27 AM, Dr. Greg wrote: > On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote: > >> Hi Christian, James and Jan, > Good morning, I hope the day is starting well for everyone. > >>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: >> [...] >> >>>> We can address this with something like following: >>>> >>>> #ifdef CONFIG_SECURITY >>>> void *i_security; >>>> #elif CONFIG_BPF_SYSCALL >>>> struct bpf_local_storage __rcu *i_bpf_storage; >>>> #endif >>>> >>>> This will help catch all misuse of the i_bpf_storage at compile >>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. >>>> >>>> Does this make sense? >>> Got to say I'm with Casey here, this will generate horrible and failure >>> prone code. >>> >>> Since effectively you're making i_security always present anyway, >>> simply do that and also pull the allocation code out of security.c in a >>> way that it's always available? That way you don't have to special >>> case the code depending on whether CONFIG_SECURITY is defined. >>> Effectively this would give everyone a generic way to attach some >>> memory area to an inode. I know it's more complex than this because >>> there are LSM hooks that run from security_inode_alloc() but if you can >>> make it work generically, I'm sure everyone will benefit. >> On a second thought, I think making i_security generic is not >> the right solution for "BPF inode storage in tracing use cases". >> >> This is because i_security serves a very specific use case: it >> points to a piece of memory whose size is calculated at system >> boot time. If some of the supported LSMs is not enabled by the >> lsm= kernel arg, the kernel will not allocate memory in >> i_security for them. The only way to change lsm= is to reboot >> the system. BPF LSM programs can be disabled at the boot time, >> which fits well in i_security. However, BPF tracing programs >> cannot be disabled at boot time (even we change the code to >> make it possible, we are not likely to disable BPF tracing). >> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some >> BPF tracing programs to load at some point of time, and these >> programs may use BPF inode storage. >> >> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory >> always will be attached to i_security (maybe under a different >> name, say, i_generic) of every inode. In this case, we should >> really add i_bpf_storage directly to the inode, because another >> pointer jump via i_generic gives nothing but overhead. >> >> Does this make sense? Or did I misunderstand the suggestion? > There is a colloquialism that seems relevant here: "Pick your poison". > > In the greater interests of the kernel, it seems that a generic > mechanism for attaching per inode information is the only realistic > path forward, unless Christian changes his position on expanding > the size of struct inode. > > There are two pathways forward. > > 1.) Attach a constant size 'blob' of storage to each inode. > > This is a similar approach to what the LSM uses where each blob is > sized as follows: > > S = U * sizeof(void *) > > Where U is the number of sub-systems that have a desire to use inode > specific storage. I can't tell for sure, but it looks like you don't understand how LSM i_security blobs are used. It is *not* the case that each LSM gets a pointer in the i_security blob. Each LSM that wants storage tells the infrastructure at initialization time how much space it wants in the blob. That can be a pointer, but usually it's a struct with flags, pointers and even lists. > Each sub-system uses it's pointer slot to manage any additional > storage that it desires to attach to the inode. Again, an LSM may choose to do it that way, but most don't. SELinux and Smack need data on every inode. It makes much more sense to put it directly in the blob than to allocate a separate chunk for every inode. > This has the obvious advantage of O(1) cost complexity for any > sub-system that wants to access its inode specific storage. > > The disadvantage, as you note, is that it wastes memory if a > sub-system does not elect to attach per inode information, for example > the tracing infrastructure. To be clear, that disadvantage only comes up if the sub-system uses inode data on an occasional basis. If it never uses inode data there is no need to have a pointer to it. > This disadvantage is parried by the fact that it reduces the size of > the inode proper by 24 bytes (4 pointers down to 1) and allows future > extensibility without colliding with the interests and desires of the > VFS maintainers. You're adding a level of indirection. Even I would object based on the performance impact. > 2.) Implement key/value mapping for inode specific storage. > > The key would be a sub-system specific numeric value that returns a > pointer the sub-system uses to manage its inode specific memory for a > particular inode. > > A participating sub-system in turn uses its identifier to register an > inode specific pointer for its sub-system. > > This strategy loses O(1) lookup complexity but reduces total memory > consumption and only imposes memory costs for inodes when a sub-system > desires to use inode specific storage. SELinux and Smack use an inode blob for every inode. The performance regression boggles the mind. Not to mention the additional complexity of managing the memory. > Approach 2 requires the introduction of generic infrastructure that > allows an inode's key/value mappings to be located, presumably based > on the inode's pointer value. We could probably just resurrect the > old IMA iint code for this purpose. > > In the end it comes down to a rather standard trade-off in this > business, memory vs. execution cost. > > We would posit that option 2 is the only viable scheme if the design > metric is overall good for the Linux kernel eco-system. No. Really, no. You need look no further than secmarks to understand how a key based blob allocation scheme leads to tears. Keys are fine in the case where use of data is sparse. They have no place when data use is the norm. >> Thanks, >> Song > Have a good day. > > As always, > Dr. Greg > > The Quixote Project - Flailing at the Travails of Cybersecurity > https://github.com/Quixote-Project >
> On Nov 19, 2024, at 10:14 AM, Casey Schaufler <casey@schaufler-ca.com> wrote: > > On 11/19/2024 4:27 AM, Dr. Greg wrote: >> On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote: >> >>> Hi Christian, James and Jan, >> Good morning, I hope the day is starting well for everyone. >> >>>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: >>> [...] >>> >>>>> We can address this with something like following: >>>>> >>>>> #ifdef CONFIG_SECURITY >>>>> void *i_security; >>>>> #elif CONFIG_BPF_SYSCALL >>>>> struct bpf_local_storage __rcu *i_bpf_storage; >>>>> #endif >>>>> >>>>> This will help catch all misuse of the i_bpf_storage at compile >>>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. >>>>> >>>>> Does this make sense? >>>> Got to say I'm with Casey here, this will generate horrible and failure >>>> prone code. >>>> >>>> Since effectively you're making i_security always present anyway, >>>> simply do that and also pull the allocation code out of security.c in a >>>> way that it's always available? That way you don't have to special >>>> case the code depending on whether CONFIG_SECURITY is defined. >>>> Effectively this would give everyone a generic way to attach some >>>> memory area to an inode. I know it's more complex than this because >>>> there are LSM hooks that run from security_inode_alloc() but if you can >>>> make it work generically, I'm sure everyone will benefit. >>> On a second thought, I think making i_security generic is not >>> the right solution for "BPF inode storage in tracing use cases". >>> >>> This is because i_security serves a very specific use case: it >>> points to a piece of memory whose size is calculated at system >>> boot time. If some of the supported LSMs is not enabled by the >>> lsm= kernel arg, the kernel will not allocate memory in >>> i_security for them. The only way to change lsm= is to reboot >>> the system. BPF LSM programs can be disabled at the boot time, >>> which fits well in i_security. However, BPF tracing programs >>> cannot be disabled at boot time (even we change the code to >>> make it possible, we are not likely to disable BPF tracing). >>> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some >>> BPF tracing programs to load at some point of time, and these >>> programs may use BPF inode storage. >>> >>> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory >>> always will be attached to i_security (maybe under a different >>> name, say, i_generic) of every inode. In this case, we should >>> really add i_bpf_storage directly to the inode, because another >>> pointer jump via i_generic gives nothing but overhead. >>> >>> Does this make sense? Or did I misunderstand the suggestion? >> There is a colloquialism that seems relevant here: "Pick your poison". >> >> In the greater interests of the kernel, it seems that a generic >> mechanism for attaching per inode information is the only realistic >> path forward, unless Christian changes his position on expanding >> the size of struct inode. >> >> There are two pathways forward. >> >> 1.) Attach a constant size 'blob' of storage to each inode. >> >> This is a similar approach to what the LSM uses where each blob is >> sized as follows: >> >> S = U * sizeof(void *) >> >> Where U is the number of sub-systems that have a desire to use inode >> specific storage. > > I can't tell for sure, but it looks like you don't understand how > LSM i_security blobs are used. It is *not* the case that each LSM > gets a pointer in the i_security blob. Each LSM that wants storage > tells the infrastructure at initialization time how much space it > wants in the blob. That can be a pointer, but usually it's a struct > with flags, pointers and even lists. > >> Each sub-system uses it's pointer slot to manage any additional >> storage that it desires to attach to the inode. > > Again, an LSM may choose to do it that way, but most don't. > SELinux and Smack need data on every inode. It makes much more sense > to put it directly in the blob than to allocate a separate chunk > for every inode. AFAICT, i_security is somehow unique in the way that its size is calculated at boot time. I guess we will just keep most LSM users behind. > >> This has the obvious advantage of O(1) cost complexity for any >> sub-system that wants to access its inode specific storage. >> >> The disadvantage, as you note, is that it wastes memory if a >> sub-system does not elect to attach per inode information, for example >> the tracing infrastructure. > > To be clear, that disadvantage only comes up if the sub-system uses > inode data on an occasional basis. If it never uses inode data there > is no need to have a pointer to it. > >> This disadvantage is parried by the fact that it reduces the size of >> the inode proper by 24 bytes (4 pointers down to 1) and allows future >> extensibility without colliding with the interests and desires of the >> VFS maintainers. > > You're adding a level of indirection. Even I would object based on > the performance impact. > >> 2.) Implement key/value mapping for inode specific storage. >> >> The key would be a sub-system specific numeric value that returns a >> pointer the sub-system uses to manage its inode specific memory for a >> particular inode. >> >> A participating sub-system in turn uses its identifier to register an >> inode specific pointer for its sub-system. >> >> This strategy loses O(1) lookup complexity but reduces total memory >> consumption and only imposes memory costs for inodes when a sub-system >> desires to use inode specific storage. > > SELinux and Smack use an inode blob for every inode. The performance > regression boggles the mind. Not to mention the additional complexity > of managing the memory. > >> Approach 2 requires the introduction of generic infrastructure that >> allows an inode's key/value mappings to be located, presumably based >> on the inode's pointer value. We could probably just resurrect the >> old IMA iint code for this purpose. >> >> In the end it comes down to a rather standard trade-off in this >> business, memory vs. execution cost. >> >> We would posit that option 2 is the only viable scheme if the design >> metric is overall good for the Linux kernel eco-system. > > No. Really, no. You need look no further than secmarks to understand > how a key based blob allocation scheme leads to tears. Keys are fine > in the case where use of data is sparse. They have no place when data > use is the norm. OTOH, I think some on-demand key-value storage makes sense for many other use cases, such as BPF (LSM and tracing), file lock, fanotify, etc. Overall, I think we have 3 types storages attached to inode: 1. Embedded in struct inode, gated by CONFIG_*. 2. Behind i_security (or maybe call it a different name if we can find other uses for it). The size is calculated at boot time. 3. Behind a key-value storage. To evaluate these categories, we have: Speed: 1 > 2 > 3 Flexibility: 3 > 2 > 1 We don't really have 3 right now. I think the direction is to create it. BPF inode storage is a key-value store. If we can get another user for 3, in addition to BPF inode storage, it should be a net win. Does this sound like a viable path forward? Thanks, Song
On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote: Good morning, I hope the day is goning well for everyone. > On 11/19/2024 4:27 AM, Dr. Greg wrote: > > On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote: > > > >> Hi Christian, James and Jan, > > Good morning, I hope the day is starting well for everyone. > > > >>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > >> [...] > >> > >>>> We can address this with something like following: > >>>> > >>>> #ifdef CONFIG_SECURITY > >>>> void *i_security; > >>>> #elif CONFIG_BPF_SYSCALL > >>>> struct bpf_local_storage __rcu *i_bpf_storage; > >>>> #endif > >>>> > >>>> This will help catch all misuse of the i_bpf_storage at compile > >>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. > >>>> > >>>> Does this make sense? > >>> Got to say I'm with Casey here, this will generate horrible and failure > >>> prone code. > >>> > >>> Since effectively you're making i_security always present anyway, > >>> simply do that and also pull the allocation code out of security.c in a > >>> way that it's always available? That way you don't have to special > >>> case the code depending on whether CONFIG_SECURITY is defined. > >>> Effectively this would give everyone a generic way to attach some > >>> memory area to an inode. I know it's more complex than this because > >>> there are LSM hooks that run from security_inode_alloc() but if you can > >>> make it work generically, I'm sure everyone will benefit. > >> On a second thought, I think making i_security generic is not > >> the right solution for "BPF inode storage in tracing use cases". > >> > >> This is because i_security serves a very specific use case: it > >> points to a piece of memory whose size is calculated at system > >> boot time. If some of the supported LSMs is not enabled by the > >> lsm= kernel arg, the kernel will not allocate memory in > >> i_security for them. The only way to change lsm= is to reboot > >> the system. BPF LSM programs can be disabled at the boot time, > >> which fits well in i_security. However, BPF tracing programs > >> cannot be disabled at boot time (even we change the code to > >> make it possible, we are not likely to disable BPF tracing). > >> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some > >> BPF tracing programs to load at some point of time, and these > >> programs may use BPF inode storage. > >> > >> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory > >> always will be attached to i_security (maybe under a different > >> name, say, i_generic) of every inode. In this case, we should > >> really add i_bpf_storage directly to the inode, because another > >> pointer jump via i_generic gives nothing but overhead. > >> > >> Does this make sense? Or did I misunderstand the suggestion? > > There is a colloquialism that seems relevant here: "Pick your poison". > > > > In the greater interests of the kernel, it seems that a generic > > mechanism for attaching per inode information is the only realistic > > path forward, unless Christian changes his position on expanding > > the size of struct inode. > > > > There are two pathways forward. > > > > 1.) Attach a constant size 'blob' of storage to each inode. > > > > This is a similar approach to what the LSM uses where each blob is > > sized as follows: > > > > S = U * sizeof(void *) > > > > Where U is the number of sub-systems that have a desire to use inode > > specific storage. > I can't tell for sure, but it looks like you don't understand how > LSM i_security blobs are used. It is *not* the case that each LSM > gets a pointer in the i_security blob. Each LSM that wants storage > tells the infrastructure at initialization time how much space it > wants in the blob. That can be a pointer, but usually it's a struct > with flags, pointers and even lists. I can state unequivocably for everyone's benefit, that as a team, we have an intimate understanding of how LSM i_security blobs are used. It was 0500 in the morning when I wrote the reply and I had personally been working for 22 hours straight, so my apologies for being imprecise. I should not have specified sizeof(void *), I should have written sizeof(allocation), for lack of a better syntax. Also for the record, in a universal allocation scheme, when I say sub-system I mean any implementation that would make use of per inode information. So the LSM, bpf tracing et.al., could all be considered sub-systems that would register at boot time for a section of the arena. > > Each sub-system uses it's pointer slot to manage any additional > > storage that it desires to attach to the inode. > Again, an LSM may choose to do it that way, but most don't. SELinux > and Smack need data on every inode. It makes much more sense to put > it directly in the blob than to allocate a separate chunk for every > inode. See my correction above. > > This has the obvious advantage of O(1) cost complexity for any > > sub-system that wants to access its inode specific storage. > > > > The disadvantage, as you note, is that it wastes memory if a > > sub-system does not elect to attach per inode information, for example > > the tracing infrastructure. > To be clear, that disadvantage only comes up if the sub-system uses > inode data on an occasional basis. If it never uses inode data there > is no need to have a pointer to it. I think we all agree on that, therein lies the rub with a common arena architecture, which is why I indicated in my earlier e-mail that this comes down to engineering trade-off decisions. That is why there would be a probable assumption that such sub-systems would only request a pointer per arena slot and use that to reference a dynamically allocated structure. If, as a group, we are really concerned about inode memory consumption the assumption would be that the maintainers would have to whine about sparse consumers requesting a structure sized allocation rather than a pointer sized allocation. > > This disadvantage is parried by the fact that it reduces the size of > > the inode proper by 24 bytes (4 pointers down to 1) and allows future > > extensibility without colliding with the interests and desires of the > > VFS maintainers. > You're adding a level of indirection. Even I would object based on > the performance impact. I'm not sure that a thorough and complete analysis of the costs associated with a sub-system touching inode local storage would support the notion of a tangible performance hit. The pointer in an arena slot would presumably be a pointer to a data structure that a sub-system allocates at inode creation time. After computing the arena slot address in classic style (i_arena + offset) the sub-system uses the pointer at that location to dereference its structure elements or to find subordinate members in its arena. If we take SMACK as an example, the smack inode contains three pointers and a scalar. So, if there is a need to access storage behind one of those pointers, there is an additional indirection hit. The three pointers are each to a structure (smack_known) that has three list pointers and a mutex lock inside of it. The SeLinux inode has a back pointer to the sponsoring inode, a list head, a spinlock and some scalars. So there is lots of potential indirection and locking going on with access to inode local storage. To extend further, for everyone thinking about this from an engineering perspective. A common arena model where everyone asks for a structure sized blob is inherently cache pessimal. Unless you are the first blob in the arena you are going to need to hit another cache-line in order to start the indirection process for your structure. A pointer based arena architecture would allow up to eight sub-systems to get their inode storage pointer for the cost of a single cache-line fetch. Let me offer another line of thinking on this drawn from the discussion above. A further optimization in the single pointer arena model is for the LSM to place a pointer to a standard LSM sized memory blob in its pointer slot on behalf of all the individual LSM's. All of the individual participating LSM's take that pointer and do the offset calculation into the LSM arena for that inode as they normally would. So there would seem to be a lot of engineering issues to consider that are beyond the simple predicate that indirection is bad. See, I do understand how the LSM arena model works. > > 2.) Implement key/value mapping for inode specific storage. > > > > The key would be a sub-system specific numeric value that returns a > > pointer the sub-system uses to manage its inode specific memory for a > > particular inode. > > > > A participating sub-system in turn uses its identifier to register an > > inode specific pointer for its sub-system. > > > > This strategy loses O(1) lookup complexity but reduces total memory > > consumption and only imposes memory costs for inodes when a sub-system > > desires to use inode specific storage. > SELinux and Smack use an inode blob for every inode. The performance > regression boggles the mind. Not to mention the additional > complexity of managing the memory. I guess we would have to measure the performance impacts to understand their level of mind boggliness. My first thought is that we hear a huge amount of fanfare about BPF being a game changer for tracing and network monitoring. Given current networking speeds, if its ability to manage storage needed for it purposes are truely abysmal the industry wouldn't be finding the technology useful. Beyond that. As I noted above, the LSM could be an independent subscriber. The pointer to register would come from the the kmem_cache allocator as it does now, so that cost is idempotent with the current implementation. The pointer registration would also be a single instance cost. So the primary cost differential over the common arena model will be the complexity costs associated with lookups in a red/black tree, if we used the old IMA integrity cache as an example implementation. As I noted above, these per inode local storage structures are complex in of themselves, including lists and locks. If touching an inode involves locking and walking lists and the like it would seem that those performance impacts would quickly swamp an r/b lookup cost. > > Approach 2 requires the introduction of generic infrastructure that > > allows an inode's key/value mappings to be located, presumably based > > on the inode's pointer value. We could probably just resurrect the > > old IMA iint code for this purpose. > > > > In the end it comes down to a rather standard trade-off in this > > business, memory vs. execution cost. > > > > We would posit that option 2 is the only viable scheme if the design > > metric is overall good for the Linux kernel eco-system. > No. Really, no. You need look no further than secmarks to understand > how a key based blob allocation scheme leads to tears. Keys are fine > in the case where use of data is sparse. They have no place when data > use is the norm. Then it would seem that we need to get everyone to agree that we can get by with using two pointers in struct inode. One for uses best served by common arena allocation and one for a key/pointer mapping, and then convert the sub-systems accordingly. Or alternately, getting everyone to agree that allocating a mininum of eight additional bytes for every subscriber to private inode data isn't the end of the world, even if use of the resource is sparse. Of course, experience would suggest, that getting everyone in this community to agree on something is roughly akin to throwing a hand grenade into a chicken coop with an expectation that all of the chickens will fly out in a uniform flock formation. As always, Dr. Greg The Quixote Project - Flailing at the Travails of Cybersecurity https://github.com/Quixote-Project
Hi Dr. Greg, Thanks for your input! > On Nov 20, 2024, at 8:54 AM, Dr. Greg <greg@enjellic.com> wrote: > > On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote: [...] > >>> 2.) Implement key/value mapping for inode specific storage. >>> >>> The key would be a sub-system specific numeric value that returns a >>> pointer the sub-system uses to manage its inode specific memory for a >>> particular inode. >>> >>> A participating sub-system in turn uses its identifier to register an >>> inode specific pointer for its sub-system. >>> >>> This strategy loses O(1) lookup complexity but reduces total memory >>> consumption and only imposes memory costs for inodes when a sub-system >>> desires to use inode specific storage. > >> SELinux and Smack use an inode blob for every inode. The performance >> regression boggles the mind. Not to mention the additional >> complexity of managing the memory. > > I guess we would have to measure the performance impacts to understand > their level of mind boggliness. > > My first thought is that we hear a huge amount of fanfare about BPF > being a game changer for tracing and network monitoring. Given > current networking speeds, if its ability to manage storage needed for > it purposes are truely abysmal the industry wouldn't be finding the > technology useful. > > Beyond that. > > As I noted above, the LSM could be an independent subscriber. The > pointer to register would come from the the kmem_cache allocator as it > does now, so that cost is idempotent with the current implementation. > The pointer registration would also be a single instance cost. > > So the primary cost differential over the common arena model will be > the complexity costs associated with lookups in a red/black tree, if > we used the old IMA integrity cache as an example implementation. > > As I noted above, these per inode local storage structures are complex > in of themselves, including lists and locks. If touching an inode > involves locking and walking lists and the like it would seem that > those performance impacts would quickly swamp an r/b lookup cost. bpf local storage is designed to be an arena like solution that works for multiple bpf maps (and we don't know how many of maps we need ahead of time). Therefore, we may end up doing what you suggested earlier: every LSM should use bpf inode storage. ;) I am only 90% kidding. > >>> Approach 2 requires the introduction of generic infrastructure that >>> allows an inode's key/value mappings to be located, presumably based >>> on the inode's pointer value. We could probably just resurrect the >>> old IMA iint code for this purpose. >>> >>> In the end it comes down to a rather standard trade-off in this >>> business, memory vs. execution cost. >>> >>> We would posit that option 2 is the only viable scheme if the design >>> metric is overall good for the Linux kernel eco-system. > >> No. Really, no. You need look no further than secmarks to understand >> how a key based blob allocation scheme leads to tears. Keys are fine >> in the case where use of data is sparse. They have no place when data >> use is the norm. > > Then it would seem that we need to get everyone to agree that we can > get by with using two pointers in struct inode. One for uses best > served by common arena allocation and one for a key/pointer mapping, > and then convert the sub-systems accordingly. > > Or alternately, getting everyone to agree that allocating a mininum of > eight additional bytes for every subscriber to private inode data > isn't the end of the world, even if use of the resource is sparse. Christian suggested we can use an inode_addon structure, which is similar to this idea. It won't work well in all contexts, though. So it is not as good as other bpf local storage (task, sock, cgroup). Thanks, Song > > Of course, experience would suggest, that getting everyone in this > community to agree on something is roughly akin to throwing a hand > grenade into a chicken coop with an expectation that all of the > chickens will fly out in a uniform flock formation. > > As always, > Dr. Greg > > The Quixote Project - Flailing at the Travails of Cybersecurity > https://github.com/Quixote-Project