Message ID | 1503686.1591113304@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] General notification queue and key notifications | expand |
On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote: > > [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak > and > Ian Kent have been working on making libmount use them, > preparatory to > working on systemd: > > https://github.com/karelzak/util-linux/commits/topic/fsinfo > > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public > > Development has stalled briefly due to other commitments, so I'm > not > sure I can ask you to pull those parts of the series for > now. Christian > Brauner would like to use them in lxc, but hasn't started. > ]] Linus, Just so your aware of what has been done and where we are at here's a summary. Karel has done quite a bit of work on libmount (at this stage it's getting hold of the mount information, aka. fsinfo()) and most of what I have done is included in that too which you can see in Karel's repo above). You can see a couple of bug fixes and a little bit of new code present in my repo which hasn't been sent over to Karel yet. This infrastructure is essential before notifications work is started which is where we will see the most improvement. It turns out that while systemd uses libmount it has it's own notifications handling sub-system as it deals with several event types, not just mount information, in the same area. So, unfortunately, changes will need to be made there as well as in libmount, more so than the trivial changes to use fsinfo() via libmount. That's where we are at the moment and I will get back to it once I've dealt with a few things I postponed to work on libmount. If you would like a more detailed account of what we have found I can provide that too. Is there anything else you would like from me or Karel? Ian
On Wed, 2020-06-03 at 10:15 +0800, Ian Kent wrote: > On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote: > > [[ With regard to the mount/sb notifications and fsinfo(), Karel > > Zak > > and > > Ian Kent have been working on making libmount use them, > > preparatory to > > working on systemd: > > > > https://github.com/karelzak/util-linux/commits/topic/fsinfo > > > > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public > > > > Development has stalled briefly due to other commitments, so I'm > > not > > sure I can ask you to pull those parts of the series for > > now. Christian > > Brauner would like to use them in lxc, but hasn't started. > > ]] > > Linus, > > Just so your aware of what has been done and where we are at here's > a summary. > > Karel has done quite a bit of work on libmount (at this stage it's > getting hold of the mount information, aka. fsinfo()) and most of > what I have done is included in that too which you can see in Karel's > repo above). You can see a couple of bug fixes and a little bit of > new code present in my repo which hasn't been sent over to Karel > yet. > > This infrastructure is essential before notifications work is started > which is where we will see the most improvement. > > It turns out that while systemd uses libmount it has it's own > notifications handling sub-system as it deals with several event > types, not just mount information, in the same area. So, > unfortunately, > changes will need to be made there as well as in libmount, more so > than the trivial changes to use fsinfo() via libmount. > > That's where we are at the moment and I will get back to it once > I've dealt with a few things I postponed to work on libmount. > > If you would like a more detailed account of what we have found I > can provide that too. > > Is there anything else you would like from me or Karel? I think there's a bit more I should say about this. One reason work hasn't progressed further on this is I spent quite a bit of time looking at the affects of using fsinfo(). My testing was done by using a large autofs direct mount map of 20000 entries which means that at autofs startup 20000 autofs mounts must be done and at autofs shutdown those 20000 mounts must be umounted. Not very scientific but something to use to get a feel for the affect of our changes. Initially just using fsinfo() to load all the mount entries was done to see how that would perform. This was done in a way that required no modifications to library user code but didn't get much improvement. Next loading all the mount ids (alone) for mount entry traversal was done and the various fields retrieved on-demand (implemented by Karel). Loading the entire mount table and then traversing the entries means the mount table is always possibly out of date. And loading the ids and getting the fields on-demand might have made that problem worse. But loading only the mount ids and using an on-demand method to get needed fields worked surprisingly well. The main issue is a mount going away while getting the fields. Testing showed that simply checking the field is valid and ignoring the entry if it isn't is enough to handle that case. Also the mount going away after the needed fields have been retrieved must be handled by callers of libmount as mounts can just as easily go away after reading the proc based tables. The case of the underlying mount information changing needs to be considered too. We will need to do better on that in the future but it too is a problem with the proc table handing and hasn't seen problems logged against libmount for it AFAIK. So, all in all, this approach worked pretty well as libmount users do use the getter access methods to retrieve the mount entry fields (which is required for the on-demand method to work). Certainly systemd always uses them (and it looks like udisks2 does too). Unfortunately using the libmount on-demand implementation requires library user code be modified (only a little in the systemd case) to use the implementation. Testing showed that we get between 10-15% reduction in overhead and CPU usage remained high. I think processing large numbers of mounts is simply a lot of work and there are particular cases that will remain that require the use of the load and traverse method. For example matching all mounts with a given prefix string (one of the systemd use cases). It's hard to get information about this but I can say that running pref during the autofs start and stop shows the bulk of the counter hits on the fsinfo() table construction code so that ahs to be where the overhead is. The unavoidable conclusion is that the load and traverse method that's been imposed on us for so long (even before libmount) for mount handling is what we need to get away from. After all, this is essentially where the problem comes from in the first place. And fsinfo() is designed to not need to use this method for getting mount information for that reason. There's also the notifications side of things which is the next area to work on. Looking at systemd I see that monitoring the proc mount table leads to a load, traverse, and process of the entire table for every single notification. It's clear that's because of the (what I'll call) anonymous notifications that we have now. The notifications in David's series carry event specific information, for example the mount id for mount notifications and the libmount fsinfo() implementation is written to use the mount id (lowest overhead lookup option), so there has to be significant improvement for this case. But systemd has it's own notifications handling code so there will need to be non-trivial changes there as well as changes in libmount. Bottom line is we have a bit of a challenge with this because we are trying to change coding practices developed over many years that, necessarily, use a load/traverse method and it's going to take quite a while to change these coding practices. My question is, is there something specific, besides what we are doing, that you'd like to see done now in order to get the series merged? Ian
On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote: > Date: Tue, 02 Jun 2020 16:51:44 +0100 > > Hi Linus, > > Can you pull this, please? It adds a general notification queue concept > and adds an event source for keys/keyrings, such as linking and unlinking > keys and changing their attributes. > > Thanks to Debarshi Ray, we do have a pull request to use this to fix a > problem with gnome-online-accounts - as mentioned last time: > > https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 > > Without this, g-o-a has to constantly poll a keyring-based kerberos cache > to find out if kinit has changed anything. > > [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and The mount/sb notification and fsinfo() stuff is something we'd like to use. (And then later extend to allow for supervised mounts where a container manager can supervise the mounts of an unprivileged container.) I'm not sure if the mount notifications are already part of this pr. Christian > Ian Kent have been working on making libmount use them, preparatory to > working on systemd: > > https://github.com/karelzak/util-linux/commits/topic/fsinfo > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public > > Development has stalled briefly due to other commitments, so I'm not > sure I can ask you to pull those parts of the series for now. Christian > Brauner would like to use them in lxc, but hasn't started. > ]] > > > LSM hooks are included: > > (1) A set of hooks are provided that allow an LSM to rule on whether or > not a watch may be set. Each of these hooks takes a different > "watched object" parameter, so they're not really shareable. The LSM > should use current's credentials. [Wanted by SELinux & Smack] > > (2) A hook is provided to allow an LSM to rule on whether or not a > particular message may be posted to a particular queue. This is given > the credentials from the event generator (which may be the system) and > the watch setter. [Wanted by Smack] > > I've provided SELinux and Smack with implementations of some of these hooks. > > > WHY > === > > Key/keyring notifications are desirable because if you have your kerberos > tickets in a file/directory, your Gnome desktop will monitor that using > something like fanotify and tell you if your credentials cache changes. > > However, we also have the ability to cache your kerberos tickets in the > session, user or persistent keyring so that it isn't left around on disk > across a reboot or logout. Keyrings, however, cannot currently be > monitored asynchronously, so the desktop has to poll for it - not so good > on a laptop. This facility will allow the desktop to avoid the need to > poll. > > > DESIGN DECISIONS > ================ > > (1) The notification queue is built on top of a standard pipe. Messages > are effectively spliced in. The pipe is opened with a special flag: > > pipe2(fds, O_NOTIFICATION_PIPE); > > The special flag has the same value as O_EXCL (which doesn't seem like > it will ever be applicable in this context)[?]. It is given up front > to make it a lot easier to prohibit splice and co. from accessing the > pipe. > > [?] Should this be done some other way? I'd rather not use up a new > O_* flag if I can avoid it - should I add a pipe3() system call > instead? > > The pipe is then configured:: > > ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); > ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); > > Messages are then read out of the pipe using read(). > > (2) It should be possible to allow write() to insert data into the > notification pipes too, but this is currently disabled as the kernel > has to be able to insert messages into the pipe *without* holding > pipe->mutex and the code to make this work needs careful auditing. > > (3) sendfile(), splice() and vmsplice() are disabled on notification pipes > because of the pipe->mutex issue and also because they sometimes want > to revert what they just did - but one or more notification messages > might've been interleaved in the ring. > > (4) The kernel inserts messages with the wait queue spinlock held. This > means that pipe_read() and pipe_write() have to take the spinlock to > update the queue pointers. > > (5) Records in the buffer are binary, typed and have a length so that they > can be of varying size. > > This allows multiple heterogeneous sources to share a common buffer; > there are 16 million types available, of which I've used just a few, > so there is scope for others to be used. Tags may be specified when a > watchpoint is created to help distinguish the sources. > > (6) Records are filterable as types have up to 256 subtypes that can be > individually filtered. Other filtration is also available. > > (7) Notification pipes don't interfere with each other; each may be bound > to a different set of watches. Any particular notification will be > copied to all the queues that are currently watching for it - and only > those that are watching for it. > > (8) When recording a notification, the kernel will not sleep, but will > rather mark a queue as having lost a message if there's insufficient > space. read() will fabricate a loss notification message at an > appropriate point later. > > (9) The notification pipe is created and then watchpoints are attached to > it, using one of: > > keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); > watch_mount(AT_FDCWD, "/", 0, fd, 0x02); > watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); > > where in both cases, fd indicates the queue and the number after is a > tag between 0 and 255. > > (10) Watches are removed if either the notification pipe is destroyed or > the watched object is destroyed. In the latter case, a message will > be generated indicating the enforced watch removal. > > > Things I want to avoid: > > (1) Introducing features that make the core VFS dependent on the network > stack or networking namespaces (ie. usage of netlink). > > (2) Dumping all this stuff into dmesg and having a daemon that sits there > parsing the output and distributing it as this then puts the > responsibility for security into userspace and makes handling > namespaces tricky. Further, dmesg might not exist or might be > inaccessible inside a container. > > (3) Letting users see events they shouldn't be able to see. > > > TESTING AND MANPAGES > ==================== > > (*) The keyutils tree has a pipe-watch branch that has keyctl commands for > making use of notifications. Proposed manual pages can also be found > on this branch, though a couple of them really need to go to the main > manpages repository instead. > > If the kernel supports the watching of keys, then running "make test" > on that branch will cause the testing infrastructure to spawn a > monitoring process on the side that monitors a notifications pipe for > all the key/keyring changes induced by the tests and they'll all be > checked off to make sure they happened. > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch > > (*) A test program is provided (samples/watch_queue/watch_test) that can > be used to monitor for keyrings, mount and superblock events. > Information on the notifications is simply logged to stdout. > > Thanks, > David > --- > The following changes since commit b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce: > > Linux 5.7-rc6 (2020-05-17 16:48:37 -0700) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601 > > for you to fetch changes up to a8478a602913dc89a7cd2060e613edecd07e1dbd: > > smack: Implement the watch_key and post_notification hooks (2020-05-19 15:47:38 +0100) > > ---------------------------------------------------------------- > Notifications over pipes + Keyring notifications > > ---------------------------------------------------------------- > David Howells (12): > uapi: General notification queue definitions > security: Add a hook for the point of notification insertion > pipe: Add O_NOTIFICATION_PIPE > pipe: Add general notification queue support > security: Add hooks to rule on setting a watch > watch_queue: Add a key/keyring notification facility > Add sample notification program > pipe: Allow buffers to be marked read-whole-or-error for notifications > pipe: Add notification lossage handling > keys: Make the KEY_NEED_* perms an enum rather than a mask > selinux: Implement the watch_key security hook > smack: Implement the watch_key and post_notification hooks > > Documentation/security/keys/core.rst | 57 ++ > Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + > Documentation/watch_queue.rst | 339 +++++++++++ > fs/pipe.c | 242 +++++--- > fs/splice.c | 12 +- > include/linux/key.h | 33 +- > include/linux/lsm_audit.h | 1 + > include/linux/lsm_hook_defs.h | 9 + > include/linux/lsm_hooks.h | 14 + > include/linux/pipe_fs_i.h | 27 +- > include/linux/security.h | 30 +- > include/linux/watch_queue.h | 127 ++++ > include/uapi/linux/keyctl.h | 2 + > include/uapi/linux/watch_queue.h | 104 ++++ > init/Kconfig | 12 + > kernel/Makefile | 1 + > kernel/watch_queue.c | 659 +++++++++++++++++++++ > samples/Kconfig | 6 + > samples/Makefile | 1 + > samples/watch_queue/Makefile | 7 + > samples/watch_queue/watch_test.c | 186 ++++++ > security/keys/Kconfig | 9 + > security/keys/compat.c | 3 + > security/keys/gc.c | 5 + > security/keys/internal.h | 38 +- > security/keys/key.c | 38 +- > security/keys/keyctl.c | 115 +++- > security/keys/keyring.c | 20 +- > security/keys/permission.c | 31 +- > security/keys/process_keys.c | 46 +- > security/keys/request_key.c | 4 +- > security/security.c | 22 +- > security/selinux/hooks.c | 51 +- > security/smack/smack_lsm.c | 112 +++- > 34 files changed, 2185 insertions(+), 179 deletions(-) > create mode 100644 Documentation/watch_queue.rst > create mode 100644 include/linux/watch_queue.h > create mode 100644 include/uapi/linux/watch_queue.h > create mode 100644 kernel/watch_queue.c > create mode 100644 samples/watch_queue/Makefile > create mode 100644 samples/watch_queue/watch_test.c >
Hi Linus,
On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote:
> Can you pull this, please? It adds a general notification queue concept
I'm trying to use David's notification stuff in userspace, and I guess
feedback is welcome :-)
The notification stuff looks pretty promising, but I do not understand
why we need to use pipe for this purpose, see typical userspace use-case:
int pipefd[2], fd;
if (pipe2(pipefd, O_NOTIFICATION_PIPE) == -1)
err(EXIT_FAILURE, "pipe2 failed");
fd = pipefd[0];
All the next operations are done with "fd". It's nowhere used as a
pipe, and nothing uses pipefd[1]. The first impression from this code
is "oh, this is strange; why?".
Is it because we need to create a new file descriptor from nothing?
Why O_NOTIFICATION_PIPE is better than introduce a new syscall
notifyfd()?
(We have signalfd(), no O_SIGNAL_PIPE, etc.)
Karel
[ Finally getting around to this since my normal pull queue is now empty ] On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <kzak@redhat.com> wrote: > > The notification stuff looks pretty promising, but I do not understand > why we need to use pipe for this purpose The original intent was never to use the "pipe()" system call itself, only use pipes as the actual transport mechanism (because I do not for a second believe in the crazy "use sockets" model that a lot of other people seem to blindly believe in). But using "pipe()" also allows for non-kernel notification queues (ie where the events come from a user space process). Then you'd not use O_NOTIFICATION_PIPE, but O_DIRECT (for a packetized pipe). > Is it because we need to create a new file descriptor from nothing? > Why O_NOTIFICATION_PIPE is better than introduce a new syscall > notifyfd()? We could eventually introduce a new system call. But I most definitely did *NOT* want to see anything like that for any first gen stuff. Especially since it wasn't clear who was going to use it, and whether early trials would literally be done with that user-space emulation model of using a perfectly regular pipe (just with packetization). I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst it will be a useful marker. I think the only real reason for it was to avoid any clashes with splice(), which has more complex use of the pipe buffers. I'm so far just reading this thread and the arguments for users, and I haven't yet looked at all the actual details in the pull request - but last time I had objections to things it wasn't the code, it was the lack of any use. Linus
[ Actually going through the code now ] On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <kzak@redhat.com> wrote: > > All the next operations are done with "fd". It's nowhere used as a > pipe, and nothing uses pipefd[1]. As an aside, that isn't necessarily true. In some of the examples, pipefd[1] is used for configuration (sizing and adding filters), although I think right now that's not really enforced, and other examples seem to have pipefd[0] do that too. DavidH: should that perhaps be a hard rule, so that you can pass a pipefd[0] to readers, while knowing that they can't then change the kinds of notifications they see. In the "pipe: Add general notification queue support" commit message, the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then in the commit message for "watch_queue: Add a key/keyring notification facility" it uses pipefd[1]. And that latter example does make sense: using the write-side pipefd[1] for configuration, while the read-side pipefd[0] is the side that sees the results. That is also how it would work if you have a user-mode pipe with the notification source controlling the writing side - the reading side can obviously not add filters or change the semantics of the watches. So that allows a trusted side to add and create filters, while some untrusted entity can then see the results. This isn't going to hold up me merging the code, but it would be good to clarify and make that something that gets enforced if we decide it's worth it. It does seem conceptually like a good idea, and potentially actually useful to clearly separate the domain of "you can add watches and filters" from "you can see the notifications". Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst > it will be a useful marker. I think the only real reason for it was to > avoid any clashes with splice(), which has more complex use of the > pipe buffers. The main reason is to prevent splice because the iov_iter rewind for splice gets quite tricky if the kernel can randomly insert packets into the pipe buffer in between what splice is inserting. > I'm so far just reading this thread and the arguments for users, and I > haven't yet looked at all the actual details in the pull request - but > last time I had objections to things it wasn't the code, it was the > lack of any use. Would you be willing at this point to consider pulling the mount notifications and fsinfo() which helps support that? I could whip up pull reqs for those two pieces - or do you want to see more concrete patches that use it? David
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > All the next operations are done with "fd". It's nowhere used as a > > pipe, and nothing uses pipefd[1]. > > As an aside, that isn't necessarily true. > > In some of the examples, pipefd[1] is used for configuration (sizing > and adding filters), although I think right now that's not really > enforced, and other examples seem to have pipefd[0] do that too. The configuration can happen on either end of the pipe. I just need to be able to find the pipe object. > DavidH: should that perhaps be a hard rule, so that you can pass a > pipefd[0] to readers, while knowing that they can't then change the > kinds of notifications they see. You can argue that the other way: that it should be a hard rule that you can pass pipefd[1] to writers, whilst knowing that they can't then change the kind of notifications that the kernel can insert into the pipe. My feeling is that it's more likely that you would keep the read end yourself and give the write end away - if at all. Most likely, IMO, would be that you attach notification sources and never use the write end directly. There is some argument for making it so that the notification sources belong to the read end only and that they keep the write side alive internally - meaning that you can just close the write end. All the notification sources just then disappear when the read end is closed - but dup() might make this kind of tricky as there is only one pipe object and its shared between both ends. The existence of O_RDWR FIFOs might also make this tricky. > In the "pipe: Add general notification queue support" commit message, > the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then > in the commit message for "watch_queue: Add a key/keyring notification > facility" it uses pipefd[1]. > > And that latter example does make sense: using the write-side > pipefd[1] for configuration, while the read-side pipefd[0] is the side > that sees the results. That is also how it would work if you have a > user-mode pipe with the notification source controlling the writing > side - the reading side can obviously not add filters or change the > semantics of the watches. > > So that allows a trusted side to add and create filters, while some > untrusted entity can then see the results. As stated above, I think you should be looking at this the other way round - you're more likely to keep the read end for yourself. If you attach multiple sources to a pipe, everything they produce comes out mixed together from the read end of the pipe. You might even pass the write end to multiple userspace-side event generators, but I'm not sure it would make sense to pass the read end around unless you have sufficient flow that you need multiple consumers to keep up with it. David
On Sat, Jun 13, 2020 at 6:05 AM David Howells <dhowells@redhat.com> wrote: > > Would you be willing at this point to consider pulling the mount notifications > and fsinfo() which helps support that? I could whip up pull reqs for those > two pieces - or do you want to see more concrete patches that use it? I'd want to see more concrete use cases, but I'd also like to see that this keyring thing gets used and doesn't find any show-stoppers when it does. If we have multiple uses, and one of them notices some problem that requires any ABI changes, but the other one has already started using it, we'll have more problems. Linus
On Sat, Jun 13, 2020 at 9:47 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If we have multiple uses, and one of them notices some problem that > requires any ABI changes, but the other one has already started using > it, we'll have more problems. Ok, it's merged in my tree, although I was somewhat unhappy about the incomprehensible calling conventions of "get_pipe_info()". The random second argument just makes no sense when you read the code, it would have probably been better as a helper function or #define to clarify the whole "for_splice" thing. But let's see how it works and what actually happens. Linus
The pull request you sent on Tue, 02 Jun 2020 16:55:04 +0100:
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6c3297841472b4e53e22e53826eea9e483d993e5
Thank you!
On Sat, Jun 13, 2020 at 3:05 PM David Howells <dhowells@redhat.com> wrote: > > I'm so far just reading this thread and the arguments for users, and I > > haven't yet looked at all the actual details in the pull request - but > > last time I had objections to things it wasn't the code, it was the > > lack of any use. > > Would you be willing at this point to consider pulling the mount notifications > and fsinfo() which helps support that? I could whip up pull reqs for those > two pieces - or do you want to see more concrete patches that use it? Well, I had some questions and comments for the mount notifications last time around[1] and didn't yet get a reply. And the fsinfo stuff is simply immature, please lets not merge it just yet. When we have some uses (most notably systemd) running on top of the current fsinfo interface, we can sit down and discuss how the API can be cleaned up. BTW I had a similar experience with the fsconfig() merge, which was pushed with some unpolished bits and where my comments were also largely ignored. So, before asking to pull, please at least *answer* reviews. You don't have to agree, but at least consider and think about the comments. Thanks, Miklos [1] https://lore.kernel.org/linux-fsdevel/CAJfpegspWA6oUtdcYvYF=3fij=Bnq03b8VMbU9RNMKc+zzjbag@mail.gmail.com/
Hi David, On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote: > Date: Tue, 02 Jun 2020 16:51:44 +0100 > > Hi Linus, > > Can you pull this, please? It adds a general notification queue > concept > and adds an event source for keys/keyrings, such as linking and > unlinking > keys and changing their attributes. [..] This commit: > keys: Make the KEY_NEED_* perms an enum rather than a mask ...upstream as: 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask ...triggers a regression in the libnvdimm unit test that exercises the encrypted keys used to store nvdimm passphrases. It results in the below warning. --- WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key_task_permission+0xd3/0x140 Modules linked in: nd_blk(OE) nfit_test(OE) device_dax(OE) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) kvm_intel(E) kvm(E) irqbypass(E) nd_pmem(OE) dax_pmem(OE) nd_btt(OE) dax_p ct10dif_pclmul(E) nd_e820(OE) nfit(OE) crc32_pclmul(E) libnvdimm(OE) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) encrypted_keys(E) trusted(E) nfit_test_iomap(OE) tpm(E) drm(E) CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G OE 5.7.0-rc6+ #155 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 RIP: 0010:key_task_permission+0xd3/0x140 Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff <0f> 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06 RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297 RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01 RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0 R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840 R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540 FS: 00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0 Call Trace: lookup_user_key+0xeb/0x6b0 ? vsscanf+0x3df/0x840 ? key_validate+0x50/0x50 ? key_default_cmp+0x20/0x20 nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm] nvdimm_security_store+0x67d/0xb20 [libnvdimm] security_store+0x67/0x1a0 [libnvdimm] kernfs_fop_write+0xcf/0x1c0 vfs_write+0xde/0x1d0 ksys_write+0x68/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7f863c624547 Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547 RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005 RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0 R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0 R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0 irq event stamp: 36976 hardirqs last enabled at (36975): [<ffffffff9131fa40>] __slab_alloc+0x70/0x90 hardirqs last disabled at (36976): [<ffffffff910049c7>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (35474): [<ffffffff91e00357>] __do_softirq+0x357/0x466 softirqs last disabled at (35467): [<ffffffff910eae96>] irq_exit+0xe6/0xf0
On Tue, Jun 16, 2020 at 6:15 PM Williams, Dan J <dan.j.williams@intel.com> wrote: > > Hi David, > > On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote: > > Date: Tue, 02 Jun 2020 16:51:44 +0100 > > > > Hi Linus, > > > > Can you pull this, please? It adds a general notification queue > > concept > > and adds an event source for keys/keyrings, such as linking and > > unlinking > > keys and changing their attributes. > [..] > > This commit: > > > keys: Make the KEY_NEED_* perms an enum rather than a mask > > ...upstream as: > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask > > ...triggers a regression in the libnvdimm unit test that exercises the > encrypted keys used to store nvdimm passphrases. It results in the > below warning. This regression is still present in tip of tree. David, have you had a chance to take a look? > > --- > > WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key_task_permission+0xd3/0x140 > Modules linked in: nd_blk(OE) nfit_test(OE) device_dax(OE) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) kvm_intel(E) kvm(E) irqbypass(E) nd_pmem(OE) dax_pmem(OE) nd_btt(OE) dax_p > ct10dif_pclmul(E) nd_e820(OE) nfit(OE) crc32_pclmul(E) libnvdimm(OE) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) encrypted_keys(E) trusted(E) nfit_test_iomap(OE) tpm(E) drm(E) > CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G OE 5.7.0-rc6+ #155 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 > RIP: 0010:key_task_permission+0xd3/0x140 > Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff <0f> 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06 > > RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297 > RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c > RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01 > RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0 > R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840 > R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540 > FS: 00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0 > Call Trace: > lookup_user_key+0xeb/0x6b0 > ? vsscanf+0x3df/0x840 > ? key_validate+0x50/0x50 > ? key_default_cmp+0x20/0x20 > nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm] > nvdimm_security_store+0x67d/0xb20 [libnvdimm] > security_store+0x67/0x1a0 [libnvdimm] > kernfs_fop_write+0xcf/0x1c0 > vfs_write+0xde/0x1d0 > ksys_write+0x68/0xe0 > do_syscall_64+0x5c/0xa0 > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > RIP: 0033:0x7f863c624547 > Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547 > RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005 > RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0 > R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0 > R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0 > irq event stamp: 36976 > hardirqs last enabled at (36975): [<ffffffff9131fa40>] __slab_alloc+0x70/0x90 > hardirqs last disabled at (36976): [<ffffffff910049c7>] trace_hardirqs_off_thunk+0x1a/0x1c > softirqs last enabled at (35474): [<ffffffff91e00357>] __do_softirq+0x357/0x466 > softirqs last disabled at (35467): [<ffffffff910eae96>] irq_exit+0xe6/0xf0 >
Dan Williams <dan.j.williams@intel.com> wrote: > > This commit: > > > > > keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > ...upstream as: > > > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > ...triggers a regression in the libnvdimm unit test that exercises the > > encrypted keys used to store nvdimm passphrases. It results in the > > below warning. > > This regression is still present in tip of tree. David, have you had a > chance to take a look? nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants the key for so that the appropriate security checks can take place in SELinux and Smack. Note that I have a patch in the works that changes this still further. Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work for you? David
On Tue, Jun 23, 2020 at 5:55 PM David Howells <dhowells@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > This commit: > > > > > > > keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > > > ...upstream as: > > > > > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > > > ...triggers a regression in the libnvdimm unit test that exercises the > > > encrypted keys used to store nvdimm passphrases. It results in the > > > below warning. > > > > This regression is still present in tip of tree. David, have you had a > > chance to take a look? > > nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants > the key for so that the appropriate security checks can take place in SELinux > and Smack. Note that I have a patch in the works that changes this still > further. > > Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work > for you? It does, thanks. Shall I wait for your further reworks to fix this for v5.8, or is that v5.9 material?
Dan Williams <dan.j.williams@intel.com> wrote: > Shall I wait for your further reworks to fix this for v5.8, or is that > v5.9 material? It could do with stewing in linux-next for a while, so 5.9 probably. David