Message ID | 155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
Headers | show |
Series | Mount, FS, Block and Keyrings notifications | expand |
On Tue, May 28, 2019 at 05:01:47PM +0100, David Howells wrote: > Things I want to avoid: > > (1) Introducing features that make the core VFS dependent on the network > stack or networking namespaces (ie. usage of netlink). > > (2) Dumping all this stuff into dmesg and having a daemon that sits there > parsing the output and distributing it as this then puts the > responsibility for security into userspace and makes handling > namespaces tricky. Further, dmesg might not exist or might be > inaccessible inside a container. > > (3) Letting users see events they shouldn't be able to see. How are you handling namespaces then? Are they determined by the namespace of the process that opened the original device handle, or the namespace that made the new syscall for the events to "start flowing"? Am I missing the logic that determines this in the patches, or is that not implemented yet? thanks, greg k-h
On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote: > > > Hi Al, > > Here's a set of patches to add a general variable-length notification queue > concept and to add sources of events for: > > (1) Mount topology events, such as mounting, unmounting, mount expiry, > mount reconfiguration. > > (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O > errors (not complete yet). > > (3) Block layer events, such as I/O errors. > > (4) Key/keyring events, such as creating, linking and removal of keys. > > One of the reasons for this is so that we can remove the issue of processes > having to repeatedly and regularly scan /proc/mounts, which has proven to > be a system performance problem. To further aid this, the fsinfo() syscall > on which this patch series depends, provides a way to access superblock and > mount information in binary form without the need to parse /proc/mounts. > > > Design decisions: > > (1) A misc chardev is used to create and open a ring buffer: > > fd = open("/dev/watch_queue", O_RDWR); > > which is then configured and mmap'd into userspace: > > ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE); > ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); > buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > > The fd cannot be read or written (though there is a facility to use > write to inject records for debugging) and userspace just pulls data > directly out of the buffer. > > (2) The ring index pointers are stored inside the ring and are thus > accessible to userspace. Userspace should only update the tail > pointer and never the head pointer or risk breaking the buffer. The > kernel checks that the pointers appear valid before trying to use > them. A 'skip' record is maintained around the pointers. > > (3) poll() can be used to wait for data to appear in the buffer. > > (4) Records in the buffer are binary, typed and have a length so that they > can be of varying size. > > This means that multiple heterogeneous sources can share a common > buffer. Tags may be specified when a watchpoint is created to help > distinguish the sources. > > (5) The queue is reusable as there are 16 million types available, of > which I've used 4, so there is scope for others to be used. > > (6) Records are filterable as types have up to 256 subtypes that can be > individually filtered. Other filtration is also available. > > (7) Each time the buffer is opened, a new buffer is created - this means > that there's no interference between watchers. > > (8) When recording a notification, the kernel will not sleep, but will > rather mark a queue as overrun if there's insufficient space, thereby > avoiding userspace causing the kernel to hang. > > (9) The 'watchpoint' should be specific where possible, meaning that you > specify the object that you want to watch. > > (10) The buffer is created and then watchpoints are attached to it, using > one of: > > keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01); > mount_notify(AT_FDCWD, "/", 0, fd, 0x02); > sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03); > > where in all three cases, fd indicates the queue and the number after > is a tag between 0 and 255. > > (11) The watch must be removed if either the watch buffer is destroyed or > the watched object is destroyed. > > > Things I want to avoid: > > (1) Introducing features that make the core VFS dependent on the network > stack or networking namespaces (ie. usage of netlink). > > (2) Dumping all this stuff into dmesg and having a daemon that sits there > parsing the output and distributing it as this then puts the > responsibility for security into userspace and makes handling > namespaces tricky. Further, dmesg might not exist or might be > inaccessible inside a container. > > (3) Letting users see events they shouldn't be able to see. > > > Further things that could be considered: > > (1) Adding a keyctl call to allow a watch on a keyring to be extended to > "children" of that keyring, such that the watch is removed from the > child if it is unlinked from the keyring. > > (2) Adding global superblock event queue. > > (3) Propagating watches to child superblock over automounts. > David, I am interested to know how you envision filesystem notifications would look with this interface. fanotify can certainly benefit from providing a ring buffer interface to read events. From what I have seen, a common practice of users is to monitor mounts (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically. It'd be good if those users can use a single watch mechanism/API for watching the mount namespace and filesystem events within mounts. A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM. It provides users with two complete different mechanisms to watch error and filesystem events. That is generally not a good thing to have. I am not asking that you implement fs_notify() before merging sb_notify() and I understand that you have a use case for sb_notify(). I am asking that you show me the path towards a unified API (how a typical program would look like), so that we know before merging your new API that it could be extended to accommodate fsnotify events where the final result will look wholesome to users. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> wrote: > I am interested to know how you envision filesystem notifications would > look with this interface. What sort of events are you thinking of by "filesystem notifications"? You mean things like file changes? David
On Wed, May 29, 2019 at 9:45 AM David Howells <dhowells@redhat.com> wrote: > > Amir Goldstein <amir73il@gmail.com> wrote: > > > I am interested to know how you envision filesystem notifications would > > look with this interface. > > What sort of events are you thinking of by "filesystem notifications"? You > mean things like file changes? I mean all the events provided by http://man7.org/linux/man-pages/man7/fanotify.7.html Which was recently (v4.20) extended to support watching a super block and more recently (v5.1) extended to support watching directory entry modifications. Thanks, Amir.
Greg KH <gregkh@linuxfoundation.org> wrote: > > (3) Letting users see events they shouldn't be able to see. > > How are you handling namespaces then? Are they determined by the > namespace of the process that opened the original device handle, or the > namespace that made the new syscall for the events to "start flowing"? So far I haven't had to deal directly with namespaces. mount_notify() requires you to have access to the mountpoint you want to watch - and the entire tree rooted there is in one namespace, so your event sources are restricted to that namespace. Further, mount objects don't themselves have any other namespaces, not even a user_ns. sb_notify() requires you to have access to the superblock you want to watch. superblocks aren't directly namespaced as a class, though individual superblocks may participate in particular namespaces (ipc, net, etc.). I'm thinking some of these should be marked unwatchable (all pseudo superblocks, kernfs-class, proc, for example). Superblocks, however, do each have a user_ns - but you were allowed to access the superblock by pathwalk, so you must have some access to the user_ns - I think. KEYCTL_NOTIFY requires you to have View access on the key you're watching. Currently, keys have no real namespace restrictions, though I have patches to include a namespace tag in the lookup criteria. block_notify() doesn't require any direct access since you're watching a global queue and there is no blockdev namespacing. LSMs are given the option to filter events, though. The thought here is that if you can access dmesg, you should be able to watch for blockdev events. Actually, thinking further on this, restricting access to events is trickier than I thought and than perhaps Casey was suggesting. Say you're watching a mount object and someone in a different user_ns namespace or with a different security label mounts on it. What governs whether you are allowed to see the event? You're watching the object for changes - and it *has* changed. Further, you might be able to see the result of this change by other means (/proc/mounts, for instance). Should you be denied the event based on the security model? On the other hand, if you're watching a tree of mount objects, it could be argued that you should be denied access to events on any mount object you can't reach by pathwalk. On the third hand, if you can see it in /proc/mounts or by fsinfo(), you should get an event for it. > How are you handling namespaces then? So to go back to the original question. At the moment they haven't impinged directly and I haven't had to deal with them directly. There are indirect namespace restrictions that I get for free just due to pathwalk, for instance. David
On Wed 29-05-19 09:33:35, Amir Goldstein wrote: > On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote: > > > > > > Hi Al, > > > > Here's a set of patches to add a general variable-length notification queue > > concept and to add sources of events for: > > > > (1) Mount topology events, such as mounting, unmounting, mount expiry, > > mount reconfiguration. > > > > (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O > > errors (not complete yet). > > > > (3) Block layer events, such as I/O errors. > > > > (4) Key/keyring events, such as creating, linking and removal of keys. > > > > One of the reasons for this is so that we can remove the issue of processes > > having to repeatedly and regularly scan /proc/mounts, which has proven to > > be a system performance problem. To further aid this, the fsinfo() syscall > > on which this patch series depends, provides a way to access superblock and > > mount information in binary form without the need to parse /proc/mounts. > > > > > > Design decisions: > > > > (1) A misc chardev is used to create and open a ring buffer: > > > > fd = open("/dev/watch_queue", O_RDWR); > > > > which is then configured and mmap'd into userspace: > > > > ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE); > > ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); > > buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE, > > MAP_SHARED, fd, 0); > > > > The fd cannot be read or written (though there is a facility to use > > write to inject records for debugging) and userspace just pulls data > > directly out of the buffer. > > > > (2) The ring index pointers are stored inside the ring and are thus > > accessible to userspace. Userspace should only update the tail > > pointer and never the head pointer or risk breaking the buffer. The > > kernel checks that the pointers appear valid before trying to use > > them. A 'skip' record is maintained around the pointers. > > > > (3) poll() can be used to wait for data to appear in the buffer. > > > > (4) Records in the buffer are binary, typed and have a length so that they > > can be of varying size. > > > > This means that multiple heterogeneous sources can share a common > > buffer. Tags may be specified when a watchpoint is created to help > > distinguish the sources. > > > > (5) The queue is reusable as there are 16 million types available, of > > which I've used 4, so there is scope for others to be used. > > > > (6) Records are filterable as types have up to 256 subtypes that can be > > individually filtered. Other filtration is also available. > > > > (7) Each time the buffer is opened, a new buffer is created - this means > > that there's no interference between watchers. > > > > (8) When recording a notification, the kernel will not sleep, but will > > rather mark a queue as overrun if there's insufficient space, thereby > > avoiding userspace causing the kernel to hang. > > > > (9) The 'watchpoint' should be specific where possible, meaning that you > > specify the object that you want to watch. > > > > (10) The buffer is created and then watchpoints are attached to it, using > > one of: > > > > keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01); > > mount_notify(AT_FDCWD, "/", 0, fd, 0x02); > > sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03); > > > > where in all three cases, fd indicates the queue and the number after > > is a tag between 0 and 255. > > > > (11) The watch must be removed if either the watch buffer is destroyed or > > the watched object is destroyed. > > > > > > Things I want to avoid: > > > > (1) Introducing features that make the core VFS dependent on the network > > stack or networking namespaces (ie. usage of netlink). > > > > (2) Dumping all this stuff into dmesg and having a daemon that sits there > > parsing the output and distributing it as this then puts the > > responsibility for security into userspace and makes handling > > namespaces tricky. Further, dmesg might not exist or might be > > inaccessible inside a container. > > > > (3) Letting users see events they shouldn't be able to see. > > > > > > Further things that could be considered: > > > > (1) Adding a keyctl call to allow a watch on a keyring to be extended to > > "children" of that keyring, such that the watch is removed from the > > child if it is unlinked from the keyring. > > > > (2) Adding global superblock event queue. > > > > (3) Propagating watches to child superblock over automounts. > > > > David, > > I am interested to know how you envision filesystem notifications would > look with this interface. > > fanotify can certainly benefit from providing a ring buffer interface to read > events. > > From what I have seen, a common practice of users is to monitor mounts > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically. > It'd be good if those users can use a single watch mechanism/API for > watching the mount namespace and filesystem events within mounts. > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM. > It provides users with two complete different mechanisms to watch error > and filesystem events. That is generally not a good thing to have. > > I am not asking that you implement fs_notify() before merging sb_notify() > and I understand that you have a use case for sb_notify(). > I am asking that you show me the path towards a unified API (how a > typical program would look like), so that we know before merging your > new API that it could be extended to accommodate fsnotify events > where the final result will look wholesome to users. Are you sure we want to combine notification about file changes etc. with administrator-type notifications about the filesystem? To me these two sound like rather different (although sometimes related) things. Honza
On Wed, May 29, 2019 at 04:25:04PM +0200, Jan Kara wrote: > > I am not asking that you implement fs_notify() before merging sb_notify() > > and I understand that you have a use case for sb_notify(). > > I am asking that you show me the path towards a unified API (how a > > typical program would look like), so that we know before merging your > > new API that it could be extended to accommodate fsnotify events > > where the final result will look wholesome to users. > > Are you sure we want to combine notification about file changes etc. with > administrator-type notifications about the filesystem? To me these two > sound like rather different (although sometimes related) things. This patchset is looking to create a "generic" kernel notification system, so I think the question is valid. It's up to the requestor to ask for the specific type of notification. thanks, greg k-h
On 5/29/2019 2:09 AM, David Howells wrote: > Greg KH <gregkh@linuxfoundation.org> wrote: > >>> (3) Letting users see events they shouldn't be able to see. >> How are you handling namespaces then? Are they determined by the >> namespace of the process that opened the original device handle, or the >> namespace that made the new syscall for the events to "start flowing"? > So far I haven't had to deal directly with namespaces. > > mount_notify() requires you to have access to the mountpoint you want to watch > - and the entire tree rooted there is in one namespace, so your event sources > are restricted to that namespace. Further, mount objects don't themselves > have any other namespaces, not even a user_ns. > > sb_notify() requires you to have access to the superblock you want to watch. > superblocks aren't directly namespaced as a class, though individual > superblocks may participate in particular namespaces (ipc, net, etc.). I'm > thinking some of these should be marked unwatchable (all pseudo superblocks, > kernfs-class, proc, for example). > > Superblocks, however, do each have a user_ns - but you were allowed to access > the superblock by pathwalk, so you must have some access to the user_ns - I > think. > > KEYCTL_NOTIFY requires you to have View access on the key you're watching. > Currently, keys have no real namespace restrictions, though I have patches to > include a namespace tag in the lookup criteria. > > block_notify() doesn't require any direct access since you're watching a > global queue and there is no blockdev namespacing. LSMs are given the option > to filter events, though. The thought here is that if you can access dmesg, > you should be able to watch for blockdev events. > > > Actually, thinking further on this, restricting access to events is trickier > than I thought and than perhaps Casey was suggesting. > > Say you're watching a mount object and someone in a different user_ns > namespace or with a different security label mounts on it. What governs > whether you are allowed to see the event? Conceptually it should be simple, but we have a variety of different policies in the core OS, never mind what goes on inside the LSMs. If you want to treat a notification like a signal you would only deliver it if the process that performed the action that triggered the event has the same UID as the process receiving the notification. Should you decide to treat it like an IP packet only the LSMs would filter delivery. If there are mode bits on the thing being watched shouldn't you respect them? > You're watching the object for changes - and it *has* changed. Further, you > might be able to see the result of this change by other means (/proc/mounts, > for instance). > > Should you be denied the event based on the security model? From a subject/object model view there are two objects and one subject involved. The subject (active entity) is the process that changes the first object, triggering an event. The watching process (that will receive the notification) is the second object, because its state will change (be written to) when the notification is delivered. For the watching process to receive the notification the changing process needs write access to the watching process. The indirection of the notification mechanism isn't relevant. If the changing process couldn't directly notify the watching process it shouldn't be able to do it indirectly, either. > On the other hand, if you're watching a tree of mount objects, it could be > argued that you should be denied access to events on any mount object you > can't reach by pathwalk. > > On the third hand, if you can see it in /proc/mounts or by fsinfo(), you > should get an event for it. Right. We've done a pretty good job of muddling the security landscape by adding spiffy features to make life easier for particular use cases. /proc is chuck full of examples. Objects that can be viewed in many different ways make for confusing security models. Try explaining /proc/234/fd/2 to a security theory student. >> How are you handling namespaces then? > So to go back to the original question. At the moment they haven't impinged > directly and I haven't had to deal with them directly. There are indirect > namespace restrictions that I get for free just due to pathwalk, for instance. > > David
> > David, > > > > I am interested to know how you envision filesystem notifications would > > look with this interface. > > > > fanotify can certainly benefit from providing a ring buffer interface to read > > events. > > > > From what I have seen, a common practice of users is to monitor mounts > > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically. > > It'd be good if those users can use a single watch mechanism/API for > > watching the mount namespace and filesystem events within mounts. > > > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM. > > It provides users with two complete different mechanisms to watch error > > and filesystem events. That is generally not a good thing to have. > > > > I am not asking that you implement fs_notify() before merging sb_notify() > > and I understand that you have a use case for sb_notify(). > > I am asking that you show me the path towards a unified API (how a > > typical program would look like), so that we know before merging your > > new API that it could be extended to accommodate fsnotify events > > where the final result will look wholesome to users. > > Are you sure we want to combine notification about file changes etc. with > administrator-type notifications about the filesystem? To me these two > sound like rather different (although sometimes related) things. > Well I am sure that ring buffer for fanotify events would be useful, so seeing that David is proposing a generic notification mechanism, I wanted to know how that mechanism could best share infrastructure with fsnotify. But apart from that I foresee the questions from users about why the mount notification API and filesystem events API do not have better integration. The way I see it, the notification queue can serve several classes of notifications and fsnotify could be one of those classes (at least FAN_CLASS_NOTIF fits nicely to the model). Thanks, Amir.
On Wed 29-05-19 18:53:21, Amir Goldstein wrote: > > > David, > > > > > > I am interested to know how you envision filesystem notifications would > > > look with this interface. > > > > > > fanotify can certainly benefit from providing a ring buffer interface to read > > > events. > > > > > > From what I have seen, a common practice of users is to monitor mounts > > > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically. > > > It'd be good if those users can use a single watch mechanism/API for > > > watching the mount namespace and filesystem events within mounts. > > > > > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM. > > > It provides users with two complete different mechanisms to watch error > > > and filesystem events. That is generally not a good thing to have. > > > > > > I am not asking that you implement fs_notify() before merging sb_notify() > > > and I understand that you have a use case for sb_notify(). > > > I am asking that you show me the path towards a unified API (how a > > > typical program would look like), so that we know before merging your > > > new API that it could be extended to accommodate fsnotify events > > > where the final result will look wholesome to users. > > > > Are you sure we want to combine notification about file changes etc. with > > administrator-type notifications about the filesystem? To me these two > > sound like rather different (although sometimes related) things. > > > > Well I am sure that ring buffer for fanotify events would be useful, so > seeing that David is proposing a generic notification mechanism, I wanted > to know how that mechanism could best share infrastructure with fsnotify. > > But apart from that I foresee the questions from users about why the > mount notification API and filesystem events API do not have better > integration. > > The way I see it, the notification queue can serve several classes > of notifications and fsnotify could be one of those classes > (at least FAN_CLASS_NOTIF fits nicely to the model). I agree that for some type of fsnotify uses a ring buffer would make sense. But for others - such as permission events or unlimited queues - you cannot really use the ring buffer and I don't like the idea of having different ways of passing fsnotify events to userspace based on notification group type... Honza
Amir Goldstein <amir73il@gmail.com> wrote: > Well I am sure that ring buffer for fanotify events would be useful, so > seeing that David is proposing a generic notification mechanism, I wanted > to know how that mechanism could best share infrastructure with fsnotify. > > But apart from that I foresee the questions from users about why the > mount notification API and filesystem events API do not have better > integration. > > The way I see it, the notification queue can serve several classes > of notifications and fsnotify could be one of those classes > (at least FAN_CLASS_NOTIF fits nicely to the model). It could be done; the main thing that concerns me is that the buffer is of limited capacity. However, I could take this: struct fanotify_event_metadata { __u32 event_len; __u8 vers; __u8 reserved; __u16 metadata_len; __aligned_u64 mask; __s32 fd; __s32 pid; }; and map it to: struct fanotify_notification { struct watch_notification watch; /* WATCH_TYPE_FANOTIFY */ __aligned_u64 mask; __u16 metadata_len; __u8 vers; __u8 reserved; __u32 reserved2; __s32 fd; __s32 pid; }; and some of the watch::info bit could be used: n->watch.info & WATCH_INFO_OVERRUN watch queue overran n->watch.info & WATCH_INFO_LENGTH event_len n->watch.info & WATCH_INFO_RECURSIVE FAN_EVENT_ON_CHILD n->watch.info & WATCH_INFO_FLAG_0 FAN_*_PERM n->watch.info & WATCH_INFO_FLAG_1 FAN_Q_OVERFLOW n->watch.info & WATCH_INFO_FLAG_2 FAN_ON_DIR n->subtype ffs(n->mask) Ideally, I'd dispense with metadata_len, vers, reserved* and set the version when setting the watch. fanotify_watch(int watchfd, unsigned int flags, u64 *mask, int dirfd, const char *pathname, unsigned int at_flags); We might also want to extend the watch_filter to allow you to, say, filter on the first __u64 after the watch member so that you could filter on specific events: struct watch_notification_type_filter { __u32 type; __u32 info_filter; __u32 info_mask; __u32 subtype_filter[8]; __u64 payload_mask[1]; __u64 payload_set[1]; }; So, in this case, it would require: n->mask & wf->payload_mask[0] == wf->payload_set[0] to be true to record the message. David