Message ID | 155991706083.15579.16359443779582362339.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Mount, FS, Block and Keyrings notifications [ver #4] | expand |
On Fri, Jun 07, 2019 at 03:17:40PM +0100, David Howells wrote: > Add UAPI definitions for the general notification ring, including the > following pieces: > > (1) struct watch_notification. > > This is the metadata header for each entry in the ring. It includes a > type and subtype that indicate the source of the message > (eg. WATCH_TYPE_MOUNT_NOTIFY) and the kind of the message > (eg. NOTIFY_MOUNT_NEW_MOUNT). > > The header also contains an information field that conveys the > following information: > > - WATCH_INFO_LENGTH. The size of the entry (entries are variable > length). > > - WATCH_INFO_OVERRUN. If preceding messages were lost due to ring > overrun or lack of memory. > > - WATCH_INFO_ENOMEM. If preceding messages were lost due to lack > of memory. > > - WATCH_INFO_RECURSIVE. If the event detected was applied to > multiple objects (eg. a recursive change to mount attributes). > > - WATCH_INFO_IN_SUBTREE. If the event didn't happen at the watched > object, but rather to some related object (eg. a subtree mount > watch saw a mount happen somewhere within the subtree). > > - WATCH_INFO_TYPE_FLAGS. Eight flags whose meanings depend on the > message type. > > - WATCH_INFO_ID. The watch ID specified when the watchpoint was > set. > > All the information in the header can be used in filtering messages at > the point of writing into the buffer. > > (2) struct watch_queue_buffer. > > This describes the layout of the ring. Note that the first slots in > the ring contain a special metadata entry that contains the ring > pointers. The producer in the kernel knows to skip this and it has a > proper header (WATCH_TYPE_META, WATCH_META_SKIP_NOTIFICATION) that > indicates the size so that the ring consumer can handle it the same as > any other record and just skip it. > > Note that this means that ring entries can never be split over the end > of the ring, so if an entry would need to be split, a skip record is > inserted to wrap the ring first; this is also WATCH_TYPE_META, > WATCH_META_SKIP_NOTIFICATION. > > Signed-off-by: David Howells <dhowells@redhat.com> I'm starting by reading the uapi changes and the sample program... > --- > > include/uapi/linux/watch_queue.h | 63 ++++++++++++++++++++++++++++++++++++++ > 1 file changed, 63 insertions(+) > create mode 100644 include/uapi/linux/watch_queue.h > > diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h > new file mode 100644 > index 000000000000..c3a88fa5f62a > --- /dev/null > +++ b/include/uapi/linux/watch_queue.h > @@ -0,0 +1,63 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_WATCH_QUEUE_H > +#define _UAPI_LINUX_WATCH_QUEUE_H > + > +#include <linux/types.h> > + > +enum watch_notification_type { > + WATCH_TYPE_META = 0, /* Special record */ > + WATCH_TYPE_MOUNT_NOTIFY = 1, /* Mount notification record */ > + WATCH_TYPE_SB_NOTIFY = 2, /* Superblock notification */ > + WATCH_TYPE_KEY_NOTIFY = 3, /* Key/keyring change notification */ > + WATCH_TYPE_BLOCK_NOTIFY = 4, /* Block layer notifications */ > +#define WATCH_TYPE___NR 5 Given the way enums work I think you can just make WATCH_TYPE___NR the last element in the enum? > +}; > + > +enum watch_meta_notification_subtype { > + WATCH_META_SKIP_NOTIFICATION = 0, /* Just skip this record */ > + WATCH_META_REMOVAL_NOTIFICATION = 1, /* Watched object was removed */ > +}; > + > +/* > + * Notification record > + */ > +struct watch_notification { Kind of a long name... > + __u32 type:24; /* enum watch_notification_type */ > + __u32 subtype:8; /* Type-specific subtype (filterable) */ 16777216 diferent types and 256 different subtypes? My gut instinct wants a better balance, though I don't know where I'd draw the line. Probably 12 bits for type and 10 for subtype? OTOH I don't have a good sense of how many distinct notification types an XFS would want to send back to userspace, and maybe 256 subtypes is fine. We could always reserve another watch_notification_type if we need > 256. Ok, no objections. :) > + __u32 info; > +#define WATCH_INFO_OVERRUN 0x00000001 /* Event(s) lost due to overrun */ > +#define WATCH_INFO_ENOMEM 0x00000002 /* Event(s) lost due to ENOMEM */ > +#define WATCH_INFO_RECURSIVE 0x00000004 /* Change was recursive */ > +#define WATCH_INFO_LENGTH 0x000001f8 /* Length of record / sizeof(watch_notification) */ This is a mask, isn't it? Could we perhaps have some helpers here? Something along the lines of... #define WATCH_INFO_LENGTH_MASK 0x000001f8 #define WATCH_INFO_LENGTH_SHIFT 3 static inline size_t watch_notification_length(struct watch_notification *wn) { return (wn->info & WATCH_INFO_LENGTH_MASK) >> WATCH_INFO_LENGTH_SHIFT * sizeof(struct watch_notification); } static inline struct watch_notification *watch_notification_next( struct watch_notification *wn) { return wn + ((wn->info & WATCH_INFO_LENGTH_MASK) >> WATCH_INFO_LENGTH_SHIFT); } ...so that we don't have to opencode all of the ring buffer walking magic and stuff? (I might also shorten the namespace from WATCH_INFO_ to WNI_ ...) Hmm so the length field is 6 bits and therefore the maximum size of a notification record is ... 63 * (sizeof(u32) * 2) = 504 bytes? Which means that kernel users can send back a maximum payload of 496 bytes? That's probably big enough for random fs notifications (bad metadata detected, media errors, etc.) Judging from the sample program I guess all that userspace does is allocate a memory buffer and toss it into the kernel, which then initializes the ring management variables, and from there we just scan around the ring buffer every time poll(watch_fd) says there's something to do? How does userspace tell the kernel the size of the ring buffer? Does (watch_notification->info & WATCH_INFO_LENGTH) == 0 have any meaning besides apparently "stop looking at me"? > +#define WATCH_INFO_IN_SUBTREE 0x00000200 /* Change was not at watched root */ > +#define WATCH_INFO_TYPE_FLAGS 0x00ff0000 /* Type-specific flags */ WATCH_INFO_FLAG_MASK ? > +#define WATCH_INFO_FLAG_0 0x00010000 > +#define WATCH_INFO_FLAG_1 0x00020000 > +#define WATCH_INFO_FLAG_2 0x00040000 > +#define WATCH_INFO_FLAG_3 0x00080000 > +#define WATCH_INFO_FLAG_4 0x00100000 > +#define WATCH_INFO_FLAG_5 0x00200000 > +#define WATCH_INFO_FLAG_6 0x00400000 > +#define WATCH_INFO_FLAG_7 0x00800000 > +#define WATCH_INFO_ID 0xff000000 /* ID of watchpoint */ WATCH_INFO_ID_MASK ? > +#define WATCH_INFO_ID__SHIFT 24 Why double underscore here? > +}; > + > +#define WATCH_LENGTH_SHIFT 3 > + > +struct watch_queue_buffer { > + union { > + /* The first few entries are special, containing the > + * ring management variables. The first /two/ entries, correct? Also, weird multiline comment style. > + */ > + struct { > + struct watch_notification watch; /* WATCH_TYPE_META */ Do these structures have to be cache-aligned for the atomic reads and writes to work? --D > + __u32 head; /* Ring head index */ > + __u32 tail; /* Ring tail index */ > + __u32 mask; /* Ring index mask */ > + } meta; > + struct watch_notification slots[0]; > + }; > +}; > + > +#endif /* _UAPI_LINUX_WATCH_QUEUE_H */ >
> I'm starting by reading the uapi changes and the sample program...
If you look in:
[PATCH 05/13] General notification queue with user mmap()'able ring buffer
you'll find the API document that goes in Documentation/
David
Darrick J. Wong <darrick.wong@oracle.com> wrote: > > +enum watch_notification_type { > > + WATCH_TYPE_META = 0, /* Special record */ > > + WATCH_TYPE_MOUNT_NOTIFY = 1, /* Mount notification record */ > > + WATCH_TYPE_SB_NOTIFY = 2, /* Superblock notification */ > > + WATCH_TYPE_KEY_NOTIFY = 3, /* Key/keyring change notification */ > > + WATCH_TYPE_BLOCK_NOTIFY = 4, /* Block layer notifications */ > > +#define WATCH_TYPE___NR 5 > > Given the way enums work I think you can just make WATCH_TYPE___NR the > last element in the enum? Yeah. I've a feeling I'd been asked not to do that, but I don't remember who by. > > +struct watch_notification { > > Kind of a long name... *shrug*. Try to avoid conflicts with userspace symbols. > > + __u32 type:24; /* enum watch_notification_type */ > > + __u32 subtype:8; /* Type-specific subtype (filterable) */ > > 16777216 diferent types and 256 different subtypes? My gut instinct > wants a better balance, though I don't know where I'd draw the line. > Probably 12 bits for type and 10 for subtype? OTOH I don't have a good > sense of how many distinct notification types an XFS would want to send > back to userspace, and maybe 256 subtypes is fine. We could always > reserve another watch_notification_type if we need > 256. > > Ok, no objections. :) If a source wants more than 256 subtypes, it can always allocate an additional type. Note that the greater the number of subtypes, the larger the filter structure (added in patch 5). > > + __u32 info; > > +#define WATCH_INFO_OVERRUN 0x00000001 /* Event(s) lost due to overrun */ > > +#define WATCH_INFO_ENOMEM 0x00000002 /* Event(s) lost due to ENOMEM */ > > +#define WATCH_INFO_RECURSIVE 0x00000004 /* Change was recursive */ > > +#define WATCH_INFO_LENGTH 0x000001f8 /* Length of record / sizeof(watch_notification) */ > > This is a mask, isn't it? Could we perhaps have some helpers here? > Something along the lines of... > > #define WATCH_INFO_LENGTH_MASK 0x000001f8 > #define WATCH_INFO_LENGTH_SHIFT 3 > > static inline size_t watch_notification_length(struct watch_notification *wn) > { > return (wn->info & WATCH_INFO_LENGTH_MASK) >> WATCH_INFO_LENGTH_SHIFT * > sizeof(struct watch_notification); > } > > static inline struct watch_notification *watch_notification_next( > struct watch_notification *wn) > { > return wn + ((wn->info & WATCH_INFO_LENGTH_MASK) >> > WATCH_INFO_LENGTH_SHIFT); > } No inline functions in UAPI headers, please. I'd love to kill off the ones that we have, but that would break things. > ...so that we don't have to opencode all of the ring buffer walking > magic and stuff? There'll end up being a small userspace library, I think. > (I might also shorten the namespace from WATCH_INFO_ to WNI_ ...) I'd rather not do that. WNI_ could mean all sorts of things - at a first glance, it sounds like something to do with Windows or windowing. > Hmm so the length field is 6 bits and therefore the maximum size of a > notification record is ... 63 * (sizeof(u32) * 2) = 504 bytes? Which > means that kernel users can send back a maximum payload of 496 bytes? > That's probably big enough for random fs notifications (bad metadata > detected, media errors, etc.) Yep. The ring buffer is limited in capacity since it has to be unswappable so that notifications can be written into it from softirq context or under lock. > Judging from the sample program I guess all that userspace does is > allocate a memory buffer and toss it into the kernel, which then > initializes the ring management variables, and from there we just scan > around the ring buffer every time poll(watch_fd) says there's something > to do? Yes. Further, if head == tail, then it's empty. Note that since head and tail will go up to 4G, but the buffer is limited to less than that, there's no need for a requisite unusable slot in the ring as head != tail when the ring is full. > How does userspace tell the kernel the size of the ring buffer? If you looked in the sample program, you might have noticed this in the main() function: if (ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE) == -1) { The chardev allocates the buffer and then userspace mmaps it. I need to steal the locked-page accounting from the epoll patches to stop someone from using this to lock away all memory. > Does (watch_notification->info & WATCH_INFO_LENGTH) == 0 have any > meaning besides apparently "stop looking at me"? That's an illegal value, indicating a kernel bug. > > +#define WATCH_INFO_IN_SUBTREE 0x00000200 /* Change was not at watched root */ > > +#define WATCH_INFO_TYPE_FLAGS 0x00ff0000 /* Type-specific flags */ > > WATCH_INFO_FLAG_MASK ? Yeah. > > +#define WATCH_INFO_FLAG_0 0x00010000 > > +#define WATCH_INFO_FLAG_1 0x00020000 > > +#define WATCH_INFO_FLAG_2 0x00040000 > > +#define WATCH_INFO_FLAG_3 0x00080000 > > +#define WATCH_INFO_FLAG_4 0x00100000 > > +#define WATCH_INFO_FLAG_5 0x00200000 > > +#define WATCH_INFO_FLAG_6 0x00400000 > > +#define WATCH_INFO_FLAG_7 0x00800000 > > +#define WATCH_INFO_ID 0xff000000 /* ID of watchpoint */ > > WATCH_INFO_ID_MASK ? Sure. > > +#define WATCH_INFO_ID__SHIFT 24 > > Why double underscore here? Because it's related to WATCH_INFO_ID, but isn't a mask on ->info. > > +}; > > + > > +#define WATCH_LENGTH_SHIFT 3 > > + > > +struct watch_queue_buffer { > > + union { > > + /* The first few entries are special, containing the > > + * ring management variables. > > The first /two/ entries, correct? Currently two. > Also, weird multiline comment style. Not really. > > + */ > > + struct { > > + struct watch_notification watch; /* WATCH_TYPE_META */ > > Do these structures have to be cache-aligned for the atomic reads and > writes to work? I hope not, otherwise we're screwed all over the kernel, particularly with structure randomisation. Further, what alignment is "cache aligned"? It can vary by CPU for a single arch. David
On 6/7/19 8:51 AM, David Howells wrote: > Darrick J. Wong <darrick.wong@oracle.com> wrote: > >>> + __u32 info; >>> +#define WATCH_INFO_OVERRUN 0x00000001 /* Event(s) lost due to overrun */ >>> +#define WATCH_INFO_ENOMEM 0x00000002 /* Event(s) lost due to ENOMEM */ >>> +#define WATCH_INFO_RECURSIVE 0x00000004 /* Change was recursive */ >>> +#define WATCH_INFO_LENGTH 0x000001f8 /* Length of record / sizeof(watch_notification) */ >> >> This is a mask, isn't it? Could we perhaps have some helpers here? >> Something along the lines of... >> >> #define WATCH_INFO_LENGTH_MASK 0x000001f8 >> #define WATCH_INFO_LENGTH_SHIFT 3 >> >> static inline size_t watch_notification_length(struct watch_notification *wn) >> { >> return (wn->info & WATCH_INFO_LENGTH_MASK) >> WATCH_INFO_LENGTH_SHIFT * >> sizeof(struct watch_notification); >> } >> >> static inline struct watch_notification *watch_notification_next( >> struct watch_notification *wn) >> { >> return wn + ((wn->info & WATCH_INFO_LENGTH_MASK) >> >> WATCH_INFO_LENGTH_SHIFT); >> } > > No inline functions in UAPI headers, please. I'd love to kill off the ones > that we have, but that would break things. Hi David, What is the problem with inline functions in UAPI headers? >> ...so that we don't have to opencode all of the ring buffer walking >> magic and stuff? > > There'll end up being a small userspace library, I think. >>> +}; >>> + >>> +#define WATCH_LENGTH_SHIFT 3 >>> + >>> +struct watch_queue_buffer { >>> + union { >>> + /* The first few entries are special, containing the >>> + * ring management variables. >> >> The first /two/ entries, correct? > > Currently two. > >> Also, weird multiline comment style. > > Not really. Yes really. >>> + */ It does not match the preferred coding style for multi-line comments according to coding-style.rst. thanks.
Randy Dunlap <rdunlap@infradead.org> wrote: > What is the problem with inline functions in UAPI headers? It makes compiler problems more likely; it increases the potential for name collisions with userspace; it makes for more potential problems if the headers are imported into some other language; and it's not easy to fix a bug in one if userspace uses it, just in case fixing the bug breaks userspace. Further, in this case, the first of Darrick's functions (calculating the length) is probably reasonable, but the second is not. It should crank the tail pointer and then use that, but that requires > >> Also, weird multiline comment style. > > > > Not really. > > Yes really. No. It's not weird. If anything, the default style is less good for several reasons. I'm going to deal with this separately as I need to generate some stats first. David
On 6/13/19 6:34 AM, David Howells wrote: > Randy Dunlap <rdunlap@infradead.org> wrote: > >> What is the problem with inline functions in UAPI headers? > > It makes compiler problems more likely; it increases the potential for name > collisions with userspace; it makes for more potential problems if the headers > are imported into some other language; and it's not easy to fix a bug in one > if userspace uses it, just in case fixing the bug breaks userspace. > > Further, in this case, the first of Darrick's functions (calculating the > length) is probably reasonable, but the second is not. It should crank the > tail pointer and then use that, but that requires > >>>> Also, weird multiline comment style. >>> >>> Not really. >> >> Yes really. > > No. It's not weird. If anything, the default style is less good for several > reasons. I'm going to deal with this separately as I need to generate some > stats first. > > David > OK, maybe you are objecting to the word "weird." So the multi-line comment style that you used is not the preferred Linux kernel multi-line comment style (except in networking code) [Documentation/process/coding-style.rst] that has been in effect for 20+ years AFAIK.
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h new file mode 100644 index 000000000000..c3a88fa5f62a --- /dev/null +++ b/include/uapi/linux/watch_queue.h @@ -0,0 +1,63 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_WATCH_QUEUE_H +#define _UAPI_LINUX_WATCH_QUEUE_H + +#include <linux/types.h> + +enum watch_notification_type { + WATCH_TYPE_META = 0, /* Special record */ + WATCH_TYPE_MOUNT_NOTIFY = 1, /* Mount notification record */ + WATCH_TYPE_SB_NOTIFY = 2, /* Superblock notification */ + WATCH_TYPE_KEY_NOTIFY = 3, /* Key/keyring change notification */ + WATCH_TYPE_BLOCK_NOTIFY = 4, /* Block layer notifications */ +#define WATCH_TYPE___NR 5 +}; + +enum watch_meta_notification_subtype { + WATCH_META_SKIP_NOTIFICATION = 0, /* Just skip this record */ + WATCH_META_REMOVAL_NOTIFICATION = 1, /* Watched object was removed */ +}; + +/* + * Notification record + */ +struct watch_notification { + __u32 type:24; /* enum watch_notification_type */ + __u32 subtype:8; /* Type-specific subtype (filterable) */ + __u32 info; +#define WATCH_INFO_OVERRUN 0x00000001 /* Event(s) lost due to overrun */ +#define WATCH_INFO_ENOMEM 0x00000002 /* Event(s) lost due to ENOMEM */ +#define WATCH_INFO_RECURSIVE 0x00000004 /* Change was recursive */ +#define WATCH_INFO_LENGTH 0x000001f8 /* Length of record / sizeof(watch_notification) */ +#define WATCH_INFO_IN_SUBTREE 0x00000200 /* Change was not at watched root */ +#define WATCH_INFO_TYPE_FLAGS 0x00ff0000 /* Type-specific flags */ +#define WATCH_INFO_FLAG_0 0x00010000 +#define WATCH_INFO_FLAG_1 0x00020000 +#define WATCH_INFO_FLAG_2 0x00040000 +#define WATCH_INFO_FLAG_3 0x00080000 +#define WATCH_INFO_FLAG_4 0x00100000 +#define WATCH_INFO_FLAG_5 0x00200000 +#define WATCH_INFO_FLAG_6 0x00400000 +#define WATCH_INFO_FLAG_7 0x00800000 +#define WATCH_INFO_ID 0xff000000 /* ID of watchpoint */ +#define WATCH_INFO_ID__SHIFT 24 +}; + +#define WATCH_LENGTH_SHIFT 3 + +struct watch_queue_buffer { + union { + /* The first few entries are special, containing the + * ring management variables. + */ + struct { + struct watch_notification watch; /* WATCH_TYPE_META */ + __u32 head; /* Ring head index */ + __u32 tail; /* Ring tail index */ + __u32 mask; /* Ring index mask */ + } meta; + struct watch_notification slots[0]; + }; +}; + +#endif /* _UAPI_LINUX_WATCH_QUEUE_H */
Add UAPI definitions for the general notification ring, including the following pieces: (1) struct watch_notification. This is the metadata header for each entry in the ring. It includes a type and subtype that indicate the source of the message (eg. WATCH_TYPE_MOUNT_NOTIFY) and the kind of the message (eg. NOTIFY_MOUNT_NEW_MOUNT). The header also contains an information field that conveys the following information: - WATCH_INFO_LENGTH. The size of the entry (entries are variable length). - WATCH_INFO_OVERRUN. If preceding messages were lost due to ring overrun or lack of memory. - WATCH_INFO_ENOMEM. If preceding messages were lost due to lack of memory. - WATCH_INFO_RECURSIVE. If the event detected was applied to multiple objects (eg. a recursive change to mount attributes). - WATCH_INFO_IN_SUBTREE. If the event didn't happen at the watched object, but rather to some related object (eg. a subtree mount watch saw a mount happen somewhere within the subtree). - WATCH_INFO_TYPE_FLAGS. Eight flags whose meanings depend on the message type. - WATCH_INFO_ID. The watch ID specified when the watchpoint was set. All the information in the header can be used in filtering messages at the point of writing into the buffer. (2) struct watch_queue_buffer. This describes the layout of the ring. Note that the first slots in the ring contain a special metadata entry that contains the ring pointers. The producer in the kernel knows to skip this and it has a proper header (WATCH_TYPE_META, WATCH_META_SKIP_NOTIFICATION) that indicates the size so that the ring consumer can handle it the same as any other record and just skip it. Note that this means that ring entries can never be split over the end of the ring, so if an entry would need to be split, a skip record is inserted to wrap the ring first; this is also WATCH_TYPE_META, WATCH_META_SKIP_NOTIFICATION. Signed-off-by: David Howells <dhowells@redhat.com> --- include/uapi/linux/watch_queue.h | 63 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) create mode 100644 include/uapi/linux/watch_queue.h