Message ID | 20230605233900.2838-4-beaub@linux.microsoft.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | tracing/user_events: Add auto cleanup and a flag to persist events | expand |
On Mon, Jun 5, 2023 at 4:39 PM Beau Belgrave <beaub@linux.microsoft.com> wrote: > + /* > + * When the event is not enabled for auto-delete there will always > + * be at least 1 reference to the event. During the event creation > + * we initially set the refcnt to 2 to achieve this. In those cases > + * the caller must acquire event_mutex and after decrement check if > + * the refcnt is 1, meaning this is the last reference. When auto > + * delete is enabled, there will only be 1 ref, IE: refcnt will be > + * only set to 1 during creation to allow the below checks to go > + * through upon the last put. The last put must always be done with > + * the event mutex held. > + */ > + if (!locked) { > + lockdep_assert_not_held(&event_mutex); > + delete = refcount_dec_and_mutex_lock(&user->refcnt, &event_mutex); > + } else { > + lockdep_assert_held(&event_mutex); > + delete = refcount_dec_and_test(&user->refcnt); > + } > + > + if (!delete) > + return; > + > + /* We now have the event_mutex in all cases */ > + > + if (user->reg_flags & USER_EVENT_REG_PERSIST) { > + /* We should not get here when persist flag is set */ > + pr_alert("BUG: Auto-delete engaged on persistent event\n"); > + goto out; > + } > + > + /* > + * Unfortunately we have to attempt the actual destroy in a work > + * queue. This is because not all cases handle a trace_event_call > + * being removed within the class->reg() operation for unregister. > + */ > + INIT_WORK(&user->put_work, delayed_destroy_user_event); > + > + /* > + * Since the event is still in the hashtable, we have to re-inc > + * the ref count to 1. This count will be decremented and checked > + * in the work queue to ensure it's still the last ref. This is > + * needed because a user-process could register the same event in > + * between the time of event_mutex release and the work queue > + * running the delayed destroy. If we removed the item now from > + * the hashtable, this would result in a timing window where a > + * user process would fail a register because the trace_event_call > + * register would fail in the tracing layers. > + */ > + refcount_set(&user->refcnt, 1); The recnt-ing scheme is quite unorthodox. Atomically decrementing it to zero and then immediately set it back to 1? Smells racy. Another process can go through the same code and do another dec and set to 1 and we'll have two work queued? Will mutex_lock help somehow? If yes, then why atomic refcnt?
On Tue, Jun 06, 2023 at 06:26:56PM -0700, Alexei Starovoitov wrote: > On Mon, Jun 5, 2023 at 4:39 PM Beau Belgrave <beaub@linux.microsoft.com> wrote: > > + /* > > + * When the event is not enabled for auto-delete there will always > > + * be at least 1 reference to the event. During the event creation > > + * we initially set the refcnt to 2 to achieve this. In those cases > > + * the caller must acquire event_mutex and after decrement check if > > + * the refcnt is 1, meaning this is the last reference. When auto > > + * delete is enabled, there will only be 1 ref, IE: refcnt will be > > + * only set to 1 during creation to allow the below checks to go > > + * through upon the last put. The last put must always be done with > > + * the event mutex held. > > + */ > > + if (!locked) { > > + lockdep_assert_not_held(&event_mutex); > > + delete = refcount_dec_and_mutex_lock(&user->refcnt, &event_mutex); > > + } else { > > + lockdep_assert_held(&event_mutex); > > + delete = refcount_dec_and_test(&user->refcnt); > > + } > > + > > + if (!delete) > > + return; > > + > > + /* We now have the event_mutex in all cases */ > > + > > + if (user->reg_flags & USER_EVENT_REG_PERSIST) { > > + /* We should not get here when persist flag is set */ > > + pr_alert("BUG: Auto-delete engaged on persistent event\n"); > > + goto out; > > + } > > + > > + /* > > + * Unfortunately we have to attempt the actual destroy in a work > > + * queue. This is because not all cases handle a trace_event_call > > + * being removed within the class->reg() operation for unregister. > > + */ > > + INIT_WORK(&user->put_work, delayed_destroy_user_event); > > + > > + /* > > + * Since the event is still in the hashtable, we have to re-inc > > + * the ref count to 1. This count will be decremented and checked > > + * in the work queue to ensure it's still the last ref. This is > > + * needed because a user-process could register the same event in > > + * between the time of event_mutex release and the work queue > > + * running the delayed destroy. If we removed the item now from > > + * the hashtable, this would result in a timing window where a > > + * user process would fail a register because the trace_event_call > > + * register would fail in the tracing layers. > > + */ > > + refcount_set(&user->refcnt, 1); > > The recnt-ing scheme is quite unorthodox. Yes, it's unfortunately because we have to keep the event in the hashtable. Typically we'd just remove the event from the hashtable, ref_dec and then upon final ref_dec free it. The problem with that is the event in the hashtable is an actual trace_event exposed via tracefs/perf. It prevents us from removing it at this time as the comment tries to explain. > Atomically decrementing it to zero and then immediately set it back to 1? > Smells racy. Thanks for pointing this out, I likely need another comment in the code explaining why this is ok. It might smell that way :) But the only way once it's zero to get another reference is by acquiring event_mutex and then calling find_user_event(). This is why it's important that at this phase we hold the event_mutex in all cases. > Another process can go through the same code and do another dec and set to 1 > and we'll have two work queued? Once we set it to 1, we cannot get into this code path again until the work that was queued is finished. The scheduled work acquires event_mutex and ensures it's still the last reference before doing any actual deletion or a refcount_dec that would allow for another work queue via this path. > Will mutex_lock help somehow? If yes, then why atomic refcnt? I hopefully fully explained the mutex_lock above, if it's confusing still let me know. The rason for the atomic refcnt though is because you initially get a reference to a user_event via find_user_event() under the lock, but then each FD (and each enabler) has a reference to the underlying user_event (and thus the trace_event). Fork() and enablement registers are done outside of this lock for performance reasons. When the task exits, we also close down the enablers for that mm/task, and so the lock is not there either (and why having the less used refcount_dec_and_mutex_lock being required on put when it's not acquired). Thanks, -Beau
On Mon, 5 Jun 2023 16:38:58 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > Currently user events need to be manually deleted via the delete IOCTL > call or via the dynamic_events file. Most operators and processes wish > to have these events auto cleanup when they are no longer used by > anything to prevent them piling without manual maintenance. However, > some operators may not want this, such as pre-registering events via the > dynamic_events tracefs file. > > Add a persist flag to user facing header and honor it within the > register IOCTL call. Add max flag as well to ensure that only known > flags can be used now and in the future. Update user_event_put() to > attempt an auto delete of the event if it's the last reference. The > auto delete must run in a work queue to ensure proper behavior of > class->reg() invocations that don't expect the call to go away from > underneath them during the unregister. Add work_struct to user_event > struct to ensure we can do this reliably. > > Link: https://lore.kernel.org/linux-trace-kernel/20230518093600.3f119d68@rorschach.local.home/ Since there still seems to be some controversy over the persistent events, could we hold off on implementing them until next merge window? That is, I would like to only have the fd owned events for this release, which would give us time to hash out any of the issues with persistent events. If they are not in, then they are not an API, but once they are in, then we are stuck with them. I believe everyone is fine with the fd owned events, right? > > Suggested-by: Steven Rostedt <rostedt@goodmis.org> > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> > --- > include/uapi/linux/user_events.h | 10 ++- > kernel/trace/trace_events_user.c | 118 +++++++++++++++++++++++++++---- > 2 files changed, 114 insertions(+), 14 deletions(-) > > That is we can keep all the code of this patch, but: > static __always_inline __must_check > @@ -1609,7 +1695,8 @@ static int user_event_create(const char *raw_command) > > mutex_lock(&group->reg_mutex); > > - ret = user_event_parse_cmd(group, name, &user, 0); > + /* Dyn events persist, otherwise they would cleanup immediately */ > + ret = user_event_parse_cmd(group, name, &user, USER_EVENT_REG_PERSIST); > > if (!ret) > user_event_put(user, false); > @@ -1843,8 +1930,13 @@ static int user_event_parse(struct user_event_group *group, char *name, > Add here: if (reg_flags) { /* Holding off implementing PERSIST events */ ret = -EINVAL; goto put_user_lock; } Which also reminds me. We should return EINVAL if any flags that we do not know about is set. Otherwise when we do implement new flags, the user will not know if they are taking effect or not. -- Steve > user->reg_flags = reg_flags; > > - /* Ensure we track self ref and caller ref (2) */ > - refcount_set(&user->refcnt, 2); > + if (user->reg_flags & USER_EVENT_REG_PERSIST) { > + /* Ensure we track self ref and caller ref (2) */ > + refcount_set(&user->refcnt, 2); > + } else { > + /* Ensure we track only caller ref (1) */ > + refcount_set(&user->refcnt, 1); > + } > > dyn_event_init(&user->devent, &user_event_dops); > dyn_event_add(&user->devent, &user->call); > @@ -2066,8 +2158,8 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) > if (ret) > return ret; > > - /* Ensure no flags, since we don't support any yet */ > - if (kreg->flags != 0) > + /* Ensure only valid flags */ > + if (kreg->flags & ~(USER_EVENT_REG_MAX-1)) > return -EINVAL; > > /* Ensure supported size */
On Thu, Jun 08, 2023 at 04:25:57PM -0400, Steven Rostedt wrote: > On Mon, 5 Jun 2023 16:38:58 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > Currently user events need to be manually deleted via the delete IOCTL > > call or via the dynamic_events file. Most operators and processes wish > > to have these events auto cleanup when they are no longer used by > > anything to prevent them piling without manual maintenance. However, > > some operators may not want this, such as pre-registering events via the > > dynamic_events tracefs file. > > > > Add a persist flag to user facing header and honor it within the > > register IOCTL call. Add max flag as well to ensure that only known > > flags can be used now and in the future. Update user_event_put() to > > attempt an auto delete of the event if it's the last reference. The > > auto delete must run in a work queue to ensure proper behavior of > > class->reg() invocations that don't expect the call to go away from > > underneath them during the unregister. Add work_struct to user_event > > struct to ensure we can do this reliably. > > > > Link: https://lore.kernel.org/linux-trace-kernel/20230518093600.3f119d68@rorschach.local.home/ > > Since there still seems to be some controversy over the persistent events, > could we hold off on implementing them until next merge window? That is, I > would like to only have the fd owned events for this release, which would > give us time to hash out any of the issues with persistent events. > > If they are not in, then they are not an API, but once they are in, then we > are stuck with them. I believe everyone is fine with the fd owned events, > right? > I am fine with this approach, however, FD owned events only means that anyone using the /sys/kernel/tracing/dynamic_events will have the event deleted immediately. Should we flat out issue an error instead of having it work, but then removed immediately? NOTE: I'm waiting for the other user_event patches to land in the tracing for-next since there will be conflicts and I want to make sure I get good coverage with catching all the put/get refs. > > > > Suggested-by: Steven Rostedt <rostedt@goodmis.org> > > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> > > --- > > include/uapi/linux/user_events.h | 10 ++- > > kernel/trace/trace_events_user.c | 118 +++++++++++++++++++++++++++---- > > 2 files changed, 114 insertions(+), 14 deletions(-) > > > > > > That is we can keep all the code of this patch, but: > > > static __always_inline __must_check > > @@ -1609,7 +1695,8 @@ static int user_event_create(const char *raw_command) > > > > mutex_lock(&group->reg_mutex); > > > > - ret = user_event_parse_cmd(group, name, &user, 0); > > + /* Dyn events persist, otherwise they would cleanup immediately */ > > + ret = user_event_parse_cmd(group, name, &user, USER_EVENT_REG_PERSIST); > > > > if (!ret) > > user_event_put(user, false); > > @@ -1843,8 +1930,13 @@ static int user_event_parse(struct user_event_group *group, char *name, > > > > Add here: > > if (reg_flags) { > /* Holding off implementing PERSIST events */ > ret = -EINVAL; > goto put_user_lock; > } > > Which also reminds me. We should return EINVAL if any flags that we do not > know about is set. Otherwise when we do implement new flags, the user will > not know if they are taking effect or not. > We do that in user_reg_get, but point taken, some flag could come through dynamic_events interface and then we'd miss it. Thanks, -Beau > -- Steve > > > > user->reg_flags = reg_flags; > > > > - /* Ensure we track self ref and caller ref (2) */ > > - refcount_set(&user->refcnt, 2); > > + if (user->reg_flags & USER_EVENT_REG_PERSIST) { > > + /* Ensure we track self ref and caller ref (2) */ > > + refcount_set(&user->refcnt, 2); > > + } else { > > + /* Ensure we track only caller ref (1) */ > > + refcount_set(&user->refcnt, 1); > > + } > > > > dyn_event_init(&user->devent, &user_event_dops); > > dyn_event_add(&user->devent, &user->call); > > @@ -2066,8 +2158,8 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) > > if (ret) > > return ret; > > > > - /* Ensure no flags, since we don't support any yet */ > > - if (kreg->flags != 0) > > + /* Ensure only valid flags */ > > + if (kreg->flags & ~(USER_EVENT_REG_MAX-1)) > > return -EINVAL; > > > > /* Ensure supported size */
On Mon, 5 Jun 2023 16:38:58 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: Continuing on what Alexei was saying ... > + /* > + * Unfortunately we have to attempt the actual destroy in a work > + * queue. This is because not all cases handle a trace_event_call > + * being removed within the class->reg() operation for unregister. > + */ > + INIT_WORK(&user->put_work, delayed_destroy_user_event); We initialize the work here. > + > + /* > + * Since the event is still in the hashtable, we have to re-inc > + * the ref count to 1. This count will be decremented and checked > + * in the work queue to ensure it's still the last ref. This is > + * needed because a user-process could register the same event in > + * between the time of event_mutex release and the work queue > + * running the delayed destroy. If we removed the item now from > + * the hashtable, this would result in a timing window where a > + * user process would fail a register because the trace_event_call > + * register would fail in the tracing layers. > + */ > + refcount_set(&user->refcnt, 1); > + > + if (!schedule_work(&user->put_work)) { From what I understand, schedule_work() can only fail if the work is already queued. That should never happen if we just initialized it. That means we need a WARN_ON_ONCE() here. Because it's a major bug if that does return false. -- Steve > + /* > + * If we fail we must wait for an admin to attempt delete or > + * another register/close of the event, whichever is first. > + */ > + pr_warn("user_events: Unable to queue delayed destroy\n"); > + }
diff --git a/include/uapi/linux/user_events.h b/include/uapi/linux/user_events.h index 2984aae4a2b4..74b909e520aa 100644 --- a/include/uapi/linux/user_events.h +++ b/include/uapi/linux/user_events.h @@ -17,6 +17,14 @@ /* Create dynamic location entry within a 32-bit value */ #define DYN_LOC(offset, size) ((size) << 16 | (offset)) +enum user_reg_flag { + /* Event will not delete upon last reference closing */ + USER_EVENT_REG_PERSIST = 1U << 0, + + /* This value or above is currently non-ABI */ + USER_EVENT_REG_MAX = 1U << 1, +}; + /* * Describes an event registration and stores the results of the registration. * This structure is passed to the DIAG_IOCSREG ioctl, callers at a minimum @@ -33,7 +41,7 @@ struct user_reg { /* Input: Enable size in bytes at address */ __u8 enable_size; - /* Input: Flags for future use, set to 0 */ + /* Input: Flags can be any of the above user_reg_flag values */ __u16 flags; /* Input: Address to update when enabled */ diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 8f0fb6cb0f33..170ec2f5076c 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -85,6 +85,7 @@ struct user_event { struct hlist_node node; struct list_head fields; struct list_head validators; + struct work_struct put_work; refcount_t refcnt; int min_size; int reg_flags; @@ -169,6 +170,7 @@ static int user_event_parse(struct user_event_group *group, char *name, static struct user_event_mm *user_event_mm_get(struct user_event_mm *mm); static struct user_event_mm *user_event_mm_get_all(struct user_event *user); static void user_event_mm_put(struct user_event_mm *mm); +static int destroy_user_event(struct user_event *user); static u32 user_event_key(char *name) { @@ -182,19 +184,98 @@ static struct user_event *user_event_get(struct user_event *user) return user; } +static void delayed_destroy_user_event(struct work_struct *work) +{ + struct user_event *user = container_of( + work, struct user_event, put_work); + + mutex_lock(&event_mutex); + + if (!refcount_dec_and_test(&user->refcnt)) + goto out; + + if (destroy_user_event(user)) { + /* + * The only reason this would fail here is if we cannot + * update the visibility of the event. In this case the + * event stays in the hashtable, waiting for someone to + * attempt to delete it later. + */ + pr_warn("user_events: Unable to delete event\n"); + refcount_set(&user->refcnt, 1); + } +out: + mutex_unlock(&event_mutex); +} + static void user_event_put(struct user_event *user, bool locked) { -#ifdef CONFIG_LOCKDEP - if (locked) - lockdep_assert_held(&event_mutex); - else - lockdep_assert_not_held(&event_mutex); -#endif + bool delete; if (unlikely(!user)) return; - refcount_dec(&user->refcnt); + /* + * When the event is not enabled for auto-delete there will always + * be at least 1 reference to the event. During the event creation + * we initially set the refcnt to 2 to achieve this. In those cases + * the caller must acquire event_mutex and after decrement check if + * the refcnt is 1, meaning this is the last reference. When auto + * delete is enabled, there will only be 1 ref, IE: refcnt will be + * only set to 1 during creation to allow the below checks to go + * through upon the last put. The last put must always be done with + * the event mutex held. + */ + if (!locked) { + lockdep_assert_not_held(&event_mutex); + delete = refcount_dec_and_mutex_lock(&user->refcnt, &event_mutex); + } else { + lockdep_assert_held(&event_mutex); + delete = refcount_dec_and_test(&user->refcnt); + } + + if (!delete) + return; + + /* We now have the event_mutex in all cases */ + + if (user->reg_flags & USER_EVENT_REG_PERSIST) { + /* We should not get here when persist flag is set */ + pr_alert("BUG: Auto-delete engaged on persistent event\n"); + goto out; + } + + /* + * Unfortunately we have to attempt the actual destroy in a work + * queue. This is because not all cases handle a trace_event_call + * being removed within the class->reg() operation for unregister. + */ + INIT_WORK(&user->put_work, delayed_destroy_user_event); + + /* + * Since the event is still in the hashtable, we have to re-inc + * the ref count to 1. This count will be decremented and checked + * in the work queue to ensure it's still the last ref. This is + * needed because a user-process could register the same event in + * between the time of event_mutex release and the work queue + * running the delayed destroy. If we removed the item now from + * the hashtable, this would result in a timing window where a + * user process would fail a register because the trace_event_call + * register would fail in the tracing layers. + */ + refcount_set(&user->refcnt, 1); + + if (!schedule_work(&user->put_work)) { + /* + * If we fail we must wait for an admin to attempt delete or + * another register/close of the event, whichever is first. + */ + pr_warn("user_events: Unable to queue delayed destroy\n"); + } +out: + /* Ensure if we didn't have event_mutex before we unlock it */ + if (!locked) + mutex_unlock(&event_mutex); } static void user_event_group_destroy(struct user_event_group *group) @@ -793,7 +874,12 @@ static struct user_event_enabler static __always_inline __must_check bool user_event_last_ref(struct user_event *user) { - return refcount_read(&user->refcnt) == 1; + int last = 0; + + if (user->reg_flags & USER_EVENT_REG_PERSIST) + last = 1; + + return refcount_read(&user->refcnt) == last; } static __always_inline __must_check @@ -1609,7 +1695,8 @@ static int user_event_create(const char *raw_command) mutex_lock(&group->reg_mutex); - ret = user_event_parse_cmd(group, name, &user, 0); + /* Dyn events persist, otherwise they would cleanup immediately */ + ret = user_event_parse_cmd(group, name, &user, USER_EVENT_REG_PERSIST); if (!ret) user_event_put(user, false); @@ -1843,8 +1930,13 @@ static int user_event_parse(struct user_event_group *group, char *name, user->reg_flags = reg_flags; - /* Ensure we track self ref and caller ref (2) */ - refcount_set(&user->refcnt, 2); + if (user->reg_flags & USER_EVENT_REG_PERSIST) { + /* Ensure we track self ref and caller ref (2) */ + refcount_set(&user->refcnt, 2); + } else { + /* Ensure we track only caller ref (1) */ + refcount_set(&user->refcnt, 1); + } dyn_event_init(&user->devent, &user_event_dops); dyn_event_add(&user->devent, &user->call); @@ -2066,8 +2158,8 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) if (ret) return ret; - /* Ensure no flags, since we don't support any yet */ - if (kreg->flags != 0) + /* Ensure only valid flags */ + if (kreg->flags & ~(USER_EVENT_REG_MAX-1)) return -EINVAL; /* Ensure supported size */
Currently user events need to be manually deleted via the delete IOCTL call or via the dynamic_events file. Most operators and processes wish to have these events auto cleanup when they are no longer used by anything to prevent them piling without manual maintenance. However, some operators may not want this, such as pre-registering events via the dynamic_events tracefs file. Add a persist flag to user facing header and honor it within the register IOCTL call. Add max flag as well to ensure that only known flags can be used now and in the future. Update user_event_put() to attempt an auto delete of the event if it's the last reference. The auto delete must run in a work queue to ensure proper behavior of class->reg() invocations that don't expect the call to go away from underneath them during the unregister. Add work_struct to user_event struct to ensure we can do this reliably. Link: https://lore.kernel.org/linux-trace-kernel/20230518093600.3f119d68@rorschach.local.home/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> --- include/uapi/linux/user_events.h | 10 ++- kernel/trace/trace_events_user.c | 118 +++++++++++++++++++++++++++---- 2 files changed, 114 insertions(+), 14 deletions(-)