Message ID | 20190721081933-mutt-send-email-mst@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop) | expand |
On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote: > Hi Paul, others, > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry > is what happens if userspace starts cycling through lots of these > ioctls. Given we actually use rcu as an optimization, we could just > disable the optimization temporarily - but the question would be how to > detect an excessive rate without working too hard :) . > > I guess we could define as excessive any rate where callback is > outstanding at the time when new structure is allocated. I have very > little understanding of rcu internals - so I wanted to check that the > following more or less implements this heuristic before I spend time > actually testing it. > > Could others pls take a look and let me know? These look good as a way of seeing if there are any outstanding callbacks, but in the case of Tree RCU, call_rcu_outstanding() would almost never return false on a busy system. Here are some alternatives: o RCU uses some pieces of Rao Shoaib kfree_rcu() patches. The idea is to make kfree_rcu() locally buffer requests into batches of (say) 1,000, but processing smaller batches when RCU is idle, or when some smallish amout of time has passed with no more kfree_rcu() request from that CPU. RCU than takes in the batch using not call_rcu(), but rather queue_rcu_work(). The resulting batch of kfree() calls would therefore execute in workqueue context rather than in softirq context, which should be much easier on the system. In theory, this would allow people to use kfree_rcu() without worrying quite so much about overload. It would also not be that hard to implement. o Subsystems vulnerable to user-induced kfree_rcu() flooding use call_rcu() instead of kfree_rcu(). Keep a count of the number of things waiting for a grace period, and when this gets too large, disable the optimization. It will then drain down, at which point the optimization can be re-enabled. But please note that callbacks are -not- guaranteed to run on the CPU that queued them. So yes, you would need a per-CPU counter, but you would need to periodically sum it up to check against the global state. Or keep track of the CPU that did the call_rcu() so that you can atomically decrement in the callback the same counter that was atomically incremented just before the call_rcu(). Or any number of other approaches. Also, the overhead is important. For example, as far as I know, current RCU gracefully handles close(open(...)) in a tight userspace loop. But there might be trouble due to tight userspace loops around lighter-weight operations. So an important question is "Just how fast is your ioctl?" If it takes (say) 100 microseconds to execute, there should be absolutely no problem. On the other hand, if it can execute in 50 nanoseconds, this very likely does need serious attention. Other thoughts? Thanx, Paul > Thanks! > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c > index 477b4eb44af5..067909521d72 100644 > --- a/kernel/rcu/tiny.c > +++ b/kernel/rcu/tiny.c > @@ -125,6 +125,25 @@ void synchronize_rcu(void) > } > EXPORT_SYMBOL_GPL(synchronize_rcu); > > +/* > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > + */ > +bool call_rcu_outstanding(void) > +{ > + unsigned long flags; > + struct rcu_data *rdp; > + bool outstanding; > + > + local_irq_save(flags); > + rdp = this_cpu_ptr(&rcu_data); > + outstanding = rcu_segcblist_empty(&rdp->cblist); > + outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail; > + local_irq_restore(flags); > + > + return outstanding; > +} > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > + > /* > * Post an RCU callback to be invoked after the end of an RCU grace > * period. But since we have but one CPU, that would be after any > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index a14e5fbbea46..d4b9d61e637d 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp) > { > } > > +/* > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > + */ > +bool call_rcu_outstanding(void) > +{ > + unsigned long flags; > + struct rcu_data *rdp; > + bool outstanding; > + > + local_irq_save(flags); > + rdp = this_cpu_ptr(&rcu_data); > + outstanding = rcu_segcblist_empty(&rdp->cblist); > + local_irq_restore(flags); > + > + return outstanding; > +} > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > + > /* > * Helper function for call_rcu() and friends. The cpu argument will > * normally be -1, indicating "currently running CPU". It may specify
On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote: > > Hi Paul, others, > > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry > > is what happens if userspace starts cycling through lots of these > > ioctls. Given we actually use rcu as an optimization, we could just > > disable the optimization temporarily - but the question would be how to > > detect an excessive rate without working too hard :) . > > > > I guess we could define as excessive any rate where callback is > > outstanding at the time when new structure is allocated. I have very > > little understanding of rcu internals - so I wanted to check that the > > following more or less implements this heuristic before I spend time > > actually testing it. > > > > Could others pls take a look and let me know? > > These look good as a way of seeing if there are any outstanding callbacks, > but in the case of Tree RCU, call_rcu_outstanding() would almost never > return false on a busy system. Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000? > > Here are some alternatives: > > o RCU uses some pieces of Rao Shoaib kfree_rcu() patches. > The idea is to make kfree_rcu() locally buffer requests into > batches of (say) 1,000, but processing smaller batches when RCU > is idle, or when some smallish amout of time has passed with > no more kfree_rcu() request from that CPU. RCU than takes in > the batch using not call_rcu(), but rather queue_rcu_work(). > The resulting batch of kfree() calls would therefore execute in > workqueue context rather than in softirq context, which should > be much easier on the system. > > In theory, this would allow people to use kfree_rcu() without > worrying quite so much about overload. It would also not be > that hard to implement. > > o Subsystems vulnerable to user-induced kfree_rcu() flooding use > call_rcu() instead of kfree_rcu(). Keep a count of the number > of things waiting for a grace period, and when this gets too > large, disable the optimization. It will then drain down, at > which point the optimization can be re-enabled. > > But please note that callbacks are -not- guaranteed to run on > the CPU that queued them. So yes, you would need a per-CPU > counter, but you would need to periodically sum it up to check > against the global state. Or keep track of the CPU that > did the call_rcu() so that you can atomically decrement in > the callback the same counter that was atomically incremented > just before the call_rcu(). Or any number of other approaches. I'm really looking for something we can do this merge window and without adding too much code, and kfree_rcu is intended to fix a bug. Adding call_rcu and careful accounting is something that I'm not happy adding with merge window already open. > > Also, the overhead is important. For example, as far as I know, > current RCU gracefully handles close(open(...)) in a tight userspace > loop. But there might be trouble due to tight userspace loops around > lighter-weight operations. > > So an important question is "Just how fast is your ioctl?" If it takes > (say) 100 microseconds to execute, there should be absolutely no problem. > On the other hand, if it can execute in 50 nanoseconds, this very likely > does need serious attention. > > Other thoughts? > > Thanx, Paul Hmm the answer to this would be I'm not sure. It's setup time stuff we never tested it. > > Thanks! > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c > > index 477b4eb44af5..067909521d72 100644 > > --- a/kernel/rcu/tiny.c > > +++ b/kernel/rcu/tiny.c > > @@ -125,6 +125,25 @@ void synchronize_rcu(void) > > } > > EXPORT_SYMBOL_GPL(synchronize_rcu); > > > > +/* > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > + */ > > +bool call_rcu_outstanding(void) > > +{ > > + unsigned long flags; > > + struct rcu_data *rdp; > > + bool outstanding; > > + > > + local_irq_save(flags); > > + rdp = this_cpu_ptr(&rcu_data); > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > + outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail; > > + local_irq_restore(flags); > > + > > + return outstanding; > > +} > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > + > > /* > > * Post an RCU callback to be invoked after the end of an RCU grace > > * period. But since we have but one CPU, that would be after any > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index a14e5fbbea46..d4b9d61e637d 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp) > > { > > } > > > > +/* > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > + */ > > +bool call_rcu_outstanding(void) > > +{ > > + unsigned long flags; > > + struct rcu_data *rdp; > > + bool outstanding; > > + > > + local_irq_save(flags); > > + rdp = this_cpu_ptr(&rcu_data); > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > + local_irq_restore(flags); > > + > > + return outstanding; > > +} > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > + > > /* > > * Helper function for call_rcu() and friends. The cpu argument will > > * normally be -1, indicating "currently running CPU". It may specify
On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote: > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote: > > > Hi Paul, others, > > > > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry > > > is what happens if userspace starts cycling through lots of these > > > ioctls. Given we actually use rcu as an optimization, we could just > > > disable the optimization temporarily - but the question would be how to > > > detect an excessive rate without working too hard :) . > > > > > > I guess we could define as excessive any rate where callback is > > > outstanding at the time when new structure is allocated. I have very > > > little understanding of rcu internals - so I wanted to check that the > > > following more or less implements this heuristic before I spend time > > > actually testing it. > > > > > > Could others pls take a look and let me know? > > > > These look good as a way of seeing if there are any outstanding callbacks, > > but in the case of Tree RCU, call_rcu_outstanding() would almost never > > return false on a busy system. > > Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy > and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000? Or the function could simply return the number of callbacks queued on the current CPU, and let the caller decide how many is too many. > > Here are some alternatives: > > > > o RCU uses some pieces of Rao Shoaib kfree_rcu() patches. > > The idea is to make kfree_rcu() locally buffer requests into > > batches of (say) 1,000, but processing smaller batches when RCU > > is idle, or when some smallish amout of time has passed with > > no more kfree_rcu() request from that CPU. RCU than takes in > > the batch using not call_rcu(), but rather queue_rcu_work(). > > The resulting batch of kfree() calls would therefore execute in > > workqueue context rather than in softirq context, which should > > be much easier on the system. > > > > In theory, this would allow people to use kfree_rcu() without > > worrying quite so much about overload. It would also not be > > that hard to implement. > > > > o Subsystems vulnerable to user-induced kfree_rcu() flooding use > > call_rcu() instead of kfree_rcu(). Keep a count of the number > > of things waiting for a grace period, and when this gets too > > large, disable the optimization. It will then drain down, at > > which point the optimization can be re-enabled. > > > > But please note that callbacks are -not- guaranteed to run on > > the CPU that queued them. So yes, you would need a per-CPU > > counter, but you would need to periodically sum it up to check > > against the global state. Or keep track of the CPU that > > did the call_rcu() so that you can atomically decrement in > > the callback the same counter that was atomically incremented > > just before the call_rcu(). Or any number of other approaches. > > I'm really looking for something we can do this merge window > and without adding too much code, and kfree_rcu is intended to > fix a bug. > Adding call_rcu and careful accounting is something that I'm not > happy adding with merge window already open. OK, then I suggest having the interface return you the number of callbacks. That allows you to experiment with the cutoff. Give or take the ioctl overhead... > > Also, the overhead is important. For example, as far as I know, > > current RCU gracefully handles close(open(...)) in a tight userspace > > loop. But there might be trouble due to tight userspace loops around > > lighter-weight operations. > > > > So an important question is "Just how fast is your ioctl?" If it takes > > (say) 100 microseconds to execute, there should be absolutely no problem. > > On the other hand, if it can execute in 50 nanoseconds, this very likely > > does need serious attention. > > > > Other thoughts? > > > > Thanx, Paul > > Hmm the answer to this would be I'm not sure. > It's setup time stuff we never tested it. Is it possible to measure it easily? Thanx, Paul > > > Thanks! > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > > > > > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c > > > index 477b4eb44af5..067909521d72 100644 > > > --- a/kernel/rcu/tiny.c > > > +++ b/kernel/rcu/tiny.c > > > @@ -125,6 +125,25 @@ void synchronize_rcu(void) > > > } > > > EXPORT_SYMBOL_GPL(synchronize_rcu); > > > > > > +/* > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > > + */ > > > +bool call_rcu_outstanding(void) > > > +{ > > > + unsigned long flags; > > > + struct rcu_data *rdp; > > > + bool outstanding; > > > + > > > + local_irq_save(flags); > > > + rdp = this_cpu_ptr(&rcu_data); > > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > > + outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail; > > > + local_irq_restore(flags); > > > + > > > + return outstanding; > > > +} > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > > + > > > /* > > > * Post an RCU callback to be invoked after the end of an RCU grace > > > * period. But since we have but one CPU, that would be after any > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index a14e5fbbea46..d4b9d61e637d 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp) > > > { > > > } > > > > > > +/* > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > > + */ > > > +bool call_rcu_outstanding(void) > > > +{ > > > + unsigned long flags; > > > + struct rcu_data *rdp; > > > + bool outstanding; > > > + > > > + local_irq_save(flags); > > > + rdp = this_cpu_ptr(&rcu_data); > > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > > + local_irq_restore(flags); > > > + > > > + return outstanding; > > > +} > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > > + > > > /* > > > * Helper function for call_rcu() and friends. The cpu argument will > > > * normally be -1, indicating "currently running CPU". It may specify >
On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > Also, the overhead is important. For example, as far as I know, > current RCU gracefully handles close(open(...)) in a tight userspace > loop. But there might be trouble due to tight userspace loops around > lighter-weight operations. I thought you believed that RCU was antifragile, in that it would scale better as it was used more heavily? Would it make sense to have call_rcu() check to see if there are many outstanding requests on this CPU and if so process them before returning? That would ensure that frequent callers usually ended up doing their own processing.
On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote: > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > > Also, the overhead is important. For example, as far as I know, > > current RCU gracefully handles close(open(...)) in a tight userspace > > loop. But there might be trouble due to tight userspace loops around > > lighter-weight operations. > > I thought you believed that RCU was antifragile, in that it would scale > better as it was used more heavily? You are referring to this? https://paulmck.livejournal.com/47933.html If so, the last few paragraphs might be worth re-reading. ;-) And in this case, the heuristics RCU uses to decide when to schedule invocation of the callbacks needs some help. One component of that help is a time-based limit to the number of consecutive callback invocations (see my crude prototype and Eric Dumazet's more polished patch). Another component is an overload warning. Why would an overload warning be needed if RCU's callback-invocation scheduling heurisitics were upgraded? Because someone could boot a 100-CPU system with the rcu_nocbs=0-99, bind all of the resulting rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload on all of the CPUs. Given the constraints, CPU 0 cannot keep up. So warnings are required as well. > Would it make sense to have call_rcu() check to see if there are many > outstanding requests on this CPU and if so process them before returning? > That would ensure that frequent callers usually ended up doing their > own processing. Unfortunately, no. Here is a code fragment illustrating why: void my_cb(struct rcu_head *rhp) { unsigned long flags; spin_lock_irqsave(&my_lock, flags); handle_cb(rhp); spin_unlock_irqrestore(&my_lock, flags); } . . . spin_lock_irqsave(&my_lock, flags); p = look_something_up(); remove_that_something(p); call_rcu(p, my_cb); spin_unlock_irqrestore(&my_lock, flags); Invoking the extra callbacks directly from call_rcu() would thus result in self-deadlock. Documentation/RCU/UP.txt contains a few more examples along these lines.
On Sun, Jul 21, 2019 at 04:31:13PM -0700, Paul E. McKenney wrote: > On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote: > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > > > Also, the overhead is important. For example, as far as I know, > > > current RCU gracefully handles close(open(...)) in a tight userspace > > > loop. But there might be trouble due to tight userspace loops around > > > lighter-weight operations. > > > > I thought you believed that RCU was antifragile, in that it would scale > > better as it was used more heavily? > > You are referring to this? https://paulmck.livejournal.com/47933.html > > If so, the last few paragraphs might be worth re-reading. ;-) > > And in this case, the heuristics RCU uses to decide when to schedule > invocation of the callbacks needs some help. One component of that help > is a time-based limit to the number of consecutive callback invocations > (see my crude prototype and Eric Dumazet's more polished patch). Another > component is an overload warning. > > Why would an overload warning be needed if RCU's callback-invocation > scheduling heurisitics were upgraded? Because someone could boot a > 100-CPU system with the rcu_nocbs=0-99, bind all of the resulting > rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload > on all of the CPUs. Given the constraints, CPU 0 cannot keep up. > > So warnings are required as well. > > > Would it make sense to have call_rcu() check to see if there are many > > outstanding requests on this CPU and if so process them before returning? > > That would ensure that frequent callers usually ended up doing their > > own processing. > > Unfortunately, no. Here is a code fragment illustrating why: > > void my_cb(struct rcu_head *rhp) > { > unsigned long flags; > > spin_lock_irqsave(&my_lock, flags); > handle_cb(rhp); > spin_unlock_irqrestore(&my_lock, flags); > } > > . . . > > spin_lock_irqsave(&my_lock, flags); > p = look_something_up(); > remove_that_something(p); > call_rcu(p, my_cb); > spin_unlock_irqrestore(&my_lock, flags); > > Invoking the extra callbacks directly from call_rcu() would thus result > in self-deadlock. Documentation/RCU/UP.txt contains a few more examples > along these lines. We could add an option that simply fails if overloaded, right? Have caller recover...
On Sun, Jul 21, 2019 at 12:28:41PM -0700, Paul E. McKenney wrote: > On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote: > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > > > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote: > > > > Hi Paul, others, > > > > > > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry > > > > is what happens if userspace starts cycling through lots of these > > > > ioctls. Given we actually use rcu as an optimization, we could just > > > > disable the optimization temporarily - but the question would be how to > > > > detect an excessive rate without working too hard :) . > > > > > > > > I guess we could define as excessive any rate where callback is > > > > outstanding at the time when new structure is allocated. I have very > > > > little understanding of rcu internals - so I wanted to check that the > > > > following more or less implements this heuristic before I spend time > > > > actually testing it. > > > > > > > > Could others pls take a look and let me know? > > > > > > These look good as a way of seeing if there are any outstanding callbacks, > > > but in the case of Tree RCU, call_rcu_outstanding() would almost never > > > return false on a busy system. > > > > Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy > > and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000? > > Or the function could simply return the number of callbacks queued > on the current CPU, and let the caller decide how many is too many. > > > > Here are some alternatives: > > > > > > o RCU uses some pieces of Rao Shoaib kfree_rcu() patches. > > > The idea is to make kfree_rcu() locally buffer requests into > > > batches of (say) 1,000, but processing smaller batches when RCU > > > is idle, or when some smallish amout of time has passed with > > > no more kfree_rcu() request from that CPU. RCU than takes in > > > the batch using not call_rcu(), but rather queue_rcu_work(). > > > The resulting batch of kfree() calls would therefore execute in > > > workqueue context rather than in softirq context, which should > > > be much easier on the system. > > > > > > In theory, this would allow people to use kfree_rcu() without > > > worrying quite so much about overload. It would also not be > > > that hard to implement. > > > > > > o Subsystems vulnerable to user-induced kfree_rcu() flooding use > > > call_rcu() instead of kfree_rcu(). Keep a count of the number > > > of things waiting for a grace period, and when this gets too > > > large, disable the optimization. It will then drain down, at > > > which point the optimization can be re-enabled. > > > > > > But please note that callbacks are -not- guaranteed to run on > > > the CPU that queued them. So yes, you would need a per-CPU > > > counter, but you would need to periodically sum it up to check > > > against the global state. Or keep track of the CPU that > > > did the call_rcu() so that you can atomically decrement in > > > the callback the same counter that was atomically incremented > > > just before the call_rcu(). Or any number of other approaches. > > > > I'm really looking for something we can do this merge window > > and without adding too much code, and kfree_rcu is intended to > > fix a bug. > > Adding call_rcu and careful accounting is something that I'm not > > happy adding with merge window already open. > > OK, then I suggest having the interface return you the number of > callbacks. That allows you to experiment with the cutoff. > > Give or take the ioctl overhead... OK - and for tiny just assume 1 is too much? > > > Also, the overhead is important. For example, as far as I know, > > > current RCU gracefully handles close(open(...)) in a tight userspace > > > loop. But there might be trouble due to tight userspace loops around > > > lighter-weight operations. > > > > > > So an important question is "Just how fast is your ioctl?" If it takes > > > (say) 100 microseconds to execute, there should be absolutely no problem. > > > On the other hand, if it can execute in 50 nanoseconds, this very likely > > > does need serious attention. > > > > > > Other thoughts? > > > > > > Thanx, Paul > > > > Hmm the answer to this would be I'm not sure. > > It's setup time stuff we never tested it. > > Is it possible to measure it easily? > > Thanx, Paul > > > > > Thanks! > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > > > > > > > > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c > > > > index 477b4eb44af5..067909521d72 100644 > > > > --- a/kernel/rcu/tiny.c > > > > +++ b/kernel/rcu/tiny.c > > > > @@ -125,6 +125,25 @@ void synchronize_rcu(void) > > > > } > > > > EXPORT_SYMBOL_GPL(synchronize_rcu); > > > > > > > > +/* > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > > > + */ > > > > +bool call_rcu_outstanding(void) > > > > +{ > > > > + unsigned long flags; > > > > + struct rcu_data *rdp; > > > > + bool outstanding; > > > > + > > > > + local_irq_save(flags); > > > > + rdp = this_cpu_ptr(&rcu_data); > > > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > > > + outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail; > > > > + local_irq_restore(flags); > > > > + > > > > + return outstanding; > > > > +} > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > > > + > > > > /* > > > > * Post an RCU callback to be invoked after the end of an RCU grace > > > > * period. But since we have but one CPU, that would be after any > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > index a14e5fbbea46..d4b9d61e637d 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp) > > > > { > > > > } > > > > > > > > +/* > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > > > + */ > > > > +bool call_rcu_outstanding(void) > > > > +{ > > > > + unsigned long flags; > > > > + struct rcu_data *rdp; > > > > + bool outstanding; > > > > + > > > > + local_irq_save(flags); > > > > + rdp = this_cpu_ptr(&rcu_data); > > > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > > > + local_irq_restore(flags); > > > > + > > > > + return outstanding; > > > > +} > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > > > + > > > > /* > > > > * Helper function for call_rcu() and friends. The cpu argument will > > > > * normally be -1, indicating "currently running CPU". It may specify > >
On Mon, Jul 22, 2019 at 03:52:05AM -0400, Michael S. Tsirkin wrote: > On Sun, Jul 21, 2019 at 04:31:13PM -0700, Paul E. McKenney wrote: > > On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote: > > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > > > > Also, the overhead is important. For example, as far as I know, > > > > current RCU gracefully handles close(open(...)) in a tight userspace > > > > loop. But there might be trouble due to tight userspace loops around > > > > lighter-weight operations. > > > > > > I thought you believed that RCU was antifragile, in that it would scale > > > better as it was used more heavily? > > > > You are referring to this? https://paulmck.livejournal.com/47933.html > > > > If so, the last few paragraphs might be worth re-reading. ;-) > > > > And in this case, the heuristics RCU uses to decide when to schedule > > invocation of the callbacks needs some help. One component of that help > > is a time-based limit to the number of consecutive callback invocations > > (see my crude prototype and Eric Dumazet's more polished patch). Another > > component is an overload warning. > > > > Why would an overload warning be needed if RCU's callback-invocation > > scheduling heurisitics were upgraded? Because someone could boot a > > 100-CPU system with the rcu_nocbs=0-99, bind all of the resulting > > rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload > > on all of the CPUs. Given the constraints, CPU 0 cannot keep up. > > > > So warnings are required as well. > > > > > Would it make sense to have call_rcu() check to see if there are many > > > outstanding requests on this CPU and if so process them before returning? > > > That would ensure that frequent callers usually ended up doing their > > > own processing. > > > > Unfortunately, no. Here is a code fragment illustrating why: > > > > void my_cb(struct rcu_head *rhp) > > { > > unsigned long flags; > > > > spin_lock_irqsave(&my_lock, flags); > > handle_cb(rhp); > > spin_unlock_irqrestore(&my_lock, flags); > > } > > > > . . . > > > > spin_lock_irqsave(&my_lock, flags); > > p = look_something_up(); > > remove_that_something(p); > > call_rcu(p, my_cb); > > spin_unlock_irqrestore(&my_lock, flags); > > > > Invoking the extra callbacks directly from call_rcu() would thus result > > in self-deadlock. Documentation/RCU/UP.txt contains a few more examples > > along these lines. > > We could add an option that simply fails if overloaded, right? > Have caller recover... For example, return EBUSY from your ioctl? That should work. You could also sleep for a jiffy or two to let things catch up in this BUSY (or similar) case. Or try three times, waiting a jiffy between each try, and return EBUSY if all three tries failed. Or just keep it simple and return EBUSY on the first try. ;-) All of this assumes that this ioctl is the cause of the overload, which during early boot seems to me to be a safe assumption. Thanx, Paul
On Mon, Jul 22, 2019 at 03:56:22AM -0400, Michael S. Tsirkin wrote: > On Sun, Jul 21, 2019 at 12:28:41PM -0700, Paul E. McKenney wrote: > > On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote: > > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote: > > > > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote: > > > > > Hi Paul, others, > > > > > > > > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry > > > > > is what happens if userspace starts cycling through lots of these > > > > > ioctls. Given we actually use rcu as an optimization, we could just > > > > > disable the optimization temporarily - but the question would be how to > > > > > detect an excessive rate without working too hard :) . > > > > > > > > > > I guess we could define as excessive any rate where callback is > > > > > outstanding at the time when new structure is allocated. I have very > > > > > little understanding of rcu internals - so I wanted to check that the > > > > > following more or less implements this heuristic before I spend time > > > > > actually testing it. > > > > > > > > > > Could others pls take a look and let me know? > > > > > > > > These look good as a way of seeing if there are any outstanding callbacks, > > > > but in the case of Tree RCU, call_rcu_outstanding() would almost never > > > > return false on a busy system. > > > > > > Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy > > > and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000? > > > > Or the function could simply return the number of callbacks queued > > on the current CPU, and let the caller decide how many is too many. > > > > > > Here are some alternatives: > > > > > > > > o RCU uses some pieces of Rao Shoaib kfree_rcu() patches. > > > > The idea is to make kfree_rcu() locally buffer requests into > > > > batches of (say) 1,000, but processing smaller batches when RCU > > > > is idle, or when some smallish amout of time has passed with > > > > no more kfree_rcu() request from that CPU. RCU than takes in > > > > the batch using not call_rcu(), but rather queue_rcu_work(). > > > > The resulting batch of kfree() calls would therefore execute in > > > > workqueue context rather than in softirq context, which should > > > > be much easier on the system. > > > > > > > > In theory, this would allow people to use kfree_rcu() without > > > > worrying quite so much about overload. It would also not be > > > > that hard to implement. > > > > > > > > o Subsystems vulnerable to user-induced kfree_rcu() flooding use > > > > call_rcu() instead of kfree_rcu(). Keep a count of the number > > > > of things waiting for a grace period, and when this gets too > > > > large, disable the optimization. It will then drain down, at > > > > which point the optimization can be re-enabled. > > > > > > > > But please note that callbacks are -not- guaranteed to run on > > > > the CPU that queued them. So yes, you would need a per-CPU > > > > counter, but you would need to periodically sum it up to check > > > > against the global state. Or keep track of the CPU that > > > > did the call_rcu() so that you can atomically decrement in > > > > the callback the same counter that was atomically incremented > > > > just before the call_rcu(). Or any number of other approaches. > > > > > > I'm really looking for something we can do this merge window > > > and without adding too much code, and kfree_rcu is intended to > > > fix a bug. > > > Adding call_rcu and careful accounting is something that I'm not > > > happy adding with merge window already open. > > > > OK, then I suggest having the interface return you the number of > > callbacks. That allows you to experiment with the cutoff. > > > > Give or take the ioctl overhead... > > OK - and for tiny just assume 1 is too much? I bet that for tiny you won't need to rate-limit at all. The reason is that grace periods are quite short. In fact, for TINY (that is, !SMP && !PREEMPT), synchronize_rcu() is a no-op. So in TINY, given that your ioctl is executing at process level, you could just invoke synchronize_rcu() and then kfree(): #ifdef CONFIG_TINY_RCU synchronize_rcu(); /* No other CPUs, so a QS is a GP! */ kfree(whatever); return; /* Or whatever control flow is appropriate. */ #endif /* More complicated stuff for !TINY. */ Thanx, Paul > > > > Also, the overhead is important. For example, as far as I know, > > > > current RCU gracefully handles close(open(...)) in a tight userspace > > > > loop. But there might be trouble due to tight userspace loops around > > > > lighter-weight operations. > > > > > > > > So an important question is "Just how fast is your ioctl?" If it takes > > > > (say) 100 microseconds to execute, there should be absolutely no problem. > > > > On the other hand, if it can execute in 50 nanoseconds, this very likely > > > > does need serious attention. > > > > > > > > Other thoughts? > > > > > > > > Thanx, Paul > > > > > > Hmm the answer to this would be I'm not sure. > > > It's setup time stuff we never tested it. > > > > Is it possible to measure it easily? > > > > Thanx, Paul > > > > > > > Thanks! > > > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > > > > > > > > > > > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c > > > > > index 477b4eb44af5..067909521d72 100644 > > > > > --- a/kernel/rcu/tiny.c > > > > > +++ b/kernel/rcu/tiny.c > > > > > @@ -125,6 +125,25 @@ void synchronize_rcu(void) > > > > > } > > > > > EXPORT_SYMBOL_GPL(synchronize_rcu); > > > > > > > > > > +/* > > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > > > > + */ > > > > > +bool call_rcu_outstanding(void) > > > > > +{ > > > > > + unsigned long flags; > > > > > + struct rcu_data *rdp; > > > > > + bool outstanding; > > > > > + > > > > > + local_irq_save(flags); > > > > > + rdp = this_cpu_ptr(&rcu_data); > > > > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > > > > + outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail; > > > > > + local_irq_restore(flags); > > > > > + > > > > > + return outstanding; > > > > > +} > > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > > > > + > > > > > /* > > > > > * Post an RCU callback to be invoked after the end of an RCU grace > > > > > * period. But since we have but one CPU, that would be after any > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > > index a14e5fbbea46..d4b9d61e637d 100644 > > > > > --- a/kernel/rcu/tree.c > > > > > +++ b/kernel/rcu/tree.c > > > > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp) > > > > > { > > > > > } > > > > > > > > > > +/* > > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. > > > > > + */ > > > > > +bool call_rcu_outstanding(void) > > > > > +{ > > > > > + unsigned long flags; > > > > > + struct rcu_data *rdp; > > > > > + bool outstanding; > > > > > + > > > > > + local_irq_save(flags); > > > > > + rdp = this_cpu_ptr(&rcu_data); > > > > > + outstanding = rcu_segcblist_empty(&rdp->cblist); > > > > > + local_irq_restore(flags); > > > > > + > > > > > + return outstanding; > > > > > +} > > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding); > > > > > + > > > > > /* > > > > > * Helper function for call_rcu() and friends. The cpu argument will > > > > > * normally be -1, indicating "currently running CPU". It may specify > > > >
On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote: > > > > Would it make sense to have call_rcu() check to see if there are many > > > > outstanding requests on this CPU and if so process them before returning? > > > > That would ensure that frequent callers usually ended up doing their > > > > own processing. > > > > > > Unfortunately, no. Here is a code fragment illustrating why: That is only true in the general case though, kfree_rcu() doesn't have this problem since we know what the callback is doing. In general a caller of kfree_rcu() should not need to hold any locks while calling it. We could apply the same idea more generally and have some 'call_immediate_or_rcu()' which has restrictions on the caller's context. I think if we have some kind of problem here it would be better to handle it inside the core code and only require that callers use the correct RCU API. I can think of many places where kfree_rcu() is being used under user control.. Jason
[snip] > > Would it make sense to have call_rcu() check to see if there are many > > outstanding requests on this CPU and if so process them before returning? > > That would ensure that frequent callers usually ended up doing their > > own processing. Other than what Paul already mentioned about deadlocks, I am not sure if this would even work for all cases since call_rcu() has to wait for a grace period. So, if the number of outstanding requests are higher than a certain amount, then you *still* have to wait for some RCU configurations for the grace period duration and cannot just execute the callback in-line. Did I miss something? Can waiting in-line for a grace period duration be tolerated in the vhost case? thanks, - Joel
On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote: > [snip] > > > Would it make sense to have call_rcu() check to see if there are many > > > outstanding requests on this CPU and if so process them before returning? > > > That would ensure that frequent callers usually ended up doing their > > > own processing. > > Other than what Paul already mentioned about deadlocks, I am not sure if this > would even work for all cases since call_rcu() has to wait for a grace > period. > > So, if the number of outstanding requests are higher than a certain amount, > then you *still* have to wait for some RCU configurations for the grace > period duration and cannot just execute the callback in-line. Did I miss > something? > > Can waiting in-line for a grace period duration be tolerated in the vhost case? > > thanks, > > - Joel No, but it has many other ways to recover (try again later, drop a packet, use a slower copy to/from user).
On Mon, Jul 22, 2019 at 10:41:52AM -0300, Jason Gunthorpe wrote: > On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote: > > > > > > Would it make sense to have call_rcu() check to see if there are many > > > > > outstanding requests on this CPU and if so process them before returning? > > > > > That would ensure that frequent callers usually ended up doing their > > > > > own processing. > > > > > > > > Unfortunately, no. Here is a code fragment illustrating why: > > That is only true in the general case though, kfree_rcu() doesn't have > this problem since we know what the callback is doing. In general a > caller of kfree_rcu() should not need to hold any locks while calling > it. Good point, at least as long as the slab allocators don't call kfree_rcu() while holding any of the slab locks. However, that would require a separate list for the kfree_rcu() callbacks, and concurrent access to those lists of kfree_rcu() callbacks. So this might work, but would add some complexity and also yet another restriction between RCU and another kernel subsystem. So I would like to try the other approaches first, for example, the time-based approach in my prototype and Eric Dumazet's more polished patch. But the immediate-invocation possibility is still there if needed. > We could apply the same idea more generally and have some > 'call_immediate_or_rcu()' which has restrictions on the caller's > context. > > I think if we have some kind of problem here it would be better to > handle it inside the core code and only require that callers use the > correct RCU API. Agreed. Especially given that there are a number of things that can be done within RCU. > I can think of many places where kfree_rcu() is being used under user > control.. And same for call_rcu(). And this is not the first time we have run into this. The last time was about 15 years ago, if I remember correctly, and that one led to some of the quiescent-state forcing and callback-invocation batch size tricks still in use today. My only real surprise is that it took so long for this to come up again. ;-) Please note also that in the common case on default configurations, callback invocation is done on the CPU that posted the callback. This means that callback invocation normally applies backpressure to the callback-happy workload. So why then is there a problem? The problem is not the lack of backpressure, but rather that the scheduling of callback invocation needs to be a bit more considerate of the needs of the rest of the system. In the common case, that is. Except that the uncommon case is real-time configurations, in which care is needed anyway. But I am in the midst of helping those out as well, details on the "dev" branch of -rcu. Thanx, Paul
On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote: > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote: > > [snip] > > > > Would it make sense to have call_rcu() check to see if there are many > > > > outstanding requests on this CPU and if so process them before returning? > > > > That would ensure that frequent callers usually ended up doing their > > > > own processing. > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this > > would even work for all cases since call_rcu() has to wait for a grace > > period. > > > > So, if the number of outstanding requests are higher than a certain amount, > > then you *still* have to wait for some RCU configurations for the grace > > period duration and cannot just execute the callback in-line. Did I miss > > something? > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case? > > > > thanks, > > > > - Joel > > No, but it has many other ways to recover (try again later, drop a > packet, use a slower copy to/from user). True enough! And your idea of taking recovery action based on the number of callbacks seems like a good one while we are getting RCU's callback scheduling improved. By the way, was this a real problem that you could make happen on real hardware? If not, I would suggest just letting RCU get improved over the next couple of releases. If it is something that you actually made happen, please let me know what (if anything) you need from me for your callback-counting EBUSY scheme. Thanx, Paul
On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> So why then is there a problem?
I'm not sure there is a real problem, I thought Michael was just
asking how to design with RCU in the case where the user controls the
kfree_rcu??
Sounds like the answer is "don't worry about it" ?
Thanks,
Jason
On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote: > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote: > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote: > > > [snip] > > > > > Would it make sense to have call_rcu() check to see if there are many > > > > > outstanding requests on this CPU and if so process them before returning? > > > > > That would ensure that frequent callers usually ended up doing their > > > > > own processing. > > > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this > > > would even work for all cases since call_rcu() has to wait for a grace > > > period. > > > > > > So, if the number of outstanding requests are higher than a certain amount, > > > then you *still* have to wait for some RCU configurations for the grace > > > period duration and cannot just execute the callback in-line. Did I miss > > > something? > > > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case? > > > > > > thanks, > > > > > > - Joel > > > > No, but it has many other ways to recover (try again later, drop a > > packet, use a slower copy to/from user). > > True enough! And your idea of taking recovery action based on the number > of callbacks seems like a good one while we are getting RCU's callback > scheduling improved. > > By the way, was this a real problem that you could make happen on real > hardware? > If not, I would suggest just letting RCU get improved over > the next couple of releases. So basically use kfree_rcu but add a comment saying e.g. "WARNING: in the future callers of kfree_rcu might need to check that not too many callbacks get queued. In that case, we can disable the optimization, or recover in some other way. Watch this space." > If it is something that you actually made happen, please let me know > what (if anything) you need from me for your callback-counting EBUSY > scheme. > > Thanx, Paul If you mean kfree_rcu causing OOM then no, it's all theoretical. If you mean synchronize_rcu stalling to the point where guest will OOPs, then yes, that's not too hard to trigger.
On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote: > On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote: > > So why then is there a problem? > > I'm not sure there is a real problem, I thought Michael was just > asking how to design with RCU in the case where the user controls the > kfree_rcu?? Right it's all based on documentation saying we should worry :) > Sounds like the answer is "don't worry about it" ? > > Thanks, > Jason
On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote: > On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote: > > So why then is there a problem? > > I'm not sure there is a real problem, I thought Michael was just > asking how to design with RCU in the case where the user controls the > kfree_rcu?? > > Sounds like the answer is "don't worry about it" ? Unless you can force failures, you should be good. And either way, improvements to RCU's handling of this sort of situation are in the works. And rcutorture has gained tests of this stuff in the last year or so as well, see its "fwd_progress" module parameter and the related code. Thanx, Paul
On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote: > On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote: > > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote: > > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote: > > > > [snip] > > > > > > Would it make sense to have call_rcu() check to see if there are many > > > > > > outstanding requests on this CPU and if so process them before returning? > > > > > > That would ensure that frequent callers usually ended up doing their > > > > > > own processing. > > > > > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this > > > > would even work for all cases since call_rcu() has to wait for a grace > > > > period. > > > > > > > > So, if the number of outstanding requests are higher than a certain amount, > > > > then you *still* have to wait for some RCU configurations for the grace > > > > period duration and cannot just execute the callback in-line. Did I miss > > > > something? > > > > > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case? > > > > > > > > thanks, > > > > > > > > - Joel > > > > > > No, but it has many other ways to recover (try again later, drop a > > > packet, use a slower copy to/from user). > > > > True enough! And your idea of taking recovery action based on the number > > of callbacks seems like a good one while we are getting RCU's callback > > scheduling improved. > > > > By the way, was this a real problem that you could make happen on real > > hardware? > > > If not, I would suggest just letting RCU get improved over > > the next couple of releases. > > So basically use kfree_rcu but add a comment saying e.g. "WARNING: > in the future callers of kfree_rcu might need to check that > not too many callbacks get queued. In that case, we can > disable the optimization, or recover in some other way. > Watch this space." That sounds fair. > > If it is something that you actually made happen, please let me know > > what (if anything) you need from me for your callback-counting EBUSY > > scheme. > > > > Thanx, Paul > > If you mean kfree_rcu causing OOM then no, it's all theoretical. > If you mean synchronize_rcu stalling to the point where guest will OOPs, > then yes, that's not too hard to trigger. Is synchronize_rcu() being stalled by the userspace loop that is invoking your ioctl that does kfree_rcu()? Or instead by the resulting callback invocation? Thanx, Paul
On Mon, Jul 22, 2019 at 09:25:51AM -0700, Paul E. McKenney wrote: > On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote: > > On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote: > > > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote: > > > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote: > > > > > [snip] > > > > > > > Would it make sense to have call_rcu() check to see if there are many > > > > > > > outstanding requests on this CPU and if so process them before returning? > > > > > > > That would ensure that frequent callers usually ended up doing their > > > > > > > own processing. > > > > > > > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this > > > > > would even work for all cases since call_rcu() has to wait for a grace > > > > > period. > > > > > > > > > > So, if the number of outstanding requests are higher than a certain amount, > > > > > then you *still* have to wait for some RCU configurations for the grace > > > > > period duration and cannot just execute the callback in-line. Did I miss > > > > > something? > > > > > > > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case? > > > > > > > > > > thanks, > > > > > > > > > > - Joel > > > > > > > > No, but it has many other ways to recover (try again later, drop a > > > > packet, use a slower copy to/from user). > > > > > > True enough! And your idea of taking recovery action based on the number > > > of callbacks seems like a good one while we are getting RCU's callback > > > scheduling improved. > > > > > > By the way, was this a real problem that you could make happen on real > > > hardware? > > > > > If not, I would suggest just letting RCU get improved over > > > the next couple of releases. > > > > So basically use kfree_rcu but add a comment saying e.g. "WARNING: > > in the future callers of kfree_rcu might need to check that > > not too many callbacks get queued. In that case, we can > > disable the optimization, or recover in some other way. > > Watch this space." > > That sounds fair. > > > > If it is something that you actually made happen, please let me know > > > what (if anything) you need from me for your callback-counting EBUSY > > > scheme. > > > > > > Thanx, Paul > > > > If you mean kfree_rcu causing OOM then no, it's all theoretical. > > If you mean synchronize_rcu stalling to the point where guest will OOPs, > > then yes, that's not too hard to trigger. > > Is synchronize_rcu() being stalled by the userspace loop that is invoking > your ioctl that does kfree_rcu()? Or instead by the resulting callback > invocation? > > Thanx, Paul Sorry, let me clarify. We currently have synchronize_rcu in a userspace loop. I have a patch replacing that with kfree_rcu. This isn't the first time synchronize_rcu is stalling a VM for a long while so I didn't investigate further.
On Mon, Jul 22, 2019 at 12:32:17PM -0400, Michael S. Tsirkin wrote: > On Mon, Jul 22, 2019 at 09:25:51AM -0700, Paul E. McKenney wrote: > > On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote: > > > On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote: > > > > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote: > > > > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote: > > > > > > [snip] > > > > > > > > Would it make sense to have call_rcu() check to see if there are many > > > > > > > > outstanding requests on this CPU and if so process them before returning? > > > > > > > > That would ensure that frequent callers usually ended up doing their > > > > > > > > own processing. > > > > > > > > > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this > > > > > > would even work for all cases since call_rcu() has to wait for a grace > > > > > > period. > > > > > > > > > > > > So, if the number of outstanding requests are higher than a certain amount, > > > > > > then you *still* have to wait for some RCU configurations for the grace > > > > > > period duration and cannot just execute the callback in-line. Did I miss > > > > > > something? > > > > > > > > > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case? > > > > > > > > > > > > thanks, > > > > > > > > > > > > - Joel > > > > > > > > > > No, but it has many other ways to recover (try again later, drop a > > > > > packet, use a slower copy to/from user). > > > > > > > > True enough! And your idea of taking recovery action based on the number > > > > of callbacks seems like a good one while we are getting RCU's callback > > > > scheduling improved. > > > > > > > > By the way, was this a real problem that you could make happen on real > > > > hardware? > > > > > > > If not, I would suggest just letting RCU get improved over > > > > the next couple of releases. > > > > > > So basically use kfree_rcu but add a comment saying e.g. "WARNING: > > > in the future callers of kfree_rcu might need to check that > > > not too many callbacks get queued. In that case, we can > > > disable the optimization, or recover in some other way. > > > Watch this space." > > > > That sounds fair. > > > > > > If it is something that you actually made happen, please let me know > > > > what (if anything) you need from me for your callback-counting EBUSY > > > > scheme. > > > > > > If you mean kfree_rcu causing OOM then no, it's all theoretical. > > > If you mean synchronize_rcu stalling to the point where guest will OOPs, > > > then yes, that's not too hard to trigger. > > > > Is synchronize_rcu() being stalled by the userspace loop that is invoking > > your ioctl that does kfree_rcu()? Or instead by the resulting callback > > invocation? > > Sorry, let me clarify. We currently have synchronize_rcu in a userspace > loop. I have a patch replacing that with kfree_rcu. This isn't the > first time synchronize_rcu is stalling a VM for a long while so I didn't > investigate further. Ah, so a bunch of synchronize_rcu() calls within a single system call inside the host is stalling the guest, correct? If so, one straightforward approach is to do an rcu_barrier() every (say) 1000 kfree_rcu() calls within that loop in the system call. This will decrease the overhead by almost a factor of 1000 compared to a synchronize_rcu() on each trip through that loop, and will prevent callback overload. Or if the situation is different (for example, the guest does a long sequence of system calls, each of which does a single kfree_rcu() or some such), please let me know what the situation is. Thanx, Paul
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c index 477b4eb44af5..067909521d72 100644 --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -125,6 +125,25 @@ void synchronize_rcu(void) } EXPORT_SYMBOL_GPL(synchronize_rcu); +/* + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. + */ +bool call_rcu_outstanding(void) +{ + unsigned long flags; + struct rcu_data *rdp; + bool outstanding; + + local_irq_save(flags); + rdp = this_cpu_ptr(&rcu_data); + outstanding = rcu_segcblist_empty(&rdp->cblist); + outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail; + local_irq_restore(flags); + + return outstanding; +} +EXPORT_SYMBOL_GPL(call_rcu_outstanding); + /* * Post an RCU callback to be invoked after the end of an RCU grace * period. But since we have but one CPU, that would be after any diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a14e5fbbea46..d4b9d61e637d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp) { } +/* + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks. + */ +bool call_rcu_outstanding(void) +{ + unsigned long flags; + struct rcu_data *rdp; + bool outstanding; + + local_irq_save(flags); + rdp = this_cpu_ptr(&rcu_data); + outstanding = rcu_segcblist_empty(&rdp->cblist); + local_irq_restore(flags); + + return outstanding; +} +EXPORT_SYMBOL_GPL(call_rcu_outstanding); + /* * Helper function for call_rcu() and friends. The cpu argument will * normally be -1, indicating "currently running CPU". It may specify
Hi Paul, others, So it seems that vhost needs to call kfree_rcu from an ioctl. My worry is what happens if userspace starts cycling through lots of these ioctls. Given we actually use rcu as an optimization, we could just disable the optimization temporarily - but the question would be how to detect an excessive rate without working too hard :) . I guess we could define as excessive any rate where callback is outstanding at the time when new structure is allocated. I have very little understanding of rcu internals - so I wanted to check that the following more or less implements this heuristic before I spend time actually testing it. Could others pls take a look and let me know? Thanks! Signed-off-by: Michael S. Tsirkin <mst@redhat.com>