Message ID | 1f85dc8e6a0149677563a2dfb4cef9a9c7eaa391.1368702323.git.mst@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote: > There are several ways to make sure might_fault > calling function does not sleep. > One is to use it on kernel or otherwise locked memory - apparently > nfs/sunrpc does this. As noted by Ingo, this is handled by the > migh_fault() implementation in mm/memory.c but not the one in > linux/kernel.h so in the current code might_fault() schedules > differently depending on CONFIG_PROVE_LOCKING, which is an undesired > semantical side effect. > > Another is to call pagefault_disable: in this case the page fault > handler will go to fixups processing and we get an error instead of > sleeping, so the might_sleep annotation is a false positive. > vhost driver wants to do this now in order to reuse socket ops > under a spinlock (and fall back on slower thread handler > on error). Are you using the assumption that spin_lock() implies preempt_disable() implies pagefault_disable()? Note that this assumption isn't valid for -rt where the spinlock becomes preemptible but we'll not disable pagefaults. > Address both issues by: > - dropping the unconditional call to might_sleep > from the fast might_fault code in linux/kernel.h > - checking for pagefault_disable() in the > CONFIG_PROVE_LOCKING implementation > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > --- > include/linux/kernel.h | 1 - > mm/memory.c | 14 +++++++++----- > 2 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > index e96329c..322b065 100644 > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -198,7 +198,6 @@ void might_fault(void); > #else > static inline void might_fault(void) > { > - might_sleep(); This removes potential resched points for PREEMPT_VOLUNTARY -- was that intentional? > } > #endif > > diff --git a/mm/memory.c b/mm/memory.c > index 6dc1882..1b8327b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4222,13 +4222,17 @@ void might_fault(void) > if (segment_eq(get_fs(), KERNEL_DS)) > return; > > - might_sleep(); > /* > - * it would be nicer only to annotate paths which are not under > - * pagefault_disable, however that requires a larger audit and > - * providing helpers like get_user_atomic. > + * It would be nicer to annotate paths which are under preempt_disable > + * but not under pagefault_disable, however that requires a new flag > + * for differentiating between the two. -rt has this, pagefault_disable() doesn't change the preempt count but pokes at task_struct::pagefault_disable. > */ > - if (!in_atomic() && current->mm) > + if (in_atomic()) > + return; > + > + might_sleep(); > + > + if (current->mm) > might_lock_read(¤t->mm->mmap_sem); > } > EXPORT_SYMBOL(might_fault); > -- > MST
On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote: > On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote: > > There are several ways to make sure might_fault > > calling function does not sleep. > > One is to use it on kernel or otherwise locked memory - apparently > > nfs/sunrpc does this. As noted by Ingo, this is handled by the > > migh_fault() implementation in mm/memory.c but not the one in > > linux/kernel.h so in the current code might_fault() schedules > > differently depending on CONFIG_PROVE_LOCKING, which is an undesired > > semantical side effect. > > > > Another is to call pagefault_disable: in this case the page fault > > handler will go to fixups processing and we get an error instead of > > sleeping, so the might_sleep annotation is a false positive. > > vhost driver wants to do this now in order to reuse socket ops > > under a spinlock (and fall back on slower thread handler > > on error). > > Are you using the assumption that spin_lock() implies preempt_disable() implies > pagefault_disable()? Note that this assumption isn't valid for -rt where the > spinlock becomes preemptible but we'll not disable pagefaults. No, I was not assuming that. What I'm trying to say is that a caller that does something like this under a spinlock: preempt_disable pagefault_disable error = copy_to_user pagefault_enable preempt_enable_no_resched is not doing anything wrong and should not get a warning, as long as error is handled correctly later. Right? > > Address both issues by: > > - dropping the unconditional call to might_sleep > > from the fast might_fault code in linux/kernel.h > > - checking for pagefault_disable() in the > > CONFIG_PROVE_LOCKING implementation > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > --- > > include/linux/kernel.h | 1 - > > mm/memory.c | 14 +++++++++----- > > 2 files changed, 9 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > > index e96329c..322b065 100644 > > --- a/include/linux/kernel.h > > +++ b/include/linux/kernel.h > > @@ -198,7 +198,6 @@ void might_fault(void); > > #else > > static inline void might_fault(void) > > { > > - might_sleep(); > > This removes potential resched points for PREEMPT_VOLUNTARY -- was that > intentional? No it's a bug. Thanks for pointing this out. OK so I guess it should be might_sleep_if(!in_atomic()) and this means might_fault would have to move from linux/kernel.h to linux/uaccess.h, since in_atomic() is in linux/hardirq.h Makes sense? > > } > > #endif > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 6dc1882..1b8327b 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4222,13 +4222,17 @@ void might_fault(void) > > if (segment_eq(get_fs(), KERNEL_DS)) > > return; > > > > - might_sleep(); > > /* > > - * it would be nicer only to annotate paths which are not under > > - * pagefault_disable, however that requires a larger audit and > > - * providing helpers like get_user_atomic. > > + * It would be nicer to annotate paths which are under preempt_disable > > + * but not under pagefault_disable, however that requires a new flag > > + * for differentiating between the two. > > -rt has this, pagefault_disable() doesn't change the preempt count but pokes > at task_struct::pagefault_disable. Good to know. So maybe we can import this at least for CONFIG_PROVE_LOCKING? To make the patch smaller I'd prefer doing both for now, this way this patchset does not have to poke in too many mm internals. I can try doing that - unless someone else has plans to merge this part soon anyway? > > */ > > - if (!in_atomic() && current->mm) > > + if (in_atomic()) > > + return; > > + > > + might_sleep(); > > + > > + if (current->mm) > > might_lock_read(¤t->mm->mmap_sem); > > } > > EXPORT_SYMBOL(might_fault); > > -- > > MST
On Sun, 2013-05-19 at 12:35 +0300, Michael S. Tsirkin wrote: > No, I was not assuming that. What I'm trying to say is that a caller > that does something like this under a spinlock: > preempt_disable > pagefault_disable > error = copy_to_user > pagefault_enable > preempt_enable_no_resched > > is not doing anything wrong and should not get a warning, > as long as error is handled correctly later. > Right? > What I see wrong with the above is the preempt_enable_no_resched(). The only place that should be ever used is right before a schedule(), as you don't want to schedule twice (once for the preempt_enable() and then again for the schedule itself). Remember, in -rt, a spin lock does not disable preemption. -- Steve
On Sun, May 19, 2013 at 08:34:04AM -0400, Steven Rostedt wrote: > On Sun, 2013-05-19 at 12:35 +0300, Michael S. Tsirkin wrote: > > > No, I was not assuming that. What I'm trying to say is that a caller > > that does something like this under a spinlock: > > preempt_disable > > pagefault_disable > > error = copy_to_user > > pagefault_enable > > preempt_enable_no_resched > > > > is not doing anything wrong and should not get a warning, > > as long as error is handled correctly later. > > Right? > > > > What I see wrong with the above is the preempt_enable_no_resched(). The > only place that should be ever used is right before a schedule(), as you > don't want to schedule twice (once for the preempt_enable() and then > again for the schedule itself). > > Remember, in -rt, a spin lock does not disable preemption. > > -- Steve Right but we need to keep it working on upstream as well. If I do preempt_enable under a spinlock upstream won't it try to sleep under spinlock?
On Sun, 2013-05-19 at 16:34 +0300, Michael S. Tsirkin wrote: > Right but we need to keep it working on upstream as well. > If I do preempt_enable under a spinlock upstream won't it > try to sleep under spinlock? No it wont. A spinlock calls preempt_disable implicitly, and a preempt_enable() will not schedule unless preempt_count is zero, which it wont be under a spinlock. If it did, there would be lots of bugs all over the place because this is done throughout the kernel (a preempt_enable() under a spinlock). In other words, don't ever use preempt_enable_no_resched(). -- Steve
On Sun, May 19, 2013 at 12:06:19PM -0400, Steven Rostedt wrote: > On Sun, 2013-05-19 at 16:34 +0300, Michael S. Tsirkin wrote: > > > Right but we need to keep it working on upstream as well. > > If I do preempt_enable under a spinlock upstream won't it > > try to sleep under spinlock? > > No it wont. A spinlock calls preempt_disable implicitly, and a > preempt_enable() will not schedule unless preempt_count is zero, which > it wont be under a spinlock. > > If it did, there would be lots of bugs all over the place because this > is done throughout the kernel (a preempt_enable() under a spinlock). > > In other words, don't ever use preempt_enable_no_resched(). > > -- Steve > OK I get it. So let me correct myself. The simple code that does something like this under a spinlock: > preempt_disable > pagefault_disable > error = copy_to_user > pagefault_enable > preempt_enable > is not doing anything wrong and should not get a warning, as long as error is handled correctly later. Right?
On Sun, 2013-05-19 at 19:40 +0300, Michael S. Tsirkin wrote: > OK I get it. So let me correct myself. The simple code > that does something like this under a spinlock: > > preempt_disable > > pagefault_disable > > error = copy_to_user > > pagefault_enable > > preempt_enable > > > is not doing anything wrong and should not get a warning, > as long as error is handled correctly later. > Right? I came in mid thread and I don't know the context. Anyway, the above looks to me as you just don't want to sleep. If you try to copy data to user space that happens not to be currently mapped for any reason, you will get an error. Even if the address space is completely valid. Is that what you want? -- Steve
On Sun, May 19, 2013 at 04:23:22PM -0400, Steven Rostedt wrote: > On Sun, 2013-05-19 at 19:40 +0300, Michael S. Tsirkin wrote: > > > OK I get it. So let me correct myself. The simple code > > that does something like this under a spinlock: > > > preempt_disable > > > pagefault_disable > > > error = copy_to_user > > > pagefault_enable > > > preempt_enable > > > > > is not doing anything wrong and should not get a warning, > > as long as error is handled correctly later. > > Right? > > I came in mid thread and I don't know the context. The context is that I want to change might_fault from might_sleep to might_sleep_if(!in_atomic()) so that above does not trigger warnings even with CONFIG_DEBUG_ATOMIC_SLEEP enabled. > Anyway, the above > looks to me as you just don't want to sleep. Exactly. upstream we can just do pagefault_disable but to make this code -rt ready it's best to do preempt_disable as well. > If you try to copy data to > user space that happens not to be currently mapped for any reason, you > will get an error. Even if the address space is completely valid. Is > that what you want? > > -- Steve > Yes, this is by design. We detect that and bounce the work to a thread outside any locks. Thanks,
On Sun, May 19, 2013 at 07:40:09PM +0300, Michael S. Tsirkin wrote: > OK I get it. So let me correct myself. The simple code > that does something like this under a spinlock: > > preempt_disable > > pagefault_disable > > error = copy_to_user > > pagefault_enable > > preempt_enable > > > is not doing anything wrong and should not get a warning, > as long as error is handled correctly later. > Right? Indeed, but I don't get the point of the preempt_{disable,enable}() here. Why does it have to disable preemption explicitly here? I thought all you wanted was to avoid the pagefault handler and make it do the exception table thing; for that pagefault_disable() is sufficient.
On Sun, May 19, 2013 at 12:35:26PM +0300, Michael S. Tsirkin wrote: > On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote: > > On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote: > > > There are several ways to make sure might_fault > > > calling function does not sleep. > > > One is to use it on kernel or otherwise locked memory - apparently > > > nfs/sunrpc does this. As noted by Ingo, this is handled by the > > > migh_fault() implementation in mm/memory.c but not the one in > > > linux/kernel.h so in the current code might_fault() schedules > > > differently depending on CONFIG_PROVE_LOCKING, which is an undesired > > > semantical side effect. > > > > > > Another is to call pagefault_disable: in this case the page fault > > > handler will go to fixups processing and we get an error instead of > > > sleeping, so the might_sleep annotation is a false positive. > > > vhost driver wants to do this now in order to reuse socket ops > > > under a spinlock (and fall back on slower thread handler > > > on error). > > > > Are you using the assumption that spin_lock() implies preempt_disable() implies > > pagefault_disable()? Note that this assumption isn't valid for -rt where the > > spinlock becomes preemptible but we'll not disable pagefaults. > > No, I was not assuming that. What I'm trying to say is that a caller > that does something like this under a spinlock: > preempt_disable > pagefault_disable > error = copy_to_user > pagefault_enable > preempt_enable_no_resched > > is not doing anything wrong and should not get a warning, > as long as error is handled correctly later. > Right? Aside from the no_resched() thing which Steven already explained and my previous email asking why you need the preempt_disable() at all, that should indeed work. The reason I was asking was that I wasn't sure you weren't doing: spin_lock(&my_lock); error = copy_to_user(); spin_unlock(&my_lock); and expecting the copy_to_user() to always take the exception table route. This works on mainline (since spin_lock implies a preempt disable and preempt_disable is the same as pagefault_disable). However as should be clear by now, it doesn't quite work that way for -rt.
On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote: > On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote: > > There are several ways to make sure might_fault > > calling function does not sleep. > > One is to use it on kernel or otherwise locked memory - apparently > > nfs/sunrpc does this. As noted by Ingo, this is handled by the > > migh_fault() implementation in mm/memory.c but not the one in > > linux/kernel.h so in the current code might_fault() schedules > > differently depending on CONFIG_PROVE_LOCKING, which is an undesired > > semantical side effect. > > > > Another is to call pagefault_disable: in this case the page fault > > handler will go to fixups processing and we get an error instead of > > sleeping, so the might_sleep annotation is a false positive. > > vhost driver wants to do this now in order to reuse socket ops > > under a spinlock (and fall back on slower thread handler > > on error). > > Are you using the assumption that spin_lock() implies preempt_disable() implies > pagefault_disable()? Note that this assumption isn't valid for -rt where the > spinlock becomes preemptible but we'll not disable pagefaults. > > > Address both issues by: > > - dropping the unconditional call to might_sleep > > from the fast might_fault code in linux/kernel.h > > - checking for pagefault_disable() in the > > CONFIG_PROVE_LOCKING implementation > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > --- > > include/linux/kernel.h | 1 - > > mm/memory.c | 14 +++++++++----- > > 2 files changed, 9 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > > index e96329c..322b065 100644 > > --- a/include/linux/kernel.h > > +++ b/include/linux/kernel.h > > @@ -198,7 +198,6 @@ void might_fault(void); > > #else > > static inline void might_fault(void) > > { > > - might_sleep(); > > This removes potential resched points for PREEMPT_VOLUNTARY -- was that > intentional? OK so I'm thinking of going back to this idea: it has the advantage of being very simple, and just might make some workloads faster if they do lots of copy_XX_user in a loop. Will have to be tested of course - anyone has objections? > > } > > #endif > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 6dc1882..1b8327b 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4222,13 +4222,17 @@ void might_fault(void) > > if (segment_eq(get_fs(), KERNEL_DS)) > > return; > > > > - might_sleep(); > > /* > > - * it would be nicer only to annotate paths which are not under > > - * pagefault_disable, however that requires a larger audit and > > - * providing helpers like get_user_atomic. > > + * It would be nicer to annotate paths which are under preempt_disable > > + * but not under pagefault_disable, however that requires a new flag > > + * for differentiating between the two. > > -rt has this, pagefault_disable() doesn't change the preempt count but pokes > at task_struct::pagefault_disable. > > > */ > > - if (!in_atomic() && current->mm) > > + if (in_atomic()) > > + return; > > + > > + might_sleep(); > > + > > + if (current->mm) > > might_lock_read(¤t->mm->mmap_sem); > > } > > EXPORT_SYMBOL(might_fault); > > -- > > MST
diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e96329c..322b065 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -198,7 +198,6 @@ void might_fault(void); #else static inline void might_fault(void) { - might_sleep(); } #endif diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..1b8327b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4222,13 +4222,17 @@ void might_fault(void) if (segment_eq(get_fs(), KERNEL_DS)) return; - might_sleep(); /* - * it would be nicer only to annotate paths which are not under - * pagefault_disable, however that requires a larger audit and - * providing helpers like get_user_atomic. + * It would be nicer to annotate paths which are under preempt_disable + * but not under pagefault_disable, however that requires a new flag + * for differentiating between the two. */ - if (!in_atomic() && current->mm) + if (in_atomic()) + return; + + might_sleep(); + + if (current->mm) might_lock_read(¤t->mm->mmap_sem); } EXPORT_SYMBOL(might_fault);
There are several ways to make sure might_fault calling function does not sleep. One is to use it on kernel or otherwise locked memory - apparently nfs/sunrpc does this. As noted by Ingo, this is handled by the migh_fault() implementation in mm/memory.c but not the one in linux/kernel.h so in the current code might_fault() schedules differently depending on CONFIG_PROVE_LOCKING, which is an undesired semantical side effect. Another is to call pagefault_disable: in this case the page fault handler will go to fixups processing and we get an error instead of sleeping, so the might_sleep annotation is a false positive. vhost driver wants to do this now in order to reuse socket ops under a spinlock (and fall back on slower thread handler on error). Address both issues by: - dropping the unconditional call to might_sleep from the fast might_fault code in linux/kernel.h - checking for pagefault_disable() in the CONFIG_PROVE_LOCKING implementation Signed-off-by: Michael S. Tsirkin <mst@redhat.com> --- include/linux/kernel.h | 1 - mm/memory.c | 14 +++++++++----- 2 files changed, 9 insertions(+), 6 deletions(-)