mbox series

[RFC,00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

Message ID 20200919091751.011116649@linutronix.de (mailing list archive)
Headers show
Series mm/highmem: Provide a preemptible variant of kmap_atomic & friends | expand

Message

Thomas Gleixner Sept. 19, 2020, 9:17 a.m. UTC
First of all, sorry for the horribly big Cc list!

Following up to the discussion in:

  https://lore.kernel.org/r/20200914204209.256266093@linutronix.de

this provides a preemptible variant of kmap_atomic & related
interfaces. This is achieved by:

 - Consolidating all kmap atomic implementations in generic code

 - Switching from per CPU storage of the kmap index to a per task storage

 - Adding a pteval array to the per task storage which contains the ptevals
   of the currently active temporary kmaps

 - Adding context switch code which checks whether the outgoing or the
   incoming task has active temporary kmaps. If so, the outgoing task's
   kmaps are removed and the incoming task's kmaps are restored.

 - Adding new interfaces k[un]map_temporary*() which are not disabling
   preemption and can be called from any context (except NMI).

   Contrary to kmap() which provides preemptible and "persistant" mappings,
   these interfaces are meant to replace the temporary mappings provided by
   kmap_atomic*() today.

This allows to get rid of conditional mapping choices and allows to have
preemptible short term mappings on 64bit which are today enforced to be
non-preemptible due to the highmem constraints. It clearly puts overhead on
the highmem users, but highmem is slow anyway.

This is not a wholesale conversion which makes kmap_atomic magically
preemptible because there might be usage sites which rely on the implicit
preempt disable. So this needs to be done on a case by case basis and the
call sites converted to kmap_temporary.

Note, that this is only lightly tested on X86 and completely untested on
all other architectures.

The lot is also available from

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem

Thanks,

	tglx
---
 a/arch/arm/mm/highmem.c               |  121 ---------------------
 a/arch/microblaze/mm/highmem.c        |   78 -------------
 a/arch/nds32/mm/highmem.c             |   48 --------
 a/arch/powerpc/mm/highmem.c           |   67 -----------
 a/arch/sparc/mm/highmem.c             |  115 --------------------
 arch/arc/Kconfig                      |    1 
 arch/arc/include/asm/highmem.h        |    8 +
 arch/arc/mm/highmem.c                 |   44 -------
 arch/arm/Kconfig                      |    1 
 arch/arm/include/asm/highmem.h        |   30 +++--
 arch/arm/mm/Makefile                  |    1 
 arch/csky/Kconfig                     |    1 
 arch/csky/include/asm/highmem.h       |    4 
 arch/csky/mm/highmem.c                |   75 -------------
 arch/microblaze/Kconfig               |    1 
 arch/microblaze/include/asm/highmem.h |    6 -
 arch/microblaze/mm/Makefile           |    1 
 arch/microblaze/mm/init.c             |    6 -
 arch/mips/Kconfig                     |    1 
 arch/mips/include/asm/highmem.h       |    4 
 arch/mips/mm/highmem.c                |   77 -------------
 arch/mips/mm/init.c                   |    3 
 arch/nds32/Kconfig.cpu                |    1 
 arch/nds32/include/asm/highmem.h      |   21 ++-
 arch/nds32/mm/Makefile                |    1 
 arch/powerpc/Kconfig                  |    1 
 arch/powerpc/include/asm/highmem.h    |    6 -
 arch/powerpc/mm/Makefile              |    1 
 arch/powerpc/mm/mem.c                 |    7 -
 arch/sparc/Kconfig                    |    1 
 arch/sparc/include/asm/highmem.h      |    7 -
 arch/sparc/mm/Makefile                |    3 
 arch/sparc/mm/srmmu.c                 |    2 
 arch/x86/include/asm/fixmap.h         |    1 
 arch/x86/include/asm/highmem.h        |   12 +-
 arch/x86/include/asm/iomap.h          |   29 +++--
 arch/x86/mm/highmem_32.c              |   59 ----------
 arch/x86/mm/init_32.c                 |   15 --
 arch/x86/mm/iomap_32.c                |   57 ----------
 arch/xtensa/Kconfig                   |    1 
 arch/xtensa/include/asm/highmem.h     |    9 +
 arch/xtensa/mm/highmem.c              |   44 -------
 b/arch/x86/Kconfig                    |    3 
 include/linux/highmem.h               |  141 +++++++++++++++---------
 include/linux/io-mapping.h            |    2 
 include/linux/sched.h                 |    9 +
 kernel/sched/core.c                   |   10 +
 mm/Kconfig                            |    3 
 mm/highmem.c                          |  192 ++++++++++++++++++++++++++++++++--
 49 files changed, 422 insertions(+), 909 deletions(-)

Comments

Daniel Vetter Sept. 19, 2020, 10:35 a.m. UTC | #1
On Sat, Sep 19, 2020 at 11:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> First of all, sorry for the horribly big Cc list!
>
> Following up to the discussion in:
>
>   https://lore.kernel.org/r/20200914204209.256266093@linutronix.de
>
> this provides a preemptible variant of kmap_atomic & related
> interfaces. This is achieved by:
>
>  - Consolidating all kmap atomic implementations in generic code
>
>  - Switching from per CPU storage of the kmap index to a per task storage
>
>  - Adding a pteval array to the per task storage which contains the ptevals
>    of the currently active temporary kmaps
>
>  - Adding context switch code which checks whether the outgoing or the
>    incoming task has active temporary kmaps. If so, the outgoing task's
>    kmaps are removed and the incoming task's kmaps are restored.
>
>  - Adding new interfaces k[un]map_temporary*() which are not disabling
>    preemption and can be called from any context (except NMI).
>
>    Contrary to kmap() which provides preemptible and "persistant" mappings,
>    these interfaces are meant to replace the temporary mappings provided by
>    kmap_atomic*() today.
>
> This allows to get rid of conditional mapping choices and allows to have
> preemptible short term mappings on 64bit which are today enforced to be
> non-preemptible due to the highmem constraints. It clearly puts overhead on
> the highmem users, but highmem is slow anyway.
>
> This is not a wholesale conversion which makes kmap_atomic magically
> preemptible because there might be usage sites which rely on the implicit
> preempt disable. So this needs to be done on a case by case basis and the
> call sites converted to kmap_temporary.
>
> Note, that this is only lightly tested on X86 and completely untested on
> all other architectures.
>
> The lot is also available from
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem

I think it should be the case, but I want to double check: Will
copy_*_user be allowed within a kmap_temporary section? This would
allow us to ditch an absolute pile of slowpaths.
-Daniel

>
> Thanks,
>
>         tglx
> ---
>  a/arch/arm/mm/highmem.c               |  121 ---------------------
>  a/arch/microblaze/mm/highmem.c        |   78 -------------
>  a/arch/nds32/mm/highmem.c             |   48 --------
>  a/arch/powerpc/mm/highmem.c           |   67 -----------
>  a/arch/sparc/mm/highmem.c             |  115 --------------------
>  arch/arc/Kconfig                      |    1
>  arch/arc/include/asm/highmem.h        |    8 +
>  arch/arc/mm/highmem.c                 |   44 -------
>  arch/arm/Kconfig                      |    1
>  arch/arm/include/asm/highmem.h        |   30 +++--
>  arch/arm/mm/Makefile                  |    1
>  arch/csky/Kconfig                     |    1
>  arch/csky/include/asm/highmem.h       |    4
>  arch/csky/mm/highmem.c                |   75 -------------
>  arch/microblaze/Kconfig               |    1
>  arch/microblaze/include/asm/highmem.h |    6 -
>  arch/microblaze/mm/Makefile           |    1
>  arch/microblaze/mm/init.c             |    6 -
>  arch/mips/Kconfig                     |    1
>  arch/mips/include/asm/highmem.h       |    4
>  arch/mips/mm/highmem.c                |   77 -------------
>  arch/mips/mm/init.c                   |    3
>  arch/nds32/Kconfig.cpu                |    1
>  arch/nds32/include/asm/highmem.h      |   21 ++-
>  arch/nds32/mm/Makefile                |    1
>  arch/powerpc/Kconfig                  |    1
>  arch/powerpc/include/asm/highmem.h    |    6 -
>  arch/powerpc/mm/Makefile              |    1
>  arch/powerpc/mm/mem.c                 |    7 -
>  arch/sparc/Kconfig                    |    1
>  arch/sparc/include/asm/highmem.h      |    7 -
>  arch/sparc/mm/Makefile                |    3
>  arch/sparc/mm/srmmu.c                 |    2
>  arch/x86/include/asm/fixmap.h         |    1
>  arch/x86/include/asm/highmem.h        |   12 +-
>  arch/x86/include/asm/iomap.h          |   29 +++--
>  arch/x86/mm/highmem_32.c              |   59 ----------
>  arch/x86/mm/init_32.c                 |   15 --
>  arch/x86/mm/iomap_32.c                |   57 ----------
>  arch/xtensa/Kconfig                   |    1
>  arch/xtensa/include/asm/highmem.h     |    9 +
>  arch/xtensa/mm/highmem.c              |   44 -------
>  b/arch/x86/Kconfig                    |    3
>  include/linux/highmem.h               |  141 +++++++++++++++---------
>  include/linux/io-mapping.h            |    2
>  include/linux/sched.h                 |    9 +
>  kernel/sched/core.c                   |   10 +
>  mm/Kconfig                            |    3
>  mm/highmem.c                          |  192 ++++++++++++++++++++++++++++++++--
>  49 files changed, 422 insertions(+), 909 deletions(-)
Daniel Vetter Sept. 19, 2020, 10:37 a.m. UTC | #2
On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Sat, Sep 19, 2020 at 11:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > First of all, sorry for the horribly big Cc list!
> >
> > Following up to the discussion in:
> >
> >   https://lore.kernel.org/r/20200914204209.256266093@linutronix.de
> >
> > this provides a preemptible variant of kmap_atomic & related
> > interfaces. This is achieved by:
> >
> >  - Consolidating all kmap atomic implementations in generic code
> >
> >  - Switching from per CPU storage of the kmap index to a per task storage
> >
> >  - Adding a pteval array to the per task storage which contains the ptevals
> >    of the currently active temporary kmaps
> >
> >  - Adding context switch code which checks whether the outgoing or the
> >    incoming task has active temporary kmaps. If so, the outgoing task's
> >    kmaps are removed and the incoming task's kmaps are restored.
> >
> >  - Adding new interfaces k[un]map_temporary*() which are not disabling
> >    preemption and can be called from any context (except NMI).
> >
> >    Contrary to kmap() which provides preemptible and "persistant" mappings,
> >    these interfaces are meant to replace the temporary mappings provided by
> >    kmap_atomic*() today.
> >
> > This allows to get rid of conditional mapping choices and allows to have
> > preemptible short term mappings on 64bit which are today enforced to be
> > non-preemptible due to the highmem constraints. It clearly puts overhead on
> > the highmem users, but highmem is slow anyway.
> >
> > This is not a wholesale conversion which makes kmap_atomic magically
> > preemptible because there might be usage sites which rely on the implicit
> > preempt disable. So this needs to be done on a case by case basis and the
> > call sites converted to kmap_temporary.
> >
> > Note, that this is only lightly tested on X86 and completely untested on
> > all other architectures.
> >
> > The lot is also available from
> >
> >    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem
>
> I think it should be the case, but I want to double check: Will
> copy_*_user be allowed within a kmap_temporary section? This would
> allow us to ditch an absolute pile of slowpaths.

(coffee just kicked in) copy_*_user is ofc allowed, but if you hit a
page fault you get a short read/write. This looks like it would remove
the need to handle these in a slowpath, since page faults can now be
served in this new kmap_temporary sections. But this sounds too good
to be true, so I'm wondering what I'm missing.
-Daniel
>
> >
> > Thanks,
> >
> >         tglx
> > ---
> >  a/arch/arm/mm/highmem.c               |  121 ---------------------
> >  a/arch/microblaze/mm/highmem.c        |   78 -------------
> >  a/arch/nds32/mm/highmem.c             |   48 --------
> >  a/arch/powerpc/mm/highmem.c           |   67 -----------
> >  a/arch/sparc/mm/highmem.c             |  115 --------------------
> >  arch/arc/Kconfig                      |    1
> >  arch/arc/include/asm/highmem.h        |    8 +
> >  arch/arc/mm/highmem.c                 |   44 -------
> >  arch/arm/Kconfig                      |    1
> >  arch/arm/include/asm/highmem.h        |   30 +++--
> >  arch/arm/mm/Makefile                  |    1
> >  arch/csky/Kconfig                     |    1
> >  arch/csky/include/asm/highmem.h       |    4
> >  arch/csky/mm/highmem.c                |   75 -------------
> >  arch/microblaze/Kconfig               |    1
> >  arch/microblaze/include/asm/highmem.h |    6 -
> >  arch/microblaze/mm/Makefile           |    1
> >  arch/microblaze/mm/init.c             |    6 -
> >  arch/mips/Kconfig                     |    1
> >  arch/mips/include/asm/highmem.h       |    4
> >  arch/mips/mm/highmem.c                |   77 -------------
> >  arch/mips/mm/init.c                   |    3
> >  arch/nds32/Kconfig.cpu                |    1
> >  arch/nds32/include/asm/highmem.h      |   21 ++-
> >  arch/nds32/mm/Makefile                |    1
> >  arch/powerpc/Kconfig                  |    1
> >  arch/powerpc/include/asm/highmem.h    |    6 -
> >  arch/powerpc/mm/Makefile              |    1
> >  arch/powerpc/mm/mem.c                 |    7 -
> >  arch/sparc/Kconfig                    |    1
> >  arch/sparc/include/asm/highmem.h      |    7 -
> >  arch/sparc/mm/Makefile                |    3
> >  arch/sparc/mm/srmmu.c                 |    2
> >  arch/x86/include/asm/fixmap.h         |    1
> >  arch/x86/include/asm/highmem.h        |   12 +-
> >  arch/x86/include/asm/iomap.h          |   29 +++--
> >  arch/x86/mm/highmem_32.c              |   59 ----------
> >  arch/x86/mm/init_32.c                 |   15 --
> >  arch/x86/mm/iomap_32.c                |   57 ----------
> >  arch/xtensa/Kconfig                   |    1
> >  arch/xtensa/include/asm/highmem.h     |    9 +
> >  arch/xtensa/mm/highmem.c              |   44 -------
> >  b/arch/x86/Kconfig                    |    3
> >  include/linux/highmem.h               |  141 +++++++++++++++---------
> >  include/linux/io-mapping.h            |    2
> >  include/linux/sched.h                 |    9 +
> >  kernel/sched/core.c                   |   10 +
> >  mm/Kconfig                            |    3
> >  mm/highmem.c                          |  192 ++++++++++++++++++++++++++++++++--
> >  49 files changed, 422 insertions(+), 909 deletions(-)
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Linus Torvalds Sept. 19, 2020, 5:18 p.m. UTC | #3
On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> this provides a preemptible variant of kmap_atomic & related
> interfaces. This is achieved by:

Ack. This looks really nice, even apart from the new capability.

The only thing I really reacted to is that the name doesn't make sense
to me: "kmap_temporary()" seems a bit odd.

Particularly for an interface that really is basically meant as a
better replacement of "kmap_atomic()" (but is perhaps also a better
replacement for "kmap()").

I think I understand how the name came about: I think the "temporary"
is there as a distinction from the "longterm" regular kmap(). So I
think it makes some sense from an internal implementation angle, but I
don't think it makes a lot of sense from an interface name.

I don't know what might be a better name, but if we want to emphasize
that it's thread-private and a one-off, maybe "local" would be a
better naming, and make it distinct from the "global" nature of the
old kmap() interface?

However, another solution might be to just use this new preemptible
"local" kmap(), and remove the old global one entirely. Yes, the old
global one caches the page table mapping and that sounds really
efficient and nice. But it's actually horribly horribly bad, because
it means that we need to use locking for them. Your new "temporary"
implementation seems to be fundamentally better locking-wise, and only
need preemption disabling as locking (and is equally fast for the
non-highmem case).

So I wonder if the single-page TLB flush isn't a better model, and
whether it wouldn't be a lot simpler to just get rid of the old
complex kmap() entirely, and replace it with this?

I agree we can't replace the kmap_atomic() version, because maybe
people depend on the preemption disabling it also implied. But what
about replacing the non-atomic kmap()?

Maybe I've missed something.  Is it because the new interface still
does "pagefault_disable()" perhaps?

But does it even need the pagefault_disable() at all? Yes, the
*atomic* one obviously needed it. But why does this new one need to
disable page faults?

[ I'm just reading the patches, I didn't try to apply them and look at
the end result, so I might have missed something ]

The only other worry I would have is just test coverage of this
change. I suspect very few developers use HIGHMEM. And I know the
various test robots don't tend to test 32-bit either.

But apart from that question about naming (and perhaps replacing
kmap() entirely), I very much like it.

                        Linus
Matthew Wilcox Sept. 19, 2020, 5:39 p.m. UTC | #4
On Sat, Sep 19, 2020 at 10:18:54AM -0700, Linus Torvalds wrote:
> On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > this provides a preemptible variant of kmap_atomic & related
> > interfaces. This is achieved by:
> 
> Ack. This looks really nice, even apart from the new capability.
> 
> The only thing I really reacted to is that the name doesn't make sense
> to me: "kmap_temporary()" seems a bit odd.
> 
> Particularly for an interface that really is basically meant as a
> better replacement of "kmap_atomic()" (but is perhaps also a better
> replacement for "kmap()").
> 
> I think I understand how the name came about: I think the "temporary"
> is there as a distinction from the "longterm" regular kmap(). So I
> think it makes some sense from an internal implementation angle, but I
> don't think it makes a lot of sense from an interface name.
> 
> I don't know what might be a better name, but if we want to emphasize
> that it's thread-private and a one-off, maybe "local" would be a
> better naming, and make it distinct from the "global" nature of the
> old kmap() interface?
> 
> However, another solution might be to just use this new preemptible
> "local" kmap(), and remove the old global one entirely. Yes, the old
> global one caches the page table mapping and that sounds really
> efficient and nice. But it's actually horribly horribly bad, because
> it means that we need to use locking for them. Your new "temporary"
> implementation seems to be fundamentally better locking-wise, and only
> need preemption disabling as locking (and is equally fast for the
> non-highmem case).
> 
> So I wonder if the single-page TLB flush isn't a better model, and
> whether it wouldn't be a lot simpler to just get rid of the old
> complex kmap() entirely, and replace it with this?
> 
> I agree we can't replace the kmap_atomic() version, because maybe
> people depend on the preemption disabling it also implied. But what
> about replacing the non-atomic kmap()?

My concern with that is people might use kmap() and then pass the address
to a different task.  So we need to audit the current users of kmap()
and convert any that do that into using vmap() instead.

I like kmap_local().  Or kmap_thread().
Linus Torvalds Sept. 19, 2020, 7:13 p.m. UTC | #5
On Sat, Sep 19, 2020 at 10:39 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> My concern with that is people might use kmap() and then pass the address
> to a different task.  So we need to audit the current users of kmap()
> and convert any that do that into using vmap() instead.

Ahh. Yes, I guess they might do that. It sounds strange, but not
entirely crazy - I could imagine some "PIO thread" that does IO to a
page that has been set up by somebody else using kmap(). Or similar.

                Linus
Thomas Gleixner Sept. 20, 2020, 6:23 a.m. UTC | #6
On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
> On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> I think it should be the case, but I want to double check: Will
>> copy_*_user be allowed within a kmap_temporary section? This would
>> allow us to ditch an absolute pile of slowpaths.
>
> (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a
> page fault you get a short read/write. This looks like it would remove
> the need to handle these in a slowpath, since page faults can now be
> served in this new kmap_temporary sections. But this sounds too good
> to be true, so I'm wondering what I'm missing.

In principle we could allow pagefaults, but not with the currently
proposed interface which can be called from any context. Obviously if
called from atomic context it can't handle user page faults.

In theory we could make a variant which does not disable pagefaults, but
that's what kmap() already provides.

Thanks,

        tglx
Thomas Gleixner Sept. 20, 2020, 6:41 a.m. UTC | #7
On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote:
> On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> this provides a preemptible variant of kmap_atomic & related
>> interfaces. This is achieved by:
>
> Ack. This looks really nice, even apart from the new capability.
>
> The only thing I really reacted to is that the name doesn't make sense
> to me: "kmap_temporary()" seems a bit odd.

Yeah. Couldn't come up with something useful.

> Particularly for an interface that really is basically meant as a
> better replacement of "kmap_atomic()" (but is perhaps also a better
> replacement for "kmap()").
>
> I think I understand how the name came about: I think the "temporary"
> is there as a distinction from the "longterm" regular kmap(). So I
> think it makes some sense from an internal implementation angle, but I
> don't think it makes a lot of sense from an interface name.
>
> I don't know what might be a better name, but if we want to emphasize
> that it's thread-private and a one-off, maybe "local" would be a
> better naming, and make it distinct from the "global" nature of the
> old kmap() interface?

Right, _local or _thread would be more intuitive.

> However, another solution might be to just use this new preemptible
> "local" kmap(), and remove the old global one entirely. Yes, the old
> global one caches the page table mapping and that sounds really
> efficient and nice. But it's actually horribly horribly bad, because
> it means that we need to use locking for them. Your new "temporary"
> implementation seems to be fundamentally better locking-wise, and only
> need preemption disabling as locking (and is equally fast for the
> non-highmem case).
>
> So I wonder if the single-page TLB flush isn't a better model, and
> whether it wouldn't be a lot simpler to just get rid of the old
> complex kmap() entirely, and replace it with this?
>
> I agree we can't replace the kmap_atomic() version, because maybe
> people depend on the preemption disabling it also implied. But what
> about replacing the non-atomic kmap()?
>
> Maybe I've missed something.  Is it because the new interface still
> does "pagefault_disable()" perhaps?
>
> But does it even need the pagefault_disable() at all? Yes, the
> *atomic* one obviously needed it. But why does this new one need to
> disable page faults?

It disables pagefaults because it can be called from atomic and
non-atomic context. That was the point to get rid of

         if (preeemptible())
         	kmap();
         else
                kmap_atomic();

If it does not disable pagefaults, then it's just a lightweight variant
of kmap() for short lived mappings.

> But apart from that question about naming (and perhaps replacing
> kmap() entirely), I very much like it.

I thought about it, but then I figured that kmap pointers can be
handed to other contexts from the thread which sets up the mapping
because it's 'permanent'.

I'm not sure whether that actually happens, so we'd need to audit all
kmap() users to be sure. If there is no such use case, then we surely
can get of rid of kmap() completely. It's only 300+ instances to stare
at and quite some of them are wrapped into other functions.

Highmem sucks no matter what and the only sane solution is to remove it
completely.

Thanks,

        tglx
Daniel Vetter Sept. 20, 2020, 8:23 a.m. UTC | #8
On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote:
> On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
> > On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >> I think it should be the case, but I want to double check: Will
> >> copy_*_user be allowed within a kmap_temporary section? This would
> >> allow us to ditch an absolute pile of slowpaths.
> >
> > (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a
> > page fault you get a short read/write. This looks like it would remove
> > the need to handle these in a slowpath, since page faults can now be
> > served in this new kmap_temporary sections. But this sounds too good
> > to be true, so I'm wondering what I'm missing.
> 
> In principle we could allow pagefaults, but not with the currently
> proposed interface which can be called from any context. Obviously if
> called from atomic context it can't handle user page faults.
 
Yeah that's clear, but does the implemention need to disable pagefaults
unconditionally?

> In theory we could make a variant which does not disable pagefaults, but
> that's what kmap() already provides.

Currently we have a bunch of code which roughly does

	kmap_atomic();
	copy_*_user();
	kunmap_atomic();

	if (short_copy_user) {
		kmap();
		copy_*_user(remaining_stuff);
		kunmap();
	}

And the only reason is that kmap is a notch slower, hence the fastpath. If
we get a kmap which is fast and allows pagefaults (only in contexts that
allow pagefaults already ofc) then we can ditch a pile of kmap users.

Cheers, Daniel
Thomas Gleixner Sept. 20, 2020, 8:49 a.m. UTC | #9
On Sun, Sep 20 2020 at 08:41, Thomas Gleixner wrote:
> On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote:
>> Maybe I've missed something.  Is it because the new interface still
>> does "pagefault_disable()" perhaps?
>>
>> But does it even need the pagefault_disable() at all? Yes, the
>> *atomic* one obviously needed it. But why does this new one need to
>> disable page faults?
>
> It disables pagefaults because it can be called from atomic and
> non-atomic context. That was the point to get rid of
>
>          if (preeemptible())
>          	kmap();
>          else
>                 kmap_atomic();
>
> If it does not disable pagefaults, then it's just a lightweight variant
> of kmap() for short lived mappings.

Actually most usage sites of kmap atomic do not need page faults to be
disabled at all. As Daniel pointed out the implicit pagefault disable
enforces copy_from_user_inatomic() even in places which are clearly
preemptible task context.

As we need to look at each instance anyway we can add the PF disable
explicitely to the very few places which actually need it.

Thanks,

        tglx
Linus Torvalds Sept. 20, 2020, 4:57 p.m. UTC | #10
On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Actually most usage sites of kmap atomic do not need page faults to be
> disabled at all.

Right. I think the pagefault disabling has (almost) nothing at all to
do with the kmap() itself - it comes from the "atomic" part, not the
"kmap" part.

I say *almost*, because there is one issue that needs some thought:
the amount of kmap nesting.

The kmap_atomic() interface - and your local/temporary/whatever
versions of it - depends very much inherently on being strictly
nesting. In fact, it depends so much on it that maybe that should be
part of the new name?

It's very wrong to do

    addr1 = kmap_atomic();
    addr2 = kmap_atomic();
    ..do something with addr 1..
    kunmap_atomic(addr1);
    .. do something with addr 2..
    kunmap_atomic(addr2);

because the way we allocate the slots is by using a percpu-atomic
inc-return (and we deallocate using dec).

So it's fundamentally a stack.

And that's perfectly fine for page faults - if they do any kmaps,
those will obviously nest.

So the only issue with page faults might be that the stack grows
_larger_. And that might need some thought. We already make the kmap
stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we
allow page faults we need to make the kmap stack bigger still.

Btw, looking at the stack code, Ithink your new implementation of it
is a bit scary:

   static inline int kmap_atomic_idx_push(void)
   {
  -       int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
  +       int idx = current->kmap_ctrl.idx++;

and now that 'current->kmap_ctrl.idx' is not atomic wrt

 (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with
nesting I think it's fine anyway - the NMI will undo whatever it did)

 (b) the prev/next switch

And that (b) part worries me. You do the kmap_switch_temporary() to
switch the entries, but you do that *separately* from actually
switching 'current' to the new value.

So kmap_switch_temporary() looks safe, but I don't think it actually
is. Because while it first unmaps the old entries and then remaps the
new ones, an interrupt can come in, and at that point it matters what
is *CURRENT*.

And regardless of whether 'current' is 'prev' or 'next', that
kmap_switch_temporary() loop may be doing the wrong thing, depending
on which one had the deeper stack. The interrupt will be using
whatever "current->kmap_ctrl.idx" is, but that might overwrite entries
that are in the process of being restored (if current is still 'prev',
but kmap_switch_temporary() is in the "restore @next's kmaps" pgase),
or it might stomp on entries that have been pte_clear()'ed by the
'prev' thing.

I dunno. The latter may be one of those "it works anyway, it
overwrites things we don't care about", but the former will most
definitely not work.

And it will be completely impossible to debug, because it will depend
on an interrupt that uses kmap_local/atomic/whatever() coming in
_just_ at the right point in the scheduler, and only when the
scheduler has been entered with the right number of kmap entries on
the prev/next stack.

And no developer will ever see this with any amount of debug code
enabled, because it will only hit on legacy platforms that do this
kmap anyway.

So honestly, that code scares me. I think it's buggy. And even if it
"happens to work", it does so for all the wrong reasons, and is very
fragile.

So I would suggest:

 - continue to use an actual per-cpu kmap_atomic_idx

 - make the switching code save the old idx, then unmap the old
entries one by one (while doing the proper "pop" action), and then map
the new entries one by one (while doing the proper "push" action).

which would mean that the only index that is actually ever *USED* is
the percpu one, and it's always up-to-date and pushed/popped for
individual entries, rather than this - imho completely bogus -
optimization where you use "p->kmap_ctrl.idx" directly and very very
unsafely.

Alternatively, that process counter would need about a hundred lines
of commentary about exactly why it's safe. Because I don't think it
is.

                   Linus
Thomas Gleixner Sept. 20, 2020, 5:24 p.m. UTC | #11
On Sun, Sep 20 2020 at 10:23, Daniel Vetter wrote:

> On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote:
>> On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
>> > On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> >> I think it should be the case, but I want to double check: Will
>> >> copy_*_user be allowed within a kmap_temporary section? This would
>> >> allow us to ditch an absolute pile of slowpaths.
>> >
>> > (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a
>> > page fault you get a short read/write. This looks like it would remove
>> > the need to handle these in a slowpath, since page faults can now be
>> > served in this new kmap_temporary sections. But this sounds too good
>> > to be true, so I'm wondering what I'm missing.
>> 
>> In principle we could allow pagefaults, but not with the currently
>> proposed interface which can be called from any context. Obviously if
>> called from atomic context it can't handle user page faults.
>  
> Yeah that's clear, but does the implemention need to disable pagefaults
> unconditionally?

As I wrote in the other reply it's not required and the final interface
will neither disable preemption nor pagefaults (except for the atomic
wrapper around it).

Thanks,

        tglx
Thomas Gleixner Sept. 20, 2020, 5:40 p.m. UTC | #12
On Sun, Sep 20 2020 at 09:57, Linus Torvalds wrote:
> On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> Btw, looking at the stack code, Ithink your new implementation of it
> is a bit scary:
>
>    static inline int kmap_atomic_idx_push(void)
>    {
>   -       int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
>   +       int idx = current->kmap_ctrl.idx++;
>
> and now that 'current->kmap_ctrl.idx' is not atomic wrt
>
>  (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with
> nesting I think it's fine anyway - the NMI will undo whatever it did)

Right. Nesting should be a non issue, but I don't think we have
kmap_atomic() in NMI context.

>  (b) the prev/next switch
>
> And that (b) part worries me. You do the kmap_switch_temporary() to
> switch the entries, but you do that *separately* from actually
> switching 'current' to the new value.
>
> So kmap_switch_temporary() looks safe, but I don't think it actually
> is. Because while it first unmaps the old entries and then remaps the
> new ones, an interrupt can come in, and at that point it matters what
> is *CURRENT*.
>
> And regardless of whether 'current' is 'prev' or 'next', that
> kmap_switch_temporary() loop may be doing the wrong thing, depending
> on which one had the deeper stack. The interrupt will be using
> whatever "current->kmap_ctrl.idx" is, but that might overwrite entries
> that are in the process of being restored (if current is still 'prev',
> but kmap_switch_temporary() is in the "restore @next's kmaps" pgase),
> or it might stomp on entries that have been pte_clear()'ed by the
> 'prev' thing.

Duh yes. Never thought about that.

> Alternatively, that process counter would need about a hundred lines
> of commentary about exactly why it's safe. Because I don't think it
> is.

I think the more obvious solution is to split the whole exercise:


  schedule()
     prepare_switch()
        unmap()

    switch_to()

    finish_switch()
        map()

That's safe because neither the unmap() nor the map() code changes
kmap_ctrl.idx. So if there is an interrupt coming in between unmap() and
switch_to() then a kmap_local() there will use the next entry. So we
could even do the unmap() with interrupts enabled (preemption disabled).
Same for the map() part.

To explain that we need only a few lines of commentry methinks.

Thanks,

        tglx
Linus Torvalds Sept. 20, 2020, 5:42 p.m. UTC | #13
On Sun, Sep 20, 2020 at 10:40 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I think the more obvious solution is to split the whole exercise:
>
>   schedule()
>      prepare_switch()
>         unmap()
>
>     switch_to()
>
>     finish_switch()
>         map()

Yeah, that looks much easier to explain. Ack.

               Linus
Linus Torvalds Sept. 20, 2020, 5:58 p.m. UTC | #14
On Sun, Sep 20, 2020 at 10:42 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yeah, that looks much easier to explain. Ack.

Btw, one thing that might be a good idea at least initially is to add
a check for p->kmap_ctrl.idx being zero at fork, exit and maybe
syscall return time (but that last one may be too cumbersome to really
worry about).

The kmap_atomic() interface basically has a lot of coverage for leaked
as part of all the "might_sleep()" checks sprinkled around,  The new
kmap_temporary/local/whatever wouldn't have that kind of incidental
debug checking, and any leaked kmap indexes would be rather hard to
debug (much) later when they cause index overflows or whatever.

                Linus
Thomas Gleixner Sept. 21, 2020, 7:39 a.m. UTC | #15
On Sun, Sep 20 2020 at 10:42, Linus Torvalds wrote:
> On Sun, Sep 20, 2020 at 10:40 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> I think the more obvious solution is to split the whole exercise:
>>
>>   schedule()
>>      prepare_switch()
>>         unmap()
>>
>>     switch_to()
>>
>>     finish_switch()
>>         map()
>
> Yeah, that looks much easier to explain. Ack.

So far so good, but Peter Z. just pointed out to me that I completely
missed the fact that this cannot work.

If a task is migrated to a different CPU then the mapping address will
change which will explode in colourful ways.

On RT kernels this works because we ping the task to the CPU via
migrate_disable(). On a !RT kernel migrate_disable() maps to
preempt_disable() which brings us back to square one.

/me goes back to the drawing board.

Thanks,

        tglx
Linus Torvalds Sept. 21, 2020, 4:24 p.m. UTC | #16
On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> If a task is migrated to a different CPU then the mapping address will
> change which will explode in colourful ways.

Heh.

Right you are.

Maybe we really *could* call this new kmap functionality something
like "kmap_percpu()" (or maybe "local" is good enough), and make it
act like your RT code does for spinlocks - not disable preemption, but
only disabling CPU migration.

That would probably be good enough for a lot of users that don't want
to expose excessive latencies, but where it's really not a huge deal
to say "stick to this CPU for a short while".

The crypto code certainly sounds like one such case.

             Linus
Thomas Gleixner Sept. 21, 2020, 7:27 p.m. UTC | #17
On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote:
> On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> If a task is migrated to a different CPU then the mapping address will
>> change which will explode in colourful ways.
>
> Right you are.
>
> Maybe we really *could* call this new kmap functionality something
> like "kmap_percpu()" (or maybe "local" is good enough), and make it
> act like your RT code does for spinlocks - not disable preemption, but
> only disabling CPU migration.

I"m all for it, but the scheduler people have opinions :)

> That would probably be good enough for a lot of users that don't want
> to expose excessive latencies, but where it's really not a huge deal
> to say "stick to this CPU for a short while".
>
> The crypto code certainly sounds like one such case.

I looked at a lot of the kmap_atomic() places and quite some of them
only require migration to be disabled to keep the temporary map
stable.

Quite some code could be simplified significantly especially those
places which need to do copy_from/to_user inside these
sections. Graphics is the main example here as Daniel pointed out.

Alternatively this could of course be solved with per CPU page tables
which will come around some day anyway I fear.

Thanks,

        tglx
Peter Zijlstra Sept. 23, 2020, 8:40 a.m. UTC | #18
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
> On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote:
> > On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> If a task is migrated to a different CPU then the mapping address will
> >> change which will explode in colourful ways.
> >
> > Right you are.
> >
> > Maybe we really *could* call this new kmap functionality something
> > like "kmap_percpu()" (or maybe "local" is good enough), and make it
> > act like your RT code does for spinlocks - not disable preemption, but
> > only disabling CPU migration.
> 
> I"m all for it, but the scheduler people have opinions :)

Right, so I'm concerned. migrate_disable() wrecks pretty much all
Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is
somewhat ironic.

Yes, it allows breaking up non-preemptible regions of non-deterministic
duration, and thereby both reduce and bound the scheduling latency, the
cost for doing that is that the theory on CPU utilization/bandwidth go
out the window.

To easily see this consider an SMP system with a number of tasks equal
to the number of CPUs. On a regular (preempt_disable) kernel we can
always run each task, by virtue of always having an idle CPU to take the
task.

However, with migrate_disable() we can have each task preempted in a
migrate_disable() region, worse we can stack them all on the _same_ CPU
(super ridiculous odds, sure). And then we end up only able to run one
task, with the rest of the CPUs picking their nose.

The end result is that, like with unbounded latency, we will still miss
our deadline, simply because we got starved for CPU.


Now, while we could (with a _lot_ of work) rework the kernel to not rely
on the implicit per-cpu ness of things like spinlock_t, the moment we
bring in basic primitives that rely on migrate_disable() we're stuck
with it.


The problem is; afaict there's been no research into this problem. There
might be scheduling (read: balancing) schemes that can mitigate/solve
this problem, or it might prove to be a 'hard' problem, I just don't
know. But once we start down this road, it's going to be hell to get rid
of it.

That's why I've been arguing (for many years) to strictly limit this to
PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build
on.
Peter Zijlstra Sept. 23, 2020, 10:19 a.m. UTC | #19
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
> Alternatively this could of course be solved with per CPU page tables
> which will come around some day anyway I fear.

Previously (with PTI) we looked at making the entire kernel map per-CPU,
and that takes a 2K copy on switch_mm() (or more general, the user part
of whatever the top level directory is for architectures that have a
shared kernel/user page-table setup in the first place).

The idea was having a fixed per-cpu kernel page-table, share a bunch of
(kernel) page-tables between all CPUs and then copy in the user part on
switch.

I've forgotten what the plan was for ASID/PCID in that scheme.

For x86_64 we've been fearing the performance of that 2k copy, but I
don't think we've ever actually bit the bullet and implemented it to see
how bad it really is.
Thomas Gleixner Sept. 23, 2020, 12:33 p.m. UTC | #20
On Wed, Sep 23 2020 at 12:19, peterz wrote:
> On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
>> Alternatively this could of course be solved with per CPU page tables
>> which will come around some day anyway I fear.
>
> Previously (with PTI) we looked at making the entire kernel map per-CPU,
> and that takes a 2K copy on switch_mm() (or more general, the user part
> of whatever the top level directory is for architectures that have a
> shared kernel/user page-table setup in the first place).
>
> The idea was having a fixed per-cpu kernel page-table, share a bunch of
> (kernel) page-tables between all CPUs and then copy in the user part on
> switch.
>
> I've forgotten what the plan was for ASID/PCID in that scheme.
>
> For x86_64 we've been fearing the performance of that 2k copy, but I
> don't think we've ever actually bit the bullet and implemented it to see
> how bad it really is.

I actually did at some point and depending on the workload the overhead
was clearly measurable. And yes, it fell apart with PCID and I could not
come up with a scheme for it which did not suck horribly. So I burried
the patches in the poison cabinet.

Aside of that, we'd need to implement that for a eight other
architectures as well...

Thanks,

        tglx
Thomas Gleixner Sept. 23, 2020, 1:35 p.m. UTC | #21
On Wed, Sep 23 2020 at 10:40, peterz wrote:
> Right, so I'm concerned. migrate_disable() wrecks pretty much all
> Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is
> somewhat ironic.

It's even more ironic that the approach of PREEMPT_RT has been
'pragmatic ignorance of theory' from the very beginning and despite
violating all theories it still works. :)

> Yes, it allows breaking up non-preemptible regions of non-deterministic
> duration, and thereby both reduce and bound the scheduling latency, the
> cost for doing that is that the theory on CPU utilization/bandwidth go
> out the window.

I agree, that the theory goes out of the window, but does it matter in
practice? I've yet to see a report of migrate disable stacking being the
culprit of a missed deadline and I surely have stared at lots of reports
in the past 10+ years.

> To easily see this consider an SMP system with a number of tasks equal
> to the number of CPUs. On a regular (preempt_disable) kernel we can
> always run each task, by virtue of always having an idle CPU to take the
> task.
>
> However, with migrate_disable() we can have each task preempted in a
> migrate_disable() region, worse we can stack them all on the _same_ CPU
> (super ridiculous odds, sure). And then we end up only able to run one
> task, with the rest of the CPUs picking their nose.
>
> The end result is that, like with unbounded latency, we will still miss
> our deadline, simply because we got starved for CPU.

I'm well aware of that.

> Now, while we could (with a _lot_ of work) rework the kernel to not rely
> on the implicit per-cpu ness of things like spinlock_t, the moment we
> bring in basic primitives that rely on migrate_disable() we're stuck
> with it.

Right, but we are stuck with per CPUness and distangling that is just
infeasible IMO.

> The problem is; afaict there's been no research into this problem.

There is no research on a lot of things the kernel does :)

> There might be scheduling (read: balancing) schemes that can
> mitigate/solve this problem, or it might prove to be a 'hard' problem,
> I just don't know.

In practive balancing surely can take the number of preempted tasks
which are in a migrate disable section into account which would be just
another measure to work around the fact that the kernel is not adhering
to the theories. It never did that even w/o migrate disable.

> But once we start down this road, it's going to be hell to get rid of
> it.

Like most of the other things the kernel came up with to deal with the
oddities of modern hardware :)

> That's why I've been arguing (for many years) to strictly limit this to
> PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build
> on.

I know, but short of rewriting the world, I'm not seing the faintest
plan to remove the stop gap. :)

As we discussed not long ago we have too many inconsistent preemption
models already. RT is adding yet another one. And that's worse than
introducing migrate disable as a general available facility.

IMO, reaching a point of consistency where our different preemption
models (NONE, VOLUNTARY, PREEMPT. RT) build on each other is far more
important.

For !RT migrate disable is far less of an danger than for RT kernels
because the amount of code which will use it is rather limited compared
to code which still will disable preemption implicit through spin and rw
locks.

On RT converting these locks to 'sleepable spinlocks' is just possible
because RT forces almost everything into task context and migrate
disable is just the obvious decomposition of preempt disable which
implicitely disables migration.

But that means that RT is by orders of magnitude more prone to run into
the scheduling trainwreck you are worried about. It just refuses to do
so at least with real world work loads.

I'm surely in favour of having solid theories behind implementation, but
at some point you just have to bite the bullet and chose pragmatism in
order to make progress.

Proliferating inconsistency is not real progress, as it is violating the
most fundamental engineering principles. That's by far more dangerous
than violating scheduling theories which are built on perfect models and
therefore enforce violation by practical implementations anyway.

Thanks,

        tglx
Thomas Gleixner Sept. 23, 2020, 2:33 p.m. UTC | #22
On Mon, Sep 21 2020 at 21:27, Thomas Gleixner wrote:
> On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote:
>> On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Maybe we really *could* call this new kmap functionality something
>> like "kmap_percpu()" (or maybe "local" is good enough), and make it
>> act like your RT code does for spinlocks - not disable preemption, but
>> only disabling CPU migration.
>
> I"m all for it, but the scheduler people have opinions :)

I just took the latest version of migrate disable patches

  https://lore.kernel.org/r/20200921163557.234036895@infradead.org

removed the RT dependency on top of them and adopted the kmap stuff
(addressing the various comments while at it and unbreaking ARM).

I'm not going to post that until there is consensus about the general
availability of migrate disable, but for those who want to play with it
I've pushed the resulting combo out to:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem 

For testing I've tweaked a few places to use the new _local() variants
and it survived testing so far and I've verified that there is actual
preemption which means zap/restore of the thread local kmaps.

Thanks,

        tglx
Steven Rostedt Sept. 23, 2020, 3:52 p.m. UTC | #23
On Wed, 23 Sep 2020 10:40:32 +0200
peterz@infradead.org wrote:

> However, with migrate_disable() we can have each task preempted in a
> migrate_disable() region, worse we can stack them all on the _same_ CPU
> (super ridiculous odds, sure). And then we end up only able to run one
> task, with the rest of the CPUs picking their nose.

What if we just made migrate_disable() a local_lock() available for !RT?

I mean make it a priority inheritance PER CPU lock.

That is, no two tasks could do a migrate_disable() on the same CPU? If
one task does a migrate_disable() and then gets preempted and the
preempting task does a migrate_disable() on the same CPU, it will block
and wait for the first task to do a migrate_enable().

No two tasks on the same CPU could enter the migrate_disable() section
simultaneously, just like no two tasks could enter a preempt_disable()
section.

In essence, we just allow local_lock() to be used for both RT and !RT.

Perhaps make migrate_disable() an anonymous local_lock()?

This should lower the SHC in theory, if you can't have stacked migrate
disables on the same CPU.

-- Steve
Thomas Gleixner Sept. 23, 2020, 8:55 p.m. UTC | #24
On Wed, Sep 23 2020 at 11:52, Steven Rostedt wrote:
> On Wed, 23 Sep 2020 10:40:32 +0200
> peterz@infradead.org wrote:
>
>> However, with migrate_disable() we can have each task preempted in a
>> migrate_disable() region, worse we can stack them all on the _same_ CPU
>> (super ridiculous odds, sure). And then we end up only able to run one
>> task, with the rest of the CPUs picking their nose.
>
> What if we just made migrate_disable() a local_lock() available for !RT?
>
> I mean make it a priority inheritance PER CPU lock.
>
> That is, no two tasks could do a migrate_disable() on the same CPU? If
> one task does a migrate_disable() and then gets preempted and the
> preempting task does a migrate_disable() on the same CPU, it will block
> and wait for the first task to do a migrate_enable().
>
> No two tasks on the same CPU could enter the migrate_disable() section
> simultaneously, just like no two tasks could enter a preempt_disable()
> section.
>
> In essence, we just allow local_lock() to be used for both RT and !RT.
>
> Perhaps make migrate_disable() an anonymous local_lock()?
>
> This should lower the SHC in theory, if you can't have stacked migrate
> disables on the same CPU.

I'm pretty sure this ends up in locking hell pretty fast and aside of
that it's not working for scenarios like:

     kmap_local();
       migrate_disable();
       ...

     copy_from_user()
        -> #PF
           -> schedule()

which brought us into that discussion in the first place. You would stop
any other migrate disable user from running until the page fault is
resolved...

Thanks,

        tglx
Steven Rostedt Sept. 23, 2020, 9:12 p.m. UTC | #25
On Wed, 23 Sep 2020 22:55:54 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> > Perhaps make migrate_disable() an anonymous local_lock()?
> >
> > This should lower the SHC in theory, if you can't have stacked migrate
> > disables on the same CPU.  
> 
> I'm pretty sure this ends up in locking hell pretty fast and aside of
> that it's not working for scenarios like:
> 
>      kmap_local();
>        migrate_disable();
>        ...
> 
>      copy_from_user()
>         -> #PF
>            -> schedule()  
> 
> which brought us into that discussion in the first place. You would stop
> any other migrate disable user from running until the page fault is
> resolved...

Then scratch the idea of having anonymous local_lock() and just bring
local_lock in directly? Then have a kmap local lock, which would only
block those that need to do a kmap.

Now as for migration disabled nesting, at least now we would have
groupings of this, and perhaps the theorists can handle that. I mean,
how is this much different that having a bunch of tasks blocked on a
mutex with the owner is pinned on a CPU?

migrate_disable() is a BKL of pinning affinity. If we only have
local_lock() available (even on !RT), then it makes the blocking in
groups. At least this way you could grep for all the different
local_locks in the system and plug that into the algorithm for WCS,
just like one would with a bunch of mutexes.

-- Steve
Thomas Gleixner Sept. 24, 2020, 6:57 a.m. UTC | #26
On Wed, Sep 23 2020 at 17:12, Steven Rostedt wrote:
> On Wed, 23 Sep 2020 22:55:54 +0200
> Then scratch the idea of having anonymous local_lock() and just bring
> local_lock in directly? Then have a kmap local lock, which would only
> block those that need to do a kmap.

That's still going to end up in lock ordering nightmares and you lose
the ability to use kmap_local from arbitrary contexts which was again
one of the goals of this exercise.

Aside of that you're imposing reentrancy protections on something which
does not need it in the first place.

> Now as for migration disabled nesting, at least now we would have
> groupings of this, and perhaps the theorists can handle that. I mean,
> how is this much different that having a bunch of tasks blocked on a
> mutex with the owner is pinned on a CPU?
>
> migrate_disable() is a BKL of pinning affinity.

No. That's just wrong. preempt disable is a concurrency control,
i.e. protecting against reentrancy on a given CPU. But it's a cpu global
protection which means that it's not protecting a specific code path.

Contrary to preempt disable, migrate disable is not protecting against
reentrancy on a given CPU. It's a temporary restriction to the scheduler
on placement.

The fact that disabling preemption implicitely disables migration does
not make them semantically equivalent.

> If we only have local_lock() available (even on !RT), then it makes
> the blocking in groups. At least this way you could grep for all the
> different local_locks in the system and plug that into the algorithm
> for WCS, just like one would with a bunch of mutexes.

You cannot do that on RT at all where migrate disable is substituting
preempt disable in spin and rw locks. The result would be the same as
with a !RT kernel just with horribly bad performance.

That means the stacking problem has to be solved anyway.

So why on earth do you want to create yet another special duct tape case
for kamp_local() which proliferates inconsistency instead of aiming for
consistency accross all preemption models?

Thanks,

        tglx
Peter Zijlstra Sept. 24, 2020, 8:27 a.m. UTC | #27
On Wed, Sep 23, 2020 at 11:52:51AM -0400, Steven Rostedt wrote:
> On Wed, 23 Sep 2020 10:40:32 +0200
> peterz@infradead.org wrote:
> 
> > However, with migrate_disable() we can have each task preempted in a
> > migrate_disable() region, worse we can stack them all on the _same_ CPU
> > (super ridiculous odds, sure). And then we end up only able to run one
> > task, with the rest of the CPUs picking their nose.
> 
> What if we just made migrate_disable() a local_lock() available for !RT?

Can't, neiter migrate_disable() nor migrate_enable() are allowed to
block -- which is what makes their implementation so 'interesting'.

> This should lower the SHC in theory, if you can't have stacked migrate
> disables on the same CPU.

See this email in that other thread (you're on Cc too IIRC):

  https://lkml.kernel.org/r/20200923170809.GY1362448@hirez.programming.kicks-ass.net

I think that is we 'frob' the balance PULL, we'll end up with something
similar.

Whichever way around we turn this thing, the migrate_disable() runtime
(we'll have to add a tracer for that), will be an interference term on
the lower priority task, exactly like preempt_disable() would be. We'll
just not exclude a higher priority task from running.


AFAICT; the best case is a single migrate_disable() nesting, where a
higher priority task preempts in a migrate_disable() section -- this is
per design.

When this preempted task becomes elegible to run under the ideal model
(IOW it becomes one of the M highest priority tasks), it might still
have to wait for the preemptee's migrate_disable() section to complete.
Thereby suffering interference in the exact duration of
migrate_disable() section.

Per this argument, the change from preempt_disable() to
migrate_disable() gets us:

 - higher priority tasks gain reduced wake-up latency
 - lower priority tasks are unchanged and are subject to the exact same
   interference term as if the higher priority task were using
   preempt_disable().

Since we've already established this term is unbounded, any task but the
highest priority task is basically buggered.

TL;DR, if we get balancing fixed and achieve (near) the optimal case
above, migrate_disable() is an over-all win. But it's provably
non-deterministic as long as the migrate_disable() sections are
non-deterministic.


The reason this all mostly works in practise is (I think) because:

 - People care most about the higher prio RT tasks and craft them to
   mostly avoid the migrate_disable() infected code.

 - The preemption scenario is most pronounced at higher utilization
   scenarios, and I suspect this is fairly rare to begin with.

 - And while many of these migrate_disable() regions are unbound in
   theory, in practise they're often fairly reasonable.


So my current todo list is:

 - Change RT PULL
 - Change DL PULL
 - Add migrate_disable() tracer; exactly like preempt/irqoff, except
   measuring task-runtime instead of cpu-time.
 - Add a mode that measures actual interference.
 - Add a traceevent to detect preemption in migrate_disable().


And then I suppose I should twist Daniel's arm to update his model to
include these scenarios and numbers.
Steven Rostedt Sept. 24, 2020, 12:32 p.m. UTC | #28
On Thu, 24 Sep 2020 08:57:52 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> > Now as for migration disabled nesting, at least now we would have
> > groupings of this, and perhaps the theorists can handle that. I mean,
> > how is this much different that having a bunch of tasks blocked on a
> > mutex with the owner is pinned on a CPU?
> >
> > migrate_disable() is a BKL of pinning affinity.  
> 
> No. That's just wrong. preempt disable is a concurrency control,

I think you totally misunderstood what I was saying. The above wasn't about
comparing preempt_disable to migrate_disable. It was comparing
migrate_disable to a chain of tasks blocked on mutexes where the top owner
has preempt_disable set. You still have a bunch of tasks that can't move to
other CPUs.


> > If we only have local_lock() available (even on !RT), then it makes
> > the blocking in groups. At least this way you could grep for all the
> > different local_locks in the system and plug that into the algorithm
> > for WCS, just like one would with a bunch of mutexes.  
> 
> You cannot do that on RT at all where migrate disable is substituting
> preempt disable in spin and rw locks. The result would be the same as
> with a !RT kernel just with horribly bad performance.

Note, the spin and rwlocks already have a lock associated with them. Why
would it be any different on RT? I wasn't suggesting adding another lock
inside a spinlock. Why would I recommend THAT? I wasn't recommending
blindly replacing migrate_disable() with local_lock(). I just meant expose
local_lock() but not migrate_disable().

> 
> That means the stacking problem has to be solved anyway.
> 
> So why on earth do you want to create yet another special duct tape case
> for kamp_local() which proliferates inconsistency instead of aiming for
> consistency accross all preemption models?

The idea was to help with the scheduling issue.

Anyway, instead of blocking. What about having a counter of number of
migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's
already another task with migrate_disabled() set, and the current task has
an affinity greater than 1, it tries to migrate to another CPU?

This way migrate_disable() is less likely to have a bunch of tasks blocked
on one CPU serialized by each task exiting the migrate_disable() section.

Yes, there's more overhead, but it only happens if multiple tasks are in a
migrate disable section on the same CPU.

-- Steve
Peter Zijlstra Sept. 24, 2020, 12:42 p.m. UTC | #29
On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
> Anyway, instead of blocking. What about having a counter of number of
> migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's
> already another task with migrate_disabled() set, and the current task has
> an affinity greater than 1, it tries to migrate to another CPU?

That doesn't solve the problem. On wakeup we should already prefer an
idle CPU over one running a (RT) task, but you can always wake more
tasks than there's CPUs around and you'll _have_ to stack at some point.

The trick is how to unstack them correctly. We need to detect when a
migrate_disable() task _should_ start running again, and migrate away
whoever is in the way at that point.

It turns out, that getting selected for pull-balance is exactly that
condition, and clearly a migrate_disable() task cannot be pulled, but we
can use that signal to try and pull away the running task that's in the
way.
Steven Rostedt Sept. 24, 2020, 1:51 p.m. UTC | #30
On Thu, 24 Sep 2020 14:42:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
> > Anyway, instead of blocking. What about having a counter of number of
> > migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's
> > already another task with migrate_disabled() set, and the current task has
> > an affinity greater than 1, it tries to migrate to another CPU?  
> 
> That doesn't solve the problem. On wakeup we should already prefer an
> idle CPU over one running a (RT) task, but you can always wake more
> tasks than there's CPUs around and you'll _have_ to stack at some point.

Yes, understood.

> 
> The trick is how to unstack them correctly. We need to detect when a
> migrate_disable() task _should_ start running again, and migrate away
> whoever is in the way at that point.
> 
> It turns out, that getting selected for pull-balance is exactly that
> condition, and clearly a migrate_disable() task cannot be pulled, but we
> can use that signal to try and pull away the running task that's in the
> way.

Unless of course that running task is in a migrate disable section
itself ;-)

But I guess we will always have that SHC, and there will always be a
scenario that you can't balance properly. But hopefully in practice we
wont see that.

How to handle kmap_local(), will migrate_disable() be used only for
32bit or, for consistency, will it also apply to 64bit?

-- Steve
Peter Zijlstra Sept. 24, 2020, 1:58 p.m. UTC | #31
On Thu, Sep 24, 2020 at 09:51:38AM -0400, Steven Rostedt wrote:

> > It turns out, that getting selected for pull-balance is exactly that
> > condition, and clearly a migrate_disable() task cannot be pulled, but we
> > can use that signal to try and pull away the running task that's in the
> > way.
> 
> Unless of course that running task is in a migrate disable section
> itself ;-)

See my ramblings here:

  https://lkml.kernel.org/r/20200924082717.GA1362448@hirez.programming.kicks-ass.net

My plan was to have the migrate_enable() of the running task trigger the
migration in that case.
Thomas Gleixner Sept. 24, 2020, 5:55 p.m. UTC | #32
On Thu, Sep 24 2020 at 08:32, Steven Rostedt wrote:
> On Thu, 24 Sep 2020 08:57:52 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> > Now as for migration disabled nesting, at least now we would have
>> > groupings of this, and perhaps the theorists can handle that. I mean,
>> > how is this much different that having a bunch of tasks blocked on a
>> > mutex with the owner is pinned on a CPU?
>> >
>> > migrate_disable() is a BKL of pinning affinity.  
>> 
>> No. That's just wrong. preempt disable is a concurrency control,
>
> I think you totally misunderstood what I was saying. The above wasn't about
> comparing preempt_disable to migrate_disable. It was comparing
> migrate_disable to a chain of tasks blocked on mutexes where the top owner
> has preempt_disable set. You still have a bunch of tasks that can't move to
> other CPUs.

What? The top owner does not prevent any task from moving. The tasks
cannot move because they are blocked on the mutex, which means they are
not runnable and non runnable tasks are not migrated at all.

I really don't understand what you are trying to say.

>> > If we only have local_lock() available (even on !RT), then it makes
>> > the blocking in groups. At least this way you could grep for all the
>> > different local_locks in the system and plug that into the algorithm
>> > for WCS, just like one would with a bunch of mutexes.  
>> 
>> You cannot do that on RT at all where migrate disable is substituting
>> preempt disable in spin and rw locks. The result would be the same as
>> with a !RT kernel just with horribly bad performance.
>
> Note, the spin and rwlocks already have a lock associated with them. Why
> would it be any different on RT? I wasn't suggesting adding another lock
> inside a spinlock. Why would I recommend THAT? I wasn't recommending
> blindly replacing migrate_disable() with local_lock(). I just meant expose
> local_lock() but not migrate_disable().

We already exposed local_lock() to non RT and it's for places which do
preempt_disable() or local_irq_disable() without having a lock
associated. But both primitives are scope less and therefore behave like
CPU local BKLs. What local_lock() provides in these cases is:

  - Making the protection scope clear by associating a named local
    lock which is coverred by lockdep.

  - It still maps to preempt_disable() or local_irq_disable() in !RT
    kernels

  - The scope and the named lock allows RT kernels to substitute with
    real (recursion aware) locking primitives which keep preemption and
    interupts enabled, but provide the fine grained protection for the
    scoped critical section.
  
So how would you substitute migrate_disable() with a local_lock()? You
can't. Again migrate_disable() is NOT a concurrency control and
therefore it cannot be substituted by any concurrency control primitive.

>> That means the stacking problem has to be solved anyway.
>> 
>> So why on earth do you want to create yet another special duct tape case
>> for kamp_local() which proliferates inconsistency instead of aiming for
>> consistency accross all preemption models?
>
> The idea was to help with the scheduling issue.

I don't see how mixing concepts and trying to duct tape a problem which
is clearly in the realm of the scheduler, i.e. load balancing and
placing algorithms, is helpful.

Thanks,

        tglx
Steven Rostedt Sept. 24, 2020, 6:58 p.m. UTC | #33
On Thu, 24 Sep 2020 19:55:10 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Thu, Sep 24 2020 at 08:32, Steven Rostedt wrote:
> > On Thu, 24 Sep 2020 08:57:52 +0200
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> >  
> >> > Now as for migration disabled nesting, at least now we would have
> >> > groupings of this, and perhaps the theorists can handle that. I mean,
> >> > how is this much different that having a bunch of tasks blocked on a
> >> > mutex with the owner is pinned on a CPU?
> >> >
> >> > migrate_disable() is a BKL of pinning affinity.    
> >> 
> >> No. That's just wrong. preempt disable is a concurrency control,  
> >
> > I think you totally misunderstood what I was saying. The above wasn't about
> > comparing preempt_disable to migrate_disable. It was comparing
> > migrate_disable to a chain of tasks blocked on mutexes where the top owner
> > has preempt_disable set. You still have a bunch of tasks that can't move to
> > other CPUs.  
> 
> What? The top owner does not prevent any task from moving. The tasks
> cannot move because they are blocked on the mutex, which means they are
> not runnable and non runnable tasks are not migrated at all.

And neither are migrated disabled tasks preempted by a high priority
task.

> 
> I really don't understand what you are trying to say.

Don't worry about it. I was just making a high level comparison of how
migrate disabled tasks blocked on a higher priority task is similar to
that of tasks blocked on a mutex held by a pinned task that is
preempted by a high priority task. But we can forget this analogy as
it's not appropriate for the current conversation.

> 
> >> > If we only have local_lock() available (even on !RT), then it makes
> >> > the blocking in groups. At least this way you could grep for all the
> >> > different local_locks in the system and plug that into the algorithm
> >> > for WCS, just like one would with a bunch of mutexes.    
> >> 
> >> You cannot do that on RT at all where migrate disable is substituting
> >> preempt disable in spin and rw locks. The result would be the same as
> >> with a !RT kernel just with horribly bad performance.  
> >
> > Note, the spin and rwlocks already have a lock associated with them. Why
> > would it be any different on RT? I wasn't suggesting adding another lock
> > inside a spinlock. Why would I recommend THAT? I wasn't recommending
> > blindly replacing migrate_disable() with local_lock(). I just meant expose
> > local_lock() but not migrate_disable().  
> 
> We already exposed local_lock() to non RT and it's for places which do
> preempt_disable() or local_irq_disable() without having a lock
> associated. But both primitives are scope less and therefore behave like
> CPU local BKLs. What local_lock() provides in these cases is:
> 
>   - Making the protection scope clear by associating a named local
>     lock which is coverred by lockdep.
> 
>   - It still maps to preempt_disable() or local_irq_disable() in !RT
>     kernels
> 
>   - The scope and the named lock allows RT kernels to substitute with
>     real (recursion aware) locking primitives which keep preemption and
>     interupts enabled, but provide the fine grained protection for the
>     scoped critical section.

I'm very much aware of the above.

>   
> So how would you substitute migrate_disable() with a local_lock()? You
> can't. Again migrate_disable() is NOT a concurrency control and
> therefore it cannot be substituted by any concurrency control primitive.

When I was first writing my email, I was writing about a way to replace
migrate_disable with a construct similar to local locks without
actually mentioning local locks, but then rewrote it to state local
locks, trying to simplify what I was writing. I shouldn't have done
that, because it portrayed that I wanted to use local_lock()
unmodified. I was actually thinking of a new construct that was similar
but not exactly the same as local lock.

But this will just make things more complex and we can forget about it.

I'll wait to see what Peter produces.

-- Steve
Daniel Bristot de Oliveira Sept. 24, 2020, 7:36 p.m. UTC | #34
On 9/24/20 10:27 AM, peterz@infradead.org wrote:
> So my current todo list is:
> 
>  - Change RT PULL
>  - Change DL PULL
>  - Add migrate_disable() tracer; exactly like preempt/irqoff, except
>    measuring task-runtime instead of cpu-time.
>  - Add a mode that measures actual interference.
>  - Add a traceevent to detect preemption in migrate_disable().
> 
> 
> And then I suppose I should twist Daniel's arm to update his model to
> include these scenarios and numbers.

Challenge accepted :-)

-- Daniel