Message ID | 20210731175341.3458608-1-lrizzo@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add mmap_assert_locked() annotations to find_vma*() | expand |
On Sat, 31 Jul 2021 10:53:41 -0700 Luigi Rizzo <lrizzo@google.com> wrote: > find_vma() and variants need protection when used. > This patch adds mmap_assert_lock() calls in the functions. > > To make sure the invariant is satisfied, we also need to add a > mmap_read_loc() around the get_user_pages_remote() call in > get_arg_page(). The lock is not strictly necessary because the mm > has been newly created, but the extra cost is limited because > the same mutex was also acquired shortly before in __bprm_mm_init(), > so it is hot and uncontended. > Well, it isn't cost-free. find_vma() is called a lot and a surprising number of systems apparently run with CONFIG_DEBUG_VM. Why do you think this cost is justified?
On Sun, Aug 1, 2021 at 9:33 PM Andrew Morton <akpm@linux-foundation.org> wrote: > On Sat, 31 Jul 2021 10:53:41 -0700 Luigi Rizzo <lrizzo@google.com> wrote: > > > find_vma() and variants need protection when used. > > This patch adds mmap_assert_lock() calls in the functions. > > > > To make sure the invariant is satisfied, we also need to add a > > mmap_read_loc() around the get_user_pages_remote() call in > > get_arg_page(). The lock is not strictly necessary because the mm > > has been newly created, but the extra cost is limited because > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > so it is hot and uncontended. > > > > Well, it isn't cost-free. find_vma() is called a lot and a surprising > number of systems apparently run with CONFIG_DEBUG_VM. Why do you > think this cost is justified? > I assume you are concerned with the cost of mmap_assert_locked() ? I'd say the justification is the same as for all asserts: at some point some code change may miss the required lock, and the asserts are there to catch elusive race conditions, There are in fact already instances of mmap_locked_assert() right before find_vma() in walk_page_range(), and a couple before calls to __get_user_pages(). As for the cost, I'd think that if CONFIG_DEBUG_VM is set, one does it on purpose to catch errors and is prepared to pay the cost (in this case the atomic_read(counter) in rwsem_is_locked(), the counter should be hot). FWIW I have instrumented find_vma() on a fast machine using kstats https://github.com/luigirizzo/lr-cstats (load the module then enable the trace with echo "trace pcpu:find_vma bits 3" > /sys/kernel/debug/kstats/_control and monitor the time with watch "grep CPUS /sys/kernel/debug/kstats/find_vma" I didn't run anything especially intensive except some network benchmarks, but I have collected ~2M samples with the following distribution of find_vma() time in nanoseconds in 3 configs: CONFIGURATION p10 p50 p90 p95 p98 no-debug 89 109 214 332 605 debug 331 369 603 862 1338 debug+this patch 337 369 603 863 1339 As you can see, just compiling a debug kernel, even without this patch, makes the function 3x more expensive. The effect of this patch is not measurable (the differences are below measurement error). cheers luigi
(repost in plain text) On Sun, Aug 1, 2021 at 9:33 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Sat, 31 Jul 2021 10:53:41 -0700 Luigi Rizzo <lrizzo@google.com> wrote: > > > find_vma() and variants need protection when used. > > This patch adds mmap_assert_lock() calls in the functions. > > > > To make sure the invariant is satisfied, we also need to add a > > mmap_read_loc() around the get_user_pages_remote() call in > > get_arg_page(). The lock is not strictly necessary because the mm > > has been newly created, but the extra cost is limited because > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > so it is hot and uncontended. > > > > Well, it isn't cost-free. find_vma() is called a lot and a surprising > number of systems apparently run with CONFIG_DEBUG_VM. Why do you > think this cost is justified? I assume you are concerned with the cost of mmap_assert_locked() ? I'd say the justification is the same as for all asserts: at some point some code change may miss the required lock, and the asserts are there to catch elusive race conditions, There are in fact already instances of mmap_locked_assert() right before find_vma() in walk_page_range(), and a couple before calls to __get_user_pages(). As for the cost, I'd think that if CONFIG_DEBUG_VM is set, one does it on purpose to catch errors and is prepared to pay the cost (in this case the atomic_read(counter) in rwsem_is_locked(), the counter should be hot). FWIW I have instrumented find_vma() on a fast machine using kstats https://github.com/luigirizzo/lr-cstats (load the module then enable the trace with echo "trace pcpu:find_vma bits 3" > /sys/kernel/debug/kstats/_control and monitor the time with watch "grep CPUS /sys/kernel/debug/kstats/find_vma" I didn't run anything especially intensive except some network benchmarks, but I have collected ~2M samples with the following distribution of find_vma() time in nanoseconds in 3 configs: CONFIGURATION p10 p50 p90 p95 p98 no-debug 89 109 214 332 605 debug 331 369 603 862 1338 debug+this patch 337 369 603 863 1339 As you can see, just compiling a debug kernel, even without this patch, makes the function 3x more expensive. The effect of this patch is not measurable (the differences are below measurement error). cheers luigi
On Mon, 2 Aug 2021 02:16:14 +0200 Luigi Rizzo <lrizzo@google.com> wrote: > > Well, it isn't cost-free. find_vma() is called a lot and a surprising > > number of systems apparently run with CONFIG_DEBUG_VM. Why do you > > think this cost is justified? > > I assume you are concerned with the cost of mmap_assert_locked() ? > > I'd say the justification is the same as for all asserts: > at some point some code change may miss the required lock, and the > asserts are there to catch elusive race conditions, > > There are in fact already instances of mmap_locked_assert() > right before find_vma() in walk_page_range(), and a couple before > calls to __get_user_pages(). > > As for the cost, I'd think that if CONFIG_DEBUG_VM is set, > one does it on purpose to catch errors and is prepared to pay > the cost (in this case the atomic_read(counter) in rwsem_is_locked(), > the counter should be hot). > > FWIW I have instrumented find_vma() on a fast machine using kstats > > https://github.com/luigirizzo/lr-cstats > > (load the module then enable the trace with > echo "trace pcpu:find_vma bits 3" > /sys/kernel/debug/kstats/_control > and monitor the time with > watch "grep CPUS /sys/kernel/debug/kstats/find_vma" > > I didn't run anything especially intensive except some network > benchmarks, but I have collected ~2M samples with the following > distribution of find_vma() time in nanoseconds in 3 configs: > > CONFIGURATION p10 p50 p90 p95 p98 > > no-debug 89 109 214 332 605 > debug 331 369 603 862 1338 > debug+this patch 337 369 603 863 1339 > > As you can see, just compiling a debug kernel, even without this patch, > makes the function 3x more expensive. The effect of this patch is > not measurable (the differences are below measurement error). Cool, thanks, that's convincing.
On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > find_vma() and variants need protection when used. > This patch adds mmap_assert_lock() calls in the functions. > > To make sure the invariant is satisfied, we also need to add a > mmap_read_loc() around the get_user_pages_remote() call in > get_arg_page(). The lock is not strictly necessary because the mm > has been newly created, but the extra cost is limited because > the same mutex was also acquired shortly before in __bprm_mm_init(), > so it is hot and uncontended. > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > fs/exec.c | 2 ++ > mm/mmap.c | 2 ++ > 2 files changed, 4 insertions(+) > > diff --git a/fs/exec.c b/fs/exec.c > index 38f63451b928..ac7603e985b4 100644 > +++ b/fs/exec.c > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > * We are doing an exec(). 'current' is the process > * doing the exec and bprm->mm is the new process's mm. > */ > + mmap_read_lock(bprm->mm); > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > &page, NULL, NULL); > + mmap_read_unlock(bprm->mm); > if (ret <= 0) > return NULL; Wasn't Jann Horn working on something like this too? https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ IIRC it was very tricky here, are you sure it is OK to obtain this lock here? I would much rather see Jann's complete solution be merged then hacking at the exec problem on the side.. Jason
On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > > find_vma() and variants need protection when used. > > This patch adds mmap_assert_lock() calls in the functions. > > > > To make sure the invariant is satisfied, we also need to add a > > mmap_read_loc() around the get_user_pages_remote() call in > > get_arg_page(). The lock is not strictly necessary because the mm > > has been newly created, but the extra cost is limited because > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > so it is hot and uncontended. > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > > fs/exec.c | 2 ++ > > mm/mmap.c | 2 ++ > > 2 files changed, 4 insertions(+) > > > > diff --git a/fs/exec.c b/fs/exec.c > > index 38f63451b928..ac7603e985b4 100644 > > +++ b/fs/exec.c > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > > * We are doing an exec(). 'current' is the process > > * doing the exec and bprm->mm is the new process's mm. > > */ > > + mmap_read_lock(bprm->mm); > > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > > &page, NULL, NULL); > > + mmap_read_unlock(bprm->mm); > > if (ret <= 0) > > return NULL; > > Wasn't Jann Horn working on something like this too? > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ > > IIRC it was very tricky here, are you sure it is OK to obtain this lock > here? I cannot comment on Jann's patch series but no other thread knows about this mm at this point in the code so the lock is definitely safe to acquire (shortly before there was also a write lock acquired on the same mm, in the same conditions). cheers luigi > > I would much rather see Jann's complete solution be merged then > hacking at the exec problem on the side.. > > Jason
* Luigi Rizzo <lrizzo@google.com> [210803 17:49]: > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > > > find_vma() and variants need protection when used. > > > This patch adds mmap_assert_lock() calls in the functions. > > > > > > To make sure the invariant is satisfied, we also need to add a > > > mmap_read_loc() around the get_user_pages_remote() call in > > > get_arg_page(). The lock is not strictly necessary because the mm > > > has been newly created, but the extra cost is limited because > > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > > so it is hot and uncontended. > > > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > > > fs/exec.c | 2 ++ > > > mm/mmap.c | 2 ++ > > > 2 files changed, 4 insertions(+) > > > > > > diff --git a/fs/exec.c b/fs/exec.c > > > index 38f63451b928..ac7603e985b4 100644 > > > +++ b/fs/exec.c > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > > > * We are doing an exec(). 'current' is the process > > > * doing the exec and bprm->mm is the new process's mm. > > > */ > > > + mmap_read_lock(bprm->mm); > > > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > > > &page, NULL, NULL); > > > + mmap_read_unlock(bprm->mm); > > > if (ret <= 0) > > > return NULL; > > > > Wasn't Jann Horn working on something like this too? > > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ > > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock > > here? > > I cannot comment on Jann's patch series but no other thread knows > about this mm at this point in the code so the lock is definitely > safe to acquire (shortly before there was also a write lock acquired > on the same mm, in the same conditions). If there is no other code that knows about this mm, then does one need the lock at all? Is this just to satisfy the new check you added? If you want to make this change, I would suggest writing it in a way to ensure the call to expand_downwards() in the same function also holds the lock. I believe this is technically required as well? What do you think? Thanks, Liam > > cheers > luigi > > > > > I would much rather see Jann's complete solution be merged then > > hacking at the exec problem on the side.. > > > > Jason >
On Tue, Aug 03, 2021 at 11:07:35PM +0000, Liam Howlett wrote: > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]: > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > > > > find_vma() and variants need protection when used. > > > > This patch adds mmap_assert_lock() calls in the functions. > > > > > > > > To make sure the invariant is satisfied, we also need to add a > > > > mmap_read_loc() around the get_user_pages_remote() call in > > > > get_arg_page(). The lock is not strictly necessary because the mm > > > > has been newly created, but the extra cost is limited because > > > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > > > so it is hot and uncontended. > > > > > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > > > > fs/exec.c | 2 ++ > > > > mm/mmap.c | 2 ++ > > > > 2 files changed, 4 insertions(+) > > > > > > > > diff --git a/fs/exec.c b/fs/exec.c > > > > index 38f63451b928..ac7603e985b4 100644 > > > > +++ b/fs/exec.c > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > > > > * We are doing an exec(). 'current' is the process > > > > * doing the exec and bprm->mm is the new process's mm. > > > > */ > > > > + mmap_read_lock(bprm->mm); > > > > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > > > > &page, NULL, NULL); > > > > + mmap_read_unlock(bprm->mm); > > > > if (ret <= 0) > > > > return NULL; > > > > > > Wasn't Jann Horn working on something like this too? > > > > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ > > > > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock > > > here? > > > > I cannot comment on Jann's patch series but no other thread knows > > about this mm at this point in the code so the lock is definitely > > safe to acquire (shortly before there was also a write lock acquired > > on the same mm, in the same conditions). > > If there is no other code that knows about this mm, then does one need > the lock at all? Is this just to satisfy the new check you added? > > If you want to make this change, I would suggest writing it in a way to > ensure the call to expand_downwards() in the same function also holds > the lock. I believe this is technically required as well? What do you > think? This is essentially what Jann was doing. Since the mm is newly created we can create it write locked and then we can add proper locking tests to many of the functions called along this path. Adding useless locks around each troublesome callsite just seems really confusing to me. Jason
On Wed, Aug 4, 2021 at 1:35 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Aug 03, 2021 at 11:07:35PM +0000, Liam Howlett wrote: > > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]: > > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > > > > > find_vma() and variants need protection when used. > > > > > This patch adds mmap_assert_lock() calls in the functions. > > > > > > > > > > To make sure the invariant is satisfied, we also need to add a > > > > > mmap_read_loc() around the get_user_pages_remote() call in > > > > > get_arg_page(). The lock is not strictly necessary because the mm > > > > > has been newly created, but the extra cost is limited because > > > > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > > > > so it is hot and uncontended. > > > > > > > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > > > > > fs/exec.c | 2 ++ > > > > > mm/mmap.c | 2 ++ > > > > > 2 files changed, 4 insertions(+) > > > > > > > > > > diff --git a/fs/exec.c b/fs/exec.c > > > > > index 38f63451b928..ac7603e985b4 100644 > > > > > +++ b/fs/exec.c > > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > > > > > * We are doing an exec(). 'current' is the process > > > > > * doing the exec and bprm->mm is the new process's mm. > > > > > */ > > > > > + mmap_read_lock(bprm->mm); > > > > > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > > > > > &page, NULL, NULL); > > > > > + mmap_read_unlock(bprm->mm); > > > > > if (ret <= 0) > > > > > return NULL; > > > > > > > > Wasn't Jann Horn working on something like this too? > > > > > > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ > > > > > > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock > > > > here? > > > > > > I cannot comment on Jann's patch series but no other thread knows > > > about this mm at this point in the code so the lock is definitely > > > safe to acquire (shortly before there was also a write lock acquired > > > on the same mm, in the same conditions). > > > > If there is no other code that knows about this mm, then does one need > > the lock at all? Is this just to satisfy the new check you added? > > > > If you want to make this change, I would suggest writing it in a way to > > ensure the call to expand_downwards() in the same function also holds > > the lock. I believe this is technically required as well? What do you > > think? > > This is essentially what Jann was doing. Since the mm is newly created > we can create it write locked and then we can add proper locking tests > to many of the functions called along this path. > > Adding useless locks around each troublesome callsite just seems > really confusing to me. Uhm... by that reasoning, even creating the mm locked (and unlocking at the end) is equally unnecessary. My goal was to add asserts and invariants that are easy to understand and get right, rather than optimize a path that does not appear to be critical. Adding one read lock pair around the one function we annotate is easy to understand and clearly a leaf lock. Having alloc_bprm return a locked object is a bit unconventional, and also passing it to other methods raises the question of whether they take other lock possibly causing lock order reversals in the future. cheers luigi
* Luigi Rizzo <lrizzo@google.com> [210803 19:58]: > On Wed, Aug 4, 2021 at 1:35 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Tue, Aug 03, 2021 at 11:07:35PM +0000, Liam Howlett wrote: > > > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]: > > > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > > > > > > find_vma() and variants need protection when used. > > > > > > This patch adds mmap_assert_lock() calls in the functions. > > > > > > > > > > > > To make sure the invariant is satisfied, we also need to add a > > > > > > mmap_read_loc() around the get_user_pages_remote() call in > > > > > > get_arg_page(). The lock is not strictly necessary because the mm > > > > > > has been newly created, but the extra cost is limited because > > > > > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > > > > > so it is hot and uncontended. > > > > > > > > > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > > > > > > fs/exec.c | 2 ++ > > > > > > mm/mmap.c | 2 ++ > > > > > > 2 files changed, 4 insertions(+) > > > > > > > > > > > > diff --git a/fs/exec.c b/fs/exec.c > > > > > > index 38f63451b928..ac7603e985b4 100644 > > > > > > +++ b/fs/exec.c > > > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > > > > > > * We are doing an exec(). 'current' is the process > > > > > > * doing the exec and bprm->mm is the new process's mm. > > > > > > */ > > > > > > + mmap_read_lock(bprm->mm); > > > > > > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > > > > > > &page, NULL, NULL); > > > > > > + mmap_read_unlock(bprm->mm); > > > > > > if (ret <= 0) > > > > > > return NULL; > > > > > > > > > > Wasn't Jann Horn working on something like this too? > > > > > > > > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ > > > > > > > > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock > > > > > here? > > > > > > > > I cannot comment on Jann's patch series but no other thread knows > > > > about this mm at this point in the code so the lock is definitely > > > > safe to acquire (shortly before there was also a write lock acquired > > > > on the same mm, in the same conditions). > > > > > > If there is no other code that knows about this mm, then does one need > > > the lock at all? Is this just to satisfy the new check you added? > > > > > > If you want to make this change, I would suggest writing it in a way to > > > ensure the call to expand_downwards() in the same function also holds > > > the lock. I believe this is technically required as well? What do you > > > think? > > > > This is essentially what Jann was doing. Since the mm is newly created > > we can create it write locked and then we can add proper locking tests > > to many of the functions called along this path. That sounds good. Jann has left the patch as pending a fix since November 2020. Can't the removal of the lock/unlock be added to the next iteration of the patch? Was there a v4 of that patch? > > > > Adding useless locks around each troublesome callsite just seems > > really confusing to me. > > Uhm... by that reasoning, even creating the mm locked (and unlocking > at the end) is equally unnecessary. I think taking the lock is more clear than leaving it the way it's currently written. It is actually confusing to see the lock taken after calling expand_downwards() which explicitly mentions the lock as required in the comments though. This should at least have a comment about early creation not requiring the lock. > > My goal was to add asserts and invariants that are easy > to understand and get right, rather than optimize a path > that does not appear to be critical. > > Adding one read lock pair around the one function we annotate > is easy to understand and clearly a leaf lock. > > Having alloc_bprm return a locked object is a bit unconventional, > and also passing it to other methods raises the question of whether > they take other lock possibly causing lock order reversals > in the future. We are (probably?) okay as the usual order right now is to take the mmap sem before the pte and interval tree. It's also just for the set up, so unless there is a special case that could cause trouble... or maybe I should ask which cases will cause trouble? Thanks, Liam
On Wed, Aug 4, 2021 at 1:07 AM Liam Howlett <liam.howlett@oracle.com> wrote: > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]: > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > > > > find_vma() and variants need protection when used. > > > > This patch adds mmap_assert_lock() calls in the functions. > > > > > > > > To make sure the invariant is satisfied, we also need to add a > > > > mmap_read_loc() around the get_user_pages_remote() call in > > > > get_arg_page(). The lock is not strictly necessary because the mm > > > > has been newly created, but the extra cost is limited because > > > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > > > so it is hot and uncontended. > > > > > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > > > > fs/exec.c | 2 ++ > > > > mm/mmap.c | 2 ++ > > > > 2 files changed, 4 insertions(+) > > > > > > > > diff --git a/fs/exec.c b/fs/exec.c > > > > index 38f63451b928..ac7603e985b4 100644 > > > > +++ b/fs/exec.c > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > > > > * We are doing an exec(). 'current' is the process > > > > * doing the exec and bprm->mm is the new process's mm. > > > > */ > > > > + mmap_read_lock(bprm->mm); > > > > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > > > > &page, NULL, NULL); > > > > + mmap_read_unlock(bprm->mm); > > > > if (ret <= 0) > > > > return NULL; > > > > > > Wasn't Jann Horn working on something like this too? > > > > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ > > > > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock > > > here? > > > > I cannot comment on Jann's patch series but no other thread knows > > about this mm at this point in the code so the lock is definitely > > safe to acquire (shortly before there was also a write lock acquired > > on the same mm, in the same conditions). > > If there is no other code that knows about this mm, then does one need > the lock at all? Is this just to satisfy the new check you added? > > If you want to make this change, I would suggest writing it in a way to > ensure the call to expand_downwards() in the same function also holds > the lock. I believe this is technically required as well? What do you > think? The call to expand_downwards() takes a VMA pointer as argument, and the mmap lock is the only thing that normally prevents concurrent freeing of VMA structs. Taking a lock there would be of limited utility - either the lock is not necessary because nobody else can access the MM, or the lock is insufficient because someone could have freed the VMA pointer before the lock was taken. So I think that taking a lock around the expand_downwards() call would just be obfuscating things, unless you specifically want to prevent concurrent *reads* while concurrent *writes* are impossible. Since I haven't sent a new version of my old series for almost a year, I think it'd be fine to take Luigi's patch for now, and undo it at a later point when/if we want to actually use proper locking here because we're worried about concurrent access to the MM.
On Wed, Aug 04, 2021 at 04:42:23PM +0200, Jann Horn wrote: > Since I haven't sent a new version of my old series for almost a year, > I think it'd be fine to take Luigi's patch for now, and undo it at a > later point when/if we want to actually use proper locking here > because we're worried about concurrent access to the MM. IIRC one of the major points of that work was not "proper locking" but to have enough locking to be complatible with lockdep so we could add assertions like in get_user_pages and find_vma. Jason
* Jann Horn <jannh@google.com> [210804 10:42]: > On Wed, Aug 4, 2021 at 1:07 AM Liam Howlett <liam.howlett@oracle.com> wrote: > > * Luigi Rizzo <lrizzo@google.com> [210803 17:49]: > > > On Tue, Aug 3, 2021 at 6:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Sat, Jul 31, 2021 at 10:53:41AM -0700, Luigi Rizzo wrote: > > > > > find_vma() and variants need protection when used. > > > > > This patch adds mmap_assert_lock() calls in the functions. > > > > > > > > > > To make sure the invariant is satisfied, we also need to add a > > > > > mmap_read_loc() around the get_user_pages_remote() call in > > > > > get_arg_page(). The lock is not strictly necessary because the mm > > > > > has been newly created, but the extra cost is limited because > > > > > the same mutex was also acquired shortly before in __bprm_mm_init(), > > > > > so it is hot and uncontended. > > > > > > > > > > Signed-off-by: Luigi Rizzo <lrizzo@google.com> > > > > > fs/exec.c | 2 ++ > > > > > mm/mmap.c | 2 ++ > > > > > 2 files changed, 4 insertions(+) > > > > > > > > > > diff --git a/fs/exec.c b/fs/exec.c > > > > > index 38f63451b928..ac7603e985b4 100644 > > > > > +++ b/fs/exec.c > > > > > @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > > > > > * We are doing an exec(). 'current' is the process > > > > > * doing the exec and bprm->mm is the new process's mm. > > > > > */ > > > > > + mmap_read_lock(bprm->mm); > > > > > ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, > > > > > &page, NULL, NULL); > > > > > + mmap_read_unlock(bprm->mm); > > > > > if (ret <= 0) > > > > > return NULL; > > > > > > > > Wasn't Jann Horn working on something like this too? > > > > > > > > https://lore.kernel.org/linux-mm/20201016225713.1971256-1-jannh@google.com/ > > > > > > > > IIRC it was very tricky here, are you sure it is OK to obtain this lock > > > > here? > > > > > > I cannot comment on Jann's patch series but no other thread knows > > > about this mm at this point in the code so the lock is definitely > > > safe to acquire (shortly before there was also a write lock acquired > > > on the same mm, in the same conditions). > > > > If there is no other code that knows about this mm, then does one need > > the lock at all? Is this just to satisfy the new check you added? > > > > If you want to make this change, I would suggest writing it in a way to > > ensure the call to expand_downwards() in the same function also holds > > the lock. I believe this is technically required as well? What do you > > think? > > The call to expand_downwards() takes a VMA pointer as argument, and > the mmap lock is the only thing that normally prevents concurrent > freeing of VMA structs. Taking a lock there would be of limited utility - either > the lock is not necessary because nobody else can access the MM, or > the lock is insufficient because someone could have freed the VMA > pointer before the lock was taken. So I think that taking a lock > around the expand_downwards() call would just be obfuscating things, > unless you specifically want to prevent concurrent *reads* while > concurrent *writes* are impossible. Good point on the VMA being passed in, that certainly points to your previous patch being a better approach. That resolves my questions around the patch. > > Since I haven't sent a new version of my old series for almost a year, > I think it'd be fine to take Luigi's patch for now, and undo it at a > later point when/if we want to actually use proper locking here > because we're worried about concurrent access to the MM.
On Wed, Aug 4, 2021 at 5:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Wed, Aug 04, 2021 at 04:42:23PM +0200, Jann Horn wrote: > > Since I haven't sent a new version of my old series for almost a year, > > I think it'd be fine to take Luigi's patch for now, and undo it at a > > later point when/if we want to actually use proper locking here > > because we're worried about concurrent access to the MM. > > IIRC one of the major points of that work was not "proper locking" but > to have enough locking to be complatible with lockdep so we could add > assertions like in get_user_pages and find_vma. That's part of it; but it's also for making the code more clearly correct and future-proofing it. Looking at it now, I think process_madvise() might actually already be able to race with execve() to some degree; and if you made a change like this to the current kernel: diff --git a/mm/madvise.c b/mm/madvise.c index 6d3d348b17f4..3648c198673c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1043,12 +1043,14 @@ madvise_behavior_valid(int behavior) static bool process_madvise_behavior_valid(int behavior) { switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: + case MADV_DOFORK: + case MADV_DONTFORK: return true; default: return false; } } it would probably introduce a memory corruption bug, because then someone might be able to destroy the stack VMA between setup_new_exec() and setup_arg_pages() by using process_madvise() to trigger VMA splitting/merging in the right pattern.
diff --git a/fs/exec.c b/fs/exec.c index 38f63451b928..ac7603e985b4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -217,8 +217,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, * We are doing an exec(). 'current' is the process * doing the exec and bprm->mm is the new process's mm. */ + mmap_read_lock(bprm->mm); ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, &page, NULL, NULL); + mmap_read_unlock(bprm->mm); if (ret <= 0) return NULL; diff --git a/mm/mmap.c b/mm/mmap.c index ca54d36d203a..79f4f8ae43ec 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -534,6 +534,7 @@ static int find_vma_links(struct mm_struct *mm, unsigned long addr, { struct rb_node **__rb_link, *__rb_parent, *rb_prev; + mmap_assert_locked(mm); __rb_link = &mm->mm_rb.rb_node; rb_prev = __rb_parent = NULL; @@ -2303,6 +2304,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) struct rb_node *rb_node; struct vm_area_struct *vma; + mmap_assert_locked(mm); /* Check the cache first. */ vma = vmacache_find(mm, addr); if (likely(vma))
find_vma() and variants need protection when used. This patch adds mmap_assert_lock() calls in the functions. To make sure the invariant is satisfied, we also need to add a mmap_read_loc() around the get_user_pages_remote() call in get_arg_page(). The lock is not strictly necessary because the mm has been newly created, but the extra cost is limited because the same mutex was also acquired shortly before in __bprm_mm_init(), so it is hot and uncontended. Signed-off-by: Luigi Rizzo <lrizzo@google.com> --- fs/exec.c | 2 ++ mm/mmap.c | 2 ++ 2 files changed, 4 insertions(+)