mm: Introduce i_mmap_lock_write_killable().

Message ID	201803282126.GBC56799.tOFVFSOHJLFOQM@I-love.SAKURA.ne.jp (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> To: mhocko@kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, riel@redhat.com Subject: Re: [PATCH] mm: Introduce i_mmap_lock_write_killable(). From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> References: <1522149570-4517-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <20180327145220.GJ5652@dhcp22.suse.cz> <201803281923.EFF26009.OFOtJSMFHQFLVO@I-love.SAKURA.ne.jp> <20180328110513.GH9275@dhcp22.suse.cz> In-Reply-To: <20180328110513.GH9275@dhcp22.suse.cz> Message-Id: <201803282126.GBC56799.tOFVFSOHJLFOQM@I-love.SAKURA.ne.jp> Date: Wed, 28 Mar 2018 21:26:43 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

201803282126.GBC56799.tOFVFSOHJLFOQM@I-love.SAKURA.ne.jp (mailing list archive)

State

New, archived

Headers

To: mhocko@kernel.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
	kirill.shutemov@linux.intel.com, riel@redhat.com
Subject: Re: [PATCH] mm: Introduce i_mmap_lock_write_killable().
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
References: <1522149570-4517-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
	<20180327145220.GJ5652@dhcp22.suse.cz>
	<201803281923.EFF26009.OFOtJSMFHQFLVO@I-love.SAKURA.ne.jp>
	<20180328110513.GH9275@dhcp22.suse.cz>
In-Reply-To: <20180328110513.GH9275@dhcp22.suse.cz>
Message-Id: <201803282126.GBC56799.tOFVFSOHJLFOQM@I-love.SAKURA.ne.jp>
Date: Wed, 28 Mar 2018 21:26:43 +0900
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Tetsuo Handa March 28, 2018, 12:26 p.m. UTC

Michal Hocko wrote:
> > > I am not saying this is wrong, I would have to think about that much
> > > more because mmap_sem tends to be used on many surprising places and the
> > > write lock just hide them all.
> > 
> > Then, an alternative approach which interrupts without downgrading is shown
> > below. But I'm not sure.
> 
> Failing the whole dup_mmap might be quite reasonable, yes. I haven't
> checked your particular patch because this code path needs much more
> time than I can give this, though.

I think that interrupting at

Comments

Michal Hocko March 28, 2018, 12:39 p.m. UTC | #1

On Wed 28-03-18 21:26:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > > Basic idea is
> > > 
> > >   bool downgraded = false;
> > >   down_write(mmap_sem);
> > >   for (something_1_that_might_depend_mmap_sem_held_for_write;
> > >        something_2_that_might_depend_mmap_sem_held_for_write;
> > >        something_3_that_might_depend_mmap_sem_held_for_write) {
> > >      something_4_that_might_depend_mmap_sem_held_for_write();
> > >      if (fatal_signal_pending(current)) {
> > >         downgrade_write(mmap_sem);
> > >         downgraded = true;
> > >         break;
> > >      }
> > >      something_5_that_might_depend_mmap_sem_held_for_write();
> > >   }
> > >   if (!downgraded)
> > >     up_write(mmap_sem);
> > >   else
> > >     up_read(mmap_sem);
> > > 
> > > . That is, try to interrupt critical sections at locations where it is
> > > known to be safe and consistent.
> > 
> > Please explain why those places are safe to interrupt.
> 
> Because (regarding the downgrade_write() approach), as far as I know,
> the current thread does not access memory which needs to be protected with
> mmap_sem held for write.

But there are other threads which can be in an arbitrary code path
before they get to the signal handling code.

> >                             So please explain why it is safe. It is
> > really not straightforward.
> 
> So, please explain why it is not safe. How can releasing mmap_sem held for
> write at safely and consistently interruptible locations be not safe?

Look, you are proposing a patch and the onus to justify its correctness
is on you. You are playing with the code which you have only vague
understanding of and so you should study and understand it more. You
seem to be infering invariants from the current function scope and that
is quite dangerous.

[...]
> > > What we need to be careful is making changes to current->mm.
> > > I'm assuming that current->mm->mmap_sem held for read is enough for
> > > i_mmap_lock_write()/flush_dcache_mmap_lock()/vma_interval_tree_insert_after()/
> > > flush_dcache_mmap_unlock()/i_mmap_unlock_write()/is_vm_hugetlb_page()/
> > > reset_vma_resv_huge_pages()/__vma_link_rb(). But I'm not sure.
> > 
> > But as soon as you downgrade the lock then all other threads can
> > interfere and perform page faults or update respecive mappings. Does
> > this matter? If not then why?
> > 
> 
> Why does this matter?
> 
> I don't know what "update respecive mappings" means.

E.g. any get_user_page or an ongoing #PF can update a mapping and so the
child process might see some inconsistencies (aka partial of the address
space with the previous content).

> Is that about mmap()/munmap() which need mmap_sem held for write?

No, it is about those who take it for read.

> Since mmap_sem is still held for read, operations which needs
> mmap_sem held for write cannot happen.
> 
> Anyway, as long as I downgrade the mmap_sem at safely and consistently
> interruptible locations, there cannot be a problem.

I have a strong feeling that the whole address space has to be copied
atomicaly form the address space POV. I might be wrong but it is not
really obvious to me why and if you want to downgrade the lock there
then please make sure to document what are you assumptions and why they
are valid.

diff --git a/kernel/fork.c b/kernel/fork.c
index 1e8c9a7..851c675 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -514,6 +514,9 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
 
+		if (!retval && fatal_signal_pending(current))
+			retval = -EINTR;
+
 		if (retval)
 			goto out;
 	}
-- 

is safe because there is no difference (except the error code) between above
change and hitting "goto fail_nomem;" path after "mpnt = mpnt->vm_next;".

Therefore, I think that interrupting at

diff --git a/kernel/fork.c b/kernel/fork.c
index 851c675..2706acc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -508,7 +508,9 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (fatal_signal_pending(current))
+			retval = -EINTR;
+		else if (!(tmp->vm_flags & VM_WIPEONFORK))
 			retval = copy_page_range(mm, oldmm, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)

mm: Introduce i_mmap_lock_write_killable().

Commit Message

Comments

Patch