mbox series

[v8,00/21] Avoid MAP_FIXED gap exposure

Message ID 20240830040101.822209-1-Liam.Howlett@oracle.com (mailing list archive)
Headers show
Series Avoid MAP_FIXED gap exposure | expand

Message

Liam R. Howlett Aug. 30, 2024, 4 a.m. UTC
It is now possible to walk the vma tree using the rcu read locks and is
beneficial to do so to reduce lock contention.  Doing so while a
MAP_FIXED mapping is executing means that a reader may see a gap in the
vma tree that should never logically exist - and does not when using the
mmap lock in read mode.  The temporal gap exists because mmap_region()
calls munmap() prior to installing the new mapping.

This patch set stops rcu readers from seeing the temporal gap by
splitting up the munmap() function into two parts.  The first part
prepares the vma tree for modifications by doing the necessary splits
and tracks the vmas marked for removal in a side tree.  The second part
completes the munmapping of the vmas after the vma tree has been
overwritten (either by a MAP_FIXED replacement vma or by a NULL in the
munmap() case).

Please note that rcu walkers will still be able to see a temporary state
of split vmas that may be in the process of being removed, but the
temporal gap will not be exposed.  vma_start_write() are called on both
parts of the split vma, so this state is detectable.

If existing vmas have a vm_ops->close(), then they will be called prior
to mapping the new vmas (and ptes are cleared out).  Without calling
->close(), hugetlbfs tests fail (hugemmap06 specifically) due to
resources still being marked as 'busy'.  Unfortunately, calling the
corresponding ->open() may not restore the state of the vmas, so it is
safer to keep the existing failure scenario where a gap is inserted and
never replaced.  The failure scenario is in its own patch (0015) for
traceability.

RFC: https://lore.kernel.org/linux-mm/20240531163217.1584450-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20240611180200.711239-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/all/20240625191145.3382793-1-Liam.Howlett@oracle.com/
v3: https://lore.kernel.org/linux-mm/20240704182718.2653918-1-Liam.Howlett@oracle.com/
v4: https://lore.kernel.org/linux-mm/20240710192250.4114783-1-Liam.Howlett@oracle.com/
v5: https://lore.kernel.org/linux-mm/20240717200709.1552558-1-Liam.Howlett@oracle.com/
v6: https://lore.kernel.org/all/20240820235730.2852400-1-Liam.Howlett@oracle.com/
v7: https://lore.kernel.org/all/20240822192543.3359552-1-Liam.Howlett@oracle.com/

Changes since v7:

This is all the patches I've sent for v7 fixups plus the return code for
mseal().  The incorrect return code was introduced in an earlier patch
and then modified (still incorrectly) later, so this version will
hopefully bisect cleanly.

- Fixed return type of vms_gather_munmap_vmas() to -ENOMEM or -EPERM
- Passed through error returned from vms_gather_munmap_vmas() in
  mmap_region() - Thanks Jeff
- Added review tag on last patch - Thanks Lorenzo
- Added #ifdef CONFIG_MMU to vma.h where necessary - Thanks Lorenzo,
  Bert, and Geert
- Fix null pointer dereference in vms_abort_munmap_vmas() - Thanks Dan

Liam R. Howlett (21):
  mm/vma: Correctly position vma_iterator in __split_vma()
  mm/vma: Introduce abort_munmap_vmas()
  mm/vma: Introduce vmi_complete_munmap_vmas()
  mm/vma: Extract the gathering of vmas from do_vmi_align_munmap()
  mm/vma: Introduce vma_munmap_struct for use in munmap operations
  mm/vma: Change munmap to use vma_munmap_struct() for accounting and
    surrounding vmas
  mm/vma: Extract validate_mm() from vma_complete()
  mm/vma: Inline munmap operation in mmap_region()
  mm/vma: Expand mmap_region() munmap call
  mm/vma: Support vma == NULL in init_vma_munmap()
  mm/mmap: Reposition vma iterator in mmap_region()
  mm/vma: Track start and end for munmap in vma_munmap_struct
  mm: Clean up unmap_region() argument list
  mm/mmap: Avoid zeroing vma tree in mmap_region()
  mm: Change failure of MAP_FIXED to restoring the gap on failure
  mm/mmap: Use PHYS_PFN in mmap_region()
  mm/mmap: Use vms accounted pages in mmap_region()
  ipc/shm, mm: Drop do_vma_munmap()
  mm: Move may_expand_vm() check in mmap_region()
  mm/vma: Drop incorrect comment from vms_gather_munmap_vmas()
  mm/vma.h: Optimise vma_munmap_struct

 include/linux/mm.h |   6 +-
 ipc/shm.c          |   8 +-
 mm/mmap.c          | 140 +++++++++---------
 mm/vma.c           | 362 +++++++++++++++++++++++++++------------------
 mm/vma.h           | 168 ++++++++++++++++++---
 5 files changed, 434 insertions(+), 250 deletions(-)

Comments

Jeff Xu Aug. 30, 2024, 4:05 p.m. UTC | #1
Hi Liam

On Thu, Aug 29, 2024 at 9:01 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> Changes since v7:
>
> This is all the patches I've sent for v7 fixups plus the return code for
> mseal().  The incorrect return code was introduced in an earlier patch
> and then modified (still incorrectly) later, so this version will
> hopefully bisect cleanly.
>
> - Fixed return type of vms_gather_munmap_vmas() to -ENOMEM or -EPERM
> - Passed through error returned from vms_gather_munmap_vmas() in
>   mmap_region() - Thanks Jeff

I would think it is cleaner to fix the original commit directly
instead of as part of a large patch series, no ?  If not possible,
maybe mm-unstable should apply my fix first then you can refactor
based on that ?

-Jeff
Liam R. Howlett Aug. 30, 2024, 5:07 p.m. UTC | #2
* Jeff Xu <jeffxu@chromium.org> [240830 12:06]:
> Hi Liam
> 
> On Thu, Aug 29, 2024 at 9:01 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > Changes since v7:
> >
> > This is all the patches I've sent for v7 fixups plus the return code for
> > mseal().  The incorrect return code was introduced in an earlier patch
> > and then modified (still incorrectly) later, so this version will
> > hopefully bisect cleanly.
> >
> > - Fixed return type of vms_gather_munmap_vmas() to -ENOMEM or -EPERM
> > - Passed through error returned from vms_gather_munmap_vmas() in
> >   mmap_region() - Thanks Jeff
> 
> I would think it is cleaner to fix the original commit directly
> instead of as part of a large patch series, no ?  If not possible,
> maybe mm-unstable should apply my fix first then you can refactor
> based on that ?

No, it is not.  These patches are not up stream, so the fixes line will
be stale before it is even available.  No one is affected except the
mm-unstable branch and linux-next.  Patches to commits that are not
upstream can change, and almost always do with fixes like this.

What I have done here is to fix the series so that each patch will keep
passing the mseal() tests.  That way, if there is an issue with mseal(),
it can be git bisected to find the problem without finding a fixed
problem.

If your fix was put into mm-unstable, all linux-next branches would have
an intermittent bug present on bisection, and waiting to respin the
patches means they miss out on testing time.

This is why we squash fixes into patches during development, or ask akpm
to do so by replying to the broken patch.  Andrew prefers not resending
large patch sets because something may be missed, so for smaller changes
they are sent with a request to squash them on his side.

In this case, there would have been 3 or 4 patches that needed to be
updated (two changed, two with merge issues I believe), so I sent a v8
because the alternative would have been confusing.

Thanks,
Liam