[RFC,v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once

Adrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.

Solve this in the most straight forward way possible, by adding a new
mount flag to mark parts of the mount propagation tree that have been
visited, and use that mark to skip parts of the mount propagation tree
that have already been visited during an unmount.  This guarantees
that each mountpoint in the possibly overlapping mount propagation
trees will be visited exactly once.

Add the functions propagation_visit_next and propagation_revisit_next
to coordinate setting and clearling the visited mount mark.

The skipping of already unmounted mounts has been moved from
__lookup_mnt_last to mark_umount_candidates, so that the new
propagation functions can notice record when the propagation tree
passes through the initial set of unmounted mounts.  Except in
umount_tree as part of the unmounting process the only place where
unmounted mounts should be found are in unmounted subtrees.  All of
the other callers of __lookup_mnt_last are from mounted subtrees so
the not checking for unmounted mounts should not affect them.

Here is a script to generate such mount tree:
$ cat run.sh
mount -t tmpfs test-mount /mnt
mount --make-shared /mnt
for i in `seq $1`; do
        mkdir /mnt/test.$i
        mount --bind /mnt /mnt/test.$i
done
cat /proc/mounts | grep test-mount | wc -l
time umount -l /mnt
$ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done

Here are the performance numbers with and without the patch:

mhash  |  8192   |  8192  |  8192       | 131072 | 131072      | 104857 | 104857
mounts | before  | after  | after (sys) | after  | after (sys) |  after | after (sys)
-------------------------------------------------------------------------------------
  1024 |  0.071s | 0.023s | 0.008s      | 0.026s | 0.000s      | 0.024s | 0.008s
  2048 |  0.184s | 0.030s | 0.012s      | 0.035s | 0.008s      | 0.030s | 0.012s
  4096 |  0.604s | 0.047s | 0.012s      | 0.042s | 0.016s      | 0.032s | 0.016s
  8912 |  4.471s | 0.085s | 0.020s      | 0.059s | 0.059s      | 0.050s | 0.036s
 16384 | 34.826s | 0.105s | 0.092s      | 0.109s | 0.060s      | 0.087s | 0.068s
 32768 |         | 0.245s | 0.168s      | 0.192s | 0.144s      | 0.167s | 0.156s
 65536 |         | 0.833s | 0.716s      | 0.485s | 0.276s      | 0.468s | 0.316s
131072 |         | 4.628s | 4.108s      | 0.758s | 0.632s      | 0.736s | 0.612s

Andrei Vagin reports fixing this performance problem is part of the
work to fix CVE-2016-6213.

Cc: stable@vger.kernel.org
Reported-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

I think this version is very close.  I had to modify __lookup_mnt_last
to not skip MOUNT_UMOUNT or we would never see when the mount
propagation trees intersected.

This doesn't look as good as the previous buggy version but it looks
good.  When the hash table isn't getting full the times look pretty
linear.  So it may be necessary to do some hash table resizing.

That said there remains one issue I need to think about some more.

In mark_umount_candidates I don't mark mounts that are locked to their
parent and their parent is not marked as a umount candidate.  Given that
we skip processing mounts multiple times this might result in a mount
whose parent gets marked as unmountable after the first time we see a
mount not getting marked as unmountable later.

Anyway Andrei if you could check this out and see if you can see
anything I missed please let me know.

Eric

 fs/namespace.c        |   6 +--
 fs/pnode.c            | 147 ++++++++++++++++++++++++++++++++++++++++++++------
 fs/pnode.h            |   4 ++
 include/linux/mount.h |   2 +
 4 files changed, 138 insertions(+), 21 deletions(-)

[RFC,v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once

Commit Message

Comments

Patch