[v3] fs: don't scan the inode cache before SB_BORN is set

From: Dave Chinner <dchinner@redhat.com>

From: Dave Chinner <dchinner@redhat.com>

We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.

We found a mount in a failed state, blocked on the shrinker rwsem
here:

mount_bdev()
  deactivate_locked_super()
    unregister_shrinker()

Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.

However, we also saw another manifestation of the same problem - the
shrinker can fall off a bad pointer if it runs before the superblock
is fully set up - a use before initialisation problem. This
typically crashed somewhere in the radix tree manipulations in
this path:

  radix_tree_gang_lookup_tag+0xc4/0x130
  xfs_perag_get_tag+0x37/0xf0
  xfs_reclaim_inodes_count+0x32/0x40
  xfs_fs_nr_cached_objects+0x11/0x20
  super_cache_count+0x35/0xc0
  shrink_slab.part.66+0xb1/0x370
  shrink_node+0x7e/0x1a0
  try_to_free_pages+0x199/0x470
  __alloc_pages_slowpath+0x3a1/0xd20
  __alloc_pages_nodemask+0x1c3/0x200
  cache_grow_begin+0x20b/0x2e0
  fallback_alloc+0x160/0x200
  kmem_cache_alloc+0x111/0x4e0

The underlying problem is that the superblock shrinker is running
before the filesystem structures it depends on have been fully set
up. i.e.  the shrinker is registered in sget(), before
->fill_super() has been called, and the shrinker can call into the
filesystem before fill_super() does it's setup work.

Setting sb->s_fs_info to NULL on xfs_mount setup failure only solves
the use-after-free part of the problem - it doesn't solve the
use-before-initialisation part. To solve that we need to check the
SB_BORN flag in super_cache_count().

The SB_BORN flag is not set until ->fs_mount() completes
successfully and trylock_super() won't succeed until it is set.
Hence super_cache_scan() will not run until SB_BORN is set, so it
makes sense to not allow super_cache_scan to run and enter the
filesystem until it is set, too. This prevents the superblock
shrinker from entering the filesystem while it is being set up and
so avoids the use-before-initialisation issue.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
Version 3:
- change the memory barriers to protect the superblock data, not
  the SB_BORN flag.

Version 2:
- convert to use SB_BORN, not SB_ACTIVE
- add memory barriers
- rework comment in super_cache_count()

---
 fs/super.c         | 30 ++++++++++++++++++++++++------
 fs/xfs/xfs_super.c | 11 +++++++++++
 2 files changed, 35 insertions(+), 6 deletions(-)

[v3] fs: don't scan the inode cache before SB_BORN is set

Commit Message

Comments

Patch