Message ID | 20150807195552.GB28529@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello, On Fri 07-08-15 21:55:52, Oleg Nesterov wrote: > I'll try to re-check/re-test, but so far I think that this and the > previous series are correct. However 3/4 from the previous series > (attached at the end just in case) uncovers (I think) some problems > in xfs locking. > > What should I do now? > > On 07/22, Oleg Nesterov wrote: > > > > Testing. Well, so far I only verified that ioctl(FIFREEZE) + > > ioctl(FITHAW) seems to wors "as expected" on my testing machine > > with ext3. So probably this needs more testing. > > Finally I got the testing machine and ran xfstests, I did > > dd if=/dev/zero of=TEST.img bs=1MiB count=4000 > dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000 > > losetup --find --show TEST.img > losetup --find --show SCRATCH.img > > mkfs.xfs -f /dev/loop0 > mkfs.xfs -f /dev/loop1 > > mkdir -p TEST SCRATCH > > mount /dev/loop0 TEST > mount /dev/loop1 SCRATCH > > TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \ > ./check `grep -il freeze tests/*/???` > > several times and every time the result looks like below: > > FSTYP -- xfs (non-debug) > PLATFORM -- Linux/x86_64 intel-canoepass-10 4.1.0+ > MKFS_OPTIONS -- -f -bsize=4096 /dev/loop1 > MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH > > generic/068 59s ... 61s > generic/085 23s ... 26s > generic/280 2s ... 3s > generic/311 169s ... 167s > xfs/011 21s ... 20s > xfs/119 32s ... 21s > xfs/297 455s ... 301s > Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297 > Passed all 7 tests Hum, I had a look at the lockdep report below and it's one of the peculiarities of the freeze protection. For record let me repeat the full argument for why I don't think there's a possibility for a real deadlock. Feel free to skip to the end if you get bored. One would like to construct the lock chain as: CPU0 (chown foo dir) CPU1 (readdir dir) CPU2 (page fault) process Y process X, thread 0 process X, thread 1 get ILOCK for dir gets freeze protection starts transaction in xfs_setattr_nonsize waits to get ILOCK on 'dir' get mmap_sem for X wait for mmap_sem for process X in filldir() wait for freeze protection in xfs_page_mkwrite and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to finish it's freeze-protected section. But this cannot happen. The reason is that we block writers level-by-level and thus while there are writers at level X, we do not block writers at level X+1. So in this particular case freeze_super() will block waiting for CPU0 to finish its freeze protected section while CPU2 is free to continue. In general we have a chain like freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\ A | \------------------------------------------/ But since ILOCK is always acquired with freeze protection at L0 and we can block at L1 only after there are no writers at L0, this loop can never happen. Note that if we use the property of freezing that lock at level X+1 cannot block when we hold lock at level X, we can as well simplify the dependency graph and track in it only the lowest level of freeze lock that is currently acquired (since the levels above it cannot block and do not in any way influence blocking of other processes either and thus are irrelevant for the purpose of deadlock detection). Then the dependency graph we'd get would be: freeze L0 -> ILOCK -> mmap_sem -> freeze L1 and we have a nice acyclic graph we like to see... So probably we have to hack the lockdep instrumentation some more and just don't tell lockdep about freeze locks at higher levels if we already hold a lock at lower level. Thoughts? Honza > But with CONFIG_LOCKDEP=y generic/068 triggers the warning: > > [ 66.092831] ====================================================== > [ 66.099726] [ INFO: possible circular locking dependency detected ] > [ 66.106719] 4.1.0+ #2 Not tainted > [ 66.110414] ------------------------------------------------------- > [ 66.117405] fsstress/2210 is trying to acquire lock: > [ 66.122944] (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0 > [ 66.131637] > but task is already holding lock: > [ 66.138146] (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs] > [ 66.148022] > which lock already depends on the new lock. > > [ 66.157141] > the existing dependency chain (in reverse order) is: > [ 66.165490] > -> #4 (&xfs_dir_ilock_class){++++..}: > [ 66.170974] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150 > [ 66.177596] [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70 > [ 66.184605] [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs] > [ 66.191645] [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs] > [ 66.199638] [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs] > [ 66.206950] [<ffffffff81279411>] notify_change+0x271/0x3a0 > [ 66.213762] [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200 > [ 66.221251] [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0 > [ 66.227576] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76 > [ 66.234878] > -> #3 (sb_internal#2){++++++}: > [ 66.239698] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150 > [ 66.246316] [<ffffffff81824256>] down_write+0x36/0x70 > [ 66.252644] [<ffffffff81100979>] percpu_down_write+0x39/0x110 > [ 66.259760] [<ffffffff8125901d>] freeze_super+0xbd/0x190 > [ 66.266369] [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520 > [ 66.273082] [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0 > [ 66.279311] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76 > [ 66.286610] > -> #2 (sb_pagefaults#2){++++++}: > [ 66.291623] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150 > [ 66.298237] [<ffffffff811007c1>] percpu_down_read+0x51/0xa0 > [ 66.305144] [<ffffffff8125979b>] __sb_start_write+0xdb/0x120 > [ 66.312140] [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0 > [ 66.319245] [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs] > [ 66.327533] [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100 > [ 66.334433] [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0 > [ 66.341533] [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450 > [ 66.348542] [<ffffffff8106a31f>] do_page_fault+0x2f/0x80 > [ 66.355172] [<ffffffff818286e8>] page_fault+0x28/0x30 > [ 66.361493] > -> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}: > [ 66.367659] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150 > [ 66.374275] [<ffffffff810f4aef>] down_read_nested+0x3f/0x60 > [ 66.381185] [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs] > [ 66.388212] [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs] > [ 66.395817] [<ffffffff81200a9e>] __do_fault+0x4e/0x100 > [ 66.402239] [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0 > [ 66.409340] [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450 > [ 66.416344] [<ffffffff8106a31f>] do_page_fault+0x2f/0x80 > [ 66.422959] [<ffffffff818286e8>] page_fault+0x28/0x30 > [ 66.429283] > -> #0 (&mm->mmap_sem){++++++}: > [ 66.434093] [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100 > [ 66.441194] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150 > [ 66.447808] [<ffffffff8120058f>] might_fault+0x6f/0xa0 > [ 66.454235] [<ffffffff8126e05a>] filldir+0x9a/0x130 > [ 66.460373] [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs] > [ 66.469233] [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs] > [ 66.476449] [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs] > [ 66.483954] [<ffffffff8126de2a>] iterate_dir+0x9a/0x140 > [ 66.490473] [<ffffffff8126e361>] SyS_getdents+0x91/0x120 > [ 66.497091] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76 > [ 66.504381] > other info that might help us debug this: > > [ 66.513316] Chain exists of: > &mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class > > [ 66.522619] Possible unsafe locking scenario: > > [ 66.529222] CPU0 CPU1 > [ 66.534275] ---- ---- > [ 66.539327] lock(&xfs_dir_ilock_class); > [ 66.543823] lock(sb_internal#2); > [ 66.550465] lock(&xfs_dir_ilock_class); > [ 66.557767] lock(&mm->mmap_sem); > [ 66.561580] > *** DEADLOCK *** > > [ 66.568186] 2 locks held by fsstress/2210: > [ 66.572753] #0: (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140 > [ 66.583103] #1: (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs] > [ 66.593457] > stack backtrace: > [ 66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2 > [ 66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013 > [ 66.616372] 0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd > [ 66.624663] 0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26 > [ 66.632955] 0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480 > [ 66.641249] Call Trace: > [ 66.643983] [<ffffffff8181d9dd>] dump_stack+0x45/0x57 > [ 66.649713] [<ffffffff810f7b26>] print_circular_bug+0x206/0x280 > [ 66.656413] [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100 > [ 66.662919] [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100 > [ 66.669523] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150 > [ 66.675543] [<ffffffff81200562>] ? might_fault+0x42/0xa0 > [ 66.681566] [<ffffffff8120058f>] might_fault+0x6f/0xa0 > [ 66.687394] [<ffffffff81200562>] ? might_fault+0x42/0xa0 > [ 66.693418] [<ffffffff8126e05a>] filldir+0x9a/0x130 > [ 66.698968] [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs] > [ 66.706941] [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs] > [ 66.715203] [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs] > [ 66.721712] [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs] > [ 66.728316] [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480 > [ 66.735998] [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs] > [ 66.742894] [<ffffffff8126de2a>] iterate_dir+0x9a/0x140 > [ 66.748819] [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0 > [ 66.754938] [<ffffffff8126e361>] SyS_getdents+0x91/0x120 > [ 66.760958] [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0 > [ 66.766884] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76 > > The chain reported by lockdep is not exactly the same every time, > but similar. > > Once again, I'll recheck. but the patch below still looks correct > to me, and I think that it is actually a fix. > > Before this patch freeze_super() calls sync_filesystem() and > s_op->freeze_fs(sb) without ->s_writers locks (as it seen by > lockdep) and this is wrong. > > After this patch lockdep knows about ->s_writers locks held and this > is what we want, but apparently this way lockdep can notice the new > dependencies. > > Oleg. > > ------------------------------------------------------------------------------ > Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super() > > Move the "fool the lockdep" code from sb_wait_write() into the new > simple helper, sb_lockdep_release(), called once before return from > freeze_super(). > > This is preparation, but imo this makes sense in any case. This way > we do not hide from lockdep the "write" locks we hold when we call > s_op->freeze_fs(sb). > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > fs/super.c | 16 ++++++++++------ > 1 files changed, 10 insertions(+), 6 deletions(-) > > diff --git a/fs/super.c b/fs/super.c > index d0fdd49..e7ea9f1 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level) > { > s64 writers; > > - /* > - * We just cycle-through lockdep here so that it does not complain > - * about returning with lock to userspace > - */ > rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_); > - rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_); > > do { > DEFINE_WAIT(wait); > - > /* > * We use a barrier in prepare_to_wait() to separate setting > * of frozen and checking of the counter > @@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level) > } while (writers); > } > > +static void sb_freeze_release(struct super_block *sb) > +{ > + int level; > + /* Avoid the warning from lockdep_sys_exit() */ > + for (level = 0; level < SB_FREEZE_LEVELS; ++level) > + rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_); > +} > + > /** > * freeze_super - lock the filesystem and force it into a consistent state > * @sb: the super to lock > @@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb) > sb->s_writers.frozen = SB_UNFROZEN; > smp_wmb(); > wake_up(&sb->s_writers.wait_unfrozen); > + sb_freeze_release(sb); > deactivate_locked_super(sb); > return ret; > } > @@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb) > * sees write activity when frozen is set to SB_FREEZE_COMPLETE. > */ > sb->s_writers.frozen = SB_FREEZE_COMPLETE; > + sb_freeze_release(sb); > up_write(&sb->s_umount); > return 0; > } > -- > 1.5.5.1 > >
On Mon, Aug 10, 2015 at 04:59:42PM +0200, Jan Kara wrote: > On Fri 07-08-15 21:55:52, Oleg Nesterov wrote: > > I'll try to re-check/re-test, but so far I think that this and the > > previous series are correct. However 3/4 from the previous series > > (attached at the end just in case) uncovers (I think) some problems > > in xfs locking. .... > > Hum, I had a look at the lockdep report below and it's one of the > peculiarities of the freeze protection. For record let me repeat the full > argument for why I don't think there's a possibility for a real deadlock. > Feel free to skip to the end if you get bored. > > One would like to construct the lock chain as: > > CPU0 (chown foo dir) CPU1 (readdir dir) CPU2 (page fault) > process Y process X, thread 0 process X, thread 1 > > get ILOCK for dir > gets freeze protection > starts transaction in xfs_setattr_nonsize > waits to get ILOCK on 'dir' > get mmap_sem for X > wait for mmap_sem for process X > in filldir() > wait for freeze protection in > xfs_page_mkwrite > > and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to > finish it's freeze-protected section. But this cannot happen. The reason is > that we block writers level-by-level and thus while there are writers at > level X, we do not block writers at level X+1. So in this particular case > freeze_super() will block waiting for CPU0 to finish its freeze protected > section while CPU2 is free to continue. > > In general we have a chain like > > freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\ > A | > \------------------------------------------/ > > But since ILOCK is always acquired with freeze protection at L0 and we can > block at L1 only after there are no writers at L0, this loop can never > happen. > > Note that if we use the property of freezing that lock at level X+1 cannot > block when we hold lock at level X, we can as well simplify the dependency > graph and track in it only the lowest level of freeze lock that is > currently acquired (since the levels above it cannot block and do not in > any way influence blocking of other processes either and thus are > irrelevant for the purpose of deadlock detection). Then the dependency > graph we'd get would be: > > freeze L0 -> ILOCK -> mmap_sem -> freeze L1 > > and we have a nice acyclic graph we like to see... So probably we have to > hack the lockdep instrumentation some more and just don't tell lockdep > about freeze locks at higher levels if we already hold a lock at lower > level. Thoughts? The XFS directory ilock->filldir->might_fault locking path has been generating false positives in quite a lot of places because of things we do on one side of the mmap_sem in filesystem paths vs thigs we do on the other side of the mmap_sem in the page fault path. I'm getting sick of these stupid lockdep false positives. I think I need to rework the XFS readdir locking patch I wrote a while back: http://oss.sgi.com/archives/xfs/2014-03/msg00146.html At the time it wasn't clear what the best way forward was. Right now I think the method I originally used (IOLOCK for directory data and hence readdir, ILOCK for everything else) will be sufficient; if we need to do anything smarter that can be addressed further down the road. Cheers, Dave.
diff --git a/fs/super.c b/fs/super.c index d0fdd49..e7ea9f1 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level) { s64 writers; - /* - * We just cycle-through lockdep here so that it does not complain - * about returning with lock to userspace - */ rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_); - rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_); do { DEFINE_WAIT(wait); - /* * We use a barrier in prepare_to_wait() to separate setting * of frozen and checking of the counter @@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level) } while (writers); } +static void sb_freeze_release(struct super_block *sb) +{ + int level; + /* Avoid the warning from lockdep_sys_exit() */ + for (level = 0; level < SB_FREEZE_LEVELS; ++level) + rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_); +} + /** * freeze_super - lock the filesystem and force it into a consistent state * @sb: the super to lock @@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb) sb->s_writers.frozen = SB_UNFROZEN; smp_wmb(); wake_up(&sb->s_writers.wait_unfrozen); + sb_freeze_release(sb); deactivate_locked_super(sb); return ret; } @@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb) * sees write activity when frozen is set to SB_FREEZE_COMPLETE. */ sb->s_writers.frozen = SB_FREEZE_COMPLETE; + sb_freeze_release(sb); up_write(&sb->s_umount); return 0; }