diff mbox series

xfs: fix data races in inode->i_*time

Message ID 1582661385-30210-1-git-send-email-cai@lca.pw (mailing list archive)
State Deferred, archived
Headers show
Series xfs: fix data races in inode->i_*time | expand

Commit Message

Qian Cai Feb. 25, 2020, 8:09 p.m. UTC
inode->i_*time could be accessed concurrently. The plain reads in
xfs_vn_getattr() is lockless which result in data races. To avoid bad
compiler optimizations like load tearing, adding pairs of
READ|WRITE_ONCE(). While at it, also take care of xfs_setattr_time()
which presumably could run concurrently with xfs_vn_getattr() as well.
The data races were reported by KCSAN,

 write to 0xffff9275a1920ad8 of 16 bytes by task 47311 on cpu 46:
  xfs_vn_update_time+0x1b0/0x400 [xfs]
  xfs_vn_update_time at fs/xfs/xfs_iops.c:1122
  update_time+0x57/0x80
  file_update_time+0x143/0x1f0
  __xfs_filemap_fault+0x1be/0x3d0 [xfs]
  xfs_filemap_page_mkwrite+0x25/0x40 [xfs]
  do_page_mkwrite+0xf7/0x250
  do_fault+0x679/0x920
  __handle_mm_fault+0xc9f/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 4 locks held by doio/47311:
  #0: ffff9275e7d70808 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9
  #1: ffff9274864394d8 (sb_pagefaults){.+.+}, at: __xfs_filemap_fault+0x19b/0x3d0 [xfs]
  #2: ffff9274864395b8 (sb_internal){.+.+}, at: xfs_trans_alloc+0x2af/0x3c0 [xfs]
  #3: ffff9275a1920920 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0x116/0x2c0 [xfs]
 irq event stamp: 42649
 hardirqs last  enabled at (42649): [<ffffffffb22dcdb3>] _raw_spin_unlock_irqrestore+0x53/0x60
 hardirqs last disabled at (42648): [<ffffffffb22dcad1>] _raw_spin_lock_irqsave+0x21/0x60
 softirqs last  enabled at (42306): [<ffffffffb260034c>] __do_softirq+0x34c/0x57c
 softirqs last disabled at (42299): [<ffffffffb18c6762>] irq_exit+0xa2/0xc0

 read to 0xffff9275a1920ad8 of 16 bytes by task 47312 on cpu 40:
  xfs_vn_getattr+0x20c/0x6a0 [xfs]
  xfs_vn_getattr at fs/xfs/xfs_iops.c:551
  vfs_getattr_nosec+0x11a/0x170
  vfs_statx_fd+0x54/0x90
  __do_sys_newfstat+0x40/0x90
  __x64_sys_newfstat+0x3a/0x50
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 no locks held by doio/47312.
 irq event stamp: 43883
 hardirqs last  enabled at (43883): [<ffffffffb1805119>] do_syscall_64+0x39/0xb05
 hardirqs last disabled at (43882): [<ffffffffb1803ede>] trace_hardirqs_off_thunk+0x1a/0x1c
 softirqs last  enabled at (43844): [<ffffffffb260034c>] __do_softirq+0x34c/0x57c
 softirqs last disabled at (43141): [<ffffffffb18c6762>] irq_exit+0xa2/0xc0

Signed-off-by: Qian Cai <cai@lca.pw>
---
 fs/xfs/xfs_iops.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Comments

Marco Elver Feb. 25, 2020, 8:27 p.m. UTC | #1
On Tue, 25 Feb 2020 at 21:09, Qian Cai <cai@lca.pw> wrote:
>
> inode->i_*time could be accessed concurrently. The plain reads in
> xfs_vn_getattr() is lockless which result in data races. To avoid bad
> compiler optimizations like load tearing, adding pairs of
> READ|WRITE_ONCE(). While at it, also take care of xfs_setattr_time()
> which presumably could run concurrently with xfs_vn_getattr() as well.
> The data races were reported by KCSAN,
>
>  write to 0xffff9275a1920ad8 of 16 bytes by task 47311 on cpu 46:
>   xfs_vn_update_time+0x1b0/0x400 [xfs]
>   xfs_vn_update_time at fs/xfs/xfs_iops.c:1122

So this one is doing concurrent writes and reads of the whole struct,
which is 16 bytes. This will always be split into multiple
loads/stores. Is it intentional?

Sadly, this is pretty much guaranteed to tear, even with the
READ/WRITE_ONCE.  The *ONCE will just make KCSAN not tell us about
this one, which is probably not what we want right now, unless we know
for sure the race is intentional.

Thanks,
-- Marco

>   update_time+0x57/0x80
>   file_update_time+0x143/0x1f0
>   __xfs_filemap_fault+0x1be/0x3d0 [xfs]
>   xfs_filemap_page_mkwrite+0x25/0x40 [xfs]
>   do_page_mkwrite+0xf7/0x250
>   do_fault+0x679/0x920
>   __handle_mm_fault+0xc9f/0xd40
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
>
>  4 locks held by doio/47311:
>   #0: ffff9275e7d70808 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9
>   #1: ffff9274864394d8 (sb_pagefaults){.+.+}, at: __xfs_filemap_fault+0x19b/0x3d0 [xfs]
>   #2: ffff9274864395b8 (sb_internal){.+.+}, at: xfs_trans_alloc+0x2af/0x3c0 [xfs]
>   #3: ffff9275a1920920 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0x116/0x2c0 [xfs]
>  irq event stamp: 42649
>  hardirqs last  enabled at (42649): [<ffffffffb22dcdb3>] _raw_spin_unlock_irqrestore+0x53/0x60
>  hardirqs last disabled at (42648): [<ffffffffb22dcad1>] _raw_spin_lock_irqsave+0x21/0x60
>  softirqs last  enabled at (42306): [<ffffffffb260034c>] __do_softirq+0x34c/0x57c
>  softirqs last disabled at (42299): [<ffffffffb18c6762>] irq_exit+0xa2/0xc0
>
>  read to 0xffff9275a1920ad8 of 16 bytes by task 47312 on cpu 40:
>   xfs_vn_getattr+0x20c/0x6a0 [xfs]
>   xfs_vn_getattr at fs/xfs/xfs_iops.c:551
>   vfs_getattr_nosec+0x11a/0x170
>   vfs_statx_fd+0x54/0x90
>   __do_sys_newfstat+0x40/0x90
>   __x64_sys_newfstat+0x3a/0x50
>   do_syscall_64+0x91/0xb05
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
>  no locks held by doio/47312.
>  irq event stamp: 43883
>  hardirqs last  enabled at (43883): [<ffffffffb1805119>] do_syscall_64+0x39/0xb05
>  hardirqs last disabled at (43882): [<ffffffffb1803ede>] trace_hardirqs_off_thunk+0x1a/0x1c
>  softirqs last  enabled at (43844): [<ffffffffb260034c>] __do_softirq+0x34c/0x57c
>  softirqs last disabled at (43141): [<ffffffffb18c6762>] irq_exit+0xa2/0xc0
>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  fs/xfs/xfs_iops.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 81f2f93caec0..2d5ca13ee9da 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -547,9 +547,9 @@
>         stat->uid = inode->i_uid;
>         stat->gid = inode->i_gid;
>         stat->ino = ip->i_ino;
> -       stat->atime = inode->i_atime;
> -       stat->mtime = inode->i_mtime;
> -       stat->ctime = inode->i_ctime;
> +       stat->atime = READ_ONCE(inode->i_atime);
> +       stat->mtime = READ_ONCE(inode->i_mtime);
> +       stat->ctime = READ_ONCE(inode->i_ctime);
>         stat->blocks =
>                 XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
>
> @@ -614,11 +614,11 @@
>         ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>
>         if (iattr->ia_valid & ATTR_ATIME)
> -               inode->i_atime = iattr->ia_atime;
> +               WRITE_ONCE(inode->i_atime, iattr->ia_atime);
>         if (iattr->ia_valid & ATTR_CTIME)
> -               inode->i_ctime = iattr->ia_ctime;
> +               WRITE_ONCE(inode->i_ctime, iattr->ia_ctime);
>         if (iattr->ia_valid & ATTR_MTIME)
> -               inode->i_mtime = iattr->ia_mtime;
> +               WRITE_ONCE(inode->i_mtime, iattr->ia_mtime);
>  }
>
>  static int
> @@ -1117,11 +1117,11 @@
>
>         xfs_ilock(ip, XFS_ILOCK_EXCL);
>         if (flags & S_CTIME)
> -               inode->i_ctime = *now;
> +               WRITE_ONCE(inode->i_ctime, *now);
>         if (flags & S_MTIME)
> -               inode->i_mtime = *now;
> +               WRITE_ONCE(inode->i_mtime, *now);
>         if (flags & S_ATIME)
> -               inode->i_atime = *now;
> +               WRITE_ONCE(inode->i_atime, *now);
>
>         xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>         xfs_trans_log_inode(tp, ip, log_flags);
> --
> 1.8.3.1
>
Darrick J. Wong Feb. 25, 2020, 8:28 p.m. UTC | #2
On Tue, Feb 25, 2020 at 03:09:45PM -0500, Qian Cai wrote:
> inode->i_*time could be accessed concurrently. The plain reads in
> xfs_vn_getattr() is lockless which result in data races. To avoid bad
> compiler optimizations like load tearing, adding pairs of
> READ|WRITE_ONCE(). While at it, also take care of xfs_setattr_time()
> which presumably could run concurrently with xfs_vn_getattr() as well.
> The data races were reported by KCSAN,
> 
>  write to 0xffff9275a1920ad8 of 16 bytes by task 47311 on cpu 46:
>   xfs_vn_update_time+0x1b0/0x400 [xfs]
>   xfs_vn_update_time at fs/xfs/xfs_iops.c:1122
>   update_time+0x57/0x80
>   file_update_time+0x143/0x1f0
>   __xfs_filemap_fault+0x1be/0x3d0 [xfs]
>   xfs_filemap_page_mkwrite+0x25/0x40 [xfs]
>   do_page_mkwrite+0xf7/0x250
>   do_fault+0x679/0x920
>   __handle_mm_fault+0xc9f/0xd40
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
> 
>  4 locks held by doio/47311:
>   #0: ffff9275e7d70808 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9
>   #1: ffff9274864394d8 (sb_pagefaults){.+.+}, at: __xfs_filemap_fault+0x19b/0x3d0 [xfs]
>   #2: ffff9274864395b8 (sb_internal){.+.+}, at: xfs_trans_alloc+0x2af/0x3c0 [xfs]
>   #3: ffff9275a1920920 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0x116/0x2c0 [xfs]
>  irq event stamp: 42649
>  hardirqs last  enabled at (42649): [<ffffffffb22dcdb3>] _raw_spin_unlock_irqrestore+0x53/0x60
>  hardirqs last disabled at (42648): [<ffffffffb22dcad1>] _raw_spin_lock_irqsave+0x21/0x60
>  softirqs last  enabled at (42306): [<ffffffffb260034c>] __do_softirq+0x34c/0x57c
>  softirqs last disabled at (42299): [<ffffffffb18c6762>] irq_exit+0xa2/0xc0
> 
>  read to 0xffff9275a1920ad8 of 16 bytes by task 47312 on cpu 40:
>   xfs_vn_getattr+0x20c/0x6a0 [xfs]
>   xfs_vn_getattr at fs/xfs/xfs_iops.c:551
>   vfs_getattr_nosec+0x11a/0x170
>   vfs_statx_fd+0x54/0x90
>   __do_sys_newfstat+0x40/0x90
>   __x64_sys_newfstat+0x3a/0x50
>   do_syscall_64+0x91/0xb05
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  no locks held by doio/47312.
>  irq event stamp: 43883
>  hardirqs last  enabled at (43883): [<ffffffffb1805119>] do_syscall_64+0x39/0xb05
>  hardirqs last disabled at (43882): [<ffffffffb1803ede>] trace_hardirqs_off_thunk+0x1a/0x1c
>  softirqs last  enabled at (43844): [<ffffffffb260034c>] __do_softirq+0x34c/0x57c
>  softirqs last disabled at (43141): [<ffffffffb18c6762>] irq_exit+0xa2/0xc0
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  fs/xfs/xfs_iops.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 81f2f93caec0..2d5ca13ee9da 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -547,9 +547,9 @@
>  	stat->uid = inode->i_uid;
>  	stat->gid = inode->i_gid;
>  	stat->ino = ip->i_ino;
> -	stat->atime = inode->i_atime;
> -	stat->mtime = inode->i_mtime;
> -	stat->ctime = inode->i_ctime;
> +	stat->atime = READ_ONCE(inode->i_atime);
> +	stat->mtime = READ_ONCE(inode->i_mtime);
> +	stat->ctime = READ_ONCE(inode->i_ctime);

Seeing as one is supposed to take ILOCK_SHARED before reading inode core
information, why don't we do that here?  Is there some huge performance
benefit to be realized from READ_ONCE vs. waiting for the lock that
protects all the writes from each other?

--D

>  	stat->blocks =
>  		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
>  
> @@ -614,11 +614,11 @@
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
>  	if (iattr->ia_valid & ATTR_ATIME)
> -		inode->i_atime = iattr->ia_atime;
> +		WRITE_ONCE(inode->i_atime, iattr->ia_atime);
>  	if (iattr->ia_valid & ATTR_CTIME)
> -		inode->i_ctime = iattr->ia_ctime;
> +		WRITE_ONCE(inode->i_ctime, iattr->ia_ctime);
>  	if (iattr->ia_valid & ATTR_MTIME)
> -		inode->i_mtime = iattr->ia_mtime;
> +		WRITE_ONCE(inode->i_mtime, iattr->ia_mtime);
>  }
>  
>  static int
> @@ -1117,11 +1117,11 @@
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	if (flags & S_CTIME)
> -		inode->i_ctime = *now;
> +		WRITE_ONCE(inode->i_ctime, *now);
>  	if (flags & S_MTIME)
> -		inode->i_mtime = *now;
> +		WRITE_ONCE(inode->i_mtime, *now);
>  	if (flags & S_ATIME)
> -		inode->i_atime = *now;
> +		WRITE_ONCE(inode->i_atime, *now);
>  
>  	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  	xfs_trans_log_inode(tp, ip, log_flags);
> -- 
> 1.8.3.1
>
Christoph Hellwig Feb. 25, 2020, 8:46 p.m. UTC | #3
On Tue, Feb 25, 2020 at 12:28:29PM -0800, Darrick J. Wong wrote:
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 81f2f93caec0..2d5ca13ee9da 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -547,9 +547,9 @@
> >  	stat->uid = inode->i_uid;
> >  	stat->gid = inode->i_gid;
> >  	stat->ino = ip->i_ino;
> > -	stat->atime = inode->i_atime;
> > -	stat->mtime = inode->i_mtime;
> > -	stat->ctime = inode->i_ctime;
> > +	stat->atime = READ_ONCE(inode->i_atime);
> > +	stat->mtime = READ_ONCE(inode->i_mtime);
> > +	stat->ctime = READ_ONCE(inode->i_ctime);
> 
> Seeing as one is supposed to take ILOCK_SHARED before reading inode core
> information, why don't we do that here?  Is there some huge performance
> benefit to be realized from READ_ONCE vs. waiting for the lock that
> protects all the writes from each other?

Yes, I don't see how READ_ONCE works on a structure.

I think you should look into fixing this race in generic_fillattr
first, and we then piggy back on that fix in XFS once it has all been
sorted out.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 81f2f93caec0..2d5ca13ee9da 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -547,9 +547,9 @@ 
 	stat->uid = inode->i_uid;
 	stat->gid = inode->i_gid;
 	stat->ino = ip->i_ino;
-	stat->atime = inode->i_atime;
-	stat->mtime = inode->i_mtime;
-	stat->ctime = inode->i_ctime;
+	stat->atime = READ_ONCE(inode->i_atime);
+	stat->mtime = READ_ONCE(inode->i_mtime);
+	stat->ctime = READ_ONCE(inode->i_ctime);
 	stat->blocks =
 		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
 
@@ -614,11 +614,11 @@ 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
 	if (iattr->ia_valid & ATTR_ATIME)
-		inode->i_atime = iattr->ia_atime;
+		WRITE_ONCE(inode->i_atime, iattr->ia_atime);
 	if (iattr->ia_valid & ATTR_CTIME)
-		inode->i_ctime = iattr->ia_ctime;
+		WRITE_ONCE(inode->i_ctime, iattr->ia_ctime);
 	if (iattr->ia_valid & ATTR_MTIME)
-		inode->i_mtime = iattr->ia_mtime;
+		WRITE_ONCE(inode->i_mtime, iattr->ia_mtime);
 }
 
 static int
@@ -1117,11 +1117,11 @@ 
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (flags & S_CTIME)
-		inode->i_ctime = *now;
+		WRITE_ONCE(inode->i_ctime, *now);
 	if (flags & S_MTIME)
-		inode->i_mtime = *now;
+		WRITE_ONCE(inode->i_mtime, *now);
 	if (flags & S_ATIME)
-		inode->i_atime = *now;
+		WRITE_ONCE(inode->i_atime, *now);
 
 	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 	xfs_trans_log_inode(tp, ip, log_flags);