diff mbox series

[v2] xfs: fix a UAF when inode item push

Message ID 20230722025721.312909-1-leo.lilong@huawei.com (mailing list archive)
State Superseded, archived
Headers show
Series [v2] xfs: fix a UAF when inode item push | expand

Commit Message

Long Li July 22, 2023, 2:57 a.m. UTC
KASAN reported a UAF bug while fault injection test:

  ==================================================================
  BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0
  Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479

  CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89
  Call Trace:
   <TASK>
   dump_stack_lvl+0x51/0x6a
   print_report+0x171/0x4a6
   kasan_report+0xb7/0x130
   xfs_inode_item_push+0x2db/0x2f0
   xfsaild+0x729/0x1f70
   kthread+0x290/0x340
   ret_from_fork+0x1f/0x30
   </TASK>

  Allocated by task 494:
   kasan_save_stack+0x22/0x40
   kasan_set_track+0x25/0x30
   __kasan_slab_alloc+0x58/0x70
   kmem_cache_alloc+0x197/0x5d0
   xfs_inode_item_init+0x62/0x170
   xfs_trans_ijoin+0x15e/0x240
   xfs_init_new_inode+0x573/0x1820
   xfs_create+0x6a1/0x1020
   xfs_generic_create+0x544/0x5d0
   vfs_mkdir+0x5d0/0x980
   do_mkdirat+0x14e/0x220
   __x64_sys_mkdir+0x6a/0x80
   do_syscall_64+0x39/0x80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

  Freed by task 14:
   kasan_save_stack+0x22/0x40
   kasan_set_track+0x25/0x30
   kasan_save_free_info+0x2e/0x40
   __kasan_slab_free+0x114/0x1b0
   kmem_cache_free+0xee/0x4e0
   xfs_inode_free_callback+0x187/0x2a0
   rcu_do_batch+0x317/0xce0
   rcu_core+0x686/0xa90
   __do_softirq+0x1b6/0x626

  The buggy address belongs to the object at ffff888022f74758
   which belongs to the cache xfs_ili of size 200
  The buggy address is located 48 bytes inside of
   200-byte region [ffff888022f74758, ffff888022f74820)

  The buggy address belongs to the physical page:
  page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74
  head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
  flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
  raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10
  raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
   ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb
  >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                        ^
   ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
   ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ==================================================================

When push inode item in xfsaild, it will race with reclaim inodes task.
Consider the following call graph, both tasks deal with the same inode.
During flushing the cluster, it will enter xfs_iflush_abort() in shutdown
conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set
to null. Concurrently, inode will be reclaimed in shutdown conditions,
there is no need to wait xfs buf lock because of lip->li_buf is null at
this time, inode will be freed via rcu callback if xfsaild task schedule
out during flushing the cluster. so, it is unsafe to reference lip after
flushing the cluster in xfs_inode_item_push().

			<log item is in AIL>
			<filesystem shutdown>
spin_lock(&ailp->ail_lock)
xfs_inode_item_push(lip)
  xfs_buf_trylock(bp)
  spin_unlock(&lip->li_ailp->ail_lock)
  xfs_iflush_cluster(bp)
    if (xfs_is_shutdown())
      xfs_iflush_abort(ip)
	xfs_trans_ail_delete(ip)
	  spin_lock(&ailp->ail_lock)
	  spin_unlock(&ailp->ail_lock)
	xfs_iflush_abort_clean(ip)
      error = -EIO
			<log item removed from AIL>
			<log item li_buf set to null>
    if (error)
      xfs_force_shutdown()
	xlog_shutdown_wait(mp->m_log)
	  might_sleep()
					xfs_reclaim_inode(ip)
					if (shutdown)
					  xfs_iflush_shutdown_abort(ip)
					    if (!bp)
					      xfs_iflush_abort(ip)
					      return
				        __xfs_inode_free(ip)
					   call_rcu(ip, xfs_inode_free_callback)
			......
			<rcu grace period expires>
			<rcu free callbacks run somewhere>
			  xfs_inode_free_callback(ip)
			    kmem_cache_free(ip->i_itemp)
			......
<starts running again>
    xfs_buf_ioend_fail(bp);
      xfs_buf_ioend(bp)
        xfs_buf_relse(bp);
    return error
spin_lock(&lip->li_ailp->ail_lock)
  <UAF on log item>

Fix the race condition by add XFS_ILOCK_SHARED lock for inode in
xfs_inode_item_push(). The XFS_ILOCK_EXCL lock is held when the inode is
reclaimed, so this prevents the uaf from occurring.

Link: https://patchwork.kernel.org/project/xfs/patch/20230211022941.GA1515023@ceph-admin/
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
v2:
 - Correct the description of race condition
 - In the v1 solution, the lip will be accessed after
 xfsaild_push_item(), it's still possible to trigger the uaf, so the 
 solution is modified in v2.

 fs/xfs/xfs_inode_item.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Dave Chinner July 26, 2023, 4:27 a.m. UTC | #1
On Sat, Jul 22, 2023 at 10:57:21AM +0800, Long Li wrote:
> KASAN reported a UAF bug while fault injection test:
> 
>   ==================================================================
>   BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0
>   Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479
> 
>   CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x51/0x6a
>    print_report+0x171/0x4a6
>    kasan_report+0xb7/0x130
>    xfs_inode_item_push+0x2db/0x2f0
>    xfsaild+0x729/0x1f70
>    kthread+0x290/0x340
>    ret_from_fork+0x1f/0x30
>    </TASK>
> 
>   Allocated by task 494:
>    kasan_save_stack+0x22/0x40
>    kasan_set_track+0x25/0x30
>    __kasan_slab_alloc+0x58/0x70
>    kmem_cache_alloc+0x197/0x5d0
>    xfs_inode_item_init+0x62/0x170
>    xfs_trans_ijoin+0x15e/0x240
>    xfs_init_new_inode+0x573/0x1820
>    xfs_create+0x6a1/0x1020
>    xfs_generic_create+0x544/0x5d0
>    vfs_mkdir+0x5d0/0x980
>    do_mkdirat+0x14e/0x220
>    __x64_sys_mkdir+0x6a/0x80
>    do_syscall_64+0x39/0x80
>    entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
>   Freed by task 14:
>    kasan_save_stack+0x22/0x40
>    kasan_set_track+0x25/0x30
>    kasan_save_free_info+0x2e/0x40
>    __kasan_slab_free+0x114/0x1b0
>    kmem_cache_free+0xee/0x4e0
>    xfs_inode_free_callback+0x187/0x2a0
>    rcu_do_batch+0x317/0xce0
>    rcu_core+0x686/0xa90
>    __do_softirq+0x1b6/0x626
> 
>   The buggy address belongs to the object at ffff888022f74758
>    which belongs to the cache xfs_ili of size 200
>   The buggy address is located 48 bytes inside of
>    200-byte region [ffff888022f74758, ffff888022f74820)
> 
>   The buggy address belongs to the physical page:
>   page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74
>   head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
>   flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
>   raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10
>   raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000
>   page dumped because: kasan: bad access detected
> 
>   Memory state around the buggy address:
>    ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
>    ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb
>   >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                         ^
>    ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>    ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>   ==================================================================
> 
> When push inode item in xfsaild, it will race with reclaim inodes task.
> Consider the following call graph, both tasks deal with the same inode.
> During flushing the cluster, it will enter xfs_iflush_abort() in shutdown
> conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set
> to null. Concurrently, inode will be reclaimed in shutdown conditions,
> there is no need to wait xfs buf lock because of lip->li_buf is null at
> this time, inode will be freed via rcu callback if xfsaild task schedule
> out during flushing the cluster. so, it is unsafe to reference lip after
> flushing the cluster in xfs_inode_item_push().
> 
> 			<log item is in AIL>
> 			<filesystem shutdown>
> spin_lock(&ailp->ail_lock)
> xfs_inode_item_push(lip)
>   xfs_buf_trylock(bp)
>   spin_unlock(&lip->li_ailp->ail_lock)
>   xfs_iflush_cluster(bp)
>     if (xfs_is_shutdown())
>       xfs_iflush_abort(ip)
> 	xfs_trans_ail_delete(ip)
> 	  spin_lock(&ailp->ail_lock)
> 	  spin_unlock(&ailp->ail_lock)
> 	xfs_iflush_abort_clean(ip)
>       error = -EIO
> 			<log item removed from AIL>
> 			<log item li_buf set to null>
>     if (error)
>       xfs_force_shutdown()
> 	xlog_shutdown_wait(mp->m_log)
> 	  might_sleep()
> 					xfs_reclaim_inode(ip)
> 					if (shutdown)
> 					  xfs_iflush_shutdown_abort(ip)
> 					    if (!bp)
> 					      xfs_iflush_abort(ip)
> 					      return
> 				        __xfs_inode_free(ip)
> 					   call_rcu(ip, xfs_inode_free_callback)
> 			......
> 			<rcu grace period expires>
> 			<rcu free callbacks run somewhere>
> 			  xfs_inode_free_callback(ip)
> 			    kmem_cache_free(ip->i_itemp)
> 			......
> <starts running again>
>     xfs_buf_ioend_fail(bp);
>       xfs_buf_ioend(bp)
>         xfs_buf_relse(bp);
>     return error
> spin_lock(&lip->li_ailp->ail_lock)
>   <UAF on log item>

Yup. It's not safe to reference the inode log item here...

> Fix the race condition by add XFS_ILOCK_SHARED lock for inode in
> xfs_inode_item_push(). The XFS_ILOCK_EXCL lock is held when the inode is
> reclaimed, so this prevents the uaf from occurring.

Having reclaim come in and free the inode after we've already
aborted and removed from the buffer isn't a problem. The inode
flushing code is designed to handle that safely.

The problem is that xfs_inode_item_push() tries to use the inode
item after the failure has occurred and we've already aborted the
inode item and finished it. i.e. the problem is this line:

	spin_lock(&lip->li_ailp->ail_lock);

because it is using the log item that has been freed to get the
ailp. We can safely store the alip at the start of the function
whilst we still hold the ail_lock.

i.e.

	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+	struct xfs_ail		*ailp = lip->li_ailp;
	struct xfs_inode        *ip = iip->ili_inode;
	struct xfs_buf          *bp = lip->li_buf;
	uint                    rval = XFS_ITEM_SUCCESS;
....

-	spin_unlock(&lip->li_ailp->ail_lock);
+	spin_unlock(&ailp->ail_lock);

.....
	} else {
		/*
		 * Release the buffer if we were unable to flush anything. On
-		 * any other error, the buffer has already been released.
+		 * any other error, the buffer has already been released and it
+		 * now unsafe to reference the inode item as a flush abort may
+		 * have removed it from the AIL and reclaim freed the inode.
		 */
		if (error == -EAGAIN)
			xfs_buf_relse(bp);
		rval = XFS_ITEM_LOCKED;
	}

-	spin_lock(&lip->li_ailp->ail_lock);
+	/* unsafe to reference anything log item related from here on. */
+	spin_lock(&ailp->ail_lock);
	return rval;

Cheers,

Dave.
Long Li July 26, 2023, 7:32 a.m. UTC | #2
On Wed, Jul 26, 2023 at 02:27:04PM +1000, Dave Chinner wrote:
> On Sat, Jul 22, 2023 at 10:57:21AM +0800, Long Li wrote:
> > KASAN reported a UAF bug while fault injection test:
> > 
> >   ==================================================================
> >   BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0
> >   Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479
> > 
> >   CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89
> >   Call Trace:
> >    <TASK>
> >    dump_stack_lvl+0x51/0x6a
> >    print_report+0x171/0x4a6
> >    kasan_report+0xb7/0x130
> >    xfs_inode_item_push+0x2db/0x2f0
> >    xfsaild+0x729/0x1f70
> >    kthread+0x290/0x340
> >    ret_from_fork+0x1f/0x30
> >    </TASK>
> > 
> >   Allocated by task 494:
> >    kasan_save_stack+0x22/0x40
> >    kasan_set_track+0x25/0x30
> >    __kasan_slab_alloc+0x58/0x70
> >    kmem_cache_alloc+0x197/0x5d0
> >    xfs_inode_item_init+0x62/0x170
> >    xfs_trans_ijoin+0x15e/0x240
> >    xfs_init_new_inode+0x573/0x1820
> >    xfs_create+0x6a1/0x1020
> >    xfs_generic_create+0x544/0x5d0
> >    vfs_mkdir+0x5d0/0x980
> >    do_mkdirat+0x14e/0x220
> >    __x64_sys_mkdir+0x6a/0x80
> >    do_syscall_64+0x39/0x80
> >    entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > 
> >   Freed by task 14:
> >    kasan_save_stack+0x22/0x40
> >    kasan_set_track+0x25/0x30
> >    kasan_save_free_info+0x2e/0x40
> >    __kasan_slab_free+0x114/0x1b0
> >    kmem_cache_free+0xee/0x4e0
> >    xfs_inode_free_callback+0x187/0x2a0
> >    rcu_do_batch+0x317/0xce0
> >    rcu_core+0x686/0xa90
> >    __do_softirq+0x1b6/0x626
> > 
> >   The buggy address belongs to the object at ffff888022f74758
> >    which belongs to the cache xfs_ili of size 200
> >   The buggy address is located 48 bytes inside of
> >    200-byte region [ffff888022f74758, ffff888022f74820)
> > 
> >   The buggy address belongs to the physical page:
> >   page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74
> >   head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
> >   flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
> >   raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10
> >   raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000
> >   page dumped because: kasan: bad access detected
> > 
> >   Memory state around the buggy address:
> >    ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
> >    ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb
> >   >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >                         ^
> >    ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
> >    ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >   ==================================================================
> > 
> > When push inode item in xfsaild, it will race with reclaim inodes task.
> > Consider the following call graph, both tasks deal with the same inode.
> > During flushing the cluster, it will enter xfs_iflush_abort() in shutdown
> > conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set
> > to null. Concurrently, inode will be reclaimed in shutdown conditions,
> > there is no need to wait xfs buf lock because of lip->li_buf is null at
> > this time, inode will be freed via rcu callback if xfsaild task schedule
> > out during flushing the cluster. so, it is unsafe to reference lip after
> > flushing the cluster in xfs_inode_item_push().
> > 
> > 			<log item is in AIL>
> > 			<filesystem shutdown>
> > spin_lock(&ailp->ail_lock)
> > xfs_inode_item_push(lip)
> >   xfs_buf_trylock(bp)
> >   spin_unlock(&lip->li_ailp->ail_lock)
> >   xfs_iflush_cluster(bp)
> >     if (xfs_is_shutdown())
> >       xfs_iflush_abort(ip)
> > 	xfs_trans_ail_delete(ip)
> > 	  spin_lock(&ailp->ail_lock)
> > 	  spin_unlock(&ailp->ail_lock)
> > 	xfs_iflush_abort_clean(ip)
> >       error = -EIO
> > 			<log item removed from AIL>
> > 			<log item li_buf set to null>
> >     if (error)
> >       xfs_force_shutdown()
> > 	xlog_shutdown_wait(mp->m_log)
> > 	  might_sleep()
> > 					xfs_reclaim_inode(ip)
> > 					if (shutdown)
> > 					  xfs_iflush_shutdown_abort(ip)
> > 					    if (!bp)
> > 					      xfs_iflush_abort(ip)
> > 					      return
> > 				        __xfs_inode_free(ip)
> > 					   call_rcu(ip, xfs_inode_free_callback)
> > 			......
> > 			<rcu grace period expires>
> > 			<rcu free callbacks run somewhere>
> > 			  xfs_inode_free_callback(ip)
> > 			    kmem_cache_free(ip->i_itemp)
> > 			......
> > <starts running again>
> >     xfs_buf_ioend_fail(bp);
> >       xfs_buf_ioend(bp)
> >         xfs_buf_relse(bp);
> >     return error
> > spin_lock(&lip->li_ailp->ail_lock)
> >   <UAF on log item>
> 
> Yup. It's not safe to reference the inode log item here...
> 
> > Fix the race condition by add XFS_ILOCK_SHARED lock for inode in
> > xfs_inode_item_push(). The XFS_ILOCK_EXCL lock is held when the inode is
> > reclaimed, so this prevents the uaf from occurring.
> 
> Having reclaim come in and free the inode after we've already
> aborted and removed from the buffer isn't a problem. The inode
> flushing code is designed to handle that safely.
> 
> The problem is that xfs_inode_item_push() tries to use the inode
> item after the failure has occurred and we've already aborted the
> inode item and finished it. i.e. the problem is this line:
> 
> 	spin_lock(&lip->li_ailp->ail_lock);
> 
> because it is using the log item that has been freed to get the
> ailp. We can safely store the alip at the start of the function
> whilst we still hold the ail_lock.
> 

Hi Dave,

That's how I solved it in v1[1], but I found that it doesn't completely
solve the problem, because it's still possible to reference the freed
lip in xfsaild_push(). Unless we don't refer to lip in tracepoint after
xfsaild_push_item().

xfsaild_push()
  xfsaild_push_item()
    lip->li_ops->iop_push()
      xfs_inode_item_push(lip)
	xfs_iflush_cluster(bp)
				......
					xfs_reclaim_inode(ip)
					  ......
					  __xfs_inode_free(ip)
					    call_rcu(ip, xfs_inode_free_callback)
				......		
			<rcu grace period expires>
			<rcu free callbacks run somewhere>
			  xfs_inode_free_callback(ip)
			    kmem_cache_free(ip->i_itemp)
				......
  trace_xfs_ail_xxx(lip) //uaf

[1] https://patchwork.kernel.org/project/xfs/patch/20230211022941.GA1515023@ceph-admin/

Thanks

Long Li

> i.e.
> 
> 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +	struct xfs_ail		*ailp = lip->li_ailp;
> 	struct xfs_inode        *ip = iip->ili_inode;
> 	struct xfs_buf          *bp = lip->li_buf;
> 	uint                    rval = XFS_ITEM_SUCCESS;
> ....
> 
> -	spin_unlock(&lip->li_ailp->ail_lock);
> +	spin_unlock(&ailp->ail_lock);
> 
> .....
> 	} else {
> 		/*
> 		 * Release the buffer if we were unable to flush anything. On
> -		 * any other error, the buffer has already been released.
> +		 * any other error, the buffer has already been released and it
> +		 * now unsafe to reference the inode item as a flush abort may
> +		 * have removed it from the AIL and reclaim freed the inode.
> 		 */
> 		if (error == -EAGAIN)
> 			xfs_buf_relse(bp);
> 		rval = XFS_ITEM_LOCKED;
> 	}
> 
> -	spin_lock(&lip->li_ailp->ail_lock);
> +	/* unsafe to reference anything log item related from here on. */
> +	spin_lock(&ailp->ail_lock);
> 	return rval;
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner July 27, 2023, 1:25 a.m. UTC | #3
On Wed, Jul 26, 2023 at 03:32:19PM +0800, Long Li wrote:
> On Wed, Jul 26, 2023 at 02:27:04PM +1000, Dave Chinner wrote:
> > On Sat, Jul 22, 2023 at 10:57:21AM +0800, Long Li wrote:
> > > Fix the race condition by add XFS_ILOCK_SHARED lock for inode in
> > > xfs_inode_item_push(). The XFS_ILOCK_EXCL lock is held when the inode is
> > > reclaimed, so this prevents the uaf from occurring.
> > 
> > Having reclaim come in and free the inode after we've already
> > aborted and removed from the buffer isn't a problem. The inode
> > flushing code is designed to handle that safely.
> > 
> > The problem is that xfs_inode_item_push() tries to use the inode
> > item after the failure has occurred and we've already aborted the
> > inode item and finished it. i.e. the problem is this line:
> > 
> > 	spin_lock(&lip->li_ailp->ail_lock);
> > 
> > because it is using the log item that has been freed to get the
> > ailp. We can safely store the alip at the start of the function
> > whilst we still hold the ail_lock.
> > 
> 
> Hi Dave,
> 
> That's how I solved it in v1[1], but I found that it doesn't completely
> solve the problem, because it's still possible to reference the freed
> lip in xfsaild_push(). Unless we don't refer to lip in tracepoint after
> xfsaild_push_item().
> 
> xfsaild_push()
>   xfsaild_push_item()
>     lip->li_ops->iop_push()
>       xfs_inode_item_push(lip)
> 	xfs_iflush_cluster(bp)
> 				......
> 					xfs_reclaim_inode(ip)
> 					  ......
> 					  __xfs_inode_free(ip)
> 					    call_rcu(ip, xfs_inode_free_callback)
> 				......		
> 			<rcu grace period expires>
> 			<rcu free callbacks run somewhere>
> 			  xfs_inode_free_callback(ip)
> 			    kmem_cache_free(ip->i_itemp)
> 				......
>   trace_xfs_ail_xxx(lip) //uaf

Then we need to fix that UAF, too!

Seriously: just because a reclaim race exposes more than one UAF
vector, it doesn't mean that the we need to completely change the
lock order and life-cycle/existence guarantees for that object. It
means we need to look at the wider situation and determine if there
are other vectors to those same UAF conditions.

Such as: what happens if other types of log items have the same sort
of "can be freed before returning" situation?

Really, all this means is that returning XFS_ITEM_LOCKED when we've
dropped the AIL lock in .iop_push() is not safe - there is no longer
any guarantee the item is still in the AIL or even whether it exists
on return to xfsaild_push(), so it's just not safe to reference
there at all.

Indeed, I note that xfs_qm_dquot_logitem_push() has exactly the same
potential UAF problem as xfs_inode_item_push() has. i.e. it also
drops the AIL lock to do a flush, which can abort the item and
remove it from the AIL on failure while the AIL lock is dropped,
then it grabs the AIL lock from the log item and returns
XFS_ITEM_LOCKED....

So, yeah, these failure cases need to return something different to
xfsaild_push() so it knows that it is unsafe to reference the log
item, even for tracing purposes. And, further,
xfs_qm_dquot_logitem_push() needs the same "grab ailp before
dropping the ail_lock" for when it is relocking the AIL....

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 127b2410eb20..c3897c5417f2 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -711,9 +711,14 @@  xfs_inode_item_push(
 	if (xfs_iflags_test(ip, XFS_IFLUSHING))
 		return XFS_ITEM_FLUSHING;
 
-	if (!xfs_buf_trylock(bp))
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
 		return XFS_ITEM_LOCKED;
 
+	if (!xfs_buf_trylock(bp)) {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		return XFS_ITEM_LOCKED;
+	}
+
 	spin_unlock(&lip->li_ailp->ail_lock);
 
 	/*
@@ -739,6 +744,7 @@  xfs_inode_item_push(
 	}
 
 	spin_lock(&lip->li_ailp->ail_lock);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	return rval;
 }