diff mbox

[RFC] xfs: Fix "use after free" of intent items

Message ID 20180111125223.10727-1-chandan@linux.vnet.ibm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Chandan Rajendra Jan. 11, 2018, 12:52 p.m. UTC
generic/388 can cause the following "use after free" error to occur,

 =============================================================================
 BUG xfs_efi_item (Not tainted): Poison overwritten
 -----------------------------------------------------------------------------

 Disabling lock debugging due to kernel taint
 INFO: 0x00000000292c4bd4-0x00000000292c4bd4. First byte 0x6a instead of 0x6b
 INFO: Allocated in .kmem_zone_alloc+0xcc/0x190 age=79 cpu=0 pid=12436
 	.__slab_alloc+0x54/0x80
 	.kmem_cache_alloc+0x124/0x350
 	.kmem_zone_alloc+0xcc/0x190
 	.xfs_efi_init+0x48/0xf0
 	.xfs_extent_free_create_intent+0x40/0x130
 	.xfs_defer_intake_work+0x74/0x1e0
 	.xfs_defer_finish+0xac/0x5c0
 	.xfs_itruncate_extents+0x170/0x590
 	.xfs_inactive_truncate+0xcc/0x170
 	.xfs_inactive+0x1d8/0x2f0
 	.xfs_fs_destroy_inode+0xe4/0x3d0
 	.destroy_inode+0x68/0xb0
 	.do_unlinkat+0x1e8/0x390
 	system_call+0x58/0x6c
 INFO: Freed in .xfs_efi_item_free+0x44/0x80 age=79 cpu=0 pid=12436
 	.kmem_cache_free+0x120/0x2b0
 	.xfs_efi_item_free+0x44/0x80
 	.xfs_trans_free_items+0xd4/0x130
 	.__xfs_trans_commit+0xd0/0x350
 	.xfs_trans_roll+0x4c/0x90
 	.xfs_defer_trans_roll+0xa4/0x2b0
 	.xfs_defer_finish+0xb8/0x5c0
 	.xfs_itruncate_extents+0x170/0x590
 	.xfs_inactive_truncate+0xcc/0x170
 	.xfs_inactive+0x1d8/0x2f0
 	.xfs_fs_destroy_inode+0xe4/0x3d0
 	.destroy_inode+0x68/0xb0
 	.do_unlinkat+0x1e8/0x390
 	system_call+0x58/0x6c

This happens due to the following interaction,
1. xfs_defer_finish() creates "extent free" intent item and adds it to the
   list at xfs_trans->t_items.
2. xfs_defer_trans_roll() invokes __xfs_trans_commit(). Here, if the
   XFS_MOUNT_FS_SHUTDOWN flag is set, we invoke io_unlock() operation of the
   corresponding log item. For "extent free" log items xfs_efi_item_unlock()
   gets invoked which frees the xfs_efi_log_item.
3. xfs_defer_trans_roll() then invokes xfs_defer_trans_abort(). Since the
   xfs_defer_pending->dfp_intent is still set to the "extent free" intent
   item, we invoke xfs_extent_free_abort_intent(). This accesses the
   previously freed xfs_efi_log_item to decrement the ref count.

This commit fixes the bug by changing xfs_efi_item_unlock() to invoke
xfs_efi_release() instead of xfs_efi_item_free(). The xfs_efi_log_item gets
freed when xfs_extent_free_abort_intent() invokes xfs_efi_release() once again
i.e. its refcount becomes zero and hence xfs_efi_item_free() is invoked.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---

The above mentioned "use after free" occurs with other log items.
(e.g. xfs_rui_item). If the below fix is found to be correct, I will
apply it to other affected log items as well and post a new patch.

 fs/xfs/xfs_extfree_item.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Chandan Rajendra Jan. 11, 2018, 1:25 p.m. UTC | #1
On Thursday, January 11, 2018 6:22:23 PM IST Chandan Rajendra wrote:
> generic/388 can cause the following "use after free" error to occur,
> 
>  =============================================================================
>  BUG xfs_efi_item (Not tainted): Poison overwritten
>  -----------------------------------------------------------------------------
> 
>  Disabling lock debugging due to kernel taint
>  INFO: 0x00000000292c4bd4-0x00000000292c4bd4. First byte 0x6a instead of 0x6b
>  INFO: Allocated in .kmem_zone_alloc+0xcc/0x190 age=79 cpu=0 pid=12436
>  	.__slab_alloc+0x54/0x80
>  	.kmem_cache_alloc+0x124/0x350
>  	.kmem_zone_alloc+0xcc/0x190
>  	.xfs_efi_init+0x48/0xf0
>  	.xfs_extent_free_create_intent+0x40/0x130
>  	.xfs_defer_intake_work+0x74/0x1e0
>  	.xfs_defer_finish+0xac/0x5c0
>  	.xfs_itruncate_extents+0x170/0x590
>  	.xfs_inactive_truncate+0xcc/0x170
>  	.xfs_inactive+0x1d8/0x2f0
>  	.xfs_fs_destroy_inode+0xe4/0x3d0
>  	.destroy_inode+0x68/0xb0
>  	.do_unlinkat+0x1e8/0x390
>  	system_call+0x58/0x6c
>  INFO: Freed in .xfs_efi_item_free+0x44/0x80 age=79 cpu=0 pid=12436
>  	.kmem_cache_free+0x120/0x2b0
>  	.xfs_efi_item_free+0x44/0x80
>  	.xfs_trans_free_items+0xd4/0x130
>  	.__xfs_trans_commit+0xd0/0x350
>  	.xfs_trans_roll+0x4c/0x90
>  	.xfs_defer_trans_roll+0xa4/0x2b0
>  	.xfs_defer_finish+0xb8/0x5c0
>  	.xfs_itruncate_extents+0x170/0x590
>  	.xfs_inactive_truncate+0xcc/0x170
>  	.xfs_inactive+0x1d8/0x2f0
>  	.xfs_fs_destroy_inode+0xe4/0x3d0
>  	.destroy_inode+0x68/0xb0
>  	.do_unlinkat+0x1e8/0x390
>  	system_call+0x58/0x6c
> 
> This happens due to the following interaction,
> 1. xfs_defer_finish() creates "extent free" intent item and adds it to the
>    list at xfs_trans->t_items.
> 2. xfs_defer_trans_roll() invokes __xfs_trans_commit(). Here, if the
>    XFS_MOUNT_FS_SHUTDOWN flag is set, we invoke io_unlock() operation of the
>    corresponding log item. For "extent free" log items xfs_efi_item_unlock()
>    gets invoked which frees the xfs_efi_log_item.
> 3. xfs_defer_trans_roll() then invokes xfs_defer_trans_abort(). Since the
>    xfs_defer_pending->dfp_intent is still set to the "extent free" intent
>    item, we invoke xfs_extent_free_abort_intent(). This accesses the
>    previously freed xfs_efi_log_item to decrement the ref count.
> 
> This commit fixes the bug by changing xfs_efi_item_unlock() to invoke
> xfs_efi_release() instead of xfs_efi_item_free(). The xfs_efi_log_item gets
> freed when xfs_extent_free_abort_intent() invokes xfs_efi_release() once again
> i.e. its refcount becomes zero and hence xfs_efi_item_free() is invoked.
> 
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
> 
> The above mentioned "use after free" occurs with other log items.
> (e.g. xfs_rui_item). If the below fix is found to be correct, I will
> apply it to other affected log items as well and post a new patch.
> 

I am assuming that the refcount for xfs_efi_item is set to 2 to account
for,
1. Linking the corresponding log descriptor item to the transaction's
   t_items list.
2. For storing the pointer to the intent item at 
   xfs_defer_pending->dfp_intent.

Please let me know if my assumption is incorrect.
Brian Foster Jan. 12, 2018, 3:02 p.m. UTC | #2
On Thu, Jan 11, 2018 at 06:22:23PM +0530, Chandan Rajendra wrote:
> generic/388 can cause the following "use after free" error to occur,
> 
>  =============================================================================
>  BUG xfs_efi_item (Not tainted): Poison overwritten
>  -----------------------------------------------------------------------------
> 
>  Disabling lock debugging due to kernel taint
>  INFO: 0x00000000292c4bd4-0x00000000292c4bd4. First byte 0x6a instead of 0x6b
>  INFO: Allocated in .kmem_zone_alloc+0xcc/0x190 age=79 cpu=0 pid=12436
>  	.__slab_alloc+0x54/0x80
>  	.kmem_cache_alloc+0x124/0x350
>  	.kmem_zone_alloc+0xcc/0x190
>  	.xfs_efi_init+0x48/0xf0
>  	.xfs_extent_free_create_intent+0x40/0x130
>  	.xfs_defer_intake_work+0x74/0x1e0
>  	.xfs_defer_finish+0xac/0x5c0
>  	.xfs_itruncate_extents+0x170/0x590
>  	.xfs_inactive_truncate+0xcc/0x170
>  	.xfs_inactive+0x1d8/0x2f0
>  	.xfs_fs_destroy_inode+0xe4/0x3d0
>  	.destroy_inode+0x68/0xb0
>  	.do_unlinkat+0x1e8/0x390
>  	system_call+0x58/0x6c
>  INFO: Freed in .xfs_efi_item_free+0x44/0x80 age=79 cpu=0 pid=12436
>  	.kmem_cache_free+0x120/0x2b0
>  	.xfs_efi_item_free+0x44/0x80
>  	.xfs_trans_free_items+0xd4/0x130
>  	.__xfs_trans_commit+0xd0/0x350
>  	.xfs_trans_roll+0x4c/0x90
>  	.xfs_defer_trans_roll+0xa4/0x2b0
>  	.xfs_defer_finish+0xb8/0x5c0
>  	.xfs_itruncate_extents+0x170/0x590
>  	.xfs_inactive_truncate+0xcc/0x170
>  	.xfs_inactive+0x1d8/0x2f0
>  	.xfs_fs_destroy_inode+0xe4/0x3d0
>  	.destroy_inode+0x68/0xb0
>  	.do_unlinkat+0x1e8/0x390
>  	system_call+0x58/0x6c
> 
> This happens due to the following interaction,
> 1. xfs_defer_finish() creates "extent free" intent item and adds it to the
>    list at xfs_trans->t_items.
> 2. xfs_defer_trans_roll() invokes __xfs_trans_commit(). Here, if the
>    XFS_MOUNT_FS_SHUTDOWN flag is set, we invoke io_unlock() operation of the
>    corresponding log item. For "extent free" log items xfs_efi_item_unlock()
>    gets invoked which frees the xfs_efi_log_item.
> 3. xfs_defer_trans_roll() then invokes xfs_defer_trans_abort(). Since the
>    xfs_defer_pending->dfp_intent is still set to the "extent free" intent
>    item, we invoke xfs_extent_free_abort_intent(). This accesses the
>    previously freed xfs_efi_log_item to decrement the ref count.
> 
> This commit fixes the bug by changing xfs_efi_item_unlock() to invoke
> xfs_efi_release() instead of xfs_efi_item_free(). The xfs_efi_log_item gets
> freed when xfs_extent_free_abort_intent() invokes xfs_efi_release() once again
> i.e. its refcount becomes zero and hence xfs_efi_item_free() is invoked.
> 

Interesting.. trying to refamiliarize myself with the expected reference
counting rules here, I see that this goes back to the old
xfs_bmap_finish() mechanism. That guy would grab an EFI, log it and then
attempt to roll/commit the transaction. IIRC, the direct release covers
the case where that tx commit fails. That is essentially equivalent to a
tx cancel in this example, so the tx is freed and xfs_bmap_finish() is
done (returns with an error). The ->iop_unlock() frees the EFI in this
case explicitly because 1.) the refcount is initialized to 2, one for
the log and one for the EFD and 2.) the EFD is never created in this
situation.

Given that, I suspect this code was correct with the bmap_finish() code
and became a problem sometime later. Have you done any investigation to
confirm/deny whether this is a regression since v4.7 or so?

Moving along to the current code... xfs_defer_finish() creates the EFI,
logs it, etc. and moves on to xfs_trans_roll(). If the tx commit fails,
we're in a similar abort/cancel situation as before. The deferred ops code
tracks the EFI within the dfops data structures, however, so it calls
into the ->abort_intent() callout to drop the other reference and we hit
this problem.

AFAICT, the abort handler covers the only situation where an intent
would be left hanging around with an extra reference for the done item
but without the done item in existence yet. Given that, I think it's
safe to change the intent log state machine handling as this patch does.
The caveat is that with this change it is now required that the
committer of a failed transaction is always responsible for releasing
the reference on intent items for the associated done items that are not
going to be created (where it wasn't required before). IOW, I suspect we
could just as safely remove the abort handling code to resolve the
problem.  Since dfops seems to handle the situation correctly, I don't
really feel strongly either way.

I do think that this patch should at least:

- document the historical context in the commit log, provided the above
  can be verified (i.e., why/since when is this a problem?)
- update the comments in the intent item unlock handlers and probably
  the dfops code to make sure the reference counting rules are clear
- include the same updates for other applicable items in the same patch

... and finally, this should be tested against a frequent shutdown
workload if not done so already. E.g., run 'while [ true ]; do fsstress;
umount; mount; done' in one thread and 'while [ true ]; do sleep N;
xfs_io -xc "shutdown -f" ...; done' in another and verify we don't
explode/hang.

Brian

> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
> 
> The above mentioned "use after free" occurs with other log items.
> (e.g. xfs_rui_item). If the below fix is found to be correct, I will
> apply it to other affected log items as well and post a new patch.
> 
>  fs/xfs/xfs_extfree_item.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 64da906..2119441 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -151,7 +151,7 @@ xfs_efi_item_unlock(
>  	struct xfs_log_item	*lip)
>  {
>  	if (lip->li_flags & XFS_LI_ABORTED)
> -		xfs_efi_item_free(EFI_ITEM(lip));
> +		xfs_efi_release(EFI_ITEM(lip));
>  }
>  
>  /*
> -- 
> 2.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chandan Rajendra Jan. 16, 2018, 1:55 p.m. UTC | #3
On Friday, January 12, 2018 8:32:24 PM IST Brian Foster wrote:
> On Thu, Jan 11, 2018 at 06:22:23PM +0530, Chandan Rajendra wrote:
> > generic/388 can cause the following "use after free" error to occur,
> > 
> >  =============================================================================
> >  BUG xfs_efi_item (Not tainted): Poison overwritten
> >  -----------------------------------------------------------------------------
> > 
> >  Disabling lock debugging due to kernel taint
> >  INFO: 0x00000000292c4bd4-0x00000000292c4bd4. First byte 0x6a instead of 0x6b
> >  INFO: Allocated in .kmem_zone_alloc+0xcc/0x190 age=79 cpu=0 pid=12436
> >  	.__slab_alloc+0x54/0x80
> >  	.kmem_cache_alloc+0x124/0x350
> >  	.kmem_zone_alloc+0xcc/0x190
> >  	.xfs_efi_init+0x48/0xf0
> >  	.xfs_extent_free_create_intent+0x40/0x130
> >  	.xfs_defer_intake_work+0x74/0x1e0
> >  	.xfs_defer_finish+0xac/0x5c0
> >  	.xfs_itruncate_extents+0x170/0x590
> >  	.xfs_inactive_truncate+0xcc/0x170
> >  	.xfs_inactive+0x1d8/0x2f0
> >  	.xfs_fs_destroy_inode+0xe4/0x3d0
> >  	.destroy_inode+0x68/0xb0
> >  	.do_unlinkat+0x1e8/0x390
> >  	system_call+0x58/0x6c
> >  INFO: Freed in .xfs_efi_item_free+0x44/0x80 age=79 cpu=0 pid=12436
> >  	.kmem_cache_free+0x120/0x2b0
> >  	.xfs_efi_item_free+0x44/0x80
> >  	.xfs_trans_free_items+0xd4/0x130
> >  	.__xfs_trans_commit+0xd0/0x350
> >  	.xfs_trans_roll+0x4c/0x90
> >  	.xfs_defer_trans_roll+0xa4/0x2b0
> >  	.xfs_defer_finish+0xb8/0x5c0
> >  	.xfs_itruncate_extents+0x170/0x590
> >  	.xfs_inactive_truncate+0xcc/0x170
> >  	.xfs_inactive+0x1d8/0x2f0
> >  	.xfs_fs_destroy_inode+0xe4/0x3d0
> >  	.destroy_inode+0x68/0xb0
> >  	.do_unlinkat+0x1e8/0x390
> >  	system_call+0x58/0x6c
> > 
> > This happens due to the following interaction,
> > 1. xfs_defer_finish() creates "extent free" intent item and adds it to the
> >    list at xfs_trans->t_items.
> > 2. xfs_defer_trans_roll() invokes __xfs_trans_commit(). Here, if the
> >    XFS_MOUNT_FS_SHUTDOWN flag is set, we invoke io_unlock() operation of the
> >    corresponding log item. For "extent free" log items xfs_efi_item_unlock()
> >    gets invoked which frees the xfs_efi_log_item.
> > 3. xfs_defer_trans_roll() then invokes xfs_defer_trans_abort(). Since the
> >    xfs_defer_pending->dfp_intent is still set to the "extent free" intent
> >    item, we invoke xfs_extent_free_abort_intent(). This accesses the
> >    previously freed xfs_efi_log_item to decrement the ref count.
> > 
> > This commit fixes the bug by changing xfs_efi_item_unlock() to invoke
> > xfs_efi_release() instead of xfs_efi_item_free(). The xfs_efi_log_item gets
> > freed when xfs_extent_free_abort_intent() invokes xfs_efi_release() once again
> > i.e. its refcount becomes zero and hence xfs_efi_item_free() is invoked.
> > 
> 
> Interesting.. trying to refamiliarize myself with the expected reference
> counting rules here, I see that this goes back to the old
> xfs_bmap_finish() mechanism. That guy would grab an EFI, log it and then
> attempt to roll/commit the transaction. IIRC, the direct release covers
> the case where that tx commit fails. That is essentially equivalent to a
> tx cancel in this example, so the tx is freed and xfs_bmap_finish() is
> done (returns with an error). The ->iop_unlock() frees the EFI in this
> case explicitly because 1.) the refcount is initialized to 2, one for
> the log and one for the EFD and 2.) the EFD is never created in this
> situation.
> 
> Given that, I suspect this code was correct with the bmap_finish() code
> and became a problem sometime later. Have you done any investigation to
> confirm/deny whether this is a regression since v4.7 or so?

You are right. The code was working fine with v4.6.  IMHO, this was due the
"committed" argument that was passed to __xfs_trans_roll(). This 
argument tracked if the log items were committed to the CIL. 

If __xfs_trans_roll() did return an error and "committed" was set to  true,
we just release one of the refcounts of the efi, since the efd will never
be created. The final refcount will be released when the EFI is added
to the AIL.

If __xfs_trans_roll() did return an error and "committed" was set to false,
we end up freeing the efi via ->iop_unlock() function and no action
is taken in xfs_bmap_finish().  However, with the present code
xfs_defer_trans_roll() ends up unconditionally (well almost; It does
check for the presence of a non-NULL pointer at 
xfs_defer_pending->dfp_intent) invoking xfs_extent_free_abort_intent(),
which access the freed memory.

I am working on a new patch for fixing this. Thanks for guidance.
diff mbox

Patch

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 64da906..2119441 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -151,7 +151,7 @@  xfs_efi_item_unlock(
 	struct xfs_log_item	*lip)
 {
 	if (lip->li_flags & XFS_LI_ABORTED)
-		xfs_efi_item_free(EFI_ITEM(lip));
+		xfs_efi_release(EFI_ITEM(lip));
 }
 
 /*