diff mbox series

[RFC,v2,7/7] ext4: fix race between blkdev_releasepage() and ext4_put_super()

Message ID 20210414134737.2366971-8-yi.zhang@huawei.com (mailing list archive)
State New, archived
Headers show
Series ext4, jbd2: fix 3 issues about bdev_try_to_free_page() | expand

Commit Message

Zhang Yi April 14, 2021, 1:47 p.m. UTC
There still exist a use after free issue when accessing the journal
structure and ext4_sb_info structure on freeing bdev buffers in
bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be
raced by ext4_put_super(), it dose freeing sb->s_fs_info and
sbi->s_journal while release page progress are still accessing them.
So it could end up trigger use-after-free or NULL pointer dereference.

drop cache                           umount filesystem
blkdev_releasepage()
 get sb
 bdev_try_to_free_page()             ext4_put_super()
                                      kfree(journal)
  if access EXT4_SB(sb)->s_journal  <-- leader to use after free
                                      sb->s_fs_info = NULL;
                                      kfree(sbi)
  if access EXT4_SB(sb)->s_journal  <-- trigger NULL pointer dereference

The above race could also happens on the error path of
ext4_fill_super(). Now the bdev_try_to_free_page() is under RCU
protection, this patch fix this race by use sb->s_usage_counter to
make bdev_try_to_free_page() safe against concurrent kill_block_super().

Fixes: c39a7f84d784 ("ext4: provide function to release metadata pages under memory pressure")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/super.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig April 15, 2021, 2:52 p.m. UTC | #1
On Wed, Apr 14, 2021 at 09:47:37PM +0800, Zhang Yi wrote:
> There still exist a use after free issue when accessing the journal
> structure and ext4_sb_info structure on freeing bdev buffers in
> bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be
> raced by ext4_put_super(), it dose freeing sb->s_fs_info and
> sbi->s_journal while release page progress are still accessing them.
> So it could end up trigger use-after-free or NULL pointer dereference.

I think the right fix is to not even call into ->bdev_try_to_free_page
unless the superblock is active.
Zhang Yi April 16, 2021, 8 a.m. UTC | #2
Hi, Christoph

On 2021/4/15 22:52, Christoph Hellwig wrote:
> On Wed, Apr 14, 2021 at 09:47:37PM +0800, Zhang Yi wrote:
>> There still exist a use after free issue when accessing the journal
>> structure and ext4_sb_info structure on freeing bdev buffers in
>> bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be
>> raced by ext4_put_super(), it dose freeing sb->s_fs_info and
>> sbi->s_journal while release page progress are still accessing them.
>> So it could end up trigger use-after-free or NULL pointer dereference.
> 
> I think the right fix is to not even call into ->bdev_try_to_free_page
> unless the superblock is active.
> .
> 
Thanks for your suggestions.

Now, we use already use "if (bdev->bd_super)" to prevent call into
->bdev_try_to_free_page unless the super is alive, and the problem is
bd_super becomes NULL concurrently after this check. So, IIUC, I think it's
the same to switch to check the superblock is active or not. The acvive
flag also could becomes inactive (raced by umount) after we call into
bdev_try_to_free_page().

In order to close this race, One solution is introduce a lock to synchronize
the active state between kill_block_super() and blkdev_releasepage(), but
the releasing page process have to try to acquire this lock in
blkdev_releasepage() for each page, and the umount process still need to wait
until the page release if some one invoke into ->bdev_try_to_free_page().
I think this solution may affect performace and is not a good way.
Think about it in depth, use percpu refcount seems have the smallest
performance effect on blkdev_releasepage().

If you don't like the refcount, maybe we could add synchronize_rcu_expedited()
in ext4_put_super(), it also could prevent this race. Any suggestions?

Thanks,
Yi.
Christoph Hellwig April 20, 2021, 1:08 p.m. UTC | #3
On Fri, Apr 16, 2021 at 04:00:48PM +0800, Zhang Yi wrote:
> Now, we use already use "if (bdev->bd_super)" to prevent call into
> ->bdev_try_to_free_page unless the super is alive, and the problem is
> bd_super becomes NULL concurrently after this check. So, IIUC, I think it's
> the same to switch to check the superblock is active or not. The acvive
> flag also could becomes inactive (raced by umount) after we call into
> bdev_try_to_free_page().

Indeed.

> In order to close this race, One solution is introduce a lock to synchronize
> the active state between kill_block_super() and blkdev_releasepage(), but
> the releasing page process have to try to acquire this lock in
> blkdev_releasepage() for each page, and the umount process still need to wait
> until the page release if some one invoke into ->bdev_try_to_free_page().
> I think this solution may affect performace and is not a good way.
> Think about it in depth, use percpu refcount seems have the smallest
> performance effect on blkdev_releasepage().
> 
> If you don't like the refcount, maybe we could add synchronize_rcu_expedited()
> in ext4_put_super(), it also could prevent this race. Any suggestions?

I really don't like to put a lot of overhead into the core VFS and block
device code.  ext4/jbd does not own the block device inode and really
has no business controlling releasepage for it.  I suspect the right
answer might be to simply revert the commit that added this hook.
Jan Kara April 21, 2021, 1:46 p.m. UTC | #4
On Tue 20-04-21 14:08:41, Christoph Hellwig wrote:
> On Fri, Apr 16, 2021 at 04:00:48PM +0800, Zhang Yi wrote:
> > Now, we use already use "if (bdev->bd_super)" to prevent call into
> > ->bdev_try_to_free_page unless the super is alive, and the problem is
> > bd_super becomes NULL concurrently after this check. So, IIUC, I think it's
> > the same to switch to check the superblock is active or not. The acvive
> > flag also could becomes inactive (raced by umount) after we call into
> > bdev_try_to_free_page().
> 
> Indeed.
> 
> > In order to close this race, One solution is introduce a lock to synchronize
> > the active state between kill_block_super() and blkdev_releasepage(), but
> > the releasing page process have to try to acquire this lock in
> > blkdev_releasepage() for each page, and the umount process still need to wait
> > until the page release if some one invoke into ->bdev_try_to_free_page().
> > I think this solution may affect performace and is not a good way.
> > Think about it in depth, use percpu refcount seems have the smallest
> > performance effect on blkdev_releasepage().
> > 
> > If you don't like the refcount, maybe we could add synchronize_rcu_expedited()
> > in ext4_put_super(), it also could prevent this race. Any suggestions?
> 
> I really don't like to put a lot of overhead into the core VFS and block
> device code.  ext4/jbd does not own the block device inode and really
> has no business controlling releasepage for it.  I suspect the right
> answer might be to simply revert the commit that added this hook.

Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
by ext4. So maybe it is not that important? I agree with Zhang and
Christoph that getting the lifetime rules sorted out will be hairy and it
is questionable, whether it is worth the additional pages we can reclaim.
Ted, do you remember what was the original motivation for this?

								Honza
Theodore Ts'o April 21, 2021, 4:57 p.m. UTC | #5
On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote:
> 
> Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
> by ext4. So maybe it is not that important? I agree with Zhang and
> Christoph that getting the lifetime rules sorted out will be hairy and it
> is questionable, whether it is worth the additional pages we can reclaim.
> Ted, do you remember what was the original motivation for this?

The comment in fs/ext4/super.c is I thought a pretty good explanation:

/*
 * Try to release metadata pages (indirect blocks, directories) which are
 * mapped via the block device.  Since these pages could have journal heads
 * which would prevent try_to_free_buffers() from freeing them, we must use
 * jbd2 layer's try_to_free_buffers() function to release them.
 */

When we modify a metadata block, we attach a journal_head (jh)
structure to the buffer_head, and bump the ref count to prevent the
buffer from being freed.  Before the transaction is committed, the
buffer is marked jbddirty, but the dirty bit is not set until the
transaction commit.

At that back, writeback happens entirely at the discretion of the
buffer cache.  The jbd layer doesn't get notification when the I/O is
completed, nor when there is an I/O error.  (There was an attempt to
add a callback but that was NACK'ed because of a complaint that it was
jbd specific.)

So we don't actually know when it's safe to detach the jh from the
buffer_head and can drop the refcount so that the buffer_head can be
freed.  When the space in the journal starts getting low, we'll look
at at the jh's attached to completed transactions, and see how many of
them have clean bh's, and at that point, we can release the buffer
heads.

The other time when we'll attempt to detach jh's from clean buffers is
via bdev_try_to_free_buffers().  So if we drop the
bdev_try_to_free_page hook, then when we are under memory pressure,
there could be potentially a large percentage of the buffer cache
which can't be freed, and so the OOM-killer might trigger more often.

Now, if we could get a callback on I/O completion on a per-bh basis,
then we could detach the jh when the buffer is clean --- and as a
bonus, we'd get a notification when there was an I/O error writing
back a metadata block, which would be even better.

So how about an even swap?  If we can get a buffer I/O completion
callback, we can drop bdev_to_free_swap hook.....

	     	      			- Ted
Jan Kara April 22, 2021, 9:04 a.m. UTC | #6
On Wed 21-04-21 12:57:39, Theodore Ts'o wrote:
> On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote:
> > 
> > Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
> > by ext4. So maybe it is not that important? I agree with Zhang and
> > Christoph that getting the lifetime rules sorted out will be hairy and it
> > is questionable, whether it is worth the additional pages we can reclaim.
> > Ted, do you remember what was the original motivation for this?
> 
> The comment in fs/ext4/super.c is I thought a pretty good explanation:
> 
> /*
>  * Try to release metadata pages (indirect blocks, directories) which are
>  * mapped via the block device.  Since these pages could have journal heads
>  * which would prevent try_to_free_buffers() from freeing them, we must use
>  * jbd2 layer's try_to_free_buffers() function to release them.
>  */
> 
> When we modify a metadata block, we attach a journal_head (jh)
> structure to the buffer_head, and bump the ref count to prevent the
> buffer from being freed.  Before the transaction is committed, the
> buffer is marked jbddirty, but the dirty bit is not set until the
> transaction commit.
> 
> At that back, writeback happens entirely at the discretion of the
> buffer cache.  The jbd layer doesn't get notification when the I/O is
> completed, nor when there is an I/O error.  (There was an attempt to
> add a callback but that was NACK'ed because of a complaint that it was
> jbd specific.)
> 
> So we don't actually know when it's safe to detach the jh from the
> buffer_head and can drop the refcount so that the buffer_head can be
> freed.  When the space in the journal starts getting low, we'll look
> at at the jh's attached to completed transactions, and see how many of
> them have clean bh's, and at that point, we can release the buffer
> heads.
> 
> The other time when we'll attempt to detach jh's from clean buffers is
> via bdev_try_to_free_buffers().  So if we drop the
> bdev_try_to_free_page hook, then when we are under memory pressure,
> there could be potentially a large percentage of the buffer cache
> which can't be freed, and so the OOM-killer might trigger more often.

Yes, I understand that. What I was more asking about is: Does it really
matter we leave those buffer heads and journal heads unreclaimed. I
understand it could be triggering premature OOM in theory but is it a
problem in practice? Was there some observed practical case for which this
was added or was it just added due to the theoretical concern?

> Now, if we could get a callback on I/O completion on a per-bh basis,
> then we could detach the jh when the buffer is clean --- and as a
> bonus, we'd get a notification when there was an I/O error writing
> back a metadata block, which would be even better.
> 
> So how about an even swap?  If we can get a buffer I/O completion
> callback, we can drop bdev_to_free_swap hook.....

I'm OK with that because mainly for IO error reporting it makes sense to
me. For this memory reclaim problem I think we have also other reasonably
sensible options. E.g. we could have a shrinker that would just walk the
checkpoint list and reclaim journal heads for whatever is already written
out... Or we could just release journal heads already after commit and
during checkpoint we'd fetch the list of blocks that may need to be written
out e.g. from journal descriptor blocks. This would be a larger overhaul
but as a bonus we'd get rid of probably the last place in the kernel which
can write out page contents through buffer heads without updating page
state properly (and thus get rid of some workarounds in mm code as well).

								Honza
Zhang Yi April 23, 2021, 11:39 a.m. UTC | #7
On 2021/4/22 17:04, Jan Kara wrote:
> On Wed 21-04-21 12:57:39, Theodore Ts'o wrote:
>> On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote:
>>>
>>> Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
>>> by ext4. So maybe it is not that important? I agree with Zhang and
>>> Christoph that getting the lifetime rules sorted out will be hairy and it
>>> is questionable, whether it is worth the additional pages we can reclaim.
>>> Ted, do you remember what was the original motivation for this?
>>
>> The comment in fs/ext4/super.c is I thought a pretty good explanation:
>>
>> /*
>>  * Try to release metadata pages (indirect blocks, directories) which are
>>  * mapped via the block device.  Since these pages could have journal heads
>>  * which would prevent try_to_free_buffers() from freeing them, we must use
>>  * jbd2 layer's try_to_free_buffers() function to release them.
>>  */
>>
>> When we modify a metadata block, we attach a journal_head (jh)
>> structure to the buffer_head, and bump the ref count to prevent the
>> buffer from being freed.  Before the transaction is committed, the
>> buffer is marked jbddirty, but the dirty bit is not set until the
>> transaction commit.
>>
>> At that back, writeback happens entirely at the discretion of the
>> buffer cache.  The jbd layer doesn't get notification when the I/O is
>> completed, nor when there is an I/O error.  (There was an attempt to
>> add a callback but that was NACK'ed because of a complaint that it was
>> jbd specific.)
>>
>> So we don't actually know when it's safe to detach the jh from the
>> buffer_head and can drop the refcount so that the buffer_head can be
>> freed.  When the space in the journal starts getting low, we'll look
>> at at the jh's attached to completed transactions, and see how many of
>> them have clean bh's, and at that point, we can release the buffer
>> heads.
>>
>> The other time when we'll attempt to detach jh's from clean buffers is
>> via bdev_try_to_free_buffers().  So if we drop the
>> bdev_try_to_free_page hook, then when we are under memory pressure,
>> there could be potentially a large percentage of the buffer cache
>> which can't be freed, and so the OOM-killer might trigger more often.
> 
> Yes, I understand that. What I was more asking about is: Does it really
> matter we leave those buffer heads and journal heads unreclaimed. I
> understand it could be triggering premature OOM in theory but is it a
> problem in practice? Was there some observed practical case for which this
> was added or was it just added due to the theoretical concern?
> 
>> Now, if we could get a callback on I/O completion on a per-bh basis,
>> then we could detach the jh when the buffer is clean --- and as a
>> bonus, we'd get a notification when there was an I/O error writing
>> back a metadata block, which would be even better.
>>
>> So how about an even swap?  If we can get a buffer I/O completion
>> callback, we can drop bdev_to_free_swap hook.....
> 
> I'm OK with that because mainly for IO error reporting it makes sense to
> me. For this memory reclaim problem I think we have also other reasonably
> sensible options. E.g. we could have a shrinker that would just walk the
> checkpoint list and reclaim journal heads for whatever is already written
> out... Or we could just release journal heads already after commit and
> during checkpoint we'd fetch the list of blocks that may need to be written
> out e.g. from journal descriptor blocks. This would be a larger overhaul
> but as a bonus we'd get rid of probably the last place in the kernel which
> can write out page contents through buffer heads without updating page
> state properly (and thus get rid of some workarounds in mm code as well).

Thanks for these suggestions, I get your first solution and sounds good, but
I do not understand your last sentence, how does ext4 not updating page state
properly? Could you explain it more clearly?

Thanks,
Yi.
Theodore Ts'o April 23, 2021, 2:40 p.m. UTC | #8
On Thu, Apr 22, 2021 at 11:04:11AM +0200, Jan Kara wrote:
> Yes, I understand that. What I was more asking about is: Does it really
> matter we leave those buffer heads and journal heads unreclaimed. I
> understand it could be triggering premature OOM in theory but is it a
> problem in practice? Was there some observed practical case for which this
> was added or was it just added due to the theoretical concern?

I was doing some research, and found the mail thread which inspired
bdev_try_to_free_page():

https://lore.kernel.org/linux-ext4/20081202200647.72cc5807.toshi.okajima@jp.fujitsu.com/

From what I can tell Toshi Okajima did have a test workload which
would trigger blkdev_releasepage().  He didn't specify it in the mail
thread as near as I can tell, but he did use it to test the page.
Thinking about it, it shouldn't be hard to trigger it via something like:

   find /mnt -print0 | xargs -0 touch

in a memory contrained box with a large file system attached (a
bookshelf NAS scenario).  Under the right circumstances, I'm pretty
sure a premature OOM could be demonstrated.

					- Ted
Jan Kara April 23, 2021, 4:06 p.m. UTC | #9
On Fri 23-04-21 19:39:09, Zhang Yi wrote:
> On 2021/4/22 17:04, Jan Kara wrote:
> > I'm OK with that because mainly for IO error reporting it makes sense to
> > me. For this memory reclaim problem I think we have also other reasonably
> > sensible options. E.g. we could have a shrinker that would just walk the
> > checkpoint list and reclaim journal heads for whatever is already written
> > out... Or we could just release journal heads already after commit and
> > during checkpoint we'd fetch the list of blocks that may need to be written
> > out e.g. from journal descriptor blocks. This would be a larger overhaul
> > but as a bonus we'd get rid of probably the last place in the kernel which
> > can write out page contents through buffer heads without updating page
> > state properly (and thus get rid of some workarounds in mm code as well).
> 
> Thanks for these suggestions, I get your first solution and sounds good, but
> I do not understand your last sentence, how does ext4 not updating page state
> properly? Could you explain it more clearly?

The problem with current checkpointing code is that it writes out dirty
buffer heads through submit_bh() function without updating page dirty state
or without setting PageWriteback bit (because we cannot easily grab page
lock in those places due to lock ordering). Thus we can end up with a page
that is dirty but in fact does not hold any dirty data (none of the buffer
heads is dirty) and also locking a page and checking for PageWriteback
isn't enough to be sure page is not under IO. This is ugly and requires
some workarounds in MM code...

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 55c7a0d8d77d..14eedd8e5bd3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1172,6 +1172,12 @@  static void ext4_put_super(struct super_block *sb)
 	 */
 	ext4_unregister_sysfs(sb);
 
+	/*
+	 * Prevent racing with bdev_try_to_free_page() access the sbi and
+	 * journal concurrently.
+	 */
+	sb_usage_counter_wait(sb);
+
 	if (sbi->s_journal) {
 		aborted = is_journal_aborted(sbi->s_journal);
 		err = jbd2_journal_destroy(sbi->s_journal);
@@ -1248,6 +1254,7 @@  static void ext4_put_super(struct super_block *sb)
 		kthread_stop(sbi->s_mmp_tsk);
 	brelse(sbi->s_sbh);
 	sb->s_fs_info = NULL;
+	percpu_ref_exit(&sb->s_usage_counter);
 	/*
 	 * Now that we are completely done shutting down the
 	 * superblock, we need to actually destroy the kobject.
@@ -1450,15 +1457,22 @@  static int ext4_nfs_commit_metadata(struct inode *inode)
 static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
 				 gfp_t wait)
 {
-	journal_t *journal = EXT4_SB(sb)->s_journal;
+	int ret = 0;
 
 	WARN_ON(PageChecked(page));
 	if (!page_has_buffers(page))
 		return 0;
-	if (journal)
-		return jbd2_journal_try_to_free_buffers(journal, page);
 
-	return 0;
+	/* Racing with umount filesystem concurrently? */
+	if (percpu_ref_tryget_live(&sb->s_usage_counter)) {
+		journal_t *journal = EXT4_SB(sb)->s_journal;
+
+		if (journal)
+			ret = jbd2_journal_try_to_free_buffers(journal, page);
+		percpu_ref_put(&sb->s_usage_counter);
+	}
+
+	return ret;
 }
 
 #ifdef CONFIG_FS_ENCRYPTION
@@ -4709,6 +4723,9 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	spin_lock_init(&sbi->s_error_lock);
 	INIT_WORK(&sbi->s_error_work, flush_stashed_error_work);
 
+	if (sb_usage_counter_init(sb))
+		goto failed_mount2a;
+
 	/* Register extent status tree shrinker */
 	if (ext4_es_register_shrinker(sbi))
 		goto failed_mount3;
@@ -5148,6 +5165,8 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
 	sbi->s_ea_block_cache = NULL;
 
+	sb->s_bdev->bd_super = NULL;
+	sb_usage_counter_wait(sb);
 	if (sbi->s_journal) {
 		jbd2_journal_destroy(sbi->s_journal);
 		sbi->s_journal = NULL;
@@ -5155,6 +5174,8 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 failed_mount3a:
 	ext4_es_unregister_shrinker(sbi);
 failed_mount3:
+	percpu_ref_exit(&sb->s_usage_counter);
+failed_mount2a:
 	flush_work(&sbi->s_error_work);
 	del_timer_sync(&sbi->s_err_report);
 	if (sbi->s_mmp_tsk)