Message ID | 20210414134737.2366971-8-yi.zhang@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ext4, jbd2: fix 3 issues about bdev_try_to_free_page() | expand |
On Wed, Apr 14, 2021 at 09:47:37PM +0800, Zhang Yi wrote: > There still exist a use after free issue when accessing the journal > structure and ext4_sb_info structure on freeing bdev buffers in > bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be > raced by ext4_put_super(), it dose freeing sb->s_fs_info and > sbi->s_journal while release page progress are still accessing them. > So it could end up trigger use-after-free or NULL pointer dereference. I think the right fix is to not even call into ->bdev_try_to_free_page unless the superblock is active.
Hi, Christoph On 2021/4/15 22:52, Christoph Hellwig wrote: > On Wed, Apr 14, 2021 at 09:47:37PM +0800, Zhang Yi wrote: >> There still exist a use after free issue when accessing the journal >> structure and ext4_sb_info structure on freeing bdev buffers in >> bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be >> raced by ext4_put_super(), it dose freeing sb->s_fs_info and >> sbi->s_journal while release page progress are still accessing them. >> So it could end up trigger use-after-free or NULL pointer dereference. > > I think the right fix is to not even call into ->bdev_try_to_free_page > unless the superblock is active. > . > Thanks for your suggestions. Now, we use already use "if (bdev->bd_super)" to prevent call into ->bdev_try_to_free_page unless the super is alive, and the problem is bd_super becomes NULL concurrently after this check. So, IIUC, I think it's the same to switch to check the superblock is active or not. The acvive flag also could becomes inactive (raced by umount) after we call into bdev_try_to_free_page(). In order to close this race, One solution is introduce a lock to synchronize the active state between kill_block_super() and blkdev_releasepage(), but the releasing page process have to try to acquire this lock in blkdev_releasepage() for each page, and the umount process still need to wait until the page release if some one invoke into ->bdev_try_to_free_page(). I think this solution may affect performace and is not a good way. Think about it in depth, use percpu refcount seems have the smallest performance effect on blkdev_releasepage(). If you don't like the refcount, maybe we could add synchronize_rcu_expedited() in ext4_put_super(), it also could prevent this race. Any suggestions? Thanks, Yi.
On Fri, Apr 16, 2021 at 04:00:48PM +0800, Zhang Yi wrote: > Now, we use already use "if (bdev->bd_super)" to prevent call into > ->bdev_try_to_free_page unless the super is alive, and the problem is > bd_super becomes NULL concurrently after this check. So, IIUC, I think it's > the same to switch to check the superblock is active or not. The acvive > flag also could becomes inactive (raced by umount) after we call into > bdev_try_to_free_page(). Indeed. > In order to close this race, One solution is introduce a lock to synchronize > the active state between kill_block_super() and blkdev_releasepage(), but > the releasing page process have to try to acquire this lock in > blkdev_releasepage() for each page, and the umount process still need to wait > until the page release if some one invoke into ->bdev_try_to_free_page(). > I think this solution may affect performace and is not a good way. > Think about it in depth, use percpu refcount seems have the smallest > performance effect on blkdev_releasepage(). > > If you don't like the refcount, maybe we could add synchronize_rcu_expedited() > in ext4_put_super(), it also could prevent this race. Any suggestions? I really don't like to put a lot of overhead into the core VFS and block device code. ext4/jbd does not own the block device inode and really has no business controlling releasepage for it. I suspect the right answer might be to simply revert the commit that added this hook.
On Tue 20-04-21 14:08:41, Christoph Hellwig wrote: > On Fri, Apr 16, 2021 at 04:00:48PM +0800, Zhang Yi wrote: > > Now, we use already use "if (bdev->bd_super)" to prevent call into > > ->bdev_try_to_free_page unless the super is alive, and the problem is > > bd_super becomes NULL concurrently after this check. So, IIUC, I think it's > > the same to switch to check the superblock is active or not. The acvive > > flag also could becomes inactive (raced by umount) after we call into > > bdev_try_to_free_page(). > > Indeed. > > > In order to close this race, One solution is introduce a lock to synchronize > > the active state between kill_block_super() and blkdev_releasepage(), but > > the releasing page process have to try to acquire this lock in > > blkdev_releasepage() for each page, and the umount process still need to wait > > until the page release if some one invoke into ->bdev_try_to_free_page(). > > I think this solution may affect performace and is not a good way. > > Think about it in depth, use percpu refcount seems have the smallest > > performance effect on blkdev_releasepage(). > > > > If you don't like the refcount, maybe we could add synchronize_rcu_expedited() > > in ext4_put_super(), it also could prevent this race. Any suggestions? > > I really don't like to put a lot of overhead into the core VFS and block > device code. ext4/jbd does not own the block device inode and really > has no business controlling releasepage for it. I suspect the right > answer might be to simply revert the commit that added this hook. Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only by ext4. So maybe it is not that important? I agree with Zhang and Christoph that getting the lifetime rules sorted out will be hairy and it is questionable, whether it is worth the additional pages we can reclaim. Ted, do you remember what was the original motivation for this? Honza
On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote: > > Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only > by ext4. So maybe it is not that important? I agree with Zhang and > Christoph that getting the lifetime rules sorted out will be hairy and it > is questionable, whether it is worth the additional pages we can reclaim. > Ted, do you remember what was the original motivation for this? The comment in fs/ext4/super.c is I thought a pretty good explanation: /* * Try to release metadata pages (indirect blocks, directories) which are * mapped via the block device. Since these pages could have journal heads * which would prevent try_to_free_buffers() from freeing them, we must use * jbd2 layer's try_to_free_buffers() function to release them. */ When we modify a metadata block, we attach a journal_head (jh) structure to the buffer_head, and bump the ref count to prevent the buffer from being freed. Before the transaction is committed, the buffer is marked jbddirty, but the dirty bit is not set until the transaction commit. At that back, writeback happens entirely at the discretion of the buffer cache. The jbd layer doesn't get notification when the I/O is completed, nor when there is an I/O error. (There was an attempt to add a callback but that was NACK'ed because of a complaint that it was jbd specific.) So we don't actually know when it's safe to detach the jh from the buffer_head and can drop the refcount so that the buffer_head can be freed. When the space in the journal starts getting low, we'll look at at the jh's attached to completed transactions, and see how many of them have clean bh's, and at that point, we can release the buffer heads. The other time when we'll attempt to detach jh's from clean buffers is via bdev_try_to_free_buffers(). So if we drop the bdev_try_to_free_page hook, then when we are under memory pressure, there could be potentially a large percentage of the buffer cache which can't be freed, and so the OOM-killer might trigger more often. Now, if we could get a callback on I/O completion on a per-bh basis, then we could detach the jh when the buffer is clean --- and as a bonus, we'd get a notification when there was an I/O error writing back a metadata block, which would be even better. So how about an even swap? If we can get a buffer I/O completion callback, we can drop bdev_to_free_swap hook..... - Ted
On Wed 21-04-21 12:57:39, Theodore Ts'o wrote: > On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote: > > > > Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only > > by ext4. So maybe it is not that important? I agree with Zhang and > > Christoph that getting the lifetime rules sorted out will be hairy and it > > is questionable, whether it is worth the additional pages we can reclaim. > > Ted, do you remember what was the original motivation for this? > > The comment in fs/ext4/super.c is I thought a pretty good explanation: > > /* > * Try to release metadata pages (indirect blocks, directories) which are > * mapped via the block device. Since these pages could have journal heads > * which would prevent try_to_free_buffers() from freeing them, we must use > * jbd2 layer's try_to_free_buffers() function to release them. > */ > > When we modify a metadata block, we attach a journal_head (jh) > structure to the buffer_head, and bump the ref count to prevent the > buffer from being freed. Before the transaction is committed, the > buffer is marked jbddirty, but the dirty bit is not set until the > transaction commit. > > At that back, writeback happens entirely at the discretion of the > buffer cache. The jbd layer doesn't get notification when the I/O is > completed, nor when there is an I/O error. (There was an attempt to > add a callback but that was NACK'ed because of a complaint that it was > jbd specific.) > > So we don't actually know when it's safe to detach the jh from the > buffer_head and can drop the refcount so that the buffer_head can be > freed. When the space in the journal starts getting low, we'll look > at at the jh's attached to completed transactions, and see how many of > them have clean bh's, and at that point, we can release the buffer > heads. > > The other time when we'll attempt to detach jh's from clean buffers is > via bdev_try_to_free_buffers(). So if we drop the > bdev_try_to_free_page hook, then when we are under memory pressure, > there could be potentially a large percentage of the buffer cache > which can't be freed, and so the OOM-killer might trigger more often. Yes, I understand that. What I was more asking about is: Does it really matter we leave those buffer heads and journal heads unreclaimed. I understand it could be triggering premature OOM in theory but is it a problem in practice? Was there some observed practical case for which this was added or was it just added due to the theoretical concern? > Now, if we could get a callback on I/O completion on a per-bh basis, > then we could detach the jh when the buffer is clean --- and as a > bonus, we'd get a notification when there was an I/O error writing > back a metadata block, which would be even better. > > So how about an even swap? If we can get a buffer I/O completion > callback, we can drop bdev_to_free_swap hook..... I'm OK with that because mainly for IO error reporting it makes sense to me. For this memory reclaim problem I think we have also other reasonably sensible options. E.g. we could have a shrinker that would just walk the checkpoint list and reclaim journal heads for whatever is already written out... Or we could just release journal heads already after commit and during checkpoint we'd fetch the list of blocks that may need to be written out e.g. from journal descriptor blocks. This would be a larger overhaul but as a bonus we'd get rid of probably the last place in the kernel which can write out page contents through buffer heads without updating page state properly (and thus get rid of some workarounds in mm code as well). Honza
On 2021/4/22 17:04, Jan Kara wrote: > On Wed 21-04-21 12:57:39, Theodore Ts'o wrote: >> On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote: >>> >>> Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only >>> by ext4. So maybe it is not that important? I agree with Zhang and >>> Christoph that getting the lifetime rules sorted out will be hairy and it >>> is questionable, whether it is worth the additional pages we can reclaim. >>> Ted, do you remember what was the original motivation for this? >> >> The comment in fs/ext4/super.c is I thought a pretty good explanation: >> >> /* >> * Try to release metadata pages (indirect blocks, directories) which are >> * mapped via the block device. Since these pages could have journal heads >> * which would prevent try_to_free_buffers() from freeing them, we must use >> * jbd2 layer's try_to_free_buffers() function to release them. >> */ >> >> When we modify a metadata block, we attach a journal_head (jh) >> structure to the buffer_head, and bump the ref count to prevent the >> buffer from being freed. Before the transaction is committed, the >> buffer is marked jbddirty, but the dirty bit is not set until the >> transaction commit. >> >> At that back, writeback happens entirely at the discretion of the >> buffer cache. The jbd layer doesn't get notification when the I/O is >> completed, nor when there is an I/O error. (There was an attempt to >> add a callback but that was NACK'ed because of a complaint that it was >> jbd specific.) >> >> So we don't actually know when it's safe to detach the jh from the >> buffer_head and can drop the refcount so that the buffer_head can be >> freed. When the space in the journal starts getting low, we'll look >> at at the jh's attached to completed transactions, and see how many of >> them have clean bh's, and at that point, we can release the buffer >> heads. >> >> The other time when we'll attempt to detach jh's from clean buffers is >> via bdev_try_to_free_buffers(). So if we drop the >> bdev_try_to_free_page hook, then when we are under memory pressure, >> there could be potentially a large percentage of the buffer cache >> which can't be freed, and so the OOM-killer might trigger more often. > > Yes, I understand that. What I was more asking about is: Does it really > matter we leave those buffer heads and journal heads unreclaimed. I > understand it could be triggering premature OOM in theory but is it a > problem in practice? Was there some observed practical case for which this > was added or was it just added due to the theoretical concern? > >> Now, if we could get a callback on I/O completion on a per-bh basis, >> then we could detach the jh when the buffer is clean --- and as a >> bonus, we'd get a notification when there was an I/O error writing >> back a metadata block, which would be even better. >> >> So how about an even swap? If we can get a buffer I/O completion >> callback, we can drop bdev_to_free_swap hook..... > > I'm OK with that because mainly for IO error reporting it makes sense to > me. For this memory reclaim problem I think we have also other reasonably > sensible options. E.g. we could have a shrinker that would just walk the > checkpoint list and reclaim journal heads for whatever is already written > out... Or we could just release journal heads already after commit and > during checkpoint we'd fetch the list of blocks that may need to be written > out e.g. from journal descriptor blocks. This would be a larger overhaul > but as a bonus we'd get rid of probably the last place in the kernel which > can write out page contents through buffer heads without updating page > state properly (and thus get rid of some workarounds in mm code as well). Thanks for these suggestions, I get your first solution and sounds good, but I do not understand your last sentence, how does ext4 not updating page state properly? Could you explain it more clearly? Thanks, Yi.
On Thu, Apr 22, 2021 at 11:04:11AM +0200, Jan Kara wrote: > Yes, I understand that. What I was more asking about is: Does it really > matter we leave those buffer heads and journal heads unreclaimed. I > understand it could be triggering premature OOM in theory but is it a > problem in practice? Was there some observed practical case for which this > was added or was it just added due to the theoretical concern? I was doing some research, and found the mail thread which inspired bdev_try_to_free_page(): https://lore.kernel.org/linux-ext4/20081202200647.72cc5807.toshi.okajima@jp.fujitsu.com/ From what I can tell Toshi Okajima did have a test workload which would trigger blkdev_releasepage(). He didn't specify it in the mail thread as near as I can tell, but he did use it to test the page. Thinking about it, it shouldn't be hard to trigger it via something like: find /mnt -print0 | xargs -0 touch in a memory contrained box with a large file system attached (a bookshelf NAS scenario). Under the right circumstances, I'm pretty sure a premature OOM could be demonstrated. - Ted
On Fri 23-04-21 19:39:09, Zhang Yi wrote: > On 2021/4/22 17:04, Jan Kara wrote: > > I'm OK with that because mainly for IO error reporting it makes sense to > > me. For this memory reclaim problem I think we have also other reasonably > > sensible options. E.g. we could have a shrinker that would just walk the > > checkpoint list and reclaim journal heads for whatever is already written > > out... Or we could just release journal heads already after commit and > > during checkpoint we'd fetch the list of blocks that may need to be written > > out e.g. from journal descriptor blocks. This would be a larger overhaul > > but as a bonus we'd get rid of probably the last place in the kernel which > > can write out page contents through buffer heads without updating page > > state properly (and thus get rid of some workarounds in mm code as well). > > Thanks for these suggestions, I get your first solution and sounds good, but > I do not understand your last sentence, how does ext4 not updating page state > properly? Could you explain it more clearly? The problem with current checkpointing code is that it writes out dirty buffer heads through submit_bh() function without updating page dirty state or without setting PageWriteback bit (because we cannot easily grab page lock in those places due to lock ordering). Thus we can end up with a page that is dirty but in fact does not hold any dirty data (none of the buffer heads is dirty) and also locking a page and checking for PageWriteback isn't enough to be sure page is not under IO. This is ugly and requires some workarounds in MM code... Honza
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 55c7a0d8d77d..14eedd8e5bd3 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1172,6 +1172,12 @@ static void ext4_put_super(struct super_block *sb) */ ext4_unregister_sysfs(sb); + /* + * Prevent racing with bdev_try_to_free_page() access the sbi and + * journal concurrently. + */ + sb_usage_counter_wait(sb); + if (sbi->s_journal) { aborted = is_journal_aborted(sbi->s_journal); err = jbd2_journal_destroy(sbi->s_journal); @@ -1248,6 +1254,7 @@ static void ext4_put_super(struct super_block *sb) kthread_stop(sbi->s_mmp_tsk); brelse(sbi->s_sbh); sb->s_fs_info = NULL; + percpu_ref_exit(&sb->s_usage_counter); /* * Now that we are completely done shutting down the * superblock, we need to actually destroy the kobject. @@ -1450,15 +1457,22 @@ static int ext4_nfs_commit_metadata(struct inode *inode) static int bdev_try_to_free_page(struct super_block *sb, struct page *page, gfp_t wait) { - journal_t *journal = EXT4_SB(sb)->s_journal; + int ret = 0; WARN_ON(PageChecked(page)); if (!page_has_buffers(page)) return 0; - if (journal) - return jbd2_journal_try_to_free_buffers(journal, page); - return 0; + /* Racing with umount filesystem concurrently? */ + if (percpu_ref_tryget_live(&sb->s_usage_counter)) { + journal_t *journal = EXT4_SB(sb)->s_journal; + + if (journal) + ret = jbd2_journal_try_to_free_buffers(journal, page); + percpu_ref_put(&sb->s_usage_counter); + } + + return ret; } #ifdef CONFIG_FS_ENCRYPTION @@ -4709,6 +4723,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) spin_lock_init(&sbi->s_error_lock); INIT_WORK(&sbi->s_error_work, flush_stashed_error_work); + if (sb_usage_counter_init(sb)) + goto failed_mount2a; + /* Register extent status tree shrinker */ if (ext4_es_register_shrinker(sbi)) goto failed_mount3; @@ -5148,6 +5165,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) ext4_xattr_destroy_cache(sbi->s_ea_block_cache); sbi->s_ea_block_cache = NULL; + sb->s_bdev->bd_super = NULL; + sb_usage_counter_wait(sb); if (sbi->s_journal) { jbd2_journal_destroy(sbi->s_journal); sbi->s_journal = NULL; @@ -5155,6 +5174,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) failed_mount3a: ext4_es_unregister_shrinker(sbi); failed_mount3: + percpu_ref_exit(&sb->s_usage_counter); +failed_mount2a: flush_work(&sbi->s_error_work); del_timer_sync(&sbi->s_err_report); if (sbi->s_mmp_tsk)
There still exist a use after free issue when accessing the journal structure and ext4_sb_info structure on freeing bdev buffers in bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be raced by ext4_put_super(), it dose freeing sb->s_fs_info and sbi->s_journal while release page progress are still accessing them. So it could end up trigger use-after-free or NULL pointer dereference. drop cache umount filesystem blkdev_releasepage() get sb bdev_try_to_free_page() ext4_put_super() kfree(journal) if access EXT4_SB(sb)->s_journal <-- leader to use after free sb->s_fs_info = NULL; kfree(sbi) if access EXT4_SB(sb)->s_journal <-- trigger NULL pointer dereference The above race could also happens on the error path of ext4_fill_super(). Now the bdev_try_to_free_page() is under RCU protection, this patch fix this race by use sb->s_usage_counter to make bdev_try_to_free_page() safe against concurrent kill_block_super(). Fixes: c39a7f84d784 ("ext4: provide function to release metadata pages under memory pressure") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> --- fs/ext4/super.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)