Message ID | 1c50d284270034703a5c99a42799ff77de871934.1742255123.git.boris@bur.io (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: fix broken drop_caches on extent_buffer folios | expand |
On Mon, Mar 17, 2025 at 11:47 PM Boris Burkov <boris@bur.io> wrote: > > The (correct) commit > e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") > replaced the page_mapped(page) check with a refcount check. However, > this refcount check does not work as expected with drop_caches for > btrfs's metadata pages. > > Btrfs has a per-sb metadata inode with cached pages, and when not in > active use by btrfs, they have a refcount of 3. One from the initial > call to alloc_pages, one (nr_pages == 1) from filemap_add_folio, and one > from folio_attach_private. We would expect such pages to get dropped by > drop_caches. However, drop_caches calls into mapping_evict_folio via > mapping_try_invalidate which gets a reference on the folio with > find_lock_entries(). As a result, these pages have a refcount of 4, and > fail this check. > > For what it's worth, such pages do get reclaimed under memory pressure, > so I would say that while this behavior is surprising, it is not really > dangerously broken. > > The following script produces such pages and uses drgn to further > analyze the state of the folios at various stages in the lifecycle > including drop_caches and memory pressure. > https://github.com/boryas/scripts/blob/main/sh/strand-meta/run.sh Not sure if it's a good thing to add URLs not as permanent as lore and kernel.org... I would place the script in the change log itself. > > When I asked the mm folks about the expected refcount in this case, I > was told that the correct thing to do is to donate the refcount from the > original allocation to the page cache after inserting it. > https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ > > Therefore, attempt to fix this by adding a put_folio() to the critical > spot in alloc_extent_buffer where we are sure that we have really > allocated and attached new pages. > > Since detach_extent_buffer_folio() has relatively complex logic w.r.t. > early exits and whether or not it actually calls folio_detach_private(), > the easiest way to ensure we don't incur a UAF in that function is to > wrap it in a buffer refcount so that the private reference cannot be the > last one. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/extent_io.c | 17 ++++++++++++++--- > 1 file changed, 14 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 7abe6ca5b38ff..207fa2d0de472 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2823,9 +2823,13 @@ static void btrfs_release_extent_buffer_folios(const struct extent_buffer *eb) > if (!folio) > continue; > > + /* > + * Avoid accidentally putting the last refcount during > + * detach_extent_buffer_folio() with an extra > + * folio_get()/folio_put() pair as a buffer. > + */ > + folio_get(folio); > detach_extent_buffer_folio(eb, folio); > - > - /* One for when we allocated the folio. */ > folio_put(folio); Instead of adding this defensive get/put pair, we could simply change detach_extent_buffer_folio(): diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 19f21540475d..7183ae731288 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2768,6 +2768,7 @@ static bool folio_range_has_eb(struct folio *folio) static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct folio *folio) { struct btrfs_fs_info *fs_info = eb->fs_info; + struct address_space *mapping = folio->mapping; const bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags); /* @@ -2775,11 +2776,11 @@ static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct fo * be done under the i_private_lock. */ if (mapped) - spin_lock(&folio->mapping->i_private_lock); + spin_lock(&mapping->i_private_lock); if (!folio_test_private(folio)) { if (mapped) - spin_unlock(&folio->mapping->i_private_lock); + spin_unlock(&mapping->i_private_lock); return; } @@ -2799,7 +2800,7 @@ static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct fo folio_detach_private(folio); } if (mapped) - spin_unlock(&folio->mapping->i_private_lock); + spin_unlock(&mapping->i_private_lock); return; } @@ -2822,7 +2823,7 @@ static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct fo if (!folio_range_has_eb(folio)) btrfs_detach_subpage(fs_info, folio, BTRFS_SUBPAGE_METADATA); - spin_unlock(&folio->mapping->i_private_lock); + spin_unlock(&mapping->i_private_lock); } /* Release all folios attached to the extent buffer */ It even makes the detach_extent_buffer_folio() code less verbose. Either way: Reviewed-by: Filipe Manana <fdmanana@suse.com> Thanks. > } > } > @@ -3370,8 +3374,15 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, > * btree_release_folio will correctly detect that a page belongs to a > * live buffer and won't free them prematurely. > */ > - for (int i = 0; i < num_extent_folios(eb); i++) > + for (int i = 0; i < num_extent_folios(eb); i++) { > folio_unlock(eb->folios[i]); > + /* > + * A folio that has been added to an address_space mapping > + * should not continue holding the refcount from its original > + * allocation indefinitely. > + */ > + folio_put(eb->folios[i]); > + } > return eb; > > out: > -- > 2.47.1 > >
On Mon, Mar 31, 2025 at 12:33:38PM +0000, Filipe Manana wrote: > On Mon, Mar 17, 2025 at 11:47 PM Boris Burkov <boris@bur.io> wrote: > > > > The (correct) commit > > e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") > > replaced the page_mapped(page) check with a refcount check. However, > > this refcount check does not work as expected with drop_caches for > > btrfs's metadata pages. > > > > Btrfs has a per-sb metadata inode with cached pages, and when not in > > active use by btrfs, they have a refcount of 3. One from the initial > > call to alloc_pages, one (nr_pages == 1) from filemap_add_folio, and one > > from folio_attach_private. We would expect such pages to get dropped by > > drop_caches. However, drop_caches calls into mapping_evict_folio via > > mapping_try_invalidate which gets a reference on the folio with > > find_lock_entries(). As a result, these pages have a refcount of 4, and > > fail this check. > > > > For what it's worth, such pages do get reclaimed under memory pressure, > > so I would say that while this behavior is surprising, it is not really > > dangerously broken. > > > > The following script produces such pages and uses drgn to further > > analyze the state of the folios at various stages in the lifecycle > > including drop_caches and memory pressure. > > https://github.com/boryas/scripts/blob/main/sh/strand-meta/run.sh > > Not sure if it's a good thing to add URLs not as permanent as lore and > kernel.org... > I would place the script in the change log itself. > Good call, will do so in the future! I guess kernel.org needs a pastebin too ;) > > > > When I asked the mm folks about the expected refcount in this case, I > > was told that the correct thing to do is to donate the refcount from the > > original allocation to the page cache after inserting it. > > https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ > > > > Therefore, attempt to fix this by adding a put_folio() to the critical > > spot in alloc_extent_buffer where we are sure that we have really > > allocated and attached new pages. > > > > Since detach_extent_buffer_folio() has relatively complex logic w.r.t. > > early exits and whether or not it actually calls folio_detach_private(), > > the easiest way to ensure we don't incur a UAF in that function is to > > wrap it in a buffer refcount so that the private reference cannot be the > > last one. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > fs/btrfs/extent_io.c | 17 ++++++++++++++--- > > 1 file changed, 14 insertions(+), 3 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index 7abe6ca5b38ff..207fa2d0de472 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -2823,9 +2823,13 @@ static void btrfs_release_extent_buffer_folios(const struct extent_buffer *eb) > > if (!folio) > > continue; > > > > + /* > > + * Avoid accidentally putting the last refcount during > > + * detach_extent_buffer_folio() with an extra > > + * folio_get()/folio_put() pair as a buffer. > > + */ > > + folio_get(folio); > > detach_extent_buffer_folio(eb, folio); > > - > > - /* One for when we allocated the folio. */ > > folio_put(folio); > > Instead of adding this defensive get/put pair, we could simply change > detach_extent_buffer_folio(): > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 19f21540475d..7183ae731288 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2768,6 +2768,7 @@ static bool folio_range_has_eb(struct folio *folio) > static void detach_extent_buffer_folio(const struct extent_buffer > *eb, struct folio *folio) > { > struct btrfs_fs_info *fs_info = eb->fs_info; > + struct address_space *mapping = folio->mapping; > const bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags); > > /* > @@ -2775,11 +2776,11 @@ static void detach_extent_buffer_folio(const > struct extent_buffer *eb, struct fo > * be done under the i_private_lock. > */ > if (mapped) > - spin_lock(&folio->mapping->i_private_lock); > + spin_lock(&mapping->i_private_lock); > > if (!folio_test_private(folio)) { > if (mapped) > - spin_unlock(&folio->mapping->i_private_lock); > + spin_unlock(&mapping->i_private_lock); > return; > } > > @@ -2799,7 +2800,7 @@ static void detach_extent_buffer_folio(const > struct extent_buffer *eb, struct fo > folio_detach_private(folio); > } > if (mapped) > - spin_unlock(&folio->mapping->i_private_lock); > + spin_unlock(&mapping->i_private_lock); > return; > } > > @@ -2822,7 +2823,7 @@ static void detach_extent_buffer_folio(const > struct extent_buffer *eb, struct fo > if (!folio_range_has_eb(folio)) > btrfs_detach_subpage(fs_info, folio, BTRFS_SUBPAGE_METADATA); > > - spin_unlock(&folio->mapping->i_private_lock); > + spin_unlock(&mapping->i_private_lock); > } > > /* Release all folios attached to the extent buffer */ > > It even makes the detach_extent_buffer_folio() code less verbose. > > Either way: > > Reviewed-by: Filipe Manana <fdmanana@suse.com> > > Thanks. > > > } > > } That's a neat way of fixing it. I still think that having a folio pointer passed into the function that could be dead halfway through the function feels dangerous. If we can prove that the private is always the last reference, then we can comment that the variable is dead after any path with a folio_put. Because we call this from paths where we are removing the extent_buffer, maybe we can actually be pretty sure the folio refcount is always going to 0? > > @@ -3370,8 +3374,15 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, > > * btree_release_folio will correctly detect that a page belongs to a > > * live buffer and won't free them prematurely. > > */ > > - for (int i = 0; i < num_extent_folios(eb); i++) > > + for (int i = 0; i < num_extent_folios(eb); i++) { > > folio_unlock(eb->folios[i]); > > + /* > > + * A folio that has been added to an address_space mapping > > + * should not continue holding the refcount from its original > > + * allocation indefinitely. > > + */ > > + folio_put(eb->folios[i]); > > + } > > return eb; > > > > out: > > -- > > 2.47.1 > > > >
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 7abe6ca5b38ff..207fa2d0de472 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2823,9 +2823,13 @@ static void btrfs_release_extent_buffer_folios(const struct extent_buffer *eb) if (!folio) continue; + /* + * Avoid accidentally putting the last refcount during + * detach_extent_buffer_folio() with an extra + * folio_get()/folio_put() pair as a buffer. + */ + folio_get(folio); detach_extent_buffer_folio(eb, folio); - - /* One for when we allocated the folio. */ folio_put(folio); } } @@ -3370,8 +3374,15 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, * btree_release_folio will correctly detect that a page belongs to a * live buffer and won't free them prematurely. */ - for (int i = 0; i < num_extent_folios(eb); i++) + for (int i = 0; i < num_extent_folios(eb); i++) { folio_unlock(eb->folios[i]); + /* + * A folio that has been added to an address_space mapping + * should not continue holding the refcount from its original + * allocation indefinitely. + */ + folio_put(eb->folios[i]); + } return eb; out:
The (correct) commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") replaced the page_mapped(page) check with a refcount check. However, this refcount check does not work as expected with drop_caches for btrfs's metadata pages. Btrfs has a per-sb metadata inode with cached pages, and when not in active use by btrfs, they have a refcount of 3. One from the initial call to alloc_pages, one (nr_pages == 1) from filemap_add_folio, and one from folio_attach_private. We would expect such pages to get dropped by drop_caches. However, drop_caches calls into mapping_evict_folio via mapping_try_invalidate which gets a reference on the folio with find_lock_entries(). As a result, these pages have a refcount of 4, and fail this check. For what it's worth, such pages do get reclaimed under memory pressure, so I would say that while this behavior is surprising, it is not really dangerously broken. The following script produces such pages and uses drgn to further analyze the state of the folios at various stages in the lifecycle including drop_caches and memory pressure. https://github.com/boryas/scripts/blob/main/sh/strand-meta/run.sh When I asked the mm folks about the expected refcount in this case, I was told that the correct thing to do is to donate the refcount from the original allocation to the page cache after inserting it. https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ Therefore, attempt to fix this by adding a put_folio() to the critical spot in alloc_extent_buffer where we are sure that we have really allocated and attached new pages. Since detach_extent_buffer_folio() has relatively complex logic w.r.t. early exits and whether or not it actually calls folio_detach_private(), the easiest way to ensure we don't incur a UAF in that function is to wrap it in a buffer refcount so that the private reference cannot be the last one. Signed-off-by: Boris Burkov <boris@bur.io> --- fs/btrfs/extent_io.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-)