Message ID | 6d25aaaadc853ffb759d538392ff48ed108e3d50.1736221254.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
Alistair Popple wrote: > Several functions internal to FS DAX use the following pattern when > trying to obtain an unlocked entry: > > xas_for_each(&xas, entry, end_idx) { > if (dax_is_locked(entry)) > entry = get_unlocked_entry(&xas, 0); > > This is problematic because get_unlocked_entry() will get the next > present entry in the range, and the next entry may not be > locked. Therefore any processing of the original locked entry will be > skipped. This can cause dax_layout_busy_page_range() to miss DMA-busy > pages in the range, leading file systems to free blocks whilst DMA > operations are ongoing which can lead to file system corruption. > > Instead callers from within a xas_for_each() loop should be waiting > for the current entry to be unlocked without advancing the XArray > state so a new function is introduced to wait. Oh wow, good eye! Did this trip up an xfstest, or did you see this purely by inspection? > Also while we are here rename get_unlocked_entry() to > get_next_unlocked_entry() to make it clear that it may advance the > iterator state. Outside of the above clarification of how found / end user effect you can add: Reviewed-by: Dan Williams <dan.j.williams@intel.com>
On Wed, Jan 08, 2025 at 02:50:36PM -0800, Dan Williams wrote: > Alistair Popple wrote: > > Several functions internal to FS DAX use the following pattern when > > trying to obtain an unlocked entry: > > > > xas_for_each(&xas, entry, end_idx) { > > if (dax_is_locked(entry)) > > entry = get_unlocked_entry(&xas, 0); > > > > This is problematic because get_unlocked_entry() will get the next > > present entry in the range, and the next entry may not be > > locked. Therefore any processing of the original locked entry will be > > skipped. This can cause dax_layout_busy_page_range() to miss DMA-busy > > pages in the range, leading file systems to free blocks whilst DMA > > operations are ongoing which can lead to file system corruption. > > > > Instead callers from within a xas_for_each() loop should be waiting > > for the current entry to be unlocked without advancing the XArray > > state so a new function is introduced to wait. > > Oh wow, good eye! > > Did this trip up an xfstest, or did you see this purely by inspection? Oh this was a "fun" one to track down :-) The other half of the story is in "fs/dax: Always remove DAX page-cache entries when breaking layouts". With just that patch applied xfstest triggered the new WARN_ON_ONCE in truncate_folio_batch_exceptionals(). That made no sense, because that patch makes breaking layouts also remove the DAX page-cache entries. Therefore no DAX page-cache entries should be found in truncate_folio_batch_exceptionals() which is now more of a sanity check. However due to the bug addressed by this patch DAX page-cache entries which should have been deleted as part of breaking layouts were being observed in truncate_folio_batch_exceptionals(). Prior to this series nothing would have noticed these being skipped because dax_delete_mapping_entry() doesn't check if the page is DMA idle. I believe this could lead to filesystem corruption if the locked entry was DMA-busy because the filesystem would assume the page was DMA-idle and therefore the underlying block free to be reallocated. However writing a test to actually prove this is tricky, and I didn't get time to do so. > > Also while we are here rename get_unlocked_entry() to > > get_next_unlocked_entry() to make it clear that it may advance the > > iterator state. > > Outside of the above clarification of how found / end user effect you > can add: > > Reviewed-by: Dan Williams <dan.j.williams@intel.com>
diff --git a/fs/dax.c b/fs/dax.c index 5133568..d010c10 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -206,7 +206,7 @@ static void dax_wake_entry(struct xa_state *xas, void *entry, * * Must be called with the i_pages lock held. */ -static void *get_unlocked_entry(struct xa_state *xas, unsigned int order) +static void *get_next_unlocked_entry(struct xa_state *xas, unsigned int order) { void *entry; struct wait_exceptional_entry_queue ewait; @@ -236,6 +236,37 @@ static void *get_unlocked_entry(struct xa_state *xas, unsigned int order) } /* + * Wait for the given entry to become unlocked. Caller must hold the i_pages + * lock and call either put_unlocked_entry() if it did not lock the entry or + * dax_unlock_entry() if it did. Returns an unlocked entry if still present. + */ +static void *wait_entry_unlocked_exclusive(struct xa_state *xas, void *entry) +{ + struct wait_exceptional_entry_queue ewait; + wait_queue_head_t *wq; + + init_wait(&ewait.wait); + ewait.wait.func = wake_exceptional_entry_func; + + while (unlikely(dax_is_locked(entry))) { + wq = dax_entry_waitqueue(xas, entry, &ewait.key); + prepare_to_wait_exclusive(wq, &ewait.wait, + TASK_UNINTERRUPTIBLE); + xas_pause(xas); + xas_unlock_irq(xas); + schedule(); + finish_wait(wq, &ewait.wait); + xas_lock_irq(xas); + entry = xas_load(xas); + } + + if (xa_is_internal(entry)) + return NULL; + + return entry; +} + +/* * The only thing keeping the address space around is the i_pages lock * (it's cycled in clear_inode() after removing the entries from i_pages) * After we call xas_unlock_irq(), we cannot touch xas->xa. @@ -250,7 +281,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry) wq = dax_entry_waitqueue(xas, entry, &ewait.key); /* - * Unlike get_unlocked_entry() there is no guarantee that this + * Unlike get_next_unlocked_entry() there is no guarantee that this * path ever successfully retrieves an unlocked entry before an * inode dies. Perform a non-exclusive wait in case this path * never successfully performs its own wake up. @@ -580,7 +611,7 @@ static void *grab_mapping_entry(struct xa_state *xas, retry: pmd_downgrade = false; xas_lock_irq(xas); - entry = get_unlocked_entry(xas, order); + entry = get_next_unlocked_entry(xas, order); if (entry) { if (dax_is_conflict(entry)) @@ -716,8 +747,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, xas_for_each(&xas, entry, end_idx) { if (WARN_ON_ONCE(!xa_is_value(entry))) continue; - if (unlikely(dax_is_locked(entry))) - entry = get_unlocked_entry(&xas, 0); + entry = wait_entry_unlocked_exclusive(&xas, entry); if (entry) page = dax_busy_page(entry); put_unlocked_entry(&xas, entry, WAKE_NEXT); @@ -750,7 +780,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, void *entry; xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas, 0); + entry = get_next_unlocked_entry(&xas, 0); if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) goto out; if (!trunc && @@ -776,7 +806,9 @@ static int __dax_clear_dirty_range(struct address_space *mapping, xas_lock_irq(&xas); xas_for_each(&xas, entry, end) { - entry = get_unlocked_entry(&xas, 0); + entry = wait_entry_unlocked_exclusive(&xas, entry); + if (!entry) + continue; xas_clear_mark(&xas, PAGECACHE_TAG_DIRTY); xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE); put_unlocked_entry(&xas, entry, WAKE_NEXT); @@ -940,7 +972,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, if (unlikely(dax_is_locked(entry))) { void *old_entry = entry; - entry = get_unlocked_entry(xas, 0); + entry = get_next_unlocked_entry(xas, 0); /* Entry got punched out / reallocated? */ if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) @@ -1949,7 +1981,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) vm_fault_t ret; xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas, order); + entry = get_next_unlocked_entry(&xas, order); /* Did we race with someone splitting entry or so? */ if (!entry || dax_is_conflict(entry) || (order == 0 && !dax_is_pte_entry(entry))) {
Several functions internal to FS DAX use the following pattern when trying to obtain an unlocked entry: xas_for_each(&xas, entry, end_idx) { if (dax_is_locked(entry)) entry = get_unlocked_entry(&xas, 0); This is problematic because get_unlocked_entry() will get the next present entry in the range, and the next entry may not be locked. Therefore any processing of the original locked entry will be skipped. This can cause dax_layout_busy_page_range() to miss DMA-busy pages in the range, leading file systems to free blocks whilst DMA operations are ongoing which can lead to file system corruption. Instead callers from within a xas_for_each() loop should be waiting for the current entry to be unlocked without advancing the XArray state so a new function is introduced to wait. Also while we are here rename get_unlocked_entry() to get_next_unlocked_entry() to make it clear that it may advance the iterator state. Signed-off-by: Alistair Popple <apopple@nvidia.com> --- fs/dax.c | 50 +++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 41 insertions(+), 9 deletions(-)