Message ID | 1676645312-13-4-git-send-email-ruansy.fnst@fujitsu.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind | expand |
On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote: > This patch is inspired by Dan's "mm, dax, pmem: Introduce > dev_pagemap_failure()"[1]. With the help of dax_holder and > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > (or mapped device) on it to unmap all files in use and notify processes > who are using those files. > > Call trace: > trigger unbind > -> unbind_store() > -> ... (skip) > -> devres_release_all() # was pmem driver ->remove() in v1 > -> kill_dax() > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) > -> xfs_dax_notify_failure() > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove > event. So do not shutdown filesystem directly if something not > supported, or if failure range includes metadata area. Make sure all > files and processes are handled correctly. > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> ..... > --- > @@ -225,6 +242,15 @@ xfs_dax_notify_failure( > if (offset + len - 1 > ddev_end) > len = ddev_end - offset + 1; > > + if (mf_flags & MF_MEM_PRE_REMOVE) { > + xfs_info(mp, "device is about to be removed!"); > + error = freeze_super(mp->m_super); > + if (error) > + return error; > + /* invalidate_inode_pages2() invalidates dax mapping */ > + super_drop_pagecache(mp->m_super, invalidate_inode_pages2); > + } Why do you still need to drop the pagecache here? My suggestion was to replace it with freezing the filesystem at this point is to stop it being dirtied further before the device remove actually occurs. The userspace processes will be killed, their DAX mappings reclaimed and the filesystem shut down before device removal occurs, so super_drop_pagecache() is largely superfluous as it doesn't actually provide any protection against racing with new mappings or dirtying of existing/newly created mappings. Freezing doesn't stop the creation of new mappings, either, it just cleans all the dirty mappings and halts anything that is trying to dirty existing clean mappings. It's not until we kill the userspace processes that new mappings will be stopped, and it's not until we shut the filesystem down that the filesystem itself will stop accessing the storage. Hence I don't see why you retained super_drop_pagecache() here at all. Can you explain why it is still needed? -Dave.
在 2023/2/27 8:07, Dave Chinner 写道: > On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote: >> This patch is inspired by Dan's "mm, dax, pmem: Introduce >> dev_pagemap_failure()"[1]. With the help of dax_holder and >> ->notify_failure() mechanism, the pmem driver is able to ask filesystem >> (or mapped device) on it to unmap all files in use and notify processes >> who are using those files. >> >> Call trace: >> trigger unbind >> -> unbind_store() >> -> ... (skip) >> -> devres_release_all() # was pmem driver ->remove() in v1 >> -> kill_dax() >> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) >> -> xfs_dax_notify_failure() >> >> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove >> event. So do not shutdown filesystem directly if something not >> supported, or if failure range includes metadata area. Make sure all >> files and processes are handled correctly. >> >> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ >> >> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > > ..... > >> --- >> @@ -225,6 +242,15 @@ xfs_dax_notify_failure( >> if (offset + len - 1 > ddev_end) >> len = ddev_end - offset + 1; >> >> + if (mf_flags & MF_MEM_PRE_REMOVE) { >> + xfs_info(mp, "device is about to be removed!"); >> + error = freeze_super(mp->m_super); >> + if (error) >> + return error; >> + /* invalidate_inode_pages2() invalidates dax mapping */ >> + super_drop_pagecache(mp->m_super, invalidate_inode_pages2); >> + } > > Why do you still need to drop the pagecache here? My suggestion was > to replace it with freezing the filesystem at this point is to stop > it being dirtied further before the device remove actually occurs. > The userspace processes will be killed, their DAX mappings reclaimed > and the filesystem shut down before device removal occurs, so > super_drop_pagecache() is largely superfluous as it doesn't actually > provide any protection against racing with new mappings or dirtying > of existing/newly created mappings. > > Freezing doesn't stop the creation of new mappings, either, it just > cleans all the dirty mappings and halts anything that is trying to This is the point I wasn't aware of. > dirty existing clean mappings. It's not until we kill the userspace > processes that new mappings will be stopped, and it's not until we > shut the filesystem down that the filesystem itself will stop > accessing the storage. > > Hence I don't see why you retained super_drop_pagecache() here at > all. Can you explain why it is still needed? So I was just afraid that it's not enough for rmap & processes killer to invalidate the dax mappings. If something error happened during the rmap walker, the fs will shutdown and there is no chance to invalidate the rest mappings whose user didn't be killed yet. Now that freezing the fs is enough, I will remove the drop cache code. -- Thanks, Ruan. > > -Dave.
在 2023/2/27 18:06, Shiyang Ruan 写道: > > > 在 2023/2/27 8:07, Dave Chinner 写道: >> On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote: >>> This patch is inspired by Dan's "mm, dax, pmem: Introduce >>> dev_pagemap_failure()"[1]. With the help of dax_holder and >>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem >>> (or mapped device) on it to unmap all files in use and notify processes >>> who are using those files. >>> >>> Call trace: >>> trigger unbind >>> -> unbind_store() >>> -> ... (skip) >>> -> devres_release_all() # was pmem driver ->remove() in v1 >>> -> kill_dax() >>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, >>> MF_MEM_PRE_REMOVE) >>> -> xfs_dax_notify_failure() >>> >>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove >>> event. So do not shutdown filesystem directly if something not >>> supported, or if failure range includes metadata area. Make sure all >>> files and processes are handled correctly. >>> >>> [1]: >>> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ >>> >>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >> >> ..... >> >>> --- >>> @@ -225,6 +242,15 @@ xfs_dax_notify_failure( >>> if (offset + len - 1 > ddev_end) >>> len = ddev_end - offset + 1; >>> + if (mf_flags & MF_MEM_PRE_REMOVE) { >>> + xfs_info(mp, "device is about to be removed!"); >>> + error = freeze_super(mp->m_super); >>> + if (error) >>> + return error; >>> + /* invalidate_inode_pages2() invalidates dax mapping */ >>> + super_drop_pagecache(mp->m_super, invalidate_inode_pages2); >>> + } >> >> Why do you still need to drop the pagecache here? My suggestion was >> to replace it with freezing the filesystem at this point is to stop >> it being dirtied further before the device remove actually occurs. >> The userspace processes will be killed, their DAX mappings reclaimed >> and the filesystem shut down before device removal occurs, so >> super_drop_pagecache() is largely superfluous as it doesn't actually >> provide any protection against racing with new mappings or dirtying >> of existing/newly created mappings. >> >> Freezing doesn't stop the creation of new mappings, either, it just >> cleans all the dirty mappings and halts anything that is trying to > > This is the point I wasn't aware of. > >> dirty existing clean mappings. It's not until we kill the userspace >> processes that new mappings will be stopped, and it's not until we >> shut the filesystem down that the filesystem itself will stop >> accessing the storage. >> >> Hence I don't see why you retained super_drop_pagecache() here at >> all. Can you explain why it is still needed? > > > So I was just afraid that it's not enough for rmap & processes killer to > invalidate the dax mappings. If something error happened during the > rmap walker, the fs will shutdown and there is no chance to invalidate > the rest mappings whose user didn't be killed yet. > > Now that freezing the fs is enough, I will remove the drop cache code. I removed the drop cache code, then kernel always went into crash when running the test[1]. After the investigation, I found that the crash is cause by accessing (invalidate dax pages when umounting fs) the page of a pmem while the pmem has been removed. According to the design, the dax page should have been invalidated by mf_dax_kill_procs() but it didn't. I found two reasons: 1. collect_procs_fsdax() only kills the current process 2. unmap_mapping_range() doesn't invalidate the dax pages (disassociate dax entry in fs/dax.c), which causes the crash in my test So, I think we should: 1. pass the mf_flag to collect_procs_fsdax() to let it collect all processes associated with the file on the XFS. 2. drop cache is still needed, but just drop the associated files' cache after mf_dax_kill_procs(), instead of dropping cache of the whole filesystem. Then the logic shuld be looked like this: unbind `-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) `-> xfs_dax_notify_failure() `-> freeze_super() `-> do xfs rmap `-> mf_dax_kill_procs() `-> collect_procs_fsdax() // all associated `-> unmap_and_kill() `-> invalidate_inode_pages2() // drop file's cache `-> thaw_super() [1] The step of unbind test: 1. create fsdax namespace on a pmem 2. mkfs.xfs on it 3. run fsx test in background 4. wait 1s 5. echo "pfn0.1" > unbind 6. wait 1s 7. umount xfs --> crash happened -- Thanks, Ruan. > > > -- > Thanks, > Ruan. > >> >> -Dave.
在 2023/3/21 18:59, Shiyang Ruan 写道: > > > 在 2023/2/27 18:06, Shiyang Ruan 写道: >> >> >> 在 2023/2/27 8:07, Dave Chinner 写道: >>> On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote: >>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce >>>> dev_pagemap_failure()"[1]. With the help of dax_holder and >>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem >>>> (or mapped device) on it to unmap all files in use and notify processes >>>> who are using those files. >>>> >>>> Call trace: >>>> trigger unbind >>>> -> unbind_store() >>>> -> ... (skip) >>>> -> devres_release_all() # was pmem driver ->remove() in v1 >>>> -> kill_dax() >>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, >>>> MF_MEM_PRE_REMOVE) >>>> -> xfs_dax_notify_failure() >>>> >>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove >>>> event. So do not shutdown filesystem directly if something not >>>> supported, or if failure range includes metadata area. Make sure all >>>> files and processes are handled correctly. >>>> >>>> [1]: >>>> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ >>>> >>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >>> >>> ..... >>> >>>> --- >>>> @@ -225,6 +242,15 @@ xfs_dax_notify_failure( >>>> if (offset + len - 1 > ddev_end) >>>> len = ddev_end - offset + 1; >>>> + if (mf_flags & MF_MEM_PRE_REMOVE) { >>>> + xfs_info(mp, "device is about to be removed!"); >>>> + error = freeze_super(mp->m_super); >>>> + if (error) >>>> + return error; >>>> + /* invalidate_inode_pages2() invalidates dax mapping */ >>>> + super_drop_pagecache(mp->m_super, invalidate_inode_pages2); >>>> + } >>> >>> Why do you still need to drop the pagecache here? My suggestion was >>> to replace it with freezing the filesystem at this point is to stop >>> it being dirtied further before the device remove actually occurs. >>> The userspace processes will be killed, their DAX mappings reclaimed >>> and the filesystem shut down before device removal occurs, so >>> super_drop_pagecache() is largely superfluous as it doesn't actually >>> provide any protection against racing with new mappings or dirtying >>> of existing/newly created mappings. >>> >>> Freezing doesn't stop the creation of new mappings, either, it just >>> cleans all the dirty mappings and halts anything that is trying to >> >> This is the point I wasn't aware of. >> >>> dirty existing clean mappings. It's not until we kill the userspace >>> processes that new mappings will be stopped, and it's not until we >>> shut the filesystem down that the filesystem itself will stop >>> accessing the storage. >>> >>> Hence I don't see why you retained super_drop_pagecache() here at >>> all. Can you explain why it is still needed? >> >> >> So I was just afraid that it's not enough for rmap & processes killer >> to invalidate the dax mappings. If something error happened during >> the rmap walker, the fs will shutdown and there is no chance to >> invalidate the rest mappings whose user didn't be killed yet. >> >> Now that freezing the fs is enough, I will remove the drop cache code. > > I removed the drop cache code, then kernel always went into crash when > running the test[1]. After the investigation, I found that the crash is > cause by accessing (invalidate dax pages when umounting fs) the page of > a pmem while the pmem has been removed. > > According to the design, the dax page should have been invalidated by > mf_dax_kill_procs() but it didn't. I found two reasons: > 1. collect_procs_fsdax() only kills the current process > 2. unmap_mapping_range() doesn't invalidate the dax pages > (disassociate dax entry in fs/dax.c), which causes the crash in my test > > So, I think we should: > 1. pass the mf_flag to collect_procs_fsdax() to let it collect all > processes associated with the file on the XFS. > 2. drop cache is still needed, but just drop the associated files' > cache after mf_dax_kill_procs(), instead of dropping cache of the whole > filesystem. > > Then the logic shuld be looked like this: > unbind > `-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) > `-> xfs_dax_notify_failure() > `-> freeze_super() > `-> do xfs rmap > `-> mf_dax_kill_procs() > `-> collect_procs_fsdax() // all associated > `-> unmap_and_kill() > `-> invalidate_inode_pages2() // drop file's cache > `-> thaw_super() > > > [1] The step of unbind test: > 1. create fsdax namespace on a pmem > 2. mkfs.xfs on it > 3. run fsx test in background > 4. wait 1s > 5. echo "pfn0.1" > unbind > 6. wait 1s > 7. umount xfs --> crash happened > Hi, Any comments? > > -- > Thanks, > Ruan. > >> >> >> -- >> Thanks, >> Ruan. >> >>> >>> -Dave.
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index c4c4728a36e4..2e1a35e82fce 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) return; if (dax_dev->holder_data != NULL) - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); + dax_holder_notify_failure(dax_dev, 0, U64_MAX, + MF_MEM_PRE_REMOVE); clear_bit(DAXDEV_ALIVE, &dax_dev->flags); synchronize_srcu(&dax_srcu); diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index 7d46a7e4980f..5f915cfc9632 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -22,6 +22,7 @@ #include <linux/mm.h> #include <linux/dax.h> +#include <linux/fs.h> struct xfs_failure_info { xfs_agblock_t startblock; @@ -77,6 +78,9 @@ xfs_dax_failure_fn( if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { + /* The device is about to be removed. Not a really failure. */ + if (notify->mf_flags & MF_MEM_PRE_REMOVE) + return 0; notify->want_shutdown = true; return 0; } @@ -168,7 +172,11 @@ xfs_dax_notify_ddev_failure( xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); if (!error) error = -EFSCORRUPTED; + } else if (mf_flags & MF_MEM_PRE_REMOVE) { + error = thaw_super(mp->m_super); + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); } + return error; } @@ -182,6 +190,7 @@ xfs_dax_notify_failure( struct xfs_mount *mp = dax_holder(dax_dev); u64 ddev_start; u64 ddev_end; + int error; if (!(mp->m_super->s_flags & SB_BORN)) { xfs_warn(mp, "filesystem is not ready for notify_failure()!"); @@ -196,6 +205,8 @@ xfs_dax_notify_failure( if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && mp->m_logdev_targp != mp->m_ddev_targp) { + if (mf_flags & MF_MEM_PRE_REMOVE) + return 0; xfs_err(mp, "ondisk log corrupt, shutting down fs!"); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); return -EFSCORRUPTED; @@ -209,6 +220,12 @@ xfs_dax_notify_failure( ddev_start = mp->m_ddev_targp->bt_dax_part_off; ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; + /* Notify failure on the whole device */ + if (offset == 0 && len == U64_MAX) { + offset = ddev_start; + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); + } + /* Ignore the range out of filesystem area */ if (offset + len - 1 < ddev_start) return -ENXIO; @@ -225,6 +242,15 @@ xfs_dax_notify_failure( if (offset + len - 1 > ddev_end) len = ddev_end - offset + 1; + if (mf_flags & MF_MEM_PRE_REMOVE) { + xfs_info(mp, "device is about to be removed!"); + error = freeze_super(mp->m_super); + if (error) + return error; + /* invalidate_inode_pages2() invalidates dax mapping */ + super_drop_pagecache(mp->m_super, invalidate_inode_pages2); + } + return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len), mf_flags); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 8f857163ac89..9711dbc9451f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3424,6 +3424,7 @@ enum mf_flags { MF_UNPOISON = 1 << 4, MF_SW_SIMULATED = 1 << 5, MF_NO_RETRY = 1 << 6, + MF_MEM_PRE_REMOVE = 1 << 7, }; int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, unsigned long count, int mf_flags);
This patch is inspired by Dan's "mm, dax, pmem: Introduce dev_pagemap_failure()"[1]. With the help of dax_holder and ->notify_failure() mechanism, the pmem driver is able to ask filesystem (or mapped device) on it to unmap all files in use and notify processes who are using those files. Call trace: trigger unbind -> unbind_store() -> ... (skip) -> devres_release_all() # was pmem driver ->remove() in v1 -> kill_dax() -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) -> xfs_dax_notify_failure() Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove event. So do not shutdown filesystem directly if something not supported, or if failure range includes metadata area. Make sure all files and processes are handled correctly. [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- drivers/dax/super.c | 3 ++- fs/xfs/xfs_notify_failure.c | 26 ++++++++++++++++++++++++++ include/linux/mm.h | 1 + 3 files changed, 29 insertions(+), 1 deletion(-)