diff mbox series

[v10,3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

Message ID 1676645312-13-4-git-send-email-ruansy.fnst@fujitsu.com (mailing list archive)
State New
Headers show
Series mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind | expand

Commit Message

Shiyang Ruan Feb. 17, 2023, 2:48 p.m. UTC
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c         |  3 ++-
 fs/xfs/xfs_notify_failure.c | 26 ++++++++++++++++++++++++++
 include/linux/mm.h          |  1 +
 3 files changed, 29 insertions(+), 1 deletion(-)

Comments

Dave Chinner Feb. 27, 2023, 12:07 a.m. UTC | #1
On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()   # was pmem driver ->remove() in v1
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>

.....

> ---
> @@ -225,6 +242,15 @@ xfs_dax_notify_failure(
>  	if (offset + len - 1 > ddev_end)
>  		len = ddev_end - offset + 1;
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "device is about to be removed!");
> +		error = freeze_super(mp->m_super);
> +		if (error)
> +			return error;
> +		/* invalidate_inode_pages2() invalidates dax mapping */
> +		super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
> +	}

Why do you still need to drop the pagecache here? My suggestion was
to replace it with freezing the filesystem at this point is to stop
it being dirtied further before the device remove actually occurs.
The userspace processes will be killed, their DAX mappings reclaimed
and the filesystem shut down before device removal occurs, so
super_drop_pagecache() is largely superfluous as it doesn't actually
provide any protection against racing with new mappings or dirtying
of existing/newly created mappings.

Freezing doesn't stop the creation of new mappings, either, it just
cleans all the dirty mappings and halts anything that is trying to
dirty existing clean mappings. It's not until we kill the userspace
processes that new mappings will be stopped, and it's not until we
shut the filesystem down that the filesystem itself will stop
accessing the storage.

Hence I don't see why you retained super_drop_pagecache() here at
all. Can you explain why it is still needed?

-Dave.
Shiyang Ruan Feb. 27, 2023, 10:06 a.m. UTC | #2
在 2023/2/27 8:07, Dave Chinner 写道:
> On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote:
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> (or mapped device) on it to unmap all files in use and notify processes
>> who are using those files.
>>
>> Call trace:
>> trigger unbind
>>   -> unbind_store()
>>    -> ... (skip)
>>     -> devres_release_all()   # was pmem driver ->remove() in v1
>>      -> kill_dax()
>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>        -> xfs_dax_notify_failure()
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event.  So do not shutdown filesystem directly if something not
>> supported, or if failure range includes metadata area.  Make sure all
>> files and processes are handled correctly.
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> 
> .....
> 
>> ---
>> @@ -225,6 +242,15 @@ xfs_dax_notify_failure(
>>   	if (offset + len - 1 > ddev_end)
>>   		len = ddev_end - offset + 1;
>>   
>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>> +		xfs_info(mp, "device is about to be removed!");
>> +		error = freeze_super(mp->m_super);
>> +		if (error)
>> +			return error;
>> +		/* invalidate_inode_pages2() invalidates dax mapping */
>> +		super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
>> +	}
> 
> Why do you still need to drop the pagecache here? My suggestion was
> to replace it with freezing the filesystem at this point is to stop
> it being dirtied further before the device remove actually occurs.
> The userspace processes will be killed, their DAX mappings reclaimed
> and the filesystem shut down before device removal occurs, so
> super_drop_pagecache() is largely superfluous as it doesn't actually
> provide any protection against racing with new mappings or dirtying
> of existing/newly created mappings.
> 
> Freezing doesn't stop the creation of new mappings, either, it just
> cleans all the dirty mappings and halts anything that is trying to

This is the point I wasn't aware of.

> dirty existing clean mappings. It's not until we kill the userspace
> processes that new mappings will be stopped, and it's not until we
> shut the filesystem down that the filesystem itself will stop
> accessing the storage.
> 
> Hence I don't see why you retained super_drop_pagecache() here at
> all. Can you explain why it is still needed?


So I was just afraid that it's not enough for rmap & processes killer to 
invalidate the dax mappings.  If something error happened during the 
rmap walker, the fs will shutdown and there is no chance to invalidate 
the rest mappings whose user didn't be killed yet.

Now that freezing the fs is enough, I will remove the drop cache code.


--
Thanks,
Ruan.

> 
> -Dave.
Shiyang Ruan March 21, 2023, 10:59 a.m. UTC | #3
在 2023/2/27 18:06, Shiyang Ruan 写道:
> 
> 
> 在 2023/2/27 8:07, Dave Chinner 写道:
>> On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote:
>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>> (or mapped device) on it to unmap all files in use and notify processes
>>> who are using those files.
>>>
>>> Call trace:
>>> trigger unbind
>>>   -> unbind_store()
>>>    -> ... (skip)
>>>     -> devres_release_all()   # was pmem driver ->remove() in v1
>>>      -> kill_dax()
>>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, 
>>> MF_MEM_PRE_REMOVE)
>>>        -> xfs_dax_notify_failure()
>>>
>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>> event.  So do not shutdown filesystem directly if something not
>>> supported, or if failure range includes metadata area.  Make sure all
>>> files and processes are handled correctly.
>>>
>>> [1]: 
>>> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>
>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>
>> .....
>>
>>> ---
>>> @@ -225,6 +242,15 @@ xfs_dax_notify_failure(
>>>       if (offset + len - 1 > ddev_end)
>>>           len = ddev_end - offset + 1;
>>> +    if (mf_flags & MF_MEM_PRE_REMOVE) {
>>> +        xfs_info(mp, "device is about to be removed!");
>>> +        error = freeze_super(mp->m_super);
>>> +        if (error)
>>> +            return error;
>>> +        /* invalidate_inode_pages2() invalidates dax mapping */
>>> +        super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
>>> +    }
>>
>> Why do you still need to drop the pagecache here? My suggestion was
>> to replace it with freezing the filesystem at this point is to stop
>> it being dirtied further before the device remove actually occurs.
>> The userspace processes will be killed, their DAX mappings reclaimed
>> and the filesystem shut down before device removal occurs, so
>> super_drop_pagecache() is largely superfluous as it doesn't actually
>> provide any protection against racing with new mappings or dirtying
>> of existing/newly created mappings.
>>
>> Freezing doesn't stop the creation of new mappings, either, it just
>> cleans all the dirty mappings and halts anything that is trying to
> 
> This is the point I wasn't aware of.
> 
>> dirty existing clean mappings. It's not until we kill the userspace
>> processes that new mappings will be stopped, and it's not until we
>> shut the filesystem down that the filesystem itself will stop
>> accessing the storage.
>>
>> Hence I don't see why you retained super_drop_pagecache() here at
>> all. Can you explain why it is still needed?
> 
> 
> So I was just afraid that it's not enough for rmap & processes killer to 
> invalidate the dax mappings.  If something error happened during the 
> rmap walker, the fs will shutdown and there is no chance to invalidate 
> the rest mappings whose user didn't be killed yet.
> 
> Now that freezing the fs is enough, I will remove the drop cache code.

I removed the drop cache code, then kernel always went into crash when 
running the test[1].  After the investigation, I found that the crash is 
cause by accessing (invalidate dax pages when umounting fs) the page of 
a pmem while the pmem has been removed.

According to the design, the dax page should have been invalidated by 
mf_dax_kill_procs() but it didn't.  I found two reasons:
  1. collect_procs_fsdax() only kills the current process
  2. unmap_mapping_range() doesn't invalidate the dax pages 
(disassociate dax entry in fs/dax.c), which causes the crash in my test

So, I think we should:
  1. pass the mf_flag to collect_procs_fsdax() to let it collect all 
processes associated with the file on the XFS.
  2. drop cache is still needed, but just drop the associated files' 
cache after mf_dax_kill_procs(), instead of dropping cache of the whole 
filesystem.

Then the logic shuld be looked like this:
unbind
  `-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
    `-> xfs_dax_notify_failure()
      `-> freeze_super()
      `-> do xfs rmap
        `-> mf_dax_kill_procs()
          `-> collect_procs_fsdax()   // all associated
          `-> unmap_and_kill()
        `-> invalidate_inode_pages2() // drop file's cache
      `-> thaw_super()


[1] The step of unbind test:
  1. create fsdax namespace on a pmem
  2. mkfs.xfs on it
  3. run fsx test in background
  4. wait 1s
  5. echo "pfn0.1" > unbind
  6. wait 1s
  7. umount xfs       --> crash happened


--
Thanks,
Ruan.

> 
> 
> -- 
> Thanks,
> Ruan.
> 
>>
>> -Dave.
Shiyang Ruan March 24, 2023, 9:49 a.m. UTC | #4
在 2023/3/21 18:59, Shiyang Ruan 写道:
> 
> 
> 在 2023/2/27 18:06, Shiyang Ruan 写道:
>>
>>
>> 在 2023/2/27 8:07, Dave Chinner 写道:
>>> On Fri, Feb 17, 2023 at 02:48:32PM +0000, Shiyang Ruan wrote:
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> (or mapped device) on it to unmap all files in use and notify processes
>>>> who are using those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>>   -> unbind_store()
>>>>    -> ... (skip)
>>>>     -> devres_release_all()   # was pmem driver ->remove() in v1
>>>>      -> kill_dax()
>>>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, 
>>>> MF_MEM_PRE_REMOVE)
>>>>        -> xfs_dax_notify_failure()
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event.  So do not shutdown filesystem directly if something not
>>>> supported, or if failure range includes metadata area.  Make sure all
>>>> files and processes are handled correctly.
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>
>>> .....
>>>
>>>> ---
>>>> @@ -225,6 +242,15 @@ xfs_dax_notify_failure(
>>>>       if (offset + len - 1 > ddev_end)
>>>>           len = ddev_end - offset + 1;
>>>> +    if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> +        xfs_info(mp, "device is about to be removed!");
>>>> +        error = freeze_super(mp->m_super);
>>>> +        if (error)
>>>> +            return error;
>>>> +        /* invalidate_inode_pages2() invalidates dax mapping */
>>>> +        super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
>>>> +    }
>>>
>>> Why do you still need to drop the pagecache here? My suggestion was
>>> to replace it with freezing the filesystem at this point is to stop
>>> it being dirtied further before the device remove actually occurs.
>>> The userspace processes will be killed, their DAX mappings reclaimed
>>> and the filesystem shut down before device removal occurs, so
>>> super_drop_pagecache() is largely superfluous as it doesn't actually
>>> provide any protection against racing with new mappings or dirtying
>>> of existing/newly created mappings.
>>>
>>> Freezing doesn't stop the creation of new mappings, either, it just
>>> cleans all the dirty mappings and halts anything that is trying to
>>
>> This is the point I wasn't aware of.
>>
>>> dirty existing clean mappings. It's not until we kill the userspace
>>> processes that new mappings will be stopped, and it's not until we
>>> shut the filesystem down that the filesystem itself will stop
>>> accessing the storage.
>>>
>>> Hence I don't see why you retained super_drop_pagecache() here at
>>> all. Can you explain why it is still needed?
>>
>>
>> So I was just afraid that it's not enough for rmap & processes killer 
>> to invalidate the dax mappings.  If something error happened during 
>> the rmap walker, the fs will shutdown and there is no chance to 
>> invalidate the rest mappings whose user didn't be killed yet.
>>
>> Now that freezing the fs is enough, I will remove the drop cache code.
> 
> I removed the drop cache code, then kernel always went into crash when 
> running the test[1].  After the investigation, I found that the crash is 
> cause by accessing (invalidate dax pages when umounting fs) the page of 
> a pmem while the pmem has been removed.
> 
> According to the design, the dax page should have been invalidated by 
> mf_dax_kill_procs() but it didn't.  I found two reasons:
>   1. collect_procs_fsdax() only kills the current process
>   2. unmap_mapping_range() doesn't invalidate the dax pages 
> (disassociate dax entry in fs/dax.c), which causes the crash in my test
> 
> So, I think we should:
>   1. pass the mf_flag to collect_procs_fsdax() to let it collect all 
> processes associated with the file on the XFS.
>   2. drop cache is still needed, but just drop the associated files' 
> cache after mf_dax_kill_procs(), instead of dropping cache of the whole 
> filesystem.
> 
> Then the logic shuld be looked like this:
> unbind
>   `-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>     `-> xfs_dax_notify_failure()
>       `-> freeze_super()
>       `-> do xfs rmap
>         `-> mf_dax_kill_procs()
>           `-> collect_procs_fsdax()   // all associated
>           `-> unmap_and_kill()
>         `-> invalidate_inode_pages2() // drop file's cache
>       `-> thaw_super()
> 
> 
> [1] The step of unbind test:
>   1. create fsdax namespace on a pmem
>   2. mkfs.xfs on it
>   3. run fsx test in background
>   4. wait 1s
>   5. echo "pfn0.1" > unbind
>   6. wait 1s
>   7. umount xfs       --> crash happened
> 

Hi,

Any comments?


> 
> -- 
> Thanks,
> Ruan.
> 
>>
>>
>> -- 
>> Thanks,
>> Ruan.
>>
>>>
>>> -Dave.
diff mbox series

Patch

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..2e1a35e82fce 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@  void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 7d46a7e4980f..5f915cfc9632 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@ 
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -77,6 +78,9 @@  xfs_dax_failure_fn(
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* The device is about to be removed.  Not a really failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -168,7 +172,11 @@  xfs_dax_notify_ddev_failure(
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
+	} else if (mf_flags & MF_MEM_PRE_REMOVE) {
+		error = thaw_super(mp->m_super);
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
 	}
+
 	return error;
 }
 
@@ -182,6 +190,7 @@  xfs_dax_notify_failure(
 	struct xfs_mount	*mp = dax_holder(dax_dev);
 	u64			ddev_start;
 	u64			ddev_end;
+	int			error;
 
 	if (!(mp->m_super->s_flags & SB_BORN)) {
 		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
@@ -196,6 +205,8 @@  xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -209,6 +220,12 @@  xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
@@ -225,6 +242,15 @@  xfs_dax_notify_failure(
 	if (offset + len - 1 > ddev_end)
 		len = ddev_end - offset + 1;
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "device is about to be removed!");
+		error = freeze_super(mp->m_super);
+		if (error)
+			return error;
+		/* invalidate_inode_pages2() invalidates dax mapping */
+		super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
+	}
+
 	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
 			mf_flags);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f857163ac89..9711dbc9451f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3424,6 +3424,7 @@  enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);