diff mbox series

[v11,7/8] xfs: Implement ->notify_failure() for XFS

Message ID 20220227120747.711169-8-ruansy.fnst@fujitsu.com (mailing list archive)
State New
Headers show
Series fsdax: introduce fs query to support reflink | expand

Commit Message

Shiyang Ruan Feb. 27, 2022, 12:07 p.m. UTC
Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.

If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data.  For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata.

After that, the memory failure needs to notify the processes who are
using those files.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/xfs/Makefile             |   1 +
 fs/xfs/xfs_buf.c            |  12 ++
 fs/xfs/xfs_fsops.c          |   3 +
 fs/xfs/xfs_mount.h          |   1 +
 fs/xfs/xfs_notify_failure.c | 235 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_notify_failure.h |  10 ++
 fs/xfs/xfs_super.c          |   6 +
 7 files changed, 268 insertions(+)
 create mode 100644 fs/xfs/xfs_notify_failure.c
 create mode 100644 fs/xfs/xfs_notify_failure.h

Comments

kernel test robot Feb. 27, 2022, 2:05 p.m. UTC | #1
Hi Shiyang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on xfs-linux/for-next]
[also build test ERROR on linux/master]
[cannot apply to hnaz-mm/master linus/master v5.17-rc5 next-20220225]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Shiyang-Ruan/fsdax-introduce-fs-query-to-support-reflink/20220227-200849
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: i386-randconfig-a003 (https://download.01.org/0day-ci/archive/20220227/202202272203.U7VlQY3B-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/9f4bfbd2bae60e9f172e0b7332b2af32aa5baa87
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Shiyang-Ruan/fsdax-introduce-fs-query-to-support-reflink/20220227-200849
        git checkout 9f4bfbd2bae60e9f172e0b7332b2af32aa5baa87
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: fs/xfs/xfs_buf.o: in function `xfs_free_buftarg':
>> fs/xfs/xfs_buf.c:1897: undefined reference to `dax_unregister_holder'
   ld: fs/xfs/xfs_notify_failure.o: in function `xfs_dax_notify_failure':
>> fs/xfs/xfs_notify_failure.c:187: undefined reference to `dax_holder'


vim +1897 fs/xfs/xfs_buf.c

  1885	
  1886	void
  1887	xfs_free_buftarg(
  1888		struct xfs_buftarg	*btp)
  1889	{
  1890		unregister_shrinker(&btp->bt_shrinker);
  1891		ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
  1892		percpu_counter_destroy(&btp->bt_io_count);
  1893		list_lru_destroy(&btp->bt_lru);
  1894	
  1895		blkdev_issue_flush(btp->bt_bdev);
  1896		if (btp->bt_daxdev)
> 1897			dax_unregister_holder(btp->bt_daxdev, btp->bt_mount);
  1898		fs_put_dax(btp->bt_daxdev);
  1899	
  1900		kmem_free(btp);
  1901	}
  1902	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Feb. 27, 2022, 3:36 p.m. UTC | #2
Hi Shiyang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on xfs-linux/for-next]
[also build test ERROR on linux/master]
[cannot apply to hnaz-mm/master linus/master v5.17-rc5 next-20220225]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Shiyang-Ruan/fsdax-introduce-fs-query-to-support-reflink/20220227-200849
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: hexagon-buildonly-randconfig-r005-20220227 (https://download.01.org/0day-ci/archive/20220227/202202272333.bhLvmuHF-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project d271fc04d5b97b12e6b797c6067d3c96a8d7470e)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/9f4bfbd2bae60e9f172e0b7332b2af32aa5baa87
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Shiyang-Ruan/fsdax-introduce-fs-query-to-support-reflink/20220227-200849
        git checkout 9f4bfbd2bae60e9f172e0b7332b2af32aa5baa87
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: dax_unregister_holder
   >>> referenced by xfs_buf.c
   >>> xfs/xfs_buf.o:(xfs_free_buftarg) in archive fs/built-in.a
   >>> referenced by xfs_buf.c
   >>> xfs/xfs_buf.o:(xfs_free_buftarg) in archive fs/built-in.a
--
>> ld.lld: error: undefined symbol: dax_holder
   >>> referenced by xfs_notify_failure.c
   >>> xfs/xfs_notify_failure.o:(xfs_dax_notify_failure) in archive fs/built-in.a
   >>> referenced by xfs_notify_failure.c
   >>> xfs/xfs_notify_failure.o:(xfs_dax_notify_failure) in archive fs/built-in.a

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Feb. 27, 2022, 3:46 p.m. UTC | #3
Hi Shiyang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on xfs-linux/for-next]
[also build test ERROR on linux/master]
[cannot apply to hnaz-mm/master linus/master v5.17-rc5 next-20220225]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Shiyang-Ruan/fsdax-introduce-fs-query-to-support-reflink/20220227-200849
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: riscv-randconfig-p001-20220227 (https://download.01.org/0day-ci/archive/20220227/202202272331.SP0o3f9L-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/9f4bfbd2bae60e9f172e0b7332b2af32aa5baa87
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Shiyang-Ruan/fsdax-introduce-fs-query-to-support-reflink/20220227-200849
        git checkout 9f4bfbd2bae60e9f172e0b7332b2af32aa5baa87
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   riscv64-linux-ld: fs/xfs/xfs_buf.o: in function `.L828':
>> xfs_buf.c:(.text+0x3f7c): undefined reference to `dax_unregister_holder'
   riscv64-linux-ld: fs/xfs/xfs_notify_failure.o: in function `xfs_dax_notify_failure':
>> xfs_notify_failure.c:(.text+0x2b0): undefined reference to `dax_holder'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Christoph Hellwig March 30, 2022, 6 a.m. UTC | #4
> @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
>  	list_lru_destroy(&btp->bt_lru);
>  
>  	blkdev_issue_flush(btp->bt_bdev);
> +	if (btp->bt_daxdev)
> +		dax_unregister_holder(btp->bt_daxdev, btp->bt_mount);
>  	fs_put_dax(btp->bt_daxdev);
>  
>  	kmem_free(btp);
> @@ -1939,6 +1942,7 @@ xfs_alloc_buftarg(
>  	struct block_device	*bdev)
>  {
>  	xfs_buftarg_t		*btp;
> +	int			error;
>  
>  	btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
>  
> @@ -1946,6 +1950,14 @@ xfs_alloc_buftarg(
>  	btp->bt_dev =  bdev->bd_dev;
>  	btp->bt_bdev = bdev;
>  	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off);
> +	if (btp->bt_daxdev) {
> +		error = dax_register_holder(btp->bt_daxdev, mp,
> +				&xfs_dax_holder_operations);
> +		if (error) {
> +			xfs_err(mp, "DAX device already in use?!");
> +			goto error_free;
> +		}
> +	}

It seems to me that just passing the holder and holder ops to
fs_dax_get_by_bdev and the holder to dax_unregister_holder would
significantly simply the interface here.

Dan, what do you think?

> +#if IS_ENABLED(CONFIG_MEMORY_FAILURE) && IS_ENABLED(CONFIG_FS_DAX)

No real need for the IS_ENABLED.  Also any reason to even build this
file if the options are not set?  It seems like
xfs_dax_holder_operations should just be defined to NULL and the
whole file not supported if we can't support the functionality.

Dan: not for this series, but is there any reason not to require
MEMORY_FAILURE for DAX to start with?

> +
> +	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> +	ddev_end = ddev_start +
> +		(mp->m_ddev_targp->bt_bdev->bd_nr_sectors << SECTOR_SHIFT) - 1;

This should use bdev_nr_bytes.

But didn't we say we don't want to support notifications on partitioned
devices and thus don't actually need all this?

> +
> +	/* Ignore the range out of filesystem area */
> +	if ((offset + len) < ddev_start)

No need for the inner braces.

> +	if ((offset + len) > ddev_end)

No need for the braces either.

> diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
> new file mode 100644
> index 000000000000..76187b9620f9
> --- /dev/null
> +++ b/fs/xfs/xfs_notify_failure.h
> @@ -0,0 +1,10 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
> + */
> +#ifndef __XFS_NOTIFY_FAILURE_H__
> +#define __XFS_NOTIFY_FAILURE_H__
> +
> +extern const struct dax_holder_operations xfs_dax_holder_operations;
> +
> +#endif  /* __XFS_NOTIFY_FAILURE_H__ */

Dowe really need a new header for this vs just sequeezing it into
xfs_super.h or something like that?

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e8f37bdc8354..b8de6ed2c888 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -353,6 +353,12 @@ xfs_setup_dax_always(
>  		return -EINVAL;
>  	}
>  
> +	if (xfs_has_reflink(mp) && !xfs_has_rmapbt(mp)) {
> +		xfs_alert(mp,
> +			"need rmapbt when both DAX and reflink enabled.");
> +		return -EINVAL;
> +	}

Right now we can't even enable reflink with DAX yet, so adding this
here seems premature - it should go into the patch allowing DAX+reflink.
Shiyang Ruan March 30, 2022, 3:16 p.m. UTC | #5
在 2022/3/30 14:00, Christoph Hellwig 写道:
>> @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
>>   	list_lru_destroy(&btp->bt_lru);
>>   
>>   	blkdev_issue_flush(btp->bt_bdev);
>> +	if (btp->bt_daxdev)
>> +		dax_unregister_holder(btp->bt_daxdev, btp->bt_mount);
>>   	fs_put_dax(btp->bt_daxdev);
>>   
>>   	kmem_free(btp);
>> @@ -1939,6 +1942,7 @@ xfs_alloc_buftarg(
>>   	struct block_device	*bdev)
>>   {
>>   	xfs_buftarg_t		*btp;
>> +	int			error;
>>   
>>   	btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
>>   
>> @@ -1946,6 +1950,14 @@ xfs_alloc_buftarg(
>>   	btp->bt_dev =  bdev->bd_dev;
>>   	btp->bt_bdev = bdev;
>>   	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off);
>> +	if (btp->bt_daxdev) {
>> +		error = dax_register_holder(btp->bt_daxdev, mp,
>> +				&xfs_dax_holder_operations);
>> +		if (error) {
>> +			xfs_err(mp, "DAX device already in use?!");
>> +			goto error_free;
>> +		}
>> +	}
> 
> It seems to me that just passing the holder and holder ops to
> fs_dax_get_by_bdev and the holder to dax_unregister_holder would
> significantly simply the interface here.
> 
> Dan, what do you think?
> 
>> +#if IS_ENABLED(CONFIG_MEMORY_FAILURE) && IS_ENABLED(CONFIG_FS_DAX)
> 
> No real need for the IS_ENABLED.  Also any reason to even build this
> file if the options are not set?  It seems like
> xfs_dax_holder_operations should just be defined to NULL and the
> whole file not supported if we can't support the functionality.

Got it.  These two CONFIG seem not related for now.  So, I think I 
should wrap these code with #ifdef CONFIG_MEMORY_FAILURE here, and add 
`xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o` in the makefile.

> 
> Dan: not for this series, but is there any reason not to require
> MEMORY_FAILURE for DAX to start with?
> 
>> +
>> +	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>> +	ddev_end = ddev_start +
>> +		(mp->m_ddev_targp->bt_bdev->bd_nr_sectors << SECTOR_SHIFT) - 1;
> 
> This should use bdev_nr_bytes.

OK.

> 
> But didn't we say we don't want to support notifications on partitioned
> devices and thus don't actually need all this?
> 
>> +
>> +	/* Ignore the range out of filesystem area */
>> +	if ((offset + len) < ddev_start)
> 
> No need for the inner braces.
> 
>> +	if ((offset + len) > ddev_end)
> 
> No need for the braces either.

Really no need?  It is to make sure the range to be handled won't out of 
the filesystem area.  And make sure the @offset and @len are valid and 
correct after subtract the bbdev_start.

> 
>> diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
>> new file mode 100644
>> index 000000000000..76187b9620f9
>> --- /dev/null
>> +++ b/fs/xfs/xfs_notify_failure.h
>> @@ -0,0 +1,10 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
>> + */
>> +#ifndef __XFS_NOTIFY_FAILURE_H__
>> +#define __XFS_NOTIFY_FAILURE_H__
>> +
>> +extern const struct dax_holder_operations xfs_dax_holder_operations;
>> +
>> +#endif  /* __XFS_NOTIFY_FAILURE_H__ */
> 
> Dowe really need a new header for this vs just sequeezing it into
> xfs_super.h or something like that?

Yes, I'll move it into xfs_super.h.  The xfs_notify_failure.c was 
splitted from xfs_super.c in the old patch.  There is no need to create 
a header file for only single line of code.

> 
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index e8f37bdc8354..b8de6ed2c888 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -353,6 +353,12 @@ xfs_setup_dax_always(
>>   		return -EINVAL;
>>   	}
>>   
>> +	if (xfs_has_reflink(mp) && !xfs_has_rmapbt(mp)) {
>> +		xfs_alert(mp,
>> +			"need rmapbt when both DAX and reflink enabled.");
>> +		return -EINVAL;
>> +	}
> 
> Right now we can't even enable reflink with DAX yet, so adding this
> here seems premature - it should go into the patch allowing DAX+reflink.
> 

Yes.  I'll remove it for now.


--
Thanks,
Ruan.
Christoph Hellwig March 30, 2022, 3:52 p.m. UTC | #6
On Wed, Mar 30, 2022 at 11:16:10PM +0800, Shiyang Ruan wrote:
> > > +#if IS_ENABLED(CONFIG_MEMORY_FAILURE) && IS_ENABLED(CONFIG_FS_DAX)
> > 
> > No real need for the IS_ENABLED.  Also any reason to even build this
> > file if the options are not set?  It seems like
> > xfs_dax_holder_operations should just be defined to NULL and the
> > whole file not supported if we can't support the functionality.
> 
> Got it.  These two CONFIG seem not related for now.  So, I think I should
> wrap these code with #ifdef CONFIG_MEMORY_FAILURE here, and add
> `xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o` in the makefile.

I'd do

ifeq ($(CONFIG_MEMORY_FAILURE),y)
xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o
endif

in the makefile and keep it out of the actual source file entirely.

> > > +
> > > +	/* Ignore the range out of filesystem area */
> > > +	if ((offset + len) < ddev_start)
> > 
> > No need for the inner braces.
> > 
> > > +	if ((offset + len) > ddev_end)
> > 
> > No need for the braces either.
> 
> Really no need?  It is to make sure the range to be handled won't out of the
> filesystem area.  And make sure the @offset and @len are valid and correct
> after subtract the bbdev_start.

Yes, but there is no need for the braces per the precedence rules in
C.  So these could be:

	if (offset + len < ddev_start)

and

	if (offset + len > ddev_end)
Shiyang Ruan April 8, 2022, 6:04 a.m. UTC | #7
在 2022/3/30 14:00, Christoph Hellwig 写道:
>> @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
>>   	list_lru_destroy(&btp->bt_lru);
>>   
>>   	blkdev_issue_flush(btp->bt_bdev);
>> +	if (btp->bt_daxdev)
>> +		dax_unregister_holder(btp->bt_daxdev, btp->bt_mount);
>>   	fs_put_dax(btp->bt_daxdev);
>>   
>>   	kmem_free(btp);
>> @@ -1939,6 +1942,7 @@ xfs_alloc_buftarg(
>>   	struct block_device	*bdev)
>>   {
>>   	xfs_buftarg_t		*btp;
>> +	int			error;
>>   
>>   	btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
>>   
>> @@ -1946,6 +1950,14 @@ xfs_alloc_buftarg(
>>   	btp->bt_dev =  bdev->bd_dev;
>>   	btp->bt_bdev = bdev;
>>   	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off);
>> +	if (btp->bt_daxdev) {
>> +		error = dax_register_holder(btp->bt_daxdev, mp,
>> +				&xfs_dax_holder_operations);
>> +		if (error) {
>> +			xfs_err(mp, "DAX device already in use?!");
>> +			goto error_free;
>> +		}
>> +	}
> 
> It seems to me that just passing the holder and holder ops to
> fs_dax_get_by_bdev and the holder to dax_unregister_holder would
> significantly simply the interface here.
> 
> Dan, what do you think?

Hi Dan,

Could you give some advise on this API?  Is it needed to move 
dax_register_holder's job into fs_dax_get_by_bdev()?


--
Thanks,
Ruan

> 
>> +#if IS_ENABLED(CONFIG_MEMORY_FAILURE) && IS_ENABLED(CONFIG_FS_DAX)
> 
> No real need for the IS_ENABLED.  Also any reason to even build this
> file if the options are not set?  It seems like
> xfs_dax_holder_operations should just be defined to NULL and the
> whole file not supported if we can't support the functionality.
> 
> Dan: not for this series, but is there any reason not to require
> MEMORY_FAILURE for DAX to start with?
> 
>> +
>> +	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>> +	ddev_end = ddev_start +
>> +		(mp->m_ddev_targp->bt_bdev->bd_nr_sectors << SECTOR_SHIFT) - 1;
> 
> This should use bdev_nr_bytes.
> 
> But didn't we say we don't want to support notifications on partitioned
> devices and thus don't actually need all this?
> 
>> +
>> +	/* Ignore the range out of filesystem area */
>> +	if ((offset + len) < ddev_start)
> 
> No need for the inner braces.
> 
>> +	if ((offset + len) > ddev_end)
> 
> No need for the braces either.
> 
>> diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
>> new file mode 100644
>> index 000000000000..76187b9620f9
>> --- /dev/null
>> +++ b/fs/xfs/xfs_notify_failure.h
>> @@ -0,0 +1,10 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
>> + */
>> +#ifndef __XFS_NOTIFY_FAILURE_H__
>> +#define __XFS_NOTIFY_FAILURE_H__
>> +
>> +extern const struct dax_holder_operations xfs_dax_holder_operations;
>> +
>> +#endif  /* __XFS_NOTIFY_FAILURE_H__ */
> 
> Dowe really need a new header for this vs just sequeezing it into
> xfs_super.h or something like that?
> 
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index e8f37bdc8354..b8de6ed2c888 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -353,6 +353,12 @@ xfs_setup_dax_always(
>>   		return -EINVAL;
>>   	}
>>   
>> +	if (xfs_has_reflink(mp) && !xfs_has_rmapbt(mp)) {
>> +		xfs_alert(mp,
>> +			"need rmapbt when both DAX and reflink enabled.");
>> +		return -EINVAL;
>> +	}
> 
> Right now we can't even enable reflink with DAX yet, so adding this
> here seems premature - it should go into the patch allowing DAX+reflink.
>
Dan Williams April 8, 2022, 6:25 a.m. UTC | #8
On Tue, Mar 29, 2022 at 11:01 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> > @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
> >       list_lru_destroy(&btp->bt_lru);
> >
> >       blkdev_issue_flush(btp->bt_bdev);
> > +     if (btp->bt_daxdev)
> > +             dax_unregister_holder(btp->bt_daxdev, btp->bt_mount);
> >       fs_put_dax(btp->bt_daxdev);
> >
> >       kmem_free(btp);
> > @@ -1939,6 +1942,7 @@ xfs_alloc_buftarg(
> >       struct block_device     *bdev)
> >  {
> >       xfs_buftarg_t           *btp;
> > +     int                     error;
> >
> >       btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
> >
> > @@ -1946,6 +1950,14 @@ xfs_alloc_buftarg(
> >       btp->bt_dev =  bdev->bd_dev;
> >       btp->bt_bdev = bdev;
> >       btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off);
> > +     if (btp->bt_daxdev) {
> > +             error = dax_register_holder(btp->bt_daxdev, mp,
> > +                             &xfs_dax_holder_operations);
> > +             if (error) {
> > +                     xfs_err(mp, "DAX device already in use?!");
> > +                     goto error_free;
> > +             }
> > +     }
>
> It seems to me that just passing the holder and holder ops to
> fs_dax_get_by_bdev and the holder to dax_unregister_holder would
> significantly simply the interface here.
>
> Dan, what do you think?

Yes, makes sense, just like the optional holder arguments to blkdev_get_by_*().

>
> > +#if IS_ENABLED(CONFIG_MEMORY_FAILURE) && IS_ENABLED(CONFIG_FS_DAX)
>
> No real need for the IS_ENABLED.  Also any reason to even build this
> file if the options are not set?  It seems like
> xfs_dax_holder_operations should just be defined to NULL and the
> whole file not supported if we can't support the functionality.
>
> Dan: not for this series, but is there any reason not to require
> MEMORY_FAILURE for DAX to start with?

Given that DAX ties some storage semantics to memory and storage
supports EIO I can see an argument to require memory_failure() for
DAX, and especially for DAX on CXL where hotplug is supported it will
be necessary. Linux currently has no facility to consult PCI drivers
about removal actions, so the only recourse for a force removed CXL
device is mass memory_failure().

>
> > +
> > +     ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > +     ddev_end = ddev_start +
> > +             (mp->m_ddev_targp->bt_bdev->bd_nr_sectors << SECTOR_SHIFT) - 1;
>
> This should use bdev_nr_bytes.
>
> But didn't we say we don't want to support notifications on partitioned
> devices and thus don't actually need all this?

Right.
Dan Williams April 8, 2022, 6:26 a.m. UTC | #9
On Thu, Apr 7, 2022 at 11:05 PM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
>
>
>
> 在 2022/3/30 14:00, Christoph Hellwig 写道:
> >> @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
> >>      list_lru_destroy(&btp->bt_lru);
> >>
> >>      blkdev_issue_flush(btp->bt_bdev);
> >> +    if (btp->bt_daxdev)
> >> +            dax_unregister_holder(btp->bt_daxdev, btp->bt_mount);
> >>      fs_put_dax(btp->bt_daxdev);
> >>
> >>      kmem_free(btp);
> >> @@ -1939,6 +1942,7 @@ xfs_alloc_buftarg(
> >>      struct block_device     *bdev)
> >>   {
> >>      xfs_buftarg_t           *btp;
> >> +    int                     error;
> >>
> >>      btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
> >>
> >> @@ -1946,6 +1950,14 @@ xfs_alloc_buftarg(
> >>      btp->bt_dev =  bdev->bd_dev;
> >>      btp->bt_bdev = bdev;
> >>      btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off);
> >> +    if (btp->bt_daxdev) {
> >> +            error = dax_register_holder(btp->bt_daxdev, mp,
> >> +                            &xfs_dax_holder_operations);
> >> +            if (error) {
> >> +                    xfs_err(mp, "DAX device already in use?!");
> >> +                    goto error_free;
> >> +            }
> >> +    }
> >
> > It seems to me that just passing the holder and holder ops to
> > fs_dax_get_by_bdev and the holder to dax_unregister_holder would
> > significantly simply the interface here.
> >
> > Dan, what do you think?
>
> Hi Dan,
>
> Could you give some advise on this API?  Is it needed to move
> dax_register_holder's job into fs_dax_get_by_bdev()?


Yes, works for me to just add them as optional arguments.
diff mbox series

Patch

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 04611a1068b4..389970b3e13b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -84,6 +84,7 @@  xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_notify_failure.o \
 				   xfs_pwork.o \
 				   xfs_reflink.o \
 				   xfs_stats.o \
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b45e0d50a405..5f4984a5e108 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -19,6 +19,7 @@ 
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_ag.h"
+#include "xfs_notify_failure.h"
 
 static struct kmem_cache *xfs_buf_cache;
 
@@ -1892,6 +1893,8 @@  xfs_free_buftarg(
 	list_lru_destroy(&btp->bt_lru);
 
 	blkdev_issue_flush(btp->bt_bdev);
+	if (btp->bt_daxdev)
+		dax_unregister_holder(btp->bt_daxdev, btp->bt_mount);
 	fs_put_dax(btp->bt_daxdev);
 
 	kmem_free(btp);
@@ -1939,6 +1942,7 @@  xfs_alloc_buftarg(
 	struct block_device	*bdev)
 {
 	xfs_buftarg_t		*btp;
+	int			error;
 
 	btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
 
@@ -1946,6 +1950,14 @@  xfs_alloc_buftarg(
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
 	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off);
+	if (btp->bt_daxdev) {
+		error = dax_register_holder(btp->bt_daxdev, mp,
+				&xfs_dax_holder_operations);
+		if (error) {
+			xfs_err(mp, "DAX device already in use?!");
+			goto error_free;
+		}
+	}
 
 	/*
 	 * Buffer IO error rate limiting. Limit it to no more than 10 messages
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 33e26690a8c4..d4d36c5bef11 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -542,6 +542,9 @@  xfs_do_force_shutdown(
 	} else if (flags & SHUTDOWN_CORRUPT_INCORE) {
 		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
 		why = "Corruption of in-memory data";
+	} else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
+		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
+		why = "Corruption of on-disk metadata";
 	} else {
 		tag = XFS_PTAG_SHUTDOWN_IOERROR;
 		why = "Metadata I/O Error";
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 00720a02e761..47ff4ac53c4c 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -435,6 +435,7 @@  void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_LOG_IO_ERROR	0x0002	/* write attempt to the log failed */
 #define SHUTDOWN_FORCE_UMOUNT	0x0004	/* shutdown from a forced unmount */
 #define SHUTDOWN_CORRUPT_INCORE	0x0008	/* corrupt in-memory data structures */
+#define SHUTDOWN_CORRUPT_ONDISK	0x0010  /* corrupt metadata on device */
 
 #define XFS_SHUTDOWN_STRINGS \
 	{ SHUTDOWN_META_IO_ERROR,	"metadata_io" }, \
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
new file mode 100644
index 000000000000..b057e9f7eb31
--- /dev/null
+++ b/fs/xfs/xfs_notify_failure.c
@@ -0,0 +1,235 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
+ */
+
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_alloc.h"
+#include "xfs_bit.h"
+#include "xfs_btree.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_trans.h"
+
+#include <linux/mm.h>
+#include <linux/dax.h>
+
+struct failure_info {
+	xfs_agblock_t		startblock;
+	xfs_extlen_t		blockcount;
+	int			mf_flags;
+};
+
+#if IS_ENABLED(CONFIG_MEMORY_FAILURE) && IS_ENABLED(CONFIG_FS_DAX)
+static pgoff_t
+xfs_failure_pgoff(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rec,
+	const struct failure_info	*notify)
+{
+	uint64_t			pos = rec->rm_offset;
+
+	if (notify->startblock > rec->rm_startblock)
+		pos += XFS_FSB_TO_B(mp,
+				notify->startblock - rec->rm_startblock);
+	return pos >> PAGE_SHIFT;
+}
+
+static unsigned long
+xfs_failure_pgcnt(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rec,
+	const struct failure_info	*notify)
+{
+	xfs_agblock_t			end_rec;
+	xfs_agblock_t			end_notify;
+	xfs_agblock_t			start_cross;
+	xfs_agblock_t			end_cross;
+
+	start_cross = max(rec->rm_startblock, notify->startblock);
+
+	end_rec = rec->rm_startblock + rec->rm_blockcount;
+	end_notify = notify->startblock + notify->blockcount;
+	end_cross = min(end_rec, end_notify);
+
+	return XFS_FSB_TO_B(mp, end_cross - start_cross) >> PAGE_SHIFT;
+}
+
+static int
+xfs_dax_failure_fn(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*data)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_inode		*ip;
+	struct failure_info		*notify = data;
+	int				error = 0;
+
+	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
+		return -EFSCORRUPTED;
+	}
+
+	/* Get files that incore, filter out others that are not in use. */
+	error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
+			 0, &ip);
+	/* Continue the rmap query if the inode isn't incore */
+	if (error == -ENODATA)
+		return 0;
+	if (error)
+		return error;
+
+	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
+				  xfs_failure_pgoff(mp, rec, notify),
+				  xfs_failure_pgcnt(mp, rec, notify),
+				  notify->mf_flags);
+	xfs_irele(ip);
+	return error;
+}
+#else
+static int
+xfs_dax_failure_fn(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*data)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	/* No other option besides shutting down the fs. */
+	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
+	return -EFSCORRUPTED;
+}
+#endif /* CONFIG_MEMORY_FAILURE && CONFIG_FS_DAX */
+
+static int
+xfs_dax_notify_ddev_failure(
+	struct xfs_mount	*mp,
+	xfs_daddr_t		daddr,
+	xfs_daddr_t		bblen,
+	int			mf_flags)
+{
+	struct xfs_trans	*tp = NULL;
+	struct xfs_btree_cur	*cur = NULL;
+	struct xfs_buf		*agf_bp = NULL;
+	int			error = 0;
+	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	for (; agno <= end_agno; agno++) {
+		struct xfs_rmap_irec	ri_low = { };
+		struct xfs_rmap_irec	ri_high;
+		struct failure_info	notify;
+		struct xfs_agf		*agf;
+		xfs_agblock_t		agend;
+
+		error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
+		if (error)
+			break;
+
+		cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agf_bp->b_pag);
+
+		/*
+		 * Set the rmap range from ri_low to ri_high, which represents
+		 * a [start, end] where we looking for the files or metadata.
+		 * The part of range out of a AG will be ignored.  So, it's fine
+		 * to set ri_low to "startblock" in all loops.  When it reaches
+		 * the last AG, set the ri_high to "endblock" to make sure we
+		 * actually end at the end.
+		 */
+		memset(&ri_high, 0xFF, sizeof(ri_high));
+		ri_low.rm_startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
+		if (agno == end_agno)
+			ri_high.rm_startblock = XFS_FSB_TO_AGBNO(mp, end_fsbno);
+
+		agf = agf_bp->b_addr;
+		agend = min(be32_to_cpu(agf->agf_length),
+				ri_high.rm_startblock);
+		notify.startblock = ri_low.rm_startblock;
+		notify.blockcount = agend - ri_low.rm_startblock;
+
+		error = xfs_rmap_query_range(cur, &ri_low, &ri_high,
+				xfs_dax_failure_fn, &notify);
+		xfs_btree_del_cursor(cur, error);
+		xfs_trans_brelse(tp, agf_bp);
+		if (error)
+			break;
+
+		fsbno = XFS_AGB_TO_FSB(mp, agno + 1, 0);
+	}
+
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+static int
+xfs_dax_notify_failure(
+	struct dax_device	*dax_dev,
+	u64			offset,
+	u64			len,
+	int			mf_flags)
+{
+	struct xfs_mount	*mp = dax_holder(dax_dev);
+	u64			ddev_start;
+	u64			ddev_end;
+
+	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
+		xfs_warn(mp,
+			 "notify_failure() not supported on realtime device!");
+		return -EOPNOTSUPP;
+	}
+
+	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
+	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
+		return -EFSCORRUPTED;
+	}
+
+	if (!xfs_has_rmapbt(mp)) {
+		xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
+		return -EOPNOTSUPP;
+	}
+
+	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
+	ddev_end = ddev_start +
+		(mp->m_ddev_targp->bt_bdev->bd_nr_sectors << SECTOR_SHIFT) - 1;
+
+	/* Ignore the range out of filesystem area */
+	if ((offset + len) < ddev_start)
+		return -ENXIO;
+	if (offset > ddev_end)
+		return -ENXIO;
+
+	/* Calculate the real range when it touches the boundary */
+	if (offset > ddev_start)
+		offset -= ddev_start;
+	else {
+		len -= ddev_start - offset;
+		offset = 0;
+	}
+	if ((offset + len) > ddev_end)
+		len -= ddev_end - offset;
+
+	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
+			mf_flags);
+}
+
+const struct dax_holder_operations xfs_dax_holder_operations = {
+	.notify_failure		= xfs_dax_notify_failure,
+};
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
new file mode 100644
index 000000000000..76187b9620f9
--- /dev/null
+++ b/fs/xfs/xfs_notify_failure.h
@@ -0,0 +1,10 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
+ */
+#ifndef __XFS_NOTIFY_FAILURE_H__
+#define __XFS_NOTIFY_FAILURE_H__
+
+extern const struct dax_holder_operations xfs_dax_holder_operations;
+
+#endif  /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e8f37bdc8354..b8de6ed2c888 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -353,6 +353,12 @@  xfs_setup_dax_always(
 		return -EINVAL;
 	}
 
+	if (xfs_has_reflink(mp) && !xfs_has_rmapbt(mp)) {
+		xfs_alert(mp,
+			"need rmapbt when both DAX and reflink enabled.");
+		return -EINVAL;
+	}
+
 	xfs_warn(mp, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
 	return 0;