diff mbox series

xfs: fail dax mount if reflink is enabled on a partition

Message ID 20220609143435.393724-1-ruansy.fnst@fujitsu.com (mailing list archive)
State Accepted
Commit 35fcd75af3edf035638e632bb49607cc8fc3cdf4
Headers show
Series xfs: fail dax mount if reflink is enabled on a partition | expand

Commit Message

Shiyang Ruan June 9, 2022, 2:34 p.m. UTC
Failure notification is not supported on partitions.  So, when we mount
a reflink enabled xfs on a partition with dax option, let it fail with
-EINVAL code.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/xfs/xfs_super.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig June 10, 2022, 5:46 a.m. UTC | #1
On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> Failure notification is not supported on partitions.  So, when we mount
> a reflink enabled xfs on a partition with dax option, let it fail with
> -EINVAL code.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong July 1, 2022, 12:31 a.m. UTC | #2
On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> Failure notification is not supported on partitions.  So, when we mount
> a reflink enabled xfs on a partition with dax option, let it fail with
> -EINVAL code.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>

Looks good to me, though I think this patch applies to ... wherever all
those rmap+reflink+dax patches went.  I think that's akpm's tree, right?

Ideally this would go in through there to keep the pieces together, but
I don't mind tossing this in at the end of the 5.20 merge window if akpm
is unwilling.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_super.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 8495ef076ffc..a3c221841fa6 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>  		goto disable_dax;
>  	}
>  
> -	if (xfs_has_reflink(mp)) {
> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
> +	if (xfs_has_reflink(mp) &&
> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> +		xfs_alert(mp,
> +			"DAX and reflink cannot work with multi-partitions!");
>  		return -EINVAL;
>  	}
>  
> -- 
> 2.36.1
> 
> 
>
Shiyang Ruan July 1, 2022, 5:14 a.m. UTC | #3
在 2022/7/1 8:31, Darrick J. Wong 写道:
> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
>> Failure notification is not supported on partitions.  So, when we mount
>> a reflink enabled xfs on a partition with dax option, let it fail with
>> -EINVAL code.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> 
> Looks good to me, though I think this patch applies to ... wherever all
> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?

Yes, they are in his tree, still in mm-unstable now.

> 
> Ideally this would go in through there to keep the pieces together, but
> I don't mind tossing this in at the end of the 5.20 merge window if akpm
> is unwilling.

Both are fine to me.  Thanks!


--
Ruan.

> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
>> ---
>>   fs/xfs/xfs_super.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 8495ef076ffc..a3c221841fa6 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>>   		goto disable_dax;
>>   	}
>>   
>> -	if (xfs_has_reflink(mp)) {
>> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
>> +	if (xfs_has_reflink(mp) &&
>> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
>> +		xfs_alert(mp,
>> +			"DAX and reflink cannot work with multi-partitions!");
>>   		return -EINVAL;
>>   	}
>>   
>> -- 
>> 2.36.1
>>
>>
>>
Shiyang Ruan July 21, 2022, 12:43 p.m. UTC | #4
Hi Andrew,

I am not sure if you had seen this.  So make a ping.

在 2022/7/1 13:14, Shiyang Ruan 写道:
> 
> 
> 在 2022/7/1 8:31, Darrick J. Wong 写道:
>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
>>> Failure notification is not supported on partitions.  So, when we mount
>>> a reflink enabled xfs on a partition with dax option, let it fail with
>>> -EINVAL code.
>>>
>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>
>> Looks good to me, though I think this patch applies to ... wherever all
>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> 
> Yes, they are in his tree, still in mm-unstable now.
> 
>>
>> Ideally this would go in through there to keep the pieces together, but
>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
>> is unwilling.

What's your opinion on this?  I found that the rmap+reflink+dax patches have been merged into mm-stable,  so maybe it's a bit late to merge this one.  ( I'm not pushing, just to know the situation. )


--
Thanks,
Ruan. 

> 
> Both are fine to me.  Thanks!
> 
> 
> -- 
> Ruan.
> 
>>
>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>
>> --D
>>
>>> ---
>>>   fs/xfs/xfs_super.c | 6 ++++--
>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>>> index 8495ef076ffc..a3c221841fa6 100644
>>> --- a/fs/xfs/xfs_super.c
>>> +++ b/fs/xfs/xfs_super.c
>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>>>           goto disable_dax;
>>>       }
>>> -    if (xfs_has_reflink(mp)) {
>>> -        xfs_alert(mp, "DAX and reflink cannot be used together!");
>>> +    if (xfs_has_reflink(mp) &&
>>> +        bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
>>> +        xfs_alert(mp,
>>> +            "DAX and reflink cannot work with multi-partitions!");
>>>           return -EINVAL;
>>>       }
>>> -- 
>>> 2.36.1
>>>
>>>
>>>
> 
> 
>
Shiyang Ruan July 21, 2022, 2:06 p.m. UTC | #5
在 2022/7/1 8:31, Darrick J. Wong 写道:
> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
>> Failure notification is not supported on partitions.  So, when we mount
>> a reflink enabled xfs on a partition with dax option, let it fail with
>> -EINVAL code.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> 
> Looks good to me, though I think this patch applies to ... wherever all
> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> 
> Ideally this would go in through there to keep the pieces together, but
> I don't mind tossing this in at the end of the 5.20 merge window if akpm
> is unwilling.

BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are 
waiting to be merged, is it time to think about "removing the 
experimental tag" again?  :)


--
Thanks,
Ruan.

> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
>> ---
>>   fs/xfs/xfs_super.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 8495ef076ffc..a3c221841fa6 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>>   		goto disable_dax;
>>   	}
>>   
>> -	if (xfs_has_reflink(mp)) {
>> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
>> +	if (xfs_has_reflink(mp) &&
>> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
>> +		xfs_alert(mp,
>> +			"DAX and reflink cannot work with multi-partitions!");
>>   		return -EINVAL;
>>   	}
>>   
>> -- 
>> 2.36.1
>>
>>
>>
Darrick J. Wong July 21, 2022, 4:16 p.m. UTC | #6
On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote:
> 在 2022/7/1 8:31, Darrick J. Wong 写道:
> > On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> >> Failure notification is not supported on partitions.  So, when we mount
> >> a reflink enabled xfs on a partition with dax option, let it fail with
> >> -EINVAL code.
> >>
> >> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > 
> > Looks good to me, though I think this patch applies to ... wherever all
> > those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> > 
> > Ideally this would go in through there to keep the pieces together, but
> > I don't mind tossing this in at the end of the 5.20 merge window if akpm
> > is unwilling.
> 
> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are 
> waiting to be merged, is it time to think about "removing the 
> experimental tag" again?  :)

It's probably time to take up that question again.

Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
and at least one of those dm layers no longer allows fsdax pass-through,
so XFS silently turned mount -o dax into -o dax=never. :(

I'm not sure how to fix that...

--D

> 
> --
> Thanks,
> Ruan.
> 
> > 
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > --D
> > 
> >> ---
> >>   fs/xfs/xfs_super.c | 6 ++++--
> >>   1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> >> index 8495ef076ffc..a3c221841fa6 100644
> >> --- a/fs/xfs/xfs_super.c
> >> +++ b/fs/xfs/xfs_super.c
> >> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
> >>   		goto disable_dax;
> >>   	}
> >>   
> >> -	if (xfs_has_reflink(mp)) {
> >> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
> >> +	if (xfs_has_reflink(mp) &&
> >> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> >> +		xfs_alert(mp,
> >> +			"DAX and reflink cannot work with multi-partitions!");
> >>   		return -EINVAL;
> >>   	}
> >>   
> >> -- 
> >> 2.36.1
> >>
> >>
> >>
Shiyang Ruan July 29, 2022, 3:55 a.m. UTC | #7
在 2022/7/22 0:16, Darrick J. Wong 写道:
> On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote:
>> 在 2022/7/1 8:31, Darrick J. Wong 写道:
>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
>>>> Failure notification is not supported on partitions.  So, when we mount
>>>> a reflink enabled xfs on a partition with dax option, let it fail with
>>>> -EINVAL code.
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>
>>> Looks good to me, though I think this patch applies to ... wherever all
>>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
>>>
>>> Ideally this would go in through there to keep the pieces together, but
>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
>>> is unwilling.
>>
>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are
>> waiting to be merged, is it time to think about "removing the
>> experimental tag" again?  :)
> 
> It's probably time to take up that question again.
> 
> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
> and at least one of those dm layers no longer allows fsdax pass-through,
> so XFS silently turned mount -o dax into -o dax=never. :(

Hi Darrick,

I tried generic/470 but it didn't run:
   [not run] Cannot use thin-pool devices on DAX capable block devices.

Did you modify the _require_dm_target() in common/rc?  I added thin-pool 
to not to check dax capability:

         case $target in
         stripe|linear|log-writes|thin-pool)  # add thin-pool here
                 ;;

then the case finally ran and it silently turned off dax as you said.

Are the steps for reproduction correct? If so, I will continue to 
investigate this problem.


--
Thanks,
Ruan.



> 
> I'm not sure how to fix that...
> 
> --D
> 
>>
>> --
>> Thanks,
>> Ruan.
>>
>>>
>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>
>>> --D
>>>
>>>> ---
>>>>    fs/xfs/xfs_super.c | 6 ++++--
>>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>>>> index 8495ef076ffc..a3c221841fa6 100644
>>>> --- a/fs/xfs/xfs_super.c
>>>> +++ b/fs/xfs/xfs_super.c
>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>>>>    		goto disable_dax;
>>>>    	}
>>>>    
>>>> -	if (xfs_has_reflink(mp)) {
>>>> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
>>>> +	if (xfs_has_reflink(mp) &&
>>>> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
>>>> +		xfs_alert(mp,
>>>> +			"DAX and reflink cannot work with multi-partitions!");
>>>>    		return -EINVAL;
>>>>    	}
>>>>    
>>>> -- 
>>>> 2.36.1
>>>>
>>>>
>>>>
Darrick J. Wong July 29, 2022, 4:54 a.m. UTC | #8
On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote:
> 
> 
> 在 2022/7/22 0:16, Darrick J. Wong 写道:
> > On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote:
> >> 在 2022/7/1 8:31, Darrick J. Wong 写道:
> >>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> >>>> Failure notification is not supported on partitions.  So, when we mount
> >>>> a reflink enabled xfs on a partition with dax option, let it fail with
> >>>> -EINVAL code.
> >>>>
> >>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> >>>
> >>> Looks good to me, though I think this patch applies to ... wherever all
> >>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> >>>
> >>> Ideally this would go in through there to keep the pieces together, but
> >>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
> >>> is unwilling.
> >>
> >> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are
> >> waiting to be merged, is it time to think about "removing the
> >> experimental tag" again?  :)
> > 
> > It's probably time to take up that question again.
> > 
> > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
> > didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
> > and at least one of those dm layers no longer allows fsdax pass-through,
> > so XFS silently turned mount -o dax into -o dax=never. :(
> 
> Hi Darrick,
> 
> I tried generic/470 but it didn't run:
>    [not run] Cannot use thin-pool devices on DAX capable block devices.
> 
> Did you modify the _require_dm_target() in common/rc?  I added thin-pool 
> to not to check dax capability:
> 
>          case $target in
>          stripe|linear|log-writes|thin-pool)  # add thin-pool here
>                  ;;
> 
> then the case finally ran and it silently turned off dax as you said.
> 
> Are the steps for reproduction correct? If so, I will continue to 
> investigate this problem.

Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
mention that.  I suspect that the removal of dm support for pmem is
going to force us to completely redesign this test.  I can't really
think of how, though, since there's no good way that I know of to gain a
point-in-time snapshot of a pmem device.

--D

> 
> --
> Thanks,
> Ruan.
> 
> 
> 
> > 
> > I'm not sure how to fix that...
> > 
> > --D
> > 
> >>
> >> --
> >> Thanks,
> >> Ruan.
> >>
> >>>
> >>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> --D
> >>>
> >>>> ---
> >>>>    fs/xfs/xfs_super.c | 6 ++++--
> >>>>    1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> >>>> index 8495ef076ffc..a3c221841fa6 100644
> >>>> --- a/fs/xfs/xfs_super.c
> >>>> +++ b/fs/xfs/xfs_super.c
> >>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
> >>>>    		goto disable_dax;
> >>>>    	}
> >>>>    
> >>>> -	if (xfs_has_reflink(mp)) {
> >>>> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
> >>>> +	if (xfs_has_reflink(mp) &&
> >>>> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> >>>> +		xfs_alert(mp,
> >>>> +			"DAX and reflink cannot work with multi-partitions!");
> >>>>    		return -EINVAL;
> >>>>    	}
> >>>>    
> >>>> -- 
> >>>> 2.36.1
> >>>>
> >>>>
> >>>>
Shiyang Ruan Aug. 3, 2022, 6:47 a.m. UTC | #9
在 2022/7/29 12:54, Darrick J. Wong 写道:
> On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote:
>>
>>
>> 在 2022/7/22 0:16, Darrick J. Wong 写道:
>>> On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote:
>>>> 在 2022/7/1 8:31, Darrick J. Wong 写道:
>>>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
>>>>>> Failure notification is not supported on partitions.  So, when we mount
>>>>>> a reflink enabled xfs on a partition with dax option, let it fail with
>>>>>> -EINVAL code.
>>>>>>
>>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>>>
>>>>> Looks good to me, though I think this patch applies to ... wherever all
>>>>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
>>>>>
>>>>> Ideally this would go in through there to keep the pieces together, but
>>>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
>>>>> is unwilling.
>>>>
>>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are
>>>> waiting to be merged, is it time to think about "removing the
>>>> experimental tag" again?  :)
>>>
>>> It's probably time to take up that question again.
>>>
>>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
>>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
>>> and at least one of those dm layers no longer allows fsdax pass-through,
>>> so XFS silently turned mount -o dax into -o dax=never. :(
>>
>> Hi Darrick,
>>
>> I tried generic/470 but it didn't run:
>>     [not run] Cannot use thin-pool devices on DAX capable block devices.
>>
>> Did you modify the _require_dm_target() in common/rc?  I added thin-pool
>> to not to check dax capability:
>>
>>           case $target in
>>           stripe|linear|log-writes|thin-pool)  # add thin-pool here
>>                   ;;
>>
>> then the case finally ran and it silently turned off dax as you said.
>>
>> Are the steps for reproduction correct? If so, I will continue to
>> investigate this problem.
> 
> Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
> mention that.  I suspect that the removal of dm support for pmem is
> going to force us to completely redesign this test.  I can't really
> think of how, though, since there's no good way that I know of to gain a
> point-in-time snapshot of a pmem device.

Hi Darrick,

 > removal of dm support for pmem
I think here we are saying about xfstest who removed the support, not 
kernel?

I found some xfstests commits:
fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool

So, this case was never able to run since the second commit?  (I didn't 
notice the not run case.  I thought it was expected to be not run.)

And according to the first commit, the restriction was added because 
some of dm devices don't support dax.  So my understanding is: we should 
redesign the case to make the it work, and firstly, we should add dax 
support for dm devices in kernel.


In addition, is there any other testcase has the same problem?  so that 
we can deal with them together.


--
Thanks,
Ruan


> 
> --D
> 
>>
>> --
>> Thanks,
>> Ruan.
>>
>>
>>
>>>
>>> I'm not sure how to fix that...
>>>
>>> --D
>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Ruan.
>>>>
>>>>>
>>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>>>
>>>>> --D
>>>>>
>>>>>> ---
>>>>>>     fs/xfs/xfs_super.c | 6 ++++--
>>>>>>     1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>>>>>> index 8495ef076ffc..a3c221841fa6 100644
>>>>>> --- a/fs/xfs/xfs_super.c
>>>>>> +++ b/fs/xfs/xfs_super.c
>>>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>>>>>>     		goto disable_dax;
>>>>>>     	}
>>>>>>     
>>>>>> -	if (xfs_has_reflink(mp)) {
>>>>>> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
>>>>>> +	if (xfs_has_reflink(mp) &&
>>>>>> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
>>>>>> +		xfs_alert(mp,
>>>>>> +			"DAX and reflink cannot work with multi-partitions!");
>>>>>>     		return -EINVAL;
>>>>>>     	}
>>>>>>     
>>>>>> -- 
>>>>>> 2.36.1
>>>>>>
>>>>>>
>>>>>>
Darrick J. Wong Aug. 4, 2022, 12:51 a.m. UTC | #10
On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote:
> 
> 
> 在 2022/7/29 12:54, Darrick J. Wong 写道:
> > On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote:
> >>
> >>
> >> 在 2022/7/22 0:16, Darrick J. Wong 写道:
> >>> On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote:
> >>>> 在 2022/7/1 8:31, Darrick J. Wong 写道:
> >>>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> >>>>>> Failure notification is not supported on partitions.  So, when we mount
> >>>>>> a reflink enabled xfs on a partition with dax option, let it fail with
> >>>>>> -EINVAL code.
> >>>>>>
> >>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> >>>>>
> >>>>> Looks good to me, though I think this patch applies to ... wherever all
> >>>>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> >>>>>
> >>>>> Ideally this would go in through there to keep the pieces together, but
> >>>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
> >>>>> is unwilling.
> >>>>
> >>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are
> >>>> waiting to be merged, is it time to think about "removing the
> >>>> experimental tag" again?  :)
> >>>
> >>> It's probably time to take up that question again.
> >>>
> >>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
> >>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
> >>> and at least one of those dm layers no longer allows fsdax pass-through,
> >>> so XFS silently turned mount -o dax into -o dax=never. :(
> >>
> >> Hi Darrick,
> >>
> >> I tried generic/470 but it didn't run:
> >>     [not run] Cannot use thin-pool devices on DAX capable block devices.
> >>
> >> Did you modify the _require_dm_target() in common/rc?  I added thin-pool
> >> to not to check dax capability:
> >>
> >>           case $target in
> >>           stripe|linear|log-writes|thin-pool)  # add thin-pool here
> >>                   ;;
> >>
> >> then the case finally ran and it silently turned off dax as you said.
> >>
> >> Are the steps for reproduction correct? If so, I will continue to
> >> investigate this problem.
> > 
> > Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
> > mention that.  I suspect that the removal of dm support for pmem is
> > going to force us to completely redesign this test.  I can't really
> > think of how, though, since there's no good way that I know of to gain a
> > point-in-time snapshot of a pmem device.
> 
> Hi Darrick,
> 
>  > removal of dm support for pmem
> I think here we are saying about xfstest who removed the support, not 
> kernel?
> 
> I found some xfstests commits:
> fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
> fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool
> 
> So, this case was never able to run since the second commit?  (I didn't 
> notice the not run case.  I thought it was expected to be not run.)
> 
> And according to the first commit, the restriction was added because 
> some of dm devices don't support dax.  So my understanding is: we should 
> redesign the case to make the it work, and firstly, we should add dax 
> support for dm devices in kernel.

dm devices used to have fsdax support; I think Christoph is actively
removing (or already has removed) all that support.

> In addition, is there any other testcase has the same problem?  so that 
> we can deal with them together.

The last I checked, there aren't any that require MAP_SYNC or pmem aside
from g/470 and the three poison notification tests that you sent a few
days ago.

--D

> 
> --
> Thanks,
> Ruan
> 
> 
> > 
> > --D
> > 
> >>
> >> --
> >> Thanks,
> >> Ruan.
> >>
> >>
> >>
> >>>
> >>> I'm not sure how to fix that...
> >>>
> >>> --D
> >>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Ruan.
> >>>>
> >>>>>
> >>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> >>>>>
> >>>>> --D
> >>>>>
> >>>>>> ---
> >>>>>>     fs/xfs/xfs_super.c | 6 ++++--
> >>>>>>     1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> >>>>>> index 8495ef076ffc..a3c221841fa6 100644
> >>>>>> --- a/fs/xfs/xfs_super.c
> >>>>>> +++ b/fs/xfs/xfs_super.c
> >>>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
> >>>>>>     		goto disable_dax;
> >>>>>>     	}
> >>>>>>     
> >>>>>> -	if (xfs_has_reflink(mp)) {
> >>>>>> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
> >>>>>> +	if (xfs_has_reflink(mp) &&
> >>>>>> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> >>>>>> +		xfs_alert(mp,
> >>>>>> +			"DAX and reflink cannot work with multi-partitions!");
> >>>>>>     		return -EINVAL;
> >>>>>>     	}
> >>>>>>     
> >>>>>> -- 
> >>>>>> 2.36.1
> >>>>>>
> >>>>>>
> >>>>>>
Shiyang Ruan Aug. 4, 2022, 1:36 a.m. UTC | #11
在 2022/8/4 8:51, Darrick J. Wong 写道:
> On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote:
>>
>>
>> 在 2022/7/29 12:54, Darrick J. Wong 写道:
>>> On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote:
>>>>
>>>>
>>>> 在 2022/7/22 0:16, Darrick J. Wong 写道:
>>>>> On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote:
>>>>>> 在 2022/7/1 8:31, Darrick J. Wong 写道:
>>>>>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
>>>>>>>> Failure notification is not supported on partitions.  So, when we mount
>>>>>>>> a reflink enabled xfs on a partition with dax option, let it fail with
>>>>>>>> -EINVAL code.
>>>>>>>>
>>>>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>>>>>
>>>>>>> Looks good to me, though I think this patch applies to ... wherever all
>>>>>>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
>>>>>>>
>>>>>>> Ideally this would go in through there to keep the pieces together, but
>>>>>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
>>>>>>> is unwilling.
>>>>>>
>>>>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are
>>>>>> waiting to be merged, is it time to think about "removing the
>>>>>> experimental tag" again?  :)
>>>>>
>>>>> It's probably time to take up that question again.
>>>>>
>>>>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
>>>>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
>>>>> and at least one of those dm layers no longer allows fsdax pass-through,
>>>>> so XFS silently turned mount -o dax into -o dax=never. :(
>>>>
>>>> Hi Darrick,
>>>>
>>>> I tried generic/470 but it didn't run:
>>>>      [not run] Cannot use thin-pool devices on DAX capable block devices.
>>>>
>>>> Did you modify the _require_dm_target() in common/rc?  I added thin-pool
>>>> to not to check dax capability:
>>>>
>>>>            case $target in
>>>>            stripe|linear|log-writes|thin-pool)  # add thin-pool here
>>>>                    ;;
>>>>
>>>> then the case finally ran and it silently turned off dax as you said.
>>>>
>>>> Are the steps for reproduction correct? If so, I will continue to
>>>> investigate this problem.
>>>
>>> Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
>>> mention that.  I suspect that the removal of dm support for pmem is
>>> going to force us to completely redesign this test.  I can't really
>>> think of how, though, since there's no good way that I know of to gain a
>>> point-in-time snapshot of a pmem device.
>>
>> Hi Darrick,
>>
>>   > removal of dm support for pmem
>> I think here we are saying about xfstest who removed the support, not
>> kernel?
>>
>> I found some xfstests commits:
>> fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
>> fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool
>>
>> So, this case was never able to run since the second commit?  (I didn't
>> notice the not run case.  I thought it was expected to be not run.)
>>
>> And according to the first commit, the restriction was added because
>> some of dm devices don't support dax.  So my understanding is: we should
>> redesign the case to make the it work, and firstly, we should add dax
>> support for dm devices in kernel.
> 
> dm devices used to have fsdax support; I think Christoph is actively
> removing (or already has removed) all that support.
> 
>> In addition, is there any other testcase has the same problem?  so that
>> we can deal with them together.
> 
> The last I checked, there aren't any that require MAP_SYNC or pmem aside
> from g/470 and the three poison notification tests that you sent a few
> days ago.

Ok.  Got it.  Thank you!


--
Ruan.

> 
> --D
> 
>>
>> --
>> Thanks,
>> Ruan
>>
>>
>>>
>>> --D
>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Ruan.
>>>>
>>>>
>>>>
>>>>>
>>>>> I'm not sure how to fix that...
>>>>>
>>>>> --D
>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Ruan.
>>>>>>
>>>>>>>
>>>>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>>>>>
>>>>>>> --D
>>>>>>>
>>>>>>>> ---
>>>>>>>>      fs/xfs/xfs_super.c | 6 ++++--
>>>>>>>>      1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>>>>>>>> index 8495ef076ffc..a3c221841fa6 100644
>>>>>>>> --- a/fs/xfs/xfs_super.c
>>>>>>>> +++ b/fs/xfs/xfs_super.c
>>>>>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>>>>>>>>      		goto disable_dax;
>>>>>>>>      	}
>>>>>>>>      
>>>>>>>> -	if (xfs_has_reflink(mp)) {
>>>>>>>> -		xfs_alert(mp, "DAX and reflink cannot be used together!");
>>>>>>>> +	if (xfs_has_reflink(mp) &&
>>>>>>>> +	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
>>>>>>>> +		xfs_alert(mp,
>>>>>>>> +			"DAX and reflink cannot work with multi-partitions!");
>>>>>>>>      		return -EINVAL;
>>>>>>>>      	}
>>>>>>>>      
>>>>>>>> -- 
>>>>>>>> 2.36.1
>>>>>>>>
>>>>>>>>
>>>>>>>>
Shiyang Ruan Sept. 8, 2022, 1:46 p.m. UTC | #12
在 2022/8/4 8:51, Darrick J. Wong 写道:
> On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote:

...

>>>>>>
>>>>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are
>>>>>> waiting to be merged, is it time to think about "removing the
>>>>>> experimental tag" again?  :)
>>>>>
>>>>> It's probably time to take up that question again.
>>>>>
>>>>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
>>>>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
>>>>> and at least one of those dm layers no longer allows fsdax pass-through,
>>>>> so XFS silently turned mount -o dax into -o dax=never. :(
>>>>
>>>> Hi Darrick,
>>>>
>>>> I tried generic/470 but it didn't run:
>>>>      [not run] Cannot use thin-pool devices on DAX capable block devices.
>>>>
>>>> Did you modify the _require_dm_target() in common/rc?  I added thin-pool
>>>> to not to check dax capability:
>>>>
>>>>            case $target in
>>>>            stripe|linear|log-writes|thin-pool)  # add thin-pool here
>>>>                    ;;
>>>>
>>>> then the case finally ran and it silently turned off dax as you said.
>>>>
>>>> Are the steps for reproduction correct? If so, I will continue to
>>>> investigate this problem.
>>>
>>> Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
>>> mention that.  I suspect that the removal of dm support for pmem is
>>> going to force us to completely redesign this test.  I can't really
>>> think of how, though, since there's no good way that I know of to gain a
>>> point-in-time snapshot of a pmem device.
>>
>> Hi Darrick,
>>
>>   > removal of dm support for pmem
>> I think here we are saying about xfstest who removed the support, not
>> kernel?
>>
>> I found some xfstests commits:
>> fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
>> fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool
>>
>> So, this case was never able to run since the second commit?  (I didn't
>> notice the not run case.  I thought it was expected to be not run.)
>>
>> And according to the first commit, the restriction was added because
>> some of dm devices don't support dax.  So my understanding is: we should
>> redesign the case to make the it work, and firstly, we should add dax
>> support for dm devices in kernel.
> 
> dm devices used to have fsdax support; I think Christoph is actively
> removing (or already has removed) all that support.
> 
>> In addition, is there any other testcase has the same problem?  so that
>> we can deal with them together.
> 
> The last I checked, there aren't any that require MAP_SYNC or pmem aside
> from g/470 and the three poison notification tests that you sent a few
> days ago.
> 
> --D
> 

Hi Darrick, Brian

I made a little investigation on generic/470.

This case was able to run before introducing thin-pool[1], but since 
that, it became 'Failed'/'Not Run' because thin-pool does not support 
DAX.  I have checked the log of thin-pool, it never supports DAX.  And, 
it's not someone has removed the fsdax support.  So, I think it's not 
correct to bypass the requirement conditions by adding 'thin-pool' to 
_require_dm_target().

As far as I known, to prevent out-of-order replay of dm-log-write, 
thin-pool was introduced (to provide discard zeroing).  Should we solve 
the 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian

Besides, since it's not a fsdax problem, I think there is nothing need 
to be fixed in fsdax.  I'd like to help it solved, but I'm still 
wondering if we could back to the original topic("Remove Experimental 
Tag") firstly? :)


[1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin 
volume for dmlogwrites target device


--
Thanks,
Ruan.
Brian Foster Sept. 9, 2022, 1:01 p.m. UTC | #13
On Thu, Sep 08, 2022 at 09:46:04PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/8/4 8:51, Darrick J. Wong 写道:
> > On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote:
> 
> ...
> 
> > > > > > > 
> > > > > > > BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are
> > > > > > > waiting to be merged, is it time to think about "removing the
> > > > > > > experimental tag" again?  :)
> > > > > > 
> > > > > > It's probably time to take up that question again.
> > > > > > 
> > > > > > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
> > > > > > didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
> > > > > > and at least one of those dm layers no longer allows fsdax pass-through,
> > > > > > so XFS silently turned mount -o dax into -o dax=never. :(
> > > > > 
> > > > > Hi Darrick,
> > > > > 
> > > > > I tried generic/470 but it didn't run:
> > > > >      [not run] Cannot use thin-pool devices on DAX capable block devices.
> > > > > 
> > > > > Did you modify the _require_dm_target() in common/rc?  I added thin-pool
> > > > > to not to check dax capability:
> > > > > 
> > > > >            case $target in
> > > > >            stripe|linear|log-writes|thin-pool)  # add thin-pool here
> > > > >                    ;;
> > > > > 
> > > > > then the case finally ran and it silently turned off dax as you said.
> > > > > 
> > > > > Are the steps for reproduction correct? If so, I will continue to
> > > > > investigate this problem.
> > > > 
> > > > Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
> > > > mention that.  I suspect that the removal of dm support for pmem is
> > > > going to force us to completely redesign this test.  I can't really
> > > > think of how, though, since there's no good way that I know of to gain a
> > > > point-in-time snapshot of a pmem device.
> > > 
> > > Hi Darrick,
> > > 
> > >   > removal of dm support for pmem
> > > I think here we are saying about xfstest who removed the support, not
> > > kernel?
> > > 
> > > I found some xfstests commits:
> > > fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
> > > fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool
> > > 
> > > So, this case was never able to run since the second commit?  (I didn't
> > > notice the not run case.  I thought it was expected to be not run.)
> > > 
> > > And according to the first commit, the restriction was added because
> > > some of dm devices don't support dax.  So my understanding is: we should
> > > redesign the case to make the it work, and firstly, we should add dax
> > > support for dm devices in kernel.
> > 
> > dm devices used to have fsdax support; I think Christoph is actively
> > removing (or already has removed) all that support.
> > 
> > > In addition, is there any other testcase has the same problem?  so that
> > > we can deal with them together.
> > 
> > The last I checked, there aren't any that require MAP_SYNC or pmem aside
> > from g/470 and the three poison notification tests that you sent a few
> > days ago.
> > 
> > --D
> > 
> 
> Hi Darrick, Brian
> 
> I made a little investigation on generic/470.
> 
> This case was able to run before introducing thin-pool[1], but since that,
> it became 'Failed'/'Not Run' because thin-pool does not support DAX.  I have
> checked the log of thin-pool, it never supports DAX.  And, it's not someone
> has removed the fsdax support.  So, I think it's not correct to bypass the
> requirement conditions by adding 'thin-pool' to _require_dm_target().
> 
> As far as I known, to prevent out-of-order replay of dm-log-write, thin-pool
> was introduced (to provide discard zeroing).  Should we solve the
> 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian
> 

Yes.. I don't recall all the internals of the tools and test, but IIRC
it relied on discard to perform zeroing between checkpoints or some such
and avoid spurious failures. The purpose of running on dm-thin was
merely to provide reliable discard zeroing behavior on the target device
and thus to allow the test to run reliably.

Brian

> Besides, since it's not a fsdax problem, I think there is nothing need to be
> fixed in fsdax.  I'd like to help it solved, but I'm still wondering if we
> could back to the original topic("Remove Experimental Tag") firstly? :)
> 
> 
> [1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin volume
> for dmlogwrites target device
> 
> 
> --
> Thanks,
> Ruan.
> 
>
Xiao Yang Sept. 14, 2022, 6:44 a.m. UTC | #14
On 2022/9/9 21:01, Brian Foster wrote:
> Yes.. I don't recall all the internals of the tools and test, but IIRC
> it relied on discard to perform zeroing between checkpoints or some such
> and avoid spurious failures. The purpose of running on dm-thin was
> merely to provide reliable discard zeroing behavior on the target device
> and thus to allow the test to run reliably.

Hi Brian,

As far as I know, generic/470 was original designed to verify 
mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the 
reason, we need to ensure that all underlying devices under 
dm-log-writes device support DAX. However dm-thin device never supports 
DAX so
running generic/470 with dm-thin device always returns "not run".

Please see the difference between old and new logic:

          old logic                          new logic
---------------------------------------------------------------
log-writes device(DAX)                 log-writes device(DAX)
            |                                       |
PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
                                          |
                                        PMEM0(DAX)
---------------------------------------------------------------

We think dm-thin device is not a good solution for generic/470, is there 
any other solution to support both discard zero and DAX?

BTW, only log-writes, stripe and linear support DAX for now.

Best Regards,
Xiao Yang
Xiao Yang Sept. 14, 2022, 9:38 a.m. UTC | #15
On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> On 2022/9/9 21:01, Brian Foster wrote:
>> Yes.. I don't recall all the internals of the tools and test, but IIRC
>> it relied on discard to perform zeroing between checkpoints or some such
>> and avoid spurious failures. The purpose of running on dm-thin was
>> merely to provide reliable discard zeroing behavior on the target device
>> and thus to allow the test to run reliably.
> Hi Brian,
> 
> As far as I know, generic/470 was original designed to verify
> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> reason, we need to ensure that all underlying devices under
> dm-log-writes device support DAX. However dm-thin device never supports
> DAX so
> running generic/470 with dm-thin device always returns "not run".
> 
> Please see the difference between old and new logic:
> 
>            old logic                          new logic
> ---------------------------------------------------------------
> log-writes device(DAX)                 log-writes device(DAX)
>              |                                       |
> PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
>                                            |
>                                          PMEM0(DAX)
> ---------------------------------------------------------------
> 
> We think dm-thin device is not a good solution for generic/470, is there
> any other solution to support both discard zero and DAX?

Hi Brian,

I have sent a patch[1] to revert your fix because I think it's not good 
for generic/470 to use thin volume as my revert patch[1] describes:
[1] 
https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u

With the revert, generic/470 can always run successfully on my 
environment so I wonder how to reproduce the out-of-order replay issue 
on XFS v5 filesystem?

PS: I want to reproduce the issue and try to find a better solution to 
fix it.

Best Regards,
Xiao Yang

> 
> BTW, only log-writes, stripe and linear support DAX for now.
Brian Foster Sept. 14, 2022, 12:34 p.m. UTC | #16
On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > On 2022/9/9 21:01, Brian Foster wrote:
> > > Yes.. I don't recall all the internals of the tools and test, but IIRC
> > > it relied on discard to perform zeroing between checkpoints or some such
> > > and avoid spurious failures. The purpose of running on dm-thin was
> > > merely to provide reliable discard zeroing behavior on the target device
> > > and thus to allow the test to run reliably.
> > Hi Brian,
> > 
> > As far as I know, generic/470 was original designed to verify
> > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > reason, we need to ensure that all underlying devices under
> > dm-log-writes device support DAX. However dm-thin device never supports
> > DAX so
> > running generic/470 with dm-thin device always returns "not run".
> > 
> > Please see the difference between old and new logic:
> > 
> >            old logic                          new logic
> > ---------------------------------------------------------------
> > log-writes device(DAX)                 log-writes device(DAX)
> >              |                                       |
> > PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
> >                                            |
> >                                          PMEM0(DAX)
> > ---------------------------------------------------------------
> > 
> > We think dm-thin device is not a good solution for generic/470, is there
> > any other solution to support both discard zero and DAX?
> 
> Hi Brian,
> 
> I have sent a patch[1] to revert your fix because I think it's not good for
> generic/470 to use thin volume as my revert patch[1] describes:
> [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u
> 

I think the history here is that generic/482 was changed over first in
commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
then sometime later we realized generic/455,457,470 had the same general
flaw and were switched over. The dm/dax compatibility thing was probably
just an oversight, but I am a little curious about that because it should
have been obvious that the change caused the test to no longer run. Did
something change after that to trigger that change in behavior?

> With the revert, generic/470 can always run successfully on my environment
> so I wonder how to reproduce the out-of-order replay issue on XFS v5
> filesystem?
> 

I don't quite recall the characteristics of the failures beyond that we
were seeing spurious test failures with generic/482 that were due to
essentially putting the fs/log back in time in a way that wasn't quite
accurate due to the clearing by the logwrites tool not taking place. If
you wanted to reproduce in order to revisit that, perhaps start with
generic/482 and let it run in a loop for a while and see if it
eventually triggers a failure/corruption..?

> PS: I want to reproduce the issue and try to find a better solution to fix
> it.
> 

It's been a while since I looked at any of this tooling to semi-grok how
it works. Perhaps it could learn to rely on something more explicit like
zero range (instead of discard?) or fall back to manual zeroing? If the
eventual solution is simple and low enough overhead, it might make some
sense to replace the dmthin hack across the set of tests mentioned
above.

Brian

> Best Regards,
> Xiao Yang
> 
> > 
> > BTW, only log-writes, stripe and linear support DAX for now.
>
Darrick J. Wong Sept. 14, 2022, 4:28 p.m. UTC | #17
On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > > On 2022/9/9 21:01, Brian Foster wrote:
> > > > Yes.. I don't recall all the internals of the tools and test, but IIRC
> > > > it relied on discard to perform zeroing between checkpoints or some such
> > > > and avoid spurious failures. The purpose of running on dm-thin was
> > > > merely to provide reliable discard zeroing behavior on the target device
> > > > and thus to allow the test to run reliably.
> > > Hi Brian,
> > > 
> > > As far as I know, generic/470 was original designed to verify
> > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > > reason, we need to ensure that all underlying devices under
> > > dm-log-writes device support DAX. However dm-thin device never supports
> > > DAX so
> > > running generic/470 with dm-thin device always returns "not run".
> > > 
> > > Please see the difference between old and new logic:
> > > 
> > >            old logic                          new logic
> > > ---------------------------------------------------------------
> > > log-writes device(DAX)                 log-writes device(DAX)
> > >              |                                       |
> > > PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
> > >                                            |
> > >                                          PMEM0(DAX)
> > > ---------------------------------------------------------------
> > > 
> > > We think dm-thin device is not a good solution for generic/470, is there
> > > any other solution to support both discard zero and DAX?
> > 
> > Hi Brian,
> > 
> > I have sent a patch[1] to revert your fix because I think it's not good for
> > generic/470 to use thin volume as my revert patch[1] describes:
> > [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u
> > 
> 
> I think the history here is that generic/482 was changed over first in
> commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
> then sometime later we realized generic/455,457,470 had the same general
> flaw and were switched over. The dm/dax compatibility thing was probably
> just an oversight, but I am a little curious about that because it should

It's not an oversight -- it used to work (albeit with EXPERIMENTAL
tags), and now we've broken it on fsdax as the pmem/blockdev divorce
progresses.

> have been obvious that the change caused the test to no longer run. Did
> something change after that to trigger that change in behavior?
> 
> > With the revert, generic/470 can always run successfully on my environment
> > so I wonder how to reproduce the out-of-order replay issue on XFS v5
> > filesystem?
> > 
> 
> I don't quite recall the characteristics of the failures beyond that we
> were seeing spurious test failures with generic/482 that were due to
> essentially putting the fs/log back in time in a way that wasn't quite
> accurate due to the clearing by the logwrites tool not taking place. If
> you wanted to reproduce in order to revisit that, perhaps start with
> generic/482 and let it run in a loop for a while and see if it
> eventually triggers a failure/corruption..?
> 
> > PS: I want to reproduce the issue and try to find a better solution to fix
> > it.
> > 
> 
> It's been a while since I looked at any of this tooling to semi-grok how
> it works.

I /think/ this was the crux of the problem, back in 2019?
https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/

> Perhaps it could learn to rely on something more explicit like
> zero range (instead of discard?) or fall back to manual zeroing?

AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
ought to be adapted to call BLKZEROOUT and (b) in the worst case it
writes zeroes to the entire device, which is/can be slow.

For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
and for cost optimization purposes we're only "paying" for the cheapest
tier.  Weirdly that maps to an upper limit of 6500 write iops and
48MB/s(!) but that would take about 20 minutes to zero the entire
device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
support discard or write-zeroes.

> If the
> eventual solution is simple and low enough overhead, it might make some
> sense to replace the dmthin hack across the set of tests mentioned
> above.

That said, for a *pmem* test you'd expect it to be faster than that...

--D

> Brian
> 
> > Best Regards,
> > Xiao Yang
> > 
> > > 
> > > BTW, only log-writes, stripe and linear support DAX for now.
> > 
>
Xiao Yang Sept. 15, 2022, 10:14 a.m. UTC | #18
On 2022/9/15 0:28, Darrick J. Wong wrote:
> On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
>> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
>>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
>>>> On 2022/9/9 21:01, Brian Foster wrote:
>>>>> Yes.. I don't recall all the internals of the tools and test, but IIRC
>>>>> it relied on discard to perform zeroing between checkpoints or some such
>>>>> and avoid spurious failures. The purpose of running on dm-thin was
>>>>> merely to provide reliable discard zeroing behavior on the target device
>>>>> and thus to allow the test to run reliably.
>>>> Hi Brian,
>>>>
>>>> As far as I know, generic/470 was original designed to verify
>>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
>>>> reason, we need to ensure that all underlying devices under
>>>> dm-log-writes device support DAX. However dm-thin device never supports
>>>> DAX so
>>>> running generic/470 with dm-thin device always returns "not run".
>>>>
>>>> Please see the difference between old and new logic:
>>>>
>>>>             old logic                          new logic
>>>> ---------------------------------------------------------------
>>>> log-writes device(DAX)                 log-writes device(DAX)
>>>>               |                                       |
>>>> PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
>>>>                                             |
>>>>                                           PMEM0(DAX)
>>>> ---------------------------------------------------------------
>>>>
>>>> We think dm-thin device is not a good solution for generic/470, is there
>>>> any other solution to support both discard zero and DAX?
>>>
>>> Hi Brian,
>>>
>>> I have sent a patch[1] to revert your fix because I think it's not good for
>>> generic/470 to use thin volume as my revert patch[1] describes:
>>> [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u
>>>
>>
>> I think the history here is that generic/482 was changed over first in
>> commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
>> then sometime later we realized generic/455,457,470 had the same general
>> flaw and were switched over. The dm/dax compatibility thing was probably
>> just an oversight, but I am a little curious about that because it should
> 
> It's not an oversight -- it used to work (albeit with EXPERIMENTAL
> tags), and now we've broken it on fsdax as the pmem/blockdev divorce
> progresses.
Hi

Do you mean that the following patch set changed the test result of 
generic/470 with thin-volume? (pass => not run/failure)
https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/

> 
>> have been obvious that the change caused the test to no longer run. Did
>> something change after that to trigger that change in behavior?
>>
>>> With the revert, generic/470 can always run successfully on my environment
>>> so I wonder how to reproduce the out-of-order replay issue on XFS v5
>>> filesystem?
>>>
>>
>> I don't quite recall the characteristics of the failures beyond that we
>> were seeing spurious test failures with generic/482 that were due to
>> essentially putting the fs/log back in time in a way that wasn't quite
>> accurate due to the clearing by the logwrites tool not taking place. If
>> you wanted to reproduce in order to revisit that, perhaps start with
>> generic/482 and let it run in a loop for a while and see if it
>> eventually triggers a failure/corruption..?
>>
>>> PS: I want to reproduce the issue and try to find a better solution to fix
>>> it.
>>>
>>
>> It's been a while since I looked at any of this tooling to semi-grok how
>> it works.
> 
> I /think/ this was the crux of the problem, back in 2019?
> https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/

Agreed.

> 
>> Perhaps it could learn to rely on something more explicit like
>> zero range (instead of discard?) or fall back to manual zeroing?
> 
> AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
> ought to be adapted to call BLKZEROOUT and (b) in the worst case it
> writes zeroes to the entire device, which is/can be slow.
> 
> For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
> and for cost optimization purposes we're only "paying" for the cheapest
> tier.  Weirdly that maps to an upper limit of 6500 write iops and
> 48MB/s(!) but that would take about 20 minutes to zero the entire
> device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
> support discard or write-zeroes.

Do you mean that discard zero(BLKDISCARD) is faster than both fill 
zero(BLKZEROOUT) and write zero on user space?

Best Regards,
Xiao Yang
> 
>> If the
>> eventual solution is simple and low enough overhead, it might make some
>> sense to replace the dmthin hack across the set of tests mentioned
>> above.
> 
> That said, for a *pmem* test you'd expect it to be faster than that...
> 
> --D
> 
>> Brian
>>
>>> Best Regards,
>>> Xiao Yang
>>>
>>>>
>>>> BTW, only log-writes, stripe and linear support DAX for now.
>>>
>>
Xiao Yang Sept. 16, 2022, 2:04 a.m. UTC | #19
On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote:
> On 2022/9/15 0:28, Darrick J. Wong wrote:
>> On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
>>> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
>>>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
>>>>> On 2022/9/9 21:01, Brian Foster wrote:
>>>>>> Yes.. I don't recall all the internals of the tools and test, but 
>>>>>> IIRC
>>>>>> it relied on discard to perform zeroing between checkpoints or 
>>>>>> some such
>>>>>> and avoid spurious failures. The purpose of running on dm-thin was
>>>>>> merely to provide reliable discard zeroing behavior on the target 
>>>>>> device
>>>>>> and thus to allow the test to run reliably.
>>>>> Hi Brian,
>>>>>
>>>>> As far as I know, generic/470 was original designed to verify
>>>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
>>>>> reason, we need to ensure that all underlying devices under
>>>>> dm-log-writes device support DAX. However dm-thin device never 
>>>>> supports
>>>>> DAX so
>>>>> running generic/470 with dm-thin device always returns "not run".
>>>>>
>>>>> Please see the difference between old and new logic:
>>>>>
>>>>>             old logic                          new logic
>>>>> ---------------------------------------------------------------
>>>>> log-writes device(DAX)                 log-writes device(DAX)
>>>>>               |                                       |
>>>>> PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
>>>>>                                             |
>>>>>                                           PMEM0(DAX)
>>>>> ---------------------------------------------------------------
>>>>>
>>>>> We think dm-thin device is not a good solution for generic/470, is 
>>>>> there
>>>>> any other solution to support both discard zero and DAX?
>>>>
>>>> Hi Brian,
>>>>
>>>> I have sent a patch[1] to revert your fix because I think it's not 
>>>> good for
>>>> generic/470 to use thin volume as my revert patch[1] describes:
>>>> [1] 
>>>> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u 
>>>>
>>>>
>>>
>>> I think the history here is that generic/482 was changed over first in
>>> commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
>>> then sometime later we realized generic/455,457,470 had the same general
>>> flaw and were switched over. The dm/dax compatibility thing was probably
>>> just an oversight, but I am a little curious about that because it 
>>> should
>>
>> It's not an oversight -- it used to work (albeit with EXPERIMENTAL
>> tags), and now we've broken it on fsdax as the pmem/blockdev divorce
>> progresses.
> Hi
> 
> Do you mean that the following patch set changed the test result of 
> generic/470 with thin-volume? (pass => not run/failure)
> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
> 
>>
>>> have been obvious that the change caused the test to no longer run. Did
>>> something change after that to trigger that change in behavior?
>>>
>>>> With the revert, generic/470 can always run successfully on my 
>>>> environment
>>>> so I wonder how to reproduce the out-of-order replay issue on XFS v5
>>>> filesystem?
>>>>
>>>
>>> I don't quite recall the characteristics of the failures beyond that we
>>> were seeing spurious test failures with generic/482 that were due to
>>> essentially putting the fs/log back in time in a way that wasn't quite
>>> accurate due to the clearing by the logwrites tool not taking place. If
>>> you wanted to reproduce in order to revisit that, perhaps start with
>>> generic/482 and let it run in a loop for a while and see if it
>>> eventually triggers a failure/corruption..?
>>>
>>>> PS: I want to reproduce the issue and try to find a better solution 
>>>> to fix
>>>> it.
>>>>
>>>
>>> It's been a while since I looked at any of this tooling to semi-grok how
>>> it works.
>>
>> I /think/ this was the crux of the problem, back in 2019?
>> https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/
> 
> Agreed.
> 
>>
>>> Perhaps it could learn to rely on something more explicit like
>>> zero range (instead of discard?) or fall back to manual zeroing?
>>
>> AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
>> ought to be adapted to call BLKZEROOUT and (b) in the worst case it
>> writes zeroes to the entire device, which is/can be slow.
>>
>> For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
>> and for cost optimization purposes we're only "paying" for the cheapest
>> tier.  Weirdly that maps to an upper limit of 6500 write iops and
>> 48MB/s(!) but that would take about 20 minutes to zero the entire
>> device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
>> support discard or write-zeroes.
> 
> Do you mean that discard zero(BLKDISCARD) is faster than both fill 
> zero(BLKZEROOUT) and write zero on user space?

Hi Darrick, Brian and Christoph

According to the discussion about generic/470. I wonder if it is 
necessary to make thin-pool support DAX. Is there any use case for the 
requirement?

Best Regards,
Xiao Yang
> 
> Best Regards,
> Xiao Yang
>>
>>> If the
>>> eventual solution is simple and low enough overhead, it might make some
>>> sense to replace the dmthin hack across the set of tests mentioned
>>> above.
>>
>> That said, for a *pmem* test you'd expect it to be faster than that...
>>
>> --D
>>
>>> Brian
>>>
>>>> Best Regards,
>>>> Xiao Yang
>>>>
>>>>>
>>>>> BTW, only log-writes, stripe and linear support DAX for now.
>>>>
>>>
Xiao Yang Sept. 20, 2022, 2:38 a.m. UTC | #20
Hi Darrick, Brian and Christoph

Ping. I hope to get your feedback.

1) I have confirmed that the following patch set did not change the test 
result of generic/470 with thin-volume. Besides, I didn't see any 
failure when running generic/470 based on normal PMEM device instaed of 
thin-volume.
https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/

2) I can reproduce the failure of generic/482 without thin-volume.

3) Is it necessary to make thin-volume support DAX. Is there any use 
case for the requirement?

Best Regards,
Xiao Yang

On 2022/9/16 10:04, Yang, Xiao/杨 晓 wrote:
> On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote:
>> On 2022/9/15 0:28, Darrick J. Wong wrote:
>>> On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
>>>> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
>>>>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
>>>>>> On 2022/9/9 21:01, Brian Foster wrote:
>>>>>>> Yes.. I don't recall all the internals of the tools and test, but 
>>>>>>> IIRC
>>>>>>> it relied on discard to perform zeroing between checkpoints or 
>>>>>>> some such
>>>>>>> and avoid spurious failures. The purpose of running on dm-thin was
>>>>>>> merely to provide reliable discard zeroing behavior on the target 
>>>>>>> device
>>>>>>> and thus to allow the test to run reliably.
>>>>>> Hi Brian,
>>>>>>
>>>>>> As far as I know, generic/470 was original designed to verify
>>>>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
>>>>>> reason, we need to ensure that all underlying devices under
>>>>>> dm-log-writes device support DAX. However dm-thin device never 
>>>>>> supports
>>>>>> DAX so
>>>>>> running generic/470 with dm-thin device always returns "not run".
>>>>>>
>>>>>> Please see the difference between old and new logic:
>>>>>>
>>>>>>             old logic                          new logic
>>>>>> ---------------------------------------------------------------
>>>>>> log-writes device(DAX)                 log-writes device(DAX)
>>>>>>               |                                       |
>>>>>> PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
>>>>>>                                             |
>>>>>>                                           PMEM0(DAX)
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>> We think dm-thin device is not a good solution for generic/470, is 
>>>>>> there
>>>>>> any other solution to support both discard zero and DAX?
>>>>>
>>>>> Hi Brian,
>>>>>
>>>>> I have sent a patch[1] to revert your fix because I think it's not 
>>>>> good for
>>>>> generic/470 to use thin volume as my revert patch[1] describes:
>>>>> [1] 
>>>>> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u 
>>>>>
>>>>>
>>>>
>>>> I think the history here is that generic/482 was changed over first in
>>>> commit 65cc9a235919 ("generic/482: use thin volume as data device"), 
>>>> and
>>>> then sometime later we realized generic/455,457,470 had the same 
>>>> general
>>>> flaw and were switched over. The dm/dax compatibility thing was 
>>>> probably
>>>> just an oversight, but I am a little curious about that because it 
>>>> should
>>>
>>> It's not an oversight -- it used to work (albeit with EXPERIMENTAL
>>> tags), and now we've broken it on fsdax as the pmem/blockdev divorce
>>> progresses.
>> Hi
>>
>> Do you mean that the following patch set changed the test result of 
>> generic/470 with thin-volume? (pass => not run/failure)
>> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
>>
>>>
>>>> have been obvious that the change caused the test to no longer run. Did
>>>> something change after that to trigger that change in behavior?
>>>>
>>>>> With the revert, generic/470 can always run successfully on my 
>>>>> environment
>>>>> so I wonder how to reproduce the out-of-order replay issue on XFS v5
>>>>> filesystem?
>>>>>
>>>>
>>>> I don't quite recall the characteristics of the failures beyond that we
>>>> were seeing spurious test failures with generic/482 that were due to
>>>> essentially putting the fs/log back in time in a way that wasn't quite
>>>> accurate due to the clearing by the logwrites tool not taking place. If
>>>> you wanted to reproduce in order to revisit that, perhaps start with
>>>> generic/482 and let it run in a loop for a while and see if it
>>>> eventually triggers a failure/corruption..?
>>>>
>>>>> PS: I want to reproduce the issue and try to find a better solution 
>>>>> to fix
>>>>> it.
>>>>>
>>>>
>>>> It's been a while since I looked at any of this tooling to semi-grok 
>>>> how
>>>> it works.
>>>
>>> I /think/ this was the crux of the problem, back in 2019?
>>> https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/
>>
>> Agreed.
>>
>>>
>>>> Perhaps it could learn to rely on something more explicit like
>>>> zero range (instead of discard?) or fall back to manual zeroing?
>>>
>>> AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
>>> ought to be adapted to call BLKZEROOUT and (b) in the worst case it
>>> writes zeroes to the entire device, which is/can be slow.
>>>
>>> For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
>>> and for cost optimization purposes we're only "paying" for the cheapest
>>> tier.  Weirdly that maps to an upper limit of 6500 write iops and
>>> 48MB/s(!) but that would take about 20 minutes to zero the entire
>>> device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
>>> support discard or write-zeroes.
>>
>> Do you mean that discard zero(BLKDISCARD) is faster than both fill 
>> zero(BLKZEROOUT) and write zero on user space?
> 
> Hi Darrick, Brian and Christoph
> 
> According to the discussion about generic/470. I wonder if it is 
> necessary to make thin-pool support DAX. Is there any use case for the 
> requirement?
> 
> Best Regards,
> Xiao Yang
>>
>> Best Regards,
>> Xiao Yang
>>>
>>>> If the
>>>> eventual solution is simple and low enough overhead, it might make some
>>>> sense to replace the dmthin hack across the set of tests mentioned
>>>> above.
>>>
>>> That said, for a *pmem* test you'd expect it to be faster than that...
>>>
>>> --D
>>>
>>>> Brian
>>>>
>>>>> Best Regards,
>>>>> Xiao Yang
>>>>>
>>>>>>
>>>>>> BTW, only log-writes, stripe and linear support DAX for now.
>>>>>
>>>>
Yasunori Gotou (Fujitsu) Sept. 30, 2022, 12:56 a.m. UTC | #21
Hello everyone,

On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote:
> Hi Darrick, Brian and Christoph
> 
> Ping. I hope to get your feedback.
> 
> 1) I have confirmed that the following patch set did not change the test 
> result of generic/470 with thin-volume. Besides, I didn't see any 
> failure when running generic/470 based on normal PMEM device instaed of 
> thin-volume.
> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
> 
> 2) I can reproduce the failure of generic/482 without thin-volume.
> 
> 3) Is it necessary to make thin-volume support DAX. Is there any use 
> case for the requirement?


Though I asked other place(*), I really want to know the usecase of
dm-thin-volume with DAX and reflink.


In my understanding, dm-thin-volume seems to provide similar feature 
like reflink of xfs. Both feature provide COW update to reduce usage of
its region, and snapshot feature, right?

I found that docker seems to select one of them (or other feature which 
supports COW). Then user don't need to use thin-volume and reflink at 
same time.

Database which uses FS-DAX may want to use snapshot for its data of 
FS-DAX, its user seems to be satisfied with reflink or thin-volume.

So I could not find on what use-case user would like to use 
dm-thin-volume and reflink at same time.

The only possibility is that the user has mistakenly configured 
dm-thinpool and reflink to be used at the same time, but if that is the 
case, it seems to be better for the user to disable one or the other.

I really wander why dm-thin-volume must be used with reflik and FS-DAX.

If my understanding is something wrong, please correct me.

(*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/

Thanks,
---
Yasunori Goto


> 
> Best Regards,
> Xiao Yang
> 
> On 2022/9/16 10:04, Yang, Xiao/杨 晓 wrote:
>> On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote:
>>> On 2022/9/15 0:28, Darrick J. Wong wrote:
>>>> On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
>>>>> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
>>>>>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
>>>>>>> On 2022/9/9 21:01, Brian Foster wrote:
>>>>>>>> Yes.. I don't recall all the internals of the tools and test, 
>>>>>>>> but IIRC
>>>>>>>> it relied on discard to perform zeroing between checkpoints or 
>>>>>>>> some such
>>>>>>>> and avoid spurious failures. The purpose of running on dm-thin was
>>>>>>>> merely to provide reliable discard zeroing behavior on the 
>>>>>>>> target device
>>>>>>>> and thus to allow the test to run reliably.
>>>>>>> Hi Brian,
>>>>>>>
>>>>>>> As far as I know, generic/470 was original designed to verify
>>>>>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
>>>>>>> reason, we need to ensure that all underlying devices under
>>>>>>> dm-log-writes device support DAX. However dm-thin device never 
>>>>>>> supports
>>>>>>> DAX so
>>>>>>> running generic/470 with dm-thin device always returns "not run".
>>>>>>>
>>>>>>> Please see the difference between old and new logic:
>>>>>>>
>>>>>>>             old logic                          new logic
>>>>>>> ---------------------------------------------------------------
>>>>>>> log-writes device(DAX)                 log-writes device(DAX)
>>>>>>>               |                                       |
>>>>>>> PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
>>>>>>>                                             |
>>>>>>>                                           PMEM0(DAX)
>>>>>>> ---------------------------------------------------------------
>>>>>>>
>>>>>>> We think dm-thin device is not a good solution for generic/470, 
>>>>>>> is there
>>>>>>> any other solution to support both discard zero and DAX?
>>>>>>
>>>>>> Hi Brian,
>>>>>>
>>>>>> I have sent a patch[1] to revert your fix because I think it's not 
>>>>>> good for
>>>>>> generic/470 to use thin volume as my revert patch[1] describes:
>>>>>> [1] 
>>>>>> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u 
>>>>>>
>>>>>>
>>>>>
>>>>> I think the history here is that generic/482 was changed over first in
>>>>> commit 65cc9a235919 ("generic/482: use thin volume as data 
>>>>> device"), and
>>>>> then sometime later we realized generic/455,457,470 had the same 
>>>>> general
>>>>> flaw and were switched over. The dm/dax compatibility thing was 
>>>>> probably
>>>>> just an oversight, but I am a little curious about that because it 
>>>>> should
>>>>
>>>> It's not an oversight -- it used to work (albeit with EXPERIMENTAL
>>>> tags), and now we've broken it on fsdax as the pmem/blockdev divorce
>>>> progresses.
>>> Hi
>>>
>>> Do you mean that the following patch set changed the test result of 
>>> generic/470 with thin-volume? (pass => not run/failure)
>>> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
>>>
>>>>
>>>>> have been obvious that the change caused the test to no longer run. 
>>>>> Did
>>>>> something change after that to trigger that change in behavior?
>>>>>
>>>>>> With the revert, generic/470 can always run successfully on my 
>>>>>> environment
>>>>>> so I wonder how to reproduce the out-of-order replay issue on XFS v5
>>>>>> filesystem?
>>>>>>
>>>>>
>>>>> I don't quite recall the characteristics of the failures beyond 
>>>>> that we
>>>>> were seeing spurious test failures with generic/482 that were due to
>>>>> essentially putting the fs/log back in time in a way that wasn't quite
>>>>> accurate due to the clearing by the logwrites tool not taking 
>>>>> place. If
>>>>> you wanted to reproduce in order to revisit that, perhaps start with
>>>>> generic/482 and let it run in a loop for a while and see if it
>>>>> eventually triggers a failure/corruption..?
>>>>>
>>>>>> PS: I want to reproduce the issue and try to find a better 
>>>>>> solution to fix
>>>>>> it.
>>>>>>
>>>>>
>>>>> It's been a while since I looked at any of this tooling to 
>>>>> semi-grok how
>>>>> it works.
>>>>
>>>> I /think/ this was the crux of the problem, back in 2019?
>>>> https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/
>>>
>>> Agreed.
>>>
>>>>
>>>>> Perhaps it could learn to rely on something more explicit like
>>>>> zero range (instead of discard?) or fall back to manual zeroing?
>>>>
>>>> AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
>>>> ought to be adapted to call BLKZEROOUT and (b) in the worst case it
>>>> writes zeroes to the entire device, which is/can be slow.
>>>>
>>>> For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
>>>> and for cost optimization purposes we're only "paying" for the cheapest
>>>> tier.  Weirdly that maps to an upper limit of 6500 write iops and
>>>> 48MB/s(!) but that would take about 20 minutes to zero the entire
>>>> device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
>>>> support discard or write-zeroes.
>>>
>>> Do you mean that discard zero(BLKDISCARD) is faster than both fill 
>>> zero(BLKZEROOUT) and write zero on user space?
>>
>> Hi Darrick, Brian and Christoph
>>
>> According to the discussion about generic/470. I wonder if it is 
>> necessary to make thin-pool support DAX. Is there any use case for the 
>> requirement?
>>
>> Best Regards,
>> Xiao Yang
>>>
>>> Best Regards,
>>> Xiao Yang
>>>>
>>>>> If the
>>>>> eventual solution is simple and low enough overhead, it might make 
>>>>> some
>>>>> sense to replace the dmthin hack across the set of tests mentioned
>>>>> above.
>>>>
>>>> That said, for a *pmem* test you'd expect it to be faster than that...
>>>>
>>>> --D
>>>>
>>>>> Brian
>>>>>
>>>>>> Best Regards,
>>>>>> Xiao Yang
>>>>>>
>>>>>>>
>>>>>>> BTW, only log-writes, stripe and linear support DAX for now.
>>>>>>
>>>>>
Darrick J. Wong Oct. 4, 2022, 12:12 a.m. UTC | #22
On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote:
> Hello everyone,
> 
> On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote:
> > Hi Darrick, Brian and Christoph
> > 
> > Ping. I hope to get your feedback.
> > 
> > 1) I have confirmed that the following patch set did not change the test
> > result of generic/470 with thin-volume. Besides, I didn't see any
> > failure when running generic/470 based on normal PMEM device instaed of
> > thin-volume.
> > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
> > 
> > 2) I can reproduce the failure of generic/482 without thin-volume.
> > 
> > 3) Is it necessary to make thin-volume support DAX. Is there any use
> > case for the requirement?
> 
> 
> Though I asked other place(*), I really want to know the usecase of
> dm-thin-volume with DAX and reflink.
> 
> 
> In my understanding, dm-thin-volume seems to provide similar feature like
> reflink of xfs. Both feature provide COW update to reduce usage of
> its region, and snapshot feature, right?
> 
> I found that docker seems to select one of them (or other feature which
> supports COW). Then user don't need to use thin-volume and reflink at same
> time.
> 
> Database which uses FS-DAX may want to use snapshot for its data of FS-DAX,
> its user seems to be satisfied with reflink or thin-volume.
> 
> So I could not find on what use-case user would like to use dm-thin-volume
> and reflink at same time.
> 
> The only possibility is that the user has mistakenly configured dm-thinpool
> and reflink to be used at the same time, but if that is the case, it seems
> to be better for the user to disable one or the other.
> 
> I really wander why dm-thin-volume must be used with reflik and FS-DAX.

There isn't a hard requirement between fsdax and dm-thinp.  The /test/
needs dm-logwrites to check that write page faults on a MAP_SYNC
mmapping are persisted directly to disk.  dm-logwrites requires a fast
way to zero an entire device for correct operation of the replay step,
and thinp is the only way to guarantee that.

--D

> If my understanding is something wrong, please correct me.
> 
> (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/
> 
> Thanks,
> ---
> Yasunori Goto
> 
> 
> > 
> > Best Regards,
> > Xiao Yang
> > 
> > On 2022/9/16 10:04, Yang, Xiao/杨 晓 wrote:
> > > On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote:
> > > > On 2022/9/15 0:28, Darrick J. Wong wrote:
> > > > > On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
> > > > > > On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> > > > > > > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > > > > > > > On 2022/9/9 21:01, Brian Foster wrote:
> > > > > > > > > Yes.. I don't recall all the internals of
> > > > > > > > > the tools and test, but IIRC
> > > > > > > > > it relied on discard to perform zeroing
> > > > > > > > > between checkpoints or some such
> > > > > > > > > and avoid spurious failures. The purpose of running on dm-thin was
> > > > > > > > > merely to provide reliable discard zeroing
> > > > > > > > > behavior on the target device
> > > > > > > > > and thus to allow the test to run reliably.
> > > > > > > > Hi Brian,
> > > > > > > > 
> > > > > > > > As far as I know, generic/470 was original designed to verify
> > > > > > > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > > > > > > > reason, we need to ensure that all underlying devices under
> > > > > > > > dm-log-writes device support DAX. However
> > > > > > > > dm-thin device never supports
> > > > > > > > DAX so
> > > > > > > > running generic/470 with dm-thin device always returns "not run".
> > > > > > > > 
> > > > > > > > Please see the difference between old and new logic:
> > > > > > > > 
> > > > > > > >             old logic                          new logic
> > > > > > > > ---------------------------------------------------------------
> > > > > > > > log-writes device(DAX)                 log-writes device(DAX)
> > > > > > > >               |                                       |
> > > > > > > > PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
> > > > > > > >                                             |
> > > > > > > >                                           PMEM0(DAX)
> > > > > > > > ---------------------------------------------------------------
> > > > > > > > 
> > > > > > > > We think dm-thin device is not a good solution
> > > > > > > > for generic/470, is there
> > > > > > > > any other solution to support both discard zero and DAX?
> > > > > > > 
> > > > > > > Hi Brian,
> > > > > > > 
> > > > > > > I have sent a patch[1] to revert your fix because I
> > > > > > > think it's not good for
> > > > > > > generic/470 to use thin volume as my revert patch[1] describes:
> > > > > > > [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > I think the history here is that generic/482 was changed over first in
> > > > > > commit 65cc9a235919 ("generic/482: use thin volume as
> > > > > > data device"), and
> > > > > > then sometime later we realized generic/455,457,470 had
> > > > > > the same general
> > > > > > flaw and were switched over. The dm/dax compatibility
> > > > > > thing was probably
> > > > > > just an oversight, but I am a little curious about that
> > > > > > because it should
> > > > > 
> > > > > It's not an oversight -- it used to work (albeit with EXPERIMENTAL
> > > > > tags), and now we've broken it on fsdax as the pmem/blockdev divorce
> > > > > progresses.
> > > > Hi
> > > > 
> > > > Do you mean that the following patch set changed the test result
> > > > of generic/470 with thin-volume? (pass => not run/failure)
> > > > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
> > > > 
> > > > > 
> > > > > > have been obvious that the change caused the test to no
> > > > > > longer run. Did
> > > > > > something change after that to trigger that change in behavior?
> > > > > > 
> > > > > > > With the revert, generic/470 can always run
> > > > > > > successfully on my environment
> > > > > > > so I wonder how to reproduce the out-of-order replay issue on XFS v5
> > > > > > > filesystem?
> > > > > > > 
> > > > > > 
> > > > > > I don't quite recall the characteristics of the failures
> > > > > > beyond that we
> > > > > > were seeing spurious test failures with generic/482 that were due to
> > > > > > essentially putting the fs/log back in time in a way that wasn't quite
> > > > > > accurate due to the clearing by the logwrites tool not
> > > > > > taking place. If
> > > > > > you wanted to reproduce in order to revisit that, perhaps start with
> > > > > > generic/482 and let it run in a loop for a while and see if it
> > > > > > eventually triggers a failure/corruption..?
> > > > > > 
> > > > > > > PS: I want to reproduce the issue and try to find a
> > > > > > > better solution to fix
> > > > > > > it.
> > > > > > > 
> > > > > > 
> > > > > > It's been a while since I looked at any of this tooling
> > > > > > to semi-grok how
> > > > > > it works.
> > > > > 
> > > > > I /think/ this was the crux of the problem, back in 2019?
> > > > > https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/
> > > > 
> > > > Agreed.
> > > > 
> > > > > 
> > > > > > Perhaps it could learn to rely on something more explicit like
> > > > > > zero range (instead of discard?) or fall back to manual zeroing?
> > > > > 
> > > > > AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
> > > > > ought to be adapted to call BLKZEROOUT and (b) in the worst case it
> > > > > writes zeroes to the entire device, which is/can be slow.
> > > > > 
> > > > > For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
> > > > > and for cost optimization purposes we're only "paying" for the cheapest
> > > > > tier.  Weirdly that maps to an upper limit of 6500 write iops and
> > > > > 48MB/s(!) but that would take about 20 minutes to zero the entire
> > > > > device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
> > > > > support discard or write-zeroes.
> > > > 
> > > > Do you mean that discard zero(BLKDISCARD) is faster than both
> > > > fill zero(BLKZEROOUT) and write zero on user space?
> > > 
> > > Hi Darrick, Brian and Christoph
> > > 
> > > According to the discussion about generic/470. I wonder if it is
> > > necessary to make thin-pool support DAX. Is there any use case for
> > > the requirement?
> > > 
> > > Best Regards,
> > > Xiao Yang
> > > > 
> > > > Best Regards,
> > > > Xiao Yang
> > > > > 
> > > > > > If the
> > > > > > eventual solution is simple and low enough overhead, it
> > > > > > might make some
> > > > > > sense to replace the dmthin hack across the set of tests mentioned
> > > > > > above.
> > > > > 
> > > > > That said, for a *pmem* test you'd expect it to be faster than that...
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > Best Regards,
> > > > > > > Xiao Yang
> > > > > > > 
> > > > > > > > 
> > > > > > > > BTW, only log-writes, stripe and linear support DAX for now.
> > > > > > > 
> > > > > > 
>
Yasunori Gotou (Fujitsu) Oct. 4, 2022, 4:12 a.m. UTC | #23
On 2022/10/03 17:12, Darrick J. Wong wrote:
> On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote:
>> Hello everyone,
>>
>> On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote:
>>> Hi Darrick, Brian and Christoph
>>>
>>> Ping. I hope to get your feedback.
>>>
>>> 1) I have confirmed that the following patch set did not change the test
>>> result of generic/470 with thin-volume. Besides, I didn't see any
>>> failure when running generic/470 based on normal PMEM device instaed of
>>> thin-volume.
>>> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
>>>
>>> 2) I can reproduce the failure of generic/482 without thin-volume.
>>>
>>> 3) Is it necessary to make thin-volume support DAX. Is there any use
>>> case for the requirement?
>>
>>
>> Though I asked other place(*), I really want to know the usecase of
>> dm-thin-volume with DAX and reflink.
>>
>>
>> In my understanding, dm-thin-volume seems to provide similar feature like
>> reflink of xfs. Both feature provide COW update to reduce usage of
>> its region, and snapshot feature, right?
>>
>> I found that docker seems to select one of them (or other feature which
>> supports COW). Then user don't need to use thin-volume and reflink at same
>> time.
>>
>> Database which uses FS-DAX may want to use snapshot for its data of FS-DAX,
>> its user seems to be satisfied with reflink or thin-volume.
>>
>> So I could not find on what use-case user would like to use dm-thin-volume
>> and reflink at same time.
>>
>> The only possibility is that the user has mistakenly configured dm-thinpool
>> and reflink to be used at the same time, but if that is the case, it seems
>> to be better for the user to disable one or the other.
>>
>> I really wander why dm-thin-volume must be used with reflik and FS-DAX.
> 
> There isn't a hard requirement between fsdax and dm-thinp.  The /test/
> needs dm-logwrites to check that write page faults on a MAP_SYNC
> mmapping are persisted directly to disk.  dm-logwrites requires a fast
> way to zero an entire device for correct operation of the replay step,
> and thinp is the only way to guarantee that.

Thank you for your answer. But I still feel something is strange.
Though dm-thinp may be good way to execute the test correctly,
I suppose it seems to be likely a kind of workaround to pass the test,
it may not be really required for actual users.

Could you tell me why passing test by workaround is so necessary?

Thanks,


> 
> --D
> 
>> If my understanding is something wrong, please correct me.
>>
>> (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/
Darrick J. Wong Oct. 4, 2022, 6:26 p.m. UTC | #24
On Mon, Oct 03, 2022 at 09:12:46PM -0700, Gotou, Yasunori/五島 康文 wrote:
> On 2022/10/03 17:12, Darrick J. Wong wrote:
> > On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote:
> > > Hello everyone,
> > > 
> > > On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote:
> > > > Hi Darrick, Brian and Christoph
> > > > 
> > > > Ping. I hope to get your feedback.
> > > > 
> > > > 1) I have confirmed that the following patch set did not change the test
> > > > result of generic/470 with thin-volume. Besides, I didn't see any
> > > > failure when running generic/470 based on normal PMEM device instaed of
> > > > thin-volume.
> > > > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/
> > > > 
> > > > 2) I can reproduce the failure of generic/482 without thin-volume.
> > > > 
> > > > 3) Is it necessary to make thin-volume support DAX. Is there any use
> > > > case for the requirement?
> > > 
> > > 
> > > Though I asked other place(*), I really want to know the usecase of
> > > dm-thin-volume with DAX and reflink.
> > > 
> > > 
> > > In my understanding, dm-thin-volume seems to provide similar feature like
> > > reflink of xfs. Both feature provide COW update to reduce usage of
> > > its region, and snapshot feature, right?
> > > 
> > > I found that docker seems to select one of them (or other feature which
> > > supports COW). Then user don't need to use thin-volume and reflink at same
> > > time.
> > > 
> > > Database which uses FS-DAX may want to use snapshot for its data of FS-DAX,
> > > its user seems to be satisfied with reflink or thin-volume.
> > > 
> > > So I could not find on what use-case user would like to use dm-thin-volume
> > > and reflink at same time.
> > > 
> > > The only possibility is that the user has mistakenly configured dm-thinpool
> > > and reflink to be used at the same time, but if that is the case, it seems
> > > to be better for the user to disable one or the other.
> > > 
> > > I really wander why dm-thin-volume must be used with reflik and FS-DAX.
> > 
> > There isn't a hard requirement between fsdax and dm-thinp.  The /test/
> > needs dm-logwrites to check that write page faults on a MAP_SYNC
> > mmapping are persisted directly to disk.  dm-logwrites requires a fast
> > way to zero an entire device for correct operation of the replay step,
> > and thinp is the only way to guarantee that.
> 
> Thank you for your answer. But I still feel something is strange.
> Though dm-thinp may be good way to execute the test correctly,

Yep.

> I suppose it seems to be likely a kind of workaround to pass the test,
> it may not be really required for actual users.

Exactly correct.  Real users should /never/ set up this kind of (test
scaffolding|insanity) to use fsdax.

> Could you tell me why passing test by workaround is so necessary?

Notice this line in generic/470:

$XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \
	-c "log_writes -d $LOGWRITES_NAME -m preunmap" \
	-f $SCRATCH_MNT/test

The second xfs_io command creates a MAP_SYNC mmap of the
SCRATCH_MNT/test file, and the third command memcpy's bytes to the
mapping to invoke the write page fault handler.

The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME
(aka the block device containing the mounted XFS filesystem) to create a
mark called "preunmap".  This mark captures the exact state of the block
device immediately after the write faults complete, so that we can come
back to it later.  There are a few things to note here:

  (1) We did not tell the fs to persist anything;
  (2) We can't use dm-snapshot here, because dm-snapshot will flush the
      fs (I think?); and
  (3) The fs is still mounted, so the state of the block device at the
      mark reflects a dirty XFS with a log that must be replayed.

The next thing the test does is unmount the fs, remove the dm-logwrites
driver to stop recording, and check the fs:

_log_writes_unmount
_log_writes_remove
_dmthin_check_fs

This ensures that the post-umount fs is consistent.  Now we want to roll
back to the place we marked to see if the mwrite data made it to pmem.
It *should* have, since we asked for a MAP_SYNC mapping on a fsdax
filesystem recorded on a pmem device:

# check pre-unmap state
_log_writes_replay_log preunmap $DMTHIN_VOL_DEV
_dmthin_mount

dm-logwrites can't actually roll backwards in time to a mark, since it
only records new disk contents.  It /can/ however roll forward from
whatever point it began recording writes to the mark, so that's what it
does.

However -- remember note (3) from earlier.  When we _dmthin_mount after
replaying the log to the "preunmap" mark, XFS will see the dirty XFS log
and try to recover the XFS log.  This is where the replay problems crop
up.  The XFS log records a monotonically increasing sequence number
(LSN) with every log update, and when updates are written into the
filesystem, that LSN is also written into the filesystem block.  Log
recovery also replays updates into the filesystem, but with the added
behavior that it skips a block replay if the block's LSN is higher than
the transaction being replayed.  IOWs, we never replay older block
contents over newer block contents.

For dm-logwrites this is a major problem, because there could be more
filesystem updates written to the XFS log after the mark is made.  LSNs
will then be handed out like this:

mkfs_lsn                 preunmap_lsn             umount_lsn
  |                           |                      |
  |--------------------------||----------|-----------|
                             |           |
                         xxx_lsn     yyy_lsn

Let's say that a new metadata block "BBB" was created in update "xxx"
immediately before the preunmap mark was made.  Per (1), we didn't flush
the filesystem before taking the mark, which means that the new block's
contents exist only in the log at this point.

Let us further say that the new block was again changed in update "yyy",
where preunmap_lsn < yyy_lsn <= umount_lsn.  Clearly, yyy_lsn > xxx_lsn.
yyy_lsn is written to the block at unmount, because unmounting flushes
the log clean before it completes.  This is the first time that BBB ever
gets written.

_log_writes_replay_log begins replaying the block device from mkfs_lsn
towards preunmap_lsn.  When it's done, it will have a log that reflects
all the changes up to preunmap_lsn.  Recall however that BBB isn't
written until after the preunmap mark, which means that dm-logwrites has
no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch
BBB.  At this point, the block header for BBB has a UUID that matches
the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn.

XFS log recovery starts up, and finds transaction xxx.  It will read BBB
from disk, but then it will see that it has an LSN of yyy_lsn.  This is
larger than xxx_lsn, so it concludes that BBB is newer than the log and
moves on to the next log item.  No other log items touch BBB, so
recovery finishes, and now we have a filesystem containing one metadata
block (BBB) from the future.  This is an inconsistent filesystem, and
has caused failures in the tests that use logwrites.

To work around this problem, all we really need to do is reinitialize
the entire block device to known contents at mkfs time.  This can be
done expensively by writing zeroes to the entire block device, or it can
be done cheaply by (a) issuing DISCARD to the whole the block device at
the start of the test and (b) ensuring that reads after a discard always
produce zeroes.  mkfs.xfs already does (a), so the test merely has to
ensure (b).

dm-thinp is the only software solution that provides (b), so that's why
this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV.
This combination used to work, but with the pending pmem/blockdev
divorce, this strategy is no longer feasible.

I think the only way to fix this test is (a) revert all of Christoph's
changes so far and scuttle the divorce; or (b) change this test like so:

 1. Create a large sparse file on $TEST_DIR and losetup that sparse
    file.  The resulting loop device will not have dax capability.

 2. Set up the dmthin/dmlogwrites stack on top of this loop device.

 3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem
    device) as the realtime device, and set the daxinherit and rtinherit
    flags on the root directory.  The result is a filesystem with a data
    section that the kernel will treat as a regular block device, a
    realtime section backed by pmem, and the necessary flags to make
    sure that the test file will actually get fsdax mode.

 4. Acknowledge that we no longer have any way to test MAP_SYNC
    functionality on ext4, which means that generic/470 has to move to
    tests/xfs/.

--D

> Thanks,
> 
> 
> > 
> > --D
> > 
> > > If my understanding is something wrong, please correct me.
> > > 
> > > (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/
Xiao Yang Oct. 20, 2022, 2:17 p.m. UTC | #25
On 2022/10/5 2:26, Darrick J. Wong wrote:
> Notice this line in generic/470:
> 
> $XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \
> 	-c "log_writes -d $LOGWRITES_NAME -m preunmap" \
> 	-f $SCRATCH_MNT/test
> 
> The second xfs_io command creates a MAP_SYNC mmap of the
> SCRATCH_MNT/test file, and the third command memcpy's bytes to the
> mapping to invoke the write page fault handler.
> 
> The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME
> (aka the block device containing the mounted XFS filesystem) to create a
> mark called "preunmap".  This mark captures the exact state of the block
> device immediately after the write faults complete, so that we can come
> back to it later.  There are a few things to note here:
> 
>    (1) We did not tell the fs to persist anything;
>    (2) We can't use dm-snapshot here, because dm-snapshot will flush the
>        fs (I think?); and
>    (3) The fs is still mounted, so the state of the block device at the
>        mark reflects a dirty XFS with a log that must be replayed.
> 
> The next thing the test does is unmount the fs, remove the dm-logwrites
> driver to stop recording, and check the fs:
> 
> _log_writes_unmount
> _log_writes_remove
> _dmthin_check_fs
> 
> This ensures that the post-umount fs is consistent.  Now we want to roll
> back to the place we marked to see if the mwrite data made it to pmem.
> It*should*  have, since we asked for a MAP_SYNC mapping on a fsdax
> filesystem recorded on a pmem device:
> 
> # check pre-unmap state
> _log_writes_replay_log preunmap $DMTHIN_VOL_DEV
> _dmthin_mount
> 
> dm-logwrites can't actually roll backwards in time to a mark, since it
> only records new disk contents.  It/can/  however roll forward from
> whatever point it began recording writes to the mark, so that's what it
> does.
> 
> However -- remember note (3) from earlier.  When we _dmthin_mount after
> replaying the log to the "preunmap" mark, XFS will see the dirty XFS log
> and try to recover the XFS log.  This is where the replay problems crop
> up.  The XFS log records a monotonically increasing sequence number
> (LSN) with every log update, and when updates are written into the
> filesystem, that LSN is also written into the filesystem block.  Log
> recovery also replays updates into the filesystem, but with the added
> behavior that it skips a block replay if the block's LSN is higher than
> the transaction being replayed.  IOWs, we never replay older block
> contents over newer block contents.
> 
> For dm-logwrites this is a major problem, because there could be more
> filesystem updates written to the XFS log after the mark is made.  LSNs
> will then be handed out like this:
> 
> mkfs_lsn                 preunmap_lsn             umount_lsn
>    |                           |                      |
>    |--------------------------||----------|-----------|
>                               |           |
>                           xxx_lsn     yyy_lsn
> 
> Let's say that a new metadata block "BBB" was created in update "xxx"
> immediately before the preunmap mark was made.  Per (1), we didn't flush
> the filesystem before taking the mark, which means that the new block's
> contents exist only in the log at this point.
> 
> Let us further say that the new block was again changed in update "yyy",
> where preunmap_lsn < yyy_lsn <= umount_lsn.  Clearly, yyy_lsn > xxx_lsn.
> yyy_lsn is written to the block at unmount, because unmounting flushes
> the log clean before it completes.  This is the first time that BBB ever
> gets written.
> 
> _log_writes_replay_log begins replaying the block device from mkfs_lsn
> towards preunmap_lsn.  When it's done, it will have a log that reflects
> all the changes up to preunmap_lsn.  Recall however that BBB isn't
> written until after the preunmap mark, which means that dm-logwrites has
> no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch
> BBB.  At this point, the block header for BBB has a UUID that matches
> the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn.
> 
> XFS log recovery starts up, and finds transaction xxx.  It will read BBB
> from disk, but then it will see that it has an LSN of yyy_lsn.  This is
> larger than xxx_lsn, so it concludes that BBB is newer than the log and
> moves on to the next log item.  No other log items touch BBB, so
> recovery finishes, and now we have a filesystem containing one metadata
> block (BBB) from the future.  This is an inconsistent filesystem, and
> has caused failures in the tests that use logwrites.
> 
> To work around this problem, all we really need to do is reinitialize
> the entire block device to known contents at mkfs time.  This can be
> done expensively by writing zeroes to the entire block device, or it can
> be done cheaply by (a) issuing DISCARD to the whole the block device at
> the start of the test and (b) ensuring that reads after a discard always
> produce zeroes.  mkfs.xfs already does (a), so the test merely has to
> ensure (b).
> 
> dm-thinp is the only software solution that provides (b), so that's why
> this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV.
> This combination used to work, but with the pending pmem/blockdev
> divorce, this strategy is no longer feasible.

Hi Darrick,

Thanks a lot for your detailed explanation.

Could you tell me if my understanding is correct. I think the issue is 
that log-writes log and XFS log may save the different state of block 
device. It is possible for XFS log to save the more updates than 
log-writes log does. In this case, we can recovery the block device by 
log-writes log's replay but we will get the inconsistent filesystem when 
mounting the block device because the mount operation will try to 
recovery more updates for XFS on the block deivce by XFS log. We need to 
fix the issue by discarding XFS log on the block device.  mkfs.xfs will 
try to discard the blocks including XFS log by calling ioctl(BLKDISCARD) 
  but it will ignore error silently when the block device doesn't 
support ioctl(BLKDISCARD).  Discarding XFS log is what you said 
"reinitialize the entire block device", right?

> 
> I think the only way to fix this test is (a) revert all of Christoph's
> changes so far and scuttle the divorce; or (b) change this test like so:

Sorry, I didn't know which Christoph's patches need to be reverted?
Could you tell me the URL about Christoph's patches?

> 
>   1. Create a large sparse file on $TEST_DIR and losetup that sparse
>      file.  The resulting loop device will not have dax capability.
> 
>   2. Set up the dmthin/dmlogwrites stack on top of this loop device.
> 
>   3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem
>      device) as the realtime device, and set the daxinherit and rtinherit
>      flags on the root directory.  The result is a filesystem with a data
>      section that the kernel will treat as a regular block device, a
>      realtime section backed by pmem, and the necessary flags to make
>      sure that the test file will actually get fsdax mode.
> 
>   4. Acknowledge that we no longer have any way to test MAP_SYNC
>      functionality on ext4, which means that generic/470 has to move to
>      tests/xfs/.

Sorry, I didn't understand why the above test change can fix the issue.
Could you tell me which step can discard XFS log?

In addition, I don't like your idea about the test change because it 
will make generic/470 become the special test for XFS. Do you know if we 
can fix the issue by changing the test in another way? blkdiscard -z can 
fix the issue because it does zero-fill rather than discard on the block 
device.  However, blkdiscard -z will take a lot of time when the block 
device is large.

Best Regards,
Xiao Yang

> 
> --D
Darrick J. Wong Oct. 22, 2022, 2:11 a.m. UTC | #26
On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> On 2022/10/5 2:26, Darrick J. Wong wrote:
> > Notice this line in generic/470:
> > 
> > $XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \
> > 	-c "log_writes -d $LOGWRITES_NAME -m preunmap" \
> > 	-f $SCRATCH_MNT/test
> > 
> > The second xfs_io command creates a MAP_SYNC mmap of the
> > SCRATCH_MNT/test file, and the third command memcpy's bytes to the
> > mapping to invoke the write page fault handler.
> > 
> > The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME
> > (aka the block device containing the mounted XFS filesystem) to create a
> > mark called "preunmap".  This mark captures the exact state of the block
> > device immediately after the write faults complete, so that we can come
> > back to it later.  There are a few things to note here:
> > 
> >    (1) We did not tell the fs to persist anything;
> >    (2) We can't use dm-snapshot here, because dm-snapshot will flush the
> >        fs (I think?); and
> >    (3) The fs is still mounted, so the state of the block device at the
> >        mark reflects a dirty XFS with a log that must be replayed.
> > 
> > The next thing the test does is unmount the fs, remove the dm-logwrites
> > driver to stop recording, and check the fs:
> > 
> > _log_writes_unmount
> > _log_writes_remove
> > _dmthin_check_fs
> > 
> > This ensures that the post-umount fs is consistent.  Now we want to roll
> > back to the place we marked to see if the mwrite data made it to pmem.
> > It*should*  have, since we asked for a MAP_SYNC mapping on a fsdax
> > filesystem recorded on a pmem device:
> > 
> > # check pre-unmap state
> > _log_writes_replay_log preunmap $DMTHIN_VOL_DEV
> > _dmthin_mount
> > 
> > dm-logwrites can't actually roll backwards in time to a mark, since it
> > only records new disk contents.  It/can/  however roll forward from
> > whatever point it began recording writes to the mark, so that's what it
> > does.
> > 
> > However -- remember note (3) from earlier.  When we _dmthin_mount after
> > replaying the log to the "preunmap" mark, XFS will see the dirty XFS log
> > and try to recover the XFS log.  This is where the replay problems crop
> > up.  The XFS log records a monotonically increasing sequence number
> > (LSN) with every log update, and when updates are written into the
> > filesystem, that LSN is also written into the filesystem block.  Log
> > recovery also replays updates into the filesystem, but with the added
> > behavior that it skips a block replay if the block's LSN is higher than
> > the transaction being replayed.  IOWs, we never replay older block
> > contents over newer block contents.
> > 
> > For dm-logwrites this is a major problem, because there could be more
> > filesystem updates written to the XFS log after the mark is made.  LSNs
> > will then be handed out like this:
> > 
> > mkfs_lsn                 preunmap_lsn             umount_lsn
> >    |                           |                      |
> >    |--------------------------||----------|-----------|
> >                               |           |
> >                           xxx_lsn     yyy_lsn
> > 
> > Let's say that a new metadata block "BBB" was created in update "xxx"
> > immediately before the preunmap mark was made.  Per (1), we didn't flush
> > the filesystem before taking the mark, which means that the new block's
> > contents exist only in the log at this point.
> > 
> > Let us further say that the new block was again changed in update "yyy",
> > where preunmap_lsn < yyy_lsn <= umount_lsn.  Clearly, yyy_lsn > xxx_lsn.
> > yyy_lsn is written to the block at unmount, because unmounting flushes
> > the log clean before it completes.  This is the first time that BBB ever
> > gets written.
> > 
> > _log_writes_replay_log begins replaying the block device from mkfs_lsn
> > towards preunmap_lsn.  When it's done, it will have a log that reflects
> > all the changes up to preunmap_lsn.  Recall however that BBB isn't
> > written until after the preunmap mark, which means that dm-logwrites has
> > no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch
> > BBB.  At this point, the block header for BBB has a UUID that matches
> > the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn.
> > 
> > XFS log recovery starts up, and finds transaction xxx.  It will read BBB
> > from disk, but then it will see that it has an LSN of yyy_lsn.  This is
> > larger than xxx_lsn, so it concludes that BBB is newer than the log and
> > moves on to the next log item.  No other log items touch BBB, so
> > recovery finishes, and now we have a filesystem containing one metadata
> > block (BBB) from the future.  This is an inconsistent filesystem, and
> > has caused failures in the tests that use logwrites.
> > 
> > To work around this problem, all we really need to do is reinitialize
> > the entire block device to known contents at mkfs time.  This can be
> > done expensively by writing zeroes to the entire block device, or it can
> > be done cheaply by (a) issuing DISCARD to the whole the block device at
> > the start of the test and (b) ensuring that reads after a discard always
> > produce zeroes.  mkfs.xfs already does (a), so the test merely has to
> > ensure (b).
> > 
> > dm-thinp is the only software solution that provides (b), so that's why
> > this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV.
> > This combination used to work, but with the pending pmem/blockdev
> > divorce, this strategy is no longer feasible.
> 
> Hi Darrick,
> 
> Thanks a lot for your detailed explanation.
> 
> Could you tell me if my understanding is correct. I think the issue is that
> log-writes log and XFS log may save the different state of block device. It
> is possible for XFS log to save the more updates than log-writes log does.

Yes.

> In this case, we can recovery the block device by log-writes log's replay
> but we will get the inconsistent filesystem when mounting the block device
> because the mount operation will try to recovery more updates for XFS on the
> block deivce by XFS log.

Sort of...

> We need to fix the issue by discarding XFS log on the block device.
> mkfs.xfs will try to discard the blocks including XFS log by calling
> ioctl(BLKDISCARD)  but it will ignore error silently when the block
> device doesn't support ioctl(BLKDISCARD).

...but I think here's where I think your understanding isn't correct.
It might help to show how the nested logging creates its own problems.

First, let's say there's a block B that contains some stale garbage
AAAA.

XFS writes a block into the XFS log (call the block L) with the
instructions "allocate block B and write CCCC to block B".  dm-logwrites
doesn't know or care about the contents of the blocks that it is told to
write; it only knows that XFS told it to write some data (the
instructions) to block L.  So it remembers the fact that some data got
written to L, but it doesn't know about B at all.

At the point where we create the dm-logwrites preunmap mark, it's only
tracking L.  It is not tracking B.   After the mark is taken, the XFS
AIL writes CCCC to B, and only then does dm-logwrites begin tracking B.
Hence B is not included in the preunmap mark.  The pre-unmount process
in XFS writes to the XFS log "write DDDD to block B" and the unmount
process checkpoints the log contents, so now block B contains contains
DDDD.

Now the test wants to roll to the preunmap mark.  Unfortunately,
dm-logwrites doesn't record former block contents, which means that the
log replay tools cannot roll backwards from "umount" to "preunmap" --
they can only roll forward from the beginning.  So there's no way to
undo writing DDDD or CCCC to B.  IOWs, there's no way to revert B's
state back to AAAA when doing dm-logwrites recovery.

Now XFS log recovery starts.  It sees "allocate block B and write CCCC
to block B".  However, it reads block B, sees that it contains DDDD, and
it skips writing CCCC.  Incorrectly.  The only way to avoid this is to
zero B before replaying the dm-logwrites.

So you could solve the problem via BLKDISCARD, or writing zeroes to the
entire block device, or scanning the metadata and writing zeroes to just
those blocks, or by adding undo buffer records to dm-logwrites and
teaching it to do a proper rollback.

> Discarding XFS log is what you said "reinitialize the entire block
> device", right?

No, I really meant the /entire/ block device.

> > 
> > I think the only way to fix this test is (a) revert all of Christoph's
> > changes so far and scuttle the divorce; or (b) change this test like so:
> 
> Sorry, I didn't know which Christoph's patches need to be reverted?
> Could you tell me the URL about Christoph's patches?

Eh, it's a whole long series of patches scuttling various parts where
pmem could talk to the block layer.  I doubt he'll accept you reverting
his removal code.

> >   1. Create a large sparse file on $TEST_DIR and losetup that sparse
> >      file.  The resulting loop device will not have dax capability.
> > 
> >   2. Set up the dmthin/dmlogwrites stack on top of this loop device.
> > 
> >   3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem
> >      device) as the realtime device, and set the daxinherit and rtinherit
> >      flags on the root directory.  The result is a filesystem with a data
> >      section that the kernel will treat as a regular block device, a
> >      realtime section backed by pmem, and the necessary flags to make
> >      sure that the test file will actually get fsdax mode.
> > 
> >   4. Acknowledge that we no longer have any way to test MAP_SYNC
> >      functionality on ext4, which means that generic/470 has to move to
> >      tests/xfs/.
> 
> Sorry, I didn't understand why the above test change can fix the issue.

XFS supports two-device filesystems -- the "realtime" section, and the
"data" section.  FS metadata and log all live in the "data" section.

So we change the test to set up some regular files, loop-mount the
files, set up the requisite dm-logwrites stuff atop the loop devices,
and format the XFS with the data section backed by the dm-logwrites
device, and make the realtime section backed by the pmem.

This way the log replay program can actually discard the data device
(because it's a regular file) and replay the log forward to the preunmap
mark.  The pmem device is not involved in the replay at all, since
changes to file data are never logged.  It now becomes irrelevant that
pmem no longer supports device mapper.

> Could you tell me which step can discard XFS log?

(None of the steps do that.)

> In addition, I don't like your idea about the test change because it will
> make generic/470 become the special test for XFS. Do you know if we can fix
> the issue by changing the test in another way? blkdiscard -z can fix the
> issue because it does zero-fill rather than discard on the block device.
> However, blkdiscard -z will take a lot of time when the block device is
> large.

Well we /could/ just do that too, but that will suck if you have 2TB of
pmem. ;)

Maybe as an alternative path we could just create a very small
filesystem on the pmem and then blkdiscard -z it?

That said -- does persistent memory actually have a future?  Intel
scuttled the entire Optane product, cxl.mem sounds like expansion
chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
asserts everywhere) and 6.1 (every time I run fstests now I see massive
data corruption).

Frankly at this point I'm tempted just to turn of fsdax support for XFS
for the 6.1 LTS because I don't have time to fix it.

--D

> 
> Best Regards,
> Xiao Yang
> 
> > 
> > --D
Xiao Yang Oct. 23, 2022, 7:04 a.m. UTC | #27
On 2022/10/22 10:11, Darrick J. Wong wrote:
>> We need to fix the issue by discarding XFS log on the block device.
>> mkfs.xfs will try to discard the blocks including XFS log by calling
>> ioctl(BLKDISCARD)  but it will ignore error silently when the block 
>> device doesn't support ioctl(BLKDISCARD).
> ...but I think here's where I think your understanding isn't correct.
> It might help to show how the nested logging creates its own problems.
> 
> First, let's say there's a block B that contains some stale garbage 
> AAAA.
> 
> XFS writes a block into the XFS log (call the block L) with the 
> instructions "allocate block B and write CCCC to block B".  
> dm-logwrites doesn't know or care about the contents of the blocks 
> that it is told to write; it only knows that XFS told it to write some 
> data (the
> instructions) to block L.  So it remembers the fact that some data got 
> written to L, but it doesn't know about B at all.
> 
> At the point where we create the dm-logwrites preunmap mark, it's only
> tracking L.  It is not tracking B.   After the mark is taken, the XFS
> AIL writes CCCC to B, and only then does dm-logwrites begin tracking B.
> Hence B is not included in the preunmap mark.  The pre-unmount process 
> in XFS writes to the XFS log "write DDDD to block B" and the unmount 
> process checkpoints the log contents, so now block B contains contains 
> DDDD.
> 
> Now the test wants to roll to the preunmap mark.  Unfortunately, 
> dm-logwrites doesn't record former block contents, which means that 
> the log replay tools cannot roll backwards from "umount" to "preunmap" 
> -- they can only roll forward from the beginning.  So there's no way 
> to undo writing DDDD or CCCC to B.  IOWs, there's no way to revert B's 
> state back to AAAA when doing dm-logwrites recovery.
> 
> Now XFS log recovery starts.  It sees "allocate block B and write CCCC 
> to block B".  However, it reads block B, sees that it contains DDDD, 
> and it skips writing CCCC.  Incorrectly.  The only way to avoid this 
> is to zero B before replaying the dm-logwrites.
> 
> So you could solve the problem via BLKDISCARD, or writing zeroes to 
> the entire block device, or scanning the metadata and writing zeroes 
> to just those blocks, or by adding undo buffer records to dm-logwrites 
> and teaching it to do a proper rollback.

Hi Darrick,

Thanks for your patient explanation.

Do you know if XFS log records still use buffer write? In other words, they cannot be written into the block device in DAX mode, right?

In fact, I can reproduce the inconsistent filesystem issue on
generic/482 but cannot reproduce the issue on generic/470.

> 
>> Discarding XFS log is what you said "reinitialize the entire block 
>> device", right?
> No, I really meant the/entire/  block device.
> 
>>> I think the only way to fix this test is (a) revert all of 
>>> Christoph's changes so far and scuttle the divorce; or (b) change this test like so:
>> Sorry, I didn't know which Christoph's patches need to be reverted?
>> Could you tell me the URL about Christoph's patches?
> Eh, it's a whole long series of patches scuttling various parts where 
> pmem could talk to the block layer.  I doubt he'll accept you 
> reverting his removal code.

Where can I find the Christoph's patch set you mentioned.
I just want to know the content of Christoph's patch set.

> 
>>>    1. Create a large sparse file on $TEST_DIR and losetup that sparse
>>>       file.  The resulting loop device will not have dax capability.
>>>
>>>    2. Set up the dmthin/dmlogwrites stack on top of this loop device.
>>>
>>>    3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem
>>>       device) as the realtime device, and set the daxinherit and rtinherit
>>>       flags on the root directory.  The result is a filesystem with a data
>>>       section that the kernel will treat as a regular block device, a
>>>       realtime section backed by pmem, and the necessary flags to make
>>>       sure that the test file will actually get fsdax mode.
>>>
>>>    4. Acknowledge that we no longer have any way to test MAP_SYNC
>>>       functionality on ext4, which means that generic/470 has to move to
>>>       tests/xfs/.
>> Sorry, I didn't understand why the above test change can fix the issue.
> XFS supports two-device filesystems -- the "realtime" section, and the 
> "data" section.  FS metadata and log all live in the "data" section.
> 
> So we change the test to set up some regular files, loop-mount the 
> files, set up the requisite dm-logwrites stuff atop the loop devices, 
> and format the XFS with the data section backed by the dm-logwrites 
> device, and make the realtime section backed by the pmem.
> 
> This way the log replay program can actually discard the data device 
> (because it's a regular file) and replay the log forward to the 
> preunmap mark.  The pmem device is not involved in the replay at all, 
> since changes to file data are never logged.  It now becomes 
> irrelevant that pmem no longer supports device mapper.
> 
>> Could you tell me which step can discard XFS log?
> (None of the steps do that.)
> 
>> In addition, I don't like your idea about the test change because it 
>> will make generic/470 become the special test for XFS. Do you know if 
>> we can fix the issue by changing the test in another way? blkdiscard 
>> -z can fix the issue because it does zero-fill rather than discard on the block device.
>> However, blkdiscard -z will take a lot of time when the block device 
>> is large.
> Well we/could/  just do that too, but that will suck if you have 2TB 
> of
> pmem.;)
> 
> Maybe as an alternative path we could just create a very small 
> filesystem on the pmem and then blkdiscard -z it?

Good idea, I have sent a patch set to do it.
https://lore.kernel.org/fstests/20221023064810.847110-1-yangx.jy@fujitsu.com/T/#t

> 
> That said -- does persistent memory actually have a future?  Intel 
> scuttled the entire Optane product, cxl.mem sounds like expansion 
> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird 
> kernel asserts everywhere) and 6.1 (every time I run fstests now I see 
> massive data corruption).

As far as I know, cxl.mem will take use of nvdimm driver and can be used by many existing applications.

Best Regards,
Xiao Yang
Dave Chinner Oct. 23, 2022, 10 p.m. UTC | #28
On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > In addition, I don't like your idea about the test change because it will
> > make generic/470 become the special test for XFS. Do you know if we can fix
> > the issue by changing the test in another way? blkdiscard -z can fix the
> > issue because it does zero-fill rather than discard on the block device.
> > However, blkdiscard -z will take a lot of time when the block device is
> > large.
> 
> Well we /could/ just do that too, but that will suck if you have 2TB of
> pmem. ;)
> 
> Maybe as an alternative path we could just create a very small
> filesystem on the pmem and then blkdiscard -z it?
> 
> That said -- does persistent memory actually have a future?  Intel
> scuttled the entire Optane product, cxl.mem sounds like expansion
> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> data corruption).

Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
don't think that has changed at all - I still see lots of kernel
warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
argument" errors.

If I turn off reflink, then instead of data corruption I get kernel
warnings like this from fsx and fsstress workloads:

[415478.558426] ------------[ cut here ]------------
[415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
[415478.564028] Modules linked in:
[415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G        W 6.1.0-rc1-dgc+ #1615
[415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
[415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
[415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
[415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
[415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
[415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
[415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
[415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
[415478.596983] FS:  00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
[415478.599364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
[415478.602883] Call Trace:
[415478.603598]  <TASK>
[415478.604229]  dax_fault_iter+0x240/0x600
[415478.605410]  dax_iomap_pte_fault+0x19c/0x3d0
[415478.606706]  __xfs_filemap_fault+0x1dd/0x2b0
[415478.607744]  __do_fault+0x2e/0x1d0
[415478.608587]  __handle_mm_fault+0xcec/0x17b0
[415478.609593]  handle_mm_fault+0xd0/0x2a0
[415478.610517]  exc_page_fault+0x1d9/0x810
[415478.611398]  asm_exc_page_fault+0x22/0x30
[415478.612311] RIP: 0033:0x7fd71a04b9ba
[415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
[415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
[415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
[415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
[415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
[415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
[415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
[415478.625600]  </TASK>
[415478.626087] ---[ end trace 0000000000000000 ]---

Even generic/247 is generating a warning like this from xfs_io,
which is a mmap vs DIO racer. Given that DIO doesn't exist for
fsdax, this test turns into just a normal write() vs mmap() racer.

Given these are the same fsdax infrastructure failures that I
reported for 6.0, it is also likely that ext4 is still throwing
them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
6.1 cycle.

> Frankly at this point I'm tempted just to turn of fsdax support for XFS
> for the 6.1 LTS because I don't have time to fix it.

/me shrugs

Backporting fixes (whenever they come along) is a problem for the
LTS kernel maintainer to deal with, not the upstream maintainer.

IMO, the issue right now is that the DAX maintainers seem to have
little interest in ensuring that the FSDAX infrastructure actually
works correctly. If anything, they seem to want to make things
harder for block based filesystems to use pmem devices and hence
FSDAX. e.g. the direction of the DAX core away from block interfaces
that filesystems need for their userspace tools to manage the
storage.

At what point do we simply say "the experiment failed, FSDAX is
dead" and remove it from XFS altogether?

Cheers,

Dave.
Shiyang Ruan Oct. 24, 2022, 3:17 a.m. UTC | #29
在 2022/10/24 6:00, Dave Chinner 写道:
> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
>>> In addition, I don't like your idea about the test change because it will
>>> make generic/470 become the special test for XFS. Do you know if we can fix
>>> the issue by changing the test in another way? blkdiscard -z can fix the
>>> issue because it does zero-fill rather than discard on the block device.
>>> However, blkdiscard -z will take a lot of time when the block device is
>>> large.
>>
>> Well we /could/ just do that too, but that will suck if you have 2TB of
>> pmem. ;)
>>
>> Maybe as an alternative path we could just create a very small
>> filesystem on the pmem and then blkdiscard -z it?
>>
>> That said -- does persistent memory actually have a future?  Intel
>> scuttled the entire Optane product, cxl.mem sounds like expansion
>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
>> asserts everywhere) and 6.1 (every time I run fstests now I see massive
>> data corruption).
>
> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> don't think that has changed at all - I still see lots of kernel
> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> argument" errors.

Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
caused by the restrictions which prevent reflink work together with DAX:

a. fs/xfs/xfs_ioctl.c:1141
/* Don't allow us to set DAX mode for a reflinked file for now. */
if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
        return -EINVAL;

b. fs/xfs/xfs_iops.c:1174
/* Only supported on non-reflinked files. */
if (xfs_is_reflink_inode(ip))
        return false;

These restrictions were removed in "drop experimental warning" patch[1].
  I think they should be separated from that patch.

[1]
https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/


Secondly, how the data corruption happened? Or which case failed?  Could
you give me more info (such as mkfs options, xfstests configs)?

>
> If I turn off reflink, then instead of data corruption I get kernel
> warnings like this from fsx and fsstress workloads:
>
> [415478.558426] ------------[ cut here ]------------
> [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
> [415478.564028] Modules linked in:
> [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G        W 6.1.0-rc1-dgc+ #1615
> [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
> [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
> [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
> [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
> [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
> [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
> [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
> [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
> [415478.596983] FS:  00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
> [415478.599364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
> [415478.602883] Call Trace:
> [415478.603598]  <TASK>
> [415478.604229]  dax_fault_iter+0x240/0x600
> [415478.605410]  dax_iomap_pte_fault+0x19c/0x3d0
> [415478.606706]  __xfs_filemap_fault+0x1dd/0x2b0
> [415478.607744]  __do_fault+0x2e/0x1d0
> [415478.608587]  __handle_mm_fault+0xcec/0x17b0
> [415478.609593]  handle_mm_fault+0xd0/0x2a0
> [415478.610517]  exc_page_fault+0x1d9/0x810
> [415478.611398]  asm_exc_page_fault+0x22/0x30
> [415478.612311] RIP: 0033:0x7fd71a04b9ba
> [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
> [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
> [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
> [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
> [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
> [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
> [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
> [415478.625600]  </TASK>
> [415478.626087] ---[ end trace 0000000000000000 ]---
>
> Even generic/247 is generating a warning like this from xfs_io,
> which is a mmap vs DIO racer. Given that DIO doesn't exist for
> fsdax, this test turns into just a normal write() vs mmap() racer.
>
> Given these are the same fsdax infrastructure failures that I
> reported for 6.0, it is also likely that ext4 is still throwing
> them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
> 6.1 cycle.

Still working on it...

>
>> Frankly at this point I'm tempted just to turn of fsdax support for XFS
>> for the 6.1 LTS because I don't have time to fix it.
>
> /me shrugs
>
> Backporting fixes (whenever they come along) is a problem for the
> LTS kernel maintainer to deal with, not the upstream maintainer.
>
> IMO, the issue right now is that the DAX maintainers seem to have
> little interest in ensuring that the FSDAX infrastructure actually
> works correctly. If anything, they seem to want to make things
> harder for block based filesystems to use pmem devices and hence
> FSDAX. e.g. the direction of the DAX core away from block interfaces
> that filesystems need for their userspace tools to manage the
> storage.
>
> At what point do we simply say "the experiment failed, FSDAX is
> dead" and remove it from XFS altogether?

I'll hurry up and try my best to solve these problems.


--
Thanks,
Ruan.

>
> Cheers,
>
> Dave.
Darrick J. Wong Oct. 24, 2022, 4:05 a.m. UTC | #30
On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote:
> 在 2022/10/24 6:00, Dave Chinner 写道:
> > On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> >> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> >>> In addition, I don't like your idea about the test change because it will
> >>> make generic/470 become the special test for XFS. Do you know if we can fix
> >>> the issue by changing the test in another way? blkdiscard -z can fix the
> >>> issue because it does zero-fill rather than discard on the block device.
> >>> However, blkdiscard -z will take a lot of time when the block device is
> >>> large.
> >>
> >> Well we /could/ just do that too, but that will suck if you have 2TB of
> >> pmem. ;)
> >>
> >> Maybe as an alternative path we could just create a very small
> >> filesystem on the pmem and then blkdiscard -z it?
> >>
> >> That said -- does persistent memory actually have a future?  Intel
> >> scuttled the entire Optane product, cxl.mem sounds like expansion
> >> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> >> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> >> data corruption).
> >
> > Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> > on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> > don't think that has changed at all - I still see lots of kernel
> > warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> > argument" errors.
> 
> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
> caused by the restrictions which prevent reflink work together with DAX:
> 
> a. fs/xfs/xfs_ioctl.c:1141
> /* Don't allow us to set DAX mode for a reflinked file for now. */
> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
>         return -EINVAL;
> 
> b. fs/xfs/xfs_iops.c:1174
> /* Only supported on non-reflinked files. */
> if (xfs_is_reflink_inode(ip))
>         return false;

Yes...

> These restrictions were removed in "drop experimental warning" patch[1].
>   I think they should be separated from that patch.

...and yes.

> 
> [1]
> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/
> 
> 
> Secondly, how the data corruption happened? Or which case failed?  Could
> you give me more info (such as mkfs options, xfstests configs)?
> 
> >
> > If I turn off reflink, then instead of data corruption I get kernel
> > warnings like this from fsx and fsstress workloads:
> >
> > [415478.558426] ------------[ cut here ]------------
> > [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
> > [415478.564028] Modules linked in:
> > [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G        W 6.1.0-rc1-dgc+ #1615
> > [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
> > [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
> > [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
> > [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
> > [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
> > [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
> > [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
> > [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
> > [415478.596983] FS:  00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
> > [415478.599364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
> > [415478.602883] Call Trace:
> > [415478.603598]  <TASK>
> > [415478.604229]  dax_fault_iter+0x240/0x600
> > [415478.605410]  dax_iomap_pte_fault+0x19c/0x3d0
> > [415478.606706]  __xfs_filemap_fault+0x1dd/0x2b0
> > [415478.607744]  __do_fault+0x2e/0x1d0
> > [415478.608587]  __handle_mm_fault+0xcec/0x17b0
> > [415478.609593]  handle_mm_fault+0xd0/0x2a0
> > [415478.610517]  exc_page_fault+0x1d9/0x810
> > [415478.611398]  asm_exc_page_fault+0x22/0x30
> > [415478.612311] RIP: 0033:0x7fd71a04b9ba
> > [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
> > [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
> > [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
> > [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
> > [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
> > [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
> > [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
> > [415478.625600]  </TASK>
> > [415478.626087] ---[ end trace 0000000000000000 ]---
> >
> > Even generic/247 is generating a warning like this from xfs_io,
> > which is a mmap vs DIO racer. Given that DIO doesn't exist for
> > fsdax, this test turns into just a normal write() vs mmap() racer.
> >
> > Given these are the same fsdax infrastructure failures that I
> > reported for 6.0, it is also likely that ext4 is still throwing
> > them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
> > 6.1 cycle.
> 
> Still working on it...

You'll have to port the entire FALLOC_FL_FUNSHARE code path to fsdax too
-- it uses the page cache to stage the COW, which then confuses fsdax
when it finds and trips over DRAM pages in the mapping.  That eliminates
one of the warnings (to be fair I just EONOTSUPP'd FUNSHARE to make that
path go away) but it still produced massive data corruption.

> >
> >> Frankly at this point I'm tempted just to turn of fsdax support for XFS
> >> for the 6.1 LTS because I don't have time to fix it.
> >
> > /me shrugs
> >
> > Backporting fixes (whenever they come along) is a problem for the
> > LTS kernel maintainer to deal with, not the upstream maintainer.
> >
> > IMO, the issue right now is that the DAX maintainers seem to have
> > little interest in ensuring that the FSDAX infrastructure actually
> > works correctly. If anything, they seem to want to make things
> > harder for block based filesystems to use pmem devices and hence
> > FSDAX. e.g. the direction of the DAX core away from block interfaces
> > that filesystems need for their userspace tools to manage the
> > storage.
> >
> > At what point do we simply say "the experiment failed, FSDAX is
> > dead" and remove it from XFS altogether?

We no longer have any pmem products in our pipeline, so I will just say
that if the corruption problems aren't resolved by the end of 6.1-rcX
I'm hiding fsdax support behind CONFIG_XFS_DEBUG or just turning it off
entirely.  I don't want to burden whoever becomes the 6.1 XFS LTS
maintainer with a slew of fsdax data corruption errors.

> I'll hurry up and try my best to solve these problems.

Ok, thank you. :)

--D

> 
> 
> --
> Thanks,
> Ruan.
> 
> >
> > Cheers,
> >
> > Dave.
Dave Chinner Oct. 24, 2022, 5:31 a.m. UTC | #31
On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote:
> 在 2022/10/24 6:00, Dave Chinner 写道:
> > On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> >> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> >>> In addition, I don't like your idea about the test change because it will
> >>> make generic/470 become the special test for XFS. Do you know if we can fix
> >>> the issue by changing the test in another way? blkdiscard -z can fix the
> >>> issue because it does zero-fill rather than discard on the block device.
> >>> However, blkdiscard -z will take a lot of time when the block device is
> >>> large.
> >>
> >> Well we /could/ just do that too, but that will suck if you have 2TB of
> >> pmem. ;)
> >>
> >> Maybe as an alternative path we could just create a very small
> >> filesystem on the pmem and then blkdiscard -z it?
> >>
> >> That said -- does persistent memory actually have a future?  Intel
> >> scuttled the entire Optane product, cxl.mem sounds like expansion
> >> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> >> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> >> data corruption).
> >
> > Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> > on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> > don't think that has changed at all - I still see lots of kernel
> > warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> > argument" errors.
> 
> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
> caused by the restrictions which prevent reflink work together with DAX:
> 
> a. fs/xfs/xfs_ioctl.c:1141
> /* Don't allow us to set DAX mode for a reflinked file for now. */
> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
>         return -EINVAL;
> 
> b. fs/xfs/xfs_iops.c:1174
> /* Only supported on non-reflinked files. */
> if (xfs_is_reflink_inode(ip))
>         return false;
> 
> These restrictions were removed in "drop experimental warning" patch[1].
>   I think they should be separated from that patch.
> 
> [1]
> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/
> 
> 
> Secondly, how the data corruption happened?

No idea - i"m just reporting that lots of fsx tests failed with data
corruptions. I haven't had time to look at why, I'm still trying to
sort out the fix for a different data corruption...

> Or which case failed?

*lots* of them failed with kernel warnings with reflink turned off:

SECTION       -- xfs_dax_noreflink
=========================
Failures: generic/051 generic/068 generic/075 generic/083
generic/112 generic/127 generic/198 generic/231 generic/247
generic/269 generic/270 generic/340 generic/344 generic/388
generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011
xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
Failed 26 of 1079 tests

All of those except xfs/073 and generic/471 are failures due to
warnings found in dmesg.

With reflink enabled, I terminated the run after g/075, g/091, g/112
and generic/127 reported fsx data corruptions and g/051, g/068,
g/075 and g/083 had reported kernel warnings in dmesg.

> Could
> you give me more info (such as mkfs options, xfstests configs)?

They are exactly the same as last time I reported these problems.

For the "no reflink" test issues:

mkfs options are "-m reflink=0,rmapbt=1", mount options "-o
dax=always" for both filesytems.  Config output at start of test
run:

SECTION       -- xfs_dax_noreflink
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022
MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch

pmem devices are a pair of fake 8GB pmem regions set up by kernel
CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up
- the kernel config is kept minimal for these VMs - and the only
kernel debug option I have turned on for these specific test runs is
CONFIG_XFS_DEBUG=y.

THe only difference between the noreflink and reflink runs is that I
drop the "-m reflink=0" mkfs parameter. Otherwise they are identical
and the errors I reported are from back-to-back fstests runs without
rebooting the VM....

-Dave.
Dan Williams Oct. 24, 2022, 5:12 p.m. UTC | #32
Dave Chinner wrote:
> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > > In addition, I don't like your idea about the test change because it will
> > > make generic/470 become the special test for XFS. Do you know if we can fix
> > > the issue by changing the test in another way? blkdiscard -z can fix the
> > > issue because it does zero-fill rather than discard on the block device.
> > > However, blkdiscard -z will take a lot of time when the block device is
> > > large.
> > 
> > Well we /could/ just do that too, but that will suck if you have 2TB of
> > pmem. ;)
> > 
> > Maybe as an alternative path we could just create a very small
> > filesystem on the pmem and then blkdiscard -z it?
> > 
> > That said -- does persistent memory actually have a future?  Intel
> > scuttled the entire Optane product, cxl.mem sounds like expansion
> > chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> > asserts everywhere) and 6.1 (every time I run fstests now I see massive
> > data corruption).
> 
> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> don't think that has changed at all - I still see lots of kernel
> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> argument" errors.
> 
> If I turn off reflink, then instead of data corruption I get kernel
> warnings like this from fsx and fsstress workloads:
> 
> [415478.558426] ------------[ cut here ]------------
> [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
> [415478.564028] Modules linked in:
> [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G        W 6.1.0-rc1-dgc+ #1615
> [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
> [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
> [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
> [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
> [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
> [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
> [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
> [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
> [415478.596983] FS:  00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
> [415478.599364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
> [415478.602883] Call Trace:
> [415478.603598]  <TASK>
> [415478.604229]  dax_fault_iter+0x240/0x600
> [415478.605410]  dax_iomap_pte_fault+0x19c/0x3d0
> [415478.606706]  __xfs_filemap_fault+0x1dd/0x2b0
> [415478.607744]  __do_fault+0x2e/0x1d0
> [415478.608587]  __handle_mm_fault+0xcec/0x17b0
> [415478.609593]  handle_mm_fault+0xd0/0x2a0
> [415478.610517]  exc_page_fault+0x1d9/0x810
> [415478.611398]  asm_exc_page_fault+0x22/0x30
> [415478.612311] RIP: 0033:0x7fd71a04b9ba
> [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
> [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
> [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
> [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
> [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
> [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
> [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
> [415478.625600]  </TASK>
> [415478.626087] ---[ end trace 0000000000000000 ]---
> 
> Even generic/247 is generating a warning like this from xfs_io,
> which is a mmap vs DIO racer. Given that DIO doesn't exist for
> fsdax, this test turns into just a normal write() vs mmap() racer.
> 
> Given these are the same fsdax infrastructure failures that I
> reported for 6.0, it is also likely that ext4 is still throwing
> them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
> 6.1 cycle.
> 
> > Frankly at this point I'm tempted just to turn of fsdax support for XFS
> > for the 6.1 LTS because I don't have time to fix it.
> 
> /me shrugs
> 
> Backporting fixes (whenever they come along) is a problem for the
> LTS kernel maintainer to deal with, not the upstream maintainer.
> 
> IMO, the issue right now is that the DAX maintainers seem to have
> little interest in ensuring that the FSDAX infrastructure actually
> works correctly. If anything, they seem to want to make things
> harder for block based filesystems to use pmem devices and hence
> FSDAX. e.g. the direction of the DAX core away from block interfaces
> that filesystems need for their userspace tools to manage the
> storage.
> 
> At what point do we simply say "the experiment failed, FSDAX is
> dead" and remove it from XFS altogether?

A fair question, given the regressions made it all the way into
v6.0-final. In retrospect I made the wrong priority call to focus on dax
page reference counting these past weeks.

When I fired up the dax unit tests on v6.0-rc1 I found basic problems
with the notify failure patches that concerned me that they had never
been tested after the final version was merged [1]. Then the rest of the
development cycle was spent fixing dax reference counting [2]. That was
a longstanding wishlist item from gup and folio developers, but, as I
said, that seems the wrong priority given the lingering regressions. I
will take a look the current dax-xfstests regression backlog. That may
find a need to consider reverting the problematic commits depending on
what is still broken if the fixes are trending towards being invasive.

[1]: https://lore.kernel.org/all/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/

[2]: https://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/
Shiyang Ruan Oct. 25, 2022, 2:26 p.m. UTC | #33
在 2022/10/24 13:31, Dave Chinner 写道:
> On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote:
>> 在 2022/10/24 6:00, Dave Chinner 写道:
>>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
>>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
>>>>> In addition, I don't like your idea about the test change because it will
>>>>> make generic/470 become the special test for XFS. Do you know if we can fix
>>>>> the issue by changing the test in another way? blkdiscard -z can fix the
>>>>> issue because it does zero-fill rather than discard on the block device.
>>>>> However, blkdiscard -z will take a lot of time when the block device is
>>>>> large.
>>>>
>>>> Well we /could/ just do that too, but that will suck if you have 2TB of
>>>> pmem. ;)
>>>>
>>>> Maybe as an alternative path we could just create a very small
>>>> filesystem on the pmem and then blkdiscard -z it?
>>>>
>>>> That said -- does persistent memory actually have a future?  Intel
>>>> scuttled the entire Optane product, cxl.mem sounds like expansion
>>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
>>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive
>>>> data corruption).
>>>
>>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
>>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
>>> don't think that has changed at all - I still see lots of kernel
>>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
>>> argument" errors.
>>
>> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
>> caused by the restrictions which prevent reflink work together with DAX:
>>
>> a. fs/xfs/xfs_ioctl.c:1141
>> /* Don't allow us to set DAX mode for a reflinked file for now. */
>> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
>>          return -EINVAL;
>>
>> b. fs/xfs/xfs_iops.c:1174
>> /* Only supported on non-reflinked files. */
>> if (xfs_is_reflink_inode(ip))
>>          return false;
>>
>> These restrictions were removed in "drop experimental warning" patch[1].
>>    I think they should be separated from that patch.
>>
>> [1]
>> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/
>>
>>
>> Secondly, how the data corruption happened?
> 
> No idea - i"m just reporting that lots of fsx tests failed with data
> corruptions. I haven't had time to look at why, I'm still trying to
> sort out the fix for a different data corruption...
> 
>> Or which case failed?
> 
> *lots* of them failed with kernel warnings with reflink turned off:
> 
> SECTION       -- xfs_dax_noreflink
> =========================
> Failures: generic/051 generic/068 generic/075 generic/083
> generic/112 generic/127 generic/198 generic/231 generic/247
> generic/269 generic/270 generic/340 generic/344 generic/388
> generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011
> xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
> Failed 26 of 1079 tests
> 
> All of those except xfs/073 and generic/471 are failures due to
> warnings found in dmesg.
> 
> With reflink enabled, I terminated the run after g/075, g/091, g/112
> and generic/127 reported fsx data corruptions and g/051, g/068,
> g/075 and g/083 had reported kernel warnings in dmesg.
> 
>> Could
>> you give me more info (such as mkfs options, xfstests configs)?
> 
> They are exactly the same as last time I reported these problems.
> 
> For the "no reflink" test issues:
> 
> mkfs options are "-m reflink=0,rmapbt=1", mount options "-o
> dax=always" for both filesytems.  Config output at start of test
> run:
> 
> SECTION       -- xfs_dax_noreflink
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022
> MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
> MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> 
> pmem devices are a pair of fake 8GB pmem regions set up by kernel
> CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up
> - the kernel config is kept minimal for these VMs - and the only
> kernel debug option I have turned on for these specific test runs is
> CONFIG_XFS_DEBUG=y.

Thanks for the detailed info.  But, in my environment (and my 
colleagues', and our real server with DCPMM) these failure cases (you 
mentioned above, in dax+non_reflink mode, with same test options) cannot 
reproduce.

Here's our test environment info:
  - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended
  - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G
  - Server's  : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM

(To quickly confirm the difference, I just ran the failed 26 cases you 
mentioned above.)  Except for generic/471 and generic/519, which failed 
even when dax is off, the rest passed.


We don't want fsdax to be truned off.  Right now, I think the most 
important thing is solving the failed cases in dax+non_reflink mode. 
So, firstly, I have to reproduce those failures.  Is there any thing 
wrong with my test environments?  I konw you are using 'memmap=XXG!YYG' to 
simulate pmem.  So, (to Darrick) could you show me your config of dev 
environment and the 'testcloud'(I am guessing it's a server with real 
nvdimm just like ours)?


(I just found I only tested on 4G and smaller pmem device.  I'll try the 
test on 8G pmem)

> 
> THe only difference between the noreflink and reflink runs is that I
> drop the "-m reflink=0" mkfs parameter. Otherwise they are identical
> and the errors I reported are from back-to-back fstests runs without
> rebooting the VM....
> 
> -Dave.


--
Thanks,
Ruan.
Darrick J. Wong Oct. 25, 2022, 5:56 p.m. UTC | #34
On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote:
> 
> 
> 在 2022/10/24 13:31, Dave Chinner 写道:
> > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote:
> >> 在 2022/10/24 6:00, Dave Chinner 写道:
> >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> >>>>> In addition, I don't like your idea about the test change because it will
> >>>>> make generic/470 become the special test for XFS. Do you know if we can fix
> >>>>> the issue by changing the test in another way? blkdiscard -z can fix the
> >>>>> issue because it does zero-fill rather than discard on the block device.
> >>>>> However, blkdiscard -z will take a lot of time when the block device is
> >>>>> large.
> >>>>
> >>>> Well we /could/ just do that too, but that will suck if you have 2TB of
> >>>> pmem. ;)
> >>>>
> >>>> Maybe as an alternative path we could just create a very small
> >>>> filesystem on the pmem and then blkdiscard -z it?
> >>>>
> >>>> That said -- does persistent memory actually have a future?  Intel
> >>>> scuttled the entire Optane product, cxl.mem sounds like expansion
> >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> >>>> data corruption).
> >>>
> >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> >>> don't think that has changed at all - I still see lots of kernel
> >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> >>> argument" errors.
> >>
> >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
> >> caused by the restrictions which prevent reflink work together with DAX:
> >>
> >> a. fs/xfs/xfs_ioctl.c:1141
> >> /* Don't allow us to set DAX mode for a reflinked file for now. */
> >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
> >>          return -EINVAL;
> >>
> >> b. fs/xfs/xfs_iops.c:1174
> >> /* Only supported on non-reflinked files. */
> >> if (xfs_is_reflink_inode(ip))
> >>          return false;
> >>
> >> These restrictions were removed in "drop experimental warning" patch[1].
> >>    I think they should be separated from that patch.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/
> >>
> >>
> >> Secondly, how the data corruption happened?
> > 
> > No idea - i"m just reporting that lots of fsx tests failed with data
> > corruptions. I haven't had time to look at why, I'm still trying to
> > sort out the fix for a different data corruption...
> > 
> >> Or which case failed?
> > 
> > *lots* of them failed with kernel warnings with reflink turned off:
> > 
> > SECTION       -- xfs_dax_noreflink
> > =========================
> > Failures: generic/051 generic/068 generic/075 generic/083
> > generic/112 generic/127 generic/198 generic/231 generic/247
> > generic/269 generic/270 generic/340 generic/344 generic/388
> > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011
> > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
> > Failed 26 of 1079 tests
> > 
> > All of those except xfs/073 and generic/471 are failures due to
> > warnings found in dmesg.
> > 
> > With reflink enabled, I terminated the run after g/075, g/091, g/112
> > and generic/127 reported fsx data corruptions and g/051, g/068,
> > g/075 and g/083 had reported kernel warnings in dmesg.
> > 
> >> Could
> >> you give me more info (such as mkfs options, xfstests configs)?
> > 
> > They are exactly the same as last time I reported these problems.
> > 
> > For the "no reflink" test issues:
> > 
> > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o
> > dax=always" for both filesytems.  Config output at start of test
> > run:
> > 
> > SECTION       -- xfs_dax_noreflink
> > FSTYP         -- xfs (debug)
> > PLATFORM      -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022
> > MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
> > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> > 
> > pmem devices are a pair of fake 8GB pmem regions set up by kernel
> > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up
> > - the kernel config is kept minimal for these VMs - and the only
> > kernel debug option I have turned on for these specific test runs is
> > CONFIG_XFS_DEBUG=y.
> 
> Thanks for the detailed info.  But, in my environment (and my 
> colleagues', and our real server with DCPMM) these failure cases (you 
> mentioned above, in dax+non_reflink mode, with same test options) cannot 
> reproduce.
> 
> Here's our test environment info:
>   - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended
>   - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G
>   - Server's  : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM
> 
> (To quickly confirm the difference, I just ran the failed 26 cases you 
> mentioned above.)  Except for generic/471 and generic/519, which failed 
> even when dax is off, the rest passed.
> 
> 
> We don't want fsdax to be truned off.  Right now, I think the most 
> important thing is solving the failed cases in dax+non_reflink mode. 
> So, firstly, I have to reproduce those failures.  Is there any thing 
> wrong with my test environments?  I konw you are using 'memmap=XXG!YYG' to 
> simulate pmem.  So, (to Darrick) could you show me your config of dev 
> environment and the 'testcloud'(I am guessing it's a server with real 
> nvdimm just like ours)?

Nope.  Since the announcement of pmem as a product, I have had 15
minutes of acces to one preproduction prototype server with actual
optane DIMMs in them.

I have /never/ had access to real hardware to test any of this, so it's
all configured via libvirt to simulate pmem in qemu:
https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/

/run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem:

$ grep mtrdisk /proc/mounts
none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0

$ ls -la /run/mtrdisk/[gh].mem
-rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem
-rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem

--D

> 
> 
> (I just found I only tested on 4G and smaller pmem device.  I'll try the 
> test on 8G pmem)
> 
> > 
> > THe only difference between the noreflink and reflink runs is that I
> > drop the "-m reflink=0" mkfs parameter. Otherwise they are identical
> > and the errors I reported are from back-to-back fstests runs without
> > rebooting the VM....
> > 
> > -Dave.
> 
> 
> --
> Thanks,
> Ruan.
Darrick J. Wong Oct. 27, 2022, 9:08 p.m. UTC | #35
[add tytso to cc since he asked about "How do you actually /get/ fsdax
mode these days?" this morning]

On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote:
> > 
> > 
> > 在 2022/10/24 13:31, Dave Chinner 写道:
> > > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote:
> > >> 在 2022/10/24 6:00, Dave Chinner 写道:
> > >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> > >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > >>>>> In addition, I don't like your idea about the test change because it will
> > >>>>> make generic/470 become the special test for XFS. Do you know if we can fix
> > >>>>> the issue by changing the test in another way? blkdiscard -z can fix the
> > >>>>> issue because it does zero-fill rather than discard on the block device.
> > >>>>> However, blkdiscard -z will take a lot of time when the block device is
> > >>>>> large.
> > >>>>
> > >>>> Well we /could/ just do that too, but that will suck if you have 2TB of
> > >>>> pmem. ;)
> > >>>>
> > >>>> Maybe as an alternative path we could just create a very small
> > >>>> filesystem on the pmem and then blkdiscard -z it?
> > >>>>
> > >>>> That said -- does persistent memory actually have a future?  Intel
> > >>>> scuttled the entire Optane product, cxl.mem sounds like expansion
> > >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> > >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> > >>>> data corruption).
> > >>>
> > >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> > >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> > >>> don't think that has changed at all - I still see lots of kernel
> > >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> > >>> argument" errors.
> > >>
> > >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
> > >> caused by the restrictions which prevent reflink work together with DAX:
> > >>
> > >> a. fs/xfs/xfs_ioctl.c:1141
> > >> /* Don't allow us to set DAX mode for a reflinked file for now. */
> > >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
> > >>          return -EINVAL;
> > >>
> > >> b. fs/xfs/xfs_iops.c:1174
> > >> /* Only supported on non-reflinked files. */
> > >> if (xfs_is_reflink_inode(ip))
> > >>          return false;
> > >>
> > >> These restrictions were removed in "drop experimental warning" patch[1].
> > >>    I think they should be separated from that patch.
> > >>
> > >> [1]
> > >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/
> > >>
> > >>
> > >> Secondly, how the data corruption happened?
> > > 
> > > No idea - i"m just reporting that lots of fsx tests failed with data
> > > corruptions. I haven't had time to look at why, I'm still trying to
> > > sort out the fix for a different data corruption...
> > > 
> > >> Or which case failed?
> > > 
> > > *lots* of them failed with kernel warnings with reflink turned off:
> > > 
> > > SECTION       -- xfs_dax_noreflink
> > > =========================
> > > Failures: generic/051 generic/068 generic/075 generic/083
> > > generic/112 generic/127 generic/198 generic/231 generic/247
> > > generic/269 generic/270 generic/340 generic/344 generic/388
> > > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011
> > > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
> > > Failed 26 of 1079 tests
> > > 
> > > All of those except xfs/073 and generic/471 are failures due to
> > > warnings found in dmesg.
> > > 
> > > With reflink enabled, I terminated the run after g/075, g/091, g/112
> > > and generic/127 reported fsx data corruptions and g/051, g/068,
> > > g/075 and g/083 had reported kernel warnings in dmesg.
> > > 
> > >> Could
> > >> you give me more info (such as mkfs options, xfstests configs)?
> > > 
> > > They are exactly the same as last time I reported these problems.
> > > 
> > > For the "no reflink" test issues:
> > > 
> > > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o
> > > dax=always" for both filesytems.  Config output at start of test
> > > run:
> > > 
> > > SECTION       -- xfs_dax_noreflink
> > > FSTYP         -- xfs (debug)
> > > PLATFORM      -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022
> > > MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
> > > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> > > 
> > > pmem devices are a pair of fake 8GB pmem regions set up by kernel
> > > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up
> > > - the kernel config is kept minimal for these VMs - and the only
> > > kernel debug option I have turned on for these specific test runs is
> > > CONFIG_XFS_DEBUG=y.
> > 
> > Thanks for the detailed info.  But, in my environment (and my 
> > colleagues', and our real server with DCPMM) these failure cases (you 
> > mentioned above, in dax+non_reflink mode, with same test options) cannot 
> > reproduce.
> > 
> > Here's our test environment info:
> >   - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended
> >   - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G
> >   - Server's  : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM
> > 
> > (To quickly confirm the difference, I just ran the failed 26 cases you 
> > mentioned above.)  Except for generic/471 and generic/519, which failed 
> > even when dax is off, the rest passed.
> > 
> > 
> > We don't want fsdax to be truned off.  Right now, I think the most 
> > important thing is solving the failed cases in dax+non_reflink mode. 
> > So, firstly, I have to reproduce those failures.  Is there any thing 
> > wrong with my test environments?  I konw you are using 'memmap=XXG!YYG' to 
> > simulate pmem.  So, (to Darrick) could you show me your config of dev 
> > environment and the 'testcloud'(I am guessing it's a server with real 
> > nvdimm just like ours)?
> 
> Nope.  Since the announcement of pmem as a product, I have had 15
> minutes of acces to one preproduction prototype server with actual
> optane DIMMs in them.
> 
> I have /never/ had access to real hardware to test any of this, so it's
> all configured via libvirt to simulate pmem in qemu:
> https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/
> 
> /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem:
> 
> $ grep mtrdisk /proc/mounts
> none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0
> 
> $ ls -la /run/mtrdisk/[gh].mem
> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem
> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem

Also forgot to mention that the VM with the fake pmem attached has a
script to do:

ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f

Every time the pmem device gets recreated, because apparently that's the
only way to get S_DAX mode nowadays?

--D

> --D
> 
> > 
> > 
> > (I just found I only tested on 4G and smaller pmem device.  I'll try the 
> > test on 8G pmem)
> > 
> > > 
> > > THe only difference between the noreflink and reflink runs is that I
> > > drop the "-m reflink=0" mkfs parameter. Otherwise they are identical
> > > and the errors I reported are from back-to-back fstests runs without
> > > rebooting the VM....
> > > 
> > > -Dave.
> > 
> > 
> > --
> > Thanks,
> > Ruan.
Dan Williams Oct. 28, 2022, 1:37 a.m. UTC | #36
Darrick J. Wong wrote:
> [add tytso to cc since he asked about "How do you actually /get/ fsdax
> mode these days?" this morning]
> 
> On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote:
> > > 
> > > 
> > > 在 2022/10/24 13:31, Dave Chinner 写道:
> > > > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote:
> > > >> 在 2022/10/24 6:00, Dave Chinner 写道:
> > > >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> > > >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > > >>>>> In addition, I don't like your idea about the test change because it will
> > > >>>>> make generic/470 become the special test for XFS. Do you know if we can fix
> > > >>>>> the issue by changing the test in another way? blkdiscard -z can fix the
> > > >>>>> issue because it does zero-fill rather than discard on the block device.
> > > >>>>> However, blkdiscard -z will take a lot of time when the block device is
> > > >>>>> large.
> > > >>>>
> > > >>>> Well we /could/ just do that too, but that will suck if you have 2TB of
> > > >>>> pmem. ;)
> > > >>>>
> > > >>>> Maybe as an alternative path we could just create a very small
> > > >>>> filesystem on the pmem and then blkdiscard -z it?
> > > >>>>
> > > >>>> That said -- does persistent memory actually have a future?  Intel
> > > >>>> scuttled the entire Optane product, cxl.mem sounds like expansion
> > > >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> > > >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> > > >>>> data corruption).
> > > >>>
> > > >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> > > >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> > > >>> don't think that has changed at all - I still see lots of kernel
> > > >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> > > >>> argument" errors.
> > > >>
> > > >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
> > > >> caused by the restrictions which prevent reflink work together with DAX:
> > > >>
> > > >> a. fs/xfs/xfs_ioctl.c:1141
> > > >> /* Don't allow us to set DAX mode for a reflinked file for now. */
> > > >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
> > > >>          return -EINVAL;
> > > >>
> > > >> b. fs/xfs/xfs_iops.c:1174
> > > >> /* Only supported on non-reflinked files. */
> > > >> if (xfs_is_reflink_inode(ip))
> > > >>          return false;
> > > >>
> > > >> These restrictions were removed in "drop experimental warning" patch[1].
> > > >>    I think they should be separated from that patch.
> > > >>
> > > >> [1]
> > > >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/
> > > >>
> > > >>
> > > >> Secondly, how the data corruption happened?
> > > > 
> > > > No idea - i"m just reporting that lots of fsx tests failed with data
> > > > corruptions. I haven't had time to look at why, I'm still trying to
> > > > sort out the fix for a different data corruption...
> > > > 
> > > >> Or which case failed?
> > > > 
> > > > *lots* of them failed with kernel warnings with reflink turned off:
> > > > 
> > > > SECTION       -- xfs_dax_noreflink
> > > > =========================
> > > > Failures: generic/051 generic/068 generic/075 generic/083
> > > > generic/112 generic/127 generic/198 generic/231 generic/247
> > > > generic/269 generic/270 generic/340 generic/344 generic/388
> > > > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011
> > > > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
> > > > Failed 26 of 1079 tests
> > > > 
> > > > All of those except xfs/073 and generic/471 are failures due to
> > > > warnings found in dmesg.
> > > > 
> > > > With reflink enabled, I terminated the run after g/075, g/091, g/112
> > > > and generic/127 reported fsx data corruptions and g/051, g/068,
> > > > g/075 and g/083 had reported kernel warnings in dmesg.
> > > > 
> > > >> Could
> > > >> you give me more info (such as mkfs options, xfstests configs)?
> > > > 
> > > > They are exactly the same as last time I reported these problems.
> > > > 
> > > > For the "no reflink" test issues:
> > > > 
> > > > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o
> > > > dax=always" for both filesytems.  Config output at start of test
> > > > run:
> > > > 
> > > > SECTION       -- xfs_dax_noreflink
> > > > FSTYP         -- xfs (debug)
> > > > PLATFORM      -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022
> > > > MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
> > > > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> > > > 
> > > > pmem devices are a pair of fake 8GB pmem regions set up by kernel
> > > > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up
> > > > - the kernel config is kept minimal for these VMs - and the only
> > > > kernel debug option I have turned on for these specific test runs is
> > > > CONFIG_XFS_DEBUG=y.
> > > 
> > > Thanks for the detailed info.  But, in my environment (and my 
> > > colleagues', and our real server with DCPMM) these failure cases (you 
> > > mentioned above, in dax+non_reflink mode, with same test options) cannot 
> > > reproduce.
> > > 
> > > Here's our test environment info:
> > >   - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended
> > >   - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G
> > >   - Server's  : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM
> > > 
> > > (To quickly confirm the difference, I just ran the failed 26 cases you 
> > > mentioned above.)  Except for generic/471 and generic/519, which failed 
> > > even when dax is off, the rest passed.
> > > 
> > > 
> > > We don't want fsdax to be truned off.  Right now, I think the most 
> > > important thing is solving the failed cases in dax+non_reflink mode. 
> > > So, firstly, I have to reproduce those failures.  Is there any thing 
> > > wrong with my test environments?  I konw you are using 'memmap=XXG!YYG' to 
> > > simulate pmem.  So, (to Darrick) could you show me your config of dev 
> > > environment and the 'testcloud'(I am guessing it's a server with real 
> > > nvdimm just like ours)?
> > 
> > Nope.  Since the announcement of pmem as a product, I have had 15
> > minutes of acces to one preproduction prototype server with actual
> > optane DIMMs in them.
> > 
> > I have /never/ had access to real hardware to test any of this, so it's
> > all configured via libvirt to simulate pmem in qemu:
> > https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/
> > 
> > /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem:
> > 
> > $ grep mtrdisk /proc/mounts
> > none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0
> > 
> > $ ls -la /run/mtrdisk/[gh].mem
> > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem
> > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem
> 
> Also forgot to mention that the VM with the fake pmem attached has a
> script to do:
> 
> ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
> ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f
> 
> Every time the pmem device gets recreated, because apparently that's the
> only way to get S_DAX mode nowadays?

If you have noticed a change here it is due to VM configuration not
anything in the driver.

If you are interested there are two ways to get pmem declared the legacy
way that predates any of the DAX work, the kernel calls it E820_PRAM,
and the modern way by platform firmware tables like ACPI NFIT. The
assumption with E820_PRAM is that it is dealing with battery backed
NVDIMMs of small capacity. In that case the /dev/pmem device can support
DAX operation by default because the necessary memory for the 'struct
page' array for that memory is likely small.

Platform firmware defined PMEM can be terabytes. So the driver does not
enable DAX by default because the user needs to make policy choice about
burning gigabytes of DRAM for that metadata, or placing it in PMEM which
is abundant, but slower. So what I suspect might be happening is your
configuration changed from something that auto-allocated the 'struct
page' array, to something that needed those commands you list above to
explicitly opt-in to reserving some PMEM capacity for the page metadata.
Shiyang Ruan Oct. 30, 2022, 9:31 a.m. UTC | #37
在 2022/10/28 9:37, Dan Williams 写道:
> Darrick J. Wong wrote:
>> [add tytso to cc since he asked about "How do you actually /get/ fsdax
>> mode these days?" this morning]
>>
>> On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote:
>>> On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote:

...skip...

>>>
>>> Nope.  Since the announcement of pmem as a product, I have had 15
>>> minutes of acces to one preproduction prototype server with actual
>>> optane DIMMs in them.
>>>
>>> I have /never/ had access to real hardware to test any of this, so it's
>>> all configured via libvirt to simulate pmem in qemu:
>>> https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/
>>>
>>> /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem:
>>>
>>> $ grep mtrdisk /proc/mounts
>>> none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0
>>>
>>> $ ls -la /run/mtrdisk/[gh].mem
>>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem
>>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem
>>
>> Also forgot to mention that the VM with the fake pmem attached has a
>> script to do:
>>
>> ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
>> ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f
>>
>> Every time the pmem device gets recreated, because apparently that's the
>> only way to get S_DAX mode nowadays?
> 
> If you have noticed a change here it is due to VM configuration not
> anything in the driver.
> 
> If you are interested there are two ways to get pmem declared the legacy
> way that predates any of the DAX work, the kernel calls it E820_PRAM,
> and the modern way by platform firmware tables like ACPI NFIT. The
> assumption with E820_PRAM is that it is dealing with battery backed
> NVDIMMs of small capacity. In that case the /dev/pmem device can support
> DAX operation by default because the necessary memory for the 'struct
> page' array for that memory is likely small.
> 
> Platform firmware defined PMEM can be terabytes. So the driver does not
> enable DAX by default because the user needs to make policy choice about
> burning gigabytes of DRAM for that metadata, or placing it in PMEM which
> is abundant, but slower. So what I suspect might be happening is your
> configuration changed from something that auto-allocated the 'struct
> page' array, to something that needed those commands you list above to
> explicitly opt-in to reserving some PMEM capacity for the page metadata.

I am using the same simulation environment as Darrick's and Dave's and 
have tested many times, but still cannot reproduce the failed cases they 
mentioned (dax+non_reflink mode, currently focuing) until now. Only a 
few cases randomly failed because of "target is busy". But IIRC, those 
failed cases you mentioned were failed with dmesg warning around the 
function "dax_associate_entry()" or "dax_disassociate_entry()". Since I 
cannot reproduce the failure, it hard for me to continue sovling the 
problem.

And how is your recent test?  Still failed with those dmesg warnings? 
If so, could you zip the test result and send it to me?


--
Thanks,
Ruan
Darrick J. Wong Nov. 2, 2022, 12:45 a.m. UTC | #38
On Sun, Oct 30, 2022 at 05:31:43PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/10/28 9:37, Dan Williams 写道:
> > Darrick J. Wong wrote:
> > > [add tytso to cc since he asked about "How do you actually /get/ fsdax
> > > mode these days?" this morning]
> > > 
> > > On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote:
> 
> ...skip...
> 
> > > > 
> > > > Nope.  Since the announcement of pmem as a product, I have had 15
> > > > minutes of acces to one preproduction prototype server with actual
> > > > optane DIMMs in them.
> > > > 
> > > > I have /never/ had access to real hardware to test any of this, so it's
> > > > all configured via libvirt to simulate pmem in qemu:
> > > > https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/
> > > > 
> > > > /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem:
> > > > 
> > > > $ grep mtrdisk /proc/mounts
> > > > none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0
> > > > 
> > > > $ ls -la /run/mtrdisk/[gh].mem
> > > > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem
> > > > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem
> > > 
> > > Also forgot to mention that the VM with the fake pmem attached has a
> > > script to do:
> > > 
> > > ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
> > > ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f
> > > 
> > > Every time the pmem device gets recreated, because apparently that's the
> > > only way to get S_DAX mode nowadays?
> > 
> > If you have noticed a change here it is due to VM configuration not
> > anything in the driver.
> > 
> > If you are interested there are two ways to get pmem declared the legacy
> > way that predates any of the DAX work, the kernel calls it E820_PRAM,
> > and the modern way by platform firmware tables like ACPI NFIT. The
> > assumption with E820_PRAM is that it is dealing with battery backed
> > NVDIMMs of small capacity. In that case the /dev/pmem device can support
> > DAX operation by default because the necessary memory for the 'struct
> > page' array for that memory is likely small.
> > 
> > Platform firmware defined PMEM can be terabytes. So the driver does not
> > enable DAX by default because the user needs to make policy choice about
> > burning gigabytes of DRAM for that metadata, or placing it in PMEM which
> > is abundant, but slower. So what I suspect might be happening is your
> > configuration changed from something that auto-allocated the 'struct
> > page' array, to something that needed those commands you list above to
> > explicitly opt-in to reserving some PMEM capacity for the page metadata.
> 
> I am using the same simulation environment as Darrick's and Dave's and have
> tested many times, but still cannot reproduce the failed cases they
> mentioned (dax+non_reflink mode, currently focuing) until now. Only a few
> cases randomly failed because of "target is busy". But IIRC, those failed
> cases you mentioned were failed with dmesg warning around the function
> "dax_associate_entry()" or "dax_disassociate_entry()". Since I cannot
> reproduce the failure, it hard for me to continue sovling the problem.

FWIW things have calmed down as of 6.1-rc3 -- if I disable reflink,
fstests runs without complaint.  Now it only seems to be affecting
reflink=1 filesystems.

> And how is your recent test?  Still failed with those dmesg warnings? If so,
> could you zip the test result and send it to me?

https://djwong.org/docs/kernel/daxbad.zip

--D

> 
> 
> --
> Thanks,
> Ruan
Shiyang Ruan Nov. 2, 2022, 5:17 a.m. UTC | #39
在 2022/11/2 8:45, Darrick J. Wong 写道:
> On Sun, Oct 30, 2022 at 05:31:43PM +0800, Shiyang Ruan wrote:
>>
>>
>> 在 2022/10/28 9:37, Dan Williams 写道:
>>> Darrick J. Wong wrote:
>>>> [add tytso to cc since he asked about "How do you actually /get/ fsdax
>>>> mode these days?" this morning]
>>>>
>>>> On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote:
>>>>> On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote:
>>
>> ...skip...
>>
>>>>>
>>>>> Nope.  Since the announcement of pmem as a product, I have had 15
>>>>> minutes of acces to one preproduction prototype server with actual
>>>>> optane DIMMs in them.
>>>>>
>>>>> I have /never/ had access to real hardware to test any of this, so it's
>>>>> all configured via libvirt to simulate pmem in qemu:
>>>>> https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/
>>>>>
>>>>> /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem:
>>>>>
>>>>> $ grep mtrdisk /proc/mounts
>>>>> none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0
>>>>>
>>>>> $ ls -la /run/mtrdisk/[gh].mem
>>>>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem
>>>>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem
>>>>
>>>> Also forgot to mention that the VM with the fake pmem attached has a
>>>> script to do:
>>>>
>>>> ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
>>>> ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f
>>>>
>>>> Every time the pmem device gets recreated, because apparently that's the
>>>> only way to get S_DAX mode nowadays?
>>>
>>> If you have noticed a change here it is due to VM configuration not
>>> anything in the driver.
>>>
>>> If you are interested there are two ways to get pmem declared the legacy
>>> way that predates any of the DAX work, the kernel calls it E820_PRAM,
>>> and the modern way by platform firmware tables like ACPI NFIT. The
>>> assumption with E820_PRAM is that it is dealing with battery backed
>>> NVDIMMs of small capacity. In that case the /dev/pmem device can support
>>> DAX operation by default because the necessary memory for the 'struct
>>> page' array for that memory is likely small.
>>>
>>> Platform firmware defined PMEM can be terabytes. So the driver does not
>>> enable DAX by default because the user needs to make policy choice about
>>> burning gigabytes of DRAM for that metadata, or placing it in PMEM which
>>> is abundant, but slower. So what I suspect might be happening is your
>>> configuration changed from something that auto-allocated the 'struct
>>> page' array, to something that needed those commands you list above to
>>> explicitly opt-in to reserving some PMEM capacity for the page metadata.
>>
>> I am using the same simulation environment as Darrick's and Dave's and have
>> tested many times, but still cannot reproduce the failed cases they
>> mentioned (dax+non_reflink mode, currently focuing) until now. Only a few
>> cases randomly failed because of "target is busy". But IIRC, those failed
>> cases you mentioned were failed with dmesg warning around the function
>> "dax_associate_entry()" or "dax_disassociate_entry()". Since I cannot
>> reproduce the failure, it hard for me to continue sovling the problem.
> 
> FWIW things have calmed down as of 6.1-rc3 -- if I disable reflink,
> fstests runs without complaint.  Now it only seems to be affecting
> reflink=1 filesystems. >
>> And how is your recent test?  Still failed with those dmesg warnings? If so,
>> could you zip the test result and send it to me?
> 
> https://djwong.org/docs/kernel/daxbad.zip

Thanks for your info!


(To Dave) I need your recent test result too.  If cases won't fail when 
reflink disabled, I'll focusing on solving the warning when reflink enabled.


--
Thanks,
Ruan.

> 
> --D
> 
>>
>>
>> --
>> Thanks,
>> Ruan
Dave Chinner Nov. 3, 2022, 2:32 a.m. UTC | #40
On Wed, Nov 02, 2022 at 05:17:18AM +0000, ruansy.fnst@fujitsu.com wrote:
> 
> 在 2022/11/2 8:45, Darrick J. Wong 写道:
> > On Sun, Oct 30, 2022 at 05:31:43PM +0800, Shiyang Ruan wrote:
> > FWIW things have calmed down as of 6.1-rc3 -- if I disable reflink,
> > fstests runs without complaint.  Now it only seems to be affecting
> > reflink=1 filesystems. >
> >> And how is your recent test?  Still failed with those dmesg warnings? If so,
> >> could you zip the test result and send it to me?
> > 
> > https://djwong.org/docs/kernel/daxbad.zip
> 
> Thanks for your info!
> 
> (To Dave) I need your recent test result too.  If cases won't fail when 
> reflink disabled, I'll focusing on solving the warning when reflink enabled.

My first run on 6.1-rc3 with reflink disabled was clean. Then I ran
a few tests with reflink enabled, and they all failed as expected.
Then I ran the no-reflink tests again, and then they all failed,
too. Nothing changed between test configs, I didn't even reboot the
test machine:

$ history |tail -5
500  sudo ./run_check.sh --mkfs-opts "-m rmapbt=1,reflink=0" --run-opts "-s xfs_dax xfs/55[12]"
501  sudo ./run_check.sh --mkfs-opts "-m rmapbt=1,reflink=0"
  --run-opts "-s xfs_dax_noreflink generic/051 generic/068
  generic/074 generic/075 generic/083 generic/112 generic/231
  generic/232 generic/269 generic/270 generic/340 generic/388
  generic/461 generic/471 generic/476 generic/519 generic/560
  generic/561 generic/617 generic/650 generic/656 xfs/011 xfs/013
  xfs/017 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538"
502  sudo ./run_check.sh --mkfs-opts "-m rmapbt=1" --run-opts
    "-s xfs_dax generic/051 generic/068 generic/074 generic/075
    generic/083 generic/112 generic/231 generic/232 generic/269
    generic/270 generic/340 generic/388 generic/461 generic/471
    generic/476 generic/519 generic/560 generic/561 generic/617
    generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297
    xfs/305 xfs/517 xfs/538"
503  sudo ./run_check.sh --mkfs-opts "-m rmapbt=1,reflink=0"
      --run-opts "-s xfs_dax_noreflink generic/051 generic/068
      generic/074 generic/075 generic/083 generic/112 generic/231
      generic/232 generic/269 generic/270 generic/340 generic/388
      generic/461 generic/471 generic/476 generic/519 generic/560
      generic/561 generic/617 generic/650 generic/656 xfs/011
      xfs/013 xfs/017 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538"
504  history |tail -5
$

The first noreflink run:

SECTION       -- xfs_dax_noreflink
=========================
Failures: generic/471 generic/519 xfs/073
Failed 3 of 29 tests

Which are typical failures for this config.

The first reflink enabled run I killed almost immediately as it
threw multiple warnings in the first couple of tests:

Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_dax generic/051 generic/068 generic/074 generic/075 generic/083 generic/112 generic/231 generic/232 generic/269 generic/270 generic/340 generic/388 generic/461 generic/471 generic/476 generic/519 generic/560 generic/561 generic/617 generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
SECTION       -- xfs_dax
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test3 6.1.0-rc3-dgc+ #1649 SMP PREEMPT_DYNAMIC Wed Nov  2 07:58:17 AEDT 2022
MKFS_OPTIONS  -- -f -m rmapbt=1 /dev/pmem1
MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch

generic/051 79s ... _check_dmesg: something found in dmesg (see /home/dave/src/xfstests-dev/results//xfs_dax/generic/051.dmesg)

generic/068 43s ... ^C
real    1m46.278s
user    0m17.465s
sys     2m9.981s
$

And then I ran the noreflink tests again to make sure the first run
wasn't a fluke:

SECTION       -- xfs_dax_noreflink
=========================
Failures: generic/051 generic/068 generic/231 generic/269
generic/270 generic/340 generic/388 generic/461 generic/471
generic/476 generic/519 generic/560 generic/561 xfs/011 xfs/013
xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
Failed 20 of 29 tests

It was a fluke - most of the tests failed this time with dax
mapping warnings. dmesg from the entire set of test runs is
attached.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8495ef076ffc..a3c221841fa6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -348,8 +348,10 @@  xfs_setup_dax_always(
 		goto disable_dax;
 	}
 
-	if (xfs_has_reflink(mp)) {
-		xfs_alert(mp, "DAX and reflink cannot be used together!");
+	if (xfs_has_reflink(mp) &&
+	    bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
+		xfs_alert(mp,
+			"DAX and reflink cannot work with multi-partitions!");
 		return -EINVAL;
 	}