Message ID | 20220609143435.393724-1-ruansy.fnst@fujitsu.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 35fcd75af3edf035638e632bb49607cc8fc3cdf4 |
Headers | show |
Series | xfs: fail dax mount if reflink is enabled on a partition | expand |
On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: > Failure notification is not supported on partitions. So, when we mount > a reflink enabled xfs on a partition with dax option, let it fail with > -EINVAL code. Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de>
On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: > Failure notification is not supported on partitions. So, when we mount > a reflink enabled xfs on a partition with dax option, let it fail with > -EINVAL code. > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Looks good to me, though I think this patch applies to ... wherever all those rmap+reflink+dax patches went. I think that's akpm's tree, right? Ideally this would go in through there to keep the pieces together, but I don't mind tossing this in at the end of the 5.20 merge window if akpm is unwilling. Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > --- > fs/xfs/xfs_super.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 8495ef076ffc..a3c221841fa6 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -348,8 +348,10 @@ xfs_setup_dax_always( > goto disable_dax; > } > > - if (xfs_has_reflink(mp)) { > - xfs_alert(mp, "DAX and reflink cannot be used together!"); > + if (xfs_has_reflink(mp) && > + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { > + xfs_alert(mp, > + "DAX and reflink cannot work with multi-partitions!"); > return -EINVAL; > } > > -- > 2.36.1 > > >
在 2022/7/1 8:31, Darrick J. Wong 写道: > On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: >> Failure notification is not supported on partitions. So, when we mount >> a reflink enabled xfs on a partition with dax option, let it fail with >> -EINVAL code. >> >> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > > Looks good to me, though I think this patch applies to ... wherever all > those rmap+reflink+dax patches went. I think that's akpm's tree, right? Yes, they are in his tree, still in mm-unstable now. > > Ideally this would go in through there to keep the pieces together, but > I don't mind tossing this in at the end of the 5.20 merge window if akpm > is unwilling. Both are fine to me. Thanks! -- Ruan. > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > --D > >> --- >> fs/xfs/xfs_super.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c >> index 8495ef076ffc..a3c221841fa6 100644 >> --- a/fs/xfs/xfs_super.c >> +++ b/fs/xfs/xfs_super.c >> @@ -348,8 +348,10 @@ xfs_setup_dax_always( >> goto disable_dax; >> } >> >> - if (xfs_has_reflink(mp)) { >> - xfs_alert(mp, "DAX and reflink cannot be used together!"); >> + if (xfs_has_reflink(mp) && >> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { >> + xfs_alert(mp, >> + "DAX and reflink cannot work with multi-partitions!"); >> return -EINVAL; >> } >> >> -- >> 2.36.1 >> >> >>
Hi Andrew, I am not sure if you had seen this. So make a ping. 在 2022/7/1 13:14, Shiyang Ruan 写道: > > > 在 2022/7/1 8:31, Darrick J. Wong 写道: >> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: >>> Failure notification is not supported on partitions. So, when we mount >>> a reflink enabled xfs on a partition with dax option, let it fail with >>> -EINVAL code. >>> >>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >> >> Looks good to me, though I think this patch applies to ... wherever all >> those rmap+reflink+dax patches went. I think that's akpm's tree, right? > > Yes, they are in his tree, still in mm-unstable now. > >> >> Ideally this would go in through there to keep the pieces together, but >> I don't mind tossing this in at the end of the 5.20 merge window if akpm >> is unwilling. What's your opinion on this? I found that the rmap+reflink+dax patches have been merged into mm-stable, so maybe it's a bit late to merge this one. ( I'm not pushing, just to know the situation. ) -- Thanks, Ruan. > > Both are fine to me. Thanks! > > > -- > Ruan. > >> >> Reviewed-by: Darrick J. Wong <djwong@kernel.org> >> >> --D >> >>> --- >>> fs/xfs/xfs_super.c | 6 ++++-- >>> 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c >>> index 8495ef076ffc..a3c221841fa6 100644 >>> --- a/fs/xfs/xfs_super.c >>> +++ b/fs/xfs/xfs_super.c >>> @@ -348,8 +348,10 @@ xfs_setup_dax_always( >>> goto disable_dax; >>> } >>> - if (xfs_has_reflink(mp)) { >>> - xfs_alert(mp, "DAX and reflink cannot be used together!"); >>> + if (xfs_has_reflink(mp) && >>> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { >>> + xfs_alert(mp, >>> + "DAX and reflink cannot work with multi-partitions!"); >>> return -EINVAL; >>> } >>> -- >>> 2.36.1 >>> >>> >>> > > >
在 2022/7/1 8:31, Darrick J. Wong 写道: > On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: >> Failure notification is not supported on partitions. So, when we mount >> a reflink enabled xfs on a partition with dax option, let it fail with >> -EINVAL code. >> >> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > > Looks good to me, though I think this patch applies to ... wherever all > those rmap+reflink+dax patches went. I think that's akpm's tree, right? > > Ideally this would go in through there to keep the pieces together, but > I don't mind tossing this in at the end of the 5.20 merge window if akpm > is unwilling. BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are waiting to be merged, is it time to think about "removing the experimental tag" again? :) -- Thanks, Ruan. > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > --D > >> --- >> fs/xfs/xfs_super.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c >> index 8495ef076ffc..a3c221841fa6 100644 >> --- a/fs/xfs/xfs_super.c >> +++ b/fs/xfs/xfs_super.c >> @@ -348,8 +348,10 @@ xfs_setup_dax_always( >> goto disable_dax; >> } >> >> - if (xfs_has_reflink(mp)) { >> - xfs_alert(mp, "DAX and reflink cannot be used together!"); >> + if (xfs_has_reflink(mp) && >> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { >> + xfs_alert(mp, >> + "DAX and reflink cannot work with multi-partitions!"); >> return -EINVAL; >> } >> >> -- >> 2.36.1 >> >> >>
On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote: > 在 2022/7/1 8:31, Darrick J. Wong 写道: > > On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: > >> Failure notification is not supported on partitions. So, when we mount > >> a reflink enabled xfs on a partition with dax option, let it fail with > >> -EINVAL code. > >> > >> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > > > > Looks good to me, though I think this patch applies to ... wherever all > > those rmap+reflink+dax patches went. I think that's akpm's tree, right? > > > > Ideally this would go in through there to keep the pieces together, but > > I don't mind tossing this in at the end of the 5.20 merge window if akpm > > is unwilling. > > BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are > waiting to be merged, is it time to think about "removing the > experimental tag" again? :) It's probably time to take up that question again. Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, and at least one of those dm layers no longer allows fsdax pass-through, so XFS silently turned mount -o dax into -o dax=never. :( I'm not sure how to fix that... --D > > -- > Thanks, > Ruan. > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > --D > > > >> --- > >> fs/xfs/xfs_super.c | 6 ++++-- > >> 1 file changed, 4 insertions(+), 2 deletions(-) > >> > >> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > >> index 8495ef076ffc..a3c221841fa6 100644 > >> --- a/fs/xfs/xfs_super.c > >> +++ b/fs/xfs/xfs_super.c > >> @@ -348,8 +348,10 @@ xfs_setup_dax_always( > >> goto disable_dax; > >> } > >> > >> - if (xfs_has_reflink(mp)) { > >> - xfs_alert(mp, "DAX and reflink cannot be used together!"); > >> + if (xfs_has_reflink(mp) && > >> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { > >> + xfs_alert(mp, > >> + "DAX and reflink cannot work with multi-partitions!"); > >> return -EINVAL; > >> } > >> > >> -- > >> 2.36.1 > >> > >> > >>
在 2022/7/22 0:16, Darrick J. Wong 写道: > On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote: >> 在 2022/7/1 8:31, Darrick J. Wong 写道: >>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: >>>> Failure notification is not supported on partitions. So, when we mount >>>> a reflink enabled xfs on a partition with dax option, let it fail with >>>> -EINVAL code. >>>> >>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >>> >>> Looks good to me, though I think this patch applies to ... wherever all >>> those rmap+reflink+dax patches went. I think that's akpm's tree, right? >>> >>> Ideally this would go in through there to keep the pieces together, but >>> I don't mind tossing this in at the end of the 5.20 merge window if akpm >>> is unwilling. >> >> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are >> waiting to be merged, is it time to think about "removing the >> experimental tag" again? :) > > It's probably time to take up that question again. > > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it > didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, > and at least one of those dm layers no longer allows fsdax pass-through, > so XFS silently turned mount -o dax into -o dax=never. :( Hi Darrick, I tried generic/470 but it didn't run: [not run] Cannot use thin-pool devices on DAX capable block devices. Did you modify the _require_dm_target() in common/rc? I added thin-pool to not to check dax capability: case $target in stripe|linear|log-writes|thin-pool) # add thin-pool here ;; then the case finally ran and it silently turned off dax as you said. Are the steps for reproduction correct? If so, I will continue to investigate this problem. -- Thanks, Ruan. > > I'm not sure how to fix that... > > --D > >> >> -- >> Thanks, >> Ruan. >> >>> >>> Reviewed-by: Darrick J. Wong <djwong@kernel.org> >>> >>> --D >>> >>>> --- >>>> fs/xfs/xfs_super.c | 6 ++++-- >>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c >>>> index 8495ef076ffc..a3c221841fa6 100644 >>>> --- a/fs/xfs/xfs_super.c >>>> +++ b/fs/xfs/xfs_super.c >>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always( >>>> goto disable_dax; >>>> } >>>> >>>> - if (xfs_has_reflink(mp)) { >>>> - xfs_alert(mp, "DAX and reflink cannot be used together!"); >>>> + if (xfs_has_reflink(mp) && >>>> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { >>>> + xfs_alert(mp, >>>> + "DAX and reflink cannot work with multi-partitions!"); >>>> return -EINVAL; >>>> } >>>> >>>> -- >>>> 2.36.1 >>>> >>>> >>>>
On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote: > > > 在 2022/7/22 0:16, Darrick J. Wong 写道: > > On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote: > >> 在 2022/7/1 8:31, Darrick J. Wong 写道: > >>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: > >>>> Failure notification is not supported on partitions. So, when we mount > >>>> a reflink enabled xfs on a partition with dax option, let it fail with > >>>> -EINVAL code. > >>>> > >>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > >>> > >>> Looks good to me, though I think this patch applies to ... wherever all > >>> those rmap+reflink+dax patches went. I think that's akpm's tree, right? > >>> > >>> Ideally this would go in through there to keep the pieces together, but > >>> I don't mind tossing this in at the end of the 5.20 merge window if akpm > >>> is unwilling. > >> > >> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are > >> waiting to be merged, is it time to think about "removing the > >> experimental tag" again? :) > > > > It's probably time to take up that question again. > > > > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it > > didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, > > and at least one of those dm layers no longer allows fsdax pass-through, > > so XFS silently turned mount -o dax into -o dax=never. :( > > Hi Darrick, > > I tried generic/470 but it didn't run: > [not run] Cannot use thin-pool devices on DAX capable block devices. > > Did you modify the _require_dm_target() in common/rc? I added thin-pool > to not to check dax capability: > > case $target in > stripe|linear|log-writes|thin-pool) # add thin-pool here > ;; > > then the case finally ran and it silently turned off dax as you said. > > Are the steps for reproduction correct? If so, I will continue to > investigate this problem. Ah, yes, I did add thin-pool to that case statement. Sorry I forgot to mention that. I suspect that the removal of dm support for pmem is going to force us to completely redesign this test. I can't really think of how, though, since there's no good way that I know of to gain a point-in-time snapshot of a pmem device. --D > > -- > Thanks, > Ruan. > > > > > > > I'm not sure how to fix that... > > > > --D > > > >> > >> -- > >> Thanks, > >> Ruan. > >> > >>> > >>> Reviewed-by: Darrick J. Wong <djwong@kernel.org> > >>> > >>> --D > >>> > >>>> --- > >>>> fs/xfs/xfs_super.c | 6 ++++-- > >>>> 1 file changed, 4 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > >>>> index 8495ef076ffc..a3c221841fa6 100644 > >>>> --- a/fs/xfs/xfs_super.c > >>>> +++ b/fs/xfs/xfs_super.c > >>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always( > >>>> goto disable_dax; > >>>> } > >>>> > >>>> - if (xfs_has_reflink(mp)) { > >>>> - xfs_alert(mp, "DAX and reflink cannot be used together!"); > >>>> + if (xfs_has_reflink(mp) && > >>>> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { > >>>> + xfs_alert(mp, > >>>> + "DAX and reflink cannot work with multi-partitions!"); > >>>> return -EINVAL; > >>>> } > >>>> > >>>> -- > >>>> 2.36.1 > >>>> > >>>> > >>>>
在 2022/7/29 12:54, Darrick J. Wong 写道: > On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote: >> >> >> 在 2022/7/22 0:16, Darrick J. Wong 写道: >>> On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote: >>>> 在 2022/7/1 8:31, Darrick J. Wong 写道: >>>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: >>>>>> Failure notification is not supported on partitions. So, when we mount >>>>>> a reflink enabled xfs on a partition with dax option, let it fail with >>>>>> -EINVAL code. >>>>>> >>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >>>>> >>>>> Looks good to me, though I think this patch applies to ... wherever all >>>>> those rmap+reflink+dax patches went. I think that's akpm's tree, right? >>>>> >>>>> Ideally this would go in through there to keep the pieces together, but >>>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm >>>>> is unwilling. >>>> >>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are >>>> waiting to be merged, is it time to think about "removing the >>>> experimental tag" again? :) >>> >>> It's probably time to take up that question again. >>> >>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it >>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, >>> and at least one of those dm layers no longer allows fsdax pass-through, >>> so XFS silently turned mount -o dax into -o dax=never. :( >> >> Hi Darrick, >> >> I tried generic/470 but it didn't run: >> [not run] Cannot use thin-pool devices on DAX capable block devices. >> >> Did you modify the _require_dm_target() in common/rc? I added thin-pool >> to not to check dax capability: >> >> case $target in >> stripe|linear|log-writes|thin-pool) # add thin-pool here >> ;; >> >> then the case finally ran and it silently turned off dax as you said. >> >> Are the steps for reproduction correct? If so, I will continue to >> investigate this problem. > > Ah, yes, I did add thin-pool to that case statement. Sorry I forgot to > mention that. I suspect that the removal of dm support for pmem is > going to force us to completely redesign this test. I can't really > think of how, though, since there's no good way that I know of to gain a > point-in-time snapshot of a pmem device. Hi Darrick, > removal of dm support for pmem I think here we are saying about xfstest who removed the support, not kernel? I found some xfstests commits: fc7b3903894a6213c765d64df91847f4460336a2 # common/rc: add the restriction. fc5870da485aec0f9196a0f2bed32f73f6b2c664 # generic/470: use thin-pool So, this case was never able to run since the second commit? (I didn't notice the not run case. I thought it was expected to be not run.) And according to the first commit, the restriction was added because some of dm devices don't support dax. So my understanding is: we should redesign the case to make the it work, and firstly, we should add dax support for dm devices in kernel. In addition, is there any other testcase has the same problem? so that we can deal with them together. -- Thanks, Ruan > > --D > >> >> -- >> Thanks, >> Ruan. >> >> >> >>> >>> I'm not sure how to fix that... >>> >>> --D >>> >>>> >>>> -- >>>> Thanks, >>>> Ruan. >>>> >>>>> >>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org> >>>>> >>>>> --D >>>>> >>>>>> --- >>>>>> fs/xfs/xfs_super.c | 6 ++++-- >>>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c >>>>>> index 8495ef076ffc..a3c221841fa6 100644 >>>>>> --- a/fs/xfs/xfs_super.c >>>>>> +++ b/fs/xfs/xfs_super.c >>>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always( >>>>>> goto disable_dax; >>>>>> } >>>>>> >>>>>> - if (xfs_has_reflink(mp)) { >>>>>> - xfs_alert(mp, "DAX and reflink cannot be used together!"); >>>>>> + if (xfs_has_reflink(mp) && >>>>>> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { >>>>>> + xfs_alert(mp, >>>>>> + "DAX and reflink cannot work with multi-partitions!"); >>>>>> return -EINVAL; >>>>>> } >>>>>> >>>>>> -- >>>>>> 2.36.1 >>>>>> >>>>>> >>>>>>
On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote: > > > 在 2022/7/29 12:54, Darrick J. Wong 写道: > > On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote: > >> > >> > >> 在 2022/7/22 0:16, Darrick J. Wong 写道: > >>> On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote: > >>>> 在 2022/7/1 8:31, Darrick J. Wong 写道: > >>>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: > >>>>>> Failure notification is not supported on partitions. So, when we mount > >>>>>> a reflink enabled xfs on a partition with dax option, let it fail with > >>>>>> -EINVAL code. > >>>>>> > >>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > >>>>> > >>>>> Looks good to me, though I think this patch applies to ... wherever all > >>>>> those rmap+reflink+dax patches went. I think that's akpm's tree, right? > >>>>> > >>>>> Ideally this would go in through there to keep the pieces together, but > >>>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm > >>>>> is unwilling. > >>>> > >>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are > >>>> waiting to be merged, is it time to think about "removing the > >>>> experimental tag" again? :) > >>> > >>> It's probably time to take up that question again. > >>> > >>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it > >>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, > >>> and at least one of those dm layers no longer allows fsdax pass-through, > >>> so XFS silently turned mount -o dax into -o dax=never. :( > >> > >> Hi Darrick, > >> > >> I tried generic/470 but it didn't run: > >> [not run] Cannot use thin-pool devices on DAX capable block devices. > >> > >> Did you modify the _require_dm_target() in common/rc? I added thin-pool > >> to not to check dax capability: > >> > >> case $target in > >> stripe|linear|log-writes|thin-pool) # add thin-pool here > >> ;; > >> > >> then the case finally ran and it silently turned off dax as you said. > >> > >> Are the steps for reproduction correct? If so, I will continue to > >> investigate this problem. > > > > Ah, yes, I did add thin-pool to that case statement. Sorry I forgot to > > mention that. I suspect that the removal of dm support for pmem is > > going to force us to completely redesign this test. I can't really > > think of how, though, since there's no good way that I know of to gain a > > point-in-time snapshot of a pmem device. > > Hi Darrick, > > > removal of dm support for pmem > I think here we are saying about xfstest who removed the support, not > kernel? > > I found some xfstests commits: > fc7b3903894a6213c765d64df91847f4460336a2 # common/rc: add the restriction. > fc5870da485aec0f9196a0f2bed32f73f6b2c664 # generic/470: use thin-pool > > So, this case was never able to run since the second commit? (I didn't > notice the not run case. I thought it was expected to be not run.) > > And according to the first commit, the restriction was added because > some of dm devices don't support dax. So my understanding is: we should > redesign the case to make the it work, and firstly, we should add dax > support for dm devices in kernel. dm devices used to have fsdax support; I think Christoph is actively removing (or already has removed) all that support. > In addition, is there any other testcase has the same problem? so that > we can deal with them together. The last I checked, there aren't any that require MAP_SYNC or pmem aside from g/470 and the three poison notification tests that you sent a few days ago. --D > > -- > Thanks, > Ruan > > > > > > --D > > > >> > >> -- > >> Thanks, > >> Ruan. > >> > >> > >> > >>> > >>> I'm not sure how to fix that... > >>> > >>> --D > >>> > >>>> > >>>> -- > >>>> Thanks, > >>>> Ruan. > >>>> > >>>>> > >>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org> > >>>>> > >>>>> --D > >>>>> > >>>>>> --- > >>>>>> fs/xfs/xfs_super.c | 6 ++++-- > >>>>>> 1 file changed, 4 insertions(+), 2 deletions(-) > >>>>>> > >>>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > >>>>>> index 8495ef076ffc..a3c221841fa6 100644 > >>>>>> --- a/fs/xfs/xfs_super.c > >>>>>> +++ b/fs/xfs/xfs_super.c > >>>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always( > >>>>>> goto disable_dax; > >>>>>> } > >>>>>> > >>>>>> - if (xfs_has_reflink(mp)) { > >>>>>> - xfs_alert(mp, "DAX and reflink cannot be used together!"); > >>>>>> + if (xfs_has_reflink(mp) && > >>>>>> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { > >>>>>> + xfs_alert(mp, > >>>>>> + "DAX and reflink cannot work with multi-partitions!"); > >>>>>> return -EINVAL; > >>>>>> } > >>>>>> > >>>>>> -- > >>>>>> 2.36.1 > >>>>>> > >>>>>> > >>>>>>
在 2022/8/4 8:51, Darrick J. Wong 写道: > On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote: >> >> >> 在 2022/7/29 12:54, Darrick J. Wong 写道: >>> On Fri, Jul 29, 2022 at 03:55:24AM +0000, ruansy.fnst@fujitsu.com wrote: >>>> >>>> >>>> 在 2022/7/22 0:16, Darrick J. Wong 写道: >>>>> On Thu, Jul 21, 2022 at 02:06:10PM +0000, ruansy.fnst@fujitsu.com wrote: >>>>>> 在 2022/7/1 8:31, Darrick J. Wong 写道: >>>>>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote: >>>>>>>> Failure notification is not supported on partitions. So, when we mount >>>>>>>> a reflink enabled xfs on a partition with dax option, let it fail with >>>>>>>> -EINVAL code. >>>>>>>> >>>>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >>>>>>> >>>>>>> Looks good to me, though I think this patch applies to ... wherever all >>>>>>> those rmap+reflink+dax patches went. I think that's akpm's tree, right? >>>>>>> >>>>>>> Ideally this would go in through there to keep the pieces together, but >>>>>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm >>>>>>> is unwilling. >>>>>> >>>>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are >>>>>> waiting to be merged, is it time to think about "removing the >>>>>> experimental tag" again? :) >>>>> >>>>> It's probably time to take up that question again. >>>>> >>>>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it >>>>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, >>>>> and at least one of those dm layers no longer allows fsdax pass-through, >>>>> so XFS silently turned mount -o dax into -o dax=never. :( >>>> >>>> Hi Darrick, >>>> >>>> I tried generic/470 but it didn't run: >>>> [not run] Cannot use thin-pool devices on DAX capable block devices. >>>> >>>> Did you modify the _require_dm_target() in common/rc? I added thin-pool >>>> to not to check dax capability: >>>> >>>> case $target in >>>> stripe|linear|log-writes|thin-pool) # add thin-pool here >>>> ;; >>>> >>>> then the case finally ran and it silently turned off dax as you said. >>>> >>>> Are the steps for reproduction correct? If so, I will continue to >>>> investigate this problem. >>> >>> Ah, yes, I did add thin-pool to that case statement. Sorry I forgot to >>> mention that. I suspect that the removal of dm support for pmem is >>> going to force us to completely redesign this test. I can't really >>> think of how, though, since there's no good way that I know of to gain a >>> point-in-time snapshot of a pmem device. >> >> Hi Darrick, >> >> > removal of dm support for pmem >> I think here we are saying about xfstest who removed the support, not >> kernel? >> >> I found some xfstests commits: >> fc7b3903894a6213c765d64df91847f4460336a2 # common/rc: add the restriction. >> fc5870da485aec0f9196a0f2bed32f73f6b2c664 # generic/470: use thin-pool >> >> So, this case was never able to run since the second commit? (I didn't >> notice the not run case. I thought it was expected to be not run.) >> >> And according to the first commit, the restriction was added because >> some of dm devices don't support dax. So my understanding is: we should >> redesign the case to make the it work, and firstly, we should add dax >> support for dm devices in kernel. > > dm devices used to have fsdax support; I think Christoph is actively > removing (or already has removed) all that support. > >> In addition, is there any other testcase has the same problem? so that >> we can deal with them together. > > The last I checked, there aren't any that require MAP_SYNC or pmem aside > from g/470 and the three poison notification tests that you sent a few > days ago. Ok. Got it. Thank you! -- Ruan. > > --D > >> >> -- >> Thanks, >> Ruan >> >> >>> >>> --D >>> >>>> >>>> -- >>>> Thanks, >>>> Ruan. >>>> >>>> >>>> >>>>> >>>>> I'm not sure how to fix that... >>>>> >>>>> --D >>>>> >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> Ruan. >>>>>> >>>>>>> >>>>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org> >>>>>>> >>>>>>> --D >>>>>>> >>>>>>>> --- >>>>>>>> fs/xfs/xfs_super.c | 6 ++++-- >>>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c >>>>>>>> index 8495ef076ffc..a3c221841fa6 100644 >>>>>>>> --- a/fs/xfs/xfs_super.c >>>>>>>> +++ b/fs/xfs/xfs_super.c >>>>>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always( >>>>>>>> goto disable_dax; >>>>>>>> } >>>>>>>> >>>>>>>> - if (xfs_has_reflink(mp)) { >>>>>>>> - xfs_alert(mp, "DAX and reflink cannot be used together!"); >>>>>>>> + if (xfs_has_reflink(mp) && >>>>>>>> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { >>>>>>>> + xfs_alert(mp, >>>>>>>> + "DAX and reflink cannot work with multi-partitions!"); >>>>>>>> return -EINVAL; >>>>>>>> } >>>>>>>> >>>>>>>> -- >>>>>>>> 2.36.1 >>>>>>>> >>>>>>>> >>>>>>>>
在 2022/8/4 8:51, Darrick J. Wong 写道: > On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote: ... >>>>>> >>>>>> BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are >>>>>> waiting to be merged, is it time to think about "removing the >>>>>> experimental tag" again? :) >>>>> >>>>> It's probably time to take up that question again. >>>>> >>>>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it >>>>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, >>>>> and at least one of those dm layers no longer allows fsdax pass-through, >>>>> so XFS silently turned mount -o dax into -o dax=never. :( >>>> >>>> Hi Darrick, >>>> >>>> I tried generic/470 but it didn't run: >>>> [not run] Cannot use thin-pool devices on DAX capable block devices. >>>> >>>> Did you modify the _require_dm_target() in common/rc? I added thin-pool >>>> to not to check dax capability: >>>> >>>> case $target in >>>> stripe|linear|log-writes|thin-pool) # add thin-pool here >>>> ;; >>>> >>>> then the case finally ran and it silently turned off dax as you said. >>>> >>>> Are the steps for reproduction correct? If so, I will continue to >>>> investigate this problem. >>> >>> Ah, yes, I did add thin-pool to that case statement. Sorry I forgot to >>> mention that. I suspect that the removal of dm support for pmem is >>> going to force us to completely redesign this test. I can't really >>> think of how, though, since there's no good way that I know of to gain a >>> point-in-time snapshot of a pmem device. >> >> Hi Darrick, >> >> > removal of dm support for pmem >> I think here we are saying about xfstest who removed the support, not >> kernel? >> >> I found some xfstests commits: >> fc7b3903894a6213c765d64df91847f4460336a2 # common/rc: add the restriction. >> fc5870da485aec0f9196a0f2bed32f73f6b2c664 # generic/470: use thin-pool >> >> So, this case was never able to run since the second commit? (I didn't >> notice the not run case. I thought it was expected to be not run.) >> >> And according to the first commit, the restriction was added because >> some of dm devices don't support dax. So my understanding is: we should >> redesign the case to make the it work, and firstly, we should add dax >> support for dm devices in kernel. > > dm devices used to have fsdax support; I think Christoph is actively > removing (or already has removed) all that support. > >> In addition, is there any other testcase has the same problem? so that >> we can deal with them together. > > The last I checked, there aren't any that require MAP_SYNC or pmem aside > from g/470 and the three poison notification tests that you sent a few > days ago. > > --D > Hi Darrick, Brian I made a little investigation on generic/470. This case was able to run before introducing thin-pool[1], but since that, it became 'Failed'/'Not Run' because thin-pool does not support DAX. I have checked the log of thin-pool, it never supports DAX. And, it's not someone has removed the fsdax support. So, I think it's not correct to bypass the requirement conditions by adding 'thin-pool' to _require_dm_target(). As far as I known, to prevent out-of-order replay of dm-log-write, thin-pool was introduced (to provide discard zeroing). Should we solve the 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian Besides, since it's not a fsdax problem, I think there is nothing need to be fixed in fsdax. I'd like to help it solved, but I'm still wondering if we could back to the original topic("Remove Experimental Tag") firstly? :) [1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin volume for dmlogwrites target device -- Thanks, Ruan.
On Thu, Sep 08, 2022 at 09:46:04PM +0800, Shiyang Ruan wrote: > > > 在 2022/8/4 8:51, Darrick J. Wong 写道: > > On Wed, Aug 03, 2022 at 06:47:24AM +0000, ruansy.fnst@fujitsu.com wrote: > > ... > > > > > > > > > > > > > > > BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) are > > > > > > > waiting to be merged, is it time to think about "removing the > > > > > > > experimental tag" again? :) > > > > > > > > > > > > It's probably time to take up that question again. > > > > > > > > > > > > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it > > > > > > didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem, > > > > > > and at least one of those dm layers no longer allows fsdax pass-through, > > > > > > so XFS silently turned mount -o dax into -o dax=never. :( > > > > > > > > > > Hi Darrick, > > > > > > > > > > I tried generic/470 but it didn't run: > > > > > [not run] Cannot use thin-pool devices on DAX capable block devices. > > > > > > > > > > Did you modify the _require_dm_target() in common/rc? I added thin-pool > > > > > to not to check dax capability: > > > > > > > > > > case $target in > > > > > stripe|linear|log-writes|thin-pool) # add thin-pool here > > > > > ;; > > > > > > > > > > then the case finally ran and it silently turned off dax as you said. > > > > > > > > > > Are the steps for reproduction correct? If so, I will continue to > > > > > investigate this problem. > > > > > > > > Ah, yes, I did add thin-pool to that case statement. Sorry I forgot to > > > > mention that. I suspect that the removal of dm support for pmem is > > > > going to force us to completely redesign this test. I can't really > > > > think of how, though, since there's no good way that I know of to gain a > > > > point-in-time snapshot of a pmem device. > > > > > > Hi Darrick, > > > > > > > removal of dm support for pmem > > > I think here we are saying about xfstest who removed the support, not > > > kernel? > > > > > > I found some xfstests commits: > > > fc7b3903894a6213c765d64df91847f4460336a2 # common/rc: add the restriction. > > > fc5870da485aec0f9196a0f2bed32f73f6b2c664 # generic/470: use thin-pool > > > > > > So, this case was never able to run since the second commit? (I didn't > > > notice the not run case. I thought it was expected to be not run.) > > > > > > And according to the first commit, the restriction was added because > > > some of dm devices don't support dax. So my understanding is: we should > > > redesign the case to make the it work, and firstly, we should add dax > > > support for dm devices in kernel. > > > > dm devices used to have fsdax support; I think Christoph is actively > > removing (or already has removed) all that support. > > > > > In addition, is there any other testcase has the same problem? so that > > > we can deal with them together. > > > > The last I checked, there aren't any that require MAP_SYNC or pmem aside > > from g/470 and the three poison notification tests that you sent a few > > days ago. > > > > --D > > > > Hi Darrick, Brian > > I made a little investigation on generic/470. > > This case was able to run before introducing thin-pool[1], but since that, > it became 'Failed'/'Not Run' because thin-pool does not support DAX. I have > checked the log of thin-pool, it never supports DAX. And, it's not someone > has removed the fsdax support. So, I think it's not correct to bypass the > requirement conditions by adding 'thin-pool' to _require_dm_target(). > > As far as I known, to prevent out-of-order replay of dm-log-write, thin-pool > was introduced (to provide discard zeroing). Should we solve the > 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian > Yes.. I don't recall all the internals of the tools and test, but IIRC it relied on discard to perform zeroing between checkpoints or some such and avoid spurious failures. The purpose of running on dm-thin was merely to provide reliable discard zeroing behavior on the target device and thus to allow the test to run reliably. Brian > Besides, since it's not a fsdax problem, I think there is nothing need to be > fixed in fsdax. I'd like to help it solved, but I'm still wondering if we > could back to the original topic("Remove Experimental Tag") firstly? :) > > > [1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin volume > for dmlogwrites target device > > > -- > Thanks, > Ruan. > >
On 2022/9/9 21:01, Brian Foster wrote: > Yes.. I don't recall all the internals of the tools and test, but IIRC > it relied on discard to perform zeroing between checkpoints or some such > and avoid spurious failures. The purpose of running on dm-thin was > merely to provide reliable discard zeroing behavior on the target device > and thus to allow the test to run reliably. Hi Brian, As far as I know, generic/470 was original designed to verify mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the reason, we need to ensure that all underlying devices under dm-log-writes device support DAX. However dm-thin device never supports DAX so running generic/470 with dm-thin device always returns "not run". Please see the difference between old and new logic: old logic new logic --------------------------------------------------------------- log-writes device(DAX) log-writes device(DAX) | | PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) | PMEM0(DAX) --------------------------------------------------------------- We think dm-thin device is not a good solution for generic/470, is there any other solution to support both discard zero and DAX? BTW, only log-writes, stripe and linear support DAX for now. Best Regards, Xiao Yang
On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: > On 2022/9/9 21:01, Brian Foster wrote: >> Yes.. I don't recall all the internals of the tools and test, but IIRC >> it relied on discard to perform zeroing between checkpoints or some such >> and avoid spurious failures. The purpose of running on dm-thin was >> merely to provide reliable discard zeroing behavior on the target device >> and thus to allow the test to run reliably. > Hi Brian, > > As far as I know, generic/470 was original designed to verify > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the > reason, we need to ensure that all underlying devices under > dm-log-writes device support DAX. However dm-thin device never supports > DAX so > running generic/470 with dm-thin device always returns "not run". > > Please see the difference between old and new logic: > > old logic new logic > --------------------------------------------------------------- > log-writes device(DAX) log-writes device(DAX) > | | > PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) > | > PMEM0(DAX) > --------------------------------------------------------------- > > We think dm-thin device is not a good solution for generic/470, is there > any other solution to support both discard zero and DAX? Hi Brian, I have sent a patch[1] to revert your fix because I think it's not good for generic/470 to use thin volume as my revert patch[1] describes: [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u With the revert, generic/470 can always run successfully on my environment so I wonder how to reproduce the out-of-order replay issue on XFS v5 filesystem? PS: I want to reproduce the issue and try to find a better solution to fix it. Best Regards, Xiao Yang > > BTW, only log-writes, stripe and linear support DAX for now.
On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: > > On 2022/9/9 21:01, Brian Foster wrote: > > > Yes.. I don't recall all the internals of the tools and test, but IIRC > > > it relied on discard to perform zeroing between checkpoints or some such > > > and avoid spurious failures. The purpose of running on dm-thin was > > > merely to provide reliable discard zeroing behavior on the target device > > > and thus to allow the test to run reliably. > > Hi Brian, > > > > As far as I know, generic/470 was original designed to verify > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the > > reason, we need to ensure that all underlying devices under > > dm-log-writes device support DAX. However dm-thin device never supports > > DAX so > > running generic/470 with dm-thin device always returns "not run". > > > > Please see the difference between old and new logic: > > > > old logic new logic > > --------------------------------------------------------------- > > log-writes device(DAX) log-writes device(DAX) > > | | > > PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) > > | > > PMEM0(DAX) > > --------------------------------------------------------------- > > > > We think dm-thin device is not a good solution for generic/470, is there > > any other solution to support both discard zero and DAX? > > Hi Brian, > > I have sent a patch[1] to revert your fix because I think it's not good for > generic/470 to use thin volume as my revert patch[1] describes: > [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u > I think the history here is that generic/482 was changed over first in commit 65cc9a235919 ("generic/482: use thin volume as data device"), and then sometime later we realized generic/455,457,470 had the same general flaw and were switched over. The dm/dax compatibility thing was probably just an oversight, but I am a little curious about that because it should have been obvious that the change caused the test to no longer run. Did something change after that to trigger that change in behavior? > With the revert, generic/470 can always run successfully on my environment > so I wonder how to reproduce the out-of-order replay issue on XFS v5 > filesystem? > I don't quite recall the characteristics of the failures beyond that we were seeing spurious test failures with generic/482 that were due to essentially putting the fs/log back in time in a way that wasn't quite accurate due to the clearing by the logwrites tool not taking place. If you wanted to reproduce in order to revisit that, perhaps start with generic/482 and let it run in a loop for a while and see if it eventually triggers a failure/corruption..? > PS: I want to reproduce the issue and try to find a better solution to fix > it. > It's been a while since I looked at any of this tooling to semi-grok how it works. Perhaps it could learn to rely on something more explicit like zero range (instead of discard?) or fall back to manual zeroing? If the eventual solution is simple and low enough overhead, it might make some sense to replace the dmthin hack across the set of tests mentioned above. Brian > Best Regards, > Xiao Yang > > > > > BTW, only log-writes, stripe and linear support DAX for now. >
On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote: > On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: > > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: > > > On 2022/9/9 21:01, Brian Foster wrote: > > > > Yes.. I don't recall all the internals of the tools and test, but IIRC > > > > it relied on discard to perform zeroing between checkpoints or some such > > > > and avoid spurious failures. The purpose of running on dm-thin was > > > > merely to provide reliable discard zeroing behavior on the target device > > > > and thus to allow the test to run reliably. > > > Hi Brian, > > > > > > As far as I know, generic/470 was original designed to verify > > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the > > > reason, we need to ensure that all underlying devices under > > > dm-log-writes device support DAX. However dm-thin device never supports > > > DAX so > > > running generic/470 with dm-thin device always returns "not run". > > > > > > Please see the difference between old and new logic: > > > > > > old logic new logic > > > --------------------------------------------------------------- > > > log-writes device(DAX) log-writes device(DAX) > > > | | > > > PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) > > > | > > > PMEM0(DAX) > > > --------------------------------------------------------------- > > > > > > We think dm-thin device is not a good solution for generic/470, is there > > > any other solution to support both discard zero and DAX? > > > > Hi Brian, > > > > I have sent a patch[1] to revert your fix because I think it's not good for > > generic/470 to use thin volume as my revert patch[1] describes: > > [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u > > > > I think the history here is that generic/482 was changed over first in > commit 65cc9a235919 ("generic/482: use thin volume as data device"), and > then sometime later we realized generic/455,457,470 had the same general > flaw and were switched over. The dm/dax compatibility thing was probably > just an oversight, but I am a little curious about that because it should It's not an oversight -- it used to work (albeit with EXPERIMENTAL tags), and now we've broken it on fsdax as the pmem/blockdev divorce progresses. > have been obvious that the change caused the test to no longer run. Did > something change after that to trigger that change in behavior? > > > With the revert, generic/470 can always run successfully on my environment > > so I wonder how to reproduce the out-of-order replay issue on XFS v5 > > filesystem? > > > > I don't quite recall the characteristics of the failures beyond that we > were seeing spurious test failures with generic/482 that were due to > essentially putting the fs/log back in time in a way that wasn't quite > accurate due to the clearing by the logwrites tool not taking place. If > you wanted to reproduce in order to revisit that, perhaps start with > generic/482 and let it run in a loop for a while and see if it > eventually triggers a failure/corruption..? > > > PS: I want to reproduce the issue and try to find a better solution to fix > > it. > > > > It's been a while since I looked at any of this tooling to semi-grok how > it works. I /think/ this was the crux of the problem, back in 2019? https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/ > Perhaps it could learn to rely on something more explicit like > zero range (instead of discard?) or fall back to manual zeroing? AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably ought to be adapted to call BLKZEROOUT and (b) in the worst case it writes zeroes to the entire device, which is/can be slow. For a (crass) example, one of my cloudy test VMs uses 34GB partitions, and for cost optimization purposes we're only "paying" for the cheapest tier. Weirdly that maps to an upper limit of 6500 write iops and 48MB/s(!) but that would take about 20 minutes to zero the entire device if the dm-thin hack wasn't in place. Frustratingly, it doesn't support discard or write-zeroes. > If the > eventual solution is simple and low enough overhead, it might make some > sense to replace the dmthin hack across the set of tests mentioned > above. That said, for a *pmem* test you'd expect it to be faster than that... --D > Brian > > > Best Regards, > > Xiao Yang > > > > > > > > BTW, only log-writes, stripe and linear support DAX for now. > > >
On 2022/9/15 0:28, Darrick J. Wong wrote: > On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote: >> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: >>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: >>>> On 2022/9/9 21:01, Brian Foster wrote: >>>>> Yes.. I don't recall all the internals of the tools and test, but IIRC >>>>> it relied on discard to perform zeroing between checkpoints or some such >>>>> and avoid spurious failures. The purpose of running on dm-thin was >>>>> merely to provide reliable discard zeroing behavior on the target device >>>>> and thus to allow the test to run reliably. >>>> Hi Brian, >>>> >>>> As far as I know, generic/470 was original designed to verify >>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the >>>> reason, we need to ensure that all underlying devices under >>>> dm-log-writes device support DAX. However dm-thin device never supports >>>> DAX so >>>> running generic/470 with dm-thin device always returns "not run". >>>> >>>> Please see the difference between old and new logic: >>>> >>>> old logic new logic >>>> --------------------------------------------------------------- >>>> log-writes device(DAX) log-writes device(DAX) >>>> | | >>>> PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) >>>> | >>>> PMEM0(DAX) >>>> --------------------------------------------------------------- >>>> >>>> We think dm-thin device is not a good solution for generic/470, is there >>>> any other solution to support both discard zero and DAX? >>> >>> Hi Brian, >>> >>> I have sent a patch[1] to revert your fix because I think it's not good for >>> generic/470 to use thin volume as my revert patch[1] describes: >>> [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u >>> >> >> I think the history here is that generic/482 was changed over first in >> commit 65cc9a235919 ("generic/482: use thin volume as data device"), and >> then sometime later we realized generic/455,457,470 had the same general >> flaw and were switched over. The dm/dax compatibility thing was probably >> just an oversight, but I am a little curious about that because it should > > It's not an oversight -- it used to work (albeit with EXPERIMENTAL > tags), and now we've broken it on fsdax as the pmem/blockdev divorce > progresses. Hi Do you mean that the following patch set changed the test result of generic/470 with thin-volume? (pass => not run/failure) https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ > >> have been obvious that the change caused the test to no longer run. Did >> something change after that to trigger that change in behavior? >> >>> With the revert, generic/470 can always run successfully on my environment >>> so I wonder how to reproduce the out-of-order replay issue on XFS v5 >>> filesystem? >>> >> >> I don't quite recall the characteristics of the failures beyond that we >> were seeing spurious test failures with generic/482 that were due to >> essentially putting the fs/log back in time in a way that wasn't quite >> accurate due to the clearing by the logwrites tool not taking place. If >> you wanted to reproduce in order to revisit that, perhaps start with >> generic/482 and let it run in a loop for a while and see if it >> eventually triggers a failure/corruption..? >> >>> PS: I want to reproduce the issue and try to find a better solution to fix >>> it. >>> >> >> It's been a while since I looked at any of this tooling to semi-grok how >> it works. > > I /think/ this was the crux of the problem, back in 2019? > https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/ Agreed. > >> Perhaps it could learn to rely on something more explicit like >> zero range (instead of discard?) or fall back to manual zeroing? > > AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably > ought to be adapted to call BLKZEROOUT and (b) in the worst case it > writes zeroes to the entire device, which is/can be slow. > > For a (crass) example, one of my cloudy test VMs uses 34GB partitions, > and for cost optimization purposes we're only "paying" for the cheapest > tier. Weirdly that maps to an upper limit of 6500 write iops and > 48MB/s(!) but that would take about 20 minutes to zero the entire > device if the dm-thin hack wasn't in place. Frustratingly, it doesn't > support discard or write-zeroes. Do you mean that discard zero(BLKDISCARD) is faster than both fill zero(BLKZEROOUT) and write zero on user space? Best Regards, Xiao Yang > >> If the >> eventual solution is simple and low enough overhead, it might make some >> sense to replace the dmthin hack across the set of tests mentioned >> above. > > That said, for a *pmem* test you'd expect it to be faster than that... > > --D > >> Brian >> >>> Best Regards, >>> Xiao Yang >>> >>>> >>>> BTW, only log-writes, stripe and linear support DAX for now. >>> >>
On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote: > On 2022/9/15 0:28, Darrick J. Wong wrote: >> On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote: >>> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: >>>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: >>>>> On 2022/9/9 21:01, Brian Foster wrote: >>>>>> Yes.. I don't recall all the internals of the tools and test, but >>>>>> IIRC >>>>>> it relied on discard to perform zeroing between checkpoints or >>>>>> some such >>>>>> and avoid spurious failures. The purpose of running on dm-thin was >>>>>> merely to provide reliable discard zeroing behavior on the target >>>>>> device >>>>>> and thus to allow the test to run reliably. >>>>> Hi Brian, >>>>> >>>>> As far as I know, generic/470 was original designed to verify >>>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the >>>>> reason, we need to ensure that all underlying devices under >>>>> dm-log-writes device support DAX. However dm-thin device never >>>>> supports >>>>> DAX so >>>>> running generic/470 with dm-thin device always returns "not run". >>>>> >>>>> Please see the difference between old and new logic: >>>>> >>>>> old logic new logic >>>>> --------------------------------------------------------------- >>>>> log-writes device(DAX) log-writes device(DAX) >>>>> | | >>>>> PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) >>>>> | >>>>> PMEM0(DAX) >>>>> --------------------------------------------------------------- >>>>> >>>>> We think dm-thin device is not a good solution for generic/470, is >>>>> there >>>>> any other solution to support both discard zero and DAX? >>>> >>>> Hi Brian, >>>> >>>> I have sent a patch[1] to revert your fix because I think it's not >>>> good for >>>> generic/470 to use thin volume as my revert patch[1] describes: >>>> [1] >>>> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u >>>> >>>> >>> >>> I think the history here is that generic/482 was changed over first in >>> commit 65cc9a235919 ("generic/482: use thin volume as data device"), and >>> then sometime later we realized generic/455,457,470 had the same general >>> flaw and were switched over. The dm/dax compatibility thing was probably >>> just an oversight, but I am a little curious about that because it >>> should >> >> It's not an oversight -- it used to work (albeit with EXPERIMENTAL >> tags), and now we've broken it on fsdax as the pmem/blockdev divorce >> progresses. > Hi > > Do you mean that the following patch set changed the test result of > generic/470 with thin-volume? (pass => not run/failure) > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ > >> >>> have been obvious that the change caused the test to no longer run. Did >>> something change after that to trigger that change in behavior? >>> >>>> With the revert, generic/470 can always run successfully on my >>>> environment >>>> so I wonder how to reproduce the out-of-order replay issue on XFS v5 >>>> filesystem? >>>> >>> >>> I don't quite recall the characteristics of the failures beyond that we >>> were seeing spurious test failures with generic/482 that were due to >>> essentially putting the fs/log back in time in a way that wasn't quite >>> accurate due to the clearing by the logwrites tool not taking place. If >>> you wanted to reproduce in order to revisit that, perhaps start with >>> generic/482 and let it run in a loop for a while and see if it >>> eventually triggers a failure/corruption..? >>> >>>> PS: I want to reproduce the issue and try to find a better solution >>>> to fix >>>> it. >>>> >>> >>> It's been a while since I looked at any of this tooling to semi-grok how >>> it works. >> >> I /think/ this was the crux of the problem, back in 2019? >> https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/ > > Agreed. > >> >>> Perhaps it could learn to rely on something more explicit like >>> zero range (instead of discard?) or fall back to manual zeroing? >> >> AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably >> ought to be adapted to call BLKZEROOUT and (b) in the worst case it >> writes zeroes to the entire device, which is/can be slow. >> >> For a (crass) example, one of my cloudy test VMs uses 34GB partitions, >> and for cost optimization purposes we're only "paying" for the cheapest >> tier. Weirdly that maps to an upper limit of 6500 write iops and >> 48MB/s(!) but that would take about 20 minutes to zero the entire >> device if the dm-thin hack wasn't in place. Frustratingly, it doesn't >> support discard or write-zeroes. > > Do you mean that discard zero(BLKDISCARD) is faster than both fill > zero(BLKZEROOUT) and write zero on user space? Hi Darrick, Brian and Christoph According to the discussion about generic/470. I wonder if it is necessary to make thin-pool support DAX. Is there any use case for the requirement? Best Regards, Xiao Yang > > Best Regards, > Xiao Yang >> >>> If the >>> eventual solution is simple and low enough overhead, it might make some >>> sense to replace the dmthin hack across the set of tests mentioned >>> above. >> >> That said, for a *pmem* test you'd expect it to be faster than that... >> >> --D >> >>> Brian >>> >>>> Best Regards, >>>> Xiao Yang >>>> >>>>> >>>>> BTW, only log-writes, stripe and linear support DAX for now. >>>> >>>
Hi Darrick, Brian and Christoph Ping. I hope to get your feedback. 1) I have confirmed that the following patch set did not change the test result of generic/470 with thin-volume. Besides, I didn't see any failure when running generic/470 based on normal PMEM device instaed of thin-volume. https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ 2) I can reproduce the failure of generic/482 without thin-volume. 3) Is it necessary to make thin-volume support DAX. Is there any use case for the requirement? Best Regards, Xiao Yang On 2022/9/16 10:04, Yang, Xiao/杨 晓 wrote: > On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote: >> On 2022/9/15 0:28, Darrick J. Wong wrote: >>> On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote: >>>> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: >>>>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: >>>>>> On 2022/9/9 21:01, Brian Foster wrote: >>>>>>> Yes.. I don't recall all the internals of the tools and test, but >>>>>>> IIRC >>>>>>> it relied on discard to perform zeroing between checkpoints or >>>>>>> some such >>>>>>> and avoid spurious failures. The purpose of running on dm-thin was >>>>>>> merely to provide reliable discard zeroing behavior on the target >>>>>>> device >>>>>>> and thus to allow the test to run reliably. >>>>>> Hi Brian, >>>>>> >>>>>> As far as I know, generic/470 was original designed to verify >>>>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the >>>>>> reason, we need to ensure that all underlying devices under >>>>>> dm-log-writes device support DAX. However dm-thin device never >>>>>> supports >>>>>> DAX so >>>>>> running generic/470 with dm-thin device always returns "not run". >>>>>> >>>>>> Please see the difference between old and new logic: >>>>>> >>>>>> old logic new logic >>>>>> --------------------------------------------------------------- >>>>>> log-writes device(DAX) log-writes device(DAX) >>>>>> | | >>>>>> PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) >>>>>> | >>>>>> PMEM0(DAX) >>>>>> --------------------------------------------------------------- >>>>>> >>>>>> We think dm-thin device is not a good solution for generic/470, is >>>>>> there >>>>>> any other solution to support both discard zero and DAX? >>>>> >>>>> Hi Brian, >>>>> >>>>> I have sent a patch[1] to revert your fix because I think it's not >>>>> good for >>>>> generic/470 to use thin volume as my revert patch[1] describes: >>>>> [1] >>>>> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u >>>>> >>>>> >>>> >>>> I think the history here is that generic/482 was changed over first in >>>> commit 65cc9a235919 ("generic/482: use thin volume as data device"), >>>> and >>>> then sometime later we realized generic/455,457,470 had the same >>>> general >>>> flaw and were switched over. The dm/dax compatibility thing was >>>> probably >>>> just an oversight, but I am a little curious about that because it >>>> should >>> >>> It's not an oversight -- it used to work (albeit with EXPERIMENTAL >>> tags), and now we've broken it on fsdax as the pmem/blockdev divorce >>> progresses. >> Hi >> >> Do you mean that the following patch set changed the test result of >> generic/470 with thin-volume? (pass => not run/failure) >> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ >> >>> >>>> have been obvious that the change caused the test to no longer run. Did >>>> something change after that to trigger that change in behavior? >>>> >>>>> With the revert, generic/470 can always run successfully on my >>>>> environment >>>>> so I wonder how to reproduce the out-of-order replay issue on XFS v5 >>>>> filesystem? >>>>> >>>> >>>> I don't quite recall the characteristics of the failures beyond that we >>>> were seeing spurious test failures with generic/482 that were due to >>>> essentially putting the fs/log back in time in a way that wasn't quite >>>> accurate due to the clearing by the logwrites tool not taking place. If >>>> you wanted to reproduce in order to revisit that, perhaps start with >>>> generic/482 and let it run in a loop for a while and see if it >>>> eventually triggers a failure/corruption..? >>>> >>>>> PS: I want to reproduce the issue and try to find a better solution >>>>> to fix >>>>> it. >>>>> >>>> >>>> It's been a while since I looked at any of this tooling to semi-grok >>>> how >>>> it works. >>> >>> I /think/ this was the crux of the problem, back in 2019? >>> https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/ >> >> Agreed. >> >>> >>>> Perhaps it could learn to rely on something more explicit like >>>> zero range (instead of discard?) or fall back to manual zeroing? >>> >>> AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably >>> ought to be adapted to call BLKZEROOUT and (b) in the worst case it >>> writes zeroes to the entire device, which is/can be slow. >>> >>> For a (crass) example, one of my cloudy test VMs uses 34GB partitions, >>> and for cost optimization purposes we're only "paying" for the cheapest >>> tier. Weirdly that maps to an upper limit of 6500 write iops and >>> 48MB/s(!) but that would take about 20 minutes to zero the entire >>> device if the dm-thin hack wasn't in place. Frustratingly, it doesn't >>> support discard or write-zeroes. >> >> Do you mean that discard zero(BLKDISCARD) is faster than both fill >> zero(BLKZEROOUT) and write zero on user space? > > Hi Darrick, Brian and Christoph > > According to the discussion about generic/470. I wonder if it is > necessary to make thin-pool support DAX. Is there any use case for the > requirement? > > Best Regards, > Xiao Yang >> >> Best Regards, >> Xiao Yang >>> >>>> If the >>>> eventual solution is simple and low enough overhead, it might make some >>>> sense to replace the dmthin hack across the set of tests mentioned >>>> above. >>> >>> That said, for a *pmem* test you'd expect it to be faster than that... >>> >>> --D >>> >>>> Brian >>>> >>>>> Best Regards, >>>>> Xiao Yang >>>>> >>>>>> >>>>>> BTW, only log-writes, stripe and linear support DAX for now. >>>>> >>>>
Hello everyone, On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote: > Hi Darrick, Brian and Christoph > > Ping. I hope to get your feedback. > > 1) I have confirmed that the following patch set did not change the test > result of generic/470 with thin-volume. Besides, I didn't see any > failure when running generic/470 based on normal PMEM device instaed of > thin-volume. > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ > > 2) I can reproduce the failure of generic/482 without thin-volume. > > 3) Is it necessary to make thin-volume support DAX. Is there any use > case for the requirement? Though I asked other place(*), I really want to know the usecase of dm-thin-volume with DAX and reflink. In my understanding, dm-thin-volume seems to provide similar feature like reflink of xfs. Both feature provide COW update to reduce usage of its region, and snapshot feature, right? I found that docker seems to select one of them (or other feature which supports COW). Then user don't need to use thin-volume and reflink at same time. Database which uses FS-DAX may want to use snapshot for its data of FS-DAX, its user seems to be satisfied with reflink or thin-volume. So I could not find on what use-case user would like to use dm-thin-volume and reflink at same time. The only possibility is that the user has mistakenly configured dm-thinpool and reflink to be used at the same time, but if that is the case, it seems to be better for the user to disable one or the other. I really wander why dm-thin-volume must be used with reflik and FS-DAX. If my understanding is something wrong, please correct me. (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/ Thanks, --- Yasunori Goto > > Best Regards, > Xiao Yang > > On 2022/9/16 10:04, Yang, Xiao/杨 晓 wrote: >> On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote: >>> On 2022/9/15 0:28, Darrick J. Wong wrote: >>>> On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote: >>>>> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: >>>>>> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: >>>>>>> On 2022/9/9 21:01, Brian Foster wrote: >>>>>>>> Yes.. I don't recall all the internals of the tools and test, >>>>>>>> but IIRC >>>>>>>> it relied on discard to perform zeroing between checkpoints or >>>>>>>> some such >>>>>>>> and avoid spurious failures. The purpose of running on dm-thin was >>>>>>>> merely to provide reliable discard zeroing behavior on the >>>>>>>> target device >>>>>>>> and thus to allow the test to run reliably. >>>>>>> Hi Brian, >>>>>>> >>>>>>> As far as I know, generic/470 was original designed to verify >>>>>>> mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the >>>>>>> reason, we need to ensure that all underlying devices under >>>>>>> dm-log-writes device support DAX. However dm-thin device never >>>>>>> supports >>>>>>> DAX so >>>>>>> running generic/470 with dm-thin device always returns "not run". >>>>>>> >>>>>>> Please see the difference between old and new logic: >>>>>>> >>>>>>> old logic new logic >>>>>>> --------------------------------------------------------------- >>>>>>> log-writes device(DAX) log-writes device(DAX) >>>>>>> | | >>>>>>> PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) >>>>>>> | >>>>>>> PMEM0(DAX) >>>>>>> --------------------------------------------------------------- >>>>>>> >>>>>>> We think dm-thin device is not a good solution for generic/470, >>>>>>> is there >>>>>>> any other solution to support both discard zero and DAX? >>>>>> >>>>>> Hi Brian, >>>>>> >>>>>> I have sent a patch[1] to revert your fix because I think it's not >>>>>> good for >>>>>> generic/470 to use thin volume as my revert patch[1] describes: >>>>>> [1] >>>>>> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u >>>>>> >>>>>> >>>>> >>>>> I think the history here is that generic/482 was changed over first in >>>>> commit 65cc9a235919 ("generic/482: use thin volume as data >>>>> device"), and >>>>> then sometime later we realized generic/455,457,470 had the same >>>>> general >>>>> flaw and were switched over. The dm/dax compatibility thing was >>>>> probably >>>>> just an oversight, but I am a little curious about that because it >>>>> should >>>> >>>> It's not an oversight -- it used to work (albeit with EXPERIMENTAL >>>> tags), and now we've broken it on fsdax as the pmem/blockdev divorce >>>> progresses. >>> Hi >>> >>> Do you mean that the following patch set changed the test result of >>> generic/470 with thin-volume? (pass => not run/failure) >>> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ >>> >>>> >>>>> have been obvious that the change caused the test to no longer run. >>>>> Did >>>>> something change after that to trigger that change in behavior? >>>>> >>>>>> With the revert, generic/470 can always run successfully on my >>>>>> environment >>>>>> so I wonder how to reproduce the out-of-order replay issue on XFS v5 >>>>>> filesystem? >>>>>> >>>>> >>>>> I don't quite recall the characteristics of the failures beyond >>>>> that we >>>>> were seeing spurious test failures with generic/482 that were due to >>>>> essentially putting the fs/log back in time in a way that wasn't quite >>>>> accurate due to the clearing by the logwrites tool not taking >>>>> place. If >>>>> you wanted to reproduce in order to revisit that, perhaps start with >>>>> generic/482 and let it run in a loop for a while and see if it >>>>> eventually triggers a failure/corruption..? >>>>> >>>>>> PS: I want to reproduce the issue and try to find a better >>>>>> solution to fix >>>>>> it. >>>>>> >>>>> >>>>> It's been a while since I looked at any of this tooling to >>>>> semi-grok how >>>>> it works. >>>> >>>> I /think/ this was the crux of the problem, back in 2019? >>>> https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/ >>> >>> Agreed. >>> >>>> >>>>> Perhaps it could learn to rely on something more explicit like >>>>> zero range (instead of discard?) or fall back to manual zeroing? >>>> >>>> AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably >>>> ought to be adapted to call BLKZEROOUT and (b) in the worst case it >>>> writes zeroes to the entire device, which is/can be slow. >>>> >>>> For a (crass) example, one of my cloudy test VMs uses 34GB partitions, >>>> and for cost optimization purposes we're only "paying" for the cheapest >>>> tier. Weirdly that maps to an upper limit of 6500 write iops and >>>> 48MB/s(!) but that would take about 20 minutes to zero the entire >>>> device if the dm-thin hack wasn't in place. Frustratingly, it doesn't >>>> support discard or write-zeroes. >>> >>> Do you mean that discard zero(BLKDISCARD) is faster than both fill >>> zero(BLKZEROOUT) and write zero on user space? >> >> Hi Darrick, Brian and Christoph >> >> According to the discussion about generic/470. I wonder if it is >> necessary to make thin-pool support DAX. Is there any use case for the >> requirement? >> >> Best Regards, >> Xiao Yang >>> >>> Best Regards, >>> Xiao Yang >>>> >>>>> If the >>>>> eventual solution is simple and low enough overhead, it might make >>>>> some >>>>> sense to replace the dmthin hack across the set of tests mentioned >>>>> above. >>>> >>>> That said, for a *pmem* test you'd expect it to be faster than that... >>>> >>>> --D >>>> >>>>> Brian >>>>> >>>>>> Best Regards, >>>>>> Xiao Yang >>>>>> >>>>>>> >>>>>>> BTW, only log-writes, stripe and linear support DAX for now. >>>>>> >>>>>
On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote: > Hello everyone, > > On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote: > > Hi Darrick, Brian and Christoph > > > > Ping. I hope to get your feedback. > > > > 1) I have confirmed that the following patch set did not change the test > > result of generic/470 with thin-volume. Besides, I didn't see any > > failure when running generic/470 based on normal PMEM device instaed of > > thin-volume. > > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ > > > > 2) I can reproduce the failure of generic/482 without thin-volume. > > > > 3) Is it necessary to make thin-volume support DAX. Is there any use > > case for the requirement? > > > Though I asked other place(*), I really want to know the usecase of > dm-thin-volume with DAX and reflink. > > > In my understanding, dm-thin-volume seems to provide similar feature like > reflink of xfs. Both feature provide COW update to reduce usage of > its region, and snapshot feature, right? > > I found that docker seems to select one of them (or other feature which > supports COW). Then user don't need to use thin-volume and reflink at same > time. > > Database which uses FS-DAX may want to use snapshot for its data of FS-DAX, > its user seems to be satisfied with reflink or thin-volume. > > So I could not find on what use-case user would like to use dm-thin-volume > and reflink at same time. > > The only possibility is that the user has mistakenly configured dm-thinpool > and reflink to be used at the same time, but if that is the case, it seems > to be better for the user to disable one or the other. > > I really wander why dm-thin-volume must be used with reflik and FS-DAX. There isn't a hard requirement between fsdax and dm-thinp. The /test/ needs dm-logwrites to check that write page faults on a MAP_SYNC mmapping are persisted directly to disk. dm-logwrites requires a fast way to zero an entire device for correct operation of the replay step, and thinp is the only way to guarantee that. --D > If my understanding is something wrong, please correct me. > > (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/ > > Thanks, > --- > Yasunori Goto > > > > > > Best Regards, > > Xiao Yang > > > > On 2022/9/16 10:04, Yang, Xiao/杨 晓 wrote: > > > On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote: > > > > On 2022/9/15 0:28, Darrick J. Wong wrote: > > > > > On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote: > > > > > > On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: > > > > > > > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: > > > > > > > > On 2022/9/9 21:01, Brian Foster wrote: > > > > > > > > > Yes.. I don't recall all the internals of > > > > > > > > > the tools and test, but IIRC > > > > > > > > > it relied on discard to perform zeroing > > > > > > > > > between checkpoints or some such > > > > > > > > > and avoid spurious failures. The purpose of running on dm-thin was > > > > > > > > > merely to provide reliable discard zeroing > > > > > > > > > behavior on the target device > > > > > > > > > and thus to allow the test to run reliably. > > > > > > > > Hi Brian, > > > > > > > > > > > > > > > > As far as I know, generic/470 was original designed to verify > > > > > > > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the > > > > > > > > reason, we need to ensure that all underlying devices under > > > > > > > > dm-log-writes device support DAX. However > > > > > > > > dm-thin device never supports > > > > > > > > DAX so > > > > > > > > running generic/470 with dm-thin device always returns "not run". > > > > > > > > > > > > > > > > Please see the difference between old and new logic: > > > > > > > > > > > > > > > > old logic new logic > > > > > > > > --------------------------------------------------------------- > > > > > > > > log-writes device(DAX) log-writes device(DAX) > > > > > > > > | | > > > > > > > > PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) > > > > > > > > | > > > > > > > > PMEM0(DAX) > > > > > > > > --------------------------------------------------------------- > > > > > > > > > > > > > > > > We think dm-thin device is not a good solution > > > > > > > > for generic/470, is there > > > > > > > > any other solution to support both discard zero and DAX? > > > > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > > I have sent a patch[1] to revert your fix because I > > > > > > > think it's not good for > > > > > > > generic/470 to use thin volume as my revert patch[1] describes: > > > > > > > [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@fujitsu.com/T/#u > > > > > > > > > > > > > > > > > > > > > > > > > > I think the history here is that generic/482 was changed over first in > > > > > > commit 65cc9a235919 ("generic/482: use thin volume as > > > > > > data device"), and > > > > > > then sometime later we realized generic/455,457,470 had > > > > > > the same general > > > > > > flaw and were switched over. The dm/dax compatibility > > > > > > thing was probably > > > > > > just an oversight, but I am a little curious about that > > > > > > because it should > > > > > > > > > > It's not an oversight -- it used to work (albeit with EXPERIMENTAL > > > > > tags), and now we've broken it on fsdax as the pmem/blockdev divorce > > > > > progresses. > > > > Hi > > > > > > > > Do you mean that the following patch set changed the test result > > > > of generic/470 with thin-volume? (pass => not run/failure) > > > > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ > > > > > > > > > > > > > > > have been obvious that the change caused the test to no > > > > > > longer run. Did > > > > > > something change after that to trigger that change in behavior? > > > > > > > > > > > > > With the revert, generic/470 can always run > > > > > > > successfully on my environment > > > > > > > so I wonder how to reproduce the out-of-order replay issue on XFS v5 > > > > > > > filesystem? > > > > > > > > > > > > > > > > > > > I don't quite recall the characteristics of the failures > > > > > > beyond that we > > > > > > were seeing spurious test failures with generic/482 that were due to > > > > > > essentially putting the fs/log back in time in a way that wasn't quite > > > > > > accurate due to the clearing by the logwrites tool not > > > > > > taking place. If > > > > > > you wanted to reproduce in order to revisit that, perhaps start with > > > > > > generic/482 and let it run in a loop for a while and see if it > > > > > > eventually triggers a failure/corruption..? > > > > > > > > > > > > > PS: I want to reproduce the issue and try to find a > > > > > > > better solution to fix > > > > > > > it. > > > > > > > > > > > > > > > > > > > It's been a while since I looked at any of this tooling > > > > > > to semi-grok how > > > > > > it works. > > > > > > > > > > I /think/ this was the crux of the problem, back in 2019? > > > > > https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/ > > > > > > > > Agreed. > > > > > > > > > > > > > > > Perhaps it could learn to rely on something more explicit like > > > > > > zero range (instead of discard?) or fall back to manual zeroing? > > > > > > > > > > AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably > > > > > ought to be adapted to call BLKZEROOUT and (b) in the worst case it > > > > > writes zeroes to the entire device, which is/can be slow. > > > > > > > > > > For a (crass) example, one of my cloudy test VMs uses 34GB partitions, > > > > > and for cost optimization purposes we're only "paying" for the cheapest > > > > > tier. Weirdly that maps to an upper limit of 6500 write iops and > > > > > 48MB/s(!) but that would take about 20 minutes to zero the entire > > > > > device if the dm-thin hack wasn't in place. Frustratingly, it doesn't > > > > > support discard or write-zeroes. > > > > > > > > Do you mean that discard zero(BLKDISCARD) is faster than both > > > > fill zero(BLKZEROOUT) and write zero on user space? > > > > > > Hi Darrick, Brian and Christoph > > > > > > According to the discussion about generic/470. I wonder if it is > > > necessary to make thin-pool support DAX. Is there any use case for > > > the requirement? > > > > > > Best Regards, > > > Xiao Yang > > > > > > > > Best Regards, > > > > Xiao Yang > > > > > > > > > > > If the > > > > > > eventual solution is simple and low enough overhead, it > > > > > > might make some > > > > > > sense to replace the dmthin hack across the set of tests mentioned > > > > > > above. > > > > > > > > > > That said, for a *pmem* test you'd expect it to be faster than that... > > > > > > > > > > --D > > > > > > > > > > > Brian > > > > > > > > > > > > > Best Regards, > > > > > > > Xiao Yang > > > > > > > > > > > > > > > > > > > > > > > BTW, only log-writes, stripe and linear support DAX for now. > > > > > > > > > > > > > >
On 2022/10/03 17:12, Darrick J. Wong wrote: > On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote: >> Hello everyone, >> >> On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote: >>> Hi Darrick, Brian and Christoph >>> >>> Ping. I hope to get your feedback. >>> >>> 1) I have confirmed that the following patch set did not change the test >>> result of generic/470 with thin-volume. Besides, I didn't see any >>> failure when running generic/470 based on normal PMEM device instaed of >>> thin-volume. >>> https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ >>> >>> 2) I can reproduce the failure of generic/482 without thin-volume. >>> >>> 3) Is it necessary to make thin-volume support DAX. Is there any use >>> case for the requirement? >> >> >> Though I asked other place(*), I really want to know the usecase of >> dm-thin-volume with DAX and reflink. >> >> >> In my understanding, dm-thin-volume seems to provide similar feature like >> reflink of xfs. Both feature provide COW update to reduce usage of >> its region, and snapshot feature, right? >> >> I found that docker seems to select one of them (or other feature which >> supports COW). Then user don't need to use thin-volume and reflink at same >> time. >> >> Database which uses FS-DAX may want to use snapshot for its data of FS-DAX, >> its user seems to be satisfied with reflink or thin-volume. >> >> So I could not find on what use-case user would like to use dm-thin-volume >> and reflink at same time. >> >> The only possibility is that the user has mistakenly configured dm-thinpool >> and reflink to be used at the same time, but if that is the case, it seems >> to be better for the user to disable one or the other. >> >> I really wander why dm-thin-volume must be used with reflik and FS-DAX. > > There isn't a hard requirement between fsdax and dm-thinp. The /test/ > needs dm-logwrites to check that write page faults on a MAP_SYNC > mmapping are persisted directly to disk. dm-logwrites requires a fast > way to zero an entire device for correct operation of the replay step, > and thinp is the only way to guarantee that. Thank you for your answer. But I still feel something is strange. Though dm-thinp may be good way to execute the test correctly, I suppose it seems to be likely a kind of workaround to pass the test, it may not be really required for actual users. Could you tell me why passing test by workaround is so necessary? Thanks, > > --D > >> If my understanding is something wrong, please correct me. >> >> (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/
On Mon, Oct 03, 2022 at 09:12:46PM -0700, Gotou, Yasunori/五島 康文 wrote: > On 2022/10/03 17:12, Darrick J. Wong wrote: > > On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote: > > > Hello everyone, > > > > > > On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote: > > > > Hi Darrick, Brian and Christoph > > > > > > > > Ping. I hope to get your feedback. > > > > > > > > 1) I have confirmed that the following patch set did not change the test > > > > result of generic/470 with thin-volume. Besides, I didn't see any > > > > failure when running generic/470 based on normal PMEM device instaed of > > > > thin-volume. > > > > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@lst.de/ > > > > > > > > 2) I can reproduce the failure of generic/482 without thin-volume. > > > > > > > > 3) Is it necessary to make thin-volume support DAX. Is there any use > > > > case for the requirement? > > > > > > > > > Though I asked other place(*), I really want to know the usecase of > > > dm-thin-volume with DAX and reflink. > > > > > > > > > In my understanding, dm-thin-volume seems to provide similar feature like > > > reflink of xfs. Both feature provide COW update to reduce usage of > > > its region, and snapshot feature, right? > > > > > > I found that docker seems to select one of them (or other feature which > > > supports COW). Then user don't need to use thin-volume and reflink at same > > > time. > > > > > > Database which uses FS-DAX may want to use snapshot for its data of FS-DAX, > > > its user seems to be satisfied with reflink or thin-volume. > > > > > > So I could not find on what use-case user would like to use dm-thin-volume > > > and reflink at same time. > > > > > > The only possibility is that the user has mistakenly configured dm-thinpool > > > and reflink to be used at the same time, but if that is the case, it seems > > > to be better for the user to disable one or the other. > > > > > > I really wander why dm-thin-volume must be used with reflik and FS-DAX. > > > > There isn't a hard requirement between fsdax and dm-thinp. The /test/ > > needs dm-logwrites to check that write page faults on a MAP_SYNC > > mmapping are persisted directly to disk. dm-logwrites requires a fast > > way to zero an entire device for correct operation of the replay step, > > and thinp is the only way to guarantee that. > > Thank you for your answer. But I still feel something is strange. > Though dm-thinp may be good way to execute the test correctly, Yep. > I suppose it seems to be likely a kind of workaround to pass the test, > it may not be really required for actual users. Exactly correct. Real users should /never/ set up this kind of (test scaffolding|insanity) to use fsdax. > Could you tell me why passing test by workaround is so necessary? Notice this line in generic/470: $XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \ -c "log_writes -d $LOGWRITES_NAME -m preunmap" \ -f $SCRATCH_MNT/test The second xfs_io command creates a MAP_SYNC mmap of the SCRATCH_MNT/test file, and the third command memcpy's bytes to the mapping to invoke the write page fault handler. The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME (aka the block device containing the mounted XFS filesystem) to create a mark called "preunmap". This mark captures the exact state of the block device immediately after the write faults complete, so that we can come back to it later. There are a few things to note here: (1) We did not tell the fs to persist anything; (2) We can't use dm-snapshot here, because dm-snapshot will flush the fs (I think?); and (3) The fs is still mounted, so the state of the block device at the mark reflects a dirty XFS with a log that must be replayed. The next thing the test does is unmount the fs, remove the dm-logwrites driver to stop recording, and check the fs: _log_writes_unmount _log_writes_remove _dmthin_check_fs This ensures that the post-umount fs is consistent. Now we want to roll back to the place we marked to see if the mwrite data made it to pmem. It *should* have, since we asked for a MAP_SYNC mapping on a fsdax filesystem recorded on a pmem device: # check pre-unmap state _log_writes_replay_log preunmap $DMTHIN_VOL_DEV _dmthin_mount dm-logwrites can't actually roll backwards in time to a mark, since it only records new disk contents. It /can/ however roll forward from whatever point it began recording writes to the mark, so that's what it does. However -- remember note (3) from earlier. When we _dmthin_mount after replaying the log to the "preunmap" mark, XFS will see the dirty XFS log and try to recover the XFS log. This is where the replay problems crop up. The XFS log records a monotonically increasing sequence number (LSN) with every log update, and when updates are written into the filesystem, that LSN is also written into the filesystem block. Log recovery also replays updates into the filesystem, but with the added behavior that it skips a block replay if the block's LSN is higher than the transaction being replayed. IOWs, we never replay older block contents over newer block contents. For dm-logwrites this is a major problem, because there could be more filesystem updates written to the XFS log after the mark is made. LSNs will then be handed out like this: mkfs_lsn preunmap_lsn umount_lsn | | | |--------------------------||----------|-----------| | | xxx_lsn yyy_lsn Let's say that a new metadata block "BBB" was created in update "xxx" immediately before the preunmap mark was made. Per (1), we didn't flush the filesystem before taking the mark, which means that the new block's contents exist only in the log at this point. Let us further say that the new block was again changed in update "yyy", where preunmap_lsn < yyy_lsn <= umount_lsn. Clearly, yyy_lsn > xxx_lsn. yyy_lsn is written to the block at unmount, because unmounting flushes the log clean before it completes. This is the first time that BBB ever gets written. _log_writes_replay_log begins replaying the block device from mkfs_lsn towards preunmap_lsn. When it's done, it will have a log that reflects all the changes up to preunmap_lsn. Recall however that BBB isn't written until after the preunmap mark, which means that dm-logwrites has no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch BBB. At this point, the block header for BBB has a UUID that matches the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn. XFS log recovery starts up, and finds transaction xxx. It will read BBB from disk, but then it will see that it has an LSN of yyy_lsn. This is larger than xxx_lsn, so it concludes that BBB is newer than the log and moves on to the next log item. No other log items touch BBB, so recovery finishes, and now we have a filesystem containing one metadata block (BBB) from the future. This is an inconsistent filesystem, and has caused failures in the tests that use logwrites. To work around this problem, all we really need to do is reinitialize the entire block device to known contents at mkfs time. This can be done expensively by writing zeroes to the entire block device, or it can be done cheaply by (a) issuing DISCARD to the whole the block device at the start of the test and (b) ensuring that reads after a discard always produce zeroes. mkfs.xfs already does (a), so the test merely has to ensure (b). dm-thinp is the only software solution that provides (b), so that's why this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV. This combination used to work, but with the pending pmem/blockdev divorce, this strategy is no longer feasible. I think the only way to fix this test is (a) revert all of Christoph's changes so far and scuttle the divorce; or (b) change this test like so: 1. Create a large sparse file on $TEST_DIR and losetup that sparse file. The resulting loop device will not have dax capability. 2. Set up the dmthin/dmlogwrites stack on top of this loop device. 3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem device) as the realtime device, and set the daxinherit and rtinherit flags on the root directory. The result is a filesystem with a data section that the kernel will treat as a regular block device, a realtime section backed by pmem, and the necessary flags to make sure that the test file will actually get fsdax mode. 4. Acknowledge that we no longer have any way to test MAP_SYNC functionality on ext4, which means that generic/470 has to move to tests/xfs/. --D > Thanks, > > > > > > --D > > > > > If my understanding is something wrong, please correct me. > > > > > > (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@TYWPR01MB10082.jpnprd01.prod.outlook.com/
On 2022/10/5 2:26, Darrick J. Wong wrote: > Notice this line in generic/470: > > $XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \ > -c "log_writes -d $LOGWRITES_NAME -m preunmap" \ > -f $SCRATCH_MNT/test > > The second xfs_io command creates a MAP_SYNC mmap of the > SCRATCH_MNT/test file, and the third command memcpy's bytes to the > mapping to invoke the write page fault handler. > > The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME > (aka the block device containing the mounted XFS filesystem) to create a > mark called "preunmap". This mark captures the exact state of the block > device immediately after the write faults complete, so that we can come > back to it later. There are a few things to note here: > > (1) We did not tell the fs to persist anything; > (2) We can't use dm-snapshot here, because dm-snapshot will flush the > fs (I think?); and > (3) The fs is still mounted, so the state of the block device at the > mark reflects a dirty XFS with a log that must be replayed. > > The next thing the test does is unmount the fs, remove the dm-logwrites > driver to stop recording, and check the fs: > > _log_writes_unmount > _log_writes_remove > _dmthin_check_fs > > This ensures that the post-umount fs is consistent. Now we want to roll > back to the place we marked to see if the mwrite data made it to pmem. > It*should* have, since we asked for a MAP_SYNC mapping on a fsdax > filesystem recorded on a pmem device: > > # check pre-unmap state > _log_writes_replay_log preunmap $DMTHIN_VOL_DEV > _dmthin_mount > > dm-logwrites can't actually roll backwards in time to a mark, since it > only records new disk contents. It/can/ however roll forward from > whatever point it began recording writes to the mark, so that's what it > does. > > However -- remember note (3) from earlier. When we _dmthin_mount after > replaying the log to the "preunmap" mark, XFS will see the dirty XFS log > and try to recover the XFS log. This is where the replay problems crop > up. The XFS log records a monotonically increasing sequence number > (LSN) with every log update, and when updates are written into the > filesystem, that LSN is also written into the filesystem block. Log > recovery also replays updates into the filesystem, but with the added > behavior that it skips a block replay if the block's LSN is higher than > the transaction being replayed. IOWs, we never replay older block > contents over newer block contents. > > For dm-logwrites this is a major problem, because there could be more > filesystem updates written to the XFS log after the mark is made. LSNs > will then be handed out like this: > > mkfs_lsn preunmap_lsn umount_lsn > | | | > |--------------------------||----------|-----------| > | | > xxx_lsn yyy_lsn > > Let's say that a new metadata block "BBB" was created in update "xxx" > immediately before the preunmap mark was made. Per (1), we didn't flush > the filesystem before taking the mark, which means that the new block's > contents exist only in the log at this point. > > Let us further say that the new block was again changed in update "yyy", > where preunmap_lsn < yyy_lsn <= umount_lsn. Clearly, yyy_lsn > xxx_lsn. > yyy_lsn is written to the block at unmount, because unmounting flushes > the log clean before it completes. This is the first time that BBB ever > gets written. > > _log_writes_replay_log begins replaying the block device from mkfs_lsn > towards preunmap_lsn. When it's done, it will have a log that reflects > all the changes up to preunmap_lsn. Recall however that BBB isn't > written until after the preunmap mark, which means that dm-logwrites has > no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch > BBB. At this point, the block header for BBB has a UUID that matches > the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn. > > XFS log recovery starts up, and finds transaction xxx. It will read BBB > from disk, but then it will see that it has an LSN of yyy_lsn. This is > larger than xxx_lsn, so it concludes that BBB is newer than the log and > moves on to the next log item. No other log items touch BBB, so > recovery finishes, and now we have a filesystem containing one metadata > block (BBB) from the future. This is an inconsistent filesystem, and > has caused failures in the tests that use logwrites. > > To work around this problem, all we really need to do is reinitialize > the entire block device to known contents at mkfs time. This can be > done expensively by writing zeroes to the entire block device, or it can > be done cheaply by (a) issuing DISCARD to the whole the block device at > the start of the test and (b) ensuring that reads after a discard always > produce zeroes. mkfs.xfs already does (a), so the test merely has to > ensure (b). > > dm-thinp is the only software solution that provides (b), so that's why > this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV. > This combination used to work, but with the pending pmem/blockdev > divorce, this strategy is no longer feasible. Hi Darrick, Thanks a lot for your detailed explanation. Could you tell me if my understanding is correct. I think the issue is that log-writes log and XFS log may save the different state of block device. It is possible for XFS log to save the more updates than log-writes log does. In this case, we can recovery the block device by log-writes log's replay but we will get the inconsistent filesystem when mounting the block device because the mount operation will try to recovery more updates for XFS on the block deivce by XFS log. We need to fix the issue by discarding XFS log on the block device. mkfs.xfs will try to discard the blocks including XFS log by calling ioctl(BLKDISCARD) but it will ignore error silently when the block device doesn't support ioctl(BLKDISCARD). Discarding XFS log is what you said "reinitialize the entire block device", right? > > I think the only way to fix this test is (a) revert all of Christoph's > changes so far and scuttle the divorce; or (b) change this test like so: Sorry, I didn't know which Christoph's patches need to be reverted? Could you tell me the URL about Christoph's patches? > > 1. Create a large sparse file on $TEST_DIR and losetup that sparse > file. The resulting loop device will not have dax capability. > > 2. Set up the dmthin/dmlogwrites stack on top of this loop device. > > 3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem > device) as the realtime device, and set the daxinherit and rtinherit > flags on the root directory. The result is a filesystem with a data > section that the kernel will treat as a regular block device, a > realtime section backed by pmem, and the necessary flags to make > sure that the test file will actually get fsdax mode. > > 4. Acknowledge that we no longer have any way to test MAP_SYNC > functionality on ext4, which means that generic/470 has to move to > tests/xfs/. Sorry, I didn't understand why the above test change can fix the issue. Could you tell me which step can discard XFS log? In addition, I don't like your idea about the test change because it will make generic/470 become the special test for XFS. Do you know if we can fix the issue by changing the test in another way? blkdiscard -z can fix the issue because it does zero-fill rather than discard on the block device. However, blkdiscard -z will take a lot of time when the block device is large. Best Regards, Xiao Yang > > --D
On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > On 2022/10/5 2:26, Darrick J. Wong wrote: > > Notice this line in generic/470: > > > > $XFS_IO_PROG -t -c "truncate $LEN" -c "mmap -S 0 $LEN" -c "mwrite 0 $LEN" \ > > -c "log_writes -d $LOGWRITES_NAME -m preunmap" \ > > -f $SCRATCH_MNT/test > > > > The second xfs_io command creates a MAP_SYNC mmap of the > > SCRATCH_MNT/test file, and the third command memcpy's bytes to the > > mapping to invoke the write page fault handler. > > > > The fourth command tells the dm-logwrites driver for $LOGWRITES_NAME > > (aka the block device containing the mounted XFS filesystem) to create a > > mark called "preunmap". This mark captures the exact state of the block > > device immediately after the write faults complete, so that we can come > > back to it later. There are a few things to note here: > > > > (1) We did not tell the fs to persist anything; > > (2) We can't use dm-snapshot here, because dm-snapshot will flush the > > fs (I think?); and > > (3) The fs is still mounted, so the state of the block device at the > > mark reflects a dirty XFS with a log that must be replayed. > > > > The next thing the test does is unmount the fs, remove the dm-logwrites > > driver to stop recording, and check the fs: > > > > _log_writes_unmount > > _log_writes_remove > > _dmthin_check_fs > > > > This ensures that the post-umount fs is consistent. Now we want to roll > > back to the place we marked to see if the mwrite data made it to pmem. > > It*should* have, since we asked for a MAP_SYNC mapping on a fsdax > > filesystem recorded on a pmem device: > > > > # check pre-unmap state > > _log_writes_replay_log preunmap $DMTHIN_VOL_DEV > > _dmthin_mount > > > > dm-logwrites can't actually roll backwards in time to a mark, since it > > only records new disk contents. It/can/ however roll forward from > > whatever point it began recording writes to the mark, so that's what it > > does. > > > > However -- remember note (3) from earlier. When we _dmthin_mount after > > replaying the log to the "preunmap" mark, XFS will see the dirty XFS log > > and try to recover the XFS log. This is where the replay problems crop > > up. The XFS log records a monotonically increasing sequence number > > (LSN) with every log update, and when updates are written into the > > filesystem, that LSN is also written into the filesystem block. Log > > recovery also replays updates into the filesystem, but with the added > > behavior that it skips a block replay if the block's LSN is higher than > > the transaction being replayed. IOWs, we never replay older block > > contents over newer block contents. > > > > For dm-logwrites this is a major problem, because there could be more > > filesystem updates written to the XFS log after the mark is made. LSNs > > will then be handed out like this: > > > > mkfs_lsn preunmap_lsn umount_lsn > > | | | > > |--------------------------||----------|-----------| > > | | > > xxx_lsn yyy_lsn > > > > Let's say that a new metadata block "BBB" was created in update "xxx" > > immediately before the preunmap mark was made. Per (1), we didn't flush > > the filesystem before taking the mark, which means that the new block's > > contents exist only in the log at this point. > > > > Let us further say that the new block was again changed in update "yyy", > > where preunmap_lsn < yyy_lsn <= umount_lsn. Clearly, yyy_lsn > xxx_lsn. > > yyy_lsn is written to the block at unmount, because unmounting flushes > > the log clean before it completes. This is the first time that BBB ever > > gets written. > > > > _log_writes_replay_log begins replaying the block device from mkfs_lsn > > towards preunmap_lsn. When it's done, it will have a log that reflects > > all the changes up to preunmap_lsn. Recall however that BBB isn't > > written until after the preunmap mark, which means that dm-logwrites has > > no record of BBB before preunmap_lsn, so dm-logwrites replay won't touch > > BBB. At this point, the block header for BBB has a UUID that matches > > the filesystem, but a LSN (yyy_lsn) that is beyond preunmap_lsn. > > > > XFS log recovery starts up, and finds transaction xxx. It will read BBB > > from disk, but then it will see that it has an LSN of yyy_lsn. This is > > larger than xxx_lsn, so it concludes that BBB is newer than the log and > > moves on to the next log item. No other log items touch BBB, so > > recovery finishes, and now we have a filesystem containing one metadata > > block (BBB) from the future. This is an inconsistent filesystem, and > > has caused failures in the tests that use logwrites. > > > > To work around this problem, all we really need to do is reinitialize > > the entire block device to known contents at mkfs time. This can be > > done expensively by writing zeroes to the entire block device, or it can > > be done cheaply by (a) issuing DISCARD to the whole the block device at > > the start of the test and (b) ensuring that reads after a discard always > > produce zeroes. mkfs.xfs already does (a), so the test merely has to > > ensure (b). > > > > dm-thinp is the only software solution that provides (b), so that's why > > this test layers dm-logwrites on top of dm-thinp on top of $SCRATCH_DEV. > > This combination used to work, but with the pending pmem/blockdev > > divorce, this strategy is no longer feasible. > > Hi Darrick, > > Thanks a lot for your detailed explanation. > > Could you tell me if my understanding is correct. I think the issue is that > log-writes log and XFS log may save the different state of block device. It > is possible for XFS log to save the more updates than log-writes log does. Yes. > In this case, we can recovery the block device by log-writes log's replay > but we will get the inconsistent filesystem when mounting the block device > because the mount operation will try to recovery more updates for XFS on the > block deivce by XFS log. Sort of... > We need to fix the issue by discarding XFS log on the block device. > mkfs.xfs will try to discard the blocks including XFS log by calling > ioctl(BLKDISCARD) but it will ignore error silently when the block > device doesn't support ioctl(BLKDISCARD). ...but I think here's where I think your understanding isn't correct. It might help to show how the nested logging creates its own problems. First, let's say there's a block B that contains some stale garbage AAAA. XFS writes a block into the XFS log (call the block L) with the instructions "allocate block B and write CCCC to block B". dm-logwrites doesn't know or care about the contents of the blocks that it is told to write; it only knows that XFS told it to write some data (the instructions) to block L. So it remembers the fact that some data got written to L, but it doesn't know about B at all. At the point where we create the dm-logwrites preunmap mark, it's only tracking L. It is not tracking B. After the mark is taken, the XFS AIL writes CCCC to B, and only then does dm-logwrites begin tracking B. Hence B is not included in the preunmap mark. The pre-unmount process in XFS writes to the XFS log "write DDDD to block B" and the unmount process checkpoints the log contents, so now block B contains contains DDDD. Now the test wants to roll to the preunmap mark. Unfortunately, dm-logwrites doesn't record former block contents, which means that the log replay tools cannot roll backwards from "umount" to "preunmap" -- they can only roll forward from the beginning. So there's no way to undo writing DDDD or CCCC to B. IOWs, there's no way to revert B's state back to AAAA when doing dm-logwrites recovery. Now XFS log recovery starts. It sees "allocate block B and write CCCC to block B". However, it reads block B, sees that it contains DDDD, and it skips writing CCCC. Incorrectly. The only way to avoid this is to zero B before replaying the dm-logwrites. So you could solve the problem via BLKDISCARD, or writing zeroes to the entire block device, or scanning the metadata and writing zeroes to just those blocks, or by adding undo buffer records to dm-logwrites and teaching it to do a proper rollback. > Discarding XFS log is what you said "reinitialize the entire block > device", right? No, I really meant the /entire/ block device. > > > > I think the only way to fix this test is (a) revert all of Christoph's > > changes so far and scuttle the divorce; or (b) change this test like so: > > Sorry, I didn't know which Christoph's patches need to be reverted? > Could you tell me the URL about Christoph's patches? Eh, it's a whole long series of patches scuttling various parts where pmem could talk to the block layer. I doubt he'll accept you reverting his removal code. > > 1. Create a large sparse file on $TEST_DIR and losetup that sparse > > file. The resulting loop device will not have dax capability. > > > > 2. Set up the dmthin/dmlogwrites stack on top of this loop device. > > > > 3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem > > device) as the realtime device, and set the daxinherit and rtinherit > > flags on the root directory. The result is a filesystem with a data > > section that the kernel will treat as a regular block device, a > > realtime section backed by pmem, and the necessary flags to make > > sure that the test file will actually get fsdax mode. > > > > 4. Acknowledge that we no longer have any way to test MAP_SYNC > > functionality on ext4, which means that generic/470 has to move to > > tests/xfs/. > > Sorry, I didn't understand why the above test change can fix the issue. XFS supports two-device filesystems -- the "realtime" section, and the "data" section. FS metadata and log all live in the "data" section. So we change the test to set up some regular files, loop-mount the files, set up the requisite dm-logwrites stuff atop the loop devices, and format the XFS with the data section backed by the dm-logwrites device, and make the realtime section backed by the pmem. This way the log replay program can actually discard the data device (because it's a regular file) and replay the log forward to the preunmap mark. The pmem device is not involved in the replay at all, since changes to file data are never logged. It now becomes irrelevant that pmem no longer supports device mapper. > Could you tell me which step can discard XFS log? (None of the steps do that.) > In addition, I don't like your idea about the test change because it will > make generic/470 become the special test for XFS. Do you know if we can fix > the issue by changing the test in another way? blkdiscard -z can fix the > issue because it does zero-fill rather than discard on the block device. > However, blkdiscard -z will take a lot of time when the block device is > large. Well we /could/ just do that too, but that will suck if you have 2TB of pmem. ;) Maybe as an alternative path we could just create a very small filesystem on the pmem and then blkdiscard -z it? That said -- does persistent memory actually have a future? Intel scuttled the entire Optane product, cxl.mem sounds like expansion chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel asserts everywhere) and 6.1 (every time I run fstests now I see massive data corruption). Frankly at this point I'm tempted just to turn of fsdax support for XFS for the 6.1 LTS because I don't have time to fix it. --D > > Best Regards, > Xiao Yang > > > > > --D
On 2022/10/22 10:11, Darrick J. Wong wrote: >> We need to fix the issue by discarding XFS log on the block device. >> mkfs.xfs will try to discard the blocks including XFS log by calling >> ioctl(BLKDISCARD) but it will ignore error silently when the block >> device doesn't support ioctl(BLKDISCARD). > ...but I think here's where I think your understanding isn't correct. > It might help to show how the nested logging creates its own problems. > > First, let's say there's a block B that contains some stale garbage > AAAA. > > XFS writes a block into the XFS log (call the block L) with the > instructions "allocate block B and write CCCC to block B". > dm-logwrites doesn't know or care about the contents of the blocks > that it is told to write; it only knows that XFS told it to write some > data (the > instructions) to block L. So it remembers the fact that some data got > written to L, but it doesn't know about B at all. > > At the point where we create the dm-logwrites preunmap mark, it's only > tracking L. It is not tracking B. After the mark is taken, the XFS > AIL writes CCCC to B, and only then does dm-logwrites begin tracking B. > Hence B is not included in the preunmap mark. The pre-unmount process > in XFS writes to the XFS log "write DDDD to block B" and the unmount > process checkpoints the log contents, so now block B contains contains > DDDD. > > Now the test wants to roll to the preunmap mark. Unfortunately, > dm-logwrites doesn't record former block contents, which means that > the log replay tools cannot roll backwards from "umount" to "preunmap" > -- they can only roll forward from the beginning. So there's no way > to undo writing DDDD or CCCC to B. IOWs, there's no way to revert B's > state back to AAAA when doing dm-logwrites recovery. > > Now XFS log recovery starts. It sees "allocate block B and write CCCC > to block B". However, it reads block B, sees that it contains DDDD, > and it skips writing CCCC. Incorrectly. The only way to avoid this > is to zero B before replaying the dm-logwrites. > > So you could solve the problem via BLKDISCARD, or writing zeroes to > the entire block device, or scanning the metadata and writing zeroes > to just those blocks, or by adding undo buffer records to dm-logwrites > and teaching it to do a proper rollback. Hi Darrick, Thanks for your patient explanation. Do you know if XFS log records still use buffer write? In other words, they cannot be written into the block device in DAX mode, right? In fact, I can reproduce the inconsistent filesystem issue on generic/482 but cannot reproduce the issue on generic/470. > >> Discarding XFS log is what you said "reinitialize the entire block >> device", right? > No, I really meant the/entire/ block device. > >>> I think the only way to fix this test is (a) revert all of >>> Christoph's changes so far and scuttle the divorce; or (b) change this test like so: >> Sorry, I didn't know which Christoph's patches need to be reverted? >> Could you tell me the URL about Christoph's patches? > Eh, it's a whole long series of patches scuttling various parts where > pmem could talk to the block layer. I doubt he'll accept you > reverting his removal code. Where can I find the Christoph's patch set you mentioned. I just want to know the content of Christoph's patch set. > >>> 1. Create a large sparse file on $TEST_DIR and losetup that sparse >>> file. The resulting loop device will not have dax capability. >>> >>> 2. Set up the dmthin/dmlogwrites stack on top of this loop device. >>> >>> 3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem >>> device) as the realtime device, and set the daxinherit and rtinherit >>> flags on the root directory. The result is a filesystem with a data >>> section that the kernel will treat as a regular block device, a >>> realtime section backed by pmem, and the necessary flags to make >>> sure that the test file will actually get fsdax mode. >>> >>> 4. Acknowledge that we no longer have any way to test MAP_SYNC >>> functionality on ext4, which means that generic/470 has to move to >>> tests/xfs/. >> Sorry, I didn't understand why the above test change can fix the issue. > XFS supports two-device filesystems -- the "realtime" section, and the > "data" section. FS metadata and log all live in the "data" section. > > So we change the test to set up some regular files, loop-mount the > files, set up the requisite dm-logwrites stuff atop the loop devices, > and format the XFS with the data section backed by the dm-logwrites > device, and make the realtime section backed by the pmem. > > This way the log replay program can actually discard the data device > (because it's a regular file) and replay the log forward to the > preunmap mark. The pmem device is not involved in the replay at all, > since changes to file data are never logged. It now becomes > irrelevant that pmem no longer supports device mapper. > >> Could you tell me which step can discard XFS log? > (None of the steps do that.) > >> In addition, I don't like your idea about the test change because it >> will make generic/470 become the special test for XFS. Do you know if >> we can fix the issue by changing the test in another way? blkdiscard >> -z can fix the issue because it does zero-fill rather than discard on the block device. >> However, blkdiscard -z will take a lot of time when the block device >> is large. > Well we/could/ just do that too, but that will suck if you have 2TB > of > pmem.;) > > Maybe as an alternative path we could just create a very small > filesystem on the pmem and then blkdiscard -z it? Good idea, I have sent a patch set to do it. https://lore.kernel.org/fstests/20221023064810.847110-1-yangx.jy@fujitsu.com/T/#t > > That said -- does persistent memory actually have a future? Intel > scuttled the entire Optane product, cxl.mem sounds like expansion > chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird > kernel asserts everywhere) and 6.1 (every time I run fstests now I see > massive data corruption). As far as I know, cxl.mem will take use of nvdimm driver and can be used by many existing applications. Best Regards, Xiao Yang
On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: > On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > > In addition, I don't like your idea about the test change because it will > > make generic/470 become the special test for XFS. Do you know if we can fix > > the issue by changing the test in another way? blkdiscard -z can fix the > > issue because it does zero-fill rather than discard on the block device. > > However, blkdiscard -z will take a lot of time when the block device is > > large. > > Well we /could/ just do that too, but that will suck if you have 2TB of > pmem. ;) > > Maybe as an alternative path we could just create a very small > filesystem on the pmem and then blkdiscard -z it? > > That said -- does persistent memory actually have a future? Intel > scuttled the entire Optane product, cxl.mem sounds like expansion > chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel > asserts everywhere) and 6.1 (every time I run fstests now I see massive > data corruption). Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I don't think that has changed at all - I still see lots of kernel warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid argument" errors. If I turn off reflink, then instead of data corruption I get kernel warnings like this from fsx and fsstress workloads: [415478.558426] ------------[ cut here ]------------ [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320 [415478.564028] Modules linked in: [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G W 6.1.0-rc1-dgc+ #1615 [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320 [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6 [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002 [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001 [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840 [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000 [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58 [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000 [415478.596983] FS: 00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000 [415478.599364] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0 [415478.602883] Call Trace: [415478.603598] <TASK> [415478.604229] dax_fault_iter+0x240/0x600 [415478.605410] dax_iomap_pte_fault+0x19c/0x3d0 [415478.606706] __xfs_filemap_fault+0x1dd/0x2b0 [415478.607744] __do_fault+0x2e/0x1d0 [415478.608587] __handle_mm_fault+0xcec/0x17b0 [415478.609593] handle_mm_fault+0xd0/0x2a0 [415478.610517] exc_page_fault+0x1d9/0x810 [415478.611398] asm_exc_page_fault+0x22/0x30 [415478.612311] RIP: 0033:0x7fd71a04b9ba [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206 [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610 [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5 [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640 [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699 [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001 [415478.625600] </TASK> [415478.626087] ---[ end trace 0000000000000000 ]--- Even generic/247 is generating a warning like this from xfs_io, which is a mmap vs DIO racer. Given that DIO doesn't exist for fsdax, this test turns into just a normal write() vs mmap() racer. Given these are the same fsdax infrastructure failures that I reported for 6.0, it is also likely that ext4 is still throwing them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the 6.1 cycle. > Frankly at this point I'm tempted just to turn of fsdax support for XFS > for the 6.1 LTS because I don't have time to fix it. /me shrugs Backporting fixes (whenever they come along) is a problem for the LTS kernel maintainer to deal with, not the upstream maintainer. IMO, the issue right now is that the DAX maintainers seem to have little interest in ensuring that the FSDAX infrastructure actually works correctly. If anything, they seem to want to make things harder for block based filesystems to use pmem devices and hence FSDAX. e.g. the direction of the DAX core away from block interfaces that filesystems need for their userspace tools to manage the storage. At what point do we simply say "the experiment failed, FSDAX is dead" and remove it from XFS altogether? Cheers, Dave.
在 2022/10/24 6:00, Dave Chinner 写道: > On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: >> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: >>> In addition, I don't like your idea about the test change because it will >>> make generic/470 become the special test for XFS. Do you know if we can fix >>> the issue by changing the test in another way? blkdiscard -z can fix the >>> issue because it does zero-fill rather than discard on the block device. >>> However, blkdiscard -z will take a lot of time when the block device is >>> large. >> >> Well we /could/ just do that too, but that will suck if you have 2TB of >> pmem. ;) >> >> Maybe as an alternative path we could just create a very small >> filesystem on the pmem and then blkdiscard -z it? >> >> That said -- does persistent memory actually have a future? Intel >> scuttled the entire Optane product, cxl.mem sounds like expansion >> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel >> asserts everywhere) and 6.1 (every time I run fstests now I see massive >> data corruption). > > Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken > on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I > don't think that has changed at all - I still see lots of kernel > warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid > argument" errors. Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is caused by the restrictions which prevent reflink work together with DAX: a. fs/xfs/xfs_ioctl.c:1141 /* Don't allow us to set DAX mode for a reflinked file for now. */ if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip)) return -EINVAL; b. fs/xfs/xfs_iops.c:1174 /* Only supported on non-reflinked files. */ if (xfs_is_reflink_inode(ip)) return false; These restrictions were removed in "drop experimental warning" patch[1]. I think they should be separated from that patch. [1] https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/ Secondly, how the data corruption happened? Or which case failed? Could you give me more info (such as mkfs options, xfstests configs)? > > If I turn off reflink, then instead of data corruption I get kernel > warnings like this from fsx and fsstress workloads: > > [415478.558426] ------------[ cut here ]------------ > [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320 > [415478.564028] Modules linked in: > [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G W 6.1.0-rc1-dgc+ #1615 > [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 > [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320 > [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6 > [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002 > [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001 > [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840 > [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000 > [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58 > [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000 > [415478.596983] FS: 00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000 > [415478.599364] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0 > [415478.602883] Call Trace: > [415478.603598] <TASK> > [415478.604229] dax_fault_iter+0x240/0x600 > [415478.605410] dax_iomap_pte_fault+0x19c/0x3d0 > [415478.606706] __xfs_filemap_fault+0x1dd/0x2b0 > [415478.607744] __do_fault+0x2e/0x1d0 > [415478.608587] __handle_mm_fault+0xcec/0x17b0 > [415478.609593] handle_mm_fault+0xd0/0x2a0 > [415478.610517] exc_page_fault+0x1d9/0x810 > [415478.611398] asm_exc_page_fault+0x22/0x30 > [415478.612311] RIP: 0033:0x7fd71a04b9ba > [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 > [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206 > [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610 > [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5 > [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640 > [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699 > [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001 > [415478.625600] </TASK> > [415478.626087] ---[ end trace 0000000000000000 ]--- > > Even generic/247 is generating a warning like this from xfs_io, > which is a mmap vs DIO racer. Given that DIO doesn't exist for > fsdax, this test turns into just a normal write() vs mmap() racer. > > Given these are the same fsdax infrastructure failures that I > reported for 6.0, it is also likely that ext4 is still throwing > them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the > 6.1 cycle. Still working on it... > >> Frankly at this point I'm tempted just to turn of fsdax support for XFS >> for the 6.1 LTS because I don't have time to fix it. > > /me shrugs > > Backporting fixes (whenever they come along) is a problem for the > LTS kernel maintainer to deal with, not the upstream maintainer. > > IMO, the issue right now is that the DAX maintainers seem to have > little interest in ensuring that the FSDAX infrastructure actually > works correctly. If anything, they seem to want to make things > harder for block based filesystems to use pmem devices and hence > FSDAX. e.g. the direction of the DAX core away from block interfaces > that filesystems need for their userspace tools to manage the > storage. > > At what point do we simply say "the experiment failed, FSDAX is > dead" and remove it from XFS altogether? I'll hurry up and try my best to solve these problems. -- Thanks, Ruan. > > Cheers, > > Dave.
On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote: > 在 2022/10/24 6:00, Dave Chinner 写道: > > On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: > >> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > >>> In addition, I don't like your idea about the test change because it will > >>> make generic/470 become the special test for XFS. Do you know if we can fix > >>> the issue by changing the test in another way? blkdiscard -z can fix the > >>> issue because it does zero-fill rather than discard on the block device. > >>> However, blkdiscard -z will take a lot of time when the block device is > >>> large. > >> > >> Well we /could/ just do that too, but that will suck if you have 2TB of > >> pmem. ;) > >> > >> Maybe as an alternative path we could just create a very small > >> filesystem on the pmem and then blkdiscard -z it? > >> > >> That said -- does persistent memory actually have a future? Intel > >> scuttled the entire Optane product, cxl.mem sounds like expansion > >> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel > >> asserts everywhere) and 6.1 (every time I run fstests now I see massive > >> data corruption). > > > > Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken > > on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I > > don't think that has changed at all - I still see lots of kernel > > warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid > > argument" errors. > > Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is > caused by the restrictions which prevent reflink work together with DAX: > > a. fs/xfs/xfs_ioctl.c:1141 > /* Don't allow us to set DAX mode for a reflinked file for now. */ > if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip)) > return -EINVAL; > > b. fs/xfs/xfs_iops.c:1174 > /* Only supported on non-reflinked files. */ > if (xfs_is_reflink_inode(ip)) > return false; Yes... > These restrictions were removed in "drop experimental warning" patch[1]. > I think they should be separated from that patch. ...and yes. > > [1] > https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/ > > > Secondly, how the data corruption happened? Or which case failed? Could > you give me more info (such as mkfs options, xfstests configs)? > > > > > If I turn off reflink, then instead of data corruption I get kernel > > warnings like this from fsx and fsstress workloads: > > > > [415478.558426] ------------[ cut here ]------------ > > [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320 > > [415478.564028] Modules linked in: > > [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G W 6.1.0-rc1-dgc+ #1615 > > [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 > > [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320 > > [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6 > > [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002 > > [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001 > > [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840 > > [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000 > > [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58 > > [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000 > > [415478.596983] FS: 00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000 > > [415478.599364] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0 > > [415478.602883] Call Trace: > > [415478.603598] <TASK> > > [415478.604229] dax_fault_iter+0x240/0x600 > > [415478.605410] dax_iomap_pte_fault+0x19c/0x3d0 > > [415478.606706] __xfs_filemap_fault+0x1dd/0x2b0 > > [415478.607744] __do_fault+0x2e/0x1d0 > > [415478.608587] __handle_mm_fault+0xcec/0x17b0 > > [415478.609593] handle_mm_fault+0xd0/0x2a0 > > [415478.610517] exc_page_fault+0x1d9/0x810 > > [415478.611398] asm_exc_page_fault+0x22/0x30 > > [415478.612311] RIP: 0033:0x7fd71a04b9ba > > [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 > > [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206 > > [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610 > > [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5 > > [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640 > > [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699 > > [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001 > > [415478.625600] </TASK> > > [415478.626087] ---[ end trace 0000000000000000 ]--- > > > > Even generic/247 is generating a warning like this from xfs_io, > > which is a mmap vs DIO racer. Given that DIO doesn't exist for > > fsdax, this test turns into just a normal write() vs mmap() racer. > > > > Given these are the same fsdax infrastructure failures that I > > reported for 6.0, it is also likely that ext4 is still throwing > > them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the > > 6.1 cycle. > > Still working on it... You'll have to port the entire FALLOC_FL_FUNSHARE code path to fsdax too -- it uses the page cache to stage the COW, which then confuses fsdax when it finds and trips over DRAM pages in the mapping. That eliminates one of the warnings (to be fair I just EONOTSUPP'd FUNSHARE to make that path go away) but it still produced massive data corruption. > > > >> Frankly at this point I'm tempted just to turn of fsdax support for XFS > >> for the 6.1 LTS because I don't have time to fix it. > > > > /me shrugs > > > > Backporting fixes (whenever they come along) is a problem for the > > LTS kernel maintainer to deal with, not the upstream maintainer. > > > > IMO, the issue right now is that the DAX maintainers seem to have > > little interest in ensuring that the FSDAX infrastructure actually > > works correctly. If anything, they seem to want to make things > > harder for block based filesystems to use pmem devices and hence > > FSDAX. e.g. the direction of the DAX core away from block interfaces > > that filesystems need for their userspace tools to manage the > > storage. > > > > At what point do we simply say "the experiment failed, FSDAX is > > dead" and remove it from XFS altogether? We no longer have any pmem products in our pipeline, so I will just say that if the corruption problems aren't resolved by the end of 6.1-rcX I'm hiding fsdax support behind CONFIG_XFS_DEBUG or just turning it off entirely. I don't want to burden whoever becomes the 6.1 XFS LTS maintainer with a slew of fsdax data corruption errors. > I'll hurry up and try my best to solve these problems. Ok, thank you. :) --D > > > -- > Thanks, > Ruan. > > > > > Cheers, > > > > Dave.
On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote: > 在 2022/10/24 6:00, Dave Chinner 写道: > > On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: > >> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > >>> In addition, I don't like your idea about the test change because it will > >>> make generic/470 become the special test for XFS. Do you know if we can fix > >>> the issue by changing the test in another way? blkdiscard -z can fix the > >>> issue because it does zero-fill rather than discard on the block device. > >>> However, blkdiscard -z will take a lot of time when the block device is > >>> large. > >> > >> Well we /could/ just do that too, but that will suck if you have 2TB of > >> pmem. ;) > >> > >> Maybe as an alternative path we could just create a very small > >> filesystem on the pmem and then blkdiscard -z it? > >> > >> That said -- does persistent memory actually have a future? Intel > >> scuttled the entire Optane product, cxl.mem sounds like expansion > >> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel > >> asserts everywhere) and 6.1 (every time I run fstests now I see massive > >> data corruption). > > > > Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken > > on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I > > don't think that has changed at all - I still see lots of kernel > > warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid > > argument" errors. > > Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is > caused by the restrictions which prevent reflink work together with DAX: > > a. fs/xfs/xfs_ioctl.c:1141 > /* Don't allow us to set DAX mode for a reflinked file for now. */ > if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip)) > return -EINVAL; > > b. fs/xfs/xfs_iops.c:1174 > /* Only supported on non-reflinked files. */ > if (xfs_is_reflink_inode(ip)) > return false; > > These restrictions were removed in "drop experimental warning" patch[1]. > I think they should be separated from that patch. > > [1] > https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/ > > > Secondly, how the data corruption happened? No idea - i"m just reporting that lots of fsx tests failed with data corruptions. I haven't had time to look at why, I'm still trying to sort out the fix for a different data corruption... > Or which case failed? *lots* of them failed with kernel warnings with reflink turned off: SECTION -- xfs_dax_noreflink ========================= Failures: generic/051 generic/068 generic/075 generic/083 generic/112 generic/127 generic/198 generic/231 generic/247 generic/269 generic/270 generic/340 generic/344 generic/388 generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011 xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538 Failed 26 of 1079 tests All of those except xfs/073 and generic/471 are failures due to warnings found in dmesg. With reflink enabled, I terminated the run after g/075, g/091, g/112 and generic/127 reported fsx data corruptions and g/051, g/068, g/075 and g/083 had reported kernel warnings in dmesg. > Could > you give me more info (such as mkfs options, xfstests configs)? They are exactly the same as last time I reported these problems. For the "no reflink" test issues: mkfs options are "-m reflink=0,rmapbt=1", mount options "-o dax=always" for both filesytems. Config output at start of test run: SECTION -- xfs_dax_noreflink FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022 MKFS_OPTIONS -- -f -m reflink=0,rmapbt=1 /dev/pmem1 MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch pmem devices are a pair of fake 8GB pmem regions set up by kernel CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up - the kernel config is kept minimal for these VMs - and the only kernel debug option I have turned on for these specific test runs is CONFIG_XFS_DEBUG=y. THe only difference between the noreflink and reflink runs is that I drop the "-m reflink=0" mkfs parameter. Otherwise they are identical and the errors I reported are from back-to-back fstests runs without rebooting the VM.... -Dave.
Dave Chinner wrote: > On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: > > On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > > > In addition, I don't like your idea about the test change because it will > > > make generic/470 become the special test for XFS. Do you know if we can fix > > > the issue by changing the test in another way? blkdiscard -z can fix the > > > issue because it does zero-fill rather than discard on the block device. > > > However, blkdiscard -z will take a lot of time when the block device is > > > large. > > > > Well we /could/ just do that too, but that will suck if you have 2TB of > > pmem. ;) > > > > Maybe as an alternative path we could just create a very small > > filesystem on the pmem and then blkdiscard -z it? > > > > That said -- does persistent memory actually have a future? Intel > > scuttled the entire Optane product, cxl.mem sounds like expansion > > chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel > > asserts everywhere) and 6.1 (every time I run fstests now I see massive > > data corruption). > > Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken > on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I > don't think that has changed at all - I still see lots of kernel > warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid > argument" errors. > > If I turn off reflink, then instead of data corruption I get kernel > warnings like this from fsx and fsstress workloads: > > [415478.558426] ------------[ cut here ]------------ > [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320 > [415478.564028] Modules linked in: > [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G W 6.1.0-rc1-dgc+ #1615 > [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 > [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320 > [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6 > [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002 > [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001 > [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840 > [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000 > [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58 > [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000 > [415478.596983] FS: 00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000 > [415478.599364] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0 > [415478.602883] Call Trace: > [415478.603598] <TASK> > [415478.604229] dax_fault_iter+0x240/0x600 > [415478.605410] dax_iomap_pte_fault+0x19c/0x3d0 > [415478.606706] __xfs_filemap_fault+0x1dd/0x2b0 > [415478.607744] __do_fault+0x2e/0x1d0 > [415478.608587] __handle_mm_fault+0xcec/0x17b0 > [415478.609593] handle_mm_fault+0xd0/0x2a0 > [415478.610517] exc_page_fault+0x1d9/0x810 > [415478.611398] asm_exc_page_fault+0x22/0x30 > [415478.612311] RIP: 0033:0x7fd71a04b9ba > [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 > [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206 > [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610 > [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5 > [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640 > [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699 > [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001 > [415478.625600] </TASK> > [415478.626087] ---[ end trace 0000000000000000 ]--- > > Even generic/247 is generating a warning like this from xfs_io, > which is a mmap vs DIO racer. Given that DIO doesn't exist for > fsdax, this test turns into just a normal write() vs mmap() racer. > > Given these are the same fsdax infrastructure failures that I > reported for 6.0, it is also likely that ext4 is still throwing > them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the > 6.1 cycle. > > > Frankly at this point I'm tempted just to turn of fsdax support for XFS > > for the 6.1 LTS because I don't have time to fix it. > > /me shrugs > > Backporting fixes (whenever they come along) is a problem for the > LTS kernel maintainer to deal with, not the upstream maintainer. > > IMO, the issue right now is that the DAX maintainers seem to have > little interest in ensuring that the FSDAX infrastructure actually > works correctly. If anything, they seem to want to make things > harder for block based filesystems to use pmem devices and hence > FSDAX. e.g. the direction of the DAX core away from block interfaces > that filesystems need for their userspace tools to manage the > storage. > > At what point do we simply say "the experiment failed, FSDAX is > dead" and remove it from XFS altogether? A fair question, given the regressions made it all the way into v6.0-final. In retrospect I made the wrong priority call to focus on dax page reference counting these past weeks. When I fired up the dax unit tests on v6.0-rc1 I found basic problems with the notify failure patches that concerned me that they had never been tested after the final version was merged [1]. Then the rest of the development cycle was spent fixing dax reference counting [2]. That was a longstanding wishlist item from gup and folio developers, but, as I said, that seems the wrong priority given the lingering regressions. I will take a look the current dax-xfstests regression backlog. That may find a need to consider reverting the problematic commits depending on what is still broken if the fixes are trending towards being invasive. [1]: https://lore.kernel.org/all/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/ [2]: https://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/
在 2022/10/24 13:31, Dave Chinner 写道: > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote: >> 在 2022/10/24 6:00, Dave Chinner 写道: >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: >>>>> In addition, I don't like your idea about the test change because it will >>>>> make generic/470 become the special test for XFS. Do you know if we can fix >>>>> the issue by changing the test in another way? blkdiscard -z can fix the >>>>> issue because it does zero-fill rather than discard on the block device. >>>>> However, blkdiscard -z will take a lot of time when the block device is >>>>> large. >>>> >>>> Well we /could/ just do that too, but that will suck if you have 2TB of >>>> pmem. ;) >>>> >>>> Maybe as an alternative path we could just create a very small >>>> filesystem on the pmem and then blkdiscard -z it? >>>> >>>> That said -- does persistent memory actually have a future? Intel >>>> scuttled the entire Optane product, cxl.mem sounds like expansion >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive >>>> data corruption). >>> >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I >>> don't think that has changed at all - I still see lots of kernel >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid >>> argument" errors. >> >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is >> caused by the restrictions which prevent reflink work together with DAX: >> >> a. fs/xfs/xfs_ioctl.c:1141 >> /* Don't allow us to set DAX mode for a reflinked file for now. */ >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip)) >> return -EINVAL; >> >> b. fs/xfs/xfs_iops.c:1174 >> /* Only supported on non-reflinked files. */ >> if (xfs_is_reflink_inode(ip)) >> return false; >> >> These restrictions were removed in "drop experimental warning" patch[1]. >> I think they should be separated from that patch. >> >> [1] >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/ >> >> >> Secondly, how the data corruption happened? > > No idea - i"m just reporting that lots of fsx tests failed with data > corruptions. I haven't had time to look at why, I'm still trying to > sort out the fix for a different data corruption... > >> Or which case failed? > > *lots* of them failed with kernel warnings with reflink turned off: > > SECTION -- xfs_dax_noreflink > ========================= > Failures: generic/051 generic/068 generic/075 generic/083 > generic/112 generic/127 generic/198 generic/231 generic/247 > generic/269 generic/270 generic/340 generic/344 generic/388 > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011 > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538 > Failed 26 of 1079 tests > > All of those except xfs/073 and generic/471 are failures due to > warnings found in dmesg. > > With reflink enabled, I terminated the run after g/075, g/091, g/112 > and generic/127 reported fsx data corruptions and g/051, g/068, > g/075 and g/083 had reported kernel warnings in dmesg. > >> Could >> you give me more info (such as mkfs options, xfstests configs)? > > They are exactly the same as last time I reported these problems. > > For the "no reflink" test issues: > > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o > dax=always" for both filesytems. Config output at start of test > run: > > SECTION -- xfs_dax_noreflink > FSTYP -- xfs (debug) > PLATFORM -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022 > MKFS_OPTIONS -- -f -m reflink=0,rmapbt=1 /dev/pmem1 > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch > > pmem devices are a pair of fake 8GB pmem regions set up by kernel > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up > - the kernel config is kept minimal for these VMs - and the only > kernel debug option I have turned on for these specific test runs is > CONFIG_XFS_DEBUG=y. Thanks for the detailed info. But, in my environment (and my colleagues', and our real server with DCPMM) these failure cases (you mentioned above, in dax+non_reflink mode, with same test options) cannot reproduce. Here's our test environment info: - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G - Server's : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM (To quickly confirm the difference, I just ran the failed 26 cases you mentioned above.) Except for generic/471 and generic/519, which failed even when dax is off, the rest passed. We don't want fsdax to be truned off. Right now, I think the most important thing is solving the failed cases in dax+non_reflink mode. So, firstly, I have to reproduce those failures. Is there any thing wrong with my test environments? I konw you are using 'memmap=XXG!YYG' to simulate pmem. So, (to Darrick) could you show me your config of dev environment and the 'testcloud'(I am guessing it's a server with real nvdimm just like ours)? (I just found I only tested on 4G and smaller pmem device. I'll try the test on 8G pmem) > > THe only difference between the noreflink and reflink runs is that I > drop the "-m reflink=0" mkfs parameter. Otherwise they are identical > and the errors I reported are from back-to-back fstests runs without > rebooting the VM.... > > -Dave. -- Thanks, Ruan.
On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote: > > > 在 2022/10/24 13:31, Dave Chinner 写道: > > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote: > >> 在 2022/10/24 6:00, Dave Chinner 写道: > >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: > >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > >>>>> In addition, I don't like your idea about the test change because it will > >>>>> make generic/470 become the special test for XFS. Do you know if we can fix > >>>>> the issue by changing the test in another way? blkdiscard -z can fix the > >>>>> issue because it does zero-fill rather than discard on the block device. > >>>>> However, blkdiscard -z will take a lot of time when the block device is > >>>>> large. > >>>> > >>>> Well we /could/ just do that too, but that will suck if you have 2TB of > >>>> pmem. ;) > >>>> > >>>> Maybe as an alternative path we could just create a very small > >>>> filesystem on the pmem and then blkdiscard -z it? > >>>> > >>>> That said -- does persistent memory actually have a future? Intel > >>>> scuttled the entire Optane product, cxl.mem sounds like expansion > >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel > >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive > >>>> data corruption). > >>> > >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken > >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I > >>> don't think that has changed at all - I still see lots of kernel > >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid > >>> argument" errors. > >> > >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is > >> caused by the restrictions which prevent reflink work together with DAX: > >> > >> a. fs/xfs/xfs_ioctl.c:1141 > >> /* Don't allow us to set DAX mode for a reflinked file for now. */ > >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip)) > >> return -EINVAL; > >> > >> b. fs/xfs/xfs_iops.c:1174 > >> /* Only supported on non-reflinked files. */ > >> if (xfs_is_reflink_inode(ip)) > >> return false; > >> > >> These restrictions were removed in "drop experimental warning" patch[1]. > >> I think they should be separated from that patch. > >> > >> [1] > >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/ > >> > >> > >> Secondly, how the data corruption happened? > > > > No idea - i"m just reporting that lots of fsx tests failed with data > > corruptions. I haven't had time to look at why, I'm still trying to > > sort out the fix for a different data corruption... > > > >> Or which case failed? > > > > *lots* of them failed with kernel warnings with reflink turned off: > > > > SECTION -- xfs_dax_noreflink > > ========================= > > Failures: generic/051 generic/068 generic/075 generic/083 > > generic/112 generic/127 generic/198 generic/231 generic/247 > > generic/269 generic/270 generic/340 generic/344 generic/388 > > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011 > > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538 > > Failed 26 of 1079 tests > > > > All of those except xfs/073 and generic/471 are failures due to > > warnings found in dmesg. > > > > With reflink enabled, I terminated the run after g/075, g/091, g/112 > > and generic/127 reported fsx data corruptions and g/051, g/068, > > g/075 and g/083 had reported kernel warnings in dmesg. > > > >> Could > >> you give me more info (such as mkfs options, xfstests configs)? > > > > They are exactly the same as last time I reported these problems. > > > > For the "no reflink" test issues: > > > > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o > > dax=always" for both filesytems. Config output at start of test > > run: > > > > SECTION -- xfs_dax_noreflink > > FSTYP -- xfs (debug) > > PLATFORM -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022 > > MKFS_OPTIONS -- -f -m reflink=0,rmapbt=1 /dev/pmem1 > > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch > > > > pmem devices are a pair of fake 8GB pmem regions set up by kernel > > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up > > - the kernel config is kept minimal for these VMs - and the only > > kernel debug option I have turned on for these specific test runs is > > CONFIG_XFS_DEBUG=y. > > Thanks for the detailed info. But, in my environment (and my > colleagues', and our real server with DCPMM) these failure cases (you > mentioned above, in dax+non_reflink mode, with same test options) cannot > reproduce. > > Here's our test environment info: > - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended > - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G > - Server's : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM > > (To quickly confirm the difference, I just ran the failed 26 cases you > mentioned above.) Except for generic/471 and generic/519, which failed > even when dax is off, the rest passed. > > > We don't want fsdax to be truned off. Right now, I think the most > important thing is solving the failed cases in dax+non_reflink mode. > So, firstly, I have to reproduce those failures. Is there any thing > wrong with my test environments? I konw you are using 'memmap=XXG!YYG' to > simulate pmem. So, (to Darrick) could you show me your config of dev > environment and the 'testcloud'(I am guessing it's a server with real > nvdimm just like ours)? Nope. Since the announcement of pmem as a product, I have had 15 minutes of acces to one preproduction prototype server with actual optane DIMMs in them. I have /never/ had access to real hardware to test any of this, so it's all configured via libvirt to simulate pmem in qemu: https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/ /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem: $ grep mtrdisk /proc/mounts none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0 $ ls -la /run/mtrdisk/[gh].mem -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem --D > > > (I just found I only tested on 4G and smaller pmem device. I'll try the > test on 8G pmem) > > > > > THe only difference between the noreflink and reflink runs is that I > > drop the "-m reflink=0" mkfs parameter. Otherwise they are identical > > and the errors I reported are from back-to-back fstests runs without > > rebooting the VM.... > > > > -Dave. > > > -- > Thanks, > Ruan.
[add tytso to cc since he asked about "How do you actually /get/ fsdax mode these days?" this morning] On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote: > On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote: > > > > > > 在 2022/10/24 13:31, Dave Chinner 写道: > > > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote: > > >> 在 2022/10/24 6:00, Dave Chinner 写道: > > >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: > > >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > > >>>>> In addition, I don't like your idea about the test change because it will > > >>>>> make generic/470 become the special test for XFS. Do you know if we can fix > > >>>>> the issue by changing the test in another way? blkdiscard -z can fix the > > >>>>> issue because it does zero-fill rather than discard on the block device. > > >>>>> However, blkdiscard -z will take a lot of time when the block device is > > >>>>> large. > > >>>> > > >>>> Well we /could/ just do that too, but that will suck if you have 2TB of > > >>>> pmem. ;) > > >>>> > > >>>> Maybe as an alternative path we could just create a very small > > >>>> filesystem on the pmem and then blkdiscard -z it? > > >>>> > > >>>> That said -- does persistent memory actually have a future? Intel > > >>>> scuttled the entire Optane product, cxl.mem sounds like expansion > > >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel > > >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive > > >>>> data corruption). > > >>> > > >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken > > >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I > > >>> don't think that has changed at all - I still see lots of kernel > > >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid > > >>> argument" errors. > > >> > > >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is > > >> caused by the restrictions which prevent reflink work together with DAX: > > >> > > >> a. fs/xfs/xfs_ioctl.c:1141 > > >> /* Don't allow us to set DAX mode for a reflinked file for now. */ > > >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip)) > > >> return -EINVAL; > > >> > > >> b. fs/xfs/xfs_iops.c:1174 > > >> /* Only supported on non-reflinked files. */ > > >> if (xfs_is_reflink_inode(ip)) > > >> return false; > > >> > > >> These restrictions were removed in "drop experimental warning" patch[1]. > > >> I think they should be separated from that patch. > > >> > > >> [1] > > >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/ > > >> > > >> > > >> Secondly, how the data corruption happened? > > > > > > No idea - i"m just reporting that lots of fsx tests failed with data > > > corruptions. I haven't had time to look at why, I'm still trying to > > > sort out the fix for a different data corruption... > > > > > >> Or which case failed? > > > > > > *lots* of them failed with kernel warnings with reflink turned off: > > > > > > SECTION -- xfs_dax_noreflink > > > ========================= > > > Failures: generic/051 generic/068 generic/075 generic/083 > > > generic/112 generic/127 generic/198 generic/231 generic/247 > > > generic/269 generic/270 generic/340 generic/344 generic/388 > > > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011 > > > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538 > > > Failed 26 of 1079 tests > > > > > > All of those except xfs/073 and generic/471 are failures due to > > > warnings found in dmesg. > > > > > > With reflink enabled, I terminated the run after g/075, g/091, g/112 > > > and generic/127 reported fsx data corruptions and g/051, g/068, > > > g/075 and g/083 had reported kernel warnings in dmesg. > > > > > >> Could > > >> you give me more info (such as mkfs options, xfstests configs)? > > > > > > They are exactly the same as last time I reported these problems. > > > > > > For the "no reflink" test issues: > > > > > > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o > > > dax=always" for both filesytems. Config output at start of test > > > run: > > > > > > SECTION -- xfs_dax_noreflink > > > FSTYP -- xfs (debug) > > > PLATFORM -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022 > > > MKFS_OPTIONS -- -f -m reflink=0,rmapbt=1 /dev/pmem1 > > > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch > > > > > > pmem devices are a pair of fake 8GB pmem regions set up by kernel > > > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up > > > - the kernel config is kept minimal for these VMs - and the only > > > kernel debug option I have turned on for these specific test runs is > > > CONFIG_XFS_DEBUG=y. > > > > Thanks for the detailed info. But, in my environment (and my > > colleagues', and our real server with DCPMM) these failure cases (you > > mentioned above, in dax+non_reflink mode, with same test options) cannot > > reproduce. > > > > Here's our test environment info: > > - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended > > - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G > > - Server's : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM > > > > (To quickly confirm the difference, I just ran the failed 26 cases you > > mentioned above.) Except for generic/471 and generic/519, which failed > > even when dax is off, the rest passed. > > > > > > We don't want fsdax to be truned off. Right now, I think the most > > important thing is solving the failed cases in dax+non_reflink mode. > > So, firstly, I have to reproduce those failures. Is there any thing > > wrong with my test environments? I konw you are using 'memmap=XXG!YYG' to > > simulate pmem. So, (to Darrick) could you show me your config of dev > > environment and the 'testcloud'(I am guessing it's a server with real > > nvdimm just like ours)? > > Nope. Since the announcement of pmem as a product, I have had 15 > minutes of acces to one preproduction prototype server with actual > optane DIMMs in them. > > I have /never/ had access to real hardware to test any of this, so it's > all configured via libvirt to simulate pmem in qemu: > https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/ > > /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem: > > $ grep mtrdisk /proc/mounts > none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0 > > $ ls -la /run/mtrdisk/[gh].mem > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem Also forgot to mention that the VM with the fake pmem attached has a script to do: ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f Every time the pmem device gets recreated, because apparently that's the only way to get S_DAX mode nowadays? --D > --D > > > > > > > (I just found I only tested on 4G and smaller pmem device. I'll try the > > test on 8G pmem) > > > > > > > > THe only difference between the noreflink and reflink runs is that I > > > drop the "-m reflink=0" mkfs parameter. Otherwise they are identical > > > and the errors I reported are from back-to-back fstests runs without > > > rebooting the VM.... > > > > > > -Dave. > > > > > > -- > > Thanks, > > Ruan.
Darrick J. Wong wrote: > [add tytso to cc since he asked about "How do you actually /get/ fsdax > mode these days?" this morning] > > On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote: > > On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote: > > > > > > > > > 在 2022/10/24 13:31, Dave Chinner 写道: > > > > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@fujitsu.com wrote: > > > >> 在 2022/10/24 6:00, Dave Chinner 写道: > > > >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote: > > > >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote: > > > >>>>> In addition, I don't like your idea about the test change because it will > > > >>>>> make generic/470 become the special test for XFS. Do you know if we can fix > > > >>>>> the issue by changing the test in another way? blkdiscard -z can fix the > > > >>>>> issue because it does zero-fill rather than discard on the block device. > > > >>>>> However, blkdiscard -z will take a lot of time when the block device is > > > >>>>> large. > > > >>>> > > > >>>> Well we /could/ just do that too, but that will suck if you have 2TB of > > > >>>> pmem. ;) > > > >>>> > > > >>>> Maybe as an alternative path we could just create a very small > > > >>>> filesystem on the pmem and then blkdiscard -z it? > > > >>>> > > > >>>> That said -- does persistent memory actually have a future? Intel > > > >>>> scuttled the entire Optane product, cxl.mem sounds like expansion > > > >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel > > > >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive > > > >>>> data corruption). > > > >>> > > > >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken > > > >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I > > > >>> don't think that has changed at all - I still see lots of kernel > > > >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid > > > >>> argument" errors. > > > >> > > > >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is > > > >> caused by the restrictions which prevent reflink work together with DAX: > > > >> > > > >> a. fs/xfs/xfs_ioctl.c:1141 > > > >> /* Don't allow us to set DAX mode for a reflinked file for now. */ > > > >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip)) > > > >> return -EINVAL; > > > >> > > > >> b. fs/xfs/xfs_iops.c:1174 > > > >> /* Only supported on non-reflinked files. */ > > > >> if (xfs_is_reflink_inode(ip)) > > > >> return false; > > > >> > > > >> These restrictions were removed in "drop experimental warning" patch[1]. > > > >> I think they should be separated from that patch. > > > >> > > > >> [1] > > > >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@fujitsu.com/ > > > >> > > > >> > > > >> Secondly, how the data corruption happened? > > > > > > > > No idea - i"m just reporting that lots of fsx tests failed with data > > > > corruptions. I haven't had time to look at why, I'm still trying to > > > > sort out the fix for a different data corruption... > > > > > > > >> Or which case failed? > > > > > > > > *lots* of them failed with kernel warnings with reflink turned off: > > > > > > > > SECTION -- xfs_dax_noreflink > > > > ========================= > > > > Failures: generic/051 generic/068 generic/075 generic/083 > > > > generic/112 generic/127 generic/198 generic/231 generic/247 > > > > generic/269 generic/270 generic/340 generic/344 generic/388 > > > > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011 > > > > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538 > > > > Failed 26 of 1079 tests > > > > > > > > All of those except xfs/073 and generic/471 are failures due to > > > > warnings found in dmesg. > > > > > > > > With reflink enabled, I terminated the run after g/075, g/091, g/112 > > > > and generic/127 reported fsx data corruptions and g/051, g/068, > > > > g/075 and g/083 had reported kernel warnings in dmesg. > > > > > > > >> Could > > > >> you give me more info (such as mkfs options, xfstests configs)? > > > > > > > > They are exactly the same as last time I reported these problems. > > > > > > > > For the "no reflink" test issues: > > > > > > > > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o > > > > dax=always" for both filesytems. Config output at start of test > > > > run: > > > > > > > > SECTION -- xfs_dax_noreflink > > > > FSTYP -- xfs (debug) > > > > PLATFORM -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022 > > > > MKFS_OPTIONS -- -f -m reflink=0,rmapbt=1 /dev/pmem1 > > > > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch > > > > > > > > pmem devices are a pair of fake 8GB pmem regions set up by kernel > > > > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up > > > > - the kernel config is kept minimal for these VMs - and the only > > > > kernel debug option I have turned on for these specific test runs is > > > > CONFIG_XFS_DEBUG=y. > > > > > > Thanks for the detailed info. But, in my environment (and my > > > colleagues', and our real server with DCPMM) these failure cases (you > > > mentioned above, in dax+non_reflink mode, with same test options) cannot > > > reproduce. > > > > > > Here's our test environment info: > > > - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended > > > - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G > > > - Server's : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM > > > > > > (To quickly confirm the difference, I just ran the failed 26 cases you > > > mentioned above.) Except for generic/471 and generic/519, which failed > > > even when dax is off, the rest passed. > > > > > > > > > We don't want fsdax to be truned off. Right now, I think the most > > > important thing is solving the failed cases in dax+non_reflink mode. > > > So, firstly, I have to reproduce those failures. Is there any thing > > > wrong with my test environments? I konw you are using 'memmap=XXG!YYG' to > > > simulate pmem. So, (to Darrick) could you show me your config of dev > > > environment and the 'testcloud'(I am guessing it's a server with real > > > nvdimm just like ours)? > > > > Nope. Since the announcement of pmem as a product, I have had 15 > > minutes of acces to one preproduction prototype server with actual > > optane DIMMs in them. > > > > I have /never/ had access to real hardware to test any of this, so it's > > all configured via libvirt to simulate pmem in qemu: > > https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/ > > > > /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem: > > > > $ grep mtrdisk /proc/mounts > > none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0 > > > > $ ls -la /run/mtrdisk/[gh].mem > > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem > > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem > > Also forgot to mention that the VM with the fake pmem attached has a > script to do: > > ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f > ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f > > Every time the pmem device gets recreated, because apparently that's the > only way to get S_DAX mode nowadays? If you have noticed a change here it is due to VM configuration not anything in the driver. If you are interested there are two ways to get pmem declared the legacy way that predates any of the DAX work, the kernel calls it E820_PRAM, and the modern way by platform firmware tables like ACPI NFIT. The assumption with E820_PRAM is that it is dealing with battery backed NVDIMMs of small capacity. In that case the /dev/pmem device can support DAX operation by default because the necessary memory for the 'struct page' array for that memory is likely small. Platform firmware defined PMEM can be terabytes. So the driver does not enable DAX by default because the user needs to make policy choice about burning gigabytes of DRAM for that metadata, or placing it in PMEM which is abundant, but slower. So what I suspect might be happening is your configuration changed from something that auto-allocated the 'struct page' array, to something that needed those commands you list above to explicitly opt-in to reserving some PMEM capacity for the page metadata.
在 2022/10/28 9:37, Dan Williams 写道: > Darrick J. Wong wrote: >> [add tytso to cc since he asked about "How do you actually /get/ fsdax >> mode these days?" this morning] >> >> On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote: >>> On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote: ...skip... >>> >>> Nope. Since the announcement of pmem as a product, I have had 15 >>> minutes of acces to one preproduction prototype server with actual >>> optane DIMMs in them. >>> >>> I have /never/ had access to real hardware to test any of this, so it's >>> all configured via libvirt to simulate pmem in qemu: >>> https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/ >>> >>> /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem: >>> >>> $ grep mtrdisk /proc/mounts >>> none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0 >>> >>> $ ls -la /run/mtrdisk/[gh].mem >>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem >>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem >> >> Also forgot to mention that the VM with the fake pmem attached has a >> script to do: >> >> ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f >> ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f >> >> Every time the pmem device gets recreated, because apparently that's the >> only way to get S_DAX mode nowadays? > > If you have noticed a change here it is due to VM configuration not > anything in the driver. > > If you are interested there are two ways to get pmem declared the legacy > way that predates any of the DAX work, the kernel calls it E820_PRAM, > and the modern way by platform firmware tables like ACPI NFIT. The > assumption with E820_PRAM is that it is dealing with battery backed > NVDIMMs of small capacity. In that case the /dev/pmem device can support > DAX operation by default because the necessary memory for the 'struct > page' array for that memory is likely small. > > Platform firmware defined PMEM can be terabytes. So the driver does not > enable DAX by default because the user needs to make policy choice about > burning gigabytes of DRAM for that metadata, or placing it in PMEM which > is abundant, but slower. So what I suspect might be happening is your > configuration changed from something that auto-allocated the 'struct > page' array, to something that needed those commands you list above to > explicitly opt-in to reserving some PMEM capacity for the page metadata. I am using the same simulation environment as Darrick's and Dave's and have tested many times, but still cannot reproduce the failed cases they mentioned (dax+non_reflink mode, currently focuing) until now. Only a few cases randomly failed because of "target is busy". But IIRC, those failed cases you mentioned were failed with dmesg warning around the function "dax_associate_entry()" or "dax_disassociate_entry()". Since I cannot reproduce the failure, it hard for me to continue sovling the problem. And how is your recent test? Still failed with those dmesg warnings? If so, could you zip the test result and send it to me? -- Thanks, Ruan
On Sun, Oct 30, 2022 at 05:31:43PM +0800, Shiyang Ruan wrote: > > > 在 2022/10/28 9:37, Dan Williams 写道: > > Darrick J. Wong wrote: > > > [add tytso to cc since he asked about "How do you actually /get/ fsdax > > > mode these days?" this morning] > > > > > > On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote: > > > > On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote: > > ...skip... > > > > > > > > > Nope. Since the announcement of pmem as a product, I have had 15 > > > > minutes of acces to one preproduction prototype server with actual > > > > optane DIMMs in them. > > > > > > > > I have /never/ had access to real hardware to test any of this, so it's > > > > all configured via libvirt to simulate pmem in qemu: > > > > https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/ > > > > > > > > /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem: > > > > > > > > $ grep mtrdisk /proc/mounts > > > > none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0 > > > > > > > > $ ls -la /run/mtrdisk/[gh].mem > > > > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem > > > > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem > > > > > > Also forgot to mention that the VM with the fake pmem attached has a > > > script to do: > > > > > > ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f > > > ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f > > > > > > Every time the pmem device gets recreated, because apparently that's the > > > only way to get S_DAX mode nowadays? > > > > If you have noticed a change here it is due to VM configuration not > > anything in the driver. > > > > If you are interested there are two ways to get pmem declared the legacy > > way that predates any of the DAX work, the kernel calls it E820_PRAM, > > and the modern way by platform firmware tables like ACPI NFIT. The > > assumption with E820_PRAM is that it is dealing with battery backed > > NVDIMMs of small capacity. In that case the /dev/pmem device can support > > DAX operation by default because the necessary memory for the 'struct > > page' array for that memory is likely small. > > > > Platform firmware defined PMEM can be terabytes. So the driver does not > > enable DAX by default because the user needs to make policy choice about > > burning gigabytes of DRAM for that metadata, or placing it in PMEM which > > is abundant, but slower. So what I suspect might be happening is your > > configuration changed from something that auto-allocated the 'struct > > page' array, to something that needed those commands you list above to > > explicitly opt-in to reserving some PMEM capacity for the page metadata. > > I am using the same simulation environment as Darrick's and Dave's and have > tested many times, but still cannot reproduce the failed cases they > mentioned (dax+non_reflink mode, currently focuing) until now. Only a few > cases randomly failed because of "target is busy". But IIRC, those failed > cases you mentioned were failed with dmesg warning around the function > "dax_associate_entry()" or "dax_disassociate_entry()". Since I cannot > reproduce the failure, it hard for me to continue sovling the problem. FWIW things have calmed down as of 6.1-rc3 -- if I disable reflink, fstests runs without complaint. Now it only seems to be affecting reflink=1 filesystems. > And how is your recent test? Still failed with those dmesg warnings? If so, > could you zip the test result and send it to me? https://djwong.org/docs/kernel/daxbad.zip --D > > > -- > Thanks, > Ruan
在 2022/11/2 8:45, Darrick J. Wong 写道: > On Sun, Oct 30, 2022 at 05:31:43PM +0800, Shiyang Ruan wrote: >> >> >> 在 2022/10/28 9:37, Dan Williams 写道: >>> Darrick J. Wong wrote: >>>> [add tytso to cc since he asked about "How do you actually /get/ fsdax >>>> mode these days?" this morning] >>>> >>>> On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote: >>>>> On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@fujitsu.com wrote: >> >> ...skip... >> >>>>> >>>>> Nope. Since the announcement of pmem as a product, I have had 15 >>>>> minutes of acces to one preproduction prototype server with actual >>>>> optane DIMMs in them. >>>>> >>>>> I have /never/ had access to real hardware to test any of this, so it's >>>>> all configured via libvirt to simulate pmem in qemu: >>>>> https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/ >>>>> >>>>> /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem: >>>>> >>>>> $ grep mtrdisk /proc/mounts >>>>> none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0 >>>>> >>>>> $ ls -la /run/mtrdisk/[gh].mem >>>>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem >>>>> -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem >>>> >>>> Also forgot to mention that the VM with the fake pmem attached has a >>>> script to do: >>>> >>>> ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f >>>> ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f >>>> >>>> Every time the pmem device gets recreated, because apparently that's the >>>> only way to get S_DAX mode nowadays? >>> >>> If you have noticed a change here it is due to VM configuration not >>> anything in the driver. >>> >>> If you are interested there are two ways to get pmem declared the legacy >>> way that predates any of the DAX work, the kernel calls it E820_PRAM, >>> and the modern way by platform firmware tables like ACPI NFIT. The >>> assumption with E820_PRAM is that it is dealing with battery backed >>> NVDIMMs of small capacity. In that case the /dev/pmem device can support >>> DAX operation by default because the necessary memory for the 'struct >>> page' array for that memory is likely small. >>> >>> Platform firmware defined PMEM can be terabytes. So the driver does not >>> enable DAX by default because the user needs to make policy choice about >>> burning gigabytes of DRAM for that metadata, or placing it in PMEM which >>> is abundant, but slower. So what I suspect might be happening is your >>> configuration changed from something that auto-allocated the 'struct >>> page' array, to something that needed those commands you list above to >>> explicitly opt-in to reserving some PMEM capacity for the page metadata. >> >> I am using the same simulation environment as Darrick's and Dave's and have >> tested many times, but still cannot reproduce the failed cases they >> mentioned (dax+non_reflink mode, currently focuing) until now. Only a few >> cases randomly failed because of "target is busy". But IIRC, those failed >> cases you mentioned were failed with dmesg warning around the function >> "dax_associate_entry()" or "dax_disassociate_entry()". Since I cannot >> reproduce the failure, it hard for me to continue sovling the problem. > > FWIW things have calmed down as of 6.1-rc3 -- if I disable reflink, > fstests runs without complaint. Now it only seems to be affecting > reflink=1 filesystems. > >> And how is your recent test? Still failed with those dmesg warnings? If so, >> could you zip the test result and send it to me? > > https://djwong.org/docs/kernel/daxbad.zip Thanks for your info! (To Dave) I need your recent test result too. If cases won't fail when reflink disabled, I'll focusing on solving the warning when reflink enabled. -- Thanks, Ruan. > > --D > >> >> >> -- >> Thanks, >> Ruan
On Wed, Nov 02, 2022 at 05:17:18AM +0000, ruansy.fnst@fujitsu.com wrote: > > 在 2022/11/2 8:45, Darrick J. Wong 写道: > > On Sun, Oct 30, 2022 at 05:31:43PM +0800, Shiyang Ruan wrote: > > FWIW things have calmed down as of 6.1-rc3 -- if I disable reflink, > > fstests runs without complaint. Now it only seems to be affecting > > reflink=1 filesystems. > > >> And how is your recent test? Still failed with those dmesg warnings? If so, > >> could you zip the test result and send it to me? > > > > https://djwong.org/docs/kernel/daxbad.zip > > Thanks for your info! > > (To Dave) I need your recent test result too. If cases won't fail when > reflink disabled, I'll focusing on solving the warning when reflink enabled. My first run on 6.1-rc3 with reflink disabled was clean. Then I ran a few tests with reflink enabled, and they all failed as expected. Then I ran the no-reflink tests again, and then they all failed, too. Nothing changed between test configs, I didn't even reboot the test machine: $ history |tail -5 500 sudo ./run_check.sh --mkfs-opts "-m rmapbt=1,reflink=0" --run-opts "-s xfs_dax xfs/55[12]" 501 sudo ./run_check.sh --mkfs-opts "-m rmapbt=1,reflink=0" --run-opts "-s xfs_dax_noreflink generic/051 generic/068 generic/074 generic/075 generic/083 generic/112 generic/231 generic/232 generic/269 generic/270 generic/340 generic/388 generic/461 generic/471 generic/476 generic/519 generic/560 generic/561 generic/617 generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538" 502 sudo ./run_check.sh --mkfs-opts "-m rmapbt=1" --run-opts "-s xfs_dax generic/051 generic/068 generic/074 generic/075 generic/083 generic/112 generic/231 generic/232 generic/269 generic/270 generic/340 generic/388 generic/461 generic/471 generic/476 generic/519 generic/560 generic/561 generic/617 generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538" 503 sudo ./run_check.sh --mkfs-opts "-m rmapbt=1,reflink=0" --run-opts "-s xfs_dax_noreflink generic/051 generic/068 generic/074 generic/075 generic/083 generic/112 generic/231 generic/232 generic/269 generic/270 generic/340 generic/388 generic/461 generic/471 generic/476 generic/519 generic/560 generic/561 generic/617 generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538" 504 history |tail -5 $ The first noreflink run: SECTION -- xfs_dax_noreflink ========================= Failures: generic/471 generic/519 xfs/073 Failed 3 of 29 tests Which are typical failures for this config. The first reflink enabled run I killed almost immediately as it threw multiple warnings in the first couple of tests: Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_dax generic/051 generic/068 generic/074 generic/075 generic/083 generic/112 generic/231 generic/232 generic/269 generic/270 generic/340 generic/388 generic/461 generic/471 generic/476 generic/519 generic/560 generic/561 generic/617 generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538 SECTION -- xfs_dax FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test3 6.1.0-rc3-dgc+ #1649 SMP PREEMPT_DYNAMIC Wed Nov 2 07:58:17 AEDT 2022 MKFS_OPTIONS -- -f -m rmapbt=1 /dev/pmem1 MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch generic/051 79s ... _check_dmesg: something found in dmesg (see /home/dave/src/xfstests-dev/results//xfs_dax/generic/051.dmesg) generic/068 43s ... ^C real 1m46.278s user 0m17.465s sys 2m9.981s $ And then I ran the noreflink tests again to make sure the first run wasn't a fluke: SECTION -- xfs_dax_noreflink ========================= Failures: generic/051 generic/068 generic/231 generic/269 generic/270 generic/340 generic/388 generic/461 generic/471 generic/476 generic/519 generic/560 generic/561 xfs/011 xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538 Failed 20 of 29 tests It was a fluke - most of the tests failed this time with dax mapping warnings. dmesg from the entire set of test runs is attached. -Dave.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 8495ef076ffc..a3c221841fa6 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -348,8 +348,10 @@ xfs_setup_dax_always( goto disable_dax; } - if (xfs_has_reflink(mp)) { - xfs_alert(mp, "DAX and reflink cannot be used together!"); + if (xfs_has_reflink(mp) && + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) { + xfs_alert(mp, + "DAX and reflink cannot work with multi-partitions!"); return -EINVAL; }
Failure notification is not supported on partitions. So, when we mount a reflink enabled xfs on a partition with dax option, let it fail with -EINVAL code. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- fs/xfs/xfs_super.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)