diff mbox

[2,2/2] xfs: fix rt_dev usage for DAX

Message ID 151751718516.69886.135497175511444689.stgit@djiang5-desk3.ch.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Jiang Feb. 1, 2018, 8:33 p.m. UTC
When using realtime device (rtdev) with xfs where the data device is not
DAX capable, two issues arise. One is when data device is not DAX but the
realtime device is DAX capable, we currently disable DAX.
After passing this check, we are also not marking the inode as DAX capable.
This change will allow DAX enabled if the data device or the realtime
device is DAX capable. S_DAX will be marked for the inode if the file is
residing on a DAX capable device. This will prevent the case of rtdev is not
DAX and data device is DAX to create realtime files.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reported-by: Darrick Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iops.c  |    3 ++-
 fs/xfs/xfs_super.c |    9 ++++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

Comments

Dave Chinner Feb. 1, 2018, 11:44 p.m. UTC | #1
On Thu, Feb 01, 2018 at 01:33:05PM -0700, Dave Jiang wrote:
> When using realtime device (rtdev) with xfs where the data device is not
> DAX capable, two issues arise. One is when data device is not DAX but the
> realtime device is DAX capable, we currently disable DAX.
> After passing this check, we are also not marking the inode as DAX capable.
> This change will allow DAX enabled if the data device or the realtime
> device is DAX capable. S_DAX will be marked for the inode if the file is
> residing on a DAX capable device. This will prevent the case of rtdev is not
> DAX and data device is DAX to create realtime files.

I'm confused by this description. I'm not sure what is broken, nor
what you are trying to fix.

I think what you want to do is enable DAX on RT devices separately
to the data device and vice versa?

i.e. is this what you are trying to acheive?

datadev dax	rtdev dax		DAX enabled on
-----------     ---------		--------------
no		no			neither
yes		no			datadev
no		yes			rtdev
yes		yes			both


> 
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> Reported-by: Darrick Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_iops.c  |    3 ++-
>  fs/xfs/xfs_super.c |    9 ++++++++-
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 56475fcd76f2..ab352c325301 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1204,7 +1204,8 @@ xfs_diflags_to_iflags(
>  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
>  	    !xfs_is_reflink_inode(ip) &&
>  	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
> -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> +	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) &&
> +	    blk_queue_dax(bdev_get_queue(inode->i_sb->s_bdev)))

This does not discriminate between the rtdev or the data dev. This
needs to call xfs_find_bdev_for_inode() to get the right device
for the inode config.

Further, if we add or remove the RT flag to the inode at a later
point in time (e.g. via ioctl) we also need to re-evaluate the S_DAX
flag at that point in time.

Which brings me to the real problem here: dynamically changing the
S_DAX flag is racy, dangerous and broken. It's not clear that this
should be allowed at all as the inode may have already been mmap()d
by the time the ioctl is called to set/clear the rt file state.

IOWs, right now we cannot support mixed DAX mode filesystems because
the generic DAX code does not support dynamic changing of the DAX
flag on an inode and so checking the block device state here is
irrelevant....

>  		inode->i_flags |= S_DAX;
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e8a687232614..5ac478924dce 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1649,11 +1649,18 @@ xfs_fs_fill_super(
>  		sb->s_flags |= SB_I_VERSION;
>  
>  	if (mp->m_flags & XFS_MOUNT_DAX) {
> +		bool rtdev_is_dax = false;
> +
>  		xfs_warn(mp,
>  		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
>  
> +		if (mp->m_rtdev_targp->bt_daxdev)
> +			if (bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
> +					      sb->s_blocksize) == 0)
> +				rtdev_is_dax = true;

.... as this code here needs to turn off DAX here if any device
in the filesystem doesn't support DAX....


FWIW, the logic in the code is terrible (not your fault, Dave).
The logic reads

	if (NOT bdev_dax_supported(rtdev)) then
		rtdev supports DAX

That also needs fixing - we're checking something that has a boolean
return state (yes or no) and so it should define them in a way that
makes the caller logic read cleanly....

Cheers,

Dave.
Dave Jiang Feb. 2, 2018, 12:13 a.m. UTC | #2
On 02/01/2018 04:44 PM, Dave Chinner wrote:
> On Thu, Feb 01, 2018 at 01:33:05PM -0700, Dave Jiang wrote:
>> When using realtime device (rtdev) with xfs where the data device is not
>> DAX capable, two issues arise. One is when data device is not DAX but the
>> realtime device is DAX capable, we currently disable DAX.
>> After passing this check, we are also not marking the inode as DAX capable.
>> This change will allow DAX enabled if the data device or the realtime
>> device is DAX capable. S_DAX will be marked for the inode if the file is
>> residing on a DAX capable device. This will prevent the case of rtdev is not
>> DAX and data device is DAX to create realtime files.
> 
> I'm confused by this description. I'm not sure what is broken, nor
> what you are trying to fix.
> 
> I think what you want to do is enable DAX on RT devices separately
> to the data device and vice versa?
> 
> i.e. is this what you are trying to acheive?
> 
> datadev dax	rtdev dax		DAX enabled on
> -----------     ---------		--------------
> no		no			neither
> yes		no			datadev
> no		yes			rtdev
> yes		yes			both

^ Yes that's pretty much what I was trying to say. I probably should've
just provided the table above. Looks like Darrick supplied a much
cleaner patch. Although I don't think it addresses the concerns you have
with regards to dynamically changing the S_DAX flag.

> 
> 
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> Reported-by: Darrick Wong <darrick.wong@oracle.com>
>> ---
>>  fs/xfs/xfs_iops.c  |    3 ++-
>>  fs/xfs/xfs_super.c |    9 ++++++++-
>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index 56475fcd76f2..ab352c325301 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -1204,7 +1204,8 @@ xfs_diflags_to_iflags(
>>  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
>>  	    !xfs_is_reflink_inode(ip) &&
>>  	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
>> -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
>> +	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) &&
>> +	    blk_queue_dax(bdev_get_queue(inode->i_sb->s_bdev)))
> 
> This does not discriminate between the rtdev or the data dev. This
> needs to call xfs_find_bdev_for_inode() to get the right device
> for the inode config.
> 
> Further, if we add or remove the RT flag to the inode at a later
> point in time (e.g. via ioctl) we also need to re-evaluate the S_DAX
> flag at that point in time.
> 
> Which brings me to the real problem here: dynamically changing the
> S_DAX flag is racy, dangerous and broken. It's not clear that this
> should be allowed at all as the inode may have already been mmap()d
> by the time the ioctl is called to set/clear the rt file state.
> 
> IOWs, right now we cannot support mixed DAX mode filesystems because
> the generic DAX code does not support dynamic changing of the DAX
> flag on an inode and so checking the block device state here is
> irrelevant....
> 
>>  		inode->i_flags |= S_DAX;
>>  }
>>  
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index e8a687232614..5ac478924dce 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1649,11 +1649,18 @@ xfs_fs_fill_super(
>>  		sb->s_flags |= SB_I_VERSION;
>>  
>>  	if (mp->m_flags & XFS_MOUNT_DAX) {
>> +		bool rtdev_is_dax = false;
>> +
>>  		xfs_warn(mp,
>>  		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
>>  
>> +		if (mp->m_rtdev_targp->bt_daxdev)
>> +			if (bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
>> +					      sb->s_blocksize) == 0)
>> +				rtdev_is_dax = true;
> 
> .... as this code here needs to turn off DAX here if any device
> in the filesystem doesn't support DAX....
> 
> 
> FWIW, the logic in the code is terrible (not your fault, Dave).
> The logic reads
> 
> 	if (NOT bdev_dax_supported(rtdev)) then
> 		rtdev supports DAX
> 
> That also needs fixing - we're checking something that has a boolean
> return state (yes or no) and so it should define them in a way that
> makes the caller logic read cleanly....
> 
> Cheers,
> 
> Dave.
>
Darrick J. Wong Feb. 2, 2018, 12:43 a.m. UTC | #3
On Fri, Feb 02, 2018 at 10:44:13AM +1100, Dave Chinner wrote:
> On Thu, Feb 01, 2018 at 01:33:05PM -0700, Dave Jiang wrote:
> > When using realtime device (rtdev) with xfs where the data device is not
> > DAX capable, two issues arise. One is when data device is not DAX but the
> > realtime device is DAX capable, we currently disable DAX.
> > After passing this check, we are also not marking the inode as DAX capable.
> > This change will allow DAX enabled if the data device or the realtime
> > device is DAX capable. S_DAX will be marked for the inode if the file is
> > residing on a DAX capable device. This will prevent the case of rtdev is not
> > DAX and data device is DAX to create realtime files.
> 
> I'm confused by this description. I'm not sure what is broken, nor
> what you are trying to fix.
> 
> I think what you want to do is enable DAX on RT devices separately
> to the data device and vice versa?
> 
> i.e. is this what you are trying to acheive?
> 
> datadev dax	rtdev dax		DAX enabled on
> -----------     ---------		--------------
> no		no			neither
> yes		no			datadev
> no		yes			rtdev
> yes		yes			both
> 
> 
> > 
> > Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> > Reported-by: Darrick Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_iops.c  |    3 ++-
> >  fs/xfs/xfs_super.c |    9 ++++++++-
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 56475fcd76f2..ab352c325301 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -1204,7 +1204,8 @@ xfs_diflags_to_iflags(
> >  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
> >  	    !xfs_is_reflink_inode(ip) &&
> >  	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
> > -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> > +	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) &&
> > +	    blk_queue_dax(bdev_get_queue(inode->i_sb->s_bdev)))
> 
> This does not discriminate between the rtdev or the data dev. This
> needs to call xfs_find_bdev_for_inode() to get the right device
> for the inode config.
> 
> Further, if we add or remove the RT flag to the inode at a later
> point in time (e.g. via ioctl) we also need to re-evaluate the S_DAX
> flag at that point in time.

Ah, right, I'd missed that subtlety in my earlier replies.  Ok, add
another patch to this series to reevaluate S_DAX when we change the RT
flag.

> Which brings me to the real problem here: dynamically changing the
> S_DAX flag is racy, dangerous and broken. It's not clear that this
> should be allowed at all as the inode may have already been mmap()d
> by the time the ioctl is called to set/clear the rt file state.

Agreed that this is a mess.  Either this needs to get fixed in the dax
code, or we need to decide that we're not going to support reconfiguring
the dax flag at all, except possibly for empty files (similar to how we
restrict changes to the rt flag).

> IOWs, right now we cannot support mixed DAX mode filesystems because
> the generic DAX code does not support dynamic changing of the DAX
> flag on an inode and so checking the block device state here is
> irrelevant....
> 
> >  		inode->i_flags |= S_DAX;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index e8a687232614..5ac478924dce 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1649,11 +1649,18 @@ xfs_fs_fill_super(
> >  		sb->s_flags |= SB_I_VERSION;
> >  
> >  	if (mp->m_flags & XFS_MOUNT_DAX) {
> > +		bool rtdev_is_dax = false;
> > +
> >  		xfs_warn(mp,
> >  		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> >  
> > +		if (mp->m_rtdev_targp->bt_daxdev)
> > +			if (bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
> > +					      sb->s_blocksize) == 0)
> > +				rtdev_is_dax = true;
> 
> .... as this code here needs to turn off DAX here if any device
> in the filesystem doesn't support DAX....

I think it'd be useful to be able to have a pmem rt device even if the
data device doesn't support it.  Or rather, I have a few clients who
have expressed interest in this sort of configuration.

--D

> 
> 
> FWIW, the logic in the code is terrible (not your fault, Dave).
> The logic reads
> 
> 	if (NOT bdev_dax_supported(rtdev)) then
> 		rtdev supports DAX
> 
> That also needs fixing - we're checking something that has a boolean
> return state (yes or no) and so it should define them in a way that
> makes the caller logic read cleanly....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 2, 2018, 3:20 a.m. UTC | #4
On Thu, Feb 01, 2018 at 05:13:40PM -0700, Dave Jiang wrote:
> 
> 
> On 02/01/2018 04:44 PM, Dave Chinner wrote:
> > On Thu, Feb 01, 2018 at 01:33:05PM -0700, Dave Jiang wrote:
> >> When using realtime device (rtdev) with xfs where the data device is not
> >> DAX capable, two issues arise. One is when data device is not DAX but the
> >> realtime device is DAX capable, we currently disable DAX.
> >> After passing this check, we are also not marking the inode as DAX capable.
> >> This change will allow DAX enabled if the data device or the realtime
> >> device is DAX capable. S_DAX will be marked for the inode if the file is
> >> residing on a DAX capable device. This will prevent the case of rtdev is not
> >> DAX and data device is DAX to create realtime files.
> > 
> > I'm confused by this description. I'm not sure what is broken, nor
> > what you are trying to fix.
> > 
> > I think what you want to do is enable DAX on RT devices separately
> > to the data device and vice versa?
> > 
> > i.e. is this what you are trying to acheive?
> > 
> > datadev dax	rtdev dax		DAX enabled on
> > -----------     ---------		--------------
> > no		no			neither
> > yes		no			datadev
> > no		yes			rtdev
> > yes		yes			both
> 
> ^ Yes that's pretty much what I was trying to say. I probably should've
> just provided the table above.

Ok, good to know I understood what you want to achieve. Now we've
just got to work out how to do it. :)

Cheers,

Dave.
Dave Chinner Feb. 2, 2018, 3:36 a.m. UTC | #5
On Thu, Feb 01, 2018 at 04:43:32PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 02, 2018 at 10:44:13AM +1100, Dave Chinner wrote:
> > On Thu, Feb 01, 2018 at 01:33:05PM -0700, Dave Jiang wrote:
> > > Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> > > Reported-by: Darrick Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_iops.c  |    3 ++-
> > >  fs/xfs/xfs_super.c |    9 ++++++++-
> > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index 56475fcd76f2..ab352c325301 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -1204,7 +1204,8 @@ xfs_diflags_to_iflags(
> > >  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
> > >  	    !xfs_is_reflink_inode(ip) &&
> > >  	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
> > > -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> > > +	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) &&
> > > +	    blk_queue_dax(bdev_get_queue(inode->i_sb->s_bdev)))
> > 
> > This does not discriminate between the rtdev or the data dev. This
> > needs to call xfs_find_bdev_for_inode() to get the right device
> > for the inode config.
> > 
> > Further, if we add or remove the RT flag to the inode at a later
> > point in time (e.g. via ioctl) we also need to re-evaluate the S_DAX
> > flag at that point in time.
> 
> Ah, right, I'd missed that subtlety in my earlier replies.  Ok, add
> another patch to this series to reevaluate S_DAX when we change the RT
> flag.
> 
> > Which brings me to the real problem here: dynamically changing the
> > S_DAX flag is racy, dangerous and broken. It's not clear that this
> > should be allowed at all as the inode may have already been mmap()d
> > by the time the ioctl is called to set/clear the rt file state.
> 
> Agreed that this is a mess.  Either this needs to get fixed in the dax
> code, or we need to decide that we're not going to support reconfiguring
> the dax flag at all, except possibly for empty files (similar to how we
> restrict changes to the rt flag).

Hmmm, the rt flag limitations are to do with whether extents have
been allocated or not. IIRC the issue with S_DAX is whether the file
has been mmap()d or not which we can't check for reliably even if
the file is empty...

.....

> > >  	if (mp->m_flags & XFS_MOUNT_DAX) {
> > > +		bool rtdev_is_dax = false;
> > > +
> > >  		xfs_warn(mp,
> > >  		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> > >  
> > > +		if (mp->m_rtdev_targp->bt_daxdev)
> > > +			if (bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
> > > +					      sb->s_blocksize) == 0)
> > > +				rtdev_is_dax = true;
> > 
> > .... as this code here needs to turn off DAX here if any device
> > in the filesystem doesn't support DAX....
> 
> I think it'd be useful to be able to have a pmem rt device even if the
> data device doesn't support it.  Or rather, I have a few clients who
> have expressed interest in this sort of configuration.

Understood.

I suspect all this means we can only support DAX on the rt device by
setting the RT flag on file create, at least until the dynamic S_DAX
issues are sorted out.  And that also means it can't be cleared from
files that have it set. So perhaps all we need to do is disallow
changing the RT flag on regular files via the ioctl if XFS_MOUNT_DAX
is set?

Cheers,

Dave.
Dave Jiang Feb. 6, 2018, 10:32 p.m. UTC | #6
On 02/01/2018 05:43 PM, Darrick J. Wong wrote:
> On Fri, Feb 02, 2018 at 10:44:13AM +1100, Dave Chinner wrote:
>> On Thu, Feb 01, 2018 at 01:33:05PM -0700, Dave Jiang wrote:
>>> When using realtime device (rtdev) with xfs where the data device is not
>>> DAX capable, two issues arise. One is when data device is not DAX but the
>>> realtime device is DAX capable, we currently disable DAX.
>>> After passing this check, we are also not marking the inode as DAX capable.
>>> This change will allow DAX enabled if the data device or the realtime
>>> device is DAX capable. S_DAX will be marked for the inode if the file is
>>> residing on a DAX capable device. This will prevent the case of rtdev is not
>>> DAX and data device is DAX to create realtime files.
>>
>> I'm confused by this description. I'm not sure what is broken, nor
>> what you are trying to fix.
>>
>> I think what you want to do is enable DAX on RT devices separately
>> to the data device and vice versa?
>>
>> i.e. is this what you are trying to acheive?
>>
>> datadev dax	rtdev dax		DAX enabled on
>> -----------     ---------		--------------
>> no		no			neither
>> yes		no			datadev
>> no		yes			rtdev
>> yes		yes			both
>>
>>
>>>
>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>> Reported-by: Darrick Wong <darrick.wong@oracle.com>
>>> ---
>>>  fs/xfs/xfs_iops.c  |    3 ++-
>>>  fs/xfs/xfs_super.c |    9 ++++++++-
>>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>>> index 56475fcd76f2..ab352c325301 100644
>>> --- a/fs/xfs/xfs_iops.c
>>> +++ b/fs/xfs/xfs_iops.c
>>> @@ -1204,7 +1204,8 @@ xfs_diflags_to_iflags(
>>>  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
>>>  	    !xfs_is_reflink_inode(ip) &&
>>>  	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
>>> -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
>>> +	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) &&
>>> +	    blk_queue_dax(bdev_get_queue(inode->i_sb->s_bdev)))
>>
>> This does not discriminate between the rtdev or the data dev. This
>> needs to call xfs_find_bdev_for_inode() to get the right device
>> for the inode config.
>>
>> Further, if we add or remove the RT flag to the inode at a later
>> point in time (e.g. via ioctl) we also need to re-evaluate the S_DAX
>> flag at that point in time.
> 
> Ah, right, I'd missed that subtlety in my earlier replies.  Ok, add
> another patch to this series to reevaluate S_DAX when we change the RT
> flag.
> 
>> Which brings me to the real problem here: dynamically changing the
>> S_DAX flag is racy, dangerous and broken. It's not clear that this
>> should be allowed at all as the inode may have already been mmap()d
>> by the time the ioctl is called to set/clear the rt file state.
> 
> Agreed that this is a mess.  Either this needs to get fixed in the dax
> code, or we need to decide that we're not going to support reconfiguring
> the dax flag at all, except possibly for empty files (similar to how we
> restrict changes to the rt flag).

Does this mean we should add a check in xfs_ioctl_setattr_xflags() to
reject removing of realtime flag if S_DAX is set on the inode until the
dynamic change issue is sorted out?

> 
>> IOWs, right now we cannot support mixed DAX mode filesystems because
>> the generic DAX code does not support dynamic changing of the DAX
>> flag on an inode and so checking the block device state here is
>> irrelevant....
>>
>>>  		inode->i_flags |= S_DAX;
>>>  }
>>>  
>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>>> index e8a687232614..5ac478924dce 100644
>>> --- a/fs/xfs/xfs_super.c
>>> +++ b/fs/xfs/xfs_super.c
>>> @@ -1649,11 +1649,18 @@ xfs_fs_fill_super(
>>>  		sb->s_flags |= SB_I_VERSION;
>>>  
>>>  	if (mp->m_flags & XFS_MOUNT_DAX) {
>>> +		bool rtdev_is_dax = false;
>>> +
>>>  		xfs_warn(mp,
>>>  		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
>>>  
>>> +		if (mp->m_rtdev_targp->bt_daxdev)
>>> +			if (bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
>>> +					      sb->s_blocksize) == 0)
>>> +				rtdev_is_dax = true;
>>
>> .... as this code here needs to turn off DAX here if any device
>> in the filesystem doesn't support DAX....
> 
> I think it'd be useful to be able to have a pmem rt device even if the
> data device doesn't support it.  Or rather, I have a few clients who
> have expressed interest in this sort of configuration.
> 
> --D
> 
>>
>>
>> FWIW, the logic in the code is terrible (not your fault, Dave).
>> The logic reads
>>
>> 	if (NOT bdev_dax_supported(rtdev)) then
>> 		rtdev supports DAX
>>
>> That also needs fixing - we're checking something that has a boolean
>> return state (yes or no) and so it should define them in a way that
>> makes the caller logic read cleanly....
>>
>> Cheers,
>>
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Feb. 6, 2018, 11:19 p.m. UTC | #7
On Tue, Feb 06, 2018 at 03:32:00PM -0700, Dave Jiang wrote:
> On 02/01/2018 05:43 PM, Darrick J. Wong wrote:
> > On Fri, Feb 02, 2018 at 10:44:13AM +1100, Dave Chinner wrote:
> >> On Thu, Feb 01, 2018 at 01:33:05PM -0700, Dave Jiang wrote:
> >>> When using realtime device (rtdev) with xfs where the data device is not
> >>> DAX capable, two issues arise. One is when data device is not DAX but the
> >>> realtime device is DAX capable, we currently disable DAX.
> >>> After passing this check, we are also not marking the inode as DAX capable.
> >>> This change will allow DAX enabled if the data device or the realtime
> >>> device is DAX capable. S_DAX will be marked for the inode if the file is
> >>> residing on a DAX capable device. This will prevent the case of rtdev is not
> >>> DAX and data device is DAX to create realtime files.
> >>
> >> I'm confused by this description. I'm not sure what is broken, nor
> >> what you are trying to fix.
> >>
> >> I think what you want to do is enable DAX on RT devices separately
> >> to the data device and vice versa?
> >>
> >> i.e. is this what you are trying to acheive?
> >>
> >> datadev dax	rtdev dax		DAX enabled on
> >> -----------     ---------		--------------
> >> no		no			neither
> >> yes		no			datadev
> >> no		yes			rtdev
> >> yes		yes			both
> >>
> >>
> >>>
> >>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> >>> Reported-by: Darrick Wong <darrick.wong@oracle.com>
> >>> ---
> >>>  fs/xfs/xfs_iops.c  |    3 ++-
> >>>  fs/xfs/xfs_super.c |    9 ++++++++-
> >>>  2 files changed, 10 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> >>> index 56475fcd76f2..ab352c325301 100644
> >>> --- a/fs/xfs/xfs_iops.c
> >>> +++ b/fs/xfs/xfs_iops.c
> >>> @@ -1204,7 +1204,8 @@ xfs_diflags_to_iflags(
> >>>  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
> >>>  	    !xfs_is_reflink_inode(ip) &&
> >>>  	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
> >>> -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> >>> +	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) &&
> >>> +	    blk_queue_dax(bdev_get_queue(inode->i_sb->s_bdev)))
> >>
> >> This does not discriminate between the rtdev or the data dev. This
> >> needs to call xfs_find_bdev_for_inode() to get the right device
> >> for the inode config.
> >>
> >> Further, if we add or remove the RT flag to the inode at a later
> >> point in time (e.g. via ioctl) we also need to re-evaluate the S_DAX
> >> flag at that point in time.
> > 
> > Ah, right, I'd missed that subtlety in my earlier replies.  Ok, add
> > another patch to this series to reevaluate S_DAX when we change the RT
> > flag.
> > 
> >> Which brings me to the real problem here: dynamically changing the
> >> S_DAX flag is racy, dangerous and broken. It's not clear that this
> >> should be allowed at all as the inode may have already been mmap()d
> >> by the time the ioctl is called to set/clear the rt file state.
>
> > Agreed that this is a mess.  Either this needs to get fixed in the dax
> > code, or we need to decide that we're not going to support reconfiguring
> > the dax flag at all, except possibly for empty files (similar to how we
> > restrict changes to the rt flag).
> 
> Does this mean we should add a check in xfs_ioctl_setattr_xflags() to
> reject removing of realtime flag if S_DAX is set on the inode until the
> dynamic change issue is sorted out?

Well, I /was/ tiptoeing around this while trying to take care of the
other 4.16 stuff, but now that I'm done with merge window stuff I'll let
loose. :)

The last time I paid much attention to DAX was the thread "re-enable XFS
per-inode DAX"[1] last September.  Motivating me to merge anything else
into DAX involves convincing me that we (mm, fs, dax developers) have
some kind of agreement about what we want the user-visible interfaces to
DAX to look like.  Namely:

0. On what level do we allow users / administrators to control usage of
the dax paths?  Can the hardware convey enough detail to the kernel that
the kernel can make a reasonable decision on its own whether buffered or
dax io make more sense?  If so, can we please just have that?  If not,
why?

1. If we want to let users override whatever decision the kernel makes,
how should we do this?  One mount option that applies to everything,
like ext4?  Inheritable inode flags, like xfs?  Do we have one to force
it on even if the kernel doesn't want to?  Do we have another to force
it off even if the kernel wants to?  Do we even want to go down this
path?  Can we get away with making the answer to Q0 "yes" and then see
if anyone actually complains about not having fine-grained control?

2. Under what conditions can we support dynamic changing of S_DAX on
inodes at runtime?  Will this switching work at any time?  Only for
files that are open but not mmap'd?  Only for files that are empty?

3. The MAP_SYNC support that was merged into 4.15 -- is this sufficient
to allow this fsyncless clflush business that everyone seems to want?

4. Can someone please fix the XFS iomap_begin function to handle CoW
properly?  I think it's a simple matter of allocate blocks, memcpy, and
remap, though I don't know how to do that. ;)

5. Do we test any of this stuff?

The thread from last September left off with promises to go define what
interface and behaviors we are providing to userspace, but afaict none
of that ever happened?  If we don't resolve these questions before LSF
then I think what's needed is to lock everyone in a room to hash all
this out. :P

--D

PS: My personal inclination is {yes, get rid of all that until someone
complains, i think so but haven't tested it, ???, i sure hope so}.

[1] https://marc.info/?l=linux-xfs&m=150638135225793&w=2

> 
> > 
> >> IOWs, right now we cannot support mixed DAX mode filesystems because
> >> the generic DAX code does not support dynamic changing of the DAX
> >> flag on an inode and so checking the block device state here is
> >> irrelevant....
> >>
> >>>  		inode->i_flags |= S_DAX;
> >>>  }
> >>>  
> >>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> >>> index e8a687232614..5ac478924dce 100644
> >>> --- a/fs/xfs/xfs_super.c
> >>> +++ b/fs/xfs/xfs_super.c
> >>> @@ -1649,11 +1649,18 @@ xfs_fs_fill_super(
> >>>  		sb->s_flags |= SB_I_VERSION;
> >>>  
> >>>  	if (mp->m_flags & XFS_MOUNT_DAX) {
> >>> +		bool rtdev_is_dax = false;
> >>> +
> >>>  		xfs_warn(mp,
> >>>  		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> >>>  
> >>> +		if (mp->m_rtdev_targp->bt_daxdev)
> >>> +			if (bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
> >>> +					      sb->s_blocksize) == 0)
> >>> +				rtdev_is_dax = true;
> >>
> >> .... as this code here needs to turn off DAX here if any device
> >> in the filesystem doesn't support DAX....
> > 
> > I think it'd be useful to be able to have a pmem rt device even if the
> > data device doesn't support it.  Or rather, I have a few clients who
> > have expressed interest in this sort of configuration.
> > 
> > --D
> > 
> >>
> >>
> >> FWIW, the logic in the code is terrible (not your fault, Dave).
> >> The logic reads
> >>
> >> 	if (NOT bdev_dax_supported(rtdev)) then
> >> 		rtdev supports DAX
> >>
> >> That also needs fixing - we're checking something that has a boolean
> >> return state (yes or no) and so it should define them in a way that
> >> makes the caller logic read cleanly....
> >>
> >> Cheers,
> >>
> >> Dave.
> >> -- 
> >> Dave Chinner
> >> david@fromorbit.com
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Feb. 7, 2018, 12:19 a.m. UTC | #8
On Tue, Feb 6, 2018 at 3:19 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Feb 06, 2018 at 03:32:00PM -0700, Dave Jiang wrote:
[..]
> The last time I paid much attention to DAX was the thread "re-enable XFS
> per-inode DAX"[1] last September.  Motivating me to merge anything else
> into DAX involves convincing me that we (mm, fs, dax developers) have
> some kind of agreement about what we want the user-visible interfaces to
> DAX to look like.  Namely:
>
> 0. On what level do we allow users / administrators to control usage of
> the dax paths?  Can the hardware convey enough detail to the kernel that
> the kernel can make a reasonable decision on its own whether buffered or
> dax io make more sense?  If so, can we please just have that?  If not,
> why?
>
> 1. If we want to let users override whatever decision the kernel makes,
> how should we do this?  One mount option that applies to everything,
> like ext4?  Inheritable inode flags, like xfs?  Do we have one to force
> it on even if the kernel doesn't want to?  Do we have another to force
> it off even if the kernel wants to?  Do we even want to go down this
> path?  Can we get away with making the answer to Q0 "yes" and then see
> if anyone actually complains about not having fine-grained control?

I think we will always have folks that want to force it on, i.e. the
MAP_SYNC user crowd. However, I think we might have some people what
will complain if they can't force it off. For example, I'm in the
process of killing off support for passing filesystem-dax mappings
through to guests because VFIO has the same "pin pages / fs-blocks
forever" problem as RDMA. Passing page cache through to a guest works
fine and it would be a shame if that silently stopped working in the
future. Given the page pinning constraint I'm not sure we can ever
support dynamically enabling DAX behind the user's back, at least not
until we kill off any "pin pages / fs-blocks forever" users, or
otherwise convert them to take a lease.
Ross Zwisler March 6, 2018, 12:06 a.m. UTC | #9
On Tue, Feb 06, 2018 at 03:19:15PM -0800, Darrick J. Wong wrote:
<>
> The last time I paid much attention to DAX was the thread "re-enable XFS
> per-inode DAX"[1] last September.  Motivating me to merge anything else
> into DAX involves convincing me that we (mm, fs, dax developers) have
> some kind of agreement about what we want the user-visible interfaces to
> DAX to look like.  

Yep, I agree that is the next step.

> Namely:
> 
> 0. On what level do we allow users / administrators to control usage of
> the dax paths?  Can the hardware convey enough detail to the kernel that
> the kernel can make a reasonable decision on its own whether buffered or
> dax io make more sense?  If so, can we please just have that?  If not,
> why?

Maybe eventually via the HMAT, but I don't think we have any systems today
that do a good job of this.

> 1. If we want to let users override whatever decision the kernel makes,
> how should we do this?  One mount option that applies to everything,
> like ext4?  Inheritable inode flags, like xfs?  Do we have one to force
> it on even if the kernel doesn't want to?  Do we have another to force
> it off even if the kernel wants to?  Do we even want to go down this
> path?  Can we get away with making the answer to Q0 "yes" and then see
> if anyone actually complains about not having fine-grained control?

I agree with Dan's assessment that even if we can make the kernel smart enough
to know when it's not a performance loss to use DAX (i.e. the persistent
memory you're using DAX on is just as fast as the page cache), users will
probably still want to retain the ability to force it on for use cases like
MAP_SYNC, and force it off for things like RDMA or VFIO, at least until the
page pinning work is complete.

Personally I'm still hopeful that we can have both the mount option and the
inheritable inode flags, and that we can figure out what we need to to get
S_DAX transitions happening again.

> 2. Under what conditions can we support dynamic changing of S_DAX on
> inodes at runtime?  Will this switching work at any time?  Only for
> files that are open but not mmap'd?  Only for files that are empty?
>
> 3. The MAP_SYNC support that was merged into 4.15 -- is this sufficient
> to allow this fsyncless clflush business that everyone seems to want?

Yep, I think so.  The next big battles are S_DAX transitions, per-inode DAX
support, and of course the page pinning / leases code that Dan & Christoph
have been talking about.

> 4. Can someone please fix the XFS iomap_begin function to handle CoW
> properly?  I think it's a simple matter of allocate blocks, memcpy, and
> remap, though I don't know how to do that. ;)
> 
> 5. Do we test any of this stuff?

Yes, I think in general we do a pretty good job of DAX test case coverage
between a combination of xfstests (which I have added to as I've fixed DAX
related bugs), nfit_test and the ndctl unit tests.  hch has recently suggested
we start using blktests as well, though I don't think we've actually made any
new tests there yet.  Suggestions on how we can get better test coverage are
welcome.

> The thread from last September left off with promises to go define what
> interface and behaviors we are providing to userspace, but afaict none
> of that ever happened?  If we don't resolve these questions before LSF
> then I think what's needed is to lock everyone in a room to hash all
> this out. :P

Yep, that's accurate.  I got pulled off onto other work and am just now
finding my way back.  I think talking about it at LSF sounds great, but it's a
shame that hch won't be available.  It'll be nice to finally meet dchinner,
though. :)

> --D
> 
> PS: My personal inclination is {yes, get rid of all that until someone
> complains, i think so but haven't tested it, ???, i sure hope so}.
> 
> [1] https://marc.info/?l=linux-xfs&m=150638135225793&w=2
diff mbox

Patch

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..ab352c325301 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1204,7 +1204,8 @@  xfs_diflags_to_iflags(
 	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
 	    !xfs_is_reflink_inode(ip) &&
 	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
-	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
+	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) &&
+	    blk_queue_dax(bdev_get_queue(inode->i_sb->s_bdev)))
 		inode->i_flags |= S_DAX;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e8a687232614..5ac478924dce 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1649,11 +1649,18 @@  xfs_fs_fill_super(
 		sb->s_flags |= SB_I_VERSION;
 
 	if (mp->m_flags & XFS_MOUNT_DAX) {
+		bool rtdev_is_dax = false;
+
 		xfs_warn(mp,
 		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
 
+		if (mp->m_rtdev_targp->bt_daxdev)
+			if (bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
+					      sb->s_blocksize) == 0)
+				rtdev_is_dax = true;
+
 		error = bdev_dax_supported(sb->s_bdev, sb->s_blocksize);
-		if (error) {
+		if (error && !rtdev_is_dax) {
 			xfs_alert(mp,
 			"DAX unsupported by block device. Turning off DAX.");
 			mp->m_flags &= ~XFS_MOUNT_DAX;