mbox series

[v2,0/2] zone-append support in io-uring and aio

Message ID 1593105349-19270-1-git-send-email-joshi.k@samsung.com (mailing list archive)
Headers show
Series zone-append support in io-uring and aio | expand

Message

Kanchan Joshi June 25, 2020, 5:15 p.m. UTC
[Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]

This patchset enables zone-append using io-uring/linux-aio, on block IO path.
Purpose is to provide zone-append consumption ability to applications which are
using zoned-block-device directly.

The application may specify RWF_ZONE_APPEND flag with write when it wants to
send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
aio, and pwritev2. An error is reported if zone-append is requested using
pwritev2. It is not in the scope of this patchset to support pwritev2 or any
other sync write API for reasons described later.

Zone-append completion result --->
With zone-append, where write took place can only be known after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.

In aio, this is returned to application using res2 field of io_event -

struct io_event {
        __u64           data;           /* the data field from the iocb */
        __u64           obj;            /* what iocb this event came from */
        __s64           res;            /* result code for this event */
        __s64           res2;           /* secondary result */
};

In io-uring, cqe->flags is repurposed for zone-append result.

struct io_uring_cqe {
        __u64   user_data;      /* sqe->data submission passed back */
        __s32   res;            /* result code for this event */
        __u32   flags;
};

Since 32 bit flags is not sufficient, we choose to return zone-relative offset
in sector/512b units. This can cover zone-size represented by chunk_sectors.
Applications will have the trouble to combine this with zone start to know
disk-relative offset. But if more bits are obtained by pulling from res field
that too would compel application to interpret res field differently, and it
seems more painstaking than the former option.
To keep uniformity, even with aio, zone-relative offset is returned.

Append using io_uring fixed-buffer --->
This is flagged as not-supported at the moment. Reason being, for fixed-buffer
io-uring sends iov_iter of bvec type. But current append-infra in block-layer
does not support such iov_iter.

Block IO vs File IO --->
For now, the user zone-append interface is supported only for zoned-block-device.
Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
will not need this anyway, because zone peculiarities are abstracted within FS.
At this point, ZoneFS also likes to use append implicitly rather than explicitly.
But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
allowing-only-block-device should be changed.

Semantics --->
Zone-append, by its nature, may perform write on a different location than what
was specified. It does not fit into POSIX, and trying to fit may just undermine
its benefit. It may be better to keep semantics as close to zone-append as
possible i.e. specify zone-start location, and obtain the actual-write location
post completion. Towards that goal, existing async APIs seem to fit fine.
Async APIs (uring, linux aio) do not work on implicit write-pointer and demand
explicit write offset (which is what we need for append). Neither write-pointer
is taken as input, nor it is updated on completion. And there is a clear way to
get zone-append result. Zone-aware applications while using these async APIs
can be fine with, for the lack of better word, zone-append semantics itself.

Sync APIs work with implicit write-pointer (at least few of those), and there is
no way to obtain zone-append result, making it hard for user-space zone-append.

Tests --->
Using new interface in fio (uring and libaio engine) by extending zbd tests
for zone-append: https://github.com/axboe/fio/pull/1026

Changes since v1:
- No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead.
- linux-aio changes vanish because of no new opcode
- Fixed the overflow and other issues mentioned by Damien
- Simplified uring support code, fixed the issues mentioned by Pavel
- Added error checks

Kanchan Joshi (1):
  fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path

Selvakumar S (1):
  io_uring: add support for zone-append

 fs/block_dev.c          | 28 ++++++++++++++++++++++++----
 fs/io_uring.c           | 32 ++++++++++++++++++++++++++++++--
 include/linux/fs.h      |  9 +++++++++
 include/uapi/linux/fs.h |  5 ++++-
 4 files changed, 67 insertions(+), 7 deletions(-)

Comments

Damien Le Moal June 26, 2020, 3:11 a.m. UTC | #1
On 2020/06/26 2:18, Kanchan Joshi wrote:
> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]
> 
> This patchset enables zone-append using io-uring/linux-aio, on block IO path.
> Purpose is to provide zone-append consumption ability to applications which are
> using zoned-block-device directly.
> 
> The application may specify RWF_ZONE_APPEND flag with write when it wants to
> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
> aio, and pwritev2. An error is reported if zone-append is requested using
> pwritev2. It is not in the scope of this patchset to support pwritev2 or any
> other sync write API for reasons described later.
> 
> Zone-append completion result --->
> With zone-append, where write took place can only be known after completion.
> So apart from usual return value of write, additional mean is needed to obtain
> the actual written location.
> 
> In aio, this is returned to application using res2 field of io_event -
> 
> struct io_event {
>         __u64           data;           /* the data field from the iocb */
>         __u64           obj;            /* what iocb this event came from */
>         __s64           res;            /* result code for this event */
>         __s64           res2;           /* secondary result */
> };
> 
> In io-uring, cqe->flags is repurposed for zone-append result.
> 
> struct io_uring_cqe {
>         __u64   user_data;      /* sqe->data submission passed back */
>         __s32   res;            /* result code for this event */
>         __u32   flags;
> };
> 
> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
> in sector/512b units. This can cover zone-size represented by chunk_sectors.
> Applications will have the trouble to combine this with zone start to know
> disk-relative offset. But if more bits are obtained by pulling from res field
> that too would compel application to interpret res field differently, and it
> seems more painstaking than the former option.
> To keep uniformity, even with aio, zone-relative offset is returned.

I am really not a fan of this, to say the least. The input is byte offset, the
output is 512B relative sector count... Arg... We really cannot do better than
that ?

At the very least, byte relative offset ? The main reason is that this is
_somewhat_ acceptable for raw block device accesses since the "sector"
abstraction has a clear meaning, but once we add iomap/zonefs async zone append
support, we really will want to have byte unit as the interface is regular
files, not block device file. We could argue that 512B sector unit is still
around even for files (e.g. block counts in file stat). Bu the different unit
for input and output of one operation is really ugly. This is not nice for the user.

> 
> Append using io_uring fixed-buffer --->
> This is flagged as not-supported at the moment. Reason being, for fixed-buffer
> io-uring sends iov_iter of bvec type. But current append-infra in block-layer
> does not support such iov_iter.
> 
> Block IO vs File IO --->
> For now, the user zone-append interface is supported only for zoned-block-device.
> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
> will not need this anyway, because zone peculiarities are abstracted within FS.
> At this point, ZoneFS also likes to use append implicitly rather than explicitly.
> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
> allowing-only-block-device should be changed.

Sure, but I think the interface is still a problem. I am not super happy about
the 512B sector unit. Zonefs will be the only file system that will be impacted
since other normal POSIX file system will not have zone append interface for
users. So this is a limited problem. Still, even for raw block device files
accesses, POSIX system calls use Byte unit everywhere. Let's try to use that.

For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed
already, we can still 8 bits from the cqe res. All  you need is to add a small
helper function in userspace iouring.h to simplify the work of the application
to get that result.

> 
> Semantics --->
> Zone-append, by its nature, may perform write on a different location than what
> was specified. It does not fit into POSIX, and trying to fit may just undermine
> its benefit. It may be better to keep semantics as close to zone-append as
> possible i.e. specify zone-start location, and obtain the actual-write location
> post completion. Towards that goal, existing async APIs seem to fit fine.
> Async APIs (uring, linux aio) do not work on implicit write-pointer and demand
> explicit write offset (which is what we need for append). Neither write-pointer

What do you mean by "implicit write pointer" ? Are you referring to the behavior
of AIO write with a block device file open with O_APPEND ? The yes, it does not
work. But that is perfectly fine for regular files, that is for zonefs.

I would prefer that this paragraph simply state the semantic that is implemented
first. Then explain why the choice. But first, clarify how the API works, what
is allowed, what's not etc. That will also simplify reviewing the code as one
can then check the code against the goal.

> is taken as input, nor it is updated on completion. And there is a clear way to
> get zone-append result. Zone-aware applications while using these async APIs
> can be fine with, for the lack of better word, zone-append semantics itself.
> 
> Sync APIs work with implicit write-pointer (at least few of those), and there is
> no way to obtain zone-append result, making it hard for user-space zone-append.

Sync API are executed under inode lock, at least for regular files. So there is
absolutely no problem to use zone append. zonefs does it already. The problem is
the lack of locking for block device file.

> 
> Tests --->
> Using new interface in fio (uring and libaio engine) by extending zbd tests
> for zone-append: https://github.com/axboe/fio/pull/1026
> 
> Changes since v1:
> - No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead.
> - linux-aio changes vanish because of no new opcode
> - Fixed the overflow and other issues mentioned by Damien
> - Simplified uring support code, fixed the issues mentioned by Pavel
> - Added error checks
> 
> Kanchan Joshi (1):
>   fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path
> 
> Selvakumar S (1):
>   io_uring: add support for zone-append
> 
>  fs/block_dev.c          | 28 ++++++++++++++++++++++++----
>  fs/io_uring.c           | 32 ++++++++++++++++++++++++++++++--
>  include/linux/fs.h      |  9 +++++++++
>  include/uapi/linux/fs.h |  5 ++++-
>  4 files changed, 67 insertions(+), 7 deletions(-)
>
Javier Gonzalez June 26, 2020, 6:37 a.m. UTC | #2
On 26.06.2020 03:11, Damien Le Moal wrote:
>On 2020/06/26 2:18, Kanchan Joshi wrote:
>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]
>>
>> This patchset enables zone-append using io-uring/linux-aio, on block IO path.
>> Purpose is to provide zone-append consumption ability to applications which are
>> using zoned-block-device directly.
>>
>> The application may specify RWF_ZONE_APPEND flag with write when it wants to
>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
>> aio, and pwritev2. An error is reported if zone-append is requested using
>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any
>> other sync write API for reasons described later.
>>
>> Zone-append completion result --->
>> With zone-append, where write took place can only be known after completion.
>> So apart from usual return value of write, additional mean is needed to obtain
>> the actual written location.
>>
>> In aio, this is returned to application using res2 field of io_event -
>>
>> struct io_event {
>>         __u64           data;           /* the data field from the iocb */
>>         __u64           obj;            /* what iocb this event came from */
>>         __s64           res;            /* result code for this event */
>>         __s64           res2;           /* secondary result */
>> };
>>
>> In io-uring, cqe->flags is repurposed for zone-append result.
>>
>> struct io_uring_cqe {
>>         __u64   user_data;      /* sqe->data submission passed back */
>>         __s32   res;            /* result code for this event */
>>         __u32   flags;
>> };
>>
>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
>> in sector/512b units. This can cover zone-size represented by chunk_sectors.
>> Applications will have the trouble to combine this with zone start to know
>> disk-relative offset. But if more bits are obtained by pulling from res field
>> that too would compel application to interpret res field differently, and it
>> seems more painstaking than the former option.
>> To keep uniformity, even with aio, zone-relative offset is returned.
>
>I am really not a fan of this, to say the least. The input is byte offset, the
>output is 512B relative sector count... Arg... We really cannot do better than
>that ?
>
>At the very least, byte relative offset ? The main reason is that this is
>_somewhat_ acceptable for raw block device accesses since the "sector"
>abstraction has a clear meaning, but once we add iomap/zonefs async zone append
>support, we really will want to have byte unit as the interface is regular
>files, not block device file. We could argue that 512B sector unit is still
>around even for files (e.g. block counts in file stat). Bu the different unit
>for input and output of one operation is really ugly. This is not nice for the user.
>

You can refer to the discussion with Jens, Pavel and Alex on the uring
interface. With the bits we have and considering the maximun zone size
supported, there is no space for a byte relative offset. We can take
some bits from cqe->res, but we were afraid this is not very
future-proof. Do you have a better idea?


>>
>> Append using io_uring fixed-buffer --->
>> This is flagged as not-supported at the moment. Reason being, for fixed-buffer
>> io-uring sends iov_iter of bvec type. But current append-infra in block-layer
>> does not support such iov_iter.
>>
>> Block IO vs File IO --->
>> For now, the user zone-append interface is supported only for zoned-block-device.
>> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
>> will not need this anyway, because zone peculiarities are abstracted within FS.
>> At this point, ZoneFS also likes to use append implicitly rather than explicitly.
>> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
>> allowing-only-block-device should be changed.
>
>Sure, but I think the interface is still a problem. I am not super happy about
>the 512B sector unit. Zonefs will be the only file system that will be impacted
>since other normal POSIX file system will not have zone append interface for
>users. So this is a limited problem. Still, even for raw block device files
>accesses, POSIX system calls use Byte unit everywhere. Let's try to use that.
>
>For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed
>already, we can still 8 bits from the cqe res. All  you need is to add a small
>helper function in userspace iouring.h to simplify the work of the application
>to get that result.

Ok. See above. We can do this.

Jens: Do you see this as a problem in the future?

[...]

Javier
Damien Le Moal June 26, 2020, 6:56 a.m. UTC | #3
On 2020/06/26 15:37, javier.gonz@samsung.com wrote:
> On 26.06.2020 03:11, Damien Le Moal wrote:
>> On 2020/06/26 2:18, Kanchan Joshi wrote:
>>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]
>>>
>>> This patchset enables zone-append using io-uring/linux-aio, on block IO path.
>>> Purpose is to provide zone-append consumption ability to applications which are
>>> using zoned-block-device directly.
>>>
>>> The application may specify RWF_ZONE_APPEND flag with write when it wants to
>>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
>>> aio, and pwritev2. An error is reported if zone-append is requested using
>>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any
>>> other sync write API for reasons described later.
>>>
>>> Zone-append completion result --->
>>> With zone-append, where write took place can only be known after completion.
>>> So apart from usual return value of write, additional mean is needed to obtain
>>> the actual written location.
>>>
>>> In aio, this is returned to application using res2 field of io_event -
>>>
>>> struct io_event {
>>>         __u64           data;           /* the data field from the iocb */
>>>         __u64           obj;            /* what iocb this event came from */
>>>         __s64           res;            /* result code for this event */
>>>         __s64           res2;           /* secondary result */
>>> };
>>>
>>> In io-uring, cqe->flags is repurposed for zone-append result.
>>>
>>> struct io_uring_cqe {
>>>         __u64   user_data;      /* sqe->data submission passed back */
>>>         __s32   res;            /* result code for this event */
>>>         __u32   flags;
>>> };
>>>
>>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
>>> in sector/512b units. This can cover zone-size represented by chunk_sectors.
>>> Applications will have the trouble to combine this with zone start to know
>>> disk-relative offset. But if more bits are obtained by pulling from res field
>>> that too would compel application to interpret res field differently, and it
>>> seems more painstaking than the former option.
>>> To keep uniformity, even with aio, zone-relative offset is returned.
>>
>> I am really not a fan of this, to say the least. The input is byte offset, the
>> output is 512B relative sector count... Arg... We really cannot do better than
>> that ?
>>
>> At the very least, byte relative offset ? The main reason is that this is
>> _somewhat_ acceptable for raw block device accesses since the "sector"
>> abstraction has a clear meaning, but once we add iomap/zonefs async zone append
>> support, we really will want to have byte unit as the interface is regular
>> files, not block device file. We could argue that 512B sector unit is still
>> around even for files (e.g. block counts in file stat). Bu the different unit
>> for input and output of one operation is really ugly. This is not nice for the user.
>>
> 
> You can refer to the discussion with Jens, Pavel and Alex on the uring
> interface. With the bits we have and considering the maximun zone size
> supported, there is no space for a byte relative offset. We can take
> some bits from cqe->res, but we were afraid this is not very
> future-proof. Do you have a better idea?

If you can take 8 bits, that gives you 40 bits, enough to support byte relative
offsets for any zone size defined as a number of 512B sectors using an unsigned
int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am
already too tired and my math is failing me...

zone size is defined by chunk_sectors, which is used for raid and software raids
too. This has been an unsigned int forever. I do not see the need for changing
this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB
does not really make any sense. Same for zone size...

> 
> 
>>>
>>> Append using io_uring fixed-buffer --->
>>> This is flagged as not-supported at the moment. Reason being, for fixed-buffer
>>> io-uring sends iov_iter of bvec type. But current append-infra in block-layer
>>> does not support such iov_iter.
>>>
>>> Block IO vs File IO --->
>>> For now, the user zone-append interface is supported only for zoned-block-device.
>>> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
>>> will not need this anyway, because zone peculiarities are abstracted within FS.
>>> At this point, ZoneFS also likes to use append implicitly rather than explicitly.
>>> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
>>> allowing-only-block-device should be changed.
>>
>> Sure, but I think the interface is still a problem. I am not super happy about
>> the 512B sector unit. Zonefs will be the only file system that will be impacted
>> since other normal POSIX file system will not have zone append interface for
>> users. So this is a limited problem. Still, even for raw block device files
>> accesses, POSIX system calls use Byte unit everywhere. Let's try to use that.
>>
>> For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed
>> already, we can still 8 bits from the cqe res. All  you need is to add a small
>> helper function in userspace iouring.h to simplify the work of the application
>> to get that result.
> 
> Ok. See above. We can do this.
> 
> Jens: Do you see this as a problem in the future?
> 
> [...]
> 
> Javier
>
Javier González June 26, 2020, 7:03 a.m. UTC | #4
On 26.06.2020 06:56, Damien Le Moal wrote:
>On 2020/06/26 15:37, javier.gonz@samsung.com wrote:
>> On 26.06.2020 03:11, Damien Le Moal wrote:
>>> On 2020/06/26 2:18, Kanchan Joshi wrote:
>>>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]
>>>>
>>>> This patchset enables zone-append using io-uring/linux-aio, on block IO path.
>>>> Purpose is to provide zone-append consumption ability to applications which are
>>>> using zoned-block-device directly.
>>>>
>>>> The application may specify RWF_ZONE_APPEND flag with write when it wants to
>>>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
>>>> aio, and pwritev2. An error is reported if zone-append is requested using
>>>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any
>>>> other sync write API for reasons described later.
>>>>
>>>> Zone-append completion result --->
>>>> With zone-append, where write took place can only be known after completion.
>>>> So apart from usual return value of write, additional mean is needed to obtain
>>>> the actual written location.
>>>>
>>>> In aio, this is returned to application using res2 field of io_event -
>>>>
>>>> struct io_event {
>>>>         __u64           data;           /* the data field from the iocb */
>>>>         __u64           obj;            /* what iocb this event came from */
>>>>         __s64           res;            /* result code for this event */
>>>>         __s64           res2;           /* secondary result */
>>>> };
>>>>
>>>> In io-uring, cqe->flags is repurposed for zone-append result.
>>>>
>>>> struct io_uring_cqe {
>>>>         __u64   user_data;      /* sqe->data submission passed back */
>>>>         __s32   res;            /* result code for this event */
>>>>         __u32   flags;
>>>> };
>>>>
>>>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
>>>> in sector/512b units. This can cover zone-size represented by chunk_sectors.
>>>> Applications will have the trouble to combine this with zone start to know
>>>> disk-relative offset. But if more bits are obtained by pulling from res field
>>>> that too would compel application to interpret res field differently, and it
>>>> seems more painstaking than the former option.
>>>> To keep uniformity, even with aio, zone-relative offset is returned.
>>>
>>> I am really not a fan of this, to say the least. The input is byte offset, the
>>> output is 512B relative sector count... Arg... We really cannot do better than
>>> that ?
>>>
>>> At the very least, byte relative offset ? The main reason is that this is
>>> _somewhat_ acceptable for raw block device accesses since the "sector"
>>> abstraction has a clear meaning, but once we add iomap/zonefs async zone append
>>> support, we really will want to have byte unit as the interface is regular
>>> files, not block device file. We could argue that 512B sector unit is still
>>> around even for files (e.g. block counts in file stat). Bu the different unit
>>> for input and output of one operation is really ugly. This is not nice for the user.
>>>
>>
>> You can refer to the discussion with Jens, Pavel and Alex on the uring
>> interface. With the bits we have and considering the maximun zone size
>> supported, there is no space for a byte relative offset. We can take
>> some bits from cqe->res, but we were afraid this is not very
>> future-proof. Do you have a better idea?
>
>If you can take 8 bits, that gives you 40 bits, enough to support byte relative
>offsets for any zone size defined as a number of 512B sectors using an unsigned
>int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am
>already too tired and my math is failing me...

Yes, the match is correct. I was thinking more of the bits being needed
for other use-case that could collide with append. We considered this
and discard it for being messy - when Pavel brought up the 512B
alignment we saw it as a good alternative.

Note too that we would be able to translate to a byte offset in
iouring.h too so the user would not need to think of this.

I do not feel strongly on this, so the one that better fits the current
and near-future for uring, that is the one we will send on V3. Will give
it until next week for others to comment too.

>
>zone size is defined by chunk_sectors, which is used for raid and software raids
>too. This has been an unsigned int forever. I do not see the need for changing
>this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB
>does not really make any sense. Same for zone size...

Yes. I think already max zone sizes are pretty huge. But yes, this might
change, so we will take it when it happens.

[...]

Javier
Kanchan Joshi June 26, 2020, 10:15 p.m. UTC | #5
On Fri, Jun 26, 2020 at 03:11:55AM +0000, Damien Le Moal wrote:
>On 2020/06/26 2:18, Kanchan Joshi wrote:
>> Semantics --->
>> Zone-append, by its nature, may perform write on a different location than what
>> was specified. It does not fit into POSIX, and trying to fit may just undermine
>> its benefit. It may be better to keep semantics as close to zone-append as
>> possible i.e. specify zone-start location, and obtain the actual-write location
>> post completion. Towards that goal, existing async APIs seem to fit fine.
>> Async APIs (uring, linux aio) do not work on implicit write-pointer and demand
>> explicit write offset (which is what we need for append). Neither write-pointer
>
>What do you mean by "implicit write pointer" ? Are you referring to the behavior
>of AIO write with a block device file open with O_APPEND ? The yes, it does not
>work. But that is perfectly fine for regular files, that is for zonefs.
Sorry, I meant file pointer.
Yes, block-device opened with O_APPEND does not increase the file-pointer
to end-of-device. That said, for uring and aio, file-pointer position
plays no role, and it is application responsibility to pass the right write
location.
>I would prefer that this paragraph simply state the semantic that is implemented
>first. Then explain why the choice. But first, clarify how the API works, what
>is allowed, what's not etc. That will also simplify reviewing the code as one
>can then check the code against the goal.

In this path (block IO) there is hardly any scope/attempt to abstract away anything.
So raw zoned-storage rule/semantics apply. I expect zone-aware
applications, which are already aware of rules, to be consumer of this.

>> is taken as input, nor it is updated on completion. And there is a clear way to
>> get zone-append result. Zone-aware applications while using these async APIs
>> can be fine with, for the lack of better word, zone-append semantics itself.
>>
>> Sync APIs work with implicit write-pointer (at least few of those), and there is
>> no way to obtain zone-append result, making it hard for user-space zone-append.
>
>Sync API are executed under inode lock, at least for regular files. So there is
>absolutely no problem to use zone append. zonefs does it already. The problem is
>the lack of locking for block device file.
Yes. I was refering to the problem of returning actual write-location using
sync APIs like write, pwrite, pwritev/v2.
>>
>> Tests --->
>> Using new interface in fio (uring and libaio engine) by extending zbd tests
>> for zone-append: https://protect2.fireeye.com/url?k=e21dd5e0-bf837b7a-e21c5eaf-0cc47a336fae-c982437ed1be6cc8&q=1&u=https%3A%2F%2Fgithub.com%2Faxboe%2Ffio%2Fpull%2F1026
>>
>> Changes since v1:
>> - No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead.
>> - linux-aio changes vanish because of no new opcode
>> - Fixed the overflow and other issues mentioned by Damien
>> - Simplified uring support code, fixed the issues mentioned by Pavel
>> - Added error checks
>>
>> Kanchan Joshi (1):
>>   fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path
>>
>> Selvakumar S (1):
>>   io_uring: add support for zone-append
>>
>>  fs/block_dev.c          | 28 ++++++++++++++++++++++++----
>>  fs/io_uring.c           | 32 ++++++++++++++++++++++++++++++--
>>  include/linux/fs.h      |  9 +++++++++
>>  include/uapi/linux/fs.h |  5 ++++-
>>  4 files changed, 67 insertions(+), 7 deletions(-)
>>
>
>
>-- 
>Damien Le Moal
>Western Digital Research
>
Matthew Wilcox (Oracle) June 30, 2020, 12:46 p.m. UTC | #6
On Thu, Jun 25, 2020 at 10:45:47PM +0530, Kanchan Joshi wrote:
> Zone-append completion result --->
> With zone-append, where write took place can only be known after completion.
> So apart from usual return value of write, additional mean is needed to obtain
> the actual written location.
> 
> In aio, this is returned to application using res2 field of io_event -
> 
> struct io_event {
>         __u64           data;           /* the data field from the iocb */
>         __u64           obj;            /* what iocb this event came from */
>         __s64           res;            /* result code for this event */
>         __s64           res2;           /* secondary result */
> };

Ah, now I understand.  I think you're being a little too specific by
calling this zone-append.  This is really a "write-anywhere" operation,
and the specified address is only a hint.

> In io-uring, cqe->flags is repurposed for zone-append result.
> 
> struct io_uring_cqe {
>         __u64   user_data;      /* sqe->data submission passed back */
>         __s32   res;            /* result code for this event */
>         __u32   flags;
> };
> 
> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
> in sector/512b units. This can cover zone-size represented by chunk_sectors.
> Applications will have the trouble to combine this with zone start to know
> disk-relative offset. But if more bits are obtained by pulling from res field
> that too would compel application to interpret res field differently, and it
> seems more painstaking than the former option.
> To keep uniformity, even with aio, zone-relative offset is returned.

Urgh, no, that's dreadful.  I'm not familiar with the io_uring code.
Maybe the first 8 bytes of the user_data could be required to be the
result offset for this submission type?

> Block IO vs File IO --->
> For now, the user zone-append interface is supported only for zoned-block-device.
> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
> will not need this anyway, because zone peculiarities are abstracted within FS.
> At this point, ZoneFS also likes to use append implicitly rather than explicitly.
> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
> allowing-only-block-device should be changed.

But we also have O_APPEND files.  And maybe we'll have other kinds of file
in future for which this would make sense.

> Semantics --->
> Zone-append, by its nature, may perform write on a different location than what
> was specified. It does not fit into POSIX, and trying to fit may just undermine

... I disagree that it doesn't fit into POSIX.  As I said above, O_APPEND
is a POSIX concept, so POSIX already understands that writes may not end
up at the current write pointer.

> its benefit. It may be better to keep semantics as close to zone-append as
> possible i.e. specify zone-start location, and obtain the actual-write location
> post completion. Towards that goal, existing async APIs seem to fit fine.
> Async APIs (uring, linux aio) do not work on implicit write-pointer and demand
> explicit write offset (which is what we need for append). Neither write-pointer
> is taken as input, nor it is updated on completion. And there is a clear way to
> get zone-append result. Zone-aware applications while using these async APIs
> can be fine with, for the lack of better word, zone-append semantics itself.
> 
> Sync APIs work with implicit write-pointer (at least few of those), and there is
> no way to obtain zone-append result, making it hard for user-space zone-append.