Message ID | 1592414619-5646-1-git-send-email-joshi.k@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | zone-append support in aio and io-uring | expand |
On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote: > This patchset enables issuing zone-append using aio and io-uring direct-io interface. > > For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA > of the zone to issue append. On completion 'res2' field is used to return > zone-relative offset. Maybe it's obvious to everyone working with zoned drives on a daily basis, but please explain in the commit message why you need to return the zone-relative offset to the application.
On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote: > This patchset enables issuing zone-append using aio and io-uring direct-io interface. > > For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA > of the zone to issue append. On completion 'res2' field is used to return > zone-relative offset. > > For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. > Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset And what exactly are the semantics supposed to be? Remember the unix file abstractions does not know about zones at all. I really don't think squeezing low-level not quite block storage protocol details into the Linux read/write path is a good idea. What could be a useful addition is a way for O_APPEND/RWF_APPEND writes to report where they actually wrote, as that comes close to Zone Append while still making sense at our usual abstraction level for file I/O.
On 17/06/2020 19.23, Kanchan Joshi wrote: > This patchset enables issuing zone-append using aio and io-uring direct-io interface. > > For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA > of the zone to issue append. On completion 'res2' field is used to return > zone-relative offset. > > For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. > Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset Please provide a pointers to applications that are updated and ready to take advantage of zone append. I do not believe it's beneficial at this point to change the libaio API, applications that would want to use this API, should anyway switch to use io_uring. Please also note that applications and libraries that want to take advantage of zone append, can already use the zonefs file-system, as it will use the zone append command when applicable. > Kanchan Joshi (1): > aio: add support for zone-append > > Selvakumar S (2): > fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling > io_uring: add support for zone-append > > fs/aio.c | 8 +++++ > fs/block_dev.c | 19 +++++++++++- > fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++-- > include/linux/fs.h | 1 + > include/uapi/linux/aio_abi.h | 1 + > include/uapi/linux/io_uring.h | 8 ++++- > 6 files changed, 105 insertions(+), 4 deletions(-) >
On 18.06.2020 10:04, Matias Bjørling wrote: >On 17/06/2020 19.23, Kanchan Joshi wrote: >>This patchset enables issuing zone-append using aio and io-uring direct-io interface. >> >>For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA >>of the zone to issue append. On completion 'res2' field is used to return >>zone-relative offset. >> >>For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset > >Please provide a pointers to applications that are updated and ready >to take advantage of zone append. Good point. We are posting a RFC with fio support for append. We wanted to start the conversation here before. We can post a fork for improve the reviews in V2. > >I do not believe it's beneficial at this point to change the libaio >API, applications that would want to use this API, should anyway >switch to use io_uring. I can see why you say this, but isn't it too restrictive to directly drop libaio support? We can split the patches and merge uring first- no proble,. > >Please also note that applications and libraries that want to take >advantage of zone append, can already use the zonefs file-system, as >it will use the zone append command when applicable. Sure. There are different paths available already, which is great. We have use cases for uring and would like to enable them too. Thanks, Javier
On 17.06.2020 23:56, Christoph Hellwig wrote: >On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote: >> This patchset enables issuing zone-append using aio and io-uring direct-io interface. >> >> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA >> of the zone to issue append. On completion 'res2' field is used to return >> zone-relative offset. >> >> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset > >And what exactly are the semantics supposed to be? Remember the >unix file abstractions does not know about zones at all. > >I really don't think squeezing low-level not quite block storage >protocol details into the Linux read/write path is a good idea. > >What could be a useful addition is a way for O_APPEND/RWF_APPEND writes >to report where they actually wrote, as that comes close to Zone Append >while still making sense at our usual abstraction level for file I/O. Makes sense. We will look into this for a V2. Thanks, Javier
On 18/06/2020 10.27, Javier González wrote: > On 18.06.2020 10:04, Matias Bjørling wrote: >> On 17/06/2020 19.23, Kanchan Joshi wrote: >>> This patchset enables issuing zone-append using aio and io-uring >>> direct-io interface. >>> >>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application >>> uses start LBA >>> of the zone to issue append. On completion 'res2' field is used to >>> return >>> zone-relative offset. >>> >>> For io-uring, this introduces three opcodes: >>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>> Since io_uring does not have aio-like res2, cqe->flags are >>> repurposed to return zone-relative offset >> >> Please provide a pointers to applications that are updated and ready >> to take advantage of zone append. > > Good point. We are posting a RFC with fio support for append. We wanted > to start the conversation here before. > > We can post a fork for improve the reviews in V2. Christoph's response points that it is not exactly clear how this matches with the POSIX API. fio support is great - but I was thinking along the lines of applications that not only benchmark performance. fio should be part of the supported applications, but should not be the sole reason the API is added.
On 18.06.2020 10:32, Matias Bjørling wrote: >On 18/06/2020 10.27, Javier González wrote: >>On 18.06.2020 10:04, Matias Bjørling wrote: >>>On 17/06/2020 19.23, Kanchan Joshi wrote: >>>>This patchset enables issuing zone-append using aio and io-uring >>>>direct-io interface. >>>> >>>>For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. >>>>Application uses start LBA >>>>of the zone to issue append. On completion 'res2' field is used >>>>to return >>>>zone-relative offset. >>>> >>>>For io-uring, this introduces three opcodes: >>>>IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>>>Since io_uring does not have aio-like res2, cqe->flags are >>>>repurposed to return zone-relative offset >>> >>>Please provide a pointers to applications that are updated and >>>ready to take advantage of zone append. >> >>Good point. We are posting a RFC with fio support for append. We wanted >>to start the conversation here before. >> >>We can post a fork for improve the reviews in V2. > >Christoph's response points that it is not exactly clear how this >matches with the POSIX API. Yes. We will address this. > >fio support is great - but I was thinking along the lines of >applications that not only benchmark performance. fio should be part >of the supported applications, but should not be the sole reason the >API is added. Agree. It is a process with different steps. We definitely want to have the right kernel interface before pushing any changes to libraries and / or applications. These will come as the interface becomes more stable. To start with xNVMe will be leveraging this new path. A number of customers are leveraging the xNVMe API for their applications already. Thanks, Javier
On 18/06/2020 10.39, Javier González wrote: > On 18.06.2020 10:32, Matias Bjørling wrote: >> On 18/06/2020 10.27, Javier González wrote: >>> On 18.06.2020 10:04, Matias Bjørling wrote: >>>> On 17/06/2020 19.23, Kanchan Joshi wrote: >>>>> This patchset enables issuing zone-append using aio and io-uring >>>>> direct-io interface. >>>>> >>>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application >>>>> uses start LBA >>>>> of the zone to issue append. On completion 'res2' field is used to >>>>> return >>>>> zone-relative offset. >>>>> >>>>> For io-uring, this introduces three opcodes: >>>>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>>>> Since io_uring does not have aio-like res2, cqe->flags are >>>>> repurposed to return zone-relative offset >>>> >>>> Please provide a pointers to applications that are updated and >>>> ready to take advantage of zone append. >>> >>> Good point. We are posting a RFC with fio support for append. We wanted >>> to start the conversation here before. >>> >>> We can post a fork for improve the reviews in V2. >> >> Christoph's response points that it is not exactly clear how this >> matches with the POSIX API. > > Yes. We will address this. >> >> fio support is great - but I was thinking along the lines of >> applications that not only benchmark performance. fio should be part >> of the supported applications, but should not be the sole reason the >> API is added. > > Agree. It is a process with different steps. We definitely want to have > the right kernel interface before pushing any changes to libraries and / > or applications. These will come as the interface becomes more stable. > > To start with xNVMe will be leveraging this new path. A number of > customers are leveraging the xNVMe API for their applications already. Heh, let me be even more specific - open-source applications, that is outside of fio (or any other benchmarking application), and libraries that acts as a mediator between two APIs.
On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote: > Please provide a pointers to applications that are updated and ready to take > advantage of zone append. That is a pretty high bar for kernel APIs that we don't otherwise apply unless seriously in doubt. > I do not believe it's beneficial at this point to change the libaio API, > applications that would want to use this API, should anyway switch to use > io_uring. I think that really depends on the amount of churn required. We absolutely can expose things like small additional flags or simple new operations, as rewriting application to different APIs is not exactly trivial. On the other hand we really shouldn't do huge additions to the machinery. > Please also note that applications and libraries that want to take advantage > of zone append, can already use the zonefs file-system, as it will use the > zone append command when applicable. Not really. While we already use Zone Append in Zonefs for some cases, we can't fully take advantage of the scalability of Zone Append. For that we'd need a way to return the file position where an O_APPEND write actually landed, as suggested in my earlier mail. Which I think is a very useful addition, and Damien and I had looked into adding it both for zonefs and normal file systems, but didn't get around to doing the work yet.
On Wed, Jun 17, 2020 at 11:56:34PM -0700, Christoph Hellwig wrote: >On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote: >> This patchset enables issuing zone-append using aio and io-uring direct-io interface. >> >> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA >> of the zone to issue append. On completion 'res2' field is used to return >> zone-relative offset. >> >> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset > >And what exactly are the semantics supposed to be? Remember the >unix file abstractions does not know about zones at all. > >I really don't think squeezing low-level not quite block storage >protocol details into the Linux read/write path is a good idea. I was thinking of raw block-access to zone device rather than pristine file abstraction. And in that context, semantics, at this point, are unchanged (i.e. same as direct writes) while flexibility of async-interface gets added. Synchronous-writes on single-zone sound fine, but synchronous-appends on single-zone do not sound that fine. >What could be a useful addition is a way for O_APPEND/RWF_APPEND writes >to report where they actually wrote, as that comes close to Zone Append >while still making sense at our usual abstraction level for file I/O. Thanks for suggesting this. O and RWF_APPEND may not go well with block access as end-of-file will be picked from dev inode. But perhaps a new flag like RWF_ZONE_APPEND can help to transform writes (aio or uring) into append without introducing new opcodes. And, I think, this can fit fine on file-abstraction of ZoneFS as well.
On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote: >On 17/06/2020 19.23, Kanchan Joshi wrote: >>This patchset enables issuing zone-append using aio and io-uring direct-io interface. >> >>For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA >>of the zone to issue append. On completion 'res2' field is used to return >>zone-relative offset. >> >>For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset > >Please provide a pointers to applications that are updated and ready >to take advantage of zone append. > >I do not believe it's beneficial at this point to change the libaio >API, applications that would want to use this API, should anyway >switch to use io_uring. > >Please also note that applications and libraries that want to take >advantage of zone append, can already use the zonefs file-system, as >it will use the zone append command when applicable. AFAIK, zonefs uses append while serving synchronous I/O. And append bio is waited upon synchronously. That maybe serving some purpose I do not know currently. But it seems applications using zonefs file abstraction will get benefitted if they could use the append themselves to carry the I/O, asynchronously.
On 18/06/2020 21.21, Kanchan Joshi wrote: > On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote: >> On 17/06/2020 19.23, Kanchan Joshi wrote: >>> This patchset enables issuing zone-append using aio and io-uring >>> direct-io interface. >>> >>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application >>> uses start LBA >>> of the zone to issue append. On completion 'res2' field is used to >>> return >>> zone-relative offset. >>> >>> For io-uring, this introduces three opcodes: >>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>> Since io_uring does not have aio-like res2, cqe->flags are >>> repurposed to return zone-relative offset >> >> Please provide a pointers to applications that are updated and ready >> to take advantage of zone append. >> >> I do not believe it's beneficial at this point to change the libaio >> API, applications that would want to use this API, should anyway >> switch to use io_uring. >> >> Please also note that applications and libraries that want to take >> advantage of zone append, can already use the zonefs file-system, as >> it will use the zone append command when applicable. > > AFAIK, zonefs uses append while serving synchronous I/O. And append bio > is waited upon synchronously. That maybe serving some purpose I do > not know currently. But it seems applications using zonefs file > abstraction will get benefitted if they could use the append > themselves to > carry the I/O, asynchronously. Yep, please see Christoph's comment regarding adding the support to zonefs.
On 2020/06/19 5:04, Matias Bjørling wrote: > On 18/06/2020 21.21, Kanchan Joshi wrote: >> On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote: >>> On 17/06/2020 19.23, Kanchan Joshi wrote: >>>> This patchset enables issuing zone-append using aio and io-uring >>>> direct-io interface. >>>> >>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application >>>> uses start LBA >>>> of the zone to issue append. On completion 'res2' field is used to >>>> return >>>> zone-relative offset. >>>> >>>> For io-uring, this introduces three opcodes: >>>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>>> Since io_uring does not have aio-like res2, cqe->flags are >>>> repurposed to return zone-relative offset >>> >>> Please provide a pointers to applications that are updated and ready >>> to take advantage of zone append. >>> >>> I do not believe it's beneficial at this point to change the libaio >>> API, applications that would want to use this API, should anyway >>> switch to use io_uring. >>> >>> Please also note that applications and libraries that want to take >>> advantage of zone append, can already use the zonefs file-system, as >>> it will use the zone append command when applicable. >> >> AFAIK, zonefs uses append while serving synchronous I/O. And append bio >> is waited upon synchronously. That maybe serving some purpose I do >> not know currently. But it seems applications using zonefs file >> abstraction will get benefitted if they could use the append >> themselves to >> carry the I/O, asynchronously. > Yep, please see Christoph's comment regarding adding the support to zonefs. For the asynchronous processing of zone append in zonefs, we need to add plumbing in the iomap code first. Since this is missing currently, zonefs can only do synchronous/blocking zone append for now. Will be working on that, if we can come up with a semantic that makes sense for posix system calls. zonefs is not a posix compliant file system, so we are not strongly tied by posix specifications. But we still want to make it as easy as possible to understand and use by the user.
On 2020/06/19 2:55, Kanchan Joshi wrote: > On Wed, Jun 17, 2020 at 11:56:34PM -0700, Christoph Hellwig wrote: >> On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote: >>> This patchset enables issuing zone-append using aio and io-uring direct-io interface. >>> >>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA >>> of the zone to issue append. On completion 'res2' field is used to return >>> zone-relative offset. >>> >>> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED. >>> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset >> >> And what exactly are the semantics supposed to be? Remember the >> unix file abstractions does not know about zones at all. >> >> I really don't think squeezing low-level not quite block storage >> protocol details into the Linux read/write path is a good idea. > > I was thinking of raw block-access to zone device rather than pristine file > abstraction. And in that context, semantics, at this point, are unchanged > (i.e. same as direct writes) while flexibility of async-interface gets > added. The aio->aio_offset use by the user and kernel differs for regular writes and zone append writes. This is a significant enough change to say that semantic changed. Yes both cases are direct IOs, but specification of the write location by the user and where the data actually lands on disk are different. There are a lot of subtle things that can happen that makes mapping of zone append operations to POSIX semantic difficult. E.g. for a regular file, using zone append for any write issued to a file open with O_APPEND maps well to POSIX only for blocking writes. For asynchronous writes, that is not true anymore since the order of data defined by the automatic append after the previous async write breaks: data can land anywhere in the zone regardless of the offset specified on submission. > Synchronous-writes on single-zone sound fine, but synchronous-appends on > single-zone do not sound that fine. Why not ? This is a perfectly valid use case that actually does not have any semantic problem. It indeed may not be the most effective method to get high performance but saying that it is "not fine" is not correct in my opinion. > >> What could be a useful addition is a way for O_APPEND/RWF_APPEND writes >> to report where they actually wrote, as that comes close to Zone Append >> while still making sense at our usual abstraction level for file I/O. > > Thanks for suggesting this. O and RWF_APPEND may not go well with block > access as end-of-file will be picked from dev inode. But perhaps a new > flag like RWF_ZONE_APPEND can help to transform writes (aio or uring) > into append without introducing new opcodes. Yes, RWF_ZONE_APPEND may be better if the semantic of RWF_APPEND cannot be cleanly reused. But as Christoph said, RWF_ZONE_APPEND semantic need to be clarified so that all reviewer can check the code against the intended behavior, and comment on that intended behavior too. > And, I think, this can fit fine on file-abstraction of ZoneFS as well. May be. Depends on what semantic you are after for user zone append interface. Ideally, we should have at least the same for raw block device and zonefs. But zonefs may be able to do a better job thanks to its real regular file abstraction of zones. As Christoph said, we started looking into it but lacked time to complete this work. This is still on-going.
On Thu, Jun 18, 2020 at 11:22:58PM +0530, Kanchan Joshi wrote: > I was thinking of raw block-access to zone device rather than pristine file > abstraction. Why? > And in that context, semantics, at this point, are unchanged > (i.e. same as direct writes) while flexibility of async-interface gets > added. > Synchronous-writes on single-zone sound fine, but synchronous-appends on > single-zone do not sound that fine. Where does synchronous access come into play? > > What could be a useful addition is a way for O_APPEND/RWF_APPEND writes > > to report where they actually wrote, as that comes close to Zone Append > > while still making sense at our usual abstraction level for file I/O. > > Thanks for suggesting this. O and RWF_APPEND may not go well with block > access as end-of-file will be picked from dev inode. No, but they go really well with zonefs. > But perhaps a new > flag like RWF_ZONE_APPEND can help to transform writes (aio or uring) > into append without introducing new opcodes. I don't think this is a good idea. Zones are a concept for a a very specific class of zoned devices. Trying to shoe-horn this into the byte address files / whole device abstraction not only is ugly conceptually but also adds the overhead for it to the VFS. And O_APPEND that returns the written position OTOH makes total sense at the file level as well and not just for raw zoned devices.