Message ID | 20241030180112.4635-7-joshi.k@samsung.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Read/Write with metadata/integrity | expand |
On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote: > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 024745283783..48dcca125db3 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -105,6 +105,22 @@ struct io_uring_sqe { > */ > __u8 cmd[0]; > }; > + /* > + * If the ring is initialized with IORING_SETUP_SQE128, then > + * this field is starting offset for 64 bytes of data. For meta io > + * this contains 'struct io_uring_meta_pi' > + */ > + __u8 big_sqe[0]; > +}; > + > +/* this is placed in SQE128 */ > +struct io_uring_meta_pi { > + __u16 pi_flags; > + __u16 app_tag; > + __u32 len; > + __u64 addr; > + __u64 seed; > + __u64 rsvd[2]; > }; On the previous version, I was more questioning if it aligns with what Pavel was trying to do here. I didn't quite get it, so I was more confused than saying it should be this way now. But I personally think this path makes sense. I would set it up just a little differently for extended sqe's so that the PI overlays a more generic struct that other opcodes might find a way to use later. Something like: struct io_uring_sqe_ext { union { __u32 rsvd0[8]; struct { __u16 pi_flags; __u16 app_tag; __u32 len; __u64 addr; __u64 seed; } rw_pi; }; __u32 rsvd1[8]; }; > @@ -3902,6 +3903,9 @@ static int __init io_uring_init(void) > /* top 8bits are for internal use */ > BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0); > > + BUILD_BUG_ON(sizeof(struct io_uring_meta_pi) > > + sizeof(struct io_uring_sqe)); Then this check would become: BUILD_BUG_ON(sizeof(struct io_uring_sqe_ext) != sizeof(struct io_uring_sqe));
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
On 10/30/24 21:09, Keith Busch wrote: > On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote: >> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >> index 024745283783..48dcca125db3 100644 >> --- a/include/uapi/linux/io_uring.h >> +++ b/include/uapi/linux/io_uring.h >> @@ -105,6 +105,22 @@ struct io_uring_sqe { >> */ >> __u8 cmd[0]; >> }; >> + /* >> + * If the ring is initialized with IORING_SETUP_SQE128, then >> + * this field is starting offset for 64 bytes of data. For meta io >> + * this contains 'struct io_uring_meta_pi' >> + */ >> + __u8 big_sqe[0]; >> +}; I don't think zero sized arrays are good as a uapi regardless of cmd[0] above, let's just do sqe = get_sqe(); big_sqe = (void *)(sqe + 1) with an appropriate helper. >> + >> +/* this is placed in SQE128 */ >> +struct io_uring_meta_pi { >> + __u16 pi_flags; >> + __u16 app_tag; >> + __u32 len; >> + __u64 addr; >> + __u64 seed; >> + __u64 rsvd[2]; >> }; > > On the previous version, I was more questioning if it aligns with what I missed that discussion, let me know if I need to look it up > Pavel was trying to do here. I didn't quite get it, so I was more > confused than saying it should be this way now. The point is, SQEs don't have nearly enough space to accommodate all such optional features, especially when it's taking so much space and not applicable to all reads but rather some specific use cases and files. Consider that there might be more similar extensions and we might even want to use them together. 1. SQE128 makes it big for all requests, intermixing with requests that don't need additional space wastes space. SQE128 is fine to use but at the same time we should be mindful about it and try to avoid enabling it if feasible. 2. This API hard codes io_uring_meta_pi into the extended part of the SQE. If we want to add another feature it'd need to go after the meta struct. SQE256? And what if the user doesn't need PI but only the second feature? In short, the uAPI need to have a clear vision of how it can be used with / extended to multiple optional features and not just PI. One option I mentioned before is passing a user pointer to an array of structures, each would will have the type specifying what kind of feature / meta information it is, e.g. META_TYPE_PI. It's not a complete solution but a base idea to extend upon. I separately mentioned before, if copy_from_user is expensive we can optimise it with pre-registering memory. I think Jens even tried something similar with structures we pass as waiting parameters. I didn't read through all iterations of the series, so if there is some other approach described that ticks the boxes and flexible enough, I'd be absolutely fine with it.
On 10/31/2024 8:09 PM, Pavel Begunkov wrote: > On 10/30/24 21:09, Keith Busch wrote: >> On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote: >>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/ >>> io_uring.h >>> index 024745283783..48dcca125db3 100644 >>> --- a/include/uapi/linux/io_uring.h >>> +++ b/include/uapi/linux/io_uring.h >>> @@ -105,6 +105,22 @@ struct io_uring_sqe { >>> */ >>> __u8 cmd[0]; >>> }; >>> + /* >>> + * If the ring is initialized with IORING_SETUP_SQE128, then >>> + * this field is starting offset for 64 bytes of data. For meta io >>> + * this contains 'struct io_uring_meta_pi' >>> + */ >>> + __u8 big_sqe[0]; >>> +}; > > I don't think zero sized arrays are good as a uapi regardless of > cmd[0] above, let's just do > > sqe = get_sqe(); > big_sqe = (void *)(sqe + 1) > > with an appropriate helper. In one of the internal version I did just that (i.e., sqe + 1), and that's fine for kernel. But afterwards added big_sqe so that userspace can directly access access second-half of SQE_128. We have the similar big_cqe[] within io_uring_cqe too. Is this still an eyesore? >>> + >>> +/* this is placed in SQE128 */ >>> +struct io_uring_meta_pi { >>> + __u16 pi_flags; >>> + __u16 app_tag; >>> + __u32 len; >>> + __u64 addr; >>> + __u64 seed; >>> + __u64 rsvd[2]; >>> }; >> >> On the previous version, I was more questioning if it aligns with what > > I missed that discussion, let me know if I need to look it up Yes, please take a look at previous iteration (v5): https://lore.kernel.org/io-uring/e7aae741-c139-48d1-bb22-dbcd69aa2f73@samsung.com/ Also the corresponding code, since my other answers will use that. >> Pavel was trying to do here. I didn't quite get it, so I was more >> confused than saying it should be this way now. > > The point is, SQEs don't have nearly enough space to accommodate all > such optional features, especially when it's taking so much space and > not applicable to all reads but rather some specific use cases and > files. Consider that there might be more similar extensions and we might > even want to use them together. > > 1. SQE128 makes it big for all requests, intermixing with requests that > don't need additional space wastes space. SQE128 is fine to use but at > the same time we should be mindful about it and try to avoid enabling it > if feasible. Right. And initial versions of this series did not use SQE128. But as we moved towards passing more comprehensive PI information, first SQE was not enough. And we thought to make use of SQE128 rather than taking copy_from_user cost. > 2. This API hard codes io_uring_meta_pi into the extended part of the > SQE. If we want to add another feature it'd need to go after the meta > struct. SQE256? Not necessarily. It depends on how much extra space it needs for another feature. To keep free space in first SQE, I chose to place PI in the second one. Anyone requiring 20b (in v6) or 18b (in v5) space, does not even have to ask for SQE128. For more, they can use leftover space in second SQE (about half of second sqe will still be free). In v5, they have entire second SQE if they don't want to use PI. If contiguity is a concern, we can move all PI bytes (about 32b) to the end of second SQE. > And what if the user doesn't need PI but only the second > feature? Not this version, but v5 exposed meta_type as bit flags. And with that, user will not pass the PI flag and that enables to use all the PI bytes for something else. We will have union of PI with some other info that is known not to co-exist. > In short, the uAPI need to have a clear vision of how it can be used > with / extended to multiple optional features and not just PI. > > One option I mentioned before is passing a user pointer to an array of > structures, each would will have the type specifying what kind of > feature / meta information it is, e.g. META_TYPE_PI. It's not a > complete solution but a base idea to extend upon. I separately > mentioned before, if copy_from_user is expensive we can optimise it > with pre-registering memory. I think Jens even tried something similar > with structures we pass as waiting parameters. > > I didn't read through all iterations of the series, so if there is > some other approach described that ticks the boxes and flexible > enough, I'd be absolutely fine with it. Please just read v5. I think it ticks as many boxes as possible without having to resort to copy_from_user.
On 11/1/24 17:54, Kanchan Joshi wrote: > On 10/31/2024 8:09 PM, Pavel Begunkov wrote: >> On 10/30/24 21:09, Keith Busch wrote: >>> On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote: >>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/ >>>> io_uring.h >>>> index 024745283783..48dcca125db3 100644 >>>> --- a/include/uapi/linux/io_uring.h >>>> +++ b/include/uapi/linux/io_uring.h >>>> @@ -105,6 +105,22 @@ struct io_uring_sqe { >>>> */ >>>> __u8 cmd[0]; >>>> }; >>>> + /* >>>> + * If the ring is initialized with IORING_SETUP_SQE128, then >>>> + * this field is starting offset for 64 bytes of data. For meta io >>>> + * this contains 'struct io_uring_meta_pi' >>>> + */ >>>> + __u8 big_sqe[0]; >>>> +}; >> >> I don't think zero sized arrays are good as a uapi regardless of >> cmd[0] above, let's just do >> >> sqe = get_sqe(); >> big_sqe = (void *)(sqe + 1) >> >> with an appropriate helper. > > In one of the internal version I did just that (i.e., sqe + 1), and > that's fine for kernel. > But afterwards added big_sqe so that userspace can directly access > access second-half of SQE_128. We have the similar big_cqe[] within > io_uring_cqe too. > > Is this still an eyesore? Yes, let's kill it as well please, and I don't think the feature really cares about it, so should be easy to do if not already in later revisions. >>>> + >>>> +/* this is placed in SQE128 */ >>>> +struct io_uring_meta_pi { >>>> + __u16 pi_flags; >>>> + __u16 app_tag; >>>> + __u32 len; >>>> + __u64 addr; >>>> + __u64 seed; >>>> + __u64 rsvd[2]; >>>> }; >>> >>> On the previous version, I was more questioning if it aligns with what >> >> I missed that discussion, let me know if I need to look it up > > Yes, please take a look at previous iteration (v5): > https://lore.kernel.org/io-uring/e7aae741-c139-48d1-bb22-dbcd69aa2f73@samsung.com/ "But in general, this is about seeing metadata as a generic term to encode extra information into io_uring SQE." Yep, that's the idea, and it also sounds to me that stream hints is one potential user as well. To summarise, the end goal is to be able to add more meta types/attributes in the future, which can be file specific, e.g. pipes don't care about integrity data, and to be able to pass an arbitrary number of such attributes to a single request. We don't need to implement it here, but the uapi needs to be flexible enough to be able to accommodate that, or we should have an understanding how it can be extended without dirty hacks. > Also the corresponding code, since my other answers will use that. > >>> Pavel was trying to do here. I didn't quite get it, so I was more >>> confused than saying it should be this way now. >> >> The point is, SQEs don't have nearly enough space to accommodate all >> such optional features, especially when it's taking so much space and >> not applicable to all reads but rather some specific use cases and >> files. Consider that there might be more similar extensions and we might >> even want to use them together. >> >> 1. SQE128 makes it big for all requests, intermixing with requests that >> don't need additional space wastes space. SQE128 is fine to use but at >> the same time we should be mindful about it and try to avoid enabling it >> if feasible. > > Right. And initial versions of this series did not use SQE128. But as we > moved towards passing more comprehensive PI information, first SQE was > not enough. And we thought to make use of SQE128 rather than taking > copy_from_user cost. Do we have any data how expensive it is? I don't think I've ever tried to profile it. And where the overhead comes from? speculation prevention? If it's indeed costly, we can add sth to io_uring like pre-mapping memory to optimise it, which would be useful in other places as well. > > 2. This API hard codes io_uring_meta_pi into the extended part of the >> SQE. If we want to add another feature it'd need to go after the meta >> struct. SQE256? > > Not necessarily. It depends on how much extra space it needs for another > feature. To keep free space in first SQE, I chose to place PI in the > second one. Anyone requiring 20b (in v6) or 18b (in v5) space, does not > even have to ask for SQE128. > For more, they can use leftover space in second SQE (about half of > second sqe will still be free). In v5, they have entire second SQE if > they don't want to use PI. > If contiguity is a concern, we can move all PI bytes (about 32b) to the > end of second SQE. > > > > And what if the user doesn't need PI but only the second >> feature? > > Not this version, but v5 exposed meta_type as bit flags. There has to be a type, I assume it's being added back. > And with that, user will not pass the PI flag and that enables to use > all the PI bytes for something else. We will have union of PI with some > other info that is known not to co-exist. Let's say we have 3 different attributes META_TYPE{1,2,3}. How are they placed in an SQE? meta1 = (void *)get_big_sqe(sqe); meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct) meta3 = meta2 + sizeof(struct meta2_struct); Structures are likely not fixed size (?). At least the PI looks large enough to force everyone to be just aliased to it. And can the user pass first meta2 in the sqe and then meta1? meta2 = (void *)get_big_sqe(sqe); meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct) If yes, how parsing should look like? Does the kernel need to read each chunk's type and look up its size to iterate to the next one? If no, what happens if we want to pass meta2 and meta3, do they start from the big_sqe? How do we pass how many of such attributes is there for the request? It should support arbitrary number of attributes in the long run, which we can't pass in an SQE, bumping the SQE size is not scalable in general, so it'd need to support user pointers or sth similar at some point. Placing them in an SQE can serve as an optimisation, and a first step, though it might be easier to start with user pointer instead. Also, when we eventually come to user pointers, we want it to be performant as well and e.g. get by just one copy_from_user, and the api/struct layouts would need to be able to support it. And once it's copied we'll want it to be handled uniformly with the SQE variant, that requires a common format. For different formats there will be a question of perfomance, maintainability, duplicating kernel and userspace code. All that doesn't need to be implemented, but we need a clear direction for the API. Maybe we can get a simplified user space pseudo code showing how the end API is supposed to look like?
On 11/7/2024 10:53 PM, Pavel Begunkov wrote: >>> 1. SQE128 makes it big for all requests, intermixing with requests that >>> don't need additional space wastes space. SQE128 is fine to use but at >>> the same time we should be mindful about it and try to avoid enabling it >>> if feasible. >> >> Right. And initial versions of this series did not use SQE128. But as we >> moved towards passing more comprehensive PI information, first SQE was >> not enough. And we thought to make use of SQE128 rather than taking >> copy_from_user cost. > > Do we have any data how expensive it is? I don't think I've ever > tried to profile it. And where the overhead comes from? speculation > prevention? We did measure this for nvme passthru commands in past (and that was the motivation for building SQE128). Perf profile showed about 3% overhead for copy [*]. > If it's indeed costly, we can add sth to io_uring like pre-mapping > memory to optimise it, which would be useful in other places as > well. But why to operate as if SQE128 does not exist? Reads/Writes, at this point, are clearly not using aboud 20b in first SQE and entire second SQE. Not using second SQE at all does not seem like the best way to protect it from being used by future users. Pre-mapping maybe better for opcodes for which copy_for_user has already been done. For something new (like this), why to start in a suboptimal way, and later, put the burden of taking hoops on userspace to get to the same level where it can get by simply passing a flag at the time of ring setup. [*] perf record -a fio -iodepth=256 -rw=randread -ioengine=io_uring -bs=512 -numjobs=1 -size=50G -group_reporting -iodepth_batch_submit=64 -iodepth_batch_complete_min=1 -iodepth_batch_complete_max=64 -fixedbufs=1 -hipri=1 -sqthread_poll=0 -filename=/dev/ng0n1 -name=io_uring_1 -uring_cmd=1 # Overhead Command Shared Object Symbol # ........ ............... ............................ ............................................................................... # 14.37% fio fio [.] axmap_isset 6.30% fio fio [.] __fio_gettime 3.69% fio fio [.] get_io_u 3.16% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string 2.61% fio [kernel.vmlinux] [k] io_submit_sqes 1.99% fio [kernel.vmlinux] [k] fget 1.96% fio [nvme_core] [k] nvme_alloc_request 1.82% fio [nvme] [k] nvme_poll 1.79% fio fio [.] add_clat_sample 1.69% fio fio [.] fio_ioring_prep 1.59% fio fio [.] thread_main 1.59% fio [nvme] [k] nvme_queue_rqs 1.56% fio [kernel.vmlinux] [k] io_issue_sqe 1.52% fio [kernel.vmlinux] [k] __put_user_nocheck_8 1.44% fio fio [.] account_io_completion 1.37% fio fio [.] get_next_rand_block 1.37% fio fio [.] __get_next_rand_offset.isra.0 1.34% fio fio [.] io_completed 1.34% fio fio [.] td_io_queue 1.27% fio [kernel.vmlinux] [k] blk_mq_alloc_request 1.27% fio [nvme_core] [k] nvme_user_cmd64
On 11/7/2024 10:53 PM, Pavel Begunkov wrote: > Let's say we have 3 different attributes META_TYPE{1,2,3}. > > How are they placed in an SQE? > > meta1 = (void *)get_big_sqe(sqe); > meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct) > meta3 = meta2 + sizeof(struct meta2_struct); Not necessary to do this kind of additions and think in terms of sequential ordering for the extra information placed into primary/secondary SQE. Please see v8: https://lore.kernel.org/io-uring/20241106121842.5004-7-anuj20.g@samsung.com/ It exposes a distinct flag (sqe->ext_cap) for each attribute/cap, and userspace should place the corresponding information where kernel has mandated. If a particular attribute (example write-hint) requires <20b of extra information, we should just place that in first SQE. PI requires more so we are placing that into second SQE. When both PI and write-hint flags are specified by user they can get processed fine without actually having to care about above additions/ordering. > Structures are likely not fixed size (?). At least the PI looks large > enough to force everyone to be just aliased to it. > > And can the user pass first meta2 in the sqe and then meta1? Yes. Just set the ext_cap flags without bothering about first/second. User can pass either or both, along with the corresponding info. Just don't have to assume specific placement into SQE. > meta2 = (void *)get_big_sqe(sqe); > meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct) > > If yes, how parsing should look like? Does the kernel need to read each > chunk's type and look up its size to iterate to the next one? We don't need to iterate if we are not assuming any ordering. > If no, what happens if we want to pass meta2 and meta3, do they start > from the big_sqe? The one who adds the support for meta2/meta3 in kernel decides where to place them within first/second SQE or get them fetched via a pointer from userspace. > How do we pass how many of such attributes is there for the request? ext_cap allows to pass 16 cap/attribute flags. Maybe all can or can not be passed inline in SQE, but I have no real visibility about the space requirement of future users. > It should support arbitrary number of attributes in the long run, which > we can't pass in an SQE, bumping the SQE size is not scalable in > general, so it'd need to support user pointers or sth similar at some > point. Placing them in an SQE can serve as an optimisation, and a first> step, though it might be easier to start with user pointer instead. > > Also, when we eventually come to user pointers, we want it to be > performant as well and e.g. get by just one copy_from_user, and the > api/struct layouts would need to be able to support it. And once it's > copied we'll want it to be handled uniformly with the SQE variant, that > requires a common format. For different formats there will be a question > of perfomance, maintainability, duplicating kernel and userspace code. > > All that doesn't need to be implemented, but we need a clear direction > for the API. Maybe we can get a simplified user space pseudo code > showing how the end API is supposed to look like? Yes. For a large/arbitrary number, we may have to fetch the entire attribute list using a user pointer/len combo. And parse it (that's where all your previous questions fit). And that can still be added on top of v8. For example, adding a flag (in ext_cap) that disables inline-sqe processing and switches to external attribute buffer: /* Second SQE has PI information */ #define EXT_CAP_PI (1U << 0) /* First SQE has hint information */ #define EXT_CAP_WRITE_HINT (1U << 1) /* Do not assume CAP presence in SQE, and fetch capability buffer page instead */ #define EXT_CAP_INDIRECT (1U << 2) Corresponding pointer (and/or len) can be put into last 16b of SQE. Use the same flags/structures for the given attributes within this buffer. That will keep things uniform and will reuse the same handling that we add for inline attributes.
On 11/10/24 17:41, Kanchan Joshi wrote: > On 11/7/2024 10:53 PM, Pavel Begunkov wrote: > >>>> 1. SQE128 makes it big for all requests, intermixing with requests that >>>> don't need additional space wastes space. SQE128 is fine to use but at >>>> the same time we should be mindful about it and try to avoid enabling it >>>> if feasible. >>> >>> Right. And initial versions of this series did not use SQE128. But as we >>> moved towards passing more comprehensive PI information, first SQE was >>> not enough. And we thought to make use of SQE128 rather than taking >>> copy_from_user cost. >> >> Do we have any data how expensive it is? I don't think I've ever >> tried to profile it. And where the overhead comes from? speculation >> prevention? > > We did measure this for nvme passthru commands in past (and that was the > motivation for building SQE128). Perf profile showed about 3% overhead > for copy [*]. Interesting. Sounds like the 3% is not accounting spec barriers, and then I'm a bit curious how much of it comes from the generic memcpy what could've been several 64 bit reads. But regardless let's assume it is expensive. >> If it's indeed costly, we can add sth to io_uring like pre-mapping >> memory to optimise it, which would be useful in other places as >> well. > > But why to operate as if SQE128 does not exist? > Reads/Writes, at this point, are clearly not using aboud 20b in first > SQE and entire second SQE. Not using second SQE at all does not seem > like the best way to protect it from being used by future users. You missed the point, if you take another look at the rest of my reply I even mentioned that SQE128 could be used as an optimisation and the only mode for this patchset, but the API has to be nicely extendable with more attributes in the future. You can't fit everything into SQE128. Even if we grow the SQE size further, it's one size for all requests, mixing requests would mean initilising entire SQE256/512/... for all requests, even for those that don't need it. It might be reasonable for some applications but not for a generic case. I know you care about having that particular integrity feature, but it'd be bad for io_uring to lock into a suboptimal API and special-casing PI implementation. Let's shift a discussion about details to the other sub-thread. > Pre-mapping maybe better for opcodes for which copy_for_user has already > been done. For something new (like this), why to start in a suboptimal > way, and later, put the burden of taking hoops on userspace to get to > the same level where it can get by simply passing a flag at the time of > ring setup.
On 11/10/24 18:36, Kanchan Joshi wrote: > On 11/7/2024 10:53 PM, Pavel Begunkov wrote: > >> Let's say we have 3 different attributes META_TYPE{1,2,3}. >> >> How are they placed in an SQE? >> >> meta1 = (void *)get_big_sqe(sqe); >> meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct) >> meta3 = meta2 + sizeof(struct meta2_struct); > > Not necessary to do this kind of additions and think in terms of > sequential ordering for the extra information placed into > primary/secondary SQE. > > Please see v8: > https://lore.kernel.org/io-uring/20241106121842.5004-7-anuj20.g@samsung.com/ > > It exposes a distinct flag (sqe->ext_cap) for each attribute/cap, and > userspace should place the corresponding information where kernel has > mandated. > > If a particular attribute (example write-hint) requires <20b of extra > information, we should just place that in first SQE. PI requires more so > we are placing that into second SQE. > > When both PI and write-hint flags are specified by user they can get > processed fine without actually having to care about above > additions/ordering. Ok, this option is to statically define a place in SQE for each meta type. The problem is that we can't place everything into an SQE, and the next big meta would need to be a user pointer, at which point copy_from_user() is expensive again and we need to invent something new. PI becomes a special case, most likely handled in a special way, and either becomes one of few "optimised" or forces for nothing its users into SQE128 (with all additional costs) when it could've been aligned with other later meta types. >> Structures are likely not fixed size (?). At least the PI looks large >> enough to force everyone to be just aliased to it. >> >> And can the user pass first meta2 in the sqe and then meta1? > > Yes. Just set the ext_cap flags without bothering about first/second. > User can pass either or both, along with the corresponding info. Just > don't have to assume specific placement into SQE. > > >> meta2 = (void *)get_big_sqe(sqe); >> meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct) >> >> If yes, how parsing should look like? Does the kernel need to read each >> chunk's type and look up its size to iterate to the next one? > > We don't need to iterate if we are not assuming any ordering. > >> If no, what happens if we want to pass meta2 and meta3, do they start >> from the big_sqe? > > The one who adds the support for meta2/meta3 in kernel decides where to > place them within first/second SQE or get them fetched via a pointer > from userspace. > >> How do we pass how many of such attributes is there for the request? > > ext_cap allows to pass 16 cap/attribute flags. Maybe all can or can not > be passed inline in SQE, but I have no real visibility about the space > requirement of future users. I like ext_cap, if not in the current form / API, then as a user hint - quick map of what meta types are passed.
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 024745283783..48dcca125db3 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -105,6 +105,22 @@ struct io_uring_sqe { */ __u8 cmd[0]; }; + /* + * If the ring is initialized with IORING_SETUP_SQE128, then + * this field is starting offset for 64 bytes of data. For meta io + * this contains 'struct io_uring_meta_pi' + */ + __u8 big_sqe[0]; +}; + +/* this is placed in SQE128 */ +struct io_uring_meta_pi { + __u16 pi_flags; + __u16 app_tag; + __u32 len; + __u64 addr; + __u64 seed; + __u64 rsvd[2]; }; /* diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 44a772013c09..c5fd74e42c04 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3879,6 +3879,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); BUILD_BUG_SQE_ELEM(56, __u64, __pad2); + BUILD_BUG_SQE_ELEM_SIZE(64, 0, big_sqe); BUILD_BUG_ON(sizeof(struct io_uring_files_update) != sizeof(struct io_uring_rsrc_update)); @@ -3902,6 +3903,9 @@ static int __init io_uring_init(void) /* top 8bits are for internal use */ BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0); + BUILD_BUG_ON(sizeof(struct io_uring_meta_pi) > + sizeof(struct io_uring_sqe)); + io_uring_optable_init(); /* diff --git a/io_uring/rw.c b/io_uring/rw.c index 30448f343c7f..cbb74fcfd0d1 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -257,6 +257,46 @@ static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import) return 0; } +static inline void io_meta_save_state(struct io_async_rw *io) +{ + io->meta_state.seed = io->meta.seed; + iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta); +} + +static inline void io_meta_restore(struct io_async_rw *io) +{ + io->meta.seed = io->meta_state.seed; + iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta); +} + +static int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_rw *rw, int ddir) +{ + const struct io_uring_meta_pi *md = (struct io_uring_meta_pi *)sqe->big_sqe; + const struct io_issue_def *def; + struct io_async_rw *io; + int ret; + + if (READ_ONCE(md->rsvd[0]) || READ_ONCE(md->rsvd[1])) + return -EINVAL; + + def = &io_issue_defs[req->opcode]; + if (def->vectored) + return -EOPNOTSUPP; + + io = req->async_data; + io->meta.flags = READ_ONCE(md->pi_flags); + io->meta.app_tag = READ_ONCE(md->app_tag); + io->meta.seed = READ_ONCE(md->seed); + ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(md->addr)), + READ_ONCE(md->len), &io->meta.iter); + if (unlikely(ret < 0)) + return ret; + rw->kiocb.ki_flags |= IOCB_HAS_METADATA; + io_meta_save_state(io); + return ret; +} + static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, int ddir, bool do_import) { @@ -279,11 +319,19 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, rw->kiocb.ki_ioprio = get_current_ioprio(); } rw->kiocb.dio_complete = NULL; + rw->kiocb.ki_flags = 0; rw->addr = READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); rw->flags = READ_ONCE(sqe->rw_flags); - return io_prep_rw_setup(req, ddir, do_import); + ret = io_prep_rw_setup(req, ddir, do_import); + + if (unlikely(ret)) + return ret; + + if (req->ctx->flags & IORING_SETUP_SQE128) + ret = io_prep_rw_meta(req, sqe, rw, ddir); + return ret; } int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -409,7 +457,10 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req) static void io_resubmit_prep(struct io_kiocb *req) { struct io_async_rw *io = req->async_data; + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + if (rw->kiocb.ki_flags & IOCB_HAS_METADATA) + io_meta_restore(io); iov_iter_restore(&io->iter, &io->iter_state); } @@ -794,7 +845,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) if (!(req->flags & REQ_F_FIXED_FILE)) req->flags |= io_file_get_flags(file); - kiocb->ki_flags = file->f_iocb_flags; + kiocb->ki_flags |= file->f_iocb_flags; ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type); if (unlikely(ret)) return ret; @@ -823,6 +874,18 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) kiocb->ki_complete = io_complete_rw; } + if (kiocb->ki_flags & IOCB_HAS_METADATA) { + struct io_async_rw *io = req->async_data; + + /* + * We have a union of meta fields with wpq used for buffered-io + * in io_async_rw, so fail it here. + */ + if (!(req->file->f_flags & O_DIRECT)) + return -EOPNOTSUPP; + kiocb->private = &io->meta; + } + return 0; } @@ -897,6 +960,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) * manually if we need to. */ iov_iter_restore(&io->iter, &io->iter_state); + if (kiocb->ki_flags & IOCB_HAS_METADATA) + io_meta_restore(io); do { /* @@ -1101,6 +1166,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) } else { ret_eagain: iov_iter_restore(&io->iter, &io->iter_state); + if (kiocb->ki_flags & IOCB_HAS_METADATA) + io_meta_restore(io); if (kiocb->ki_flags & IOCB_WRITE) io_req_end_write(req); return -EAGAIN; diff --git a/io_uring/rw.h b/io_uring/rw.h index 3f432dc75441..2d7656bd268d 100644 --- a/io_uring/rw.h +++ b/io_uring/rw.h @@ -2,6 +2,11 @@ #include <linux/pagemap.h> +struct io_meta_state { + u32 seed; + struct iov_iter_state iter_meta; +}; + struct io_async_rw { size_t bytes_done; struct iov_iter iter; @@ -9,7 +14,14 @@ struct io_async_rw { struct iovec fast_iov; struct iovec *free_iovec; int free_iov_nr; - struct wait_page_queue wpq; + /* wpq is for buffered io, while meta fields are used with direct io */ + union { + struct wait_page_queue wpq; + struct { + struct uio_meta meta; + struct io_meta_state meta_state; + }; + }; }; int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);