Message ID | 4385ba84-55dd-6b08-0ca7-6b4a43f9d9a2@samba.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | IORING_CQE_F_COPIED | expand |
Hey Stefan, On 10/14/22 12:06, Stefan Metzmacher wrote: > Hi Pavel, > > In the tests I made I used this version of IORING_CQE_F_COPIED: > https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d > (also inlined at the end) > > Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests) I was thinking, can it be delivered separately but not in the same cqe? The intention is to keep it off the IO path. For example, it can emit a zc status CQE or maybe keep a "zc failed" counter inside the ring. Other options? And we can add a separate callback for that, will make a couple of things better. What do you think? Especially from the userspace usability perspective.
Hi Pavel, > On 10/14/22 12:06, Stefan Metzmacher wrote: >> Hi Pavel, >> >> In the tests I made I used this version of IORING_CQE_F_COPIED: >> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d >> (also inlined at the end) >> >> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests) > > I was thinking, can it be delivered separately but not in the same cqe? > The intention is to keep it off the IO path. For example, it can emit a > zc status CQE or maybe keep a "zc failed" counter inside the ring. Other > options? And we can add a separate callback for that, will make a couple > of things better. > > What do you think? Especially from the userspace usability perspective. So far I can't think of any other way that would be useful yet, but that doesn't mean something else might exist... IORING_CQE_F_COPIED is available per request and makes it possible to judge why the related SENDMSG_ZC was fast or not. It's also available in trace-cmd report. Everything else would likely re-introduce similar complexity like we had with the notif slots. Instead of a new IORING_CQE_F_COPIED flag we could also set cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different. As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED. Can you be more verbose why you're thinking about something different? metze
On 10/18/22 09:43, Stefan Metzmacher wrote: > Hi Pavel, > >> On 10/14/22 12:06, Stefan Metzmacher wrote: >>> Hi Pavel, >>> >>> In the tests I made I used this version of IORING_CQE_F_COPIED: >>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d >>> (also inlined at the end) >>> >>> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests) >> >> I was thinking, can it be delivered separately but not in the same cqe? >> The intention is to keep it off the IO path. For example, it can emit a >> zc status CQE or maybe keep a "zc failed" counter inside the ring. Other >> options? And we can add a separate callback for that, will make a couple >> of things better. >> >> What do you think? Especially from the userspace usability perspective. > > So far I can't think of any other way that would be useful yet, > but that doesn't mean something else might exist... > > IORING_CQE_F_COPIED is available per request and makes it possible > to judge why the related SENDMSG_ZC was fast or not. > It's also available in trace-cmd report. > > Everything else would likely re-introduce similar complexity like we > had with the notif slots. > > Instead of a new IORING_CQE_F_COPIED flag we could also set > cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different. > > As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED > for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED. > Can you be more verbose why you're thinking about something different? Because it feels like something that should be done roughly once and in advance. Performance wise, I agree that a bunch of extra instructions in the (io_uring) IO path won't make difference as the net overhead is already high, but I still prefer to keep it thin. The complexity is a good point though, if only we could piggy back it onto MSG_PROBE. Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport. First, there is no more ubuf_info::zerocopy, see for-next, but you can grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate. You would want to take one io_uring patch I'm going to send (will CC you), with that you won't need to change anything in net/. And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag, I'll make it zero overhead when not set later by replacing the callback.
Hi Pavel, >> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED >> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED. >> Can you be more verbose why you're thinking about something different? > > Because it feels like something that should be done roughly once and in > advance. Performance wise, I agree that a bunch of extra instructions in > the (io_uring) IO path won't make difference as the net overhead is > already high, but I still prefer to keep it thin. The complexity is a > good point though, if only we could piggy back it onto MSG_PROBE. > Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport. Thanks! Experimenting with this stuff lets me wish to have a way to have a different 'user_data' field for the notif cqe, maybe based on a IORING_RECVSEND_ flag, it may make my life easier and would avoid some complexity in userspace... As I need to handle retry on short writes even with MSG_WAITALL as EINTR and other errors could cause them. What do you think? > First, there is no more ubuf_info::zerocopy, see for-next, but you can > grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate. Ok I found your "net: introduce struct ubuf_info_msgzc" and "net: shrink struct ubuf_info" commits. I think the change would be trivial, the zerocopy field would just move to struct io_notif_data..., maybe as 'bool copied'. > You would want to take one io_uring patch I'm going to send (will CC > you), with that you won't need to change anything in net/. The problem is that e.g. tcp_sendmsg_locked() won't ever call the callback at all if 'zc' is false. That's why there's the: if (!zc) uarg->zerocopy = 0; Maybe I can inverse the logic and use two variables 'zero_copied' and 'copied'. We'd start with both being false and this logic in the callback: if (success) { if (unlikely(!nd->zero_copied && !nd->copied)) nd->zero_copied = true; } else { if (unlikely(!nd->copied)) { nd->copied = true; nd->zero_copied = false; } } And __io_notif_complete_tw still needs: if (!nd->zero_copied) notif->cqe.flags |= IORING_CQE_F_COPIED; instead of if (nd->copied) > And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag, > I'll make it zero overhead when not set later by replacing the callback. And the if statement to select a highspeed callback based on a IORING_RECVSEND_ flag is less overhead than the if statements in the slow callback version? metze
On 10/19/22 17:12, Stefan Metzmacher wrote: > Hi Pavel, > >>> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED >>> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED. >>> Can you be more verbose why you're thinking about something different? >> >> Because it feels like something that should be done roughly once and in >> advance. Performance wise, I agree that a bunch of extra instructions in >> the (io_uring) IO path won't make difference as the net overhead is >> already high, but I still prefer to keep it thin. The complexity is a >> good point though, if only we could piggy back it onto MSG_PROBE. >> Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport. > > Thanks! > > Experimenting with this stuff lets me wish to have a way to > have a different 'user_data' field for the notif cqe, > maybe based on a IORING_RECVSEND_ flag, it may make my life > easier and would avoid some complexity in userspace... > As I need to handle retry on short writes even with MSG_WAITALL > as EINTR and other errors could cause them. > > What do you think? > >> First, there is no more ubuf_info::zerocopy, see for-next, but you can >> grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate. > > Ok I found your "net: introduce struct ubuf_info_msgzc" and > "net: shrink struct ubuf_info" commits. > > I think the change would be trivial, the zerocopy field would just move > to struct io_notif_data..., maybe as 'bool copied'. > >> You would want to take one io_uring patch I'm going to send (will CC >> you), with that you won't need to change anything in net/. > > The problem is that e.g. tcp_sendmsg_locked() won't ever call > the callback at all if 'zc' is false. > > That's why there's the: > > if (!zc) > uarg->zerocopy = 0; > > Maybe I can inverse the logic and use two variables 'zero_copied' > and 'copied'. > > We'd start with both being false and this logic in the callback:> > if (success) { > if (unlikely(!nd->zero_copied && !nd->copied)) > nd->zero_copied = true; > } else { > if (unlikely(!nd->copied)) { > nd->copied = true; > nd->zero_copied = false; > } > } Yep, sth like that should do, but let's guard against spurious net_zcopy_put() just in case. used = false; copied = false; callback(skb, success, ubuf) { if (skb) used = true; if (!success) copied = true; } complete() { if (!used || copied) set_flag(IORING_CQE_F_COPIED); } > And __io_notif_complete_tw still needs: > > if (!nd->zero_copied) > notif->cqe.flags |= IORING_CQE_F_COPIED; Which can be shoved in a custom callback >> And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag, >> I'll make it zero overhead when not set later by replacing the callback. > > And the if statement to select a highspeed callback based on > a IORING_RECVSEND_ flag is less overhead than > the if statements in the slow callback version? I'm more concerned about future changes around it, but there won't be extra ifs. #define COMMON_FLAGS (RECVSEND_FIRST_POLL|...) #define ALL_FLAGS (COMMON_FLAGS|RECVSEND_PROBE) if (flags & ~COMMON_FLAGS) { if (flags & ~ALL_FLAGS) return err; if (flags & RECVSEND_PROBE) set_callback(notif); }
Hi Pavel, >> Experimenting with this stuff lets me wish to have a way to >> have a different 'user_data' field for the notif cqe, >> maybe based on a IORING_RECVSEND_ flag, it may make my life >> easier and would avoid some complexity in userspace... >> As I need to handle retry on short writes even with MSG_WAITALL >> as EINTR and other errors could cause them. >> >> What do you think? Any comment on this? IORING_SEND_NOTIF_USER_DATA could let us use notif->cqe.user_data = sqe->addr3; metze
On 10/20/22 11:10, Stefan Metzmacher wrote: > Hi Pavel, > >>> Experimenting with this stuff lets me wish to have a way to >>> have a different 'user_data' field for the notif cqe, >>> maybe based on a IORING_RECVSEND_ flag, it may make my life >>> easier and would avoid some complexity in userspace... >>> As I need to handle retry on short writes even with MSG_WAITALL >>> as EINTR and other errors could cause them. >>> >>> What do you think? > > Any comment on this? > > IORING_SEND_NOTIF_USER_DATA could let us use > notif->cqe.user_data = sqe->addr3; I'd rather not use the last available u64, tbh, that was the reason for not adding a second user_data in the first place. Maybe IORING_SETUP_SQE128?
Hi Pavel, >>>> Experimenting with this stuff lets me wish to have a way to >>>> have a different 'user_data' field for the notif cqe, >>>> maybe based on a IORING_RECVSEND_ flag, it may make my life >>>> easier and would avoid some complexity in userspace... >>>> As I need to handle retry on short writes even with MSG_WAITALL >>>> as EINTR and other errors could cause them. >>>> >>>> What do you think? >> >> Any comment on this? >> >> IORING_SEND_NOTIF_USER_DATA could let us use >> notif->cqe.user_data = sqe->addr3; > > I'd rather not use the last available u64, tbh, that was the > reason for not adding a second user_data in the first place. As far as I can see io_send_zc_prep has this: if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) return -EINVAL; both are u64... metze
On 10/21/22 09:32, Stefan Metzmacher wrote: > Hi Pavel, > >>>>> Experimenting with this stuff lets me wish to have a way to >>>>> have a different 'user_data' field for the notif cqe, >>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life >>>>> easier and would avoid some complexity in userspace... >>>>> As I need to handle retry on short writes even with MSG_WAITALL >>>>> as EINTR and other errors could cause them. >>>>> >>>>> What do you think? >>> >>> Any comment on this? >>> >>> IORING_SEND_NOTIF_USER_DATA could let us use >>> notif->cqe.user_data = sqe->addr3; >> >> I'd rather not use the last available u64, tbh, that was the >> reason for not adding a second user_data in the first place. > > As far as I can see io_send_zc_prep has this: > > if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) > return -EINVAL; > > both are u64... Hah, true, completely forgot about that one
Am 21.10.22 um 11:27 schrieb Pavel Begunkov: > On 10/21/22 09:32, Stefan Metzmacher wrote: >> Hi Pavel, >> >>>>>> Experimenting with this stuff lets me wish to have a way to >>>>>> have a different 'user_data' field for the notif cqe, >>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life >>>>>> easier and would avoid some complexity in userspace... >>>>>> As I need to handle retry on short writes even with MSG_WAITALL >>>>>> as EINTR and other errors could cause them. >>>>>> >>>>>> What do you think? >>>> >>>> Any comment on this? >>>> >>>> IORING_SEND_NOTIF_USER_DATA could let us use >>>> notif->cqe.user_data = sqe->addr3; >>> >>> I'd rather not use the last available u64, tbh, that was the >>> reason for not adding a second user_data in the first place. >> >> As far as I can see io_send_zc_prep has this: >> >> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) >> return -EINVAL; >> >> both are u64... > > Hah, true, completely forgot about that one So would a commit like below be fine for you? Do you have anything in mind for SEND[MSG]_ZC that could possibly use another u64 in future? metze diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 738d6234d1d9..7a6272872334 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -300,6 +300,7 @@ enum io_uring_op { #define IORING_RECVSEND_POLL_FIRST (1U << 0) #define IORING_RECV_MULTISHOT (1U << 1) #define IORING_RECVSEND_FIXED_BUF (1U << 2) +#define IORING_SEND_NOTIF_USER_DATA (1U << 3) /* * accept flags stored in sqe->ioprio diff --git a/io_uring/net.c b/io_uring/net.c index 735eec545115..e1bc06b58cd7 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -938,7 +938,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *notif; - if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) + if (unlikely(READ_ONCE(sqe->__pad2[0])) return -EINVAL; /* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */ if (req->flags & REQ_F_CQE_SKIP) @@ -946,12 +946,19 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) zc->flags = READ_ONCE(sqe->ioprio); if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | - IORING_RECVSEND_FIXED_BUF)) + IORING_RECVSEND_FIXED_BUF | + IORING_SEND_NOTIF_USER_DATA)) return -EINVAL; notif = zc->notif = io_alloc_notif(ctx); if (!notif) return -ENOMEM; - notif->cqe.user_data = req->cqe.user_data; + if (zc->flags & IORING_SEND_NOTIF_USER_DATA) + notif->cqe.user_data = READ_ONCE(sqe->addr3); + else { + if (unlikely(READ_ONCE(sqe->addr3))) + return -EINVAL; + notif->cqe.user_data = req->cqe.user_data; + } notif->cqe.res = 0; notif->cqe.flags = IORING_CQE_F_NOTIF; req->flags |= REQ_F_NEED_CLEANUP;
Hi Pavel, and others... >> As far as I can see io_send_zc_prep has this: >> >> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) >> return -EINVAL; >> >> both are u64... > > Hah, true, completely forgot about that one BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout" thread, that would make it much easier to figure out which fields are used..., see https://lore.kernel.org/io-uring/cover.1660291547.git.metze@samba.org/#r metze
On 10/21/22 10:45, Stefan Metzmacher wrote: > Am 21.10.22 um 11:27 schrieb Pavel Begunkov: >> On 10/21/22 09:32, Stefan Metzmacher wrote: >>> Hi Pavel, >>> >>>>>>> Experimenting with this stuff lets me wish to have a way to >>>>>>> have a different 'user_data' field for the notif cqe, >>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life >>>>>>> easier and would avoid some complexity in userspace... >>>>>>> As I need to handle retry on short writes even with MSG_WAITALL >>>>>>> as EINTR and other errors could cause them. >>>>>>> >>>>>>> What do you think? >>>>> >>>>> Any comment on this? >>>>> >>>>> IORING_SEND_NOTIF_USER_DATA could let us use >>>>> notif->cqe.user_data = sqe->addr3; >>>> >>>> I'd rather not use the last available u64, tbh, that was the >>>> reason for not adding a second user_data in the first place. >>> >>> As far as I can see io_send_zc_prep has this: >>> >>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) >>> return -EINVAL; >>> >>> both are u64... >> >> Hah, true, completely forgot about that one > > So would a commit like below be fine for you? > > Do you have anything in mind for SEND[MSG]_ZC that could possibly use > another u64 in future? It'll most likely be taken in the future, some features are planned some I can imagine. The question is how necessary this one is and/or how much simpler it would make it considering that CQEs are ordered and apps still need to check for F_MORE. It shouldn't even require refcounting. Can you elaborate on the simplifying userspace part?
On 10/21/22 11:15, Stefan Metzmacher wrote: > Hi Pavel, and others... > >>> As far as I can see io_send_zc_prep has this: >>> >>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) >>> return -EINVAL; >>> >>> both are u64... >> >> Hah, true, completely forgot about that one > > BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout" > thread, that would make it much easier to figure out which fields are used..., see > https://lore.kernel.org/io-uring/cover.1660291547.git.metze@samba.org/#r I admit the sqe layout is messy as there is no good separation b/w common vs opcode specific fields, but it's not like the new layout makes it much simpler. E.g. looking who is using a field will get more complicated. iow, no strong opinion on it. btw, will be happy to have the include guard patch from one of your branches
Am 21.10.22 um 13:20 schrieb Pavel Begunkov: > On 10/21/22 10:45, Stefan Metzmacher wrote: >> Am 21.10.22 um 11:27 schrieb Pavel Begunkov: >>> On 10/21/22 09:32, Stefan Metzmacher wrote: >>>> Hi Pavel, >>>> >>>>>>>> Experimenting with this stuff lets me wish to have a way to >>>>>>>> have a different 'user_data' field for the notif cqe, >>>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life >>>>>>>> easier and would avoid some complexity in userspace... >>>>>>>> As I need to handle retry on short writes even with MSG_WAITALL >>>>>>>> as EINTR and other errors could cause them. >>>>>>>> >>>>>>>> What do you think? >>>>>> >>>>>> Any comment on this? >>>>>> >>>>>> IORING_SEND_NOTIF_USER_DATA could let us use >>>>>> notif->cqe.user_data = sqe->addr3; >>>>> >>>>> I'd rather not use the last available u64, tbh, that was the >>>>> reason for not adding a second user_data in the first place. >>>> >>>> As far as I can see io_send_zc_prep has this: >>>> >>>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) >>>> return -EINVAL; >>>> >>>> both are u64... >>> >>> Hah, true, completely forgot about that one >> >> So would a commit like below be fine for you? >> >> Do you have anything in mind for SEND[MSG]_ZC that could possibly use >> another u64 in future? > > It'll most likely be taken in the future, some features are planned > some I can imagine. Can give examples? As I can't imagine any possible feature. > The question is how necessary this one is and/or > how much simpler it would make it considering that CQEs are ordered > and apps still need to check for F_MORE. It shouldn't even require > refcounting. Can you elaborate on the simplifying userspace part? > It's not critical, it would just make it easier to dispatch a different callback functions for the two cases. The current problem I'm facing is that I have a structure holding the state of an response and that has a single embedded completion structure: (simplified) struct completion { uint32_t generation; void (*callback_fn)(void *callback_private, const struct io_uring_cqe *cqe); void *callback_private; }; I use the memory address of the completion structure glued with the lower bits of the generation as 'user_data'. Imagine that I got a short write from SENDMSG_ZC/WAITALL because EINTR was generated, then I need to retry from userspace, which I'd try immediately without waiting for the NOTIF cqe to arrive. For each incoming cqe I get the completion address and the generation out of user_data and then verify the generation against the one stored in the completion in order to detect bugs, before passing over to callback_fn(). Because I still need to handle the NOTIF cqe from the first try I can't change the generation for the next try. I thought about using two completion structures, one for the main SENDMSG_ZC result (which gets its generation incremented with each retry) and one for the NOTIF cqes just keeping generation stable having a simple callback_fn just waiting for a refcount to get 0. Most likely I just need to sit down concentrated to get the recounting and similar things sorted out. If there are really useful things we will do with addr3 and __pad2[0], I can try to cope with it... It would just be sad if they wouldn't be used anyway. metze
Am 21.10.22 um 13:26 schrieb Pavel Begunkov: > On 10/21/22 11:15, Stefan Metzmacher wrote: >> Hi Pavel, and others... >> >>>> As far as I can see io_send_zc_prep has this: >>>> >>>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) >>>> return -EINVAL; >>>> >>>> both are u64... >>> >>> Hah, true, completely forgot about that one >> >> BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout" >> thread, that would make it much easier to figure out which fields are used..., see >> https://lore.kernel.org/io-uring/cover.1660291547.git.metze@samba.org/#r > > I admit the sqe layout is messy as there is no good separation b/w > common vs opcode specific fields, but it's not like the new layout > makes it much simpler. Really? > E.g. looking who is using a field will get more complicated. Why should anyone care what fields other opcodes use and how they are named. For legacy reasons we have to live with struct io_uring_sqe_common common; in the middle. apart from that each opcode should be free to use 5 u64 fields and 1 u32 field. But e.g. + /* IORING_OP_FALLOCATE */ + struct io_uring_sqe_fallocate { + struct io_uring_sqe_hdr hdr; + + __u64 offset; + __u64 length; + __u32 mode; + __u32 u32_ofs28; + + struct io_uring_sqe_common common; + + __u32 u32_ofs44; + __u64 u64_ofs48; + __u64 u64_ofs56; + } fallocate; Immediately shows what's used and what not and it avoids brain dead things like using sqe->addr instead of sqe->len for the length. And it makes it trivial to verify that the _prep function rejects any unused field. And it would it easier to write per opcode tracing code, which can be easily analyzed. > iow, no strong opinion on it. > > btw, will be happy to have the include guard patch from one of > your branches This one from the io_uring_livepatch.v6.1 branch? https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=3c36e05baad737f5cb896fdc9fc53dc1b74d2499 metze
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 04729989e6ee..efeab6a9b4f3 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -341,6 +341,7 @@ struct io_uring_cqe { #define IORING_CQE_F_MORE (1U << 1) #define IORING_CQE_F_SOCK_NONEMPTY (1U << 2) #define IORING_CQE_F_NOTIF (1U << 3) +#define IORING_CQE_F_COPIED (1U << 4) enum { IORING_CQE_BUFFER_SHIFT = 16, diff --git a/io_uring/notif.c b/io_uring/notif.c index e37c6569d82e..2162d1af0b60 100644 --- a/io_uring/notif.c +++ b/io_uring/notif.c @@ -18,6 +18,8 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked) __io_unaccount_mem(ctx->user, nd->account_pages); nd->account_pages = 0; } + if (!nd->uarg.zerocopy) + notif->cqe.flags |= IORING_CQE_F_COPIED; io_req_task_complete(notif, locked); } @@ -28,6 +30,8 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb, struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg); struct io_kiocb *notif = cmd_to_io_kiocb(nd); + uarg->zerocopy = uarg->zerocopy & success; + if (refcount_dec_and_test(&uarg->refcnt)) { notif->io_task_work.func = __io_notif_complete_tw; io_req_task_work_add(notif); @@ -53,6 +57,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx) nd = io_notif_to_data(notif); nd->account_pages = 0; + nd->uarg.zerocopy = 1; nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN; nd->uarg.callback = io_uring_tx_zerocopy_callback; refcount_set(&nd->uarg.refcnt, 1); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 04e2034f2f8e..64d263a8ece8 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1032,7 +1032,8 @@ static int __ip_append_data(struct sock *sk, paged = true; zc = true; uarg = msg->msg_ubuf; - } + } else + msg->msg_ubuf->zerocopy = 0; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); if (!uarg) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index cdf26724d7db..d3a2ed9f22df 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1247,6 +1247,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) uarg = msg->msg_ubuf; net_zcopy_get(uarg); zc = sk->sk_route_caps & NETIF_F_SG; + if (!zc) + uarg->zerocopy = 0; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); if (!uarg) { diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index bb0f469a5247..3d75dd05ff98 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1559,7 +1559,8 @@ static int __ip6_append_data(struct sock *sk, paged = true; zc = true; uarg = msg->msg_ubuf; - } + } else + msg->msg_ubuf->zerocopy = 0; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); if (!uarg)