Message ID | 20210831110238.299458-3-leobras@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | QIOChannel flags + multifd zerocopy | expand |
On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > send calls. It does so by avoiding copying user data into kernel buffers. > > To make it work, three steps are needed: > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > 3 - Process the socket's error queue, dealing with any error AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() from a synchronous call to an asynchronous call. It is forbidden to overwrite/reuse/free the buffer passed to sendmsg until an asynchronous completion notification has been received from the socket error queue. These notifications are not required to arrive in-order, even for a TCP stream, because the kernel hangs on to the buffer if a re-transmit is needed. https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html "Page pinning also changes system call semantics. It temporarily shares the buffer between process and network stack. Unlike with copying, the process cannot immediately overwrite the buffer after system call return without possibly modifying the data in flight. Kernel integrity is not affected, but a buggy program can possibly corrupt its own data stream." AFAICT, the design added in this patch does not provide any way to honour these requirements around buffer lifetime. I can't see how we can introduce MSG_ZEROCOPY in any seemless way. The buffer lifetime requirements imply need for an API design that is fundamentally different for asynchronous usage, with a callback to notify when the write has finished/failed. > Zerocopy has it's costs, so it will only get improved performance if > the sending buffer is big (10KB, according to Linux docs). > > The step 2 makes it possible to use the same socket to send data > using both zerocopy and the default copying approach, so the application > cat choose what is best for each packet. > > To implement step 1, an optional set_zerocopy() interface was created > in QIOChannel, allowing each using code to enable or disable it. > > Step 2 will be enabled by the using code at each qio_channel_write*() > that would benefit of zerocopy; > > Step 3 is done with qio_channel_socket_errq_proc(), that runs after > SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found. > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > --- > include/io/channel-socket.h | 2 + > include/io/channel.h | 29 ++++++++++++++ > io/channel-socket.c | 76 +++++++++++++++++++++++++++++++++++++ > io/channel-tls.c | 11 ++++++ > io/channel-websock.c | 9 +++++ > io/channel.c | 11 ++++++ > 6 files changed, 138 insertions(+) > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h > index e747e63514..09dffe059f 100644 > --- a/include/io/channel-socket.h > +++ b/include/io/channel-socket.h > @@ -47,6 +47,8 @@ struct QIOChannelSocket { > socklen_t localAddrLen; > struct sockaddr_storage remoteAddr; > socklen_t remoteAddrLen; > + size_t errq_pending; > + bool zerocopy_enabled; > }; > > > diff --git a/include/io/channel.h b/include/io/channel.h > index dada9ebaaf..de10a78b10 100644 > --- a/include/io/channel.h > +++ b/include/io/channel.h > @@ -137,6 +137,8 @@ struct QIOChannelClass { > IOHandler *io_read, > IOHandler *io_write, > void *opaque); > + void (*io_set_zerocopy)(QIOChannel *ioc, > + bool enabled); > }; > > /* General I/O handling functions */ > @@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc, > void qio_channel_set_delay(QIOChannel *ioc, > bool enabled); > > +/** > + * qio_channel_set_zerocopy: > + * @ioc: the channel object > + * @enabled: the new flag state > + * > + * Controls whether the underlying transport is > + * permitted to use zerocopy to avoid copying the > + * sending buffer in kernel. If @enabled is true, then the > + * writes may avoid buffer copy in kernel. If @enabled > + * is false, writes will cause the kernel to always > + * copy the buffer contents before sending. > + * > + * In order to use make a write with zerocopy feature, > + * it's also necessary to sent each packet with > + * MSG_ZEROCOPY flag. With this, it's possible to > + * to select only writes that would benefit from the > + * use of zerocopy feature, i.e. the ones with larger > + * buffers. > + * > + * This feature was added in Linux 4.14, so older > + * versions will fail on enabling. This is not an > + * issue, since it will fall-back to default copying > + * approach. > + */ > +void qio_channel_set_zerocopy(QIOChannel *ioc, > + bool enabled); > + > /** > * qio_channel_set_cork: > * @ioc: the channel object > diff --git a/io/channel-socket.c b/io/channel-socket.c > index e377e7303d..a69fec7315 100644 > --- a/io/channel-socket.c > +++ b/io/channel-socket.c > @@ -26,8 +26,10 @@ > #include "io/channel-watch.h" > #include "trace.h" > #include "qapi/clone-visitor.h" > +#include <linux/errqueue.h> That's going to fail to biuld on non-Linux > > #define SOCKET_MAX_FDS 16 > +#define SOCKET_ERRQ_THRESH 16384 > > SocketAddress * > qio_channel_socket_get_local_address(QIOChannelSocket *ioc, > @@ -55,6 +57,8 @@ qio_channel_socket_new(void) > > sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)); > sioc->fd = -1; > + sioc->zerocopy_enabled = false; > + sioc->errq_pending = 0; > > ioc = QIO_CHANNEL(sioc); > qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); > @@ -520,6 +524,54 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc, > return ret; > } > > +static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc, > + Error **errp) > +{ > + int fd = sioc->fd; > + int ret; > + struct msghdr msg = {}; > + struct sock_extended_err *serr; > + struct cmsghdr *cm; > + > + do { > + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); > + if (ret <= 0) { > + if (ret == 0 || errno == EAGAIN) { > + /* Nothing on errqueue */ > + sioc->errq_pending = 0; > + break; > + } > + if (errno == EINTR) { > + continue; > + } > + > + error_setg_errno(errp, errno, > + "Unable to read errqueue"); > + break; > + } > + > + cm = CMSG_FIRSTHDR(&msg); > + if (cm->cmsg_level != SOL_IP && > + cm->cmsg_type != IP_RECVERR) { > + error_setg_errno(errp, EPROTOTYPE, > + "Wrong cmsg in errqueue"); > + break; > + } > + > + serr = (void *) CMSG_DATA(cm); > + if (serr->ee_errno != 0) { > + error_setg_errno(errp, serr->ee_errno, > + "Error on socket"); > + break; > + } > + if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) { > + error_setg_errno(errp, serr->ee_origin, > + "Error not from zerocopy"); > + break; > + } > + } while (true); > +} > + > static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > const struct iovec *iov, > size_t niov, > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > "Unable to write to socket"); > return -1; > } > + > + if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) { > + sioc->errq_pending += niov; > + if (sioc->errq_pending > SOCKET_ERRQ_THRESH) { > + qio_channel_socket_errq_proc(sioc, errp); > + } > + } This silently looses any errors set from upto the final SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set. Also means if you have a series of writes without MSG_ZEROCOPY, it'll delay checking any pending errors. I would suggest checkig in close(), but as mentioned earlier, I think the design is flawed because the caller fundamentally needs to know about completion for every single write they make in order to know when the buffer can be released / reused. > +static void > +qio_channel_socket_set_zerocopy(QIOChannel *ioc, > + bool enabled) > +{ > + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > + int v = enabled ? 1 : 0; > + int ret; > + > + ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > + if (ret >= 0) { > + sioc->zerocopy_enabled = true; > + } > +} Surely we need to tell the caller wether this succeeed, otherwise the later sendmsg() is going to fail with EINVAL on older kernels where MSG_ZEROCOPY is not supported. > diff --git a/io/channel-tls.c b/io/channel-tls.c > index 4ce890a538..bf44b0f7b0 100644 > --- a/io/channel-tls.c > +++ b/io/channel-tls.c > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc, > qio_channel_set_delay(tioc->master, enabled); > } > > + > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc, > + bool enabled) > +{ > + QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc); > + > + qio_channel_set_zerocopy(tioc->master, enabled); > +} This is going to be unsafe because gnutls will discard/reuse the buffer for the ciphertext after every write(). I can't see a way to safely enable MSG_ZEROCOPY when TLS is used. Regards, Daniel
On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote: > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > To make it work, three steps are needed: > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > 3 - Process the socket's error queue, dealing with any error > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > from a synchronous call to an asynchronous call. > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > until an asynchronous completion notification has been received from > the socket error queue. These notifications are not required to > arrive in-order, even for a TCP stream, because the kernel hangs on > to the buffer if a re-transmit is needed. > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > "Page pinning also changes system call semantics. It temporarily > shares the buffer between process and network stack. Unlike with > copying, the process cannot immediately overwrite the buffer > after system call return without possibly modifying the data in > flight. Kernel integrity is not affected, but a buggy program > can possibly corrupt its own data stream." > > AFAICT, the design added in this patch does not provide any way > to honour these requirements around buffer lifetime. > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > way. The buffer lifetime requirements imply need for an API > design that is fundamentally different for asynchronous usage, > with a callback to notify when the write has finished/failed. Regarding buffer reuse - it indeed has a very deep implication on the buffer being available and it's not obvious at all. Just to mention that the initial user of this work will make sure all zero copy buffers will be guest pages only (as it's only used in multi-fd), so they should always be there during the process. I think asking for a complete design still makes sense. E.g., for precopy before we flush device states and completes the migration, we may want to at least have a final ack on all the zero-copies of guest pages to guarantee they are flushed. IOW, we need to make sure the last piece of migration stream lands after the guest pages so that the dest VM will always contain the latest page data when dest VM starts. So far I don't see how current code guaranteed that. In short, we may just want to at least having a way to make sure all zero copied buffers are finished using and they're sent after some function returns (e.g., qio_channel_flush()). That may require us to do some accounting on when we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the error queue and keep those information somewhere too. Some other side notes that reached my mind.. The qio_channel_writev_full() may not be suitable for async operations, as the name "full" implies synchronous to me. So maybe we can add a new helper for zero copy on the channel? We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that bit is not set in qio channel features. Thanks,
On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote: > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote: > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > > > To make it work, three steps are needed: > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > > 3 - Process the socket's error queue, dealing with any error > > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > > from a synchronous call to an asynchronous call. > > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > > until an asynchronous completion notification has been received from > > the socket error queue. These notifications are not required to > > arrive in-order, even for a TCP stream, because the kernel hangs on > > to the buffer if a re-transmit is needed. > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > "Page pinning also changes system call semantics. It temporarily > > shares the buffer between process and network stack. Unlike with > > copying, the process cannot immediately overwrite the buffer > > after system call return without possibly modifying the data in > > flight. Kernel integrity is not affected, but a buggy program > > can possibly corrupt its own data stream." > > > > AFAICT, the design added in this patch does not provide any way > > to honour these requirements around buffer lifetime. > > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > > way. The buffer lifetime requirements imply need for an API > > design that is fundamentally different for asynchronous usage, > > with a callback to notify when the write has finished/failed. > > Regarding buffer reuse - it indeed has a very deep implication on the buffer > being available and it's not obvious at all. Just to mention that the initial > user of this work will make sure all zero copy buffers will be guest pages only > (as it's only used in multi-fd), so they should always be there during the > process. That is not the case when migration is using TLS, because the buffers transmitted are the ciphertext buffer, not the plaintext guest page. > In short, we may just want to at least having a way to make sure all zero > copied buffers are finished using and they're sent after some function returns > (e.g., qio_channel_flush()). That may require us to do some accounting on when > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the > error queue and keep those information somewhere too. > > Some other side notes that reached my mind.. > > The qio_channel_writev_full() may not be suitable for async operations, as the > name "full" implies synchronous to me. So maybe we can add a new helper for > zero copy on the channel? All the APIs that exist today are fundamentally only suitable for sync operations. Supporting async correctly will definitely a brand new APIs separate from what exists today. Regards, Daniel
On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote: > On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote: > > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote: > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > > > > > To make it work, three steps are needed: > > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > > > 3 - Process the socket's error queue, dealing with any error > > > > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > > > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > > > from a synchronous call to an asynchronous call. > > > > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > > > until an asynchronous completion notification has been received from > > > the socket error queue. These notifications are not required to > > > arrive in-order, even for a TCP stream, because the kernel hangs on > > > to the buffer if a re-transmit is needed. > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > "Page pinning also changes system call semantics. It temporarily > > > shares the buffer between process and network stack. Unlike with > > > copying, the process cannot immediately overwrite the buffer > > > after system call return without possibly modifying the data in > > > flight. Kernel integrity is not affected, but a buggy program > > > can possibly corrupt its own data stream." > > > > > > AFAICT, the design added in this patch does not provide any way > > > to honour these requirements around buffer lifetime. > > > > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > > > way. The buffer lifetime requirements imply need for an API > > > design that is fundamentally different for asynchronous usage, > > > with a callback to notify when the write has finished/failed. > > > > Regarding buffer reuse - it indeed has a very deep implication on the buffer > > being available and it's not obvious at all. Just to mention that the initial > > user of this work will make sure all zero copy buffers will be guest pages only > > (as it's only used in multi-fd), so they should always be there during the > > process. > > That is not the case when migration is using TLS, because the buffers > transmitted are the ciphertext buffer, not the plaintext guest page. Agreed. > > > In short, we may just want to at least having a way to make sure all zero > > copied buffers are finished using and they're sent after some function returns > > (e.g., qio_channel_flush()). That may require us to do some accounting on when > > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the > > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the > > error queue and keep those information somewhere too. > > > > Some other side notes that reached my mind.. > > > > The qio_channel_writev_full() may not be suitable for async operations, as the > > name "full" implies synchronous to me. So maybe we can add a new helper for > > zero copy on the channel? > > All the APIs that exist today are fundamentally only suitable for sync > operations. Supporting async correctly will definitely a brand new APIs > separate from what exists today. Yes. What I wanted to say is maybe we can still keep the io_writev() interface untouched, but just add a new interface at qio_channel_writev_full() level. IOW, we could comment on io_writev() that it can be either sync or async now, just like sendmsg() has that implication too with different flag passed in. When calling io_writev() with different upper helpers, QIO channel could identify what flag to pass over to io_writev().
On Wed, Sep 01, 2021 at 11:52:13AM -0400, Peter Xu wrote: > On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote: > > On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote: > > > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote: > > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > > > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > > > > > > > To make it work, three steps are needed: > > > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > > > > 3 - Process the socket's error queue, dealing with any error > > > > > > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > > > > > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > > > > from a synchronous call to an asynchronous call. > > > > > > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > > > > until an asynchronous completion notification has been received from > > > > the socket error queue. These notifications are not required to > > > > arrive in-order, even for a TCP stream, because the kernel hangs on > > > > to the buffer if a re-transmit is needed. > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > "Page pinning also changes system call semantics. It temporarily > > > > shares the buffer between process and network stack. Unlike with > > > > copying, the process cannot immediately overwrite the buffer > > > > after system call return without possibly modifying the data in > > > > flight. Kernel integrity is not affected, but a buggy program > > > > can possibly corrupt its own data stream." > > > > > > > > AFAICT, the design added in this patch does not provide any way > > > > to honour these requirements around buffer lifetime. > > > > > > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > > > > way. The buffer lifetime requirements imply need for an API > > > > design that is fundamentally different for asynchronous usage, > > > > with a callback to notify when the write has finished/failed. > > > > > > Regarding buffer reuse - it indeed has a very deep implication on the buffer > > > being available and it's not obvious at all. Just to mention that the initial > > > user of this work will make sure all zero copy buffers will be guest pages only > > > (as it's only used in multi-fd), so they should always be there during the > > > process. > > > > That is not the case when migration is using TLS, because the buffers > > transmitted are the ciphertext buffer, not the plaintext guest page. > > Agreed. > > > > > > In short, we may just want to at least having a way to make sure all zero > > > copied buffers are finished using and they're sent after some function returns > > > (e.g., qio_channel_flush()). That may require us to do some accounting on when > > > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the > > > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the > > > error queue and keep those information somewhere too. > > > > > > Some other side notes that reached my mind.. > > > > > > The qio_channel_writev_full() may not be suitable for async operations, as the > > > name "full" implies synchronous to me. So maybe we can add a new helper for > > > zero copy on the channel? > > > > All the APIs that exist today are fundamentally only suitable for sync > > operations. Supporting async correctly will definitely a brand new APIs > > separate from what exists today. > > Yes. What I wanted to say is maybe we can still keep the io_writev() interface > untouched, but just add a new interface at qio_channel_writev_full() level. > > IOW, we could comment on io_writev() that it can be either sync or async now, > just like sendmsg() has that implication too with different flag passed in. > When calling io_writev() with different upper helpers, QIO channel could > identify what flag to pass over to io_writev(). I don't think we should overload any of the existing methods with extra parameters for async. Introduce a completely new set of methods exclusively for this alternative interaction model. I can kinda understand why they took the approach they did with sendmsg, but I wouldn't use it as design motivation for QEMU (except as example of what not to do). Regards, Daniel
Hello Daniel, thank you for the feedback! Comments inline. On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > To make it work, three steps are needed: > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > 3 - Process the socket's error queue, dealing with any error > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > from a synchronous call to an asynchronous call. You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in a somehow synchronous way, but returning errp (and sometimes closing the channel because of it) was a poor implementation. > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > until an asynchronous completion notification has been received from > the socket error queue. These notifications are not required to > arrive in-order, even for a TCP stream, because the kernel hangs on > to the buffer if a re-transmit is needed. > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > "Page pinning also changes system call semantics. It temporarily > shares the buffer between process and network stack. Unlike with > copying, the process cannot immediately overwrite the buffer > after system call return without possibly modifying the data in > flight. Kernel integrity is not affected, but a buggy program > can possibly corrupt its own data stream." > By the above piece of documentation, I get there is no problem in overwriting the buffer, but a corrupt, or newer version of the memory may be sent instead of the original one. I am pointing this out because there are workloads like page migration that would not be impacted, given once the buffer is changed, it will dirty the page and it will be re-sent. But I agree with you. It's not a good choice to expect all the users to behave like that, and since an interface for dealing with those errors is not provided to the using code, there is no way of using that in other scenarios. > AFAICT, the design added in this patch does not provide any way > to honour these requirements around buffer lifetime. > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > way. The buffer lifetime requirements imply need for an API > design that is fundamentally different for asynchronous usage, > with a callback to notify when the write has finished/failed. > That is a good point. Proposing a new optional method like io_async_writev() + an error checking mechanism could do the job. io_async_writev() could fall-back to io_writev() in cases where it's not implemented. I am not sure about the error checking yet. Options I can see are: 1 - A callback, as you suggested, which IIUC would be provided by code using the QIOChannel, and would only fix the reported errors, leaving the responsibility of checking for errors to the IOChannel code. 2 - A new method, maybe io_async_errck(), which would be called whenever the using code wants to deal with pending errors. It could return an array/list of IOVs that need to be re-sent, for example, and code using QIOChannel could deal with it however it wants. [snip] > > * qio_channel_set_cork: > > * @ioc: the channel object > > diff --git a/io/channel-socket.c b/io/channel-socket.c > > index e377e7303d..a69fec7315 100644 > > --- a/io/channel-socket.c > > +++ b/io/channel-socket.c > > @@ -26,8 +26,10 @@ > > #include "io/channel-watch.h" > > #include "trace.h" > > #include "qapi/clone-visitor.h" > > +#include <linux/errqueue.h> > > That's going to fail to biuld on non-Linux Good catch, thanks! [snip] > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > > "Unable to write to socket"); > > return -1; > > } > > + > > + if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) { > > + sioc->errq_pending += niov; > > + if (sioc->errq_pending > SOCKET_ERRQ_THRESH) { > > + qio_channel_socket_errq_proc(sioc, errp); > > + } > > + } > > This silently looses any errors set from upto the final > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set. You are right. > > Also means if you have a series of writes without > MSG_ZEROCOPY, it'll delay checking any pending > errors. That's expected... if there are only happening sends without MSG_ZEROCOPY, it means the ones sent with zerocopy can wait. The problem would be the above case. > > I would suggest checkig in close(), but as mentioned > earlier, I think the design is flawed because the caller > fundamentally needs to know about completion for every > single write they make in order to know when the buffer > can be released / reused. Well, there could be a flush mechanism (maybe in io_sync_errck(), activated with a parameter flag, or on a different method if callback is preferred): In the MSG_ZEROCOPY docs, we can see that the example includes using a poll() syscall after each packet sent, and this means the fd gets a signal after each sendmsg() happens, with error or not. We could harness this with a poll() and a relatively high timeout: - We stop sending packets, and then call poll(). - Whenever poll() returns 0, it means a timeout happened, and so it took too long without sendmsg() happening, meaning all the packets are sent. - If it returns anything else, we go back to fixing the errors found (re-send) The problem may be defining the value of this timeout, but it could be called only when zerocopy is active. What do you think? > > > +static void > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc, > > + bool enabled) > > +{ > > + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > > + int v = enabled ? 1 : 0; > > + int ret; > > + > > + ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > > + if (ret >= 0) { > > + sioc->zerocopy_enabled = true; > > + } > > +} > > Surely we need to tell the caller wether this succeeed, otherwise > the later sendmsg() is going to fail with EINVAL on older kernels > where MSG_ZEROCOPY is not supported. Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it should have been something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this way it would reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy otherwise. or something like this, if we want it to stick with zerocopy if setting it off fails. if (ret >= 0) { sioc->zerocopy_enabled = enabled; } > > > > diff --git a/io/channel-tls.c b/io/channel-tls.c > > index 4ce890a538..bf44b0f7b0 100644 > > --- a/io/channel-tls.c > > +++ b/io/channel-tls.c > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc, > > qio_channel_set_delay(tioc->master, enabled); > > } > > > > + > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc, > > + bool enabled) > > +{ > > + QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc); > > + > > + qio_channel_set_zerocopy(tioc->master, enabled); > > +} > > This is going to be unsafe because gnutls will discard/reuse the > buffer for the ciphertext after every write(). I can't see a > way to safely enable MSG_ZEROCOPY when TLS is used. Yeah, that makes sense. It would make more sense to implement KTLS, as IIRC it kind of does 'zerocopy', since it saves the encrypted data directly in kernel buffer. We could implement KTLS as io_async_writev() for Channel_TLS, and change this flag to async_enabled. If KTLS is not available, it would fall back to using gnuTLS on io_writev, just like it would happen in zerocopy. > > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
Thanks for this feedback Peter! I ended up reading/replying the e-mails in thread order, so I may have been redundant with your argument, sorry about that. I will add my comments inline, but I will add references to the previous mail I sent Daniel, so please read it too. On Tue, Aug 31, 2021 at 5:27 PM Peter Xu <peterx@redhat.com> wrote: [snip] > > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > > way. The buffer lifetime requirements imply need for an API > > design that is fundamentally different for asynchronous usage, > > with a callback to notify when the write has finished/failed. > > Regarding buffer reuse - it indeed has a very deep implication on the buffer > being available and it's not obvious at all. Just to mention that the initial > user of this work will make sure all zero copy buffers will be guest pages only > (as it's only used in multi-fd), so they should always be there during the > process. Thanks for pointing that out, what's what I had in mind at the time. > > I think asking for a complete design still makes sense. Agree, since I am touching QIOChannel, it makes sense to make it work for other code that uses it too, not only our case. > E.g., for precopy > before we flush device states and completes the migration, we may want to at > least have a final ack on all the zero-copies of guest pages to guarantee they > are flushed. > > IOW, we need to make sure the last piece of migration stream lands after the > guest pages so that the dest VM will always contain the latest page data when > dest VM starts. So far I don't see how current code guaranteed that. > > In short, we may just want to at least having a way to make sure all zero > copied buffers are finished using and they're sent after some function returns > (e.g., qio_channel_flush()). That may require us to do some accounting on when > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the > error queue and keep those information somewhere too. Yeah, that's correct. I haven't fully studied what the returned data represents, but I suppose this could be a way to fix that. In my previous reply to Daniel I pointed out a way we may achieve a flush behavior with poll() too, but it could be a little hacky. > > Some other side notes that reached my mind.. > > The qio_channel_writev_full() may not be suitable for async operations, as the > name "full" implies synchronous to me. So maybe we can add a new helper for > zero copy on the channel? > > We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then > we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that > bit is not set in qio channel features. I also suggested something like that, but I thought it could be good if we could fall back to io_writev() if we didn't have the zerocopy feature (or the async feature). What do you think? > > Thanks, > > -- > Peter Xu > I really appreciate your suggestions, thanks Peter! Best regards, Leonardo
Hello Daniel,. A few more comments: On Wed, Sep 1, 2021 at 5:51 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote: > > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote: > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > > > > > To make it work, three steps are needed: > > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > > > 3 - Process the socket's error queue, dealing with any error > > > > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > > > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > > > from a synchronous call to an asynchronous call. > > > > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > > > until an asynchronous completion notification has been received from > > > the socket error queue. These notifications are not required to > > > arrive in-order, even for a TCP stream, because the kernel hangs on > > > to the buffer if a re-transmit is needed. > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > "Page pinning also changes system call semantics. It temporarily > > > shares the buffer between process and network stack. Unlike with > > > copying, the process cannot immediately overwrite the buffer > > > after system call return without possibly modifying the data in > > > flight. Kernel integrity is not affected, but a buggy program > > > can possibly corrupt its own data stream." > > > > > > AFAICT, the design added in this patch does not provide any way > > > to honour these requirements around buffer lifetime. > > > > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > > > way. The buffer lifetime requirements imply need for an API > > > design that is fundamentally different for asynchronous usage, > > > with a callback to notify when the write has finished/failed. > > > > Regarding buffer reuse - it indeed has a very deep implication on the buffer > > being available and it's not obvious at all. Just to mention that the initial > > user of this work will make sure all zero copy buffers will be guest pages only > > (as it's only used in multi-fd), so they should always be there during the > > process. > > That is not the case when migration is using TLS, because the buffers > transmitted are the ciphertext buffer, not the plaintext guest page. That makes sense. I am still working my way on Migration code, and I ended up not knowing that part. I think we can work on KTLS for this case, and implement something 'like zerocopy' for this. I added more details in the previous mail I sent you. > > > In short, we may just want to at least having a way to make sure all zero > > copied buffers are finished using and they're sent after some function returns > > (e.g., qio_channel_flush()). That may require us to do some accounting on when > > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the > > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the > > error queue and keep those information somewhere too. > > > > Some other side notes that reached my mind.. > > > > The qio_channel_writev_full() may not be suitable for async operations, as the > > name "full" implies synchronous to me. So maybe we can add a new helper for > > zero copy on the channel? > > All the APIs that exist today are fundamentally only suitable for sync > operations. Supporting async correctly will definitely a brand new APIs > separate from what exists today. That's a great, I was thinking on implementing this as an optional method for QIOChannel, more details in the previous mail :). (sorry, I ended up reading/replying the mails in thread order) > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > Best regards, Leonardo Bras
On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote: > Hello Daniel, thank you for the feedback! > > Comments inline. > > On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > > > To make it work, three steps are needed: > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > > 3 - Process the socket's error queue, dealing with any error > > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > > from a synchronous call to an asynchronous call. > > You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in > a somehow synchronous way, but returning errp (and sometimes closing the > channel because of it) was a poor implementation. > > > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > > until an asynchronous completion notification has been received from > > the socket error queue. These notifications are not required to > > arrive in-order, even for a TCP stream, because the kernel hangs on > > to the buffer if a re-transmit is needed. > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > "Page pinning also changes system call semantics. It temporarily > > shares the buffer between process and network stack. Unlike with > > copying, the process cannot immediately overwrite the buffer > > after system call return without possibly modifying the data in > > flight. Kernel integrity is not affected, but a buggy program > > can possibly corrupt its own data stream." > > > > By the above piece of documentation, I get there is no problem in > overwriting the buffer, but a corrupt, or newer version of the memory may > be sent instead of the original one. I am pointing this out because there > are workloads like page migration that would not be impacted, given > once the buffer is changed, it will dirty the page and it will be re-sent. The idea of having the buffer overwritten is still seriously worrying me. I get the idea that in theory the "worst" that should happen is that we unexpectedly transmit newer guest memory contents. There are so many edge cases in migration code though that I'm worried there might be scenarios in which that is still going to cause problems. The current migration code has a synchronization point at the end of each iteration of the migration loop. At the end of each iteration, the source has a guarantee that all pages from the dirty bitmap have now been sent. With the ZEROCOPY feature we have entirely removed all synchronization points from the source host wrt memory sending. At best we know that the pages will have been sent if we get to close() without seeing errors, unless we're going to explicitly track the completion messages. The TCP re-transmit semantics are especially worrying to me, because the re-transmit is liable to send different data than was in the original lost TCP packet. Maybe everything is still safe because TCP is a reliable ordered stream, and thus eventually the dst will get the newest guest page contents. I'm still very worried that we have code in the source that implicitly relies on there being some synchronization points that we've effectively removed. > > AFAICT, the design added in this patch does not provide any way > > to honour these requirements around buffer lifetime. > > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > > way. The buffer lifetime requirements imply need for an API > > design that is fundamentally different for asynchronous usage, > > with a callback to notify when the write has finished/failed. > > That is a good point. > Proposing a new optional method like io_async_writev() + an error > checking mechanism could do the job. > io_async_writev() could fall-back to io_writev() in cases where it's not > implemented. > > I am not sure about the error checking yet. > Options I can see are: > 1 - A callback, as you suggested, which IIUC would be provided by > code using the QIOChannel, and would only fix the reported errors, > leaving the responsibility of checking for errors to the IOChannel code. A callback is fairly simple, but potentially limits performance. Reading the kernel docs it seems they intentionally merge notifications to enable a whole contiguous set to be processed in one go. A callback would effectively be unmerging them requiring processing one a time. Not sure how much this matters to our usage. > 2 - A new method, maybe io_async_errck(), which would be called > whenever the using code wants to deal with pending errors. It could > return an array/list of IOVs that need to be re-sent, for example, > and code using QIOChannel could deal with it however it wants. Considering that we're using TCP, we get a reliable, ordered data stream. So there should never be a failure that requires use to know specific IOVs to re-sent - the kernel still handles TCP re-transmit - we just have to still have the buffer available. As noted above though, the re-transmit is liable to send different data than the original transmit, which worries me. So the only errors we should get from the completion notifications would be errors that are completely fatal for the TCP stream. So for errors, we only need to know whether an error has ocurred or not. The second reason for the completion notifications is for providing a synchronization, to know when the buffer can be released or overwritten. That is the only scenario we need to know list of IOVs that completed. > > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > > > "Unable to write to socket"); > > > return -1; > > > } > > > + > > > + if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) { > > > + sioc->errq_pending += niov; > > > + if (sioc->errq_pending > SOCKET_ERRQ_THRESH) { > > > + qio_channel_socket_errq_proc(sioc, errp); > > > + } > > > + } > > > > This silently looses any errors set from upto the final > > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set. > > You are right. > > > > > Also means if you have a series of writes without > > MSG_ZEROCOPY, it'll delay checking any pending > > errors. > > That's expected... if there are only happening sends without MSG_ZEROCOPY, > it means the ones sent with zerocopy can wait. The problem would be > the above case. Well it depends whether there are synchonization requirements in the caller. For example, current migration code knows that at the end of each iteration sending pages, that the data has all actally been sent. With this new logic, we might have done several more iterations of the migration loop before the data for the original iteration has been sent. The writes() for the original iteration may well be sending the data from the later iterations. Possibly this is ok, but it is a clear difference in the data that will actually go out on the wire with this code. > > I would suggest checkig in close(), but as mentioned > > earlier, I think the design is flawed because the caller > > fundamentally needs to know about completion for every > > single write they make in order to know when the buffer > > can be released / reused. > > Well, there could be a flush mechanism (maybe in io_sync_errck(), > activated with a > parameter flag, or on a different method if callback is preferred): > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll() > syscall after each packet sent, and this means the fd gets a signal after each > sendmsg() happens, with error or not. > > We could harness this with a poll() and a relatively high timeout: > - We stop sending packets, and then call poll(). > - Whenever poll() returns 0, it means a timeout happened, and so it > took too long > without sendmsg() happening, meaning all the packets are sent. > - If it returns anything else, we go back to fixing the errors found (re-send) > > The problem may be defining the value of this timeout, but it could be > called only > when zerocopy is active. Maybe we need to check completions at the end of each iteration of the migration dirtypage loop ? > > > +static void > > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc, > > > + bool enabled) > > > +{ > > > + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > > > + int v = enabled ? 1 : 0; > > > + int ret; > > > + > > > + ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > > > + if (ret >= 0) { > > > + sioc->zerocopy_enabled = true; > > > + } > > > +} > > > > Surely we need to tell the caller wether this succeeed, otherwise > > the later sendmsg() is going to fail with EINVAL on older kernels > > where MSG_ZEROCOPY is not supported. > > Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it > should have been > something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this > way it would > reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy > otherwise. > > or something like this, if we want it to stick with zerocopy if > setting it off fails. > if (ret >= 0) { > sioc->zerocopy_enabled = enabled; > } Yes, that is a bug fix we need, but actually I was refering to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY from 'flags', if zerocopy_enabled is not set, to avoid EINVAL from sendmsg. > > > diff --git a/io/channel-tls.c b/io/channel-tls.c > > > index 4ce890a538..bf44b0f7b0 100644 > > > --- a/io/channel-tls.c > > > +++ b/io/channel-tls.c > > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc, > > > qio_channel_set_delay(tioc->master, enabled); > > > } > > > > > > + > > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc, > > > + bool enabled) > > > +{ > > > + QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc); > > > + > > > + qio_channel_set_zerocopy(tioc->master, enabled); > > > +} > > > > This is going to be unsafe because gnutls will discard/reuse the > > buffer for the ciphertext after every write(). I can't see a > > way to safely enable MSG_ZEROCOPY when TLS is used. > > Yeah, that makes sense. > It would make more sense to implement KTLS, as IIRC it kind of does > 'zerocopy', since it saves the encrypted data directly in kernel buffer. I would hope it is zerocopy, in so much as the kernel can directly use the userspace pages as the plaintext, and then the ciphertext in kernel space can be sent directly without further copies. What i'm not sure is whether this means it also becomes an effectively asynchronous API for transmitting data. ie does it have the same constraints about needing to lock pages, and not modify data in the pages? I've not investigated it in any detail, but I don't recall that being mentioned, and if it was the case, it would be impossible to enable that transparently in gnutls as its a major semantic changed. > We could implement KTLS as io_async_writev() for Channel_TLS, and change this > flag to async_enabled. If KTLS is not available, it would fall back to > using gnuTLS on io_writev, just like it would happen in zerocopy. If gnutls is going to support KTLS in a way we can use, I dont want to have any of that duplicated in QEMU code. I want to be able delegate as much as possible to gnutls, at least so that QEMU isn't in the loop for any crypto sensitive parts, as that complicates certification of crypto. Regards, Daniel
On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote: > > Hello Daniel, thank you for the feedback! > > > > Comments inline. > > > > On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote: > > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket > > > > send calls. It does so by avoiding copying user data into kernel buffers. > > > > > > > > To make it work, three steps are needed: > > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY > > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall > > > > 3 - Process the socket's error queue, dealing with any error > > > > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY. > > > > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg() > > > from a synchronous call to an asynchronous call. > > > > You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in > > a somehow synchronous way, but returning errp (and sometimes closing the > > channel because of it) was a poor implementation. > > > > > > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg > > > until an asynchronous completion notification has been received from > > > the socket error queue. These notifications are not required to > > > arrive in-order, even for a TCP stream, because the kernel hangs on > > > to the buffer if a re-transmit is needed. > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > "Page pinning also changes system call semantics. It temporarily > > > shares the buffer between process and network stack. Unlike with > > > copying, the process cannot immediately overwrite the buffer > > > after system call return without possibly modifying the data in > > > flight. Kernel integrity is not affected, but a buggy program > > > can possibly corrupt its own data stream." > > > > > > > By the above piece of documentation, I get there is no problem in > > overwriting the buffer, but a corrupt, or newer version of the memory may > > be sent instead of the original one. I am pointing this out because there > > are workloads like page migration that would not be impacted, given > > once the buffer is changed, it will dirty the page and it will be re-sent. > > The idea of having the buffer overwritten is still seriously worrying > me. I get the idea that in theory the "worst" that should happen is > that we unexpectedly transmit newer guest memory contents. There are > so many edge cases in migration code though that I'm worried there > might be scenarios in which that is still going to cause problems. I agree we should keep a eye on that, maybe have David Gilbert's opinion on that. > > The current migration code has a synchronization point at the end of > each iteration of the migration loop. At the end of each iteration, > the source has a guarantee that all pages from the dirty bitmap have > now been sent. With the ZEROCOPY feature we have entirely removed all > synchronization points from the source host wrt memory sending. At > best we know that the pages will have been sent if we get to close() > without seeing errors, unless we're going to explicitly track the > completion messages. The TCP re-transmit semantics are especially > worrying to me, because the re-transmit is liable to send different > data than was in the original lost TCP packet. > > Maybe everything is still safe because TCP is a reliable ordered > stream, and thus eventually the dst will get the newest guest > page contents. I'm still very worried that we have code in the > source that implicitly relies on there being some synchronization > points that we've effectively removed. > > > > AFAICT, the design added in this patch does not provide any way > > > to honour these requirements around buffer lifetime. > > > > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless > > > way. The buffer lifetime requirements imply need for an API > > > design that is fundamentally different for asynchronous usage, > > > with a callback to notify when the write has finished/failed. > > > > That is a good point. > > Proposing a new optional method like io_async_writev() + an error > > checking mechanism could do the job. > > io_async_writev() could fall-back to io_writev() in cases where it's not > > implemented. > > > > I am not sure about the error checking yet. > > Options I can see are: > > 1 - A callback, as you suggested, which IIUC would be provided by > > code using the QIOChannel, and would only fix the reported errors, > > leaving the responsibility of checking for errors to the IOChannel code. > > A callback is fairly simple, but potentially limits performance. Reading > the kernel docs it seems they intentionally merge notifications to enable > a whole contiguous set to be processed in one go. A callback would > effectively be unmerging them requiring processing one a time. Not > sure how much this matters to our usage. > > > 2 - A new method, maybe io_async_errck(), which would be called > > whenever the using code wants to deal with pending errors. It could > > return an array/list of IOVs that need to be re-sent, for example, > > and code using QIOChannel could deal with it however it wants. > > Considering that we're using TCP, we get a reliable, ordered data > stream. So there should never be a failure that requires use to > know specific IOVs to re-sent - the kernel still handles TCP > re-transmit - we just have to still have the buffer available. > As noted above though, the re-transmit is liable to send different > data than the original transmit, which worries me. > > So the only errors we should get from the completion notifications > would be errors that are completely fatal for the TCP stream. So > for errors, we only need to know whether an error has ocurred or > not. The second reason for the completion notifications is for > providing a synchronization, to know when the buffer can be released > or overwritten. That is the only scenario we need to know list of > IOVs that completed. > > > > > > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > > > > "Unable to write to socket"); > > > > return -1; > > > > } > > > > + > > > > + if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) { > > > > + sioc->errq_pending += niov; > > > > + if (sioc->errq_pending > SOCKET_ERRQ_THRESH) { > > > > + qio_channel_socket_errq_proc(sioc, errp); > > > > + } > > > > + } > > > > > > This silently looses any errors set from upto the final > > > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set. > > > > You are right. > > > > > > > > Also means if you have a series of writes without > > > MSG_ZEROCOPY, it'll delay checking any pending > > > errors. > > > > That's expected... if there are only happening sends without MSG_ZEROCOPY, > > it means the ones sent with zerocopy can wait. The problem would be > > the above case. > > Well it depends whether there are synchonization requirements in the > caller. For example, current migration code knows that at the end of > each iteration sending pages, that the data has all actally been > sent. With this new logic, we might have done several more iterations > of the migration loop before the data for the original iteration has > been sent. The writes() for the original iteration may well be sending > the data from the later iterations. Possibly this is ok, but it is > a clear difference in the data that will actually go out on the wire > with this code. > > > > I would suggest checkig in close(), but as mentioned > > > earlier, I think the design is flawed because the caller > > > fundamentally needs to know about completion for every > > > single write they make in order to know when the buffer > > > can be released / reused. > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(), > > activated with a > > parameter flag, or on a different method if callback is preferred): > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll() > > syscall after each packet sent, and this means the fd gets a signal after each > > sendmsg() happens, with error or not. > > > > We could harness this with a poll() and a relatively high timeout: > > - We stop sending packets, and then call poll(). > > - Whenever poll() returns 0, it means a timeout happened, and so it > > took too long > > without sendmsg() happening, meaning all the packets are sent. > > - If it returns anything else, we go back to fixing the errors found (re-send) > > > > The problem may be defining the value of this timeout, but it could be > > called only > > when zerocopy is active. > > Maybe we need to check completions at the end of each iteration of the > migration dirtypage loop ? Sorry, I am really new to this, and I still couldn't understand why would we need to check at the end of each iteration, instead of doing a full check at the end. Is that a case of a fatal error on TCP stream, in which a lot of packets were reported as 'sent' but we may lose track of which were in fact sent? Would the poll() approach above be enough for a 'flush()' ? > > > > > > +static void > > > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc, > > > > + bool enabled) > > > > +{ > > > > + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > > > > + int v = enabled ? 1 : 0; > > > > + int ret; > > > > + > > > > + ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > > > > + if (ret >= 0) { > > > > + sioc->zerocopy_enabled = true; > > > > + } > > > > +} > > > > > > Surely we need to tell the caller wether this succeeed, otherwise > > > the later sendmsg() is going to fail with EINVAL on older kernels > > > where MSG_ZEROCOPY is not supported. > > > > Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it > > should have been > > something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this > > way it would > > reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy > > otherwise. > > > > or something like this, if we want it to stick with zerocopy if > > setting it off fails. > > if (ret >= 0) { > > sioc->zerocopy_enabled = enabled; > > } > > Yes, that is a bug fix we need, but actually I was refering > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL > from sendmsg. Agree, something like io_writev(,, sioc->zerocopy_enabled ? MSG_ZEROCOPY : 0, errp)' should do, right? (or an io_async_writev(), that will fall_back to io_writev() if zerocopy is disabled) > > > > > diff --git a/io/channel-tls.c b/io/channel-tls.c > > > > index 4ce890a538..bf44b0f7b0 100644 > > > > --- a/io/channel-tls.c > > > > +++ b/io/channel-tls.c > > > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc, > > > > qio_channel_set_delay(tioc->master, enabled); > > > > } > > > > > > > > + > > > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc, > > > > + bool enabled) > > > > +{ > > > > + QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc); > > > > + > > > > + qio_channel_set_zerocopy(tioc->master, enabled); > > > > +} > > > > > > This is going to be unsafe because gnutls will discard/reuse the > > > buffer for the ciphertext after every write(). I can't see a > > > way to safely enable MSG_ZEROCOPY when TLS is used. > > > > Yeah, that makes sense. > > It would make more sense to implement KTLS, as IIRC it kind of does > > 'zerocopy', since it saves the encrypted data directly in kernel buffer. > > I would hope it is zerocopy, in so much as the kernel can directly > use the userspace pages as the plaintext, and then the ciphertext > in kernel space can be sent directly without further copies. > > What i'm not sure is whether this means it also becomes an effectively > asynchronous API for transmitting data. ie does it have the same > constraints about needing to lock pages, and not modify data in the > pages? I've not investigated it in any detail, but I don't recall > that being mentioned, and if it was the case, it would be impossible > to enable that transparently in gnutls as its a major semantic changed. I think it somehow waits for the user buffer to be encrypted to the kernel buffer, and then returns, behaving like a normal sendmsg(). (except if it's using kernel async cripto API, which I don't think is the case) > > > We could implement KTLS as io_async_writev() for Channel_TLS, and change this > > flag to async_enabled. If KTLS is not available, it would fall back to > > using gnuTLS on io_writev, just like it would happen in zerocopy. > > If gnutls is going to support KTLS in a way we can use, I dont want to > have any of that duplicated in QEMU code. I want to be able delegate > as much as possible to gnutls, at least so that QEMU isn't in the loop > for any crypto sensitive parts, as that complicates certification > of crypto. Yeah, that's a very good argument :) I think it's probably the case to move from the current callback implementation to the implementation in which we give gnuTLS the file descriptor. ( I was thinking on implementing an simpler direct gnuTLS version only for the 'zerocopy' QIOChannel_TLS, but I think it would not make much sense.)
On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote: > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote: > > > > > > I would suggest checkig in close(), but as mentioned > > > > earlier, I think the design is flawed because the caller > > > > fundamentally needs to know about completion for every > > > > single write they make in order to know when the buffer > > > > can be released / reused. > > > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(), > > > activated with a > > > parameter flag, or on a different method if callback is preferred): > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll() > > > syscall after each packet sent, and this means the fd gets a signal after each > > > sendmsg() happens, with error or not. > > > > > > We could harness this with a poll() and a relatively high timeout: > > > - We stop sending packets, and then call poll(). > > > - Whenever poll() returns 0, it means a timeout happened, and so it > > > took too long > > > without sendmsg() happening, meaning all the packets are sent. > > > - If it returns anything else, we go back to fixing the errors found (re-send) > > > > > > The problem may be defining the value of this timeout, but it could be > > > called only > > > when zerocopy is active. > > > > Maybe we need to check completions at the end of each iteration of the > > migration dirtypage loop ? > > Sorry, I am really new to this, and I still couldn't understand why would we > need to check at the end of each iteration, instead of doing a full check at the > end. The end of each iteration is an implicit synchronization point in the current migration code. For example, we might do 2 iterations of migration pre-copy, and then switch to post-copy mode. If the data from those 2 iterations hasn't been sent at the point we switch to post-copy, that is a semantic change from current behaviour. I don't know if that will have an problematic effect on the migration process, or not. Checking the async completions at the end of each iteration though, would ensure the semantics similar to current semantics, reducing risk of unexpected problems. > > > or something like this, if we want it to stick with zerocopy if > > > setting it off fails. > > > if (ret >= 0) { > > > sioc->zerocopy_enabled = enabled; > > > } > > > > Yes, that is a bug fix we need, but actually I was refering > > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY > > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL > > from sendmsg. > > Agree, something like io_writev(,, sioc->zerocopy_enabled ? > MSG_ZEROCOPY : 0, errp)' > should do, right? > (or an io_async_writev(), that will fall_back to io_writev() if > zerocopy is disabled) Something like that - depends what API we end up deciding on > > > We could implement KTLS as io_async_writev() for Channel_TLS, and change this > > > flag to async_enabled. If KTLS is not available, it would fall back to > > > using gnuTLS on io_writev, just like it would happen in zerocopy. > > > > If gnutls is going to support KTLS in a way we can use, I dont want to > > have any of that duplicated in QEMU code. I want to be able delegate > > as much as possible to gnutls, at least so that QEMU isn't in the loop > > for any crypto sensitive parts, as that complicates certification > > of crypto. > > Yeah, that's a very good argument :) > I think it's probably the case to move from the current callback implementation > to the implementation in which we give gnuTLS the file descriptor. That would introduce a coupling between the two QIOChannel impls though, which is avoided so far, because we didn't want an assuption that a QIOChannel == a FD. Regards, Daniel
On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote: > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote: > > > > > > > > I would suggest checkig in close(), but as mentioned > > > > > earlier, I think the design is flawed because the caller > > > > > fundamentally needs to know about completion for every > > > > > single write they make in order to know when the buffer > > > > > can be released / reused. > > > > > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(), > > > > activated with a > > > > parameter flag, or on a different method if callback is preferred): > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll() > > > > syscall after each packet sent, and this means the fd gets a signal after each > > > > sendmsg() happens, with error or not. > > > > > > > > We could harness this with a poll() and a relatively high timeout: > > > > - We stop sending packets, and then call poll(). > > > > - Whenever poll() returns 0, it means a timeout happened, and so it > > > > took too long > > > > without sendmsg() happening, meaning all the packets are sent. > > > > - If it returns anything else, we go back to fixing the errors found (re-send) > > > > > > > > The problem may be defining the value of this timeout, but it could be > > > > called only > > > > when zerocopy is active. > > > > > > Maybe we need to check completions at the end of each iteration of the > > > migration dirtypage loop ? > > > > Sorry, I am really new to this, and I still couldn't understand why would we > > need to check at the end of each iteration, instead of doing a full check at the > > end. > > The end of each iteration is an implicit synchronization point in the > current migration code. > > For example, we might do 2 iterations of migration pre-copy, and then > switch to post-copy mode. If the data from those 2 iterations hasn't > been sent at the point we switch to post-copy, that is a semantic > change from current behaviour. I don't know if that will have an > problematic effect on the migration process, or not. Checking the > async completions at the end of each iteration though, would ensure > the semantics similar to current semantics, reducing risk of unexpected > problems. > What if we do the 'flush()' before we start post-copy, instead of after each iteration? would that be enough? > > > > > or something like this, if we want it to stick with zerocopy if > > > > setting it off fails. > > > > if (ret >= 0) { > > > > sioc->zerocopy_enabled = enabled; > > > > } > > > > > > Yes, that is a bug fix we need, but actually I was refering > > > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY > > > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL > > > from sendmsg. > > > > Agree, something like io_writev(,, sioc->zerocopy_enabled ? > > MSG_ZEROCOPY : 0, errp)' > > should do, right? > > (or an io_async_writev(), that will fall_back to io_writev() if > > zerocopy is disabled) > > Something like that - depends what API we end up deciding on Great :) > > > > > We could implement KTLS as io_async_writev() for Channel_TLS, and change this > > > > flag to async_enabled. If KTLS is not available, it would fall back to > > > > using gnuTLS on io_writev, just like it would happen in zerocopy. > > > > > > If gnutls is going to support KTLS in a way we can use, I dont want to > > > have any of that duplicated in QEMU code. I want to be able delegate > > > as much as possible to gnutls, at least so that QEMU isn't in the loop > > > for any crypto sensitive parts, as that complicates certification > > > of crypto. > > > > Yeah, that's a very good argument :) > > I think it's probably the case to move from the current callback implementation > > to the implementation in which we give gnuTLS the file descriptor. > > That would introduce a coupling between the two QIOChannel impls > though, which is avoided so far, because we didn't want an assuption > that a QIOChannel == a FD. Interesting, QIOChannelTLS would necessarily need to have a QIOChannelSocket, is that right? Maybe we can get a QIOChannelKTLS, which would use gnuTLS but get to have it's own fd, but that possibly would cause a lot of duplicated code. On the other hand, this way the QIOChannelTLS would still be able to accept any other channel, if desired. Anyway, it's a complicated decision :) > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > Thank you, I am learning a lot today :) Best regards, Leonardo
On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote: > On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote: > > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote: > > > > > > > > > > I would suggest checkig in close(), but as mentioned > > > > > > earlier, I think the design is flawed because the caller > > > > > > fundamentally needs to know about completion for every > > > > > > single write they make in order to know when the buffer > > > > > > can be released / reused. > > > > > > > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(), > > > > > activated with a > > > > > parameter flag, or on a different method if callback is preferred): > > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll() > > > > > syscall after each packet sent, and this means the fd gets a signal after each > > > > > sendmsg() happens, with error or not. > > > > > > > > > > We could harness this with a poll() and a relatively high timeout: > > > > > - We stop sending packets, and then call poll(). > > > > > - Whenever poll() returns 0, it means a timeout happened, and so it > > > > > took too long > > > > > without sendmsg() happening, meaning all the packets are sent. > > > > > - If it returns anything else, we go back to fixing the errors found (re-send) > > > > > > > > > > The problem may be defining the value of this timeout, but it could be > > > > > called only > > > > > when zerocopy is active. > > > > > > > > Maybe we need to check completions at the end of each iteration of the > > > > migration dirtypage loop ? > > > > > > Sorry, I am really new to this, and I still couldn't understand why would we > > > need to check at the end of each iteration, instead of doing a full check at the > > > end. > > > > The end of each iteration is an implicit synchronization point in the > > current migration code. > > > > For example, we might do 2 iterations of migration pre-copy, and then > > switch to post-copy mode. If the data from those 2 iterations hasn't > > been sent at the point we switch to post-copy, that is a semantic > > change from current behaviour. I don't know if that will have an > > problematic effect on the migration process, or not. Checking the > > async completions at the end of each iteration though, would ensure > > the semantics similar to current semantics, reducing risk of unexpected > > problems. > > > > What if we do the 'flush()' before we start post-copy, instead of after each > iteration? would that be enough? Possibly, yes. This really need David G's input since he understands the code in way more detail than me. Regards, Daniel
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote: > > On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote: > > > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote: > > > > > > > > > > > > I would suggest checkig in close(), but as mentioned > > > > > > > earlier, I think the design is flawed because the caller > > > > > > > fundamentally needs to know about completion for every > > > > > > > single write they make in order to know when the buffer > > > > > > > can be released / reused. > > > > > > > > > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(), > > > > > > activated with a > > > > > > parameter flag, or on a different method if callback is preferred): > > > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll() > > > > > > syscall after each packet sent, and this means the fd gets a signal after each > > > > > > sendmsg() happens, with error or not. > > > > > > > > > > > > We could harness this with a poll() and a relatively high timeout: > > > > > > - We stop sending packets, and then call poll(). > > > > > > - Whenever poll() returns 0, it means a timeout happened, and so it > > > > > > took too long > > > > > > without sendmsg() happening, meaning all the packets are sent. > > > > > > - If it returns anything else, we go back to fixing the errors found (re-send) > > > > > > > > > > > > The problem may be defining the value of this timeout, but it could be > > > > > > called only > > > > > > when zerocopy is active. > > > > > > > > > > Maybe we need to check completions at the end of each iteration of the > > > > > migration dirtypage loop ? > > > > > > > > Sorry, I am really new to this, and I still couldn't understand why would we > > > > need to check at the end of each iteration, instead of doing a full check at the > > > > end. > > > > > > The end of each iteration is an implicit synchronization point in the > > > current migration code. > > > > > > For example, we might do 2 iterations of migration pre-copy, and then > > > switch to post-copy mode. If the data from those 2 iterations hasn't > > > been sent at the point we switch to post-copy, that is a semantic > > > change from current behaviour. I don't know if that will have an > > > problematic effect on the migration process, or not. Checking the > > > async completions at the end of each iteration though, would ensure > > > the semantics similar to current semantics, reducing risk of unexpected > > > problems. > > > > > > > What if we do the 'flush()' before we start post-copy, instead of after each > > iteration? would that be enough? > > Possibly, yes. This really need David G's input since he understands > the code in way more detail than me. Hmm I'm not entirely sure why we have the sync after each iteration; the case I can think of is if we're doing async sending, we could have two versions of the same page in flight (one from each iteration) - you'd want those to get there in the right order. Dave > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote: > I also suggested something like that, but I thought it could be good if we could > fall back to io_writev() if we didn't have the zerocopy feature (or > the async feature). > > What do you think? That fallback looks safe and ok, I'm just not sure whether it'll be of great help. E.g. if we provide an QIO api that allows both sync write and zero-copy write (then we do the fallback when necessary), it means the buffer implication applies too to this api, so it's easier to me to just detect the zero copy capability and use one alternative. Thanks,
On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote: > > > What if we do the 'flush()' before we start post-copy, instead of after each > > > iteration? would that be enough? > > > > Possibly, yes. This really need David G's input since he understands > > the code in way more detail than me. > > Hmm I'm not entirely sure why we have the sync after each iteration; We don't have that yet, do we? > the case I can think of is if we're doing async sending, we could have > two versions of the same page in flight (one from each iteration) - > you'd want those to get there in the right order. From MSG_ZEROCOPY document: For protocols that acknowledge data in-order, like TCP, each notification can be squashed into the previous one, so that no more than one notification is outstanding at any one point. Ordered delivery is the common case, but not guaranteed. Notifications may arrive out of order on retransmission and socket teardown. So no matter whether we have a flush() per iteration before, it seems we may want one when zero copy enabled? Thanks,
* Peter Xu (peterx@redhat.com) wrote: > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote: > > > > What if we do the 'flush()' before we start post-copy, instead of after each > > > > iteration? would that be enough? > > > > > > Possibly, yes. This really need David G's input since he understands > > > the code in way more detail than me. > > > > Hmm I'm not entirely sure why we have the sync after each iteration; > > We don't have that yet, do we? I think multifd does; I think it calls multifd_send_sync_main sometimes, which I *think* corresponds to iterations. Dave > > the case I can think of is if we're doing async sending, we could have > > two versions of the same page in flight (one from each iteration) - > > you'd want those to get there in the right order. > > From MSG_ZEROCOPY document: > > For protocols that acknowledge data in-order, like TCP, each > notification can be squashed into the previous one, so that no more > than one notification is outstanding at any one point. > > Ordered delivery is the common case, but not guaranteed. Notifications > may arrive out of order on retransmission and socket teardown. > > So no matter whether we have a flush() per iteration before, it seems we may > want one when zero copy enabled? > > Thanks, > > -- > Peter Xu >
On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (peterx@redhat.com) wrote: > > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote: > > > > > What if we do the 'flush()' before we start post-copy, instead of after each > > > > > iteration? would that be enough? > > > > > > > > Possibly, yes. This really need David G's input since he understands > > > > the code in way more detail than me. > > > > > > Hmm I'm not entirely sure why we have the sync after each iteration; > > > > We don't have that yet, do we? > > I think multifd does; I think it calls multifd_send_sync_main sometimes, > which I *think* corresponds to iterations. Oh.. Then we may need to: (1) Make that sync work for zero-copy too; say, semaphores may not be enough with it - we need to make sure all async buffers are consumed by checking the error queue of the sockets, (2) When we make zero-copy work without multi-fd, we may want some similar sync on the main stream too Thanks, > > Dave > > > > the case I can think of is if we're doing async sending, we could have > > > two versions of the same page in flight (one from each iteration) - > > > you'd want those to get there in the right order. > > > > From MSG_ZEROCOPY document: > > > > For protocols that acknowledge data in-order, like TCP, each > > notification can be squashed into the previous one, so that no more > > than one notification is outstanding at any one point. > > > > Ordered delivery is the common case, but not guaranteed. Notifications > > may arrive out of order on retransmission and socket teardown. > > > > So no matter whether we have a flush() per iteration before, it seems we may > > want one when zero copy enabled? > > > > Thanks, > > > > -- > > Peter Xu > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >
On Tue, Sep 7, 2021 at 1:44 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote: > > I also suggested something like that, but I thought it could be good if we could > > fall back to io_writev() if we didn't have the zerocopy feature (or > > the async feature). > > > > What do you think? > > That fallback looks safe and ok, I'm just not sure whether it'll be of great > help. E.g. if we provide an QIO api that allows both sync write and zero-copy > write (then we do the fallback when necessary), it means the buffer implication > applies too to this api, so it's easier to me to just detect the zero copy > capability and use one alternative. Thanks, > > -- > Peter Xu > I was thinking this way (async method with fallback) we would allow code using the QIO api to just try to use the io_async_writev() whenever the code seems fit, and let the QIO channel layer to decide when it can be used (implementation provides, features available), and just fallback to io_writev() when it can't be used. Best regards, Leo
On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > Possibly, yes. This really need David G's input since he understands > > the code in way more detail than me. > > Hmm I'm not entirely sure why we have the sync after each iteration; > the case I can think of is if we're doing async sending, we could have > two versions of the same page in flight (one from each iteration) - > you'd want those to get there in the right order. > > Dave Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY, we will in fact have the same page in flight twice, instead of two versions, given the buffer is sent as it is during transmission. For me it looks like there will be no change in the current algorithm, but I am still a beginner in migration code, and I am probably missing something. Best regards, Leo
On Wed, Sep 08, 2021 at 05:13:40PM -0300, Leonardo Bras Soares Passos wrote: > On Tue, Sep 7, 2021 at 1:44 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote: > > > I also suggested something like that, but I thought it could be good if we could > > > fall back to io_writev() if we didn't have the zerocopy feature (or > > > the async feature). > > > > > > What do you think? > > > > That fallback looks safe and ok, I'm just not sure whether it'll be of great > > help. E.g. if we provide an QIO api that allows both sync write and zero-copy > > write (then we do the fallback when necessary), it means the buffer implication > > applies too to this api, so it's easier to me to just detect the zero copy > > capability and use one alternative. Thanks, > > > > -- > > Peter Xu > > > > I was thinking this way (async method with fallback) we would allow code using > the QIO api to just try to use the io_async_writev() whenever the code > seems fit, and > let the QIO channel layer to decide when it can be used > (implementation provides, > features available), and just fallback to io_writev() when it can't be used. Sure, I think it depends on whether the whole sync/async interface will be the same, e.g., when async api needs some callback registered then the interface won't be the same, and the caller will be aware of that anyways. Otherwise it looks fine indeed. Thanks,
On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote: > On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert > <dgilbert@redhat.com> wrote: > > > Possibly, yes. This really need David G's input since he understands > > > the code in way more detail than me. > > > > Hmm I'm not entirely sure why we have the sync after each iteration; > > the case I can think of is if we're doing async sending, we could have > > two versions of the same page in flight (one from each iteration) - > > you'd want those to get there in the right order. > > > > Dave > > Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY, we will in > fact have the same page in flight twice, instead of two versions, > given the buffer is > sent as it is during transmission. That's an interesting point, which looks even valid... :) There can still be two versions depending on when the page is read and feed to the NICs as the page can be changing during the window, but as long as the latter sent page always lands later than the former page on dest node then it looks ok.
On Wed, Sep 08, 2021 at 05:09:33PM -0400, Peter Xu wrote: > On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote: > > On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert > > <dgilbert@redhat.com> wrote: > > > > Possibly, yes. This really need David G's input since he understands > > > > the code in way more detail than me. > > > > > > Hmm I'm not entirely sure why we have the sync after each iteration; > > > the case I can think of is if we're doing async sending, we could have > > > two versions of the same page in flight (one from each iteration) - > > > you'd want those to get there in the right order. > > > > > > Dave > > > > Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY, we will in > > fact have the same page in flight twice, instead of two versions, > > given the buffer is > > sent as it is during transmission. > > That's an interesting point, which looks even valid... :) > > There can still be two versions depending on when the page is read and feed to > the NICs as the page can be changing during the window, but as long as the > latter sent page always lands later than the former page on dest node then it > looks ok. The really strange scenario is around TCP re-transmits. The kernel may transmit page version 1, then we have version 2 written. The kerenl now needs to retransmit a packet due to network loss. The re-transmission will contain version 2 payload which differs from the originally lost packet's payload. IOW, the supposed "reliable" TCP stream is no longer reliable and actually changes its data retrospectively because we've intentionally violated the rules the kernel documented for use of MSG_ZEROCOPY. We think we're probably ok with migration as we are going to rely on the face that we eventually pause the guest to stop page changes during the final switchover. None the less I really strongly dislike the idea of not honouring the kernel API contract, despite the potential performance benefits it brings. Regards, Daniel
On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote: > We think we're probably ok with migration as we are going to rely on the > face that we eventually pause the guest to stop page changes during the > final switchover. None the less I really strongly dislike the idea of > not honouring the kernel API contract, despite the potential performance > benefits it brings. Yes understandable, and it does looks tricky. But it's guest page and it's just by nature how it works to me (sending page happening in parallel with page being modified). I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's own choice if that happens. So even if it's not by design and not suggested, it seems not forbidden either. We can wr-protect the page (using things like userfaultfd-wp) during sending and unprotect them when it's done with a qio callback, that'll guarantee the buffer not changing during sending, however we gain nothing besides the "api cleaness" then.. Thanks,
On Wed, Sep 8, 2021 at 11:05 PM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote: > > We think we're probably ok with migration as we are going to rely on the > > face that we eventually pause the guest to stop page changes during the > > final switchover. None the less I really strongly dislike the idea of > > not honouring the kernel API contract, despite the potential performance > > benefits it brings. > > Yes understandable, and it does looks tricky. But it's guest page and it's just > by nature how it works to me (sending page happening in parallel with page > being modified). > > I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's > own choice if that happens. So even if it's not by design and not suggested, it > seems not forbidden either. > > We can wr-protect the page (using things like userfaultfd-wp) during sending > and unprotect them when it's done with a qio callback, that'll guarantee the > buffer not changing during sending, however we gain nothing besides the "api > cleaness" then.. > > Thanks, > > -- > Peter Xu > I can't stress enough how inexperienced I am in qemu codebase, but based on the current discussion and my humble opinion, it would make sense if something like an async_writev + async_flush method (or a callback instead) would be offered by QIOChannel, and let the responsibility of "which flags to use", "when to lock", and "how / when to flush" to the codebase using it. I mean, it's clear how the sync requirements will be very different depending on what the using code will be sending with it, so It makes sense to me that we offer the tool and let it decide how it should be used (and possibly deal with any consequences). Is there anything that could go wrong with the above that I am missing? About the callback proposed, I am not sure how that would work in an efficient way. Could someone help me with that? FWIW, what I had in mind for a (theoretical) migration setup with io_async_writev() + io_async_flush(): - For guest RAM we can decide not to rw_lock memory on zerocopy, because there is no need, - For headers, we can decide to not use async (use io_writev() instead), - flush() can happen each iteration of migration, or at each N seconds, or at the end. Thank you for the great discussion :) Leonardo Bras
* Peter Xu (peterx@redhat.com) wrote: > On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (peterx@redhat.com) wrote: > > > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote: > > > > > > What if we do the 'flush()' before we start post-copy, instead of after each > > > > > > iteration? would that be enough? > > > > > > > > > > Possibly, yes. This really need David G's input since he understands > > > > > the code in way more detail than me. > > > > > > > > Hmm I'm not entirely sure why we have the sync after each iteration; > > > > > > We don't have that yet, do we? > > > > I think multifd does; I think it calls multifd_send_sync_main sometimes, > > which I *think* corresponds to iterations. > > Oh.. Then we may need to: > > (1) Make that sync work for zero-copy too; say, semaphores may not be enough > with it - we need to make sure all async buffers are consumed by checking > the error queue of the sockets, > > (2) When we make zero-copy work without multi-fd, we may want some similar > sync on the main stream too It might not be worth bothering with (2) - zerocopy fits a lot better with multifd's data organisation. Dave > Thanks, > > > > > Dave > > > > > > the case I can think of is if we're doing async sending, we could have > > > > two versions of the same page in flight (one from each iteration) - > > > > you'd want those to get there in the right order. > > > > > > From MSG_ZEROCOPY document: > > > > > > For protocols that acknowledge data in-order, like TCP, each > > > notification can be squashed into the previous one, so that no more > > > than one notification is outstanding at any one point. > > > > > > Ordered delivery is the common case, but not guaranteed. Notifications > > > may arrive out of order on retransmission and socket teardown. > > > > > > So no matter whether we have a flush() per iteration before, it seems we may > > > want one when zero copy enabled? > > > > > > Thanks, > > > > > > -- > > > Peter Xu > > > > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > > -- > Peter Xu >
On Thu, Sep 09, 2021 at 01:58:39AM -0300, Leonardo Bras Soares Passos wrote: > FWIW, what I had in mind for a (theoretical) migration setup with > io_async_writev() + io_async_flush(): One trivial concern is it's not strictly just "async" because "async" can happen on any nonblocking fd; here it's more "zerocopy" specific. But as long as Dan is okay with it I'm fine too to start with "async", as long as we do proper documentation on the requirement of lifecycle of the buffers. > - For guest RAM we can decide not to rw_lock memory on zerocopy, > because there is no need, > - For headers, we can decide to not use async (use io_writev() instead), > - flush() can happen each iteration of migration, or at each N > seconds, or at the end. I think we should only need the flush at the end of precopy, and that should also cover when switching to postcopy. We could merge this flush() into the existing per-iteration sync of multi-fd, but it's not strictly necessary, imho. We can see which is easier. The rest looks good to me. Thanks,
diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h index e747e63514..09dffe059f 100644 --- a/include/io/channel-socket.h +++ b/include/io/channel-socket.h @@ -47,6 +47,8 @@ struct QIOChannelSocket { socklen_t localAddrLen; struct sockaddr_storage remoteAddr; socklen_t remoteAddrLen; + size_t errq_pending; + bool zerocopy_enabled; }; diff --git a/include/io/channel.h b/include/io/channel.h index dada9ebaaf..de10a78b10 100644 --- a/include/io/channel.h +++ b/include/io/channel.h @@ -137,6 +137,8 @@ struct QIOChannelClass { IOHandler *io_read, IOHandler *io_write, void *opaque); + void (*io_set_zerocopy)(QIOChannel *ioc, + bool enabled); }; /* General I/O handling functions */ @@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc, void qio_channel_set_delay(QIOChannel *ioc, bool enabled); +/** + * qio_channel_set_zerocopy: + * @ioc: the channel object + * @enabled: the new flag state + * + * Controls whether the underlying transport is + * permitted to use zerocopy to avoid copying the + * sending buffer in kernel. If @enabled is true, then the + * writes may avoid buffer copy in kernel. If @enabled + * is false, writes will cause the kernel to always + * copy the buffer contents before sending. + * + * In order to use make a write with zerocopy feature, + * it's also necessary to sent each packet with + * MSG_ZEROCOPY flag. With this, it's possible to + * to select only writes that would benefit from the + * use of zerocopy feature, i.e. the ones with larger + * buffers. + * + * This feature was added in Linux 4.14, so older + * versions will fail on enabling. This is not an + * issue, since it will fall-back to default copying + * approach. + */ +void qio_channel_set_zerocopy(QIOChannel *ioc, + bool enabled); + /** * qio_channel_set_cork: * @ioc: the channel object diff --git a/io/channel-socket.c b/io/channel-socket.c index e377e7303d..a69fec7315 100644 --- a/io/channel-socket.c +++ b/io/channel-socket.c @@ -26,8 +26,10 @@ #include "io/channel-watch.h" #include "trace.h" #include "qapi/clone-visitor.h" +#include <linux/errqueue.h> #define SOCKET_MAX_FDS 16 +#define SOCKET_ERRQ_THRESH 16384 SocketAddress * qio_channel_socket_get_local_address(QIOChannelSocket *ioc, @@ -55,6 +57,8 @@ qio_channel_socket_new(void) sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)); sioc->fd = -1; + sioc->zerocopy_enabled = false; + sioc->errq_pending = 0; ioc = QIO_CHANNEL(sioc); qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); @@ -520,6 +524,54 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc, return ret; } +static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc, + Error **errp) +{ + int fd = sioc->fd; + int ret; + struct msghdr msg = {}; + struct sock_extended_err *serr; + struct cmsghdr *cm; + + do { + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); + if (ret <= 0) { + if (ret == 0 || errno == EAGAIN) { + /* Nothing on errqueue */ + sioc->errq_pending = 0; + break; + } + if (errno == EINTR) { + continue; + } + + error_setg_errno(errp, errno, + "Unable to read errqueue"); + break; + } + + cm = CMSG_FIRSTHDR(&msg); + if (cm->cmsg_level != SOL_IP && + cm->cmsg_type != IP_RECVERR) { + error_setg_errno(errp, EPROTOTYPE, + "Wrong cmsg in errqueue"); + break; + } + + serr = (void *) CMSG_DATA(cm); + if (serr->ee_errno != 0) { + error_setg_errno(errp, serr->ee_errno, + "Error on socket"); + break; + } + if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) { + error_setg_errno(errp, serr->ee_origin, + "Error not from zerocopy"); + break; + } + } while (true); +} + static ssize_t qio_channel_socket_writev(QIOChannel *ioc, const struct iovec *iov, size_t niov, @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, "Unable to write to socket"); return -1; } + + if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) { + sioc->errq_pending += niov; + if (sioc->errq_pending > SOCKET_ERRQ_THRESH) { + qio_channel_socket_errq_proc(sioc, errp); + } + } + return ret; } #else /* WIN32 */ @@ -689,6 +749,21 @@ qio_channel_socket_set_delay(QIOChannel *ioc, } +static void +qio_channel_socket_set_zerocopy(QIOChannel *ioc, + bool enabled) +{ + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); + int v = enabled ? 1 : 0; + int ret; + + ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); + if (ret >= 0) { + sioc->zerocopy_enabled = true; + } +} + + static void qio_channel_socket_set_cork(QIOChannel *ioc, bool enabled) @@ -789,6 +864,7 @@ static void qio_channel_socket_class_init(ObjectClass *klass, ioc_klass->io_set_delay = qio_channel_socket_set_delay; ioc_klass->io_create_watch = qio_channel_socket_create_watch; ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler; + ioc_klass->io_set_zerocopy = qio_channel_socket_set_zerocopy; } static const TypeInfo qio_channel_socket_info = { diff --git a/io/channel-tls.c b/io/channel-tls.c index 4ce890a538..bf44b0f7b0 100644 --- a/io/channel-tls.c +++ b/io/channel-tls.c @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc, qio_channel_set_delay(tioc->master, enabled); } + +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc, + bool enabled) +{ + QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc); + + qio_channel_set_zerocopy(tioc->master, enabled); +} + + static void qio_channel_tls_set_cork(QIOChannel *ioc, bool enabled) { @@ -416,6 +426,7 @@ static void qio_channel_tls_class_init(ObjectClass *klass, ioc_klass->io_shutdown = qio_channel_tls_shutdown; ioc_klass->io_create_watch = qio_channel_tls_create_watch; ioc_klass->io_set_aio_fd_handler = qio_channel_tls_set_aio_fd_handler; + ioc_klass->io_set_zerocopy = qio_channel_tls_set_zerocopy; } static const TypeInfo qio_channel_tls_info = { diff --git a/io/channel-websock.c b/io/channel-websock.c index 035dd6075b..4e9491966b 100644 --- a/io/channel-websock.c +++ b/io/channel-websock.c @@ -1194,6 +1194,14 @@ static void qio_channel_websock_set_delay(QIOChannel *ioc, qio_channel_set_delay(tioc->master, enabled); } +static void qio_channel_websock_set_zerocopy(QIOChannel *ioc, + bool enabled) +{ + QIOChannelWebsock *tioc = QIO_CHANNEL_WEBSOCK(ioc); + + qio_channel_set_zerocopy(tioc->master, enabled); +} + static void qio_channel_websock_set_cork(QIOChannel *ioc, bool enabled) { @@ -1318,6 +1326,7 @@ static void qio_channel_websock_class_init(ObjectClass *klass, ioc_klass->io_close = qio_channel_websock_close; ioc_klass->io_shutdown = qio_channel_websock_shutdown; ioc_klass->io_create_watch = qio_channel_websock_create_watch; + ioc_klass->io_set_zerocopy = qio_channel_websock_set_zerocopy; } static const TypeInfo qio_channel_websock_info = { diff --git a/io/channel.c b/io/channel.c index ee3cb83d4d..476440e8b2 100644 --- a/io/channel.c +++ b/io/channel.c @@ -450,6 +450,17 @@ void qio_channel_set_delay(QIOChannel *ioc, } +void qio_channel_set_zerocopy(QIOChannel *ioc, + bool enabled) +{ + QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc); + + if (klass->io_set_zerocopy) { + klass->io_set_zerocopy(ioc, enabled); + } +} + + void qio_channel_set_cork(QIOChannel *ioc, bool enabled) {
MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket send calls. It does so by avoiding copying user data into kernel buffers. To make it work, three steps are needed: 1 - A setsockopt() system call, enabling SO_ZEROCOPY 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall 3 - Process the socket's error queue, dealing with any error Zerocopy has it's costs, so it will only get improved performance if the sending buffer is big (10KB, according to Linux docs). The step 2 makes it possible to use the same socket to send data using both zerocopy and the default copying approach, so the application cat choose what is best for each packet. To implement step 1, an optional set_zerocopy() interface was created in QIOChannel, allowing each using code to enable or disable it. Step 2 will be enabled by the using code at each qio_channel_write*() that would benefit of zerocopy; Step 3 is done with qio_channel_socket_errq_proc(), that runs after SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found. Signed-off-by: Leonardo Bras <leobras@redhat.com> --- include/io/channel-socket.h | 2 + include/io/channel.h | 29 ++++++++++++++ io/channel-socket.c | 76 +++++++++++++++++++++++++++++++++++++ io/channel-tls.c | 11 ++++++ io/channel-websock.c | 9 +++++ io/channel.c | 11 ++++++ 6 files changed, 138 insertions(+)