Message ID | 20220504191835.791580-3-leobras@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | MSG_ZEROCOPY + multifd | expand |
On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote: > +/* > + * Zero-copy defines bellow are included to avoid breaking builds on systems > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable > + * (without a lot of ifdefs). > + */ > +#ifndef MSG_ZEROCOPY > +#define MSG_ZEROCOPY 0x4000000 > +#endif > +#ifndef SO_ZEROCOPY > +#define SO_ZEROCOPY 60 > +#endif So this will define these two values on e.g. FreeBSD, while they do not make sense at all there because these numbers are pure magics and meaningless outside Linux.. I don't think it's anything dangerous, but IMHO it's another way of being not clean comparing of using some "#ifdef"s. Comparing to this approach the "use #ifdef" approach is actually slightly more cleaner to me. :) Let's wait for some other inputs.
On Wed, May 4, 2022 at 4:53 PM Peter Xu <peterx@redhat.com> wrote: > > On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote: > > +/* > > + * Zero-copy defines bellow are included to avoid breaking builds on systems > > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable > > + * (without a lot of ifdefs). > > + */ > > +#ifndef MSG_ZEROCOPY > > +#define MSG_ZEROCOPY 0x4000000 > > +#endif > > +#ifndef SO_ZEROCOPY > > +#define SO_ZEROCOPY 60 > > +#endif > > So this will define these two values on e.g. FreeBSD, while they do not > make sense at all there because these numbers are pure magics and > meaningless outside Linux.. Correct. But since only in Linux it's possible to set the QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag, sflags will always be zero and it would never try using MSG_ZEROCOPY outside Linux. > I don't think it's anything dangerous, but IMHO it's another way of being > not clean comparing of using some "#ifdef"s. Comparing to this approach > the "use #ifdef" approach is actually slightly more cleaner to me. :) > This requires: - Creating a define such as 'QEMU_MSG_ZEROCOPY', that needs to include <sys/socket.h> to get some flags: #define QEMU_MSG_ZEROCOPY defined(CONFIG_LINUX) && defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY) - Making it available for all code in this patchset that does "ifdef CONFIG_LINUX' (migration/migration.c/h, qapi/migration.json, monitor/hmp-cmds.c, io/channel-socket.c) - Replace current usage of CONFIG_LINUX in this patchset for QEMU_MSG_ZEROCOPY - Change qio_channel_socket_writev() so the current 2 usages of MSG_ZEROCOPY are surrounded by ifdef QEMU_MSG_ZEROCOPY. Pros of above approach (1): - Smaller binary: The whole MSG_ZEROCOPY code is compiled out if the building system does not support it. - Since it's compiled out, there is a couple lines of less code running if the building system does not support it - It's not even possible to set this option in MigrationSetParams, which will return an error. Pros of current approach (2): - Define is local to file (I am not sure if it's ok to create a 'global' define for above approach, including <sys/socket.h> bits) - A build system that does not support MSG_ZEROCOPY can produce a binary that can use MSG_ZEROCOPY if the target system supports it. - There are no #ifdefs on qio_channel_socket_writev() (2) is already implemented in v11, but I have no issue implementing (1) for v12 if it's ok to create this 'global' define. > Let's wait for some other inputs. Agree. Having the pros of each approach clear, I would like some input on what is better for the project. Best regards, Leo
On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote: > For CONFIG_LINUX, implement the new zero copy flag and the optional callback > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY > feature is available in the host kernel, which is checked on > qio_channel_socket_connect_sync() > > qio_channel_socket_flush() was implemented by counting how many times > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the > socket's error queue, in order to find how many of them finished sending. > Flush will loop until those counters are the same, or until some error occurs. > > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY: > 1: Buffer > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying, > some caution is necessary to avoid overwriting any buffer before it's sent. > If something like this happen, a newer version of the buffer may be sent instead. > - If this is a problem, it's recommended to call qio_channel_flush() before freeing > or re-using the buffer. > > 2: Locked memory > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and > unlocked after it's sent. > - Depending on the size of each buffer, and how often it's sent, it may require > a larger amount of locked memory than usually available to non-root user. > - If the required amount of locked memory is not available, writev_zero_copy > will return an error, which can abort an operation like migration, > - Because of this, when an user code wants to add zero copy as a feature, it > requires a mechanism to disable it, so it can still be accessible to less > privileged users. > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > Reviewed-by: Peter Xu <peterx@redhat.com> > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> > Reviewed-by: Juan Quintela <quintela@redhat.com> > --- > include/io/channel-socket.h | 2 + > io/channel-socket.c | 120 ++++++++++++++++++++++++++++++++++-- > 2 files changed, 118 insertions(+), 4 deletions(-) > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h > index e747e63514..513c428fe4 100644 > --- a/include/io/channel-socket.h > +++ b/include/io/channel-socket.h > @@ -47,6 +47,8 @@ struct QIOChannelSocket { > socklen_t localAddrLen; > struct sockaddr_storage remoteAddr; > socklen_t remoteAddrLen; > + ssize_t zero_copy_queued; > + ssize_t zero_copy_sent; > }; > > > diff --git a/io/channel-socket.c b/io/channel-socket.c > index 696a04dc9c..ae756ce166 100644 > --- a/io/channel-socket.c > +++ b/io/channel-socket.c > @@ -25,9 +25,25 @@ > #include "io/channel-watch.h" > #include "trace.h" > #include "qapi/clone-visitor.h" > +#ifdef CONFIG_LINUX > +#include <linux/errqueue.h> > +#include <sys/socket.h> > +#endif > > #define SOCKET_MAX_FDS 16 > > +/* > + * Zero-copy defines bellow are included to avoid breaking builds on systems > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable > + * (without a lot of ifdefs). > + */ > +#ifndef MSG_ZEROCOPY > +#define MSG_ZEROCOPY 0x4000000 > +#endif > +#ifndef SO_ZEROCOPY > +#define SO_ZEROCOPY 60 > +#endif Please put these behind CONFIG_LINUX to make it clear to readers that this is entirely Linux specific > + > SocketAddress * > qio_channel_socket_get_local_address(QIOChannelSocket *ioc, > Error **errp) > @@ -54,6 +70,8 @@ qio_channel_socket_new(void) > > sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)); > sioc->fd = -1; > + sioc->zero_copy_queued = 0; > + sioc->zero_copy_sent = 0; > > ioc = QIO_CHANNEL(sioc); > qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); > @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc, > return -1; > } > > +#ifdef CONFIG_LINUX > + int ret, v = 1; > + ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > + if (ret == 0) { > + /* Zero copy available on host */ > + qio_channel_set_feature(QIO_CHANNEL(ioc), > + QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY); > + } > +#endif > + > return 0; > } > > @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)]; > size_t fdsize = sizeof(int) * nfds; > struct cmsghdr *cmsg; > + int sflags = 0; > > memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); > > @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > memcpy(CMSG_DATA(cmsg), fds, fdsize); > } > > + if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { > + sflags = MSG_ZEROCOPY; > + } Also should be behind CONFIG_LINUX > + > retry: > - ret = sendmsg(sioc->fd, &msg, 0); > + ret = sendmsg(sioc->fd, &msg, sflags); > if (ret <= 0) { > - if (errno == EAGAIN) { > + switch (errno) { > + case EAGAIN: > return QIO_CHANNEL_ERR_BLOCK; > - } > - if (errno == EINTR) { > + case EINTR: > goto retry; > + case ENOBUFS: > + if (sflags & MSG_ZEROCOPY) { > + error_setg_errno(errp, errno, > + "Process can't lock enough memory for using MSG_ZEROCOPY"); > + return -1; > + } > + break; And this ENOBUGS case behind CONFIG_LINUX > } > + > error_setg_errno(errp, errno, > "Unable to write to socket"); > return -1; > @@ -659,6 +700,74 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > } > #endif /* WIN32 */ > > + > +#ifdef CONFIG_LINUX > +static int qio_channel_socket_flush(QIOChannel *ioc, > + Error **errp) > +{ > + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > + struct msghdr msg = {}; > + struct sock_extended_err *serr; > + struct cmsghdr *cm; > + char control[CMSG_SPACE(sizeof(*serr))]; > + int received; > + int ret = 1; > + > + msg.msg_control = control; > + msg.msg_controllen = sizeof(control); > + memset(control, 0, sizeof(control)); > + > + while (sioc->zero_copy_sent < sioc->zero_copy_queued) { > + received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE); > + if (received < 0) { > + switch (errno) { > + case EAGAIN: > + /* Nothing on errqueue, wait until something is available */ > + qio_channel_wait(ioc, G_IO_ERR); > + continue; > + case EINTR: > + continue; > + default: > + error_setg_errno(errp, errno, > + "Unable to read errqueue"); > + return -1; > + } > + } > + > + cm = CMSG_FIRSTHDR(&msg); > + if (cm->cmsg_level != SOL_IP && > + cm->cmsg_type != IP_RECVERR) { > + error_setg_errno(errp, EPROTOTYPE, > + "Wrong cmsg in errqueue"); > + return -1; > + } > + > + serr = (void *) CMSG_DATA(cm); > + if (serr->ee_errno != SO_EE_ORIGIN_NONE) { > + error_setg_errno(errp, serr->ee_errno, > + "Error on socket"); > + return -1; > + } > + if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) { > + error_setg_errno(errp, serr->ee_origin, > + "Error not from zero copy"); > + return -1; > + } > + > + /* No errors, count successfully finished sendmsg()*/ > + sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1; > + > + /* If any sendmsg() succeeded using zero copy, return 0 at the end */ > + if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) { > + ret = 0; > + } > + } > + > + return ret; > +} > + > +#endif /* CONFIG_LINUX */ > + > static int > qio_channel_socket_set_blocking(QIOChannel *ioc, > bool enabled, > @@ -789,6 +898,9 @@ static void qio_channel_socket_class_init(ObjectClass *klass, > ioc_klass->io_set_delay = qio_channel_socket_set_delay; > ioc_klass->io_create_watch = qio_channel_socket_create_watch; > ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler; > +#ifdef CONFIG_LINUX > + ioc_klass->io_flush = qio_channel_socket_flush; > +#endif > } > > static const TypeInfo qio_channel_socket_info = { > -- > 2.36.0 > With regards, Daniel
On Thu, May 05, 2022 at 01:20:04AM -0300, Leonardo Bras Soares Passos wrote: > (2) is already implemented in v11, but I have no issue implementing > (1) for v12 if it's ok to create this 'global' define. Dan's suggestion in the other thread sounds good to me with current approach, on having CONFIG_LINUX to wrap the defines. But note that it's still prone to change in the future, e.g., when other qemu modules start to use MSG_ZEROCOPY, we'll probably at least move the defines into some other headers. We could probably need to clean things up when it happens. But I won't strongly ask for that - we can leave that for later. Thanks,
On Thu, May 5, 2022 at 5:05 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote: > > For CONFIG_LINUX, implement the new zero copy flag and the optional callback > > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY > > feature is available in the host kernel, which is checked on > > qio_channel_socket_connect_sync() > > > > qio_channel_socket_flush() was implemented by counting how many times > > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the > > socket's error queue, in order to find how many of them finished sending. > > Flush will loop until those counters are the same, or until some error occurs. > > > > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY: > > 1: Buffer > > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying, > > some caution is necessary to avoid overwriting any buffer before it's sent. > > If something like this happen, a newer version of the buffer may be sent instead. > > - If this is a problem, it's recommended to call qio_channel_flush() before freeing > > or re-using the buffer. > > > > 2: Locked memory > > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and > > unlocked after it's sent. > > - Depending on the size of each buffer, and how often it's sent, it may require > > a larger amount of locked memory than usually available to non-root user. > > - If the required amount of locked memory is not available, writev_zero_copy > > will return an error, which can abort an operation like migration, > > - Because of this, when an user code wants to add zero copy as a feature, it > > requires a mechanism to disable it, so it can still be accessible to less > > privileged users. > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > Reviewed-by: Peter Xu <peterx@redhat.com> > > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> > > Reviewed-by: Juan Quintela <quintela@redhat.com> > > --- > > include/io/channel-socket.h | 2 + > > io/channel-socket.c | 120 ++++++++++++++++++++++++++++++++++-- > > 2 files changed, 118 insertions(+), 4 deletions(-) > > > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h > > index e747e63514..513c428fe4 100644 > > --- a/include/io/channel-socket.h > > +++ b/include/io/channel-socket.h > > @@ -47,6 +47,8 @@ struct QIOChannelSocket { > > socklen_t localAddrLen; > > struct sockaddr_storage remoteAddr; > > socklen_t remoteAddrLen; > > + ssize_t zero_copy_queued; > > + ssize_t zero_copy_sent; > > }; > > > > > > diff --git a/io/channel-socket.c b/io/channel-socket.c > > index 696a04dc9c..ae756ce166 100644 > > --- a/io/channel-socket.c > > +++ b/io/channel-socket.c > > @@ -25,9 +25,25 @@ > > #include "io/channel-watch.h" > > #include "trace.h" > > #include "qapi/clone-visitor.h" > > +#ifdef CONFIG_LINUX > > +#include <linux/errqueue.h> > > +#include <sys/socket.h> > > +#endif > > > > #define SOCKET_MAX_FDS 16 > > > > +/* > > + * Zero-copy defines bellow are included to avoid breaking builds on systems > > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable > > + * (without a lot of ifdefs). > > + */ > > +#ifndef MSG_ZEROCOPY > > +#define MSG_ZEROCOPY 0x4000000 > > +#endif > > +#ifndef SO_ZEROCOPY > > +#define SO_ZEROCOPY 60 > > +#endif > > Please put these behind CONFIG_LINUX to make it clear to readers that > this is entirely Linux specific > > > > + > > SocketAddress * > > qio_channel_socket_get_local_address(QIOChannelSocket *ioc, > > Error **errp) > > @@ -54,6 +70,8 @@ qio_channel_socket_new(void) > > > > sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)); > > sioc->fd = -1; > > + sioc->zero_copy_queued = 0; > > + sioc->zero_copy_sent = 0; > > > > ioc = QIO_CHANNEL(sioc); > > qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); > > @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc, > > return -1; > > } > > > > +#ifdef CONFIG_LINUX > > + int ret, v = 1; > > + ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > > + if (ret == 0) { > > + /* Zero copy available on host */ > > + qio_channel_set_feature(QIO_CHANNEL(ioc), > > + QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY); > > + } > > +#endif > > + > > return 0; > > } > > > > @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > > char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)]; > > size_t fdsize = sizeof(int) * nfds; > > struct cmsghdr *cmsg; > > + int sflags = 0; > > > > memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); > > > > @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, > > memcpy(CMSG_DATA(cmsg), fds, fdsize); > > } > > > > + if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { > > + sflags = MSG_ZEROCOPY; > > + } > > Also should be behind CONFIG_LINUX Hello Daniel, But what if this gets compiled in a Linux system without MSG_ZEROCOPY support? As qapi will have zero-copy-send as an option we could have this scenario: - User request migration using zero-copy-send - multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY - In qio_channel_socket_connect_sync(): setsockopt() part will be compiled-out, so no error here - above part in qio_channel_socket_writev() will be commented-out, which means write_flags will be ignored - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode - migration will succeed In the above case, the user has all the reason to think migration is using MSG_ZEROCOPY, but in fact it's quietly falling back to copy-mode. That's why I suggested creating a 'global' config usiing SO_ZEROCOPY & MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance of even offering zero-copy-send if we don't have it. Another local option is to do implement your suggestions, and also change qio_channel_socket_connect_sync() so it returns an error if MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as: +#ifdef CONFIG_LINUX +#if defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY) + int ret, v = 1; + ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); + if (ret == 0) { + /* Zero copy available on host */ + qio_channel_set_feature(QIO_CHANNEL(ioc), + QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY); + } +#else + error_setg_errno(errp, errno,"MSG_ZEROCOPY not available"); + return -1; +#endif +#endif What do you think? Best regards, Leo > > > + > > retry: > > - ret = sendmsg(sioc->fd, &msg, 0); > > + ret = sendmsg(sioc->fd, &msg, sflags); > > if (ret <= 0) { > > - if (errno == EAGAIN) { > > + switch (errno) { > > + case EAGAIN: > > return QIO_CHANNEL_ERR_BLOCK; > > - } > > - if (errno == EINTR) { > > + case EINTR: > > goto retry; > > + case ENOBUFS: > > + if (sflags & MSG_ZEROCOPY) { > > + error_setg_errno(errp, errno, > > + "Process can't lock enough memory for using MSG_ZEROCOPY"); > > + return -1; > > + } > > + break; > > And this ENOBUGS case behind CONFIG_LINUX > [...]
On Thu, May 05, 2022 at 12:42:47PM -0300, Leonardo Bras Soares Passos wrote: > > Hello Daniel, > > But what if this gets compiled in a Linux system without MSG_ZEROCOPY support? > As qapi will have zero-copy-send as an option we could have this scenario: > > - User request migration using zero-copy-send > - multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY > - In qio_channel_socket_connect_sync(): setsockopt() part will be > compiled-out, so no error here > - above part in qio_channel_socket_writev() will be commented-out, > which means write_flags will be ignored > - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode > - migration will succeed > > In the above case, the user has all the reason to think migration is > using MSG_ZEROCOPY, but in fact it's quietly falling back to > copy-mode. I think we're ok because qio_channel_writev_full() does if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) && !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) { error_setg_errno(errp, EINVAL, "Requested Zero Copy feature is not available"); return -1; } and since there's no way for QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY to get set when MSG_ZEROCOPY is compiled out, we'll trigger the error condition. > That's why I suggested creating a 'global' config usiing SO_ZEROCOPY & > MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance > of even offering zero-copy-send if we don't have it. > > Another local option is to do implement your suggestions, and also > change qio_channel_socket_connect_sync() so it returns an error if > MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as: > > +#ifdef CONFIG_LINUX > +#if defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY) > + int ret, v = 1; > + ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > + if (ret == 0) { > + /* Zero copy available on host */ > + qio_channel_set_feature(QIO_CHANNEL(ioc), > + QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY); > + } > +#else > + error_setg_errno(errp, errno,"MSG_ZEROCOPY not available"); > + return -1; > +#endif > +#endif Do we actually need the ifdef CONFIG_LINUX bit at all ? Sufficient to just have the check for MSG_ZEROCOPY + SO_ZEROCOPY, which will fail on non-Linux anyway. With regards, Daniel
On Thu, May 5, 2022 at 12:55 PM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, May 05, 2022 at 12:42:47PM -0300, Leonardo Bras Soares Passos wrote: > > > > Hello Daniel, > > > > But what if this gets compiled in a Linux system without MSG_ZEROCOPY support? > > As qapi will have zero-copy-send as an option we could have this scenario: > > > > - User request migration using zero-copy-send > > - multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY > > - In qio_channel_socket_connect_sync(): setsockopt() part will be > > compiled-out, so no error here > > - above part in qio_channel_socket_writev() will be commented-out, > > which means write_flags will be ignored > > - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode > > - migration will succeed > > > > In the above case, the user has all the reason to think migration is > > using MSG_ZEROCOPY, but in fact it's quietly falling back to > > copy-mode. > > I think we're ok because qio_channel_writev_full() does > > if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) && > !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) { > error_setg_errno(errp, EINVAL, > "Requested Zero Copy feature is not available"); > return -1; > } > > and since there's no way for QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY to > get set when MSG_ZEROCOPY is compiled out, we'll trigger the error > condition. Oh, that's right. It will fail in the first writev(), I was just considering failing during setup. > > > That's why I suggested creating a 'global' config usiing SO_ZEROCOPY & > > MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance > > of even offering zero-copy-send if we don't have it. > > > > Another local option is to do implement your suggestions, and also > > change qio_channel_socket_connect_sync() so it returns an error if > > MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as: > > > > +#ifdef CONFIG_LINUX > > +#if defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY) > > + int ret, v = 1; > > + ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); > > + if (ret == 0) { > > + /* Zero copy available on host */ > > + qio_channel_set_feature(QIO_CHANNEL(ioc), > > + QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY); > > + } > > +#else > > + error_setg_errno(errp, errno,"MSG_ZEROCOPY not available"); > > + return -1; > > +#endif > > +#endif > > Do we actually need the ifdef CONFIG_LINUX bit at all ? > > Sufficient to just have the check for MSG_ZEROCOPY + SO_ZEROCOPY, > which will fail on non-Linux anyway. By some include issue, or by future implementations we can have MSG_ZEROCOPY or SO_ZEROCOPY getting defined in OS other than Linux, which would introduce some headaches. Since you pointed out that migration will fail on writev, the above piece of code is not necessary. We could have a local define that equals to (MSG_ZEROCOPY && SO_ZEROCOPY && CONFIG_LINUX) so that we can make the code simpler where needed. I will work on a v12 and send it here. Best regards, Leo
diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h index e747e63514..513c428fe4 100644 --- a/include/io/channel-socket.h +++ b/include/io/channel-socket.h @@ -47,6 +47,8 @@ struct QIOChannelSocket { socklen_t localAddrLen; struct sockaddr_storage remoteAddr; socklen_t remoteAddrLen; + ssize_t zero_copy_queued; + ssize_t zero_copy_sent; }; diff --git a/io/channel-socket.c b/io/channel-socket.c index 696a04dc9c..ae756ce166 100644 --- a/io/channel-socket.c +++ b/io/channel-socket.c @@ -25,9 +25,25 @@ #include "io/channel-watch.h" #include "trace.h" #include "qapi/clone-visitor.h" +#ifdef CONFIG_LINUX +#include <linux/errqueue.h> +#include <sys/socket.h> +#endif #define SOCKET_MAX_FDS 16 +/* + * Zero-copy defines bellow are included to avoid breaking builds on systems + * that don't support MSG_ZEROCOPY, while keeping the functions more readable + * (without a lot of ifdefs). + */ +#ifndef MSG_ZEROCOPY +#define MSG_ZEROCOPY 0x4000000 +#endif +#ifndef SO_ZEROCOPY +#define SO_ZEROCOPY 60 +#endif + SocketAddress * qio_channel_socket_get_local_address(QIOChannelSocket *ioc, Error **errp) @@ -54,6 +70,8 @@ qio_channel_socket_new(void) sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)); sioc->fd = -1; + sioc->zero_copy_queued = 0; + sioc->zero_copy_sent = 0; ioc = QIO_CHANNEL(sioc); qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc, return -1; } +#ifdef CONFIG_LINUX + int ret, v = 1; + ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v)); + if (ret == 0) { + /* Zero copy available on host */ + qio_channel_set_feature(QIO_CHANNEL(ioc), + QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY); + } +#endif + return 0; } @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)]; size_t fdsize = sizeof(int) * nfds; struct cmsghdr *cmsg; + int sflags = 0; memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, memcpy(CMSG_DATA(cmsg), fds, fdsize); } + if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { + sflags = MSG_ZEROCOPY; + } + retry: - ret = sendmsg(sioc->fd, &msg, 0); + ret = sendmsg(sioc->fd, &msg, sflags); if (ret <= 0) { - if (errno == EAGAIN) { + switch (errno) { + case EAGAIN: return QIO_CHANNEL_ERR_BLOCK; - } - if (errno == EINTR) { + case EINTR: goto retry; + case ENOBUFS: + if (sflags & MSG_ZEROCOPY) { + error_setg_errno(errp, errno, + "Process can't lock enough memory for using MSG_ZEROCOPY"); + return -1; + } + break; } + error_setg_errno(errp, errno, "Unable to write to socket"); return -1; @@ -659,6 +700,74 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, } #endif /* WIN32 */ + +#ifdef CONFIG_LINUX +static int qio_channel_socket_flush(QIOChannel *ioc, + Error **errp) +{ + QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); + struct msghdr msg = {}; + struct sock_extended_err *serr; + struct cmsghdr *cm; + char control[CMSG_SPACE(sizeof(*serr))]; + int received; + int ret = 1; + + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + memset(control, 0, sizeof(control)); + + while (sioc->zero_copy_sent < sioc->zero_copy_queued) { + received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE); + if (received < 0) { + switch (errno) { + case EAGAIN: + /* Nothing on errqueue, wait until something is available */ + qio_channel_wait(ioc, G_IO_ERR); + continue; + case EINTR: + continue; + default: + error_setg_errno(errp, errno, + "Unable to read errqueue"); + return -1; + } + } + + cm = CMSG_FIRSTHDR(&msg); + if (cm->cmsg_level != SOL_IP && + cm->cmsg_type != IP_RECVERR) { + error_setg_errno(errp, EPROTOTYPE, + "Wrong cmsg in errqueue"); + return -1; + } + + serr = (void *) CMSG_DATA(cm); + if (serr->ee_errno != SO_EE_ORIGIN_NONE) { + error_setg_errno(errp, serr->ee_errno, + "Error on socket"); + return -1; + } + if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) { + error_setg_errno(errp, serr->ee_origin, + "Error not from zero copy"); + return -1; + } + + /* No errors, count successfully finished sendmsg()*/ + sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1; + + /* If any sendmsg() succeeded using zero copy, return 0 at the end */ + if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) { + ret = 0; + } + } + + return ret; +} + +#endif /* CONFIG_LINUX */ + static int qio_channel_socket_set_blocking(QIOChannel *ioc, bool enabled, @@ -789,6 +898,9 @@ static void qio_channel_socket_class_init(ObjectClass *klass, ioc_klass->io_set_delay = qio_channel_socket_set_delay; ioc_klass->io_create_watch = qio_channel_socket_create_watch; ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler; +#ifdef CONFIG_LINUX + ioc_klass->io_flush = qio_channel_socket_flush; +#endif } static const TypeInfo qio_channel_socket_info = {