Message ID | 20210831110238.299458-4-leobras@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | QIOChannel flags + multifd zerocopy | expand |
On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > Change the send_write() interface of multifd, allowing it to pass down > flags for qio_channel_write*(). > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > other data being sent at the default copying approach. > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > --- > migration/multifd-zlib.c | 7 ++++--- > migration/multifd-zstd.c | 7 ++++--- > migration/multifd.c | 9 ++++++--- > migration/multifd.h | 3 ++- > 4 files changed, 16 insertions(+), 10 deletions(-) > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > } > > if (used) { > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > + &local_err); I don't think it is valid to unconditionally enable this feature due to the resource usage implications https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html "A zerocopy failure will return -1 with errno ENOBUFS. This happens if the socket option was not set, the socket exceeds its optmem limit or the user exceeds its ulimit on locked pages." The limit on locked pages is something that looks very likely to be exceeded unless you happen to be running a QEMU config that already implies locked memory (eg PCI assignment) Regards, Daniel
On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > Change the send_write() interface of multifd, allowing it to pass down > > flags for qio_channel_write*(). > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > other data being sent at the default copying approach. > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > --- > > migration/multifd-zlib.c | 7 ++++--- > > migration/multifd-zstd.c | 7 ++++--- > > migration/multifd.c | 9 ++++++--- > > migration/multifd.h | 3 ++- > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > } > > > > if (used) { > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > + &local_err); > > I don't think it is valid to unconditionally enable this feature due to the > resource usage implications > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > if the socket option was not set, the socket exceeds its optmem > limit or the user exceeds its ulimit on locked pages." > > The limit on locked pages is something that looks very likely to be > exceeded unless you happen to be running a QEMU config that already > implies locked memory (eg PCI assignment) Yes it would be great to be a migration capability in parallel to multifd. At initial phase if it's easy to be implemented on multi-fd only, we can add a dependency between the caps. In the future we can remove that dependency when the code is ready to go without multifd. Thanks,
On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote: > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > flags for qio_channel_write*(). > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > other data being sent at the default copying approach. > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > --- > > > migration/multifd-zlib.c | 7 ++++--- > > > migration/multifd-zstd.c | 7 ++++--- > > > migration/multifd.c | 9 ++++++--- > > > migration/multifd.h | 3 ++- > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > } > > > > > > if (used) { > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > + &local_err); > > > > I don't think it is valid to unconditionally enable this feature due to the > > resource usage implications > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > if the socket option was not set, the socket exceeds its optmem > > limit or the user exceeds its ulimit on locked pages." > > > > The limit on locked pages is something that looks very likely to be > > exceeded unless you happen to be running a QEMU config that already > > implies locked memory (eg PCI assignment) > > Yes it would be great to be a migration capability in parallel to multifd. At > initial phase if it's easy to be implemented on multi-fd only, we can add a > dependency between the caps. In the future we can remove that dependency when > the code is ready to go without multifd. Thanks, Also, I'm wondering how zerocopy support interacts with kernel support for kTLS and multipath-TCP, both of which we want to be able to use with migration. Regards, Daniel
On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote: > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote: > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > flags for qio_channel_write*(). > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > other data being sent at the default copying approach. > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > --- > > > > migration/multifd-zlib.c | 7 ++++--- > > > > migration/multifd-zstd.c | 7 ++++--- > > > > migration/multifd.c | 9 ++++++--- > > > > migration/multifd.h | 3 ++- > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > } > > > > > > > > if (used) { > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > + &local_err); > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > resource usage implications > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > if the socket option was not set, the socket exceeds its optmem > > > limit or the user exceeds its ulimit on locked pages." > > > > > > The limit on locked pages is something that looks very likely to be > > > exceeded unless you happen to be running a QEMU config that already > > > implies locked memory (eg PCI assignment) > > > > Yes it would be great to be a migration capability in parallel to multifd. At > > initial phase if it's easy to be implemented on multi-fd only, we can add a > > dependency between the caps. In the future we can remove that dependency when > > the code is ready to go without multifd. Thanks, > > Also, I'm wondering how zerocopy support interacts with kernel support > for kTLS and multipath-TCP, both of which we want to be able to use > with migration. Copying Jason Wang for net implications between these features on kernel side and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS). From the safe side we may want to only enable one of them until we prove they'll work together I guess.. Not a immediate concern as I don't really think any of them is really explicitly supported in qemu. KTLS may be implicitly included by a new gnutls, but we need to mark TLS and ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of gnutls won't has a way to maintain the tls buffers used by zerocopy. So at least we need some knob to detect whether kTLS is enabled in gnutls.
On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote: > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote: > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote: > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > flags for qio_channel_write*(). > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > other data being sent at the default copying approach. > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > --- > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > migration/multifd.c | 9 ++++++--- > > > > > migration/multifd.h | 3 ++- > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > } > > > > > > > > > > if (used) { > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > + &local_err); > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > resource usage implications > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > if the socket option was not set, the socket exceeds its optmem > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > exceeded unless you happen to be running a QEMU config that already > > > > implies locked memory (eg PCI assignment) > > > > > > Yes it would be great to be a migration capability in parallel to multifd. At > > > initial phase if it's easy to be implemented on multi-fd only, we can add a > > > dependency between the caps. In the future we can remove that dependency when > > > the code is ready to go without multifd. Thanks, > > > > Also, I'm wondering how zerocopy support interacts with kernel support > > for kTLS and multipath-TCP, both of which we want to be able to use > > with migration. > > Copying Jason Wang for net implications between these features on kernel side > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS). > > From the safe side we may want to only enable one of them until we prove > they'll work together I guess.. MPTCP is good when we're network limited for migration KTLS will be good when we're CPU limited on AES for migration, which is essentially always when TLS is used. ZEROCOPY will be good when we're CPU limited for data copy on migration, or to reduce the impact on other concurrent VMs on the same CPUs. Ultimately we woudld benefit from all of them at the same time, if it were technically possible todo. > Not a immediate concern as I don't really think any of them is really > explicitly supported in qemu. QEMU has mptcp support already: commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7 Author: Dr. David Alan Gilbert <dgilbert@redhat.com> Date: Wed Apr 21 12:28:34 2021 +0100 sockets: Support multipath TCP Multipath TCP allows combining multiple interfaces/routes into a single socket, with very little work for the user/admin. It's enabled by 'mptcp' on most socket addresses: ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of > gnutls won't has a way to maintain the tls buffers used by zerocopy. So at > least we need some knob to detect whether kTLS is enabled in gnutls. It isn't possible for gnutls to transparently enable KTLS, because GNUTLS doesn't get to see the actual socket directly - it'll need some work in QEMU to enable it. We know MPTCP and KTLS are currently mutually exclusive as they both use the same kernel network hooks framework. Regards, Daniel
On Wed, Sep 01, 2021 at 04:44:30PM +0100, Daniel P. Berrangé wrote: > QEMU has mptcp support already: > > commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7 > Author: Dr. David Alan Gilbert <dgilbert@redhat.com> > Date: Wed Apr 21 12:28:34 2021 +0100 > > sockets: Support multipath TCP > > Multipath TCP allows combining multiple interfaces/routes into a single > socket, with very little work for the user/admin. > > It's enabled by 'mptcp' on most socket addresses: > > ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp Oops, I totally forgot about that, sorry! > > > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and > > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of > > gnutls won't has a way to maintain the tls buffers used by zerocopy. So at > > least we need some knob to detect whether kTLS is enabled in gnutls. > > It isn't possible for gnutls to transparently enable KTLS, because > GNUTLS doesn't get to see the actual socket directly - it'll need > some work in QEMU to enable it. We know MPTCP and KTLS are currently > mutually exclusive as they both use the same kernel network hooks > framework. Then we may need to at least figure out whether zerocopy needs to mask out mptcp.
Hello Daniel, thanks for the feedback ! On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > Change the send_write() interface of multifd, allowing it to pass down > > flags for qio_channel_write*(). > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > other data being sent at the default copying approach. > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > --- > > migration/multifd-zlib.c | 7 ++++--- > > migration/multifd-zstd.c | 7 ++++--- > > migration/multifd.c | 9 ++++++--- > > migration/multifd.h | 3 ++- > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > } > > > > if (used) { > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > + &local_err); > > I don't think it is valid to unconditionally enable this feature due to the > resource usage implications > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > if the socket option was not set, the socket exceeds its optmem > limit or the user exceeds its ulimit on locked pages." You are correct, I unfortunately missed this part in the docs :( > The limit on locked pages is something that looks very likely to be > exceeded unless you happen to be running a QEMU config that already > implies locked memory (eg PCI assignment) Do you mean the limit an user has on locking memory? If so, that makes sense. I remember I needed to set the upper limit of locked memory for the user before using it, or adding a capability to qemu before. Maybe an option would be trying to mlock all guest memory before setting zerocopy=on in qemu code. If it fails, we can print an error message and fall back to not using zerocopy (following the idea of a new io_async_writev() I told you in the previous mail). > > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > Best regards, Leonardo
在 2021/9/1 下午11:35, Peter Xu 写道: > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote: >> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote: >>> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: >>>> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: >>>>> Call qio_channel_set_zerocopy(true) in the start of every multifd thread. >>>>> >>>>> Change the send_write() interface of multifd, allowing it to pass down >>>>> flags for qio_channel_write*(). >>>>> >>>>> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the >>>>> other data being sent at the default copying approach. >>>>> >>>>> Signed-off-by: Leonardo Bras <leobras@redhat.com> >>>>> --- >>>>> migration/multifd-zlib.c | 7 ++++--- >>>>> migration/multifd-zstd.c | 7 ++++--- >>>>> migration/multifd.c | 9 ++++++--- >>>>> migration/multifd.h | 3 ++- >>>>> 4 files changed, 16 insertions(+), 10 deletions(-) >>>>> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) >>>>> } >>>>> >>>>> if (used) { >>>>> - ret = multifd_send_state->ops->send_write(p, used, &local_err); >>>>> + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, >>>>> + &local_err); >>>> I don't think it is valid to unconditionally enable this feature due to the >>>> resource usage implications >>>> >>>> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html >>>> >>>> "A zerocopy failure will return -1 with errno ENOBUFS. This happens >>>> if the socket option was not set, the socket exceeds its optmem >>>> limit or the user exceeds its ulimit on locked pages." >>>> >>>> The limit on locked pages is something that looks very likely to be >>>> exceeded unless you happen to be running a QEMU config that already >>>> implies locked memory (eg PCI assignment) >>> Yes it would be great to be a migration capability in parallel to multifd. At >>> initial phase if it's easy to be implemented on multi-fd only, we can add a >>> dependency between the caps. In the future we can remove that dependency when >>> the code is ready to go without multifd. Thanks, >> Also, I'm wondering how zerocopy support interacts with kernel support >> for kTLS and multipath-TCP, both of which we want to be able to use >> with migration. > Copying Jason Wang for net implications between these features on kernel side Note that the MSG_ZEROCOPY is contributed by Google :) > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS). I think they can. Anyway kernel can choose to do datacopy when necessary. Note that the "zerocopy" is probably not correct here. What's better is "Enable MSG_ZEROCOPY" since: 1) kernel supports various kinds of zerocopy, for TX, it has supported sendfile() for many years. 2) MSG_ZEROCOPY is only used for TX but not RX 3) TCP rx zerocopy is only supported via mmap() which requires some extra configurations e.g 4K MTU, driver support for header split etc. [1] https://www.youtube.com/watch?v=_ZfiQGWFvg0 Thanks > > From the safe side we may want to only enable one of them until we prove > they'll work together I guess.. > > Not a immediate concern as I don't really think any of them is really > explicitly supported in qemu. > > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of > gnutls won't has a way to maintain the tls buffers used by zerocopy. So at > least we need some knob to detect whether kTLS is enabled in gnutls. >
Hello Peter, thank you for this feedback! On Tue, Aug 31, 2021 at 5:29 PM Peter Xu <peterx@redhat.com> wrote: > Yes it would be great to be a migration capability in parallel to multifd. At > initial phase if it's easy to be implemented on multi-fd only, we can add a > dependency between the caps. In the future we can remove that dependency when > the code is ready to go without multifd. Thanks, I thought the idea was to use MSG_ZEROCOPY whenever possible, and otherwise fall back to the default copying mechanism. On replies to 2/3 I mentioned a new method to QIOChannel interface, that would fall back to copying, and that may be a way to not have to use a capability. Best regards, Leonardo
A few more comments on this one: On Wed, Sep 1, 2021 at 12:44 PM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > From the safe side we may want to only enable one of them until we prove > > they'll work together I guess.. > > MPTCP is good when we're network limited for migration > > KTLS will be good when we're CPU limited on AES for migration, > which is essentially always when TLS is used. > > ZEROCOPY will be good when we're CPU limited for data copy > on migration, or to reduce the impact on other concurrent > VMs on the same CPUs. > > Ultimately we woudld benefit from all of them at the same > time, if it were technically possible todo. Maybe I misunderstood something, but IIRC KTLS kind of does some 'zerocopy'. I mean, on an usual setup, we would have a buffer A that would contain the encrypted data, and a buffer B that would receive the encrypted data, and a buffer C, in kernel, that would receive a copy of buffer B. On KTLS, the buffer A would be passed to kernel and be encrypted directly in buffer C, avoiding one copy. Oh, it's possible Buffer A would be copied to buffer B', which is inside the kernel, and then encrypted to buffer C, but I am not sure if this would make sense. Anyway, other than B' case, it would make no sense having MSG_ZEROCOPY in QIOChannel_TLS, so that's something that I need to change in this patchset. > > > Not a immediate concern as I don't really think any of them is really > > explicitly supported in qemu. > > QEMU has mptcp support already: > > commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7 > Author: Dr. David Alan Gilbert <dgilbert@redhat.com> > Date: Wed Apr 21 12:28:34 2021 +0100 > > sockets: Support multipath TCP > > Multipath TCP allows combining multiple interfaces/routes into a single > socket, with very little work for the user/admin. > > It's enabled by 'mptcp' on most socket addresses: > > ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp > > > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and > > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of > > gnutls won't has a way to maintain the tls buffers used by zerocopy. So at > > least we need some knob to detect whether kTLS is enabled in gnutls. > > It isn't possible for gnutls to transparently enable KTLS, because > GNUTLS doesn't get to see the actual socket directly Yes, IIRC it uses gnuTLS with callbacks instead. > - it'll need > some work in QEMU to enable it. Maybe it's overkill, but what if we get to implement using KTLS directly in QEMU, and fall back to gnuTLS when it's not available?
Thanks for contributing Jason! On Thu, Sep 2, 2021 at 4:23 AM Jason Wang <jasowang@redhat.com> wrote: > > > 在 2021/9/1 下午11:35, Peter Xu 写道: > > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote: > >> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote: > >>> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: > >>>> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > >>>>> Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > >>>>> > >>>>> Change the send_write() interface of multifd, allowing it to pass down > >>>>> flags for qio_channel_write*(). > >>>>> > >>>>> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > >>>>> other data being sent at the default copying approach. > >>>>> > >>>>> Signed-off-by: Leonardo Bras <leobras@redhat.com> > >>>>> --- > >>>>> migration/multifd-zlib.c | 7 ++++--- > >>>>> migration/multifd-zstd.c | 7 ++++--- > >>>>> migration/multifd.c | 9 ++++++--- > >>>>> migration/multifd.h | 3 ++- > >>>>> 4 files changed, 16 insertions(+), 10 deletions(-) > >>>>> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > >>>>> } > >>>>> > >>>>> if (used) { > >>>>> - ret = multifd_send_state->ops->send_write(p, used, &local_err); > >>>>> + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > >>>>> + &local_err); > >>>> I don't think it is valid to unconditionally enable this feature due to the > >>>> resource usage implications > >>>> > >>>> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > >>>> > >>>> "A zerocopy failure will return -1 with errno ENOBUFS. This happens > >>>> if the socket option was not set, the socket exceeds its optmem > >>>> limit or the user exceeds its ulimit on locked pages." > >>>> > >>>> The limit on locked pages is something that looks very likely to be > >>>> exceeded unless you happen to be running a QEMU config that already > >>>> implies locked memory (eg PCI assignment) > >>> Yes it would be great to be a migration capability in parallel to multifd. At > >>> initial phase if it's easy to be implemented on multi-fd only, we can add a > >>> dependency between the caps. In the future we can remove that dependency when > >>> the code is ready to go without multifd. Thanks, > >> Also, I'm wondering how zerocopy support interacts with kernel support > >> for kTLS and multipath-TCP, both of which we want to be able to use > >> with migration. > > Copying Jason Wang for net implications between these features on kernel side > > > Note that the MSG_ZEROCOPY is contributed by Google :) > > > > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS). > > > I think they can. Anyway kernel can choose to do datacopy when necessary. > > Note that the "zerocopy" is probably not correct here. What's better is > "Enable MSG_ZEROCOPY" since: > > 1) kernel supports various kinds of zerocopy, for TX, it has supported > sendfile() for many years. > 2) MSG_ZEROCOPY is only used for TX but not RX > 3) TCP rx zerocopy is only supported via mmap() which requires some > extra configurations e.g 4K MTU, driver support for header split etc. RX would be my next challenge :) > > [1] https://www.youtube.com/watch?v=_ZfiQGWFvg0 Thank you for sharing! Best regards, Leonardo
On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > Hello Daniel, thanks for the feedback ! > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > flags for qio_channel_write*(). > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > other data being sent at the default copying approach. > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > --- > > > migration/multifd-zlib.c | 7 ++++--- > > > migration/multifd-zstd.c | 7 ++++--- > > > migration/multifd.c | 9 ++++++--- > > > migration/multifd.h | 3 ++- > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > } > > > > > > if (used) { > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > + &local_err); > > > > I don't think it is valid to unconditionally enable this feature due to the > > resource usage implications > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > if the socket option was not set, the socket exceeds its optmem > > limit or the user exceeds its ulimit on locked pages." > > You are correct, I unfortunately missed this part in the docs :( > > > The limit on locked pages is something that looks very likely to be > > exceeded unless you happen to be running a QEMU config that already > > implies locked memory (eg PCI assignment) > > Do you mean the limit an user has on locking memory? Yes, by default limit QEMU sees will be something very small. > If so, that makes sense. I remember I needed to set the upper limit of locked > memory for the user before using it, or adding a capability to qemu before. > > Maybe an option would be trying to mlock all guest memory before setting > zerocopy=on in qemu code. If it fails, we can print an error message and fall > back to not using zerocopy (following the idea of a new io_async_writev() > I told you in the previous mail). Currently ability to lock memory is something that has to be configured when QEMU starts, and it requires libvirt to grant suitable permissions to QEMU. Memory locking is generally undesirable because it prevents memory overcommit. Or rather if you are allowing memory overcommit, then allowing memory locking is a way to kill your entire host. I don't think we can unconditionally grant ability to lock arbitrary guest RAM at startup, just to satisfy a possible desire to use zerocopy migration later. Granting it at runtime feels questionable as you now need to track and predict how much locked memory you've allowed, and also have possible problems with revokation. Possibly you could unconditionally grant ability to lock a small amount of guest RAM at startup, but how small can it be, while still making a useful difference to migration. It would imply we also need to be very careful with migration to avoid having too large an amount of outstanding zerocopy requests to exceed the limit. IOW, the only clear place in which we can use zerocopy, is where we are already forced to accept the penalty of locked memory at startup. eg when the guest is using huge pages and no overcommit, or possibly when the guest is using PCI device assignment, though in the latter I can't remember if we allow entire of guest RAM to be locked or not. Overall the memory locking needs look like a significant constraint that will affect ability to use this feature. Regards, Daniel
On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > Hello Daniel, thanks for the feedback ! > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > flags for qio_channel_write*(). > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > other data being sent at the default copying approach. > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > --- > > > > migration/multifd-zlib.c | 7 ++++--- > > > > migration/multifd-zstd.c | 7 ++++--- > > > > migration/multifd.c | 9 ++++++--- > > > > migration/multifd.h | 3 ++- > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > } > > > > > > > > if (used) { > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > + &local_err); > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > resource usage implications > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > if the socket option was not set, the socket exceeds its optmem > > > limit or the user exceeds its ulimit on locked pages." > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > The limit on locked pages is something that looks very likely to be > > > exceeded unless you happen to be running a QEMU config that already > > > implies locked memory (eg PCI assignment) > > > > Do you mean the limit an user has on locking memory? > > Yes, by default limit QEMU sees will be something very small. > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > memory for the user before using it, or adding a capability to qemu before. > > > > Maybe an option would be trying to mlock all guest memory before setting > > zerocopy=on in qemu code. If it fails, we can print an error message and fall > > back to not using zerocopy (following the idea of a new io_async_writev() > > I told you in the previous mail). > > Currently ability to lock memory is something that has to be configured > when QEMU starts, and it requires libvirt to grant suitable permissions > to QEMU. Memory locking is generally undesirable because it prevents > memory overcommit. Or rather if you are allowing memory overcommit, then > allowing memory locking is a way to kill your entire host. You mean it's gonna consume too much memory, or something else? > > I don't think we can unconditionally grant ability to lock arbitrary > guest RAM at startup, just to satisfy a possible desire to use zerocopy > migration later. Granting it at runtime feels questionable as you now > need to track and predict how much locked memory you've allowed, and > also have possible problems with revokation. (I am really new to this, so please forgive me if I am asking dumb or overly basic questions) What does revokation means in this context? You give the process hability to lock n MB of memory, and then you take it? Why would that happen? Is Locked memory a limited resource? > > Possibly you could unconditionally grant ability to lock a small amount > of guest RAM at startup, but how small can it be, while still making a > useful difference to migration. It would imply we also need to be very > careful with migration to avoid having too large an amount of outstanding > zerocopy requests to exceed the limit. Yeah, having to decide on a value that would be ok to lock is very complex, given we can migrate with multifd, which can make this value grow a lot. Except if we only allow a few of those fds to really use zerocopy. > > IOW, the only clear place in which we can use zerocopy, is where we are > already forced to accept the penalty of locked memory at startup. eg when > the guest is using huge pages and no overcommit, or possibly when the guest > is using PCI device assignment, It would be something already, given that those scenarios are the ones with the largest number of pages to migrate. But I understand we could give it a try on other scenarios. > though in the latter I can't remember if > we allow entire of guest RAM to be locked or not. If I recall correctly on a previous discussion, this was the case at least for PCI passthrough. > > Overall the memory locking needs look like a significant constraint that > will affect ability to use this feature. > I Agree, there is a lot to take in account. In any way, those constraints could be checked at the same function as the setsockopt() right? (Setting up configs to improve the chance of zerocopy would probably only happen at/before qemu starting, right?) Best regards, Leonardo
On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote: > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > Hello Daniel, thanks for the feedback ! > > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > flags for qio_channel_write*(). > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > other data being sent at the default copying approach. > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > --- > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > migration/multifd.c | 9 ++++++--- > > > > > migration/multifd.h | 3 ++- > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > } > > > > > > > > > > if (used) { > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > + &local_err); > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > resource usage implications > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > if the socket option was not set, the socket exceeds its optmem > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > The limit on locked pages is something that looks very likely to be > > > > exceeded unless you happen to be running a QEMU config that already > > > > implies locked memory (eg PCI assignment) > > > > > > Do you mean the limit an user has on locking memory? > > > > Yes, by default limit QEMU sees will be something very small. > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > Maybe an option would be trying to mlock all guest memory before setting > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall > > > back to not using zerocopy (following the idea of a new io_async_writev() > > > I told you in the previous mail). > > > > Currently ability to lock memory is something that has to be configured > > when QEMU starts, and it requires libvirt to grant suitable permissions > > to QEMU. Memory locking is generally undesirable because it prevents > > memory overcommit. Or rather if you are allowing memory overcommit, then > > allowing memory locking is a way to kill your entire host. > > You mean it's gonna consume too much memory, or something else? Essentially yes. > > I don't think we can unconditionally grant ability to lock arbitrary > > guest RAM at startup, just to satisfy a possible desire to use zerocopy > > migration later. Granting it at runtime feels questionable as you now > > need to track and predict how much locked memory you've allowed, and > > also have possible problems with revokation. > > (I am really new to this, so please forgive me if I am asking dumb or > overly basic questions) > > What does revokation means in this context? > You give the process hability to lock n MB of memory, and then you take it? > Why would that happen? Is Locked memory a limited resource? Consider a VM host with 64 GB of RAM and 64 GB of swap, and an overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM total. So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage which exceeds physical RAM. Most of the time this may well be fine as the VMs don't concurrently need their full RAM allocation, and worst case they'll get pushed to swap as the kernel re-shares memory in respose to load. So perhaps each VM only needs 20 GB resident at any time, but over time one VM can burst upto 32 GB and then 12 GB of it get swapped out later when inactive. But now consider if we allowed 2 of the VMs to lock memory for purposes of migration. Those 2 VMs can now pin 64 GB of memory in the worst case, leaving no free memory for the 3rd VM or for the OS. This will likely take down the entire host, regardless of swap availability. IOW, if you are overcomitting RAM you have to be extremely careful about allowing any VM to lock memory. If you do decide to allow memory locking, you need to make sure that the worst case locked memory amount still leaves enough unlocked memory for the OS to be able to effectively manage the overcommit load via swap. We definitely can't grant memory locking to VMs at startup in this scenario, and if we grant it at runtime, we need to be able to revoke it again later. These overcommit numbers are a bit more extreme that you'd usually do in real world, but it illustrates the genreal problem. Also bear in mind that QEMU has memory overhead beyond the guest RAM block, which varies over time, making accounting quite hard. We have to also assume that QEMU could have been compromised by a guest breakout, so we can't assume that migration will play nice - we have to assume the worst case possible, given the process ulimits. > > Overall the memory locking needs look like a significant constraint that > > will affect ability to use this feature. > > > > I Agree, there is a lot to take in account. > In any way, those constraints could be checked at the same function as > the setsockopt() right? QEMU could possibly check its ulimits to see if it is possible, but it would be safer to require a migration capability to be explicitly set by the mgmt app to opt-in to zerocopy. > (Setting up configs to improve the chance of zerocopy would probably only > happen at/before qemu starting, right?) Usually, but you can change ulimits on the fly for a process. I'm just not sure of semantics if you reduce limits and existing usage exceeds the reduced value. Regards, Daniel
On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote: > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > Hello Daniel, thanks for the feedback ! > > > > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > > flags for qio_channel_write*(). > > > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > > other data being sent at the default copying approach. > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > --- > > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > > migration/multifd.c | 9 ++++++--- > > > > > > migration/multifd.h | 3 ++- > > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > > } > > > > > > > > > > > > if (used) { > > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > > + &local_err); > > > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > resource usage implications > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > implies locked memory (eg PCI assignment) > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > Yes, by default limit QEMU sees will be something very small. > > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > > > Maybe an option would be trying to mlock all guest memory before setting > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall > > > > back to not using zerocopy (following the idea of a new io_async_writev() > > > > I told you in the previous mail). > > > > > > Currently ability to lock memory is something that has to be configured > > > when QEMU starts, and it requires libvirt to grant suitable permissions > > > to QEMU. Memory locking is generally undesirable because it prevents > > > memory overcommit. Or rather if you are allowing memory overcommit, then > > > allowing memory locking is a way to kill your entire host. > > > > You mean it's gonna consume too much memory, or something else? > > Essentially yes. Well, maybe we can check for available memory before doing that, but maybe it's too much effort. > > > > I don't think we can unconditionally grant ability to lock arbitrary > > > guest RAM at startup, just to satisfy a possible desire to use zerocopy > > > migration later. Granting it at runtime feels questionable as you now > > > need to track and predict how much locked memory you've allowed, and > > > also have possible problems with revokation. > > > > (I am really new to this, so please forgive me if I am asking dumb or > > overly basic questions) > > > > What does revokation means in this context? > > You give the process hability to lock n MB of memory, and then you take it? > > Why would that happen? Is Locked memory a limited resource? > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM > total. > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage > which exceeds physical RAM. Most of the time this may well be fine > as the VMs don't concurrently need their full RAM allocation, and > worst case they'll get pushed to swap as the kernel re-shares > memory in respose to load. So perhaps each VM only needs 20 GB > resident at any time, but over time one VM can burst upto 32 GB > and then 12 GB of it get swapped out later when inactive. > > But now consider if we allowed 2 of the VMs to lock memory for > purposes of migration. Those 2 VMs can now pin 64 GB of memory > in the worst case, leaving no free memory for the 3rd VM or > for the OS. This will likely take down the entire host, regardless > of swap availability. > > IOW, if you are overcomitting RAM you have to be extremely > careful about allowing any VM to lock memory. If you do decide > to allow memory locking, you need to make sure that the worst > case locked memory amount still leaves enough unlocked memory > for the OS to be able to effectively manage the overcommit > load via swap. We definitely can't grant memory locking to > VMs at startup in this scenario, and if we grant it at runtime, > we need to be able to revoke it again later. > > These overcommit numbers are a bit more extreme that you'd > usually do in real world, but it illustrates the genreal > problem. Also bear in mind that QEMU has memory overhead > beyond the guest RAM block, which varies over time, making > accounting quite hard. We have to also assume that QEMU > could have been compromised by a guest breakout, so we > can't assume that migration will play nice - we have to > assume the worst case possible, given the process ulimits. > Yeah, that makes sense. Thanks for this illustration and elucidation ! I assume there is no way of asking the OS to lock memory, and if there is no space available, it fails and rolls back the locking. If there was, the VM2 would fail the locking and we could achieve MSG_ZEROCOPY at least in VM1. But thinking in a better way, if both could lock the required memory, and little was available to VM3, it could experience major slowdown, as would the host OS. Yeah, that would be complex. > > > > Overall the memory locking needs look like a significant constraint that > > > will affect ability to use this feature. > > > > > > > I Agree, there is a lot to take in account. > > In any way, those constraints could be checked at the same function as > > the setsockopt() right? > > QEMU could possibly check its ulimits to see if it is possible, but it > would be safer to require a migration capability to be explicitly set > by the mgmt app to opt-in to zerocopy. Yeah, that would make sense. Let the mgmt app check system resources and choose where to request zerocopy. It also has access to knowing if that process has enough memory locked. > > > (Setting up configs to improve the chance of zerocopy would probably only > > happen at/before qemu starting, right?) > > Usually, but you can change ulimits on the fly for a process. I'm just > not sure of semantics if you reduce limits and existing usage exceeds > the reduced value. Yeah, that's a good question. > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > Best regards, Leonardo Bras
On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote: > On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote: > > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > > Hello Daniel, thanks for the feedback ! > > > > > > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > > > flags for qio_channel_write*(). > > > > > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > > > other data being sent at the default copying approach. > > > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > --- > > > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > > > migration/multifd.c | 9 ++++++--- > > > > > > > migration/multifd.h | 3 ++- > > > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > > > } > > > > > > > > > > > > > > if (used) { > > > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > > > + &local_err); > > > > > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > > resource usage implications > > > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > > implies locked memory (eg PCI assignment) > > > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > > > Yes, by default limit QEMU sees will be something very small. > > > > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > > > > > Maybe an option would be trying to mlock all guest memory before setting > > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall > > > > > back to not using zerocopy (following the idea of a new io_async_writev() > > > > > I told you in the previous mail). > > > > > > > > Currently ability to lock memory is something that has to be configured > > > > when QEMU starts, and it requires libvirt to grant suitable permissions > > > > to QEMU. Memory locking is generally undesirable because it prevents > > > > memory overcommit. Or rather if you are allowing memory overcommit, then > > > > allowing memory locking is a way to kill your entire host. > > > > > > You mean it's gonna consume too much memory, or something else? > > > > Essentially yes. > > Well, maybe we can check for available memory before doing that, > but maybe it's too much effort. From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt app needs to enforce policy based on the worst case that the VM configuration allows for. Checking current available memory is not viable, because even if you see 10 GB free, QEMU can't know if that free memory is there to satisfy other VMs's worst case needs, or its own. QEMU needs to be explicitly told whether its OK to use locked memory, and an enforcement policy applied to it. > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an > > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM > > total. > > > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage > > which exceeds physical RAM. Most of the time this may well be fine > > as the VMs don't concurrently need their full RAM allocation, and > > worst case they'll get pushed to swap as the kernel re-shares > > memory in respose to load. So perhaps each VM only needs 20 GB > > resident at any time, but over time one VM can burst upto 32 GB > > and then 12 GB of it get swapped out later when inactive. > > > > But now consider if we allowed 2 of the VMs to lock memory for > > purposes of migration. Those 2 VMs can now pin 64 GB of memory > > in the worst case, leaving no free memory for the 3rd VM or > > for the OS. This will likely take down the entire host, regardless > > of swap availability. > > > > IOW, if you are overcomitting RAM you have to be extremely > > careful about allowing any VM to lock memory. If you do decide > > to allow memory locking, you need to make sure that the worst > > case locked memory amount still leaves enough unlocked memory > > for the OS to be able to effectively manage the overcommit > > load via swap. We definitely can't grant memory locking to > > VMs at startup in this scenario, and if we grant it at runtime, > > we need to be able to revoke it again later. > > > > These overcommit numbers are a bit more extreme that you'd > > usually do in real world, but it illustrates the genreal > > problem. Also bear in mind that QEMU has memory overhead > > beyond the guest RAM block, which varies over time, making > > accounting quite hard. We have to also assume that QEMU > > could have been compromised by a guest breakout, so we > > can't assume that migration will play nice - we have to > > assume the worst case possible, given the process ulimits. > > > > Yeah, that makes sense. Thanks for this illustration and elucidation ! > > I assume there is no way of asking the OS to lock memory, and if there is > no space available, it fails and rolls back the locking. Yes & no. On most Linux configs though it ends up being no, because instead of getting a nice failure, when host memory pressure occurs, the OOM Killer wakes up and just culls processes. Regards, Daniel
On Thu, Sep 2, 2021 at 6:59 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote: > > On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote: > > > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > > > Hello Daniel, thanks for the feedback ! > > > > > > > > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > > > > flags for qio_channel_write*(). > > > > > > > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > > > > other data being sent at the default copying approach. > > > > > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > > --- > > > > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > > > > migration/multifd.c | 9 ++++++--- > > > > > > > > migration/multifd.h | 3 ++- > > > > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > > > > } > > > > > > > > > > > > > > > > if (used) { > > > > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > > > > + &local_err); > > > > > > > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > > > resource usage implications > > > > > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > > > implies locked memory (eg PCI assignment) > > > > > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > > > > > Yes, by default limit QEMU sees will be something very small. > > > > > > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > > > > > > > Maybe an option would be trying to mlock all guest memory before setting > > > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall > > > > > > back to not using zerocopy (following the idea of a new io_async_writev() > > > > > > I told you in the previous mail). > > > > > > > > > > Currently ability to lock memory is something that has to be configured > > > > > when QEMU starts, and it requires libvirt to grant suitable permissions > > > > > to QEMU. Memory locking is generally undesirable because it prevents > > > > > memory overcommit. Or rather if you are allowing memory overcommit, then > > > > > allowing memory locking is a way to kill your entire host. > > > > > > > > You mean it's gonna consume too much memory, or something else? > > > > > > Essentially yes. > > > > Well, maybe we can check for available memory before doing that, > > but maybe it's too much effort. > > From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt > app needs to enforce policy based on the worst case that the > VM configuration allows for. > > Checking current available memory is not viable, because even > if you see 10 GB free, QEMU can't know if that free memory is > there to satisfy other VMs's worst case needs, or its own. QEMU > needs to be explicitly told whether its OK to use locked memory, > and an enforcement policy applied to it. Yeah, it makes sense to let the mgmt app deal with that and enable/disable the MSG_ZEROCOPY on migration whenever it sees fit. > > > > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an > > > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM > > > total. > > > > > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage > > > which exceeds physical RAM. Most of the time this may well be fine > > > as the VMs don't concurrently need their full RAM allocation, and > > > worst case they'll get pushed to swap as the kernel re-shares > > > memory in respose to load. So perhaps each VM only needs 20 GB > > > resident at any time, but over time one VM can burst upto 32 GB > > > and then 12 GB of it get swapped out later when inactive. > > > > > > But now consider if we allowed 2 of the VMs to lock memory for > > > purposes of migration. Those 2 VMs can now pin 64 GB of memory > > > in the worst case, leaving no free memory for the 3rd VM or > > > for the OS. This will likely take down the entire host, regardless > > > of swap availability. > > > > > > IOW, if you are overcomitting RAM you have to be extremely > > > careful about allowing any VM to lock memory. If you do decide > > > to allow memory locking, you need to make sure that the worst > > > case locked memory amount still leaves enough unlocked memory > > > for the OS to be able to effectively manage the overcommit > > > load via swap. We definitely can't grant memory locking to > > > VMs at startup in this scenario, and if we grant it at runtime, > > > we need to be able to revoke it again later. > > > > > > These overcommit numbers are a bit more extreme that you'd > > > usually do in real world, but it illustrates the genreal > > > problem. Also bear in mind that QEMU has memory overhead > > > beyond the guest RAM block, which varies over time, making > > > accounting quite hard. We have to also assume that QEMU > > > could have been compromised by a guest breakout, so we > > > can't assume that migration will play nice - we have to > > > assume the worst case possible, given the process ulimits. > > > > > > > Yeah, that makes sense. Thanks for this illustration and elucidation ! > > > > I assume there is no way of asking the OS to lock memory, and if there is > > no space available, it fails and rolls back the locking. > > Yes & no. On most Linux configs though it ends up being no, because > instead of getting a nice failure, when host memory pressure occurs, > the OOM Killer wakes up and just culls processes. oh, right the OOM Killer :) > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > Thanks! Best regards, Leonardo
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote: > > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote: > > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote: > > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > > flags for qio_channel_write*(). > > > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > > other data being sent at the default copying approach. > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > --- > > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > > migration/multifd.c | 9 ++++++--- > > > > > > migration/multifd.h | 3 ++- > > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > > } > > > > > > > > > > > > if (used) { > > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > > + &local_err); > > > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > resource usage implications > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > implies locked memory (eg PCI assignment) > > > > > > > > Yes it would be great to be a migration capability in parallel to multifd. At > > > > initial phase if it's easy to be implemented on multi-fd only, we can add a > > > > dependency between the caps. In the future we can remove that dependency when > > > > the code is ready to go without multifd. Thanks, > > > > > > Also, I'm wondering how zerocopy support interacts with kernel support > > > for kTLS and multipath-TCP, both of which we want to be able to use > > > with migration. > > > > Copying Jason Wang for net implications between these features on kernel side > > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS). > > > > From the safe side we may want to only enable one of them until we prove > > they'll work together I guess.. > > MPTCP is good when we're network limited for migration > > KTLS will be good when we're CPU limited on AES for migration, > which is essentially always when TLS is used. > > ZEROCOPY will be good when we're CPU limited for data copy > on migration, or to reduce the impact on other concurrent > VMs on the same CPUs. > > Ultimately we woudld benefit from all of them at the same > time, if it were technically possible todo. I think last time I spoke to Paolo Abeni there were some interactions between them; I can't remember what though (I think mptcp and ktls didn't play at the time). Dave > > Not a immediate concern as I don't really think any of them is really > > explicitly supported in qemu. > > QEMU has mptcp support already: > > commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7 > Author: Dr. David Alan Gilbert <dgilbert@redhat.com> > Date: Wed Apr 21 12:28:34 2021 +0100 > > sockets: Support multipath TCP > > Multipath TCP allows combining multiple interfaces/routes into a single > socket, with very little work for the user/admin. > > It's enabled by 'mptcp' on most socket addresses: > > ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp > > > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and > > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of > > gnutls won't has a way to maintain the tls buffers used by zerocopy. So at > > least we need some knob to detect whether kTLS is enabled in gnutls. > > It isn't possible for gnutls to transparently enable KTLS, because > GNUTLS doesn't get to see the actual socket directly - it'll need > some work in QEMU to enable it. We know MPTCP and KTLS are currently > mutually exclusive as they both use the same kernel network hooks > framework. > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote: > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > Hello Daniel, thanks for the feedback ! > > > > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > > flags for qio_channel_write*(). > > > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > > other data being sent at the default copying approach. > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > --- > > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > > migration/multifd.c | 9 ++++++--- > > > > > > migration/multifd.h | 3 ++- > > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > > } > > > > > > > > > > > > if (used) { > > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > > + &local_err); > > > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > resource usage implications > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > implies locked memory (eg PCI assignment) > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > Yes, by default limit QEMU sees will be something very small. > > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > > > Maybe an option would be trying to mlock all guest memory before setting > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall > > > > back to not using zerocopy (following the idea of a new io_async_writev() > > > > I told you in the previous mail). > > > > > > Currently ability to lock memory is something that has to be configured > > > when QEMU starts, and it requires libvirt to grant suitable permissions > > > to QEMU. Memory locking is generally undesirable because it prevents > > > memory overcommit. Or rather if you are allowing memory overcommit, then > > > allowing memory locking is a way to kill your entire host. > > > > You mean it's gonna consume too much memory, or something else? > > Essentially yes. > > > > I don't think we can unconditionally grant ability to lock arbitrary > > > guest RAM at startup, just to satisfy a possible desire to use zerocopy > > > migration later. Granting it at runtime feels questionable as you now > > > need to track and predict how much locked memory you've allowed, and > > > also have possible problems with revokation. > > > > (I am really new to this, so please forgive me if I am asking dumb or > > overly basic questions) > > > > What does revokation means in this context? > > You give the process hability to lock n MB of memory, and then you take it? > > Why would that happen? Is Locked memory a limited resource? > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM > total. > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage > which exceeds physical RAM. Most of the time this may well be fine > as the VMs don't concurrently need their full RAM allocation, and > worst case they'll get pushed to swap as the kernel re-shares > memory in respose to load. So perhaps each VM only needs 20 GB > resident at any time, but over time one VM can burst upto 32 GB > and then 12 GB of it get swapped out later when inactive. > > But now consider if we allowed 2 of the VMs to lock memory for > purposes of migration. Those 2 VMs can now pin 64 GB of memory > in the worst case, leaving no free memory for the 3rd VM or > for the OS. This will likely take down the entire host, regardless > of swap availability. > > IOW, if you are overcomitting RAM you have to be extremely > careful about allowing any VM to lock memory. If you do decide > to allow memory locking, you need to make sure that the worst > case locked memory amount still leaves enough unlocked memory > for the OS to be able to effectively manage the overcommit > load via swap. We definitely can't grant memory locking to > VMs at startup in this scenario, and if we grant it at runtime, > we need to be able to revoke it again later. > > These overcommit numbers are a bit more extreme that you'd > usually do in real world, but it illustrates the genreal > problem. Also bear in mind that QEMU has memory overhead > beyond the guest RAM block, which varies over time, making > accounting quite hard. We have to also assume that QEMU > could have been compromised by a guest breakout, so we > can't assume that migration will play nice - we have to > assume the worst case possible, given the process ulimits. We already have the same problem for RDMA; (Although it has some options for doing rolling locking in chunks and recently there's code for use with new cards that don't need locking). I think the thing here is to offer zerocopy as an option; then people can decide on the costs etc. Dave > > > > Overall the memory locking needs look like a significant constraint that > > > will affect ability to use this feature. > > > > > > > I Agree, there is a lot to take in account. > > In any way, those constraints could be checked at the same function as > > the setsockopt() right? > > QEMU could possibly check its ulimits to see if it is possible, but it > would be safer to require a migration capability to be explicitly set > by the mgmt app to opt-in to zerocopy. > > > (Setting up configs to improve the chance of zerocopy would probably only > > happen at/before qemu starting, right?) > > Usually, but you can change ulimits on the fly for a process. I'm just > not sure of semantics if you reduce limits and existing usage exceeds > the reduced value. > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > I don't think it is valid to unconditionally enable this feature due to the > > resource usage implications > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > if the socket option was not set, the socket exceeds its optmem > > limit or the user exceeds its ulimit on locked pages." > > You are correct, I unfortunately missed this part in the docs :( > > > The limit on locked pages is something that looks very likely to be > > exceeded unless you happen to be running a QEMU config that already > > implies locked memory (eg PCI assignment) > > Do you mean the limit an user has on locking memory? > > If so, that makes sense. I remember I needed to set the upper limit of locked > memory for the user before using it, or adding a capability to qemu before. So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK. The thing is IIUC that's accounting for pinned pages only with either mlock() (FOLL_MLOCK) or vfio (FOLL_PIN). I don't really think MSG_ZEROCOPY is doing that at all... I'm looking at __zerocopy_sg_from_iter() -> iov_iter_get_pages(). Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I think most of them do no need such accountings.
On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > I don't think it is valid to unconditionally enable this feature due to the > > > resource usage implications > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > if the socket option was not set, the socket exceeds its optmem > > > limit or the user exceeds its ulimit on locked pages." > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > The limit on locked pages is something that looks very likely to be > > > exceeded unless you happen to be running a QEMU config that already > > > implies locked memory (eg PCI assignment) > > > > Do you mean the limit an user has on locking memory? > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > memory for the user before using it, or adding a capability to qemu before. > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK. > > The thing is IIUC that's accounting for pinned pages only with either mlock() > (FOLL_MLOCK) or vfio (FOLL_PIN). > > I don't really think MSG_ZEROCOPY is doing that at all... I'm looking at > __zerocopy_sg_from_iter() -> iov_iter_get_pages(). It happens probably here: E.g __ip_append_data() msg_zerocopy_realloc() mm_account_pinned_pages() Thanks > > Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I > think most of them do no need such accountings. > > -- > Peter Xu >
On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote: > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > resource usage implications > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > if the socket option was not set, the socket exceeds its optmem > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > The limit on locked pages is something that looks very likely to be > > > > exceeded unless you happen to be running a QEMU config that already > > > > implies locked memory (eg PCI assignment) > > > > > > Do you mean the limit an user has on locking memory? > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > memory for the user before using it, or adding a capability to qemu before. > > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK. > > > > The thing is IIUC that's accounting for pinned pages only with either mlock() > > (FOLL_MLOCK) or vfio (FOLL_PIN). > > > > I don't really think MSG_ZEROCOPY is doing that at all... I'm looking at > > __zerocopy_sg_from_iter() -> iov_iter_get_pages(). > > It happens probably here: > > E.g > > __ip_append_data() > msg_zerocopy_realloc() > mm_account_pinned_pages() Right. :) But my previous question is more about the reason behind it - I thought that's a common GUP and it shouldn't rely on locked_vm because it should still be temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether all the rest of iov_iter_get_pages() callers should check locked_vm too, and AFAIU they're not doing that right now.. Thanks,
On Wed, Sep 8, 2021 at 11:24 AM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote: > > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > resource usage implications > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > implies locked memory (eg PCI assignment) > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK. > > > > > > The thing is IIUC that's accounting for pinned pages only with either mlock() > > > (FOLL_MLOCK) or vfio (FOLL_PIN). > > > > > > I don't really think MSG_ZEROCOPY is doing that at all... I'm looking at > > > __zerocopy_sg_from_iter() -> iov_iter_get_pages(). > > > > It happens probably here: > > > > E.g > > > > __ip_append_data() > > msg_zerocopy_realloc() > > mm_account_pinned_pages() > > Right. :) > > But my previous question is more about the reason behind it - I thought that's > a common GUP and it shouldn't rely on locked_vm because it should still be > temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we > don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether > all the rest of iov_iter_get_pages() callers should check locked_vm too, and > AFAIU they're not doing that right now.. You are probably right, but I'm not sure if it's too late to fix that. (E.g it will break existing userspace) Thanks > > Thanks, > > -- > Peter Xu >
* Jason Wang (jasowang@redhat.com) wrote: > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > resource usage implications > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > if the socket option was not set, the socket exceeds its optmem > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > The limit on locked pages is something that looks very likely to be > > > > exceeded unless you happen to be running a QEMU config that already > > > > implies locked memory (eg PCI assignment) > > > > > > Do you mean the limit an user has on locking memory? > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > memory for the user before using it, or adding a capability to qemu before. > > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK. > > > > The thing is IIUC that's accounting for pinned pages only with either mlock() > > (FOLL_MLOCK) or vfio (FOLL_PIN). > > > > I don't really think MSG_ZEROCOPY is doing that at all... I'm looking at > > __zerocopy_sg_from_iter() -> iov_iter_get_pages(). > > It happens probably here: > > E.g > > __ip_append_data() > msg_zerocopy_realloc() > mm_account_pinned_pages() When do they get uncounted? i.e. is it just the data that's in flight that's marked as pinned? Dave > Thanks > > > > > Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I > > think most of them do no need such accountings. > > > > -- > > Peter Xu > > >
On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote: > * Jason Wang (jasowang@redhat.com) wrote: > > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > resource usage implications > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > implies locked memory (eg PCI assignment) > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK. > > > > > > The thing is IIUC that's accounting for pinned pages only with either mlock() > > > (FOLL_MLOCK) or vfio (FOLL_PIN). > > > > > > I don't really think MSG_ZEROCOPY is doing that at all... I'm looking at > > > __zerocopy_sg_from_iter() -> iov_iter_get_pages(). > > > > It happens probably here: > > > > E.g > > > > __ip_append_data() > > msg_zerocopy_realloc() > > mm_account_pinned_pages() > > When do they get uncounted? i.e. is it just the data that's in flight > that's marked as pinned? I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages() correspondingly. Thanks,
On Tue, Sep 07, 2021 at 12:13:28PM +0100, Dr. David Alan Gilbert wrote: > * Daniel P. Berrangé (berrange@redhat.com) wrote: > > On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote: > > > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote: > > > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote: > > > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote: > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread. > > > > > > > > > > > > > > Change the send_write() interface of multifd, allowing it to pass down > > > > > > > flags for qio_channel_write*(). > > > > > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the > > > > > > > other data being sent at the default copying approach. > > > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > --- > > > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > > > migration/multifd.c | 9 ++++++--- > > > > > > > migration/multifd.h | 3 ++- > > > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) > > > > > > > } > > > > > > > > > > > > > > if (used) { > > > > > > > - ret = multifd_send_state->ops->send_write(p, used, &local_err); > > > > > > > + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, > > > > > > > + &local_err); > > > > > > > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > > resource usage implications > > > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > > implies locked memory (eg PCI assignment) > > > > > > > > > > Yes it would be great to be a migration capability in parallel to multifd. At > > > > > initial phase if it's easy to be implemented on multi-fd only, we can add a > > > > > dependency between the caps. In the future we can remove that dependency when > > > > > the code is ready to go without multifd. Thanks, > > > > > > > > Also, I'm wondering how zerocopy support interacts with kernel support > > > > for kTLS and multipath-TCP, both of which we want to be able to use > > > > with migration. > > > > > > Copying Jason Wang for net implications between these features on kernel side > > > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS). > > > > > > From the safe side we may want to only enable one of them until we prove > > > they'll work together I guess.. > > > > MPTCP is good when we're network limited for migration > > > > KTLS will be good when we're CPU limited on AES for migration, > > which is essentially always when TLS is used. > > > > ZEROCOPY will be good when we're CPU limited for data copy > > on migration, or to reduce the impact on other concurrent > > VMs on the same CPUs. > > > > Ultimately we woudld benefit from all of them at the same > > time, if it were technically possible todo. > > I think last time I spoke to Paolo Abeni there were some interactions > between them; I can't remember what though (I think mptcp and ktls > didn't play at the time). MPTCP and KTLS use the same kernel hook in the network layer and only 1 hook is permitted at a time, making them mutually exclusive for now. In theory this can be fixed but no one is actively working on it yet. Regards, Daniel
On Wed, Sep 8, 2021 at 11:19 PM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote: > > * Jason Wang (jasowang@redhat.com) wrote: > > > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote: > > > > > > I don't think it is valid to unconditionally enable this feature due to the > > > > > > resource usage implications > > > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This happens > > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > > > The limit on locked pages is something that looks very likely to be > > > > > > exceeded unless you happen to be running a QEMU config that already > > > > > > implies locked memory (eg PCI assignment) > > > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked > > > > > memory for the user before using it, or adding a capability to qemu before. > > > > > > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK. > > > > > > > > The thing is IIUC that's accounting for pinned pages only with either mlock() > > > > (FOLL_MLOCK) or vfio (FOLL_PIN). > > > > > > > > I don't really think MSG_ZEROCOPY is doing that at all... I'm looking at > > > > __zerocopy_sg_from_iter() -> iov_iter_get_pages(). > > > > > > It happens probably here: > > > > > > E.g > > > > > > __ip_append_data() > > > msg_zerocopy_realloc() > > > mm_account_pinned_pages() > > > > When do they get uncounted? i.e. is it just the data that's in flight > > that's marked as pinned? > > I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages() > correspondingly. Thanks, Yes, and actually the memory that could be pinned is limited by the sndbuf of TCP socket. So we are fine with rlimit (e.g we don't need to pin all guest pages). Thanks > > -- > Peter Xu >
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c index ab4ba75d75..d8cce1810a 100644 --- a/migration/multifd-zlib.c +++ b/migration/multifd-zlib.c @@ -160,12 +160,13 @@ static int zlib_send_prepare(MultiFDSendParams *p, uint32_t used, Error **errp) * @used: number of pages used * @errp: pointer to an error */ -static int zlib_send_write(MultiFDSendParams *p, uint32_t used, Error **errp) +static int zlib_send_write(MultiFDSendParams *p, uint32_t used, int flags, + Error **errp) { struct zlib_data *z = p->data; - return qio_channel_write_all(p->c, (void *)z->zbuff, p->next_packet_size, - errp); + return qio_channel_write_all_flags(p->c, (void *)z->zbuff, + p->next_packet_size, flags, errp); } /** diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c index 693bddf8c9..fa063fd33e 100644 --- a/migration/multifd-zstd.c +++ b/migration/multifd-zstd.c @@ -171,12 +171,13 @@ static int zstd_send_prepare(MultiFDSendParams *p, uint32_t used, Error **errp) * @used: number of pages used * @errp: pointer to an error */ -static int zstd_send_write(MultiFDSendParams *p, uint32_t used, Error **errp) +static int zstd_send_write(MultiFDSendParams *p, uint32_t used, int flags, + Error **errp) { struct zstd_data *z = p->data; - return qio_channel_write_all(p->c, (void *)z->zbuff, p->next_packet_size, - errp); + return qio_channel_write_all_flags(p->c, (void *)z->zbuff, + p->next_packet_size, flags, errp); } /** diff --git a/migration/multifd.c b/migration/multifd.c index 377da78f5b..097621c12c 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -103,9 +103,10 @@ static int nocomp_send_prepare(MultiFDSendParams *p, uint32_t used, * @used: number of pages used * @errp: pointer to an error */ -static int nocomp_send_write(MultiFDSendParams *p, uint32_t used, Error **errp) +static int nocomp_send_write(MultiFDSendParams *p, uint32_t used, int flags, + Error **errp) { - return qio_channel_writev_all(p->c, p->pages->iov, used, errp); + return qio_channel_writev_all_flags(p->c, p->pages->iov, used, flags, errp); } /** @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque) } if (used) { - ret = multifd_send_state->ops->send_write(p, used, &local_err); + ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY, + &local_err); if (ret != 0) { break; } @@ -815,6 +817,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p, } else { /* update for tls qio channel */ p->c = ioc; + qio_channel_set_zerocopy(ioc, true); qemu_thread_create(&p->thread, p->name, multifd_send_thread, p, QEMU_THREAD_JOINABLE); } diff --git a/migration/multifd.h b/migration/multifd.h index 8d6751f5ed..7243ba4185 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -157,7 +157,8 @@ typedef struct { /* Prepare the send packet */ int (*send_prepare)(MultiFDSendParams *p, uint32_t used, Error **errp); /* Write the send packet */ - int (*send_write)(MultiFDSendParams *p, uint32_t used, Error **errp); + int (*send_write)(MultiFDSendParams *p, uint32_t used, int flags, + Error **errp); /* Setup for receiving side */ int (*recv_setup)(MultiFDRecvParams *p, Error **errp); /* Cleanup for receiving side */
Call qio_channel_set_zerocopy(true) in the start of every multifd thread. Change the send_write() interface of multifd, allowing it to pass down flags for qio_channel_write*(). Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the other data being sent at the default copying approach. Signed-off-by: Leonardo Bras <leobras@redhat.com> --- migration/multifd-zlib.c | 7 ++++--- migration/multifd-zstd.c | 7 ++++--- migration/multifd.c | 9 ++++++--- migration/multifd.h | 3 ++- 4 files changed, 16 insertions(+), 10 deletions(-)