Message ID | 20090121230642.10404.65372.stgit@kvm.aw (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Alex Williamson wrote: > This is an attempt to improve the latency of virtio-net while not hurting > throughput. I wanted to try moving packet TX into a different thread > so we can quickly return to the guest after it kicks us to send packets > out. I also switched the order of when the tx_timer comes into play, so > we can get an inital burst of packets out, then wait for the timer to > fire and notify us if there's more to do. Here's what it does for me > (average of 5 runs each, testing to a remote system on a 1Gb network): > > netperf TCP_STREAM: 939.22Mb/s -> 935.24Mb/s = 99.58% > netperf TCP_RR: 2028.72/s -> 3927.99/s = 193.62% > tbench: 92.99MB/s -> 99.97MB/s = 107.51% > > I'd be interested to hear if it helps or hurts anyone else. Thanks, > My worry with this change is that increases cpu utilization even more than it increases bandwidth, so that our bits/cycle measure decreases. The descriptors (and perhaps data) are likely on the same cache as the vcpu, and moving the transmit to the iothread will cause them to move to the iothread's cache. My preferred approach to increasing both bandwidth and bits/cycle (the latter figure is more important IMO, unfortunately benchmarks don't measure it) is to aio-enable tap and raw sockets. The vcpu thread would only touch the packet descriptors (not data) and submit all packets in one io_submit() call. Unfortunately a huge amount of work is needed to pull this off.
Hi Alex, On Wed, 2009-01-21 at 16:08 -0700, Alex Williamson wrote: > This is an attempt to improve the latency of virtio-net while not hurting > throughput. I wanted to try moving packet TX into a different thread > so we can quickly return to the guest after it kicks us to send packets > out. I also switched the order of when the tx_timer comes into play, so > we can get an inital burst of packets out, then wait for the timer to > fire and notify us if there's more to do. Here's what it does for me > (average of 5 runs each, testing to a remote system on a 1Gb network): > > netperf TCP_STREAM: 939.22Mb/s -> 935.24Mb/s = 99.58% > netperf TCP_RR: 2028.72/s -> 3927.99/s = 193.62% > tbench: 92.99MB/s -> 99.97MB/s = 107.51% > > I'd be interested to hear if it helps or hurts anyone else. Thanks, Avi and I went back and forth on this one in great detail before: http://www.mail-archive.com/kvm@vger.kernel.org/msg06431.html Avi's arguments make a lot of sense, but looking over those patches again now, I still think that applying them would be a better approach than what we have right now. Cheers, Mark. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-01-22 at 14:12 +0200, Avi Kivity wrote: > My worry with this change is that increases cpu utilization even more > than it increases bandwidth, so that our bits/cycle measure decreases. Yep, agreed it's important to watch out for this. > The descriptors (and perhaps data) are likely on the same cache as the > vcpu, and moving the transmit to the iothread will cause them to move to > the iothread's cache. We flush from the I/O thread right now. We only ever flush from the vcpu thread when the ring fills up, which rarely happens from what I've observed. Cheers, Mark. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark McLoughlin wrote: > Hi Alex, > > On Wed, 2009-01-21 at 16:08 -0700, Alex Williamson wrote: > >> This is an attempt to improve the latency of virtio-net while not hurting >> throughput. I wanted to try moving packet TX into a different thread >> so we can quickly return to the guest after it kicks us to send packets >> out. I also switched the order of when the tx_timer comes into play, so >> we can get an inital burst of packets out, then wait for the timer to >> fire and notify us if there's more to do. Here's what it does for me >> (average of 5 runs each, testing to a remote system on a 1Gb network): >> >> netperf TCP_STREAM: 939.22Mb/s -> 935.24Mb/s = 99.58% >> netperf TCP_RR: 2028.72/s -> 3927.99/s = 193.62% >> tbench: 92.99MB/s -> 99.97MB/s = 107.51% >> >> I'd be interested to hear if it helps or hurts anyone else. Thanks, >> > > Avi and I went back and forth on this one in great detail before: > > http://www.mail-archive.com/kvm@vger.kernel.org/msg06431.html > > Avi's arguments make a lot of sense, but looking over those patches > again now, I still think that applying them would be a better approach > than what we have right now. > I can go with that. Anthony, should I wait for a qemu iothread so it can go upstream directly?
(copying Anthony this time) Mark McLoughlin wrote: > Hi Alex, > > On Wed, 2009-01-21 at 16:08 -0700, Alex Williamson wrote: > >> This is an attempt to improve the latency of virtio-net while not hurting >> throughput. I wanted to try moving packet TX into a different thread >> so we can quickly return to the guest after it kicks us to send packets >> out. I also switched the order of when the tx_timer comes into play, so >> we can get an inital burst of packets out, then wait for the timer to >> fire and notify us if there's more to do. Here's what it does for me >> (average of 5 runs each, testing to a remote system on a 1Gb network): >> >> netperf TCP_STREAM: 939.22Mb/s -> 935.24Mb/s = 99.58% >> netperf TCP_RR: 2028.72/s -> 3927.99/s = 193.62% >> tbench: 92.99MB/s -> 99.97MB/s = 107.51% >> >> I'd be interested to hear if it helps or hurts anyone else. Thanks, >> > > Avi and I went back and forth on this one in great detail before: > > http://www.mail-archive.com/kvm@vger.kernel.org/msg06431.html > > Avi's arguments make a lot of sense, but looking over those patches > again now, I still think that applying them would be a better approach > than what we have right now. > I can go with that. Anthony, should I wait for a qemu iothread so it can go upstream directly?
On Thu, 2009-01-22 at 12:48 +0000, Mark McLoughlin wrote: > On Thu, 2009-01-22 at 14:12 +0200, Avi Kivity wrote: > > > My worry with this change is that increases cpu utilization even more > > than it increases bandwidth, so that our bits/cycle measure decreases. > > Yep, agreed it's important to watch out for this. > > > The descriptors (and perhaps data) are likely on the same cache as the > > vcpu, and moving the transmit to the iothread will cause them to move to > > the iothread's cache. > > We flush from the I/O thread right now. > > We only ever flush from the vcpu thread when the ring fills up, which > rarely happens from what I've observed. Sorry to have come in late to the discussion, but it seems like maybe it needed another kick after a couple months anyway. As noted, we are mostly (almost exclusively?) doing TX from the timer anyway, so this change or Mark's previous patch series don't really change current cache effects. I am curious what happens to latency with Mark's series since that isn't really addressed by the charts, hopefully good things without the tx_timer. A thread per device or perhaps even a thread per RX/TX stream seems like a logical goal, but these current patches do provide a worthwhile incremental improvement. Perhaps we could affinitize the guest to do I/O on a specific vcpu via _PXM methods in ACPI so we can provide hints to the scheduler to keep a vcpu thread and it's associated I/O threads nearby. Thanks, Alex
Avi Kivity wrote: > Mark McLoughlin wrote: >> Avi and I went back and forth on this one in great detail before: >> >> http://www.mail-archive.com/kvm@vger.kernel.org/msg06431.html >> >> Avi's arguments make a lot of sense, but looking over those patches >> again now, I still think that applying them would be a better approach >> than what we have right now. >> > > I can go with that. Anthony, should I wait for a qemu iothread so it > can go upstream directly? Uh, go ahead and apply it now. The io thread should come soon but I don't think it's going to be easier to wait and merge this later than dealing with the conflict after you resync against QEMU post IO thread. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 8f3c41d..26508bd 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -17,6 +17,8 @@ #include "virtio-net.h" #ifdef USE_KVM #include "qemu-kvm.h" +#include "compatfd.h" +#include "qemu-char.h" #endif #define TAP_VNET_HDR @@ -49,6 +51,9 @@ typedef struct VirtIONet int enabled; uint32_t *vlans; } vlan_table; +#ifdef USE_KVM + int fds[2]; +#endif } VirtIONet; /* TODO @@ -570,10 +575,50 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) } } +#ifdef USE_KVM +static void virtio_net_do_tx(void *opaque) +{ + VirtIONet *n = opaque; + + virtio_net_flush_tx(n, n->tx_vq); + qemu_mod_timer(n->tx_timer, + qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); +} + +static void virtio_net_notify_tx(int fd) +{ + uint64_t value = 1; + char buffer[8]; + size_t offset = 0; + + memcpy(buffer, &value, sizeof(value)); + + while (offset < 8) { + ssize_t len; + + len = write(fd, buffer + offset, 8 - offset); + if (len == -1 && errno == EINTR) + continue; + + if (len <= 0) + break; + + offset += len; + } + + if (offset != 8) + fprintf(stderr, "virtio-net: failed to notify tx thread\n"); +} +#endif + static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) { VirtIONet *n = to_virtio_net(vdev); +#ifdef USE_KVM + virtio_queue_set_notification(n->tx_vq, 0); + virtio_net_notify_tx(n->fds[0]); +#else if (n->tx_timer_active) { virtio_queue_set_notification(vq, 1); qemu_del_timer(n->tx_timer); @@ -585,6 +630,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) n->tx_timer_active = 1; virtio_queue_set_notification(vq, 0); } +#endif } static void virtio_net_tx_timer(void *opaque) @@ -598,7 +644,9 @@ static void virtio_net_tx_timer(void *opaque) return; virtio_queue_set_notification(n->tx_vq, 1); +#ifndef USE_KVM virtio_net_flush_tx(n, n->tx_vq); +#endif } /* @@ -712,6 +760,15 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) virtio_net_receive, virtio_net_can_receive, n); n->vc->link_status_changed = virtio_net_set_link_status; +#ifdef USE_KVM + if (qemu_eventfd(n->fds) == -1) { + fprintf(stderr, "failed to setup virtio-net eventfd\n"); + return NULL; + } + + qemu_set_fd_handler2(n->fds[0], NULL, virtio_net_do_tx, NULL, (void *)n); +#endif + qemu_format_nic_info_str(n->vc, n->mac); n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
This is an attempt to improve the latency of virtio-net while not hurting throughput. I wanted to try moving packet TX into a different thread so we can quickly return to the guest after it kicks us to send packets out. I also switched the order of when the tx_timer comes into play, so we can get an inital burst of packets out, then wait for the timer to fire and notify us if there's more to do. Here's what it does for me (average of 5 runs each, testing to a remote system on a 1Gb network): netperf TCP_STREAM: 939.22Mb/s -> 935.24Mb/s = 99.58% netperf TCP_RR: 2028.72/s -> 3927.99/s = 193.62% tbench: 92.99MB/s -> 99.97MB/s = 107.51% I'd be interested to hear if it helps or hurts anyone else. Thanks, Alex Signed-off-by: Alex Williamson <alex.williamson@hp.com> --- qemu/hw/virtio-net.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 57 insertions(+), 0 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html