Message ID | 20200625135752.227293-1-stefanha@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | virtio: NUMA-aware memory allocation | expand |
On 2020/6/25 下午9:57, Stefan Hajnoczi wrote: > These patches are not ready to be merged because I was unable to measure a > performance improvement. I'm publishing them so they are archived in case > someone picks up this work again in the future. > > The goal of these patches is to allocate virtqueues and driver state from the > device's NUMA node for optimal memory access latency. Only guests with a vNUMA > topology and virtio devices spread across vNUMA nodes benefit from this. In > other cases the memory placement is fine and we don't need to take NUMA into > account inside the guest. > > These patches could be extended to virtio_net.ko and other devices in the > future. I only tested virtio_blk.ko. > > The benchmark configuration was designed to trigger worst-case NUMA placement: > * Physical NVMe storage controller on host NUMA node 0 > * IOThread pinned to host NUMA node 0 > * virtio-blk-pci device in vNUMA node 1 > * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0 > * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1 > > The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA > node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic= > e. > Applying these patches fixes memory placement so that virtqueues and driver > state is allocated in vNUMA node 1 where the virtio-blk-pci device is located. > > The fio 4KB randread benchmark results do not show a significant improvement: > > Name IOPS Error > virtio-blk 42373.79 =C2=B1 0.54% > virtio-blk-numa 42517.07 =C2=B1 0.79% I remember I did something similar in vhost by using page_to_nid() for descriptor ring. And I get little improvement as shown here. Michael reminds that it was probably because all data were cached. So I doubt if the test lacks sufficient stress on the cache ... Thanks > > Stefan Hajnoczi (3): > virtio-pci: use NUMA-aware memory allocation in probe > virtio_ring: use NUMA-aware memory allocation in probe > virtio-blk: use NUMA-aware memory allocation in probe > > include/linux/gfp.h | 2 +- > drivers/block/virtio_blk.c | 7 +++++-- > drivers/virtio/virtio_pci_common.c | 16 ++++++++++++---- > drivers/virtio/virtio_ring.c | 26 +++++++++++++++++--------- > mm/page_alloc.c | 2 +- > 5 files changed, 36 insertions(+), 17 deletions(-) > > --=20 > 2.26.2 >
On Sun, Jun 28, 2020 at 02:34:37PM +0800, Jason Wang wrote: > > On 2020/6/25 下午9:57, Stefan Hajnoczi wrote: > > These patches are not ready to be merged because I was unable to measure a > > performance improvement. I'm publishing them so they are archived in case > > someone picks up this work again in the future. > > > > The goal of these patches is to allocate virtqueues and driver state from the > > device's NUMA node for optimal memory access latency. Only guests with a vNUMA > > topology and virtio devices spread across vNUMA nodes benefit from this. In > > other cases the memory placement is fine and we don't need to take NUMA into > > account inside the guest. > > > > These patches could be extended to virtio_net.ko and other devices in the > > future. I only tested virtio_blk.ko. > > > > The benchmark configuration was designed to trigger worst-case NUMA placement: > > * Physical NVMe storage controller on host NUMA node 0 > > * IOThread pinned to host NUMA node 0 > > * virtio-blk-pci device in vNUMA node 1 > > * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0 > > * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1 > > > > The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA > > node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic= > > e. > > Applying these patches fixes memory placement so that virtqueues and driver > > state is allocated in vNUMA node 1 where the virtio-blk-pci device is located. > > > > The fio 4KB randread benchmark results do not show a significant improvement: > > > > Name IOPS Error > > virtio-blk 42373.79 =C2=B1 0.54% > > virtio-blk-numa 42517.07 =C2=B1 0.79% > > > I remember I did something similar in vhost by using page_to_nid() for > descriptor ring. And I get little improvement as shown here. > > Michael reminds that it was probably because all data were cached. So I > doubt if the test lacks sufficient stress on the cache ... Yes, that sounds likely. If there's no real-world performance improvement then I'm happy to leave these patches unmerged. Stefan
On Mon, Jun 29, 2020 at 10:26:46AM +0100, Stefan Hajnoczi wrote: > On Sun, Jun 28, 2020 at 02:34:37PM +0800, Jason Wang wrote: > > > > On 2020/6/25 下午9:57, Stefan Hajnoczi wrote: > > > These patches are not ready to be merged because I was unable to measure a > > > performance improvement. I'm publishing them so they are archived in case > > > someone picks up this work again in the future. > > > > > > The goal of these patches is to allocate virtqueues and driver state from the > > > device's NUMA node for optimal memory access latency. Only guests with a vNUMA > > > topology and virtio devices spread across vNUMA nodes benefit from this. In > > > other cases the memory placement is fine and we don't need to take NUMA into > > > account inside the guest. > > > > > > These patches could be extended to virtio_net.ko and other devices in the > > > future. I only tested virtio_blk.ko. > > > > > > The benchmark configuration was designed to trigger worst-case NUMA placement: > > > * Physical NVMe storage controller on host NUMA node 0 It's possible that numa is not such a big deal for NVMe. And it's possible that bios misconfigures ACPI reporting NUMA placement incorrectly. I think that the best thing to try is to use a ramdisk on a specific numa node. > > > * IOThread pinned to host NUMA node 0 > > > * virtio-blk-pci device in vNUMA node 1 > > > * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0 > > > * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1 > > > > > > The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA > > > node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic= > > > e. > > > Applying these patches fixes memory placement so that virtqueues and driver > > > state is allocated in vNUMA node 1 where the virtio-blk-pci device is located. > > > > > > The fio 4KB randread benchmark results do not show a significant improvement: > > > > > > Name IOPS Error > > > virtio-blk 42373.79 =C2=B1 0.54% > > > virtio-blk-numa 42517.07 =C2=B1 0.79% > > > > > > I remember I did something similar in vhost by using page_to_nid() for > > descriptor ring. And I get little improvement as shown here. > > > > Michael reminds that it was probably because all data were cached. So I > > doubt if the test lacks sufficient stress on the cache ... > > Yes, that sounds likely. If there's no real-world performance > improvement then I'm happy to leave these patches unmerged. > > Stefan Well that was for vhost though. This is virtio, which is different. Doesn't some benchmark put pressure on the CPU cache? I kind of feel there should be a difference, and the fact there isn't means there's some other bottleneck somewhere. Might be worth figuring out.
On Mon, Jun 29, 2020 at 11:28:41AM -0400, Michael S. Tsirkin wrote: > On Mon, Jun 29, 2020 at 10:26:46AM +0100, Stefan Hajnoczi wrote: > > On Sun, Jun 28, 2020 at 02:34:37PM +0800, Jason Wang wrote: > > > > > > On 2020/6/25 下午9:57, Stefan Hajnoczi wrote: > > > > These patches are not ready to be merged because I was unable to measure a > > > > performance improvement. I'm publishing them so they are archived in case > > > > someone picks up this work again in the future. > > > > > > > > The goal of these patches is to allocate virtqueues and driver state from the > > > > device's NUMA node for optimal memory access latency. Only guests with a vNUMA > > > > topology and virtio devices spread across vNUMA nodes benefit from this. In > > > > other cases the memory placement is fine and we don't need to take NUMA into > > > > account inside the guest. > > > > > > > > These patches could be extended to virtio_net.ko and other devices in the > > > > future. I only tested virtio_blk.ko. > > > > > > > > The benchmark configuration was designed to trigger worst-case NUMA placement: > > > > * Physical NVMe storage controller on host NUMA node 0 > > It's possible that numa is not such a big deal for NVMe. > And it's possible that bios misconfigures ACPI reporting NUMA placement > incorrectly. > I think that the best thing to try is to use a ramdisk > on a specific numa node. Using ramdisk is an interesting idea, thanks. Stefan