[RFC,0/3] virtio: NUMA-aware memory allocation

Message ID	20200625135752.227293-1-stefanha@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=Zyx7=AG=vger.kernel.org=kvm-owner@kernel.org> From: Stefan Hajnoczi <stefanha@redhat.com> To: kvm@vger.kernel.org Cc: virtualization@lists.linux-foundation.org, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com> Subject: [RFC 0/3] virtio: NUMA-aware memory allocation Date: Thu, 25 Jun 2020 14:57:49 +0100 Message-Id: <20200625135752.227293-1-stefanha@redhat.com> Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	virtio: NUMA-aware memory allocation \| expand [RFC,0/3] virtio: NUMA-aware memory allocation [RFC,1/3] virtio-pci: use NUMA-aware memory allocation in probe [RFC,2/3] virtio_ring: use NUMA-aware memory allocation in probe [RFC,3/3] virtio-blk: use NUMA-aware memory allocation in probe

Message ID

20200625135752.227293-1-stefanha@redhat.com (mailing list archive)

Headers

From: Stefan Hajnoczi <stefanha@redhat.com>
To: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org,
        "Michael S. Tsirkin" <mst@redhat.com>,
        Jason Wang <jasowang@redhat.com>
Subject: [RFC 0/3] virtio: NUMA-aware memory allocation
Date: Thu, 25 Jun 2020 14:57:49 +0100
Message-Id: <20200625135752.227293-1-stefanha@redhat.com>
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Sender: kvm-owner@vger.kernel.org
Precedence: bulk

Series

virtio: NUMA-aware memory allocation | expand

Message

Stefan Hajnoczi June 25, 2020, 1:57 p.m. UTC

These patches are not ready to be merged because I was unable to measure a
performance improvement. I'm publishing them so they are archived in case
someone picks up this work again in the future.

The goal of these patches is to allocate virtqueues and driver state from the
device's NUMA node for optimal memory access latency. Only guests with a vNUMA
topology and virtio devices spread across vNUMA nodes benefit from this.  In
other cases the memory placement is fine and we don't need to take NUMA into
account inside the guest.

These patches could be extended to virtio_net.ko and other devices in the
future. I only tested virtio_blk.ko.

The benchmark configuration was designed to trigger worst-case NUMA placement:
 * Physical NVMe storage controller on host NUMA node 0
 * IOThread pinned to host NUMA node 0
 * virtio-blk-pci device in vNUMA node 1
 * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0
 * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1

The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA
node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic=
e.
Applying these patches fixes memory placement so that virtqueues and driver
state is allocated in vNUMA node 1 where the virtio-blk-pci device is located.

The fio 4KB randread benchmark results do not show a significant improvement:

Name                  IOPS   Error
virtio-blk        42373.79 =C2=B1 0.54%
virtio-blk-numa   42517.07 =C2=B1 0.79%

Stefan Hajnoczi (3):
  virtio-pci: use NUMA-aware memory allocation in probe
  virtio_ring: use NUMA-aware memory allocation in probe
  virtio-blk: use NUMA-aware memory allocation in probe

 include/linux/gfp.h                |  2 +-
 drivers/block/virtio_blk.c         |  7 +++++--
 drivers/virtio/virtio_pci_common.c | 16 ++++++++++++----
 drivers/virtio/virtio_ring.c       | 26 +++++++++++++++++---------
 mm/page_alloc.c                    |  2 +-
 5 files changed, 36 insertions(+), 17 deletions(-)

--=20
2.26.2

Comments

Jason Wang June 28, 2020, 6:34 a.m. UTC | #1

On 2020/6/25 下午9:57, Stefan Hajnoczi wrote:
> These patches are not ready to be merged because I was unable to measure a
> performance improvement. I'm publishing them so they are archived in case
> someone picks up this work again in the future.
>
> The goal of these patches is to allocate virtqueues and driver state from the
> device's NUMA node for optimal memory access latency. Only guests with a vNUMA
> topology and virtio devices spread across vNUMA nodes benefit from this.  In
> other cases the memory placement is fine and we don't need to take NUMA into
> account inside the guest.
>
> These patches could be extended to virtio_net.ko and other devices in the
> future. I only tested virtio_blk.ko.
>
> The benchmark configuration was designed to trigger worst-case NUMA placement:
>   * Physical NVMe storage controller on host NUMA node 0
>   * IOThread pinned to host NUMA node 0
>   * virtio-blk-pci device in vNUMA node 1
>   * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0
>   * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1
>
> The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA
> node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic=
> e.
> Applying these patches fixes memory placement so that virtqueues and driver
> state is allocated in vNUMA node 1 where the virtio-blk-pci device is located.
>
> The fio 4KB randread benchmark results do not show a significant improvement:
>
> Name                  IOPS   Error
> virtio-blk        42373.79 =C2=B1 0.54%
> virtio-blk-numa   42517.07 =C2=B1 0.79%


I remember I did something similar in vhost by using page_to_nid() for 
descriptor ring. And I get little improvement as shown here.

Michael reminds that it was probably because all data were cached. So I 
doubt if the test lacks sufficient stress on the cache ...

Thanks


>
> Stefan Hajnoczi (3):
>    virtio-pci: use NUMA-aware memory allocation in probe
>    virtio_ring: use NUMA-aware memory allocation in probe
>    virtio-blk: use NUMA-aware memory allocation in probe
>
>   include/linux/gfp.h                |  2 +-
>   drivers/block/virtio_blk.c         |  7 +++++--
>   drivers/virtio/virtio_pci_common.c | 16 ++++++++++++----
>   drivers/virtio/virtio_ring.c       | 26 +++++++++++++++++---------
>   mm/page_alloc.c                    |  2 +-
>   5 files changed, 36 insertions(+), 17 deletions(-)
>
> --=20
> 2.26.2
>

Stefan Hajnoczi June 29, 2020, 9:26 a.m. UTC | #2

On Sun, Jun 28, 2020 at 02:34:37PM +0800, Jason Wang wrote:
> 
> On 2020/6/25 下午9:57, Stefan Hajnoczi wrote:
> > These patches are not ready to be merged because I was unable to measure a
> > performance improvement. I'm publishing them so they are archived in case
> > someone picks up this work again in the future.
> > 
> > The goal of these patches is to allocate virtqueues and driver state from the
> > device's NUMA node for optimal memory access latency. Only guests with a vNUMA
> > topology and virtio devices spread across vNUMA nodes benefit from this.  In
> > other cases the memory placement is fine and we don't need to take NUMA into
> > account inside the guest.
> > 
> > These patches could be extended to virtio_net.ko and other devices in the
> > future. I only tested virtio_blk.ko.
> > 
> > The benchmark configuration was designed to trigger worst-case NUMA placement:
> >   * Physical NVMe storage controller on host NUMA node 0
> >   * IOThread pinned to host NUMA node 0
> >   * virtio-blk-pci device in vNUMA node 1
> >   * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0
> >   * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1
> > 
> > The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA
> > node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic=
> > e.
> > Applying these patches fixes memory placement so that virtqueues and driver
> > state is allocated in vNUMA node 1 where the virtio-blk-pci device is located.
> > 
> > The fio 4KB randread benchmark results do not show a significant improvement:
> > 
> > Name                  IOPS   Error
> > virtio-blk        42373.79 =C2=B1 0.54%
> > virtio-blk-numa   42517.07 =C2=B1 0.79%
> 
> 
> I remember I did something similar in vhost by using page_to_nid() for
> descriptor ring. And I get little improvement as shown here.
> 
> Michael reminds that it was probably because all data were cached. So I
> doubt if the test lacks sufficient stress on the cache ...

Yes, that sounds likely. If there's no real-world performance
improvement then I'm happy to leave these patches unmerged.

Stefan

Michael S. Tsirkin June 29, 2020, 3:28 p.m. UTC | #3

On Mon, Jun 29, 2020 at 10:26:46AM +0100, Stefan Hajnoczi wrote:
> On Sun, Jun 28, 2020 at 02:34:37PM +0800, Jason Wang wrote:
> > 
> > On 2020/6/25 下午9:57, Stefan Hajnoczi wrote:
> > > These patches are not ready to be merged because I was unable to measure a
> > > performance improvement. I'm publishing them so they are archived in case
> > > someone picks up this work again in the future.
> > > 
> > > The goal of these patches is to allocate virtqueues and driver state from the
> > > device's NUMA node for optimal memory access latency. Only guests with a vNUMA
> > > topology and virtio devices spread across vNUMA nodes benefit from this.  In
> > > other cases the memory placement is fine and we don't need to take NUMA into
> > > account inside the guest.
> > > 
> > > These patches could be extended to virtio_net.ko and other devices in the
> > > future. I only tested virtio_blk.ko.
> > > 
> > > The benchmark configuration was designed to trigger worst-case NUMA placement:
> > >   * Physical NVMe storage controller on host NUMA node 0

It's possible that numa is not such a big deal for NVMe.
And it's possible that bios misconfigures ACPI reporting NUMA placement
incorrectly.
I think that the best thing to try is to use a ramdisk
on a specific numa node.

> > >   * IOThread pinned to host NUMA node 0
> > >   * virtio-blk-pci device in vNUMA node 1
> > >   * vCPU 0 on host NUMA node 1 and vCPU 1 on host NUMA node 0
> > >   * vCPU 0 in vNUMA node 0 and vCPU 1 in vNUMA node 1
> > > 
> > > The intent is to have .probe() code run on vCPU 0 in vNUMA node 0 (host NUMA
> > > node 1) so that memory is in the wrong NUMA node for the virtio-blk-pci devic=
> > > e.
> > > Applying these patches fixes memory placement so that virtqueues and driver
> > > state is allocated in vNUMA node 1 where the virtio-blk-pci device is located.
> > > 
> > > The fio 4KB randread benchmark results do not show a significant improvement:
> > > 
> > > Name                  IOPS   Error
> > > virtio-blk        42373.79 =C2=B1 0.54%
> > > virtio-blk-numa   42517.07 =C2=B1 0.79%
> > 
> > 
> > I remember I did something similar in vhost by using page_to_nid() for
> > descriptor ring. And I get little improvement as shown here.
> > 
> > Michael reminds that it was probably because all data were cached. So I
> > doubt if the test lacks sufficient stress on the cache ...
> 
> Yes, that sounds likely. If there's no real-world performance
> improvement then I'm happy to leave these patches unmerged.
> 
> Stefan


Well that was for vhost though. This is virtio, which is different.
Doesn't some benchmark put pressure on the CPU cache?


I kind of feel there should be a difference, and the fact there isn't
means there's some other bottleneck somewhere. Might be worth
figuring out.

Stefan Hajnoczi June 30, 2020, 8:47 a.m. UTC | #4

On Mon, Jun 29, 2020 at 11:28:41AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 29, 2020 at 10:26:46AM +0100, Stefan Hajnoczi wrote:
> > On Sun, Jun 28, 2020 at 02:34:37PM +0800, Jason Wang wrote:
> > > 
> > > On 2020/6/25 下午9:57, Stefan Hajnoczi wrote:
> > > > These patches are not ready to be merged because I was unable to measure a
> > > > performance improvement. I'm publishing them so they are archived in case
> > > > someone picks up this work again in the future.
> > > > 
> > > > The goal of these patches is to allocate virtqueues and driver state from the
> > > > device's NUMA node for optimal memory access latency. Only guests with a vNUMA
> > > > topology and virtio devices spread across vNUMA nodes benefit from this.  In
> > > > other cases the memory placement is fine and we don't need to take NUMA into
> > > > account inside the guest.
> > > > 
> > > > These patches could be extended to virtio_net.ko and other devices in the
> > > > future. I only tested virtio_blk.ko.
> > > > 
> > > > The benchmark configuration was designed to trigger worst-case NUMA placement:
> > > >   * Physical NVMe storage controller on host NUMA node 0
> 
> It's possible that numa is not such a big deal for NVMe.
> And it's possible that bios misconfigures ACPI reporting NUMA placement
> incorrectly.
> I think that the best thing to try is to use a ramdisk
> on a specific numa node.

Using ramdisk is an interesting idea, thanks.

Stefan