From patchwork Wed Jun 17 09:37:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stefan Hajnoczi X-Patchwork-Id: 11609495 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D0F1B90 for ; Wed, 17 Jun 2020 09:37:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B79F6207E8 for ; Wed, 17 Jun 2020 09:37:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="A0F6DjKT" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726211AbgFQJhi (ORCPT ); Wed, 17 Jun 2020 05:37:38 -0400 Received: from us-smtp-1.mimecast.com ([205.139.110.61]:60213 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725536AbgFQJhi (ORCPT ); Wed, 17 Jun 2020 05:37:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1592386656; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=hQXnvUHXJXPvepBf7ywq9CRGBzneqKE9hf77vrkoSDo=; b=A0F6DjKTzu1ET/ytG9vc2X5NZfA5i7ezZeNoQle3lrD25J83KvILcSvfTWgkegnVJ2BeFM 9iqJN2xTc+DVh+Ca7c3vNLoQSYGJk/NpQZtLiQwU/gTmfhZUmJ3L5PjXlPRW2C9UKivoKB dhRPbUrmyjkCFfgwUg/RthI4yhKE/ag= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-412-A6VUPTEaN4WkGTE4zq-pnQ-1; Wed, 17 Jun 2020 05:37:34 -0400 X-MC-Unique: A6VUPTEaN4WkGTE4zq-pnQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id EEA641009618; Wed, 17 Jun 2020 09:37:32 +0000 (UTC) Received: from localhost (ovpn-114-151.ams2.redhat.com [10.36.114.151]) by smtp.corp.redhat.com (Postfix) with ESMTP id AB4025C1C3; Wed, 17 Jun 2020 09:37:26 +0000 (UTC) From: Stefan Hajnoczi To: linux-kernel@vger.kernel.org Cc: Marcelo Tosatti , linux-pci@vger.kernel.org, Thomas Gleixner , Bjorn Helgaas , "Michael S. Tsirkin" , Stefan Hajnoczi Subject: [RFC 0/2] genirq: take device NUMA node into account for managed IRQs Date: Wed, 17 Jun 2020 10:37:23 +0100 Message-Id: <20200617093725.1725569-1-stefanha@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org Devices with a small number of managed IRQs do not benefit from spreading across all CPUs. Instead they benefit from NUMA node affinity so that IRQs are handled on the device's NUMA node. For example, here is a machine with a virtio-blk PCI device on NUMA node 1: # lstopo-no-graphics Machine (958MB total) Package L#0 NUMANode L#0 (P#0 491MB) L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Cor= e L#0 + PU L#0 (P#0) Package L#1 NUMANode L#1 (P#1 466MB) L3 L#1 (16MB) + L2 L#1 (4096KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Cor= e L#1 + PU L#1 (P#1) HostBridge PCIBridge PCI c9:00.0 (SCSI) Block "vdb" HostBridge PCIBridge PCI 02:00.0 (Ethernet) Net "enp2s0" PCIBridge PCI 05:00.0 (SCSI) Block "vda" PCI 00:1f.2 (SATA) Currently the virtio5-req.0 IRQ for the vdb device gets assigned to CPU 0: # cat /proc/interrupts CPU0 CPU1 ... 36: 0 0 PCI-MSI 105381888-edge virtio5-config 37: 81 0 PCI-MSI 105381889-edge virtio5-req.0 If managed IRQ assignment takes the device's NUMA node into account then CPU 1 will be used instead: # cat /proc/interrupts CPU0 CPU1 ... 36: 0 0 PCI-MSI 105381888-edge virtio5-config 37: 0 92 PCI-MSI 105381889-edge virtio5-req.0 The fio benchmark with 4KB random read running on CPU 1 increases IOPS by 58%: Name IOPS Error Before 26720.59 =C2=B1 0.28% After 42373.79 =C2=B1 0.54% Now most of this improvement is not due to NUMA but just because the requests complete on the same CPU where they were submitted. However, if the IRQ is on CPU 0 and fio also runs on CPU 0 only 39600 IOPS is achieved, not the full 42373 IOPS that we get when NUMA affinity is honored. So it is worth taking NUMA into account to achieve maximum performance. The following patches are a hack that uses the device's NUMA node when assigning managed IRQs. They are not mergeable but I hope they will help start the discussion. One bug is that they affect all managed IRQs, even for devices with many IRQs where spreading across all CPUs is a good policy. Please let me know what you think: 1. Is there a reason why managed IRQs should *not* take NUMA into account that I've missed? 2. Is there a better place to implement this logic? For example, pci_alloc_irq_vectors_affinity() where the cpumasks are calculated. Any suggestions on how to proceed would be appreciated. Thanks! Stefan Hajnoczi (2): genirq: honor device NUMA node when allocating descs genirq/matrix: take NUMA into account for managed IRQs include/linux/irq.h | 2 +- arch/x86/kernel/apic/vector.c | 3 ++- kernel/irq/irqdesc.c | 3 ++- kernel/irq/matrix.c | 16 ++++++++++++---- 4 files changed, 17 insertions(+), 7 deletions(-) --=20 2.26.2