diff mbox series

iommu/arm-smmu-v3: expose numa_node attribute to users in sysfs

Message ID 20200530091505.56664-1-song.bao.hua@hisilicon.com (mailing list archive)
State New, archived
Headers show
Series iommu/arm-smmu-v3: expose numa_node attribute to users in sysfs | expand

Commit Message

Song Bao Hua (Barry Song) May 30, 2020, 9:15 a.m. UTC
As tests show the latency of dma_unmap can increase dramatically while
calling them cross NUMA nodes, especially cross CPU packages, eg.
300ns vs 800ns while waiting for the completion of CMD_SYNC in an
empty command queue. The large latency causing by remote node will
in turn make contention of the command queue more serious, and enlarge
the latency of DMA users within local NUMA nodes.

Users might intend to enforce NUMA locality with the consideration of
the position of SMMU. The patch provides minor benefit by presenting
this information to users directly, as they might want to know it without
checking hardware spec at all.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 drivers/iommu/arm-smmu-v3.c | 40 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

Comments

Robin Murphy June 1, 2020, 1:13 p.m. UTC | #1
On 2020-05-30 10:15, Barry Song wrote:
> As tests show the latency of dma_unmap can increase dramatically while
> calling them cross NUMA nodes, especially cross CPU packages, eg.
> 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
> empty command queue. The large latency causing by remote node will
> in turn make contention of the command queue more serious, and enlarge
> the latency of DMA users within local NUMA nodes.
> 
> Users might intend to enforce NUMA locality with the consideration of
> the position of SMMU. The patch provides minor benefit by presenting
> this information to users directly, as they might want to know it without
> checking hardware spec at all.

Hmm, given that dev-to_node() is a standard driver model thing, is there 
not already some generic device property that can expose it - and if 
not, should there be? Presumably if userspace cares enough to want to 
know whereabouts in the system an IOMMU is, it probably also cares where 
the actual endpoint devices are too.

At the very least, it doesn't seem right for it to be specific to one 
single IOMMU driver.

Robin.

> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>   drivers/iommu/arm-smmu-v3.c | 40 ++++++++++++++++++++++++++++++++++++-
>   1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 82508730feb7..754c4d59498b 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -4021,6 +4021,44 @@ err_reset_pci_ops: __maybe_unused;
>   	return err;
>   }
>   
> +static ssize_t numa_node_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%d\n", dev_to_node(dev));
> +}
> +static DEVICE_ATTR_RO(numa_node);
> +
> +static umode_t arm_smmu_numa_attr_visible(struct kobject *kobj, struct attribute *a,
> +		int n)
> +{
> +	struct device *dev = container_of(kobj, typeof(*dev), kobj);
> +
> +	if (!IS_ENABLED(CONFIG_NUMA))
> +		return 0;
> +
> +	if (a == &dev_attr_numa_node.attr &&
> +			dev_to_node(dev) == NUMA_NO_NODE)
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static struct attribute *arm_smmu_dev_attrs[] = {
> +	&dev_attr_numa_node.attr,
> +	NULL
> +};
> +
> +static struct attribute_group arm_smmu_dev_attrs_group = {
> +	.attrs          = arm_smmu_dev_attrs,
> +	.is_visible     = arm_smmu_numa_attr_visible,
> +};
> +
> +
> +static const struct attribute_group *arm_smmu_dev_attrs_groups[] = {
> +	&arm_smmu_dev_attrs_group,
> +	NULL,
> +};
> +
>   static int arm_smmu_device_probe(struct platform_device *pdev)
>   {
>   	int irq, ret;
> @@ -4097,7 +4135,7 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
>   		return ret;
>   
>   	/* And we're up. Go go go! */
> -	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
> +	ret = iommu_device_sysfs_add(&smmu->iommu, dev, arm_smmu_dev_attrs_groups,
>   				     "smmu3.%pa", &ioaddr);
>   	if (ret)
>   		return ret;
>
Song Bao Hua (Barry Song) June 1, 2020, 8:43 p.m. UTC | #2
> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy@arm.com]
> Sent: Tuesday, June 2, 2020 1:14 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; will@kernel.org;
> hch@lst.de; m.szyprowski@samsung.com; iommu@lists.linux-foundation.org
> Cc: Linuxarm <linuxarm@huawei.com>; linux-arm-kernel@lists.infradead.org
> Subject: Re: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to
> users in sysfs
> 
> On 2020-05-30 10:15, Barry Song wrote:
> > As tests show the latency of dma_unmap can increase dramatically while
> > calling them cross NUMA nodes, especially cross CPU packages, eg.
> > 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
> > empty command queue. The large latency causing by remote node will
> > in turn make contention of the command queue more serious, and enlarge
> > the latency of DMA users within local NUMA nodes.
> >
> > Users might intend to enforce NUMA locality with the consideration of
> > the position of SMMU. The patch provides minor benefit by presenting
> > this information to users directly, as they might want to know it without
> > checking hardware spec at all.
> 
> Hmm, given that dev-to_node() is a standard driver model thing, is there
> not already some generic device property that can expose it - and if
> not, should there be? Presumably if userspace cares enough to want to
> know whereabouts in the system an IOMMU is, it probably also cares where
> the actual endpoint devices are too.
> 
> At the very least, it doesn't seem right for it to be specific to one
> single IOMMU driver.

Right now pci devices have generally got the numa_node in sysfs by drivers/pci/pci-sysfs.c

static ssize_t numa_node_store(struct device *dev,
                               struct device_attribute *attr, const char *buf,
                               size_t count)
{
        ...

        add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK);
        pci_alert(pdev, FW_BUG "Overriding NUMA node to %d.  Contact your vendor for updates.",
                  node);

        dev->numa_node = node;
        return count;
}

static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
                              char *buf)
{
        return sprintf(buf, "%d\n", dev->numa_node);
}
static DEVICE_ATTR_RW(numa_node);

for other devices who care about numa information, the specific drivers are doing that, for example:

drivers/dax/bus.c:      if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
drivers/dax/bus.c:      &dev_attr_numa_node.attr,
drivers/dma/idxd/sysfs.c:       &dev_attr_numa_node.attr,
drivers/hv/vmbus_drv.c: &dev_attr_numa_node.attr,
drivers/nvdimm/bus.c:   &dev_attr_numa_node.attr,
drivers/nvme/host/core.c:       &dev_attr_numa_node.attr,

smmu is usually a platform device, we can actually expose numa_node for platform_device, or globally expose numa_node
for general "device" if people don't opposite.

Barry

> 
> Robin.
> 
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >   drivers/iommu/arm-smmu-v3.c | 40
> ++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 39 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 82508730feb7..754c4d59498b 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -4021,6 +4021,44 @@ err_reset_pci_ops: __maybe_unused;
> >   	return err;
> >   }
> >
> > +static ssize_t numa_node_show(struct device *dev,
> > +		struct device_attribute *attr, char *buf)
> > +{
> > +	return sprintf(buf, "%d\n", dev_to_node(dev));
> > +}
> > +static DEVICE_ATTR_RO(numa_node);
> > +
> > +static umode_t arm_smmu_numa_attr_visible(struct kobject *kobj, struct
> attribute *a,
> > +		int n)
> > +{
> > +	struct device *dev = container_of(kobj, typeof(*dev), kobj);
> > +
> > +	if (!IS_ENABLED(CONFIG_NUMA))
> > +		return 0;
> > +
> > +	if (a == &dev_attr_numa_node.attr &&
> > +			dev_to_node(dev) == NUMA_NO_NODE)
> > +		return 0;
> > +
> > +	return a->mode;
> > +}
> > +
> > +static struct attribute *arm_smmu_dev_attrs[] = {
> > +	&dev_attr_numa_node.attr,
> > +	NULL
> > +};
> > +
> > +static struct attribute_group arm_smmu_dev_attrs_group = {
> > +	.attrs          = arm_smmu_dev_attrs,
> > +	.is_visible     = arm_smmu_numa_attr_visible,
> > +};
> > +
> > +
> > +static const struct attribute_group *arm_smmu_dev_attrs_groups[] = {
> > +	&arm_smmu_dev_attrs_group,
> > +	NULL,
> > +};
> > +
> >   static int arm_smmu_device_probe(struct platform_device *pdev)
> >   {
> >   	int irq, ret;
> > @@ -4097,7 +4135,7 @@ static int arm_smmu_device_probe(struct
> platform_device *pdev)
> >   		return ret;
> >
> >   	/* And we're up. Go go go! */
> > -	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
> > +	ret = iommu_device_sysfs_add(&smmu->iommu, dev,
> arm_smmu_dev_attrs_groups,
> >   				     "smmu3.%pa", &ioaddr);
> >   	if (ret)
> >   		return ret;
> >
Will Deacon July 3, 2020, 4:21 p.m. UTC | #3
On Sat, May 30, 2020 at 09:15:05PM +1200, Barry Song wrote:
> As tests show the latency of dma_unmap can increase dramatically while
> calling them cross NUMA nodes, especially cross CPU packages, eg.
> 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
> empty command queue. The large latency causing by remote node will
> in turn make contention of the command queue more serious, and enlarge
> the latency of DMA users within local NUMA nodes.
> 
> Users might intend to enforce NUMA locality with the consideration of
> the position of SMMU. The patch provides minor benefit by presenting
> this information to users directly, as they might want to know it without
> checking hardware spec at all.

I don't think that's a very good reason to expose things to userspace.
I know sysfs shouldn't be treated as ABI, but the grim reality is that
once somebody relies on this stuff then we can't change it, so I'd
rather avoid exposing it unless it's absolutely necessary.

Thanks,

Will
Song Bao Hua (Barry Song) July 5, 2020, 9:53 a.m. UTC | #4
> -----Original Message-----
> From: Will Deacon [mailto:will@kernel.org]
> Sent: Saturday, July 4, 2020 4:22 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: robin.murphy@arm.com; hch@lst.de; m.szyprowski@samsung.com;
> iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
> Linuxarm <linuxarm@huawei.com>
> Subject: Re: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to
> users in sysfs
> 
> On Sat, May 30, 2020 at 09:15:05PM +1200, Barry Song wrote:
> > As tests show the latency of dma_unmap can increase dramatically while
> > calling them cross NUMA nodes, especially cross CPU packages, eg.
> > 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
> > empty command queue. The large latency causing by remote node will
> > in turn make contention of the command queue more serious, and enlarge
> > the latency of DMA users within local NUMA nodes.
> >
> > Users might intend to enforce NUMA locality with the consideration of
> > the position of SMMU. The patch provides minor benefit by presenting
> > this information to users directly, as they might want to know it without
> > checking hardware spec at all.
> 
> I don't think that's a very good reason to expose things to userspace.
> I know sysfs shouldn't be treated as ABI, but the grim reality is that
> once somebody relies on this stuff then we can't change it, so I'd
> rather avoid exposing it unless it's absolutely necessary.

Will, thanks for taking a look!

I am not sure if it is absolutely necessary, but it is useful to users. The whole story started
from some users who wanted to know the hardware topology very clear by reading some
sysfs node just like they are able to do that for pci devices. The intention is that users can
know hardware topology of various devices easily from linux since they maybe don't know
all the hardware details.

For pci devices, kernel has done that. And there are some other drivers out of pci
exposing numa_node as well. It seems it is hard to say it is absolutely necessary
for them too since sysfs shouldn't be treated as ABI. 

I got some input from Linux users who also wanted to know the numa node for
other devices which are not PCI, for example, platform devices. And I thought the
requirement is kind of reasonable. So I also had another patch to generally support
this kind of requirements, with the below patch, this smmu patch is not necessary
any more:
https://lkml.org/lkml/2020/6/18/1257

for platform device created by ARM ACPI/IORT and general acpi_create_platform_device()
drivers/acpi/scan.c:
static void acpi_default_enumeration(struct acpi_device *device)
{
	...
	if (!device->flags.enumeration_by_parent) {
		acpi_create_platform_device(device, NULL);
		acpi_device_set_enumerated(device);
	}
}

struct platform_device *acpi_create_platform_device(struct acpi_device *adev,
					struct property_entry *properties)
{
	...

	pdev = platform_device_register_full(&pdevinfo);
	if (IS_ERR(pdev))
		...
	else {
		set_dev_node(&pdev->dev, acpi_get_node(adev->handle));
		...
	}
	...
}
numa_node is set for this kind of devices.

Anyway, just want to explain to you the background some people want to know the 
hardware topology from Linux in same simple way. And it seems it is a reasonable
requirement to me :-)

> 
> Thanks,
> 
> Will

Thanks
barry
Jonathan Cameron July 6, 2020, 8:26 a.m. UTC | #5
+CC Brice.  

On Sun, 5 Jul 2020 09:53:58 +0000
"Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com> wrote:

> > -----Original Message-----
> > From: Will Deacon [mailto:will@kernel.org]
> > Sent: Saturday, July 4, 2020 4:22 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > Cc: robin.murphy@arm.com; hch@lst.de; m.szyprowski@samsung.com;
> > iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
> > Linuxarm <linuxarm@huawei.com>
> > Subject: Re: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to
> > users in sysfs
> > 
> > On Sat, May 30, 2020 at 09:15:05PM +1200, Barry Song wrote:  
> > > As tests show the latency of dma_unmap can increase dramatically while
> > > calling them cross NUMA nodes, especially cross CPU packages, eg.
> > > 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
> > > empty command queue. The large latency causing by remote node will
> > > in turn make contention of the command queue more serious, and enlarge
> > > the latency of DMA users within local NUMA nodes.
> > >
> > > Users might intend to enforce NUMA locality with the consideration of
> > > the position of SMMU. The patch provides minor benefit by presenting
> > > this information to users directly, as they might want to know it without
> > > checking hardware spec at all.  
> > 
> > I don't think that's a very good reason to expose things to userspace.
> > I know sysfs shouldn't be treated as ABI, but the grim reality is that
> > once somebody relies on this stuff then we can't change it, so I'd
> > rather avoid exposing it unless it's absolutely necessary.  
> 
> Will, thanks for taking a look!
> 
> I am not sure if it is absolutely necessary, but it is useful to users. The whole story started
> from some users who wanted to know the hardware topology very clear by reading some
> sysfs node just like they are able to do that for pci devices. The intention is that users can
> know hardware topology of various devices easily from linux since they maybe don't know
> all the hardware details.
> 
> For pci devices, kernel has done that. And there are some other drivers out of pci
> exposing numa_node as well. It seems it is hard to say it is absolutely necessary
> for them too since sysfs shouldn't be treated as ABI. 
Brice,

Given hwloc is probably the most demanding user of topology information
currently...

How useful would this info be for hwloc and hwloc users?
Sort of feels like it might be useful in some cases.

The very brief description of what we have here is exposing the numa node
of an IOMMU.  The discussion also diverted into whether it just makes sense
to expose this for all platform devices or even do it at the device level.

Jonathan


> 
> I got some input from Linux users who also wanted to know the numa node for
> other devices which are not PCI, for example, platform devices. And I thought the
> requirement is kind of reasonable. So I also had another patch to generally support
> this kind of requirements, with the below patch, this smmu patch is not necessary
> any more:
> https://lkml.org/lkml/2020/6/18/1257
> 
> for platform device created by ARM ACPI/IORT and general acpi_create_platform_device()
> drivers/acpi/scan.c:
> static void acpi_default_enumeration(struct acpi_device *device)
> {
> 	...
> 	if (!device->flags.enumeration_by_parent) {
> 		acpi_create_platform_device(device, NULL);
> 		acpi_device_set_enumerated(device);
> 	}
> }
> 
> struct platform_device *acpi_create_platform_device(struct acpi_device *adev,
> 					struct property_entry *properties)
> {
> 	...
> 
> 	pdev = platform_device_register_full(&pdevinfo);
> 	if (IS_ERR(pdev))
> 		...
> 	else {
> 		set_dev_node(&pdev->dev, acpi_get_node(adev->handle));
> 		...
> 	}
> 	...
> }
> numa_node is set for this kind of devices.
> 
> Anyway, just want to explain to you the background some people want to know the 
> hardware topology from Linux in same simple way. And it seems it is a reasonable
> requirement to me :-)
> 
> > 
> > Thanks,
> > 
> > Will  
> 
> Thanks
> barry
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Brice Goglin July 8, 2020, 6:28 a.m. UTC | #6
Le 06/07/2020 à 10:26, Jonathan Cameron a écrit :
> On Sun, 5 Jul 2020 09:53:58 +0000
> "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com> wrote:
>
>>> -----Original Message-----
>>> From: Will Deacon [mailto:will@kernel.org]
>>> Sent: Saturday, July 4, 2020 4:22 AM
>>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
>>> Cc: robin.murphy@arm.com; hch@lst.de; m.szyprowski@samsung.com;
>>> iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
>>> Linuxarm <linuxarm@huawei.com>
>>> Subject: Re: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to
>>> users in sysfs
>>>
>>> On Sat, May 30, 2020 at 09:15:05PM +1200, Barry Song wrote:  
>>>> As tests show the latency of dma_unmap can increase dramatically while
>>>> calling them cross NUMA nodes, especially cross CPU packages, eg.
>>>> 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
>>>> empty command queue. The large latency causing by remote node will
>>>> in turn make contention of the command queue more serious, and enlarge
>>>> the latency of DMA users within local NUMA nodes.
>>>>
>>>> Users might intend to enforce NUMA locality with the consideration of
>>>> the position of SMMU. The patch provides minor benefit by presenting
>>>> this information to users directly, as they might want to know it without
>>>> checking hardware spec at all.  
>>> I don't think that's a very good reason to expose things to userspace.
>>> I know sysfs shouldn't be treated as ABI, but the grim reality is that
>>> once somebody relies on this stuff then we can't change it, so I'd
>>> rather avoid exposing it unless it's absolutely necessary.  
>> Will, thanks for taking a look!
>>
>> I am not sure if it is absolutely necessary, but it is useful to users. The whole story started
>> from some users who wanted to know the hardware topology very clear by reading some
>> sysfs node just like they are able to do that for pci devices. The intention is that users can
>> know hardware topology of various devices easily from linux since they maybe don't know
>> all the hardware details.
>>
>> For pci devices, kernel has done that. And there are some other drivers out of pci
>> exposing numa_node as well. It seems it is hard to say it is absolutely necessary
>> for them too since sysfs shouldn't be treated as ABI. 
> Brice,
>
> Given hwloc is probably the most demanding user of topology information
> currently...
>
> How useful would this info be for hwloc and hwloc users?
> Sort of feels like it might be useful in some cases.
>
> The very brief description of what we have here is exposing the numa node
> of an IOMMU.  The discussion also diverted into whether it just makes sense
> to expose this for all platform devices or even do it at the device level.


Hello

We don't have anything about IOMMU in hwloc so far, likely because its
locality never mattered in the past? I guess we'll get some user
requests for it once more platforms show this issue and some
performance-critical applications are not happy with it.

Can you clarify what the whole machine topology look like? Are we
talking about some PCI devices being attached to one socket but talking
to the IOMMU of the other socket?

Brice
diff mbox series

Patch

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 82508730feb7..754c4d59498b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -4021,6 +4021,44 @@  err_reset_pci_ops: __maybe_unused;
 	return err;
 }
 
+static ssize_t numa_node_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", dev_to_node(dev));
+}
+static DEVICE_ATTR_RO(numa_node);
+
+static umode_t arm_smmu_numa_attr_visible(struct kobject *kobj, struct attribute *a,
+		int n)
+{
+	struct device *dev = container_of(kobj, typeof(*dev), kobj);
+
+	if (!IS_ENABLED(CONFIG_NUMA))
+		return 0;
+
+	if (a == &dev_attr_numa_node.attr &&
+			dev_to_node(dev) == NUMA_NO_NODE)
+		return 0;
+
+	return a->mode;
+}
+
+static struct attribute *arm_smmu_dev_attrs[] = {
+	&dev_attr_numa_node.attr,
+	NULL
+};
+
+static struct attribute_group arm_smmu_dev_attrs_group = {
+	.attrs          = arm_smmu_dev_attrs,
+	.is_visible     = arm_smmu_numa_attr_visible,
+};
+
+
+static const struct attribute_group *arm_smmu_dev_attrs_groups[] = {
+	&arm_smmu_dev_attrs_group,
+	NULL,
+};
+
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
 	int irq, ret;
@@ -4097,7 +4135,7 @@  static int arm_smmu_device_probe(struct platform_device *pdev)
 		return ret;
 
 	/* And we're up. Go go go! */
-	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
+	ret = iommu_device_sysfs_add(&smmu->iommu, dev, arm_smmu_dev_attrs_groups,
 				     "smmu3.%pa", &ioaddr);
 	if (ret)
 		return ret;