Message ID | 20200630001051.12350-2-vdumpa@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Nvidia Arm SMMUv2 Implementation | expand |
On Mon, Jun 29, 2020 at 05:10:49PM -0700, Krishna Reddy wrote: > NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave > IOVA accesses across them. > Add NVIDIA implementation for dual ARM MMU-500s and add new compatible > string for Tegra194 SoC SMMU topology. > > Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> Reviewed-by: Nicolin Chen <nicoleotsuka@gmail.com>
> NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave IOVA > accesses across them. > Add NVIDIA implementation for dual ARM MMU-500s and add new compatible > string for Tegra194 SoC SMMU topology. > > Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> Reviewed-by: Pritesh Raithatha <praithatha@nvidia.com>
On 30/06/2020 01:10, Krishna Reddy wrote: > NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave > IOVA accesses across them. > Add NVIDIA implementation for dual ARM MMU-500s and add new compatible > string for Tegra194 SoC SMMU topology. There is no description here of the 3rd SMMU that you mention below. I think that we should describe the full picture here. > Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> > --- > MAINTAINERS | 2 + > drivers/iommu/Makefile | 2 +- > drivers/iommu/arm-smmu-impl.c | 3 + > drivers/iommu/arm-smmu-nvidia.c | 196 ++++++++++++++++++++++++++++++++ > drivers/iommu/arm-smmu.h | 1 + > 5 files changed, 203 insertions(+), 1 deletion(-) > create mode 100644 drivers/iommu/arm-smmu-nvidia.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index 7b5ffd646c6b9..64c37dbdd4426 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -16808,8 +16808,10 @@ F: drivers/i2c/busses/i2c-tegra.c > > TEGRA IOMMU DRIVERS > M: Thierry Reding <thierry.reding@gmail.com> > +R: Krishna Reddy <vdumpa@nvidia.com> > L: linux-tegra@vger.kernel.org > S: Supported > +F: drivers/iommu/arm-smmu-nvidia.c > F: drivers/iommu/tegra* > > TEGRA KBC DRIVER > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 342190196dfb0..2b8203db73ec3 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -15,7 +15,7 @@ obj-$(CONFIG_AMD_IOMMU) += amd/iommu.o amd/init.o amd/quirks.o > obj-$(CONFIG_AMD_IOMMU_DEBUGFS) += amd/debugfs.o > obj-$(CONFIG_AMD_IOMMU_V2) += amd/iommu_v2.o > obj-$(CONFIG_ARM_SMMU) += arm_smmu.o > -arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-qcom.o > +arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-nvidia.o arm-smmu-qcom.o > obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o > obj-$(CONFIG_DMAR_TABLE) += intel/dmar.o > obj-$(CONFIG_INTEL_IOMMU) += intel/iommu.o intel/pasid.o > diff --git a/drivers/iommu/arm-smmu-impl.c b/drivers/iommu/arm-smmu-impl.c > index c75b9d957b702..70f7318017617 100644 > --- a/drivers/iommu/arm-smmu-impl.c > +++ b/drivers/iommu/arm-smmu-impl.c > @@ -171,6 +171,9 @@ struct arm_smmu_device *arm_smmu_impl_init(struct arm_smmu_device *smmu) > if (of_property_read_bool(np, "calxeda,smmu-secure-config-access")) > smmu->impl = &calxeda_impl; > > + if (of_device_is_compatible(smmu->dev->of_node, "nvidia,tegra194-smmu")) > + return nvidia_smmu_impl_init(smmu); > + > if (of_device_is_compatible(np, "qcom,sdm845-smmu-500") || > of_device_is_compatible(np, "qcom,sc7180-smmu-500")) > return qcom_smmu_impl_init(smmu); > diff --git a/drivers/iommu/arm-smmu-nvidia.c b/drivers/iommu/arm-smmu-nvidia.c > new file mode 100644 > index 0000000000000..1124f0ac1823a > --- /dev/null > +++ b/drivers/iommu/arm-smmu-nvidia.c > @@ -0,0 +1,196 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +// NVIDIA ARM SMMU v2 implementation quirks > +// Copyright (C) 2019-2020 NVIDIA CORPORATION. All rights reserved. > + > +#include <linux/bitfield.h> > +#include <linux/delay.h> > +#include <linux/of.h> > +#include <linux/platform_device.h> > +#include <linux/slab.h> > + > +#include "arm-smmu.h" > + > +/* > + * Tegra194 has three ARM MMU-500 Instances. > + * Two of them are used together for interleaved IOVA accesses and > + * used by non-isochronous HW devices for SMMU translations. > + * Third one is used for SMMU translations from isochronous HW devices. > + * It is possible to use this implementation to program either > + * all three or two of the instances identically as desired through > + * DT node. > + * > + * Programming all the three instances identically comes with redundant TLB > + * invalidations as all three never need to be TLB invalidated for a HW device. > + * > + * When Linux kernel supports multiple SMMU devices, the SMMU device used for > + * isochornous HW devices should be added as a separate ARM MMU-500 device > + * in DT and be programmed independently for efficient TLB invalidates. > + */ > +#define MAX_SMMU_INSTANCES 3 > + > +#define TLB_LOOP_TIMEOUT_IN_US 1000000 /* 1s! */ > +#define TLB_SPIN_COUNT 10 > + > +struct nvidia_smmu { > + struct arm_smmu_device smmu; > + unsigned int num_inst; > + void __iomem *bases[MAX_SMMU_INSTANCES]; > +}; > + > +static inline struct nvidia_smmu *to_nvidia_smmu(struct arm_smmu_device *smmu) > +{ > + return container_of(smmu, struct nvidia_smmu, smmu); > +} > + > +static inline void __iomem *nvidia_smmu_page(struct arm_smmu_device *smmu, > + unsigned int inst, int page) If you run checkpatch --strict on these you will get a lot of ... CHECK: Alignment should match open parenthesis #116: FILE: drivers/iommu/arm-smmu-nvidia.c:46: +static inline void __iomem *nvidia_smmu_page(struct arm_smmu_device *smmu, + unsigned int inst, int page) We should fix these. > +{ > + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); > + > + if (!nvidia_smmu->bases[0]) > + nvidia_smmu->bases[0] = smmu->base; > + > + return nvidia_smmu->bases[inst] + (page << smmu->pgshift); > +} > + > +static u32 nvidia_smmu_read_reg(struct arm_smmu_device *smmu, > + int page, int offset) > +{ > + void __iomem *reg = nvidia_smmu_page(smmu, 0, page) + offset; > + > + return readl_relaxed(reg); > +} > + > +static void nvidia_smmu_write_reg(struct arm_smmu_device *smmu, > + int page, int offset, u32 val) > +{ > + unsigned int i; > + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); > + > + for (i = 0; i < nvidia_smmu->num_inst; i++) { > + void __iomem *reg = nvidia_smmu_page(smmu, i, page) + offset; Personally, I would declare 'reg' outside of the loop as I feel it will make the code cleaner and easier to read. > + > + writel_relaxed(val, reg); > + } > +} > + > +static u64 nvidia_smmu_read_reg64(struct arm_smmu_device *smmu, > + int page, int offset) > +{ > + void __iomem *reg = nvidia_smmu_page(smmu, 0, page) + offset; > + > + return readq_relaxed(reg); > +} > + > +static void nvidia_smmu_write_reg64(struct arm_smmu_device *smmu, > + int page, int offset, u64 val) > +{ > + unsigned int i; > + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); > + > + for (i = 0; i < nvidia_smmu->num_inst; i++) { > + void __iomem *reg = nvidia_smmu_page(smmu, i, page) + offset; > + > + writeq_relaxed(val, reg); > + } > +} > + > +static void nvidia_smmu_tlb_sync(struct arm_smmu_device *smmu, int page, > + int sync, int status) > +{ > + unsigned int delay; > + > + arm_smmu_writel(smmu, page, sync, 0); > + > + for (delay = 1; delay < TLB_LOOP_TIMEOUT_IN_US; delay *= 2) { So we are doubling the delay every time? Is this better than just using the same on each loop? > + unsigned int spin_cnt; > + > + for (spin_cnt = TLB_SPIN_COUNT; spin_cnt > 0; spin_cnt--) { > + u32 val = 0; > + unsigned int i; > + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); Why not do this once at the beginning of the function? > + > + for (i = 0; i < nvidia_smmu->num_inst; i++) { > + void __iomem *reg = > + nvidia_smmu_page(smmu, i, page) + status; > + > + val |= readl_relaxed(reg); > + } > + > + if (!(val & ARM_SMMU_sTLBGSTATUS_GSACTIVE)) > + return; > + > + cpu_relax(); > + } > + > + udelay(delay); > + } > + > + dev_err_ratelimited(smmu->dev, > + "TLB sync timed out -- SMMU may be deadlocked\n"); > +} > + > +static int nvidia_smmu_reset(struct arm_smmu_device *smmu) > +{ > + unsigned int i; > + > + for (i = 0; i < to_nvidia_smmu(smmu)->num_inst; i++) { > + u32 val; > + void __iomem *reg = nvidia_smmu_page(smmu, i, ARM_SMMU_GR0) + > + ARM_SMMU_GR0_sGFSR; I feel that declaring variables here clutters the code. > + > + /* clear global FSR */ > + val = readl_relaxed(reg); > + writel_relaxed(val, reg); > + } > + > + return 0; > +} > + > +static const struct arm_smmu_impl nvidia_smmu_impl = { > + .read_reg = nvidia_smmu_read_reg, > + .write_reg = nvidia_smmu_write_reg, > + .read_reg64 = nvidia_smmu_read_reg64, > + .write_reg64 = nvidia_smmu_write_reg64, > + .reset = nvidia_smmu_reset, > + .tlb_sync = nvidia_smmu_tlb_sync, > +}; > + > +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu) > +{ > + unsigned int i; > + struct nvidia_smmu *nvidia_smmu; > + struct platform_device *pdev = to_platform_device(smmu->dev); > + > + nvidia_smmu = devm_kzalloc(smmu->dev, sizeof(*nvidia_smmu), GFP_KERNEL); > + if (!nvidia_smmu) > + return ERR_PTR(-ENOMEM); > + > + nvidia_smmu->smmu = *smmu; > + /* Instance 0 is ioremapped by arm-smmu.c after this function returns */ > + nvidia_smmu->num_inst = 1; > + > + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { > + struct resource *res; > + > + res = platform_get_resource(pdev, IORESOURCE_MEM, i); > + if (!res) > + break; > + > + nvidia_smmu->bases[i] = devm_ioremap_resource(smmu->dev, res); > + if (IS_ERR(nvidia_smmu->bases[i])) > + return ERR_CAST(nvidia_smmu->bases[i]); > + > + nvidia_smmu->num_inst++; > + } > + > + nvidia_smmu->smmu.impl = &nvidia_smmu_impl; > + /* > + * Free the arm_smmu_device struct allocated in arm-smmu.c. > + * Once this function returns, arm-smmu.c would use arm_smmu_device > + * allocated as part of nvidia_smmu struct. > + */ > + devm_kfree(smmu->dev, smmu); Why don't we just store the pointer of the smmu struct passed to this function in the nvidia_smmu struct and then we do not need to free this here. In other words make ... struct nvidia_smmu { struct arm_smmu_device *smmu; unsigned int num_inst; void __iomem *bases[MAX_SMMU_INSTANCES]; }; This seems more appropriate, than copying the struct and freeing memory allocated else-where. > + > + return &nvidia_smmu->smmu; > +} > diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h > index d172c024be618..8cf1511ed9874 100644 > --- a/drivers/iommu/arm-smmu.h > +++ b/drivers/iommu/arm-smmu.h > @@ -450,6 +450,7 @@ static inline void arm_smmu_writeq(struct arm_smmu_device *smmu, int page, > arm_smmu_writeq((s), ARM_SMMU_CB((s), (n)), (o), (v)) > > struct arm_smmu_device *arm_smmu_impl_init(struct arm_smmu_device *smmu); > +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu); > struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu); > > int arm_mmu500_reset(struct arm_smmu_device *smmu); > Cheers Jon
On 30/06/2020 01:10, Krishna Reddy wrote: > NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave > IOVA accesses across them. > Add NVIDIA implementation for dual ARM MMU-500s and add new compatible > string for Tegra194 SoC SMMU topology. > > Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> > --- > MAINTAINERS | 2 + > drivers/iommu/Makefile | 2 +- > drivers/iommu/arm-smmu-impl.c | 3 + > drivers/iommu/arm-smmu-nvidia.c | 196 ++++++++++++++++++++++++++++++++ > drivers/iommu/arm-smmu.h | 1 + > 5 files changed, 203 insertions(+), 1 deletion(-) > create mode 100644 drivers/iommu/arm-smmu-nvidia.c ... > +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu) > +{ > + unsigned int i; > + struct nvidia_smmu *nvidia_smmu; > + struct platform_device *pdev = to_platform_device(smmu->dev); > + > + nvidia_smmu = devm_kzalloc(smmu->dev, sizeof(*nvidia_smmu), GFP_KERNEL); > + if (!nvidia_smmu) > + return ERR_PTR(-ENOMEM); > + > + nvidia_smmu->smmu = *smmu; > + /* Instance 0 is ioremapped by arm-smmu.c after this function returns */ > + nvidia_smmu->num_inst = 1; > + > + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { > + struct resource *res; > + > + res = platform_get_resource(pdev, IORESOURCE_MEM, i); > + if (!res) > + break; Currently this driver is only supported for Tegra194 which I understand has 3 SMMUs. Therefore, I don't feel that we should fail silently here, I think it is better to return an error if all 3 cannot be initialised. In the future if there is an SoC that has less (hopefully not more) than Tegra194 then we should handle this via the DT compatible string. In other words, we should always know how many SMMUs there are for a given SoC and how many we should initialise. > + > + nvidia_smmu->bases[i] = devm_ioremap_resource(smmu->dev, res); > + if (IS_ERR(nvidia_smmu->bases[i])) > + return ERR_CAST(nvidia_smmu->bases[i]); You want to use PTR_ERR() here. Jon
On 2020-06-30 09:19, Jon Hunter wrote: > > On 30/06/2020 01:10, Krishna Reddy wrote: >> NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave >> IOVA accesses across them. >> Add NVIDIA implementation for dual ARM MMU-500s and add new compatible >> string for Tegra194 SoC SMMU topology. > > There is no description here of the 3rd SMMU that you mention below. > I think that we should describe the full picture here. > >> Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> >> --- >> MAINTAINERS | 2 + >> drivers/iommu/Makefile | 2 +- >> drivers/iommu/arm-smmu-impl.c | 3 + >> drivers/iommu/arm-smmu-nvidia.c | 196 ++++++++++++++++++++++++++++++++ >> drivers/iommu/arm-smmu.h | 1 + >> 5 files changed, 203 insertions(+), 1 deletion(-) >> create mode 100644 drivers/iommu/arm-smmu-nvidia.c >> >> diff --git a/MAINTAINERS b/MAINTAINERS >> index 7b5ffd646c6b9..64c37dbdd4426 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -16808,8 +16808,10 @@ F: drivers/i2c/busses/i2c-tegra.c >> >> TEGRA IOMMU DRIVERS >> M: Thierry Reding <thierry.reding@gmail.com> >> +R: Krishna Reddy <vdumpa@nvidia.com> >> L: linux-tegra@vger.kernel.org >> S: Supported >> +F: drivers/iommu/arm-smmu-nvidia.c >> F: drivers/iommu/tegra* >> >> TEGRA KBC DRIVER >> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile >> index 342190196dfb0..2b8203db73ec3 100644 >> --- a/drivers/iommu/Makefile >> +++ b/drivers/iommu/Makefile >> @@ -15,7 +15,7 @@ obj-$(CONFIG_AMD_IOMMU) += amd/iommu.o amd/init.o amd/quirks.o >> obj-$(CONFIG_AMD_IOMMU_DEBUGFS) += amd/debugfs.o >> obj-$(CONFIG_AMD_IOMMU_V2) += amd/iommu_v2.o >> obj-$(CONFIG_ARM_SMMU) += arm_smmu.o >> -arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-qcom.o >> +arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-nvidia.o arm-smmu-qcom.o >> obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o >> obj-$(CONFIG_DMAR_TABLE) += intel/dmar.o >> obj-$(CONFIG_INTEL_IOMMU) += intel/iommu.o intel/pasid.o >> diff --git a/drivers/iommu/arm-smmu-impl.c b/drivers/iommu/arm-smmu-impl.c >> index c75b9d957b702..70f7318017617 100644 >> --- a/drivers/iommu/arm-smmu-impl.c >> +++ b/drivers/iommu/arm-smmu-impl.c >> @@ -171,6 +171,9 @@ struct arm_smmu_device *arm_smmu_impl_init(struct arm_smmu_device *smmu) >> if (of_property_read_bool(np, "calxeda,smmu-secure-config-access")) >> smmu->impl = &calxeda_impl; >> >> + if (of_device_is_compatible(smmu->dev->of_node, "nvidia,tegra194-smmu")) Nit: please use "np" like all the surrounding code does. >> + return nvidia_smmu_impl_init(smmu); >> + >> if (of_device_is_compatible(np, "qcom,sdm845-smmu-500") || >> of_device_is_compatible(np, "qcom,sc7180-smmu-500")) >> return qcom_smmu_impl_init(smmu); >> diff --git a/drivers/iommu/arm-smmu-nvidia.c b/drivers/iommu/arm-smmu-nvidia.c >> new file mode 100644 >> index 0000000000000..1124f0ac1823a >> --- /dev/null >> +++ b/drivers/iommu/arm-smmu-nvidia.c >> @@ -0,0 +1,196 @@ >> +// SPDX-License-Identifier: GPL-2.0-only >> +// NVIDIA ARM SMMU v2 implementation quirks >> +// Copyright (C) 2019-2020 NVIDIA CORPORATION. All rights reserved. >> + >> +#include <linux/bitfield.h> >> +#include <linux/delay.h> >> +#include <linux/of.h> >> +#include <linux/platform_device.h> >> +#include <linux/slab.h> >> + >> +#include "arm-smmu.h" >> + >> +/* >> + * Tegra194 has three ARM MMU-500 Instances. >> + * Two of them are used together for interleaved IOVA accesses and >> + * used by non-isochronous HW devices for SMMU translations. >> + * Third one is used for SMMU translations from isochronous HW devices. >> + * It is possible to use this implementation to program either >> + * all three or two of the instances identically as desired through >> + * DT node. >> + * >> + * Programming all the three instances identically comes with redundant TLB >> + * invalidations as all three never need to be TLB invalidated for a HW device. >> + * >> + * When Linux kernel supports multiple SMMU devices, the SMMU device used for >> + * isochornous HW devices should be added as a separate ARM MMU-500 device >> + * in DT and be programmed independently for efficient TLB invalidates. I don't understand the "When" there - the driver has always supported multiple independent SMMUs, and it's not something that could be configured out or otherwise disabled. Plus I really don't see why you would ever want to force unrelated SMMUs to be programmed together - beyond the TLB thing mentioned it would also waste precious context bank resources and might lead to weird device grouping via false stream ID aliasing, with no obvious upside at all. >> + */ >> +#define MAX_SMMU_INSTANCES 3 >> + >> +#define TLB_LOOP_TIMEOUT_IN_US 1000000 /* 1s! */ >> +#define TLB_SPIN_COUNT 10 >> + >> +struct nvidia_smmu { >> + struct arm_smmu_device smmu; >> + unsigned int num_inst; >> + void __iomem *bases[MAX_SMMU_INSTANCES]; >> +}; >> + >> +static inline struct nvidia_smmu *to_nvidia_smmu(struct arm_smmu_device *smmu) >> +{ >> + return container_of(smmu, struct nvidia_smmu, smmu); >> +} >> + >> +static inline void __iomem *nvidia_smmu_page(struct arm_smmu_device *smmu, >> + unsigned int inst, int page) > > If you run checkpatch --strict on these you will get a lot of ... > > CHECK: Alignment should match open parenthesis > #116: FILE: drivers/iommu/arm-smmu-nvidia.c:46: > +static inline void __iomem *nvidia_smmu_page(struct arm_smmu_device *smmu, > + unsigned int inst, int page) > > We should fix these. > >> +{ >> + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); >> + >> + if (!nvidia_smmu->bases[0]) >> + nvidia_smmu->bases[0] = smmu->base; >> + >> + return nvidia_smmu->bases[inst] + (page << smmu->pgshift); >> +} >> + >> +static u32 nvidia_smmu_read_reg(struct arm_smmu_device *smmu, >> + int page, int offset) >> +{ >> + void __iomem *reg = nvidia_smmu_page(smmu, 0, page) + offset; >> + >> + return readl_relaxed(reg); >> +} >> + >> +static void nvidia_smmu_write_reg(struct arm_smmu_device *smmu, >> + int page, int offset, u32 val) >> +{ >> + unsigned int i; >> + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); >> + >> + for (i = 0; i < nvidia_smmu->num_inst; i++) { >> + void __iomem *reg = nvidia_smmu_page(smmu, i, page) + offset; > > Personally, I would declare 'reg' outside of the loop as I feel it will make > the code cleaner and easier to read. > >> + >> + writel_relaxed(val, reg); >> + } >> +} >> + >> +static u64 nvidia_smmu_read_reg64(struct arm_smmu_device *smmu, >> + int page, int offset) >> +{ >> + void __iomem *reg = nvidia_smmu_page(smmu, 0, page) + offset; >> + >> + return readq_relaxed(reg); >> +} >> + >> +static void nvidia_smmu_write_reg64(struct arm_smmu_device *smmu, >> + int page, int offset, u64 val) >> +{ >> + unsigned int i; >> + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); >> + >> + for (i = 0; i < nvidia_smmu->num_inst; i++) { >> + void __iomem *reg = nvidia_smmu_page(smmu, i, page) + offset; >> + >> + writeq_relaxed(val, reg); >> + } >> +} >> + >> +static void nvidia_smmu_tlb_sync(struct arm_smmu_device *smmu, int page, >> + int sync, int status) >> +{ >> + unsigned int delay; >> + >> + arm_smmu_writel(smmu, page, sync, 0); >> + >> + for (delay = 1; delay < TLB_LOOP_TIMEOUT_IN_US; delay *= 2) { > > So we are doubling the delay every time? Is this better than just using > the same on each loop? This is the same logic as the main driver (see 8513c8930069) - the sync is expected to complete relatively quickly, hence why we have the inner spin loop to avoid the delay entirely in the typical case, and the longer it's taking, the more likely it is that something's wrong and it will never complete anyway. Realistically, a heavily loaded SMMU at a modest clock rate might take us through a couple of iterations of the outer loop, but beyond that we're pretty much just killing time until we declare it wedged and give up, and by then there's not much point in burning power frantically hamering on the interconnect. > >> + unsigned int spin_cnt; >> + >> + for (spin_cnt = TLB_SPIN_COUNT; spin_cnt > 0; spin_cnt--) { >> + u32 val = 0; >> + unsigned int i; >> + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); > > Why not do this once at the beginning of the function? > >> + >> + for (i = 0; i < nvidia_smmu->num_inst; i++) { >> + void __iomem *reg = >> + nvidia_smmu_page(smmu, i, page) + status; >> + >> + val |= readl_relaxed(reg); >> + } >> + >> + if (!(val & ARM_SMMU_sTLBGSTATUS_GSACTIVE)) >> + return; >> + >> + cpu_relax(); >> + } >> + >> + udelay(delay); >> + } >> + >> + dev_err_ratelimited(smmu->dev, >> + "TLB sync timed out -- SMMU may be deadlocked\n"); >> +} >> + >> +static int nvidia_smmu_reset(struct arm_smmu_device *smmu) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < to_nvidia_smmu(smmu)->num_inst; i++) { >> + u32 val; >> + void __iomem *reg = nvidia_smmu_page(smmu, i, ARM_SMMU_GR0) + >> + ARM_SMMU_GR0_sGFSR; > > I feel that declaring variables here clutters the code. > >> + >> + /* clear global FSR */ >> + val = readl_relaxed(reg); >> + writel_relaxed(val, reg); >> + } >> + >> + return 0; >> +} >> + >> +static const struct arm_smmu_impl nvidia_smmu_impl = { >> + .read_reg = nvidia_smmu_read_reg, >> + .write_reg = nvidia_smmu_write_reg, >> + .read_reg64 = nvidia_smmu_read_reg64, >> + .write_reg64 = nvidia_smmu_write_reg64, >> + .reset = nvidia_smmu_reset, >> + .tlb_sync = nvidia_smmu_tlb_sync, >> +}; >> + >> +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu) >> +{ >> + unsigned int i; >> + struct nvidia_smmu *nvidia_smmu; >> + struct platform_device *pdev = to_platform_device(smmu->dev); >> + >> + nvidia_smmu = devm_kzalloc(smmu->dev, sizeof(*nvidia_smmu), GFP_KERNEL); >> + if (!nvidia_smmu) >> + return ERR_PTR(-ENOMEM); >> + >> + nvidia_smmu->smmu = *smmu; >> + /* Instance 0 is ioremapped by arm-smmu.c after this function returns */ >> + nvidia_smmu->num_inst = 1; >> + >> + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { >> + struct resource *res; >> + >> + res = platform_get_resource(pdev, IORESOURCE_MEM, i); >> + if (!res) >> + break; >> + >> + nvidia_smmu->bases[i] = devm_ioremap_resource(smmu->dev, res); >> + if (IS_ERR(nvidia_smmu->bases[i])) >> + return ERR_CAST(nvidia_smmu->bases[i]); >> + >> + nvidia_smmu->num_inst++; >> + } >> + >> + nvidia_smmu->smmu.impl = &nvidia_smmu_impl; >> + /* >> + * Free the arm_smmu_device struct allocated in arm-smmu.c. >> + * Once this function returns, arm-smmu.c would use arm_smmu_device >> + * allocated as part of nvidia_smmu struct. >> + */ >> + devm_kfree(smmu->dev, smmu); > > Why don't we just store the pointer of the smmu struct passed to this function > in the nvidia_smmu struct and then we do not need to free this here. In other > words make ... > > struct nvidia_smmu { > struct arm_smmu_device *smmu; > unsigned int num_inst; > void __iomem *bases[MAX_SMMU_INSTANCES]; > }; > > This seems more appropriate, than copying the struct and freeing memory > allocated else-where. But then how do you get back to struct nvidia_smmu given just a pointer to struct arm_smmu_device? I'll admit my quickly-hacked-up design for post-hoc subclassing isn't the prettiest, but in fairness it was only ever intended for a couple of exceptional cases. I guess it would be equally possible to wrap struct arm_smmu_impl, and container_of() back to the private data from smmu->impl itself, which avoids the minor weirdness of reassigning the smmu pointer in the middle of probing, but then you'd have to store the function pointers in writeable memory, which in principle might make the integrity folks pull a sad face. The aim was to avoid the non-architectural stuff cluttering up the main driver as far as possible, which is why I preferred a subclassing approach over cramming more random pointers into struct arm_smmu_device. Not to say it can't be changed if you've got a really solid idea for an alternative... Robin. > >> + >> + return &nvidia_smmu->smmu; >> +} >> diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h >> index d172c024be618..8cf1511ed9874 100644 >> --- a/drivers/iommu/arm-smmu.h >> +++ b/drivers/iommu/arm-smmu.h >> @@ -450,6 +450,7 @@ static inline void arm_smmu_writeq(struct arm_smmu_device *smmu, int page, >> arm_smmu_writeq((s), ARM_SMMU_CB((s), (n)), (o), (v)) >> >> struct arm_smmu_device *arm_smmu_impl_init(struct arm_smmu_device *smmu); >> +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu); >> struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu); >> >> int arm_mmu500_reset(struct arm_smmu_device *smmu); >> > > Cheers > Jon >
On 30/06/2020 15:53, Robin Murphy wrote: > On 2020-06-30 09:19, Jon Hunter wrote: >> >> On 30/06/2020 01:10, Krishna Reddy wrote: >>> NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave >>> IOVA accesses across them. >>> Add NVIDIA implementation for dual ARM MMU-500s and add new compatible >>> string for Tegra194 SoC SMMU topology. >> >> There is no description here of the 3rd SMMU that you mention below. >> I think that we should describe the full picture here. >> >>> Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> ... >>> +static void nvidia_smmu_tlb_sync(struct arm_smmu_device *smmu, int >>> page, >>> + int sync, int status) >>> +{ >>> + unsigned int delay; >>> + >>> + arm_smmu_writel(smmu, page, sync, 0); >>> + >>> + for (delay = 1; delay < TLB_LOOP_TIMEOUT_IN_US; delay *= 2) { >> >> So we are doubling the delay every time? Is this better than just using >> the same on each loop? > > This is the same logic as the main driver (see 8513c8930069) - the sync > is expected to complete relatively quickly, hence why we have the inner > spin loop to avoid the delay entirely in the typical case, and the > longer it's taking, the more likely it is that something's wrong and it > will never complete anyway. Realistically, a heavily loaded SMMU at a > modest clock rate might take us through a couple of iterations of the > outer loop, but beyond that we're pretty much just killing time until we > declare it wedged and give up, and by then there's not much point in > burning power frantically hamering on the interconnect. Ah OK. Then maybe we should move the definitions for TLB_LOOP_TIMEOUT and TLB_SPIN_COUNT into the arm-smmu.h so that we can use them directly in this file instead of redefining them. Then it maybe clear that these are part of the main driver. >>> +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device >>> *smmu) >>> +{ >>> + unsigned int i; >>> + struct nvidia_smmu *nvidia_smmu; >>> + struct platform_device *pdev = to_platform_device(smmu->dev); >>> + >>> + nvidia_smmu = devm_kzalloc(smmu->dev, sizeof(*nvidia_smmu), >>> GFP_KERNEL); >>> + if (!nvidia_smmu) >>> + return ERR_PTR(-ENOMEM); >>> + >>> + nvidia_smmu->smmu = *smmu; >>> + /* Instance 0 is ioremapped by arm-smmu.c after this function >>> returns */ >>> + nvidia_smmu->num_inst = 1; >>> + >>> + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { >>> + struct resource *res; >>> + >>> + res = platform_get_resource(pdev, IORESOURCE_MEM, i); >>> + if (!res) >>> + break; >>> + >>> + nvidia_smmu->bases[i] = devm_ioremap_resource(smmu->dev, res); >>> + if (IS_ERR(nvidia_smmu->bases[i])) >>> + return ERR_CAST(nvidia_smmu->bases[i]); >>> + >>> + nvidia_smmu->num_inst++; >>> + } >>> + >>> + nvidia_smmu->smmu.impl = &nvidia_smmu_impl; >>> + /* >>> + * Free the arm_smmu_device struct allocated in arm-smmu.c. >>> + * Once this function returns, arm-smmu.c would use arm_smmu_device >>> + * allocated as part of nvidia_smmu struct. >>> + */ >>> + devm_kfree(smmu->dev, smmu); >> >> Why don't we just store the pointer of the smmu struct passed to this >> function >> in the nvidia_smmu struct and then we do not need to free this here. >> In other >> words make ... >> >> struct nvidia_smmu { >> struct arm_smmu_device *smmu; >> unsigned int num_inst; >> void __iomem *bases[MAX_SMMU_INSTANCES]; >> }; >> >> This seems more appropriate, than copying the struct and freeing memory >> allocated else-where. > > But then how do you get back to struct nvidia_smmu given just a pointer > to struct arm_smmu_device? Ah yes of course that is what I was missing. I wondered what was going on here. So I think we should add a nice comment in the above function of why we are copying this and cannot simply store the pointer. Cheers Jon
>> +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device >> +*smmu) { >> + unsigned int i; .... >> + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { >> + struct resource *res; >> + >> + res = platform_get_resource(pdev, IORESOURCE_MEM, i); >> + if (!res) >> + break; >Currently this driver is only supported for Tegra194 which I understand has 3 SMMUs. Therefore, I don't feel that we should fail silently here, I think it is better to return an error if all 3 cannot be initialised. Initialization of all the three SMMU instances is not necessary here. The driver can work with all the possible number of instances 1, 2 and 3 based on the DT config though it doesn't make much sense to use it with 1 instance. There is no silent failure here from driver point of view. If there is misconfig in DT, SMMU faults would catch issues. >> + nvidia_smmu->bases[i] = devm_ioremap_resource(smmu->dev, res); >> + if (IS_ERR(nvidia_smmu->bases[i])) >> + return ERR_CAST(nvidia_smmu->bases[i]); >You want to use PTR_ERR() here. PTR_ERR() returns long integer. This function returns a pointer. ERR_CAST is the right one to use here. -- nvpublic
On 30/06/2020 17:23, Krishna Reddy wrote: >>> +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device >>> +*smmu) { >>> + unsigned int i; > .... >>> + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { >>> + struct resource *res; >>> + >>> + res = platform_get_resource(pdev, IORESOURCE_MEM, i); >>> + if (!res) >>> + break; > >> Currently this driver is only supported for Tegra194 which I understand has 3 SMMUs. Therefore, I don't feel that we should fail silently here, I think it is better to return an error if all 3 cannot be initialised. > > Initialization of all the three SMMU instances is not necessary here. That is not what I am saying. > The driver can work with all the possible number of instances 1, 2 and 3 based on the DT config though it doesn't make much sense to use it with 1 instance. > There is no silent failure here from driver point of view. If there is misconfig in DT, SMMU faults would catch issues. I disagree and you should return a proper error here. >>> + nvidia_smmu->bases[i] = devm_ioremap_resource(smmu->dev, res); >>> + if (IS_ERR(nvidia_smmu->bases[i])) >>> + return ERR_CAST(nvidia_smmu->bases[i]); > >> You want to use PTR_ERR() here. > > PTR_ERR() returns long integer. > This function returns a pointer. ERR_CAST is the right one to use here. Ah yes, indeed. OK that's fine. Jon
On 30/06/2020 17:32, Jon Hunter wrote: > On 30/06/2020 17:23, Krishna Reddy wrote: >>>> +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device >>>> +*smmu) { >>>> + unsigned int i; >> .... >>>> + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { >>>> + struct resource *res; >>>> + >>>> + res = platform_get_resource(pdev, IORESOURCE_MEM, i); >>>> + if (!res) >>>> + break; >> >>> Currently this driver is only supported for Tegra194 which I understand has 3 SMMUs. Therefore, I don't feel that we should fail silently here, I think it is better to return an error if all 3 cannot be initialised. >> >> Initialization of all the three SMMU instances is not necessary here. > > That is not what I am saying. > >> The driver can work with all the possible number of instances 1, 2 and 3 based on the DT config though it doesn't make much sense to use it with 1 instance. >> There is no silent failure here from driver point of view. If there is misconfig in DT, SMMU faults would catch issues. > > I disagree and you should return a proper error here. OK, well I see what you are saying, but if we intended to support all 3 for Tegra194, then we should ensure all 3 are initialised correctly. My concern here is testing, because when things break in upstream I am usually the one that tracks it down. Not having clear warning/error messages when something is not initialised as expected makes it harder. It would be better to query the number of SMMUs populated in device-tree and then ensure that all are initialised correctly. Jon
>> NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave >> IOVA accesses across them. >> Add NVIDIA implementation for dual ARM MMU-500s and add new compatible >> string for Tegra194 SoC SMMU topology. >There is no description here of the 3rd SMMU that you mention below. >I think that we should describe the full picture here. This driver is primarily for dual SMMU config. So, It is avoided in the commit message. However, Implementation supports option to configure 3 instances identically with one SMMU DT node and is documented in the implementation. >> + >> +static inline void __iomem *nvidia_smmu_page(struct arm_smmu_device *smmu, >> + unsigned int inst, int page) >If you run checkpatch --strict on these you will get a lot of ... >CHECK: Alignment should match open parenthesis >#116: FILE: drivers/iommu/arm-smmu-nvidia.c:46: >+static inline void __iomem *nvidia_smmu_page(struct arm_smmu_device *smmu, >+ unsigned int inst, int page) >We should fix these. I will fix these if I need to push a new patch set. >> +static void nvidia_smmu_write_reg(struct arm_smmu_device *smmu, >> + int page, int offset, u32 val) { >> + unsigned int i; >> + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); >> + >> + for (i = 0; i < nvidia_smmu->num_inst; i++) { >> + void __iomem *reg = nvidia_smmu_page(smmu, i, page) + offset; >Personally, I would declare 'reg' outside of the loop as I feel it will make the code cleaner and easier to read. It was like that before and is updated to its current form to limit the scope of variables as per Thierry's comments in v6. We can just leave it as it is as there is no technical issue here. -KR -- nvpublic
>OK, well I see what you are saying, but if we intended to support all 3 for Tegra194, then we should ensure all 3 are initialised correctly. The driver intend to support up to 3 instances. It doesn't really mandate that all three instances be present in same DT node. Each mmio aperture in "reg" property is an instance here. reg = <inst0_base, size>, <inst1_base, size>, <inst2_base, size>; The reg can have all three or less and driver just configures based on reg and it works fine. >It would be better to query the number of SMMUs populated in device-tree and then ensure that all are initialised correctly. Getting the IORESOURCE_MEM is the way to count the instances driver need to support. In a way, It is already querying through IORESOURCE_MEM here. -KR
On 30/06/2020 18:16, Krishna Reddy wrote: >> OK, well I see what you are saying, but if we intended to support all 3 for Tegra194, then we should ensure all 3 are initialised correctly. > > The driver intend to support up to 3 instances. It doesn't really mandate that all three instances be present in same DT node. > Each mmio aperture in "reg" property is an instance here. reg = <inst0_base, size>, <inst1_base, size>, <inst2_base, size>; > The reg can have all three or less and driver just configures based on reg and it works fine. So it sounds like we need at least 2 SMMUs (for non-iso and iso) but we have up to 3 (for Tegra194). So the question is do we have a use-case where we only use 2 and not 3? If not, then it still seems that we should require that all 3 are present. The other problem I see here is that currently the arm-smmu binding defines the 'reg' with a 'maxItems' of 1, whereas we have 3. I believe that this will get caught by the 'dt_binding_check' when we try to populate the binding. >> It would be better to query the number of SMMUs populated in device-tree and then ensure that all are initialised correctly. > > Getting the IORESOURCE_MEM is the way to count the instances driver need to support. > In a way, It is already querying through IORESOURCE_MEM here. Yes I was wondering that. I think we just need to decide if the 3rd SMMU is optional or not. The DT binding should detail and min and max supported. Jon
>> The driver intend to support up to 3 instances. It doesn't really mandate that all three instances be present in same DT node. >> Each mmio aperture in "reg" property is an instance here. reg = >> <inst0_base, size>, <inst1_base, size>, <inst2_base, size>; The reg can have all three or less and driver just configures based on reg and it works fine. >So it sounds like we need at least 2 SMMUs (for non-iso and iso) but we have up to 3 (for Tegra194). So the question is do we have a use-case where we only use 2 and not 3? If not, then it still seems that we should require that all 3 are present. It can be either 2 SMMUs (for non-iso) or 3 SMMUs (for non-iso and iso). Let me fail the one instance case as it can use regular arm smmu implementation and don't need nvidia implementation explicitly. >The other problem I see here is that currently the arm-smmu binding defines the 'reg' with a 'maxItems' of 1, whereas we have 3. I believe that this will get caught by the 'dt_binding_check' when we try to populate the binding. Thanks for pointing it out! Will update the binding doc. -KR -- nvpublic
>> + * When Linux kernel supports multiple SMMU devices, the SMMU device >> +used for >> + * isochornous HW devices should be added as a separate ARM MMU-500 >> +device >> + * in DT and be programmed independently for efficient TLB invalidates. >I don't understand the "When" there - the driver has always supported multiple independent SMMUs, and it's not something that could be configured out or otherwise disabled. Plus I really don't see why you would ever want to force unrelated SMMUs to be >programmed together - beyond the TLB thing mentioned it would also waste precious context bank resources and might lead to weird device grouping via false stream ID aliasing, with no obvious upside at all. Sorry, I missed this comment. During the initial patches, when the iommu_ops were different between, support multiple SMMU drivers at the same is not possible as one of them(that gets probed last) overwrites the platform bus ops. On revisiting the original issue, This problem is no longer relevant. At this point, It makes more sense to just get rid of 3rd instance programming in arm-smmu-nvidia.c and just limit it to the SMMU instances that need identical programming. -KR
On 2020-07-01 19:18, Krishna Reddy wrote: >>> + * When Linux kernel supports multiple SMMU devices, the SMMU >>> device +used for + * isochornous HW devices should be added as a >>> separate ARM MMU-500 +device + * in DT and be programmed >>> independently for efficient TLB invalidates. > >> I don't understand the "When" there - the driver has always >> supported multiple independent SMMUs, and it's not something that >> could be configured out or otherwise disabled. Plus I really don't >> see why you would ever want to force unrelated SMMUs to be >> >programmed together - beyond the TLB thing mentioned it would also >> waste precious context bank resources and might lead to weird >> device grouping via false stream ID aliasing, with no obvious >> upside at all. > > Sorry, I missed this comment. During the initial patches, when the > iommu_ops were different between, support multiple SMMU drivers at > the same is not possible as one of them(that gets probed last) > overwrites the platform bus ops. On revisiting the original issue, > This problem is no longer relevant. At this point, It makes more > sense to just get rid of 3rd instance programming in > arm-smmu-nvidia.c and just limit it to the SMMU instances that need > identical programming. Yeah, I realised later last night that this probably originated from forking the whole driver downstream. But even then you could have treated the other one as a separate nsmmu with a single instance ;) Since it does add a bit of confusion to the code and comments, let's just keep things simple. I do like Jon's suggestion of actually enforcing that the number of "reg" regions exactly matches the number expected for the given compatible - I guess for now that means just hard-coding 2 and hoping the hardware folks don't cook up any more of these... Robin.
>Yeah, I realised later last night that this probably originated from forking the whole driver downstream. But even then you could have treated the other one as a separate nsmmu with a single instance ;) True, But the initial nvidia implementation had limitation that it can only handle one instance of usage. With your implementation hooks design, it should be able to handle multiple instances of usage now. >Since it does add a bit of confusion to the code and comments, let's just keep things simple. I do like Jon's suggestion of actually enforcing that the number of "reg" regions exactly matches the number expected for the given compatible - I guess for now that means just hard-coding 2 and hoping the hardware folks don't cook up any more of these... For T194, reg can just be forced to 2. No future plan to use more than two MMU-500s together as of now. -KR
diff --git a/MAINTAINERS b/MAINTAINERS index 7b5ffd646c6b9..64c37dbdd4426 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16808,8 +16808,10 @@ F: drivers/i2c/busses/i2c-tegra.c TEGRA IOMMU DRIVERS M: Thierry Reding <thierry.reding@gmail.com> +R: Krishna Reddy <vdumpa@nvidia.com> L: linux-tegra@vger.kernel.org S: Supported +F: drivers/iommu/arm-smmu-nvidia.c F: drivers/iommu/tegra* TEGRA KBC DRIVER diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 342190196dfb0..2b8203db73ec3 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -15,7 +15,7 @@ obj-$(CONFIG_AMD_IOMMU) += amd/iommu.o amd/init.o amd/quirks.o obj-$(CONFIG_AMD_IOMMU_DEBUGFS) += amd/debugfs.o obj-$(CONFIG_AMD_IOMMU_V2) += amd/iommu_v2.o obj-$(CONFIG_ARM_SMMU) += arm_smmu.o -arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-qcom.o +arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-nvidia.o arm-smmu-qcom.o obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o obj-$(CONFIG_DMAR_TABLE) += intel/dmar.o obj-$(CONFIG_INTEL_IOMMU) += intel/iommu.o intel/pasid.o diff --git a/drivers/iommu/arm-smmu-impl.c b/drivers/iommu/arm-smmu-impl.c index c75b9d957b702..70f7318017617 100644 --- a/drivers/iommu/arm-smmu-impl.c +++ b/drivers/iommu/arm-smmu-impl.c @@ -171,6 +171,9 @@ struct arm_smmu_device *arm_smmu_impl_init(struct arm_smmu_device *smmu) if (of_property_read_bool(np, "calxeda,smmu-secure-config-access")) smmu->impl = &calxeda_impl; + if (of_device_is_compatible(smmu->dev->of_node, "nvidia,tegra194-smmu")) + return nvidia_smmu_impl_init(smmu); + if (of_device_is_compatible(np, "qcom,sdm845-smmu-500") || of_device_is_compatible(np, "qcom,sc7180-smmu-500")) return qcom_smmu_impl_init(smmu); diff --git a/drivers/iommu/arm-smmu-nvidia.c b/drivers/iommu/arm-smmu-nvidia.c new file mode 100644 index 0000000000000..1124f0ac1823a --- /dev/null +++ b/drivers/iommu/arm-smmu-nvidia.c @@ -0,0 +1,196 @@ +// SPDX-License-Identifier: GPL-2.0-only +// NVIDIA ARM SMMU v2 implementation quirks +// Copyright (C) 2019-2020 NVIDIA CORPORATION. All rights reserved. + +#include <linux/bitfield.h> +#include <linux/delay.h> +#include <linux/of.h> +#include <linux/platform_device.h> +#include <linux/slab.h> + +#include "arm-smmu.h" + +/* + * Tegra194 has three ARM MMU-500 Instances. + * Two of them are used together for interleaved IOVA accesses and + * used by non-isochronous HW devices for SMMU translations. + * Third one is used for SMMU translations from isochronous HW devices. + * It is possible to use this implementation to program either + * all three or two of the instances identically as desired through + * DT node. + * + * Programming all the three instances identically comes with redundant TLB + * invalidations as all three never need to be TLB invalidated for a HW device. + * + * When Linux kernel supports multiple SMMU devices, the SMMU device used for + * isochornous HW devices should be added as a separate ARM MMU-500 device + * in DT and be programmed independently for efficient TLB invalidates. + */ +#define MAX_SMMU_INSTANCES 3 + +#define TLB_LOOP_TIMEOUT_IN_US 1000000 /* 1s! */ +#define TLB_SPIN_COUNT 10 + +struct nvidia_smmu { + struct arm_smmu_device smmu; + unsigned int num_inst; + void __iomem *bases[MAX_SMMU_INSTANCES]; +}; + +static inline struct nvidia_smmu *to_nvidia_smmu(struct arm_smmu_device *smmu) +{ + return container_of(smmu, struct nvidia_smmu, smmu); +} + +static inline void __iomem *nvidia_smmu_page(struct arm_smmu_device *smmu, + unsigned int inst, int page) +{ + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); + + if (!nvidia_smmu->bases[0]) + nvidia_smmu->bases[0] = smmu->base; + + return nvidia_smmu->bases[inst] + (page << smmu->pgshift); +} + +static u32 nvidia_smmu_read_reg(struct arm_smmu_device *smmu, + int page, int offset) +{ + void __iomem *reg = nvidia_smmu_page(smmu, 0, page) + offset; + + return readl_relaxed(reg); +} + +static void nvidia_smmu_write_reg(struct arm_smmu_device *smmu, + int page, int offset, u32 val) +{ + unsigned int i; + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); + + for (i = 0; i < nvidia_smmu->num_inst; i++) { + void __iomem *reg = nvidia_smmu_page(smmu, i, page) + offset; + + writel_relaxed(val, reg); + } +} + +static u64 nvidia_smmu_read_reg64(struct arm_smmu_device *smmu, + int page, int offset) +{ + void __iomem *reg = nvidia_smmu_page(smmu, 0, page) + offset; + + return readq_relaxed(reg); +} + +static void nvidia_smmu_write_reg64(struct arm_smmu_device *smmu, + int page, int offset, u64 val) +{ + unsigned int i; + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); + + for (i = 0; i < nvidia_smmu->num_inst; i++) { + void __iomem *reg = nvidia_smmu_page(smmu, i, page) + offset; + + writeq_relaxed(val, reg); + } +} + +static void nvidia_smmu_tlb_sync(struct arm_smmu_device *smmu, int page, + int sync, int status) +{ + unsigned int delay; + + arm_smmu_writel(smmu, page, sync, 0); + + for (delay = 1; delay < TLB_LOOP_TIMEOUT_IN_US; delay *= 2) { + unsigned int spin_cnt; + + for (spin_cnt = TLB_SPIN_COUNT; spin_cnt > 0; spin_cnt--) { + u32 val = 0; + unsigned int i; + struct nvidia_smmu *nvidia_smmu = to_nvidia_smmu(smmu); + + for (i = 0; i < nvidia_smmu->num_inst; i++) { + void __iomem *reg = + nvidia_smmu_page(smmu, i, page) + status; + + val |= readl_relaxed(reg); + } + + if (!(val & ARM_SMMU_sTLBGSTATUS_GSACTIVE)) + return; + + cpu_relax(); + } + + udelay(delay); + } + + dev_err_ratelimited(smmu->dev, + "TLB sync timed out -- SMMU may be deadlocked\n"); +} + +static int nvidia_smmu_reset(struct arm_smmu_device *smmu) +{ + unsigned int i; + + for (i = 0; i < to_nvidia_smmu(smmu)->num_inst; i++) { + u32 val; + void __iomem *reg = nvidia_smmu_page(smmu, i, ARM_SMMU_GR0) + + ARM_SMMU_GR0_sGFSR; + + /* clear global FSR */ + val = readl_relaxed(reg); + writel_relaxed(val, reg); + } + + return 0; +} + +static const struct arm_smmu_impl nvidia_smmu_impl = { + .read_reg = nvidia_smmu_read_reg, + .write_reg = nvidia_smmu_write_reg, + .read_reg64 = nvidia_smmu_read_reg64, + .write_reg64 = nvidia_smmu_write_reg64, + .reset = nvidia_smmu_reset, + .tlb_sync = nvidia_smmu_tlb_sync, +}; + +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu) +{ + unsigned int i; + struct nvidia_smmu *nvidia_smmu; + struct platform_device *pdev = to_platform_device(smmu->dev); + + nvidia_smmu = devm_kzalloc(smmu->dev, sizeof(*nvidia_smmu), GFP_KERNEL); + if (!nvidia_smmu) + return ERR_PTR(-ENOMEM); + + nvidia_smmu->smmu = *smmu; + /* Instance 0 is ioremapped by arm-smmu.c after this function returns */ + nvidia_smmu->num_inst = 1; + + for (i = 1; i < MAX_SMMU_INSTANCES; i++) { + struct resource *res; + + res = platform_get_resource(pdev, IORESOURCE_MEM, i); + if (!res) + break; + + nvidia_smmu->bases[i] = devm_ioremap_resource(smmu->dev, res); + if (IS_ERR(nvidia_smmu->bases[i])) + return ERR_CAST(nvidia_smmu->bases[i]); + + nvidia_smmu->num_inst++; + } + + nvidia_smmu->smmu.impl = &nvidia_smmu_impl; + /* + * Free the arm_smmu_device struct allocated in arm-smmu.c. + * Once this function returns, arm-smmu.c would use arm_smmu_device + * allocated as part of nvidia_smmu struct. + */ + devm_kfree(smmu->dev, smmu); + + return &nvidia_smmu->smmu; +} diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h index d172c024be618..8cf1511ed9874 100644 --- a/drivers/iommu/arm-smmu.h +++ b/drivers/iommu/arm-smmu.h @@ -450,6 +450,7 @@ static inline void arm_smmu_writeq(struct arm_smmu_device *smmu, int page, arm_smmu_writeq((s), ARM_SMMU_CB((s), (n)), (o), (v)) struct arm_smmu_device *arm_smmu_impl_init(struct arm_smmu_device *smmu); +struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu); struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu); int arm_mmu500_reset(struct arm_smmu_device *smmu);
NVIDIA's Tegra194 SoC uses two ARM MMU-500s together to interleave IOVA accesses across them. Add NVIDIA implementation for dual ARM MMU-500s and add new compatible string for Tegra194 SoC SMMU topology. Signed-off-by: Krishna Reddy <vdumpa@nvidia.com> --- MAINTAINERS | 2 + drivers/iommu/Makefile | 2 +- drivers/iommu/arm-smmu-impl.c | 3 + drivers/iommu/arm-smmu-nvidia.c | 196 ++++++++++++++++++++++++++++++++ drivers/iommu/arm-smmu.h | 1 + 5 files changed, 203 insertions(+), 1 deletion(-) create mode 100644 drivers/iommu/arm-smmu-nvidia.c