diff mbox series

[v3] iommu/arm-smmu-v3: Fix L1 stream table index calculation for 32-bit sid size

Message ID 20241004180405.555194-1-yang@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series [v3] iommu/arm-smmu-v3: Fix L1 stream table index calculation for 32-bit sid size | expand

Commit Message

Yang Shi Oct. 4, 2024, 6:04 p.m. UTC
The commit ce410410f1a7 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()")
calculated the last index of L1 stream table by 1 << smmu->sid_bits. 1
is 32 bit value.
However some platforms, for example, AmpereOne and the platforms with
ARM MMU-700, have 32-bit stream id size.  This resulted in ouf-of-bound shift.
The disassembly of shift is:

    ldr     w2, [x19, 828]  //, smmu_7(D)->sid_bits
    mov     w20, 1
    lsl     w20, w20, w2

According to ARM spec, if the registers are 32 bit, the instruction actually
does:
    dest = src << (shift % 32)

So it actually shifted by zero bit.

The out-of-bound shift is also undefined behavior according to C
language standard.

This caused v6.12-rc1 failed to boot on such platforms.

UBSAN also reported:

UBSAN: shift-out-of-bounds in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3628:29
shift exponent 32 is too large for 32-bit type 'int'

Using 64 bit immediate when doing shift can solve the problem.  The
disassembly after the fix looks like:
    ldr     w20, [x19, 828] //, smmu_7(D)->sid_bits
    mov     x0, 1
    lsl     x0, x0, x20

There are a couple of problematic places, extracted the shift into a helper.

Fixes: ce410410f1a7 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()")
Tested-by: James Morse <james.morse@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 +++++++++++-----
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  5 +++++
 2 files changed, 16 insertions(+), 5 deletions(-)

v3: * Some trivial modification to the commit log per Robin Murphy.
    * Used "num_sids" instead of "max_sids" per Robin Murphy.
    * Returned u64 type for arm_smmu_strtab_num_sids() per Nicolin Chen.
    * Checked size in arm_smmu_init_strtab_linear() in order to avoid
      overflow per Jason Gunthorpe.
    * Collected r-b tag from Jason Gunthorpe.
v2: * Extracted the shift into a helper per Jason Gunthorpe.
    * Covered more places per Nicolin Chen and Jason Gunthorpe.
    * Used 1ULL instead of 1UL to guarantee 64 bit per Robin Murphy.
    * Made the subject more general since this is not AmpereOne specific
      problem per the report from James Morse.
    * Collected t-b tag from James Morse.
    * Added Fixes tag in commit log.

Comments

Daniel Mentz Oct. 4, 2024, 9:14 p.m. UTC | #1
On Fri, Oct 4, 2024 at 11:04 AM Yang Shi <yang@os.amperecomputing.com> wrote:
>  static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>  {
> -       u32 size;
> +       u64 size;
>         struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> +       u64 num_sids = arm_smmu_strtab_num_sids(smmu);
> +
> +       size = num_sids * sizeof(struct arm_smmu_ste);
> +       /* The max size for dmam_alloc_coherent() is 32-bit */

I'd remove this comment. I assume the intent here was to say that the
maximum size is 4GB (not 32 bit). I also can't find any reference to
this limitation. Where does dmam_alloc_coherent() limit the size of an
allocation to 4GB? Also, this comment might not be applicable to 64
bit platforms.

> +       if (size > SIZE_MAX)
> +               return -EINVAL;

I'm assuming this is for platforms where the range of a u64 is larger
than that of a size_t type? If we're printing an error message if an
allocation fails (i.e. "failed to allocate linear stream table (%llu
bytes)\n"), then we might also want to print an error message here.

> -       cfg->linear.num_ents = 1 << smmu->sid_bits;
> +       cfg->linear.num_ents = num_sids;

If you're worried about 32 bit platforms, then I'm wondering if this
also needs some attention. cfg->linear.num_ents is defined as an
unsigned int and num_sids could potentially be outside the range of an
unsigned int on 32 bit platforms.

> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 1e9952ca989f..c8ceddc5e8ef 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -853,6 +853,11 @@ struct arm_smmu_master_domain {
>         ioasid_t ssid;
>  };
>
> +static inline u64 arm_smmu_strtab_num_sids(struct arm_smmu_device *smmu)
> +{
> +       return (1ULL << smmu->sid_bits);
> +}
> +

I'm wondering if it makes sense to move this up and put it right
before arm_smmu_strtab_l1_idx(). That way, all the arm_smmu_strtab_*
functions are in one place.

On a related note, in arm_smmu_init_strtab_2lvl() we're capping the
number of l1 entries at STRTAB_MAX_L1_ENTRIES for 2 level stream
tables. I'm thinking it would make sense to limit the size of linear
stream tables for the same reasons.
Yang Shi Oct. 4, 2024, 9:47 p.m. UTC | #2
On 10/4/24 2:14 PM, Daniel Mentz wrote:
> On Fri, Oct 4, 2024 at 11:04 AM Yang Shi <yang@os.amperecomputing.com> wrote:
>>   static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>>   {
>> -       u32 size;
>> +       u64 size;
>>          struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>> +       u64 num_sids = arm_smmu_strtab_num_sids(smmu);
>> +
>> +       size = num_sids * sizeof(struct arm_smmu_ste);
>> +       /* The max size for dmam_alloc_coherent() is 32-bit */
> I'd remove this comment. I assume the intent here was to say that the
> maximum size is 4GB (not 32 bit). I also can't find any reference to
> this limitation. Where does dmam_alloc_coherent() limit the size of an
> allocation to 4GB? Also, this comment might not be applicable to 64
> bit platforms.

The "size" parameter passed to dmam_alloc_coherent() is size_t type 
which is unsigned int.

>
>> +       if (size > SIZE_MAX)
>> +               return -EINVAL;
> I'm assuming this is for platforms where the range of a u64 is larger
> than that of a size_t type? If we're printing an error message if an
> allocation fails (i.e. "failed to allocate linear stream table (%llu
> bytes)\n"), then we might also want to print an error message here.

Thanks for the suggestion. Yes, we can if it really helps.

>
>> -       cfg->linear.num_ents = 1 << smmu->sid_bits;
>> +       cfg->linear.num_ents = num_sids;
> If you're worried about 32 bit platforms, then I'm wondering if this
> also needs some attention. cfg->linear.num_ents is defined as an
> unsigned int and num_sids could potentially be outside the range of an
> unsigned int on 32 bit platforms.

The (size > SIZE_MAX) check can guarantee excessively large num_sids 
won't reach here.

>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index 1e9952ca989f..c8ceddc5e8ef 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -853,6 +853,11 @@ struct arm_smmu_master_domain {
>>          ioasid_t ssid;
>>   };
>>
>> +static inline u64 arm_smmu_strtab_num_sids(struct arm_smmu_device *smmu)
>> +{
>> +       return (1ULL << smmu->sid_bits);
>> +}
>> +
> I'm wondering if it makes sense to move this up and put it right
> before arm_smmu_strtab_l1_idx(). That way, all the arm_smmu_strtab_*
> functions are in one place.

I did it. But the function uses struct arm_smmu_device which is defined 
after those arm_smmu_strtab_* helpers. I have to put the helper after 
struct arm_smmu_device definition to avoid compile error. We may 
consider re-organize the header file to group them better, but I don't 
think it is urgent enough and it seems out of the scope of the bug fix 
patch. I really want to have the bug fix landed in upstream ASAP.

>
> On a related note, in arm_smmu_init_strtab_2lvl() we're capping the
> number of l1 entries at STRTAB_MAX_L1_ENTRIES for 2 level stream
> tables. I'm thinking it would make sense to limit the size of linear
> stream tables for the same reasons.

Yes, this also works. But I don't know what value should be used. Jason 
actually suggested (size > SIZE_512M) in v2 review, but I thought the 
value is a magic number. Why 512M? Just because it is too large for 
allocation. So I picked up SIZE_MAX, just because it is the largest size 
supported by size_t type.
Daniel Mentz Oct. 5, 2024, 1:03 a.m. UTC | #3
On Fri, Oct 4, 2024 at 2:47 PM Yang Shi <yang@os.amperecomputing.com> wrote:
> On 10/4/24 2:14 PM, Daniel Mentz wrote:
> > On Fri, Oct 4, 2024 at 11:04 AM Yang Shi <yang@os.amperecomputing.com> wrote:
> >>   static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> >>   {
> >> -       u32 size;
> >> +       u64 size;
> >>          struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> >> +       u64 num_sids = arm_smmu_strtab_num_sids(smmu);
> >> +
> >> +       size = num_sids * sizeof(struct arm_smmu_ste);
> >> +       /* The max size for dmam_alloc_coherent() is 32-bit */
> > I'd remove this comment. I assume the intent here was to say that the
> > maximum size is 4GB (not 32 bit). I also can't find any reference to
> > this limitation. Where does dmam_alloc_coherent() limit the size of an
> > allocation to 4GB? Also, this comment might not be applicable to 64
> > bit platforms.
>
> The "size" parameter passed to dmam_alloc_coherent() is size_t type
> which is unsigned int.

I believe that this is true only for 32 bit platforms. On arm64,
unsigned int is 32 bit, whereas size_t is 64 bit. I'm still in favor
of removing that comment, because it's not applicable to arm64.

> >
> >> -       cfg->linear.num_ents = 1 << smmu->sid_bits;
> >> +       cfg->linear.num_ents = num_sids;
> > If you're worried about 32 bit platforms, then I'm wondering if this
> > also needs some attention. cfg->linear.num_ents is defined as an
> > unsigned int and num_sids could potentially be outside the range of an
> > unsigned int on 32 bit platforms.
>
> The (size > SIZE_MAX) check can guarantee excessively large num_sids
> won't reach here.

Now that I think about it, unsigned int is 32 bit even on arm64. So,
I'm afraid this could (theoretically) overflow. On arm64, I don't
think that the (size > SIZE_MAX) check will prevent this.

> >> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> >> index 1e9952ca989f..c8ceddc5e8ef 100644
> >> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> >> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> >> @@ -853,6 +853,11 @@ struct arm_smmu_master_domain {
> >>          ioasid_t ssid;
> >>   };
> >>
> >> +static inline u64 arm_smmu_strtab_num_sids(struct arm_smmu_device *smmu)
> >> +{
> >> +       return (1ULL << smmu->sid_bits);
> >> +}
> >> +
> > I'm wondering if it makes sense to move this up and put it right
> > before arm_smmu_strtab_l1_idx(). That way, all the arm_smmu_strtab_*
> > functions are in one place.
>
> I did it. But the function uses struct arm_smmu_device which is defined
> after those arm_smmu_strtab_* helpers. I have to put the helper after
> struct arm_smmu_device definition to avoid compile error. We may
> consider re-organize the header file to group them better, but I don't
> think it is urgent enough and it seems out of the scope of the bug fix
> patch. I really want to have the bug fix landed in upstream ASAP.

Understood. Thanks. We could move the changes in
arm_smmu_init_strtab_linear() into a separate patch to accelerate the
process. I'm fine either way, though. I don't want to get in the way
of this landing upstream.

>
> >
> > On a related note, in arm_smmu_init_strtab_2lvl() we're capping the
> > number of l1 entries at STRTAB_MAX_L1_ENTRIES for 2 level stream
> > tables. I'm thinking it would make sense to limit the size of linear
> > stream tables for the same reasons.
>
> Yes, this also works. But I don't know what value should be used. Jason
> actually suggested (size > SIZE_512M) in v2 review, but I thought the
> value is a magic number. Why 512M? Just because it is too large for
> allocation. So I picked up SIZE_MAX, just because it is the largest size
> supported by size_t type.

I think it should be capped to STRTAB_MAX_L1_ENTRIES
Yang Shi Oct. 5, 2024, 1:53 a.m. UTC | #4
On 10/4/24 6:03 PM, Daniel Mentz wrote:
> On Fri, Oct 4, 2024 at 2:47 PM Yang Shi <yang@os.amperecomputing.com> wrote:
>> On 10/4/24 2:14 PM, Daniel Mentz wrote:
>>> On Fri, Oct 4, 2024 at 11:04 AM Yang Shi <yang@os.amperecomputing.com> wrote:
>>>>    static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>>>>    {
>>>> -       u32 size;
>>>> +       u64 size;
>>>>           struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>>>> +       u64 num_sids = arm_smmu_strtab_num_sids(smmu);
>>>> +
>>>> +       size = num_sids * sizeof(struct arm_smmu_ste);
>>>> +       /* The max size for dmam_alloc_coherent() is 32-bit */
>>> I'd remove this comment. I assume the intent here was to say that the
>>> maximum size is 4GB (not 32 bit). I also can't find any reference to
>>> this limitation. Where does dmam_alloc_coherent() limit the size of an
>>> allocation to 4GB? Also, this comment might not be applicable to 64
>>> bit platforms.
>> The "size" parameter passed to dmam_alloc_coherent() is size_t type
>> which is unsigned int.
> I believe that this is true only for 32 bit platforms. On arm64,
> unsigned int is 32 bit, whereas size_t is 64 bit. I'm still in favor
> of removing that comment, because it's not applicable to arm64.

Thanks for figuring this out. Yes, you are right. I missed this point.

>
>>>> -       cfg->linear.num_ents = 1 << smmu->sid_bits;
>>>> +       cfg->linear.num_ents = num_sids;
>>> If you're worried about 32 bit platforms, then I'm wondering if this
>>> also needs some attention. cfg->linear.num_ents is defined as an
>>> unsigned int and num_sids could potentially be outside the range of an
>>> unsigned int on 32 bit platforms.
>> The (size > SIZE_MAX) check can guarantee excessively large num_sids
>> won't reach here.
> Now that I think about it, unsigned int is 32 bit even on arm64. So,
> I'm afraid this could (theoretically) overflow. On arm64, I don't
> think that the (size > SIZE_MAX) check will prevent this.

Yes, SIZE_MAX is ~(size_t)0, but size_t is unsigned long on ARM64. So 
the check actually doesn't do what I expect it should do. U32_MAX should 
be used.

>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> index 1e9952ca989f..c8ceddc5e8ef 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>>> @@ -853,6 +853,11 @@ struct arm_smmu_master_domain {
>>>>           ioasid_t ssid;
>>>>    };
>>>>
>>>> +static inline u64 arm_smmu_strtab_num_sids(struct arm_smmu_device *smmu)
>>>> +{
>>>> +       return (1ULL << smmu->sid_bits);
>>>> +}
>>>> +
>>> I'm wondering if it makes sense to move this up and put it right
>>> before arm_smmu_strtab_l1_idx(). That way, all the arm_smmu_strtab_*
>>> functions are in one place.
>> I did it. But the function uses struct arm_smmu_device which is defined
>> after those arm_smmu_strtab_* helpers. I have to put the helper after
>> struct arm_smmu_device definition to avoid compile error. We may
>> consider re-organize the header file to group them better, but I don't
>> think it is urgent enough and it seems out of the scope of the bug fix
>> patch. I really want to have the bug fix landed in upstream ASAP.
> Understood. Thanks. We could move the changes in
> arm_smmu_init_strtab_linear() into a separate patch to accelerate the
> process. I'm fine either way, though. I don't want to get in the way
> of this landing upstream.

Thank you for your understanding.

>
>>> On a related note, in arm_smmu_init_strtab_2lvl() we're capping the
>>> number of l1 entries at STRTAB_MAX_L1_ENTRIES for 2 level stream
>>> tables. I'm thinking it would make sense to limit the size of linear
>>> stream tables for the same reasons.
>> Yes, this also works. But I don't know what value should be used. Jason
>> actually suggested (size > SIZE_512M) in v2 review, but I thought the
>> value is a magic number. Why 512M? Just because it is too large for
>> allocation. So I picked up SIZE_MAX, just because it is the largest size
>> supported by size_t type.
> I think it should be capped to STRTAB_MAX_L1_ENTRIES

I'm not expert on SMMU. Does the linear stream table have the same cap 
as 2-level stream table? Is this defined by the hardware spec? If it is 
not, why should we pick this value?
Daniel Mentz Oct. 7, 2024, 4:36 p.m. UTC | #5
On Fri, Oct 4, 2024 at 6:53 PM Yang Shi <yang@os.amperecomputing.com> wrote:
> >>> On a related note, in arm_smmu_init_strtab_2lvl() we're capping the
> >>> number of l1 entries at STRTAB_MAX_L1_ENTRIES for 2 level stream
> >>> tables. I'm thinking it would make sense to limit the size of linear
> >>> stream tables for the same reasons.
> >> Yes, this also works. But I don't know what value should be used. Jason
> >> actually suggested (size > SIZE_512M) in v2 review, but I thought the
> >> value is a magic number. Why 512M? Just because it is too large for
> >> allocation. So I picked up SIZE_MAX, just because it is the largest size
> >> supported by size_t type.
> > I think it should be capped to STRTAB_MAX_L1_ENTRIES
>
> I'm not expert on SMMU. Does the linear stream table have the same cap
> as 2-level stream table? Is this defined by the hardware spec? If it is
> not, why should we pick this value?

No. I don't think it's defined by the architecture specification. I
don't have a strong opinion on the particular value for the size limit
of linear Stream tables. However, I do believe that we should pick a
size limit. Today, the driver limits the number of Level-1 Stream
Table Descriptors in a 2-level Stream table. For consistency, we
should limit the size of linear Stream tables, too.
Yang Shi Oct. 7, 2024, 5:49 p.m. UTC | #6
On 10/7/24 9:36 AM, Daniel Mentz wrote:
> On Fri, Oct 4, 2024 at 6:53 PM Yang Shi <yang@os.amperecomputing.com> wrote:
>>>>> On a related note, in arm_smmu_init_strtab_2lvl() we're capping the
>>>>> number of l1 entries at STRTAB_MAX_L1_ENTRIES for 2 level stream
>>>>> tables. I'm thinking it would make sense to limit the size of linear
>>>>> stream tables for the same reasons.
>>>> Yes, this also works. But I don't know what value should be used. Jason
>>>> actually suggested (size > SIZE_512M) in v2 review, but I thought the
>>>> value is a magic number. Why 512M? Just because it is too large for
>>>> allocation. So I picked up SIZE_MAX, just because it is the largest size
>>>> supported by size_t type.
>>> I think it should be capped to STRTAB_MAX_L1_ENTRIES
>> I'm not expert on SMMU. Does the linear stream table have the same cap
>> as 2-level stream table? Is this defined by the hardware spec? If it is
>> not, why should we pick this value?
> No. I don't think it's defined by the architecture specification. I
> don't have a strong opinion on the particular value for the size limit
> of linear Stream tables. However, I do believe that we should pick a
> size limit. Today, the driver limits the number of Level-1 Stream
> Table Descriptors in a 2-level Stream table. For consistency, we
> should limit the size of linear Stream tables, too.

We are on the same page regarding having the size limit. Took a look at 
the definition of STRTAB_MAX_L1_ENTRIES, I saw this comment:

/*
  * Stream table.
  *
  * Linear: Enough to cover 1 << IDR1.SIDSIZE entries
  * 2lvl: 128k L1 entries,
  *       256 lazy entries per table (each table covers a PCI bus)
  */

I'm not sure whether the comment for linear is still true or not with 
large IDR1.SIDSIZE. But using STRTAB_MAX_L1_ENTRIES for linear does have 
conflict with the comment. I will pick U32_MAX for now since it is the 
largest size on 32-bit and good enough to prevent from overflow.
Jason Gunthorpe Oct. 7, 2024, 5:50 p.m. UTC | #7
On Fri, Oct 04, 2024 at 06:53:25PM -0700, Yang Shi wrote:

> Yes, SIZE_MAX is ~(size_t)0, but size_t is unsigned long on ARM64. So the
> check actually doesn't do what I expect it should do. U32_MAX should be
> used.

SIZE_MAX is right:

static inline void *dmam_alloc_coherent(struct device *dev, size_t size,
		dma_addr_t *dma_handle, gfp_t gfp)

It is up to dmam_alloc_coherent() to not truncate it's arguments, when
you pass the u64 bounded to SIZE_MAX you guarentee that size will be a
valid value.

> > I think it should be capped to STRTAB_MAX_L1_ENTRIES
> 
> I'm not expert on SMMU. Does the linear stream table have the same cap as
> 2-level stream table? Is this defined by the hardware spec? If it is not,
> why should we pick this value?

Well, the way the driver works is in the 2 level mode it caps the
whole table to STRTAB_MAX_L1_ENTRIES * STRTAB_NUM_L2_STES total SIDs
which is 0x2000000 or 26 bits

I guess there is a reasonable argument that linear or 2 level should
have the same software enforced max size. Though would put it at a max
2G linear STE which is still larger than Linux can allocate, so it
doesn't really make any practical difference compared to SIZE_MAX.

Jason
Yang Shi Oct. 7, 2024, 6:43 p.m. UTC | #8
On 10/7/24 10:50 AM, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 06:53:25PM -0700, Yang Shi wrote:
>
>> Yes, SIZE_MAX is ~(size_t)0, but size_t is unsigned long on ARM64. So the
>> check actually doesn't do what I expect it should do. U32_MAX should be
>> used.
> SIZE_MAX is right:
>
> static inline void *dmam_alloc_coherent(struct device *dev, size_t size,
> 		dma_addr_t *dma_handle, gfp_t gfp)
>
> It is up to dmam_alloc_coherent() to not truncate it's arguments, when
> you pass the u64 bounded to SIZE_MAX you guarentee that size will be a
> valid value.

SIZE_MAX is fine for dmam_alloc_coherent(). The concern from Daniel is 
the later assignment to cfg->linear.num_ents may truncate the value, 
which is unsigned int.

If I read the code correctly, it looks like dmam_alloc_coherent() may 
not guarantee to fail large allocation, for example, a very large cma 
area is configured. It is ridiculous, but it is allowed. Please correct 
me if I'm wrong. So the concern seems valid to me, so I proposed U32_MAX.

>
>>> I think it should be capped to STRTAB_MAX_L1_ENTRIES
>> I'm not expert on SMMU. Does the linear stream table have the same cap as
>> 2-level stream table? Is this defined by the hardware spec? If it is not,
>> why should we pick this value?
> Well, the way the driver works is in the 2 level mode it caps the
> whole table to STRTAB_MAX_L1_ENTRIES * STRTAB_NUM_L2_STES total SIDs
> which is 0x2000000 or 26 bits
>
> I guess there is a reasonable argument that linear or 2 level should
> have the same software enforced max size. Though would put it at a max
> 2G linear STE which is still larger than Linux can allocate, so it
> doesn't really make any practical difference compared to SIZE_MAX.
>
> Jason
Will Deacon Oct. 8, 2024, 1:34 p.m. UTC | #9
Hi folks,

Sorry I'm late to the party, I went fishing.

On Fri, Oct 04, 2024 at 11:04:05AM -0700, Yang Shi wrote:
> The commit ce410410f1a7 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()")
> calculated the last index of L1 stream table by 1 << smmu->sid_bits. 1
> is 32 bit value.
> However some platforms, for example, AmpereOne and the platforms with
> ARM MMU-700, have 32-bit stream id size.  This resulted in ouf-of-bound shift.
> The disassembly of shift is:
> 
>     ldr     w2, [x19, 828]  //, smmu_7(D)->sid_bits
>     mov     w20, 1
>     lsl     w20, w20, w2
> 
> According to ARM spec, if the registers are 32 bit, the instruction actually
> does:
>     dest = src << (shift % 32)
> 
> So it actually shifted by zero bit.
> 
> The out-of-bound shift is also undefined behavior according to C
> language standard.
> 
> This caused v6.12-rc1 failed to boot on such platforms.
> 
> UBSAN also reported:
> 
> UBSAN: shift-out-of-bounds in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3628:29
> shift exponent 32 is too large for 32-bit type 'int'
> 
> Using 64 bit immediate when doing shift can solve the problem.  The
> disassembly after the fix looks like:
>     ldr     w20, [x19, 828] //, smmu_7(D)->sid_bits
>     mov     x0, 1
>     lsl     x0, x0, x20
> 
> There are a couple of problematic places, extracted the shift into a helper.
> 
> Fixes: ce410410f1a7 ("iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()")
> Tested-by: James Morse <james.morse@arm.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 +++++++++++-----
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  5 +++++
>  2 files changed, 16 insertions(+), 5 deletions(-)

[...]

> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 737c5b882355..9d4fc91d9258 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3624,8 +3624,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>  {
>  	u32 l1size;
>  	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> +	u64 num_sids = arm_smmu_strtab_num_sids(smmu);
>  	unsigned int last_sid_idx =
> -		arm_smmu_strtab_l1_idx((1 << smmu->sid_bits) - 1);
> +		arm_smmu_strtab_l1_idx(num_sids - 1);
>  
>  	/* Calculate the L1 size, capped to the SIDSIZE. */
>  	cfg->l2.num_l1_ents = min(last_sid_idx + 1, STRTAB_MAX_L1_ENTRIES);
> @@ -3655,20 +3656,25 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>  
>  static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>  {
> -	u32 size;
> +	u64 size;
>  	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> +	u64 num_sids = arm_smmu_strtab_num_sids(smmu);
> +
> +	size = num_sids * sizeof(struct arm_smmu_ste);
> +	/* The max size for dmam_alloc_coherent() is 32-bit */
> +	if (size > SIZE_MAX)
> +		return -EINVAL;
>  
> -	size = (1 << smmu->sid_bits) * sizeof(struct arm_smmu_ste);
>  	cfg->linear.table = dmam_alloc_coherent(smmu->dev, size,
>  						&cfg->linear.ste_dma,
>  						GFP_KERNEL);
>  	if (!cfg->linear.table) {
>  		dev_err(smmu->dev,
> -			"failed to allocate linear stream table (%u bytes)\n",
> +			"failed to allocate linear stream table (%llu bytes)\n",
>  			size);
>  		return -ENOMEM;
>  	}
> -	cfg->linear.num_ents = 1 << smmu->sid_bits;
> +	cfg->linear.num_ents = num_sids;

This all looks a bit messy to me. The architecture guarantees that
2-level stream tables are supported once we hit 7-bit SIDs and, although
the driver relaxes this to > 8-bit SIDs, we'll never run into overflow
problems in the linear table code above.

So I'm inclined to take Daniel's one-liner [1] which just chucks the
'ULL' suffix into the 2-level case. Otherwise, we're in a weird
situation where the size is 64-bit for a short while until it gets
truncated anyway when we assign it to a 32-bit field.

Any objections?

Will

[1] https://lore.kernel.org/r/20241002015357.1766934-1-danielmentz@google.com
Jason Gunthorpe Oct. 8, 2024, 3:15 p.m. UTC | #10
On Tue, Oct 08, 2024 at 02:34:58PM +0100, Will Deacon wrote:

> This all looks a bit messy to me. The architecture guarantees that
> 2-level stream tables are supported once we hit 7-bit SIDs and, although
> the driver relaxes this to > 8-bit SIDs, we'll never run into overflow
> problems in the linear table code above.

My original point was about the confidential compute position (sigh)
that the untrusted hypverisor should not corrupt the driver.

So your statement is architecturally true, but we never check that
IDR0_ST_LVL_2LVL is set if IDR1_SIDSIZE > 2**7, and so we can get into
this situation where the hypervisor could trigger some kind of bad
behavior.

> So I'm inclined to take Daniel's one-liner [1] which just chucks the
> 'ULL' suffix into the 2-level case. Otherwise, we're in a weird

I think you should take it and let better be for the CC crowd.

Jason
Yang Shi Oct. 8, 2024, 5:04 p.m. UTC | #11
On 10/8/24 8:15 AM, Jason Gunthorpe wrote:
> On Tue, Oct 08, 2024 at 02:34:58PM +0100, Will Deacon wrote:
>
>> This all looks a bit messy to me. The architecture guarantees that
>> 2-level stream tables are supported once we hit 7-bit SIDs and, although
>> the driver relaxes this to > 8-bit SIDs, we'll never run into overflow
>> problems in the linear table code above.
> My original point was about the confidential compute position (sigh)
> that the untrusted hypverisor should not corrupt the driver.
>
> So your statement is architecturally true, but we never check that
> IDR0_ST_LVL_2LVL is set if IDR1_SIDSIZE > 2**7, and so we can get into
> this situation where the hypervisor could trigger some kind of bad
> behavior.

Jason's concern seems valid to me IMHO. But if the simpler version is 
preferred, I'd suggest add some comments at least or the check suggested 
by Jason to make the architecture guarantee more clear. Just in case 
someone else won't repeat what we had done just because they see "1ULL" 
in 2lvl code but not in linear code.

>
>> So I'm inclined to take Daniel's one-liner [1] which just chucks the
>> 'ULL' suffix into the 2-level case. Otherwise, we're in a weird
> I think you should take it and let better be for the CC crowd.
>
> Jason
Will Deacon Oct. 8, 2024, 5:41 p.m. UTC | #12
On Tue, Oct 08, 2024 at 10:04:50AM -0700, Yang Shi wrote:
> On 10/8/24 8:15 AM, Jason Gunthorpe wrote:
> > On Tue, Oct 08, 2024 at 02:34:58PM +0100, Will Deacon wrote:
> > 
> > > This all looks a bit messy to me. The architecture guarantees that
> > > 2-level stream tables are supported once we hit 7-bit SIDs and, although
> > > the driver relaxes this to > 8-bit SIDs, we'll never run into overflow
> > > problems in the linear table code above.
> > My original point was about the confidential compute position (sigh)
> > that the untrusted hypverisor should not corrupt the driver.
> > 
> > So your statement is architecturally true, but we never check that
> > IDR0_ST_LVL_2LVL is set if IDR1_SIDSIZE > 2**7, and so we can get into
> > this situation where the hypervisor could trigger some kind of bad
> > behavior.
> 
> Jason's concern seems valid to me IMHO. But if the simpler version is
> preferred, I'd suggest add some comments at least or the check suggested by
> Jason to make the architecture guarantee more clear. Just in case someone
> else won't repeat what we had done just because they see "1ULL" in 2lvl code
> but not in linear code.

In principle, I'm not opposed to hardening the driver against malicious
virtual hardware, but I don't think we need to add comments everywhere
where we're relying on architectural behaviour. The hardening should
also be done separately from fixing a bug affecting users with real
hardware.

Will
Will Deacon Oct. 8, 2024, 5:42 p.m. UTC | #13
On Tue, Oct 08, 2024 at 12:15:06PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 08, 2024 at 02:34:58PM +0100, Will Deacon wrote:
> 
> > This all looks a bit messy to me. The architecture guarantees that
> > 2-level stream tables are supported once we hit 7-bit SIDs and, although
> > the driver relaxes this to > 8-bit SIDs, we'll never run into overflow
> > problems in the linear table code above.
> 
> My original point was about the confidential compute position (sigh)
> that the untrusted hypverisor should not corrupt the driver.
> 
> So your statement is architecturally true, but we never check that
> IDR0_ST_LVL_2LVL is set if IDR1_SIDSIZE > 2**7, and so we can get into
> this situation where the hypervisor could trigger some kind of bad
> behavior.
> 
> > So I'm inclined to take Daniel's one-liner [1] which just chucks the
> > 'ULL' suffix into the 2-level case. Otherwise, we're in a weird
> 
> I think you should take it and let better be for the CC crowd.

Heh. I wish them luck! :p

Will
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 737c5b882355..9d4fc91d9258 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3624,8 +3624,9 @@  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
 {
 	u32 l1size;
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+	u64 num_sids = arm_smmu_strtab_num_sids(smmu);
 	unsigned int last_sid_idx =
-		arm_smmu_strtab_l1_idx((1 << smmu->sid_bits) - 1);
+		arm_smmu_strtab_l1_idx(num_sids - 1);
 
 	/* Calculate the L1 size, capped to the SIDSIZE. */
 	cfg->l2.num_l1_ents = min(last_sid_idx + 1, STRTAB_MAX_L1_ENTRIES);
@@ -3655,20 +3656,25 @@  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
 
 static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 {
-	u32 size;
+	u64 size;
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+	u64 num_sids = arm_smmu_strtab_num_sids(smmu);
+
+	size = num_sids * sizeof(struct arm_smmu_ste);
+	/* The max size for dmam_alloc_coherent() is 32-bit */
+	if (size > SIZE_MAX)
+		return -EINVAL;
 
-	size = (1 << smmu->sid_bits) * sizeof(struct arm_smmu_ste);
 	cfg->linear.table = dmam_alloc_coherent(smmu->dev, size,
 						&cfg->linear.ste_dma,
 						GFP_KERNEL);
 	if (!cfg->linear.table) {
 		dev_err(smmu->dev,
-			"failed to allocate linear stream table (%u bytes)\n",
+			"failed to allocate linear stream table (%llu bytes)\n",
 			size);
 		return -ENOMEM;
 	}
-	cfg->linear.num_ents = 1 << smmu->sid_bits;
+	cfg->linear.num_ents = num_sids;
 
 	arm_smmu_init_initial_stes(cfg->linear.table, cfg->linear.num_ents);
 	return 0;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1e9952ca989f..c8ceddc5e8ef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -853,6 +853,11 @@  struct arm_smmu_master_domain {
 	ioasid_t ssid;
 };
 
+static inline u64 arm_smmu_strtab_num_sids(struct arm_smmu_device *smmu)
+{
+	return (1ULL << smmu->sid_bits);
+}
+
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 {
 	return container_of(dom, struct arm_smmu_domain, domain);