diff mbox series

[1/4] iommu/arm-smmu-v3: Use command queue batching helpers to improve performance

Message ID 20210811114852.2429-2-thunder.leizhen@huawei.com (mailing list archive)
State New, archived
Headers show
Series Prepare for ECMDQ support | expand

Commit Message

Leizhen (ThunderTown) Aug. 11, 2021, 11:48 a.m. UTC
The obvious key to the performance optimization of commit 587e6c10a7ce
("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is
to allow multiple cores to insert commands in parallel after a brief mutex
contention.

Obviously, inserting as many commands at a time as possible can reduce the
number of times the mutex contention participates, thereby improving the
overall performance. At least it reduces the number of calls to function
arm_smmu_cmdq_issue_cmdlist().

Therefore, use command queue batching helpers to insert multiple commands
at a time.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Robin Murphy Aug. 13, 2021, 4:01 p.m. UTC | #1
On 2021-08-11 12:48, Zhen Lei wrote:
> The obvious key to the performance optimization of commit 587e6c10a7ce
> ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is
> to allow multiple cores to insert commands in parallel after a brief mutex
> contention.
> 
> Obviously, inserting as many commands at a time as possible can reduce the
> number of times the mutex contention participates, thereby improving the
> overall performance. At least it reduces the number of calls to function
> arm_smmu_cmdq_issue_cmdlist().
> 
> Therefore, use command queue batching helpers to insert multiple commands
> at a time.
> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 235f9bdaeaf223b..c81cd929047f573 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
>   {
>   	int i;
>   	struct arm_smmu_cmdq_ent cmd;
> +	struct arm_smmu_cmdq_batch cmds = {};

BTW, it looks like this has crossed over with John's patch removing these.

Robin.

>   
>   	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
>   
>   	for (i = 0; i < master->num_streams; i++) {
>   		cmd.atc.sid = master->streams[i].id;
> -		arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
> +		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
>   	}
>   
> -	return arm_smmu_cmdq_issue_sync(master->smmu);
> +	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
>   }
>   
>   int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
>
John Garry Aug. 13, 2021, 4:45 p.m. UTC | #2
On 13/08/2021 17:01, Robin Murphy wrote:
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 235f9bdaeaf223b..c81cd929047f573 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct 
>> arm_smmu_master *master)
>>   {
>>       int i;
>>       struct arm_smmu_cmdq_ent cmd;
>> +    struct arm_smmu_cmdq_batch cmds = {};
> 
> BTW, it looks like this has crossed over with John's patch removing these.

It is only called from arm_smmu_disable_ats(), so not hot-path by the 
look for it. Or who even has ats HW ...?

But it should be at least cleaned-up for consistency. Leizhen?

Thanks,
John
Leizhen (ThunderTown) Aug. 16, 2021, 2:15 a.m. UTC | #3
On 2021/8/14 0:45, John Garry wrote:
> On 13/08/2021 17:01, Robin Murphy wrote:
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index 235f9bdaeaf223b..c81cd929047f573 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
>>>   {
>>>       int i;
>>>       struct arm_smmu_cmdq_ent cmd;
>>> +    struct arm_smmu_cmdq_batch cmds = {};
>>
>> BTW, it looks like this has crossed over with John's patch removing these.
> 
> It is only called from arm_smmu_disable_ats(), so not hot-path by the look for it. Or who even has ats HW ...?
> 
> But it should be at least cleaned-up for consistency. Leizhen?

Okay, I'll revise it. But Will already took it. So I'm not sure whether to send v2 or a separate patch.

> 
> Thanks,
> John
> .
>
Leizhen (ThunderTown) Aug. 16, 2021, 4:05 a.m. UTC | #4
On 2021/8/16 10:15, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/8/14 0:45, John Garry wrote:
>> On 13/08/2021 17:01, Robin Murphy wrote:
>>>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index 235f9bdaeaf223b..c81cd929047f573 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
>>>>   {
>>>>       int i;
>>>>       struct arm_smmu_cmdq_ent cmd;
>>>> +    struct arm_smmu_cmdq_batch cmds = {};
>>>
>>> BTW, it looks like this has crossed over with John's patch removing these.
>>
>> It is only called from arm_smmu_disable_ats(), so not hot-path by the look for it. Or who even has ats HW ...?
>>
>> But it should be at least cleaned-up for consistency. Leizhen?
> 
> Okay, I'll revise it. But Will already took it. So I'm not sure whether to send v2 or a separate patch.

I think I'd better post v2, otherwise I should write the same description.

In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized
slightly, three useless instructions can be reduced.

Case 1):
void arm_smmu_cmdq_build_cmd_tst1(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
{
        memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
        cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
}
0000000000004608 <arm_smmu_cmdq_build_cmd_tst1>:
    4608:       a9007c1f        stp     xzr, xzr, [x0]
    460c:       39400022        ldrb    w2, [x1]
    4610:       f9400001        ldr     x1, [x0]
    4614:       aa020021        orr     x1, x1, x2
    4618:       f9000001        str     x1, [x0]
    461c:       d65f03c0        ret

Case 2):
void arm_smmu_cmdq_build_cmd_tst2(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
{
        int i;

        cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
        for (i = 1; i < CMDQ_ENT_DWORDS; i++)
                cmd[i] = 0;
}
0000000000004620 <arm_smmu_cmdq_build_cmd_tst2>:
    4620:       39400021        ldrb    w1, [x1]
    4624:       a9007c01        stp     x1, xzr, [x0]
    4628:       d65f03c0        ret
    462c:       d503201f        nop

Case 3):
void arm_smmu_cmdq_build_cmd_tst3(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
{
        memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
        cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
}
0000000000004630 <arm_smmu_cmdq_build_cmd_tst3>:
    4630:       a9007c1f        stp     xzr, xzr, [x0]
    4634:       39400021        ldrb    w1, [x1]
    4638:       f9000001        str     x1, [x0]
    463c:       d65f03c0        ret

> 
>>
>> Thanks,
>> John
>> .
>>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
John Garry Aug. 16, 2021, 7:24 a.m. UTC | #5
> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized
> slightly, three useless instructions can be reduced.

I think that you could optimise further by pre-building commonly used 
commands.

For example, CMD_SYNC without MSI polling is always the same. And then 
only different in 1 field for MSI polling.

But you need to check if the performance gain is worth the change.

> 
> Case 1):
> void arm_smmu_cmdq_build_cmd_tst1(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> {
>          memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
>          cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
> }
> 0000000000004608 <arm_smmu_cmdq_build_cmd_tst1>:
>      4608:       a9007c1f        stp     xzr, xzr, [x0]
>      460c:       39400022        ldrb    w2, [x1]
>      4610:       f9400001        ldr     x1, [x0]
>      4614:       aa020021        orr     x1, x1, x2
>      4618:       f9000001        str     x1, [x0]
>      461c:       d65f03c0        ret
> 
> Case 2):
> void arm_smmu_cmdq_build_cmd_tst2(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> {
>          int i;
> 
>          cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
>          for (i = 1; i < CMDQ_ENT_DWORDS; i++)
>                  cmd[i] = 0;
> }
> 0000000000004620 <arm_smmu_cmdq_build_cmd_tst2>:
>      4620:       39400021        ldrb    w1, [x1]
>      4624:       a9007c01        stp     x1, xzr, [x0]
>      4628:       d65f03c0        ret
>      462c:       d503201f        nop
> 
> Case 3):
> void arm_smmu_cmdq_build_cmd_tst3(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> {
>          memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
>          cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
> }
> 0000000000004630 <arm_smmu_cmdq_build_cmd_tst3>:
>      4630:       a9007c1f        stp     xzr, xzr, [x0]
>      4634:       39400021        ldrb    w1, [x1]
>      4638:       f9000001        str     x1, [x0]
>      463c:       d65f03c0        ret
>
Leizhen (ThunderTown) Aug. 16, 2021, 7:47 a.m. UTC | #6
On 2021/8/16 15:24, John Garry wrote:
>> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized
>> slightly, three useless instructions can be reduced.
> 
> I think that you could optimise further by pre-building commonly used commands.
> 
> For example, CMD_SYNC without MSI polling is always the same. And then only different in 1 field for MSI polling.
> 
> But you need to check if the performance gain is worth the change.

Good advice. I can give it a try.

> 
>>
>> Case 1):
>> void arm_smmu_cmdq_build_cmd_tst1(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>> {
>>          memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
>>          cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> }
>> 0000000000004608 <arm_smmu_cmdq_build_cmd_tst1>:
>>      4608:       a9007c1f        stp     xzr, xzr, [x0]
>>      460c:       39400022        ldrb    w2, [x1]
>>      4610:       f9400001        ldr     x1, [x0]
>>      4614:       aa020021        orr     x1, x1, x2
>>      4618:       f9000001        str     x1, [x0]
>>      461c:       d65f03c0        ret
>>
>> Case 2):
>> void arm_smmu_cmdq_build_cmd_tst2(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>> {
>>          int i;
>>
>>          cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
>>          for (i = 1; i < CMDQ_ENT_DWORDS; i++)
>>                  cmd[i] = 0;
>> }
>> 0000000000004620 <arm_smmu_cmdq_build_cmd_tst2>:
>>      4620:       39400021        ldrb    w1, [x1]
>>      4624:       a9007c01        stp     x1, xzr, [x0]
>>      4628:       d65f03c0        ret
>>      462c:       d503201f        nop
>>
>> Case 3):
>> void arm_smmu_cmdq_build_cmd_tst3(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>> {
>>          memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
>>          cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);
>> }
>> 0000000000004630 <arm_smmu_cmdq_build_cmd_tst3>:
>>      4630:       a9007c1f        stp     xzr, xzr, [x0]
>>      4634:       39400021        ldrb    w1, [x1]
>>      4638:       f9000001        str     x1, [x0]
>>      463c:       d65f03c0        ret
>>
> 
> .
>
Will Deacon Aug. 16, 2021, 8:21 a.m. UTC | #7
On Mon, Aug 16, 2021 at 03:47:58PM +0800, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/8/16 15:24, John Garry wrote:
> >> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized
> >> slightly, three useless instructions can be reduced.
> > 
> > I think that you could optimise further by pre-building commonly used commands.
> > 
> > For example, CMD_SYNC without MSI polling is always the same. And then only different in 1 field for MSI polling.
> > 
> > But you need to check if the performance gain is worth the change.
> 
> Good advice. I can give it a try.

Please send it as a new patch on top. I've queued the old one and sent
it to Joerg. Since this is just further cleanup, it can be done separately.

Will
Leizhen (ThunderTown) Aug. 16, 2021, 8:41 a.m. UTC | #8
On 2021/8/16 16:21, Will Deacon wrote:
> On Mon, Aug 16, 2021 at 03:47:58PM +0800, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2021/8/16 15:24, John Garry wrote:
>>>> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized
>>>> slightly, three useless instructions can be reduced.
>>>
>>> I think that you could optimise further by pre-building commonly used commands.
>>>
>>> For example, CMD_SYNC without MSI polling is always the same. And then only different in 1 field for MSI polling.
>>>
>>> But you need to check if the performance gain is worth the change.
>>
>> Good advice. I can give it a try.
> 
> Please send it as a new patch on top. I've queued the old one and sent
> it to Joerg. Since this is just further cleanup, it can be done separately.

OK, I'll post a separate patch later.

> 
> Will
> .
>
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 235f9bdaeaf223b..c81cd929047f573 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1747,15 +1747,16 @@  static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
 {
 	int i;
 	struct arm_smmu_cmdq_ent cmd;
+	struct arm_smmu_cmdq_batch cmds = {};
 
 	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
 
 	for (i = 0; i < master->num_streams; i++) {
 		cmd.atc.sid = master->streams[i].id;
-		arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
 	}
 
-	return arm_smmu_cmdq_issue_sync(master->smmu);
+	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
 }
 
 int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,