Message ID | 20210811114852.2429-2-thunder.leizhen@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Prepare for ECMDQ support | expand |
On 2021-08-11 12:48, Zhen Lei wrote: > The obvious key to the performance optimization of commit 587e6c10a7ce > ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is > to allow multiple cores to insert commands in parallel after a brief mutex > contention. > > Obviously, inserting as many commands at a time as possible can reduce the > number of times the mutex contention participates, thereby improving the > overall performance. At least it reduces the number of calls to function > arm_smmu_cmdq_issue_cmdlist(). > > Therefore, use command queue batching helpers to insert multiple commands > at a time. > > Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> > --- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index 235f9bdaeaf223b..c81cd929047f573 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master) > { > int i; > struct arm_smmu_cmdq_ent cmd; > + struct arm_smmu_cmdq_batch cmds = {}; BTW, it looks like this has crossed over with John's patch removing these. Robin. > > arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd); > > for (i = 0; i < master->num_streams; i++) { > cmd.atc.sid = master->streams[i].id; > - arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); > + arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd); > } > > - return arm_smmu_cmdq_issue_sync(master->smmu); > + return arm_smmu_cmdq_batch_submit(master->smmu, &cmds); > } > > int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid, >
On 13/08/2021 17:01, Robin Murphy wrote: >> >> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> index 235f9bdaeaf223b..c81cd929047f573 100644 >> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct >> arm_smmu_master *master) >> { >> int i; >> struct arm_smmu_cmdq_ent cmd; >> + struct arm_smmu_cmdq_batch cmds = {}; > > BTW, it looks like this has crossed over with John's patch removing these. It is only called from arm_smmu_disable_ats(), so not hot-path by the look for it. Or who even has ats HW ...? But it should be at least cleaned-up for consistency. Leizhen? Thanks, John
On 2021/8/14 0:45, John Garry wrote: > On 13/08/2021 17:01, Robin Murphy wrote: >>> >>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>> index 235f9bdaeaf223b..c81cd929047f573 100644 >>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>> @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master) >>> { >>> int i; >>> struct arm_smmu_cmdq_ent cmd; >>> + struct arm_smmu_cmdq_batch cmds = {}; >> >> BTW, it looks like this has crossed over with John's patch removing these. > > It is only called from arm_smmu_disable_ats(), so not hot-path by the look for it. Or who even has ats HW ...? > > But it should be at least cleaned-up for consistency. Leizhen? Okay, I'll revise it. But Will already took it. So I'm not sure whether to send v2 or a separate patch. > > Thanks, > John > . >
On 2021/8/16 10:15, Leizhen (ThunderTown) wrote: > > > On 2021/8/14 0:45, John Garry wrote: >> On 13/08/2021 17:01, Robin Murphy wrote: >>>> >>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>> index 235f9bdaeaf223b..c81cd929047f573 100644 >>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>> @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master) >>>> { >>>> int i; >>>> struct arm_smmu_cmdq_ent cmd; >>>> + struct arm_smmu_cmdq_batch cmds = {}; >>> >>> BTW, it looks like this has crossed over with John's patch removing these. >> >> It is only called from arm_smmu_disable_ats(), so not hot-path by the look for it. Or who even has ats HW ...? >> >> But it should be at least cleaned-up for consistency. Leizhen? > > Okay, I'll revise it. But Will already took it. So I'm not sure whether to send v2 or a separate patch. I think I'd better post v2, otherwise I should write the same description. In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized slightly, three useless instructions can be reduced. Case 1): void arm_smmu_cmdq_build_cmd_tst1(u64 *cmd, struct arm_smmu_cmdq_ent *ent) { memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT); cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode); } 0000000000004608 <arm_smmu_cmdq_build_cmd_tst1>: 4608: a9007c1f stp xzr, xzr, [x0] 460c: 39400022 ldrb w2, [x1] 4610: f9400001 ldr x1, [x0] 4614: aa020021 orr x1, x1, x2 4618: f9000001 str x1, [x0] 461c: d65f03c0 ret Case 2): void arm_smmu_cmdq_build_cmd_tst2(u64 *cmd, struct arm_smmu_cmdq_ent *ent) { int i; cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode); for (i = 1; i < CMDQ_ENT_DWORDS; i++) cmd[i] = 0; } 0000000000004620 <arm_smmu_cmdq_build_cmd_tst2>: 4620: 39400021 ldrb w1, [x1] 4624: a9007c01 stp x1, xzr, [x0] 4628: d65f03c0 ret 462c: d503201f nop Case 3): void arm_smmu_cmdq_build_cmd_tst3(u64 *cmd, struct arm_smmu_cmdq_ent *ent) { memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT); cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode); } 0000000000004630 <arm_smmu_cmdq_build_cmd_tst3>: 4630: a9007c1f stp xzr, xzr, [x0] 4634: 39400021 ldrb w1, [x1] 4638: f9000001 str x1, [x0] 463c: d65f03c0 ret > >> >> Thanks, >> John >> . >> > _______________________________________________ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu >
> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized > slightly, three useless instructions can be reduced. I think that you could optimise further by pre-building commonly used commands. For example, CMD_SYNC without MSI polling is always the same. And then only different in 1 field for MSI polling. But you need to check if the performance gain is worth the change. > > Case 1): > void arm_smmu_cmdq_build_cmd_tst1(u64 *cmd, struct arm_smmu_cmdq_ent *ent) > { > memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT); > cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode); > } > 0000000000004608 <arm_smmu_cmdq_build_cmd_tst1>: > 4608: a9007c1f stp xzr, xzr, [x0] > 460c: 39400022 ldrb w2, [x1] > 4610: f9400001 ldr x1, [x0] > 4614: aa020021 orr x1, x1, x2 > 4618: f9000001 str x1, [x0] > 461c: d65f03c0 ret > > Case 2): > void arm_smmu_cmdq_build_cmd_tst2(u64 *cmd, struct arm_smmu_cmdq_ent *ent) > { > int i; > > cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode); > for (i = 1; i < CMDQ_ENT_DWORDS; i++) > cmd[i] = 0; > } > 0000000000004620 <arm_smmu_cmdq_build_cmd_tst2>: > 4620: 39400021 ldrb w1, [x1] > 4624: a9007c01 stp x1, xzr, [x0] > 4628: d65f03c0 ret > 462c: d503201f nop > > Case 3): > void arm_smmu_cmdq_build_cmd_tst3(u64 *cmd, struct arm_smmu_cmdq_ent *ent) > { > memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT); > cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode); > } > 0000000000004630 <arm_smmu_cmdq_build_cmd_tst3>: > 4630: a9007c1f stp xzr, xzr, [x0] > 4634: 39400021 ldrb w1, [x1] > 4638: f9000001 str x1, [x0] > 463c: d65f03c0 ret >
On 2021/8/16 15:24, John Garry wrote: >> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized >> slightly, three useless instructions can be reduced. > > I think that you could optimise further by pre-building commonly used commands. > > For example, CMD_SYNC without MSI polling is always the same. And then only different in 1 field for MSI polling. > > But you need to check if the performance gain is worth the change. Good advice. I can give it a try. > >> >> Case 1): >> void arm_smmu_cmdq_build_cmd_tst1(u64 *cmd, struct arm_smmu_cmdq_ent *ent) >> { >> memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT); >> cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode); >> } >> 0000000000004608 <arm_smmu_cmdq_build_cmd_tst1>: >> 4608: a9007c1f stp xzr, xzr, [x0] >> 460c: 39400022 ldrb w2, [x1] >> 4610: f9400001 ldr x1, [x0] >> 4614: aa020021 orr x1, x1, x2 >> 4618: f9000001 str x1, [x0] >> 461c: d65f03c0 ret >> >> Case 2): >> void arm_smmu_cmdq_build_cmd_tst2(u64 *cmd, struct arm_smmu_cmdq_ent *ent) >> { >> int i; >> >> cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode); >> for (i = 1; i < CMDQ_ENT_DWORDS; i++) >> cmd[i] = 0; >> } >> 0000000000004620 <arm_smmu_cmdq_build_cmd_tst2>: >> 4620: 39400021 ldrb w1, [x1] >> 4624: a9007c01 stp x1, xzr, [x0] >> 4628: d65f03c0 ret >> 462c: d503201f nop >> >> Case 3): >> void arm_smmu_cmdq_build_cmd_tst3(u64 *cmd, struct arm_smmu_cmdq_ent *ent) >> { >> memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT); >> cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode); >> } >> 0000000000004630 <arm_smmu_cmdq_build_cmd_tst3>: >> 4630: a9007c1f stp xzr, xzr, [x0] >> 4634: 39400021 ldrb w1, [x1] >> 4638: f9000001 str x1, [x0] >> 463c: d65f03c0 ret >> > > . >
On Mon, Aug 16, 2021 at 03:47:58PM +0800, Leizhen (ThunderTown) wrote: > > > On 2021/8/16 15:24, John Garry wrote: > >> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized > >> slightly, three useless instructions can be reduced. > > > > I think that you could optimise further by pre-building commonly used commands. > > > > For example, CMD_SYNC without MSI polling is always the same. And then only different in 1 field for MSI polling. > > > > But you need to check if the performance gain is worth the change. > > Good advice. I can give it a try. Please send it as a new patch on top. I've queued the old one and sent it to Joerg. Since this is just further cleanup, it can be done separately. Will
On 2021/8/16 16:21, Will Deacon wrote: > On Mon, Aug 16, 2021 at 03:47:58PM +0800, Leizhen (ThunderTown) wrote: >> >> >> On 2021/8/16 15:24, John Garry wrote: >>>> In addition, I find that function arm_smmu_cmdq_build_cmd() can also be optimized >>>> slightly, three useless instructions can be reduced. >>> >>> I think that you could optimise further by pre-building commonly used commands. >>> >>> For example, CMD_SYNC without MSI polling is always the same. And then only different in 1 field for MSI polling. >>> >>> But you need to check if the performance gain is worth the change. >> >> Good advice. I can give it a try. > > Please send it as a new patch on top. I've queued the old one and sent > it to Joerg. Since this is just further cleanup, it can be done separately. OK, I'll post a separate patch later. > > Will > . >
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 235f9bdaeaf223b..c81cd929047f573 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1747,15 +1747,16 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master) { int i; struct arm_smmu_cmdq_ent cmd; + struct arm_smmu_cmdq_batch cmds = {}; arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd); for (i = 0; i < master->num_streams; i++) { cmd.atc.sid = master->streams[i].id; - arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); + arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd); } - return arm_smmu_cmdq_issue_sync(master->smmu); + return arm_smmu_cmdq_batch_submit(master->smmu, &cmds); } int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
The obvious key to the performance optimization of commit 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is to allow multiple cores to insert commands in parallel after a brief mutex contention. Obviously, inserting as many commands at a time as possible can reduce the number of times the mutex contention participates, thereby improving the overall performance. At least it reduces the number of calls to function arm_smmu_cmdq_issue_cmdlist(). Therefore, use command queue batching helpers to insert multiple commands at a time. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)