mbox series

[v2,0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency

Message ID 1598018062-175608-1-git-send-email-john.garry@huawei.com (mailing list archive)
Headers show
Series iommu/arm-smmu-v3: Improve cmdq lock efficiency | expand

Message

John Garry Aug. 21, 2020, 1:54 p.m. UTC
As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes a lot of time once we start getting many
CPUs contending - from experiment, for 64 CPUs contending the cmdq,
success rate is ~ 1 in 12, which is poor, but not totally awful.

This series removes that cmpxchg() and replaces with an atomic_add,
same as how the actual cmdq deals with maintaining the prod pointer.

For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
increase:
Before: 1250K IOPs
After: 1550K IOPs

I also have a test harness to check the rate of DMA map+unmaps we can
achieve:

CPU count	8	16	32	64
Before:		282K	115K	36K	11K
After:		302K	193K	80K	30K

(unit is map+unmaps per CPU per second)

[0] https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3

Differences to v1:
- Simplify by dropping patch to always issue a CMD_SYNC
- Use 64b atomic add, keeping prod in a separate 32b field

John Garry (2):
  iommu/arm-smmu-v3: Calculate max commands per batch
  iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166 ++++++++++++++------
 1 file changed, 114 insertions(+), 52 deletions(-)

Comments

Song Bao Hua (Barry Song) Sept. 1, 2020, 11:17 a.m. UTC | #1
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of John Garry
> Sent: Saturday, August 22, 2020 1:54 AM
> To: will@kernel.org; robin.murphy@arm.com
> Cc: joro@8bytes.org; linux-arm-kernel@lists.infradead.org;
> iommu@lists.linux-foundation.org; maz@kernel.org; Linuxarm
> <linuxarm@huawei.com>; linux-kernel@vger.kernel.org; John Garry
> <john.garry@huawei.com>
> Subject: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
> 
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes a lot of time once we start getting many CPUs
> contending - from experiment, for 64 CPUs contending the cmdq, success rate
> is ~ 1 in 12, which is poor, but not totally awful.
> 
> This series removes that cmpxchg() and replaces with an atomic_add, same as
> how the actual cmdq deals with maintaining the prod pointer.
> 
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1250K IOPs
> After: 1550K IOPs
> 
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
> 
> CPU count	8	16	32	64
> Before:		282K	115K	36K	11K
> After:		302K	193K	80K	30K
> 
> (unit is map+unmaps per CPU per second)

I have seen performance improvement on hns3 network by sending UDP with 1-32 threads:

Threads number        1        4           8         16       32
Before patch(TX Mbps)  7636.05  16444.36  21694.48  25746.40   25295.93
After  patch(TX Mbps)  7711.60  16478.98  26561.06  32628.75   33764.56

As you can see, for 8,16,32 threads, network TX throughput improve much. For 1 and 4 threads,
Tx throughput is almost seem before and after patch. This should be sensible as this patch
is mainly for decreasing the lock contention.

> 
> [0]
> https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD2
> 4B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e
> 685757c27e39c7cbde3
> 
> Differences to v1:
> - Simplify by dropping patch to always issue a CMD_SYNC
> - Use 64b atomic add, keeping prod in a separate 32b field
> 
> John Garry (2):
>   iommu/arm-smmu-v3: Calculate max commands per batch
>   iommu/arm-smmu-v3: Remove cmpxchg() in
> arm_smmu_cmdq_issue_cmdlist()
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166
> ++++++++++++++------
>  1 file changed, 114 insertions(+), 52 deletions(-)
> 
> --
> 2.26.2

Thanks
Barry
Will Deacon Sept. 21, 2020, 1:43 p.m. UTC | #2
On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote:
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes a lot of time once we start getting many
> CPUs contending - from experiment, for 64 CPUs contending the cmdq,
> success rate is ~ 1 in 12, which is poor, but not totally awful.
> 
> This series removes that cmpxchg() and replaces with an atomic_add,
> same as how the actual cmdq deals with maintaining the prod pointer.

I'm still not a fan of this. Could you try to adapt the hacks I sent before,
please? I know they weren't quite right (I have no hardware to test on), but
the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
queueing in the spinlock implementation should avoid the contention.

Thanks,

Will
John Garry Sept. 21, 2020, 1:58 p.m. UTC | #3
On 21/09/2020 14:43, Will Deacon wrote:
> On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote:
>> As mentioned in [0], the CPU may consume many cycles processing
>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
>> get space on the queue takes a lot of time once we start getting many
>> CPUs contending - from experiment, for 64 CPUs contending the cmdq,
>> success rate is ~ 1 in 12, which is poor, but not totally awful.
>>
>> This series removes that cmpxchg() and replaces with an atomic_add,
>> same as how the actual cmdq deals with maintaining the prod pointer.
>  > I'm still not a fan of this.

:(

> Could you try to adapt the hacks I sent before,
> please? I know they weren't quite right (I have no hardware to test on), but
> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
> queueing in the spinlock implementation should avoid the contention.

OK, so if you're asking me to try this again, then I can do that, and 
see what it gives us.

John
John Garry Sept. 23, 2020, 2:47 p.m. UTC | #4
On 21/09/2020 14:58, John Garry wrote:
> 
>> Could you try to adapt the hacks I sent before,
>> please? I know they weren't quite right (I have no hardware to test 
>> on

Could the ARM Rev C FVP be used to at least functionally test? Can't 
seem to access myself, even though it's gratis...

), but
>> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
>> queueing in the spinlock implementation should avoid the contention.
> 

So I modified that suggested change to get it functioning, and it looks 
like this:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..f907b7c233a2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
  	atomic_long_t			*valid_map;
  	atomic_t			owner_prod;
  	atomic_t			lock;
+	spinlock_t			slock;
  };

  struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
  	u64 cmd_sync[CMDQ_ENT_DWORDS];
  	u32 prod;
  	unsigned long flags;
-	bool owner;
+	bool owner, locked = false;
  	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
  	struct arm_smmu_ll_queue llq = {
  		.max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,26 +1388,42 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,

  	/* 1. Allocate some space in the queue */
  	local_irq_save(flags);
-	llq.val = READ_ONCE(cmdq->q.llq.val);
  	do {
  		u64 old;

-		while (!queue_has_space(&llq, n + sync)) {
+		llq.val = READ_ONCE(cmdq->q.llq.val);
+
+		if (queue_has_space(&llq, n + sync))
+			goto try_cas;
+
+		if (locked) {
+			spin_unlock(&cmdq->slock);
+			locked = 0; // added
+		}
+
+		do {
  			local_irq_restore(flags);
  			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
  				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
  			local_irq_save(flags);
-		}
+		} while (!queue_has_space(&llq, n + sync));

+try_cas:
  		head.cons = llq.cons;
  		head.prod = queue_inc_prod_n(&llq, n + sync) |
  					     CMDQ_PROD_OWNED_FLAG;

  		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-		if (old == llq.val)
+		if (old == llq.val) { // was if (old != llq.val)
+			if (locked)   //           break;
+				spin_unlock(&cmdq->slock);//
  			break;//
+		}//

-		llq.val = old;
+		if (!locked) {
+			spin_lock(&cmdq->slock);
+			locked = true;
+		}
  	} while (1);
  	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
  	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3209,7 @@ static int arm_smmu_cmdq_init(struct 
arm_smmu_device *smmu)

  	atomic_set(&cmdq->owner_prod, 0);
  	atomic_set(&cmdq->lock, 0);
+	spin_lock_init(&cmdq->slock);

  	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
  	if (!bitmap) {
John Garry Nov. 13, 2020, 10:43 a.m. UTC | #5
On 21/09/2020 14:58, John Garry wrote:
> On 21/09/2020 14:43, Will Deacon wrote:
>> On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote:
>>> As mentioned in [0], the CPU may consume many cycles processing
>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() 
>>> loop to
>>> get space on the queue takes a lot of time once we start getting many
>>> CPUs contending - from experiment, for 64 CPUs contending the cmdq,
>>> success rate is ~ 1 in 12, which is poor, but not totally awful.
>>>
>>> This series removes that cmpxchg() and replaces with an atomic_add,
>>> same as how the actual cmdq deals with maintaining the prod pointer.
>>  > I'm still not a fan of this.
> 
> :(
> 
>> Could you try to adapt the hacks I sent before,
>> please? I know they weren't quite right (I have no hardware to test 
>> on), but
>> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
>> queueing in the spinlock implementation should avoid the contention.
> 
> OK, so if you're asking me to try this again, then I can do that, and 
> see what it gives us.
> 

JFYI, to prove that this is not a problem which affects only our HW, I 
managed to test an arm64 platform from another vendor. Generally I see 
the same issue, and this patchset actually helps that platform even more.

		CPUs	Before	After	% Increase
Huawei D06	8	282K	302K	7%
Other			379K	420K	11%

Huawei D06	16	115K	193K	68K
Other			102K	291K	185K

Huawei D06	32	36K	80K	122%
Other			41K	156K	280%

Huawei D06	64	11K	30K	172%
Other			6K	47K	683%

I tested with something like [1], so unit is map+unmaps per cpu per 
second - higher is better.

My D06 is memory poor, so would expect higher results otherwise (with 
more memory). Indeed, my D05 has memory on all nodes and performs better.

Anyway, I see that the implementation here is not perfect, and I could 
not get suggested approach to improve performance significantly. So back 
to the drawing board...

Thanks,
John

[1] 
https://lore.kernel.org/linux-iommu/20201102080646.2180-1-song.bao.hua@hisilicon.com/