Message ID | 1598018062-175608-1-git-send-email-john.garry@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | iommu/arm-smmu-v3: Improve cmdq lock efficiency | expand |
> -----Original Message----- > From: linux-kernel-owner@vger.kernel.org > [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of John Garry > Sent: Saturday, August 22, 2020 1:54 AM > To: will@kernel.org; robin.murphy@arm.com > Cc: joro@8bytes.org; linux-arm-kernel@lists.infradead.org; > iommu@lists.linux-foundation.org; maz@kernel.org; Linuxarm > <linuxarm@huawei.com>; linux-kernel@vger.kernel.org; John Garry > <john.garry@huawei.com> > Subject: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency > > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes a lot of time once we start getting many CPUs > contending - from experiment, for 64 CPUs contending the cmdq, success rate > is ~ 1 in 12, which is poor, but not totally awful. > > This series removes that cmpxchg() and replaces with an atomic_add, same as > how the actual cmdq deals with maintaining the prod pointer. > > For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput > increase: > Before: 1250K IOPs > After: 1550K IOPs > > I also have a test harness to check the rate of DMA map+unmaps we can > achieve: > > CPU count 8 16 32 64 > Before: 282K 115K 36K 11K > After: 302K 193K 80K 30K > > (unit is map+unmaps per CPU per second) I have seen performance improvement on hns3 network by sending UDP with 1-32 threads: Threads number 1 4 8 16 32 Before patch(TX Mbps) 7636.05 16444.36 21694.48 25746.40 25295.93 After patch(TX Mbps) 7711.60 16478.98 26561.06 32628.75 33764.56 As you can see, for 8,16,32 threads, network TX throughput improve much. For 1 and 4 threads, Tx throughput is almost seem before and after patch. This should be sensible as this patch is mainly for decreasing the lock contention. > > [0] > https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD2 > 4B86DB@DGGEMI525-MBS.china.huawei.com/T/#ma02e301c38c3e94b7725e > 685757c27e39c7cbde3 > > Differences to v1: > - Simplify by dropping patch to always issue a CMD_SYNC > - Use 64b atomic add, keeping prod in a separate 32b field > > John Garry (2): > iommu/arm-smmu-v3: Calculate max commands per batch > iommu/arm-smmu-v3: Remove cmpxchg() in > arm_smmu_cmdq_issue_cmdlist() > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166 > ++++++++++++++------ > 1 file changed, 114 insertions(+), 52 deletions(-) > > -- > 2.26.2 Thanks Barry
On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote: > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes a lot of time once we start getting many > CPUs contending - from experiment, for 64 CPUs contending the cmdq, > success rate is ~ 1 in 12, which is poor, but not totally awful. > > This series removes that cmpxchg() and replaces with an atomic_add, > same as how the actual cmdq deals with maintaining the prod pointer. I'm still not a fan of this. Could you try to adapt the hacks I sent before, please? I know they weren't quite right (I have no hardware to test on), but the basic idea is to fall back to a spinlock if the cmpxchg() fails. The queueing in the spinlock implementation should avoid the contention. Thanks, Will
On 21/09/2020 14:43, Will Deacon wrote: > On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote: >> As mentioned in [0], the CPU may consume many cycles processing >> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to >> get space on the queue takes a lot of time once we start getting many >> CPUs contending - from experiment, for 64 CPUs contending the cmdq, >> success rate is ~ 1 in 12, which is poor, but not totally awful. >> >> This series removes that cmpxchg() and replaces with an atomic_add, >> same as how the actual cmdq deals with maintaining the prod pointer. > > I'm still not a fan of this. :( > Could you try to adapt the hacks I sent before, > please? I know they weren't quite right (I have no hardware to test on), but > the basic idea is to fall back to a spinlock if the cmpxchg() fails. The > queueing in the spinlock implementation should avoid the contention. OK, so if you're asking me to try this again, then I can do that, and see what it gives us. John
On 21/09/2020 14:58, John Garry wrote: > >> Could you try to adapt the hacks I sent before, >> please? I know they weren't quite right (I have no hardware to test >> on Could the ARM Rev C FVP be used to at least functionally test? Can't seem to access myself, even though it's gratis... ), but >> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The >> queueing in the spinlock implementation should avoid the contention. > So I modified that suggested change to get it functioning, and it looks like this: diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 7196207be7ea..f907b7c233a2 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -560,6 +560,7 @@ struct arm_smmu_cmdq { atomic_long_t *valid_map; atomic_t owner_prod; atomic_t lock; + spinlock_t slock; }; struct arm_smmu_cmdq_batch { @@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 cmd_sync[CMDQ_ENT_DWORDS]; u32 prod; unsigned long flags; - bool owner; + bool owner, locked = false; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, @@ -1387,26 +1388,42 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); do { u64 old; - while (!queue_has_space(&llq, n + sync)) { + llq.val = READ_ONCE(cmdq->q.llq.val); + + if (queue_has_space(&llq, n + sync)) + goto try_cas; + + if (locked) { + spin_unlock(&cmdq->slock); + locked = 0; // added + } + + do { local_irq_restore(flags); if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); local_irq_save(flags); - } + } while (!queue_has_space(&llq, n + sync)); +try_cas: head.cons = llq.cons; head.prod = queue_inc_prod_n(&llq, n + sync) | CMDQ_PROD_OWNED_FLAG; old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) + if (old == llq.val) { // was if (old != llq.val) + if (locked) // break; + spin_unlock(&cmdq->slock);// break;// + }// - llq.val = old; + if (!locked) { + spin_lock(&cmdq->slock); + locked = true; + } } while (1); owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); head.prod &= ~CMDQ_PROD_OWNED_FLAG; @@ -3192,6 +3209,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); + spin_lock_init(&cmdq->slock); bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); if (!bitmap) {
On 21/09/2020 14:58, John Garry wrote: > On 21/09/2020 14:43, Will Deacon wrote: >> On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote: >>> As mentioned in [0], the CPU may consume many cycles processing >>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() >>> loop to >>> get space on the queue takes a lot of time once we start getting many >>> CPUs contending - from experiment, for 64 CPUs contending the cmdq, >>> success rate is ~ 1 in 12, which is poor, but not totally awful. >>> >>> This series removes that cmpxchg() and replaces with an atomic_add, >>> same as how the actual cmdq deals with maintaining the prod pointer. >> > I'm still not a fan of this. > > :( > >> Could you try to adapt the hacks I sent before, >> please? I know they weren't quite right (I have no hardware to test >> on), but >> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The >> queueing in the spinlock implementation should avoid the contention. > > OK, so if you're asking me to try this again, then I can do that, and > see what it gives us. > JFYI, to prove that this is not a problem which affects only our HW, I managed to test an arm64 platform from another vendor. Generally I see the same issue, and this patchset actually helps that platform even more. CPUs Before After % Increase Huawei D06 8 282K 302K 7% Other 379K 420K 11% Huawei D06 16 115K 193K 68K Other 102K 291K 185K Huawei D06 32 36K 80K 122% Other 41K 156K 280% Huawei D06 64 11K 30K 172% Other 6K 47K 683% I tested with something like [1], so unit is map+unmaps per cpu per second - higher is better. My D06 is memory poor, so would expect higher results otherwise (with more memory). Indeed, my D05 has memory on all nodes and performs better. Anyway, I see that the implementation here is not perfect, and I could not get suggested approach to improve performance significantly. So back to the drawing board... Thanks, John [1] https://lore.kernel.org/linux-iommu/20201102080646.2180-1-song.bao.hua@hisilicon.com/