Message ID | 1592846920-45338-5-git-send-email-john.garry@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iommu/arm-smmu-v3: Improve cmdq lock efficiency | expand |
Hi John, I love your patch! Perhaps something to improve: [auto build test WARNING on iommu/next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next config: arm64-randconfig-c024-20200622 (attached as .config) compiler: aarch64-linux-gcc (GCC) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>, old ones prefixed by <<): In file included from include/linux/bits.h:23, from include/linux/ioport.h:15, from include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) | ^ include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) | ^ include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) | ^~~~~~~~~~~~~~~~~~~ >> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); | ^~~~~~~ vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c 1369 1370 /* 1371 * This is the actual insertion function, and provides the following 1372 * ordering guarantees to callers: 1373 * 1374 * - There is a dma_wmb() before publishing any commands to the queue. 1375 * This can be relied upon to order prior writes to data structures 1376 * in memory (such as a CD or an STE) before the command. 1377 * 1378 * - On completion of a CMD_SYNC, there is a control dependency. 1379 * This can be relied upon to order subsequent writes to memory (e.g. 1380 * freeing an IOVA) after completion of the CMD_SYNC. 1381 * 1382 * - Command insertion is totally ordered, so if two CPUs each race to 1383 * insert their own list of commands then all of the commands from one 1384 * CPU will appear before any of the commands from the other CPU. 1385 * 1386 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer 1387 * for when the cmdq is full, such that we don't wrap more than twice. 1388 * It also makes it easy for the owner to know by how many to increment the 1389 * cmdq lock. 1390 */ 1391 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, 1392 u64 *cmds, int n) 1393 { 1394 u64 cmd_sync[CMDQ_ENT_DWORDS]; 1395 const int sync = 1; 1396 u32 prod; 1397 unsigned long flags; 1398 bool owner; 1399 struct arm_smmu_cmdq *cmdq = &smmu->cmdq; 1400 struct arm_smmu_ll_queue llq = { 1401 .max_n_shift = cmdq->q.llq.max_n_shift, 1402 }, head = llq, space = llq; 1403 u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; > 1404 u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); 1405 u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); 1406 int ret = 0; 1407 1408 /* 1. Allocate some space in the queue */ 1409 local_irq_save(flags); 1410 1411 prod = atomic_fetch_add(n + sync + owner_val, 1412 &cmdq->q.llq.atomic.prod); 1413 1414 owner = !(prod & owner_mask); 1415 llq.prod = prod_mask & prod; 1416 head.prod = queue_inc_prod_n(&llq, n + sync); 1417 1418 /* 1419 * Ensure it's safe to write the entries. For this, we need to ensure 1420 * that there is space in the queue from our prod pointer. 1421 */ 1422 space.cons = READ_ONCE(cmdq->q.llq.cons); 1423 space.prod = llq.prod; 1424 while (!queue_has_space(&space, n + sync)) { 1425 if (arm_smmu_cmdq_poll_until_not_full(smmu, &space)) 1426 dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); 1427 1428 space.prod = llq.prod; 1429 } 1430 1431 /* 1432 * 2. Write our commands into the queue 1433 * Dependency ordering from the space-checking while loop, above. 1434 */ 1435 arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); 1436 1437 prod = queue_inc_prod_n(&llq, n); 1438 arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); 1439 queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); 1440 1441 /* 3. Mark our slots as valid, ensuring commands are visible first */ 1442 dma_wmb(); 1443 arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod); 1444 1445 /* 4. If we are the owner, take control of the SMMU hardware */ 1446 if (owner) { 1447 int owner_count; 1448 u32 prod_tmp; 1449 1450 /* a. Wait for previous owner to finish */ 1451 atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod); 1452 1453 /* b. Stop gathering work by clearing the owned mask */ 1454 prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask, 1455 &cmdq->q.llq.atomic.prod); 1456 prod = prod_tmp & prod_mask; 1457 owner_count = prod_tmp & owner_mask; 1458 owner_count >>= cmdq->q.llq.owner_count_shift; 1459 1460 /* 1461 * c. Wait for any gathered work to be written to the queue. 1462 * Note that we read our own entries so that we have the control 1463 * dependency required by (d). 1464 */ 1465 arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod); 1466 1467 /* 1468 * In order to determine completion of the CMD_SYNCs, we must 1469 * ensure that the queue can't wrap twice without us noticing. 1470 * We achieve that by taking the cmdq lock as shared before 1471 * progressing the prod pointer. 1472 * The owner does this for all the non-owners it has gathered. 1473 * Otherwise, some non-owner(s) may lock the cmdq, blocking 1474 * cons being updating. This could be when the cmdq has just 1475 * become full. In this case, other sibling non-owners could be 1476 * blocked from progressing, leading to deadlock. 1477 */ 1478 arm_smmu_cmdq_shared_lock(cmdq, owner_count); 1479 1480 /* 1481 * d. Advance the hardware prod pointer 1482 * Control dependency ordering from the entries becoming valid. 1483 */ 1484 writel_relaxed(prod, cmdq->q.prod_reg); 1485 1486 /* 1487 * e. Tell the next owner we're done 1488 * Make sure we've updated the hardware first, so that we don't 1489 * race to update prod and potentially move it backwards. 1490 */ 1491 atomic_set_release(&cmdq->owner_prod, prod); 1492 } 1493 1494 /* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */ 1495 llq.prod = queue_inc_prod_n(&llq, n); 1496 ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq); 1497 if (ret) 1498 dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n", 1499 llq.prod, readl_relaxed(cmdq->q.prod_reg), 1500 readl_relaxed(cmdq->q.cons_reg)); 1501 1502 /* 1503 * Try to unlock the cmdq lock. This will fail if we're the last reader, 1504 * in which case we can safely update cmdq->q.llq.cons 1505 */ 1506 if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) { 1507 WRITE_ONCE(cmdq->q.llq.cons, llq.cons); 1508 arm_smmu_cmdq_shared_unlock(cmdq); 1509 } 1510 1511 local_irq_restore(flags); 1512 return ret; 1513 } 1514 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On 23/06/2020 02:07, kernel test robot wrote: + Rikard, as the GENMASK compile-time sanity checks were added recently > Hi John, > > I love your patch! Perhaps something to improve: > > [auto build test WARNING on iommu/next] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use as documented in > https://git-scm.com/docs/git-format-patch] > > url: https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 > base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next > config: arm64-randconfig-c024-20200622 (attached as .config) > compiler: aarch64-linux-gcc (GCC) 9.3.0 > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > All warnings (new ones prefixed by >>, old ones prefixed by <<): > > In file included from include/linux/bits.h:23, > from include/linux/ioport.h:15, > from include/linux/acpi.h:12, > from drivers/iommu/arm-smmu-v3.c:12: > drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': > include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and h=unsigned value, so I doubt this warn. Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks like GENMASK_INPUT_CHECK() could be improved. > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] > 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > | ^ > include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO' > 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) > | ^ > include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK' > 39 | (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) > | ^~~~~~~~~~~~~~~~~~~ >>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK' > 1404 | u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); > | ^~~~~~~ > > vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
On 23/06/2020 10:35, Rikard Falkeborn wrote: > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it > looks > like GENMASK_INPUT_CHECK() could be improved. > > > Indeed it could, it is fixed in -next. ok, thanks for the pointer, but I still see this on today's -next with this patch: make W=1 drivers/iommu/arm-smmu-v3.o In file included from ./include/linux/bits.h:23:0, from ./include/linux/ioport.h:15, from ./include/linux/acpi.h:12, from drivers/iommu/arm-smmu-v3.c:12: drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’: ./include/linux/bits.h:27:7: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] (l) > (h), 0))) ^ ./include/linux/build_bug.h:16:62: note: in definition of macro ‘BUILD_BUG_ON_ZERO’ #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) ^ ./include/linux/bits.h:40:3: note: in expansion of macro ‘GENMASK_INPUT_CHECK’ (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) ^~~~~~~~~~~~~~~~~~~ drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’ u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); That's gcc 7.5.0 . Cheers, John
On 2020-06-23 10:21, John Garry wrote: > On 23/06/2020 02:07, kernel test robot wrote: > > + Rikard, as the GENMASK compile-time sanity checks were added recently > >> Hi John, >> >> I love your patch! Perhaps something to improve: >> >> [auto build test WARNING on iommu/next] >> [If your patch is applied to the wrong git tree, kindly drop us a note. >> And when submitting patch, we suggest to use as documented in >> https://git-scm.com/docs/git-format-patch] >> >> url: >> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 >> >> base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git >> next >> config: arm64-randconfig-c024-20200622 (attached as .config) >> compiler: aarch64-linux-gcc (GCC) 9.3.0 >> >> If you fix the issue, kindly add following tag as appropriate >> Reported-by: kernel test robot <lkp@intel.com> >> >> All warnings (new ones prefixed by >>, old ones prefixed by <<): >> >> In file included from include/linux/bits.h:23, >> from include/linux/ioport.h:15, >> from include/linux/acpi.h:12, >> from drivers/iommu/arm-smmu-v3.c:12: >> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist': >> include/linux/bits.h:26:28: warning: comparison of unsigned expression >> < 0 is always false [-Wtype-limits] >> 26 | __builtin_constant_p((l) > (h)), (l) > (h), 0))) > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and > h=unsigned value, so I doubt this warn. > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks > like GENMASK_INPUT_CHECK() could be improved. That said, I think this particular case might be even better off dodging GENMASK() entirely, by doing something like this first. Untested... Robin. ----->8----- Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations Beyond the initial queue setup based on the log2 values from ID registers, the log2 queue size is only ever used in the form of (1 << max_n_shift) to repeatedly recalculate the number of queue elements. Simply storing it in that form leads to slightly more efficient code, particularly in the low-level queue accessors where it counts most: add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) Function old new delta arm_smmu_init_one_queue 360 364 +4 arm_smmu_priq_thread 512 508 -4 arm_smmu_evtq_thread 300 292 -8 __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 queue_remove_raw 180 164 -16 arm_smmu_gerror_handler 732 716 -16 arm_smmu_device_probe 4312 4284 -28 arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 Total: Before=20135, After=20019, chg -0.58% Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..407cb9451a7a 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -185,8 +185,8 @@ #define ARM_SMMU_MEMATTR_DEVICE_nGnRE 0x1 #define ARM_SMMU_MEMATTR_OIWB 0xf -#define Q_IDX(llq, p) ((p) & ((1 << (llq)->max_n_shift) - 1)) -#define Q_WRP(llq, p) ((p) & (1 << (llq)->max_n_shift)) +#define Q_IDX(llq, p) ((p) & ((llq)->max_n - 1)) +#define Q_WRP(llq, p) ((p) & (llq)->max_n) #define Q_OVERFLOW_FLAG (1U << 31) #define Q_OVF(p) ((p) & Q_OVERFLOW_FLAG) #define Q_ENT(q, p) ((q)->base + \ @@ -531,7 +531,7 @@ struct arm_smmu_ll_queue { } atomic; u8 __pad[SMP_CACHE_BYTES]; } ____cacheline_aligned_in_smp; - u32 max_n_shift; + u32 max_n; }; struct arm_smmu_queue { @@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) cons = Q_IDX(q, q->cons); if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) - space = (1 << q->max_n_shift) - (prod - cons); + space = q->max_n - (prod - cons); else space = cons - prod; @@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq, { u32 swidx, sbidx, ewidx, ebidx; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = sprod, + .max_n = cmdq->q.llq.max_n, + .prod = sprod, }; ewidx = BIT_WORD(Q_IDX(&llq, eprod)); @@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, { int i; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, - .prod = prod, + .max_n = cmdq->q.llq.max_n, + .prod = prod, }; for (i = 0; i < n; ++i) { @@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, bool owner; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { - .max_n_shift = cmdq->q.llq.max_n_shift, + .max_n = cmdq->q.llq.max_n, }, head = llq; int ret = 0; @@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, size_t qsz; do { - qsz = ((1 << q->llq.max_n_shift) * dwords) << 3; + qsz = (q->llq.max_n * dwords) << 3; q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL); if (q->base || qsz < PAGE_SIZE) break; - q->llq.max_n_shift--; + q->llq.max_n >>= 1; } while (1); if (!q->base) { @@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, if (!WARN_ON(q->base_dma & (qsz - 1))) { dev_info(smmu->dev, "allocated %u entries for %s\n", - 1 << q->llq.max_n_shift, name); + q->llq.max_n, name); } q->prod_reg = arm_smmu_page1_fixup(prod_off, smmu); @@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->q_base = Q_BASE_RWA; q->q_base |= q->base_dma & Q_BASE_ADDR_MASK; - q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift); + q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n)); q->llq.prod = q->llq.cons = 0; return 0; @@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu) { int ret = 0; struct arm_smmu_cmdq *cmdq = &smmu->cmdq; - unsigned int nents = 1 << cmdq->q.llq.max_n_shift; atomic_long_t *bitmap; atomic_set(&cmdq->owner_prod, 0); atomic_set(&cmdq->lock, 0); - bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL); + bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL); if (!bitmap) { dev_err(smmu->dev, "failed to allocate cmdq bitmap\n"); ret = -ENOMEM; @@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { - u32 reg; + u32 reg, max_n_shift; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; /* IDR0 */ @@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) } /* Queue sizes, capped to ensure natural alignment */ - smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_CMDQS, reg)); - if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) { + max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg)); + smmu->cmdq.q.llq.max_n = 1 << max_n_shift; + if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) { /* * We don't support splitting up batches, so one batch of * commands plus an extra sync needs to fit inside the command @@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) return -ENXIO; } - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_EVTQS, reg)); - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, - FIELD_GET(IDR1_PRIQS, reg)); + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); + smmu->evtq.q.llq.max_n = 1 << max_n_shift; + + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); + smmu->priq.q.llq.max_n = 1 << max_n_shift; /* SID/SSID sizes */ smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
>> >> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and >> h=unsigned value, so I doubt this warn. >> >> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks >> like GENMASK_INPUT_CHECK() could be improved. > > That said, I think this particular case might be even better off dodging > GENMASK() entirely, by doing something like this first. Untested... > > Robin. > > ----->8----- > Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations > > Beyond the initial queue setup based on the log2 values from ID > registers, the log2 queue size is only ever used in the form of > (1 << max_n_shift) to repeatedly recalculate the number of queue > elements. Simply storing it in that form leads to slightly more > efficient code, particularly in the low-level queue accessors > where it counts most: > > add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116) > Function old new delta > arm_smmu_init_one_queue 360 364 +4 > arm_smmu_priq_thread 512 508 -4 > arm_smmu_evtq_thread 300 292 -8 > __arm_smmu_cmdq_poll_set_valid_map.isra 296 288 -8 > queue_remove_raw 180 164 -16 > arm_smmu_gerror_handler 732 716 -16 > arm_smmu_device_probe 4312 4284 -28 > arm_smmu_cmdq_issue_cmdlist 1892 1852 -40 > Total: Before=20135, After=20019, chg -0.58% > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- [...] > } > > - smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_EVTQS, reg)); > - smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, > - FIELD_GET(IDR1_PRIQS, reg)); > + max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg)); > + smmu->evtq.q.llq.max_n = 1 << max_n_shift; So I require the bitmask of this for the prod, which would be (max_n << 1) - 1. I don't feel too strongly either way, and the other big changes in this series need to be considered first... Thanks, John > + > + max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg)); > + smmu->priq.q.llq.max_n = 1 << max_n_shift; > > /* SID/SSID sizes */ > smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg); >
On 23/06/2020 14:55, Rikard Falkeborn wrote: > Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com > <mailto:john.garry@huawei.com>> skrev: > > On 23/06/2020 10:35, Rikard Falkeborn wrote: > > > > I'd say that GENMASK_INPUT_CHECK() should be able to handle a > l=0 and > > h=unsigned value, so I doubt this warn. > > > > Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, > but it > > looks > > like GENMASK_INPUT_CHECK() could be improved. > > > > > > Indeed it could, it is fixed in -next. > > ok, thanks for the pointer, but I still see this on today's -next with > this patch: > > make W=1 drivers/iommu/arm-smmu-v3.o > > > Oh, ok thanks for reporting. I guess different gcc versions have > different behaviour. I guess we'll have to change the comparison to > (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and > logic correct but you get the idea). > Yeah, so this looks to fix it: --- a/include/linux/bits.h +++ b/include/linux/bits.h @@ -23,7 +23,8 @@ #include <linux/build_bug.h> #define GENMASK_INPUT_CHECK(h, l) \ (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \ - __builtin_constant_p((l) > (h)), (l) > (h), 0))) + __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h) == (l) ||(h) > (l)), 0))) + We may be able to just use (h) == (l) as the const expr to make it more concise, but that may be confusing. I only tested with my toolchain based on 7.5.0 Thanks, John
On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: > It has been shown that the cmpxchg() for finding space in the cmdq can > be a bottleneck: > - for more CPUs contending the cmdq, the cmpxchg() will fail more often > - since the software-maintained cons pointer is updated on the same 64b > memory region, the chance of cmpxchg() failure increases again > > The cmpxchg() is removed as part of 2 related changes: > > - Update prod and cmdq owner in a single atomic add operation. For this, we > count the prod and owner in separate regions in prod memory. > > As with simple binary counting, once the prod+wrap fields overflow, they > will zero. They should never overflow into "owner" region, and we zero > the non-owner, prod region for each owner. This maintains the prod > pointer. > > As for the "owner", we now count this value, instead of setting a flag. > Similar to before, once the owner has finished gathering, it will clear > a mask. As such, a CPU declares itself as the "owner" when it reads zero > for this region. This zeroing will also clear possible overflow in > wrap+prod region, above. > > The owner is now responsible for all cmdq locking to avoid possible > deadlock. The owner will lock the cmdq for all non-owers it has gathered > when they have space in the queue and have written their entries. > > - Check for space in the cmdq after the prod pointer has been assigned. > > We don't bother checking for space in the cmdq before assigning the prod > pointer, as this would be racy. > > So since the prod pointer is updated unconditionally, it would be common > for no space to be available in the cmdq when prod is assigned - that > is, according the software-maintained prod and cons pointer. So now > it must be ensured that the entries are not yet written and not until > there is space. > > How the prod pointer is maintained also leads to a strange condition > where the prod pointer can wrap past the cons pointer. We can detect this > condition, and report no space here. However, a prod pointer progressed > twice past the cons pointer cannot be detected. But it can be ensured that > this that this scenario does not occur, as we limit the amount of > commands any CPU can issue at any given time, such that we cannot > progress prod pointer further. > > Signed-off-by: John Garry <john.garry@huawei.com> > --- > drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- > 1 file changed, 61 insertions(+), 40 deletions(-) I must admit, you made me smile putting trivial@kernel.org on cc for this ;) Will
On 16/07/2020 11:20, Will Deacon wrote: > On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote: >> It has been shown that the cmpxchg() for finding space in the cmdq can >> be a bottleneck: >> - for more CPUs contending the cmdq, the cmpxchg() will fail more often >> - since the software-maintained cons pointer is updated on the same 64b >> memory region, the chance of cmpxchg() failure increases again >> >> The cmpxchg() is removed as part of 2 related changes: >> >> - Update prod and cmdq owner in a single atomic add operation. For this, we >> count the prod and owner in separate regions in prod memory. >> >> As with simple binary counting, once the prod+wrap fields overflow, they >> will zero. They should never overflow into "owner" region, and we zero >> the non-owner, prod region for each owner. This maintains the prod >> pointer. >> >> As for the "owner", we now count this value, instead of setting a flag. >> Similar to before, once the owner has finished gathering, it will clear >> a mask. As such, a CPU declares itself as the "owner" when it reads zero >> for this region. This zeroing will also clear possible overflow in >> wrap+prod region, above. >> >> The owner is now responsible for all cmdq locking to avoid possible >> deadlock. The owner will lock the cmdq for all non-owers it has gathered >> when they have space in the queue and have written their entries. >> >> - Check for space in the cmdq after the prod pointer has been assigned. >> >> We don't bother checking for space in the cmdq before assigning the prod >> pointer, as this would be racy. >> >> So since the prod pointer is updated unconditionally, it would be common >> for no space to be available in the cmdq when prod is assigned - that >> is, according the software-maintained prod and cons pointer. So now >> it must be ensured that the entries are not yet written and not until >> there is space. >> >> How the prod pointer is maintained also leads to a strange condition >> where the prod pointer can wrap past the cons pointer. We can detect this >> condition, and report no space here. However, a prod pointer progressed >> twice past the cons pointer cannot be detected. But it can be ensured that >> this that this scenario does not occur, as we limit the amount of >> commands any CPU can issue at any given time, such that we cannot >> progress prod pointer further. >> >> Signed-off-by: John Garry <john.garry@huawei.com> >> --- >> drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- >> 1 file changed, 61 insertions(+), 40 deletions(-) > > I must admit, you made me smile putting trivial@kernel.org on cc for this ;) > Yes, quite ironic :)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 45a39ccaf455..54d056f10bea 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -772,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) prod = Q_IDX(q, q->prod); cons = Q_IDX(q, q->cons); - if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) + if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) { + /* check if we have wrapped, meaning definitely no space */ + if (cons > prod) + return false; + space = (1 << q->max_n_shift) - (prod - cons); - else + } else { + /* similar check to above */ + if (prod > cons) + return false; + space = cons - prod; + } return space >= n; } @@ -1073,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu) * fails if the caller appears to be the last lock holder (yes, this is * racy). All successful UNLOCK routines have RELEASE semantics. */ -static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) +static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count) { int val; @@ -1083,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq) * to INT_MIN so these increments won't hurt as the value will remain * negative. */ - if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0) + if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0) return; do { val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0); - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val); + } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val); } static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq) @@ -1374,8 +1383,10 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds, * insert their own list of commands then all of the commands from one * CPU will appear before any of the commands from the other CPU. * - * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue - * more than the permitted amount commands at once. + * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer + * for when the cmdq is full, such that we don't wrap more than twice. + * It also makes it easy for the owner to know by how many to increment the + * cmdq lock. */ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, u64 *cmds, int n) @@ -1388,39 +1399,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq = &smmu->cmdq; struct arm_smmu_ll_queue llq = { .max_n_shift = cmdq->q.llq.max_n_shift, - }, head = llq; + }, head = llq, space = llq; + u32 owner_val = 1 << cmdq->q.llq.owner_count_shift; + u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0); + u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift); int ret = 0; /* 1. Allocate some space in the queue */ local_irq_save(flags); - llq.val = READ_ONCE(cmdq->q.llq.val); - do { - u64 old; - while (!queue_has_space(&llq, n + sync)) { - local_irq_restore(flags); - if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq)) - dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); - local_irq_save(flags); - } + prod = atomic_fetch_add(n + sync + owner_val, + &cmdq->q.llq.atomic.prod); - head.cons = llq.cons; - head.prod = queue_inc_prod_n(&llq, n + sync) | - CMDQ_PROD_OWNED_FLAG; + owner = !(prod & owner_mask); + llq.prod = prod_mask & prod; + head.prod = queue_inc_prod_n(&llq, n + sync); - old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val); - if (old == llq.val) - break; + /* + * Ensure it's safe to write the entries. For this, we need to ensure + * that there is space in the queue from our prod pointer. + */ + space.cons = READ_ONCE(cmdq->q.llq.cons); + space.prod = llq.prod; + while (!queue_has_space(&space, n + sync)) { + if (arm_smmu_cmdq_poll_until_not_full(smmu, &space)) + dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); - llq.val = old; - } while (1); - owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG); - head.prod &= ~CMDQ_PROD_OWNED_FLAG; - llq.prod &= ~CMDQ_PROD_OWNED_FLAG; + space.prod = llq.prod; + } /* * 2. Write our commands into the queue - * Dependency ordering from the cmpxchg() loop above. + * Dependency ordering from the space-checking while loop, above. */ arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n); @@ -1428,26 +1438,24 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod); queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS); - /* - * In order to determine completion of our CMD_SYNC, we must ensure - * that the queue can't wrap twice without us noticing. We achieve that - * by taking the cmdq lock as shared before marking our slot as valid. - */ - arm_smmu_cmdq_shared_lock(cmdq); - /* 3. Mark our slots as valid, ensuring commands are visible first */ dma_wmb(); arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod); /* 4. If we are the owner, take control of the SMMU hardware */ if (owner) { + int owner_count; + u32 prod_tmp; + /* a. Wait for previous owner to finish */ atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod); - /* b. Stop gathering work by clearing the owned flag */ - prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG, - &cmdq->q.llq.atomic.prod); - prod &= ~CMDQ_PROD_OWNED_FLAG; + /* b. Stop gathering work by clearing the owned mask */ + prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask, + &cmdq->q.llq.atomic.prod); + prod = prod_tmp & prod_mask; + owner_count = prod_tmp & owner_mask; + owner_count >>= cmdq->q.llq.owner_count_shift; /* * c. Wait for any gathered work to be written to the queue. @@ -1456,6 +1464,19 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, */ arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod); + /* + * In order to determine completion of the CMD_SYNCs, we must + * ensure that the queue can't wrap twice without us noticing. + * We achieve that by taking the cmdq lock as shared before + * progressing the prod pointer. + * The owner does this for all the non-owners it has gathered. + * Otherwise, some non-owner(s) may lock the cmdq, blocking + * cons being updating. This could be when the cmdq has just + * become full. In this case, other sibling non-owners could be + * blocked from progressing, leading to deadlock. + */ + arm_smmu_cmdq_shared_lock(cmdq, owner_count); + /* * d. Advance the hardware prod pointer * Control dependency ordering from the entries becoming valid.
It has been shown that the cmpxchg() for finding space in the cmdq can be a bottleneck: - for more CPUs contending the cmdq, the cmpxchg() will fail more often - since the software-maintained cons pointer is updated on the same 64b memory region, the chance of cmpxchg() failure increases again The cmpxchg() is removed as part of 2 related changes: - Update prod and cmdq owner in a single atomic add operation. For this, we count the prod and owner in separate regions in prod memory. As with simple binary counting, once the prod+wrap fields overflow, they will zero. They should never overflow into "owner" region, and we zero the non-owner, prod region for each owner. This maintains the prod pointer. As for the "owner", we now count this value, instead of setting a flag. Similar to before, once the owner has finished gathering, it will clear a mask. As such, a CPU declares itself as the "owner" when it reads zero for this region. This zeroing will also clear possible overflow in wrap+prod region, above. The owner is now responsible for all cmdq locking to avoid possible deadlock. The owner will lock the cmdq for all non-owers it has gathered when they have space in the queue and have written their entries. - Check for space in the cmdq after the prod pointer has been assigned. We don't bother checking for space in the cmdq before assigning the prod pointer, as this would be racy. So since the prod pointer is updated unconditionally, it would be common for no space to be available in the cmdq when prod is assigned - that is, according the software-maintained prod and cons pointer. So now it must be ensured that the entries are not yet written and not until there is space. How the prod pointer is maintained also leads to a strange condition where the prod pointer can wrap past the cons pointer. We can detect this condition, and report no space here. However, a prod pointer progressed twice past the cons pointer cannot be detected. But it can be ensured that this that this scenario does not occur, as we limit the amount of commands any CPU can issue at any given time, such that we cannot progress prod pointer further. Signed-off-by: John Garry <john.garry@huawei.com> --- drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++-------------- 1 file changed, 61 insertions(+), 40 deletions(-)