diff mbox series

[4/4] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

Message ID 1592846920-45338-5-git-send-email-john.garry@huawei.com (mailing list archive)
State New, archived
Headers show
Series iommu/arm-smmu-v3: Improve cmdq lock efficiency | expand

Commit Message

John Garry June 22, 2020, 5:28 p.m. UTC
It has been shown that the cmpxchg() for finding space in the cmdq can
be a bottleneck:
- for more CPUs contending the cmdq, the cmpxchg() will fail more often
- since the software-maintained cons pointer is updated on the same 64b
  memory region, the chance of cmpxchg() failure increases again

The cmpxchg() is removed as part of 2 related changes:

- Update prod and cmdq owner in a single atomic add operation. For this, we
  count the prod and owner in separate regions in prod memory.

  As with simple binary counting, once the prod+wrap fields overflow, they
  will zero. They should never overflow into "owner" region, and we zero
  the non-owner, prod region for each owner. This maintains the prod
  pointer.

  As for the "owner", we now count this value, instead of setting a flag.
  Similar to before, once the owner has finished gathering, it will clear
  a mask. As such, a CPU declares itself as the "owner" when it reads zero
  for this region. This zeroing will also clear possible overflow in
  wrap+prod region, above.

  The owner is now responsible for all cmdq locking to avoid possible
  deadlock. The owner will lock the cmdq for all non-owers it has gathered
  when they have space in the queue and have written their entries.

- Check for space in the cmdq after the prod pointer has been assigned.

  We don't bother checking for space in the cmdq before assigning the prod
  pointer, as this would be racy.

  So since the prod pointer is updated unconditionally, it would be common
  for no space to be available in the cmdq when prod is assigned - that
  is, according the software-maintained prod and cons pointer. So now
  it must be ensured that the entries are not yet written and not until
  there is space.

  How the prod pointer is maintained also leads to a strange condition
  where the prod pointer can wrap past the cons pointer. We can detect this
  condition, and report no space here. However, a prod pointer progressed
  twice past the cons pointer cannot be detected. But it can be ensured that
  this that this scenario does not occur, as we limit the amount of
  commands any CPU can issue at any given time, such that we cannot
  progress prod pointer further.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
 1 file changed, 61 insertions(+), 40 deletions(-)

Comments

kernel test robot June 23, 2020, 1:07 a.m. UTC | #1
Hi John,

I love your patch! Perhaps something to improve:

[auto build test WARNING on iommu/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm64-randconfig-c024-20200622 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

In file included from include/linux/bits.h:23,
from include/linux/ioport.h:15,
from include/linux/acpi.h:12,
from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
|                            ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
|                                                              ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
|   ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
|                  ^~~~~~~
include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
|                                        ^
include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
|                                                              ^
include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
|   ^~~~~~~~~~~~~~~~~~~
>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
|                  ^~~~~~~

vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c

  1369	
  1370	/*
  1371	 * This is the actual insertion function, and provides the following
  1372	 * ordering guarantees to callers:
  1373	 *
  1374	 * - There is a dma_wmb() before publishing any commands to the queue.
  1375	 *   This can be relied upon to order prior writes to data structures
  1376	 *   in memory (such as a CD or an STE) before the command.
  1377	 *
  1378	 * - On completion of a CMD_SYNC, there is a control dependency.
  1379	 *   This can be relied upon to order subsequent writes to memory (e.g.
  1380	 *   freeing an IOVA) after completion of the CMD_SYNC.
  1381	 *
  1382	 * - Command insertion is totally ordered, so if two CPUs each race to
  1383	 *   insert their own list of commands then all of the commands from one
  1384	 *   CPU will appear before any of the commands from the other CPU.
  1385	 *
  1386	 * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
  1387	 *   for when the cmdq is full, such that we don't wrap more than twice.
  1388	 *   It also makes it easy for the owner to know by how many to increment the
  1389	 *   cmdq lock.
  1390	 */
  1391	static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
  1392					       u64 *cmds, int n)
  1393	{
  1394		u64 cmd_sync[CMDQ_ENT_DWORDS];
  1395		const int sync = 1;
  1396		u32 prod;
  1397		unsigned long flags;
  1398		bool owner;
  1399		struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
  1400		struct arm_smmu_ll_queue llq = {
  1401			.max_n_shift = cmdq->q.llq.max_n_shift,
  1402		}, head = llq, space = llq;
  1403		u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
> 1404		u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
  1405		u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
  1406		int ret = 0;
  1407	
  1408		/* 1. Allocate some space in the queue */
  1409		local_irq_save(flags);
  1410	
  1411		prod = atomic_fetch_add(n + sync + owner_val,
  1412					&cmdq->q.llq.atomic.prod);
  1413	
  1414		owner = !(prod & owner_mask);
  1415		llq.prod = prod_mask & prod;
  1416		head.prod = queue_inc_prod_n(&llq, n + sync);
  1417	
  1418		/*
  1419		 * Ensure it's safe to write the entries. For this, we need to ensure
  1420		 * that there is space in the queue from our prod pointer.
  1421		 */
  1422		space.cons = READ_ONCE(cmdq->q.llq.cons);
  1423		space.prod = llq.prod;
  1424		while (!queue_has_space(&space, n + sync)) {
  1425			if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
  1426				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
  1427	
  1428			space.prod = llq.prod;
  1429		}
  1430	
  1431		/*
  1432		 * 2. Write our commands into the queue
  1433		 * Dependency ordering from the space-checking while loop, above.
  1434		 */
  1435		arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
  1436	
  1437		prod = queue_inc_prod_n(&llq, n);
  1438		arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
  1439		queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
  1440	
  1441		/* 3. Mark our slots as valid, ensuring commands are visible first */
  1442		dma_wmb();
  1443		arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
  1444	
  1445		/* 4. If we are the owner, take control of the SMMU hardware */
  1446		if (owner) {
  1447			int owner_count;
  1448			u32 prod_tmp;
  1449	
  1450			/* a. Wait for previous owner to finish */
  1451			atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
  1452	
  1453			/* b. Stop gathering work by clearing the owned mask */
  1454			prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
  1455							       &cmdq->q.llq.atomic.prod);
  1456			prod = prod_tmp & prod_mask;
  1457			owner_count = prod_tmp & owner_mask;
  1458			owner_count >>= cmdq->q.llq.owner_count_shift;
  1459	
  1460			/*
  1461			 * c. Wait for any gathered work to be written to the queue.
  1462			 * Note that we read our own entries so that we have the control
  1463			 * dependency required by (d).
  1464			 */
  1465			arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
  1466	
  1467			/*
  1468			 * In order to determine completion of the CMD_SYNCs, we must
  1469			 * ensure that the queue can't wrap twice without us noticing.
  1470			 * We achieve that by taking the cmdq lock as shared before
  1471			 * progressing the prod pointer.
  1472			 * The owner does this for all the non-owners it has gathered.
  1473			 * Otherwise, some non-owner(s) may lock the cmdq, blocking
  1474			 * cons being updating. This could be when the cmdq has just
  1475			 * become full. In this case, other sibling non-owners could be
  1476			 * blocked from progressing, leading to deadlock.
  1477			 */
  1478			arm_smmu_cmdq_shared_lock(cmdq, owner_count);
  1479	
  1480			/*
  1481			 * d. Advance the hardware prod pointer
  1482			 * Control dependency ordering from the entries becoming valid.
  1483			 */
  1484			writel_relaxed(prod, cmdq->q.prod_reg);
  1485	
  1486			/*
  1487			 * e. Tell the next owner we're done
  1488			 * Make sure we've updated the hardware first, so that we don't
  1489			 * race to update prod and potentially move it backwards.
  1490			 */
  1491			atomic_set_release(&cmdq->owner_prod, prod);
  1492		}
  1493	
  1494		/* 5. Since we always insert a CMD_SYNC, we must wait for it to complete */
  1495		llq.prod = queue_inc_prod_n(&llq, n);
  1496		ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
  1497		if (ret)
  1498			dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
  1499					    llq.prod, readl_relaxed(cmdq->q.prod_reg),
  1500					    readl_relaxed(cmdq->q.cons_reg));
  1501	
  1502		/*
  1503		 * Try to unlock the cmdq lock. This will fail if we're the last reader,
  1504		 * in which case we can safely update cmdq->q.llq.cons
  1505		 */
  1506		if (!arm_smmu_cmdq_shared_tryunlock(cmdq)) {
  1507			WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
  1508			arm_smmu_cmdq_shared_unlock(cmdq);
  1509		}
  1510	
  1511		local_irq_restore(flags);
  1512		return ret;
  1513	}
  1514	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
John Garry June 23, 2020, 9:21 a.m. UTC | #2
On 23/06/2020 02:07, kernel test robot wrote:

+ Rikard, as the GENMASK compile-time sanity checks were added recently

> Hi John,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on iommu/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use  as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm64-randconfig-c024-20200622 (attached as .config)
> compiler: aarch64-linux-gcc (GCC) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
> 
> In file included from include/linux/bits.h:23,
> from include/linux/ioport.h:15,
> from include/linux/acpi.h:12,
> from drivers/iommu/arm-smmu-v3.c:12:
> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
> include/linux/bits.h:26:28: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))

I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and 
h=unsigned value, so I doubt this warn.

Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks 
like GENMASK_INPUT_CHECK() could be improved.

> |                            ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> |                                                              ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> |   ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> |                  ^~~~~~~
> include/linux/bits.h:26:40: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
> 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> |                                        ^
> include/linux/build_bug.h:16:62: note: in definition of macro 'BUILD_BUG_ON_ZERO'
> 16 | #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
> |                                                              ^
> include/linux/bits.h:39:3: note: in expansion of macro 'GENMASK_INPUT_CHECK'
> 39 |  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
> |   ^~~~~~~~~~~~~~~~~~~
>>> drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro 'GENMASK'
> 1404 |  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
> |                  ^~~~~~~
> 
> vim +/GENMASK +1404 drivers/iommu/arm-smmu-v3.c
John Garry June 23, 2020, 10:19 a.m. UTC | #3
On 23/06/2020 10:35, Rikard Falkeborn wrote:
> 
>     I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>     h=unsigned value, so I doubt this warn.
> 
>     Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it
>     looks
>     like GENMASK_INPUT_CHECK() could be improved.
> 
> 
> Indeed it could, it is fixed in -next.

ok, thanks for the pointer, but I still see this on today's -next with 
this patch:

make W=1 drivers/iommu/arm-smmu-v3.o

In file included from ./include/linux/bits.h:23:0,
                 from ./include/linux/ioport.h:15,
                 from ./include/linux/acpi.h:12,
                 from drivers/iommu/arm-smmu-v3.c:12:
drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_cmdq_issue_cmdlist’:
./include/linux/bits.h:27:7: warning: comparison of unsigned expression 
< 0 is always false [-Wtype-limits]
   (l) > (h), 0)))
       ^
./include/linux/build_bug.h:16:62: note: in definition of macro 
‘BUILD_BUG_ON_ZERO’
#define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); })))
                                                              ^
./include/linux/bits.h:40:3: note: in expansion of macro 
‘GENMASK_INPUT_CHECK’
  (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
   ^~~~~~~~~~~~~~~~~~~
drivers/iommu/arm-smmu-v3.c:1404:18: note: in expansion of macro ‘GENMASK’
  u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);

That's gcc 7.5.0 .

Cheers,
John
Robin Murphy June 23, 2020, 4:22 p.m. UTC | #4
On 2020-06-23 10:21, John Garry wrote:
> On 23/06/2020 02:07, kernel test robot wrote:
> 
> + Rikard, as the GENMASK compile-time sanity checks were added recently
> 
>> Hi John,
>>
>> I love your patch! Perhaps something to improve:
>>
>> [auto build test WARNING on iommu/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use  as documented in
>> https://git-scm.com/docs/git-format-patch]
>>
>> url:    
>> https://github.com/0day-ci/linux/commits/John-Garry/iommu-arm-smmu-v3-Improve-cmdq-lock-efficiency/20200623-013438 
>>
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
>> next
>> config: arm64-randconfig-c024-20200622 (attached as .config)
>> compiler: aarch64-linux-gcc (GCC) 9.3.0
>>
>> If you fix the issue, kindly add following tag as appropriate
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>>
>> In file included from include/linux/bits.h:23,
>> from include/linux/ioport.h:15,
>> from include/linux/acpi.h:12,
>> from drivers/iommu/arm-smmu-v3.c:12:
>> drivers/iommu/arm-smmu-v3.c: In function 'arm_smmu_cmdq_issue_cmdlist':
>> include/linux/bits.h:26:28: warning: comparison of unsigned expression 
>> < 0 is always false [-Wtype-limits]
>> 26 |   __builtin_constant_p((l) > (h)), (l) > (h), 0)))
> 
> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and 
> h=unsigned value, so I doubt this warn.
> 
> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks 
> like GENMASK_INPUT_CHECK() could be improved.

That said, I think this particular case might be even better off dodging
GENMASK() entirely, by doing something like this first. Untested...

Robin.

----->8-----
Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations

Beyond the initial queue setup based on the log2 values from ID
registers, the log2 queue size is only ever used in the form of
(1 << max_n_shift) to repeatedly recalculate the number of queue
elements. Simply storing it in that form leads to slightly more
efficient code, particularly in the low-level queue accessors
where it counts most:

add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
Function                                     old     new   delta
arm_smmu_init_one_queue                      360     364      +4
arm_smmu_priq_thread                         512     508      -4
arm_smmu_evtq_thread                         300     292      -8
__arm_smmu_cmdq_poll_set_valid_map.isra      296     288      -8
queue_remove_raw                             180     164     -16
arm_smmu_gerror_handler                      732     716     -16
arm_smmu_device_probe                       4312    4284     -28
arm_smmu_cmdq_issue_cmdlist                 1892    1852     -40
Total: Before=20135, After=20019, chg -0.58%

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 46 ++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..407cb9451a7a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -185,8 +185,8 @@
 #define ARM_SMMU_MEMATTR_DEVICE_nGnRE	0x1
 #define ARM_SMMU_MEMATTR_OIWB		0xf
 
-#define Q_IDX(llq, p)			((p) & ((1 << (llq)->max_n_shift) - 1))
-#define Q_WRP(llq, p)			((p) & (1 << (llq)->max_n_shift))
+#define Q_IDX(llq, p)			((p) & ((llq)->max_n - 1))
+#define Q_WRP(llq, p)			((p) & (llq)->max_n)
 #define Q_OVERFLOW_FLAG			(1U << 31)
 #define Q_OVF(p)			((p) & Q_OVERFLOW_FLAG)
 #define Q_ENT(q, p)			((q)->base +			\
@@ -531,7 +531,7 @@ struct arm_smmu_ll_queue {
 		} atomic;
 		u8			__pad[SMP_CACHE_BYTES];
 	} ____cacheline_aligned_in_smp;
-	u32				max_n_shift;
+	u32				max_n;
 };
 
 struct arm_smmu_queue {
@@ -771,7 +771,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 	cons = Q_IDX(q, q->cons);
 
 	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
-		space = (1 << q->max_n_shift) - (prod - cons);
+		space = q->max_n - (prod - cons);
 	else
 		space = cons - prod;
 
@@ -1164,8 +1164,8 @@ static void __arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq,
 {
 	u32 swidx, sbidx, ewidx, ebidx;
 	struct arm_smmu_ll_queue llq = {
-		.max_n_shift	= cmdq->q.llq.max_n_shift,
-		.prod		= sprod,
+		.max_n	= cmdq->q.llq.max_n,
+		.prod	= sprod,
 	};
 
 	ewidx = BIT_WORD(Q_IDX(&llq, eprod));
@@ -1344,8 +1344,8 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
 {
 	int i;
 	struct arm_smmu_ll_queue llq = {
-		.max_n_shift	= cmdq->q.llq.max_n_shift,
-		.prod		= prod,
+		.max_n	= cmdq->q.llq.max_n,
+		.prod	= prod,
 	};
 
 	for (i = 0; i < n; ++i) {
@@ -1381,7 +1381,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	bool owner;
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
 	struct arm_smmu_ll_queue llq = {
-		.max_n_shift = cmdq->q.llq.max_n_shift,
+		.max_n = cmdq->q.llq.max_n,
 	}, head = llq;
 	int ret = 0;
 
@@ -3144,13 +3144,13 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 	size_t qsz;
 
 	do {
-		qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+		qsz = (q->llq.max_n * dwords) << 3;
 		q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
 					      GFP_KERNEL);
 		if (q->base || qsz < PAGE_SIZE)
 			break;
 
-		q->llq.max_n_shift--;
+		q->llq.max_n >>= 1;
 	} while (1);
 
 	if (!q->base) {
@@ -3162,7 +3162,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 
 	if (!WARN_ON(q->base_dma & (qsz - 1))) {
 		dev_info(smmu->dev, "allocated %u entries for %s\n",
-			 1 << q->llq.max_n_shift, name);
+			 q->llq.max_n, name);
 	}
 
 	q->prod_reg	= arm_smmu_page1_fixup(prod_off, smmu);
@@ -3171,7 +3171,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 
 	q->q_base  = Q_BASE_RWA;
 	q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
-	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, ilog2(q->llq.max_n));
 
 	q->llq.prod = q->llq.cons = 0;
 	return 0;
@@ -3187,13 +3187,12 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
 {
 	int ret = 0;
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
-	unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
 	atomic_long_t *bitmap;
 
 	atomic_set(&cmdq->owner_prod, 0);
 	atomic_set(&cmdq->lock, 0);
 
-	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+	bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.max_n, GFP_KERNEL);
 	if (!bitmap) {
 		dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
 		ret = -ENOMEM;
@@ -3695,7 +3694,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
-	u32 reg;
+	u32 reg, max_n_shift;
 	bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
 
 	/* IDR0 */
@@ -3798,9 +3797,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 	}
 
 	/* Queue sizes, capped to ensure natural alignment */
-	smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_CMDQS, reg));
-	if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
+	max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, reg));
+	smmu->cmdq.q.llq.max_n = 1 << max_n_shift;
+	if (smmu->cmdq.q.llq.max_n <= CMDQ_BATCH_ENTRIES) {
 		/*
 		 * We don't support splitting up batches, so one batch of
 		 * commands plus an extra sync needs to fit inside the command
@@ -3812,10 +3811,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 		return -ENXIO;
 	}
 
-	smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_EVTQS, reg));
-	smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_PRIQS, reg));
+	max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
+	smmu->evtq.q.llq.max_n = 1 << max_n_shift;
+
+	max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
+	smmu->priq.q.llq.max_n = 1 << max_n_shift;
 
 	/* SID/SSID sizes */
 	smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
John Garry June 24, 2020, 8:15 a.m. UTC | #5
>>
>> I'd say that GENMASK_INPUT_CHECK() should be able to handle a l=0 and
>> h=unsigned value, so I doubt this warn.
>>
>> Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it, but it looks
>> like GENMASK_INPUT_CHECK() could be improved.
> 
> That said, I think this particular case might be even better off dodging
> GENMASK() entirely, by doing something like this first. Untested...
> 
> Robin.
> 
> ----->8-----
> Subject: [PATCH] iommu/arm-smmu-v3: Streamline queue calculations
> 
> Beyond the initial queue setup based on the log2 values from ID
> registers, the log2 queue size is only ever used in the form of
> (1 << max_n_shift) to repeatedly recalculate the number of queue
> elements. Simply storing it in that form leads to slightly more
> efficient code, particularly in the low-level queue accessors
> where it counts most:
> 
> add/remove: 0/0 grow/shrink: 1/7 up/down: 4/-120 (-116)
> Function                                     old     new   delta
> arm_smmu_init_one_queue                      360     364      +4
> arm_smmu_priq_thread                         512     508      -4
> arm_smmu_evtq_thread                         300     292      -8
> __arm_smmu_cmdq_poll_set_valid_map.isra      296     288      -8
> queue_remove_raw                             180     164     -16
> arm_smmu_gerror_handler                      732     716     -16
> arm_smmu_device_probe                       4312    4284     -28
> arm_smmu_cmdq_issue_cmdlist                 1892    1852     -40
> Total: Before=20135, After=20019, chg -0.58%
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---

[...]


>   	}
>   
> -	smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
> -					     FIELD_GET(IDR1_EVTQS, reg));
> -	smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
> -					     FIELD_GET(IDR1_PRIQS, reg));
> +	max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_EVTQS, reg));
> +	smmu->evtq.q.llq.max_n = 1 << max_n_shift;

So I require the bitmask of this for the prod, which would be (max_n << 
1) - 1.

I don't feel too strongly either way, and the other big changes in this 
series need to be considered first...

Thanks,
John

> +
> +	max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_PRIQS, reg));
> +	smmu->priq.q.llq.max_n = 1 << max_n_shift;
>   
>   	/* SID/SSID sizes */
>   	smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
>
John Garry June 26, 2020, 10:05 a.m. UTC | #6
On 23/06/2020 14:55, Rikard Falkeborn wrote:
> Den tis 23 juni 2020 12:21John Garry <john.garry@huawei.com 
> <mailto:john.garry@huawei.com>> skrev:
> 
>     On 23/06/2020 10:35, Rikard Falkeborn wrote:
>      >
>      >     I'd say that GENMASK_INPUT_CHECK() should be able to handle a
>     l=0 and
>      >     h=unsigned value, so I doubt this warn.
>      >
>      >     Using GENMASK((int)cmdq->q.llq.max_n_shift, 0) resolves it,
>     but it
>      >     looks
>      >     like GENMASK_INPUT_CHECK() could be improved.
>      >
>      >
>      > Indeed it could, it is fixed in -next.
> 
>     ok, thanks for the pointer, but I still see this on today's -next with
>     this patch:
> 
>     make W=1 drivers/iommu/arm-smmu-v3.o
> 
> 
> Oh, ok thanks for reporting. I guess different gcc versions have 
> different behaviour. I guess we'll have to change the comparison to 
> (!((h) == (l) || (h) > (l))) instead (not sure I got all parenthesis and 
> logic correct but you get the idea).
> 

Yeah, so this looks to fix it:

--- a/include/linux/bits.h
+++ b/include/linux/bits.h
@@ -23,7 +23,8 @@
#include <linux/build_bug.h>
#define GENMASK_INPUT_CHECK(h, l) \
        (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \
-               __builtin_constant_p((l) > (h)), (l) > (h), 0)))
+               __builtin_constant_p(!((h) == (l) ||(h) > (l))), !((h) 
== (l) ||(h) > (l)), 0)))
+

We may be able to just use (h) == (l) as the const expr to make it more 
concise, but that may be confusing.

I only tested with my toolchain based on 7.5.0

Thanks,
John
Will Deacon July 16, 2020, 10:20 a.m. UTC | #7
On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
> It has been shown that the cmpxchg() for finding space in the cmdq can
> be a bottleneck:
> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
> - since the software-maintained cons pointer is updated on the same 64b
>   memory region, the chance of cmpxchg() failure increases again
> 
> The cmpxchg() is removed as part of 2 related changes:
> 
> - Update prod and cmdq owner in a single atomic add operation. For this, we
>   count the prod and owner in separate regions in prod memory.
> 
>   As with simple binary counting, once the prod+wrap fields overflow, they
>   will zero. They should never overflow into "owner" region, and we zero
>   the non-owner, prod region for each owner. This maintains the prod
>   pointer.
> 
>   As for the "owner", we now count this value, instead of setting a flag.
>   Similar to before, once the owner has finished gathering, it will clear
>   a mask. As such, a CPU declares itself as the "owner" when it reads zero
>   for this region. This zeroing will also clear possible overflow in
>   wrap+prod region, above.
> 
>   The owner is now responsible for all cmdq locking to avoid possible
>   deadlock. The owner will lock the cmdq for all non-owers it has gathered
>   when they have space in the queue and have written their entries.
> 
> - Check for space in the cmdq after the prod pointer has been assigned.
> 
>   We don't bother checking for space in the cmdq before assigning the prod
>   pointer, as this would be racy.
> 
>   So since the prod pointer is updated unconditionally, it would be common
>   for no space to be available in the cmdq when prod is assigned - that
>   is, according the software-maintained prod and cons pointer. So now
>   it must be ensured that the entries are not yet written and not until
>   there is space.
> 
>   How the prod pointer is maintained also leads to a strange condition
>   where the prod pointer can wrap past the cons pointer. We can detect this
>   condition, and report no space here. However, a prod pointer progressed
>   twice past the cons pointer cannot be detected. But it can be ensured that
>   this that this scenario does not occur, as we limit the amount of
>   commands any CPU can issue at any given time, such that we cannot
>   progress prod pointer further.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
>  1 file changed, 61 insertions(+), 40 deletions(-)

I must admit, you made me smile putting trivial@kernel.org on cc for this ;)

Will
John Garry July 16, 2020, 10:26 a.m. UTC | #8
On 16/07/2020 11:20, Will Deacon wrote:
> On Tue, Jun 23, 2020 at 01:28:40AM +0800, John Garry wrote:
>> It has been shown that the cmpxchg() for finding space in the cmdq can
>> be a bottleneck:
>> - for more CPUs contending the cmdq, the cmpxchg() will fail more often
>> - since the software-maintained cons pointer is updated on the same 64b
>>    memory region, the chance of cmpxchg() failure increases again
>>
>> The cmpxchg() is removed as part of 2 related changes:
>>
>> - Update prod and cmdq owner in a single atomic add operation. For this, we
>>    count the prod and owner in separate regions in prod memory.
>>
>>    As with simple binary counting, once the prod+wrap fields overflow, they
>>    will zero. They should never overflow into "owner" region, and we zero
>>    the non-owner, prod region for each owner. This maintains the prod
>>    pointer.
>>
>>    As for the "owner", we now count this value, instead of setting a flag.
>>    Similar to before, once the owner has finished gathering, it will clear
>>    a mask. As such, a CPU declares itself as the "owner" when it reads zero
>>    for this region. This zeroing will also clear possible overflow in
>>    wrap+prod region, above.
>>
>>    The owner is now responsible for all cmdq locking to avoid possible
>>    deadlock. The owner will lock the cmdq for all non-owers it has gathered
>>    when they have space in the queue and have written their entries.
>>
>> - Check for space in the cmdq after the prod pointer has been assigned.
>>
>>    We don't bother checking for space in the cmdq before assigning the prod
>>    pointer, as this would be racy.
>>
>>    So since the prod pointer is updated unconditionally, it would be common
>>    for no space to be available in the cmdq when prod is assigned - that
>>    is, according the software-maintained prod and cons pointer. So now
>>    it must be ensured that the entries are not yet written and not until
>>    there is space.
>>
>>    How the prod pointer is maintained also leads to a strange condition
>>    where the prod pointer can wrap past the cons pointer. We can detect this
>>    condition, and report no space here. However, a prod pointer progressed
>>    twice past the cons pointer cannot be detected. But it can be ensured that
>>    this that this scenario does not occur, as we limit the amount of
>>    commands any CPU can issue at any given time, such that we cannot
>>    progress prod pointer further.
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>>   drivers/iommu/arm-smmu-v3.c | 101 ++++++++++++++++++++++--------------
>>   1 file changed, 61 insertions(+), 40 deletions(-)
> 
> I must admit, you made me smile putting trivial@kernel.org on cc for this ;)
> 

Yes, quite ironic :)
diff mbox series

Patch

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 45a39ccaf455..54d056f10bea 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -772,10 +772,19 @@  static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 	prod = Q_IDX(q, q->prod);
 	cons = Q_IDX(q, q->cons);
 
-	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) {
+		/* check if we have wrapped, meaning definitely no space */
+		if (cons > prod)
+			return false;
+
 		space = (1 << q->max_n_shift) - (prod - cons);
-	else
+	} else {
+		/* similar check to above */
+		if (prod > cons)
+			return false;
+
 		space = cons - prod;
+	}
 
 	return space >= n;
 }
@@ -1073,7 +1082,7 @@  static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
  *   fails if the caller appears to be the last lock holder (yes, this is
  *   racy). All successful UNLOCK routines have RELEASE semantics.
  */
-static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
+static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count)
 {
 	int val;
 
@@ -1083,12 +1092,12 @@  static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
 	 * to INT_MIN so these increments won't hurt as the value will remain
 	 * negative.
 	 */
-	if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
+	if (atomic_fetch_add_relaxed(count, &cmdq->lock) >= 0)
 		return;
 
 	do {
 		val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
-	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + count) != val);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -1374,8 +1383,10 @@  static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
  *   insert their own list of commands then all of the commands from one
  *   CPU will appear before any of the commands from the other CPU.
  *
- * - A CMD_SYNC is always inserted, ensuring that any CPU does not issue
- *   more than the permitted amount commands at once.
+ * - A CMD_SYNC is always inserted, which ensures we limit the prod pointer
+ *   for when the cmdq is full, such that we don't wrap more than twice.
+ *   It also makes it easy for the owner to know by how many to increment the
+ *   cmdq lock.
  */
 static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 				       u64 *cmds, int n)
@@ -1388,39 +1399,38 @@  static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
 	struct arm_smmu_ll_queue llq = {
 		.max_n_shift = cmdq->q.llq.max_n_shift,
-	}, head = llq;
+	}, head = llq, space = llq;
+	u32 owner_val = 1 << cmdq->q.llq.owner_count_shift;
+	u32 prod_mask = GENMASK(cmdq->q.llq.max_n_shift, 0);
+	u32 owner_mask = GENMASK(30, cmdq->q.llq.owner_count_shift);
 	int ret = 0;
 
 	/* 1. Allocate some space in the queue */
 	local_irq_save(flags);
-	llq.val = READ_ONCE(cmdq->q.llq.val);
-	do {
-		u64 old;
 
-		while (!queue_has_space(&llq, n + sync)) {
-			local_irq_restore(flags);
-			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
-				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
-			local_irq_save(flags);
-		}
+	prod = atomic_fetch_add(n + sync + owner_val,
+				&cmdq->q.llq.atomic.prod);
 
-		head.cons = llq.cons;
-		head.prod = queue_inc_prod_n(&llq, n + sync) |
-					     CMDQ_PROD_OWNED_FLAG;
+	owner = !(prod & owner_mask);
+	llq.prod = prod_mask & prod;
+	head.prod = queue_inc_prod_n(&llq, n + sync);
 
-		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-		if (old == llq.val)
-			break;
+	/*
+	 * Ensure it's safe to write the entries. For this, we need to ensure
+	 * that there is space in the queue from our prod pointer.
+	 */
+	space.cons = READ_ONCE(cmdq->q.llq.cons);
+	space.prod = llq.prod;
+	while (!queue_has_space(&space, n + sync)) {
+		if (arm_smmu_cmdq_poll_until_not_full(smmu, &space))
+			dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
 
-		llq.val = old;
-	} while (1);
-	owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
-	head.prod &= ~CMDQ_PROD_OWNED_FLAG;
-	llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
+		space.prod = llq.prod;
+	}
 
 	/*
 	 * 2. Write our commands into the queue
-	 * Dependency ordering from the cmpxchg() loop above.
+	 * Dependency ordering from the space-checking while loop, above.
 	 */
 	arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
 
@@ -1428,26 +1438,24 @@  static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
 	queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
 
-	/*
-	 * In order to determine completion of our CMD_SYNC, we must ensure
-	 * that the queue can't wrap twice without us noticing. We achieve that
-	 * by taking the cmdq lock as shared before marking our slot as valid.
-	 */
-	arm_smmu_cmdq_shared_lock(cmdq);
-
 	/* 3. Mark our slots as valid, ensuring commands are visible first */
 	dma_wmb();
 	arm_smmu_cmdq_set_valid_map(cmdq, llq.prod, head.prod);
 
 	/* 4. If we are the owner, take control of the SMMU hardware */
 	if (owner) {
+		int owner_count;
+		u32 prod_tmp;
+
 		/* a. Wait for previous owner to finish */
 		atomic_cond_read_relaxed(&cmdq->owner_prod, VAL == llq.prod);
 
-		/* b. Stop gathering work by clearing the owned flag */
-		prod = atomic_fetch_andnot_relaxed(CMDQ_PROD_OWNED_FLAG,
-						   &cmdq->q.llq.atomic.prod);
-		prod &= ~CMDQ_PROD_OWNED_FLAG;
+		/* b. Stop gathering work by clearing the owned mask */
+		prod_tmp = atomic_fetch_andnot_relaxed(~prod_mask,
+						       &cmdq->q.llq.atomic.prod);
+		prod = prod_tmp & prod_mask;
+		owner_count = prod_tmp & owner_mask;
+		owner_count >>= cmdq->q.llq.owner_count_shift;
 
 		/*
 		 * c. Wait for any gathered work to be written to the queue.
@@ -1456,6 +1464,19 @@  static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 		 */
 		arm_smmu_cmdq_poll_valid_map(cmdq, llq.prod, prod);
 
+		/*
+		 * In order to determine completion of the CMD_SYNCs, we must
+		 * ensure that the queue can't wrap twice without us noticing.
+		 * We achieve that by taking the cmdq lock as shared before
+		 * progressing the prod pointer.
+		 * The owner does this for all the non-owners it has gathered.
+		 * Otherwise, some non-owner(s) may lock the cmdq, blocking
+		 * cons being updating. This could be when the cmdq has just
+		 * become full. In this case, other sibling non-owners could be
+		 * blocked from progressing, leading to deadlock.
+		 */
+		arm_smmu_cmdq_shared_lock(cmdq, owner_count);
+
 		/*
 		 * d. Advance the hardware prod pointer
 		 * Control dependency ordering from the entries becoming valid.