diff mbox series

iommu/arm-smmu-v3: Remove some unneeded init in arm_smmu_cmdq_issue_cmdlist()

Message ID 1624293394-202509-1-git-send-email-john.garry@huawei.com (mailing list archive)
State New, archived
Headers show
Series iommu/arm-smmu-v3: Remove some unneeded init in arm_smmu_cmdq_issue_cmdlist() | expand

Commit Message

John Garry June 21, 2021, 4:36 p.m. UTC
Members of struct "llq" will be zero-inited, apart from member max_n_shift.
But we write llq.val straight after the init, so it was pointless to zero
init those other members. As such, separately init member max_n_shift
only.

In addition, struct "head" is initialised to "llq" only so that member
max_n_shift is set. But that member is never referenced for "head", so
remove any init there.

Removing these initializations is seen as a small performance optimisation,
as this code is (very) hot path.

Signed-off-by: John Garry <john.garry@huawei.com>

Comments

John Garry Aug. 5, 2021, 10:22 a.m. UTC | #1
On 21/06/2021 17:36, John Garry wrote:
> Members of struct "llq" will be zero-inited, apart from member max_n_shift.
> But we write llq.val straight after the init, so it was pointless to zero
> init those other members. As such, separately init member max_n_shift
> only.
> 
> In addition, struct "head" is initialised to "llq" only so that member
> max_n_shift is set. But that member is never referenced for "head", so
> remove any init there.
> 
> Removing these initializations is seen as a small performance optimisation,
> as this code is (very) hot path.
> 

Hi Will,

Any chance you can pick up this small optimisation?

Cheers

> Signed-off-by: John Garry <john.garry@huawei.com>
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 54b2f27b81d4..8a8ad49bb7fd 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -727,11 +727,11 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>   	unsigned long flags;
>   	bool owner;
>   	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> -	struct arm_smmu_ll_queue llq = {
> -		.max_n_shift = cmdq->q.llq.max_n_shift,
> -	}, head = llq;
> +	struct arm_smmu_ll_queue llq, head;
>   	int ret = 0;
>   
> +	llq.max_n_shift = cmdq->q.llq.max_n_shift;
> +
>   	/* 1. Allocate some space in the queue */
>   	local_irq_save(flags);
>   	llq.val = READ_ONCE(cmdq->q.llq.val);
>
Will Deacon Aug. 5, 2021, 11:21 a.m. UTC | #2
On Thu, Aug 05, 2021 at 11:22:15AM +0100, John Garry wrote:
> On 21/06/2021 17:36, John Garry wrote:
> > Members of struct "llq" will be zero-inited, apart from member max_n_shift.
> > But we write llq.val straight after the init, so it was pointless to zero
> > init those other members. As such, separately init member max_n_shift
> > only.
> > 
> > In addition, struct "head" is initialised to "llq" only so that member
> > max_n_shift is set. But that member is never referenced for "head", so
> > remove any init there.
> > 
> > Removing these initializations is seen as a small performance optimisation,
> > as this code is (very) hot path.
> > 
> 
> Hi Will,
> 
> Any chance you can pick up this small optimisation?

Yup! I've actually queued it locally, but I may end up asking Joerg to take
it directly depending on what else I queue for 5.15. So far, most of the
SMMU stuff is all part of wider refactorings.

Cheers,

Will
Robin Murphy Aug. 5, 2021, 11:24 a.m. UTC | #3
On 2021-06-21 17:36, John Garry wrote:
> Members of struct "llq" will be zero-inited, apart from member max_n_shift.
> But we write llq.val straight after the init, so it was pointless to zero
> init those other members. As such, separately init member max_n_shift
> only.
> 
> In addition, struct "head" is initialised to "llq" only so that member
> max_n_shift is set. But that member is never referenced for "head", so
> remove any init there.
> 
> Removing these initializations is seen as a small performance optimisation,
> as this code is (very) hot path.

I looked at this and immediately thought "surely the compiler can see 
that all the prod/cons/val fields are written anyway and elide the 
initialisation?", so I dumped the before and after disassembly, and... oh.

You should probably clarify that it's zero-initialising all the 
cacheline padding which is both pointless and painful. With that,

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

However, having looked this closely I'm now tangentially wondering why 
max_n_shift isn't inside the padded union? It's read at the same time as 
both prod and cons by queue_has_space(), and never updated, so there 
doesn't appear to be any benefit to it being in a separate cacheline all 
by itself, and llq is already twice as big as it needs to be. Sorting 
that would also be a good opportunity to store the value of interest in 
its appropriate form so we're not needlessly recalculating 1 << shift 
every flippin' time...

Robin.

> Signed-off-by: John Garry <john.garry@huawei.com>
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 54b2f27b81d4..8a8ad49bb7fd 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -727,11 +727,11 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>   	unsigned long flags;
>   	bool owner;
>   	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> -	struct arm_smmu_ll_queue llq = {
> -		.max_n_shift = cmdq->q.llq.max_n_shift,
> -	}, head = llq;
> +	struct arm_smmu_ll_queue llq, head;
>   	int ret = 0;
>   
> +	llq.max_n_shift = cmdq->q.llq.max_n_shift;
> +
>   	/* 1. Allocate some space in the queue */
>   	local_irq_save(flags);
>   	llq.val = READ_ONCE(cmdq->q.llq.val);
>
Robin Murphy Aug. 5, 2021, 12:18 p.m. UTC | #4
On 2021-08-05 12:24, Robin Murphy wrote:
> On 2021-06-21 17:36, John Garry wrote:
>> Members of struct "llq" will be zero-inited, apart from member 
>> max_n_shift.
>> But we write llq.val straight after the init, so it was pointless to zero
>> init those other members. As such, separately init member max_n_shift
>> only.
>>
>> In addition, struct "head" is initialised to "llq" only so that member
>> max_n_shift is set. But that member is never referenced for "head", so
>> remove any init there.
>>
>> Removing these initializations is seen as a small performance 
>> optimisation,
>> as this code is (very) hot path.
> 
> I looked at this and immediately thought "surely the compiler can see 
> that all the prod/cons/val fields are written anyway and elide the 
> initialisation?", so I dumped the before and after disassembly, and... oh.
> 
> You should probably clarify that it's zero-initialising all the 
> cacheline padding which is both pointless and painful. With that,
> 
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
> 
> However, having looked this closely I'm now tangentially wondering why 
> max_n_shift isn't inside the padded union? It's read at the same time as 
> both prod and cons by queue_has_space(), and never updated, so there 
> doesn't appear to be any benefit to it being in a separate cacheline all 
> by itself, and llq is already twice as big as it needs to be. Sorting 
> that would also be a good opportunity to store the value of interest in 
> its appropriate form so we're not needlessly recalculating 1 << shift 
> every flippin' time...

...on which note, how about something like this on top?

(untested since I don't have any SMMUv3 hardware to hand)

Robin.

----->8-----
Subject: [PATCH] iommu/arm-smmu-v3: Improve arm_smmu_ll_queue efficiency

Once initialised, max_n_shift is only ever read at the same time as
accessing prod or cons, thus should not have any impact on contention
to justify keeping it in its own separate cacheline. Move it inside the
padding union to halve the size of struct arm_smmu_ll_queue. Even then,
though, there are a couple more spots in the command issuing path where
we could do without the overhead of zeroing even one cache line worth of
padding, so avoid implicit initialisation of those temporary structures
as was done at the top level in arm_smmu_cmdq_issue_cmdlist().

Furthermore, the shift value is only directly relevant for initially
setting up the relevant queue base register; all we care about after
that is the number of entries, so store that value instead once a
queue is initialised and avoid needlessly recalculating it everywhere.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 35 +++++++++++----------
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 29 ++++++++++-------
  2 files changed, 37 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 47610dc5d920..bc55217d6d61 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -111,7 +111,7 @@ static bool queue_has_space(struct arm_smmu_ll_queue 
*q, u32 n)
  	cons = Q_IDX(q, q->cons);

  	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
-		space = (1 << q->max_n_shift) - (prod - cons);
+		space = q->nents - (prod - cons);
  	else
  		space = cons - prod;

@@ -517,10 +517,11 @@ static void 
__arm_smmu_cmdq_poll_set_valid_map(struct arm_smmu_cmdq *cmdq,
  					       u32 sprod, u32 eprod, bool set)
  {
  	u32 swidx, sbidx, ewidx, ebidx;
-	struct arm_smmu_ll_queue llq = {
-		.max_n_shift	= cmdq->q.llq.max_n_shift,
-		.prod		= sprod,
-	};
+	struct arm_smmu_ll_queue llq;
+
+	/* Avoid zero-initialising all the padding */;
+	llq.nents = cmdq->q.llq.nents;
+	llq.prod = sprod;

  	ewidx = BIT_WORD(Q_IDX(&llq, eprod));
  	ebidx = Q_IDX(&llq, eprod) % BITS_PER_LONG;
@@ -696,10 +697,11 @@ static void arm_smmu_cmdq_write_entries(struct 
arm_smmu_cmdq *cmdq, u64 *cmds,
  					u32 prod, int n)
  {
  	int i;
-	struct arm_smmu_ll_queue llq = {
-		.max_n_shift	= cmdq->q.llq.max_n_shift,
-		.prod		= prod,
-	};
+	struct arm_smmu_ll_queue llq;
+
+	/* Avoid zero-initialising all the padding */;
+	llq.nents = cmdq->q.llq.nents;
+	llq.prod = prod;

  	for (i = 0; i < n; ++i) {
  		u64 *cmd = &cmds[i * CMDQ_ENT_DWORDS];
@@ -736,7 +738,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
  	struct arm_smmu_ll_queue llq, head;
  	int ret = 0;

-	llq.max_n_shift = cmdq->q.llq.max_n_shift;
+	llq.nents = cmdq->q.llq.nents;

  	/* 1. Allocate some space in the queue */
  	local_irq_save(flags);
@@ -2845,16 +2847,18 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,
  				   unsigned long cons_off,
  				   size_t dwords, const char *name)
  {
+	u32 max_n_shift = q->llq.max_n_shift;
  	size_t qsz;

  	do {
-		qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+		q->llq.nents = 1 << max_n_shift;
+		qsz = q->llq.nents * dwords * sizeof(u64);
  		q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
  					      GFP_KERNEL);
  		if (q->base || qsz < PAGE_SIZE)
  			break;

-		q->llq.max_n_shift--;
+		max_n_shift--;
  	} while (1);

  	if (!q->base) {
@@ -2866,7 +2870,7 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,

  	if (!WARN_ON(q->base_dma & (qsz - 1))) {
  		dev_info(smmu->dev, "allocated %u entries for %s\n",
-			 1 << q->llq.max_n_shift, name);
+			 q->llq.nents, name);
  	}

  	q->prod_reg	= page + prod_off;
@@ -2875,7 +2879,7 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,

  	q->q_base  = Q_BASE_RWA;
  	q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
-	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, max_n_shift);

  	q->llq.prod = q->llq.cons = 0;
  	return 0;
@@ -2891,13 +2895,12 @@ static int arm_smmu_cmdq_init(struct 
arm_smmu_device *smmu)
  {
  	int ret = 0;
  	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
-	unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
  	atomic_long_t *bitmap;

  	atomic_set(&cmdq->owner_prod, 0);
  	atomic_set(&cmdq->lock, 0);

-	bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+	bitmap = (atomic_long_t *)bitmap_zalloc(cmdq->q.llq.nents, GFP_KERNEL);
  	if (!bitmap) {
  		dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
  		ret = -ENOMEM;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 4cb136f07914..bc13952ad445 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -166,8 +166,8 @@
  #define ARM_SMMU_MEMATTR_DEVICE_nGnRE	0x1
  #define ARM_SMMU_MEMATTR_OIWB		0xf

-#define Q_IDX(llq, p)			((p) & ((1 << (llq)->max_n_shift) - 1))
-#define Q_WRP(llq, p)			((p) & (1 << (llq)->max_n_shift))
+#define Q_IDX(llq, p)			((p) & ((llq)->nents - 1))
+#define Q_WRP(llq, p)			((p) & (llq)->nents)
  #define Q_OVERFLOW_FLAG			(1U << 31)
  #define Q_OVF(p)			((p) & Q_OVERFLOW_FLAG)
  #define Q_ENT(q, p)			((q)->base +			\
@@ -505,18 +505,25 @@ struct arm_smmu_cmdq_ent {

  struct arm_smmu_ll_queue {
  	union {
-		u64			val;
-		struct {
-			u32		prod;
-			u32		cons;
+		struct{
+			union {
+				u64	val;
+				struct {
+					u32 prod;
+					u32 cons;
+				};
+				struct {
+					atomic_t prod;
+					atomic_t cons;
+				} atomic;
+			};
+			union {
+				u32	max_n_shift;
+				u32	nents;
+			};
  		};
-		struct {
-			atomic_t	prod;
-			atomic_t	cons;
-		} atomic;
  		u8			__pad[SMP_CACHE_BYTES];
  	} ____cacheline_aligned_in_smp;
-	u32				max_n_shift;
  };

  struct arm_smmu_queue {
John Garry Aug. 5, 2021, 1:40 p.m. UTC | #5
On 05/08/2021 12:24, Robin Murphy wrote:
> On 2021-06-21 17:36, John Garry wrote:
>> Members of struct "llq" will be zero-inited, apart from member 
>> max_n_shift.
>> But we write llq.val straight after the init, so it was pointless to zero
>> init those other members. As such, separately init member max_n_shift
>> only.
>>
>> In addition, struct "head" is initialised to "llq" only so that member
>> max_n_shift is set. But that member is never referenced for "head", so
>> remove any init there.
>>
>> Removing these initializations is seen as a small performance 
>> optimisation,
>> as this code is (very) hot path.
> 
> I looked at this and immediately thought "surely the compiler can see 
> that all the prod/cons/val fields are written anyway and elide the 
> initialisation?", so I dumped the before and after disassembly, and... oh.
> 
> You should probably clarify that it's zero-initialising all the 
> cacheline padding which is both pointless and painful. With that,
> 
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
> 
> However, having looked this closely I'm now tangentially wondering why 
> max_n_shift isn't inside the padded union? It's read at the same time as 
> both prod and cons by queue_has_space(), and never updated, so there 
> doesn't appear to be any benefit to it being in a separate cacheline all 
> by itself, and llq is already twice as big as it needs to be.

I think that the problem is if the prod+cons 64b value and the shift are 
on the same cacheline, then we have a chance of accessing a stale 
cacheline twice:

static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
				       u64 *cmds, int n, bool sync)
{
	u64 cmd_sync[CMDQ_ENT_DWORDS];
	u32 prod;
	unsigned long flags;
	bool owner;
	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
	struct arm_smmu_ll_queue llq = {
		.max_n_shift = cmdq->q.llq.max_n_shift,	// here
	}, head = llq;
	int ret = 0;

	/* 1. Allocate some space in the queue */
	local_irq_save(flags);
	llq.val = READ_ONCE(cmdq->q.llq.val);	// and again here


since cmdq->q.llq is per-SMMU. If max_n_shift is on a separate 
cacheline, then it should never be stale.

I suppose they could be combined into a smaller sub-struct and loaded in 
a single operation, but it looks messy, and prob without much gain.

Thanks,
John

> Sorting 
> that would also be a good opportunity to store the value of interest in 
> its appropriate form so we're not needlessly recalculating 1 << shift 
> every flippin' time...
> 
> Robin.
> 
>> Signed-off-by: John Garry <john.garry@huawei.com>
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 54b2f27b81d4..8a8ad49bb7fd 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -727,11 +727,11 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
>> arm_smmu_device *smmu,
>>       unsigned long flags;
>>       bool owner;
>>       struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>> -    struct arm_smmu_ll_queue llq = {
>> -        .max_n_shift = cmdq->q.llq.max_n_shift,
>> -    }, head = llq;
>> +    struct arm_smmu_ll_queue llq, head;
>>       int ret = 0;
>> +    llq.max_n_shift = cmdq->q.llq.max_n_shift;
>> +
>>       /* 1. Allocate some space in the queue */
>>       local_irq_save(flags);
>>       llq.val = READ_ONCE(cmdq->q.llq.val);
>>
> .
Robin Murphy Aug. 5, 2021, 2:41 p.m. UTC | #6
On 2021-08-05 14:40, John Garry wrote:
> On 05/08/2021 12:24, Robin Murphy wrote:
>> On 2021-06-21 17:36, John Garry wrote:
>>> Members of struct "llq" will be zero-inited, apart from member 
>>> max_n_shift.
>>> But we write llq.val straight after the init, so it was pointless to 
>>> zero
>>> init those other members. As such, separately init member max_n_shift
>>> only.
>>>
>>> In addition, struct "head" is initialised to "llq" only so that member
>>> max_n_shift is set. But that member is never referenced for "head", so
>>> remove any init there.
>>>
>>> Removing these initializations is seen as a small performance 
>>> optimisation,
>>> as this code is (very) hot path.
>>
>> I looked at this and immediately thought "surely the compiler can see 
>> that all the prod/cons/val fields are written anyway and elide the 
>> initialisation?", so I dumped the before and after disassembly, and... 
>> oh.
>>
>> You should probably clarify that it's zero-initialising all the 
>> cacheline padding which is both pointless and painful. With that,
>>
>> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>>
>> However, having looked this closely I'm now tangentially wondering why 
>> max_n_shift isn't inside the padded union? It's read at the same time 
>> as both prod and cons by queue_has_space(), and never updated, so 
>> there doesn't appear to be any benefit to it being in a separate 
>> cacheline all by itself, and llq is already twice as big as it needs 
>> to be.
> 
> I think that the problem is if the prod+cons 64b value and the shift are 
> on the same cacheline, then we have a chance of accessing a stale 
> cacheline twice:
> 
> static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>                         u64 *cmds, int n, bool sync)
> {
>      u64 cmd_sync[CMDQ_ENT_DWORDS];
>      u32 prod;
>      unsigned long flags;
>      bool owner;
>      struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
>      struct arm_smmu_ll_queue llq = {
>          .max_n_shift = cmdq->q.llq.max_n_shift,    // here
>      }, head = llq;
>      int ret = 0;
> 
>      /* 1. Allocate some space in the queue */
>      local_irq_save(flags);
>      llq.val = READ_ONCE(cmdq->q.llq.val);    // and again here
> 
> 
> since cmdq->q.llq is per-SMMU. If max_n_shift is on a separate 
> cacheline, then it should never be stale.

Ah, right, even though the accesses are always going to be close 
together, I suppose it could still technically cause some false sharing 
if someone else is trying to update prod at exactly the right time. I 
guess that might be why we need the explicit padding there in the first 
place, it's just a shame that it ends up wasting even more space with 
implicit padding at the end too (and I have a vague memory that trying 
to force member alignment and structure packing at the same time doesn't 
work well). Oh well.

> I suppose they could be combined into a smaller sub-struct and loaded in 
> a single operation, but it looks messy, and prob without much gain.

Indeed I wouldn't say that saving memory is the primary concern here, 
and any more convoluted code is hardly going to help performance. Plus 
it still wouldn't help the other cases where we're just copying the size 
into a fake queue to do some prod arithmetic - I hadn't fully clocked 
what was going on there when I skimmed through things earlier.

Disregarding the bogus layout change, though, do you reckon the rest of 
my idea makes sense?

Cheers,
Robin.
John Garry Aug. 5, 2021, 3:16 p.m. UTC | #7
On 05/08/2021 15:41, Robin Murphy wrote:
>> I suppose they could be combined into a smaller sub-struct and loaded 
>> in a single operation, but it looks messy, and prob without much gain.
> 
> Indeed I wouldn't say that saving memory is the primary concern here, 
> and any more convoluted code is hardly going to help performance. Plus 
> it still wouldn't help the other cases where we're just copying the size 
> into a fake queue to do some prod arithmetic - I hadn't fully clocked 
> what was going on there when I skimmed through things earlier.
> 
> Disregarding the bogus layout change, though, do you reckon the rest of 
> my idea makes sense?

I tried the similar change to avoid zero-init the padding in 
arm_smmu_cmdq_write_entries() and the 
_arm_smmu_cmdq_poll_set_valid_map(), but the disassembly was the same. 
So the compiler must have got smart there.

But for the original change in this patch, it did make a difference. 
It's nice to remove what was a memcpy:

     1770: a9077eff stp xzr, xzr, [x23, #112]
}, head = llq;
     1774: 94000000 bl 0 <memcpy>

And performance was very fractionally better.

As for pre-evaluating "nents", I'm not sure how much that can help, but 
I am not too optimistic. I can try some testing when I get a chance. 
Having said that, I would need to check the disassembly also.

Thanks,
John
Robin Murphy Aug. 5, 2021, 5:14 p.m. UTC | #8
On 2021-08-05 16:16, John Garry wrote:
> On 05/08/2021 15:41, Robin Murphy wrote:
>>> I suppose they could be combined into a smaller sub-struct and loaded 
>>> in a single operation, but it looks messy, and prob without much gain.
>>
>> Indeed I wouldn't say that saving memory is the primary concern here, 
>> and any more convoluted code is hardly going to help performance. Plus 
>> it still wouldn't help the other cases where we're just copying the 
>> size into a fake queue to do some prod arithmetic - I hadn't fully 
>> clocked what was going on there when I skimmed through things earlier.
>>
>> Disregarding the bogus layout change, though, do you reckon the rest 
>> of my idea makes sense?
> 
> I tried the similar change to avoid zero-init the padding in 
> arm_smmu_cmdq_write_entries() and the 
> _arm_smmu_cmdq_poll_set_valid_map(), but the disassembly was the same. 
> So the compiler must have got smart there.

Yeah, in my build __arm_smmu_cmdq_poll_set_valid_map() only uses 32 
bytes of stack, so clearly it's managed to see through the macro magic 
once queue_inc_prod_n() is inlined and elide the whole struct. 
arm_smmu_cmdq_write_entries() is inlined already, but logically must be 
the same deal since it's a similarly inlined queue_inc_prod_n().

However, that may all change if different compiler flags or a different 
compiler lead to different inlining decisions, so I'd argue that if this 
can matter anywhere then it's worth treating consistently everywhere.

> But for the original change in this patch, it did make a difference. 
> It's nice to remove what was a memcpy:
> 
>      1770: a9077eff stp xzr, xzr, [x23, #112]
> }, head = llq;
>      1774: 94000000 bl 0 <memcpy>
> 
> And performance was very fractionally better.

Heh, mine was this beauty:

         struct arm_smmu_ll_queue llq = {
     17d4:       a9017f7f        stp     xzr, xzr, [x27, #16]
     17d8:       a9027f7f        stp     xzr, xzr, [x27, #32]
     17dc:       a9037f7f        stp     xzr, xzr, [x27, #48]
     17e0:       a9047f7f        stp     xzr, xzr, [x27, #64]
         }, head = llq;
     17e4:       b900c340        str     w0, [x26, #192]
{
     17e8:       290d0be1        stp     w1, w2, [sp, #104]
         }, head = llq;
     17ec:       a9440f62        ldp     x2, x3, [x27, #64]
     17f0:       a9007f5f        stp     xzr, xzr, [x26]
     17f4:       a9017f5f        stp     xzr, xzr, [x26, #16]
     17f8:       a9027f5f        stp     xzr, xzr, [x26, #32]
     17fc:       a9037f5f        stp     xzr, xzr, [x26, #48]
     1800:       a9040f42        stp     x2, x3, [x26, #64]
         struct arm_smmu_ll_queue llq = {
     1804:       a9057f7f        stp     xzr, xzr, [x27, #80]
         }, head = llq;
     1808:       a9057f5f        stp     xzr, xzr, [x26, #80]
         struct arm_smmu_ll_queue llq = {
     180c:       a9067f7f        stp     xzr, xzr, [x27, #96]
         }, head = llq;
     1810:       a9067f5f        stp     xzr, xzr, [x26, #96]
         struct arm_smmu_ll_queue llq = {
     1814:       a9077f7f        stp     xzr, xzr, [x27, #112]
         }, head = llq;
     1818:       a9077f5f        stp     xzr, xzr, [x26, #112]
         struct arm_smmu_ll_queue llq = {
     181c:       a9087f5f        stp     xzr, xzr, [x26, #128]

> As for pre-evaluating "nents", I'm not sure how much that can help, but 
> I am not too optimistic. I can try some testing when I get a chance. 
> Having said that, I would need to check the disassembly also.

It'll just turn MOV,LDR,LSL sequences into plain LDRs - a small saving 
but with no real downside, and a third of it is in the place where doing 
less work matters most:

add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-100 (-100)
Function                                     old     new   delta
arm_smmu_priq_thread                         532     528      -4
arm_smmu_evtq_thread                         368     364      -4
arm_smmu_device_probe                       4564    4556      -8
__arm_smmu_cmdq_poll_set_valid_map.isra      316     308      -8
arm_smmu_init_one_queue.isra                 320     308     -12
queue_remove_raw                             192     176     -16
arm_smmu_gerror_handler                      752     736     -16
arm_smmu_cmdq_issue_cmdlist                 1812    1780     -32
Total: Before=23776, After=23676, chg -0.42%


Robin.
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 54b2f27b81d4..8a8ad49bb7fd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -727,11 +727,11 @@  static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	unsigned long flags;
 	bool owner;
 	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
-	struct arm_smmu_ll_queue llq = {
-		.max_n_shift = cmdq->q.llq.max_n_shift,
-	}, head = llq;
+	struct arm_smmu_ll_queue llq, head;
 	int ret = 0;
 
+	llq.max_n_shift = cmdq->q.llq.max_n_shift;
+
 	/* 1. Allocate some space in the queue */
 	local_irq_save(flags);
 	llq.val = READ_ONCE(cmdq->q.llq.val);