diff mbox series

[net-next,1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling

Message ID 20220126231239.1443128-2-tobias@waldekranz.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: dsa: mv88e6xxx: Improve indirect addressing performance | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 47 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Tobias Waldekranz Jan. 26, 2022, 11:12 p.m. UTC
Avoid a long delay when a busy bit is still set and has to be polled
again.

Measurements on a system with 2 Opals (6097F) and one Agate (6352)
show that even with this much tighter loop, we have about a 50% chance
of the bit being cleared on the first poll, all other accesses see the
bit being cleared on the second poll.

On a standard MDIO bus running MDC at 2.5MHz, a single access with 32
bits of preamble plus 32 bits of data takes 64*(1/2.5MHz) = 25.6us.

This means that mv88e6xxx_smi_direct_wait took 26us + CPU overhead in
the fast scenario, but 26us + 1500us + 26us + CPU overhead in the slow
case - bringing the average close to 1ms.

With this change in place, the slow case is closer to 2*26us + CPU
overhead, with the average well below 100us - a 10x improvement.

This translates to real-world winnings. On a 3-chip 20-port system,
the modprobe time drops by 88%:

Before:

root@coronet:~# time modprobe mv88e6xxx
real    0m 15.99s
user    0m 0.00s
sys     0m 1.52s

After:

root@coronet:~# time modprobe mv88e6xxx
real    0m 2.21s
user    0m 0.00s
sys     0m 1.54s

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 8 ++++----
 drivers/net/dsa/mv88e6xxx/smi.c  | 8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

Comments

Andrew Lunn Jan. 26, 2022, 11:45 p.m. UTC | #1
> @@ -86,12 +86,12 @@ int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int addr, int reg, u16 val)
>  int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg,
>  			u16 mask, u16 val)
>  {
> +	const unsigned long timeout = jiffies + msecs_to_jiffies(50);
>  	u16 data;
>  	int err;
> -	int i;
>  
>  	/* There's no bus specific operation to wait for a mask */
> -	for (i = 0; i < 16; i++) {
> +	do {
>  		err = mv88e6xxx_read(chip, addr, reg, &data);
>  		if (err)
>  			return err;
> @@ -99,8 +99,8 @@ int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg,
>  		if ((data & mask) == val)
>  			return 0;
>  
> -		usleep_range(1000, 2000);
> -	}
> +		cpu_relax();
> +	} while (time_before(jiffies, timeout));

I don't know if this is an issue or not...

There are a few bit-banging systems out there. For those, i wonder if
50ms is too short? With the old code, they had 16 chances, no matter
how slow they were. With the new code, if they take 50ms for one
transaction, they don't get a second chance.

But if they have taken 50ms, around 37ms has been spent with the
preamble, start, op, phy address, and register address. I assume at
that point the switch actually looks at the register, and given your
timings, it really should be ready, so a second loop is probably not
required?

O.K, so this seems safe.

     Andrew
Andrew Lunn Jan. 26, 2022, 11:54 p.m. UTC | #2
On Thu, Jan 27, 2022 at 12:12:38AM +0100, Tobias Waldekranz wrote:
> Avoid a long delay when a busy bit is still set and has to be polled
> again.
> 
> Measurements on a system with 2 Opals (6097F) and one Agate (6352)
> show that even with this much tighter loop, we have about a 50% chance
> of the bit being cleared on the first poll, all other accesses see the
> bit being cleared on the second poll.
> 
> On a standard MDIO bus running MDC at 2.5MHz, a single access with 32
> bits of preamble plus 32 bits of data takes 64*(1/2.5MHz) = 25.6us.
> 
> This means that mv88e6xxx_smi_direct_wait took 26us + CPU overhead in
> the fast scenario, but 26us + 1500us + 26us + CPU overhead in the slow
> case - bringing the average close to 1ms.
> 
> With this change in place, the slow case is closer to 2*26us + CPU
> overhead, with the average well below 100us - a 10x improvement.
> 
> This translates to real-world winnings. On a 3-chip 20-port system,
> the modprobe time drops by 88%:
> 
> Before:
> 
> root@coronet:~# time modprobe mv88e6xxx
> real    0m 15.99s
> user    0m 0.00s
> sys     0m 1.52s
> 
> After:
> 
> root@coronet:~# time modprobe mv88e6xxx
> real    0m 2.21s
> user    0m 0.00s
> sys     0m 1.54s
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew
Tobias Waldekranz Jan. 27, 2022, 12:58 p.m. UTC | #3
On Thu, Jan 27, 2022 at 00:45, Andrew Lunn <andrew@lunn.ch> wrote:
> There are a few bit-banging systems out there. For those, i wonder if
> 50ms is too short? With the old code, they had 16 chances, no matter
> how slow they were. With the new code, if they take 50ms for one
> transaction, they don't get a second chance.
>
> But if they have taken 50ms, around 37ms has been spent with the
> preamble, start, op, phy address, and register address. I assume at
> that point the switch actually looks at the register, and given your
> timings, it really should be ready, so a second loop is probably not
> required?
>
> O.K, so this seems safe.

I think you raise a good point though. Say that you then have this
series of events:

1. Bang out ST
2. Bang out OP
3. Bang out PHYADR
4. Bang out REGADR
5. Clock out TA
6. schedule()
7. A SCHED_FIFO/P99 task runs
8. Clock in DATA

- Steps 1 through 5 could plausibly be completed before the bit clears
  if you are running over some memory mapped GPIO lines
- Step 7 could execute for more than 50ms
- After step 8, you would see the busy bit set, but your time is up

All of this is of course _very_ unlikely, but not impossible. Should we
ensure that you always get at least two bites at the apple?
Andrew Lunn Jan. 27, 2022, 1:06 p.m. UTC | #4
On Thu, Jan 27, 2022 at 01:58:12PM +0100, Tobias Waldekranz wrote:
> On Thu, Jan 27, 2022 at 00:45, Andrew Lunn <andrew@lunn.ch> wrote:
> > There are a few bit-banging systems out there. For those, i wonder if
> > 50ms is too short? With the old code, they had 16 chances, no matter
> > how slow they were. With the new code, if they take 50ms for one
> > transaction, they don't get a second chance.
> >
> > But if they have taken 50ms, around 37ms has been spent with the
> > preamble, start, op, phy address, and register address. I assume at
> > that point the switch actually looks at the register, and given your
> > timings, it really should be ready, so a second loop is probably not
> > required?
> >
> > O.K, so this seems safe.
> 
> I think you raise a good point though. Say that you then have this
> series of events:
> 
> 1. Bang out ST
> 2. Bang out OP
> 3. Bang out PHYADR
> 4. Bang out REGADR
> 5. Clock out TA
> 6. schedule()
> 7. A SCHED_FIFO/P99 task runs
> 8. Clock in DATA
> 
> - Steps 1 through 5 could plausibly be completed before the bit clears
>   if you are running over some memory mapped GPIO lines
> - Step 7 could execute for more than 50ms
> - After step 8, you would see the busy bit set, but your time is up

So this is the opposite case i was thinking about. A very fast bit
banger. Yes, in theory this could happen.

> All of this is of course _very_ unlikely, but not impossible. Should we
> ensure that you always get at least two bites at the apple?

This is why i always point people at include/linux/iopoll.h. It
handles conditions like this by doing one more poll after the timeout
just to be sure the scheduler has not interfered. So a minimum of 2
would be good.

      Andrew
diff mbox series

Patch

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 58ca684d73f7..3566617143cf 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -86,12 +86,12 @@  int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int addr, int reg, u16 val)
 int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg,
 			u16 mask, u16 val)
 {
+	const unsigned long timeout = jiffies + msecs_to_jiffies(50);
 	u16 data;
 	int err;
-	int i;
 
 	/* There's no bus specific operation to wait for a mask */
-	for (i = 0; i < 16; i++) {
+	do {
 		err = mv88e6xxx_read(chip, addr, reg, &data);
 		if (err)
 			return err;
@@ -99,8 +99,8 @@  int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg,
 		if ((data & mask) == val)
 			return 0;
 
-		usleep_range(1000, 2000);
-	}
+		cpu_relax();
+	} while (time_before(jiffies, timeout));
 
 	dev_err(chip->dev, "Timeout while waiting for switch\n");
 	return -ETIMEDOUT;
diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c
index 282fe08db050..a59f32243e08 100644
--- a/drivers/net/dsa/mv88e6xxx/smi.c
+++ b/drivers/net/dsa/mv88e6xxx/smi.c
@@ -55,11 +55,11 @@  static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip,
 static int mv88e6xxx_smi_direct_wait(struct mv88e6xxx_chip *chip,
 				     int dev, int reg, int bit, int val)
 {
+	const unsigned long timeout = jiffies + msecs_to_jiffies(50);
 	u16 data;
 	int err;
-	int i;
 
-	for (i = 0; i < 16; i++) {
+	do {
 		err = mv88e6xxx_smi_direct_read(chip, dev, reg, &data);
 		if (err)
 			return err;
@@ -67,8 +67,8 @@  static int mv88e6xxx_smi_direct_wait(struct mv88e6xxx_chip *chip,
 		if (!!(data & BIT(bit)) == !!val)
 			return 0;
 
-		usleep_range(1000, 2000);
-	}
+		cpu_relax();
+	} while (time_before(jiffies, timeout));
 
 	return -ETIMEDOUT;
 }