diff mbox series

[v5,6/7] r8169: Coalesce mac ocp write and modify for 8125 and 8125B start to reduce spinlocks

Message ID 20231029183600.451694-6-mirsad.todorovac@alu.unizg.hr (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v5,1/7] r8169: Add r8169_mac_ocp_(write|modify)_seq helpers to reduce spinlock contention | expand

Checks

Context Check Description
netdev/series_format warning Series does not have a cover letter; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1344 this patch: 1344
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1369 this patch: 1369
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1369 this patch: 1369
netdev/checkpatch warning WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 0439297be951 ("r8169: add support for RTL8125B")' WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125")'
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Mirsad Todorovac Oct. 29, 2023, 6:36 p.m. UTC
Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
the startup of 8125 and 8125B involve implicit spin_lock_irqsave() and
spin_unlock_irqrestore() on each invocation.

Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle
pair of spin_lock_irqsave() and spin_unlock_irqrestore(), these calls
reduce overall lock contention.

Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
Fixes: 0439297be9511 ("r8169: add support for RTL8125B")
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
---
v5:
 added unlocked primitives to allow mac ocs modify grouping
 applied coalescing of mac ocp writes/modifies for 8168ep and 8117
 some formatting fixes to please checkpatch.pl

v4:
 fixed complaints as advised by Heiner and checkpatch.pl
 split the patch into five sections to be more easily manipulated and reviewed
 introduced r8168_mac_ocp_write_seq()
 applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
 removed register/mask pair array sentinels, so using ARRAY_SIZE().
 avoided duplication of RTL_W32() call code as advised by Heiner.

 drivers/net/ethernet/realtek/r8169_main.c | 75 +++++++++++++----------
 1 file changed, 44 insertions(+), 31 deletions(-)

Comments

Heiner Kallweit Oct. 30, 2023, 2:02 p.m. UTC | #1
On 29.10.2023 19:36, Mirsad Goran Todorovac wrote:
> Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
> the startup of 8125 and 8125B involve implicit spin_lock_irqsave() and
> spin_unlock_irqrestore() on each invocation.
> 
> Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
> r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle
> pair of spin_lock_irqsave() and spin_unlock_irqrestore(), these calls
> reduce overall lock contention.
> 
> Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
> Fixes: 0439297be9511 ("r8169: add support for RTL8125B")
> Cc: Heiner Kallweit <hkallweit1@gmail.com>
> Cc: Marco Elver <elver@google.com>
> Cc: nic_swsd@realtek.com
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
> Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
> Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
> ---
> v5:
>  added unlocked primitives to allow mac ocs modify grouping
>  applied coalescing of mac ocp writes/modifies for 8168ep and 8117
>  some formatting fixes to please checkpatch.pl
> 
> v4:
>  fixed complaints as advised by Heiner and checkpatch.pl
>  split the patch into five sections to be more easily manipulated and reviewed
>  introduced r8168_mac_ocp_write_seq()
>  applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
> 
> v3:
>  removed register/mask pair array sentinels, so using ARRAY_SIZE().
>  avoided duplication of RTL_W32() call code as advised by Heiner.
> 
>  drivers/net/ethernet/realtek/r8169_main.c | 75 +++++++++++++----------
>  1 file changed, 44 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index 50fbacb05953..0778cd0ba2e0 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -3553,6 +3553,28 @@ DECLARE_RTL_COND(rtl_mac_ocp_e00e_cond)
>  
>  static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>  {
> +	static const struct e_info_regmaskset e_info_8125_common_1[] = {
> +		{ 0xd3e2, 0x0fff, 0x03a9 },
> +		{ 0xd3e4, 0x00ff, 0x0000 },
> +		{ 0xe860, 0x0000, 0x0080 },
> +	};
> +
> +	static const struct e_info_regmaskset e_info_8125_common_2[] = {
> +		{ 0xc0b4, 0x0000, 0x000c },
> +		{ 0xeb6a, 0x00ff, 0x0033 },
> +		{ 0xeb50, 0x03e0, 0x0040 },
> +		{ 0xe056, 0x00f0, 0x0030 },
> +		{ 0xe040, 0x1000, 0x0000 },
> +		{ 0xea1c, 0x0003, 0x0001 },
> +		{ 0xe0c0, 0x4f0f, 0x4403 },
> +		{ 0xe052, 0x0080, 0x0068 },
> +		{ 0xd430, 0x0fff, 0x047f },
> +		{ 0xea1c, 0x0004, 0x0000 },
> +		{ 0xeb54, 0x0000, 0x0001 },
> +	};
> +
> +	unsigned long flags;
> +
>  	rtl_pcie_state_l2l3_disable(tp);
>  
>  	RTL_W16(tp, 0x382, 0x221b);
> @@ -3560,47 +3582,38 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>  	RTL_W16(tp, 0x4800, 0);
>  
>  	/* disable UPS */
> -	r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
> +
> +	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> +	__r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
>  
>  	RTL_W8(tp, Config1, RTL_R8(tp, Config1) & ~0x10);
>  
> -	r8168_mac_ocp_write(tp, 0xc140, 0xffff);
> -	r8168_mac_ocp_write(tp, 0xc142, 0xffff);
> +	__r8168_mac_ocp_write(tp, 0xc140, 0xffff);
> +	__r8168_mac_ocp_write(tp, 0xc142, 0xffff);
>  
> -	r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
> -	r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
> -	r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);
> +	__r8168_mac_ocp_modify_seq(tp, e_info_8125_common_1);
>  
>  	/* disable new tx descriptor format */
> -	r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
> +	__r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
>  
> -	if (tp->mac_version == RTL_GIGA_MAC_VER_63)
> -		r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
> -	else
> -		r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
> +	if (tp->mac_version == RTL_GIGA_MAC_VER_63) {
> +		__r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
> +		__r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
> +	} else {
> +		__r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
> +		__r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
> +	}
> +
> +	__r8168_mac_ocp_modify_seq(tp, e_info_8125_common_2);
> +	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>  
> -	if (tp->mac_version == RTL_GIGA_MAC_VER_63)
> -		r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
> -	else
> -		r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
> -
> -	r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
> -	r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
> -	r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
> -	r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
> -	r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
> -	r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
> -	r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
> -	r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
> -	r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);
> -
> -	r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
> -	r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
>  	udelay(1);
> -	r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
> -	RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
>  
> -	r8168_mac_ocp_write(tp, 0xe098, 0xc302);
> +	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
> +	__r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
> +	RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
> +	__r8168_mac_ocp_write(tp, 0xe098, 0xc302);
> +	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>  
>  	rtl_loop_wait_low(tp, &rtl_mac_ocp_e00e_cond, 1000, 10);
>  

All this manual locking and unlocking makes the code harder
to read and more error-prone. Maybe, as a rule of thumb:
If you can replace a block with more than 10 mac ocp ops,
then fine with me.
Mirsad Todorovac Oct. 30, 2023, 3:02 p.m. UTC | #2
On 10/30/23 15:02, Heiner Kallweit wrote:
> On 29.10.2023 19:36, Mirsad Goran Todorovac wrote:
>> Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
>> the startup of 8125 and 8125B involve implicit spin_lock_irqsave() and
>> spin_unlock_irqrestore() on each invocation.
>>
>> Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
>> r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle
>> pair of spin_lock_irqsave() and spin_unlock_irqrestore(), these calls
>> reduce overall lock contention.
>>
>> Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
>> Fixes: 0439297be9511 ("r8169: add support for RTL8125B")
>> Cc: Heiner Kallweit <hkallweit1@gmail.com>
>> Cc: Marco Elver <elver@google.com>
>> Cc: nic_swsd@realtek.com
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: netdev@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
>> Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
>> Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
>> ---
>> v5:
>>   added unlocked primitives to allow mac ocs modify grouping
>>   applied coalescing of mac ocp writes/modifies for 8168ep and 8117
>>   some formatting fixes to please checkpatch.pl
>>
>> v4:
>>   fixed complaints as advised by Heiner and checkpatch.pl
>>   split the patch into five sections to be more easily manipulated and reviewed
>>   introduced r8168_mac_ocp_write_seq()
>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>
>> v3:
>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>
>>   drivers/net/ethernet/realtek/r8169_main.c | 75 +++++++++++++----------
>>   1 file changed, 44 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>> index 50fbacb05953..0778cd0ba2e0 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -3553,6 +3553,28 @@ DECLARE_RTL_COND(rtl_mac_ocp_e00e_cond)
>>   
>>   static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>>   {
>> +	static const struct e_info_regmaskset e_info_8125_common_1[] = {
>> +		{ 0xd3e2, 0x0fff, 0x03a9 },
>> +		{ 0xd3e4, 0x00ff, 0x0000 },
>> +		{ 0xe860, 0x0000, 0x0080 },
>> +	};
>> +
>> +	static const struct e_info_regmaskset e_info_8125_common_2[] = {
>> +		{ 0xc0b4, 0x0000, 0x000c },
>> +		{ 0xeb6a, 0x00ff, 0x0033 },
>> +		{ 0xeb50, 0x03e0, 0x0040 },
>> +		{ 0xe056, 0x00f0, 0x0030 },
>> +		{ 0xe040, 0x1000, 0x0000 },
>> +		{ 0xea1c, 0x0003, 0x0001 },
>> +		{ 0xe0c0, 0x4f0f, 0x4403 },
>> +		{ 0xe052, 0x0080, 0x0068 },
>> +		{ 0xd430, 0x0fff, 0x047f },
>> +		{ 0xea1c, 0x0004, 0x0000 },
>> +		{ 0xeb54, 0x0000, 0x0001 },
>> +	};
>> +
>> +	unsigned long flags;
>> +
>>   	rtl_pcie_state_l2l3_disable(tp);
>>   
>>   	RTL_W16(tp, 0x382, 0x221b);
>> @@ -3560,47 +3582,38 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>>   	RTL_W16(tp, 0x4800, 0);
>>   
>>   	/* disable UPS */
>> -	r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
>> +
>> +	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>> +	__r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
>>   
>>   	RTL_W8(tp, Config1, RTL_R8(tp, Config1) & ~0x10);
>>   
>> -	r8168_mac_ocp_write(tp, 0xc140, 0xffff);
>> -	r8168_mac_ocp_write(tp, 0xc142, 0xffff);
>> +	__r8168_mac_ocp_write(tp, 0xc140, 0xffff);
>> +	__r8168_mac_ocp_write(tp, 0xc142, 0xffff);
>>   
>> -	r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
>> -	r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
>> -	r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);
>> +	__r8168_mac_ocp_modify_seq(tp, e_info_8125_common_1);
>>   
>>   	/* disable new tx descriptor format */
>> -	r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
>> +	__r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
>>   
>> -	if (tp->mac_version == RTL_GIGA_MAC_VER_63)
>> -		r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
>> -	else
>> -		r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
>> +	if (tp->mac_version == RTL_GIGA_MAC_VER_63) {
>> +		__r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
>> +		__r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
>> +	} else {
>> +		__r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
>> +		__r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
>> +	}
>> +
>> +	__r8168_mac_ocp_modify_seq(tp, e_info_8125_common_2);
>> +	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>   
>> -	if (tp->mac_version == RTL_GIGA_MAC_VER_63)
>> -		r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
>> -	else
>> -		r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
>> -
>> -	r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
>> -	r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
>> -	r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
>> -	r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
>> -	r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
>> -	r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
>> -	r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
>> -	r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
>> -	r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);
>> -
>> -	r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
>> -	r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);

[1] I think we have a candidate for a squeeze here in this driver.

>>   	udelay(1);
>> -	r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
>> -	RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
>>   
>> -	r8168_mac_ocp_write(tp, 0xe098, 0xc302);
>> +	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>> +	__r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
>> +	RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
>> +	__r8168_mac_ocp_write(tp, 0xe098, 0xc302);
>> +	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>   
>>   	rtl_loop_wait_low(tp, &rtl_mac_ocp_e00e_cond, 1000, 10);
>>   
> 
> All this manual locking and unlocking makes the code harder
> to read and more error-prone. Maybe, as a rule of thumb:
> If you can replace a block with more than 10 mac ocp ops,
> then fine with me
As I worked with another German developer, Mr. Frank Heckenbach from the GNU Pascal project,
I know that Germans are pedantic and reliable :-)

If this rtl_hw_start_8125_common() is called only once, then maybe every memory bus cycle
isn't worth saving, and then maybe the additional complexity isn't worth adding (but it
was fun doing, and it works with my NIC).

AFAIK, a spin_lock_irqsave()/spin_unlock_irqrestore() isn't a free lunch as you know, and I read
from the manuals that on modern CPUs a locked ADD $0, -128(%esp) or something takes about 50
clock cycles, in which all cores have to wait.

Doing that in storm of 10 lock/unlock pairs amounts to 500 cycles or 125 ns in the best case
on a 4 GHz CPU.

But I trust that you as the maintainer have the big picture and greater insight in the actual hw.

Elon Musk said that the greatest error in engineering is optimising stuff that never should
have been written.

As I like my RTL 8125b NIC to be fast (TM), I am not giving up on it.

Just recalled that you mentioned that it supports RSS. Is it feasible to add it to the r8169
driver?

It is OK for me that we have a formal and less formal mode of communication and switch between
them.

Thank you for all your quick reviews, much obliged, thank you for partially accepting the patches,
but I think I need to sleep over it before submitting a new version.

Regards,
Mirsad Todorovac
Heiner Kallweit Oct. 30, 2023, 3:53 p.m. UTC | #3
On 30.10.2023 16:02, Mirsad Todorovac wrote:
> 
> 
> On 10/30/23 15:02, Heiner Kallweit wrote:
>> On 29.10.2023 19:36, Mirsad Goran Todorovac wrote:
>>> Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
>>> the startup of 8125 and 8125B involve implicit spin_lock_irqsave() and
>>> spin_unlock_irqrestore() on each invocation.
>>>
>>> Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
>>> r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle
>>> pair of spin_lock_irqsave() and spin_unlock_irqrestore(), these calls
>>> reduce overall lock contention.
>>>
>>> Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
>>> Fixes: 0439297be9511 ("r8169: add support for RTL8125B")
>>> Cc: Heiner Kallweit <hkallweit1@gmail.com>
>>> Cc: Marco Elver <elver@google.com>
>>> Cc: nic_swsd@realtek.com
>>> Cc: "David S. Miller" <davem@davemloft.net>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>> Cc: netdev@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
>>> Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
>>> Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
>>> ---
>>> v5:
>>>   added unlocked primitives to allow mac ocs modify grouping
>>>   applied coalescing of mac ocp writes/modifies for 8168ep and 8117
>>>   some formatting fixes to please checkpatch.pl
>>>
>>> v4:
>>>   fixed complaints as advised by Heiner and checkpatch.pl
>>>   split the patch into five sections to be more easily manipulated and reviewed
>>>   introduced r8168_mac_ocp_write_seq()
>>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>>
>>> v3:
>>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>>
>>>   drivers/net/ethernet/realtek/r8169_main.c | 75 +++++++++++++----------
>>>   1 file changed, 44 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>>> index 50fbacb05953..0778cd0ba2e0 100644
>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>> @@ -3553,6 +3553,28 @@ DECLARE_RTL_COND(rtl_mac_ocp_e00e_cond)
>>>     static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>>>   {
>>> +    static const struct e_info_regmaskset e_info_8125_common_1[] = {
>>> +        { 0xd3e2, 0x0fff, 0x03a9 },
>>> +        { 0xd3e4, 0x00ff, 0x0000 },
>>> +        { 0xe860, 0x0000, 0x0080 },
>>> +    };
>>> +
>>> +    static const struct e_info_regmaskset e_info_8125_common_2[] = {
>>> +        { 0xc0b4, 0x0000, 0x000c },
>>> +        { 0xeb6a, 0x00ff, 0x0033 },
>>> +        { 0xeb50, 0x03e0, 0x0040 },
>>> +        { 0xe056, 0x00f0, 0x0030 },
>>> +        { 0xe040, 0x1000, 0x0000 },
>>> +        { 0xea1c, 0x0003, 0x0001 },
>>> +        { 0xe0c0, 0x4f0f, 0x4403 },
>>> +        { 0xe052, 0x0080, 0x0068 },
>>> +        { 0xd430, 0x0fff, 0x047f },
>>> +        { 0xea1c, 0x0004, 0x0000 },
>>> +        { 0xeb54, 0x0000, 0x0001 },
>>> +    };
>>> +
>>> +    unsigned long flags;
>>> +
>>>       rtl_pcie_state_l2l3_disable(tp);
>>>         RTL_W16(tp, 0x382, 0x221b);
>>> @@ -3560,47 +3582,38 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
>>>       RTL_W16(tp, 0x4800, 0);
>>>         /* disable UPS */
>>> -    r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
>>> +
>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>> +    __r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
>>>         RTL_W8(tp, Config1, RTL_R8(tp, Config1) & ~0x10);
>>>   -    r8168_mac_ocp_write(tp, 0xc140, 0xffff);
>>> -    r8168_mac_ocp_write(tp, 0xc142, 0xffff);
>>> +    __r8168_mac_ocp_write(tp, 0xc140, 0xffff);
>>> +    __r8168_mac_ocp_write(tp, 0xc142, 0xffff);
>>>   -    r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
>>> -    r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
>>> -    r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);
>>> +    __r8168_mac_ocp_modify_seq(tp, e_info_8125_common_1);
>>>         /* disable new tx descriptor format */
>>> -    r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
>>> +    __r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
>>>   -    if (tp->mac_version == RTL_GIGA_MAC_VER_63)
>>> -        r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
>>> -    else
>>> -        r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
>>> +    if (tp->mac_version == RTL_GIGA_MAC_VER_63) {
>>> +        __r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
>>> +        __r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
>>> +    } else {
>>> +        __r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
>>> +        __r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
>>> +    }
>>> +
>>> +    __r8168_mac_ocp_modify_seq(tp, e_info_8125_common_2);
>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>   -    if (tp->mac_version == RTL_GIGA_MAC_VER_63)
>>> -        r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
>>> -    else
>>> -        r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
>>> -
>>> -    r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
>>> -    r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
>>> -    r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
>>> -    r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
>>> -    r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
>>> -    r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
>>> -    r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
>>> -    r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
>>> -    r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);
>>> -
>>> -    r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
>>> -    r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
> 
> [1] I think we have a candidate for a squeeze here in this driver.
> 
>>>       udelay(1);
>>> -    r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
>>> -    RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
>>>   -    r8168_mac_ocp_write(tp, 0xe098, 0xc302);
>>> +    raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>> +    __r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
>>> +    RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
>>> +    __r8168_mac_ocp_write(tp, 0xe098, 0xc302);
>>> +    raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>         rtl_loop_wait_low(tp, &rtl_mac_ocp_e00e_cond, 1000, 10);
>>>   
>>
>> All this manual locking and unlocking makes the code harder
>> to read and more error-prone. Maybe, as a rule of thumb:
>> If you can replace a block with more than 10 mac ocp ops,
>> then fine with me
> As I worked with another German developer, Mr. Frank Heckenbach from the GNU Pascal project,
> I know that Germans are pedantic and reliable :-)
> 
> If this rtl_hw_start_8125_common() is called only once, then maybe every memory bus cycle
> isn't worth saving, and then maybe the additional complexity isn't worth adding (but it
> was fun doing, and it works with my NIC).
> 
> AFAIK, a spin_lock_irqsave()/spin_unlock_irqrestore() isn't a free lunch as you know, and I read
> from the manuals that on modern CPUs a locked ADD $0, -128(%esp) or something takes about 50
> clock cycles, in which all cores have to wait.
> 
> Doing that in storm of 10 lock/unlock pairs amounts to 500 cycles or 125 ns in the best case
> on a 4 GHz CPU.
> 
> But I trust that you as the maintainer have the big picture and greater insight in the actual hw.
> 
Big picture: maybe, as I've been working on r8169 for about 5 yrs
Greater insight in the actual hw: not really, because Realtek doesn't publish datasheets and
errata information. I just know the history of a lot of needed quirks in r8169, and I can recall
what we tried to improve and had to roll back because users had problems.
Most famous example is ASPM L1 + sub-state support. ASPM with r8169-driven NIC versions has a
long history of users facing slow network connectivity / tx timeouts / complete NIC stalls.
And it's still not solved with the newest NIC versions. But maybe not only Realtek is to blame.
Basically every consumer mainboard comes with a Realtek NIC, chipset issues and BIOS bugs
contribute to the mess.

> Elon Musk said that the greatest error in engineering is optimising stuff that never should
> have been written.
> 
> As I like my RTL 8125b NIC to be fast (TM), I am not giving up on it.
> 
Does this mean that you're not achieving the 2500Mbps line rate with r8169 as-is?

> Just recalled that you mentioned that it supports RSS. Is it feasible to add it to the r8169
> driver?
> 
Feasible: definitely, as you "just" have to take the RSS-related code from r8125 and adjust it
to meet mainline code quality criteria
Question is which actual benefit it brings, and whether it's worth the additional complexity.
As a starting point you could compare performance KPI's when using r8169 vs. r8125.

> It is OK for me that we have a formal and less formal mode of communication and switch between
> them.
> 
> Thank you for all your quick reviews, much obliged, thank you for partially accepting the patches,
> but I think I need to sleep over it before submitting a new version.
> 
> Regards,
> Mirsad Todorovac
diff mbox series

Patch

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 50fbacb05953..0778cd0ba2e0 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -3553,6 +3553,28 @@  DECLARE_RTL_COND(rtl_mac_ocp_e00e_cond)
 
 static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
 {
+	static const struct e_info_regmaskset e_info_8125_common_1[] = {
+		{ 0xd3e2, 0x0fff, 0x03a9 },
+		{ 0xd3e4, 0x00ff, 0x0000 },
+		{ 0xe860, 0x0000, 0x0080 },
+	};
+
+	static const struct e_info_regmaskset e_info_8125_common_2[] = {
+		{ 0xc0b4, 0x0000, 0x000c },
+		{ 0xeb6a, 0x00ff, 0x0033 },
+		{ 0xeb50, 0x03e0, 0x0040 },
+		{ 0xe056, 0x00f0, 0x0030 },
+		{ 0xe040, 0x1000, 0x0000 },
+		{ 0xea1c, 0x0003, 0x0001 },
+		{ 0xe0c0, 0x4f0f, 0x4403 },
+		{ 0xe052, 0x0080, 0x0068 },
+		{ 0xd430, 0x0fff, 0x047f },
+		{ 0xea1c, 0x0004, 0x0000 },
+		{ 0xeb54, 0x0000, 0x0001 },
+	};
+
+	unsigned long flags;
+
 	rtl_pcie_state_l2l3_disable(tp);
 
 	RTL_W16(tp, 0x382, 0x221b);
@@ -3560,47 +3582,38 @@  static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
 	RTL_W16(tp, 0x4800, 0);
 
 	/* disable UPS */
-	r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
+
+	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
+	__r8168_mac_ocp_modify(tp, 0xd40a, 0x0010, 0x0000);
 
 	RTL_W8(tp, Config1, RTL_R8(tp, Config1) & ~0x10);
 
-	r8168_mac_ocp_write(tp, 0xc140, 0xffff);
-	r8168_mac_ocp_write(tp, 0xc142, 0xffff);
+	__r8168_mac_ocp_write(tp, 0xc140, 0xffff);
+	__r8168_mac_ocp_write(tp, 0xc142, 0xffff);
 
-	r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
-	r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
-	r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);
+	__r8168_mac_ocp_modify_seq(tp, e_info_8125_common_1);
 
 	/* disable new tx descriptor format */
-	r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
+	__r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
 
-	if (tp->mac_version == RTL_GIGA_MAC_VER_63)
-		r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
-	else
-		r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
+	if (tp->mac_version == RTL_GIGA_MAC_VER_63) {
+		__r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0200);
+		__r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
+	} else {
+		__r8168_mac_ocp_modify(tp, 0xe614, 0x0700, 0x0400);
+		__r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
+	}
+
+	__r8168_mac_ocp_modify_seq(tp, e_info_8125_common_2);
+	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
 
-	if (tp->mac_version == RTL_GIGA_MAC_VER_63)
-		r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0000);
-	else
-		r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
-
-	r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
-	r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
-	r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
-	r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
-	r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
-	r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
-	r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
-	r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
-	r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);
-
-	r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
-	r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
 	udelay(1);
-	r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
-	RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
 
-	r8168_mac_ocp_write(tp, 0xe098, 0xc302);
+	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
+	__r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
+	RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
+	__r8168_mac_ocp_write(tp, 0xe098, 0xc302);
+	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
 
 	rtl_loop_wait_low(tp, &rtl_mac_ocp_e00e_cond, 1000, 10);