Message ID | 20231104221514.45821-1-mirsad.todorovac@alu.unizg.hr (mailing list archive) |
---|---|
Headers | show |
Series | Coalesce mac ocp write/modify calls to reduce spinlock contention | expand |
On 04.11.2023 23:15, Mirsad Goran Todorovac wrote: > The motivation for these helpers was the locking overhead of 130 consecutive > r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused > if the PHY is powered-down. > > To quote Heiner: > > On RTL8411b the RX unit gets confused if the PHY is powered-down. > This was reported in [0] and confirmed by Realtek. Realtek provided > a sequence to fix the RX unit after PHY wakeup. > > A series of about 130 r8168_mac_ocp_write() calls is performed to program the > RTL registers for recovery, each doing an expensive spin_lock_irqsave() and > spin_unlock_irqrestore(). > > Each mac ocp write is made of: > > static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, > u32 data) > { > if (rtl_ocp_reg_failure(reg)) > return; > > RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); > } > > static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, > u32 data) > { > unsigned long flags; > > raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); > __r8168_mac_ocp_write(tp, reg, data); > raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); > } > > Register programming is done through RTL_W32() macro which expands into > > #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg)) > > which is further (on Alpha): > > extern inline void writel(u32 b, volatile void __iomem *addr) > { > mb(); > __raw_writel(b, addr); > } > > or on i386/x86_64: > > #define build_mmio_write(name, size, type, reg, barrier) \ > static inline void name(type val, volatile void __iomem *addr) \ > { asm volatile("mov" size " %0,%1": :reg (val), \ > "m" (*(volatile type __force *)addr) barrier); } > > build_mmio_write(writel, "l", unsigned int, "r", :"memory") > > This obviously involves iat least a compiler barrier. > > mb() expands into something like this i.e. on x86_64: > > #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") > > This means a whole lot of memory bus stalls: for spin_lock_irqsave(), > memory barrier, writel(), and spin_unlock_irqrestore(). > > With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like > a lock storm that will stall all of the cores and CPUs on the same memory controller > for certain time I/O takes to finish. > > In a sequential case of RTL register programming, the writes to RTL registers > can be coalesced under a same raw spinlock. This can dramatically decrease the > number of bus stalls in a multicore or multi-CPU system. > > Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are > provided to reduce lock contention: > > static void rtl_hw_start_8411_2(struct rtl8169_private *tp) > { > > ... > > /* The following Realtek-provided magic fixes an issue with the RX unit > * getting confused after the PHY having been powered-down. > */ > > static const struct recover_8411b_info init_zero_seq[] = { > { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, > ... > }; > > ... > > r8168_mac_ocp_write_seq(tp, init_zero_seq); > > ... > > } > > The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/ > functions that only changed the function names and the ending of the line, so the actual > hex data is unchanged. > > To repeat, the reason for the introduction of the original commit > was to enable recovery of the RX unit on the RTL8411b which was confused by the > powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem > into a series of about 500+ memory bus locks, most waiting for the main memory read, > modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for > the programming sequence to reach RTL NIC registers. > > [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075 > > v6: > proceeded according to Jacob Keller's suggestions by creating a cover page and reducing > the text within the commits. Applying to the net-next tree as Heiner Kallweit requested. > > v5: > attempted some new optimisations, which were rejected, but not all and not completely. > > v4: > fixed complaints as advised by Heiner and checkpatch.pl. > split the patch into five sections to be more easily manipulated and reviewed > introduced r8168_mac_ocp_write_seq() > applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B > > v3: > removed register/mask pair array sentinels, so using ARRAY_SIZE(). > avoided duplication of RTL_W32() call code as advised by Heiner. > > Mirsad Goran Todorovac (5): > r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock > stalls > r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce > spinlock stalls > r8169: Coalesce mac ocp write and modify for 8168H start to reduce > spinlocks > r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce > spinlock contention > r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce > spinlocks > > drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++----------- > 1 file changed, 150 insertions(+), 154 deletions(-) > You still write: "a lock storm that will stall all of the cores and CPUs on the same memory controller" even though you were informed that that's not the case. There's no actual problem, therefore your Fixes tags are incorrect. Also net-next is closed at the moment. In patches 3-5 I see no benefit. And I have doubts whether the small benefit in patch 2 is worth adding all the helpers in patch 1.
On 11/4/23 23:37, Heiner Kallweit wrote: > On 04.11.2023 23:15, Mirsad Goran Todorovac wrote: >> The motivation for these helpers was the locking overhead of 130 consecutive >> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused >> if the PHY is powered-down. >> >> To quote Heiner: >> >> On RTL8411b the RX unit gets confused if the PHY is powered-down. >> This was reported in [0] and confirmed by Realtek. Realtek provided >> a sequence to fix the RX unit after PHY wakeup. >> >> A series of about 130 r8168_mac_ocp_write() calls is performed to program the >> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and >> spin_unlock_irqrestore(). >> >> Each mac ocp write is made of: >> >> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >> u32 data) >> { >> if (rtl_ocp_reg_failure(reg)) >> return; >> >> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); >> } >> >> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >> u32 data) >> { >> unsigned long flags; >> >> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); >> __r8168_mac_ocp_write(tp, reg, data); >> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); >> } >> >> Register programming is done through RTL_W32() macro which expands into >> >> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg)) >> >> which is further (on Alpha): >> >> extern inline void writel(u32 b, volatile void __iomem *addr) >> { >> mb(); >> __raw_writel(b, addr); >> } >> >> or on i386/x86_64: >> >> #define build_mmio_write(name, size, type, reg, barrier) \ >> static inline void name(type val, volatile void __iomem *addr) \ >> { asm volatile("mov" size " %0,%1": :reg (val), \ >> "m" (*(volatile type __force *)addr) barrier); } >> >> build_mmio_write(writel, "l", unsigned int, "r", :"memory") >> >> This obviously involves iat least a compiler barrier. >> >> mb() expands into something like this i.e. on x86_64: >> >> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") >> >> This means a whole lot of memory bus stalls: for spin_lock_irqsave(), >> memory barrier, writel(), and spin_unlock_irqrestore(). >> >> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like >> a lock storm that will stall all of the cores and CPUs on the same memory controller >> for certain time I/O takes to finish. >> >> In a sequential case of RTL register programming, the writes to RTL registers >> can be coalesced under a same raw spinlock. This can dramatically decrease the >> number of bus stalls in a multicore or multi-CPU system. >> >> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are >> provided to reduce lock contention: >> >> static void rtl_hw_start_8411_2(struct rtl8169_private *tp) >> { >> >> ... >> >> /* The following Realtek-provided magic fixes an issue with the RX unit >> * getting confused after the PHY having been powered-down. >> */ >> >> static const struct recover_8411b_info init_zero_seq[] = { >> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, >> ... >> }; >> >> ... >> >> r8168_mac_ocp_write_seq(tp, init_zero_seq); >> >> ... >> >> } >> >> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/ >> functions that only changed the function names and the ending of the line, so the actual >> hex data is unchanged. >> >> To repeat, the reason for the introduction of the original commit >> was to enable recovery of the RX unit on the RTL8411b which was confused by the >> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem >> into a series of about 500+ memory bus locks, most waiting for the main memory read, >> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for >> the programming sequence to reach RTL NIC registers. >> >> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075 >> >> v6: >> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing >> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested. >> >> v5: >> attempted some new optimisations, which were rejected, but not all and not completely. >> >> v4: >> fixed complaints as advised by Heiner and checkpatch.pl. >> split the patch into five sections to be more easily manipulated and reviewed >> introduced r8168_mac_ocp_write_seq() >> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B >> >> v3: >> removed register/mask pair array sentinels, so using ARRAY_SIZE(). >> avoided duplication of RTL_W32() call code as advised by Heiner. >> >> Mirsad Goran Todorovac (5): >> r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock >> stalls >> r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce >> spinlock stalls >> r8169: Coalesce mac ocp write and modify for 8168H start to reduce >> spinlocks >> r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce >> spinlock contention >> r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce >> spinlocks >> >> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++----------- >> 1 file changed, 150 insertions(+), 154 deletions(-) >> Hi, Mr. Kallweit, So good to hear so soon from you. I'm encouraged that you are positive about improving the speed and reducing the size of the Realtek drivers. > You still write: > "a lock storm that will stall all of the cores and CPUs on the same memory controller" > even though you were informed that that's not the case. I was not convinced. There is no such thing as a free lunch, and there is no locking without affecting other cores, or locking would not make sense. > There's no actual problem, therefore your Fixes tags are incorrect. Mea culpa - my mistake, I will fix that in the next version. > Also net-next is closed at the moment. There is no problem with that, as these are only optimisation fixes, not zero day exploits. I am a patient person. > In patches 3-5 I see no benefit. And I have doubts whether the small benefit in > patch 2 is worth adding all the helpers in patch 1. I merely followed and mimed driver style from the constructions like this one: static const struct ephy_info e_info_8168e_1[] = { { 0x00, 0x0200, 0x0100 }, { 0x00, 0x0000, 0x0004 }, { 0x06, 0x0002, 0x0001 }, { 0x06, 0x0000, 0x0030 }, { 0x07, 0x0000, 0x2000 }, { 0x00, 0x0000, 0x0020 }, { 0x03, 0x5800, 0x2000 }, { 0x03, 0x0000, 0x0001 }, { 0x01, 0x0800, 0x1000 }, { 0x07, 0x0000, 0x4000 }, { 0x1e, 0x0000, 0x2000 }, { 0x19, 0xffff, 0xfe6c }, { 0x0a, 0x0000, 0x0040 } }; rtl_set_def_aspm_entry_latency(tp); rtl_ephy_init(tp, e_info_8168e_1); Here you did not think that introducing an array reduced code readability. My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock on each RTL_W32() write. I am convinced that a driver with less raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better with more NICs and more cores. You said nothing to convinced me otherwise. But I am merely defending my point, this by no means implies disrespect or overlooking your contribution to the source as a coder and a a maintainer. Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous, and it is logical to use less locking, as locking is expensive. "barrier" in writev() guarantees sequential orders of write, and locking and unlocking on each read/modify/write is unnecessary overhead, IMHO. As the conclusion, I would like to emphasise that improving lock contention for the code is by no means a personal attack on the maintainer or a breach of the Code of Conduct. If you are so much against the changes which Mr. Jacob Keller from Intel reviewed, maybe we can cool emotions and start thinking rationally. Additionally, I would like to "inline" many functions, as I think that call/return sequences with stack frame generation /destruction are more expensive than inlining the small one liners. But I will certainly respect your opinion on the matter as a maintainer. What I realise that I might be optimising the cold paths of the code, but from your emails it seems like nothing is worth optimising in this driver, and with all due respect Sir, I think that is dead wrong. Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded in the spirit that this is exactly what the guys in Chernobyl did while maintaining the reactor that malfunctioned: they did not dare to question the authority telling them that everything is alright. Have a nice evening, and please do not take these words as a breach of the Code or a personal attack. I believe we are on the same side, and that is making this driver better. The Linux kernel developer community was my last hope that this human race has a force to improve the mankind and make it worth surviving. But sometimes it is more honourable to go down with the ship and preserve the honour. Best regards, Mirsad Todorovac
On 11/4/23 23:37, Heiner Kallweit wrote: > On 04.11.2023 23:15, Mirsad Goran Todorovac wrote: >> The motivation for these helpers was the locking overhead of 130 consecutive >> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused >> if the PHY is powered-down. >> >> To quote Heiner: >> >> On RTL8411b the RX unit gets confused if the PHY is powered-down. >> This was reported in [0] and confirmed by Realtek. Realtek provided >> a sequence to fix the RX unit after PHY wakeup. >> >> A series of about 130 r8168_mac_ocp_write() calls is performed to program the >> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and >> spin_unlock_irqrestore(). >> >> Each mac ocp write is made of: >> >> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >> u32 data) >> { >> if (rtl_ocp_reg_failure(reg)) >> return; >> >> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); >> } >> >> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >> u32 data) >> { >> unsigned long flags; >> >> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); >> __r8168_mac_ocp_write(tp, reg, data); >> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); >> } >> >> Register programming is done through RTL_W32() macro which expands into >> >> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg)) >> >> which is further (on Alpha): >> >> extern inline void writel(u32 b, volatile void __iomem *addr) >> { >> mb(); >> __raw_writel(b, addr); >> } >> >> or on i386/x86_64: >> >> #define build_mmio_write(name, size, type, reg, barrier) \ >> static inline void name(type val, volatile void __iomem *addr) \ >> { asm volatile("mov" size " %0,%1": :reg (val), \ >> "m" (*(volatile type __force *)addr) barrier); } >> >> build_mmio_write(writel, "l", unsigned int, "r", :"memory") >> >> This obviously involves iat least a compiler barrier. >> >> mb() expands into something like this i.e. on x86_64: >> >> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") >> >> This means a whole lot of memory bus stalls: for spin_lock_irqsave(), >> memory barrier, writel(), and spin_unlock_irqrestore(). >> >> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like >> a lock storm that will stall all of the cores and CPUs on the same memory controller >> for certain time I/O takes to finish. >> >> In a sequential case of RTL register programming, the writes to RTL registers >> can be coalesced under a same raw spinlock. This can dramatically decrease the >> number of bus stalls in a multicore or multi-CPU system. >> >> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are >> provided to reduce lock contention: >> >> static void rtl_hw_start_8411_2(struct rtl8169_private *tp) >> { >> >> ... >> >> /* The following Realtek-provided magic fixes an issue with the RX unit >> * getting confused after the PHY having been powered-down. >> */ >> >> static const struct recover_8411b_info init_zero_seq[] = { >> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, >> ... >> }; >> >> ... >> >> r8168_mac_ocp_write_seq(tp, init_zero_seq); >> >> ... >> >> } >> >> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/ >> functions that only changed the function names and the ending of the line, so the actual >> hex data is unchanged. >> >> To repeat, the reason for the introduction of the original commit >> was to enable recovery of the RX unit on the RTL8411b which was confused by the >> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem >> into a series of about 500+ memory bus locks, most waiting for the main memory read, >> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for >> the programming sequence to reach RTL NIC registers. >> >> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075 >> >> v6: >> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing >> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested. >> >> v5: >> attempted some new optimisations, which were rejected, but not all and not completely. >> >> v4: >> fixed complaints as advised by Heiner and checkpatch.pl. >> split the patch into five sections to be more easily manipulated and reviewed >> introduced r8168_mac_ocp_write_seq() >> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B >> >> v3: >> removed register/mask pair array sentinels, so using ARRAY_SIZE(). >> avoided duplication of RTL_W32() call code as advised by Heiner. >> >> Mirsad Goran Todorovac (5): >> r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock >> stalls >> r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce >> spinlock stalls >> r8169: Coalesce mac ocp write and modify for 8168H start to reduce >> spinlocks >> r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce >> spinlock contention >> r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce >> spinlocks >> >> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++----------- >> 1 file changed, 150 insertions(+), 154 deletions(-) >> > > You still write: > "a lock storm that will stall all of the cores and CPUs on the same memory controller" > even though you were informed that that's not the case. > There's no actual problem, therefore your Fixes tags are incorrect. > Also net-next is closed at the moment. > In patches 3-5 I see no benefit. And I have doubts whether the small benefit in > patch 2 is worth adding all the helpers in patch 1. After some thought, I would like to have a consensus on these patches, rather than someone feels defeated or outvoted. So I will try to reach some common ground, if you think the cause is worth it. Why is adding six lines of a helper a problem worse than removing 130 lines of callers? I would hate to think that the Linux kernel developer community became the place where Authority has higher weight than Reason and Logic. I have no personal gain from improving these drivers other than the Galactic credits. One thing I wouldn't like and do not like is the Windows drivers being better because their programmers are more innovative. Best regards, Mirsad Todorovac
On 05.11.2023 01:15, Mirsad Todorovac wrote: > > > On 11/4/23 23:37, Heiner Kallweit wrote: >> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote: >>> The motivation for these helpers was the locking overhead of 130 consecutive >>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused >>> if the PHY is powered-down. >>> >>> To quote Heiner: >>> >>> On RTL8411b the RX unit gets confused if the PHY is powered-down. >>> This was reported in [0] and confirmed by Realtek. Realtek provided >>> a sequence to fix the RX unit after PHY wakeup. >>> >>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the >>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and >>> spin_unlock_irqrestore(). >>> >>> Each mac ocp write is made of: >>> >>> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >>> u32 data) >>> { >>> if (rtl_ocp_reg_failure(reg)) >>> return; >>> >>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); >>> } >>> >>> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >>> u32 data) >>> { >>> unsigned long flags; >>> >>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); >>> __r8168_mac_ocp_write(tp, reg, data); >>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); >>> } >>> >>> Register programming is done through RTL_W32() macro which expands into >>> >>> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg)) >>> >>> which is further (on Alpha): >>> >>> extern inline void writel(u32 b, volatile void __iomem *addr) >>> { >>> mb(); >>> __raw_writel(b, addr); >>> } >>> >>> or on i386/x86_64: >>> >>> #define build_mmio_write(name, size, type, reg, barrier) \ >>> static inline void name(type val, volatile void __iomem *addr) \ >>> { asm volatile("mov" size " %0,%1": :reg (val), \ >>> "m" (*(volatile type __force *)addr) barrier); } >>> >>> build_mmio_write(writel, "l", unsigned int, "r", :"memory") >>> >>> This obviously involves iat least a compiler barrier. >>> >>> mb() expands into something like this i.e. on x86_64: >>> >>> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") >>> >>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(), >>> memory barrier, writel(), and spin_unlock_irqrestore(). >>> >>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like >>> a lock storm that will stall all of the cores and CPUs on the same memory controller >>> for certain time I/O takes to finish. >>> >>> In a sequential case of RTL register programming, the writes to RTL registers >>> can be coalesced under a same raw spinlock. This can dramatically decrease the >>> number of bus stalls in a multicore or multi-CPU system. >>> >>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are >>> provided to reduce lock contention: >>> >>> static void rtl_hw_start_8411_2(struct rtl8169_private *tp) >>> { >>> >>> ... >>> >>> /* The following Realtek-provided magic fixes an issue with the RX unit >>> * getting confused after the PHY having been powered-down. >>> */ >>> >>> static const struct recover_8411b_info init_zero_seq[] = { >>> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, >>> ... >>> }; >>> >>> ... >>> >>> r8168_mac_ocp_write_seq(tp, init_zero_seq); >>> >>> ... >>> >>> } >>> >>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/ >>> functions that only changed the function names and the ending of the line, so the actual >>> hex data is unchanged. >>> >>> To repeat, the reason for the introduction of the original commit >>> was to enable recovery of the RX unit on the RTL8411b which was confused by the >>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem >>> into a series of about 500+ memory bus locks, most waiting for the main memory read, >>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for >>> the programming sequence to reach RTL NIC registers. >>> >>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075 >>> >>> v6: >>> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing >>> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested. >>> >>> v5: >>> attempted some new optimisations, which were rejected, but not all and not completely. >>> >>> v4: >>> fixed complaints as advised by Heiner and checkpatch.pl. >>> split the patch into five sections to be more easily manipulated and reviewed >>> introduced r8168_mac_ocp_write_seq() >>> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B >>> >>> v3: >>> removed register/mask pair array sentinels, so using ARRAY_SIZE(). >>> avoided duplication of RTL_W32() call code as advised by Heiner. >>> >>> Mirsad Goran Todorovac (5): >>> r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock >>> stalls >>> r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce >>> spinlock stalls >>> r8169: Coalesce mac ocp write and modify for 8168H start to reduce >>> spinlocks >>> r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce >>> spinlock contention >>> r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce >>> spinlocks >>> >>> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++----------- >>> 1 file changed, 150 insertions(+), 154 deletions(-) >>> > > Hi, Mr. Kallweit, > > So good to hear so soon from you. I'm encouraged that you are positive about improving > the speed and reducing the size of the Realtek drivers. > >> You still write: >> "a lock storm that will stall all of the cores and CPUs on the same memory controller" >> even though you were informed that that's not the case. > > I was not convinced. There is no such thing as a free lunch, and there is no locking > without affecting other cores, or locking would not make sense. > >> There's no actual problem, therefore your Fixes tags are incorrect. > > Mea culpa - my mistake, I will fix that in the next version. > >> Also net-next is closed at the moment. > > There is no problem with that, as these are only optimisation fixes, not zero day > exploits. I am a patient person. > >> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in >> patch 2 is worth adding all the helpers in patch 1. > > I merely followed and mimed driver style from the constructions like this one: > > static const struct ephy_info e_info_8168e_1[] = { > { 0x00, 0x0200, 0x0100 }, > { 0x00, 0x0000, 0x0004 }, > { 0x06, 0x0002, 0x0001 }, > { 0x06, 0x0000, 0x0030 }, > { 0x07, 0x0000, 0x2000 }, > { 0x00, 0x0000, 0x0020 }, > { 0x03, 0x5800, 0x2000 }, > { 0x03, 0x0000, 0x0001 }, > { 0x01, 0x0800, 0x1000 }, > { 0x07, 0x0000, 0x4000 }, > { 0x1e, 0x0000, 0x2000 }, > { 0x19, 0xffff, 0xfe6c }, > { 0x0a, 0x0000, 0x0040 } > }; > > rtl_set_def_aspm_entry_latency(tp); > > rtl_ephy_init(tp, e_info_8168e_1); > > Here you did not think that introducing an array reduced code readability. > > My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock > on each RTL_W32() write. I am convinced that a driver with less > raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better > with more NICs and more cores. > Then please focus on hot paths where it actually could make a difference, and provide numbers instead of a purely theoretical discussion. > You said nothing to convinced me otherwise. > > But I am merely defending my point, this by no means implies disrespect or overlooking > your contribution to the source as a coder and a a maintainer. > > Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous, > and it is logical to use less locking, as locking is expensive. "barrier" in writev() > guarantees sequential orders of write, and locking and unlocking on each read/modify/write > is unnecessary overhead, IMHO. > > As the conclusion, I would like to emphasise that improving lock contention for the code > is by no means a personal attack on the maintainer or a breach of the Code of Conduct. > > If you are so much against the changes which Mr. Jacob Keller from Intel reviewed, > maybe we can cool emotions and start thinking rationally. > > Additionally, I would like to "inline" many functions, as I think that call/return > sequences with stack frame generation /destruction are more expensive than inlining the > small one liners. > Mainline standard is to let the compiler decide on inlining. > But I will certainly respect your opinion on the matter as a maintainer. > > What I realise that I might be optimising the cold paths of the code, but from your emails > it seems like nothing is worth optimising in this driver, and with all due respect Sir, > I think that is dead wrong. > Nobody ever said that, and if you look at the history of the driver you'll see a lot of optimizations that have been added over time. Ideally an optimization improves both: performance and code readability Code readability is important for maintainability and weighs higher for me than a minor performance optimization in a code path that is very rarely used. > Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded > in the spirit that this is exactly what the guys in Chernobyl did while maintaining the > reactor that malfunctioned: they did not dare to question the authority telling them that > everything is alright. > > Have a nice evening, and please do not take these words as a breach of the Code or a > personal attack. I believe we are on the same side, and that is making this driver better. > > The Linux kernel developer community was my last hope that this human race has a force > to improve the mankind and make it worth surviving. > > But sometimes it is more honourable to go down with the ship and preserve the honour. > > Best regards, > Mirsad Todorovac
On 11/5/23 10:20, Heiner Kallweit wrote: > On 05.11.2023 01:15, Mirsad Todorovac wrote: >> >> >> On 11/4/23 23:37, Heiner Kallweit wrote: >>> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote: >>>> The motivation for these helpers was the locking overhead of 130 consecutive >>>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused >>>> if the PHY is powered-down. >>>> >>>> To quote Heiner: >>>> >>>> On RTL8411b the RX unit gets confused if the PHY is powered-down. >>>> This was reported in [0] and confirmed by Realtek. Realtek provided >>>> a sequence to fix the RX unit after PHY wakeup. >>>> >>>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the >>>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and >>>> spin_unlock_irqrestore(). >>>> >>>> Each mac ocp write is made of: >>>> >>>> static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >>>> u32 data) >>>> { >>>> if (rtl_ocp_reg_failure(reg)) >>>> return; >>>> >>>> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); >>>> } >>>> >>>> static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, >>>> u32 data) >>>> { >>>> unsigned long flags; >>>> >>>> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); >>>> __r8168_mac_ocp_write(tp, reg, data); >>>> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); >>>> } >>>> >>>> Register programming is done through RTL_W32() macro which expands into >>>> >>>> #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg)) >>>> >>>> which is further (on Alpha): >>>> >>>> extern inline void writel(u32 b, volatile void __iomem *addr) >>>> { >>>> mb(); >>>> __raw_writel(b, addr); >>>> } >>>> >>>> or on i386/x86_64: >>>> >>>> #define build_mmio_write(name, size, type, reg, barrier) \ >>>> static inline void name(type val, volatile void __iomem *addr) \ >>>> { asm volatile("mov" size " %0,%1": :reg (val), \ >>>> "m" (*(volatile type __force *)addr) barrier); } >>>> >>>> build_mmio_write(writel, "l", unsigned int, "r", :"memory") >>>> >>>> This obviously involves iat least a compiler barrier. >>>> >>>> mb() expands into something like this i.e. on x86_64: >>>> >>>> #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") >>>> >>>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(), >>>> memory barrier, writel(), and spin_unlock_irqrestore(). >>>> >>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like >>>> a lock storm that will stall all of the cores and CPUs on the same memory controller >>>> for certain time I/O takes to finish. >>>> >>>> In a sequential case of RTL register programming, the writes to RTL registers >>>> can be coalesced under a same raw spinlock. This can dramatically decrease the >>>> number of bus stalls in a multicore or multi-CPU system. >>>> >>>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are >>>> provided to reduce lock contention: >>>> >>>> static void rtl_hw_start_8411_2(struct rtl8169_private *tp) >>>> { >>>> >>>> ... >>>> >>>> /* The following Realtek-provided magic fixes an issue with the RX unit >>>> * getting confused after the PHY having been powered-down. >>>> */ >>>> >>>> static const struct recover_8411b_info init_zero_seq[] = { >>>> { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, >>>> ... >>>> }; >>>> >>>> ... >>>> >>>> r8168_mac_ocp_write_seq(tp, init_zero_seq); >>>> >>>> ... >>>> >>>> } >>>> >>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/ >>>> functions that only changed the function names and the ending of the line, so the actual >>>> hex data is unchanged. >>>> >>>> To repeat, the reason for the introduction of the original commit >>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the >>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem >>>> into a series of about 500+ memory bus locks, most waiting for the main memory read, >>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for >>>> the programming sequence to reach RTL NIC registers. >>>> >>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075 >>>> >>>> v6: >>>> proceeded according to Jacob Keller's suggestions by creating a cover page and reducing >>>> the text within the commits. Applying to the net-next tree as Heiner Kallweit requested. >>>> >>>> v5: >>>> attempted some new optimisations, which were rejected, but not all and not completely. >>>> >>>> v4: >>>> fixed complaints as advised by Heiner and checkpatch.pl. >>>> split the patch into five sections to be more easily manipulated and reviewed >>>> introduced r8168_mac_ocp_write_seq() >>>> applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B >>>> >>>> v3: >>>> removed register/mask pair array sentinels, so using ARRAY_SIZE(). >>>> avoided duplication of RTL_W32() call code as advised by Heiner. >>>> >>>> Mirsad Goran Todorovac (5): >>>> r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock >>>> stalls >>>> r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce >>>> spinlock stalls >>>> r8169: Coalesce mac ocp write and modify for 8168H start to reduce >>>> spinlocks >>>> r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce >>>> spinlock contention >>>> r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce >>>> spinlocks >>>> >>>> drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++----------- >>>> 1 file changed, 150 insertions(+), 154 deletions(-) >>>> >> >> Hi, Mr. Kallweit, >> >> So good to hear so soon from you. I'm encouraged that you are positive about improving >> the speed and reducing the size of the Realtek drivers. >> >>> You still write: >>> "a lock storm that will stall all of the cores and CPUs on the same memory controller" >>> even though you were informed that that's not the case. >> >> I was not convinced. There is no such thing as a free lunch, and there is no locking >> without affecting other cores, or locking would not make sense. >> >>> There's no actual problem, therefore your Fixes tags are incorrect. >> >> Mea culpa - my mistake, I will fix that in the next version. >> >>> Also net-next is closed at the moment. >> >> There is no problem with that, as these are only optimisation fixes, not zero day >> exploits. I am a patient person. >> >>> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in >>> patch 2 is worth adding all the helpers in patch 1. >> >> I merely followed and mimed driver style from the constructions like this one: >> >> static const struct ephy_info e_info_8168e_1[] = { >> { 0x00, 0x0200, 0x0100 }, >> { 0x00, 0x0000, 0x0004 }, >> { 0x06, 0x0002, 0x0001 }, >> { 0x06, 0x0000, 0x0030 }, >> { 0x07, 0x0000, 0x2000 }, >> { 0x00, 0x0000, 0x0020 }, >> { 0x03, 0x5800, 0x2000 }, >> { 0x03, 0x0000, 0x0001 }, >> { 0x01, 0x0800, 0x1000 }, >> { 0x07, 0x0000, 0x4000 }, >> { 0x1e, 0x0000, 0x2000 }, >> { 0x19, 0xffff, 0xfe6c }, >> { 0x0a, 0x0000, 0x0040 } >> }; >> >> rtl_set_def_aspm_entry_latency(tp); >> >> rtl_ephy_init(tp, e_info_8168e_1); >> >> Here you did not think that introducing an array reduced code readability. >> >> My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock >> on each RTL_W32() write. I am convinced that a driver with less >> raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better >> with more NICs and more cores. >> > Then please focus on hot paths where it actually could make a difference, > and provide numbers instead of a purely theoretical discussion. I will comply. RTL8411b losing PHY that requires this expensive reset probably doesn't happen anyway on the Linux servers. :-/ I have done my homework and I see that you are also co-maintainer of the net PHYLIB, so your insight on this matter is undoubtedly greater after five years of experience in maintaining the driver. Learning about the network stack and the PHY layer is however a formidable thought very interesting task. The whole area of making multimedia more responsive on Linux and Windows graphic interface is very challenging, and I could pass it with my day job as research. But as I said, I have to catch up with a lot of homework. >> You said nothing to convinced me otherwise. >> >> But I am merely defending my point, this by no means implies disrespect or overlooking >> your contribution to the source as a coder and a a maintainer. >> >> Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous, >> and it is logical to use less locking, as locking is expensive. "barrier" in writev() >> guarantees sequential orders of write, and locking and unlocking on each read/modify/write >> is unnecessary overhead, IMHO. >> >> As the conclusion, I would like to emphasise that improving lock contention for the code >> is by no means a personal attack on the maintainer or a breach of the Code of Conduct. >> >> If you are so much against the changes which Mr. Jacob Keller from Intel reviewed, >> maybe we can cool emotions and start thinking rationally. >> >> Additionally, I would like to "inline" many functions, as I think that call/return >> sequences with stack frame generation /destruction are more expensive than inlining the >> small one liners. > Mainline standard is to let the compiler decide on inlining. >> But I will certainly respect your opinion on the matter as a maintainer. >> >> What I realise that I might be optimising the cold paths of the code, but from your emails >> it seems like nothing is worth optimising in this driver, and with all due respect Sir, >> I think that is dead wrong. > > Nobody ever said that, and if you look at the history of the driver you'll see a lot of > optimizations that have been added over time. Ideally an optimization improves both: > performance and code readability > Code readability is important for maintainability and weighs higher for me than a minor > performance optimization in a code path that is very rarely used. I see. However, you do use lookup tables for programming with fn rtl_ephy_init(). If this would be more readable, I can unroll the table so it is one entry per line like e_info_8168e_1 was made? Then the actual function call adds nothing to readability and the ease of maintanance, as the principle { address, value } and { address, mask, value } would be preserved. In fact, some programming books advise separating data from code for readability and efficiency sake. I admit that the 8125 optimisation I proposed is minimal locking but hard to read and maintain. (It defies the KISS principle.) Thank you for your time and patience with me. I always like more that things are explained through Spock logic than by a call to authority (which is BTW the argumentum-ad-hominem logical fallacy). (This is of course not the case with the sacred texts, where I use quotes from the relevant authorities.) Have a nice day and I wish you a blessed Sunday. I should probably catch up with the documentation before using any more of your valuable time and energy. I hope to reciprocate. Best regards, Mirsad Todorovac >> Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded >> in the spirit that this is exactly what the guys in Chernobyl did while maintaining the >> reactor that malfunctioned: they did not dare to question the authority telling them that >> everything is alright. >> >> Have a nice evening, and please do not take these words as a breach of the Code or a >> personal attack. I believe we are on the same side, and that is making this driver better. >> >> The Linux kernel developer community was my last hope that this human race has a force >> to improve the mankind and make it worth surviving. >> >> But sometimes it is more honourable to go down with the ship and preserve the honour. >> >> Best regards, >> Mirsad Todorovac
> > > With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like > > > a lock storm that will stall all of the cores and CPUs on the same memory controller > > > for certain time I/O takes to finish. Please provide benchmark data to show this is a real issue, and the patch fixes it. > Additionally, I would like to "inline" many functions, as I think that call/return > sequences with stack frame generation /destruction are more expensive than inlining the > small one liners. Please provide benchmarks to show the compiler is getting this wrong, and inline really is needed. Until there are benchmarks: NACK. Andrew
On 11/5/23 16:33, Andrew Lunn wrote: >>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like >>>> a lock storm that will stall all of the cores and CPUs on the same memory controller >>>> for certain time I/O takes to finish. > Please provide benchmark data to show this is a real issue, and the > patch fixes it. Certainly, Sir, but I have to figure out what to measure. To better think of it, I actually do not have a system with a physical RTL8411b, this patch was made by the finding in the visual inspection of the code. FWIW, separation of code and data is the design principle that is strongly endorsed lately and it seems like a good practice that prevents a range of security breaches and attacks: [1] https://blog.klipse.tech/databook/2020/10/02/separate-code-data.html [2] Reduce system complexity by separating Code from Data, https://livebook.manning.com/book/data-oriented-programming/chapter-2/v-2/ In this case, data is hard-coded. The resulting code would be smaller in size and execution time, and IMHO more readable, (in a table), but I will not enter much discussion for you have made your mind already. >> Additionally, I would like to "inline" many functions, as I think that call/return >> sequences with stack frame generation /destruction are more expensive than inlining the >> small one liners. > > Please provide benchmarks to show the compiler is getting this wrong, > and inline really is needed. I think I am by now technologically up to that: "drivers/net/ethernet/realtek/r8169_main.s" 302034 lines ------------------------------------------------------------------------------------- 7500 r8168_mac_ocp_write: 7501 .LVL488: 7502 .LFB6322: 7503 .loc 1 900 1 is_stmt 1 view -0 7504 .cfi_startproc 7505 1: call __fentry__ 7506 .section __mcount_loc, "a",@progbits 7507 .quad 1b 7508 .previous 7509 .loc 1 901 2 view .LVU1955 7510 .loc 1 903 2 view .LVU1956 7511 .loc 1 903 7 view .LVU1957 7512 .loc 1 903 10 view .LVU1958 7513 .loc 1 903 33 view .LVU1959 7514 .loc 1 903 57 view .LVU1960 7515 .loc 1 903 88 view .LVU1961 7516 .loc 1 903 95 view .LVU1962 7517 .loc 1 900 1 is_stmt 0 view .LVU1963 7518 pushq %rbp 7519 .cfi_def_cfa_offset 16 7520 .cfi_offset 6, -16 7521 movq %rsp, %rbp 7522 .cfi_def_cfa_register 6 7523 pushq %r15 7524 .cfi_offset 15, -24 7525 .loc 1 903 103 view .LVU1964 7526 leaq 6692(%rdi), %r15 7527 .loc 1 900 1 view .LVU1965 7528 pushq %r14 7529 .cfi_offset 14, -32 7530 movl %edx, %r14d 7531 pushq %r13 7532 pushq %r12 7533 .cfi_offset 13, -40 7534 .cfi_offset 12, -48 7535 movq %rdi, %r12 7536 .loc 1 903 103 view .LVU1966 7537 movq %r15, %rdi 7538 .LVL489: 7539 .loc 1 900 1 view .LVU1967 7540 pushq %rbx 7541 .cfi_offset 3, -56 7542 .loc 1 900 1 view .LVU1968 7543 movl %esi, %ebx 7544 .loc 1 903 103 view .LVU1969 7545 call _raw_spin_lock_irqsave 7546 .LVL490: 7547 .LBB3557: 7548 .LBB3558: 7549 .loc 1 893 6 view .LVU1970 7550 movl %ebx, %edi 7551 .LBE3558: 7552 .LBE3557: 7553 .loc 1 903 103 view .LVU1971 7554 movq %rax, %r13 7555 .LVL491: 7556 .loc 1 903 5 is_stmt 1 view .LVU1972 7557 .loc 1 904 2 view .LVU1973 7558 .LBB3564: 7559 .LBI3557: 7560 .loc 1 891 13 view .LVU1974 7561 .LBB3563: 7562 .loc 1 893 2 view .LVU1975 7563 .loc 1 893 6 is_stmt 0 view .LVU1976 7564 call rtl_ocp_reg_failure 7565 .LVL492: 7566 .loc 1 893 5 view .LVU1977 7567 testb %al, %al 7568 jne .L375 7569 .LVL493: 7570 .LBB3559: 7571 .LBI3559: 7572 .loc 1 891 13 is_stmt 1 view .LVU1978 7573 .LBB3560: 7574 .loc 1 896 2 view .LVU1979 7575 .loc 1 896 28 is_stmt 0 view .LVU1980 7576 sall $15, %ebx 7577 .LVL494: 7578 .loc 1 896 58 view .LVU1981 7579 movq (%r12), %rax 7580 .loc 1 896 35 view .LVU1982 7581 orl %r14d, %ebx 7582 .loc 1 896 2 view .LVU1983 7583 orl $-2147483648, %ebx 7584 .LVL495: 7585 .LBB3561: 7586 .LBI3561: 7587 .loc 2 66 120 is_stmt 1 view .LVU1984 7588 .LBB3562: 7589 .loc 2 66 168 view .LVU1985 7590 #APP 7591 # 66 "./arch/x86/include/asm/io.h" 1 7592 movl %ebx,176(%rax) 7593 # 0 "" 2 7594 .LVL496: 7595 #NO_APP 7596 .L375: 7597 .loc 2 66 168 is_stmt 0 view .LVU1986 7598 .LBE3562: 7599 .LBE3561: 7600 .LBE3560: 7601 .LBE3559: 7602 .LBE3563: 7603 .LBE3564: 7604 .loc 1 905 2 is_stmt 1 view .LVU1987 7605 .loc 1 905 7 view .LVU1988 7606 .loc 1 905 10 view .LVU1989 7607 .loc 1 905 33 view .LVU1990 7608 .loc 1 905 57 view .LVU1991 7609 .loc 1 905 88 view .LVU1992 7610 .loc 1 905 95 view .LVU1993 7611 movq %r13, %rsi 7612 movq %r15, %rdi 7613 call _raw_spin_unlock_irqrestore 7614 .LVL497: 7615 .loc 1 905 5 view .LVU1994 7616 .loc 1 906 1 is_stmt 0 view .LVU1995 7617 popq %rbx 7618 .cfi_restore 3 7619 popq %r12 7620 .cfi_restore 12 7621 .LVL498: 7622 .loc 1 906 1 view .LVU1996 7623 popq %r13 7624 .cfi_restore 13 7625 .LVL499: 7626 .loc 1 906 1 view .LVU1997 7627 popq %r14 7628 .cfi_restore 14 7629 .LVL500: 7630 .loc 1 906 1 view .LVU1998 7631 popq %r15 7632 .cfi_restore 15 7633 .LVL501: 7634 .loc 1 906 1 view .LVU1999 7635 popq %rbp 7636 .cfi_restore 6 7637 .cfi_def_cfa 7, 8 7638 xorl %eax, %eax 7639 xorl %edx, %edx 7640 xorl %esi, %esi 7641 xorl %edi, %edi 7642 jmp __x86_return_thunk 7643 .cfi_endproc 7644 .LFE6322: 7645 .size r8168_mac_ocp_write, .-r8168_mac_ocp_write 7646 .p2align 4 7647 .section __patchable_function_entries,"awo",@progbits,rtl_eriar_cond_check 7648 .align 8 7649 .quad .LPFE44 7650 .text 7651 .LPFE44: 7652 nop 7653 nop 7654 nop 7655 nop ------------------------------------------------------------------------------------- The call of the function is the actual call: ------------------------------------------------------------------------------------- 39334 .LBE11119: 39335 .loc 1 3112 2 is_stmt 1 view .LVU10399 39336 xorl %edx, %edx 39337 movl $64556, %esi 39338 movq %r13, %rdi 39339 call r8168_mac_ocp_write ------------------------------------------------------------------------------------- The command used for generating the assembly was taken from .o.cmd file and added -save-temps as the only change: $ gcc -Wp,-MMD,drivers/net/ethernet/realtek/.r8169_main.o.d -save-temps -nostdinc \ -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi \ -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi \ -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h \ -include ./include/linux/compiler_types.h -D__KERNEL__ -fmacro-prefix-map=./= \ -std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing \ -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 \ -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup \ -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables \ -mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix \ -mfunction-return=thunk-extern -fno-jump-tables -mharden-sls=all -fpatchable-function-entry=16,16 \ -fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong \ -fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-stack-clash-protection \ -fzero-call-used-regs=used-gpr -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -falign-functions=16 \ -fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration \ -Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs \ -Wno-frame-address -Wno-address-of-packed-member -Wframe-larger-than=1024 -Wno-main \ -Wno-unused-but-set-variable -Wno-unused-const-variable -Wvla -Wno-pointer-sign -Wcast-function-type \ -Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time \ -Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wno-unused-but-set-variable \ -Wno-unused-const-variable -Wno-restrict -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation \ -Wno-stringop-overflow -Wno-stringop-truncation -Wno-missing-field-initializers -Wno-type-limits \ -Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -g -gdwarf-5 -fsanitize=bounds-strict \ -fsanitize=shift -fsanitize=bool -fsanitize=enum -DMODULE -DKBUILD_BASENAME='"r8169_main"' \ -DKBUILD_MODNAME='"r8169"' -D__KBUILD_MODNAME=kmod_r8169 -c -o drivers/net/ethernet/realtek/r8169_main.o \ drivers/net/ethernet/realtek/r8169_main.c ; ./tools/objtool/objtool --hacks=jump_label --hacks=noinstr \ --hacks=skylake --retpoline --rethunk --sls --stackval --static-call --uaccess --prefix=16 \ --module drivers/net/ethernet/realtek/r8169_main.o This is a build against net-next. Please find the attached config. RTL_(R|W)(8|16|32) family are obviously macros: #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg)) This is the writel(): 7590 #APP 7591 # 66 "./arch/x86/include/asm/io.h" 1 7592 movl %ebx,176(%rax) 7593 # 0 "" 2 7594 .LVL496: 7595 #NO_APP writel() looks optimal. Hope this helps. > Until there are benchmarks: NACK. That means I've got a Reviewed-by: Jacob Keller and two NACKs. I am voted out. :-/ I suppose one NACK from a maintainer is sufficient to halt the patch. Going back to the documentation, the drawing board, and of course, the Source. :-) Best regards, Mirsad Todorovac
> The command used for generating the assembly was taken from .o.cmd file and > added -save-temps as the only change: make drivers/net/ethernet/realtek/r8169_main.lst is simpler. You get the C and the generated assembler listed together. Andrew
On 11/5/23 21:36, Andrew Lunn wrote: >> The command used for generating the assembly was taken from .o.cmd file and >> added -save-temps as the only change: > > make drivers/net/ethernet/realtek/r8169_main.lst > > is simpler. You get the C and the generated assembler listed together. > > Andrew Here, Sir: ---------------------------------------------------------------------------------------------- 0000000000001950 <r8168_mac_ocp_write>: { 1950: e8 00 00 00 00 call 1955 <r8168_mac_ocp_write+0x5> 1951: R_X86_64_PLT32 __fentry__-0x4 1955: 55 push %rbp 1956: 48 89 e5 mov %rsp,%rbp 1959: 41 57 push %r15 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 195b: 4c 8d bf 24 1a 00 00 lea 0x1a24(%rdi),%r15 { 1962: 41 56 push %r14 1964: 41 89 d6 mov %edx,%r14d 1967: 41 55 push %r13 1969: 41 54 push %r12 196b: 49 89 fc mov %rdi,%r12 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 196e: 4c 89 ff mov %r15,%rdi { 1971: 53 push %rbx 1972: 89 f3 mov %esi,%ebx raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 1974: e8 00 00 00 00 call 1979 <r8168_mac_ocp_write+0x29> 1975: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 if (rtl_ocp_reg_failure(reg)) 1979: 89 df mov %ebx,%edi raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 197b: 49 89 c5 mov %rax,%r13 if (rtl_ocp_reg_failure(reg)) 197e: e8 1d f2 ff ff call ba0 <rtl_ocp_reg_failure> 1983: 84 c0 test %al,%al 1985: 75 16 jne 199d <r8168_mac_ocp_write+0x4d> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 1987: c1 e3 0f shl $0xf,%ebx 198a: 49 8b 04 24 mov (%r12),%rax 198e: 44 09 f3 or %r14d,%ebx 1991: 81 cb 00 00 00 80 or $0x80000000,%ebx build_mmio_write(writel, "l", unsigned int, "r", :"memory") 1997: 89 98 b0 00 00 00 mov %ebx,0xb0(%rax) raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 199d: 4c 89 ee mov %r13,%rsi 19a0: 4c 89 ff mov %r15,%rdi 19a3: e8 00 00 00 00 call 19a8 <r8168_mac_ocp_write+0x58> 19a4: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 ---------------------------------------------------------------------------------------------- Actually, this time rtl_hw_start_8411_2() has r8169_mac_ocp_write mostly() inlined without an explicit "inline". I cannot explain why the r8169_main.s looked different with the same .config. ---------------------------------------------------------------------------------------------- 0000000000008370 <rtl_hw_start_8411_2>: { 8370: e8 00 00 00 00 call 8375 <rtl_hw_start_8411_2+0x5> 8371: R_X86_64_PLT32 __fentry__-0x4 8375: 55 push %rbp 8376: 48 89 e5 mov %rsp,%rbp 8379: 41 56 push %r14 837b: 41 55 push %r13 837d: 49 89 fd mov %rdi,%r13 8380: 41 54 push %r12 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 8382: 4d 8d a5 24 1a 00 00 lea 0x1a24(%r13),%r12 rtl_hw_start_8168g(tp); 8389: e8 02 c7 ff ff call 4a90 <rtl_hw_start_8168g> rtl_ephy_init(tp, e_info_8411_2); 838e: ba 0a 00 00 00 mov $0xa,%edx 8393: 48 c7 c6 00 00 00 00 mov $0x0,%rsi 8396: R_X86_64_32S .rodata+0x440 839a: 4c 89 ef mov %r13,%rdi 839d: e8 be 9d ff ff call 2160 <__rtl_ephy_init> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 83a2: 4c 89 e7 mov %r12,%rdi 83a5: e8 00 00 00 00 call 83aa <rtl_hw_start_8411_2+0x3a> 83a6: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 83aa: ba 00 00 14 fe mov $0xfe140000,%edx 83af: 48 89 c6 mov %rax,%rsi RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 83b2: 49 8b 45 00 mov 0x0(%r13),%rax 83b6: 89 90 b0 00 00 00 mov %edx,0xb0(%rax) raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 83bc: 4c 89 e7 mov %r12,%rdi 83bf: e8 00 00 00 00 call 83c4 <rtl_hw_start_8411_2+0x54> 83c0: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 83c4: 4c 89 e7 mov %r12,%rdi 83c7: e8 00 00 00 00 call 83cc <rtl_hw_start_8411_2+0x5c> 83c8: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 83cc: ba 00 00 15 fe mov $0xfe150000,%edx 83d1: 48 89 c6 mov %rax,%rsi RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 83d4: 49 8b 45 00 mov 0x0(%r13),%rax 83d8: 89 90 b0 00 00 00 mov %edx,0xb0(%rax) raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 83de: 4c 89 e7 mov %r12,%rdi 83e1: e8 00 00 00 00 call 83e6 <rtl_hw_start_8411_2+0x76> 83e2: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 r8168_mac_ocp_write(tp, 0xFC2C, 0x0000); 83e6: 31 d2 xor %edx,%edx 83e8: be 2c fc 00 00 mov $0xfc2c,%esi 83ed: 4c 89 ef mov %r13,%rdi 83f0: e8 5b 95 ff ff call 1950 <r8168_mac_ocp_write> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 83f5: 4c 89 e7 mov %r12,%rdi 83f8: e8 00 00 00 00 call 83fd <rtl_hw_start_8411_2+0x8d> 83f9: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 83fd: ba 00 00 17 fe mov $0xfe170000,%edx 8402: 48 89 c6 mov %rax,%rsi RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 8405: 49 8b 45 00 mov 0x0(%r13),%rax 8409: 89 90 b0 00 00 00 mov %edx,0xb0(%rax) raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 840f: 4c 89 e7 mov %r12,%rdi 8412: e8 00 00 00 00 call 8417 <rtl_hw_start_8411_2+0xa7> 8413: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 8417: 4c 89 e7 mov %r12,%rdi 841a: e8 00 00 00 00 call 841f <rtl_hw_start_8411_2+0xaf> 841b: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 841f: ba 00 00 18 fe mov $0xfe180000,%edx 8424: 48 89 c6 mov %rax,%rsi RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 8427: 49 8b 45 00 mov 0x0(%r13),%rax 842b: 89 90 b0 00 00 00 mov %edx,0xb0(%rax) raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 8431: 4c 89 e7 mov %r12,%rdi 8434: e8 00 00 00 00 call 8439 <rtl_hw_start_8411_2+0xc9> 8435: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 8439: 4c 89 e7 mov %r12,%rdi 843c: e8 00 00 00 00 call 8441 <rtl_hw_start_8411_2+0xd1> 843d: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 8441: ba 00 00 19 fe mov $0xfe190000,%edx 8446: 48 89 c6 mov %rax,%rsi RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 8449: 49 8b 45 00 mov 0x0(%r13),%rax 844d: 89 90 b0 00 00 00 mov %edx,0xb0(%rax) raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 8453: 4c 89 e7 mov %r12,%rdi 8456: e8 00 00 00 00 call 845b <rtl_hw_start_8411_2+0xeb> 8457: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 845b: 4c 89 e7 mov %r12,%rdi 845e: e8 00 00 00 00 call 8463 <rtl_hw_start_8411_2+0xf3> 845f: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 8463: ba 00 00 1a fe mov $0xfe1a0000,%edx 8468: 48 89 c6 mov %rax,%rsi RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 846b: 49 8b 45 00 mov 0x0(%r13),%rax 846f: 89 90 b0 00 00 00 mov %edx,0xb0(%rax) raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 8475: 4c 89 e7 mov %r12,%rdi 8478: e8 00 00 00 00 call 847d <rtl_hw_start_8411_2+0x10d> 8479: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 r8168_mac_ocp_write(tp, 0xFC36, 0x0000); 847d: 31 d2 xor %edx,%edx 847f: be 36 fc 00 00 mov $0xfc36,%esi 8484: 4c 89 ef mov %r13,%rdi 8487: e8 c4 94 ff ff call 1950 <r8168_mac_ocp_write> mdelay(3); ---------------------------------------------------------------------------------------------- With the patch applied, and an explicit "inline" on: static inline bool rtl_ocp_reg_failure(u32 reg); static inline void __r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp, ...); static inline void r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp, ...); the code looks like: ---------------------------------------------------------------------------------------------- 0000000000004d60 <rtl_hw_start_8411_2>: { 4d60: e8 00 00 00 00 call 4d65 <rtl_hw_start_8411_2+0x5> 4d61: R_X86_64_PLT32 __fentry__-0x4 4d65: 55 push %rbp 4d66: 48 89 e5 mov %rsp,%rbp 4d69: 41 57 push %r15 4d6b: 41 56 push %r14 for (p = array; len--; p++) 4d6d: 49 c7 c6 00 00 00 00 mov $0x0,%r14 4d70: R_X86_64_32S .rodata+0x820 { 4d74: 41 55 push %r13 4d76: 41 54 push %r12 raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 4d78: 41 bc 28 fc 00 00 mov $0xfc28,%r12d { 4d7e: 53 push %rbx 4d7f: 48 89 fb mov %rdi,%rbx raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 4d82: 4c 8d ab 24 1a 00 00 lea 0x1a24(%rbx),%r13 { 4d89: 48 83 ec 08 sub $0x8,%rsp rtl_hw_start_8168g(tp); 4d8d: e8 3e fe ff ff call 4bd0 <rtl_hw_start_8168g> rtl_ephy_init(tp, e_info_8411_2); 4d92: ba 0a 00 00 00 mov $0xa,%edx 4d97: 48 c7 c6 00 00 00 00 mov $0x0,%rsi 4d9a: R_X86_64_32S .rodata+0x860 4d9e: 48 89 df mov %rbx,%rdi 4da1: e8 fa d4 ff ff call 22a0 <__rtl_ephy_init> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 4da6: 4c 89 ef mov %r13,%rdi 4da9: e8 00 00 00 00 call 4dae <rtl_hw_start_8411_2+0x4e> 4daa: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 4dae: 31 c9 xor %ecx,%ecx 4db0: 49 89 c7 mov %rax,%r15 for (p = array; len--; p++) 4db3: eb 07 jmp 4dbc <rtl_hw_start_8411_2+0x5c> __r8168_mac_ocp_write(tp, p->reg, p->data); 4db5: 41 8b 4e 04 mov 0x4(%r14),%ecx 4db9: 45 8b 26 mov (%r14),%r12d return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg); 4dbc: 41 f7 c4 01 00 ff ff test $0xffff0001,%r12d 4dc3: 0f 85 9b 01 00 00 jne 4f64 <rtl_hw_start_8411_2+0x204> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 4dc9: 41 c1 e4 0f shl $0xf,%r12d 4dcd: 48 8b 03 mov (%rbx),%rax 4dd0: 41 09 cc or %ecx,%r12d 4dd3: 41 81 cc 00 00 00 80 or $0x80000000,%r12d build_mmio_write(writel, "l", unsigned int, "r", :"memory") 4dda: 44 89 a0 b0 00 00 00 mov %r12d,0xb0(%rax) for (p = array; len--; p++) 4de1: 49 83 c6 08 add $0x8,%r14 4de5: 49 81 fe 00 00 00 00 cmp $0x0,%r14 4de8: R_X86_64_32S .rodata+0x860 4dec: 75 c7 jne 4db5 <rtl_hw_start_8411_2+0x55> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 4dee: 4c 89 fe mov %r15,%rsi 4df1: 4c 89 ef mov %r13,%rdi raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 4df4: 41 bc 00 f8 00 00 mov $0xf800,%r12d for (p = array; len--; p++) 4dfa: 49 c7 c6 00 00 00 00 mov $0x0,%r14 4dfd: R_X86_64_32S .rodata+0x4a0 raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 4e01: e8 00 00 00 00 call 4e06 <rtl_hw_start_8411_2+0xa6> 4e02: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 mdelay(3); 4e06: bf 08 9c c4 00 mov $0xc49c08,%edi 4e0b: e8 00 00 00 00 call 4e10 <rtl_hw_start_8411_2+0xb0> 4e0c: R_X86_64_PLT32 __const_udelay-0x4 r8168_mac_ocp_write(tp, 0xFC26, 0x0000); 4e10: 31 d2 xor %edx,%edx 4e12: be 26 fc 00 00 mov $0xfc26,%esi 4e17: 48 89 df mov %rbx,%rdi 4e1a: e8 71 d0 ff ff call 1e90 <r8168_mac_ocp_write> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 4e1f: 4c 89 ef mov %r13,%rdi 4e22: e8 00 00 00 00 call 4e27 <rtl_hw_start_8411_2+0xc7> 4e23: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 4e27: b9 08 e0 00 00 mov $0xe008,%ecx 4e2c: 49 89 c7 mov %rax,%r15 for (p = array; len--; p++) 4e2f: eb 07 jmp 4e38 <rtl_hw_start_8411_2+0xd8> __r8168_mac_ocp_write(tp, p->reg, p->data); 4e31: 41 8b 4e 04 mov 0x4(%r14),%ecx 4e35: 45 8b 26 mov (%r14),%r12d return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg); 4e38: 41 f7 c4 01 00 ff ff test $0xffff0001,%r12d 4e3f: 0f 85 eb 00 00 00 jne 4f30 <rtl_hw_start_8411_2+0x1d0> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 4e45: 41 c1 e4 0f shl $0xf,%r12d 4e49: 48 8b 03 mov (%rbx),%rax 4e4c: 41 09 cc or %ecx,%r12d 4e4f: 41 81 cc 00 00 00 80 or $0x80000000,%r12d 4e56: 44 89 a0 b0 00 00 00 mov %r12d,0xb0(%rax) for (p = array; len--; p++) 4e5d: 49 83 c6 08 add $0x8,%r14 4e61: 49 81 fe 00 00 00 00 cmp $0x0,%r14 4e64: R_X86_64_32S .rodata+0x818 4e68: 75 c7 jne 4e31 <rtl_hw_start_8411_2+0xd1> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 4e6a: 4c 89 fe mov %r15,%rsi 4e6d: 4c 89 ef mov %r13,%rdi raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 4e70: 41 bc 2a fc 00 00 mov $0xfc2a,%r12d for (p = array; len--; p++) 4e76: 49 c7 c6 00 00 00 00 mov $0x0,%r14 4e79: R_X86_64_32S .rodata+0x460 raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 4e7d: e8 00 00 00 00 call 4e82 <rtl_hw_start_8411_2+0x122> 4e7e: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 r8168_mac_ocp_write(tp, 0xFC26, 0x8000); 4e82: ba 00 80 00 00 mov $0x8000,%edx 4e87: be 26 fc 00 00 mov $0xfc26,%esi 4e8c: 48 89 df mov %rbx,%rdi 4e8f: e8 fc cf ff ff call 1e90 <r8168_mac_ocp_write> raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags); 4e94: 4c 89 ef mov %r13,%rdi 4e97: e8 00 00 00 00 call 4e9c <rtl_hw_start_8411_2+0x13c> 4e98: R_X86_64_PLT32 _raw_spin_lock_irqsave-0x4 4e9c: b9 43 07 00 00 mov $0x743,%ecx 4ea1: 49 89 c7 mov %rax,%r15 for (p = array; len--; p++) 4ea4: eb 07 jmp 4ead <rtl_hw_start_8411_2+0x14d> __r8168_mac_ocp_write(tp, p->reg, p->data); 4ea6: 41 8b 4e 04 mov 0x4(%r14),%ecx 4eaa: 45 8b 26 mov (%r14),%r12d return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg); 4ead: 41 f7 c4 01 00 ff ff test $0xffff0001,%r12d 4eb4: 75 4d jne 4f03 <rtl_hw_start_8411_2+0x1a3> RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data); 4eb6: 41 c1 e4 0f shl $0xf,%r12d 4eba: 48 8b 03 mov (%rbx),%rax 4ebd: 41 09 cc or %ecx,%r12d 4ec0: 41 81 cc 00 00 00 80 or $0x80000000,%r12d 4ec7: 44 89 a0 b0 00 00 00 mov %r12d,0xb0(%rax) for (p = array; len--; p++) 4ece: 49 83 c6 08 add $0x8,%r14 4ed2: 49 81 fe 00 00 00 00 cmp $0x0,%r14 4ed5: R_X86_64_32S .rodata+0x498 4ed9: 75 cb jne 4ea6 <rtl_hw_start_8411_2+0x146> raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags); 4edb: 4c 89 fe mov %r15,%rsi 4ede: 4c 89 ef mov %r13,%rdi 4ee1: e8 00 00 00 00 call 4ee6 <rtl_hw_start_8411_2+0x186> 4ee2: R_X86_64_PLT32 _raw_spin_unlock_irqrestore-0x4 } 4ee6: 48 83 c4 08 add $0x8,%rsp 4eea: 5b pop %rbx 4eeb: 41 5c pop %r12 4eed: 41 5d pop %r13 4eef: 41 5e pop %r14 4ef1: 41 5f pop %r15 4ef3: 5d pop %rbp 4ef4: 31 c0 xor %eax,%eax 4ef6: 31 d2 xor %edx,%edx 4ef8: 31 c9 xor %ecx,%ecx 4efa: 31 f6 xor %esi,%esi 4efc: 31 ff xor %edi,%edi 4efe: e9 00 00 00 00 jmp 4f03 <rtl_hw_start_8411_2+0x1a3> 4eff: R_X86_64_PLT32 __x86_return_thunk-0x4 return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg); 4f03: 0f b6 05 00 00 00 00 movzbl 0x0(%rip),%eax # 4f0a <rtl_hw_start_8411_2+0x1aa> 4f06: R_X86_64_PC32 .data.once-0x2 4f0a: 3c 01 cmp $0x1,%al 4f0c: 0f 87 00 00 00 00 ja 4f12 <rtl_hw_start_8411_2+0x1b2> 4f0e: R_X86_64_PC32 .text.unlikely+0xd0 4f12: a8 01 test $0x1,%al 4f14: 75 b8 jne 4ece <rtl_hw_start_8411_2+0x16e> 4f16: 44 89 e6 mov %r12d,%esi 4f19: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 4f1c: R_X86_64_32S .rodata.str1.1+0x53 4f20: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4f27 <rtl_hw_start_8411_2+0x1c7> 4f22: R_X86_64_PC32 .data.once-0x3 4f27: e8 00 00 00 00 call 4f2c <rtl_hw_start_8411_2+0x1cc> 4f28: R_X86_64_PLT32 __warn_printk-0x4 4f2c: 0f 0b ud2 4f2e: eb 9e jmp 4ece <rtl_hw_start_8411_2+0x16e> 4f30: 0f b6 05 00 00 00 00 movzbl 0x0(%rip),%eax # 4f37 <rtl_hw_start_8411_2+0x1d7> 4f33: R_X86_64_PC32 .data.once-0x2 4f37: 3c 01 cmp $0x1,%al 4f39: 0f 87 00 00 00 00 ja 4f3f <rtl_hw_start_8411_2+0x1df> 4f3b: R_X86_64_PC32 .text.unlikely+0x106 4f3f: a8 01 test $0x1,%al 4f41: 0f 85 16 ff ff ff jne 4e5d <rtl_hw_start_8411_2+0xfd> 4f47: 44 89 e6 mov %r12d,%esi 4f4a: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 4f4d: R_X86_64_32S .rodata.str1.1+0x53 4f51: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4f58 <rtl_hw_start_8411_2+0x1f8> 4f53: R_X86_64_PC32 .data.once-0x3 4f58: e8 00 00 00 00 call 4f5d <rtl_hw_start_8411_2+0x1fd> 4f59: R_X86_64_PLT32 __warn_printk-0x4 4f5d: 0f 0b ud2 4f5f: e9 f9 fe ff ff jmp 4e5d <rtl_hw_start_8411_2+0xfd> 4f64: 0f b6 05 00 00 00 00 movzbl 0x0(%rip),%eax # 4f6b <rtl_hw_start_8411_2+0x20b> 4f67: R_X86_64_PC32 .data.once-0x2 4f6b: 3c 01 cmp $0x1,%al 4f6d: 0f 87 00 00 00 00 ja 4f73 <rtl_hw_start_8411_2+0x213> 4f6f: R_X86_64_PC32 .text.unlikely+0xeb 4f73: a8 01 test $0x1,%al 4f75: 0f 85 66 fe ff ff jne 4de1 <rtl_hw_start_8411_2+0x81> 4f7b: 44 89 e6 mov %r12d,%esi 4f7e: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 4f81: R_X86_64_32S .rodata.str1.1+0x53 4f85: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4f8c <rtl_hw_start_8411_2+0x22c> 4f87: R_X86_64_PC32 .data.once-0x3 4f8c: e8 00 00 00 00 call 4f91 <rtl_hw_start_8411_2+0x231> 4f8d: R_X86_64_PLT32 __warn_printk-0x4 4f91: 0f 0b ud2 4f93: e9 49 fe ff ff jmp 4de1 <rtl_hw_start_8411_2+0x81> 4f98: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 4f9f: 00 4fa0: 90 nop 4fa1: 90 nop 4fa2: 90 nop 4fa3: 90 nop 4fa4: 90 nop 4fa5: 90 nop 4fa6: 90 nop 4fa7: 90 nop 4fa8: 90 nop 4fa9: 90 nop 4faa: 90 nop 4fab: 90 nop 4fac: 90 nop 4fad: 90 nop 4fae: 90 nop 4faf: 90 nop ---------------------------------------------------------------------------------------------- Best regards, Mirsad Todorovac