[net-next,v2,0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface

Message ID	20231023154649.45931-1-Parthiban.Veerasooran@microchip.com (mailing list archive)
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EC7BA291E; Mon, 23 Oct 2023 15:47:23 +0000 (UTC) From: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com> To: <davem@davemloft.net>, <edumazet@google.com>, <kuba@kernel.org>, <pabeni@redhat.com>, <robh+dt@kernel.org>, <krzysztof.kozlowski+dt@linaro.org>, <conor+dt@kernel.org>, <corbet@lwn.net>, <steen.hegelund@microchip.com>, <rdunlap@infradead.org>, <horms@kernel.org>, <casper.casan@gmail.com>, <andrew@lunn.ch> CC: <netdev@vger.kernel.org>, <devicetree@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-doc@vger.kernel.org>, <horatiu.vultur@microchip.com>, <Woojung.Huh@microchip.com>, <Nicolas.Ferre@microchip.com>, <UNGLinuxDriver@microchip.com>, <Thorsten.Kummermehr@microchip.com>, Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com> Subject: [PATCH net-next v2 0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface Date: Mon, 23 Oct 2023 21:16:40 +0530 Message-ID: <20231023154649.45931-1-Parthiban.Veerasooran@microchip.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface \| expand [net-next,v2,0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface [net-next,v2,1/9] net: ethernet: implement OPEN Alliance control transaction interface [net-next,v2,2/9] net: ethernet: oa_tc6: implement mac-phy software reset [net-next,v2,3/9] net: ethernet: oa_tc6: implement OA TC6 configuration function [net-next,v2,4/9] dt-bindings: net: add OPEN Alliance 10BASE-T1x MAC-PHY Serial Interface [net-next,v2,5/9] net: ethernet: oa_tc6: implement internal PHY initialization [net-next,v2,6/9] dt-bindings: net: oa-tc6: add PHY register access capability [net-next,v2,7/9] net: ethernet: oa_tc6: implement data transaction interface [net-next,v2,8/9] microchip: lan865x: add driver support for Microchip's LAN865X MACPHY [net-next,v2,9/9] dt-bindings: net: add Microchip's LAN865X 10BASE-T1S MACPHY

Parthiban Veerasooran Oct. 23, 2023, 3:46 p.m. UTC

This patch series contain the below updates,
- Adds support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface in the
  net/ethernet/oa_tc6.c.
- Adds driver support for Microchip LAN8650/1 Rev.B0 10BASE-T1S MACPHY
  Ethernet driver in the net/ethernet/microchip/lan865x.c.

Changes:
v2:
- Removed RFC tag.
- OA TC6 framework configured in the Kconfig and Makefile to compile as a
  module.
- Kerneldoc headers added for all the API methods exposed to MAC driver.
- Odd parity calculation logic updated from the below link,
  https://elixir.bootlin.com/linux/latest/source/lib/bch.c#L348
- Control buffer memory allocation moved to the initial function.
- struct oa_tc6 implemented as an obaque structure.
- Removed kthread for handling mac-phy interrupt instead threaded irq is
  used.
- Removed interrupt implementation for soft reset handling instead of
  that polling has been implemented.
- Registers name in the defines changed according to the specification
  document.
- Registers defines are arranged in the order of offset and followed by
  register fields.
- oa_tc6_write_register() implemented for writing a single register and
  oa_tc6_write_registers() implemented for writing multiple registers.
- oa_tc6_read_register() implemented for reading a single register and
  oa_tc6_read_registers() implemented for reading multiple registers.
- Removed DRV_VERSION macro as git hash provided by ethtool.
- Moved MDIO bus registration and PHY initialization to the OA TC6 lib.
- Replaced lan865x_set/get_link_ksettings() functions with
  phy_ethtool_ksettings_set/get() functions.
- MAC-PHY's standard capability register values checked against the
  user configured values.
- Removed unnecessary parameters validity check in various places.
- Removed MAC address configuration in the lan865x_net_open() function as
  it is done in the lan865x_probe() function already.
- Moved standard registers and proprietary vendor registers to the
  respective files.
- Added proper subject prefixes for the DT bindings.
- Moved OA specific properties to a separate DT bindings and corrected the
  types & mistakes in the DT bindings.
- Inherited OA specific DT bindings to the LAN865x specific DT bindings.
- Removed sparse warnings in all the places.
- Used net_err_ratelimited() for printing the error messages.
- oa_tc6_process_rx_chunks() function and the content of oa_tc6_handler()
  function are split into small functions.
- Used proper macros provided by network layer for calculating the
  MAX_ETH_LEN.
- Return value of netif_rx() function handled properly.
- Removed unnecessary NULL initialization of skb in the
  oa_tc6_rx_eth_ready() function removed.
- Local variables declaration ordered in reverse xmas tree notation.

Parthiban Veerasooran (9):
  net: ethernet: implement OPEN Alliance control transaction interface
  net: ethernet: oa_tc6: implement mac-phy software reset
  net: ethernet: oa_tc6: implement OA TC6 configuration function
  dt-bindings: net: add OPEN Alliance 10BASE-T1x MAC-PHY Serial
    Interface
  net: ethernet: oa_tc6: implement internal PHY initialization
  dt-bindings: net: oa-tc6: add PHY register access capability
  net: ethernet: oa_tc6: implement data transaction interface
  microchip: lan865x: add driver support for Microchip's LAN865X MACPHY
  dt-bindings: net: add Microchip's LAN865X 10BASE-T1S MACPHY

 .../bindings/net/microchip,lan865x.yaml       |  101 ++
 .../devicetree/bindings/net/oa-tc6.yaml       |   86 ++
 Documentation/networking/oa-tc6-framework.rst |  233 ++++
 MAINTAINERS                                   |   16 +
 drivers/net/ethernet/Kconfig                  |   12 +
 drivers/net/ethernet/Makefile                 |    1 +
 drivers/net/ethernet/microchip/Kconfig        |   11 +
 drivers/net/ethernet/microchip/Makefile       |    2 +
 drivers/net/ethernet/microchip/lan865x.c      |  415 ++++++
 drivers/net/ethernet/oa_tc6.c                 | 1117 +++++++++++++++++
 include/linux/oa_tc6.h                        |  109 ++
 11 files changed, 2103 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/microchip,lan865x.yaml
 create mode 100644 Documentation/devicetree/bindings/net/oa-tc6.yaml
 create mode 100644 Documentation/networking/oa-tc6-framework.rst
 create mode 100644 drivers/net/ethernet/microchip/lan865x.c
 create mode 100644 drivers/net/ethernet/oa_tc6.c
 create mode 100644 include/linux/oa_tc6.h

Andrew Lunn Feb. 16, 2024, 10:13 p.m. UTC | #1

On Mon, Oct 23, 2023 at 09:16:40PM +0530, Parthiban Veerasooran wrote:
> This patch series contain the below updates,
> - Adds support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface in the
>   net/ethernet/oa_tc6.c.
> - Adds driver support for Microchip LAN8650/1 Rev.B0 10BASE-T1S MACPHY
>   Ethernet driver in the net/ethernet/microchip/lan865x.c.

Hi Parthiban

Omsemi also have a TC6 device, which should use this framework. They
would like to make progress getting their device supported in
mainline.

What is happening with this patchset? Its been a few months since you
posted this. Will there be a new version soon? Has Microchip stopped
development? Postponed it because of other priorities etc?

Thanks
	Andrew

Parthiban Veerasooran Feb. 19, 2024, 9:46 a.m. UTC | #2

On 17/02/24 3:43 am, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Mon, Oct 23, 2023 at 09:16:40PM +0530, Parthiban Veerasooran wrote:
>> This patch series contain the below updates,
>> - Adds support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface in the
>>    net/ethernet/oa_tc6.c.
>> - Adds driver support for Microchip LAN8650/1 Rev.B0 10BASE-T1S MACPHY
>>    Ethernet driver in the net/ethernet/microchip/lan865x.c.
> 
> Hi Parthiban
> 
> Omsemi also have a TC6 device, which should use this framework. They
> would like to make progress getting their device supported in
> mainline.
> 
> What is happening with this patchset? Its been a few months since you
> posted this. Will there be a new version soon? Has Microchip stopped
> development? Postponed it because of other priorities etc?
Hi Andrew,

 From Microchip side, we haven't stopped/postponed this framework 
development. We are already working on it. It is in the final stage now. 
We are doing internal reviews right now and we expect that in 3 weeks 
time frame in the mainline again. We will send a new version (v3) of 
this patch series soon.

Thanks for your patience.

Best regards,
Parthiban V
> 
> Thanks
>          Andrew

Andrew Lunn Feb. 19, 2024, 2:30 p.m. UTC | #3

> Hi Andrew,
> 
>  From Microchip side, we haven't stopped/postponed this framework 
> development. We are already working on it. It is in the final stage now. 
> We are doing internal reviews right now and we expect that in 3 weeks 
> time frame in the mainline again. We will send a new version (v3) of 
> this patch series soon.

Hi Parthiban

It is good to here you are still working on it.

A have a few comments about how Linux mainline works. It tends to be
very iterative. Cycles tend to be fast. You will probably get review
comments within a couple of days of posting code. You often see
developers posting a new version within a few days, maybe a week. If
reviewers have asked for large changes, it can take longer, but
general, the cycles are short.

When you say you need three weeks for internal review, that to me
seems very slow. Is it so hard to get access to internal reviewers? Do
you have a very formal review process? More waterfall than iterative
development? I would suggest you try to keep your internal reviews
fast and low overhead, because you will be doing it lots of times as
we iterate the framework.

	Andrew

Parthiban Veerasooran Feb. 21, 2024, 5:15 a.m. UTC | #4

On 19/02/24 8:00 pm, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
>> Hi Andrew,
>>
>>   From Microchip side, we haven't stopped/postponed this framework
>> development. We are already working on it. It is in the final stage now.
>> We are doing internal reviews right now and we expect that in 3 weeks
>> time frame in the mainline again. We will send a new version (v3) of
>> this patch series soon.
> 
> Hi Parthiban
> 
> It is good to here you are still working on it.
> 
> A have a few comments about how Linux mainline works. It tends to be
> very iterative. Cycles tend to be fast. You will probably get review
> comments within a couple of days of posting code. You often see
> developers posting a new version within a few days, maybe a week. If
> reviewers have asked for large changes, it can take longer, but
> general, the cycles are short.
> 
> When you say you need three weeks for internal review, that to me
> seems very slow. Is it so hard to get access to internal reviewers? Do
> you have a very formal review process? More waterfall than iterative
> development? I would suggest you try to keep your internal reviews
> fast and low overhead, because you will be doing it lots of times as
> we iterate the framework.

Hi Andrew,

We understand your concern. We are working on this task with full focus. 
Initially there were some implementation change proposal from our 
internal reviewers to improve the performance and code quality. 
Consequently the testing of the new implementation took some while to 
bring it to a good shape.

Our internal reviewers Steen Hegelund and Horatiu Vultur are actively 
participating in reviewing my patches. I already have talked to them and 
we are in progress together to get the next version ready for the 
submission. We are trying our level best and working hard to push the 
next set of patches to the mainline as soon as possible.

Best regards,
Parthiban V
> 
>          Andrew
>

Piergiorgio Beruto Feb. 21, 2024, 12:21 p.m. UTC | #5

Hi all,
thank you for this discussion.

Since this framework is supposed to be a base for every potential OA-TC6 MACPHY implementation, I propose this could be a joint work.
Selvamani and I spotted some points in the code where we could optimize the performance and we would like to add the required changes to accommodate all potential implementations.

Would it be possible to share with the group the latest patches?

Thanks,
Piergiorgio

-----Original Message-----
From: Parthiban.Veerasooran@microchip.com <Parthiban.Veerasooran@microchip.com> 
Sent: 21 February, 2024 06:15
To: andrew@lunn.ch
Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Steen.Hegelund@microchip.com; netdev@vger.kernel.org; Horatiu.Vultur@microchip.com; Woojung.Huh@microchip.com; Nicolas.Ferre@microchip.com; UNGLinuxDriver@microchip.com; Thorsten.Kummermehr@microchip.com; Piergiorgio Beruto <Pier.Beruto@onsemi.com>; Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
Subject: Re: [PATCH net-next v2 0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface

[External Email]: This email arrived from an external source - Please exercise caution when opening any attachments or clicking on links.

On 19/02/24 8:00 pm, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know 
> the content is safe
> 
>> Hi Andrew,
>>
>>   From Microchip side, we haven't stopped/postponed this framework 
>> development. We are already working on it. It is in the final stage now.
>> We are doing internal reviews right now and we expect that in 3 weeks 
>> time frame in the mainline again. We will send a new version (v3) of 
>> this patch series soon.
> 
> Hi Parthiban
> 
> It is good to here you are still working on it.
> 
> A have a few comments about how Linux mainline works. It tends to be 
> very iterative. Cycles tend to be fast. You will probably get review 
> comments within a couple of days of posting code. You often see 
> developers posting a new version within a few days, maybe a week. If 
> reviewers have asked for large changes, it can take longer, but 
> general, the cycles are short.
> 
> When you say you need three weeks for internal review, that to me 
> seems very slow. Is it so hard to get access to internal reviewers? Do 
> you have a very formal review process? More waterfall than iterative 
> development? I would suggest you try to keep your internal reviews 
> fast and low overhead, because you will be doing it lots of times as 
> we iterate the framework.

Hi Andrew,

We understand your concern. We are working on this task with full focus. 
Initially there were some implementation change proposal from our internal reviewers to improve the performance and code quality. 
Consequently the testing of the new implementation took some while to bring it to a good shape.

Our internal reviewers Steen Hegelund and Horatiu Vultur are actively participating in reviewing my patches. I already have talked to them and we are in progress together to get the next version ready for the submission. We are trying our level best and working hard to push the next set of patches to the mainline as soon as possible.

Best regards,
Parthiban V
> 
>          Andrew
>

Parthiban Veerasooran Feb. 22, 2024, 5:17 a.m. UTC | #6

On 21/02/24 5:51 pm, Piergiorgio Beruto wrote:
> [Some people who received this message don't often get email from pier.beruto@onsemi.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> Hi all,
> thank you for this discussion.
> 
> Since this framework is supposed to be a base for every potential OA-TC6 MACPHY implementation, I propose this could be a joint work.
> Selvamani and I spotted some points in the code where we could optimize the performance and we would like to add the required changes to accommodate all potential implementations.
> 
> Would it be possible to share with the group the latest patches?

Hi Piergiorgio,

Good to hear! Yes I agree with you, let's work together. We really 
appreciate your efforts on this.

As I already replied to Andrew, we too did some implementation changes 
proposed by our internal reviewers to optimize the performance and 
improve the code quality. Also we are in the final stage of the internal 
review and to avoid unnecessary confusions between the versions we don't 
recommend to share an intermediate version. Definitely the next version 
(v3) is going to hit the mainline in the next couple of days so that you 
can give your comments. My team with our internal reviewers are 
extremely working hard to make it happen.

Thanks for your patience.

Best regards,
Parthiban V
> 
> Thanks,
> Piergiorgio
> 
> -----Original Message-----
> From: Parthiban.Veerasooran@microchip.com <Parthiban.Veerasooran@microchip.com>
> Sent: 21 February, 2024 06:15
> To: andrew@lunn.ch
> Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Steen.Hegelund@microchip.com; netdev@vger.kernel.org; Horatiu.Vultur@microchip.com; Woojung.Huh@microchip.com; Nicolas.Ferre@microchip.com; UNGLinuxDriver@microchip.com; Thorsten.Kummermehr@microchip.com; Piergiorgio Beruto <Pier.Beruto@onsemi.com>; Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
> Subject: Re: [PATCH net-next v2 0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface
> 
> [External Email]: This email arrived from an external source - Please exercise caution when opening any attachments or clicking on links.
> 
> On 19/02/24 8:00 pm, Andrew Lunn wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know
>> the content is safe
>>
>>> Hi Andrew,
>>>
>>>    From Microchip side, we haven't stopped/postponed this framework
>>> development. We are already working on it. It is in the final stage now.
>>> We are doing internal reviews right now and we expect that in 3 weeks
>>> time frame in the mainline again. We will send a new version (v3) of
>>> this patch series soon.
>>
>> Hi Parthiban
>>
>> It is good to here you are still working on it.
>>
>> A have a few comments about how Linux mainline works. It tends to be
>> very iterative. Cycles tend to be fast. You will probably get review
>> comments within a couple of days of posting code. You often see
>> developers posting a new version within a few days, maybe a week. If
>> reviewers have asked for large changes, it can take longer, but
>> general, the cycles are short.
>>
>> When you say you need three weeks for internal review, that to me
>> seems very slow. Is it so hard to get access to internal reviewers? Do
>> you have a very formal review process? More waterfall than iterative
>> development? I would suggest you try to keep your internal reviews
>> fast and low overhead, because you will be doing it lots of times as
>> we iterate the framework.
> 
> Hi Andrew,
> 
> We understand your concern. We are working on this task with full focus.
> Initially there were some implementation change proposal from our internal reviewers to improve the performance and code quality.
> Consequently the testing of the new implementation took some while to bring it to a good shape.
> 
> Our internal reviewers Steen Hegelund and Horatiu Vultur are actively participating in reviewing my patches. I already have talked to them and we are in progress together to get the next version ready for the submission. We are trying our level best and working hard to push the next set of patches to the mainline as soon as possible.
> 
> Best regards,
> Parthiban V
>>
>>           Andrew
>>
>

Parthiban Veerasooran March 4, 2024, 11:14 a.m. UTC | #7

On 19/02/24 8:00 pm, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
>> Hi Andrew,
>>
>>   From Microchip side, we haven't stopped/postponed this framework
>> development. We are already working on it. It is in the final stage now.
>> We are doing internal reviews right now and we expect that in 3 weeks
>> time frame in the mainline again. We will send a new version (v3) of
>> this patch series soon.
> 
> Hi Parthiban
> 
> It is good to here you are still working on it.
> 
> A have a few comments about how Linux mainline works. It tends to be
> very iterative. Cycles tend to be fast. You will probably get review
> comments within a couple of days of posting code. You often see
> developers posting a new version within a few days, maybe a week. If
> reviewers have asked for large changes, it can take longer, but
> general, the cycles are short.
> 
> When you say you need three weeks for internal review, that to me
> seems very slow. Is it so hard to get access to internal reviewers? Do
> you have a very formal review process? More waterfall than iterative
> development? I would suggest you try to keep your internal reviews
> fast and low overhead, because you will be doing it lots of times as
> we iterate the framework.
> 
>          Andrew
> 
Hi Andrew,

Good day...!

Finally we have completed the v3 patch series preparation and planning 
to post it in the mainline in the next days. FYI, next week (from 11th 
to 15th March) I will be out of office and will not have access to 
emails. Again will be back to work on 18th March. Would it be ok for you 
to post the patch series this week or shall I post it on March 18th? as 
I will not be able to reply to the comments immediately. Could you 
please provide your suggestion on this?

Best regards,
Parthiban V

Andrew Lunn March 4, 2024, 1:22 p.m. UTC | #8

> Hi Andrew,
> 
> Good day...!
> 
> Finally we have completed the v3 patch series preparation and planning 
> to post it in the mainline in the next days. FYI, next week (from 11th 
> to 15th March) I will be out of office and will not have access to 
> emails. Again will be back to work on 18th March. Would it be ok for you 
> to post the patch series this week or shall I post it on March 18th? as 
> I will not be able to reply to the comments immediately. Could you 
> please provide your suggestion on this?

Onsemi are waiting for the new version. So i would suggest you post
them sooner, not later. If need be, get one of the other Microchip
developers to post them. If they are posted RFC, the signed-off-by can
be missing the actual submitter.

       Andrew

Parthiban Veerasooran March 5, 2024, 1:22 p.m. UTC | #9

On 04/03/24 6:52 pm, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
>> Hi Andrew,
>>
>> Good day...!
>>
>> Finally we have completed the v3 patch series preparation and planning
>> to post it in the mainline in the next days. FYI, next week (from 11th
>> to 15th March) I will be out of office and will not have access to
>> emails. Again will be back to work on 18th March. Would it be ok for you
>> to post the patch series this week or shall I post it on March 18th? as
>> I will not be able to reply to the comments immediately. Could you
>> please provide your suggestion on this?
> 
> Onsemi are waiting for the new version. So i would suggest you post
> them sooner, not later. If need be, get one of the other Microchip
> developers to post them. If they are posted RFC, the signed-off-by can
> be missing the actual submitter.
> 
>         Andrew
Hi Andrew,

Thank you for your suggestion. We will send out the patches soon and 
will try our level best to reply to the comments as soon as possible.

Best regards,
Parthiban V

Benjamin Bigler March 24, 2024, 11:55 a.m. UTC | #10

Hi Parthiban

I hope I send this in the right context as it is not related to just one patch or
some specific code.

I conducted UDP load testing using three i.MX8MM boards in conjunction with the
LAN8651. The setup involved one board functioning as a server, which is just
echoing back received data, while the remaining two boards acted as clients,
sending UDP packets of different sizes in various bursts to the server.
Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might
have influenced the results.

During the tests I experienced some issues:

- The boards just start receiving after first sending something (ping another board).
Some measurements showed that the irq stays asserted after init. This makes sense
as far as I understand the chapter 7.7 of the specification, the irq is deasserted
on reception of the first data header following CSn being asserted. As a workaround
I trigger the thread at the end of oa_tc6_init.

- If there is a lot of traffic, the receive buffer overflow error spams the log.

- If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb.
Mostly because more data to rx_skb is added than allocated and sometimes because
rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging
with a logic analyzer showed that the chip is not behave correctly. There is more
bytes between start_valid and end_valid than there should be. Also there
seems to be 2 end_valid without a start_valid between. What is common is that the incorrect
frame starts in a chunk where end_valid and start_valid is set.
In my opinion its a problem in the chip (maybe related to the errata in the next point)
but the driver should be resilent and just drop the packet and not cause a kernel panic.

- Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0)
and also exst is not active. I found out that there is an errata (DS80001075) point s3
that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above.
The driver now works since about 2.5 weeks with various load with just one loss of frame
error where I had to reboot the system after about 4 days.

Is there a reason why you removed the netdev watchdog which was active in v2?

Thanks,
Benjamin Bigler

Parthiban Veerasooran March 25, 2024, 1:24 p.m. UTC | #11

Hi Benjamin Bigler,

Thank you for your testing and feedback. It would be really helpful to 
bring the driver to a good shape. We really appreciate your efforts on this.

On 24/03/24 5:25 pm, Benjamin Bigler wrote:
> [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> Hi Parthiban
> 
> I hope I send this in the right context as it is not related to just one patch or
> some specific code.
> 
> I conducted UDP load testing using three i.MX8MM boards in conjunction with the
> LAN8651. The setup involved one board functioning as a server, which is just
> echoing back received data, while the remaining two boards acted as clients,
> sending UDP packets of different sizes in various bursts to the server.
> Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might
> have influenced the results.
> 
> During the tests I experienced some issues:
> 
> - The boards just start receiving after first sending something (ping another board).
>    Some measurements showed that the irq stays asserted after init. This makes sense
>    as far as I understand the chapter 7.7 of the specification, the irq is deasserted
>    on reception of the first data header following CSn being asserted. As a workaround
>    I trigger the thread at the end of oa_tc6_init.
It looks like the IRQ is asserted on RESET completion and expects a data
chunk from host to deassert the IRQ. I used to test the driver in RPI 4
using iperf3. For some reason I never faced this issue, may be when the
network device is being registered there might be some packet 
transmission which leads to deliver a data chunk so that the IRQ is
deasserted. Thanks for the workaround. I think that would be the 
solution to solve this issue. Adding the below lines in the end of the 
function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to 
perform an empty data chunk transfer which will deassert the IRQ before 
starting the actual data transfer.

/* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset
  * complete status. IRQ is also asserted on reset completion and it is
  * remain asserted until MAC-PHY receives a data chunk. So performing an
  * empty data chunk transmission will deassert the IRQ. Refer section
  * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details.
  */
tc6->int_flag = true;
wake_up_interruptible(&tc6->spi_wq);
> 
> - If there is a lot of traffic, the receive buffer overflow error spams the log.
> 
> - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb.
>    Mostly because more data to rx_skb is added than allocated and sometimes because
>    rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging
>    with a logic analyzer showed that the chip is not behave correctly. There is more
>    bytes between start_valid and end_valid than there should be. Also there
>    seems to be 2 end_valid without a start_valid between. What is common is that the incorrect
>    frame starts in a chunk where end_valid and start_valid is set.
>    In my opinion its a problem in the chip (maybe related to the errata in the next point)
>    but the driver should be resilent and just drop the packet and not cause a kernel panic.
Usually I run into this issue "receive buffer overflow" when I run RPI 4
with default cpu governor setting which is "ondemand". In this case, 
even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is
clocking down when it is idle which leads delivering half of the
configured SPI clock speed around 5.9 MHz. So the systems like RPI 4 
need performance mode enabled to get the proper clock speed for SPI. 
Refer below link for more details.

https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750

I used to enable performance mode using the below command.

echo performance | sudo tee 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null

So please ensure the SPI clock speed using a logic analyzer to get the
maximum throughput without receive buffer overflow.

Of course, I agree that the driver should not crash in case of receive
buffer overflow. By referring your investigations, I understand that the
buffers in the MAC-PHY is being continuously overwritten again and again
as the host is very slow to read the data from the MAC-PHY buffers
through SPI which alters the descriptors. There might be two reasons why
we run into this situation.
1. The host is busy doing something else and delays to initiate SPI even
    though SPI clock speed is 15 MHz.
2. The SPI clock speed is less than 15 MHz.

I use the below iperf3 setup for my testing and never faced the driver
crash issue even though faced "receive buffer overflow" error when I run
RPI 4 with "ondemand" default mode.

Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY
  $ iperf3 -s
Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick
  $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0

and vice versa.

I never faced "receive buffer overflow" error when I run RPI 4 with
"performance" mode enabled and even though all the cores are stressed
using the below command,

$ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null &

Can you share more details about your testing setup and applications you
use, so that I will try to reproduce the issue in my setup to debug the
driver?
> 
> - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0)
>    and also exst is not active. I found out that there is an errata (DS80001075) point s3
>    that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above.
>    The driver now works since about 2.5 weeks with various load with just one loss of frame
>    error where I had to reboot the system after about 4 days.
It is good to hear that the driver works fine with the above changes. As 
mentioned in the errata, this continuous interrupt issue is a known
issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this
issue and no need of any workaround. Setting ZARFE bit in the CONFIG0
will solve the continuous interrupt issue but don't know how the above
"receive buffer overflow" issue also solved. I think it is a good idea 
to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be 
interesting to see the result. I am always using LAN8651 Rev.B1 for my 
testing.

I should be able to reproduce the "receive buffer overflow" issue and 
consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can 
investigate the issue further. As I am not able to reproduce in my RPI 
4, I need your support for the tests and applications you used in your 
setup.
> 
> Is there a reason why you removed the netdev watchdog which was active in v2?
When the timeout occurs, there is no further action except increasing
tx_errors. Not seeing this except USB-to-Ethernet which can be removed
unexpectedly. But this is SPI interface which will not be removed
unexpectedly as it is a platform device. That's why we removed this.

Best regards,
Parthiban V
> 
> Thanks,
> Benjamin Bigler
>

Andrew Lunn March 25, 2024, 2:01 p.m. UTC | #12

> It looks like the IRQ is asserted on RESET completion and expects a data
> chunk from host to deassert the IRQ. I used to test the driver in RPI 4
> using iperf3. For some reason I never faced this issue, may be when the
> network device is being registered there might be some packet 
> transmission which leads to deliver a data chunk so that the IRQ is
> deasserted.

If you have IPv6 enabled, the network stack will try to add a link
local IPv6 address to the interface, which means performing a
Duplicate Address Detection. That means sending a few packets.

Try disabling IPv6 if you want to reproduce the problem.

	  Andrew

Parthiban Veerasooran March 26, 2024, 9:43 a.m. UTC | #13

Hi Andrew,

On 25/03/24 7:31 pm, Andrew Lunn wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
>> It looks like the IRQ is asserted on RESET completion and expects a data
>> chunk from host to deassert the IRQ. I used to test the driver in RPI 4
>> using iperf3. For some reason I never faced this issue, may be when the
>> network device is being registered there might be some packet
>> transmission which leads to deliver a data chunk so that the IRQ is
>> deasserted.
> 
> If you have IPv6 enabled, the network stack will try to add a link
> local IPv6 address to the interface, which means performing a
> Duplicate Address Detection. That means sending a few packets.
> 
> Try disabling IPv6 if you want to reproduce the problem.
Yes, I saw that IPv6 is enabled by default in my RPI 4. So as you 
mentioned, I tried disabling it but still approximately in 1 second 
there is a packet coming from the n/w stack which makes the IRQ 
deasserted (Refer the attached screenshot). So just for testing, I 
disabled waking up the SPI thread from the oa_tc6_start_xmit() function 
and noted that the IRQ is not deasserted throughout the period in Logic 
analyzer.

As I replied in the previous email, using the below code makes the IRQ 
deasserted when the oa_tc6_init() finishes without expecting any data 
chunk from oa_tc6_start_xmit(). So in my opinion let's stick with this 
solution.

/* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset
  * complete status. IRQ is also asserted on reset completion and it is
  * remain asserted until MAC-PHY receives a data chunk. So performing an
  * empty data chunk transmission will deassert the IRQ. Refer section
  * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details.
  */
tc6->int_flag = true;
wake_up_interruptible(&tc6->spi_wq);

Best regards,
Parthiban V
> 
>            Andrew

Benjamin Bigler April 3, 2024, 9:40 p.m. UTC | #14

Hi Parthiban,

Sorry for the late answer, I was quite busy the last few days.

On Mon, 2024-03-25 at 13:24 +0000, Parthiban.Veerasooran@microchip.com wrote:
> Hi Benjamin Bigler,
> 
> Thank you for your testing and feedback. It would be really helpful to 
> bring the driver to a good shape. We really appreciate your efforts on this.
> 
> On 24/03/24 5:25 pm, Benjamin Bigler wrote:
> > [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > 
> > Hi Parthiban
> > 
> > I hope I send this in the right context as it is not related to just one patch or
> > some specific code.
> > 
> > I conducted UDP load testing using three i.MX8MM boards in conjunction with the
> > LAN8651. The setup involved one board functioning as a server, which is just
> > echoing back received data, while the remaining two boards acted as clients,
> > sending UDP packets of different sizes in various bursts to the server.
> > Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might
> > have influenced the results.
> > 
> > During the tests I experienced some issues:
> > 
> > - The boards just start receiving after first sending something (ping another board).
> >    Some measurements showed that the irq stays asserted after init. This makes sense
> >    as far as I understand the chapter 7.7 of the specification, the irq is deasserted
> >    on reception of the first data header following CSn being asserted. As a workaround
> >    I trigger the thread at the end of oa_tc6_init.
> It looks like the IRQ is asserted on RESET completion and expects a data
> chunk from host to deassert the IRQ. I used to test the driver in RPI 4
> using iperf3. For some reason I never faced this issue, may be when the
> network device is being registered there might be some packet 
> transmission which leads to deliver a data chunk so that the IRQ is
> deasserted. Thanks for the workaround. I think that would be the 
> solution to solve this issue. Adding the below lines in the end of the 
> function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to 
> perform an empty data chunk transfer which will deassert the IRQ before 
> starting the actual data transfer.

I have ipv6 disabled and use static ipv4 addresses. That could be the reason why on
my side no packet is sent.

> 
> /* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset
>   * complete status. IRQ is also asserted on reset completion and it is
>   * remain asserted until MAC-PHY receives a data chunk. So performing an
>   * empty data chunk transmission will deassert the IRQ. Refer section
>   * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details.
>   */
> tc6->int_flag = true;
> wake_up_interruptible(&tc6->spi_wq);

Perfect, thats the same I added and also works on my side.

> > 
> > - If there is a lot of traffic, the receive buffer overflow error spams the log.
> > 
> > - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb.
> >    Mostly because more data to rx_skb is added than allocated and sometimes because
> >    rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging
> >    with a logic analyzer showed that the chip is not behave correctly. There is more
> >    bytes between start_valid and end_valid than there should be. Also there
> >    seems to be 2 end_valid without a start_valid between. What is common is that the incorrect
> >    frame starts in a chunk where end_valid and start_valid is set.
> >    In my opinion its a problem in the chip (maybe related to the errata in the next point)
> >    but the driver should be resilent and just drop the packet and not cause a kernel panic.
> Usually I run into this issue "receive buffer overflow" when I run RPI 4
> with default cpu governor setting which is "ondemand". In this case, 
> even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is
> clocking down when it is idle which leads delivering half of the
> configured SPI clock speed around 5.9 MHz. So the systems like RPI 4 
> need performance mode enabled to get the proper clock speed for SPI. 
> Refer below link for more details.
> 
> https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750
> 
> I used to enable performance mode using the below command.
> 
> echo performance | sudo tee 
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null
> 
> So please ensure the SPI clock speed using a logic analyzer to get the
> maximum throughput without receive buffer overflow.
> 
> Of course, I agree that the driver should not crash in case of receive
> buffer overflow. By referring your investigations, I understand that the
> buffers in the MAC-PHY is being continuously overwritten again and again
> as the host is very slow to read the data from the MAC-PHY buffers
> through SPI which alters the descriptors. There might be two reasons why
> we run into this situation.
> 1. The host is busy doing something else and delays to initiate SPI even
>     though SPI clock speed is 15 MHz.
> 2. The SPI clock speed is less than 15 MHz.

Sorry there is a missunderstanding between us. The receive buffer overflow is not
causing any harm except filling the log. In my setup I get in one day about 35000
entries. I am not sure if its appropriate to log these errors.

The SPI Frequency is at 14.8 MHz. If I just have 2 boards connected, I am not able
to reproduce this. Only with 3 boards when 2 boards sends multiple big ethernet
frames (1512 byte per Frame) to one, I get these log entries. 
The latency seems to be quite low, from IRQ to start reading first frame it takes
always less than 500us. Also the boards are just running the udp test.

> 
> I use the below iperf3 setup for my testing and never faced the driver
> crash issue even though faced "receive buffer overflow" error when I run
> RPI 4 with "ondemand" default mode.
> 
> Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY
>   $ iperf3 -s
> Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick
>   $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0
> 
> and vice versa.
> 
> I never faced "receive buffer overflow" error when I run RPI 4 with
> "performance" mode enabled and even though all the cores are stressed
> using the below command,
> 
> $ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null &
> 
> Can you share more details about your testing setup and applications you
> use, so that I will try to reproduce the issue in my setup to debug the
> driver?

I use a internal tool which does some stress tests using udp. Unfortunately,
I am not allowed to publish it, but a colleague works on a rust implementation,
which we can publish, but its not fully ready yet.
On one board the tool is running in server mode. It just echoes back the received
data. On the 2 other boards the tool is running in client mode. It sends various
sized udp-packets in different bursts and then checks if it receives the same
data in the same order.


The crashes only happens when ZARFE is not set (with Rev B0). When the crash
happens, I see on the logic analyzer that there are more bytes than mtu + headers
between the frame where start_valid is set and the frame where end_valid is set.
Then this happens:

[  437.155673] skbuff: skb_over_panic: text:ffff80007a8c2bd8 len:1600 put:64 head:ffff00000de28080
data:ffff00000de280c0 tail:0x680 end:0x640 dev:eth1
[  437.168987] kernel BUG at net/core/skbuff.c:192!
[  437.173612] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[  437.180407] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O)
tpm_tis_spi tpm_tis_core spi_imx imx_sdma
[  437.196016] CPU: 1 PID: 455 Comm: oa-tc6-spi-thre Tainted: G           O       6.6.11-
gce336e2c2bc3-dirty #1
[  437.205853] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT)
[  437.212820] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  437.219790] pc : skb_panic+0x58/0x5c
[  437.223376] lr : skb_panic+0x58/0x5c
[  437.226959] sp : ffff80008362bd90
[  437.230278] x29: ffff80008362bda0 x28: 0000000000000000 x27: ffff000001066878
[  437.237426] x26: 000000000000001e x25: 00000000000007f8 x24: ffff0000010cea80
[  437.244571] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000
[  437.251720] x20: ffff0000010ceaa8 x19: 000000003f20003f x18: ffffffffffffffff
[  437.258867] x17: ffff7ffffded9000 x16: ffff800080008000 x15: 073a0764076e0765
[  437.266015] x14: 0720073007380736 x13: ffff8000823d1f58 x12: 0000000000000534
[  437.273162] x11: 00000000000001bc x10: ffff800082429f58 x9 : ffff8000823d1f58
[  437.280310] x8 : 00000000ffffefff x7 : ffff800082429f58 x6 : 0000000000000000
[  437.287455] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[  437.294606] x2 : 0000000000000000 x1 : ffff000001223b00 x0 : 0000000000000087
[  437.301753] Call trace:
[  437.304203]  skb_panic+0x58/0x5c
[  437.307436]  skb_find_text+0x0/0xf0
[  437.310933]  oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6]
[  437.316523]  kthread+0x118/0x11c
[  437.319758]  ret_from_fork+0x10/0x20
[  437.323343] Code: f90007e9 b940b908 f90003e8 97ca3c34 (d4210000)
[  437.329446] ---[ end trace 0000000000000000 ]---


Sometimes there are 2 end_valid after eachother without a start_valid between.
Then this happens:

[  469.737297] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000074
[  469.746137] Mem abort info:
[  469.748950]   ESR = 0x0000000096000004
[  469.752709]   EC = 0x25: DABT (current EL), IL = 32 bits
[  469.758036]   SET = 0, FnV = 0
[  469.761098]   EA = 0, S1PTW = 0
[  469.764252]   FSC = 0x04: level 0 translation fault
[  469.769144] Data abort info:
[  469.772033]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  469.777529]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  469.782594]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  469.787921] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043c32000
[  469.794377] [0000000000000074] pgd=0000000000000000, p4d=0000000000000000
[  469.801184] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  469.807459] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O)
tpm_tis_spi tpm_tis_core spi_imx imx_sdma
[  469.823064] CPU: 2 PID: 456 Comm: oa-tc6-spi-thre Tainted: G           O       6.6.11-
g350ed394a6ca-dirty #1
[  469.832903] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT)
[  469.839871] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  469.846841] pc : skb_put+0xc/0x6c
[  469.850169] lr : oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6]
[  469.856106] sp : ffff80008376bdb0
[  469.859424] x29: ffff80008376bdb0 x28: 0000000000000000 x27: ffff00000194c080
[  469.866573] x26: 0000000000000000 x25: 0000000000000000 x24: ffff000001095c80
[  469.873720] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000
[  469.880870] x20: ffff000001095ca8 x19: 000000003f20003f x18: 0000000000000000
[  469.888023] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  469.895174] x14: 0000031acf8b86d8 x13: 0000000000000000 x12: 0000000000000000
[  469.902321] x11: 0000000000000002 x10: 0000000000000a60 x9 : ffff80008376b970
[  469.909467] x8 : ffff00007fb6e580 x7 : 000000000194b080 x6 : 0000000000000000
[  469.916616] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 000000000000fc80
[  469.923765] x2 : 0000000000000001 x1 : 0000000000000040 x0 : 0000000000000000
[  469.930915] Call trace:
[  469.933365]  skb_put+0xc/0x6c
[  469.936342]  oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6]
[  469.941929]  kthread+0x118/0x11c
[  469.945166]  ret_from_fork+0x10/0x20
[  469.948752] Code: d65f03c0 d503233f a9bf7bfd 910003fd (b9407406)
[  469.954854] ---[ end trace 0000000000000000 ]---


If interested I can try to get a recording with the logic analyzer and send it to you.

By the way in the other answer you attached a screenshot of the logic analyzer and you
have a very nice HLA for oa_tc6. Are they open-source or are there any plans to publish them?

> > 
> > - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0)
> >    and also exst is not active. I found out that there is an errata (DS80001075) point s3
> >    that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above.
> >    The driver now works since about 2.5 weeks with various load with just one loss of frame
> >    error where I had to reboot the system after about 4 days.
> It is good to hear that the driver works fine with the above changes. As 
> mentioned in the errata, this continuous interrupt issue is a known
> issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this
> issue and no need of any workaround. Setting ZARFE bit in the CONFIG0
> will solve the continuous interrupt issue but don't know how the above
> "receive buffer overflow" issue also solved. I think it is a good idea 
> to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be 
> interesting to see the result. I am always using LAN8651 Rev.B1 for my 
> testing.

Unfortunately I just have LAN8651 Rev. B0 Chips. Are you sure that the Rev B1 has the
issue fixed? The errata here says that B1 is affected too:
https://ww1.microchip.com/downloads/aemDocuments/documents/AIS/ProductDocuments/Errata/LAN8650-1-Errata-80001075.pdf

> 
> I should be able to reproduce the "receive buffer overflow" issue and 
> consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can 
> investigate the issue further. As I am not able to reproduce in my RPI 
> 4, I need your support for the tests and applications you used in your 
> setup.
> > 
> > Is there a reason why you removed the netdev watchdog which was active in v2?
> When the timeout occurs, there is no further action except increasing
> tx_errors. Not seeing this except USB-to-Ethernet which can be removed
> unexpectedly. But this is SPI interface which will not be removed
> unexpectedly as it is a platform device. That's why we removed this.
> 
> Best regards,
> Parthiban V
> > 
> > Thanks,
> > Benjamin Bigler
> > 
> 

Thanks,
Benjamin Bigler

Parthiban Veerasooran April 8, 2024, 1:41 p.m. UTC | #15

Hi Benjamin,

On 04/04/24 3:10 am, Benjamin Bigler wrote:
> [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> Hi Parthiban,
> 
> Sorry for the late answer, I was quite busy the last few days.
No problem.
> 
> On Mon, 2024-03-25 at 13:24 +0000, Parthiban.Veerasooran@microchip.com wrote:
>> Hi Benjamin Bigler,
>>
>> Thank you for your testing and feedback. It would be really helpful to
>> bring the driver to a good shape. We really appreciate your efforts on this.
>>
>> On 24/03/24 5:25 pm, Benjamin Bigler wrote:
>>> [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> Hi Parthiban
>>>
>>> I hope I send this in the right context as it is not related to just one patch or
>>> some specific code.
>>>
>>> I conducted UDP load testing using three i.MX8MM boards in conjunction with the
>>> LAN8651. The setup involved one board functioning as a server, which is just
>>> echoing back received data, while the remaining two boards acted as clients,
>>> sending UDP packets of different sizes in various bursts to the server.
>>> Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might
>>> have influenced the results.
>>>
>>> During the tests I experienced some issues:
>>>
>>> - The boards just start receiving after first sending something (ping another board).
>>>     Some measurements showed that the irq stays asserted after init. This makes sense
>>>     as far as I understand the chapter 7.7 of the specification, the irq is deasserted
>>>     on reception of the first data header following CSn being asserted. As a workaround
>>>     I trigger the thread at the end of oa_tc6_init.
>> It looks like the IRQ is asserted on RESET completion and expects a data
>> chunk from host to deassert the IRQ. I used to test the driver in RPI 4
>> using iperf3. For some reason I never faced this issue, may be when the
>> network device is being registered there might be some packet
>> transmission which leads to deliver a data chunk so that the IRQ is
>> deasserted. Thanks for the workaround. I think that would be the
>> solution to solve this issue. Adding the below lines in the end of the
>> function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to
>> perform an empty data chunk transfer which will deassert the IRQ before
>> starting the actual data transfer.
> 
> I have ipv6 disabled and use static ipv4 addresses. That could be the reason why on
> my side no packet is sent.
> 
>>
>> /* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset
>>    * complete status. IRQ is also asserted on reset completion and it is
>>    * remain asserted until MAC-PHY receives a data chunk. So performing an
>>    * empty data chunk transmission will deassert the IRQ. Refer section
>>    * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details.
>>    */
>> tc6->int_flag = true;
>> wake_up_interruptible(&tc6->spi_wq);
> 
> Perfect, thats the same I added and also works on my side.
> 
>>>
>>> - If there is a lot of traffic, the receive buffer overflow error spams the log.
>>>
>>> - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb.
>>>     Mostly because more data to rx_skb is added than allocated and sometimes because
>>>     rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging
>>>     with a logic analyzer showed that the chip is not behave correctly. There is more
>>>     bytes between start_valid and end_valid than there should be. Also there
>>>     seems to be 2 end_valid without a start_valid between. What is common is that the incorrect
>>>     frame starts in a chunk where end_valid and start_valid is set.
>>>     In my opinion its a problem in the chip (maybe related to the errata in the next point)
>>>     but the driver should be resilent and just drop the packet and not cause a kernel panic.
>> Usually I run into this issue "receive buffer overflow" when I run RPI 4
>> with default cpu governor setting which is "ondemand". In this case,
>> even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is
>> clocking down when it is idle which leads delivering half of the
>> configured SPI clock speed around 5.9 MHz. So the systems like RPI 4
>> need performance mode enabled to get the proper clock speed for SPI.
>> Refer below link for more details.
>>
>> https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750
>>
>> I used to enable performance mode using the below command.
>>
>> echo performance | sudo tee
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null
>>
>> So please ensure the SPI clock speed using a logic analyzer to get the
>> maximum throughput without receive buffer overflow.
>>
>> Of course, I agree that the driver should not crash in case of receive
>> buffer overflow. By referring your investigations, I understand that the
>> buffers in the MAC-PHY is being continuously overwritten again and again
>> as the host is very slow to read the data from the MAC-PHY buffers
>> through SPI which alters the descriptors. There might be two reasons why
>> we run into this situation.
>> 1. The host is busy doing something else and delays to initiate SPI even
>>      though SPI clock speed is 15 MHz.
>> 2. The SPI clock speed is less than 15 MHz.
> 
> Sorry there is a missunderstanding between us. The receive buffer overflow is not
> causing any harm except filling the log. In my setup I get in one day about 35000
> entries. I am not sure if its appropriate to log these errors.
> 
> The SPI Frequency is at 14.8 MHz. If I just have 2 boards connected, I am not able
> to reproduce this. Only with 3 boards when 2 boards sends multiple big ethernet
> frames (1512 byte per Frame) to one, I get these log entries.
> The latency seems to be quite low, from IRQ to start reading first frame it takes
> always less than 500us. Also the boards are just running the udp test.
> 
>>
>> I use the below iperf3 setup for my testing and never faced the driver
>> crash issue even though faced "receive buffer overflow" error when I run
>> RPI 4 with "ondemand" default mode.
>>
>> Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY
>>    $ iperf3 -s
>> Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick
>>    $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0
>>
>> and vice versa.
>>
>> I never faced "receive buffer overflow" error when I run RPI 4 with
>> "performance" mode enabled and even though all the cores are stressed
>> using the below command,
>>
>> $ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null &
>>
>> Can you share more details about your testing setup and applications you
>> use, so that I will try to reproduce the issue in my setup to debug the
>> driver?
> 
> I use a internal tool which does some stress tests using udp. Unfortunately,
> I am not allowed to publish it, but a colleague works on a rust implementation,
> which we can publish, but its not fully ready yet.
> On one board the tool is running in server mode. It just echoes back the received
> data. On the 2 other boards the tool is running in client mode. It sends various
> sized udp-packets in different bursts and then checks if it receives the same
> data in the same order.
> 
> 
> The crashes only happens when ZARFE is not set (with Rev B0). When the crash
> happens, I see on the logic analyzer that there are more bytes than mtu + headers
> between the frame where start_valid is set and the frame where end_valid is set.
> Then this happens:
Thanks for all the above details. I will include this ZARFE fix in the 
next version v4 which I am going to post soon.
> 
> [  437.155673] skbuff: skb_over_panic: text:ffff80007a8c2bd8 len:1600 put:64 head:ffff00000de28080
> data:ffff00000de280c0 tail:0x680 end:0x640 dev:eth1
> [  437.168987] kernel BUG at net/core/skbuff.c:192!
> [  437.173612] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [  437.180407] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O)
> tpm_tis_spi tpm_tis_core spi_imx imx_sdma
> [  437.196016] CPU: 1 PID: 455 Comm: oa-tc6-spi-thre Tainted: G           O       6.6.11-
> gce336e2c2bc3-dirty #1
> [  437.205853] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT)
> [  437.212820] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  437.219790] pc : skb_panic+0x58/0x5c
> [  437.223376] lr : skb_panic+0x58/0x5c
> [  437.226959] sp : ffff80008362bd90
> [  437.230278] x29: ffff80008362bda0 x28: 0000000000000000 x27: ffff000001066878
> [  437.237426] x26: 000000000000001e x25: 00000000000007f8 x24: ffff0000010cea80
> [  437.244571] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000
> [  437.251720] x20: ffff0000010ceaa8 x19: 000000003f20003f x18: ffffffffffffffff
> [  437.258867] x17: ffff7ffffded9000 x16: ffff800080008000 x15: 073a0764076e0765
> [  437.266015] x14: 0720073007380736 x13: ffff8000823d1f58 x12: 0000000000000534
> [  437.273162] x11: 00000000000001bc x10: ffff800082429f58 x9 : ffff8000823d1f58
> [  437.280310] x8 : 00000000ffffefff x7 : ffff800082429f58 x6 : 0000000000000000
> [  437.287455] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
> [  437.294606] x2 : 0000000000000000 x1 : ffff000001223b00 x0 : 0000000000000087
> [  437.301753] Call trace:
> [  437.304203]  skb_panic+0x58/0x5c
> [  437.307436]  skb_find_text+0x0/0xf0
> [  437.310933]  oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6]
> [  437.316523]  kthread+0x118/0x11c
> [  437.319758]  ret_from_fork+0x10/0x20
> [  437.323343] Code: f90007e9 b940b908 f90003e8 97ca3c34 (d4210000)
> [  437.329446] ---[ end trace 0000000000000000 ]---
> 
> 
> Sometimes there are 2 end_valid after eachother without a start_valid between.
> Then this happens:
> 
> [  469.737297] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000074
> [  469.746137] Mem abort info:
> [  469.748950]   ESR = 0x0000000096000004
> [  469.752709]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  469.758036]   SET = 0, FnV = 0
> [  469.761098]   EA = 0, S1PTW = 0
> [  469.764252]   FSC = 0x04: level 0 translation fault
> [  469.769144] Data abort info:
> [  469.772033]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> [  469.777529]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [  469.782594]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [  469.787921] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043c32000
> [  469.794377] [0000000000000074] pgd=0000000000000000, p4d=0000000000000000
> [  469.801184] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> [  469.807459] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O)
> tpm_tis_spi tpm_tis_core spi_imx imx_sdma
> [  469.823064] CPU: 2 PID: 456 Comm: oa-tc6-spi-thre Tainted: G           O       6.6.11-
> g350ed394a6ca-dirty #1
> [  469.832903] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT)
> [  469.839871] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  469.846841] pc : skb_put+0xc/0x6c
> [  469.850169] lr : oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6]
> [  469.856106] sp : ffff80008376bdb0
> [  469.859424] x29: ffff80008376bdb0 x28: 0000000000000000 x27: ffff00000194c080
> [  469.866573] x26: 0000000000000000 x25: 0000000000000000 x24: ffff000001095c80
> [  469.873720] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000
> [  469.880870] x20: ffff000001095ca8 x19: 000000003f20003f x18: 0000000000000000
> [  469.888023] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [  469.895174] x14: 0000031acf8b86d8 x13: 0000000000000000 x12: 0000000000000000
> [  469.902321] x11: 0000000000000002 x10: 0000000000000a60 x9 : ffff80008376b970
> [  469.909467] x8 : ffff00007fb6e580 x7 : 000000000194b080 x6 : 0000000000000000
> [  469.916616] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 000000000000fc80
> [  469.923765] x2 : 0000000000000001 x1 : 0000000000000040 x0 : 0000000000000000
> [  469.930915] Call trace:
> [  469.933365]  skb_put+0xc/0x6c
> [  469.936342]  oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6]
> [  469.941929]  kthread+0x118/0x11c
> [  469.945166]  ret_from_fork+0x10/0x20
> [  469.948752] Code: d65f03c0 d503233f a9bf7bfd 910003fd (b9407406)
> [  469.954854] ---[ end trace 0000000000000000 ]---
> 
> 
> If interested I can try to get a recording with the logic analyzer and send it to you.
I don't think it is needed.
> 
> By the way in the other answer you attached a screenshot of the logic analyzer and you
> have a very nice HLA for oa_tc6. Are they open-source or are there any plans to publish them?
It is already available in the Microchip's github page for public. 
Checkout the below link for the same.

https://github.com/MicrochipTech/oa-tc6-saleae-extension
> 
>>>
>>> - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0)
>>>     and also exst is not active. I found out that there is an errata (DS80001075) point s3
>>>     that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above.
>>>     The driver now works since about 2.5 weeks with various load with just one loss of frame
>>>     error where I had to reboot the system after about 4 days.
>> It is good to hear that the driver works fine with the above changes. As
>> mentioned in the errata, this continuous interrupt issue is a known
>> issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this
>> issue and no need of any workaround. Setting ZARFE bit in the CONFIG0
>> will solve the continuous interrupt issue but don't know how the above
>> "receive buffer overflow" issue also solved. I think it is a good idea
>> to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be
>> interesting to see the result. I am always using LAN8651 Rev.B1 for my
>> testing.
> 
> Unfortunately I just have LAN8651 Rev. B0 Chips. Are you sure that the Rev B1 has the
> issue fixed? The errata here says that B1 is affected too:
> https://ww1.microchip.com/downloads/aemDocuments/documents/AIS/ProductDocuments/Errata/LAN8650-1-Errata-80001075.pdf
As per my knowledge it is fixed in the Rev.B1 but as you said errata 
says the issue persists in both revisions. Let me check internally and 
get back to you on this. But it is always recommended to use Rev.B1 
rather Rev.B0. If possible, I would suggest to use the latest one.

Best regards,
Parthiban V
> 
>>
>> I should be able to reproduce the "receive buffer overflow" issue and
>> consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can
>> investigate the issue further. As I am not able to reproduce in my RPI
>> 4, I need your support for the tests and applications you used in your
>> setup.
>>>
>>> Is there a reason why you removed the netdev watchdog which was active in v2?
>> When the timeout occurs, there is no further action except increasing
>> tx_errors. Not seeing this except USB-to-Ethernet which can be removed
>> unexpectedly. But this is SPI interface which will not be removed
>> unexpectedly as it is a platform device. That's why we removed this.
>>
>> Best regards,
>> Parthiban V
>>>
>>> Thanks,
>>> Benjamin Bigler
>>>
>>
> 
> Thanks,
> Benjamin Bigler
>

[net-next,v2,0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface

Message

Comments