Message ID | 20231023154649.45931-1-Parthiban.Veerasooran@microchip.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface | expand |
On Mon, Oct 23, 2023 at 09:16:40PM +0530, Parthiban Veerasooran wrote: > This patch series contain the below updates, > - Adds support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface in the > net/ethernet/oa_tc6.c. > - Adds driver support for Microchip LAN8650/1 Rev.B0 10BASE-T1S MACPHY > Ethernet driver in the net/ethernet/microchip/lan865x.c. Hi Parthiban Omsemi also have a TC6 device, which should use this framework. They would like to make progress getting their device supported in mainline. What is happening with this patchset? Its been a few months since you posted this. Will there be a new version soon? Has Microchip stopped development? Postponed it because of other priorities etc? Thanks Andrew
On 17/02/24 3:43 am, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > On Mon, Oct 23, 2023 at 09:16:40PM +0530, Parthiban Veerasooran wrote: >> This patch series contain the below updates, >> - Adds support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface in the >> net/ethernet/oa_tc6.c. >> - Adds driver support for Microchip LAN8650/1 Rev.B0 10BASE-T1S MACPHY >> Ethernet driver in the net/ethernet/microchip/lan865x.c. > > Hi Parthiban > > Omsemi also have a TC6 device, which should use this framework. They > would like to make progress getting their device supported in > mainline. > > What is happening with this patchset? Its been a few months since you > posted this. Will there be a new version soon? Has Microchip stopped > development? Postponed it because of other priorities etc? Hi Andrew, From Microchip side, we haven't stopped/postponed this framework development. We are already working on it. It is in the final stage now. We are doing internal reviews right now and we expect that in 3 weeks time frame in the mainline again. We will send a new version (v3) of this patch series soon. Thanks for your patience. Best regards, Parthiban V > > Thanks > Andrew
> Hi Andrew, > > From Microchip side, we haven't stopped/postponed this framework > development. We are already working on it. It is in the final stage now. > We are doing internal reviews right now and we expect that in 3 weeks > time frame in the mainline again. We will send a new version (v3) of > this patch series soon. Hi Parthiban It is good to here you are still working on it. A have a few comments about how Linux mainline works. It tends to be very iterative. Cycles tend to be fast. You will probably get review comments within a couple of days of posting code. You often see developers posting a new version within a few days, maybe a week. If reviewers have asked for large changes, it can take longer, but general, the cycles are short. When you say you need three weeks for internal review, that to me seems very slow. Is it so hard to get access to internal reviewers? Do you have a very formal review process? More waterfall than iterative development? I would suggest you try to keep your internal reviews fast and low overhead, because you will be doing it lots of times as we iterate the framework. Andrew
On 19/02/24 8:00 pm, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > >> Hi Andrew, >> >> From Microchip side, we haven't stopped/postponed this framework >> development. We are already working on it. It is in the final stage now. >> We are doing internal reviews right now and we expect that in 3 weeks >> time frame in the mainline again. We will send a new version (v3) of >> this patch series soon. > > Hi Parthiban > > It is good to here you are still working on it. > > A have a few comments about how Linux mainline works. It tends to be > very iterative. Cycles tend to be fast. You will probably get review > comments within a couple of days of posting code. You often see > developers posting a new version within a few days, maybe a week. If > reviewers have asked for large changes, it can take longer, but > general, the cycles are short. > > When you say you need three weeks for internal review, that to me > seems very slow. Is it so hard to get access to internal reviewers? Do > you have a very formal review process? More waterfall than iterative > development? I would suggest you try to keep your internal reviews > fast and low overhead, because you will be doing it lots of times as > we iterate the framework. Hi Andrew, We understand your concern. We are working on this task with full focus. Initially there were some implementation change proposal from our internal reviewers to improve the performance and code quality. Consequently the testing of the new implementation took some while to bring it to a good shape. Our internal reviewers Steen Hegelund and Horatiu Vultur are actively participating in reviewing my patches. I already have talked to them and we are in progress together to get the next version ready for the submission. We are trying our level best and working hard to push the next set of patches to the mainline as soon as possible. Best regards, Parthiban V > > Andrew >
Hi all, thank you for this discussion. Since this framework is supposed to be a base for every potential OA-TC6 MACPHY implementation, I propose this could be a joint work. Selvamani and I spotted some points in the code where we could optimize the performance and we would like to add the required changes to accommodate all potential implementations. Would it be possible to share with the group the latest patches? Thanks, Piergiorgio -----Original Message----- From: Parthiban.Veerasooran@microchip.com <Parthiban.Veerasooran@microchip.com> Sent: 21 February, 2024 06:15 To: andrew@lunn.ch Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Steen.Hegelund@microchip.com; netdev@vger.kernel.org; Horatiu.Vultur@microchip.com; Woojung.Huh@microchip.com; Nicolas.Ferre@microchip.com; UNGLinuxDriver@microchip.com; Thorsten.Kummermehr@microchip.com; Piergiorgio Beruto <Pier.Beruto@onsemi.com>; Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com> Subject: Re: [PATCH net-next v2 0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface [External Email]: This email arrived from an external source - Please exercise caution when opening any attachments or clicking on links. On 19/02/24 8:00 pm, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know > the content is safe > >> Hi Andrew, >> >> From Microchip side, we haven't stopped/postponed this framework >> development. We are already working on it. It is in the final stage now. >> We are doing internal reviews right now and we expect that in 3 weeks >> time frame in the mainline again. We will send a new version (v3) of >> this patch series soon. > > Hi Parthiban > > It is good to here you are still working on it. > > A have a few comments about how Linux mainline works. It tends to be > very iterative. Cycles tend to be fast. You will probably get review > comments within a couple of days of posting code. You often see > developers posting a new version within a few days, maybe a week. If > reviewers have asked for large changes, it can take longer, but > general, the cycles are short. > > When you say you need three weeks for internal review, that to me > seems very slow. Is it so hard to get access to internal reviewers? Do > you have a very formal review process? More waterfall than iterative > development? I would suggest you try to keep your internal reviews > fast and low overhead, because you will be doing it lots of times as > we iterate the framework. Hi Andrew, We understand your concern. We are working on this task with full focus. Initially there were some implementation change proposal from our internal reviewers to improve the performance and code quality. Consequently the testing of the new implementation took some while to bring it to a good shape. Our internal reviewers Steen Hegelund and Horatiu Vultur are actively participating in reviewing my patches. I already have talked to them and we are in progress together to get the next version ready for the submission. We are trying our level best and working hard to push the next set of patches to the mainline as soon as possible. Best regards, Parthiban V > > Andrew >
On 21/02/24 5:51 pm, Piergiorgio Beruto wrote: > [Some people who received this message don't often get email from pier.beruto@onsemi.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > Hi all, > thank you for this discussion. > > Since this framework is supposed to be a base for every potential OA-TC6 MACPHY implementation, I propose this could be a joint work. > Selvamani and I spotted some points in the code where we could optimize the performance and we would like to add the required changes to accommodate all potential implementations. > > Would it be possible to share with the group the latest patches? Hi Piergiorgio, Good to hear! Yes I agree with you, let's work together. We really appreciate your efforts on this. As I already replied to Andrew, we too did some implementation changes proposed by our internal reviewers to optimize the performance and improve the code quality. Also we are in the final stage of the internal review and to avoid unnecessary confusions between the versions we don't recommend to share an intermediate version. Definitely the next version (v3) is going to hit the mainline in the next couple of days so that you can give your comments. My team with our internal reviewers are extremely working hard to make it happen. Thanks for your patience. Best regards, Parthiban V > > Thanks, > Piergiorgio > > -----Original Message----- > From: Parthiban.Veerasooran@microchip.com <Parthiban.Veerasooran@microchip.com> > Sent: 21 February, 2024 06:15 > To: andrew@lunn.ch > Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Steen.Hegelund@microchip.com; netdev@vger.kernel.org; Horatiu.Vultur@microchip.com; Woojung.Huh@microchip.com; Nicolas.Ferre@microchip.com; UNGLinuxDriver@microchip.com; Thorsten.Kummermehr@microchip.com; Piergiorgio Beruto <Pier.Beruto@onsemi.com>; Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com> > Subject: Re: [PATCH net-next v2 0/9] Add support for OPEN Alliance 10BASE-T1x MACPHY Serial Interface > > [External Email]: This email arrived from an external source - Please exercise caution when opening any attachments or clicking on links. > > On 19/02/24 8:00 pm, Andrew Lunn wrote: >> EXTERNAL EMAIL: Do not click links or open attachments unless you know >> the content is safe >> >>> Hi Andrew, >>> >>> From Microchip side, we haven't stopped/postponed this framework >>> development. We are already working on it. It is in the final stage now. >>> We are doing internal reviews right now and we expect that in 3 weeks >>> time frame in the mainline again. We will send a new version (v3) of >>> this patch series soon. >> >> Hi Parthiban >> >> It is good to here you are still working on it. >> >> A have a few comments about how Linux mainline works. It tends to be >> very iterative. Cycles tend to be fast. You will probably get review >> comments within a couple of days of posting code. You often see >> developers posting a new version within a few days, maybe a week. If >> reviewers have asked for large changes, it can take longer, but >> general, the cycles are short. >> >> When you say you need three weeks for internal review, that to me >> seems very slow. Is it so hard to get access to internal reviewers? Do >> you have a very formal review process? More waterfall than iterative >> development? I would suggest you try to keep your internal reviews >> fast and low overhead, because you will be doing it lots of times as >> we iterate the framework. > > Hi Andrew, > > We understand your concern. We are working on this task with full focus. > Initially there were some implementation change proposal from our internal reviewers to improve the performance and code quality. > Consequently the testing of the new implementation took some while to bring it to a good shape. > > Our internal reviewers Steen Hegelund and Horatiu Vultur are actively participating in reviewing my patches. I already have talked to them and we are in progress together to get the next version ready for the submission. We are trying our level best and working hard to push the next set of patches to the mainline as soon as possible. > > Best regards, > Parthiban V >> >> Andrew >> >
On 19/02/24 8:00 pm, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > >> Hi Andrew, >> >> From Microchip side, we haven't stopped/postponed this framework >> development. We are already working on it. It is in the final stage now. >> We are doing internal reviews right now and we expect that in 3 weeks >> time frame in the mainline again. We will send a new version (v3) of >> this patch series soon. > > Hi Parthiban > > It is good to here you are still working on it. > > A have a few comments about how Linux mainline works. It tends to be > very iterative. Cycles tend to be fast. You will probably get review > comments within a couple of days of posting code. You often see > developers posting a new version within a few days, maybe a week. If > reviewers have asked for large changes, it can take longer, but > general, the cycles are short. > > When you say you need three weeks for internal review, that to me > seems very slow. Is it so hard to get access to internal reviewers? Do > you have a very formal review process? More waterfall than iterative > development? I would suggest you try to keep your internal reviews > fast and low overhead, because you will be doing it lots of times as > we iterate the framework. > > Andrew > Hi Andrew, Good day...! Finally we have completed the v3 patch series preparation and planning to post it in the mainline in the next days. FYI, next week (from 11th to 15th March) I will be out of office and will not have access to emails. Again will be back to work on 18th March. Would it be ok for you to post the patch series this week or shall I post it on March 18th? as I will not be able to reply to the comments immediately. Could you please provide your suggestion on this? Best regards, Parthiban V
> Hi Andrew, > > Good day...! > > Finally we have completed the v3 patch series preparation and planning > to post it in the mainline in the next days. FYI, next week (from 11th > to 15th March) I will be out of office and will not have access to > emails. Again will be back to work on 18th March. Would it be ok for you > to post the patch series this week or shall I post it on March 18th? as > I will not be able to reply to the comments immediately. Could you > please provide your suggestion on this? Onsemi are waiting for the new version. So i would suggest you post them sooner, not later. If need be, get one of the other Microchip developers to post them. If they are posted RFC, the signed-off-by can be missing the actual submitter. Andrew
On 04/03/24 6:52 pm, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > >> Hi Andrew, >> >> Good day...! >> >> Finally we have completed the v3 patch series preparation and planning >> to post it in the mainline in the next days. FYI, next week (from 11th >> to 15th March) I will be out of office and will not have access to >> emails. Again will be back to work on 18th March. Would it be ok for you >> to post the patch series this week or shall I post it on March 18th? as >> I will not be able to reply to the comments immediately. Could you >> please provide your suggestion on this? > > Onsemi are waiting for the new version. So i would suggest you post > them sooner, not later. If need be, get one of the other Microchip > developers to post them. If they are posted RFC, the signed-off-by can > be missing the actual submitter. > > Andrew Hi Andrew, Thank you for your suggestion. We will send out the patches soon and will try our level best to reply to the comments as soon as possible. Best regards, Parthiban V
Hi Parthiban I hope I send this in the right context as it is not related to just one patch or some specific code. I conducted UDP load testing using three i.MX8MM boards in conjunction with the LAN8651. The setup involved one board functioning as a server, which is just echoing back received data, while the remaining two boards acted as clients, sending UDP packets of different sizes in various bursts to the server. Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might have influenced the results. During the tests I experienced some issues: - The boards just start receiving after first sending something (ping another board). Some measurements showed that the irq stays asserted after init. This makes sense as far as I understand the chapter 7.7 of the specification, the irq is deasserted on reception of the first data header following CSn being asserted. As a workaround I trigger the thread at the end of oa_tc6_init. - If there is a lot of traffic, the receive buffer overflow error spams the log. - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb. Mostly because more data to rx_skb is added than allocated and sometimes because rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging with a logic analyzer showed that the chip is not behave correctly. There is more bytes between start_valid and end_valid than there should be. Also there seems to be 2 end_valid without a start_valid between. What is common is that the incorrect frame starts in a chunk where end_valid and start_valid is set. In my opinion its a problem in the chip (maybe related to the errata in the next point) but the driver should be resilent and just drop the packet and not cause a kernel panic. - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0) and also exst is not active. I found out that there is an errata (DS80001075) point s3 that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above. The driver now works since about 2.5 weeks with various load with just one loss of frame error where I had to reboot the system after about 4 days. Is there a reason why you removed the netdev watchdog which was active in v2? Thanks, Benjamin Bigler
Hi Benjamin Bigler, Thank you for your testing and feedback. It would be really helpful to bring the driver to a good shape. We really appreciate your efforts on this. On 24/03/24 5:25 pm, Benjamin Bigler wrote: > [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > Hi Parthiban > > I hope I send this in the right context as it is not related to just one patch or > some specific code. > > I conducted UDP load testing using three i.MX8MM boards in conjunction with the > LAN8651. The setup involved one board functioning as a server, which is just > echoing back received data, while the remaining two boards acted as clients, > sending UDP packets of different sizes in various bursts to the server. > Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might > have influenced the results. > > During the tests I experienced some issues: > > - The boards just start receiving after first sending something (ping another board). > Some measurements showed that the irq stays asserted after init. This makes sense > as far as I understand the chapter 7.7 of the specification, the irq is deasserted > on reception of the first data header following CSn being asserted. As a workaround > I trigger the thread at the end of oa_tc6_init. It looks like the IRQ is asserted on RESET completion and expects a data chunk from host to deassert the IRQ. I used to test the driver in RPI 4 using iperf3. For some reason I never faced this issue, may be when the network device is being registered there might be some packet transmission which leads to deliver a data chunk so that the IRQ is deasserted. Thanks for the workaround. I think that would be the solution to solve this issue. Adding the below lines in the end of the function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to perform an empty data chunk transfer which will deassert the IRQ before starting the actual data transfer. /* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset * complete status. IRQ is also asserted on reset completion and it is * remain asserted until MAC-PHY receives a data chunk. So performing an * empty data chunk transmission will deassert the IRQ. Refer section * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details. */ tc6->int_flag = true; wake_up_interruptible(&tc6->spi_wq); > > - If there is a lot of traffic, the receive buffer overflow error spams the log. > > - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb. > Mostly because more data to rx_skb is added than allocated and sometimes because > rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging > with a logic analyzer showed that the chip is not behave correctly. There is more > bytes between start_valid and end_valid than there should be. Also there > seems to be 2 end_valid without a start_valid between. What is common is that the incorrect > frame starts in a chunk where end_valid and start_valid is set. > In my opinion its a problem in the chip (maybe related to the errata in the next point) > but the driver should be resilent and just drop the packet and not cause a kernel panic. Usually I run into this issue "receive buffer overflow" when I run RPI 4 with default cpu governor setting which is "ondemand". In this case, even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is clocking down when it is idle which leads delivering half of the configured SPI clock speed around 5.9 MHz. So the systems like RPI 4 need performance mode enabled to get the proper clock speed for SPI. Refer below link for more details. https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750 I used to enable performance mode using the below command. echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null So please ensure the SPI clock speed using a logic analyzer to get the maximum throughput without receive buffer overflow. Of course, I agree that the driver should not crash in case of receive buffer overflow. By referring your investigations, I understand that the buffers in the MAC-PHY is being continuously overwritten again and again as the host is very slow to read the data from the MAC-PHY buffers through SPI which alters the descriptors. There might be two reasons why we run into this situation. 1. The host is busy doing something else and delays to initiate SPI even though SPI clock speed is 15 MHz. 2. The SPI clock speed is less than 15 MHz. I use the below iperf3 setup for my testing and never faced the driver crash issue even though faced "receive buffer overflow" error when I run RPI 4 with "ondemand" default mode. Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY $ iperf3 -s Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0 and vice versa. I never faced "receive buffer overflow" error when I run RPI 4 with "performance" mode enabled and even though all the cores are stressed using the below command, $ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null & Can you share more details about your testing setup and applications you use, so that I will try to reproduce the issue in my setup to debug the driver? > > - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0) > and also exst is not active. I found out that there is an errata (DS80001075) point s3 > that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above. > The driver now works since about 2.5 weeks with various load with just one loss of frame > error where I had to reboot the system after about 4 days. It is good to hear that the driver works fine with the above changes. As mentioned in the errata, this continuous interrupt issue is a known issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this issue and no need of any workaround. Setting ZARFE bit in the CONFIG0 will solve the continuous interrupt issue but don't know how the above "receive buffer overflow" issue also solved. I think it is a good idea to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be interesting to see the result. I am always using LAN8651 Rev.B1 for my testing. I should be able to reproduce the "receive buffer overflow" issue and consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can investigate the issue further. As I am not able to reproduce in my RPI 4, I need your support for the tests and applications you used in your setup. > > Is there a reason why you removed the netdev watchdog which was active in v2? When the timeout occurs, there is no further action except increasing tx_errors. Not seeing this except USB-to-Ethernet which can be removed unexpectedly. But this is SPI interface which will not be removed unexpectedly as it is a platform device. That's why we removed this. Best regards, Parthiban V > > Thanks, > Benjamin Bigler >
> It looks like the IRQ is asserted on RESET completion and expects a data > chunk from host to deassert the IRQ. I used to test the driver in RPI 4 > using iperf3. For some reason I never faced this issue, may be when the > network device is being registered there might be some packet > transmission which leads to deliver a data chunk so that the IRQ is > deasserted. If you have IPv6 enabled, the network stack will try to add a link local IPv6 address to the interface, which means performing a Duplicate Address Detection. That means sending a few packets. Try disabling IPv6 if you want to reproduce the problem. Andrew
Hi Andrew, On 25/03/24 7:31 pm, Andrew Lunn wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > >> It looks like the IRQ is asserted on RESET completion and expects a data >> chunk from host to deassert the IRQ. I used to test the driver in RPI 4 >> using iperf3. For some reason I never faced this issue, may be when the >> network device is being registered there might be some packet >> transmission which leads to deliver a data chunk so that the IRQ is >> deasserted. > > If you have IPv6 enabled, the network stack will try to add a link > local IPv6 address to the interface, which means performing a > Duplicate Address Detection. That means sending a few packets. > > Try disabling IPv6 if you want to reproduce the problem. Yes, I saw that IPv6 is enabled by default in my RPI 4. So as you mentioned, I tried disabling it but still approximately in 1 second there is a packet coming from the n/w stack which makes the IRQ deasserted (Refer the attached screenshot). So just for testing, I disabled waking up the SPI thread from the oa_tc6_start_xmit() function and noted that the IRQ is not deasserted throughout the period in Logic analyzer. As I replied in the previous email, using the below code makes the IRQ deasserted when the oa_tc6_init() finishes without expecting any data chunk from oa_tc6_start_xmit(). So in my opinion let's stick with this solution. /* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset * complete status. IRQ is also asserted on reset completion and it is * remain asserted until MAC-PHY receives a data chunk. So performing an * empty data chunk transmission will deassert the IRQ. Refer section * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details. */ tc6->int_flag = true; wake_up_interruptible(&tc6->spi_wq); Best regards, Parthiban V > > Andrew
Hi Parthiban, Sorry for the late answer, I was quite busy the last few days. On Mon, 2024-03-25 at 13:24 +0000, Parthiban.Veerasooran@microchip.com wrote: > Hi Benjamin Bigler, > > Thank you for your testing and feedback. It would be really helpful to > bring the driver to a good shape. We really appreciate your efforts on this. > > On 24/03/24 5:25 pm, Benjamin Bigler wrote: > > [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > > > Hi Parthiban > > > > I hope I send this in the right context as it is not related to just one patch or > > some specific code. > > > > I conducted UDP load testing using three i.MX8MM boards in conjunction with the > > LAN8651. The setup involved one board functioning as a server, which is just > > echoing back received data, while the remaining two boards acted as clients, > > sending UDP packets of different sizes in various bursts to the server. > > Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might > > have influenced the results. > > > > During the tests I experienced some issues: > > > > - The boards just start receiving after first sending something (ping another board). > > Some measurements showed that the irq stays asserted after init. This makes sense > > as far as I understand the chapter 7.7 of the specification, the irq is deasserted > > on reception of the first data header following CSn being asserted. As a workaround > > I trigger the thread at the end of oa_tc6_init. > It looks like the IRQ is asserted on RESET completion and expects a data > chunk from host to deassert the IRQ. I used to test the driver in RPI 4 > using iperf3. For some reason I never faced this issue, may be when the > network device is being registered there might be some packet > transmission which leads to deliver a data chunk so that the IRQ is > deasserted. Thanks for the workaround. I think that would be the > solution to solve this issue. Adding the below lines in the end of the > function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to > perform an empty data chunk transfer which will deassert the IRQ before > starting the actual data transfer. I have ipv6 disabled and use static ipv4 addresses. That could be the reason why on my side no packet is sent. > > /* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset > * complete status. IRQ is also asserted on reset completion and it is > * remain asserted until MAC-PHY receives a data chunk. So performing an > * empty data chunk transmission will deassert the IRQ. Refer section > * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details. > */ > tc6->int_flag = true; > wake_up_interruptible(&tc6->spi_wq); Perfect, thats the same I added and also works on my side. > > > > - If there is a lot of traffic, the receive buffer overflow error spams the log. > > > > - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb. > > Mostly because more data to rx_skb is added than allocated and sometimes because > > rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging > > with a logic analyzer showed that the chip is not behave correctly. There is more > > bytes between start_valid and end_valid than there should be. Also there > > seems to be 2 end_valid without a start_valid between. What is common is that the incorrect > > frame starts in a chunk where end_valid and start_valid is set. > > In my opinion its a problem in the chip (maybe related to the errata in the next point) > > but the driver should be resilent and just drop the packet and not cause a kernel panic. > Usually I run into this issue "receive buffer overflow" when I run RPI 4 > with default cpu governor setting which is "ondemand". In this case, > even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is > clocking down when it is idle which leads delivering half of the > configured SPI clock speed around 5.9 MHz. So the systems like RPI 4 > need performance mode enabled to get the proper clock speed for SPI. > Refer below link for more details. > > https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750 > > I used to enable performance mode using the below command. > > echo performance | sudo tee > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null > > So please ensure the SPI clock speed using a logic analyzer to get the > maximum throughput without receive buffer overflow. > > Of course, I agree that the driver should not crash in case of receive > buffer overflow. By referring your investigations, I understand that the > buffers in the MAC-PHY is being continuously overwritten again and again > as the host is very slow to read the data from the MAC-PHY buffers > through SPI which alters the descriptors. There might be two reasons why > we run into this situation. > 1. The host is busy doing something else and delays to initiate SPI even > though SPI clock speed is 15 MHz. > 2. The SPI clock speed is less than 15 MHz. Sorry there is a missunderstanding between us. The receive buffer overflow is not causing any harm except filling the log. In my setup I get in one day about 35000 entries. I am not sure if its appropriate to log these errors. The SPI Frequency is at 14.8 MHz. If I just have 2 boards connected, I am not able to reproduce this. Only with 3 boards when 2 boards sends multiple big ethernet frames (1512 byte per Frame) to one, I get these log entries. The latency seems to be quite low, from IRQ to start reading first frame it takes always less than 500us. Also the boards are just running the udp test. > > I use the below iperf3 setup for my testing and never faced the driver > crash issue even though faced "receive buffer overflow" error when I run > RPI 4 with "ondemand" default mode. > > Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY > $ iperf3 -s > Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick > $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0 > > and vice versa. > > I never faced "receive buffer overflow" error when I run RPI 4 with > "performance" mode enabled and even though all the cores are stressed > using the below command, > > $ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null & > > Can you share more details about your testing setup and applications you > use, so that I will try to reproduce the issue in my setup to debug the > driver? I use a internal tool which does some stress tests using udp. Unfortunately, I am not allowed to publish it, but a colleague works on a rust implementation, which we can publish, but its not fully ready yet. On one board the tool is running in server mode. It just echoes back the received data. On the 2 other boards the tool is running in client mode. It sends various sized udp-packets in different bursts and then checks if it receives the same data in the same order. The crashes only happens when ZARFE is not set (with Rev B0). When the crash happens, I see on the logic analyzer that there are more bytes than mtu + headers between the frame where start_valid is set and the frame where end_valid is set. Then this happens: [ 437.155673] skbuff: skb_over_panic: text:ffff80007a8c2bd8 len:1600 put:64 head:ffff00000de28080 data:ffff00000de280c0 tail:0x680 end:0x640 dev:eth1 [ 437.168987] kernel BUG at net/core/skbuff.c:192! [ 437.173612] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 437.180407] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O) tpm_tis_spi tpm_tis_core spi_imx imx_sdma [ 437.196016] CPU: 1 PID: 455 Comm: oa-tc6-spi-thre Tainted: G O 6.6.11- gce336e2c2bc3-dirty #1 [ 437.205853] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT) [ 437.212820] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 437.219790] pc : skb_panic+0x58/0x5c [ 437.223376] lr : skb_panic+0x58/0x5c [ 437.226959] sp : ffff80008362bd90 [ 437.230278] x29: ffff80008362bda0 x28: 0000000000000000 x27: ffff000001066878 [ 437.237426] x26: 000000000000001e x25: 00000000000007f8 x24: ffff0000010cea80 [ 437.244571] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000 [ 437.251720] x20: ffff0000010ceaa8 x19: 000000003f20003f x18: ffffffffffffffff [ 437.258867] x17: ffff7ffffded9000 x16: ffff800080008000 x15: 073a0764076e0765 [ 437.266015] x14: 0720073007380736 x13: ffff8000823d1f58 x12: 0000000000000534 [ 437.273162] x11: 00000000000001bc x10: ffff800082429f58 x9 : ffff8000823d1f58 [ 437.280310] x8 : 00000000ffffefff x7 : ffff800082429f58 x6 : 0000000000000000 [ 437.287455] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 [ 437.294606] x2 : 0000000000000000 x1 : ffff000001223b00 x0 : 0000000000000087 [ 437.301753] Call trace: [ 437.304203] skb_panic+0x58/0x5c [ 437.307436] skb_find_text+0x0/0xf0 [ 437.310933] oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6] [ 437.316523] kthread+0x118/0x11c [ 437.319758] ret_from_fork+0x10/0x20 [ 437.323343] Code: f90007e9 b940b908 f90003e8 97ca3c34 (d4210000) [ 437.329446] ---[ end trace 0000000000000000 ]--- Sometimes there are 2 end_valid after eachother without a start_valid between. Then this happens: [ 469.737297] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000074 [ 469.746137] Mem abort info: [ 469.748950] ESR = 0x0000000096000004 [ 469.752709] EC = 0x25: DABT (current EL), IL = 32 bits [ 469.758036] SET = 0, FnV = 0 [ 469.761098] EA = 0, S1PTW = 0 [ 469.764252] FSC = 0x04: level 0 translation fault [ 469.769144] Data abort info: [ 469.772033] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 469.777529] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 469.782594] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 469.787921] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043c32000 [ 469.794377] [0000000000000074] pgd=0000000000000000, p4d=0000000000000000 [ 469.801184] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 469.807459] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O) tpm_tis_spi tpm_tis_core spi_imx imx_sdma [ 469.823064] CPU: 2 PID: 456 Comm: oa-tc6-spi-thre Tainted: G O 6.6.11- g350ed394a6ca-dirty #1 [ 469.832903] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT) [ 469.839871] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 469.846841] pc : skb_put+0xc/0x6c [ 469.850169] lr : oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6] [ 469.856106] sp : ffff80008376bdb0 [ 469.859424] x29: ffff80008376bdb0 x28: 0000000000000000 x27: ffff00000194c080 [ 469.866573] x26: 0000000000000000 x25: 0000000000000000 x24: ffff000001095c80 [ 469.873720] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000 [ 469.880870] x20: ffff000001095ca8 x19: 000000003f20003f x18: 0000000000000000 [ 469.888023] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 469.895174] x14: 0000031acf8b86d8 x13: 0000000000000000 x12: 0000000000000000 [ 469.902321] x11: 0000000000000002 x10: 0000000000000a60 x9 : ffff80008376b970 [ 469.909467] x8 : ffff00007fb6e580 x7 : 000000000194b080 x6 : 0000000000000000 [ 469.916616] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 000000000000fc80 [ 469.923765] x2 : 0000000000000001 x1 : 0000000000000040 x0 : 0000000000000000 [ 469.930915] Call trace: [ 469.933365] skb_put+0xc/0x6c [ 469.936342] oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6] [ 469.941929] kthread+0x118/0x11c [ 469.945166] ret_from_fork+0x10/0x20 [ 469.948752] Code: d65f03c0 d503233f a9bf7bfd 910003fd (b9407406) [ 469.954854] ---[ end trace 0000000000000000 ]--- If interested I can try to get a recording with the logic analyzer and send it to you. By the way in the other answer you attached a screenshot of the logic analyzer and you have a very nice HLA for oa_tc6. Are they open-source or are there any plans to publish them? > > > > - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0) > > and also exst is not active. I found out that there is an errata (DS80001075) point s3 > > that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above. > > The driver now works since about 2.5 weeks with various load with just one loss of frame > > error where I had to reboot the system after about 4 days. > It is good to hear that the driver works fine with the above changes. As > mentioned in the errata, this continuous interrupt issue is a known > issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this > issue and no need of any workaround. Setting ZARFE bit in the CONFIG0 > will solve the continuous interrupt issue but don't know how the above > "receive buffer overflow" issue also solved. I think it is a good idea > to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be > interesting to see the result. I am always using LAN8651 Rev.B1 for my > testing. Unfortunately I just have LAN8651 Rev. B0 Chips. Are you sure that the Rev B1 has the issue fixed? The errata here says that B1 is affected too: https://ww1.microchip.com/downloads/aemDocuments/documents/AIS/ProductDocuments/Errata/LAN8650-1-Errata-80001075.pdf > > I should be able to reproduce the "receive buffer overflow" issue and > consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can > investigate the issue further. As I am not able to reproduce in my RPI > 4, I need your support for the tests and applications you used in your > setup. > > > > Is there a reason why you removed the netdev watchdog which was active in v2? > When the timeout occurs, there is no further action except increasing > tx_errors. Not seeing this except USB-to-Ethernet which can be removed > unexpectedly. But this is SPI interface which will not be removed > unexpectedly as it is a platform device. That's why we removed this. > > Best regards, > Parthiban V > > > > Thanks, > > Benjamin Bigler > > > Thanks, Benjamin Bigler
Hi Benjamin, On 04/04/24 3:10 am, Benjamin Bigler wrote: > [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > Hi Parthiban, > > Sorry for the late answer, I was quite busy the last few days. No problem. > > On Mon, 2024-03-25 at 13:24 +0000, Parthiban.Veerasooran@microchip.com wrote: >> Hi Benjamin Bigler, >> >> Thank you for your testing and feedback. It would be really helpful to >> bring the driver to a good shape. We really appreciate your efforts on this. >> >> On 24/03/24 5:25 pm, Benjamin Bigler wrote: >>> [Some people who received this message don't often get email from benjamin@bigler.one. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] >>> >>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe >>> >>> Hi Parthiban >>> >>> I hope I send this in the right context as it is not related to just one patch or >>> some specific code. >>> >>> I conducted UDP load testing using three i.MX8MM boards in conjunction with the >>> LAN8651. The setup involved one board functioning as a server, which is just >>> echoing back received data, while the remaining two boards acted as clients, >>> sending UDP packets of different sizes in various bursts to the server. >>> Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might >>> have influenced the results. >>> >>> During the tests I experienced some issues: >>> >>> - The boards just start receiving after first sending something (ping another board). >>> Some measurements showed that the irq stays asserted after init. This makes sense >>> as far as I understand the chapter 7.7 of the specification, the irq is deasserted >>> on reception of the first data header following CSn being asserted. As a workaround >>> I trigger the thread at the end of oa_tc6_init. >> It looks like the IRQ is asserted on RESET completion and expects a data >> chunk from host to deassert the IRQ. I used to test the driver in RPI 4 >> using iperf3. For some reason I never faced this issue, may be when the >> network device is being registered there might be some packet >> transmission which leads to deliver a data chunk so that the IRQ is >> deasserted. Thanks for the workaround. I think that would be the >> solution to solve this issue. Adding the below lines in the end of the >> function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to >> perform an empty data chunk transfer which will deassert the IRQ before >> starting the actual data transfer. > > I have ipv6 disabled and use static ipv4 addresses. That could be the reason why on > my side no packet is sent. > >> >> /* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset >> * complete status. IRQ is also asserted on reset completion and it is >> * remain asserted until MAC-PHY receives a data chunk. So performing an >> * empty data chunk transmission will deassert the IRQ. Refer section >> * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details. >> */ >> tc6->int_flag = true; >> wake_up_interruptible(&tc6->spi_wq); > > Perfect, thats the same I added and also works on my side. > >>> >>> - If there is a lot of traffic, the receive buffer overflow error spams the log. >>> >>> - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb. >>> Mostly because more data to rx_skb is added than allocated and sometimes because >>> rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging >>> with a logic analyzer showed that the chip is not behave correctly. There is more >>> bytes between start_valid and end_valid than there should be. Also there >>> seems to be 2 end_valid without a start_valid between. What is common is that the incorrect >>> frame starts in a chunk where end_valid and start_valid is set. >>> In my opinion its a problem in the chip (maybe related to the errata in the next point) >>> but the driver should be resilent and just drop the packet and not cause a kernel panic. >> Usually I run into this issue "receive buffer overflow" when I run RPI 4 >> with default cpu governor setting which is "ondemand". In this case, >> even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is >> clocking down when it is idle which leads delivering half of the >> configured SPI clock speed around 5.9 MHz. So the systems like RPI 4 >> need performance mode enabled to get the proper clock speed for SPI. >> Refer below link for more details. >> >> https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750 >> >> I used to enable performance mode using the below command. >> >> echo performance | sudo tee >> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null >> >> So please ensure the SPI clock speed using a logic analyzer to get the >> maximum throughput without receive buffer overflow. >> >> Of course, I agree that the driver should not crash in case of receive >> buffer overflow. By referring your investigations, I understand that the >> buffers in the MAC-PHY is being continuously overwritten again and again >> as the host is very slow to read the data from the MAC-PHY buffers >> through SPI which alters the descriptors. There might be two reasons why >> we run into this situation. >> 1. The host is busy doing something else and delays to initiate SPI even >> though SPI clock speed is 15 MHz. >> 2. The SPI clock speed is less than 15 MHz. > > Sorry there is a missunderstanding between us. The receive buffer overflow is not > causing any harm except filling the log. In my setup I get in one day about 35000 > entries. I am not sure if its appropriate to log these errors. > > The SPI Frequency is at 14.8 MHz. If I just have 2 boards connected, I am not able > to reproduce this. Only with 3 boards when 2 boards sends multiple big ethernet > frames (1512 byte per Frame) to one, I get these log entries. > The latency seems to be quite low, from IRQ to start reading first frame it takes > always less than 500us. Also the boards are just running the udp test. > >> >> I use the below iperf3 setup for my testing and never faced the driver >> crash issue even though faced "receive buffer overflow" error when I run >> RPI 4 with "ondemand" default mode. >> >> Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY >> $ iperf3 -s >> Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick >> $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0 >> >> and vice versa. >> >> I never faced "receive buffer overflow" error when I run RPI 4 with >> "performance" mode enabled and even though all the cores are stressed >> using the below command, >> >> $ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null & >> >> Can you share more details about your testing setup and applications you >> use, so that I will try to reproduce the issue in my setup to debug the >> driver? > > I use a internal tool which does some stress tests using udp. Unfortunately, > I am not allowed to publish it, but a colleague works on a rust implementation, > which we can publish, but its not fully ready yet. > On one board the tool is running in server mode. It just echoes back the received > data. On the 2 other boards the tool is running in client mode. It sends various > sized udp-packets in different bursts and then checks if it receives the same > data in the same order. > > > The crashes only happens when ZARFE is not set (with Rev B0). When the crash > happens, I see on the logic analyzer that there are more bytes than mtu + headers > between the frame where start_valid is set and the frame where end_valid is set. > Then this happens: Thanks for all the above details. I will include this ZARFE fix in the next version v4 which I am going to post soon. > > [ 437.155673] skbuff: skb_over_panic: text:ffff80007a8c2bd8 len:1600 put:64 head:ffff00000de28080 > data:ffff00000de280c0 tail:0x680 end:0x640 dev:eth1 > [ 437.168987] kernel BUG at net/core/skbuff.c:192! > [ 437.173612] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > [ 437.180407] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O) > tpm_tis_spi tpm_tis_core spi_imx imx_sdma > [ 437.196016] CPU: 1 PID: 455 Comm: oa-tc6-spi-thre Tainted: G O 6.6.11- > gce336e2c2bc3-dirty #1 > [ 437.205853] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT) > [ 437.212820] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 437.219790] pc : skb_panic+0x58/0x5c > [ 437.223376] lr : skb_panic+0x58/0x5c > [ 437.226959] sp : ffff80008362bd90 > [ 437.230278] x29: ffff80008362bda0 x28: 0000000000000000 x27: ffff000001066878 > [ 437.237426] x26: 000000000000001e x25: 00000000000007f8 x24: ffff0000010cea80 > [ 437.244571] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000 > [ 437.251720] x20: ffff0000010ceaa8 x19: 000000003f20003f x18: ffffffffffffffff > [ 437.258867] x17: ffff7ffffded9000 x16: ffff800080008000 x15: 073a0764076e0765 > [ 437.266015] x14: 0720073007380736 x13: ffff8000823d1f58 x12: 0000000000000534 > [ 437.273162] x11: 00000000000001bc x10: ffff800082429f58 x9 : ffff8000823d1f58 > [ 437.280310] x8 : 00000000ffffefff x7 : ffff800082429f58 x6 : 0000000000000000 > [ 437.287455] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 > [ 437.294606] x2 : 0000000000000000 x1 : ffff000001223b00 x0 : 0000000000000087 > [ 437.301753] Call trace: > [ 437.304203] skb_panic+0x58/0x5c > [ 437.307436] skb_find_text+0x0/0xf0 > [ 437.310933] oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6] > [ 437.316523] kthread+0x118/0x11c > [ 437.319758] ret_from_fork+0x10/0x20 > [ 437.323343] Code: f90007e9 b940b908 f90003e8 97ca3c34 (d4210000) > [ 437.329446] ---[ end trace 0000000000000000 ]--- > > > Sometimes there are 2 end_valid after eachother without a start_valid between. > Then this happens: > > [ 469.737297] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000074 > [ 469.746137] Mem abort info: > [ 469.748950] ESR = 0x0000000096000004 > [ 469.752709] EC = 0x25: DABT (current EL), IL = 32 bits > [ 469.758036] SET = 0, FnV = 0 > [ 469.761098] EA = 0, S1PTW = 0 > [ 469.764252] FSC = 0x04: level 0 translation fault > [ 469.769144] Data abort info: > [ 469.772033] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > [ 469.777529] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > [ 469.782594] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [ 469.787921] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043c32000 > [ 469.794377] [0000000000000074] pgd=0000000000000000, p4d=0000000000000000 > [ 469.801184] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > [ 469.807459] Modules linked in: ppp_async crc_ccitt ppp_generic slhc lan865x oa_tc6 bec_infoo(O) > tpm_tis_spi tpm_tis_core spi_imx imx_sdma > [ 469.823064] CPU: 2 PID: 456 Comm: oa-tc6-spi-thre Tainted: G O 6.6.11- > g350ed394a6ca-dirty #1 > [ 469.832903] Hardware name: Toradex Verdin iMX8M Mini on FUMU (DT) > [ 469.839871] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 469.846841] pc : skb_put+0xc/0x6c > [ 469.850169] lr : oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6] > [ 469.856106] sp : ffff80008376bdb0 > [ 469.859424] x29: ffff80008376bdb0 x28: 0000000000000000 x27: ffff00000194c080 > [ 469.866573] x26: 0000000000000000 x25: 0000000000000000 x24: ffff000001095c80 > [ 469.873720] x23: 00000000f0f0f0f1 x22: 000000000000001f x21: 0000000000000000 > [ 469.880870] x20: ffff000001095ca8 x19: 000000003f20003f x18: 0000000000000000 > [ 469.888023] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 > [ 469.895174] x14: 0000031acf8b86d8 x13: 0000000000000000 x12: 0000000000000000 > [ 469.902321] x11: 0000000000000002 x10: 0000000000000a60 x9 : ffff80008376b970 > [ 469.909467] x8 : ffff00007fb6e580 x7 : 000000000194b080 x6 : 0000000000000000 > [ 469.916616] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 000000000000fc80 > [ 469.923765] x2 : 0000000000000001 x1 : 0000000000000040 x0 : 0000000000000000 > [ 469.930915] Call trace: > [ 469.933365] skb_put+0xc/0x6c > [ 469.936342] oa_tc6_spi_thread_handler+0x438/0x880 [oa_tc6] > [ 469.941929] kthread+0x118/0x11c > [ 469.945166] ret_from_fork+0x10/0x20 > [ 469.948752] Code: d65f03c0 d503233f a9bf7bfd 910003fd (b9407406) > [ 469.954854] ---[ end trace 0000000000000000 ]--- > > > If interested I can try to get a recording with the logic analyzer and send it to you. I don't think it is needed. > > By the way in the other answer you attached a screenshot of the logic analyzer and you > have a very nice HLA for oa_tc6. Are they open-source or are there any plans to publish them? It is already available in the Microchip's github page for public. Checkout the below link for the same. https://github.com/MicrochipTech/oa-tc6-saleae-extension > >>> >>> - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0) >>> and also exst is not active. I found out that there is an errata (DS80001075) point s3 >>> that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above. >>> The driver now works since about 2.5 weeks with various load with just one loss of frame >>> error where I had to reboot the system after about 4 days. >> It is good to hear that the driver works fine with the above changes. As >> mentioned in the errata, this continuous interrupt issue is a known >> issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this >> issue and no need of any workaround. Setting ZARFE bit in the CONFIG0 >> will solve the continuous interrupt issue but don't know how the above >> "receive buffer overflow" issue also solved. I think it is a good idea >> to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be >> interesting to see the result. I am always using LAN8651 Rev.B1 for my >> testing. > > Unfortunately I just have LAN8651 Rev. B0 Chips. Are you sure that the Rev B1 has the > issue fixed? The errata here says that B1 is affected too: > https://ww1.microchip.com/downloads/aemDocuments/documents/AIS/ProductDocuments/Errata/LAN8650-1-Errata-80001075.pdf As per my knowledge it is fixed in the Rev.B1 but as you said errata says the issue persists in both revisions. Let me check internally and get back to you on this. But it is always recommended to use Rev.B1 rather Rev.B0. If possible, I would suggest to use the latest one. Best regards, Parthiban V > >> >> I should be able to reproduce the "receive buffer overflow" issue and >> consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can >> investigate the issue further. As I am not able to reproduce in my RPI >> 4, I need your support for the tests and applications you used in your >> setup. >>> >>> Is there a reason why you removed the netdev watchdog which was active in v2? >> When the timeout occurs, there is no further action except increasing >> tx_errors. Not seeing this except USB-to-Ethernet which can be removed >> unexpectedly. But this is SPI interface which will not be removed >> unexpectedly as it is a platform device. That's why we removed this. >> >> Best regards, >> Parthiban V >>> >>> Thanks, >>> Benjamin Bigler >>> >> > > Thanks, > Benjamin Bigler >