Message ID | 20240603184714.3697911-1-joshua.a.hay@intel.com (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [iwl-net] idpf: extend tx watchdog timeout | expand |
On 6/3/2024 11:47 AM, Joshua Hay wrote: > There are several reasons for a TX completion to take longer than usual > to be written back by HW. For example, the completion for a packet that > misses a rule will have increased latency. The side effect of these > variable latencies for any given packet is out of order completions. The > stack sends packet X and Y. If packet X takes longer because of the rule > miss in the example above, but packet Y hits, it can go on the wire > immediately. Which also means it can be completed first. The driver > will then receive a completion for packet Y before packet X. The driver > will stash the buffers for packet X in a hash table to allow the tx send > queue descriptors for both packet X and Y to be reused. The driver will > receive the completion for packet X sometime later and have to search > the hash table for the associated packet. > > The driver cleans packets directly on the ring first, i.e. not out of > order completions since they are to some extent considered "slow(er) > path". However, certain workloads can increase the frequency of out of > order completions thus introducing even more latency into the cleaning > path. Bump up the timeout value to account for these workloads. > > Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") > Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> > --- Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
On 6/3/2024 11:47 AM, Joshua Hay wrote: > > There are several reasons for a TX completion to take longer than usual > to be written back by HW. For example, the completion for a packet that > misses a rule will have increased latency. The side effect of these > variable latencies for any given packet is out of order completions. The > stack sends packet X and Y. If packet X takes longer because of the rule > miss in the example above, but packet Y hits, it can go on the wire > immediately. Which also means it can be completed first. The driver > will then receive a completion for packet Y before packet X. The driver > will stash the buffers for packet X in a hash table to allow the tx send > queue descriptors for both packet X and Y to be reused. The driver will > receive the completion for packet X sometime later and have to search > the hash table for the associated packet. > > The driver cleans packets directly on the ring first, i.e. not out of > order completions since they are to some extent considered "slow(er) > path". However, certain workloads can increase the frequency of out of > order completions thus introducing even more latency into the cleaning > path. Bump up the timeout value to account for these workloads. > > Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") > Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> > --- > drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c > index f1ee5584e8fa..3d4ae2ed9b96 100644 > --- a/drivers/net/ethernet/intel/idpf/idpf_lib.c > +++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c > @@ -770,8 +770,8 @@ static int idpf_cfg_netdev(struct idpf_vport *vport) > else > netdev->netdev_ops = &idpf_netdev_ops_singleq; > > - /* setup watchdog timeout value to be 5 second */ > - netdev->watchdog_timeo = 5 * HZ; > + /* setup watchdog timeout value to be 30 seconds */ > + netdev->watchdog_timeo = 30 * HZ; Huh... that's a pretty big number. If it really needs to be that big I wonder if there's something else that needs attention. sln > > netdev->dev_port = idx; > > -- > 2.39.2 > >
On 6/3/2024 11:47 AM, Joshua Hay wrote: > > There are several reasons for a TX completion to take longer than usual > to be written back by HW. For example, the completion for a packet that > misses a rule will have increased latency. The side effect of these > variable latencies for any given packet is out of order completions. The > stack sends packet X and Y. If packet X takes longer because of the rule > miss in the example above, but packet Y hits, it can go on the wire > immediately. Which also means it can be completed first. The driver > will then receive a completion for packet Y before packet X. The driver > will stash the buffers for packet X in a hash table to allow the tx send > queue descriptors for both packet X and Y to be reused. The driver will > receive the completion for packet X sometime later and have to search > the hash table for the associated packet. > > The driver cleans packets directly on the ring first, i.e. not out of > order completions since they are to some extent considered "slow(er) > path". However, certain workloads can increase the frequency of out of > order completions thus introducing even more latency into the cleaning > path. Bump up the timeout value to account for these workloads. > > Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") > Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> > --- > drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) We tested this patch with our intensive high-performance workloads that have been able to reproduce the issue, and it was sufficient to avoid tx timeouts. We also noticed that these longer timeouts are not unusual in the smartnic space: we see 100s or 50s timeouts for a few NICs, and other NICs receive this timeout as a hint from the fw. Reviewed-by: David Decotigny <ddecotig@google.com>
From: David Decotigny <ddecotig@gmail.com> Date: Tue, 4 Jun 2024 16:34:48 -0700 > > > On 6/3/2024 11:47 AM, Joshua Hay wrote: >> >> There are several reasons for a TX completion to take longer than usual >> to be written back by HW. For example, the completion for a packet that >> misses a rule will have increased latency. The side effect of these >> variable latencies for any given packet is out of order completions. The >> stack sends packet X and Y. If packet X takes longer because of the rule >> miss in the example above, but packet Y hits, it can go on the wire >> immediately. Which also means it can be completed first. The driver >> will then receive a completion for packet Y before packet X. The driver >> will stash the buffers for packet X in a hash table to allow the tx send >> queue descriptors for both packet X and Y to be reused. The driver will >> receive the completion for packet X sometime later and have to search >> the hash table for the associated packet. >> >> The driver cleans packets directly on the ring first, i.e. not out of >> order completions since they are to some extent considered "slow(er) >> path". However, certain workloads can increase the frequency of out of >> order completions thus introducing even more latency into the cleaning >> path. Bump up the timeout value to account for these workloads. >> >> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") >> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> >> --- >> drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) > > > We tested this patch with our intensive high-performance workloads that > have been able to reproduce the issue, and it was sufficient to avoid tx > timeouts. We also noticed that these longer timeouts are not unusual in > the smartnic space: we see 100s or 50s timeouts for a few NICs, and Example? > other NICs receive this timeout as a hint from the fw. > > Reviewed-by: David Decotigny <ddecotig@google.com> We either need to teach watchdog core to account OOOs or there's something really wrong in the driver. For example, how can we then distinguish if > 5 sec delay happened just due to an OOO or due to that something went bad with the HW? Note that there are several patches fixing Tx (incl. timeouts) in my tree, including yours (Joshua's) which you somehow didn't send yet ._. Maybe start from them first? I don't buy 30 seconds, at least for now. Maybe I'm missing something. Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com> Thanks, Olek
On 6/11/2024 3:44 AM, Alexander Lobakin wrote: > From: David Decotigny <ddecotig@gmail.com> > Date: Tue, 4 Jun 2024 16:34:48 -0700 > >> >> >> On 6/3/2024 11:47 AM, Joshua Hay wrote: >>> >>> There are several reasons for a TX completion to take longer than usual >>> to be written back by HW. For example, the completion for a packet that >>> misses a rule will have increased latency. The side effect of these >>> variable latencies for any given packet is out of order completions. The >>> stack sends packet X and Y. If packet X takes longer because of the rule >>> miss in the example above, but packet Y hits, it can go on the wire >>> immediately. Which also means it can be completed first. The driver >>> will then receive a completion for packet Y before packet X. The driver >>> will stash the buffers for packet X in a hash table to allow the tx send >>> queue descriptors for both packet X and Y to be reused. The driver will >>> receive the completion for packet X sometime later and have to search >>> the hash table for the associated packet. >>> >>> The driver cleans packets directly on the ring first, i.e. not out of >>> order completions since they are to some extent considered "slow(er) >>> path". However, certain workloads can increase the frequency of out of >>> order completions thus introducing even more latency into the cleaning >>> path. Bump up the timeout value to account for these workloads. >>> >>> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") >>> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> >>> --- >>> drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> >> We tested this patch with our intensive high-performance workloads that >> have been able to reproduce the issue, and it was sufficient to avoid tx >> timeouts. We also noticed that these longer timeouts are not unusual in >> the smartnic space: we see 100s or 50s timeouts for a few NICs, and > > Example? > >> other NICs receive this timeout as a hint from the fw. >> >> Reviewed-by: David Decotigny <ddecotig@google.com> > > We either need to teach watchdog core to account OOOs or there's > something really wrong in the driver. For example, how can we then > distinguish if > 5 sec delay happened just due to an OOO or due to that > something went bad with the HW? > > Note that there are several patches fixing Tx (incl. timeouts) in my > tree, including yours (Joshua's) which you somehow didn't send yet ._. > Maybe start from them first? I believe it was you who specifically asked our team to defer pushing any upstream patches while you were working on the XDP series "to avoid having to rebase", which was a reasonable request at the time. We also had no reason to believe the existing upstream idpf implementation was experiencing timeouts (it is being tested by numerous validation teams). So there was no urgency to get those patches upstream. Which patches in your tree do you believe fix specific timeout situations? It appears you pulled in some of the changes from the out-of-tree driver, but those were all enhancements. It wasn't until the workload that David mentioned was run on the current driver that we had any indication there were timeout issues. > > I don't buy 30 seconds, at least for now. Maybe I'm missing something. > > Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com> In the process of debugging the newly discovered timeout, our architecture team made it clear that 5 seconds is too low for this type of device, with a non deterministic pipeline where packets can take a number of exception/slow paths. Admittedly, we don't know the exact number, so the solution for the time being was to bump it up with a comfortable buffer. As we tune things and debug with various workloads, we can bring it back down. As David mentioned, there is precedent for an extended timeout for smartnics. Why is it suddenly unacceptable for Intel's device? > > Thanks, > Olek Thanks, Josh
From: Josh Hay <joshua.a.hay@intel.com> Date: Tue, 11 Jun 2024 11:13:53 -0700 > > > On 6/11/2024 3:44 AM, Alexander Lobakin wrote: >> From: David Decotigny <ddecotig@gmail.com> >> Date: Tue, 4 Jun 2024 16:34:48 -0700 [...] >> Note that there are several patches fixing Tx (incl. timeouts) in my >> tree, including yours (Joshua's) which you somehow didn't send yet ._. >> Maybe start from them first? > > I believe it was you who specifically asked our team to defer pushing > any upstream patches while you were working on the XDP series "to avoid > having to rebase", which was a reasonable request at the time. We also It was only related to the virtchnl refactoring and later I cancelled that when I realized it will go earlier than our series. > had no reason to believe the existing upstream idpf implementation was > experiencing timeouts (it is being tested by numerous validation teams). > So there was no urgency to get those patches upstream. Which patches in > your tree do you believe fix specific timeout situations? It appears you [0][1][2] > pulled in some of the changes from the out-of-tree driver, but those > were all enhancements. It wasn't until the workload that David mentioned No, there are all fixes. [0] is your from the OOT, extended. [1] is mine and never was in the OOT. [2] is your from the OOT, extended by Michał. They really do help. Note that there's one more Tx timeout patch from you in the OOT, but it actually broke Tx xD > was run on the current driver that we had any indication there were > timeout issues. > >> >> I don't buy 30 seconds, at least for now. Maybe I'm missing something. >> >> Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com> > > > In the process of debugging the newly discovered timeout, our > architecture team made it clear that 5 seconds is too low for this type > of device, with a non deterministic pipeline where packets can take a > number of exception/slow paths. Admittedly, we don't know the exact Slowpath which takes 30 seconds to complete, seriously? > number, so the solution for the time being was to bump it up with a > comfortable buffer. As we tune things and debug with various workloads, > we can bring it back down. As David mentioned, there is precedent for an > extended timeout for smartnics. Why is it suddenly unacceptable for > Intel's device? I don't know where this "suddenly" comes from. Because even 5 seconds is too much. HW usually send packets in microseconds if not faster. Extending the timeout will hide real issues and make debugging more difficult. I suspect this all is for OOO packets with an explicit sending timestamp passed from the userspace, but as I said, you need to teach the kernel watchdog to account them. Otherwise, I can ask the driver to send a packet in 31 seconds, what then, there will be a timeout and you will send a patch to extend it to 60 seconds? > >> >> Thanks, >> Olek > > Thanks, > Josh [0] https://github.com/alobakin/linux/commit/aad547037598 [1] https://github.com/alobakin/linux/commit/50f4c27ba13e [2] https://github.com/alobakin/linux/commit/4a9b6c5d0ee8 Thanks, Olek
On Tue, Jun 11, 2024 at 11:13 AM Josh Hay <joshua.a.hay@intel.com> wrote: > > > > On 6/11/2024 3:44 AM, Alexander Lobakin wrote: > > From: David Decotigny <ddecotig@gmail.com> > > Date: Tue, 4 Jun 2024 16:34:48 -0700 > > > >> > >> > >> On 6/3/2024 11:47 AM, Joshua Hay wrote: > >>> > >>> There are several reasons for a TX completion to take longer than usual > >>> to be written back by HW. For example, the completion for a packet that > >>> misses a rule will have increased latency. The side effect of these > >>> variable latencies for any given packet is out of order completions. The > >>> stack sends packet X and Y. If packet X takes longer because of the rule > >>> miss in the example above, but packet Y hits, it can go on the wire > >>> immediately. Which also means it can be completed first. The driver > >>> will then receive a completion for packet Y before packet X. The driver > >>> will stash the buffers for packet X in a hash table to allow the tx send > >>> queue descriptors for both packet X and Y to be reused. The driver will > >>> receive the completion for packet X sometime later and have to search > >>> the hash table for the associated packet. > >>> > >>> The driver cleans packets directly on the ring first, i.e. not out of > >>> order completions since they are to some extent considered "slow(er) > >>> path". However, certain workloads can increase the frequency of out of > >>> order completions thus introducing even more latency into the cleaning > >>> path. Bump up the timeout value to account for these workloads. > >>> > >>> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") > >>> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > >>> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> > >>> --- > >>> drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >> > >> > >> We tested this patch with our intensive high-performance workloads that > >> have been able to reproduce the issue, and it was sufficient to avoid tx > >> timeouts. We also noticed that these longer timeouts are not unusual in > >> the smartnic space: we see 100s or 50s timeouts for a few NICs, and > > > > Example? a sample: drivers/net/ethernet/cavium/thunder/nic.h:#define NICVF_TX_TIMEOUT (50 * HZ) drivers/net/ethernet/cavium/thunder/nicvf_main.c: netdev->watchdog_timeo = NICVF_TX_TIMEOUT; drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h:#define OTX2_TX_TIMEOUT (100 * HZ) drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c: netdev->watchdog_timeo = OTX2_TX_TIMEOUT; drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c: netdev->watchdog_timeo = OTX2_TX_TIMEOUT; drivers/net/ethernet/amazon/ena/ena_netdev.c: netdev->watchdog_timeo = msecs_to_jiffies(hints->netdev_wd_timeout); > > > >> other NICs receive this timeout as a hint from the fw. > >> > >> Reviewed-by: David Decotigny <ddecotig@google.com> > > > > We either need to teach watchdog core to account OOOs or there's > > something really wrong in the driver. For example, how can we then > > distinguish if > 5 sec delay happened just due to an OOO or due to that > > something went bad with the HW? > > > > Note that there are several patches fixing Tx (incl. timeouts) in my > > tree, including yours (Joshua's) which you somehow didn't send yet ._. > > Maybe start from them first? > > I believe it was you who specifically asked our team to defer pushing > any upstream patches while you were working on the XDP series "to avoid > having to rebase", which was a reasonable request at the time. We also > had no reason to believe the existing upstream idpf implementation was > experiencing timeouts (it is being tested by numerous validation teams). > So there was no urgency to get those patches upstream. Which patches in > your tree do you believe fix specific timeout situations? It appears you > pulled in some of the changes from the out-of-tree driver, but those > were all enhancements. It wasn't until the workload that David mentioned > was run on the current driver that we had any indication there were > timeout issues. Some issues were observed with high cpu/memory/network utilization workloads such as a SAP HANA benchmark. > > > > > I don't buy 30 seconds, at least for now. Maybe I'm missing something. > > > > Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com> > > > In the process of debugging the newly discovered timeout, our > architecture team made it clear that 5 seconds is too low for this type > of device, with a non deterministic pipeline where packets can take a > number of exception/slow paths. Admittedly, we don't know the exact > number, so the solution for the time being was to bump it up with a > comfortable buffer. As we tune things and debug with various workloads, > we can bring it back down. As David mentioned, there is precedent for an > extended timeout for smartnics. Why is it suddenly unacceptable for > Intel's device? > > > > > Thanks, > > Olek > > Thanks, > Josh -- David Decotigny -- David Decotigny
On 6/12/2024 2:34 AM, Alexander Lobakin wrote: > From: Josh Hay <joshua.a.hay@intel.com> > Date: Tue, 11 Jun 2024 11:13:53 -0700 > >> >> >> On 6/11/2024 3:44 AM, Alexander Lobakin wrote: >>> From: David Decotigny <ddecotig@gmail.com> >>> Date: Tue, 4 Jun 2024 16:34:48 -0700 > > [...] > >>> Note that there are several patches fixing Tx (incl. timeouts) in my >>> tree, including yours (Joshua's) which you somehow didn't send yet ._. >>> Maybe start from them first? >> >> I believe it was you who specifically asked our team to defer pushing >> any upstream patches while you were working on the XDP series "to avoid >> having to rebase", which was a reasonable request at the time. We also > > It was only related to the virtchnl refactoring and later I cancelled > that when I realized it will go earlier than our series. > >> had no reason to believe the existing upstream idpf implementation was >> experiencing timeouts (it is being tested by numerous validation teams). >> So there was no urgency to get those patches upstream. Which patches in >> your tree do you believe fix specific timeout situations? It appears you > > [0][1][2] > >> pulled in some of the changes from the out-of-tree driver, but those >> were all enhancements. It wasn't until the workload that David mentioned > > No, there are all fixes. > > [0] is your from the OOT, extended. > [1] is mine and never was in the OOT. > [2] is your from the OOT, extended by Michał. My main point was since no other tx timeouts have been reported on the upstream driver, none of these seem like critical fixes. This particular timeout signature did not seem to match any of these in general. E.g. it would have been obvious if the timeout was because of what [0] fixes. It's also possible these timeouts did not manifest on the upstream driver because it is missing other OOT changes. > > They really do help. No disagreement there. I would've loved to push these changes sooner, but we already covered why that didn't happen. > > Note that there's one more Tx timeout patch from you in the OOT, but it > actually broke Tx xD If you are implying that our team would commit code that is knowingly broken, that is absolutely not true. I believe what you're referring to is a change that introduced a tx timeout that also took a very specific workload to trigger it. That issue was fixed and not applicable to the current upstream implementation, so I do not see how that is relevant to this conversation. > >> was run on the current driver that we had any indication there were >> timeout issues. >> >>> >>> I don't buy 30 seconds, at least for now. Maybe I'm missing something. >>> >>> Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com> >> >> >> In the process of debugging the newly discovered timeout, our >> architecture team made it clear that 5 seconds is too low for this type >> of device, with a non deterministic pipeline where packets can take a >> number of exception/slow paths. Admittedly, we don't know the exact > > Slowpath which takes 30 seconds to complete, seriously? The architecture team said 5s is too low. 30s was chosen to give ample cushion to avoid changing the timeout should this situation come up again. > >> number, so the solution for the time being was to bump it up with a >> comfortable buffer. As we tune things and debug with various workloads, >> we can bring it back down. As David mentioned, there is precedent for an >> extended timeout for smartnics. Why is it suddenly unacceptable for >> Intel's device? > > I don't know where this "suddenly" comes from. > Because even 5 seconds is too much. > HW usually send packets in microseconds if not faster. Extending the > timeout will hide real issues and make debugging more difficult. Can you please elaborate on exactly why it would be more difficult? If something is actually wrong in HW, it seems unlikely extra time would correct it. If something is functionally wrong in the driver, why does it matter if it's 5s, 15s, or 30s? It will timeout just the same. > > I suspect this all is for OOO packets with an explicit sending timestamp > passed from the userspace, but as I said, you need to teach the kernel > watchdog to account them. Out of order completions can happen for numerous reasons, some of which the driver will know nothing about, i.e. the userspace timestamps are not the only things that trigger OOO completions. We can detect that we're processing completion B before A, but it's only at that time that we can tell the stack to _maybe_ expect a delayed completion. I'm open to discussing this further, but it does not seem like a simple solution that can be implemented in the immediate future. > Otherwise, I can ask the driver to send a packet in 31 seconds, what > then, there will be a timeout and you will send a patch to extend it to > 60 seconds? I hope there are checks in the stack itself that would not allow the packet to be scheduled beyond the timeout window :) > >> >>> >>> Thanks, >>> Olek >> >> Thanks, >> Josh > > [0] https://github.com/alobakin/linux/commit/aad547037598 > [1] https://github.com/alobakin/linux/commit/50f4c27ba13e > [2] https://github.com/alobakin/linux/commit/4a9b6c5d0ee8 > > Thanks, > Olek Thanks, Josh
From: David Decotigny <ddecotig@google.com> Date: Wed, 12 Jun 2024 11:01:46 -0700 > On Tue, Jun 11, 2024 at 11:13 AM Josh Hay <joshua.a.hay@intel.com> wrote: >> >> >> >> On 6/11/2024 3:44 AM, Alexander Lobakin wrote: >>> From: David Decotigny <ddecotig@gmail.com> >>> Date: Tue, 4 Jun 2024 16:34:48 -0700 >>> >>>> >>>> >>>> On 6/3/2024 11:47 AM, Joshua Hay wrote: >>>>> >>>>> There are several reasons for a TX completion to take longer than usual >>>>> to be written back by HW. For example, the completion for a packet that >>>>> misses a rule will have increased latency. The side effect of these >>>>> variable latencies for any given packet is out of order completions. The >>>>> stack sends packet X and Y. If packet X takes longer because of the rule >>>>> miss in the example above, but packet Y hits, it can go on the wire >>>>> immediately. Which also means it can be completed first. The driver >>>>> will then receive a completion for packet Y before packet X. The driver >>>>> will stash the buffers for packet X in a hash table to allow the tx send >>>>> queue descriptors for both packet X and Y to be reused. The driver will >>>>> receive the completion for packet X sometime later and have to search >>>>> the hash table for the associated packet. >>>>> >>>>> The driver cleans packets directly on the ring first, i.e. not out of >>>>> order completions since they are to some extent considered "slow(er) >>>>> path". However, certain workloads can increase the frequency of out of >>>>> order completions thus introducing even more latency into the cleaning >>>>> path. Bump up the timeout value to account for these workloads. >>>>> >>>>> Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") >>>>> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>>>> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> >>>>> --- >>>>> drivers/net/ethernet/intel/idpf/idpf_lib.c | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >>>> >>>> We tested this patch with our intensive high-performance workloads that >>>> have been able to reproduce the issue, and it was sufficient to avoid tx >>>> timeouts. We also noticed that these longer timeouts are not unusual in >>>> the smartnic space: we see 100s or 50s timeouts for a few NICs, and >>> >>> Example? > > a sample: > > drivers/net/ethernet/cavium/thunder/nic.h:#define > NICVF_TX_TIMEOUT (50 * HZ) > drivers/net/ethernet/cavium/thunder/nicvf_main.c: > netdev->watchdog_timeo = NICVF_TX_TIMEOUT; > drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h:#define > OTX2_TX_TIMEOUT (100 * HZ) > drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c: > netdev->watchdog_timeo = OTX2_TX_TIMEOUT; > drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c: > netdev->watchdog_timeo = OTX2_TX_TIMEOUT; > drivers/net/ethernet/amazon/ena/ena_netdev.c: > netdev->watchdog_timeo = msecs_to_jiffies(hints->netdev_wd_timeout); This one doesn't say anything at all :D mlx5 has 15 seconds, but mlx4 has the same TO as well, so this might be some legacy stuff. Netronome has 5, QLogic has 5, Broadcom 5 etc etc. These 50-100 belong to one vendor (Cavium is Marvell) and look like a hack to hide HW issues. Re "some issues were observed" -- this patch only hides the symptoms, at least from what I'm seeing currently. Still no details, so that I could understand the reasons for it. Thanks, Olek
From: Josh Hay <joshua.a.hay@intel.com> Date: Wed, 12 Jun 2024 23:36:29 -0700 > > > On 6/12/2024 2:34 AM, Alexander Lobakin wrote: >> From: Josh Hay <joshua.a.hay@intel.com> >> Date: Tue, 11 Jun 2024 11:13:53 -0700 >> >>> >>> >>> On 6/11/2024 3:44 AM, Alexander Lobakin wrote: >>>> From: David Decotigny <ddecotig@gmail.com> >>>> Date: Tue, 4 Jun 2024 16:34:48 -0700 >> >> [...] >> >>>> Note that there are several patches fixing Tx (incl. timeouts) in my >>>> tree, including yours (Joshua's) which you somehow didn't send yet ._. >>>> Maybe start from them first? >>> >>> I believe it was you who specifically asked our team to defer pushing >>> any upstream patches while you were working on the XDP series "to avoid >>> having to rebase", which was a reasonable request at the time. We also >> >> It was only related to the virtchnl refactoring and later I cancelled >> that when I realized it will go earlier than our series. >> >>> had no reason to believe the existing upstream idpf implementation was >>> experiencing timeouts (it is being tested by numerous validation teams). >>> So there was no urgency to get those patches upstream. Which patches in >>> your tree do you believe fix specific timeout situations? It appears you >> >> [0][1][2] >> >>> pulled in some of the changes from the out-of-tree driver, but those >>> were all enhancements. It wasn't until the workload that David mentioned >> >> No, there are all fixes. >> >> [0] is your from the OOT, extended. > [1] is mine and never was in the >> OOT. >> [2] is your from the OOT, extended by Michał. > > My main point was since no other tx timeouts have been reported on the > upstream driver, none of these seem like critical fixes. This particular Lots of bugs have never been reported, this doesn't mean they don't exist. > timeout signature did not seem to match any of these in general. E.g. it > would have been obvious if the timeout was because of what [0] fixes. > It's also possible these timeouts did not manifest on the upstream > driver because it is missing other OOT changes. What I'm saying is that why not try to reproduce the issues that this patch tries to hide on my tree with [0][1][2] first and see whether they're still here? If they fix the issue, then why extend the timeout? > >> >> They really do help. > > No disagreement there. I would've loved to push these changes sooner, > but we already covered why that didn't happen. Because "they're not critical"? Intel's been having Tx timeouts on idpf for years. I'd say that at least [1] is critical since this is obvious bug. Without [2], XDP just doesn't work on idpf. Why isn't [0] critical since without this patch, you stash partial packets? And in some cases, ntc/ntu can go beyond the ring size? Is it okay? > >> >> Note that there's one more Tx timeout patch from you in the OOT, but it >> actually broke Tx xD > > If you are implying that our team would commit code that is knowingly > broken, that is absolutely not true. I believe what you're referring to I didn't say that. I just said that I tried to pull your latest yet another Tx timeout fix and with it, I have more timeouts than without. > is a change that introduced a tx timeout that also took a very specific > workload to trigger it. That issue was fixed and not applicable to the > current upstream implementation, so I do not see how that is relevant to > this conversation. It was just a side note, don't focus too much on such instead of really important stuff. > >> >>> was run on the current driver that we had any indication there were >>> timeout issues. >>> >>>> >>>> I don't buy 30 seconds, at least for now. Maybe I'm missing something. >>>> >>>> Nacked-by: Alexander Lobakin <aleksander.lobakin@intel.com> >>> >>> >>> In the process of debugging the newly discovered timeout, our >>> architecture team made it clear that 5 seconds is too low for this type >>> of device, with a non deterministic pipeline where packets can take a >>> number of exception/slow paths. Admittedly, we don't know the exact >> >> Slowpath which takes 30 seconds to complete, seriously? > > The architecture team said 5s is too low. 30s was chosen to give ample > cushion to avoid changing the timeout should this situation come up again. Just "said", without any details why? Now I can say that 5 is too much and we need to have 1 second there. Why believe them, not me, w/o any arguments/explanation? (I at least have an argument that usually packets get sent in a couple us or faster and even 1 second between triggering the sending and receiving a completion is too much and if that happens, there were some HW issues) > >> >>> number, so the solution for the time being was to bump it up with a >>> comfortable buffer. As we tune things and debug with various workloads, >>> we can bring it back down. As David mentioned, there is precedent for an >>> extended timeout for smartnics. Why is it suddenly unacceptable for >>> Intel's device? >> >> I don't know where this "suddenly" comes from. >> Because even 5 seconds is too much. >> HW usually send packets in microseconds if not faster. Extending the >> timeout will hide real issues and make debugging more difficult. > > Can you please elaborate on exactly why it would be more difficult? If > something is actually wrong in HW, it seems unlikely extra time would > correct it. If something is functionally wrong in the driver, why does > it matter if it's 5s, 15s, or 30s? It will timeout just the same. HW can hang and perform a reset and if you have 30 seconds for Tx timeout, you can even not notice it. Having 5 seconds or lower for the timeout will most likely spot it. > >> >> I suspect this all is for OOO packets with an explicit sending timestamp >> passed from the userspace, but as I said, you need to teach the kernel >> watchdog to account them. > > Out of order completions can happen for numerous reasons, some of which > the driver will know nothing about, i.e. the userspace timestamps are > not the only things that trigger OOO completions. We can detect that > we're processing completion B before A, but it's only at that time that > we can tell the stack to _maybe_ expect a delayed completion. I'm open > to discussing this further, but it does not seem like a simple solution > that can be implemented in the immediate future. I still didn't get an explanation why a packet can receive an OOO completion in whole 30 seconds instead of a couple microseconds. I also don't understand why we can receive an OOO completion for packets that were send without an explicit timestamp from the userspace. Without the timestamp (or some sort of priority etc.), they should be sent in the same order as they were passed to the driver, right? If so, why a packet can get an OOO completion, which means it was sent not in the same order as the driver got them? Does HW do some prioritizing even if the kernel didn't ask for it? But even if so, HW shouldn't in general delay sending for 30 seconds? > >> Otherwise, I can ask the driver to send a packet in 31 seconds, what >> then, there will be a timeout and you will send a patch to extend it to >> 60 seconds? > > I hope there are checks in the stack itself that would not allow the > packet to be scheduled beyond the timeout window :) I don't think so, but since I haven't read the actual code, I won't say anything particular. Anyway, it was just a side note, my main concerns are above. Thanks, Olek
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c index f1ee5584e8fa..3d4ae2ed9b96 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_lib.c +++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c @@ -770,8 +770,8 @@ static int idpf_cfg_netdev(struct idpf_vport *vport) else netdev->netdev_ops = &idpf_netdev_ops_singleq; - /* setup watchdog timeout value to be 5 second */ - netdev->watchdog_timeo = 5 * HZ; + /* setup watchdog timeout value to be 30 seconds */ + netdev->watchdog_timeo = 30 * HZ; netdev->dev_port = idx;