Message ID | 20240130095353.2881-8-darinzon@amazon.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 071271f39ce833a3534d1fbd47174d1bed6d9326 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | ENA driver changes | expand |
On Tue, 2024-01-30 at 09:53 +0000, darinzon@amazon.com wrote: > @@ -3408,25 +3437,45 @@ static int check_missing_comp_in_tx_queue(struct ena_adapter *adapter, > adapter->missing_tx_completion_to); > > if (unlikely(is_tx_comp_time_expired)) { > - if (!tx_buf->print_once) { > - time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); > - missing_tx_comp_to = jiffies_to_msecs(adapter->missing_tx_completion_to); > - netif_notice(adapter, tx_err, adapter->netdev, > - "Found a Tx that wasn't completed on time, qid %d, index %d. %u usecs have passed since last napi execution. Missing Tx timeout value %u msecs\n", > - tx_ring->qid, i, time_since_last_napi, missing_tx_comp_to); > + time_since_last_napi = > + jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); > + napi_scheduled = !!(ena_napi->napi.state & NAPIF_STATE_SCHED); > + > + if (missing_tx_comp_to < time_since_last_napi && napi_scheduled) { > + /* We suspect napi isn't called because the > + * bottom half is not run. Require a bigger > + * timeout for these cases > + */ Not blocking this series, but I guess the above "the bottom half is not run", after commit d15121be7485655129101f3960ae6add40204463, happens only when running in napi threaded mode, right? cheers, Paolo
On Tue, Jan 30, 2024 at 09:53:49AM +0000, darinzon@amazon.com wrote: > From: David Arinzon <darinzon@amazon.com> > > The function responsible for polling TX completions might not receive > the CPU resources it needs due to higher priority tasks running on the > requested core. > > The driver might not be able to recognize such cases, but it can use its > state to suspect that they happened. If both conditions are met: > > - napi hasn't been executed more than the TX completion timeout value > - napi is scheduled (meaning that we've received an interrupt) > > Then it's more likely that the napi handler isn't scheduled because of > an overloaded CPU. > It was decided that for this case, the driver would wait twice as long > as the regular timeout before scheduling a reset. > The driver uses ENA_REGS_RESET_SUSPECTED_POLL_STARVATION reset reason to > indicate this case to the device. > > This patch also adds more information to the ena_tx_timeout() callback. > This function is called by the kernel when it detects that a specific TX > queue has been closed for too long. > > Signed-off-by: Shay Agroskin <shayagr@amazon.com> > Signed-off-by: David Arinzon <darinzon@amazon.com> > --- > drivers/net/ethernet/amazon/ena/ena_netdev.c | 77 +++++++++++++++---- > .../net/ethernet/amazon/ena/ena_regs_defs.h | 1 + > 2 files changed, 64 insertions(+), 14 deletions(-) > > diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c > index 18acb76..ae9291b 100644 > --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c > +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c > @@ -47,19 +47,44 @@ static int ena_restore_device(struct ena_adapter *adapter); > > static void ena_tx_timeout(struct net_device *dev, unsigned int txqueue) > { > + enum ena_regs_reset_reason_types reset_reason = ENA_REGS_RESET_OS_NETDEV_WD; > struct ena_adapter *adapter = netdev_priv(dev); > + unsigned int time_since_last_napi, threshold; > + struct ena_ring *tx_ring; > + int napi_scheduled; > + > + if (txqueue >= adapter->num_io_queues) { > + netdev_err(dev, "TX timeout on invalid queue %u\n", txqueue); > + goto schedule_reset; > + } > + > + threshold = jiffies_to_usecs(dev->watchdog_timeo); > + tx_ring = &adapter->tx_ring[txqueue]; > + > + time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); > + napi_scheduled = !!(tx_ring->napi->state & NAPIF_STATE_SCHED); > > + netdev_err(dev, > + "TX q %d is paused for too long (threshold %u). Time since last napi %u usec. napi scheduled: %d\n", > + txqueue, > + threshold, > + time_since_last_napi, > + napi_scheduled); > + > + if (threshold < time_since_last_napi && napi_scheduled) { > + netdev_err(dev, > + "napi handler hasn't been called for a long time but is scheduled\n"); > + reset_reason = ENA_REGS_RESET_SUSPECTED_POLL_STARVATION; Hi David, a nit from my side: the line above is indented one tab-stop too many. No need to respin just for this AFAIC. > + } > +schedule_reset: > /* Change the state of the device to trigger reset > * Check that we are not in the middle or a trigger already > */ > - > if (test_and_set_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags)) > return; > > - ena_reset_device(adapter, ENA_REGS_RESET_OS_NETDEV_WD); > + ena_reset_device(adapter, reset_reason); > ena_increase_stat(&adapter->dev_stats.tx_timeout, 1, &adapter->syncp); > - > - netif_err(adapter, tx_err, dev, "Transmit time out\n"); > } > ...
> On Tue, Jan 30, 2024 at 09:53:49AM +0000, darinzon@amazon.com wrote: > > From: David Arinzon <darinzon@amazon.com> > > > > The function responsible for polling TX completions might not receive > > the CPU resources it needs due to higher priority tasks running on the > > requested core. > > > > The driver might not be able to recognize such cases, but it can use > > its state to suspect that they happened. If both conditions are met: > > > > - napi hasn't been executed more than the TX completion timeout value > > - napi is scheduled (meaning that we've received an interrupt) > > > > Then it's more likely that the napi handler isn't scheduled because of > > an overloaded CPU. > > It was decided that for this case, the driver would wait twice as long > > as the regular timeout before scheduling a reset. > > The driver uses ENA_REGS_RESET_SUSPECTED_POLL_STARVATION reset > reason > > to indicate this case to the device. > > > > This patch also adds more information to the ena_tx_timeout() callback. > > This function is called by the kernel when it detects that a specific > > TX queue has been closed for too long. > > > > Signed-off-by: Shay Agroskin <shayagr@amazon.com> > > Signed-off-by: David Arinzon <darinzon@amazon.com> > > --- > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 77 > +++++++++++++++---- > > .../net/ethernet/amazon/ena/ena_regs_defs.h | 1 + > > 2 files changed, 64 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c > > b/drivers/net/ethernet/amazon/ena/ena_netdev.c > > index 18acb76..ae9291b 100644 > > --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c > > +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c > > @@ -47,19 +47,44 @@ static int ena_restore_device(struct ena_adapter > > *adapter); > > > > static void ena_tx_timeout(struct net_device *dev, unsigned int > > txqueue) { > > + enum ena_regs_reset_reason_types reset_reason = > > + ENA_REGS_RESET_OS_NETDEV_WD; > > struct ena_adapter *adapter = netdev_priv(dev); > > + unsigned int time_since_last_napi, threshold; > > + struct ena_ring *tx_ring; > > + int napi_scheduled; > > + > > + if (txqueue >= adapter->num_io_queues) { > > + netdev_err(dev, "TX timeout on invalid queue %u\n", txqueue); > > + goto schedule_reset; > > + } > > + > > + threshold = jiffies_to_usecs(dev->watchdog_timeo); > > + tx_ring = &adapter->tx_ring[txqueue]; > > + > > + time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring- > >tx_stats.last_napi_jiffies); > > + napi_scheduled = !!(tx_ring->napi->state & NAPIF_STATE_SCHED); > > > > + netdev_err(dev, > > + "TX q %d is paused for too long (threshold %u). Time since last > napi %u usec. napi scheduled: %d\n", > > + txqueue, > > + threshold, > > + time_since_last_napi, > > + napi_scheduled); > > + > > + if (threshold < time_since_last_napi && napi_scheduled) { > > + netdev_err(dev, > > + "napi handler hasn't been called for a long time but is > scheduled\n"); > > + reset_reason = > > + ENA_REGS_RESET_SUSPECTED_POLL_STARVATION; > > Hi David, > > a nit from my side: the line above is indented one tab-stop too many. > No need to respin just for this AFAIC. > Hi Simon, Thanks for pointing it out. Seems like I got carried away a bit with the Indentation due to the print above it. Thanks, David > > + } > > +schedule_reset: > > /* Change the state of the device to trigger reset > > * Check that we are not in the middle or a trigger already > > */ > > - > > if (test_and_set_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags)) > > return; > > > > - ena_reset_device(adapter, ENA_REGS_RESET_OS_NETDEV_WD); > > + ena_reset_device(adapter, reset_reason); > > ena_increase_stat(&adapter->dev_stats.tx_timeout, 1, > > &adapter->syncp); > > - > > - netif_err(adapter, tx_err, dev, "Transmit time out\n"); > > } > > > > ...
> On Tue, 2024-01-30 at 09:53 +0000, darinzon@amazon.com wrote: > > @@ -3408,25 +3437,45 @@ static int > check_missing_comp_in_tx_queue(struct ena_adapter *adapter, > > adapter->missing_tx_completion_to); > > > > if (unlikely(is_tx_comp_time_expired)) { > > - if (!tx_buf->print_once) { > > - time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring- > >tx_stats.last_napi_jiffies); > > - missing_tx_comp_to = jiffies_to_msecs(adapter- > >missing_tx_completion_to); > > - netif_notice(adapter, tx_err, adapter->netdev, > > - "Found a Tx that wasn't completed on time, qid %d, > index %d. %u usecs have passed since last napi execution. Missing Tx > timeout value %u msecs\n", > > - tx_ring->qid, i, time_since_last_napi, > missing_tx_comp_to); > > + time_since_last_napi = > > + jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); > > + napi_scheduled = !!(ena_napi->napi.state & > > + NAPIF_STATE_SCHED); > > + > > + if (missing_tx_comp_to < time_since_last_napi && > napi_scheduled) { > > + /* We suspect napi isn't called because the > > + * bottom half is not run. Require a bigger > > + * timeout for these cases > > + */ > > Not blocking this series, but I guess the above "the bottom half is not run", > after commit d15121be7485655129101f3960ae6add40204463, happens only > when running in napi threaded mode, right? > > cheers, > > Paolo Hi Paolo, The ENA driver napi routine doesn't run in threaded mode. We've seen cases where napi is indeed scheduled, but didn't get a chance to run for a noticeable amount of time and process TX completions, and based on that we conclude that there's a high CPU load that doesn't allow the routine to run in a timely manner. Based on the information in d15121be7485655129101f3960ae6add40204463, the observed stalls are in the magnitude of milliseconds, the above code is actually an additional grace time, and the timeouts here are in seconds. Thanks, David
On Thu, 2024-02-01 at 13:21 +0000, Arinzon, David wrote: > > On Tue, 2024-01-30 at 09:53 +0000, darinzon@amazon.com wrote: > > > @@ -3408,25 +3437,45 @@ static int > > check_missing_comp_in_tx_queue(struct ena_adapter *adapter, > > > adapter->missing_tx_completion_to); > > > > > > if (unlikely(is_tx_comp_time_expired)) { > > > - if (!tx_buf->print_once) { > > > - time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring- > > > tx_stats.last_napi_jiffies); > > > - missing_tx_comp_to = jiffies_to_msecs(adapter- > > > missing_tx_completion_to); > > > - netif_notice(adapter, tx_err, adapter->netdev, > > > - "Found a Tx that wasn't completed on time, qid %d, > > index %d. %u usecs have passed since last napi execution. Missing Tx > > timeout value %u msecs\n", > > > - tx_ring->qid, i, time_since_last_napi, > > missing_tx_comp_to); > > > + time_since_last_napi = > > > + jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); > > > + napi_scheduled = !!(ena_napi->napi.state & > > > + NAPIF_STATE_SCHED); > > > + > > > + if (missing_tx_comp_to < time_since_last_napi && > > napi_scheduled) { > > > + /* We suspect napi isn't called because the > > > + * bottom half is not run. Require a bigger > > > + * timeout for these cases > > > + */ > > > > Not blocking this series, but I guess the above "the bottom half is not run", > > after commit d15121be7485655129101f3960ae6add40204463, happens only > > when running in napi threaded mode, right? > > > > cheers, > > > > Paolo > > Hi Paolo, > > The ENA driver napi routine doesn't run in threaded mode. ... unless you do: echo 1 > /sys/class/net/<nic name>/threaded :) > We've seen cases where napi is indeed scheduled, but didn't get a chance > to run for a noticeable amount of time and process TX completions, > and based on that we conclude that there's a high CPU load that doesn't allow > the routine to run in a timely manner. > Based on the information in d15121be7485655129101f3960ae6add40204463, > the observed stalls are in the magnitude of milliseconds, the above code is actually > an additional grace time, and the timeouts here are in seconds. Do I read correctly that in your scenario the napi instance is not scheduled for _seconds_? That smells like a serious bug somewhere else really worthy of more investigation. Cheers, Paolo
> On Thu, 2024-02-01 at 13:21 +0000, Arinzon, David wrote: > > > On Tue, 2024-01-30 at 09:53 +0000, darinzon@amazon.com wrote: > > > > @@ -3408,25 +3437,45 @@ static int > > > check_missing_comp_in_tx_queue(struct ena_adapter *adapter, > > > > adapter->missing_tx_completion_to); > > > > > > > > if (unlikely(is_tx_comp_time_expired)) { > > > > - if (!tx_buf->print_once) { > > > > - time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring- > > > > tx_stats.last_napi_jiffies); > > > > - missing_tx_comp_to = jiffies_to_msecs(adapter- > > > > missing_tx_completion_to); > > > > - netif_notice(adapter, tx_err, adapter->netdev, > > > > - "Found a Tx that wasn't completed on time, qid > %d, > > > index %d. %u usecs have passed since last napi execution. Missing Tx > > > timeout value %u msecs\n", > > > > - tx_ring->qid, i, time_since_last_napi, > > > missing_tx_comp_to); > > > > + time_since_last_napi = > > > > + jiffies_to_usecs(jiffies - tx_ring- > >tx_stats.last_napi_jiffies); > > > > + napi_scheduled = !!(ena_napi->napi.state & > > > > + NAPIF_STATE_SCHED); > > > > + > > > > + if (missing_tx_comp_to < > > > > + time_since_last_napi && > > > napi_scheduled) { > > > > + /* We suspect napi isn't called because the > > > > + * bottom half is not run. Require a bigger > > > > + * timeout for these cases > > > > + */ > > > > > > Not blocking this series, but I guess the above "the bottom half is > > > not run", after commit d15121be7485655129101f3960ae6add40204463, > > > happens only when running in napi threaded mode, right? > > > > > > cheers, > > > > > > Paolo > > > > Hi Paolo, > > > > The ENA driver napi routine doesn't run in threaded mode. > > ... unless you do: > > echo 1 > /sys/class/net/<nic name>/threaded > > :) > Thanks for pointing this out. We will look into this further. > > We've seen cases where napi is indeed scheduled, but didn't get a > > chance to run for a noticeable amount of time and process TX > > completions, and based on that we conclude that there's a high CPU > > load that doesn't allow the routine to run in a timely manner. > > Based on the information in > d15121be7485655129101f3960ae6add40204463, > > the observed stalls are in the magnitude of milliseconds, the above > > code is actually an additional grace time, and the timeouts here are in > seconds. > > Do I read correctly that in your scenario the napi instance is not scheduled for > _seconds_? That smells like a serious bug somewhere else really worthy of > more investigation. > > Cheers, > > Paolo > Thanks for noting this. It is something that we're actively monitoring and looking to improve. Thanks, David
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 18acb76..ae9291b 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -47,19 +47,44 @@ static int ena_restore_device(struct ena_adapter *adapter); static void ena_tx_timeout(struct net_device *dev, unsigned int txqueue) { + enum ena_regs_reset_reason_types reset_reason = ENA_REGS_RESET_OS_NETDEV_WD; struct ena_adapter *adapter = netdev_priv(dev); + unsigned int time_since_last_napi, threshold; + struct ena_ring *tx_ring; + int napi_scheduled; + + if (txqueue >= adapter->num_io_queues) { + netdev_err(dev, "TX timeout on invalid queue %u\n", txqueue); + goto schedule_reset; + } + + threshold = jiffies_to_usecs(dev->watchdog_timeo); + tx_ring = &adapter->tx_ring[txqueue]; + + time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); + napi_scheduled = !!(tx_ring->napi->state & NAPIF_STATE_SCHED); + netdev_err(dev, + "TX q %d is paused for too long (threshold %u). Time since last napi %u usec. napi scheduled: %d\n", + txqueue, + threshold, + time_since_last_napi, + napi_scheduled); + + if (threshold < time_since_last_napi && napi_scheduled) { + netdev_err(dev, + "napi handler hasn't been called for a long time but is scheduled\n"); + reset_reason = ENA_REGS_RESET_SUSPECTED_POLL_STARVATION; + } +schedule_reset: /* Change the state of the device to trigger reset * Check that we are not in the middle or a trigger already */ - if (test_and_set_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags)) return; - ena_reset_device(adapter, ENA_REGS_RESET_OS_NETDEV_WD); + ena_reset_device(adapter, reset_reason); ena_increase_stat(&adapter->dev_stats.tx_timeout, 1, &adapter->syncp); - - netif_err(adapter, tx_err, dev, "Transmit time out\n"); } static void update_rx_ring_mtu(struct ena_adapter *adapter, int mtu) @@ -3374,14 +3399,18 @@ static int check_missing_comp_in_tx_queue(struct ena_adapter *adapter, struct ena_ring *tx_ring) { struct ena_napi *ena_napi = container_of(tx_ring->napi, struct ena_napi, napi); + enum ena_regs_reset_reason_types reset_reason = ENA_REGS_RESET_MISS_TX_CMPL; unsigned int time_since_last_napi; unsigned int missing_tx_comp_to; bool is_tx_comp_time_expired; struct ena_tx_buffer *tx_buf; unsigned long last_jiffies; + int napi_scheduled; u32 missed_tx = 0; int i, rc = 0; + missing_tx_comp_to = jiffies_to_msecs(adapter->missing_tx_completion_to); + for (i = 0; i < tx_ring->ring_size; i++) { tx_buf = &tx_ring->tx_buffer_info[i]; last_jiffies = tx_buf->last_jiffies; @@ -3408,25 +3437,45 @@ static int check_missing_comp_in_tx_queue(struct ena_adapter *adapter, adapter->missing_tx_completion_to); if (unlikely(is_tx_comp_time_expired)) { - if (!tx_buf->print_once) { - time_since_last_napi = jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); - missing_tx_comp_to = jiffies_to_msecs(adapter->missing_tx_completion_to); - netif_notice(adapter, tx_err, adapter->netdev, - "Found a Tx that wasn't completed on time, qid %d, index %d. %u usecs have passed since last napi execution. Missing Tx timeout value %u msecs\n", - tx_ring->qid, i, time_since_last_napi, missing_tx_comp_to); + time_since_last_napi = + jiffies_to_usecs(jiffies - tx_ring->tx_stats.last_napi_jiffies); + napi_scheduled = !!(ena_napi->napi.state & NAPIF_STATE_SCHED); + + if (missing_tx_comp_to < time_since_last_napi && napi_scheduled) { + /* We suspect napi isn't called because the + * bottom half is not run. Require a bigger + * timeout for these cases + */ + if (!time_is_before_jiffies(last_jiffies + + 2 * adapter->missing_tx_completion_to)) + continue; + + reset_reason = ENA_REGS_RESET_SUSPECTED_POLL_STARVATION; } - tx_buf->print_once = 1; missed_tx++; + + if (tx_buf->print_once) + continue; + + netif_notice(adapter, tx_err, adapter->netdev, + "TX hasn't completed, qid %d, index %d. %u usecs from last napi execution, napi scheduled: %d\n", + tx_ring->qid, i, time_since_last_napi, napi_scheduled); + + tx_buf->print_once = 1; } } if (unlikely(missed_tx > adapter->missing_tx_completion_threshold)) { netif_err(adapter, tx_err, adapter->netdev, - "The number of lost tx completions is above the threshold (%d > %d). Reset the device\n", + "Lost TX completions are above the threshold (%d > %d). Completion transmission timeout: %u.\n", missed_tx, - adapter->missing_tx_completion_threshold); - ena_reset_device(adapter, ENA_REGS_RESET_MISS_TX_CMPL); + adapter->missing_tx_completion_threshold, + missing_tx_comp_to); + netif_err(adapter, tx_err, adapter->netdev, + "Resetting the device\n"); + + ena_reset_device(adapter, reset_reason); rc = -EIO; } diff --git a/drivers/net/ethernet/amazon/ena/ena_regs_defs.h b/drivers/net/ethernet/amazon/ena/ena_regs_defs.h index 1e007a4..2c3d6a7 100644 --- a/drivers/net/ethernet/amazon/ena/ena_regs_defs.h +++ b/drivers/net/ethernet/amazon/ena/ena_regs_defs.h @@ -21,6 +21,7 @@ enum ena_regs_reset_reason_types { ENA_REGS_RESET_USER_TRIGGER = 12, ENA_REGS_RESET_GENERIC = 13, ENA_REGS_RESET_MISS_INTERRUPT = 14, + ENA_REGS_RESET_SUSPECTED_POLL_STARVATION = 15, }; /* ena_registers offsets */