diff mbox series

Aquantia ethernet driver suspend/resume issues

Message ID CAHk-=wiZZi7FcvqVSUirHBjx0bBUZ4dFrMDVLc3+3HCrtq0rBA@mail.gmail.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Aquantia ethernet driver suspend/resume issues | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/codegen success Generated files up to date
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1115 this patch: 1115
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 1142 this patch: 1142
netdev/verify_signedoff fail author Signed-off-by missing
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1142 this patch: 1142
netdev/checkpatch warning WARNING: Possible repeated word: 'please'
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Linus Torvalds Nov. 26, 2023, 3:04 a.m. UTC
Ok, so this is pretty random, but I ended up replacing my main SSD
today, and decided that I'll just do a clean re-install and copy my
user data over from my old SSD. As a result of all that, my ethernet
cable ended up in a random ethernet port when I reconnected
everything, and because of the system reinstall I ended up with
suspend-at-idle on by default (which I very much don't want, but I
only noticed after it happened).

And it turns out that suspend/resume *really* doesn't work on the
Aquantia ethernet driver, which is where the cable happened to be.

First you get an allocation failure at resume:

  kworker/u256:41: page allocation failure: order:6,
mode:0x40d00(GFP_NOIO|__GFP_COMP|__GFP_ZERO),
nodemask=(null),cpuset=/,mems_allowed=0
  CPU: 58 PID: 11654 Comm: kworker/u256:41 Not tainted
  Workqueue: events_unbound async_run_entry_fn
  Call Trace:
   <TASK>
   dump_stack_lvl+0x47/0x60
   warn_alloc+0x165/0x1e0
   __alloc_pages_slowpath.constprop.0+0xcd4/0xd90
   __alloc_pages+0x32d/0x350
   __kmalloc_large_node+0x73/0x130
   __kmalloc+0xc3/0x150
   aq_ring_alloc+0x22/0xb0 [atlantic]
   aq_vec_ring_alloc+0xee/0x1a0 [atlantic]
   aq_nic_init+0x118/0x1d0 [atlantic]
   atl_resume_common+0x40/0xd0 [atlantic]
   ...

and immediately after that we get

  trying to free invalid coherent area: 000000006fb35228
  WARNING: CPU: 58 PID: 11654 at kernel/dma/remap.c:65
dma_common_free_remap+0x2d/0x40
  CPU: 58 PID: 11654 Comm: kworker/u256:41 Not tainted 6.5.6-300.fc39.x86_64 #1
  Workqueue: events_unbound async_run_entry_fn
  Call Trace:
   <TASK>
   __iommu_dma_free+0xe8/0x100
   aq_ring_alloc+0xa4/0xb0 [atlantic]
   aq_vec_ring_alloc+0xee/0x1a0 [atlantic]
   aq_nic_init+0x118/0x1d0 [atlantic]
   atl_resume_common+0x40/0xd0 [atlantic]
   ...
  atlantic 0000:44:00.0: PM: dpm_run_callback():
pci_pm_resume+0x0/0xf0 returns -12
  atlantic 0000:44:00.0: PM: failed to resume async: error -12

and now the slab cache is corrupt and the system is dead.

My *guess* is that what is going on is that when the kcalloc() failued
(because it tries to allocate a large area, and it has only been
tested at boot-time when it succeeds),  we end up doing that

  err_exit:
        if (err < 0) {
                aq_ring_free(self);
                self = NULL;
        }

but aq_ring_free() does

        kfree(self->buff_ring);

        if (self->dx_ring)
                dma_free_coherent(aq_nic_get_dev(self->aq_nic),
                                  self->size * self->dx_size, self->dx_ring,
                                  self->dx_ring_pa);

and notice how it will free the dx_ring even though it was never
allocated! I suspect dc_ring is  non-zero because it was allocated
earlier, but the suspend free'd it - but never cleared the pointer.

That "never cleared the pointer on free" is true for buff_ring too,
but the aq_ring_alloc() did

        self->buff_ring =
                kcalloc(self->size, sizeof(struct aq_ring_buff_s), GFP_KERNEL);

so when that failed, at least it re-initialized that part to NULL, so
we just had a kfree(NULL) which is fine.

Anyway, I suspect a fix for the fatal error might be something like
the attached, but I think the *root* of the problem is how the
aquantia driver tried to allocate a humongous buff_ring with kmalloc,
which really doesn't work.  You can see that "order:6", ie we're
talking an allocation > 100kB, and in low-memory situations that kind
of kmalloc space simply isn't available. It *will* fail.

Again, during boot you'll probably never see any issues. During
suspend/resume it very much does not work.

In general, suspend/resume should *not* do big memory management
things. It should probably have never free'd the old data structure,
and it most definitely cannot try to allocate a big new data structure
in resume.

To make matters worse, it looks like there's not just *one* of those
big allocations, there's multiple ones, both for RX and TX. But I
didn't look much more closely.

I don't know what the right fix is, but *one* fix would certainly be
to not tear everything down at suspend time, only to build it up again
at resume.

And please please please don't double-free things randomly (if that is
what was going on, but it does look like it was).

           Linus

Comments

Igor Russkikh Nov. 27, 2023, 5:29 p.m. UTC | #1
Hi Linus,

> and now the slab cache is corrupt and the system is dead.
> 
> My *guess* is that what is going on is that when the kcalloc() failued
> (because it tries to allocate a large area, and it has only been
> tested at boot-time when it succeeds),  we end up doing that

You are probably right root causing the issue here, and the bad thing is
that direct fix will not solve the problem.

I'm trying to repro this on my side with some artificially increased structure
sizes, but no success so far.

Datapath initialization in driver requires normally 8 vectors each having rx/tx
pair of rings, in total making 8*(4096+2048)*64 = ~ 3Mb of memory.

...

> Anyway, I suspect a fix for the fatal error might be something like
> the attached, but I think the *root* of the problem is how the
> aquantia driver tried to allocate a humongous buff_ring with kmalloc,
> which really doesn't work.  You can see that "order:6", ie we're
> talking an allocation > 100kB, and in low-memory situations that kind
> of kmalloc space simply isn't available. It *will* fail.

Correct.
It seems after some series of pm-related changes the driver logic was changed to always
reinit full sw structures during suspend/resume, which is obviously an overkill.

...

> 
> I don't know what the right fix is, but *one* fix would certainly be
> to not tear everything down at suspend time, only to build it up again
> at resume.
> 
> And please please please don't double-free things randomly (if that is
> what was going on, but it does look like it was).

You are right its also a potential double free problem (may be not manifesting),
but aq_ring_free is being called from "same level" aq_ring_alloc, but also then
from aq_ring_tx_alloc caller error handler.

Attached is my draft patch to improve that.

Still looking into the bigger problem of fixing suspend/resume.

Thanks
  Igor
From d753ba3bc5d18f0213833da7ceba3097d8db027a Mon Sep 17 00:00:00 2001
From: Igor Russkikh <irusskikh@marvell.com>
Date: Mon, 27 Nov 2023 16:09:05 +0100
Subject: [PATCH] net: atlantic: fixed double free on reinit

---
 .../net/ethernet/aquantia/atlantic/aq_ptp.c   | 17 ++----
 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 60 ++++++-------------
 .../net/ethernet/aquantia/atlantic/aq_ring.h  | 18 +++---
 .../net/ethernet/aquantia/atlantic/aq_vec.c   | 22 ++++---
 4 files changed, 42 insertions(+), 75 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ptp.c b/drivers/net/ethernet/aquantia/atlantic/aq_ptp.c
index 80b44043e6c5..1814375fecee 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ptp.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ptp.c
@@ -953,8 +953,6 @@ int aq_ptp_ring_alloc(struct aq_nic_s *aq_nic)
 {
 	struct aq_ptp_s *aq_ptp = aq_nic->aq_ptp;
 	unsigned int tx_ring_idx, rx_ring_idx;
-	struct aq_ring_s *hwts;
-	struct aq_ring_s *ring;
 	int err;
 
 	if (!aq_ptp)
@@ -962,27 +960,24 @@ int aq_ptp_ring_alloc(struct aq_nic_s *aq_nic)
 
 	tx_ring_idx = aq_ptp_ring_idx(aq_nic->aq_nic_cfg.tc_mode);
 
-	ring = aq_ring_tx_alloc(&aq_ptp->ptp_tx, aq_nic,
+	err = aq_ring_tx_alloc(&aq_ptp->ptp_tx, aq_nic,
 				tx_ring_idx, &aq_nic->aq_nic_cfg);
-	if (!ring) {
-		err = -ENOMEM;
+	if (err) {
 		goto err_exit;
 	}
 
 	rx_ring_idx = aq_ptp_ring_idx(aq_nic->aq_nic_cfg.tc_mode);
 
-	ring = aq_ring_rx_alloc(&aq_ptp->ptp_rx, aq_nic,
+	err = aq_ring_rx_alloc(&aq_ptp->ptp_rx, aq_nic,
 				rx_ring_idx, &aq_nic->aq_nic_cfg);
-	if (!ring) {
-		err = -ENOMEM;
+	if (err) {
 		goto err_exit_ptp_tx;
 	}
 
-	hwts = aq_ring_hwts_rx_alloc(&aq_ptp->hwts_rx, aq_nic, PTP_HWST_RING_IDX,
+	err = aq_ring_hwts_rx_alloc(&aq_ptp->hwts_rx, aq_nic, PTP_HWST_RING_IDX,
 				     aq_nic->aq_nic_cfg.rxds,
 				     aq_nic->aq_nic_cfg.aq_hw_caps->rxd_size);
-	if (!hwts) {
-		err = -ENOMEM;
+	if (err) {
 		goto err_exit_ptp_rx;
 	}
 
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4de22eed099a..241472e56eb2 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -132,8 +132,8 @@ static int aq_get_rxpages(struct aq_ring_s *self, struct aq_ring_buff_s *rxbuf)
 	return 0;
 }
 
-static struct aq_ring_s *aq_ring_alloc(struct aq_ring_s *self,
-				       struct aq_nic_s *aq_nic)
+static int aq_ring_alloc(struct aq_ring_s *self,
+			 struct aq_nic_s *aq_nic)
 {
 	int err = 0;
 
@@ -156,46 +156,29 @@ static struct aq_ring_s *aq_ring_alloc(struct aq_ring_s *self,
 err_exit:
 	if (err < 0) {
 		aq_ring_free(self);
-		self = NULL;
 	}
 
-	return self;
+	return err;
 }
 
-struct aq_ring_s *aq_ring_tx_alloc(struct aq_ring_s *self,
+int aq_ring_tx_alloc(struct aq_ring_s *self,
 				   struct aq_nic_s *aq_nic,
 				   unsigned int idx,
 				   struct aq_nic_cfg_s *aq_nic_cfg)
 {
-	int err = 0;
-
 	self->aq_nic = aq_nic;
 	self->idx = idx;
 	self->size = aq_nic_cfg->txds;
 	self->dx_size = aq_nic_cfg->aq_hw_caps->txd_size;
 
-	self = aq_ring_alloc(self, aq_nic);
-	if (!self) {
-		err = -ENOMEM;
-		goto err_exit;
-	}
-
-err_exit:
-	if (err < 0) {
-		aq_ring_free(self);
-		self = NULL;
-	}
-
-	return self;
+	return aq_ring_alloc(self, aq_nic);
 }
 
-struct aq_ring_s *aq_ring_rx_alloc(struct aq_ring_s *self,
-				   struct aq_nic_s *aq_nic,
-				   unsigned int idx,
-				   struct aq_nic_cfg_s *aq_nic_cfg)
+int aq_ring_rx_alloc(struct aq_ring_s *self,
+		     struct aq_nic_s *aq_nic,
+		     unsigned int idx,
+		     struct aq_nic_cfg_s *aq_nic_cfg)
 {
-	int err = 0;
-
 	self->aq_nic = aq_nic;
 	self->idx = idx;
 	self->size = aq_nic_cfg->rxds;
@@ -217,22 +200,10 @@ struct aq_ring_s *aq_ring_rx_alloc(struct aq_ring_s *self,
 		self->tail_size = 0;
 	}
 
-	self = aq_ring_alloc(self, aq_nic);
-	if (!self) {
-		err = -ENOMEM;
-		goto err_exit;
-	}
-
-err_exit:
-	if (err < 0) {
-		aq_ring_free(self);
-		self = NULL;
-	}
-
-	return self;
+	return aq_ring_alloc(self, aq_nic);
 }
 
-struct aq_ring_s *
+int
 aq_ring_hwts_rx_alloc(struct aq_ring_s *self, struct aq_nic_s *aq_nic,
 		      unsigned int idx, unsigned int size, unsigned int dx_size)
 {
@@ -250,10 +221,10 @@ aq_ring_hwts_rx_alloc(struct aq_ring_s *self, struct aq_nic_s *aq_nic,
 					   GFP_KERNEL);
 	if (!self->dx_ring) {
 		aq_ring_free(self);
-		return NULL;
+		return -ENOMEM;
 	}
 
-	return self;
+	return 0;
 }
 
 int aq_ring_init(struct aq_ring_s *self, const enum atl_ring_type ring_type)
@@ -932,11 +903,14 @@ void aq_ring_free(struct aq_ring_s *self)
 		return;
 
 	kfree(self->buff_ring);
+	self->buff_ring = NULL;
 
-	if (self->dx_ring)
+	if (self->dx_ring) {
 		dma_free_coherent(aq_nic_get_dev(self->aq_nic),
 				  self->size * self->dx_size, self->dx_ring,
 				  self->dx_ring_pa);
+		self->dx_ring = NULL;
+	}
 }
 
 unsigned int aq_ring_fill_stats_data(struct aq_ring_s *self, u64 *data)
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.h b/drivers/net/ethernet/aquantia/atlantic/aq_ring.h
index 0a6c34438c1d..00fbeccc16c8 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.h
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.h
@@ -183,14 +183,14 @@ static inline unsigned int aq_ring_avail_dx(struct aq_ring_s *self)
 		self->sw_head - self->sw_tail - 1);
 }
 
-struct aq_ring_s *aq_ring_tx_alloc(struct aq_ring_s *self,
-				   struct aq_nic_s *aq_nic,
-				   unsigned int idx,
-				   struct aq_nic_cfg_s *aq_nic_cfg);
-struct aq_ring_s *aq_ring_rx_alloc(struct aq_ring_s *self,
-				   struct aq_nic_s *aq_nic,
-				   unsigned int idx,
-				   struct aq_nic_cfg_s *aq_nic_cfg);
+int aq_ring_tx_alloc(struct aq_ring_s *self,
+		     struct aq_nic_s *aq_nic,
+		     unsigned int idx,
+		     struct aq_nic_cfg_s *aq_nic_cfg);
+int aq_ring_rx_alloc(struct aq_ring_s *self,
+		     struct aq_nic_s *aq_nic,
+		     unsigned int idx,
+		     struct aq_nic_cfg_s *aq_nic_cfg);
 
 int aq_ring_init(struct aq_ring_s *self, const enum atl_ring_type ring_type);
 void aq_ring_rx_deinit(struct aq_ring_s *self);
@@ -207,7 +207,7 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 		     int budget);
 int aq_ring_rx_fill(struct aq_ring_s *self);
 
-struct aq_ring_s *aq_ring_hwts_rx_alloc(struct aq_ring_s *self,
+int aq_ring_hwts_rx_alloc(struct aq_ring_s *self,
 		struct aq_nic_s *aq_nic, unsigned int idx,
 		unsigned int size, unsigned int dx_size);
 void aq_ring_hwts_rx_clean(struct aq_ring_s *self, struct aq_nic_s *aq_nic);
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_vec.c b/drivers/net/ethernet/aquantia/atlantic/aq_vec.c
index f5db1c44e9b9..d811106862ff 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_vec.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_vec.c
@@ -136,10 +136,9 @@ int aq_vec_ring_alloc(struct aq_vec_s *self, struct aq_nic_s *aq_nic,
 		const unsigned int idx_ring = AQ_NIC_CFG_TCVEC2RING(aq_nic_cfg,
 								    i, idx);
 
-		ring = aq_ring_tx_alloc(&self->ring[i][AQ_VEC_TX_ID], aq_nic,
-					idx_ring, aq_nic_cfg);
-		if (!ring) {
-			err = -ENOMEM;
+		ring = &self->ring[i][AQ_VEC_TX_ID];
+		err = aq_ring_tx_alloc(ring, aq_nic, idx_ring, aq_nic_cfg);
+		if (err) {
 			goto err_exit;
 		}
 
@@ -147,24 +146,23 @@ int aq_vec_ring_alloc(struct aq_vec_s *self, struct aq_nic_s *aq_nic,
 
 		aq_nic_set_tx_ring(aq_nic, idx_ring, ring);
 
-		if (xdp_rxq_info_reg(&self->ring[i][AQ_VEC_RX_ID].xdp_rxq,
+		ring = &self->ring[i][AQ_VEC_RX_ID];
+		if (xdp_rxq_info_reg(&ring->xdp_rxq,
 				     aq_nic->ndev, idx,
 				     self->napi.napi_id) < 0) {
 			err = -ENOMEM;
 			goto err_exit;
 		}
-		if (xdp_rxq_info_reg_mem_model(&self->ring[i][AQ_VEC_RX_ID].xdp_rxq,
+		if (xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
 					       MEM_TYPE_PAGE_SHARED, NULL) < 0) {
-			xdp_rxq_info_unreg(&self->ring[i][AQ_VEC_RX_ID].xdp_rxq);
+			xdp_rxq_info_unreg(&ring->xdp_rxq);
 			err = -ENOMEM;
 			goto err_exit;
 		}
 
-		ring = aq_ring_rx_alloc(&self->ring[i][AQ_VEC_RX_ID], aq_nic,
-					idx_ring, aq_nic_cfg);
-		if (!ring) {
-			xdp_rxq_info_unreg(&self->ring[i][AQ_VEC_RX_ID].xdp_rxq);
-			err = -ENOMEM;
+		err = aq_ring_rx_alloc(ring, aq_nic, idx_ring, aq_nic_cfg);
+		if (err) {
+			xdp_rxq_info_unreg(&ring->xdp_rxq);
 			goto err_exit;
 		}
Linus Torvalds Nov. 27, 2023, 6:02 p.m. UTC | #2
On Mon, 27 Nov 2023 at 09:29, Igor Russkikh <irusskikh@marvell.com> wrote:
>
> I'm trying to repro this on my side with some artificially increased structure
> sizes, but no success so far.

So I suspect that one reason I triggered the problem was simply
because the suspend/resume happened while I walked away from the
computer when it was copying a few hundred gig of data from the old
SSD (over USB, so not hugely fast).

End result: the suspend/resume happened while the machine was actually
quite busy, and presumably a *lot* of memory was just used for disk
buffers etc. The "duspend when idle" logic doesn't really take
background tasks into account, and my logs leading up to the suspend
shows

  13:54:09  systemd[1]: Starting systemd-suspend.service - System Suspend...
  13:54:09  wpa_supplicant[2401]: wlo2: CTRL-EVENT-DSCP-POLICY clear_all
  13:54:09  systemd-sleep[12477]: Entering sleep state 'suspend'...
  13:54:09  kernel: PM: suspend entry (deep)
  13:54:19  systemd[1]: NetworkManager-dispatcher.service: Deactivated
successfully.
  13:54:22  kernel: Filesystems sync: 12.738 seconds
  14:06:30  kernel: Freezing user space processes

and while that last timestamp is bogus (the timestamp comes from
syslogd logging, and it actually happens at *resume*), you can see
that the filesystem activity was pretty significant with the sync
taking a long time, because the copy process was still on-going the
whole time. And it continued *after* the sync too.

So I - accidentally - ended up hitting a lot of "that's not great"
cases on this, that I wouldn't hit normally (because I obviously turn
off suspend-at-idle). All on hardware that isn't normally used for
suspend/resume anyway, so it probably has somewhat limited testing to
begin with.

For triggering it, you might try to change that

        self->buff_ring =
                kcalloc(self->size, sizeof(struct aq_ring_buff_s), GFP_KERNEL);

to use GFP_NOWAIT instead of GFP_KERNEL. That makes allocation
failures *much* more likely. It will still work at boot time.

Or just artificially make it fail with a "fail the Nth time you hit it".

Also, make sure you don't have ridiculous amounts of memory in your
machine.  I've got "only" 64GB in mine, which is small for a big
machine, and presumably a lot of it was used for buffer cache, and I'm
not sure what the device suspend/resume ordering was (ie disk might be
resumed after ethernet).

                Linus
Jakub Kicinski Nov. 27, 2023, 10:59 p.m. UTC | #3
On Mon, 27 Nov 2023 18:29:19 +0100 Igor Russkikh wrote:
> > Anyway, I suspect a fix for the fatal error might be something like
> > the attached, but I think the *root* of the problem is how the
> > aquantia driver tried to allocate a humongous buff_ring with kmalloc,
> > which really doesn't work.  You can see that "order:6", ie we're
> > talking an allocation > 100kB, and in low-memory situations that kind
> > of kmalloc space simply isn't available. It *will* fail.  
> 
> Correct.
> It seems after some series of pm-related changes the driver logic was changed to always
> reinit full sw structures during suspend/resume, which is obviously an overkill.

Another option you can consider is lowering the default ring size.
If I'm looking right you default to 4k descriptors for Tx.
Is it based on real life experience?

Longer term hopefully the queue management API will help drivers
avoid freeing memory at suspend. We'll be able to disarm and
unconfigure queues without freeing the memory.
Igor Russkikh Nov. 28, 2023, 5:54 p.m. UTC | #4
On 11/27/2023 7:02 PM, Linus Torvalds wrote:

> 
> So I suspect that one reason I triggered the problem was simply
> because the suspend/resume happened while I walked away from the
> computer when it was copying a few hundred gig of data from the old
> SSD (over USB, so not hugely fast).

...

> Also, make sure you don't have ridiculous amounts of memory in your
> machine.  I've got "only" 64GB in mine, which is small for a big
> machine, and presumably a lot of it was used for buffer cache, and I'm
> not sure what the device suspend/resume ordering was (ie disk might be
> resumed after ethernet).

With these details in mind I was able to repro this within seconds on my 16Gb machine,
basically by doing a stress in parallel:

    stress -m 2000 --vm-bytes 20M --vm-hang 10 --backoff 1000

    while true; do sudo ifconfig enp1s0 down; sudo ifconfig enp1s0 up; done

in 5-10 seconds I get

[  859.536856] atlantic 0000:01:00.0 enp1s0: aq_ring_alloc[6](0x30000)
[  859.563153] warn_alloc: 1 callbacks suppressed
[  859.563156] ifconfig: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
[  859.563163] CPU: 13 PID: 48544 Comm: ifconfig Tainted: G           OE      6.7.0-rc2igor1+ #1
[  859.563165] Hardware name: ASUS System Product Name/PRIME Z590-P, BIOS 1017 07/12/2021
[  859.563166] Call Trace:
[  859.563168]  <TASK>
[  859.563170]  dump_stack_lvl+0x48/0x70
[  859.563175]  dump_stack+0x10/0x20
[  859.563177]  warn_alloc+0x119/0x190
[  859.563180]  ? __alloc_pages_direct_compact+0xae/0x1f0
[  859.563183]  __alloc_pages_slowpath.constprop.0+0xd1a/0xdd0
[  859.563188]  __alloc_pages+0x304/0x350
[  859.563192]  ? aq_ring_alloc+0x29/0xe0 [atlantic]
[  859.563207]  __kmalloc_large_node+0x7f/0x140
[  859.563210]  __kmalloc+0xc9/0x140
[  859.563212]  aq_ring_alloc+0x29/0xe0 [atlantic]
[  859.563221]  aq_ring_rx_alloc+0x7d/0x90 [atlantic]
[  859.563230]  aq_vec_ring_alloc+0xab/0x170 [atlantic]
[  859.563241]  aq_nic_init+0x11c/0x1e0 [atlantic]
[  859.563250]  aq_ndev_open+0x20/0x90 [atlantic]
[  859.563259]  __dev_open+0xe9/0x190
[  859.563261]  __dev_change_flags+0x18c/0x1f0
[  859.563263]  dev_change_flags+0x26/0x70
[  859.563265]  devinet_ioctl+0x602/0x760
[  859.563268]  inet_ioctl+0x167/0x190
[  859.563269]  ? sk_ioctl+0x4b/0x110
[  859.563271]  ? inet_ioctl+0x95/0x190
[  859.563273]  sock_do_ioctl+0x44/0xf0
[  859.563274]  ? __check_object_size+0x51/0x2d0
[  859.563277]  ? _copy_to_user+0x25/0x40
[  859.563279]  sock_ioctl+0xf7/0x300
[  859.563280]  __x64_sys_ioctl+0x95/0xd0
[  859.563283]  do_syscall_64+0x5c/0xe0
[  859.563286]  ? exit_to_user_mode_prepare+0x45/0x1a0
[  859.563289]  ? syscall_exit_to_user_mode+0x34/0x50
[  859.563291]  ? do_syscall_64+0x6b/0xe0
[  859.563293]  ? do_syscall_64+0x6b/0xe0
[  859.563295]  ? syscall_exit_to_user_mode+0x34/0x50
[  859.563296]  ? __x64_sys_openat+0x20/0x30
[  859.563298]  ? do_syscall_64+0x6b/0xe0
[  859.563300]  ? syscall_exit_to_user_mode+0x34/0x50
[  859.563301]  ? __x64_sys_read+0x1a/0x20
[  859.563303]  ? do_syscall_64+0x6b/0xe0
[  859.563305]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  859.563307] RIP: 0033:0x7f98499df3ab
[  859.563309] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
[  859.563310] RSP: 002b:00007fffba955138 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  859.563312] RAX: ffffffffffffffda RBX: 00007fffba955140 RCX: 00007f98499df3ab
[  859.563313] RDX: 00007fffba955140 RSI: 0000000000008914 RDI: 0000000000000004
[  859.563314] RBP: 00007fffba9551f0 R08: 0000000000000008 R09: 0000000000000001
[  859.563315] R10: 0000000000000011 R11: 0000000000000202 R12: 0000000000000041
[  859.563316] R13: 00007fffba9554e8 R14: 0000000000000000 R15: 0000000000000000
[  859.563318]  </TASK>
[  859.563319] Mem-Info:
[  859.563320] active_anon:14091 inactive_anon:3805083 isolated_anon:2336
                active_file:4601 inactive_file:5452 isolated_file:3
                unevictable:2258 dirty:56 writeback:0
                slab_reclaimable:35879 slab_unreclaimable:42730
                mapped:8485 shmem:2635 pagetables:35066
                sec_pagetables:0 bounce:0
                kernel_misc_reclaimable:0
                free:56673 free_pcp:0 free_cma:0
[  859.563323] Node 0 active_anon:56364kB inactive_anon:15220332kB active_file:18404kB inactive_file:21808kB unevictable:9032kB isolated(anon):9344kB isolated(file):12kB mapped:33940kB dirty:224kB writeback:0kB shmem:10540kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB kernel_stack:42592kB pagetables:140264kB sec_pagetables:0kB all_unreclaimable? no
[  859.563326] Node 0 DMA free:13308kB boost:0kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  859.563329] lowmem_reserve[]: 0 2305 15744 15744 15744
[  859.563332] Node 0 DMA32 free:63608kB boost:0kB min:9884kB low:12352kB high:14820kB reserved_highatomic:0KB active_anon:16kB inactive_anon:2330560kB active_file:88kB inactive_file:36kB unevictable:0kB writepending:8kB present:2467796kB managed:2401988kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  859.563335] lowmem_reserve[]: 0 0 13439 13439 13439
[  859.563338] Node 0 Normal free:149776kB boost:90112kB min:147740kB low:162144kB high:176548kB reserved_highatomic:2048KB active_anon:56024kB inactive_anon:12889984kB active_file:19608kB inactive_file:22216kB unevictable:9032kB writepending:0kB present:14098432kB managed:13769880kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  859.563342] lowmem_reserve[]: 0 0 0 0 0
[  859.563344] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 13308kB
[  859.563353] Node 0 DMA32: 2*4kB (UM) 2*8kB (UM) 5*16kB (M) 5*32kB (UM) 6*64kB (UM) 17*128kB (UM) 22*256kB (UM) 4*512kB (UM) 6*1024kB (UM) 11*2048kB (UM) 6*4096kB (UM) = 63752kB
[  859.563362] Node 0 Normal: 7073*4kB (UMEH) 1922*8kB (UMEH) 756*16kB (UMEH) 464*32kB (UMEH) 830*64kB (UMEH) 197*128kB (UMH) 6*256kB (MH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 150484kB
[  859.563371] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  859.563373] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  859.563374] 23673 total pagecache pages
[  859.563374] 11582 pages in swap cache
[  859.563375] Free swap  = 7137276kB
[  859.563375] Total swap = 8003580kB
[  859.563376] 4145555 pages RAM
[  859.563376] 0 pages HighMem/MovableOnly
[  859.563377] 98748 pages reserved
[  859.563378] 0 pages hwpoisoned
[  859.563379] atlantic 0000:01:00.0 enp1s0: aq_ring_alloc[6](0x18000)
[  859.563381] atlantic 0000:01:00.0 enp1s0: aq_ring_alloc[6] FAILURE =============================
[  859.563388] atlantic 0000:01:00.0 enp1s0: device open failure
[  860.996946] atlantic 0000:01:00.0 enp1s0: aq_ring_alloc[0](0x30000)
[  860.996961] atlantic 0000:01:00.0 enp1s0: aq_ring_alloc[0](0x18000)

Thats already with the patch applied, so no panic and next "ifconfig up" recovers the device state.

I will submit a bugfix patch for that solution, but will also continue looking into suspend/resume refactoring.

Thanks,
  Igor
Linus Torvalds Nov. 28, 2023, 6:37 p.m. UTC | #5
On Tue, 28 Nov 2023 at 09:54, Igor Russkikh <irusskikh@marvell.com> wrote:
>
> Thats already with the patch applied, so no panic and next "ifconfig up" recovers the device state.

Ack. At least it's recoverable, I had to reboot the machine when it
happened to me (and am very happy that at least all the logs made it
to disk).

> I will submit a bugfix patch for that solution, but will also continue looking into suspend/resume refactoring.

Thanks,

                   Linus
Igor Russkikh Nov. 28, 2023, 7:18 p.m. UTC | #6
On 11/27/2023 11:59 PM, Jakub Kicinski wrote:

> Another option you can consider is lowering the default ring size.
> If I'm looking right you default to 4k descriptors for Tx.
> Is it based on real life experience?

Probably reducing default will help - but again not 100%.

I remember these numbers where chosen mainly to show up good 10Gbps
line speed in tests, like iperf udp/tcp flood. But these of course
artificial.

For sure "normal" user can survive even with lower digits.

Thanks
  Igor
Jakub Kicinski Nov. 28, 2023, 9:09 p.m. UTC | #7
On Tue, 28 Nov 2023 20:18:49 +0100 Igor Russkikh wrote:
> > Another option you can consider is lowering the default ring size.
> > If I'm looking right you default to 4k descriptors for Tx.
> > Is it based on real life experience?  
> 
> Probably reducing default will help - but again not 100%.
> 
> I remember these numbers where chosen mainly to show up good 10Gbps
> line speed in tests, like iperf udp/tcp flood. But these of course
> artificial.
> 
> For sure "normal" user can survive even with lower digits.

For Rx under load larger rings are sometimes useful to avoid drops.
But your Tx rings are larger than Rx, which is a bit odd.

I was going to say that with BQL enabled you're very unlikely to ever
use much of the 4k Tx ring, anyway. But you don't have BQL support :S

My free advice is to recheck you really need these sizes and implement
BQL :)
Igor Russkikh Nov. 30, 2023, 12:59 p.m. UTC | #8
On 11/28/2023 10:09 PM, Jakub Kicinski wrote:
> 
> For Rx under load larger rings are sometimes useful to avoid drops.
> But your Tx rings are larger than Rx, which is a bit odd.

Agree. Just looked into the history, and it looks like this size was chosen
since the very first commit of this driver.

> I was going to say that with BQL enabled you're very unlikely to ever
> use much of the 4k Tx ring, anyway. But you don't have BQL support :S
> 
> My free advice is to recheck you really need these sizes and implement
> BQL :)

Thanks for the hint, will consider this.

Regards
  Igor
Peter Waller Jan. 21, 2024, 9:05 p.m. UTC | #9
I see a fix for double free [0] landed in 6.7; I've been running that 
for a few days and have hit a resume from suspend issue twice. Stack 
trace looks a little different (via __iommu_dma_map instead of 
__iommu_dma_free), provided below.

I've had resume issues with the atlantic driver since I've had this 
hardware, but it went away for a while and seems as though it may have 
come back with 6.7. (No crashes since logs begin on Dec 15 till Jan 12, 
Upgrade to 6.7; crashes 20th and 21st, though my usage style of the 
system has also varied, maybe crashes are associated with higher memory 
usage?).

Possibly unrelated but I also see fairly frequent (1 to ten times per 
boot, since logs begin?) messages in my logs of the form "atlantic 
0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014 
address=0xffce8000 flags=0x0020]".

[0] 
https://github.com/torvalds/linux/commit/7bb26ea74aa86fdf894b7dbd8c5712c5b4187da7

- Peter

kworker/u65:2: page allocation failure: order:6, 
mode:0x40d00(GFP_NOIO|__GFP_COMP|__GFP_ZERO), 
nodemask=(null),cpuset=/,mems_allowed=0
CPU: 18 PID: 166017 Comm: kworker/u65:2 Not tainted 6.7.0
Hardware name: ASUS System Product [...] BIOS 1502 06/08/2023
Workqueue: events_unbound async_run_entry_fn
Call Trace:
  <TASK>
  dump_stack_lvl+0x47/0x60
  warn_alloc+0x165/0x1e0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __alloc_pages_direct_compact+0xb3/0x290
  __alloc_pages+0x109e/0x1130
  ? iommu_dma_alloc_iova+0xd4/0x120
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __iommu_dma_map+0x84/0xf0
  ? aq_ring_alloc+0x22/0x80 [atlantic]
  __kmalloc_large_node+0x77/0x130
  __kmalloc+0xc6/0x150
  aq_ring_alloc+0x22/0x80 [atlantic]
  aq_vec_ring_alloc+0xee/0x1a0 [atlantic]
  aq_nic_init+0x118/0x1d0 [atlantic]
  atl_resume_common+0x40/0xd0 [atlantic]


On 30/11/2023 12:59, Igor Russkikh wrote:
>
> On 11/28/2023 10:09 PM, Jakub Kicinski wrote:
>> For Rx under load larger rings are sometimes useful to avoid drops.
>> But your Tx rings are larger than Rx, which is a bit odd.
> Agree. Just looked into the history, and it looks like this size was chosen
> since the very first commit of this driver.
>
>> I was going to say that with BQL enabled you're very unlikely to ever
>> use much of the 4k Tx ring, anyway. But you don't have BQL support :S
>>
>> My free advice is to recheck you really need these sizes and implement
>> BQL :)
> Thanks for the hint, will consider this.
>
> Regards
>    Igor
Igor Russkikh Jan. 23, 2024, 2:58 p.m. UTC | #10
On 1/21/2024 10:05 PM, Peter Waller wrote:
> I see a fix for double free [0] landed in 6.7; I've been running that
> for a few days and have hit a resume from suspend issue twice. Stack
> trace looks a little different (via __iommu_dma_map instead of
> __iommu_dma_free), provided below.
> 
> I've had resume issues with the atlantic driver since I've had this
> hardware, but it went away for a while and seems as though it may have
> come back with 6.7. (No crashes since logs begin on Dec 15 till Jan 12,
> Upgrade to 6.7; crashes 20th and 21st, though my usage style of the
> system has also varied, maybe crashes are associated with higher memory
> usage?).

Hi Peter,

Are these hard crashes, or just warnings in dmesg you see?
From the log you provided it looks like a warning, meaning system is usable
and driver can be restored with `if down/up` sequence.

If so, then this is somewhat expected, because I'm still looking into
how to refactor this suspend/resume cycle to reduce mem usage.
Permanent workaround would be to reduce rx/tx ring sizes with something like

    ethtool -G rx 1024 tx 1024

If its a hard panic, we should look deeper into it.

> Possibly unrelated but I also see fairly frequent (1 to ten times per
> boot, since logs begin?) messages in my logs of the form "atlantic
> 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014
> address=0xffce8000 flags=0x0020]".

Seems to be unrelated, but basically indicates HW or FW tries to access unmapped
memory addresses, and iommu catches that.
Full dmesg may help analyze this.

Regards
  Igor
Peter Waller Jan. 23, 2024, 3:13 p.m. UTC | #11
True, it is a warning rather than a hard crash, though shutdown hangs. Thanks for the workaround.

I can provide more dmesg when I’m back at my computer. Do you need the whole thing or is there something in particular you want from it? From memory there isn’t much more in the way of messages that looked connected to me.

Sent from my mobile, please excuse brevity

> On 23 Jan 2024, at 14:59, Igor Russkikh <irusskikh@marvell.com> wrote:
> 
> 
>> On 1/21/2024 10:05 PM, Peter Waller wrote:
>> I see a fix for double free [0] landed in 6.7; I've been running that
>> for a few days and have hit a resume from suspend issue twice. Stack
>> trace looks a little different (via __iommu_dma_map instead of
>> __iommu_dma_free), provided below.
>> 
>> I've had resume issues with the atlantic driver since I've had this
>> hardware, but it went away for a while and seems as though it may have
>> come back with 6.7. (No crashes since logs begin on Dec 15 till Jan 12,
>> Upgrade to 6.7; crashes 20th and 21st, though my usage style of the
>> system has also varied, maybe crashes are associated with higher memory
>> usage?).
> 
> Hi Peter,
> 
> Are these hard crashes, or just warnings in dmesg you see?
> From the log you provided it looks like a warning, meaning system is usable
> and driver can be restored with `if down/up` sequence.
> 
> If so, then this is somewhat expected, because I'm still looking into
> how to refactor this suspend/resume cycle to reduce mem usage.
> Permanent workaround would be to reduce rx/tx ring sizes with something like
> 
>    ethtool -G rx 1024 tx 1024
> 
> If its a hard panic, we should look deeper into it.
> 
>> Possibly unrelated but I also see fairly frequent (1 to ten times per
>> boot, since logs begin?) messages in my logs of the form "atlantic
>> 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014
>> address=0xffce8000 flags=0x0020]".
> 
> Seems to be unrelated, but basically indicates HW or FW tries to access unmapped
> memory addresses, and iommu catches that.
> Full dmesg may help analyze this.
> 
> Regards
>  Igor
Peter Waller Jan. 23, 2024, 9:02 p.m. UTC | #12
Here's part of the log, I can provide more off list if it helps. - Peter

<previous boot> Filesystems sync: 0.014 seconds
<n>.678271 Freezing user space processes
<n>.678366 Freezing user space processes completed (elapsed 0.001 seconds)
<n>.678383 OOM killer disabled.
<n>.678397 Freezing remaining freezable tasks
<n>.678407 Freezing remaining freezable tasks completed (elapsed 0.000 
seconds)
<n>.678423 printk: Suspending console(s) (use no_console_suspend to debug)
<n>.678437 serial 00:04: disabled
<n>.678654 queueing ieee80211 work while going to suspend
<n>.678680 sd 9:0:0:0: [sda] Synchronizing SCSI cache
<n>.678884 ata10.00: Entering standby power mode
<n>.678900 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xfc80b000 flags=0x0020]
<n>.679124 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xffeae520 flags=0x0020]
<n>.679270 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xfc80c000 flags=0x0020]
<n>.679411 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xffeae530 flags=0x0020]
<n>.679541 ACPI: EC: interrupt blocked
<n>.679554 amdgpu 0000:03:00.0: amdgpu: MODE1 reset
<n>.679682 amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
<n>.679803 amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
<n>.679919 ACPI: PM: Preparing to enter system sleep state S3
<n>.679931 ACPI: EC: event blocked
<n>.679942 ACPI: EC: EC stopped
<n>.679952 ACPI: PM: Saving platform NVS memory
<n>.679959 Disabling non-boot CPUs ...
<snip>
<n>.682471 atlantic 0000:0c:00.0 eno2: atlantic: link change old 1000 new 0
<snip>
<n>.687497 PM: suspend exit

On 23/01/2024 15:13, Peter Waller wrote:
> True, it is a warning rather than a hard crash, though shutdown hangs. Thanks for the workaround.
>
> I can provide more dmesg when I’m back at my computer. Do you need the whole thing or is there something in particular you want from it? From memory there isn’t much more in the way of messages that looked connected to me.
>
> Sent from my mobile, please excuse brevity
>
>> On 23 Jan 2024, at 14:59, Igor Russkikh <irusskikh@marvell.com> wrote:
>>
>> 
>>> On 1/21/2024 10:05 PM, Peter Waller wrote:
>>> I see a fix for double free [0] landed in 6.7; I've been running that
>>> for a few days and have hit a resume from suspend issue twice. Stack
>>> trace looks a little different (via __iommu_dma_map instead of
>>> __iommu_dma_free), provided below.
>>>
>>> I've had resume issues with the atlantic driver since I've had this
>>> hardware, but it went away for a while and seems as though it may have
>>> come back with 6.7. (No crashes since logs begin on Dec 15 till Jan 12,
>>> Upgrade to 6.7; crashes 20th and 21st, though my usage style of the
>>> system has also varied, maybe crashes are associated with higher memory
>>> usage?).
>> Hi Peter,
>>
>> Are these hard crashes, or just warnings in dmesg you see?
>>  From the log you provided it looks like a warning, meaning system is usable
>> and driver can be restored with `if down/up` sequence.
>>
>> If so, then this is somewhat expected, because I'm still looking into
>> how to refactor this suspend/resume cycle to reduce mem usage.
>> Permanent workaround would be to reduce rx/tx ring sizes with something like
>>
>>     ethtool -G rx 1024 tx 1024
>>
>> If its a hard panic, we should look deeper into it.
>>
>>> Possibly unrelated but I also see fairly frequent (1 to ten times per
>>> boot, since logs begin?) messages in my logs of the form "atlantic
>>> 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014
>>> address=0xffce8000 flags=0x0020]".
>> Seems to be unrelated, but basically indicates HW or FW tries to access unmapped
>> memory addresses, and iommu catches that.
>> Full dmesg may help analyze this.
>>
>> Regards
>>   Igor
Igor Russkikh Jan. 25, 2024, 2:58 p.m. UTC | #13
On 1/23/2024 10:02 PM, Peter Waller wrote:
> Here's part of the log, I can provide more off list if it helps. - Peter
> 
> <n>.678900 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0014 address=0xfc80b000 flags=0x0020]
> <n>.679124 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0014 address=0xffeae520 flags=0x0020]
> <n>.679270 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0014 address=0xfc80c000 flags=0x0020]
> <n>.679411 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0014 address=0xffeae530 flags=0x0020]

Thanks, these looks like descriptor prefetch accesses (those aligned by 0x10), and packet page accesses (aligned by 0x1000).
Overall strange, because driver normally deinit all the device activities, not to access host memory after unmapping.

I will check if there any potential flaws exist.

Regards,
  Igor
Igor Russkikh Feb. 2, 2024, 10:35 a.m. UTC | #14
Hi Peter,

> I've been running with this for a day or two, and had a panic on resume 
> today, dmesg below.
> 
> - Peter
> 
> [65525.454687] atlantic: Boot code hanged
> [65525.561024] ------------[ cut here ]------------
> [65525.561026] hw_atl2_shared_buffer_finish_ack
> [65525.561042] WARNING: CPU: 8 PID: 797385 at 
> drivers/net/ethernet/aquantia/atlantic/hw_atl2/hw_atl2_utils_fw.c:112 
> aq_a2_fw_deinit+0xcd/0xe0 [atlantic]

Unfortunately this seems to be a different issue, HW related. I suspect you have ASUS labeled NIC or MB?

Driver is reporting here it can't not activate device and firmware is not responsive.

There exists a BZ:
https://bugzilla.kernel.org/show_bug.cgi?id=217260

exploring a similar problem. It has some workaround patch proposal, but its questionable.

Igor
Peter Waller Feb. 2, 2024, 10:59 a.m. UTC | #15
On 02/02/2024 10:35, Igor Russkikh wrote:
> Unfortunately this seems to be a different issue, HW related. I suspect you have ASUS labeled NIC or MB?

Thanks for the link. Yes, it's an ASUS motherboard with onboard NIC. 
There are bios updates available which may also come with firmware updates.

Do you happen to know if there a way to identify/fingerprint/version the 
firmware on the NIC via the dmesg/lshw output/bios screens so that I may 
tell if updating the bios is also supplying new firmware?
diff mbox series

Patch

 drivers/net/ethernet/aquantia/atlantic/aq_ring.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4de22eed099a..472c7c08bfed 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -932,11 +932,14 @@  void aq_ring_free(struct aq_ring_s *self)
 		return;
 
 	kfree(self->buff_ring);
+	self->buff_ring = NULL;
 
-	if (self->dx_ring)
+	if (self->dx_ring) {
 		dma_free_coherent(aq_nic_get_dev(self->aq_nic),
 				  self->size * self->dx_size, self->dx_ring,
 				  self->dx_ring_pa);
+		self->dx_ring = NULL;
+	}
 }
 
 unsigned int aq_ring_fill_stats_data(struct aq_ring_s *self, u64 *data)