Message ID | 1595351666-28193-3-git-send-email-pillair@codeaurora.org (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Kalle Valo |
Headers | show |
Series | Add support to process rx packets in thread | expand |
On 2020-07-21 10:14, Rakesh Pillai wrote: > NAPI instance gets scheduled on a CPU core on which > the IRQ was triggered. The processing of rx packets > can be CPU intensive and since NAPI cannot be moved > to a different CPU core, to get better performance, > its better to move the gist of rx packet processing > in a high priority thread. > > Add the init/deinit part for a thread to process the > receive packets. > IMHO this defeat the whole purpose of NAPI. Originally in ath10k irq processing happened in tasklet (high priority) context which in turn push more data to net core even though net is unable to process driver data as both happen in different context (fast producer - slow consumer) issue. Why can't CPU governor schedule the interrupts in less loaded CPU core? Otherwise you can play with different RPS and affinity settings to meet your requirement. IMO introducing high priority tasklets/threads is not viable solution. -Rajkumar
On 2020-07-21 23:53, Rajkumar Manoharan wrote: > On 2020-07-21 10:14, Rakesh Pillai wrote: >> NAPI instance gets scheduled on a CPU core on which >> the IRQ was triggered. The processing of rx packets >> can be CPU intensive and since NAPI cannot be moved >> to a different CPU core, to get better performance, >> its better to move the gist of rx packet processing >> in a high priority thread. >> >> Add the init/deinit part for a thread to process the >> receive packets. >> > IMHO this defeat the whole purpose of NAPI. Originally in ath10k > irq processing happened in tasklet (high priority) context which in > turn push more data to net core even though net is unable to process > driver data as both happen in different context (fast producer - slow > consumer) > issue. Why can't CPU governor schedule the interrupts in less loaded CPU > core? > Otherwise you can play with different RPS and affinity settings to meet > your > requirement. > > IMO introducing high priority tasklets/threads is not viable solution. I'm beginning to think that the main problem with NAPI here is that the work done by poll functions on 802.11 drivers is significantly more CPU intensive compared to ethernet drivers, possibly more than what NAPI was designed for. I'm considering testing a different approach (with mt76 initially): - Add a mac80211 rx function that puts processed skbs into a list instead of handing them to the network stack directly. - Move all rx processing to a high priority thread, keep a driver internal queue for fully processed packets. - Schedule NAPI poll on completion. - NAPI poll function pulls from the internal queue and passes to the network stack. With this approach, the network stack retains some control over the processing rate of rx packets, while the scheduler can move the CPU intensive processing around to where it fits best. What do you think? - Felix
On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: > I'm considering testing a different approach (with mt76 initially): > - Add a mac80211 rx function that puts processed skbs into a list > instead of handing them to the network stack directly. Would this be *after* all the mac80211 processing, i.e. in place of the rx-up-to-stack? johannes
On 2020-07-22 14:55, Johannes Berg wrote: > On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: > >> I'm considering testing a different approach (with mt76 initially): >> - Add a mac80211 rx function that puts processed skbs into a list >> instead of handing them to the network stack directly. > > Would this be *after* all the mac80211 processing, i.e. in place of the > rx-up-to-stack? Yes, it would run all the rx handlers normally and then put the resulting skbs into a list instead of calling netif_receive_skb or napi_gro_frags. - Felix
On 2020-07-22 06:00, Felix Fietkau wrote: > On 2020-07-22 14:55, Johannes Berg wrote: >> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: >> >>> I'm considering testing a different approach (with mt76 initially): >>> - Add a mac80211 rx function that puts processed skbs into a list >>> instead of handing them to the network stack directly. >> >> Would this be *after* all the mac80211 processing, i.e. in place of >> the >> rx-up-to-stack? > Yes, it would run all the rx handlers normally and then put the > resulting skbs into a list instead of calling netif_receive_skb or > napi_gro_frags. > Felix, This seems like split & batch processing. In past (ath9k), we observed some behavioral differences between netif_rx and netif_receive_skb. The intermediate queue in netif_rx changed not just performance but also time sensitive application data. Agree that wireless stack processing might be heavier than ethernet packet processing. If the hardware supports rx decap offload, NAPI processing shouldn't be an issue. right? -Rajkumar
> -----Original Message----- > From: Rajkumar Manoharan <rmanohar@codeaurora.org> > Sent: Wednesday, July 22, 2020 3:23 AM > To: Rakesh Pillai <pillair@codeaurora.org> > Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux- > kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net; > davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org; > dianders@chromium.org; evgreen@chromium.org; linux-wireless- > owner@vger.kernel.org > Subject: Re: [RFC 2/7] ath10k: Add support to process rx packet in thread > > On 2020-07-21 10:14, Rakesh Pillai wrote: > > NAPI instance gets scheduled on a CPU core on which > > the IRQ was triggered. The processing of rx packets > > can be CPU intensive and since NAPI cannot be moved > > to a different CPU core, to get better performance, > > its better to move the gist of rx packet processing > > in a high priority thread. > > > > Add the init/deinit part for a thread to process the > > receive packets. > > > IMHO this defeat the whole purpose of NAPI. Originally in ath10k > irq processing happened in tasklet (high priority) context which in > turn push more data to net core even though net is unable to process > driver data as both happen in different context (fast producer - slow > consumer) > issue. Why can't CPU governor schedule the interrupts in less loaded CPU > core? > Otherwise you can play with different RPS and affinity settings to meet > your > requirement. > > IMO introducing high priority tasklets/threads is not viable solution. Hi Rajkumar, In linux, the IRQs are directed to the first core which is booted. I see that all the IRQs are getting routed to CORE0 even if its heavily loaded. IRQ and NAPI are not under the scheduler's control, since it cannot be moved. NAPI is scheduled only on the same core as IRQ. But a thread can be moved around by the scheduler based on the CPU load. This is the reason I went ahead with using thread. Affinity settings are static, and cannot be done runtime without any downstream userspace entity. > > -Rajkumar
On 7/23/2020 11:25 AM, Rakesh Pillai wrote: > Hi Rajkumar, > In linux, the IRQs are directed to the first core which is booted. > I see that all the IRQs are getting routed to CORE0 even if its heavily > loaded. > You should be able to configure the initial IRQ setup so that they don't all go on CPU 0 when you create the IRQ. That obviously doesn't help the case of wanting scheduler to dynamically move the processing around to other CPUs though.
On 7/22/20 6:00 AM, Felix Fietkau wrote: > On 2020-07-22 14:55, Johannes Berg wrote: >> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: >> >>> I'm considering testing a different approach (with mt76 initially): >>> - Add a mac80211 rx function that puts processed skbs into a list >>> instead of handing them to the network stack directly. >> >> Would this be *after* all the mac80211 processing, i.e. in place of the >> rx-up-to-stack? > Yes, it would run all the rx handlers normally and then put the > resulting skbs into a list instead of calling netif_receive_skb or > napi_gro_frags. Whatever came of this? I realized I'm running Felix's patch since his mt76 driver needs it. Any chance it will go upstream? Thanks, Ben > > - Felix >
On Mon, Mar 22, 2021 at 4:58 PM Ben Greear <greearb@candelatech.com> wrote: > On 7/22/20 6:00 AM, Felix Fietkau wrote: > > On 2020-07-22 14:55, Johannes Berg wrote: > >> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: > >> > >>> I'm considering testing a different approach (with mt76 initially): > >>> - Add a mac80211 rx function that puts processed skbs into a list > >>> instead of handing them to the network stack directly. > >> > >> Would this be *after* all the mac80211 processing, i.e. in place of the > >> rx-up-to-stack? > > Yes, it would run all the rx handlers normally and then put the > > resulting skbs into a list instead of calling netif_receive_skb or > > napi_gro_frags. > > Whatever came of this? I realized I'm running Felix's patch since his mt76 > driver needs it. Any chance it will go upstream? If you're asking about $subject (moving NAPI/RX to a thread), this landed upstream recently: http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715 It needs a bit of coaxing to work on a WiFi driver (including: WiFi drivers tend to have a different netdev for NAPI than they expose to /sys/class/net/), but it's there. I'm not sure if people had something else in mind in the stuff you're quoting though. Brian
On 3/22/21 6:20 PM, Brian Norris wrote: > On Mon, Mar 22, 2021 at 4:58 PM Ben Greear <greearb@candelatech.com> wrote: >> On 7/22/20 6:00 AM, Felix Fietkau wrote: >>> On 2020-07-22 14:55, Johannes Berg wrote: >>>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: >>>> >>>>> I'm considering testing a different approach (with mt76 initially): >>>>> - Add a mac80211 rx function that puts processed skbs into a list >>>>> instead of handing them to the network stack directly. >>>> >>>> Would this be *after* all the mac80211 processing, i.e. in place of the >>>> rx-up-to-stack? >>> Yes, it would run all the rx handlers normally and then put the >>> resulting skbs into a list instead of calling netif_receive_skb or >>> napi_gro_frags. >> >> Whatever came of this? I realized I'm running Felix's patch since his mt76 >> driver needs it. Any chance it will go upstream? > > If you're asking about $subject (moving NAPI/RX to a thread), this > landed upstream recently: > http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715 > > It needs a bit of coaxing to work on a WiFi driver (including: WiFi > drivers tend to have a different netdev for NAPI than they expose to > /sys/class/net/), but it's there. > > I'm not sure if people had something else in mind in the stuff you're > quoting though. No, I got it confused with something Felix did: https://github.com/greearb/mt76/blob/master/patches/0001-net-add-support-for-threaded-NAPI-polling.patch Maybe the NAPI/RX to a thread thing superceded Felix's patch? Thanks, Ben > > Brian >
On 2021-03-23 04:01, Ben Greear wrote: > On 3/22/21 6:20 PM, Brian Norris wrote: >> On Mon, Mar 22, 2021 at 4:58 PM Ben Greear <greearb@candelatech.com> wrote: >>> On 7/22/20 6:00 AM, Felix Fietkau wrote: >>>> On 2020-07-22 14:55, Johannes Berg wrote: >>>>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: >>>>> >>>>>> I'm considering testing a different approach (with mt76 initially): >>>>>> - Add a mac80211 rx function that puts processed skbs into a list >>>>>> instead of handing them to the network stack directly. >>>>> >>>>> Would this be *after* all the mac80211 processing, i.e. in place of the >>>>> rx-up-to-stack? >>>> Yes, it would run all the rx handlers normally and then put the >>>> resulting skbs into a list instead of calling netif_receive_skb or >>>> napi_gro_frags. >>> >>> Whatever came of this? I realized I'm running Felix's patch since his mt76 >>> driver needs it. Any chance it will go upstream? >> >> If you're asking about $subject (moving NAPI/RX to a thread), this >> landed upstream recently: >> http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715 >> >> It needs a bit of coaxing to work on a WiFi driver (including: WiFi >> drivers tend to have a different netdev for NAPI than they expose to >> /sys/class/net/), but it's there. >> >> I'm not sure if people had something else in mind in the stuff you're >> quoting though. > > No, I got it confused with something Felix did: > > https://github.com/greearb/mt76/blob/master/patches/0001-net-add-support-for-threaded-NAPI-polling.patch > > Maybe the NAPI/RX to a thread thing superceded Felix's patch? Yes, it did and it's in linux-next already. I sent the following change to make mt76 use it: https://github.com/nbd168/wireless/commit/1d4ff31437e5aaa999bd7a - Felix
> -----Original Message----- > From: Felix Fietkau <nbd@nbd.name> > Sent: Tuesday, March 23, 2021 1:16 PM > To: Ben Greear <greearb@candelatech.com>; Brian Norris > <briannorris@chromium.org> > Cc: Johannes Berg <johannes@sipsolutions.net>; Rajkumar Manoharan > <rmanohar@codeaurora.org>; Rakesh Pillai <pillair@codeaurora.org>; ath10k > <ath10k@lists.infradead.org>; linux-wireless <linux- > wireless@vger.kernel.org>; Linux Kernel <linux-kernel@vger.kernel.org>; > Kalle Valo <kvalo@codeaurora.org>; David S. Miller > <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; > netdev@vger.kernel.org; Doug Anderson <dianders@chromium.org>; Evan > Green <evgreen@chromium.org> > Subject: Re: [RFC 2/7] ath10k: Add support to process rx packet in thread > > > On 2021-03-23 04:01, Ben Greear wrote: > > On 3/22/21 6:20 PM, Brian Norris wrote: > >> On Mon, Mar 22, 2021 at 4:58 PM Ben Greear > <greearb@candelatech.com> wrote: > >>> On 7/22/20 6:00 AM, Felix Fietkau wrote: > >>>> On 2020-07-22 14:55, Johannes Berg wrote: > >>>>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: > >>>>> > >>>>>> I'm considering testing a different approach (with mt76 initially): > >>>>>> - Add a mac80211 rx function that puts processed skbs into a list > >>>>>> instead of handing them to the network stack directly. > >>>>> > >>>>> Would this be *after* all the mac80211 processing, i.e. in place of the > >>>>> rx-up-to-stack? > >>>> Yes, it would run all the rx handlers normally and then put the > >>>> resulting skbs into a list instead of calling netif_receive_skb or > >>>> napi_gro_frags. > >>> > >>> Whatever came of this? I realized I'm running Felix's patch since his mt76 > >>> driver needs it. Any chance it will go upstream? > >> > >> If you're asking about $subject (moving NAPI/RX to a thread), this > >> landed upstream recently: > >> http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715 > >> > >> It needs a bit of coaxing to work on a WiFi driver (including: WiFi > >> drivers tend to have a different netdev for NAPI than they expose to > >> /sys/class/net/), but it's there. > >> > >> I'm not sure if people had something else in mind in the stuff you're > >> quoting though. > > > > No, I got it confused with something Felix did: > > > > https://github.com/greearb/mt76/blob/master/patches/0001-net-add- > support-for-threaded-NAPI-polling.patch > > > > Maybe the NAPI/RX to a thread thing superceded Felix's patch? > Yes, it did and it's in linux-next already. > I sent the following change to make mt76 use it: > https://github.com/nbd168/wireless/commit/1d4ff31437e5aaa999bd7a Hi Felix / Ben, In case of ath10k (snoc based targets), we have a lot of processing in the NAPI context. Even moving this to threaded NAPI is not helping much due to the load. Breaking the tasks into multiple context (with the patch series I posted) is helping in improving the throughput. With the current rx_thread based approach, the rx processing is broken into two parallel contexts 1) reaping the packets from the HW 2) processing these packets list and handing it over to mac80211 (and later to the network stack) This is the primary reason for choosing the rx thread approach. Thanks, Rakesh. > > - Felix
On 2021-03-25 10:45, Rakesh Pillai wrote: > Hi Felix / Ben, > > In case of ath10k (snoc based targets), we have a lot of processing in the NAPI context. > Even moving this to threaded NAPI is not helping much due to the load. > > Breaking the tasks into multiple context (with the patch series I posted) is helping in improving the throughput. > With the current rx_thread based approach, the rx processing is broken into two parallel contexts > 1) reaping the packets from the HW > 2) processing these packets list and handing it over to mac80211 (and later to the network stack) > > This is the primary reason for choosing the rx thread approach. Have you considered the possibility that maybe the problem is that the driver doing too much work? One example is that you could take advantage of the new 802.3 decap offload to simplify rx processing. Worked for me on mt76 where a dual-core 1.3 GHz A64 can easily handle >1.8 Gbps local TCP rx on a single card, without the rx NAPI thread being the biggest consumer of CPU cycles. And if you can't do that and still consider all of the metric tons of processing work necessary, you could still do this: On interrupts, spawn a processing thread that traverses the ring and does the preparation work (instead of NAPI). From that thread you schedule the threaded NAPI handler that processes these packets further and hands them to mac80211. To keep the load somewhat balanced, you can limit the number of pre-processed packets in the ring. - Felix
diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c index 9104496..2b520a0 100644 --- a/drivers/net/wireless/ath/ath10k/core.c +++ b/drivers/net/wireless/ath/ath10k/core.c @@ -668,6 +668,39 @@ static unsigned int ath10k_core_get_fw_feature_str(char *buf, return scnprintf(buf, buf_len, "%s", ath10k_core_fw_feature_str[feat]); } +int ath10k_core_thread_shutdown(struct ath10k *ar, + struct ath10k_thread *thread) +{ + ath10k_dbg(ar, ATH10K_DBG_BOOT, "shutting down %s\n", thread->name); + set_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags); + wake_up_process(thread->task); + wait_for_completion(&thread->shutdown); + ath10k_info(ar, "thread %s exited\n", thread->name); + + return 0; +} +EXPORT_SYMBOL(ath10k_core_thread_shutdown); + +int ath10k_core_thread_init(struct ath10k *ar, + struct ath10k_thread *thread, + int (*handler)(void *data), + char *thread_name) +{ + thread->task = kthread_create(handler, thread, thread_name); + if (IS_ERR(thread->task)) + return -EINVAL; + + init_waitqueue_head(&thread->wait_q); + init_completion(&thread->shutdown); + memcpy(thread->name, thread_name, ATH10K_THREAD_NAME_SIZE_MAX); + clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags); + ath10k_info(ar, "Starting thread %s\n", thread_name); + wake_up_process(thread->task); + + return 0; +} +EXPORT_SYMBOL(ath10k_core_thread_init); + void ath10k_core_get_fw_features_str(struct ath10k *ar, char *buf, size_t buf_len) diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h index 5c18f6c..96919e8 100644 --- a/drivers/net/wireless/ath/ath10k/core.h +++ b/drivers/net/wireless/ath/ath10k/core.h @@ -970,6 +970,22 @@ struct ath10k_bus_params { bool hl_msdu_ids; }; +#define ATH10K_THREAD_NAME_SIZE_MAX 32 + +enum ath10k_thread_events { + ATH10K_THREAD_EVENT_SHUTDOWN, + ATH10K_THREAD_EVENT_MAX, +}; + +struct ath10k_thread { + struct ath10k *ar; + struct task_struct *task; + struct completion shutdown; + wait_queue_head_t wait_q; + DECLARE_BITMAP(event_flags, ATH10K_THREAD_EVENT_MAX); + char name[ATH10K_THREAD_NAME_SIZE_MAX]; +}; + struct ath10k { struct ath_common ath_common; struct ieee80211_hw *hw; @@ -982,6 +998,7 @@ struct ath10k { } msa; u8 mac_addr[ETH_ALEN]; + struct ath10k_thread rx_thread; enum ath10k_hw_rev hw_rev; u16 dev_id; u32 chip_id; @@ -1276,6 +1293,12 @@ static inline bool ath10k_peer_stats_enabled(struct ath10k *ar) extern unsigned long ath10k_coredump_mask; +int ath10k_core_thread_shutdown(struct ath10k *ar, + struct ath10k_thread *thread); +int ath10k_core_thread_init(struct ath10k *ar, + struct ath10k_thread *thread, + int (*handler)(void *data), + char *thread_name); struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev, enum ath10k_bus bus, enum ath10k_hw_rev hw_rev, diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c index 1ef5fdb..463c34e 100644 --- a/drivers/net/wireless/ath/ath10k/snoc.c +++ b/drivers/net/wireless/ath/ath10k/snoc.c @@ -909,6 +909,31 @@ static void ath10k_snoc_buffer_cleanup(struct ath10k *ar) } } +int ath10k_snoc_rx_thread_loop(void *data) +{ + struct ath10k_thread *rx_thread = data; + struct ath10k *ar = rx_thread->ar; + bool shutdown = false; + + ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread started\n"); + set_user_nice(current, -1); + + while (!shutdown) { + wait_event_interruptible( + rx_thread->wait_q, + (test_bit(ATH10K_THREAD_EVENT_SHUTDOWN, + rx_thread->event_flags))); + if (test_and_clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN, + rx_thread->event_flags)) + shutdown = true; + } + + ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread exiting\n"); + complete(&rx_thread->shutdown); + + do_exit(0); +} + static void ath10k_snoc_hif_stop(struct ath10k *ar) { if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags)) @@ -916,6 +941,7 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar) napi_synchronize(&ar->napi); napi_disable(&ar->napi); + ath10k_core_thread_shutdown(ar, &ar->rx_thread); ath10k_snoc_buffer_cleanup(ar); ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n"); } @@ -923,9 +949,19 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar) static int ath10k_snoc_hif_start(struct ath10k *ar) { struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar); + int ret; bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX); napi_enable(&ar->napi); + + ret = ath10k_core_thread_init(ar, &ar->rx_thread, + ath10k_snoc_rx_thread_loop, + "ath10k_rx_thread"); + if (ret) { + ath10k_err(ar, "failed to start rx thread\n"); + goto rx_thread_fail; + } + ath10k_snoc_irq_enable(ar); ath10k_snoc_rx_post(ar); @@ -934,6 +970,10 @@ static int ath10k_snoc_hif_start(struct ath10k *ar) ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n"); return 0; + +rx_thread_fail: + napi_disable(&ar->napi); + return ret; } static int ath10k_snoc_init_pipes(struct ath10k *ar) @@ -1652,6 +1692,7 @@ static int ath10k_snoc_probe(struct platform_device *pdev) return -ENOMEM; } + ar->rx_thread.ar = ar; ar_snoc = ath10k_snoc_priv(ar); ar_snoc->dev = pdev; platform_set_drvdata(pdev, ar);
NAPI instance gets scheduled on a CPU core on which the IRQ was triggered. The processing of rx packets can be CPU intensive and since NAPI cannot be moved to a different CPU core, to get better performance, its better to move the gist of rx packet processing in a high priority thread. Add the init/deinit part for a thread to process the receive packets. Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1 Signed-off-by: Rakesh Pillai <pillair@codeaurora.org> --- drivers/net/wireless/ath/ath10k/core.c | 33 +++++++++++++++++++++++++++ drivers/net/wireless/ath/ath10k/core.h | 23 +++++++++++++++++++ drivers/net/wireless/ath/ath10k/snoc.c | 41 ++++++++++++++++++++++++++++++++++ 3 files changed, 97 insertions(+)