Message ID | 1595351666-28193-3-git-send-email-pillair@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add support to process rx packets in thread | expand |
On 2020-07-21 10:14, Rakesh Pillai wrote: > NAPI instance gets scheduled on a CPU core on which > the IRQ was triggered. The processing of rx packets > can be CPU intensive and since NAPI cannot be moved > to a different CPU core, to get better performance, > its better to move the gist of rx packet processing > in a high priority thread. > > Add the init/deinit part for a thread to process the > receive packets. > IMHO this defeat the whole purpose of NAPI. Originally in ath10k irq processing happened in tasklet (high priority) context which in turn push more data to net core even though net is unable to process driver data as both happen in different context (fast producer - slow consumer) issue. Why can't CPU governor schedule the interrupts in less loaded CPU core? Otherwise you can play with different RPS and affinity settings to meet your requirement. IMO introducing high priority tasklets/threads is not viable solution. -Rajkumar
On 2020-07-21 23:53, Rajkumar Manoharan wrote: > On 2020-07-21 10:14, Rakesh Pillai wrote: >> NAPI instance gets scheduled on a CPU core on which >> the IRQ was triggered. The processing of rx packets >> can be CPU intensive and since NAPI cannot be moved >> to a different CPU core, to get better performance, >> its better to move the gist of rx packet processing >> in a high priority thread. >> >> Add the init/deinit part for a thread to process the >> receive packets. >> > IMHO this defeat the whole purpose of NAPI. Originally in ath10k > irq processing happened in tasklet (high priority) context which in > turn push more data to net core even though net is unable to process > driver data as both happen in different context (fast producer - slow > consumer) > issue. Why can't CPU governor schedule the interrupts in less loaded CPU > core? > Otherwise you can play with different RPS and affinity settings to meet > your > requirement. > > IMO introducing high priority tasklets/threads is not viable solution. I'm beginning to think that the main problem with NAPI here is that the work done by poll functions on 802.11 drivers is significantly more CPU intensive compared to ethernet drivers, possibly more than what NAPI was designed for. I'm considering testing a different approach (with mt76 initially): - Add a mac80211 rx function that puts processed skbs into a list instead of handing them to the network stack directly. - Move all rx processing to a high priority thread, keep a driver internal queue for fully processed packets. - Schedule NAPI poll on completion. - NAPI poll function pulls from the internal queue and passes to the network stack. With this approach, the network stack retains some control over the processing rate of rx packets, while the scheduler can move the CPU intensive processing around to where it fits best. What do you think? - Felix
On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: > I'm considering testing a different approach (with mt76 initially): > - Add a mac80211 rx function that puts processed skbs into a list > instead of handing them to the network stack directly. Would this be *after* all the mac80211 processing, i.e. in place of the rx-up-to-stack? johannes
On 2020-07-22 14:55, Johannes Berg wrote: > On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: > >> I'm considering testing a different approach (with mt76 initially): >> - Add a mac80211 rx function that puts processed skbs into a list >> instead of handing them to the network stack directly. > > Would this be *after* all the mac80211 processing, i.e. in place of the > rx-up-to-stack? Yes, it would run all the rx handlers normally and then put the resulting skbs into a list instead of calling netif_receive_skb or napi_gro_frags. - Felix
On 2020-07-22 06:00, Felix Fietkau wrote: > On 2020-07-22 14:55, Johannes Berg wrote: >> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: >> >>> I'm considering testing a different approach (with mt76 initially): >>> - Add a mac80211 rx function that puts processed skbs into a list >>> instead of handing them to the network stack directly. >> >> Would this be *after* all the mac80211 processing, i.e. in place of >> the >> rx-up-to-stack? > Yes, it would run all the rx handlers normally and then put the > resulting skbs into a list instead of calling netif_receive_skb or > napi_gro_frags. > Felix, This seems like split & batch processing. In past (ath9k), we observed some behavioral differences between netif_rx and netif_receive_skb. The intermediate queue in netif_rx changed not just performance but also time sensitive application data. Agree that wireless stack processing might be heavier than ethernet packet processing. If the hardware supports rx decap offload, NAPI processing shouldn't be an issue. right? -Rajkumar
> -----Original Message----- > From: Rajkumar Manoharan <rmanohar@codeaurora.org> > Sent: Wednesday, July 22, 2020 3:23 AM > To: Rakesh Pillai <pillair@codeaurora.org> > Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux- > kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net; > davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org; > dianders@chromium.org; evgreen@chromium.org; linux-wireless- > owner@vger.kernel.org > Subject: Re: [RFC 2/7] ath10k: Add support to process rx packet in thread > > On 2020-07-21 10:14, Rakesh Pillai wrote: > > NAPI instance gets scheduled on a CPU core on which > > the IRQ was triggered. The processing of rx packets > > can be CPU intensive and since NAPI cannot be moved > > to a different CPU core, to get better performance, > > its better to move the gist of rx packet processing > > in a high priority thread. > > > > Add the init/deinit part for a thread to process the > > receive packets. > > > IMHO this defeat the whole purpose of NAPI. Originally in ath10k > irq processing happened in tasklet (high priority) context which in > turn push more data to net core even though net is unable to process > driver data as both happen in different context (fast producer - slow > consumer) > issue. Why can't CPU governor schedule the interrupts in less loaded CPU > core? > Otherwise you can play with different RPS and affinity settings to meet > your > requirement. > > IMO introducing high priority tasklets/threads is not viable solution. Hi Rajkumar, In linux, the IRQs are directed to the first core which is booted. I see that all the IRQs are getting routed to CORE0 even if its heavily loaded. IRQ and NAPI are not under the scheduler's control, since it cannot be moved. NAPI is scheduled only on the same core as IRQ. But a thread can be moved around by the scheduler based on the CPU load. This is the reason I went ahead with using thread. Affinity settings are static, and cannot be done runtime without any downstream userspace entity. > > -Rajkumar
On 7/23/2020 11:25 AM, Rakesh Pillai wrote: > Hi Rajkumar, > In linux, the IRQs are directed to the first core which is booted. > I see that all the IRQs are getting routed to CORE0 even if its heavily > loaded. > You should be able to configure the initial IRQ setup so that they don't all go on CPU 0 when you create the IRQ. That obviously doesn't help the case of wanting scheduler to dynamically move the processing around to other CPUs though.
diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c index 9104496..2b520a0 100644 --- a/drivers/net/wireless/ath/ath10k/core.c +++ b/drivers/net/wireless/ath/ath10k/core.c @@ -668,6 +668,39 @@ static unsigned int ath10k_core_get_fw_feature_str(char *buf, return scnprintf(buf, buf_len, "%s", ath10k_core_fw_feature_str[feat]); } +int ath10k_core_thread_shutdown(struct ath10k *ar, + struct ath10k_thread *thread) +{ + ath10k_dbg(ar, ATH10K_DBG_BOOT, "shutting down %s\n", thread->name); + set_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags); + wake_up_process(thread->task); + wait_for_completion(&thread->shutdown); + ath10k_info(ar, "thread %s exited\n", thread->name); + + return 0; +} +EXPORT_SYMBOL(ath10k_core_thread_shutdown); + +int ath10k_core_thread_init(struct ath10k *ar, + struct ath10k_thread *thread, + int (*handler)(void *data), + char *thread_name) +{ + thread->task = kthread_create(handler, thread, thread_name); + if (IS_ERR(thread->task)) + return -EINVAL; + + init_waitqueue_head(&thread->wait_q); + init_completion(&thread->shutdown); + memcpy(thread->name, thread_name, ATH10K_THREAD_NAME_SIZE_MAX); + clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags); + ath10k_info(ar, "Starting thread %s\n", thread_name); + wake_up_process(thread->task); + + return 0; +} +EXPORT_SYMBOL(ath10k_core_thread_init); + void ath10k_core_get_fw_features_str(struct ath10k *ar, char *buf, size_t buf_len) diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h index 5c18f6c..96919e8 100644 --- a/drivers/net/wireless/ath/ath10k/core.h +++ b/drivers/net/wireless/ath/ath10k/core.h @@ -970,6 +970,22 @@ struct ath10k_bus_params { bool hl_msdu_ids; }; +#define ATH10K_THREAD_NAME_SIZE_MAX 32 + +enum ath10k_thread_events { + ATH10K_THREAD_EVENT_SHUTDOWN, + ATH10K_THREAD_EVENT_MAX, +}; + +struct ath10k_thread { + struct ath10k *ar; + struct task_struct *task; + struct completion shutdown; + wait_queue_head_t wait_q; + DECLARE_BITMAP(event_flags, ATH10K_THREAD_EVENT_MAX); + char name[ATH10K_THREAD_NAME_SIZE_MAX]; +}; + struct ath10k { struct ath_common ath_common; struct ieee80211_hw *hw; @@ -982,6 +998,7 @@ struct ath10k { } msa; u8 mac_addr[ETH_ALEN]; + struct ath10k_thread rx_thread; enum ath10k_hw_rev hw_rev; u16 dev_id; u32 chip_id; @@ -1276,6 +1293,12 @@ static inline bool ath10k_peer_stats_enabled(struct ath10k *ar) extern unsigned long ath10k_coredump_mask; +int ath10k_core_thread_shutdown(struct ath10k *ar, + struct ath10k_thread *thread); +int ath10k_core_thread_init(struct ath10k *ar, + struct ath10k_thread *thread, + int (*handler)(void *data), + char *thread_name); struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev, enum ath10k_bus bus, enum ath10k_hw_rev hw_rev, diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c index 1ef5fdb..463c34e 100644 --- a/drivers/net/wireless/ath/ath10k/snoc.c +++ b/drivers/net/wireless/ath/ath10k/snoc.c @@ -909,6 +909,31 @@ static void ath10k_snoc_buffer_cleanup(struct ath10k *ar) } } +int ath10k_snoc_rx_thread_loop(void *data) +{ + struct ath10k_thread *rx_thread = data; + struct ath10k *ar = rx_thread->ar; + bool shutdown = false; + + ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread started\n"); + set_user_nice(current, -1); + + while (!shutdown) { + wait_event_interruptible( + rx_thread->wait_q, + (test_bit(ATH10K_THREAD_EVENT_SHUTDOWN, + rx_thread->event_flags))); + if (test_and_clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN, + rx_thread->event_flags)) + shutdown = true; + } + + ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread exiting\n"); + complete(&rx_thread->shutdown); + + do_exit(0); +} + static void ath10k_snoc_hif_stop(struct ath10k *ar) { if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags)) @@ -916,6 +941,7 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar) napi_synchronize(&ar->napi); napi_disable(&ar->napi); + ath10k_core_thread_shutdown(ar, &ar->rx_thread); ath10k_snoc_buffer_cleanup(ar); ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n"); } @@ -923,9 +949,19 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar) static int ath10k_snoc_hif_start(struct ath10k *ar) { struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar); + int ret; bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX); napi_enable(&ar->napi); + + ret = ath10k_core_thread_init(ar, &ar->rx_thread, + ath10k_snoc_rx_thread_loop, + "ath10k_rx_thread"); + if (ret) { + ath10k_err(ar, "failed to start rx thread\n"); + goto rx_thread_fail; + } + ath10k_snoc_irq_enable(ar); ath10k_snoc_rx_post(ar); @@ -934,6 +970,10 @@ static int ath10k_snoc_hif_start(struct ath10k *ar) ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n"); return 0; + +rx_thread_fail: + napi_disable(&ar->napi); + return ret; } static int ath10k_snoc_init_pipes(struct ath10k *ar) @@ -1652,6 +1692,7 @@ static int ath10k_snoc_probe(struct platform_device *pdev) return -ENOMEM; } + ar->rx_thread.ar = ar; ar_snoc = ath10k_snoc_priv(ar); ar_snoc->dev = pdev; platform_set_drvdata(pdev, ar);
NAPI instance gets scheduled on a CPU core on which the IRQ was triggered. The processing of rx packets can be CPU intensive and since NAPI cannot be moved to a different CPU core, to get better performance, its better to move the gist of rx packet processing in a high priority thread. Add the init/deinit part for a thread to process the receive packets. Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1 Signed-off-by: Rakesh Pillai <pillair@codeaurora.org> --- drivers/net/wireless/ath/ath10k/core.c | 33 +++++++++++++++++++++++++++ drivers/net/wireless/ath/ath10k/core.h | 23 +++++++++++++++++++ drivers/net/wireless/ath/ath10k/snoc.c | 41 ++++++++++++++++++++++++++++++++++ 3 files changed, 97 insertions(+)