Message ID | 20240507-dwc3_per_ep_irqthread-v1-1-f14dec6de19f@pengutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | usb: dwc3: gadget: create per ep interrupts | expand |
Hi Michael, kernel test robot noticed the following build warnings: [auto build test WARNING on dd5a440a31fae6e459c0d6271dddd62825505361] url: https://github.com/intel-lab-lkp/linux/commits/Michael-Grzeschik/usb-dwc3-gadget-create-per-ep-interrupts/20240507-070804 base: dd5a440a31fae6e459c0d6271dddd62825505361 patch link: https://lore.kernel.org/r/20240507-dwc3_per_ep_irqthread-v1-1-f14dec6de19f%40pengutronix.de patch subject: [PATCH] usb: dwc3: gadget: create per ep interrupts config: x86_64-buildonly-randconfig-001-20240507 (https://download.01.org/0day-ci/archive/20240507/202405071800.omy5Ou1D-lkp@intel.com/config) compiler: clang version 18.1.4 (https://github.com/llvm/llvm-project e6c3289804a67ea0bb6a86fadbe454dd93b8d855) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240507/202405071800.omy5Ou1D-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202405071800.omy5Ou1D-lkp@intel.com/ All warnings (new ones prefixed by >>): >> drivers/usb/dwc3/gadget.c:1068:6: warning: variable 'count_processed' set but not used [-Wunused-but-set-variable] 1068 | int count_processed = 0; | ^ 1 warning generated. vim +/count_processed +1068 drivers/usb/dwc3/gadget.c 1059 1060 static void dwc3_process_event_entry(struct dwc3 *dwc, 1061 const union dwc3_event *event); 1062 1063 static irqreturn_t dwc3_endpoint_thread_irq(int irq, void *_dep) 1064 { 1065 struct dwc3_ep *dep = _dep; 1066 struct dwc3 *dwc = dep->dwc; 1067 const union dwc3_event *event; > 1068 int count_processed = 0; 1069 u32 event_raw; 1070 unsigned long flags; 1071 1072 dep->givebacks_current_turn = 0; 1073 1074 spin_lock_irqsave(&dep->event_lock, flags); 1075 1076 if (dep->ep_event_flags & DWC3_EP_EVENT_OVERFLOW) { 1077 dev_err(dwc->dev, "ep%d: event buffer overflow\n", dep->number); 1078 dep->ep_event_flags &= ~DWC3_EP_EVENT_OVERFLOW; 1079 } 1080 1081 while (dep->ep_event_r_index != dep->ep_event_w_index) { 1082 1083 event_raw = dep->ep_event_buffer[dep->ep_event_r_index]; 1084 1085 /* 1086 * we have a copy of the event, so we can release the lock 1087 */ 1088 spin_unlock_irqrestore(&dep->event_lock, flags); 1089 1090 event = (const union dwc3_event *) &event_raw; 1091 1092 spin_lock(&dwc->lock); 1093 dwc3_process_event_entry(dwc, event); 1094 spin_unlock(&dwc->lock); 1095 1096 /* 1097 * we need to re-acquire the lock to update the read index 1098 */ 1099 spin_lock_irqsave(&dep->event_lock, flags); 1100 1101 dep->ep_event_r_index = (dep->ep_event_r_index + 1) % 1102 ARRAY_SIZE(dep->ep_event_buffer); 1103 1104 count_processed += 1; 1105 } 1106 1107 spin_unlock_irqrestore(&dep->event_lock, flags); 1108 1109 return IRQ_HANDLED; 1110 } 1111
Hi Michael,
kernel test robot noticed the following build warnings:
[auto build test WARNING on dd5a440a31fae6e459c0d6271dddd62825505361]
url: https://github.com/intel-lab-lkp/linux/commits/Michael-Grzeschik/usb-dwc3-gadget-create-per-ep-interrupts/20240507-070804
base: dd5a440a31fae6e459c0d6271dddd62825505361
patch link: https://lore.kernel.org/r/20240507-dwc3_per_ep_irqthread-v1-1-f14dec6de19f%40pengutronix.de
patch subject: [PATCH] usb: dwc3: gadget: create per ep interrupts
config: i386-randconfig-141-20240507 (https://download.01.org/0day-ci/archive/20240507/202405072202.VCOpSmIK-lkp@intel.com/config)
compiler: clang version 18.1.4 (https://github.com/llvm/llvm-project e6c3289804a67ea0bb6a86fadbe454dd93b8d855)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405072202.VCOpSmIK-lkp@intel.com/
New smatch warnings:
drivers/usb/dwc3/gadget.c:1126 dwc3_gadget_init_endpoint_irq() warn: unsigned 'dep->irq_endpoint' is never less than zero.
drivers/usb/dwc3/gadget.c:3077 dwc3_gadget_start() warn: 'dwc->irq_gadget' from request_threaded_irq() not released on lines: 3067.
Old smatch warnings:
drivers/usb/dwc3/gadget.c:1733 __dwc3_gadget_kick_transfer() warn: missing error code? 'ret'
drivers/usb/dwc3/gadget.c:2848 dwc3_gadget_pullup() warn: pm_runtime_get_sync() also returns 1 on success
vim +1126 drivers/usb/dwc3/gadget.c
1111
1112 static int dwc3_gadget_init_endpoint_irq(struct dwc3 *dwc, struct dwc3_ep *dep)
1113 {
1114 char *irq_name;
1115 int ret = 0;
1116
1117 /* FIXME: endpoint.claimed would be better here, but somehow
1118 * the composite gadget layer is leaving the claimed value to 0
1119 * after calling usb_ep_autoconfig_reset after the final bind
1120 */
1121 /* ep0in and ep0out share the same interrupt thread */
1122 if (!dep->endpoint.address && dep->number)
1123 return 0;
1124
1125 dep->irq_endpoint = irq_create_mapping(dwc->ep_irq_domain, dep->number);
> 1126 if (dep->irq_endpoint < 0) {
1127 ret = dep->irq_endpoint;
1128
1129 dev_err(dwc->dev, "failed to map irq for ep%d --> %d\n",
1130 dep->number, ret);
1131 return ret;
1132 }
1133
1134 irq_name = kzalloc(16, GFP_KERNEL);
1135 if (!dep->number)
1136 snprintf(irq_name, 16, "ep0");
1137 else
1138 snprintf(irq_name, 16, "ep%d%s", dep->number >> 1, dep->direction ?
1139 "in" : "out");
1140
1141 ret = request_threaded_irq(dep->irq_endpoint, dwc3_endpoint_irq,
1142 dwc3_endpoint_thread_irq, IRQF_SHARED,
1143 irq_name, dep);
1144 if (ret) {
1145 irq_dispose_mapping(irq_find_mapping(dwc->ep_irq_domain, dep->number));
1146 dev_err(dwc->dev, "failed to request irq #%d --> %d\n",
1147 dep->irq_endpoint, ret);
1148 }
1149
1150 return ret;
1151 }
1152
Hi Michael, On 5/6/2024 4:06 PM, Michael Grzeschik wrote: > This patch is splitting up the interrupt event handling from one > interrupt thread to separate per endpoint interrupt threads. > I assume that the incentive from doing this is to improve overall throughput numbers. Would you be able to share some data on the benefits of moving to per EP event management? > To achieve this we create a new dwc3 interrupt domain in which > we map all claimed interrupts to individual interrupt threads. > > Although the gadget layer is preparing the claimed parameter > of each usb_ep which could be checked if the endpoint is > to used or not, the claimed value was 0 for each ep in gadget_start. > This was tested when describing some composite gadget using configfs. > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs (not sure if you're using this) whenever it adds a USB config. This is to handle multi config situations, so subsequent USB configs can be assigned (resuse) endpoints, since only one config is active at a time for a USB device. This was a struggle for me as well when adding the TXFIFO resizing logic. We won't actually know which EPs are going to be used until the host issues the set configuration packet to select a config, and the set_alt() callback issues usb_ep_enable(). So the implementation (TXFIFO resizing) is currently based on the maximum potential endpoints used by any USB configuration. Not sure if having 31 (potentially) different IRQ entries would be ok, but maybe it would be simpler to just to request IRQ for dwc->num_eps always? Have you tried this on a multi config device? Thanks Wesley Cheng
On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: >Hi Michael, > >On 5/6/2024 4:06 PM, Michael Grzeschik wrote: >>This patch is splitting up the interrupt event handling from one >>interrupt thread to separate per endpoint interrupt threads. >> > >I assume that the incentive from doing this is to improve overall >throughput numbers. Would you be able to share some data on the >benefits of moving to per EP event management? The main benefit is to make it possible to use high demanding usb endpoints simultaneously. In our special case we saw that streaming via uac and streaming via uvc was producing noise in the audio stream. This was due to the fact, that the isoc feedback endpoint that would adjust the samplerate was not being called fast enough when there was heavy a lot of traffic in the uvc endpoint context. By moving the endpoints into their own thread handlers the short feedback requests are at least able to be scheduled in between the bursts of the uvc packages. The next step is to have all threads running on different cpu cores, without interfering each other. However, as we still have not matrix irq allocator for arm, there still is no direct benefit from that yet. >>To achieve this we create a new dwc3 interrupt domain in which >>we map all claimed interrupts to individual interrupt threads. >> >>Although the gadget layer is preparing the claimed parameter >>of each usb_ep which could be checked if the endpoint is >>to used or not, the claimed value was 0 for each ep in gadget_start. >>This was tested when describing some composite gadget using configfs. >> > >yeah... the claimed flag is cleared by the USB gadget, ie USB configfs >(not sure if you're using this) whenever it adds a USB config. This >is to handle multi config situations, so subsequent USB configs can be >assigned (resuse) endpoints, since only one config is active at a time >for a USB device. > >This was a struggle for me as well when adding the TXFIFO resizing >logic. We won't actually know which EPs are going to be used until >the host issues the set configuration packet to select a config, and >the set_alt() callback issues usb_ep_enable(). So the implementation >(TXFIFO resizing) is currently based on the maximum potential >endpoints used by any USB configuration. > >Not sure if having 31 (potentially) different IRQ entries would be ok, >but maybe it would be simpler to just to request IRQ for dwc->num_eps >always? > >Have you tried this on a multi config device? No, I didn't. I doubt that this will work after your explanation. So thanks for the insides! I tried putting the request_threaded_irq into the ep_enable function but this does not work as I see a lot of schedule while atomic errors. This is possible as ep_enable is called from an set alt coming from ep0 interrupt thread context. So there is probably now no other option left to have exact endpoint interrupt threads. I will rework this back to request a kthread for each endpoint even as we will probably would not be using them. Regards, Michael
On Wed, May 08, 2024, Michael Grzeschik wrote: > On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: > > Hi Michael, > > > > On 5/6/2024 4:06 PM, Michael Grzeschik wrote: > > > This patch is splitting up the interrupt event handling from one > > > interrupt thread to separate per endpoint interrupt threads. > > > > > > > I assume that the incentive from doing this is to improve overall > > throughput numbers. Would you be able to share some data on the > > benefits of moving to per EP event management? > > The main benefit is to make it possible to use high demanding usb > endpoints simultaneously. In our special case we saw that streaming > via uac and streaming via uvc was producing noise in the audio > stream. This was due to the fact, that the isoc feedback endpoint > that would adjust the samplerate was not being called fast enough > when there was heavy a lot of traffic in the uvc endpoint context. > > By moving the endpoints into their own thread handlers the short > feedback requests are at least able to be scheduled in between the bursts > of the uvc packages. The next step is to have all threads running on > different cpu cores, without interfering each other. However, as we > still have not matrix irq allocator for arm, there still is no direct > benefit from that yet. > > > > > To achieve this we create a new dwc3 interrupt domain in which > > > we map all claimed interrupts to individual interrupt threads. > > > > > > Although the gadget layer is preparing the claimed parameter > > > of each usb_ep which could be checked if the endpoint is > > > to used or not, the claimed value was 0 for each ep in gadget_start. > > > This was tested when describing some composite gadget using configfs. > > > > > > > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs > > (not sure if you're using this) whenever it adds a USB config. This is > > to handle multi config situations, so subsequent USB configs can be > > assigned (resuse) endpoints, since only one config is active at a time > > for a USB device. > > > > This was a struggle for me as well when adding the TXFIFO resizing > > logic. We won't actually know which EPs are going to be used until the > > host issues the set configuration packet to select a config, and the > > set_alt() callback issues usb_ep_enable(). So the implementation > > (TXFIFO resizing) is currently based on the maximum potential endpoints > > used by any USB configuration. > > > > Not sure if having 31 (potentially) different IRQ entries would be ok, > > but maybe it would be simpler to just to request IRQ for dwc->num_eps > > always? > > > > Have you tried this on a multi config device? > > No, I didn't. I doubt that this will work after your explanation. So > thanks for the insides! > > I tried putting the request_threaded_irq into the ep_enable function > but this does not work as I see a lot of schedule while atomic > errors. This is possible as ep_enable is called from an set alt > coming from ep0 interrupt thread context. > > So there is probably now no other option left to have exact endpoint > interrupt threads. I will rework this back to request a kthread for each > endpoint even as we will probably would not be using them. > Do you have any data on latency here? I don't see how introducing more soft interrupts would improve on latency, if anything, it should be worse? This is making the driver way more complicated and potentially introduce many bugs. I may be wrong here, but I suspect that by multiplying the interrupt handlings, you _may_ see improvement due to the a higher chance being selected by the scheduler. However, the overall latency will probably be worse. (correct me if I'm wrong). This will affect other applications. Let's not do this. BR, Thinh
On Wed, May 08, 2024 at 11:20:03PM +0000, Thinh Nguyen wrote: >On Wed, May 08, 2024, Michael Grzeschik wrote: >> On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: >> > Hi Michael, >> > >> > On 5/6/2024 4:06 PM, Michael Grzeschik wrote: >> > > This patch is splitting up the interrupt event handling from one >> > > interrupt thread to separate per endpoint interrupt threads. >> > > >> > >> > I assume that the incentive from doing this is to improve overall >> > throughput numbers. Would you be able to share some data on the >> > benefits of moving to per EP event management? >> >> The main benefit is to make it possible to use high demanding usb >> endpoints simultaneously. In our special case we saw that streaming >> via uac and streaming via uvc was producing noise in the audio >> stream. This was due to the fact, that the isoc feedback endpoint >> that would adjust the samplerate was not being called fast enough >> when there was heavy a lot of traffic in the uvc endpoint context. >> >> By moving the endpoints into their own thread handlers the short >> feedback requests are at least able to be scheduled in between the bursts >> of the uvc packages. The next step is to have all threads running on >> different cpu cores, without interfering each other. However, as we >> still have not matrix irq allocator for arm, there still is no direct >> benefit from that yet. >> >> >> > > To achieve this we create a new dwc3 interrupt domain in which >> > > we map all claimed interrupts to individual interrupt threads. >> > > >> > > Although the gadget layer is preparing the claimed parameter >> > > of each usb_ep which could be checked if the endpoint is >> > > to used or not, the claimed value was 0 for each ep in gadget_start. >> > > This was tested when describing some composite gadget using configfs. >> > > >> > >> > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs >> > (not sure if you're using this) whenever it adds a USB config. This is >> > to handle multi config situations, so subsequent USB configs can be >> > assigned (resuse) endpoints, since only one config is active at a time >> > for a USB device. >> > >> > This was a struggle for me as well when adding the TXFIFO resizing >> > logic. We won't actually know which EPs are going to be used until the >> > host issues the set configuration packet to select a config, and the >> > set_alt() callback issues usb_ep_enable(). So the implementation >> > (TXFIFO resizing) is currently based on the maximum potential endpoints >> > used by any USB configuration. >> > >> > Not sure if having 31 (potentially) different IRQ entries would be ok, >> > but maybe it would be simpler to just to request IRQ for dwc->num_eps >> > always? >> > >> > Have you tried this on a multi config device? >> >> No, I didn't. I doubt that this will work after your explanation. So >> thanks for the insides! >> >> I tried putting the request_threaded_irq into the ep_enable function >> but this does not work as I see a lot of schedule while atomic >> errors. This is possible as ep_enable is called from an set alt >> coming from ep0 interrupt thread context. >> >> So there is probably now no other option left to have exact endpoint >> interrupt threads. I will rework this back to request a kthread for each >> endpoint even as we will probably would not be using them. >> > >Do you have any data on latency here? I don't have the exact numbers for the uac feedback isoc endpoint at the moment. But without the patch applied, it was reproducably returning with EXDEV when we started uvc streaming and therefor increased the amount of events per interrupt thread cycle. With the patch applied however, we are able to only route the events to the corresponding soft irqs and leave the moment of truth to the scheduler. >I don't see how introducing more soft interrupts would improve on >latency, if anything, it should be worse? Why should explicit handling of coherent ep events on one cpu core introduce more latency then by interleaving different events for arbitrary ep all in one thread? >This is making the driver way more complicated and potentially >introduce many bugs. Possible, but not unsolvable. >I may be wrong here, but I suspect that by multiplying the interrupt >handlings, you _may_ see improvement due to the a higher chance being >selected by the scheduler. However, the overall latency will probably >be worse. (correct me if I'm wrong). I doubt that it will be worse if each softirq can be handled on different cpus at the same time. >This will affect other applications. Let's make sure we will not break anything on the way. Okay? :) >Let's not do this. I actually thought that this is even requested: https://docs.kernel.org/usb/dwc3.html Michael
On Thu, May 09, 2024, Michael Grzeschik wrote: > On Wed, May 08, 2024 at 11:20:03PM +0000, Thinh Nguyen wrote: > > On Wed, May 08, 2024, Michael Grzeschik wrote: > > > On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: > > > > Hi Michael, > > > > > > > > On 5/6/2024 4:06 PM, Michael Grzeschik wrote: > > > > > This patch is splitting up the interrupt event handling from one > > > > > interrupt thread to separate per endpoint interrupt threads. > > > > > > > > > > > > > I assume that the incentive from doing this is to improve overall > > > > throughput numbers. Would you be able to share some data on the > > > > benefits of moving to per EP event management? > > > > > > The main benefit is to make it possible to use high demanding usb > > > endpoints simultaneously. In our special case we saw that streaming > > > via uac and streaming via uvc was producing noise in the audio > > > stream. This was due to the fact, that the isoc feedback endpoint > > > that would adjust the samplerate was not being called fast enough > > > when there was heavy a lot of traffic in the uvc endpoint context. > > > > > > By moving the endpoints into their own thread handlers the short > > > feedback requests are at least able to be scheduled in between the bursts > > > of the uvc packages. The next step is to have all threads running on > > > different cpu cores, without interfering each other. However, as we > > > still have not matrix irq allocator for arm, there still is no direct > > > benefit from that yet. > > > > > > > > > > > To achieve this we create a new dwc3 interrupt domain in which > > > > > we map all claimed interrupts to individual interrupt threads. > > > > > > > > > > Although the gadget layer is preparing the claimed parameter > > > > > of each usb_ep which could be checked if the endpoint is > > > > > to used or not, the claimed value was 0 for each ep in gadget_start. > > > > > This was tested when describing some composite gadget using configfs. > > > > > > > > > > > > > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs > > > > (not sure if you're using this) whenever it adds a USB config. This is > > > > to handle multi config situations, so subsequent USB configs can be > > > > assigned (resuse) endpoints, since only one config is active at a time > > > > for a USB device. > > > > > > > > This was a struggle for me as well when adding the TXFIFO resizing > > > > logic. We won't actually know which EPs are going to be used until the > > > > host issues the set configuration packet to select a config, and the > > > > set_alt() callback issues usb_ep_enable(). So the implementation > > > > (TXFIFO resizing) is currently based on the maximum potential endpoints > > > > used by any USB configuration. > > > > > > > > Not sure if having 31 (potentially) different IRQ entries would be ok, > > > > but maybe it would be simpler to just to request IRQ for dwc->num_eps > > > > always? > > > > > > > > Have you tried this on a multi config device? > > > > > > No, I didn't. I doubt that this will work after your explanation. So > > > thanks for the insides! > > > > > > I tried putting the request_threaded_irq into the ep_enable function > > > but this does not work as I see a lot of schedule while atomic > > > errors. This is possible as ep_enable is called from an set alt > > > coming from ep0 interrupt thread context. > > > > > > So there is probably now no other option left to have exact endpoint > > > interrupt threads. I will rework this back to request a kthread for each > > > endpoint even as we will probably would not be using them. > > > > > > > Do you have any data on latency here? > > I don't have the exact numbers for the uac feedback isoc endpoint > at the moment. But without the patch applied, it was reproducably > returning with EXDEV when we started uvc streaming and therefor > increased the amount of events per interrupt thread cycle. > > With the patch applied however, we are able to only route the events to > the corresponding soft irqs and leave the moment of truth to the > scheduler. Basically you're trying increase the priority of dwc3 by the greater number of soft interrupts. > > > I don't see how introducing more soft interrupts would improve on > > latency, if anything, it should be worse? > > Why should explicit handling of coherent ep events on one cpu core > introduce more latency then by interleaving different events for > arbitrary ep all in one thread? Because we are only using a single interrupt line, the sequence of events need to be handled 1 set at a time. The amount of time saved from handling interrupts of different endpoint should be miniscule. There's latency to switching context and locking, which I think would offset and introduce more latency than what you can potentially save. I'd like to see data on the improvement on the net latency here. > > > This is making the driver way more complicated and potentially > > introduce many bugs. > > Possible, but not unsolvable. > > > I may be wrong here, but I suspect that by multiplying the interrupt > > handlings, you _may_ see improvement due to the a higher chance being > > selected by the scheduler. However, the overall latency will probably > > be worse. (correct me if I'm wrong). > > I doubt that it will be worse if each softirq can be handled on > different cpus at the same time. See comment above. > > > This will affect other applications. > > Let's make sure we will not break anything on the way. Okay? :) > > > Let's not do this. > > I actually thought that this is even requested: > > https://docs.kernel.org/usb/dwc3.html > That's a very old and outdate TODO list. We don't use wait_for_completion_timeout in the commands. During transfers, we're using Update Transfer command, and it completes almost immediately. The only time where a command may take a longer time is when we need to bring the device down for reset/disconnect and stop transfers, but that's not what contributes to the problem here. Internal tests show that we can achieve very close theoretical USB speeds with the current dwc3 implementation. BR, Thinh
On Thu, May 09, 2024 at 12:23:09AM +0000, Thinh Nguyen wrote: >On Thu, May 09, 2024, Michael Grzeschik wrote: >> On Wed, May 08, 2024 at 11:20:03PM +0000, Thinh Nguyen wrote: >> > On Wed, May 08, 2024, Michael Grzeschik wrote: >> > > On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: >> > > > Hi Michael, >> > > > >> > > > On 5/6/2024 4:06 PM, Michael Grzeschik wrote: >> > > > > This patch is splitting up the interrupt event handling from one >> > > > > interrupt thread to separate per endpoint interrupt threads. >> > > > > >> > > > >> > > > I assume that the incentive from doing this is to improve overall >> > > > throughput numbers. Would you be able to share some data on the >> > > > benefits of moving to per EP event management? >> > > >> > > The main benefit is to make it possible to use high demanding usb >> > > endpoints simultaneously. In our special case we saw that streaming >> > > via uac and streaming via uvc was producing noise in the audio >> > > stream. This was due to the fact, that the isoc feedback endpoint >> > > that would adjust the samplerate was not being called fast enough >> > > when there was heavy a lot of traffic in the uvc endpoint context. >> > > >> > > By moving the endpoints into their own thread handlers the short >> > > feedback requests are at least able to be scheduled in between the bursts >> > > of the uvc packages. The next step is to have all threads running on >> > > different cpu cores, without interfering each other. However, as we >> > > still have not matrix irq allocator for arm, there still is no direct >> > > benefit from that yet. >> > > >> > > >> > > > > To achieve this we create a new dwc3 interrupt domain in which >> > > > > we map all claimed interrupts to individual interrupt threads. >> > > > > >> > > > > Although the gadget layer is preparing the claimed parameter >> > > > > of each usb_ep which could be checked if the endpoint is >> > > > > to used or not, the claimed value was 0 for each ep in gadget_start. >> > > > > This was tested when describing some composite gadget using configfs. >> > > > > >> > > > >> > > > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs >> > > > (not sure if you're using this) whenever it adds a USB config. This is >> > > > to handle multi config situations, so subsequent USB configs can be >> > > > assigned (resuse) endpoints, since only one config is active at a time >> > > > for a USB device. >> > > > >> > > > This was a struggle for me as well when adding the TXFIFO resizing >> > > > logic. We won't actually know which EPs are going to be used until the >> > > > host issues the set configuration packet to select a config, and the >> > > > set_alt() callback issues usb_ep_enable(). So the implementation >> > > > (TXFIFO resizing) is currently based on the maximum potential endpoints >> > > > used by any USB configuration. >> > > > >> > > > Not sure if having 31 (potentially) different IRQ entries would be ok, >> > > > but maybe it would be simpler to just to request IRQ for dwc->num_eps >> > > > always? >> > > > >> > > > Have you tried this on a multi config device? >> > > >> > > No, I didn't. I doubt that this will work after your explanation. So >> > > thanks for the insides! >> > > >> > > I tried putting the request_threaded_irq into the ep_enable function >> > > but this does not work as I see a lot of schedule while atomic >> > > errors. This is possible as ep_enable is called from an set alt >> > > coming from ep0 interrupt thread context. >> > > >> > > So there is probably now no other option left to have exact endpoint >> > > interrupt threads. I will rework this back to request a kthread for each >> > > endpoint even as we will probably would not be using them. >> > > >> > >> > Do you have any data on latency here? >> >> I don't have the exact numbers for the uac feedback isoc endpoint >> at the moment. But without the patch applied, it was reproducably >> returning with EXDEV when we started uvc streaming and therefor >> increased the amount of events per interrupt thread cycle. >> >> With the patch applied however, we are able to only route the events to >> the corresponding soft irqs and leave the moment of truth to the >> scheduler. > >Basically you're trying increase the priority of dwc3 by the greater >number of soft interrupts. Possible. Never thought about this. >> > I don't see how introducing more soft interrupts would improve on >> > latency, if anything, it should be worse? >> >> Why should explicit handling of coherent ep events on one cpu core >> introduce more latency then by interleaving different events for >> arbitrary ep all in one thread? > >Because we are only using a single interrupt line, the sequence of >events need to be handled 1 set at a time. The amount of time saved from >handling interrupts of different endpoint should be miniscule. There's >latency to switching context and locking, which I think would offset and >introduce more latency than what you can potentially save. > >I'd like to see data on the improvement on the net latency here. If this is the case. Then we are currently dealing with way to much durtion in the complete handler of the endpoints. I can't really tell for the uac endpoints. But the uvc complete endpoint is going through this roundtrip. With no_interupt = 0 at every 16 request: dwc3_endpoint_interrupt dwc3_gadget_endpoint_trbs_complete dwc3_gadget_ep_cleanup_completed_requests ~16 * { dwc3_gadget_ep_cleanup_completed_request dwc3_gadget_giveback usb_gadget_giveback_request usb_ep_queue __dwc3_gadget_ep_queue dwc3_prepare_trbs ~ * { dwc3_prepare_trbs_sg/dwc3_prepare_trbs_linear } dwc3_send_gadget_ep_cmd } I think this is a lot of stack for an interrupt thread to handle if you really want to pipeline this in one irqthread run and leave and make sure that the other endpoints will also be handled soon enough. >> >> > This is making the driver way more complicated and potentially >> > introduce many bugs. >> >> Possible, but not unsolvable. >> >> > I may be wrong here, but I suspect that by multiplying the interrupt >> > handlings, you _may_ see improvement due to the a higher chance being >> > selected by the scheduler. However, the overall latency will probably >> > be worse. (correct me if I'm wrong). >> >> I doubt that it will be worse if each softirq can be handled on >> different cpus at the same time. > >See comment above. To solve this issue I see two options: We could either do this by having different interrupt threads per ep like in this patch. Or we ensure that the complete handler is not running that long. This could be ensured by providing an interface that is similar to the threaded interrupt interface. The complete handler should then only wake up the corresponding complete thread. This policy of a short running complete handler also should be commented somewhere in the kernel. Which brings me back to the open discussion with avichal, where I already ment that it should be possible to completely remove the usb_ep_queue callback from the complete handler. We there should only update the buffer state and make sure that the pump worker would take care of queueing the right buffers to the dwc3 driver. I will go more into the details in this thread: https://lore.kernel.org/all/17192e0f-7f18-49ae-96fc-71054d46f74a@google.com/ >> > This will affect other applications. >> >> Let's make sure we will not break anything on the way. Okay? :) >> >> > Let's not do this. >> >> I actually thought that this is even requested: >> >> https://docs.kernel.org/usb/dwc3.html >> > >That's a very old and outdate TODO list. We should ensure that this chapter will be removed then. >We don't use wait_for_completion_timeout in the commands. During >transfers, we're using Update Transfer command, and it completes almost >immediately. The only time where a command may take a longer time is >when we need to bring the device down for reset/disconnect and stop >transfers, but that's not what contributes to the problem here. > >Internal tests show that we can achieve very close theoretical USB >speeds with the current dwc3 implementation. Granted, but only if we ensure that the complete() callback is not destroing the runtime duration and probably no usb_ep_queue is never called from the complete callback. Michael
On Fri, May 10, 2024, Michael Grzeschik wrote: > On Thu, May 09, 2024 at 12:23:09AM +0000, Thinh Nguyen wrote: > > On Thu, May 09, 2024, Michael Grzeschik wrote: > > > On Wed, May 08, 2024 at 11:20:03PM +0000, Thinh Nguyen wrote: > > > > On Wed, May 08, 2024, Michael Grzeschik wrote: > > > > > On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: > > > > > > Hi Michael, > > > > > > > > > > > > On 5/6/2024 4:06 PM, Michael Grzeschik wrote: > > > > > > > This patch is splitting up the interrupt event handling from one > > > > > > > interrupt thread to separate per endpoint interrupt threads. > > > > > > > > > > > > > > > > > > > I assume that the incentive from doing this is to improve overall > > > > > > throughput numbers. Would you be able to share some data on the > > > > > > benefits of moving to per EP event management? > > > > > > > > > > The main benefit is to make it possible to use high demanding usb > > > > > endpoints simultaneously. In our special case we saw that streaming > > > > > via uac and streaming via uvc was producing noise in the audio > > > > > stream. This was due to the fact, that the isoc feedback endpoint > > > > > that would adjust the samplerate was not being called fast enough > > > > > when there was heavy a lot of traffic in the uvc endpoint context. > > > > > > > > > > By moving the endpoints into their own thread handlers the short > > > > > feedback requests are at least able to be scheduled in between the bursts > > > > > of the uvc packages. The next step is to have all threads running on > > > > > different cpu cores, without interfering each other. However, as we > > > > > still have not matrix irq allocator for arm, there still is no direct > > > > > benefit from that yet. > > > > > > > > > > > > > > > > > To achieve this we create a new dwc3 interrupt domain in which > > > > > > > we map all claimed interrupts to individual interrupt threads. > > > > > > > > > > > > > > Although the gadget layer is preparing the claimed parameter > > > > > > > of each usb_ep which could be checked if the endpoint is > > > > > > > to used or not, the claimed value was 0 for each ep in gadget_start. > > > > > > > This was tested when describing some composite gadget using configfs. > > > > > > > > > > > > > > > > > > > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs > > > > > > (not sure if you're using this) whenever it adds a USB config. This is > > > > > > to handle multi config situations, so subsequent USB configs can be > > > > > > assigned (resuse) endpoints, since only one config is active at a time > > > > > > for a USB device. > > > > > > > > > > > > This was a struggle for me as well when adding the TXFIFO resizing > > > > > > logic. We won't actually know which EPs are going to be used until the > > > > > > host issues the set configuration packet to select a config, and the > > > > > > set_alt() callback issues usb_ep_enable(). So the implementation > > > > > > (TXFIFO resizing) is currently based on the maximum potential endpoints > > > > > > used by any USB configuration. > > > > > > > > > > > > Not sure if having 31 (potentially) different IRQ entries would be ok, > > > > > > but maybe it would be simpler to just to request IRQ for dwc->num_eps > > > > > > always? > > > > > > > > > > > > Have you tried this on a multi config device? > > > > > > > > > > No, I didn't. I doubt that this will work after your explanation. So > > > > > thanks for the insides! > > > > > > > > > > I tried putting the request_threaded_irq into the ep_enable function > > > > > but this does not work as I see a lot of schedule while atomic > > > > > errors. This is possible as ep_enable is called from an set alt > > > > > coming from ep0 interrupt thread context. > > > > > > > > > > So there is probably now no other option left to have exact endpoint > > > > > interrupt threads. I will rework this back to request a kthread for each > > > > > endpoint even as we will probably would not be using them. > > > > > > > > > > > > > Do you have any data on latency here? > > > > > > I don't have the exact numbers for the uac feedback isoc endpoint > > > at the moment. But without the patch applied, it was reproducably > > > returning with EXDEV when we started uvc streaming and therefor > > > increased the amount of events per interrupt thread cycle. > > > > > > With the patch applied however, we are able to only route the events to > > > the corresponding soft irqs and leave the moment of truth to the > > > scheduler. > > > > Basically you're trying increase the priority of dwc3 by the greater > > number of soft interrupts. > > Possible. Never thought about this. > > > > > I don't see how introducing more soft interrupts would improve on > > > > latency, if anything, it should be worse? > > > > > > Why should explicit handling of coherent ep events on one cpu core > > > introduce more latency then by interleaving different events for > > > arbitrary ep all in one thread? > > > > Because we are only using a single interrupt line, the sequence of > > events need to be handled 1 set at a time. The amount of time saved from > > handling interrupts of different endpoint should be miniscule. There's > > latency to switching context and locking, which I think would offset and > > introduce more latency than what you can potentially save. > > > > I'd like to see data on the improvement on the net latency here. > > If this is the case. Then we are currently dealing with way to much > durtion in the complete handler of the endpoints. I can't really > tell for the uac endpoints. But the uvc complete endpoint is going > through this roundtrip. > > With no_interupt = 0 at every 16 request: > > dwc3_endpoint_interrupt > dwc3_gadget_endpoint_trbs_complete > dwc3_gadget_ep_cleanup_completed_requests > ~16 * { > dwc3_gadget_ep_cleanup_completed_request > dwc3_gadget_giveback > usb_gadget_giveback_request > usb_ep_queue > __dwc3_gadget_ep_queue > dwc3_prepare_trbs > ~ * { > dwc3_prepare_trbs_sg/dwc3_prepare_trbs_linear > } > dwc3_send_gadget_ep_cmd > } > > I think this is a lot of stack for an interrupt thread to handle if you > really want to pipeline this in one irqthread run and leave and make > sure that the other endpoints will also be handled soon enough. > The usb_ep_queue ops should be relatively quick, I think you mean the request process and/or preparation at the function driver before queuing a new request? For usb_ep_queue(), the dwc3 driver doesn't need to do much except telling the controller that "new TRBs are prepared, go cache and process them when possible" in usb_ep_queue(). > > > > > > > This is making the driver way more complicated and potentially > > > > introduce many bugs. > > > > > > Possible, but not unsolvable. > > > > > > > I may be wrong here, but I suspect that by multiplying the interrupt > > > > handlings, you _may_ see improvement due to the a higher chance being > > > > selected by the scheduler. However, the overall latency will probably > > > > be worse. (correct me if I'm wrong). > > > > > > I doubt that it will be worse if each softirq can be handled on > > > different cpus at the same time. > > > > See comment above. > > To solve this issue I see two options: > > We could either do this by having different interrupt threads per ep > like in this patch. I'd like to avoid this. > > Or we ensure that the complete handler is not running that long. This should be the way to go. At the upper layer, you know what takes longer to prepare and what priority the work of each request/endpoint should be. From the dwc3 driver, we currently don't handle the controller with "Multiple Interrupt Support" configuration where each interrupt line is associated with a separate event buffer and endpoint. So, it doesn't make sense to create different interrupt threads for each endpoint. For applications that have many endpoints, discounting the latency introduces by the function driver, we will have more latency from handling, scheduling, and waking up interrupt threads. > > This could be ensured by providing an interface that is similar to the > threaded interrupt interface. The complete handler should then only > wake up the corresponding complete thread. > > This policy of a short running complete handler also should be commented > somewhere in the kernel. > > Which brings me back to the open discussion with avichal, where I > already ment that it should be possible to completely remove the > usb_ep_queue callback from the complete handler. We there should only > update the buffer state and make sure that the pump worker would take > care of queueing the right buffers to the dwc3 driver. I will go more > into the details in this thread: > > https://lore.kernel.org/all/17192e0f-7f18-49ae-96fc-71054d46f74a@google.com/ > > > > > This will affect other applications. > > > > > > Let's make sure we will not break anything on the way. Okay? :) > > > > > > > Let's not do this. > > > > > > I actually thought that this is even requested: > > > > > > https://docs.kernel.org/usb/dwc3.html > > > > > > > That's a very old and outdate TODO list. > > We should ensure that this chapter will be removed then. Sure, we can remove that. > > > We don't use wait_for_completion_timeout in the commands. During > > transfers, we're using Update Transfer command, and it completes almost > > immediately. The only time where a command may take a longer time is > > when we need to bring the device down for reset/disconnect and stop > > transfers, but that's not what contributes to the problem here. > > > > Internal tests show that we can achieve very close theoretical USB > > speeds with the current dwc3 implementation. > > Granted, but only if we ensure that the complete() callback is not > destroing the runtime duration and probably no usb_ep_queue is never > called from the complete callback. > Right, we have tests that prepare/process requests within the completion callbacks, and we have tests that prepare/process requests in a separate work. Depending on the amount of work/latency needed, we implement in a certain way. e.g. for UASP tests, the processing of the request is on a separate work than completion callback. BR, Thinh
On Sat, May 11, 2024 at 12:11:31AM +0000, Thinh Nguyen wrote: >On Fri, May 10, 2024, Michael Grzeschik wrote: >> On Thu, May 09, 2024 at 12:23:09AM +0000, Thinh Nguyen wrote: >> > On Thu, May 09, 2024, Michael Grzeschik wrote: >> > > On Wed, May 08, 2024 at 11:20:03PM +0000, Thinh Nguyen wrote: >> > > > On Wed, May 08, 2024, Michael Grzeschik wrote: >> > > > > On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: >> > > > > > Hi Michael, >> > > > > > >> > > > > > On 5/6/2024 4:06 PM, Michael Grzeschik wrote: >> > > > > > > This patch is splitting up the interrupt event handling from one >> > > > > > > interrupt thread to separate per endpoint interrupt threads. >> > > > > > > >> > > > > > >> > > > > > I assume that the incentive from doing this is to improve overall >> > > > > > throughput numbers. Would you be able to share some data on the >> > > > > > benefits of moving to per EP event management? >> > > > > >> > > > > The main benefit is to make it possible to use high demanding usb >> > > > > endpoints simultaneously. In our special case we saw that streaming >> > > > > via uac and streaming via uvc was producing noise in the audio >> > > > > stream. This was due to the fact, that the isoc feedback endpoint >> > > > > that would adjust the samplerate was not being called fast enough >> > > > > when there was heavy a lot of traffic in the uvc endpoint context. >> > > > > >> > > > > By moving the endpoints into their own thread handlers the short >> > > > > feedback requests are at least able to be scheduled in between the bursts >> > > > > of the uvc packages. The next step is to have all threads running on >> > > > > different cpu cores, without interfering each other. However, as we >> > > > > still have not matrix irq allocator for arm, there still is no direct >> > > > > benefit from that yet. >> > > > > >> > > > > >> > > > > > > To achieve this we create a new dwc3 interrupt domain in which >> > > > > > > we map all claimed interrupts to individual interrupt threads. >> > > > > > > >> > > > > > > Although the gadget layer is preparing the claimed parameter >> > > > > > > of each usb_ep which could be checked if the endpoint is >> > > > > > > to used or not, the claimed value was 0 for each ep in gadget_start. >> > > > > > > This was tested when describing some composite gadget using configfs. >> > > > > > > >> > > > > > >> > > > > > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs >> > > > > > (not sure if you're using this) whenever it adds a USB config. This is >> > > > > > to handle multi config situations, so subsequent USB configs can be >> > > > > > assigned (resuse) endpoints, since only one config is active at a time >> > > > > > for a USB device. >> > > > > > >> > > > > > This was a struggle for me as well when adding the TXFIFO resizing >> > > > > > logic. We won't actually know which EPs are going to be used until the >> > > > > > host issues the set configuration packet to select a config, and the >> > > > > > set_alt() callback issues usb_ep_enable(). So the implementation >> > > > > > (TXFIFO resizing) is currently based on the maximum potential endpoints >> > > > > > used by any USB configuration. >> > > > > > >> > > > > > Not sure if having 31 (potentially) different IRQ entries would be ok, >> > > > > > but maybe it would be simpler to just to request IRQ for dwc->num_eps >> > > > > > always? >> > > > > > >> > > > > > Have you tried this on a multi config device? >> > > > > >> > > > > No, I didn't. I doubt that this will work after your explanation. So >> > > > > thanks for the insides! >> > > > > >> > > > > I tried putting the request_threaded_irq into the ep_enable function >> > > > > but this does not work as I see a lot of schedule while atomic >> > > > > errors. This is possible as ep_enable is called from an set alt >> > > > > coming from ep0 interrupt thread context. >> > > > > >> > > > > So there is probably now no other option left to have exact endpoint >> > > > > interrupt threads. I will rework this back to request a kthread for each >> > > > > endpoint even as we will probably would not be using them. >> > > > > >> > > > >> > > > Do you have any data on latency here? >> > > >> > > I don't have the exact numbers for the uac feedback isoc endpoint >> > > at the moment. But without the patch applied, it was reproducably >> > > returning with EXDEV when we started uvc streaming and therefor >> > > increased the amount of events per interrupt thread cycle. >> > > >> > > With the patch applied however, we are able to only route the events to >> > > the corresponding soft irqs and leave the moment of truth to the >> > > scheduler. >> > >> > Basically you're trying increase the priority of dwc3 by the greater >> > number of soft interrupts. >> >> Possible. Never thought about this. >> >> > > > I don't see how introducing more soft interrupts would improve on >> > > > latency, if anything, it should be worse? >> > > >> > > Why should explicit handling of coherent ep events on one cpu core >> > > introduce more latency then by interleaving different events for >> > > arbitrary ep all in one thread? >> > >> > Because we are only using a single interrupt line, the sequence of >> > events need to be handled 1 set at a time. The amount of time saved from >> > handling interrupts of different endpoint should be miniscule. There's >> > latency to switching context and locking, which I think would offset and >> > introduce more latency than what you can potentially save. >> > >> > I'd like to see data on the improvement on the net latency here. >> >> If this is the case. Then we are currently dealing with way to much >> durtion in the complete handler of the endpoints. I can't really >> tell for the uac endpoints. But the uvc complete endpoint is going >> through this roundtrip. >> >> With no_interupt = 0 at every 16 request: >> >> dwc3_endpoint_interrupt >> dwc3_gadget_endpoint_trbs_complete >> dwc3_gadget_ep_cleanup_completed_requests >> ~16 * { >> dwc3_gadget_ep_cleanup_completed_request >> dwc3_gadget_giveback >> usb_gadget_giveback_request >> usb_ep_queue >> __dwc3_gadget_ep_queue >> dwc3_prepare_trbs >> ~ * { >> dwc3_prepare_trbs_sg/dwc3_prepare_trbs_linear >> } >> dwc3_send_gadget_ep_cmd >> } >> >> I think this is a lot of stack for an interrupt thread to handle if you >> really want to pipeline this in one irqthread run and leave and make >> sure that the other endpoints will also be handled soon enough. >> > >The usb_ep_queue ops should be relatively quick, I think you mean the >request process and/or preparation at the function driver before queuing >a new request? For usb_ep_queue(), the dwc3 driver doesn't need to do >much except telling the controller that "new TRBs are prepared, go cache >and process them when possible" in usb_ep_queue(). What you refer is the call of prepare_trbs and ep_cmd. This is probably pretty fast. But we still do this up to 16 times on one interrupt run. To really tell the weight in that case we will have to come back with numbers. >> > > >> > > > This is making the driver way more complicated and potentially >> > > > introduce many bugs. >> > > >> > > Possible, but not unsolvable. >> > > >> > > > I may be wrong here, but I suspect that by multiplying the interrupt >> > > > handlings, you _may_ see improvement due to the a higher chance being >> > > > selected by the scheduler. However, the overall latency will probably >> > > > be worse. (correct me if I'm wrong). >> > > >> > > I doubt that it will be worse if each softirq can be handled on >> > > different cpus at the same time. >> > >> > See comment above. >> >> To solve this issue I see two options: >> >> We could either do this by having different interrupt threads per ep >> like in this patch. > >I'd like to avoid this. > >> >> Or we ensure that the complete handler is not running that long. > >This should be the way to go. At the upper layer, you know what takes >longer to prepare and what priority the work of each request/endpoint >should be. Good. >From the dwc3 driver, we currently don't handle the controller with >"Multiple Interrupt Support" configuration where each interrupt line is >associated with a separate event buffer and endpoint. So, it doesn't >make sense to create different interrupt threads for each endpoint. For >applications that have many endpoints, discounting the latency >introduces by the function driver, we will have more latency from >handling, scheduling, and waking up interrupt threads. I was not aware that this mode (Multiple Interrupt Support) is even possible with dwc3. So if I really want to get the per ep hanlder patch to be mainline it really needs to make use of that feature. >> This could be ensured by providing an interface that is similar to the >> threaded interrupt interface. The complete handler should then only >> wake up the corresponding complete thread. >> >> This policy of a short running complete handler also should be commented >> somewhere in the kernel. >> >> Which brings me back to the open discussion with avichal, where I >> already ment that it should be possible to completely remove the >> usb_ep_queue callback from the complete handler. We there should only >> update the buffer state and make sure that the pump worker would take >> care of queueing the right buffers to the dwc3 driver. I will go more >> into the details in this thread: >> >> https://lore.kernel.org/all/17192e0f-7f18-49ae-96fc-71054d46f74a@google.com/ >> >> > > > This will affect other applications. >> > > >> > > Let's make sure we will not break anything on the way. Okay? :) >> > > >> > > > Let's not do this. >> > > >> > > I actually thought that this is even requested: >> > > >> > > https://docs.kernel.org/usb/dwc3.html >> > > >> > >> > That's a very old and outdate TODO list. >> >> We should ensure that this chapter will be removed then. > >Sure, we can remove that. Good. >> >> > We don't use wait_for_completion_timeout in the commands. During >> > transfers, we're using Update Transfer command, and it completes almost >> > immediately. The only time where a command may take a longer time is >> > when we need to bring the device down for reset/disconnect and stop >> > transfers, but that's not what contributes to the problem here. >> > >> > Internal tests show that we can achieve very close theoretical USB >> > speeds with the current dwc3 implementation. >> >> Granted, but only if we ensure that the complete() callback is not >> destroing the runtime duration and probably no usb_ep_queue is never >> called from the complete callback. >> > >Right, we have tests that prepare/process requests within the completion >callbacks, and we have tests that prepare/process requests in a separate >work. Depending on the amount of work/latency needed, we implement in a >certain way. e.g. for UASP tests, the processing of the request is on a >separate work than completion callback. So we really need to come up with some interface for the user of the complete handlers. So the upper layers won't be informed that the amount of work is critical and therefor could need a separate thread. (At least this should be documented) Michael
On Sun, May 12, 2024, Michael Grzeschik wrote: > On Sat, May 11, 2024 at 12:11:31AM +0000, Thinh Nguyen wrote: > > On Fri, May 10, 2024, Michael Grzeschik wrote: > > > On Thu, May 09, 2024 at 12:23:09AM +0000, Thinh Nguyen wrote: > > > > On Thu, May 09, 2024, Michael Grzeschik wrote: > > > > > On Wed, May 08, 2024 at 11:20:03PM +0000, Thinh Nguyen wrote: > > > > > > On Wed, May 08, 2024, Michael Grzeschik wrote: > > > > > > > On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote: > > > > > > > > Hi Michael, > > > > > > > > > > > > > > > > On 5/6/2024 4:06 PM, Michael Grzeschik wrote: > > > > > > > > > This patch is splitting up the interrupt event handling from one > > > > > > > > > interrupt thread to separate per endpoint interrupt threads. > > > > > > > > > > > > > > > > > > > > > > > > > I assume that the incentive from doing this is to improve overall > > > > > > > > throughput numbers. Would you be able to share some data on the > > > > > > > > benefits of moving to per EP event management? > > > > > > > > > > > > > > The main benefit is to make it possible to use high demanding usb > > > > > > > endpoints simultaneously. In our special case we saw that streaming > > > > > > > via uac and streaming via uvc was producing noise in the audio > > > > > > > stream. This was due to the fact, that the isoc feedback endpoint > > > > > > > that would adjust the samplerate was not being called fast enough > > > > > > > when there was heavy a lot of traffic in the uvc endpoint context. > > > > > > > > > > > > > > By moving the endpoints into their own thread handlers the short > > > > > > > feedback requests are at least able to be scheduled in between the bursts > > > > > > > of the uvc packages. The next step is to have all threads running on > > > > > > > different cpu cores, without interfering each other. However, as we > > > > > > > still have not matrix irq allocator for arm, there still is no direct > > > > > > > benefit from that yet. > > > > > > > > > > > > > > > > > > > > > > > To achieve this we create a new dwc3 interrupt domain in which > > > > > > > > > we map all claimed interrupts to individual interrupt threads. > > > > > > > > > > > > > > > > > > Although the gadget layer is preparing the claimed parameter > > > > > > > > > of each usb_ep which could be checked if the endpoint is > > > > > > > > > to used or not, the claimed value was 0 for each ep in gadget_start. > > > > > > > > > This was tested when describing some composite gadget using configfs. > > > > > > > > > > > > > > > > > > > > > > > > > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs > > > > > > > > (not sure if you're using this) whenever it adds a USB config. This is > > > > > > > > to handle multi config situations, so subsequent USB configs can be > > > > > > > > assigned (resuse) endpoints, since only one config is active at a time > > > > > > > > for a USB device. > > > > > > > > > > > > > > > > This was a struggle for me as well when adding the TXFIFO resizing > > > > > > > > logic. We won't actually know which EPs are going to be used until the > > > > > > > > host issues the set configuration packet to select a config, and the > > > > > > > > set_alt() callback issues usb_ep_enable(). So the implementation > > > > > > > > (TXFIFO resizing) is currently based on the maximum potential endpoints > > > > > > > > used by any USB configuration. > > > > > > > > > > > > > > > > Not sure if having 31 (potentially) different IRQ entries would be ok, > > > > > > > > but maybe it would be simpler to just to request IRQ for dwc->num_eps > > > > > > > > always? > > > > > > > > > > > > > > > > Have you tried this on a multi config device? > > > > > > > > > > > > > > No, I didn't. I doubt that this will work after your explanation. So > > > > > > > thanks for the insides! > > > > > > > > > > > > > > I tried putting the request_threaded_irq into the ep_enable function > > > > > > > but this does not work as I see a lot of schedule while atomic > > > > > > > errors. This is possible as ep_enable is called from an set alt > > > > > > > coming from ep0 interrupt thread context. > > > > > > > > > > > > > > So there is probably now no other option left to have exact endpoint > > > > > > > interrupt threads. I will rework this back to request a kthread for each > > > > > > > endpoint even as we will probably would not be using them. > > > > > > > > > > > > > > > > > > > Do you have any data on latency here? > > > > > > > > > > I don't have the exact numbers for the uac feedback isoc endpoint > > > > > at the moment. But without the patch applied, it was reproducably > > > > > returning with EXDEV when we started uvc streaming and therefor > > > > > increased the amount of events per interrupt thread cycle. > > > > > > > > > > With the patch applied however, we are able to only route the events to > > > > > the corresponding soft irqs and leave the moment of truth to the > > > > > scheduler. > > > > > > > > Basically you're trying increase the priority of dwc3 by the greater > > > > number of soft interrupts. > > > > > > Possible. Never thought about this. > > > > > > > > > I don't see how introducing more soft interrupts would improve on > > > > > > latency, if anything, it should be worse? > > > > > > > > > > Why should explicit handling of coherent ep events on one cpu core > > > > > introduce more latency then by interleaving different events for > > > > > arbitrary ep all in one thread? > > > > > > > > Because we are only using a single interrupt line, the sequence of > > > > events need to be handled 1 set at a time. The amount of time saved from > > > > handling interrupts of different endpoint should be miniscule. There's > > > > latency to switching context and locking, which I think would offset and > > > > introduce more latency than what you can potentially save. > > > > > > > > I'd like to see data on the improvement on the net latency here. > > > > > > If this is the case. Then we are currently dealing with way to much > > > durtion in the complete handler of the endpoints. I can't really > > > tell for the uac endpoints. But the uvc complete endpoint is going > > > through this roundtrip. > > > > > > With no_interupt = 0 at every 16 request: > > > > > > dwc3_endpoint_interrupt > > > dwc3_gadget_endpoint_trbs_complete > > > dwc3_gadget_ep_cleanup_completed_requests > > > ~16 * { > > > dwc3_gadget_ep_cleanup_completed_request > > > dwc3_gadget_giveback > > > usb_gadget_giveback_request > > > usb_ep_queue > > > __dwc3_gadget_ep_queue > > > dwc3_prepare_trbs > > > ~ * { > > > dwc3_prepare_trbs_sg/dwc3_prepare_trbs_linear > > > } > > > dwc3_send_gadget_ep_cmd > > > } > > > > > > I think this is a lot of stack for an interrupt thread to handle if you > > > really want to pipeline this in one irqthread run and leave and make > > > sure that the other endpoints will also be handled soon enough. > > > > > > > The usb_ep_queue ops should be relatively quick, I think you mean the > > request process and/or preparation at the function driver before queuing > > a new request? For usb_ep_queue(), the dwc3 driver doesn't need to do > > much except telling the controller that "new TRBs are prepared, go cache > > and process them when possible" in usb_ep_queue(). > > What you refer is the call of prepare_trbs and ep_cmd. This is probably > pretty fast. But we still do this up to 16 times on one interrupt run. > To really tell the weight in that case we will have to come back with > numbers. Yes, some numbers will be nice. > > > > > > > > > > > > This is making the driver way more complicated and potentially > > > > > > introduce many bugs. > > > > > > > > > > Possible, but not unsolvable. > > > > > > > > > > > I may be wrong here, but I suspect that by multiplying the interrupt > > > > > > handlings, you _may_ see improvement due to the a higher chance being > > > > > > selected by the scheduler. However, the overall latency will probably > > > > > > be worse. (correct me if I'm wrong). > > > > > > > > > > I doubt that it will be worse if each softirq can be handled on > > > > > different cpus at the same time. > > > > > > > > See comment above. > > > > > > To solve this issue I see two options: > > > > > > We could either do this by having different interrupt threads per ep > > > like in this patch. > > > > I'd like to avoid this. > > > > > > > > Or we ensure that the complete handler is not running that long. > > > > This should be the way to go. At the upper layer, you know what takes > > longer to prepare and what priority the work of each request/endpoint > > should be. > > Good. > > > From the dwc3 driver, we currently don't handle the controller with > > "Multiple Interrupt Support" configuration where each interrupt line is > > associated with a separate event buffer and endpoint. So, it doesn't > > make sense to create different interrupt threads for each endpoint. For > > applications that have many endpoints, discounting the latency > > introduces by the function driver, we will have more latency from > > handling, scheduling, and waking up interrupt threads. > > I was not aware that this mode (Multiple Interrupt Support) is even possible > with dwc3. So if I really want to get the per ep hanlder patch to be mainline > it really needs to make use of that feature. You can check GHWPARAMS1[22:17] capability value to see how many interrupts the controller supports. I believe most platforms don't support more than 1 interrupt. > > > > This could be ensured by providing an interface that is similar to the > > > threaded interrupt interface. The complete handler should then only > > > wake up the corresponding complete thread. > > > > > > This policy of a short running complete handler also should be commented > > > somewhere in the kernel. > > > > > > Which brings me back to the open discussion with avichal, where I > > > already ment that it should be possible to completely remove the > > > usb_ep_queue callback from the complete handler. We there should only > > > update the buffer state and make sure that the pump worker would take > > > care of queueing the right buffers to the dwc3 driver. I will go more > > > into the details in this thread: > > > > > > https://lore.kernel.org/all/17192e0f-7f18-49ae-96fc-71054d46f74a@google.com/ > > > > > > > > > This will affect other applications. > > > > > > > > > > Let's make sure we will not break anything on the way. Okay? :) > > > > > > > > > > > Let's not do this. > > > > > > > > > > I actually thought that this is even requested: > > > > > > > > > > https://docs.kernel.org/usb/dwc3.html > > > > > > > > > > > > > That's a very old and outdate TODO list. > > > > > > We should ensure that this chapter will be removed then. > > > > Sure, we can remove that. > > Good. > > > > > > > > We don't use wait_for_completion_timeout in the commands. During > > > > transfers, we're using Update Transfer command, and it completes almost > > > > immediately. The only time where a command may take a longer time is > > > > when we need to bring the device down for reset/disconnect and stop > > > > transfers, but that's not what contributes to the problem here. > > > > > > > > Internal tests show that we can achieve very close theoretical USB > > > > speeds with the current dwc3 implementation. > > > > > > Granted, but only if we ensure that the complete() callback is not > > > destroing the runtime duration and probably no usb_ep_queue is never > > > called from the complete callback. > > > > > > > Right, we have tests that prepare/process requests within the completion > > callbacks, and we have tests that prepare/process requests in a separate > > work. Depending on the amount of work/latency needed, we implement in a > > certain way. e.g. for UASP tests, the processing of the request is on a > > separate work than completion callback. > > So we really need to come up with some interface for the user of the > complete handlers. So the upper layers won't be informed that the > amount of work is critical and therefor could need a separate thread. > > (At least this should be documented) > The upperlayer should know whether the work will be time consuming. So the decision should be up to the upperlayer driver to decide whether to handle the work outside of the completion callback at the upperlayer driver. We should not expect to have an interface telling the controller driver whether to use separate threads for different endpoints. This gets tricky if the UDC depends on this feature to work properly. Also it requires quite an effort to implement and impose across different controller drivers. Thanks, Thinh
diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h index 180dd8d29287c..53cc34ce71682 100644 --- a/drivers/usb/dwc3/core.h +++ b/drivers/usb/dwc3/core.h @@ -733,6 +733,18 @@ struct dwc3_ep { struct list_head pending_list; struct list_head started_list; + unsigned int irq_endpoint; + + spinlock_t event_lock; + u32 ep_event_buffer[256]; + int ep_event_w_index; + int ep_event_r_index; + + int givebacks_current_turn; + +#define DWC3_EP_EVENT_OVERFLOW BIT(0) + unsigned int ep_event_flags; + void __iomem *regs; struct dwc3_trb *trb_pool; @@ -1173,6 +1185,8 @@ struct dwc3 { struct usb_gadget *gadget; struct usb_gadget_driver *gadget_driver; + struct irq_domain *ep_irq_domain; + struct clk *bus_clk; struct clk *ref_clk; struct clk *susp_clk; diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index f94f68f1e7d2b..3b49d80fc8dfa 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -16,6 +16,9 @@ #include <linux/pm_runtime.h> #include <linux/interrupt.h> #include <linux/io.h> +#include <linux/irq.h> +#include <linux/irqchip.h> +#include <linux/irqdomain.h> #include <linux/list.h> #include <linux/dma-mapping.h> @@ -1049,6 +1052,106 @@ static int __dwc3_gadget_ep_disable(struct dwc3_ep *dep) /* -------------------------------------------------------------------------- */ +static irqreturn_t dwc3_endpoint_irq(int irq, void *_dep) +{ + return IRQ_WAKE_THREAD; +} + +static void dwc3_process_event_entry(struct dwc3 *dwc, + const union dwc3_event *event); + +static irqreturn_t dwc3_endpoint_thread_irq(int irq, void *_dep) +{ + struct dwc3_ep *dep = _dep; + struct dwc3 *dwc = dep->dwc; + const union dwc3_event *event; + int count_processed = 0; + u32 event_raw; + unsigned long flags; + + dep->givebacks_current_turn = 0; + + spin_lock_irqsave(&dep->event_lock, flags); + + if (dep->ep_event_flags & DWC3_EP_EVENT_OVERFLOW) { + dev_err(dwc->dev, "ep%d: event buffer overflow\n", dep->number); + dep->ep_event_flags &= ~DWC3_EP_EVENT_OVERFLOW; + } + + while (dep->ep_event_r_index != dep->ep_event_w_index) { + + event_raw = dep->ep_event_buffer[dep->ep_event_r_index]; + + /* + * we have a copy of the event, so we can release the lock + */ + spin_unlock_irqrestore(&dep->event_lock, flags); + + event = (const union dwc3_event *) &event_raw; + + spin_lock(&dwc->lock); + dwc3_process_event_entry(dwc, event); + spin_unlock(&dwc->lock); + + /* + * we need to re-acquire the lock to update the read index + */ + spin_lock_irqsave(&dep->event_lock, flags); + + dep->ep_event_r_index = (dep->ep_event_r_index + 1) % + ARRAY_SIZE(dep->ep_event_buffer); + + count_processed += 1; + } + + spin_unlock_irqrestore(&dep->event_lock, flags); + + return IRQ_HANDLED; +} + +static int dwc3_gadget_init_endpoint_irq(struct dwc3 *dwc, struct dwc3_ep *dep) +{ + char *irq_name; + int ret = 0; + + /* FIXME: endpoint.claimed would be better here, but somehow + * the composite gadget layer is leaving the claimed value to 0 + * after calling usb_ep_autoconfig_reset after the final bind + */ + /* ep0in and ep0out share the same interrupt thread */ + if (!dep->endpoint.address && dep->number) + return 0; + + dep->irq_endpoint = irq_create_mapping(dwc->ep_irq_domain, dep->number); + if (dep->irq_endpoint < 0) { + ret = dep->irq_endpoint; + + dev_err(dwc->dev, "failed to map irq for ep%d --> %d\n", + dep->number, ret); + return ret; + } + + irq_name = kzalloc(16, GFP_KERNEL); + if (!dep->number) + snprintf(irq_name, 16, "ep0"); + else + snprintf(irq_name, 16, "ep%d%s", dep->number >> 1, dep->direction ? + "in" : "out"); + + ret = request_threaded_irq(dep->irq_endpoint, dwc3_endpoint_irq, + dwc3_endpoint_thread_irq, IRQF_SHARED, + irq_name, dep); + if (ret) { + irq_dispose_mapping(irq_find_mapping(dwc->ep_irq_domain, dep->number)); + dev_err(dwc->dev, "failed to request irq #%d --> %d\n", + dep->irq_endpoint, ret); + } + + return ret; +} + +/* -------------------------------------------------------------------------- */ + static int dwc3_gadget_ep0_enable(struct usb_ep *ep, const struct usb_endpoint_descriptor *desc) { @@ -2939,6 +3042,7 @@ static int dwc3_gadget_start(struct usb_gadget *g, struct usb_gadget_driver *driver) { struct dwc3 *dwc = gadget_to_dwc(g); + u8 epnum; unsigned long flags; int ret; int irq; @@ -2952,6 +3056,17 @@ static int dwc3_gadget_start(struct usb_gadget *g, return ret; } + for (epnum = 0; epnum < dwc->num_eps; epnum++) { + int ret; + /* ep0in and ep0out share the same interrupt thread */ + if (epnum == 1) + continue; + + ret = dwc3_gadget_init_endpoint_irq(dwc, dwc->eps[epnum]); + if (ret) + return ret; + } + spin_lock_irqsave(&dwc->lock, flags); dwc->gadget_driver = driver; spin_unlock_irqrestore(&dwc->lock, flags); @@ -2972,6 +3087,7 @@ static void __dwc3_gadget_stop(struct dwc3 *dwc) static int dwc3_gadget_stop(struct usb_gadget *g) { struct dwc3 *dwc = gadget_to_dwc(g); + u8 epnum; unsigned long flags; if (dwc->sys_wakeup) @@ -2982,6 +3098,18 @@ static int dwc3_gadget_stop(struct usb_gadget *g) dwc->max_cfg_eps = 0; spin_unlock_irqrestore(&dwc->lock, flags); + for (epnum = 0; epnum < dwc->num_eps; epnum++) { + struct dwc3_ep *dep; + + if (epnum == 1) + continue; + + dep = dwc->eps[epnum]; + + free_irq(dep->irq_endpoint, dwc->eps[epnum]); + irq_dispose_mapping(dep->irq_endpoint); + } + free_irq(dwc->irq_gadget, dwc->ev_buf); return 0; @@ -3298,6 +3426,8 @@ static int dwc3_gadget_init_endpoint(struct dwc3 *dwc, u8 epnum) INIT_LIST_HEAD(&dep->started_list); INIT_LIST_HEAD(&dep->cancelled_list); + spin_lock_init(&dep->event_lock); + dwc3_debugfs_create_endpoint_dir(dep); return 0; @@ -4403,7 +4533,9 @@ static irqreturn_t dwc3_process_event_buf(struct dwc3_event_buffer *evt) { struct dwc3 *dwc = evt->dwc; irqreturn_t ret = IRQ_NONE; + unsigned long flags; int left; + int i; left = evt->count; @@ -4412,10 +4544,36 @@ static irqreturn_t dwc3_process_event_buf(struct dwc3_event_buffer *evt) while (left > 0) { union dwc3_event event; + struct dwc3_ep *dep; + int epnum = 0; event.raw = *(u32 *) (evt->cache + evt->lpos); - dwc3_process_event_entry(dwc, &event); + if (!event.type.is_devspec) { + struct dwc3_event_depevt *depevt = &event.depevt; + + epnum = depevt->endpoint_number; + /* ep0in and ep0out share the same interrupt thread */ + if (epnum <= 1) + epnum &= ~0x01; + + if (epnum < 0 || epnum >= dwc->num_eps) { + dev_err(dwc->dev, "invalid epnum %d\n", epnum); + continue; + } + } + + dep = dwc->eps[epnum]; + + spin_lock(&dep->event_lock); + dep->ep_event_buffer[dep->ep_event_w_index] = event.raw; + dep->ep_event_w_index = (dep->ep_event_w_index + 1) % + ARRAY_SIZE(dep->ep_event_buffer); + + if (dep->ep_event_w_index == dep->ep_event_r_index) + dep->ep_event_flags |= DWC3_EP_EVENT_OVERFLOW; + + spin_unlock(&dep->event_lock); /* * FIXME we wrap around correctly to the next entry as @@ -4430,6 +4588,22 @@ static irqreturn_t dwc3_process_event_buf(struct dwc3_event_buffer *evt) left -= 4; } + for (i = 0; i < dwc->num_eps; i++) { + struct dwc3_ep *dep = dwc->eps[i]; + + /* ep0in and ep0out share the same interrupt thread */ + if (i == 1) + continue; + + spin_lock_irqsave(&dep->event_lock, flags); + + // TODO: improve + if (dep->ep_event_r_index != dep->ep_event_w_index) + generic_handle_domain_irq_safe(dwc->ep_irq_domain, i); + + spin_unlock_irqrestore(&dep->event_lock, flags); + } + evt->count = 0; ret = IRQ_HANDLED; @@ -4553,6 +4727,22 @@ static void dwc_gadget_release(struct device *dev) kfree(gadget); } +static const struct irq_chip ep_irq_chip = { + .name = "dwc3-ep", +}; + +static int ep_irq_domain_map(struct irq_domain *d, unsigned int virq, irq_hw_number_t hwirq) +{ + irq_set_chip_and_handler(virq, &ep_irq_chip, handle_simple_irq); + + return 0; +} + +static const struct irq_domain_ops ep_irq_dom_ops = { + .map = ep_irq_domain_map, + .xlate = irq_domain_xlate_onetwocell, +}; + /** * dwc3_gadget_init - initializes gadget related registers * @dwc: pointer to our controller context structure @@ -4573,6 +4763,13 @@ int dwc3_gadget_init(struct dwc3 *dwc) dwc->irq_gadget = irq; + dwc->ep_irq_domain = irq_domain_add_simple(NULL, dwc->num_eps, 0, &ep_irq_dom_ops, dwc); + if (!dwc->ep_irq_domain) { + dev_err(dwc->dev, "failed to create ep irq domain\n"); + ret = -ENOMEM; + goto err0; + } + dwc->ep0_trb = dma_alloc_coherent(dwc->sysdev, sizeof(*dwc->ep0_trb) * 2, &dwc->ep0_trb_addr, GFP_KERNEL); @@ -4691,7 +4888,10 @@ void dwc3_gadget_exit(struct dwc3 *dwc) if (!dwc->gadget) return; + irq_domain_remove(dwc->ep_irq_domain); + dwc3_enable_susphy(dwc, false); + usb_del_gadget(dwc->gadget); dwc3_gadget_free_endpoints(dwc); usb_put_gadget(dwc->gadget);
This patch is splitting up the interrupt event handling from one interrupt thread to separate per endpoint interrupt threads. To achieve this we create a new dwc3 interrupt domain in which we map all claimed interrupts to individual interrupt threads. Although the gadget layer is preparing the claimed parameter of each usb_ep which could be checked if the endpoint is to used or not, the claimed value was 0 for each ep in gadget_start. This was tested when describing some composite gadget using configfs. As a workaround we check the ep->address value instead for now. The ep0 is handling in and out events both in one common thread. There is still some work left to improve. 1) The big dwc->lock can block other endpoint threads. To solve this, the endpoint need their own locks. While the ep0 device events have still to be handled priviliged. 2) The smp_affinity is currently not possible to change which will make the per ep threads run on the same cpu as the current irq. To gain benefit from running the ep threads on different cores, the big dwc->lock is needs to be solved first anyways. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> --- drivers/usb/dwc3/core.h | 14 ++++ drivers/usb/dwc3/gadget.c | 202 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 215 insertions(+), 1 deletion(-) --- base-commit: dd5a440a31fae6e459c0d6271dddd62825505361 change-id: 20240507-dwc3_per_ep_irqthread-d01e1abdce6c Best regards,