Message ID | 20230602103750.2290132-1-vladimir.oltean@nxp.com (mailing list archive) |
---|---|
Headers | show |
Series | Improve the taprio qdisc's relationship with its children | expand |
Hi, On Fri, Jun 02, 2023 at 01:37:45PM +0300, Vladimir Oltean wrote: > [ Original patch set was lost due to an apparent transient problem with > kernel.org's DNSBL setup. This is an identical resend. ] > > Prompted by Vinicius' request to consolidate some child Qdisc > dereferences in taprio: > https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/ > > I remembered that I had left some unfinished work in this Qdisc, namely > commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see > the per-netdev-queue pfifo child qdiscs""). > > This patch set represents another stab at, essentially, what's in the > title. Not only does taprio not properly detect when it's grafted as a > non-root qdisc, but it also returns incorrect per-class stats. > Eventually, Vinicius' request is addressed too, although in a different > form than the one he requested (which was purely cosmetic). > > Review from people more experienced with Qdiscs than me would be > appreciated. I tried my best to explain what I consider to be problems. > I am deliberately targeting net-next because the changes are too > invasive for net - they were reverted from stable once already. I noticed that this patch set has "Changes Requested" in patchwork. I can't completely exclude the fact that maybe someone has requested some changes to be made, but there is no email in my inbox to that end, and for that matter, neither did patchwork or the email archive process any responses to this thread.
On Mon, Jun 05, 2023 at 03:50:42PM +0300, Vladimir Oltean wrote: > Hi, > > On Fri, Jun 02, 2023 at 01:37:45PM +0300, Vladimir Oltean wrote: > > [ Original patch set was lost due to an apparent transient problem with > > kernel.org's DNSBL setup. This is an identical resend. ] > > > > Prompted by Vinicius' request to consolidate some child Qdisc > > dereferences in taprio: > > https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/ > > > > I remembered that I had left some unfinished work in this Qdisc, namely > > commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see > > the per-netdev-queue pfifo child qdiscs""). > > > > This patch set represents another stab at, essentially, what's in the > > title. Not only does taprio not properly detect when it's grafted as a > > non-root qdisc, but it also returns incorrect per-class stats. > > Eventually, Vinicius' request is addressed too, although in a different > > form than the one he requested (which was purely cosmetic). > > > > Review from people more experienced with Qdiscs than me would be > > appreciated. I tried my best to explain what I consider to be problems. > > I am deliberately targeting net-next because the changes are too > > invasive for net - they were reverted from stable once already. > > I noticed that this patch set has "Changes Requested" in patchwork. > > I can't completely exclude the fact that maybe someone has requested > some changes to be made, but there is no email in my inbox to that end, > and for that matter, neither did patchwork or the email archive process > any responses to this thread. I concur. Let's see if this sets set it to "Under Review".
On Fri, Jun 2, 2023 at 6:38 AM Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > [ Original patch set was lost due to an apparent transient problem with > kernel.org's DNSBL setup. This is an identical resend. ] > > Prompted by Vinicius' request to consolidate some child Qdisc > dereferences in taprio: > https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/ > > I remembered that I had left some unfinished work in this Qdisc, namely > commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see > the per-netdev-queue pfifo child qdiscs""). > > This patch set represents another stab at, essentially, what's in the > title. Not only does taprio not properly detect when it's grafted as a > non-root qdisc, but it also returns incorrect per-class stats. > Eventually, Vinicius' request is addressed too, although in a different > form than the one he requested (which was purely cosmetic). > > Review from people more experienced with Qdiscs than me would be > appreciated. I tried my best to explain what I consider to be problems. I havent been following - but if you show me sample intended tc configs for both s/w and hardware offloads i can comment. In my cursory look i assumed you wanted to go along the path of mqprio where nothing much happens in the s/w datapath other than requeues when the tx hardware path is busy (notice it is missing an enqueue/deque ops). In that case the hardware selection is essentially of a DMA ring based on skb tags. It seems you took it up a notch by infact having a choice of whether to have pure s/w or offload path. cheers, jamal > I am deliberately targeting net-next because the changes are too > invasive for net - they were reverted from stable once already. > > Vladimir Oltean (5): > net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during > attach() > net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload > mode > net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf() > net/sched: taprio: delete misleading comment about preallocating child > qdiscs > net/sched: taprio: dump class stats for the actual q->qdiscs[] > > net/sched/sch_taprio.c | 60 ++++++++++++++++++++++++------------------ > 1 file changed, 35 insertions(+), 25 deletions(-) > > -- > 2.34.1 >
Hi Jamal, On Mon, Jun 05, 2023 at 11:44:17AM -0400, Jamal Hadi Salim wrote: > I havent been following - but if you show me sample intended tc > configs for both s/w and hardware offloads i can comment. There is not much difference in usage between the 2 modes. IMO the software data path logic is only a simulation for demonstrative purposes of what the shaper is intended to do. If hardware offload is available, it is always preferable. Otherwise, I'm not sure if anyone uses the pure software scheduling mode (also without txtime assist) for a real life use case. I was working with something like this for testing the code paths affected by these changes: #!/bin/bash add_taprio() { local offload=$1 local extra_flags case $offload in true) extra_flags="flags 0x2" ;; false) extra_flags="clockid CLOCK_TAI" ;; esac tc qdisc replace dev eno0 handle 8001: parent root stab overhead 24 taprio \ num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ max-sdu 0 0 0 0 0 200 0 0 \ base-time 200 \ sched-entry S 80 20000 \ sched-entry S a0 20000 \ sched-entry S 5f 60000 \ $extra_flags } add_cbs() { local offload=$1 local extra_flags case $offload in true) extra_flags="offload 1" ;; false) extra_flags="" ;; esac max_frame_size=1500 data_rate_kbps=20000 port_transmit_rate_kbps=1000000 idleslope=$data_rate_kbps sendslope=$(($idleslope - $port_transmit_rate_kbps)) locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps)) hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps)) tc qdisc replace dev eno0 parent 8001:8 cbs \ idleslope $idleslope \ sendslope $sendslope \ hicredit $hicredit \ locredit $locredit \ $extra_flags } # this should always fail add_second_taprio() { tc qdisc replace dev eno0 parent 8001:7 taprio \ num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ max-sdu 0 0 0 0 0 200 0 0 \ base-time 200 \ sched-entry S 80 20000 \ sched-entry S a0 20000 \ sched-entry S 5f 60000 \ clockid CLOCK_TAI } ip link set eno0 up echo "Offload:" add_taprio true add_cbs true add_second_taprio mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60 sleep 5 tc -s class show dev eno0 tc qdisc del dev eno0 root echo "Software:" add_taprio false add_cbs false add_second_taprio mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60 sleep 5 tc -s class show dev eno0 tc qdisc del dev eno0 root > In my cursory look i assumed you wanted to go along the path of mqprio > where nothing much happens in the s/w datapath other than requeues > when the tx hardware path is busy (notice it is missing an > enqueue/deque ops). In that case the hardware selection is essentially > of a DMA ring based on skb tags. It seems you took it up a notch by > infact having a choice of whether to have pure s/w or offload path. Yes. Actually the original taprio design always had the enqueue()/dequeue() ops involved in the data path, then commit 13511704f8d7 ("net: taprio offload: enforce qdisc to netdev queue mapping") retrofitted the mqprio model when using the "flags 0x2" argument. If you have time to read, the discussion behind that redesign was here: https://lore.kernel.org/netdev/20210511171829.17181-1-yannick.vignon@oss.nxp.com/
Hi Vladimir, On Mon, Jun 5, 2023 at 12:53 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > Hi Jamal, > > On Mon, Jun 05, 2023 at 11:44:17AM -0400, Jamal Hadi Salim wrote: > > I havent been following - but if you show me sample intended tc > > configs for both s/w and hardware offloads i can comment. > > There is not much difference in usage between the 2 modes. IMO the software > data path logic is only a simulation for demonstrative purposes of what the > shaper is intended to do. If hardware offload is available, it is always > preferable. Otherwise, I'm not sure if anyone uses the pure software > scheduling mode (also without txtime assist) for a real life use case. > Thanks for the sample. Understood on the software twin being useful for simulation (our standard rule is you need to have a software twin) - it is great you conform to that. I took a cursory glance at the existing kernel code in order to get better context (I will make more time later). Essentially this qdisc is classful and is capable of hardware offload. Initial comments are unrelated to the patchset (all this Klingon DCB thingy configuration like a flag 0x2 is still magic to me). 1)Just some details become confusing in regards to offload vs not; F.e class grafting (taprio_graft()) is de/activating the device but that seems only needed for offload. Would it not be better to have those separate and call graft_offload vs graft_software, etc? We really need to create a generic document on how someone would write code for qdiscs for consistency (I started working on one but never completed it - if there is a volunteer i would be happy to work with one to complete it). 2) It seems like in mqprio this qdisc can only be root qdisc (like mqprio) and you dont want to replace the children with other types of qdiscs i.e the children are always pfifo? i.e is it possible or intended for example to replace 8001:x with bfifo etc? or even change the pfifo queue size, etc? 3) Offload intention seems really to be bypassing enqueue() and going straigth to the driver xmit() for a specific DMA ring that the skb is mapped to. Except for the case where the driver says it's busy and refuses to stash the skb in ring in which case you have to requeue to the appropriate child qdisc/class. I am not sure how that would work here - mqprio gets away with it by not defining any of the en/de/requeue() callbacks - but likely it will be the lack of requeue that makes it work. I will read the other thread you pointed to when i get a moment. cheers, jamal > I was working with something like this for testing the code paths affected > by these changes: > > #!/bin/bash > > add_taprio() > { > local offload=$1 > local extra_flags > > case $offload in > true) > extra_flags="flags 0x2" > ;; > false) > extra_flags="clockid CLOCK_TAI" > ;; > esac > > tc qdisc replace dev eno0 handle 8001: parent root stab overhead 24 taprio \ > num_tc 8 \ > map 0 1 2 3 4 5 6 7 \ > queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ > max-sdu 0 0 0 0 0 200 0 0 \ > base-time 200 \ > sched-entry S 80 20000 \ > sched-entry S a0 20000 \ > sched-entry S 5f 60000 \ > $extra_flags > } > > add_cbs() > { > local offload=$1 > local extra_flags > > case $offload in > true) > extra_flags="offload 1" > ;; > false) > extra_flags="" > ;; > esac > > max_frame_size=1500 > data_rate_kbps=20000 > port_transmit_rate_kbps=1000000 > idleslope=$data_rate_kbps > sendslope=$(($idleslope - $port_transmit_rate_kbps)) > locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps)) > hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps)) > tc qdisc replace dev eno0 parent 8001:8 cbs \ > idleslope $idleslope \ > sendslope $sendslope \ > hicredit $hicredit \ > locredit $locredit \ > $extra_flags > } > > # this should always fail > add_second_taprio() > { > tc qdisc replace dev eno0 parent 8001:7 taprio \ > num_tc 8 \ > map 0 1 2 3 4 5 6 7 \ > queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ > max-sdu 0 0 0 0 0 200 0 0 \ > base-time 200 \ > sched-entry S 80 20000 \ > sched-entry S a0 20000 \ > sched-entry S 5f 60000 \ > clockid CLOCK_TAI > } > > ip link set eno0 up > > echo "Offload:" > add_taprio true > add_cbs true > add_second_taprio > mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60 > sleep 5 > tc -s class show dev eno0 > tc qdisc del dev eno0 root > > echo "Software:" > add_taprio false > add_cbs false > add_second_taprio > mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60 > sleep 5 > tc -s class show dev eno0 > tc qdisc del dev eno0 root > > > In my cursory look i assumed you wanted to go along the path of mqprio > > where nothing much happens in the s/w datapath other than requeues > > when the tx hardware path is busy (notice it is missing an > > enqueue/deque ops). In that case the hardware selection is essentially > > of a DMA ring based on skb tags. It seems you took it up a notch by > > infact having a choice of whether to have pure s/w or offload path. > > Yes. Actually the original taprio design always had the enqueue()/dequeue() > ops involved in the data path, then commit 13511704f8d7 ("net: taprio > offload: enforce qdisc to netdev queue mapping") retrofitted the mqprio > model when using the "flags 0x2" argument. > If you have time to read, the discussion behind that redesign was here: > https://lore.kernel.org/netdev/20210511171829.17181-1-yannick.vignon@oss.nxp.com/
On Tue, Jun 06, 2023 at 11:39:32AM -0400, Jamal Hadi Salim wrote: > 1)Just some details become confusing in regards to offload vs not; F.e > class grafting (taprio_graft()) is de/activating the device but that > seems only needed for offload. Would it not be better to have those > separate and call graft_offload vs graft_software, etc? We really need > to create a generic document on how someone would write code for > qdiscs for consistency (I started working on one but never completed > it - if there is a volunteer i would be happy to work with one to > complete it). I would be a happy reader of that document. I haven't studied whether dev_deactivate() and dev_activate() are necessary for the pure software data path, where the root taprio is also directly attached to the netdev TXQs and that fact doesn't change across its lifetime. > 2) It seems like in mqprio this qdisc can only be root qdisc (like > mqprio) so far so good > and you dont want to replace the children with other types of > qdiscs i.e the children are always pfifo? i.e is it possible or > intended for example to replace 8001:x with bfifo etc? or even change > the pfifo queue size, etc? no, this is not true, why do you say this? > 3) Offload intention seems really to be bypassing enqueue() and going > straigth to the driver xmit() for a specific DMA ring that the skb is > mapped to. Except for the case where the driver says it's busy and > refuses to stash the skb in ring in which case you have to requeue to > the appropriate child qdisc/class. I am not sure how that would work > here - mqprio gets away with it by not defining any of the > en/de/requeue() callbacks wait, there is a requeue() callback? where? > - but likely it will be the lack of requeue that makes it work. Looking at dev_requeue_skb(), isn't it always going to be requeued to the same qdisc it was originally dequeued from? I don't see what is the problem. My understanding of the offload intention is not really to bypass dequeue() in general and as a matter of principle, but rather to bypass the root's taprio_dequeue() specifically, as that could do unrelated work, and jump right to the specific child's dequeue(). The child could have its own complex enqueue() and dequeue() and that is perfectly fine - for example cbs_dequeue_soft() is a valid child dequeue procedure - as long as the process isn't blocked in the sendmsg() call by __qdisc_run() processing packets belonging to unrelated traffic classes.
On Tue, Jun 6, 2023 at 12:32 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > On Tue, Jun 06, 2023 at 11:39:32AM -0400, Jamal Hadi Salim wrote: > > 1)Just some details become confusing in regards to offload vs not; F.e > > class grafting (taprio_graft()) is de/activating the device but that > > seems only needed for offload. Would it not be better to have those > > separate and call graft_offload vs graft_software, etc? We really need > > to create a generic document on how someone would write code for > > qdiscs for consistency (I started working on one but never completed > > it - if there is a volunteer i would be happy to work with one to > > complete it). > > I would be a happy reader of that document. I haven't studied whether > dev_deactivate() and dev_activate() are necessary for the pure software > data path, where the root taprio is also directly attached to the netdev > TXQs and that fact doesn't change across its lifetime. I didnt go that far in the doc but was intending to. It was supposed to be a tutorial somewhere and i ended not doing it. something like htb will be a good example to compare with (it is a classful qdisc which is also capable of offloading). It is not the same as mqprio which can only be root. > > 2) It seems like in mqprio this qdisc can only be root qdisc (like > > mqprio) > > so far so good > > > and you dont want to replace the children with other types of > > qdiscs i.e the children are always pfifo? i.e is it possible or > > intended for example to replace 8001:x with bfifo etc? or even change > > the pfifo queue size, etc? > > no, this is not true, why do you say this? I am just asking questions trying to understand;-> So if can i replace, for example, with a tbf would it make sense even in s/w? > > 3) Offload intention seems really to be bypassing enqueue() and going > > straigth to the driver xmit() for a specific DMA ring that the skb is > > mapped to. Except for the case where the driver says it's busy and > > refuses to stash the skb in ring in which case you have to requeue to > > the appropriate child qdisc/class. I am not sure how that would work > > here - mqprio gets away with it by not defining any of the > > en/de/requeue() callbacks > > wait, there is a requeue() callback? where? Hrm, someone removed that ops i guess at some point - not sure when, need to look at git history. But short answer is yes it used to be there; git will probably reveal from the commit its disappearance. > > > - but likely it will be the lack of requeue that makes it work. > > Looking at dev_requeue_skb(), isn't it always going to be requeued to > the same qdisc it was originally dequeued from? I don't see what is the > problem. In the basic case that approach is sufficient. For pfifo you want it to the first skb dequeued the next opportunity to send to h/w. Basically the idea is/was: if the hardware is busy you may need to run some algorithm (present in requeue but not in enqueu) to decide if you should put the skb at the head, at the tail, somewhere else, drop it, check if some time limit has exceeded and do something funky etc etc. > My understanding of the offload intention is not really to bypass dequeue() > in general and as a matter of principle, but rather to bypass the root's > taprio_dequeue() specifically, as that could do unrelated work, and jump > right to the specific child's dequeue(). > > The child could have its own complex enqueue() and dequeue() and that is > perfectly fine - for example cbs_dequeue_soft() is a valid child dequeue > procedure - as long as the process isn't blocked in the sendmsg() call > by __qdisc_run() processing packets belonging to unrelated traffic > classes. Does it matter what type the child enqueue/dequeue? eg can i attach htb, etc? cheers, jamal
On Tue, Jun 06, 2023 at 01:42:19PM -0400, Jamal Hadi Salim wrote: > > > 2) It seems like in mqprio this qdisc can only be root qdisc (like > > > mqprio) > > > > so far so good > > > > > and you dont want to replace the children with other types of > > > qdiscs i.e the children are always pfifo? i.e is it possible or > > > intended for example to replace 8001:x with bfifo etc? or even change > > > the pfifo queue size, etc? > > > > no, this is not true, why do you say this? > > I am just asking questions trying to understand;-> So if can i > replace, for example, with a tbf would it make sense even in s/w? > > > The child could have its own complex enqueue() and dequeue() and that is > > perfectly fine - for example cbs_dequeue_soft() is a valid child dequeue > > procedure - as long as the process isn't blocked in the sendmsg() call > > by __qdisc_run() processing packets belonging to unrelated traffic > > classes. > > Does it matter what type the child enqueue/dequeue? eg can i attach htb, etc? So in principle, the taprio model is compatible with attaching any child Qdisc to the per-TXQ child classes - with tc-cbs in particular being of interest, because that is a TSN shaper, but also, tbf or htb could be reasonably imagined as children, and taprio doesn't oppose to any Qdisc as its child. That being said, a non-offloaded cbs/htb will not work with an offloaded taprio root anymore after commit 13511704f8d7 ("net: taprio offload: enforce qdisc to netdev queue mapping"), and IMO what should be done there is to reject somehow those child Qdiscs which also can't be offloaded when the root is offloaded. Offloading a taprio qdisc (potentially with offloaded cbs children) also affects the autonomous forwarding data path in case of an Ethernet switch. So it's not just about dev_queue_xmits() from Linux.
On Wed, Jun 07, 2023 at 01:09:01PM +0300, Vladimir Oltean wrote: > That being said, a non-offloaded cbs/htb will not work with an offloaded > taprio root anymore after commit 13511704f8d7 ("net: taprio offload: > enforce qdisc to netdev queue mapping"), and IMO what should be done > there is to reject somehow those child Qdiscs which also can't be > offloaded when the root is offloaded. > > Offloading a taprio qdisc (potentially with offloaded cbs children) also > affects the autonomous forwarding data path in case of an Ethernet switch. > So it's not just about dev_queue_xmits() from Linux. Ah, no, I said something stupid. Non-offloaded child Qdisc should still work with offloaded root (shaping done in software, scheduling in hardware). I was thinking about the autonomous forwarding case in the back of my mind, that's still problematic because the shaping wouldn't affect the packets that Linux didn't originate.