Message ID | 0a50f0cf5593baeb628dc8606c523665e5e2ae6c.1589519600.git.viresh.kumar@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] dt-bindings: mailbox: add doorbell support to ARM MHU | expand |
On Fri, May 15, 2020 at 12:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > - The hardware gives us the capability to write the register in > parallel, i.e. we can write 0x800 and 0x400 together without any > software locks, and so these 32 bits should be considered as separate > channel even if only one interrupt is issued by the hardware finally. > This shouldn't be called as virtualization of the channels, as the > hardware supports this (as clearly mentioned in the TRM) and it takes > care of handling the signal properly. > I'll leave this one open to bikeshed arguments. > - With serialization, if we use only one channel as today at every > priority, if there are 5 requests to send signal to the receiver and > the dvfs request is the last one in queue (which may be called from > scheduler's hot path with fast switching), it unnecessarily needs to > wait for the first four transfers to finish due to the software > locking imposed by the mailbox framework. This adds additional delay, > maybe of few ms only, which isn't required by the hardware but just by > the software and few ms can be important in scheduler's hotpath. > As I asked you yesterday over the call, it may help if you could share some numbers to back up the doomsday scenario. I don't believe mailbox will be a bottleneck, unless you send commands in a while(1) ... but even then you have to compare against the virtual-channel implementation. (Not to forget one usually doesn't need/want the dvfs, power, clock, hotplug all happening at the _same_ time) Please note, SCMI... lets not pretend it is not about making scmi work with mhu :) ... itself uses shared-memory transfers and wait_for_completion_timeout in scmi_do_xfer(). If some platform _really-really_ faced speed bottlenecks, it would come to want to exchange 32-bit encoded command/response over the mhu register, asynchronously and totally bypassing shmem... which is possible only now. > - With the current approach it isn't possible to assign different bits > (or doorbell numbers) to clients from DT and the only way of doing > that without adding new bindings is by extending #mbox-cells to accept > a value of 2 as done in this patch. > I am afraid you are confused. You can use bit/doorbell-6 by passing 0x40 to mhu as the data to send. Cheers!
On 15-05-20, 11:46, Jassi Brar wrote: > As I asked you yesterday over the call, it may help if you could share > some numbers to back up the doomsday scenario. Yes, I have already asked Sudeep to get some numbers for this. He will get back to us. > > - With the current approach it isn't possible to assign different bits > > (or doorbell numbers) to clients from DT and the only way of doing > > that without adding new bindings is by extending #mbox-cells to accept > > a value of 2 as done in this patch. > > > I am afraid you are confused. You can use bit/doorbell-6 by passing > 0x40 to mhu as the data to send. That's how the code will do it, right I agree. What I was asking was the way this information is passed from DT.
On Mon, May 18, 2020 at 2:42 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 15-05-20, 11:46, Jassi Brar wrote: > > As I asked you yesterday over the call, it may help if you could share > > some numbers to back up the doomsday scenario. > > Yes, I have already asked Sudeep to get some numbers for this. He will > get back to us. > Thanks, current bottleneck numbers and the patch/changes to improve that, would help. > > > - With the current approach it isn't possible to assign different bits > > > (or doorbell numbers) to clients from DT and the only way of doing > > > that without adding new bindings is by extending #mbox-cells to accept > > > a value of 2 as done in this patch. > > > > > I am afraid you are confused. You can use bit/doorbell-6 by passing > > 0x40 to mhu as the data to send. > > That's how the code will do it, right I agree. What I was asking was > the way this information is passed from DT. > That is a client/protocol property and has nothing to do with the controller dt node. cheers!
On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote: > From: Sudeep Holla <sudeep.holla@arm.com> > > Hi Rob, Arnd and Jassi, > > This stuff has been doing rounds on the mailing list since several years > now with no agreed conclusion by all the parties. And here is another > attempt to get some feedback from everyone involved to close this once > and for ever. Your comments will very much be appreciated. > > The ARM MHU is defined here in the TRM [1] for your reference, which > states following: > > "The MHU drives the signal using a 32-bit register, with all 32 > bits logically ORed together. The MHU provides a set of > registers to enable software to set, clear, and check the status > of each of the bits of this register independently. The use of > 32 bits for each interrupt line enables software to provide more > information about the source of the interrupt. For example, each > bit of the register can be associated with a type of event that > can contribute to raising the interrupt." > Does this mean that there are 32 different signals and they are all ORed into the same interrupt line to trigger software action when something happens? Or does it mean that this register is used to pass multi-bit information and when any such information is passed an interrupt will be triggered? If so, what does that information mean? How is it tied into other Linux drivers/subsystems? > On few other platforms, like qcom, similar doorbell mechanism is present > with separate interrupt for each of the bits (that's how I understood > it), while in case of ARM MHU, there is a single interrupt line for all > the 32 bits. Also in case of ARM MHU, these registers and interrupts > have 3 copies for different priority levels, i.e. low priority > non-secure, high priority non-secure and secure channels. > In the Qualcomm case we have 32 bits in a register where each bit written will trigger an interrupt elsewhere in the SoC, as such the mailbox is purely write only and the "read" side is handled by some interrupt-controller. The three copies just sounds like 3 (or 3 * 32) different mailbox channels. > For ARM MHU, both the dt bindings and the Linux driver support 3 > channels for the different priorities right now and support sending a 32 > bit data on every transfer in a locked fashion, i.e. only one transfer > can be done at once and the other have to wait for it to finish first. > > Here are the point of view of the parties involved on this subject: > > Jassi's viewpoint: > > - Virtualization of channels should be discouraged in software based on > specific usecases of the application. This may invite other mailbox > driver authors to ask for doing virtualization in their drivers. > > - In mailbox's terminology, every channel is equivalent to a signal, > since there is only one signal generated here by the MHU, there should > be only one channel per priority level. > > - The clients should send data (of just setting 1 bit or many in the 32 > bit word) using the existing mechanism as the delays due to > serialization shouldn't be significant anyway. > > - The driver supports 90% of the users with the current implementation > and it shouldn't be extended to support doorbell and implement two > different modes by changing value of #mbox-cells field in bindings. I interpret Jassi's view as this is a channel that passes 32 bit "messages". > > Sudeep (ARM) and myself as well to some extent: > > - The hardware gives us the capability to write the register in > parallel, i.e. we can write 0x800 and 0x400 together without any > software locks, and so these 32 bits should be considered as separate > channel even if only one interrupt is issued by the hardware finally. > This shouldn't be called as virtualization of the channels, as the > hardware supports this (as clearly mentioned in the TRM) and it takes > care of handling the signal properly. > But if writes to the register is ORed together than it seems like the hardware isn't supposed to pass multi-bit messages, but instead is supposed to carry 32 individual signals - somewhat similar to the Qualcomm block. I don't see a problem with having a cascaded IRQ handler for incoming notifications. > - With serialization, if we use only one channel as today at every > priority, if there are 5 requests to send signal to the receiver and > the dvfs request is the last one in queue (which may be called from > scheduler's hot path with fast switching), it unnecessarily needs to > wait for the first four transfers to finish due to the software > locking imposed by the mailbox framework. This adds additional delay, > maybe of few ms only, which isn't required by the hardware but just by > the software and few ms can be important in scheduler's hotpath. > So these 5 requests, are they conveyed by the signals [1,2,3,4,5] or [BIT(0), BIT(1), BIT(2), BIT(3), BIT(4)]? In the first case you have to serialize things given that e.g. signal 1 immediately followed by 2 is indistinguishable from 3. If you signals are single-bit notifications then you don't need any serialization. Regards, Bjorn > - With the current approach it isn't possible to assign different bits > (or doorbell numbers) to clients from DT and the only way of doing > that without adding new bindings is by extending #mbox-cells to accept > a value of 2 as done in this patch. > > Jassi and Sudeep, I hope I was able to represent both the view points > properly here. Please correct me if I have made a mistake here. > > This is it. It would be nice to get the views of everyone now on this > and how should this be handled. > > Thanks. > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0515f/DDI0515F_juno_arm_development_platform_soc_trm.pdf , section 3.4.4, page 3-38. > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > --- > .../devicetree/bindings/mailbox/arm-mhu.txt | 39 ++++++++++++++++++- > 1 file changed, 37 insertions(+), 2 deletions(-) > > diff --git a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt > index 4971f03f0b33..ba659bcc7109 100644 > --- a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt > +++ b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt > @@ -10,6 +10,15 @@ STAT register and the remote clears it after having read the data. > The last channel is specified to be a 'Secure' resource, hence can't be > used by Linux running NS. > > +The MHU drives the interrupt signal using a 32-bit register, with all > +32-bits logically ORed together. It provides a set of registers to > +enable software to set, clear and check the status of each of the bits > +of this register independently. The use of 32 bits per interrupt line > +enables software to provide more information about the source of the > +interrupt. For example, each bit of the register can be associated with > +a type of event that can contribute to raising the interrupt. Each of > +the 32-bits can be used as "doorbell" to alert the remote processor. > + > Mailbox Device Node: > ==================== > > @@ -18,13 +27,21 @@ used by Linux running NS. > - compatible: Shall be "arm,mhu" & "arm,primecell" > - reg: Contains the mailbox register address range (base > address and length) > -- #mbox-cells Shall be 1 - the index of the channel needed. > +- #mbox-cells Shall be 1 - the index of the channel needed when > + not used as set of doorbell bits. > + Shall be 2 - the index of the channel needed, and > + the index of the doorbell bit within the channel > + when used in doorbell mode. > - interrupts: Contains the interrupt information corresponding to > - each of the 3 links of MHU. > + each of the 3 physical channels of MHU namely low > + priority non-secure, high priority non-secure and > + secure channels. > > Example: > -------- > > +1. Controller which doesn't support doorbells > + > mhu: mailbox@2b1f0000 { > #mbox-cells = <1>; > compatible = "arm,mhu", "arm,primecell"; > @@ -41,3 +58,21 @@ used by Linux running NS. > reg = <0 0x2e000000 0x4000>; > mboxes = <&mhu 1>; /* HP-NonSecure */ > }; > + > +2. Controller which supports doorbells > + > + mhu: mailbox@2b1f0000 { > + #mbox-cells = <2>; > + compatible = "arm,mhu", "arm,primecell"; > + reg = <0 0x2b1f0000 0x1000>; > + interrupts = <0 36 4>, /* LP-NonSecure */ > + <0 35 4>; /* HP-NonSecure */ > + clocks = <&clock 0 2 1>; > + clock-names = "apb_pclk"; > + }; > + > + mhu_client: scb@2e000000 { > + compatible = "arm,scpi"; > + reg = <0 0x2e000000 0x200>; > + mboxes = <&mhu 1 4>; /* HP-NonSecure 5th doorbell bit */ > + }; > -- > 2.25.0.rc1.19.g042ed3e048af >
On 18-05-20, 18:29, Bjorn Andersson wrote: > On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote: > > This stuff has been doing rounds on the mailing list since several years > > now with no agreed conclusion by all the parties. And here is another > > attempt to get some feedback from everyone involved to close this once > > and for ever. Your comments will very much be appreciated. > > > > The ARM MHU is defined here in the TRM [1] for your reference, which > > states following: > > > > "The MHU drives the signal using a 32-bit register, with all 32 > > bits logically ORed together. The MHU provides a set of > > registers to enable software to set, clear, and check the status > > of each of the bits of this register independently. The use of > > 32 bits for each interrupt line enables software to provide more > > information about the source of the interrupt. For example, each > > bit of the register can be associated with a type of event that > > can contribute to raising the interrupt." > > > > Does this mean that there are 32 different signals and they are all ORed > into the same interrupt line to trigger software action when something > happens? > > Or does it mean that this register is used to pass multi-bit information > and when any such information is passed an interrupt will be triggered? > If so, what does that information mean? How is it tied into other Linux > drivers/subsystems? I have started to believe the hardware is written badly at this point :) The thing is that the register can be used to send a 32 bit data (which the driver already provides), or it can be used by writing different bits to the SET register concurrently, without corrupting the other bits as writing 0 to a bit has no effect, we have a separate CLEAR register for that. And so it says that all the bits are ORed together to generate the interrupt, i.e. any bit set will generate an interrupt. Which also means that you can't send data 0 with the register.
On Mon, May 18, 2020 at 10:40 PM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 18-05-20, 18:29, Bjorn Andersson wrote: > > On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote: > > > This stuff has been doing rounds on the mailing list since several years > > > now with no agreed conclusion by all the parties. And here is another > > > attempt to get some feedback from everyone involved to close this once > > > and for ever. Your comments will very much be appreciated. > > > > > > The ARM MHU is defined here in the TRM [1] for your reference, which > > > states following: > > > > > > "The MHU drives the signal using a 32-bit register, with all 32 > > > bits logically ORed together. The MHU provides a set of > > > registers to enable software to set, clear, and check the status > > > of each of the bits of this register independently. The use of > > > 32 bits for each interrupt line enables software to provide more > > > information about the source of the interrupt. For example, each > > > bit of the register can be associated with a type of event that > > > can contribute to raising the interrupt." > > > > > > > Does this mean that there are 32 different signals and they are all ORed > > into the same interrupt line to trigger software action when something > > happens? > > > > Or does it mean that this register is used to pass multi-bit information > > and when any such information is passed an interrupt will be triggered? > > If so, what does that information mean? How is it tied into other Linux > > drivers/subsystems? > > I have started to believe the hardware is written badly at this point > :) > H/W is actually fine :) Its just that the driver is written to _also_ support a platform (my original) that doesn't have shmem and need to pass data via 32bit registers. Frankly, I am not against the doorbell mode, I am against implementing two modes in a driver. If it really helped (note the past tense) the SCMI, we could implement the driver only in doorbell mode but unfortunately SCMI would still be _broken_ for non-doorbell controllers.
On 18-05-20, 19:53, Jassi Brar wrote: > That is a client/protocol property and has nothing to do with the > controller dt node. That's what I am concerned about, i.e. different ways of passing the doorbell number via DT.
On Fri, May 15, 2020 at 10:47:38AM +0530, Viresh Kumar wrote: > From: Sudeep Holla <sudeep.holla@arm.com> > > Hi Rob, Arnd and Jassi, > > This stuff has been doing rounds on the mailing list since several years > now with no agreed conclusion by all the parties. And here is another > attempt to get some feedback from everyone involved to close this once > and for ever. Your comments will very much be appreciated. > > The ARM MHU is defined here in the TRM [1] for your reference, which > states following: > > "The MHU drives the signal using a 32-bit register, with all 32 > bits logically ORed together. The MHU provides a set of > registers to enable software to set, clear, and check the status > of each of the bits of this register independently. The use of > 32 bits for each interrupt line enables software to provide more > information about the source of the interrupt. For example, each > bit of the register can be associated with a type of event that > can contribute to raising the interrupt." > > On few other platforms, like qcom, similar doorbell mechanism is present > with separate interrupt for each of the bits (that's how I understood > it), while in case of ARM MHU, there is a single interrupt line for all > the 32 bits. Also in case of ARM MHU, these registers and interrupts > have 3 copies for different priority levels, i.e. low priority > non-secure, high priority non-secure and secure channels. > > For ARM MHU, both the dt bindings and the Linux driver support 3 > channels for the different priorities right now and support sending a 32 > bit data on every transfer in a locked fashion, i.e. only one transfer > can be done at once and the other have to wait for it to finish first. > > Here are the point of view of the parties involved on this subject: > > Jassi's viewpoint: > > - Virtualization of channels should be discouraged in software based on > specific usecases of the application. This may invite other mailbox > driver authors to ask for doing virtualization in their drivers. > > - In mailbox's terminology, every channel is equivalent to a signal, > since there is only one signal generated here by the MHU, there should > be only one channel per priority level. > > - The clients should send data (of just setting 1 bit or many in the 32 > bit word) using the existing mechanism as the delays due to > serialization shouldn't be significant anyway. > > - The driver supports 90% of the users with the current implementation > and it shouldn't be extended to support doorbell and implement two > different modes by changing value of #mbox-cells field in bindings. > > Sudeep (ARM) and myself as well to some extent: > > - The hardware gives us the capability to write the register in > parallel, i.e. we can write 0x800 and 0x400 together without any > software locks, and so these 32 bits should be considered as separate > channel even if only one interrupt is issued by the hardware finally. > This shouldn't be called as virtualization of the channels, as the > hardware supports this (as clearly mentioned in the TRM) and it takes > care of handling the signal properly. > > - With serialization, if we use only one channel as today at every > priority, if there are 5 requests to send signal to the receiver and > the dvfs request is the last one in queue (which may be called from > scheduler's hot path with fast switching), it unnecessarily needs to > wait for the first four transfers to finish due to the software > locking imposed by the mailbox framework. This adds additional delay, > maybe of few ms only, which isn't required by the hardware but just by > the software and few ms can be important in scheduler's hotpath. > > - With the current approach it isn't possible to assign different bits > (or doorbell numbers) to clients from DT and the only way of doing > that without adding new bindings is by extending #mbox-cells to accept > a value of 2 as done in this patch. > > Jassi and Sudeep, I hope I was able to represent both the view points > properly here. Please correct me if I have made a mistake here. > > This is it. It would be nice to get the views of everyone now on this > and how should this be handled. I am perfectly fine with adding another cell which seems appropriate here. You can have 5 cells for all I care if that makes sense for the h/w. That has nothing to do with the Linux design. Whether Linux requires serializing mailbox accesses is a separate issue. On that side, it seems silly to not allow driving the h/w in the most efficient way possible. Rob
On 28-05-20, 13:20, Rob Herring wrote: > Whether Linux > requires serializing mailbox accesses is a separate issue. On that side, > it seems silly to not allow driving the h/w in the most efficient way > possible. That's exactly what we are trying to say. The hardware allows us to write all 32 bits in parallel, without any hardware issues, why shouldn't we do that ? The delay (which Sudeep will find out, he is facing issues with hardware access because of lockdown right now) which may be small in transmitting across a mailbox becomes significant because of the fact that it happens synchronously and the receiver will send some sort of acknowledgement (and that depends on the firmware there) and the kernel needs to wait for it, while the kernel doesn't really need to. There is no reason IMHO for being inefficient here while we can do better.
On Thu, May 28, 2020 at 2:20 PM Rob Herring <robh@kernel.org> wrote: > > On Fri, May 15, 2020 at 10:47:38AM +0530, Viresh Kumar wrote: > > From: Sudeep Holla <sudeep.holla@arm.com> > > > > Hi Rob, Arnd and Jassi, > > > > This stuff has been doing rounds on the mailing list since several years > > now with no agreed conclusion by all the parties. And here is another > > attempt to get some feedback from everyone involved to close this once > > and for ever. Your comments will very much be appreciated. > > > > The ARM MHU is defined here in the TRM [1] for your reference, which > > states following: > > > > "The MHU drives the signal using a 32-bit register, with all 32 > > bits logically ORed together. The MHU provides a set of > > registers to enable software to set, clear, and check the status > > of each of the bits of this register independently. The use of > > 32 bits for each interrupt line enables software to provide more > > information about the source of the interrupt. For example, each > > bit of the register can be associated with a type of event that > > can contribute to raising the interrupt." > > > > On few other platforms, like qcom, similar doorbell mechanism is present > > with separate interrupt for each of the bits (that's how I understood > > it), while in case of ARM MHU, there is a single interrupt line for all > > the 32 bits. Also in case of ARM MHU, these registers and interrupts > > have 3 copies for different priority levels, i.e. low priority > > non-secure, high priority non-secure and secure channels. > > > > For ARM MHU, both the dt bindings and the Linux driver support 3 > > channels for the different priorities right now and support sending a 32 > > bit data on every transfer in a locked fashion, i.e. only one transfer > > can be done at once and the other have to wait for it to finish first. > > > > Here are the point of view of the parties involved on this subject: > > > > Jassi's viewpoint: > > > > - Virtualization of channels should be discouraged in software based on > > specific usecases of the application. This may invite other mailbox > > driver authors to ask for doing virtualization in their drivers. > > > > - In mailbox's terminology, every channel is equivalent to a signal, > > since there is only one signal generated here by the MHU, there should > > be only one channel per priority level. > > > > - The clients should send data (of just setting 1 bit or many in the 32 > > bit word) using the existing mechanism as the delays due to > > serialization shouldn't be significant anyway. > > > > - The driver supports 90% of the users with the current implementation > > and it shouldn't be extended to support doorbell and implement two > > different modes by changing value of #mbox-cells field in bindings. > > > > Sudeep (ARM) and myself as well to some extent: > > > > - The hardware gives us the capability to write the register in > > parallel, i.e. we can write 0x800 and 0x400 together without any > > software locks, and so these 32 bits should be considered as separate > > channel even if only one interrupt is issued by the hardware finally. > > This shouldn't be called as virtualization of the channels, as the > > hardware supports this (as clearly mentioned in the TRM) and it takes > > care of handling the signal properly. > > > > - With serialization, if we use only one channel as today at every > > priority, if there are 5 requests to send signal to the receiver and > > the dvfs request is the last one in queue (which may be called from > > scheduler's hot path with fast switching), it unnecessarily needs to > > wait for the first four transfers to finish due to the software > > locking imposed by the mailbox framework. This adds additional delay, > > maybe of few ms only, which isn't required by the hardware but just by > > the software and few ms can be important in scheduler's hotpath. > > > > - With the current approach it isn't possible to assign different bits > > (or doorbell numbers) to clients from DT and the only way of doing > > that without adding new bindings is by extending #mbox-cells to accept > > a value of 2 as done in this patch. > > > > Jassi and Sudeep, I hope I was able to represent both the view points > > properly here. Please correct me if I have made a mistake here. > > > > This is it. It would be nice to get the views of everyone now on this > > and how should this be handled. > > I am perfectly fine with adding another cell which seems appropriate > here. You can have 5 cells for all I care if that makes sense for > the h/w. That has nothing to do with the Linux design. Whether Linux > requires serializing mailbox accesses is a separate issue. On that side, > it seems silly to not allow driving the h/w in the most efficient way > possible. > The fact that all these bits are backed by one physical signal makes the firmware implement its own mux-demux'ing scheme. So the choice was between demux and single signal at a time. Had there been one signal per bit, the implementation would have definitely been 'efficient'. And the first platform the driver was developed on, required message passing over the registers. So now my approach is to make do with what we have...unless it shows in numbers. Anyways, if it comes to that, I'd rather a separate "doorbell' driver than a driver working in two s/w modes. Thanks.
On 29-05-20, 00:20, Jassi Brar wrote: > The fact that all these bits are backed by one physical signal makes I believe that within the IP itself, there must be 32 signals, just that only one comes out of it all OR'd together. Maybe that can be verified by writing 0x01 to it multiple times and it should generate the interrupt only once on the first instance. i.e. writing 1 to any big again won't raise the interrupt. > the firmware implement its own mux-demux'ing scheme. Similar to how the interrupt controllers do it in the kernel, they receive a single interrupt and then go check its registers/bits to see which peripheral raised it. > So the choice was > between demux and single signal at a time. Well, the demux is on the firmware side and the kernel may not need to worry about that, let the platform do it ? > Had there been one signal > per bit, the implementation would have definitely been 'efficient'. I will say 'More efficient', it will still be 'efficient' if we implement doorbell. > And the first platform the driver was developed on, required message > passing over the registers. I think it was correctly done at that point of time. No doubt about that. > So now my approach is to make do with what > we have...unless it shows in numbers. ARM's office is closed (lock-down) and he doesn't have remote access to the board and so was unable to do it until now. Numbers or no numbers, I don't think there is a question here about what is the most efficient way of doing it. Doing it in parallel (when the hardware allows) is surely going to be more efficient. Sending a signal, getting it processed by the firmware, doing something with it and then sending an Ack and that being received by kernel will take time. Why make a queue on the kernel side when we don't have to. :) > Anyways, if it comes to that, I'd rather a separate "doorbell' driver > than a driver working in two s/w modes. I think that would be fine with everyone, if you are fine with that now (based on plain logic and no numbers), maybe we can get that done now?
On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote: > On 28-05-20, 13:20, Rob Herring wrote: > > Whether Linux > > requires serializing mailbox accesses is a separate issue. On that side, > > it seems silly to not allow driving the h/w in the most efficient way > > possible. > > That's exactly what we are trying to say. The hardware allows us to > write all 32 bits in parallel, without any hardware issues, why > shouldn't we do that ? The delay (which Sudeep will find out, he is > facing issues with hardware access because of lockdown right now) OK, I was able to access the setup today. I couldn't reach a point where I can do measurements as the system just became unusable with one physical channel instead of 2 virtual channels as in my patches. My test was simple. Switch to schedutil and read sensors periodically via sysfs. arm-scmi firmware:scmi: message for 1 is not expected! arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120) arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120) arm-scmi firmware:scmi: message for 1 is not expected! arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120) arm-scmi firmware:scmi: message for 1 is not expected! With trace enabled I can see even cpufreq_set timing out. Sample trace output: bash-1019 [005] 1149.452340: scmi_xfer_begin: transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1 bash-1019 [005] 1149.452407: scmi_xfer_end: transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0 bash-1526 [000] 1149.472553: scmi_xfer_begin: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0 <idle>-0 [001] 1149.472733: scmi_xfer_begin: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1 <idle>-0 [001] 1149.472842: scmi_xfer_end: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 status=-110 <idle>-0 [001] 1149.483040: scmi_xfer_begin: transfer_id=1540 msg_id=7 protocol_id=19 seq=1 poll=1 <idle>-0 [001] 1149.483043: scmi_xfer_end: transfer_id=1540 msg_id=7 protocol_id=19 seq=1 status=0 rs:main-543 [003] 1149.493031: scmi_xfer_begin: transfer_id=1541 msg_id=7 protocol_id=19 seq=1 poll=1 rs:main-543 [003] 1149.493047: scmi_xfer_end: transfer_id=1541 msg_id=7 protocol_id=19 seq=1 status=0 <idle>-0 [000] 1149.507033: scmi_xfer_begin: transfer_id=1542 msg_id=7 protocol_id=19 seq=1 poll=1 <idle>-0 [000] 1149.507044: scmi_xfer_end: transfer_id=1542 msg_id=7 protocol_id=19 seq=1 status=0 bash-1526 [000] 1149.516068: scmi_xfer_end: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 status=-110 bash-1526 [000] 1149.516559: scmi_xfer_begin: transfer_id=1543 msg_id=6 protocol_id=21 seq=0 poll=0 <idle>-0 [001] 1149.516729: scmi_xfer_begin: transfer_id=1544 msg_id=7 protocol_id=19 seq=1 poll=1 <idle>-0 [001] 1149.516837: scmi_xfer_end: transfer_id=1544 msg_id=7 protocol_id=19 seq=1 status=-110 ksoftirqd/0-9 [000] 1149.519065: scmi_xfer_begin: transfer_id=1545 msg_id=7 protocol_id=19 seq=1 poll=1 ksoftirqd/0-9 [000] 1149.519072: scmi_xfer_end: transfer_id=1545 msg_id=7 protocol_id=19 seq=1 status=0 <idle>-0 [001] 1149.526878: scmi_xfer_begin: transfer_id=1546 msg_id=7 protocol_id=19 seq=1 poll=1 <idle>-0 [001] 1149.526882: scmi_xfer_end: transfer_id=1546 msg_id=7 protocol_id=19 seq=1 status=0 <idle>-0 [000] 1149.551119: scmi_xfer_begin: transfer_id=1547 msg_id=7 protocol_id=19 seq=1 poll=1 <idle>-0 [000] 1149.551138: scmi_xfer_end: transfer_id=1547 msg_id=7 protocol_id=19 seq=1 status=0 bash-1526 [000] 1149.560191: scmi_xfer_end: transfer_id=1543 msg_id=6 protocol_id=21 seq=0 status=-110 bash-1526 [000] 1149.560690: scmi_xfer_begin: transfer_id=1548 msg_id=6 protocol_id=21 seq=0 poll=0 <idle>-0 [001] 1149.560859: scmi_xfer_begin: transfer_id=1549 msg_id=7 protocol_id=19 seq=1 poll=1 <idle>-0 [001] 1149.560968: scmi_xfer_end: transfer_id=1549 msg_id=7 protocol_id=19 seq=1 status=-110 protocol_id=19 is cpufreq and 21 is sensor. This is simplest test and I can easily generate more timeouts starting some stress test with DVFS. -- Regards, Sudeep
On Wed, Jun 03, 2020 at 07:04:35PM +0100, Sudeep Holla wrote: > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote: > > On 28-05-20, 13:20, Rob Herring wrote: > > > Whether Linux > > > requires serializing mailbox accesses is a separate issue. On that side, > > > it seems silly to not allow driving the h/w in the most efficient way > > > possible. > > > > That's exactly what we are trying to say. The hardware allows us to > > write all 32 bits in parallel, without any hardware issues, why > > shouldn't we do that ? The delay (which Sudeep will find out, he is > > facing issues with hardware access because of lockdown right now) > > OK, I was able to access the setup today. I couldn't reach a point > where I can do measurements as the system just became unusable with > one physical channel instead of 2 virtual channels as in my patches. > > My test was simple. Switch to schedutil and read sensors periodically > via sysfs. > > arm-scmi firmware:scmi: message for 1 is not expected! > arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120) > arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120) > arm-scmi firmware:scmi: message for 1 is not expected! > arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120) > arm-scmi firmware:scmi: message for 1 is not expected! > > With trace enabled I can see even cpufreq_set timing out. Sample trace > output: > > bash-1019 [005] 1149.452340: scmi_xfer_begin: transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1 > bash-1019 [005] 1149.452407: scmi_xfer_end: transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0 > bash-1526 [000] 1149.472553: scmi_xfer_begin: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0 > <idle>-0 [001] 1149.472733: scmi_xfer_begin: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1 > <idle>-0 [001] 1149.472842: scmi_xfer_end: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 status=-110 > <idle>-0 [001] 1149.483040: scmi_xfer_begin: transfer_id=1540 msg_id=7 protocol_id=19 seq=1 poll=1 > <idle>-0 [001] 1149.483043: scmi_xfer_end: transfer_id=1540 msg_id=7 protocol_id=19 seq=1 status=0 > rs:main-543 [003] 1149.493031: scmi_xfer_begin: transfer_id=1541 msg_id=7 protocol_id=19 seq=1 poll=1 > rs:main-543 [003] 1149.493047: scmi_xfer_end: transfer_id=1541 msg_id=7 protocol_id=19 seq=1 status=0 > <idle>-0 [000] 1149.507033: scmi_xfer_begin: transfer_id=1542 msg_id=7 protocol_id=19 seq=1 poll=1 > <idle>-0 [000] 1149.507044: scmi_xfer_end: transfer_id=1542 msg_id=7 protocol_id=19 seq=1 status=0 > bash-1526 [000] 1149.516068: scmi_xfer_end: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 status=-110 > bash-1526 [000] 1149.516559: scmi_xfer_begin: transfer_id=1543 msg_id=6 protocol_id=21 seq=0 poll=0 > <idle>-0 [001] 1149.516729: scmi_xfer_begin: transfer_id=1544 msg_id=7 protocol_id=19 seq=1 poll=1 > <idle>-0 [001] 1149.516837: scmi_xfer_end: transfer_id=1544 msg_id=7 protocol_id=19 seq=1 status=-110 > ksoftirqd/0-9 [000] 1149.519065: scmi_xfer_begin: transfer_id=1545 msg_id=7 protocol_id=19 seq=1 poll=1 > ksoftirqd/0-9 [000] 1149.519072: scmi_xfer_end: transfer_id=1545 msg_id=7 protocol_id=19 seq=1 status=0 > <idle>-0 [001] 1149.526878: scmi_xfer_begin: transfer_id=1546 msg_id=7 protocol_id=19 seq=1 poll=1 > <idle>-0 [001] 1149.526882: scmi_xfer_end: transfer_id=1546 msg_id=7 protocol_id=19 seq=1 status=0 > <idle>-0 [000] 1149.551119: scmi_xfer_begin: transfer_id=1547 msg_id=7 protocol_id=19 seq=1 poll=1 > <idle>-0 [000] 1149.551138: scmi_xfer_end: transfer_id=1547 msg_id=7 protocol_id=19 seq=1 status=0 > bash-1526 [000] 1149.560191: scmi_xfer_end: transfer_id=1543 msg_id=6 protocol_id=21 seq=0 status=-110 > bash-1526 [000] 1149.560690: scmi_xfer_begin: transfer_id=1548 msg_id=6 protocol_id=21 seq=0 poll=0 > <idle>-0 [001] 1149.560859: scmi_xfer_begin: transfer_id=1549 msg_id=7 protocol_id=19 seq=1 poll=1 > <idle>-0 [001] 1149.560968: scmi_xfer_end: transfer_id=1549 msg_id=7 protocol_id=19 seq=1 status=-110 > > protocol_id=19 is cpufreq and 21 is sensor. This is simplest test and > I can easily generate more timeouts starting some stress test with DVFS. > I just realised that we have the timing info in the traces and you will observe the sensor readings take something in order of 100us to 500-600us or even more based on which sensor is being read. While we have 100us timeout for cpufreq opp set. I am attaching full trace now.
Hi Bjorn, Thanks for the details response. On Mon, May 18, 2020 at 06:29:27PM -0700, Bjorn Andersson wrote: > On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote: > [...] I find this part nicely summarise your response. > > - With serialization, if we use only one channel as today at every > > priority, if there are 5 requests to send signal to the receiver and > > the dvfs request is the last one in queue (which may be called from > > scheduler's hot path with fast switching), it unnecessarily needs to > > wait for the first four transfers to finish due to the software > > locking imposed by the mailbox framework. This adds additional delay, > > maybe of few ms only, which isn't required by the hardware but just by > > the software and few ms can be important in scheduler's hotpath. > > > > So these 5 requests, are they conveyed by the signals [1,2,3,4,5] or > [BIT(0), BIT(1), BIT(2), BIT(3), BIT(4)]? > Latter in this case. IMO it is platform choice on how to use it. It is equally possible to send 2^31 different signals. But the receiver must also interpret it in the *exact* same way. In this case, the receiver which is platform firmware interprets as individual bit signals. > In the first case you have to serialize things given that e.g. signal 1 > immediately followed by 2 is indistinguishable from 3. > Agree and we are not proposing to break that use case. It exists in the driver/binding today and will continue as is. > If you signals are single-bit notifications then you don't need any > serialization. > Indeed, we are making use of that. -- Regards, Sudeep
On Mon, May 18, 2020 at 11:05:03PM -0500, Jassi Brar wrote: > On Mon, May 18, 2020 at 10:40 PM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > On 18-05-20, 18:29, Bjorn Andersson wrote: > > > On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote: > > > > This stuff has been doing rounds on the mailing list since several years > > > > now with no agreed conclusion by all the parties. And here is another > > > > attempt to get some feedback from everyone involved to close this once > > > > and for ever. Your comments will very much be appreciated. > > > > > > > > The ARM MHU is defined here in the TRM [1] for your reference, which > > > > states following: > > > > > > > > "The MHU drives the signal using a 32-bit register, with all 32 > > > > bits logically ORed together. The MHU provides a set of > > > > registers to enable software to set, clear, and check the status > > > > of each of the bits of this register independently. The use of > > > > 32 bits for each interrupt line enables software to provide more > > > > information about the source of the interrupt. For example, each > > > > bit of the register can be associated with a type of event that > > > > can contribute to raising the interrupt." > > > > > > > > > > Does this mean that there are 32 different signals and they are all ORed > > > into the same interrupt line to trigger software action when something > > > happens? > > > > > > Or does it mean that this register is used to pass multi-bit information > > > and when any such information is passed an interrupt will be triggered? > > > If so, what does that information mean? How is it tied into other Linux > > > drivers/subsystems? > > > > I have started to believe the hardware is written badly at this point > > :) > > > H/W is actually fine :) Its just that the driver is written to > _also_ support a platform (my original) that doesn't have shmem and > need to pass data via 32bit registers. > Frankly, I am not against the doorbell mode, I am against implementing > two modes in a driver. If it really helped (note the past tense) the > SCMI, we could implement the driver only in doorbell mode but > unfortunately SCMI would still be _broken_ for non-doorbell > controllers. Should be fine as the specification is designed to work with only shmem for any data transfer and mailbox is just a signal mechanism. I won't be too worried about that. -- Regards, Sudeep
On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote: > > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote: > > On 28-05-20, 13:20, Rob Herring wrote: > > > Whether Linux > > > requires serializing mailbox accesses is a separate issue. On that side, > > > it seems silly to not allow driving the h/w in the most efficient way > > > possible. > > > > That's exactly what we are trying to say. The hardware allows us to > > write all 32 bits in parallel, without any hardware issues, why > > shouldn't we do that ? The delay (which Sudeep will find out, he is > > facing issues with hardware access because of lockdown right now) > > OK, I was able to access the setup today. I couldn't reach a point > where I can do measurements as the system just became unusable with > one physical channel instead of 2 virtual channels as in my patches. > > My test was simple. Switch to schedutil and read sensors periodically > via sysfs. > > arm-scmi firmware:scmi: message for 1 is not expected! > This sounds like you are not serialising requests on a shared channel. Can you please also share the patch? Thanks.
On Wed, Jun 3, 2020 at 1:31 PM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > H/W is actually fine :) Its just that the driver is written to > > _also_ support a platform (my original) that doesn't have shmem and > > need to pass data via 32bit registers. > > Frankly, I am not against the doorbell mode, I am against implementing > > two modes in a driver. If it really helped (note the past tense) the > > SCMI, we could implement the driver only in doorbell mode but > > unfortunately SCMI would still be _broken_ for non-doorbell > > controllers. > > Should be fine as the specification is designed to work with only shmem > for any data transfer and mailbox is just a signal mechanism. I won't > be too worried about that. > Please clarify how it will work on, say again, rockchip platform with shmem. thanks.
On 03-06-20, 19:17, Sudeep Holla wrote: > I just realised that we have the timing info in the traces and you will > observe the sensor readings take something in order of 100us to 500-600us > or even more based on which sensor is being read. While we have 100us > timeout for cpufreq opp set. Which timeout from opp core are you talking about ?
On Thu, Jun 04, 2020 at 11:29:03AM +0530, Viresh Kumar wrote: > On 03-06-20, 19:17, Sudeep Holla wrote: > > I just realised that we have the timing info in the traces and you will > > observe the sensor readings take something in order of 100us to 500-600us > > or even more based on which sensor is being read. While we have 100us > > timeout for cpufreq opp set. > > Which timeout from opp core are you talking about ? The one we have in SCMI itself to meet the fast_switch requirement. -- Regards, Sudeep
On Wed, Jun 03, 2020 at 01:32:42PM -0500, Jassi Brar wrote: > On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote: > > > On 28-05-20, 13:20, Rob Herring wrote: > > > > Whether Linux > > > > requires serializing mailbox accesses is a separate issue. On that side, > > > > it seems silly to not allow driving the h/w in the most efficient way > > > > possible. > > > > > > That's exactly what we are trying to say. The hardware allows us to > > > write all 32 bits in parallel, without any hardware issues, why > > > shouldn't we do that ? The delay (which Sudeep will find out, he is > > > facing issues with hardware access because of lockdown right now) > > > > OK, I was able to access the setup today. I couldn't reach a point > > where I can do measurements as the system just became unusable with > > one physical channel instead of 2 virtual channels as in my patches. > > > > My test was simple. Switch to schedutil and read sensors periodically > > via sysfs. > > > > arm-scmi firmware:scmi: message for 1 is not expected! > > > This sounds like you are not serialising requests on a shared channel. > Can you please also share the patch? OK, I did try with a small patch initially and then realised we must hit issue with mainline as is. Tried and the behaviour is exact same. All I did is removed my patches and use bit[0] as the signal. It doesn't matter as writes to the register are now serialised. Oh, the above message comes when OS times out in advance while firmware continues to process the old request and respond. The trace I sent gives much better view of what's going on. -- Regards, Sudeep
On Thu, Jun 4, 2020 at 4:20 AM Sudeep Holla <sudeep.holla@arm.com> wrote: > > On Wed, Jun 03, 2020 at 01:32:42PM -0500, Jassi Brar wrote: > > On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote: > > > > On 28-05-20, 13:20, Rob Herring wrote: > > > > > Whether Linux > > > > > requires serializing mailbox accesses is a separate issue. On that side, > > > > > it seems silly to not allow driving the h/w in the most efficient way > > > > > possible. > > > > > > > > That's exactly what we are trying to say. The hardware allows us to > > > > write all 32 bits in parallel, without any hardware issues, why > > > > shouldn't we do that ? The delay (which Sudeep will find out, he is > > > > facing issues with hardware access because of lockdown right now) > > > > > > OK, I was able to access the setup today. I couldn't reach a point > > > where I can do measurements as the system just became unusable with > > > one physical channel instead of 2 virtual channels as in my patches. > > > > > > My test was simple. Switch to schedutil and read sensors periodically > > > via sysfs. > > > > > > arm-scmi firmware:scmi: message for 1 is not expected! > > > > > This sounds like you are not serialising requests on a shared channel. > > Can you please also share the patch? > > OK, I did try with a small patch initially and then realised we must hit > issue with mainline as is. Tried and the behaviour is exact same. All > I did is removed my patches and use bit[0] as the signal. It doesn't > matter as writes to the register are now serialised. Oh, the above > message comes when OS times out in advance while firmware continues to > process the old request and respond. > > The trace I sent gives much better view of what's going on. > BTW, you didn't even share 'good' vs 'bad' log for me to actually see if the api lacks. Let us look closely ... >> bash-1019 [005] 1149.452340: scmi_xfer_begin: transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1 >> bash-1019 [005] 1149.452407: scmi_xfer_end: transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0 > This round trip took 67usecs. (log shows some at even 3usecs) That includes mailbox api overhead, memcpy and the time taken by remote to respond. So the api is definitely capable of much faster transfers than you require. >> bash-1526 [000] 1149.472553: scmi_xfer_begin: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0 >> <idle>-0 [001] 1149.472733: scmi_xfer_begin: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1 > Here another request is started before the first is finished. If you want this to work you have to serialise access to the single physical channel and/or run the remote firmware in asynchronous mode - that is, the firmware ack the bit as soon as it sees and starts working in the background, so that we return in ~3usec always, while the data returns whenever it is ready. Again, I don't have the code or the 'good' run log to compare. PS: I feel it is probably less effort for me to simply let the patch through, than use my reiki powers to imagine how your test code and firmware looks like.
On Thu, Jun 04, 2020 at 10:15:55AM -0500, Jassi Brar wrote: > On Thu, Jun 4, 2020 at 4:20 AM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > On Wed, Jun 03, 2020 at 01:32:42PM -0500, Jassi Brar wrote: > > > On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > > > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote: > > > > > On 28-05-20, 13:20, Rob Herring wrote: > > > > > > Whether Linux > > > > > > requires serializing mailbox accesses is a separate issue. On that side, > > > > > > it seems silly to not allow driving the h/w in the most efficient way > > > > > > possible. > > > > > > > > > > That's exactly what we are trying to say. The hardware allows us to > > > > > write all 32 bits in parallel, without any hardware issues, why > > > > > shouldn't we do that ? The delay (which Sudeep will find out, he is > > > > > facing issues with hardware access because of lockdown right now) > > > > > > > > OK, I was able to access the setup today. I couldn't reach a point > > > > where I can do measurements as the system just became unusable with > > > > one physical channel instead of 2 virtual channels as in my patches. > > > > > > > > My test was simple. Switch to schedutil and read sensors periodically > > > > via sysfs. > > > > > > > > arm-scmi firmware:scmi: message for 1 is not expected! > > > > > > > This sounds like you are not serialising requests on a shared channel. > > > Can you please also share the patch? > > > > OK, I did try with a small patch initially and then realised we must hit > > issue with mainline as is. Tried and the behaviour is exact same. All > > I did is removed my patches and use bit[0] as the signal. It doesn't > > matter as writes to the register are now serialised. Oh, the above > > message comes when OS times out in advance while firmware continues to > > process the old request and respond. > > > > The trace I sent gives much better view of what's going on. > > > BTW, you didn't even share 'good' vs 'bad' log for me to actually see > if the api lacks. > > Let us look closely ... > > >> bash-1019 [005] 1149.452340: scmi_xfer_begin: > transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1 > >> bash-1019 [005] 1149.452407: scmi_xfer_end: > transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0 > > > This round trip took 67usecs. (log shows some at even 3usecs) > That includes mailbox api overhead, memcpy and the time taken by > remote to respond. This is DVFS request which firmware acknowledges quickly and expected to at most 100us. > So the api is definitely capable of much faster transfers than you require. > I am not complaining about that. The delay is mostly due to the load on the mailbox and parallelising helps is the focus here. > >> bash-1526 [000] 1149.472553: scmi_xfer_begin: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0 > >> <idle>-0 [001] 1149.472733: scmi_xfer_begin: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1 > > > Here another request is started before the first is finished. Ah, the prints are when the client requested. It is not when the mailbox started it. So this just indicates the beginning of the transfer from the client. I must have mentioned that earlier. Some request timeout before being picked up by mailbox if the previous request is not acknowledge quickly. E.g. Say a sensor command started which may take upto 1ms, almost 5-6 DVFS request after that will timeout. > If you want this to work you have to serialise access to the single > physical channel and/or run the remote firmware in asynchronous mode - > that is, the firmware ack the bit as soon as it sees and starts > working in the background, so that we return in ~3usec always, while > the data returns whenever it is ready. Yes it does that for few requests like DVFS while it uses synchronous mode for few others. While ideally I agree everything in asynchronous most is better, I don't know there may be reasons for such design. Also the solution given is to use different bits as independent channels which hardware allows. Anyways it's open source SCP project[1]. -- Regards, Sudeep [1] https://github.com/ARM-software/SCP-firmware
On Thu, Jun 4, 2020 at 11:56 PM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > >> bash-1526 [000] 1149.472553: scmi_xfer_begin: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0 > > >> <idle>-0 [001] 1149.472733: scmi_xfer_begin: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1 > > > > > Here another request is started before the first is finished. > > Ah, the prints are when the client requested. It is not when the mailbox > started it. So this just indicates the beginning of the transfer from the > client. > There maybe condition on a sensor read to finish within 1ms, but there is no condition for the read to _start_ at this very moment (usually there are sleeps in the path to sensor requests). > > If you want this to work you have to serialise access to the single > > physical channel and/or run the remote firmware in asynchronous mode - > > that is, the firmware ack the bit as soon as it sees and starts > > working in the background, so that we return in ~3usec always, while > > the data returns whenever it is ready. > > Yes it does that for few requests like DVFS while it uses synchronous > mode for few others. While ideally I agree everything in asynchronous > most is better, I don't know there may be reasons for such design. > The reason is that, if the firmware works asynchronously, every call would return in ~3usec and you won't think you need virtual channels. You have shared only 'bad' log without serialising access. Please share log after serialising access to the channel and the 'good' log with virtual channels. That should put the topic to rest. thanks.
On Fri, Jun 5, 2020 at 3:58 AM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > >> bash-1526 [000] 1149.472553: scmi_xfer_begin: transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0 > > > > >> <idle>-0 [001] 1149.472733: scmi_xfer_begin: transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1 > > > > > > > > > Here another request is started before the first is finished. > > > > > > Ah, the prints are when the client requested. It is not when the mailbox > > > started it. So this just indicates the beginning of the transfer from the > > > client. > > > > > There maybe condition on a sensor read to finish within 1ms, but there > > is no condition for the read to _start_ at this very moment (usually > > there are sleeps in the path to sensor requests). > > > > Again I wasn't clear. The trace logs are at the point just before calling > mbox_send_messages. So any delay in sensor drivers won't get include. It > is after the point sensor driver request to read the value and before we > send the request via mailbox. > No, you were clear, I wasn't. Let me try again. Since origin upto scmi_xfer, there can be many forms of sleep like schedule/mutexlock etc.... think of some userspace triggering sensor or dvfs operation. Linux does not provide real-time guarantees. Even if remote (scmi) firmware guarantee RT response, it makes sense to timeout a response only after the _request is on the bus_ and not when you submit a request to the api (unless you serialise it). IOW, start the timeout from mbox_client.tx_prepare() when the message actually gets on the bus. > > You have shared only 'bad' log without serialising access. Please > > share log after serialising access to the channel and the 'good' log > > with virtual channels. That should put the topic to rest. > > > > I didn't realise that, sorry for missing that earlier. Attached both > now, thanks for asking. > Interesting logs ! The time taken to complete _successful_ requests are arguably better in bad_trace ... there are many <10usec responses in bad_trace, while the fastest response in good_trace is 53usec. And the requests that 'fail/timeout' are purely the result of not serialising them or checkout for timeout at wrong place as explained above. thanks.
On 05-06-20, 10:42, Jassi Brar wrote: > Since origin upto scmi_xfer, there can be many forms of sleep like > schedule/mutexlock etc.... think of some userspace triggering sensor > or dvfs operation. Linux does not provide real-time guarantees. Even > if remote (scmi) firmware guarantee RT response, it makes sense to > timeout a response only after the _request is on the bus_ and not > when you submit a request to the api (unless you serialise it). > IOW, start the timeout from mbox_client.tx_prepare() when the > message actually gets on the bus. There are multiple purposes of the timeout IMO: - Returning early if the other side is dead/hung, in such a case the timeout can be put when the request is put on the bus as we don't care of the time it takes to complete the request until the time the request can be fulfilled. This can be a example of i2c/spi memory read. - Ensuring maximum time in which the request needs to be serviced. There may be hard requirements, like in case for DVFS from scheduler's hot path (which is essential for better working of the overall system). And for such a case the timeout is placed at the right place IMO, i.e. right after a request is submitted to mailbox. And some more points I wanted to share.. - I am not sure I understood the *serializing* part you guys were talking about. I believe mailbox framework is already serializing the requests it is receiving on a single channel with a spin lock, right ? Why does the client need to serialize them as well? Is that for avoiding timeouts ? - For me, and Sudeep as well IIUC, the bigger problem isn't that timeouts are happening and requests are failing (and so changing the timeout to a bigger value isn't going to fix anything), but the problem is that it is taking too long (because of the queue of requests on a channel) for a request to finish after being submitted. Scheduler doesn't care of the underneath logistics for example, all it cares for is the time it takes to change the frequency of a CPU. If you can do it fast enough in a guaranteed manner, then you can use fast switching, otherwise not. - The hardware can very well support the case today where this can be done in parallel and (almost) in a guaranteed time-frame. While the software wants to add a limit to that and so wants to serialize requests. - As many people have already suggested it (like me, Sudeep, Rob, maybe Bjorn as well), it seems silly to not allow driving the h/w in the most efficient way possible (and allow fast cpu switching in this case). > Interesting logs ! The time taken to complete _successful_ requests > are arguably better in bad_trace ... there are many <10usec responses > in bad_trace, while the fastest response in good_trace is 53usec. Indeed this is interesting. It may be worth looking (separately) into why don't we see those 3 us long requests anymore, or maybe they were just not there in the logs. > And the requests that 'fail/timeout' are purely the result of not > serialising them or checkout for timeout at wrong place as explained > above. We can't allow for the requests to go on for ever in some cases, while in other cases it may be absolutely fine.
Hi Viresh, Thanks for summarising the thoughts quite nicely. On Wed, Jun 10, 2020 at 03:03:34PM +0530, Viresh Kumar wrote: > On 05-06-20, 10:42, Jassi Brar wrote: > > Since origin upto scmi_xfer, there can be many forms of sleep like > > schedule/mutexlock etc.... think of some userspace triggering sensor > > or dvfs operation. Linux does not provide real-time guarantees. Even > > if remote (scmi) firmware guarantee RT response, it makes sense to > > timeout a response only after the _request is on the bus_ and not > > when you submit a request to the api (unless you serialise it). > > IOW, start the timeout from mbox_client.tx_prepare() when the > > message actually gets on the bus. > > There are multiple purposes of the timeout IMO: > > - Returning early if the other side is dead/hung, in such a case the > timeout can be put when the request is put on the bus as we don't > care of the time it takes to complete the request until the time the > request can be fulfilled. This can be a example of i2c/spi memory > read. > > - Ensuring maximum time in which the request needs to be serviced. > There may be hard requirements, like in case for DVFS from > scheduler's hot path (which is essential for better working of the > overall system). And for such a case the timeout is placed at the > right place IMO, i.e. right after a request is submitted to mailbox. > Agreed on both points. > And some more points I wanted to share.. > > - I am not sure I understood the *serializing* part you guys were > talking about. I believe mailbox framework is already serializing > the requests it is receiving on a single channel with a spin lock, > right ? Why does the client need to serialize them as well? Is that > for avoiding timeouts ? > > - For me, and Sudeep as well IIUC, the bigger problem isn't that > timeouts are happening and requests are failing (and so changing the > timeout to a bigger value isn't going to fix anything), but the > problem is that it is taking too long (because of the queue of > requests on a channel) for a request to finish after being > submitted. Scheduler doesn't care of the underneath logistics for > example, all it cares for is the time it takes to change the > frequency of a CPU. If you can do it fast enough in a guaranteed > manner, then you can use fast switching, otherwise not. > > - The hardware can very well support the case today where this can be > done in parallel and (almost) in a guaranteed time-frame. While the > software wants to add a limit to that and so wants to serialize > requests. > +1 > - As many people have already suggested it (like me, Sudeep, Rob, > maybe Bjorn as well), it seems silly to not allow driving the h/w in > the most efficient way possible (and allow fast cpu switching in > this case). > > > Interesting logs ! The time taken to complete _successful_ requests > > are arguably better in bad_trace ... there are many <10usec responses > > in bad_trace, while the fastest response in good_trace is 53usec. > > Indeed this is interesting. It may be worth looking (separately) into > why don't we see those 3 us long requests anymore, or maybe they were > just not there in the logs. > As I mentioned in another thread that non-dvfs requests may be prioritised lower when there are parallel request to the remote. The so called bad trace doesn't have such scenario with single channel and all requests from OS being serialised. The good trace has 2 channels and requests to remote happen in parallel and hence it is fair to see slightly higher latencies for lower priority requests. -- Regards, Sudeep
On Thu, Jun 11, 2020 at 5:00 AM Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > Interesting logs ! The time taken to complete _successful_ requests > > > are arguably better in bad_trace ... there are many <10usec responses > > > in bad_trace, while the fastest response in good_trace is 53usec. > > > > Indeed this is interesting. It may be worth looking (separately) into > > why don't we see those 3 us long requests anymore, or maybe they were > > just not there in the logs. > > > > As I mentioned in another thread that non-dvfs requests may be prioritised > lower when there are parallel request to the remote. The so called bad > trace doesn't have such scenario with single channel and all requests > from OS being serialised. The good trace has 2 channels and requests to > remote happen in parallel and hence it is fair to see slightly higher > latencies for lower priority requests. > In the first post in this thread, Viresh lamented that mailbox introduces "a few ms" delay in the scheduler path. Your own tests show that is certainly not the case -- average is the same as proposed virtual channels 50-100us, the best case is 3us vs 53us for virtual channels. -#define SCMI_MAX_POLL_TO_NS (100 * NSEC_PER_USEC) +#define SCMI_MAX_POLL_TO_NS (30 * 1000 * NSEC_PER_USEC) The above simple fix (not a hack or workaround) avoids the need of virtual channels' implementation, as per tests you conducted. It might have been silly to not implement virtual channels originally, but it would be just as silly now to implement if we can reuse the code. So I welcome new tests. thanks.
On 11-06-20, 19:34, Jassi Brar wrote: > In the first post in this thread, Viresh lamented that mailbox > introduces "a few ms" delay in the scheduler path. > Your own tests show that is certainly not the case -- average is the > same as proposed virtual channels 50-100us, the best case is 3us vs > 53us for virtual channels. Hmmm, I am not sure where is the confusion here Jassi. There are two things which are very very different from each other. - Time taken by the mailbox framework (and remote for acknowledging it) for completion of a single request, this can be 3us to 100s of us. This is clear for everyone. THIS IS NOT THE PROBLEM. - Delay introduced by few of such requests on the last one, i.e. 5 normal requests followed by an important one (like DVFS), the last one needs to wait for the first 5 to finish first. THIS IS THE PROBLEM. Just increasing the timeout isn't going to solve anything as I said in the last email, we can make it 5 minutes for what's its worth. The idea is to make the turn-around-time less for all the requests.. From Google (I know you must already know it, I am just trying to highlight the importance of this thing here): Turnaround time (TAT) is the time interval from the time of submission of a process (read request) to the time of the completion of the process. This is what people care about, that is the whole reason kernel has multi-processing support in the first place. If making things sequential was good enough, we would have never reached here. The whole idea is to parallelize things as much as possible without hurting efficiency in a bad way (like too much parallelism). The hardware allows parallelism and there is absolutely no point in not allowing that. The kernel doesn't need to worry about how the remote is going to handle it. Remote may be simple and handle it sequentially or it may be running Linux itself and can schedule multiple threads for requests.
Picking up the old thread again after and getting pinged by multiple colleagues about it (thanks!) reading through the history. On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 11-06-20, 19:34, Jassi Brar wrote: > > In the first post in this thread, Viresh lamented that mailbox > > introduces "a few ms" delay in the scheduler path. > > Your own tests show that is certainly not the case -- average is the > > same as proposed virtual channels 50-100us, the best case is 3us vs > > 53us for virtual channels. > > Hmmm, I am not sure where is the confusion here Jassi. There are two > things which are very very different from each other. > > - Time taken by the mailbox framework (and remote for acknowledging > it) for completion of a single request, this can be 3us to 100s of > us. This is clear for everyone. THIS IS NOT THE PROBLEM. > > - Delay introduced by few of such requests on the last one, i.e. 5 > normal requests followed by an important one (like DVFS), the last > one needs to wait for the first 5 to finish first. THIS IS THE > PROBLEM. Earlier, Jassi also commented "Linux does not provide real-time guarantees", which to me is what actually causes the issue here: Linux having timeouts when communicating to the firmware means that it relies on the hardware and firmware having real-time behavior even when not providing real-time guarantees to its processes. When comparing the two usage models, it's clear that the minimum latency for a message delivery is always at least the time time to process an interrupt, plus at least one expensive MMIO read and one less expensive posted MMIO write for an Ack. If we have a doorbell plus out-of-band message, we need an extra DMA barrier and a read from coherent memory, both of which can be noticeable. As soon as messages are queued in the current model, the maximum latency increases by a potentially unbounded number of round-trips, while in the doorbell model that problem does not exist, so I agree that we need to handle both modes in the kernel deal with all existing hardware as well as firmware that requires low-latency communication. It also sounds like that debate is already settled because there are platforms using both modes, and in the kernel we usually end up supporting the platforms that our users have, whether we think it's a good idea or not. The only questions that I see in need of being answered are: 1. Should the binding use just different "#mbox-cells" values or also different "compatible" strings to tell that difference? 2. Should one driver try to handle both modes or should there be two drivers? It sounds like Jassi strongly prefers separate drivers, which would make separate compatible strings the more practical approach. While the argument can be made that a single piece of hardware should only have one DT description, the counter-argument would be that the behavior described by the DT here is made up by both the hardware and the firmware behind it, and they are in fact different. Arnd
On 08-09-20, 11:14, Arnd Bergmann wrote: > Picking up the old thread again after and getting pinged by multiple > colleagues about it (thanks!) reading through the history. Thanks for your inputs Arnd. > Earlier, Jassi also commented "Linux does not provide real-time > guarantees", which to me is what actually causes the issue here: > > Linux having timeouts when communicating to the firmware means > that it relies on the hardware and firmware having real-time behavior > even when not providing real-time guarantees to its processes. > > When comparing the two usage models, it's clear that the minimum > latency for a message delivery is always at least the time time > to process an interrupt, plus at least one expensive MMIO read > and one less expensive posted MMIO write for an Ack. If we > have a doorbell plus out-of-band message, we need an extra > DMA barrier and a read from coherent memory, both of which can > be noticeable. As soon as messages are queued in the current > model, the maximum latency increases by a potentially unbounded > number of round-trips, while in the doorbell model that problem > does not exist, so I agree that we need to handle both modes > in the kernel deal with all existing hardware as well as firmware > that requires low-latency communication. Right. > It also sounds like that debate is already settled because there > are platforms using both modes, and in the kernel we usually > end up supporting the platforms that our users have, whether > we think it's a good idea or not. > > The only questions that I see in need of being answered are: > > 1. Should the binding use just different "#mbox-cells" values or > also different "compatible" strings to tell that difference? > 2. Should one driver try to handle both modes or should there > be two drivers? > > It sounds like Jassi strongly prefers separate drivers, which > would make separate compatible strings the more practical > approach. While the argument can be made that a single > piece of hardware should only have one DT description, > the counter-argument would be that the behavior described > by the DT here is made up by both the hardware and the > firmware behind it, and they are in fact different. I would be fine with both the ideas and that isn't a blocker for me. Though I still feel this is exactly why we have #mbox-cells here and that should be enough in this case, even if we opt for multiple drivers. But whatever everyone agrees to will be fine.
On Tue, Sep 08, 2020 at 11:14:50AM +0200, Arnd Bergmann wrote: > Picking up the old thread again after and getting pinged by multiple > colleagues about it (thanks!) reading through the history. > > On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > On 11-06-20, 19:34, Jassi Brar wrote: > > > In the first post in this thread, Viresh lamented that mailbox > > > introduces "a few ms" delay in the scheduler path. > > > Your own tests show that is certainly not the case -- average is the > > > same as proposed virtual channels 50-100us, the best case is 3us vs > > > 53us for virtual channels. > > > > Hmmm, I am not sure where is the confusion here Jassi. There are two > > things which are very very different from each other. > > > > - Time taken by the mailbox framework (and remote for acknowledging > > it) for completion of a single request, this can be 3us to 100s of > > us. This is clear for everyone. THIS IS NOT THE PROBLEM. > > > > - Delay introduced by few of such requests on the last one, i.e. 5 > > normal requests followed by an important one (like DVFS), the last > > one needs to wait for the first 5 to finish first. THIS IS THE > > PROBLEM. > > Earlier, Jassi also commented "Linux does not provide real-time > guarantees", which to me is what actually causes the issue here: > > Linux having timeouts when communicating to the firmware means > that it relies on the hardware and firmware having real-time behavior > even when not providing real-time guarantees to its processes. > > When comparing the two usage models, it's clear that the minimum > latency for a message delivery is always at least the time time > to process an interrupt, plus at least one expensive MMIO read > and one less expensive posted MMIO write for an Ack. If we > have a doorbell plus out-of-band message, we need an extra > DMA barrier and a read from coherent memory, both of which can > be noticeable. As soon as messages are queued in the current > model, the maximum latency increases by a potentially unbounded > number of round-trips, while in the doorbell model that problem > does not exist, so I agree that we need to handle both modes > in the kernel deal with all existing hardware as well as firmware > that requires low-latency communication. > > It also sounds like that debate is already settled because there > are platforms using both modes, and in the kernel we usually > end up supporting the platforms that our users have, whether > we think it's a good idea or not. > Thanks for the nice summary of the discussion so far. > The only questions that I see in need of being answered are: > > 1. Should the binding use just different "#mbox-cells" values or > also different "compatible" strings to tell that difference? I initially proposed latter, but Rob preferred the former which makes sense for the reasons you have mentioned below. > 2. Should one driver try to handle both modes or should there > be two drivers? > > It sounds like Jassi strongly prefers separate drivers, which > would make separate compatible strings the more practical > approach. Indeed. > While the argument can be made that a single > piece of hardware should only have one DT description, > the counter-argument would be that the behavior described > by the DT here is made up by both the hardware and the > firmware behind it, and they are in fact different. > I am too fine either way. -- Regards, Sudeep
On Tue, Sep 8, 2020 at 4:15 AM Arnd Bergmann <arnd@arndb.de> wrote: > > Picking up the old thread again after and getting pinged by multiple > colleagues about it (thanks!) reading through the history. > > On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > On 11-06-20, 19:34, Jassi Brar wrote: > > > In the first post in this thread, Viresh lamented that mailbox > > > introduces "a few ms" delay in the scheduler path. > > > Your own tests show that is certainly not the case -- average is the > > > same as proposed virtual channels 50-100us, the best case is 3us vs > > > 53us for virtual channels. > > > > Hmmm, I am not sure where is the confusion here Jassi. There are two > > things which are very very different from each other. > > > > - Time taken by the mailbox framework (and remote for acknowledging > > it) for completion of a single request, this can be 3us to 100s of > > us. This is clear for everyone. THIS IS NOT THE PROBLEM. > > > > - Delay introduced by few of such requests on the last one, i.e. 5 > > normal requests followed by an important one (like DVFS), the last > > one needs to wait for the first 5 to finish first. THIS IS THE > > PROBLEM. > > Earlier, Jassi also commented "Linux does not provide real-time > guarantees", which to me is what actually causes the issue here: > > Linux having timeouts when communicating to the firmware means > that it relies on the hardware and firmware having real-time behavior > even when not providing real-time guarantees to its processes. > The timeout used in SCMI is simply based on how long the Juno (?) platform takes to reply in most cases. Talking proper code-design, the timeout (if at all) shouldn't even be a hardcoded value, but instead taken from the platform. > When comparing the two usage models, it's clear that the minimum > latency for a message delivery is always at least the time time > to process an interrupt, plus at least one expensive MMIO read > and one less expensive posted MMIO write for an Ack. If we > have a doorbell plus out-of-band message, we need an extra > DMA barrier and a read from coherent memory, both of which can > be noticeable. As soon as messages are queued in the current > model, the maximum latency increases by a potentially unbounded > number of round-trips, while in the doorbell model that problem > does not exist, so I agree that we need to handle both modes > in the kernel deal with all existing hardware as well as firmware > that requires low-latency communication. > From the test case Sudeep last shared, the scmi usage on mhu doesn't not even hit any bottleneck ... the test "failed" because of the too small hardcoded timeout value. Otherwise the current code actually shows better numbers. We need some synthetic tests to bring the limitation to the surface. I agree that there may be such a test case, however fictitious. For that reason, I am ok with the doorbell mode. > The only questions that I see in need of being answered are: > > 1. Should the binding use just different "#mbox-cells" values or > also different "compatible" strings to tell that difference? > 2. Should one driver try to handle both modes or should there > be two drivers? > > It sounds like Jassi strongly prefers separate drivers, which > would make separate compatible strings the more practical > approach. While the argument can be made that a single > piece of hardware should only have one DT description, > the counter-argument would be that the behavior described > by the DT here is made up by both the hardware and the > firmware behind it, and they are in fact different. > I totally agree with one compat-string for one hardware. However, as you said, unlike other device classes, the mailbox driver runs the sumtotal of hardware and the remote firmware behaviour. Also the implementations wouldn't share much, so I think a separate file+dt will be better. But I wanna get rid of this toothache that flares up every season, so whatever. Cheers!
On 08-09-20, 22:23, Jassi Brar wrote: > From the test case Sudeep last shared, the scmi usage on mhu doesn't > not even hit any bottleneck ... the test "failed" because of the too > small hardcoded timeout value. Otherwise the current code actually > shows better numbers. Its not important on why the test failed there, but the fact that there were requests in queue which have to be completed one by one and the last ones in the queue will always pay the penalty. > We need some synthetic tests to bring the limitation to the surface. I > agree that there may be such a test case, however fictitious. For that > reason, I am ok with the doorbell mode. > > I totally agree with one compat-string for one hardware. However, as > you said, unlike other device classes, the mailbox driver runs the > sumtotal of hardware and the remote firmware behaviour. Also the > implementations wouldn't share much, so I think a separate file+dt > will be better. > But I wanna get rid of this toothache that flares up > every season, so whatever. I can't agree more :) So to conclude the thread, if I have understood correctly, we are going to implement another doorbell driver for this hardware which will use a different compatible string and #mbox-cells value. I will try to refresh the bindings soon, which will be followed by the driver implementation. Thanks everyone.
On Tue, Sep 08, 2020 at 10:23:33PM -0500, Jassi Brar wrote: > On Tue, Sep 8, 2020 at 4:15 AM Arnd Bergmann <arnd@arndb.de> wrote: > > > > Picking up the old thread again after and getting pinged by multiple > > colleagues about it (thanks!) reading through the history. > > > > On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > > > On 11-06-20, 19:34, Jassi Brar wrote: > > > > In the first post in this thread, Viresh lamented that mailbox > > > > introduces "a few ms" delay in the scheduler path. > > > > Your own tests show that is certainly not the case -- average is the > > > > same as proposed virtual channels 50-100us, the best case is 3us vs > > > > 53us for virtual channels. > > > > > > Hmmm, I am not sure where is the confusion here Jassi. There are two > > > things which are very very different from each other. > > > > > > - Time taken by the mailbox framework (and remote for acknowledging > > > it) for completion of a single request, this can be 3us to 100s of > > > us. This is clear for everyone. THIS IS NOT THE PROBLEM. > > > > > > - Delay introduced by few of such requests on the last one, i.e. 5 > > > normal requests followed by an important one (like DVFS), the last > > > one needs to wait for the first 5 to finish first. THIS IS THE > > > PROBLEM. > > > > Earlier, Jassi also commented "Linux does not provide real-time > > guarantees", which to me is what actually causes the issue here: > > > > Linux having timeouts when communicating to the firmware means > > that it relies on the hardware and firmware having real-time behavior > > even when not providing real-time guarantees to its processes. > > > The timeout used in SCMI is simply based on how long the Juno (?) > platform takes to reply in most cases. Just FYI, the timeouts in SCMI can be platform specific. So each platform have flexibility to choose it's own choice of timeout and need not be stuck with so called *Juno values". The architects of SCMI believe the transfers(especially DVFS) must not exceed few 100s of uS and worst case for any transfers must be few ms. My initial choice of 30ms was based on the jiffes based timer and 100Hz. Architect claim that is too much, but I thought 3 jiffies at minimum in case we start timer when we are close to the boundaries. > Talking proper code-design, the timeout (if at all) shouldn't even be > a hardcoded value, but instead taken from the platform. > > > When comparing the two usage models, it's clear that the minimum > > latency for a message delivery is always at least the time time > > to process an interrupt, plus at least one expensive MMIO read > > and one less expensive posted MMIO write for an Ack. If we > > have a doorbell plus out-of-band message, we need an extra > > DMA barrier and a read from coherent memory, both of which can > > be noticeable. As soon as messages are queued in the current > > model, the maximum latency increases by a potentially unbounded > > number of round-trips, while in the doorbell model that problem > > does not exist, so I agree that we need to handle both modes > > in the kernel deal with all existing hardware as well as firmware > > that requires low-latency communication. > > > From the test case Sudeep last shared, the scmi usage on mhu doesn't > not even hit any bottleneck ... the test "failed" because of the too > small hardcoded timeout value. Otherwise the current code actually > shows better numbers. No disagreement on the latter part. But we can't ignore the bottlenecks even if they are rare. > We need some synthetic tests to bring the limitation to the surface. I > agree that there may be such a test case, however fictitious. For that > reason, I am ok with the doorbell mode. > Thanks !
diff --git a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt index 4971f03f0b33..ba659bcc7109 100644 --- a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt +++ b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt @@ -10,6 +10,15 @@ STAT register and the remote clears it after having read the data. The last channel is specified to be a 'Secure' resource, hence can't be used by Linux running NS. +The MHU drives the interrupt signal using a 32-bit register, with all +32-bits logically ORed together. It provides a set of registers to +enable software to set, clear and check the status of each of the bits +of this register independently. The use of 32 bits per interrupt line +enables software to provide more information about the source of the +interrupt. For example, each bit of the register can be associated with +a type of event that can contribute to raising the interrupt. Each of +the 32-bits can be used as "doorbell" to alert the remote processor. + Mailbox Device Node: ==================== @@ -18,13 +27,21 @@ used by Linux running NS. - compatible: Shall be "arm,mhu" & "arm,primecell" - reg: Contains the mailbox register address range (base address and length) -- #mbox-cells Shall be 1 - the index of the channel needed. +- #mbox-cells Shall be 1 - the index of the channel needed when + not used as set of doorbell bits. + Shall be 2 - the index of the channel needed, and + the index of the doorbell bit within the channel + when used in doorbell mode. - interrupts: Contains the interrupt information corresponding to - each of the 3 links of MHU. + each of the 3 physical channels of MHU namely low + priority non-secure, high priority non-secure and + secure channels. Example: -------- +1. Controller which doesn't support doorbells + mhu: mailbox@2b1f0000 { #mbox-cells = <1>; compatible = "arm,mhu", "arm,primecell"; @@ -41,3 +58,21 @@ used by Linux running NS. reg = <0 0x2e000000 0x4000>; mboxes = <&mhu 1>; /* HP-NonSecure */ }; + +2. Controller which supports doorbells + + mhu: mailbox@2b1f0000 { + #mbox-cells = <2>; + compatible = "arm,mhu", "arm,primecell"; + reg = <0 0x2b1f0000 0x1000>; + interrupts = <0 36 4>, /* LP-NonSecure */ + <0 35 4>; /* HP-NonSecure */ + clocks = <&clock 0 2 1>; + clock-names = "apb_pclk"; + }; + + mhu_client: scb@2e000000 { + compatible = "arm,scpi"; + reg = <0 0x2e000000 0x200>; + mboxes = <&mhu 1 4>; /* HP-NonSecure 5th doorbell bit */ + };