mbox series

[v1,0/7] MPFS system controller/mailbox fixes

Message ID 20230111134513.2495510-1-conor.dooley@microchip.com (mailing list archive)
Headers show
Series MPFS system controller/mailbox fixes | expand

Message

Conor Dooley Jan. 11, 2023, 1:45 p.m. UTC
Hey Jassi, all,

Here are some fixes for the system controller on PolarFire SoC that I
ran into while implementing support for using the system controller to
re-program the FPGA. A few are just minor bits that I fixed in passing,
but the bulk of the patchset is changes to how the mailbox figures out
if a "service" has completed.

Prior to implementing this particular functionality, the services
requested from the system controller, via its mailbox interface, always
triggered an interrupt when the system controller was finished with
the service.

Unfortunately some of the services used to validate the FPGA images
before programming them do not trigger an interrupt if they fail.
For example, the service that checks whether an FPGA image is actually
a newer version than what is already programmed, does not trigger an
interrupt, unless the image is actually newer than the one currently
programmed. If it has an earlier version, no interrupt is triggered
and a status is set in the system controller's status register to
signify the reason for the failure.

In order to differentiate between the service succeeding & the system
controller being inoperative or otherwise unable to function, I had to
switch the controller to poll a busy bit in the system controller's
registers to see if it has completed a service.
This makes sense anyway, as the interrupt corresponds to "data ready"
rather than "tx done", so I have changed the mailbox controller driver
to do that & left the interrupt solely for signalling data ready.
It just so happened that all of the services that I had worked with and
tested up to this point were "infallible" & did not set a status, so the
particular code paths were never tested.

Jassi, the mailbox and soc patches depend on each other, as the change
in what the interrupt is used for requires changing the client driver's
behaviour too, as mbox_send_message() will now return when the system
controller is no longer busy rather than when the data is ready.
I'm happy to send the lot via the soc tree with your Ack and/or reivew,
if that also works you?

Secondly, I have a question about what to do if a service does fail, but
not due to a timeout - eg the above example where the "new" image for
the FPGA is actually older than the one that currently exists.
Ideally, if a service fails due to something other than the transaction
timing out, I would go and read the status registers to see what the
cause of failure was.
I could not find a function in the mailbox framework that allows the
client to request that sort of information from the client. Trying to
do something with the auxiliary bus, or exporting some function to a
device specific header seemed like a circumvention of the mailbox
framework.
Do you think it would be a good idea to implement something like
mbox_client_peek_status(struct mbox_chan *chan, void *data) to allow
clients to request this type of information?

It'd certainly allow me to report the actual errors to the drivers
implementing the service & make better decisions there about how to
proceed.
Perhaps I have missed some way of doing this kind of thing that (should
have been) staring me in the face!

Thanks,
Conor.

CC: Conor Dooley <conor.dooley@microchip.com>
CC: Daire McNamara <daire.mcnamara@microchip.com>
CC: Jassi Brar <jassisinghbrar@gmail.com>
CC: linux-riscv@lists.infradead.org
CC: linux-kernel@vger.kernel.org

Conor Dooley (7):
  mailbox: mpfs: fix an incorrect mask width
  mailbox: mpfs: switch to txdone_poll
  mailbox: mpfs: ditch a useless busy check
  soc: microchip: mpfs: fix some horrible alignment
  soc: microchip: mpfs: use a consistent completion timeout
  soc: microchip: mpfs: simplify error handling in
    mpfs_blocking_transaction()
  soc: microchip: mpfs: handle timeouts and failed services differently

 drivers/mailbox/mailbox-mpfs.c              | 25 +++++++----
 drivers/soc/microchip/mpfs-sys-controller.c | 48 +++++++++++++--------
 2 files changed, 47 insertions(+), 26 deletions(-)


base-commit: 88603b6dc419445847923fcb7fe5080067a30f98

Comments

Conor Dooley Jan. 18, 2023, 1:53 p.m. UTC | #1
On Wed, Jan 11, 2023 at 01:45:06PM +0000, Conor Dooley wrote:
> Hey Jassi, all,
> 
> Here are some fixes for the system controller on PolarFire SoC that I
> ran into while implementing support for using the system controller to
> re-program the FPGA. A few are just minor bits that I fixed in passing,
> but the bulk of the patchset is changes to how the mailbox figures out
> if a "service" has completed.
> 
> Prior to implementing this particular functionality, the services
> requested from the system controller, via its mailbox interface, always
> triggered an interrupt when the system controller was finished with
> the service.
> 
> Unfortunately some of the services used to validate the FPGA images
> before programming them do not trigger an interrupt if they fail.
> For example, the service that checks whether an FPGA image is actually
> a newer version than what is already programmed, does not trigger an
> interrupt, unless the image is actually newer than the one currently
> programmed. If it has an earlier version, no interrupt is triggered
> and a status is set in the system controller's status register to
> signify the reason for the failure.

I think, with how things currently are, the timeout is insufficient.
I've noticed some services taking longer (significantly) longer than what
I have provisioned for.

I'll try to find an upper bound and respin a v2. Questions below are
still valid either way!

Thanks,
Conor.

> In order to differentiate between the service succeeding & the system
> controller being inoperative or otherwise unable to function, I had to
> switch the controller to poll a busy bit in the system controller's
> registers to see if it has completed a service.
> This makes sense anyway, as the interrupt corresponds to "data ready"
> rather than "tx done", so I have changed the mailbox controller driver
> to do that & left the interrupt solely for signalling data ready.
> It just so happened that all of the services that I had worked with and
> tested up to this point were "infallible" & did not set a status, so the
> particular code paths were never tested.
> 
> Jassi, the mailbox and soc patches depend on each other, as the change
> in what the interrupt is used for requires changing the client driver's
> behaviour too, as mbox_send_message() will now return when the system
> controller is no longer busy rather than when the data is ready.
> I'm happy to send the lot via the soc tree with your Ack and/or reivew,
> if that also works you?
> 
> Secondly, I have a question about what to do if a service does fail, but
> not due to a timeout - eg the above example where the "new" image for
> the FPGA is actually older than the one that currently exists.
> Ideally, if a service fails due to something other than the transaction
> timing out, I would go and read the status registers to see what the
> cause of failure was.
> I could not find a function in the mailbox framework that allows the
> client to request that sort of information from the client. Trying to
> do something with the auxiliary bus, or exporting some function to a
> device specific header seemed like a circumvention of the mailbox
> framework.
> Do you think it would be a good idea to implement something like
> mbox_client_peek_status(struct mbox_chan *chan, void *data) to allow
> clients to request this type of information?
> 
> It'd certainly allow me to report the actual errors to the drivers
> implementing the service & make better decisions there about how to
> proceed.
> Perhaps I have missed some way of doing this kind of thing that (should
> have been) staring me in the face!
> 
> Thanks,
> Conor.
> 
> CC: Conor Dooley <conor.dooley@microchip.com>
> CC: Daire McNamara <daire.mcnamara@microchip.com>
> CC: Jassi Brar <jassisinghbrar@gmail.com>
> CC: linux-riscv@lists.infradead.org
> CC: linux-kernel@vger.kernel.org
> 
> Conor Dooley (7):
>   mailbox: mpfs: fix an incorrect mask width
>   mailbox: mpfs: switch to txdone_poll
>   mailbox: mpfs: ditch a useless busy check
>   soc: microchip: mpfs: fix some horrible alignment
>   soc: microchip: mpfs: use a consistent completion timeout
>   soc: microchip: mpfs: simplify error handling in
>     mpfs_blocking_transaction()
>   soc: microchip: mpfs: handle timeouts and failed services differently
> 
>  drivers/mailbox/mailbox-mpfs.c              | 25 +++++++----
>  drivers/soc/microchip/mpfs-sys-controller.c | 48 +++++++++++++--------
>  2 files changed, 47 insertions(+), 26 deletions(-)
> 
> 
> base-commit: 88603b6dc419445847923fcb7fe5080067a30f98
> -- 
> 2.39.0
>
Jassi Brar Jan. 21, 2023, 4:01 p.m. UTC | #2
On Wed, Jan 11, 2023 at 7:45 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> In order to differentiate between the service succeeding & the system
> controller being inoperative or otherwise unable to function, I had to
> switch the controller to poll a busy bit in the system controller's
> registers to see if it has completed a service.
> This makes sense anyway, as the interrupt corresponds to "data ready"
> rather than "tx done", so I have changed the mailbox controller driver
> to do that & left the interrupt solely for signalling data ready.
> It just so happened that all of the services that I had worked with and
> tested up to this point were "infallible" & did not set a status, so the
> particular code paths were never tested.
>
> Jassi, the mailbox and soc patches depend on each other, as the change
> in what the interrupt is used for requires changing the client driver's
> behaviour too, as mbox_send_message() will now return when the system
> controller is no longer busy rather than when the data is ready.
> I'm happy to send the lot via the soc tree with your Ack and/or reivew,
> if that also works you?
>
Ok, let me review them and get back to you.

> Secondly, I have a question about what to do if a service does fail, but
> not due to a timeout - eg the above example where the "new" image for
> the FPGA is actually older than the one that currently exists.
> Ideally, if a service fails due to something other than the transaction
> timing out, I would go and read the status registers to see what the
> cause of failure was.
> I could not find a function in the mailbox framework that allows the
> client to request that sort of information from the client. Trying to
> do something with the auxiliary bus, or exporting some function to a
> device specific header seemed like a circumvention of the mailbox
> framework.
> Do you think it would be a good idea to implement something like
> mbox_client_peek_status(struct mbox_chan *chan, void *data) to allow
> clients to request this type of information?
>
.last_tx_done() is supposed to make sure everything is ok.
If the expected status bit is "sometimes not set", that means that bit
is not the complete status. You have to check multiple registers to
detect if and what caused the failure.

Cheers.
Conor Dooley Jan. 21, 2023, 7:12 p.m. UTC | #3
On Sat, Jan 21, 2023 at 10:01:41AM -0600, Jassi Brar wrote:
> On Wed, Jan 11, 2023 at 7:45 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> >
> > In order to differentiate between the service succeeding & the system
> > controller being inoperative or otherwise unable to function, I had to
> > switch the controller to poll a busy bit in the system controller's
> > registers to see if it has completed a service.
> > This makes sense anyway, as the interrupt corresponds to "data ready"
> > rather than "tx done", so I have changed the mailbox controller driver
> > to do that & left the interrupt solely for signalling data ready.
> > It just so happened that all of the services that I had worked with and
> > tested up to this point were "infallible" & did not set a status, so the
> > particular code paths were never tested.
> >
> > Jassi, the mailbox and soc patches depend on each other, as the change
> > in what the interrupt is used for requires changing the client driver's
> > behaviour too, as mbox_send_message() will now return when the system
> > controller is no longer busy rather than when the data is ready.
> > I'm happy to send the lot via the soc tree with your Ack and/or reivew,
> > if that also works you?
> >
> Ok, let me review them and get back to you.

FYI, I did sent a v2 on Friday:
https://lore.kernel.org/all/20230120143734.3438755-1-conor.dooley@microchip.com/

The change is just a timeout duration though.

> > Secondly, I have a question about what to do if a service does fail, but
> > not due to a timeout - eg the above example where the "new" image for
> > the FPGA is actually older than the one that currently exists.
> > Ideally, if a service fails due to something other than the transaction
> > timing out, I would go and read the status registers to see what the
> > cause of failure was.
> > I could not find a function in the mailbox framework that allows the
> > client to request that sort of information from the client. Trying to
> > do something with the auxiliary bus, or exporting some function to a
> > device specific header seemed like a circumvention of the mailbox
> > framework.
> > Do you think it would be a good idea to implement something like
> > mbox_client_peek_status(struct mbox_chan *chan, void *data) to allow
> > clients to request this type of information?
> >
> .last_tx_done() is supposed to make sure everything is ok.

Hm, might've explained badly as I think you've misunderstood. Or (see
below) I might have mistakenly thought that last_tx_done() was only meant
to signify that tx was done.

Anyways, I'll try to clarify.
Some services don't set a status, but whether a status is, or isn't,
set has nothing to do with whether the service has completed.
One service that sets a status is "Authenticate Bitstream". This
service sets a status of 0x0 if the bitstream in question is okay _and_
something that the FPGA can be upgraded to. It returns a failure of 0x18
if the bitstream is valid _but_ is the same as that currently programmed.
(and of course a whole host of other possible errors in-between)

These statuses, and whether they are a bad outcome or not, is dependant
on the service and I don't think should be handled in the mailbox
controller driver.

> If the expected status bit is "sometimes not set", that means that bit
> is not the complete status.

If the "busy" bit goes low, then the transmission must be complete,
there should be no need to check other bits for *completion*, but...

> You have to check multiple registers to
> detect if and what caused the failure.

...maybe I have just misunderstood the role of .last_tx_done(). The
comment in mailbox-controller.h lead me to believe that it was used just
to check if it had been completed.

Am I allowed to use .last_tx_done() to pass information back to the
mailbox client? If I could, that'd certainly be a nice way to get the
information on whether the service failed etc.

Hopefully that, plus when you have a chance to look at the code, will
make what I am asking about a little clearer!

Thanks,
Conor.
Conor Dooley Feb. 16, 2023, 10:24 p.m. UTC | #4
Hey Jassi,

On Sat, Jan 21, 2023 at 07:12:52PM +0000, Conor Dooley wrote:
> On Sat, Jan 21, 2023 at 10:01:41AM -0600, Jassi Brar wrote:
> > On Wed, Jan 11, 2023 at 7:45 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > >
> > > In order to differentiate between the service succeeding & the system
> > > controller being inoperative or otherwise unable to function, I had to
> > > switch the controller to poll a busy bit in the system controller's
> > > registers to see if it has completed a service.
> > > This makes sense anyway, as the interrupt corresponds to "data ready"
> > > rather than "tx done", so I have changed the mailbox controller driver
> > > to do that & left the interrupt solely for signalling data ready.
> > > It just so happened that all of the services that I had worked with and
> > > tested up to this point were "infallible" & did not set a status, so the
> > > particular code paths were never tested.
> > >
> > > Jassi, the mailbox and soc patches depend on each other, as the change
> > > in what the interrupt is used for requires changing the client driver's
> > > behaviour too, as mbox_send_message() will now return when the system
> > > controller is no longer busy rather than when the data is ready.
> > > I'm happy to send the lot via the soc tree with your Ack and/or reivew,
> > > if that also works you?
> > >
> > Ok, let me review them and get back to you.
> 
> FYI, I did sent a v2 on Friday:
> https://lore.kernel.org/all/20230120143734.3438755-1-conor.dooley@microchip.com/
> 
> The change is just a timeout duration though.
> 
> > > Secondly, I have a question about what to do if a service does fail, but
> > > not due to a timeout - eg the above example where the "new" image for
> > > the FPGA is actually older than the one that currently exists.
> > > Ideally, if a service fails due to something other than the transaction
> > > timing out, I would go and read the status registers to see what the
> > > cause of failure was.
> > > I could not find a function in the mailbox framework that allows the
> > > client to request that sort of information from the client. Trying to
> > > do something with the auxiliary bus, or exporting some function to a
> > > device specific header seemed like a circumvention of the mailbox
> > > framework.
> > > Do you think it would be a good idea to implement something like
> > > mbox_client_peek_status(struct mbox_chan *chan, void *data) to allow
> > > clients to request this type of information?
> > >
> > .last_tx_done() is supposed to make sure everything is ok.
> 
> Hm, might've explained badly as I think you've misunderstood. Or (see
> below) I might have mistakenly thought that last_tx_done() was only meant
> to signify that tx was done.
> 
> Anyways, I'll try to clarify.
> Some services don't set a status, but whether a status is, or isn't,
> set has nothing to do with whether the service has completed.
> One service that sets a status is "Authenticate Bitstream". This
> service sets a status of 0x0 if the bitstream in question is okay _and_
> something that the FPGA can be upgraded to. It returns a failure of 0x18
> if the bitstream is valid _but_ is the same as that currently programmed.
> (and of course a whole host of other possible errors in-between)
> 
> These statuses, and whether they are a bad outcome or not, is dependant
> on the service and I don't think should be handled in the mailbox
> controller driver.
> 
> > If the expected status bit is "sometimes not set", that means that bit
> > is not the complete status.
> 
> If the "busy" bit goes low, then the transmission must be complete,
> there should be no need to check other bits for *completion*, but...
> 
> > You have to check multiple registers to
> > detect if and what caused the failure.
> 
> ...maybe I have just misunderstood the role of .last_tx_done(). The
> comment in mailbox-controller.h lead me to believe that it was used just
> to check if it had been completed.
> 
> Am I allowed to use .last_tx_done() to pass information back to the
> mailbox client? If I could, that'd certainly be a nice way to get the
> information on whether the service failed etc.
> 
> Hopefully that, plus when you have a chance to look at the code, will
> make what I am asking about a little clearer!

Just wondering if you've had a chance to look at this again! I know it's
missed the merge window this time around but I would like to get this
behaviour fixed as other work depends on it.

Thanks,
Conor.
Jassi Brar Feb. 17, 2023, 4:04 a.m. UTC | #5
On Thu, Feb 16, 2023 at 4:24 PM Conor Dooley <conor@kernel.org> wrote:
> >
> > > > Secondly, I have a question about what to do if a service does fail, but
> > > > not due to a timeout - eg the above example where the "new" image for
> > > > the FPGA is actually older than the one that currently exists.
> > > > Ideally, if a service fails due to something other than the transaction
> > > > timing out, I would go and read the status registers to see what the
> > > > cause of failure was.
> > > > I could not find a function in the mailbox framework that allows the
> > > > client to request that sort of information from the client. Trying to
> > > > do something with the auxiliary bus, or exporting some function to a
> > > > device specific header seemed like a circumvention of the mailbox
> > > > framework.
> > > > Do you think it would be a good idea to implement something like
> > > > mbox_client_peek_status(struct mbox_chan *chan, void *data) to allow
> > > > clients to request this type of information?
> > > >
> > > .last_tx_done() is supposed to make sure everything is ok.
> >
> > Hm, might've explained badly as I think you've misunderstood. Or (see
> > below) I might have mistakenly thought that last_tx_done() was only meant
> > to signify that tx was done.
> >
> > Anyways, I'll try to clarify.
> > Some services don't set a status, but whether a status is, or isn't,
> > set has nothing to do with whether the service has completed.
> > One service that sets a status is "Authenticate Bitstream". This
> > service sets a status of 0x0 if the bitstream in question is okay _and_
> > something that the FPGA can be upgraded to. It returns a failure of 0x18
> > if the bitstream is valid _but_ is the same as that currently programmed.
> > (and of course a whole host of other possible errors in-between)
> >
> > These statuses, and whether they are a bad outcome or not, is dependant
> > on the service and I don't think should be handled in the mailbox
> > controller driver.
> >
> > > If the expected status bit is "sometimes not set", that means that bit
> > > is not the complete status.
> >
> > If the "busy" bit goes low, then the transmission must be complete,
> > there should be no need to check other bits for *completion*, but...
> >
> > > You have to check multiple registers to
> > > detect if and what caused the failure.
> >
> > ...maybe I have just misunderstood the role of .last_tx_done(). The
> > comment in mailbox-controller.h lead me to believe that it was used just
> > to check if it had been completed.
> >
> > Am I allowed to use .last_tx_done() to pass information back to the
> > mailbox client? If I could, that'd certainly be a nice way to get the
> > information on whether the service failed etc.
> >
> > Hopefully that, plus when you have a chance to look at the code, will
> > make what I am asking about a little clearer!
>
> Just wondering if you've had a chance to look at this again! I know it's
> missed the merge window this time around but I would like to get this
> behaviour fixed as other work depends on it.
>
My opinion about adding a new api just to accommodate remote f/w's
behaviour change across updates is still no.
last_tx_done() is more abstract than you think -- it has to play with
dozens of behaviors of remotes. So may just wrap your whatever logic,
of "tx is done", in that.

This query within the patchset threw me off -- I thought you needed
the new api for the patchset, so I didn't look further.
Looking at it now, I am ok with applying Patches 1,2 and 3. If you want.

cheers.
Conor Dooley Feb. 17, 2023, 7:34 a.m. UTC | #6
On Thu, Feb 16, 2023 at 10:04:17PM -0600, Jassi Brar wrote:
> On Thu, Feb 16, 2023 at 4:24 PM Conor Dooley <conor@kernel.org> wrote:
> > >
> > > > > Secondly, I have a question about what to do if a service does fail, but
> > > > > not due to a timeout - eg the above example where the "new" image for
> > > > > the FPGA is actually older than the one that currently exists.
> > > > > Ideally, if a service fails due to something other than the transaction
> > > > > timing out, I would go and read the status registers to see what the
> > > > > cause of failure was.
> > > > > I could not find a function in the mailbox framework that allows the
> > > > > client to request that sort of information from the client. Trying to
> > > > > do something with the auxiliary bus, or exporting some function to a
> > > > > device specific header seemed like a circumvention of the mailbox
> > > > > framework.
> > > > > Do you think it would be a good idea to implement something like
> > > > > mbox_client_peek_status(struct mbox_chan *chan, void *data) to allow
> > > > > clients to request this type of information?
> > > > >
> > > > .last_tx_done() is supposed to make sure everything is ok.
> > >
> > > Hm, might've explained badly as I think you've misunderstood. Or (see
> > > below) I might have mistakenly thought that last_tx_done() was only meant
> > > to signify that tx was done.
> > >
> > > Anyways, I'll try to clarify.
> > > Some services don't set a status, but whether a status is, or isn't,
> > > set has nothing to do with whether the service has completed.
> > > One service that sets a status is "Authenticate Bitstream". This
> > > service sets a status of 0x0 if the bitstream in question is okay _and_
> > > something that the FPGA can be upgraded to. It returns a failure of 0x18
> > > if the bitstream is valid _but_ is the same as that currently programmed.
> > > (and of course a whole host of other possible errors in-between)
> > >
> > > These statuses, and whether they are a bad outcome or not, is dependant
> > > on the service and I don't think should be handled in the mailbox
> > > controller driver.
> > >
> > > > If the expected status bit is "sometimes not set", that means that bit
> > > > is not the complete status.
> > >
> > > If the "busy" bit goes low, then the transmission must be complete,
> > > there should be no need to check other bits for *completion*, but...
> > >
> > > > You have to check multiple registers to
> > > > detect if and what caused the failure.
> > >
> > > ...maybe I have just misunderstood the role of .last_tx_done(). The
> > > comment in mailbox-controller.h lead me to believe that it was used just
> > > to check if it had been completed.
> > >
> > > Am I allowed to use .last_tx_done() to pass information back to the
> > > mailbox client? If I could, that'd certainly be a nice way to get the
> > > information on whether the service failed etc.
> > >
> > > Hopefully that, plus when you have a chance to look at the code, will
> > > make what I am asking about a little clearer!
> >
> > Just wondering if you've had a chance to look at this again! I know it's
> > missed the merge window this time around but I would like to get this
> > behaviour fixed as other work depends on it.
> >
> My opinion about adding a new api just to accommodate remote f/w's
> behaviour change across updates is still no.
> last_tx_done() is more abstract than you think -- it has to play with
> dozens of behaviors of remotes. So may just wrap your whatever logic,
> of "tx is done", in that.

Okay, that's fine. I'd rather not add a new API, so adding that to
last_tx_done() is fine by me! I just wasn't sure from your earlier reply
if that was okay.

> This query within the patchset threw me off -- I thought you needed
> the new api for the patchset, so I didn't look further.

Ahh, sorry. The query was about improving things to report an accurate
status rather than only timeouts. What's here *works* but is
sub-optimal.

> Looking at it now, I am ok with applying Patches 1,2 and 3. If you want.

Ehh, the whole thing needs to go together to avoid breaking behaviour,
so, given the merge window is next week, I'd rather rework it to report
the actual status from last_tx_done().

Thanks!