Message ID | 20201214162937.2.Ibade998ed587e070388b4bf58801f1107a40eb53@changeid (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case | expand |
Quoting Douglas Anderson (2020-12-14 16:30:19) > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > index 6f736e94e9f4..5ef2e9f38ac9 100644 > --- a/drivers/spi/spi-geni-qcom.c > +++ b/drivers/spi/spi-geni-qcom.c > @@ -145,12 +145,49 @@ static void handle_fifo_timeout(struct spi_master *spi, > dev_err(mas->dev, "Failed to cancel/abort m_cmd\n"); > } > > +static int spi_geni_check_busy(struct spi_geni_master *mas) Maybe spi_geni_is_busy() and return bool? > +{ > + struct geni_se *se = &mas->se; > + u32 m_irq, m_irq_en; > + > + /* > + * We grab the spinlock so that if we raced really fast and the IRQ > + * handler is still actually running we'll wait for it to exit. This > + * can happen because the IRQ handler may signal in the middle of the > + * function and the next transfer can kick off right away. > + * > + * Once we have the spinlock, if we're starting a new transfer we > + * expect nothing is pending. We check this to handle the case where > + * the previous transfer timed out and then handle_fifo_timeout() timed > + * out. This can happen if the interrupt handler was blocked for > + * a long time and we don't want to start any new transfers until it's > + * all done. > + * > + * We are OK releasing the spinlock after we're done here since (if > + * we're returning 0 and going ahead with the transfer) we know that > + * the SPI controller must be in a quiet state. > + */ > + spin_lock_irq(&mas->lock); > + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); > + spin_unlock_irq(&mas->lock); > + > + if (m_irq & m_irq_en) { Is this really "busy" though? If we canceled something out then maybe the irq has fired but what if it's to tell us that we have some available space in the TX fifo? Does that really matter? It seems like if we have an RX irq when we're starting a transfer that might be bad too but we could forcibly clear that by acking it here and then setting the fifo word count that we're expecting for rx? Put another way, why isn't this driver looking at the TX and RX fifo status registers more than in one place? > + dev_err(mas->dev, "Busy, IRQs pending %#010x\n", > + m_irq & m_irq_en); > + return -EBUSY; > + } > + > + return 0; > +} > + > static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) > { > struct spi_geni_master *mas = spi_master_get_devdata(slv->master); > struct spi_master *spi = dev_get_drvdata(mas->dev); > struct geni_se *se = &mas->se; > unsigned long time_left; > + int ret; > > if (!(slv->mode & SPI_CS_HIGH)) > set_flag = !set_flag; > @@ -158,6 +195,12 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) > if (set_flag == mas->cs_flag) > return; > > + ret = spi_geni_check_busy(mas); > + if (ret) { if (spi_geni_is_busy()) > + dev_err(mas->dev, "Can't set chip select\n"); > + return; > + } > + > mas->cs_flag = set_flag; > > pm_runtime_get_sync(mas->dev); > @@ -277,8 +320,12 @@ static int setup_fifo_params(struct spi_device *spi_slv, > static int spi_geni_prepare_message(struct spi_master *spi, > struct spi_message *spi_msg) > { > - int ret; > struct spi_geni_master *mas = spi_master_get_devdata(spi); > + int ret; > + > + ret = spi_geni_check_busy(mas); > + if (ret) if (spi_geni_is_busy()) return -EBUSY; > + return ret; > > ret = setup_fifo_params(spi_msg->spi, spi); > if (ret) > @@ -440,21 +487,6 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, > struct geni_se *se = &mas->se; > int ret; > > - /* > - * Ensure that our interrupt handler isn't still running from some > - * prior command before we start messing with the hardware behind > - * its back. We don't need to _keep_ the lock here since we're only > - * worried about racing with out interrupt handler. The SPI core > - * already handles making sure that we're not trying to do two > - * transfers at once or setting a chip select and doing a transfer > - * concurrently. > - * > - * NOTE: we actually _can't_ hold the lock here because possibly we > - * might call clk_set_rate() which needs to be able to sleep. > - */ > - spin_lock_irq(&mas->lock); > - spin_unlock_irq(&mas->lock); > - > if (xfer->bits_per_word != mas->cur_bits_per_word) { > spi_setup_word_len(mas, mode, xfer->bits_per_word); > mas->cur_bits_per_word = xfer->bits_per_word; > @@ -511,6 +543,11 @@ static int spi_geni_transfer_one(struct spi_master *spi, > struct spi_transfer *xfer) > { > struct spi_geni_master *mas = spi_master_get_devdata(spi); > + int ret; > + > + ret = spi_geni_check_busy(mas); > + if (ret) > + return ret; if (spi_geni_is_busy()) return -EBUSY; > > /* Terminate and return success for 0 byte length transfer */ > if (!xfer->len)
Hi, On Mon, Dec 14, 2020 at 6:57 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Douglas Anderson (2020-12-14 16:30:19) > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > > index 6f736e94e9f4..5ef2e9f38ac9 100644 > > --- a/drivers/spi/spi-geni-qcom.c > > +++ b/drivers/spi/spi-geni-qcom.c > > @@ -145,12 +145,49 @@ static void handle_fifo_timeout(struct spi_master *spi, > > dev_err(mas->dev, "Failed to cancel/abort m_cmd\n"); > > } > > > > +static int spi_geni_check_busy(struct spi_geni_master *mas) > > Maybe spi_geni_is_busy() and return bool? Yeah, that's cleaner, thanks. > > + spin_lock_irq(&mas->lock); > > + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > > + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); > > + spin_unlock_irq(&mas->lock); > > + > > + if (m_irq & m_irq_en) { > > Is this really "busy" though? If we canceled something out then maybe > the irq has fired but what if it's to tell us that we have some > available space in the TX fifo? Does that really matter? It seems like > if we have an RX irq when we're starting a transfer that might be bad > too but we could forcibly clear that by acking it here and then setting > the fifo word count that we're expecting for rx? > > Put another way, why isn't this driver looking at the TX and RX fifo > status registers more than in one place? I'm not sure I understand all your concerns. Can you clarify? In case it helps, I'll add a few thoughts here: 1. SPI is a controller clocked protocol and this is the driver for the controller. There is no way to get a RX IRQ unless we initiate it. 2. The code always takes care to make sure that when we're done with a transfer that we disable the TX watermark. This means we won't get any more interrupts. The only time an interrupt could still be pending when we start a new transfer is: a) If the interrupt handler is still running on another CPU. In that case it will have the spinlock and won't release it until it clears the interrupts. b) If we had a timeout on the previous transfer and then got timeouts sending the cancel and abort. In general when we're starting a new transfer we assume that we can program the hardware willy-nilly. If there's some chance something else is happening (or our interrupt could go off) then it breaks that whole model. I've addressed all your other concerns and I'm ready to send out v2, but I'll hold off until you confirm that the above explanation satisfies your questions. If you can think of any extra comments somewhere that would help document that better I'm happy to add them into the commit or commit message. -Doug
Quoting Doug Anderson (2020-12-15 09:25:51) > On Mon, Dec 14, 2020 at 6:57 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > Quoting Douglas Anderson (2020-12-14 16:30:19) > > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > > > index 6f736e94e9f4..5ef2e9f38ac9 100644 > > > --- a/drivers/spi/spi-geni-qcom.c > > > +++ b/drivers/spi/spi-geni-qcom.c > > > > + spin_lock_irq(&mas->lock); > > > + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > > > + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); > > > + spin_unlock_irq(&mas->lock); > > > + > > > + if (m_irq & m_irq_en) { > > > > Is this really "busy" though? If we canceled something out then maybe > > the irq has fired but what if it's to tell us that we have some > > available space in the TX fifo? Does that really matter? It seems like > > if we have an RX irq when we're starting a transfer that might be bad > > too but we could forcibly clear that by acking it here and then setting > > the fifo word count that we're expecting for rx? > > > > Put another way, why isn't this driver looking at the TX and RX fifo > > status registers more than in one place? > > I'm not sure I understand all your concerns. Can you clarify? In > case it helps, I'll add a few thoughts here: > > 1. SPI is a controller clocked protocol and this is the driver for the > controller. There is no way to get a RX IRQ unless we initiate it. > > 2. The code always takes care to make sure that when we're done with a > transfer that we disable the TX watermark. This means we won't get > any more interrupts. > > The only time an interrupt could still be pending when we start a new > transfer is: > > a) If the interrupt handler is still running on another CPU. In that > case it will have the spinlock and won't release it until it clears > the interrupts. > > b) If we had a timeout on the previous transfer and then got timeouts > sending the cancel and abort. > > In general when we're starting a new transfer we assume that we can > program the hardware willy-nilly. If there's some chance something > else is happening (or our interrupt could go off) then it breaks that > whole model. Right. I thought this patch was making sure that the hardware wasn't in the process of doing something else when we setup the transfer. I'm saying that only checking the irq misses the fact that maybe the transfer hasn't completed yet or a pending irq hasn't come in yet, but the fifo status would tell us that the fifo is transferring something or receiving something. If an RX can't happen, then the code should clearly show that an RX irq isn't expected, and mask out that bit so it is ignored or explicitly check for it and call WARN_ON() if the bit is set. I'm wondering why we don't check the FIFO status and the irq bits to make sure that some previous cancelled operation isn't still pending either in the FIFO or as an irq. While this patch will fix the scenario where the irq is delayed but pending in the hardware it won't cover the case that the hardware itself is wedged, for example because the sequencer just decided to stop working entirely.
Hi, On Tue, Dec 15, 2020 at 2:25 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-15 09:25:51) > > On Mon, Dec 14, 2020 at 6:57 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > > Quoting Douglas Anderson (2020-12-14 16:30:19) > > > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > > > > index 6f736e94e9f4..5ef2e9f38ac9 100644 > > > > --- a/drivers/spi/spi-geni-qcom.c > > > > +++ b/drivers/spi/spi-geni-qcom.c > > > > > > + spin_lock_irq(&mas->lock); > > > > + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > > > > + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); > > > > + spin_unlock_irq(&mas->lock); > > > > + > > > > + if (m_irq & m_irq_en) { > > > > > > Is this really "busy" though? If we canceled something out then maybe > > > the irq has fired but what if it's to tell us that we have some > > > available space in the TX fifo? Does that really matter? It seems like > > > if we have an RX irq when we're starting a transfer that might be bad > > > too but we could forcibly clear that by acking it here and then setting > > > the fifo word count that we're expecting for rx? > > > > > > Put another way, why isn't this driver looking at the TX and RX fifo > > > status registers more than in one place? > > > > I'm not sure I understand all your concerns. Can you clarify? In > > case it helps, I'll add a few thoughts here: > > > > 1. SPI is a controller clocked protocol and this is the driver for the > > controller. There is no way to get a RX IRQ unless we initiate it. > > > > 2. The code always takes care to make sure that when we're done with a > > transfer that we disable the TX watermark. This means we won't get > > any more interrupts. > > > > The only time an interrupt could still be pending when we start a new > > transfer is: > > > > a) If the interrupt handler is still running on another CPU. In that > > case it will have the spinlock and won't release it until it clears > > the interrupts. > > > > b) If we had a timeout on the previous transfer and then got timeouts > > sending the cancel and abort. > > > > In general when we're starting a new transfer we assume that we can > > program the hardware willy-nilly. If there's some chance something > > else is happening (or our interrupt could go off) then it breaks that > > whole model. > > Right. I thought this patch was making sure that the hardware wasn't in > the process of doing something else when we setup the transfer. I'm > saying that only checking the irq misses the fact that maybe the > transfer hasn't completed yet or a pending irq hasn't come in yet, but > the fifo status would tell us that the fifo is transferring something or > receiving something. If an RX can't happen, then the code should clearly > show that an RX irq isn't expected, and mask out that bit so it is > ignored or explicitly check for it and call WARN_ON() if the bit is set. > > I'm wondering why we don't check the FIFO status and the irq bits to > make sure that some previous cancelled operation isn't still pending > either in the FIFO or as an irq. While this patch will fix the scenario > where the irq is delayed but pending in the hardware it won't cover the > case that the hardware itself is wedged, for example because the > sequencer just decided to stop working entirely. It also won't catch the case where the SoC decided that all GPIOs are inverted and starts reporting highs for lows and lows for highs, nor does it handle the case where the CPU suddenly switches to Big Endian mode for no reason. :-P ...by that, I mean I'm not trying to catch the case where the hardware itself is behaving in a totally unexpected way. I have seen no instances where the hardware wedges nor where the sequencer stops working and until I see them happen I'm not inclined to add code for them. Without seeing them actually happen I'm not really sure what the right way to recover would be. We've already tried "cancel" and "abort" and then waited at least 1 second. If you know of some sort of magic "unwedge" then we should add it into handle_fifo_timeout(). However, super delayed interrupts due to software not servicing the interrupt in time is something that really happens, if rarely. Adding code to account for that seems worth it and is easy to test... Did I convince you? ;-) -Doug
Quoting Doug Anderson (2020-12-15 15:34:59) > On Tue, Dec 15, 2020 at 2:25 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > Quoting Doug Anderson (2020-12-15 09:25:51) > > > In general when we're starting a new transfer we assume that we can > > > program the hardware willy-nilly. If there's some chance something > > > else is happening (or our interrupt could go off) then it breaks that > > > whole model. > > > > Right. I thought this patch was making sure that the hardware wasn't in > > the process of doing something else when we setup the transfer. I'm > > saying that only checking the irq misses the fact that maybe the > > transfer hasn't completed yet or a pending irq hasn't come in yet, but > > the fifo status would tell us that the fifo is transferring something or > > receiving something. If an RX can't happen, then the code should clearly > > show that an RX irq isn't expected, and mask out that bit so it is > > ignored or explicitly check for it and call WARN_ON() if the bit is set. > > > > I'm wondering why we don't check the FIFO status and the irq bits to > > make sure that some previous cancelled operation isn't still pending > > either in the FIFO or as an irq. While this patch will fix the scenario > > where the irq is delayed but pending in the hardware it won't cover the > > case that the hardware itself is wedged, for example because the > > sequencer just decided to stop working entirely. > > It also won't catch the case where the SoC decided that all GPIOs are > inverted and starts reporting highs for lows and lows for highs, nor > does it handle the case where the CPU suddenly switches to Big Endian > mode for no reason. :-P > > ...by that, I mean I'm not trying to catch the case where the hardware > itself is behaving in a totally unexpected way. I have seen no > instances where the hardware wedges nor where the sequencer stops > working and until I see them happen I'm not inclined to add code for > them. Without seeing them actually happen I'm not really sure what > the right way to recover would be. We've already tried "cancel" and > "abort" and then waited at least 1 second. If you know of some sort > of magic "unwedge" then we should add it into handle_fifo_timeout(). I am not aware of an "unwedge" command. Presumably the cancel/abort stuff makes the FIFO state "sane" so there's nothing to see in the FIFO status registers. I wonder if we should keep around some "did we cancel last time?" flag and only check the isr if we canceled out and timed out to boot? That would be a cheap and easy check to make sure that we don't check this each transaction. > > However, super delayed interrupts due to software not servicing the > interrupt in time is something that really happens, if rarely. Adding > code to account for that seems worth it and is easy to test... > Agreed. The function name is wrong then as the device is not "busy". So maybe spi_geni_isr_pending()? That would clearly describe what's being checked.
Hi, On Tue, Dec 15, 2020 at 5:18 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-15 15:34:59) > > On Tue, Dec 15, 2020 at 2:25 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > > Quoting Doug Anderson (2020-12-15 09:25:51) > > > > In general when we're starting a new transfer we assume that we can > > > > program the hardware willy-nilly. If there's some chance something > > > > else is happening (or our interrupt could go off) then it breaks that > > > > whole model. > > > > > > Right. I thought this patch was making sure that the hardware wasn't in > > > the process of doing something else when we setup the transfer. I'm > > > saying that only checking the irq misses the fact that maybe the > > > transfer hasn't completed yet or a pending irq hasn't come in yet, but > > > the fifo status would tell us that the fifo is transferring something or > > > receiving something. If an RX can't happen, then the code should clearly > > > show that an RX irq isn't expected, and mask out that bit so it is > > > ignored or explicitly check for it and call WARN_ON() if the bit is set. > > > > > > I'm wondering why we don't check the FIFO status and the irq bits to > > > make sure that some previous cancelled operation isn't still pending > > > either in the FIFO or as an irq. While this patch will fix the scenario > > > where the irq is delayed but pending in the hardware it won't cover the > > > case that the hardware itself is wedged, for example because the > > > sequencer just decided to stop working entirely. > > > > It also won't catch the case where the SoC decided that all GPIOs are > > inverted and starts reporting highs for lows and lows for highs, nor > > does it handle the case where the CPU suddenly switches to Big Endian > > mode for no reason. :-P > > > > ...by that, I mean I'm not trying to catch the case where the hardware > > itself is behaving in a totally unexpected way. I have seen no > > instances where the hardware wedges nor where the sequencer stops > > working and until I see them happen I'm not inclined to add code for > > them. Without seeing them actually happen I'm not really sure what > > the right way to recover would be. We've already tried "cancel" and > > "abort" and then waited at least 1 second. If you know of some sort > > of magic "unwedge" then we should add it into handle_fifo_timeout(). > > I am not aware of an "unwedge" command. Presumably the cancel/abort > stuff makes the FIFO state "sane" so there's nothing to see in the FIFO > status registers. I wonder if we should keep around some "did we cancel > last time?" flag and only check the isr if we canceled out and timed > out to boot? That would be a cheap and easy check to make sure that we > don't check this each transaction. Sure. I guess technically it would be a "did we fail to cancel last time". > > However, super delayed interrupts due to software not servicing the > > interrupt in time is something that really happens, if rarely. Adding > > code to account for that seems worth it and is easy to test... > > > > Agreed. The function name is wrong then as the device is not "busy". So > maybe spi_geni_isr_pending()? That would clearly describe what's being > checked. I changed this to just be about the abort. See if v2 looks better to you.
diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c index 6f736e94e9f4..5ef2e9f38ac9 100644 --- a/drivers/spi/spi-geni-qcom.c +++ b/drivers/spi/spi-geni-qcom.c @@ -145,12 +145,49 @@ static void handle_fifo_timeout(struct spi_master *spi, dev_err(mas->dev, "Failed to cancel/abort m_cmd\n"); } +static int spi_geni_check_busy(struct spi_geni_master *mas) +{ + struct geni_se *se = &mas->se; + u32 m_irq, m_irq_en; + + /* + * We grab the spinlock so that if we raced really fast and the IRQ + * handler is still actually running we'll wait for it to exit. This + * can happen because the IRQ handler may signal in the middle of the + * function and the next transfer can kick off right away. + * + * Once we have the spinlock, if we're starting a new transfer we + * expect nothing is pending. We check this to handle the case where + * the previous transfer timed out and then handle_fifo_timeout() timed + * out. This can happen if the interrupt handler was blocked for + * a long time and we don't want to start any new transfers until it's + * all done. + * + * We are OK releasing the spinlock after we're done here since (if + * we're returning 0 and going ahead with the transfer) we know that + * the SPI controller must be in a quiet state. + */ + spin_lock_irq(&mas->lock); + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); + spin_unlock_irq(&mas->lock); + + if (m_irq & m_irq_en) { + dev_err(mas->dev, "Busy, IRQs pending %#010x\n", + m_irq & m_irq_en); + return -EBUSY; + } + + return 0; +} + static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) { struct spi_geni_master *mas = spi_master_get_devdata(slv->master); struct spi_master *spi = dev_get_drvdata(mas->dev); struct geni_se *se = &mas->se; unsigned long time_left; + int ret; if (!(slv->mode & SPI_CS_HIGH)) set_flag = !set_flag; @@ -158,6 +195,12 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) if (set_flag == mas->cs_flag) return; + ret = spi_geni_check_busy(mas); + if (ret) { + dev_err(mas->dev, "Can't set chip select\n"); + return; + } + mas->cs_flag = set_flag; pm_runtime_get_sync(mas->dev); @@ -277,8 +320,12 @@ static int setup_fifo_params(struct spi_device *spi_slv, static int spi_geni_prepare_message(struct spi_master *spi, struct spi_message *spi_msg) { - int ret; struct spi_geni_master *mas = spi_master_get_devdata(spi); + int ret; + + ret = spi_geni_check_busy(mas); + if (ret) + return ret; ret = setup_fifo_params(spi_msg->spi, spi); if (ret) @@ -440,21 +487,6 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, struct geni_se *se = &mas->se; int ret; - /* - * Ensure that our interrupt handler isn't still running from some - * prior command before we start messing with the hardware behind - * its back. We don't need to _keep_ the lock here since we're only - * worried about racing with out interrupt handler. The SPI core - * already handles making sure that we're not trying to do two - * transfers at once or setting a chip select and doing a transfer - * concurrently. - * - * NOTE: we actually _can't_ hold the lock here because possibly we - * might call clk_set_rate() which needs to be able to sleep. - */ - spin_lock_irq(&mas->lock); - spin_unlock_irq(&mas->lock); - if (xfer->bits_per_word != mas->cur_bits_per_word) { spi_setup_word_len(mas, mode, xfer->bits_per_word); mas->cur_bits_per_word = xfer->bits_per_word; @@ -511,6 +543,11 @@ static int spi_geni_transfer_one(struct spi_master *spi, struct spi_transfer *xfer) { struct spi_geni_master *mas = spi_master_get_devdata(spi); + int ret; + + ret = spi_geni_check_busy(mas); + if (ret) + return ret; /* Terminate and return success for 0 byte length transfer */ if (!xfer->len)
In commit 2ee471a1e28e ("spi: spi-geni-qcom: Mo' betta locking") we added a dance in setup_fifo_xfer() to make sure that the previous transfer was really done before we setup the next one. However, it wasn't enough. Specifically, if we had a timeout it's possible that the previous transfer could still be pending. This could happen if our interrupt handler was blocked for a long while (interrupt storm or someone disablng IRQs for a while). This pending interrupt could throw off our logic. Let's really make sure that the previous interrupt isn't still pending before we start the next transfer. Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") Signed-off-by: Douglas Anderson <dianders@chromium.org> --- drivers/spi/spi-geni-qcom.c | 69 ++++++++++++++++++++++++++++--------- 1 file changed, 53 insertions(+), 16 deletions(-)