diff mbox

[v2] tpm_tis: Increase ST19NP18 TPM command timeout to avoid chip lockup

Message ID 1465252079-126836-1-git-send-email-eswierk@skyportsystems.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ed Swierk June 6, 2016, 10:27 p.m. UTC
The STMicro ST19NP18-TPM sometimes takes much longer to execute
commands than it reports in its capabilities. For example, command 186
(TPM_FlushSpecific) has been observed to take 14560 msec to complete,
far longer than the 750 msec limit for "short" commands reported by
the chip. The behavior has also been seen with command 101
(TPM_GetCapability).

Worse, when the tpm_tis driver attempts to cancel the current command
(by writing commandReady = 1 to TPM_STS_x), the chip locks up
completely, returning all-1s from all memory-mapped register
reads. The lockup can be cleared only by resetting the system.

The occurrence of this excessive command duration depends on the
sequence of commands preceding it. One sequence is creating at least 2
new keys via TPM_CreateWrapKey, then letting the TPM idle for at least
30 seconds, then loading a key via TPM_LoadKey2. The next
TPM_FlushSpecific occasionally takes tens of seconds to
complete. Another sequence is creating many keys in a row without
pause. The TPM_CreateWrapKey operation gets much slower after the
first few iterations, as one would expect when the pool of precomputed
keys is exhausted. Then after a 35-second pause, the same TPM_LoadKey2
followed by TPM_FlushSpecific sequence triggers the behavior.

Our working theory is that this older TPM sometimes pauses to perform
internal garbage collection, which modern chips implement as a
background process. Without access to the chip's implementation
details it's impossible to know whether any commands are immune to
this behavior.  So it seems safest to ignore the chip's reported
command durations, and use a value much higher than any observed
duration, like 2 minutes (which happens to be the value used for
TPM_UNDEFINED commands in tpm_calc_ordinal_duration()).

v2: Minor correction of patch description.

Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
---
 drivers/char/tpm/tpm_tis.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Stefan Berger June 7, 2016, 1:07 a.m. UTC | #1
Ed Swierk <eswierk@skyportsystems.com> wrote on 06/06/2016 06:27:59 PM:

> 
> The STMicro ST19NP18-TPM sometimes takes much longer to execute
> commands than it reports in its capabilities. For example, command 186
> (TPM_FlushSpecific) has been observed to take 14560 msec to complete,
> far longer than the 750 msec limit for "short" commands reported by
> the chip. The behavior has also been seen with command 101
> (TPM_GetCapability).

Hm, those should be really fast.

> 
> Worse, when the tpm_tis driver attempts to cancel the current command
> (by writing commandReady = 1 to TPM_STS_x), the chip locks up
> completely, returning all-1s from all memory-mapped register
> reads. The lockup can be cleared only by resetting the system.
> 
> The occurrence of this excessive command duration depends on the
> sequence of commands preceding it. One sequence is creating at least 2
> new keys via TPM_CreateWrapKey, then letting the TPM idle for at least

How long does it take to create those keys? Maybe it will create new keys 
in the 'background' after that.

> 30 seconds, then loading a key via TPM_LoadKey2. The next
> TPM_FlushSpecific occasionally takes tens of seconds to
> complete. Another sequence is creating many keys in a row without
> pause. The TPM_CreateWrapKey operation gets much slower after the
> first few iterations, as one would expect when the pool of precomputed
> keys is exhausted. Then after a 35-second pause, the same TPM_LoadKey2
> followed by TPM_FlushSpecific sequence triggers the behavior.
> 
> Our working theory is that this older TPM sometimes pauses to perform
> internal garbage collection, which modern chips implement as a
> background process. Without access to the chip's implementation
> details it's impossible to know whether any commands are immune to
> this behavior.  So it seems safest to ignore the chip's reported
> command durations, and use a value much higher than any observed
> duration, like 2 minutes (which happens to be the value used for
> TPM_UNDEFINED commands in tpm_calc_ordinal_duration()).
> 
> v2: Minor correction of patch description.
> 
> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>

Reviewed-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
Ed Swierk June 7, 2016, 1:48 a.m. UTC | #2
On Mon, Jun 6, 2016 at 6:07 PM, Stefan Berger <stefanb@us.ibm.com> wrote:
> Ed Swierk <eswierk@skyportsystems.com> wrote on 06/06/2016 06:27:59 PM:
> > The occurrence of this excessive command duration depends on the
> > sequence of commands preceding it. One sequence is creating at least 2
> > new keys via TPM_CreateWrapKey, then letting the TPM idle for at least
>
> How long does it take to create those keys? Maybe it will create new keys in the 'background' after that.

The first few TPM_CreateWrapKey commands take roughly 300 msec. I've
seen it go as high as 80 seconds after several of those in a row.

It makes sense that a key generation process starts up after the chip
thinks it's idle. Slowing down unrelated operations stretches the
meaning of "background", of course. And locking up the chip is
downright impolite.

> > 30 seconds, then loading a key via TPM_LoadKey2. The next
> > TPM_FlushSpecific occasionally takes tens of seconds to
> > complete. Another sequence is creating many keys in a row without
> > pause. The TPM_CreateWrapKey operation gets much slower after the
> > first few iterations, as one would expect when the pool of precomputed
> > keys is exhausted. Then after a 35-second pause, the same TPM_LoadKey2
> > followed by TPM_FlushSpecific sequence triggers the behavior.
> >
> > Our working theory is that this older TPM sometimes pauses to perform
> > internal garbage collection, which modern chips implement as a
> > background process. Without access to the chip's implementation
> > details it's impossible to know whether any commands are immune to
> > this behavior.  So it seems safest to ignore the chip's reported
> > command durations, and use a value much higher than any observed
> > duration, like 2 minutes (which happens to be the value used for
> > TPM_UNDEFINED commands in tpm_calc_ordinal_duration()).

On further testing of my patch, I realize that I have totally confused
timeouts and durations. I need to override the command durations
(short, medium, long) reported by the chip. The reported protocol
timeouts (a, b, c, d) are fine.

Please consider this patch withdrawn. I'll send an updated patch shortly.

--Ed

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
Jarkko Sakkinen June 7, 2016, 1:49 p.m. UTC | #3
On Mon, Jun 06, 2016 at 03:27:59PM -0700, Ed Swierk wrote:
> The STMicro ST19NP18-TPM sometimes takes much longer to execute
> commands than it reports in its capabilities. For example, command 186
> (TPM_FlushSpecific) has been observed to take 14560 msec to complete,
> far longer than the 750 msec limit for "short" commands reported by
> the chip. The behavior has also been seen with command 101
> (TPM_GetCapability).
> 
> Worse, when the tpm_tis driver attempts to cancel the current command
> (by writing commandReady = 1 to TPM_STS_x), the chip locks up
> completely, returning all-1s from all memory-mapped register
> reads. The lockup can be cleared only by resetting the system.

Does this also happen when command doesn't take enormously long? I'm
just trying to understand does this contain one or two issues. I don't
know how hard this would be to test in practice with commands that last
750 ms (maybe with a quirk to kernel code).

> The occurrence of this excessive command duration depends on the
> sequence of commands preceding it. One sequence is creating at least 2
> new keys via TPM_CreateWrapKey, then letting the TPM idle for at least
> 30 seconds, then loading a key via TPM_LoadKey2. The next
> TPM_FlushSpecific occasionally takes tens of seconds to
> complete. Another sequence is creating many keys in a row without
> pause. The TPM_CreateWrapKey operation gets much slower after the
> first few iterations, as one would expect when the pool of precomputed
> keys is exhausted. Then after a 35-second pause, the same TPM_LoadKey2
> followed by TPM_FlushSpecific sequence triggers the behavior.
> 
> Our working theory is that this older TPM sometimes pauses to perform
> internal garbage collection, which modern chips implement as a
> background process. Without access to the chip's implementation
> details it's impossible to know whether any commands are immune to
> this behavior.  So it seems safest to ignore the chip's reported
> command durations, and use a value much higher than any observed
> duration, like 2 minutes (which happens to be the value used for
> TPM_UNDEFINED commands in tpm_calc_ordinal_duration()).
> 
> v2: Minor correction of patch description.
> 
> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>

Acked-by: Jarkko Sakkinne <jarkko.sakkinen@linux.intel.com>

/Jarkko

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
Jarkko Sakkinen June 7, 2016, 1:52 p.m. UTC | #4
On Mon, Jun 06, 2016 at 06:48:10PM -0700, Ed Swierk wrote:
> On Mon, Jun 6, 2016 at 6:07 PM, Stefan Berger <stefanb@us.ibm.com> wrote:
> > Ed Swierk <eswierk@skyportsystems.com> wrote on 06/06/2016 06:27:59 PM:
> > > The occurrence of this excessive command duration depends on the
> > > sequence of commands preceding it. One sequence is creating at least 2
> > > new keys via TPM_CreateWrapKey, then letting the TPM idle for at least
> >
> > How long does it take to create those keys? Maybe it will create new keys in the 'background' after that.
> 
> The first few TPM_CreateWrapKey commands take roughly 300 msec. I've
> seen it go as high as 80 seconds after several of those in a row.
> 
> It makes sense that a key generation process starts up after the chip
> thinks it's idle. Slowing down unrelated operations stretches the
> meaning of "background", of course. And locking up the chip is
> downright impolite.
> 
> > > 30 seconds, then loading a key via TPM_LoadKey2. The next
> > > TPM_FlushSpecific occasionally takes tens of seconds to
> > > complete. Another sequence is creating many keys in a row without
> > > pause. The TPM_CreateWrapKey operation gets much slower after the
> > > first few iterations, as one would expect when the pool of precomputed
> > > keys is exhausted. Then after a 35-second pause, the same TPM_LoadKey2
> > > followed by TPM_FlushSpecific sequence triggers the behavior.
> > >
> > > Our working theory is that this older TPM sometimes pauses to perform
> > > internal garbage collection, which modern chips implement as a
> > > background process. Without access to the chip's implementation
> > > details it's impossible to know whether any commands are immune to
> > > this behavior.  So it seems safest to ignore the chip's reported
> > > command durations, and use a value much higher than any observed
> > > duration, like 2 minutes (which happens to be the value used for
> > > TPM_UNDEFINED commands in tpm_calc_ordinal_duration()).
> 
> On further testing of my patch, I realize that I have totally confused
> timeouts and durations. I need to override the command durations
> (short, medium, long) reported by the chip. The reported protocol
> timeouts (a, b, c, d) are fine.
> 
> Please consider this patch withdrawn. I'll send an updated patch shortly.

Ack.

> --Ed

/Jarkko

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
diff mbox

Patch

diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
index 088fa86..3070be2 100644
--- a/drivers/char/tpm/tpm_tis.c
+++ b/drivers/char/tpm/tpm_tis.c
@@ -484,6 +484,9 @@  static const struct tis_vendor_timeout_override vendor_timeout_overrides[] = {
 	/* Atmel 3204 */
 	{ 0x32041114, { (TIS_SHORT_TIMEOUT*1000), (TIS_LONG_TIMEOUT*1000),
 			(TIS_SHORT_TIMEOUT*1000), (TIS_SHORT_TIMEOUT*1000) } },
+	/* STMicro ST19NP18-TPM */
+	{ 0x0000104a, { (120*1000*1000), (120*1000*1000),
+			(120*1000*1000), (120*1000*1000) } },
 };
 
 static bool tpm_tis_update_timeouts(struct tpm_chip *chip,