diff mbox

[RFC,15/15] s390-bios: Use sense ccw to ensure consistent device state at boot time

Message ID 1530811543-6881-16-git-send-email-jjherne@linux.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason J. Herne July 5, 2018, 5:25 p.m. UTC
If a vfio-ccw device is left in an error state (example: pending unit
check) then it is possible for that state to persist for a vfio-ccw device even
after the enable subchannel that we do to bring the device online. If this state
is allowed to persist then even simple I/O operations will needlessly fail. A
basic sense ccw is used to clear this error state for the boot device.

Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
---
 pc-bios/s390-ccw/cio.c  | 13 +++++++++++++
 pc-bios/s390-ccw/cio.h  | 13 +++++++++++++
 pc-bios/s390-ccw/main.c |  5 +++++
 3 files changed, 31 insertions(+)

Comments

Cornelia Huck July 6, 2018, 10:08 a.m. UTC | #1
On Thu,  5 Jul 2018 13:25:43 -0400
"Jason J. Herne" <jjherne@linux.ibm.com> wrote:

> If a vfio-ccw device is left in an error state (example: pending unit
> check) then it is possible for that state to persist for a vfio-ccw device even
> after the enable subchannel that we do to bring the device online. If this state
> is allowed to persist then even simple I/O operations will needlessly fail. A
> basic sense ccw is used to clear this error state for the boot device.

Another thing: What about unsolicited interrupts? I.e., you enable the
subchannel, and then it becomes pending with unsolicited status. Do you
have any handling for that (or plan to add it)?

We could ignore that for virtio devices, but probably not for dasds.

> 
> Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
> ---
>  pc-bios/s390-ccw/cio.c  | 13 +++++++++++++
>  pc-bios/s390-ccw/cio.h  | 13 +++++++++++++
>  pc-bios/s390-ccw/main.c |  5 +++++
>  3 files changed, 31 insertions(+)

> diff --git a/pc-bios/s390-ccw/main.c b/pc-bios/s390-ccw/main.c
> index 2bccfa7..e0ce59b 100644
> --- a/pc-bios/s390-ccw/main.c
> +++ b/pc-bios/s390-ccw/main.c
> @@ -201,12 +201,17 @@ static void virtio_setup(void)
>  
>  int main(void)
>  {
> +    SenseData sd;
> +
>      sclp_setup();
>      cio_setup();
>      boot_setup();
>      find_boot_device();
>      enable_subchannel(blk_schid);
>  
> +    /* Clear any outstanding device error conditions */
> +    basic_sense(blk_schid, &sd);

Hmm. Could an error condition reassert itself after it was cleared?
Probably not worth spending too much time on, though.

> +
>      switch (cu_type(blk_schid)) {
>      case 0x3990:  /* Real DASD device */
>          dasd_ipl(blk_schid); /* no return */
Jason J. Herne July 6, 2018, 2:45 p.m. UTC | #2
On 07/06/2018 06:08 AM, Cornelia Huck wrote:
> On Thu,  5 Jul 2018 13:25:43 -0400
> "Jason J. Herne" <jjherne@linux.ibm.com> wrote:
> 
>> If a vfio-ccw device is left in an error state (example: pending unit
>> check) then it is possible for that state to persist for a vfio-ccw device even
>> after the enable subchannel that we do to bring the device online. If this state
>> is allowed to persist then even simple I/O operations will needlessly fail. A
>> basic sense ccw is used to clear this error state for the boot device.
> 
> Another thing: What about unsolicited interrupts? I.e., you enable the
> subchannel, and then it becomes pending with unsolicited status. Do you
> have any handling for that (or plan to add it)?
> 
> We could ignore that for virtio devices, but probably not for dasds.
...
> Hmm. Could an error condition reassert itself after it was cleared?
> Probably not worth spending too much time on, though.
> 
>> +
>>       switch (cu_type(blk_schid)) {
>>       case 0x3990:  /* Real DASD device */
>>           dasd_ipl(blk_schid); /* no return */

I believe both of these issues to be the same as we are currently 
discussion in our replies to patch #10. My opinion posted there applies 
to this scenario. If my information is incorrect or you disagree with my 
assessment please yell at me! :-D
Halil Pasic July 6, 2018, 10:20 p.m. UTC | #3
On 07/05/2018 07:25 PM, Jason J. Herne wrote:
> If a vfio-ccw device is left in an error state (example: pending unit
> check) then it is possible for that state to persist for a vfio-ccw device even
> after the enable subchannel that we do to bring the device online. If this state
> is allowed to persist then even simple I/O operations will needlessly fail. A
> basic sense ccw is used to clear this error state for the boot device.
> 
> Signed-off-by: Jason J. Herne<jjherne@linux.ibm.com>

I don't like this. AFAIK an IPL is preceded by and subsystem reset. If
it weren't the IPL-ed OS (program) would have to take care any potential
mess left by the previous one -- and pray it gets control. A subsystem
reset should clear any device error state, so it is not supposed to
persist across subsystem resets. If the error re-emerges (unsolicited)
after the reset, it's likely something is really broken and needs investigation.
Generally IPL is supposed to fail in such cases (except for corner cases
which are not really handled by this patch).

AFAIU this patch works around a broken reset. While our bios is not guest and
one could try to argue that it's firmware -- part of 'the machine', a believe
handling the reset in the bios is wrong. AFAIR the qemu emulator is in charge,
and if needed makes kvm do what it has to.

If the reset is broken for vfio-ccw (which is very possible, but I would have
to check), I think we should fix it in the right place. A workaround may be
still justified (if kernel changes  like clear support are needed). But we
should indicate that clearly in the commit message and in the code.

Regards,
Halil
diff mbox

Patch

diff --git a/pc-bios/s390-ccw/cio.c b/pc-bios/s390-ccw/cio.c
index f440380..e77ec94 100644
--- a/pc-bios/s390-ccw/cio.c
+++ b/pc-bios/s390-ccw/cio.c
@@ -57,6 +57,19 @@  __u16 cu_type(SubChannelId schid)
     return senseData.cu_type;
 }
 
+void basic_sense(SubChannelId schid, SenseData *sd)
+{
+    Ccw1 senseCcw;
+
+    senseCcw.cmd_code = CCW_CMD_BASIC_SENSE;
+    senseCcw.cda = ptr2u32(sd);
+    senseCcw.count = sizeof(*sd);
+
+    if (do_cio(schid, ptr2u32(&senseCcw), CCW_FMT1)) {
+        panic("Failed to run Basic Sense CCW\n");
+    }
+}
+
 static bool irb_error(Irb *irb)
 {
     /* We have to ignore Incorrect Length (cstat == 0x40) indicators because
diff --git a/pc-bios/s390-ccw/cio.h b/pc-bios/s390-ccw/cio.h
index b5c9fd7..003524e 100644
--- a/pc-bios/s390-ccw/cio.h
+++ b/pc-bios/s390-ccw/cio.h
@@ -234,6 +234,18 @@  typedef struct senseid {
     struct ciw ciw[62];
 }  __attribute__ ((packed, aligned(4))) SenseId;
 
+/* basic sense response buffer layout */
+typedef struct senseData {
+    uint8_t status[3];
+    uint8_t res_count;
+    uint8_t phys_drive_id;
+    uint8_t low_cyl_addr;
+    uint8_t head_high_cyl_addr;
+    uint8_t fmt_msg;
+    uint8_t reserved[16];
+    uint8_t reserved2[8];
+}  __attribute__ ((packed, aligned(4))) SenseData;
+
 /* interruption response block */
 typedef struct irb {
     struct scsw scsw;
@@ -260,6 +272,7 @@  int enable_mss_facility(void);
 void enable_subchannel(SubChannelId schid);
 __u16 cu_type(SubChannelId schid);
 void await_io_int(uint16_t sch_no);
+void basic_sense(SubChannelId schid, SenseData *sd);
 int do_cio(SubChannelId schid, uint32_t ccw_addr, int fmt);
 
 /*
diff --git a/pc-bios/s390-ccw/main.c b/pc-bios/s390-ccw/main.c
index 2bccfa7..e0ce59b 100644
--- a/pc-bios/s390-ccw/main.c
+++ b/pc-bios/s390-ccw/main.c
@@ -201,12 +201,17 @@  static void virtio_setup(void)
 
 int main(void)
 {
+    SenseData sd;
+
     sclp_setup();
     cio_setup();
     boot_setup();
     find_boot_device();
     enable_subchannel(blk_schid);
 
+    /* Clear any outstanding device error conditions */
+    basic_sense(blk_schid, &sd);
+
     switch (cu_type(blk_schid)) {
     case 0x3990:  /* Real DASD device */
         dasd_ipl(blk_schid); /* no return */