Message ID | 1530811543-6881-16-git-send-email-jjherne@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 5 Jul 2018 13:25:43 -0400 "Jason J. Herne" <jjherne@linux.ibm.com> wrote: > If a vfio-ccw device is left in an error state (example: pending unit > check) then it is possible for that state to persist for a vfio-ccw device even > after the enable subchannel that we do to bring the device online. If this state > is allowed to persist then even simple I/O operations will needlessly fail. A > basic sense ccw is used to clear this error state for the boot device. Another thing: What about unsolicited interrupts? I.e., you enable the subchannel, and then it becomes pending with unsolicited status. Do you have any handling for that (or plan to add it)? We could ignore that for virtio devices, but probably not for dasds. > > Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com> > --- > pc-bios/s390-ccw/cio.c | 13 +++++++++++++ > pc-bios/s390-ccw/cio.h | 13 +++++++++++++ > pc-bios/s390-ccw/main.c | 5 +++++ > 3 files changed, 31 insertions(+) > diff --git a/pc-bios/s390-ccw/main.c b/pc-bios/s390-ccw/main.c > index 2bccfa7..e0ce59b 100644 > --- a/pc-bios/s390-ccw/main.c > +++ b/pc-bios/s390-ccw/main.c > @@ -201,12 +201,17 @@ static void virtio_setup(void) > > int main(void) > { > + SenseData sd; > + > sclp_setup(); > cio_setup(); > boot_setup(); > find_boot_device(); > enable_subchannel(blk_schid); > > + /* Clear any outstanding device error conditions */ > + basic_sense(blk_schid, &sd); Hmm. Could an error condition reassert itself after it was cleared? Probably not worth spending too much time on, though. > + > switch (cu_type(blk_schid)) { > case 0x3990: /* Real DASD device */ > dasd_ipl(blk_schid); /* no return */
On 07/06/2018 06:08 AM, Cornelia Huck wrote: > On Thu, 5 Jul 2018 13:25:43 -0400 > "Jason J. Herne" <jjherne@linux.ibm.com> wrote: > >> If a vfio-ccw device is left in an error state (example: pending unit >> check) then it is possible for that state to persist for a vfio-ccw device even >> after the enable subchannel that we do to bring the device online. If this state >> is allowed to persist then even simple I/O operations will needlessly fail. A >> basic sense ccw is used to clear this error state for the boot device. > > Another thing: What about unsolicited interrupts? I.e., you enable the > subchannel, and then it becomes pending with unsolicited status. Do you > have any handling for that (or plan to add it)? > > We could ignore that for virtio devices, but probably not for dasds. ... > Hmm. Could an error condition reassert itself after it was cleared? > Probably not worth spending too much time on, though. > >> + >> switch (cu_type(blk_schid)) { >> case 0x3990: /* Real DASD device */ >> dasd_ipl(blk_schid); /* no return */ I believe both of these issues to be the same as we are currently discussion in our replies to patch #10. My opinion posted there applies to this scenario. If my information is incorrect or you disagree with my assessment please yell at me! :-D
On 07/05/2018 07:25 PM, Jason J. Herne wrote: > If a vfio-ccw device is left in an error state (example: pending unit > check) then it is possible for that state to persist for a vfio-ccw device even > after the enable subchannel that we do to bring the device online. If this state > is allowed to persist then even simple I/O operations will needlessly fail. A > basic sense ccw is used to clear this error state for the boot device. > > Signed-off-by: Jason J. Herne<jjherne@linux.ibm.com> I don't like this. AFAIK an IPL is preceded by and subsystem reset. If it weren't the IPL-ed OS (program) would have to take care any potential mess left by the previous one -- and pray it gets control. A subsystem reset should clear any device error state, so it is not supposed to persist across subsystem resets. If the error re-emerges (unsolicited) after the reset, it's likely something is really broken and needs investigation. Generally IPL is supposed to fail in such cases (except for corner cases which are not really handled by this patch). AFAIU this patch works around a broken reset. While our bios is not guest and one could try to argue that it's firmware -- part of 'the machine', a believe handling the reset in the bios is wrong. AFAIR the qemu emulator is in charge, and if needed makes kvm do what it has to. If the reset is broken for vfio-ccw (which is very possible, but I would have to check), I think we should fix it in the right place. A workaround may be still justified (if kernel changes like clear support are needed). But we should indicate that clearly in the commit message and in the code. Regards, Halil
diff --git a/pc-bios/s390-ccw/cio.c b/pc-bios/s390-ccw/cio.c index f440380..e77ec94 100644 --- a/pc-bios/s390-ccw/cio.c +++ b/pc-bios/s390-ccw/cio.c @@ -57,6 +57,19 @@ __u16 cu_type(SubChannelId schid) return senseData.cu_type; } +void basic_sense(SubChannelId schid, SenseData *sd) +{ + Ccw1 senseCcw; + + senseCcw.cmd_code = CCW_CMD_BASIC_SENSE; + senseCcw.cda = ptr2u32(sd); + senseCcw.count = sizeof(*sd); + + if (do_cio(schid, ptr2u32(&senseCcw), CCW_FMT1)) { + panic("Failed to run Basic Sense CCW\n"); + } +} + static bool irb_error(Irb *irb) { /* We have to ignore Incorrect Length (cstat == 0x40) indicators because diff --git a/pc-bios/s390-ccw/cio.h b/pc-bios/s390-ccw/cio.h index b5c9fd7..003524e 100644 --- a/pc-bios/s390-ccw/cio.h +++ b/pc-bios/s390-ccw/cio.h @@ -234,6 +234,18 @@ typedef struct senseid { struct ciw ciw[62]; } __attribute__ ((packed, aligned(4))) SenseId; +/* basic sense response buffer layout */ +typedef struct senseData { + uint8_t status[3]; + uint8_t res_count; + uint8_t phys_drive_id; + uint8_t low_cyl_addr; + uint8_t head_high_cyl_addr; + uint8_t fmt_msg; + uint8_t reserved[16]; + uint8_t reserved2[8]; +} __attribute__ ((packed, aligned(4))) SenseData; + /* interruption response block */ typedef struct irb { struct scsw scsw; @@ -260,6 +272,7 @@ int enable_mss_facility(void); void enable_subchannel(SubChannelId schid); __u16 cu_type(SubChannelId schid); void await_io_int(uint16_t sch_no); +void basic_sense(SubChannelId schid, SenseData *sd); int do_cio(SubChannelId schid, uint32_t ccw_addr, int fmt); /* diff --git a/pc-bios/s390-ccw/main.c b/pc-bios/s390-ccw/main.c index 2bccfa7..e0ce59b 100644 --- a/pc-bios/s390-ccw/main.c +++ b/pc-bios/s390-ccw/main.c @@ -201,12 +201,17 @@ static void virtio_setup(void) int main(void) { + SenseData sd; + sclp_setup(); cio_setup(); boot_setup(); find_boot_device(); enable_subchannel(blk_schid); + /* Clear any outstanding device error conditions */ + basic_sense(blk_schid, &sd); + switch (cu_type(blk_schid)) { case 0x3990: /* Real DASD device */ dasd_ipl(blk_schid); /* no return */
If a vfio-ccw device is left in an error state (example: pending unit check) then it is possible for that state to persist for a vfio-ccw device even after the enable subchannel that we do to bring the device online. If this state is allowed to persist then even simple I/O operations will needlessly fail. A basic sense ccw is used to clear this error state for the boot device. Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com> --- pc-bios/s390-ccw/cio.c | 13 +++++++++++++ pc-bios/s390-ccw/cio.h | 13 +++++++++++++ pc-bios/s390-ccw/main.c | 5 +++++ 3 files changed, 31 insertions(+)