diff mbox

virtio-blk: add SGI_IO passthru support

Message ID 20090427082914.GA383@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig April 27, 2009, 8:29 a.m. UTC
[had the qemu list address wrong the first time, reply to this message,
 not the previous if you were on Cc]


Add support for SG_IO passthru (packet commands) to the virtio-blk
backend.  Conceptually based on an older patch from Hannes Reinecke
but largely rewritten to match the code structure and layering in
virtio-blk.

Note that currently we issue the hose SG_IO synchronously.  We could
easily switch to async I/O, but that would required either bloating
the VirtIOBlockReq by the size of struct sg_io_hdr or an additional
memory allocation for each SG_IO request.


Signed-off-by: Christoph Hellwig <hch@lst.de>


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Avi Kivity April 27, 2009, 9:15 a.m. UTC | #1
Christoph Hellwig wrote:
> [had the qemu list address wrong the first time, reply to this message,
>  not the previous if you were on Cc]
>
>
> Add support for SG_IO passthru (packet commands) to the virtio-blk
> backend.  Conceptually based on an older patch from Hannes Reinecke
> but largely rewritten to match the code structure and layering in
> virtio-blk.
>
> Note that currently we issue the hose SG_IO synchronously.  We could
> easily switch to async I/O, but that would required either bloating
> the VirtIOBlockReq by the size of struct sg_io_hdr or an additional
> memory allocation for each SG_IO request.
>   

I think that's worthwhile.  The extra bloat is trivial (especially as 
the number of inflight virtio requests is tightly bounded), and stalling 
the vcpu for requests is a pain.
Anthony Liguori April 27, 2009, 2:36 p.m. UTC | #2
Christoph Hellwig wrote:
> [had the qemu list address wrong the first time, reply to this message,
>  not the previous if you were on Cc]
>
>
> Add support for SG_IO passthru (packet commands) to the virtio-blk
> backend.  Conceptually based on an older patch from Hannes Reinecke
> but largely rewritten to match the code structure and layering in
> virtio-blk.
>
> Note that currently we issue the hose SG_IO synchronously.  We could
> easily switch to async I/O, but that would required either bloating
> the VirtIOBlockReq by the size of struct sg_io_hdr or an additional
> memory allocation for each SG_IO request.
>
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
>   

So practically speaking, what can you do with this?  Should we be 
handling some SCSI cmds internally to QEMU (like eject operations) and 
supporting media=cdrom in -drive for if=virtio?

On a related topic, should we switch /dev/vdX to be /dev/sdX?

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 28, 2009, 9:47 a.m. UTC | #3
On Mon, Apr 27, 2009 at 12:15:31PM +0300, Avi Kivity wrote:
> I think that's worthwhile.  The extra bloat is trivial (especially as 
> the number of inflight virtio requests is tightly bounded), and stalling 
> the vcpu for requests is a pain.

Ok, new patch will follow ASAP.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 28, 2009, 9:51 a.m. UTC | #4
On Mon, Apr 27, 2009 at 09:36:51AM -0500, Anthony Liguori wrote:
> So practically speaking, what can you do with this?

It's interesting if your underlying device is a real scsi disk and you
want virtio for efficiency but also allow to issue scsi commands on
the underlying device, e.g. for finding out more information about
the underlying setup using the sg_* tools or for certain manual
multipathing setups, or for upgrading firmware or..

> Should we be 
> handling some SCSI cmds internally to QEMU (like eject operations) and 
> supporting media=cdrom in -drive for if=virtio?

Not quite yet.  Eventually I want to support a virtio-scsi kind of
transport which would use the same virtio-blk protocol but only send
scsi commands.  We'd need a different driver on the Linux side for
it that registers to the scsi layer.  On the QEMU side it could either
do pass-through to a real scsi device using SG_IO or use the existing
command scsi emulator in scsi-disk.c.

> On a related topic, should we switch /dev/vdX to be /dev/sdX?

We can't and don't want for the current virtio-blk driver.  Once we
implement a virtio-scsi driver in the guest disk will show up as
/dev/sdX.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anthony Liguori April 28, 2009, 4:37 p.m. UTC | #5
Christoph Hellwig wrote:
>> Should we be 
>> handling some SCSI cmds internally to QEMU (like eject operations) and 
>> supporting media=cdrom in -drive for if=virtio?
>>     
>
> Not quite yet.  Eventually I want to support a virtio-scsi kind of
> transport which would use the same virtio-blk protocol but only send
> scsi commands.  We'd need a different driver on the Linux side for
> it that registers to the scsi layer.  On the QEMU side it could either
> do pass-through to a real scsi device using SG_IO or use the existing
> command scsi emulator in scsi-disk.c.
>   

Ah, excellent.  I think that's a great thing to do.  So do you think 
virtio-scsi would deprecate virtio-blk?

I've gotten a number of requests for exposing stuff like VPD through 
virtio-blk.  There's also a surprising amount of software out there that 
assumes /dev/sdX or /dev/hdX naming.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger April 28, 2009, 4:52 p.m. UTC | #6
Am Tuesday 28 April 2009 18:37:01 schrieb Anthony Liguori:
> Christoph Hellwig wrote:
> >> Should we be 
> >> handling some SCSI cmds internally to QEMU (like eject operations) and 
> >> supporting media=cdrom in -drive for if=virtio?
> >>     
> >
> > Not quite yet.  Eventually I want to support a virtio-scsi kind of
> > transport which would use the same virtio-blk protocol but only send
> > scsi commands.  We'd need a different driver on the Linux side for
> > it that registers to the scsi layer.  On the QEMU side it could either
> > do pass-through to a real scsi device using SG_IO or use the existing
> > command scsi emulator in scsi-disk.c.
> >   
> 
> Ah, excellent.  I think that's a great thing to do.  So do you think 
> virtio-scsi would deprecate virtio-blk?

There are lots of other block device drivers with strange characteristics. 
For example dasd/xpram disks with 4k block size or other disks with strange partitioning schemes. Doing a scsi emulation on top of these beasts looks possible but far from ideal. So I guess we will need virtio_blk for non-scsi block devices.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger April 28, 2009, 7:09 p.m. UTC | #7
Am Tuesday 28 April 2009 11:51:54 schrieb Christoph Hellwig:
> Not quite yet.  Eventually I want to support a virtio-scsi kind of
> transport which would use the same virtio-blk protocol but only send
> scsi commands.  We'd need a different driver on the Linux side for
> it that registers to the scsi layer.  On the QEMU side it could either
> do pass-through to a real scsi device using SG_IO or use the existing

Yes, virtio-scsi is also something we were thinking of. The last time we discussed this idea of SCSI passthrough internally, we stumbled over error recovery. Who is responsible for the error recovery? The host, the guest, or both? Are there problems, which will trigger error recovery in the guest and host midlayer at the same time? To be honest, my scsi knowledge is very limited, so I dont know if that is a real problem.

Christian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 29, 2009, 10:48 a.m. UTC | #8
On Tue, Apr 28, 2009 at 11:37:01AM -0500, Anthony Liguori wrote:
> Ah, excellent.  I think that's a great thing to do.  So do you think 
> virtio-scsi would deprecate virtio-blk?

I don't think so.  If you have an image format or a non-scsi blockdevice
underneath virtio-block avoids the encoding into SCSI CDBs and back and
should be faster.  

> I've gotten a number of requests for exposing stuff like VPD through 
> virtio-blk.  There's also a surprising amount of software out there that 
> assumes /dev/sdX or /dev/hdX naming.

They should be fixed for a while, we have lots of non-scsi, non-ide
block devices.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 29, 2009, 10:50 a.m. UTC | #9
On Tue, Apr 28, 2009 at 09:09:52PM +0200, Christian Borntraeger wrote:
> Yes, virtio-scsi is also something we were thinking of. The last time we discussed this idea of SCSI passthrough internally, we stumbled over error recovery. Who is responsible for the error recovery? The host, the guest, or both? Are there problems, which will trigger error recovery in the guest and host midlayer at the same time? To be honest, my scsi knowledge is very limited, so I dont know if that is a real problem.

I'm not that far with planning yet.  The best way would be to do real
error handling in the host probably.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger April 29, 2009, 11:07 a.m. UTC | #10
Am Wednesday 29 April 2009 12:50:04 schrieb Christoph Hellwig:
> On Tue, Apr 28, 2009 at 09:09:52PM +0200, Christian Borntraeger wrote:
> > Yes, virtio-scsi is also something we were thinking of. The last time we discussed this idea of SCSI passthrough internally, we stumbled over error recovery. Who is responsible for the error recovery? The host, the guest, or both? Are there problems, which will trigger error recovery in the guest and host midlayer at the same time? To be honest, my scsi knowledge is very limited, so I dont know if that is a real problem.
> 
> I'm not that far with planning yet.  The best way would be to do real
> error handling in the host probably.

Yes, that was my impression too. The tricky part is the timeout based recovery found in the scsi midlayer (e.g. scsi-error.c). In case of a timeout the guest and the host will start recovery at the same time.

Christian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Brook April 29, 2009, 11:11 a.m. UTC | #11
On Wednesday 29 April 2009, Christoph Hellwig wrote:
> On Tue, Apr 28, 2009 at 11:37:01AM -0500, Anthony Liguori wrote:
> > Ah, excellent.  I think that's a great thing to do.  So do you think
> > virtio-scsi would deprecate virtio-blk?
>
> I don't think so.  If you have an image format or a non-scsi blockdevice
> underneath virtio-block avoids the encoding into SCSI CDBs and back and
> should be faster.

Is this actually measurably faster, or just infinitesimally faster in theory?

I can maybe see that virtio-blk is slightly simpler for dumb drivers, though 
even then a basic scsi host is pretty straightforward and I find it hard to 
believe there's much real benefit.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger April 29, 2009, 11:19 a.m. UTC | #12
Am Wednesday 29 April 2009 13:11:19 schrieb Paul Brook:
> On Wednesday 29 April 2009, Christoph Hellwig wrote:
> > On Tue, Apr 28, 2009 at 11:37:01AM -0500, Anthony Liguori wrote:
> > > Ah, excellent.  I think that's a great thing to do.  So do you think
> > > virtio-scsi would deprecate virtio-blk?
> >
> > I don't think so.  If you have an image format or a non-scsi blockdevice
> > underneath virtio-block avoids the encoding into SCSI CDBs and back and
> > should be faster.
> 
> Is this actually measurably faster, or just infinitesimally faster in 
> theory? 

If the underlying backing is 4k block size and the emulated scsi disk has 512 byte block size, write performance can be a lot slower.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 29, 2009, 11:21 a.m. UTC | #13
On Wed, Apr 29, 2009 at 12:11:19PM +0100, Paul Brook wrote:
> Is this actually measurably faster, or just infinitesimally faster in theory?

On normal disks it's rather theoretical.  On high-end SSDs and arrays the
impact is noticeable, mostly due to the additional latency.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Brook April 29, 2009, 11:29 a.m. UTC | #14
On Wednesday 29 April 2009, Christian Borntraeger wrote:
> Am Wednesday 29 April 2009 13:11:19 schrieb Paul Brook:
> > On Wednesday 29 April 2009, Christoph Hellwig wrote:
> > > On Tue, Apr 28, 2009 at 11:37:01AM -0500, Anthony Liguori wrote:
> > > > Ah, excellent.  I think that's a great thing to do.  So do you think
> > > > virtio-scsi would deprecate virtio-blk?
> > >
> > > I don't think so.  If you have an image format or a non-scsi
> > > blockdevice underneath virtio-block avoids the encoding into SCSI CDBs
> > > and back and should be faster.
> >
> > Is this actually measurably faster, or just infinitesimally faster in
> > theory?
>
> If the underlying backing is 4k block size and the emulated scsi disk has
> 512 byte block size, write performance can be a lot slower.

Rubbish. If this is really a problem you can just use a scsi disk with 4k 
sectors.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Brook April 29, 2009, 11:37 a.m. UTC | #15
On Wednesday 29 April 2009, Christoph Hellwig wrote:
> On Wed, Apr 29, 2009 at 12:11:19PM +0100, Paul Brook wrote:
> > Is this actually measurably faster, or just infinitesimally faster in
> > theory?
>
> On normal disks it's rather theoretical.  On high-end SSDs and arrays the
> impact is noticeable, mostly due to the additional latency.

How exactly does it introduce additional latency? A scsi command block is 
hardly large or complicated. Are you suggesting that a 16/32byte scsi command 
takes significantly longer to process than a 16byte virtio command 
descriptor? I'd expect any extra processing to be a small fraction of the 
host syscall latency, let alone the latency of the physical host adapter. It 
probably even fits on the same CPU cache line.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 30, 2009, 8:13 p.m. UTC | #16
On Wed, Apr 29, 2009 at 12:37:20PM +0100, Paul Brook wrote:
> How exactly does it introduce additional latency? A scsi command block is 
> hardly large or complicated. Are you suggesting that a 16/32byte scsi command 
> takes significantly longer to process than a 16byte virtio command 
> descriptor? I'd expect any extra processing to be a small fraction of the 
> host syscall latency, let alone the latency of the physical host adapter. It 
> probably even fits on the same CPU cache line.

Encoding the scsi CDB is additional work but I would be surprised it it
is mesurable.  Just using scsi cdbs would be simple enough, the bigger
issue is emulating a full blown scsi bus because then you need to do all
kinds queueing decisions at target levels etc and drag in a complicated
scsi stack and not just a simple block driver in the guest.  And at
least on current linux kernels that does introduce mesurable latency.

Now it might be possible to get that latency down to a level where we
can ignore it but when doing all this additional work there always will
be additional overhead.

> 
> Paul
---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger April 30, 2009, 8:55 p.m. UTC | #17
On Thu, 2009-04-30 at 22:13 +0200, Christoph Hellwig wrote:
> On Wed, Apr 29, 2009 at 12:37:20PM +0100, Paul Brook wrote:
> > How exactly does it introduce additional latency? A scsi command block is 
> > hardly large or complicated. Are you suggesting that a 16/32byte scsi command 
> > takes significantly longer to process than a 16byte virtio command 
> > descriptor? I'd expect any extra processing to be a small fraction of the 
> > host syscall latency, let alone the latency of the physical host adapter. It 
> > probably even fits on the same CPU cache line.
> 
> Encoding the scsi CDB is additional work but I would be surprised it it
> is mesurable.  Just using scsi cdbs would be simple enough, the bigger
> issue is emulating a full blown scsi bus because then you need to do all
> kinds queueing decisions at target levels etc and drag in a complicated
> scsi stack and not just a simple block driver in the guest.  And at
> least on current linux kernels that does introduce mesurable latency.
>
> Now it might be possible to get that latency down to a level where we
> can ignore it but when doing all this additional work there always will
> be additional overhead.
> 

/me puts on SCSI target mode hat

The other obvious benefit for allowing passthrough SCSI block devices
into KVM guests via virtio-blk means that at some point those SCSI block
devices could be coming from a local target mode stack that is
representing LVM block devices as SCSI-3 storage, or say FILEIO on top
of a btrfs mount also representing SCSI-3 storage along with the typical
hardware passthroughs for local (KVM Host) accessable SCSI devices.

The important part is that KVM guests using SG_IO passthrough with
virtio-blk (assuming that they actually showed up as SCSI devices in the
KVM guest) is that guests would be able to take advantage of existing
Linux SCSI I/O fencing functionality that is used for H/A, and for
cluster filesystems like GFS in RHEL, etc.

Also using scsi_dh_alua in Linux KVM guests against SCSI-3 compatible
block_devices using the passthrough means you could do some interesting
things with controlling bandwith and paths using asymmetric access port
states from local storage into Linux KVM guest.

How much latency overhead a SCSI passthrough (with a pseudo SCSI bus) in
the KVM guest would add compared to native virtio-blk would be the key
metric.  Now if the KVM SCSI passthrough was talking in SCSI-target mode
directly to drivers/scsi (instead of going through Block -> SCSI for
example) it would actually involve less overhead on the KVM host IMHO.

For the usage of using BLOCK and FILEIO devices that have SCSI-3
emulation on top (and then passing into the KVM guest) would obviously
add processing overhead, but for those folks who are interested in SCSI
I/O fencing in the KVM guest using existing tools, the performance
overhead would be reasonable tradeoff on machines with a fast memory and
point-to-point bus architecture.

--nab

> > 
> > Paul
> ---end quoted text---
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Brook April 30, 2009, 9:49 p.m. UTC | #18
On Thursday 30 April 2009, Christoph Hellwig wrote:
> On Wed, Apr 29, 2009 at 12:37:20PM +0100, Paul Brook wrote:
> > How exactly does it introduce additional latency? A scsi command block is
> > hardly large or complicated. Are you suggesting that a 16/32byte scsi
> > command takes significantly longer to process than a 16byte virtio
> > command descriptor? I'd expect any extra processing to be a small
> > fraction of the host syscall latency, let alone the latency of the
> > physical host adapter. It probably even fits on the same CPU cache line.
>
> Encoding the scsi CDB is additional work but I would be surprised it it
> is mesurable.  Just using scsi cdbs would be simple enough, the bigger
> issue is emulating a full blown scsi bus because then you need to do all
> kinds queueing decisions at target levels etc and drag in a complicated
> scsi stack and not just a simple block driver in the guest.  And at
> least on current linux kernels that does introduce mesurable latency.

Only if you emulate a crufty old parallel scsi bus, and that's just silly.
One of the nice things about scsi is it separates the command set from the 
transport layer. cf. USB mass-storage, SAS, SBP2(firewire), and probably 
several others I've forgotten.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Javier Guerra April 30, 2009, 9:56 p.m. UTC | #19
On Thu, Apr 30, 2009 at 4:49 PM, Paul Brook <paul@codesourcery.com> wrote:
> One of the nice things about scsi is it separates the command set from the
> transport layer. cf. USB mass-storage, SAS, SBP2(firewire), and probably
> several others I've forgotten.

ATAPI, SCSIoFC, SCSIoIB....
Christoph Hellwig May 1, 2009, 7:24 a.m. UTC | #20
On Thu, Apr 30, 2009 at 10:49:19PM +0100, Paul Brook wrote:
> Only if you emulate a crufty old parallel scsi bus, and that's just silly.
> One of the nice things about scsi is it separates the command set from the 
> transport layer. cf. USB mass-storage, SAS, SBP2(firewire), and probably 
> several others I've forgotten.

It has nothing to do with an SPI bus.  Everything that resembles a SAM
architecture can have multiple LUs per targer, and multiple targers per
initiator port, so we need all the complex queing code, and we need
error handling and and and.  For some really simple ones like USB
mass-storage we might get away with some ad-hoc SCSI CDB generation
instead of real scsi stack, but I would recommend against it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamie Lokier May 1, 2009, 11:08 a.m. UTC | #21
Christoph Hellwig wrote:
> On Thu, Apr 30, 2009 at 10:49:19PM +0100, Paul Brook wrote:
> > Only if you emulate a crufty old parallel scsi bus, and that's just silly.
> > One of the nice things about scsi is it separates the command set from the 
> > transport layer. cf. USB mass-storage, SAS, SBP2(firewire), and probably 
> > several others I've forgotten.
> 
> It has nothing to do with an SPI bus.  Everything that resembles a SAM
> architecture can have multiple LUs per targer, and multiple targers per
> initiator port, so we need all the complex queing code, and we need
> error handling and and and.

If you're using virtio-block to connect to lots of LUNs on lots of
targets (i.e. lots of block devices), don't you need similar queuing
code and error handling for all that too?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 1, 2009, 2:28 p.m. UTC | #22
On Fri, May 01, 2009 at 12:08:06PM +0100, Jamie Lokier wrote:
> If you're using virtio-block to connect to lots of LUNs on lots of
> targets (i.e. lots of block devices), don't you need similar queuing
> code and error handling for all that too?

Currenly there's a 1:1 relation of virtio drivers and virtio queues.

Note that I don't argue against virtio-scsi, I brought this up first.
I just thing that virtio-blk is actually nicer for some use cases.
(and of course we'll have to keep it for backwards compatiblity anyway)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell May 5, 2009, 5:21 a.m. UTC | #23
On Fri, 1 May 2009 05:43:50 am Christoph Hellwig wrote:
> On Wed, Apr 29, 2009 at 12:37:20PM +0100, Paul Brook wrote:
> > How exactly does it introduce additional latency? A scsi command block is
> > hardly large or complicated. Are you suggesting that a 16/32byte scsi
> > command takes significantly longer to process than a 16byte virtio
> > command descriptor? I'd expect any extra processing to be a small
> > fraction of the host syscall latency, let alone the latency of the
> > physical host adapter. It probably even fits on the same CPU cache line.
>
> Encoding the scsi CDB is additional work but I would be surprised it it
> is mesurable.  Just using scsi cdbs would be simple enough, the bigger
> issue is emulating a full blown scsi bus because then you need to do all
> kinds queueing decisions at target levels etc and drag in a complicated
> scsi stack and not just a simple block driver in the guest.  And at
> least on current linux kernels that does introduce mesurable latency.
>
> Now it might be possible to get that latency down to a level where we
> can ignore it but when doing all this additional work there always will
> be additional overhead.

But Paul, if you're enthusiastic you could patch lguest and show how simple
it is :)

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: qemu/hw/virtio-blk.h
===================================================================
--- qemu.orig/hw/virtio-blk.h	2009-04-26 16:50:38.154074532 +0200
+++ qemu/hw/virtio-blk.h	2009-04-26 22:51:16.838076869 +0200
@@ -28,6 +28,9 @@ 
 #define VIRTIO_BLK_F_SIZE_MAX   1       /* Indicates maximum segment size */
 #define VIRTIO_BLK_F_SEG_MAX    2       /* Indicates maximum # of segments */
 #define VIRTIO_BLK_F_GEOMETRY   4       /* Indicates support of legacy geometry */
+#define VIRTIO_BLK_F_RO         5       /* Disk is read-only */
+#define VIRTIO_BLK_F_BLK_SIZE   6       /* Block size of disk is available*/
+#define VIRTIO_BLK_F_SCSI       7       /* Supports scsi command passthru */
 
 struct virtio_blk_config
 {
@@ -70,6 +73,15 @@  struct virtio_blk_inhdr
     unsigned char status;
 };
 
+/* SCSI pass-through header */
+struct virtio_scsi_inhdr
+{
+    uint32_t errors;
+    uint32_t data_len;
+    uint32_t sense_len;
+    uint32_t residual;
+};
+
 void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs);
 
 #endif
Index: qemu/hw/virtio-blk.c
===================================================================
--- qemu.orig/hw/virtio-blk.c	2009-04-26 16:50:38.160074667 +0200
+++ qemu/hw/virtio-blk.c	2009-04-27 10:25:19.278074514 +0200
@@ -15,6 +15,9 @@ 
 #include <sysemu.h>
 #include "virtio-blk.h"
 #include "block_int.h"
+#ifdef __linux__
+# include <scsi/sg.h>
+#endif
 
 typedef struct VirtIOBlock
 {
@@ -35,6 +38,7 @@  typedef struct VirtIOBlockReq
     VirtQueueElement elem;
     struct virtio_blk_inhdr *in;
     struct virtio_blk_outhdr *out;
+    struct virtio_scsi_inhdr *scsi;
     QEMUIOVector qiov;
     struct VirtIOBlockReq *next;
 } VirtIOBlockReq;
@@ -103,6 +107,108 @@  static VirtIOBlockReq *virtio_blk_get_re
     return req;
 }
 
+#ifdef __linux__
+static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
+{
+    struct sg_io_hdr hdr;
+    int ret, size = 0;
+    int status;
+    int i;
+
+    /*
+     * We require at least one output segment each for the virtio_blk_outhdr
+     * and the SCSI command block.
+     *
+     * We also at least require the virtio_blk_inhdr, the virtio_scsi_inhdr
+     * and the sense buffer pointer in the input segments.
+     */
+    if (req->elem.out_num < 2 || req->elem.in_num < 3) {
+        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
+        return;
+    }
+
+    /*
+     * No support for bidirection commands yet.
+     */
+    if (req->elem.out_num > 2 && req->elem.in_num > 3) {
+        virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
+        return;
+    }
+
+    /*
+     * The scsi inhdr is placed in the second-to-last input segment, just
+     * before the regular inhdr.
+     */
+    req->scsi = (void *)req->elem.in_sg[req->elem.in_num - 2].iov_base;
+    size = sizeof(*req->in) + sizeof(*req->scsi);
+
+    memset(&hdr, 0, sizeof(struct sg_io_hdr));
+    hdr.interface_id = 'S';
+    hdr.cmd_len = req->elem.out_sg[1].iov_len;
+    hdr.cmdp = req->elem.out_sg[1].iov_base;
+    hdr.dxfer_len = 0;
+
+    if (req->elem.out_num > 2) {
+        /*
+         * If there are more than the minimally required 2 output segments
+         * there is write payload starting from the third iovec.
+         */
+        hdr.dxfer_direction = SG_DXFER_TO_DEV;
+        hdr.iovec_count = req->elem.out_num - 2;
+
+        for (i = 0; i < hdr.iovec_count; i++)
+            hdr.dxfer_len += req->elem.out_sg[i + 2].iov_len;
+
+        hdr.dxferp = req->elem.out_sg + 2;
+
+    } else if (req->elem.in_num > 3) {
+        /*
+         * If we have more than 3 input segments the guest wants to actually
+         * read data.
+         */
+        hdr.dxfer_direction = SG_DXFER_FROM_DEV;
+        hdr.iovec_count = req->elem.in_num - 3;
+        for (i = 0; i < hdr.iovec_count; i++)
+            hdr.dxfer_len += req->elem.in_sg[i].iov_len;
+
+        hdr.dxferp = req->elem.in_sg;
+        size += hdr.dxfer_len;
+    } else {
+        /*
+         * Some SCSI commands don't actually transfer any data.
+         */
+        hdr.dxfer_direction = SG_DXFER_NONE;
+    }
+
+    hdr.sbp = req->elem.in_sg[req->elem.in_num - 3].iov_base;
+    hdr.mx_sb_len = req->elem.in_sg[req->elem.in_num - 3].iov_len;
+    size += hdr.mx_sb_len;
+
+    ret = bdrv_ioctl(req->dev->bs, SG_IO, &hdr);
+    if (ret) {
+        status = VIRTIO_BLK_S_UNSUPP;
+        hdr.status = ret;
+        hdr.resid = hdr.dxfer_len;
+    } else if (hdr.status) {
+        status = VIRTIO_BLK_S_IOERR;
+    } else {
+        status = VIRTIO_BLK_S_OK;
+    }
+
+    req->scsi->errors = hdr.status;
+    req->scsi->residual = hdr.resid;
+    req->scsi->sense_len = hdr.sb_len_wr;
+    req->scsi->data_len = hdr.dxfer_len;
+
+    virtio_blk_req_complete(req, status);
+}
+#else
+static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
+{
+    virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
+}
+#endif /* __linux__ */
+
 static void virtio_blk_handle_write(VirtIOBlockReq *req)
 {
     bdrv_aio_writev(req->dev->bs, req->out->sector, &req->qiov,
@@ -136,12 +242,7 @@  static void virtio_blk_handle_output(Vir
         req->in = (void *)req->elem.in_sg[req->elem.in_num - 1].iov_base;
 
         if (req->out->type & VIRTIO_BLK_T_SCSI_CMD) {
-            unsigned int len = sizeof(*req->in);
-
-            req->in->status = VIRTIO_BLK_S_UNSUPP;
-            virtqueue_push(vq, &req->elem, len);
-            virtio_notify(vdev, vq);
-            qemu_free(req);
+            virtio_blk_handle_scsi(req);
         } else if (req->out->type & VIRTIO_BLK_T_OUT) {
             qemu_iovec_init_external(&req->qiov, &req->elem.out_sg[1],
                                      req->elem.out_num - 1);
@@ -203,7 +304,15 @@  static void virtio_blk_update_config(Vir
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 {
-    return (1 << VIRTIO_BLK_F_SEG_MAX | 1 << VIRTIO_BLK_F_GEOMETRY);
+    uint32_t features = 0;
+
+    features |= (1 << VIRTIO_BLK_F_SEG_MAX);
+    features |= (1 << VIRTIO_BLK_F_GEOMETRY);
+#ifdef __linux__
+    features |= (1 << VIRTIO_BLK_F_SCSI);
+#endif
+
+    return features;
 }
 
 static void virtio_blk_save(QEMUFile *f, void *opaque)