diff mbox

Add shared memory PCI device that shares a memory object betweens VMs

Message ID 1238600608-9120-1-git-send-email-cam@cs.ualberta.ca (mailing list archive)
State New, archived
Headers show

Commit Message

Cam Macdonell April 1, 2009, 3:43 p.m. UTC
This patch supports sharing memory between VMs and between the host/VM.  It's a first 
cut and comments are encouraged.  The goal is to support simple Inter-VM communication
with zero-copy access to shared memory.

The patch adds the switch -ivshmem (short for Inter-VM shared memory) that is 
used as follows "-ivshmem file,size".  

The shared memory object named 'file' will be created/opened and mapped onto a
PCI memory device with size 'size'.  The PCI device has two BARs, BAR0 for
registers and BAR1 for the memory region that maps the file above.  The memory
region can be mmapped into userspace on the guest (or read and written if you want). 

The register region will eventually be used to support interrupts which are
communicated via unix domain sockets, but I need some tips on how to do this
using a qemu character device. 

Also, feel free to suggest a better name if you have one.

Thanks,
Cam

---
 qemu/Makefile.target |    2 +
 qemu/hw/ivshmem.c    |  363 ++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu/hw/pc.c         |    6 +
 qemu/hw/pc.h         |    3 +
 qemu/qemu-options.hx |   10 ++
 qemu/sysemu.h        |    7 +
 qemu/vl.c            |   12 ++
 7 files changed, 403 insertions(+), 0 deletions(-)
 create mode 100644 qemu/hw/ivshmem.c

Comments

Anthony Liguori April 1, 2009, 4:29 p.m. UTC | #1
Hi Cam,

Cam Macdonell wrote:
> This patch supports sharing memory between VMs and between the host/VM.  It's a first 
> cut and comments are encouraged.  The goal is to support simple Inter-VM communication
> with zero-copy access to shared memory.
>   

Nice work!

I would suggest two design changes to make here.  The first is that I 
think you should use virtio.  The second is that I think instead of 
relying on mapping in device memory to the guest, you should have the 
guest allocate it's own memory to dedicate to sharing.

A lot of what you're doing is duplicating functionality in virtio-pci.  
You can also obtain greater portability by building the drivers with 
virtio.  It may not seem obvious how to make the memory sharing via BAR 
fit into virtio, but if you follow my second suggestion, it will be a 
lot easier.

Right now, you've got a bit of a hole in your implementation because you 
only support files that are powers-of-two in size even though that's not 
documented/enforced.  This is a limitation of PCI resource regions.  
Also, the PCI memory hole is limited in size today which is going to put 
an upper bound on the amount of memory you could ever map into a guest.  
Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
too since qemu_ram_alloc() is a static allocation from a contiguous heap.

If you used virtio, what you could do is provide a ring queue that was 
used to communicate a series of requests/response.  The exchange might 
look like this:

guest: REQ discover memory region
host: RSP memory region id: 4 size: 8k
guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
(addr=944000,size=4k)}
host: RSP mapped region id: 4
guest: REQ notify region id: 4
host: RSP notify region id: 4
guest: REQ poll region id: 4
host: RSP poll region id: 4

And the REQ/RSP order does not have to be in series like this.  In 
general, you need one entry on the queue to poll for new memory regions, 
one entry for each mapped region to poll for incoming notification, and 
then the remaining entries can be used to send short-lived 
requests/responses.

It's important that the REQ map takes a scatter/gather list of physical 
addresses because after running for a while, it's unlikely that you'll 
be able to allocate any significant size of contiguous memory.

 From a QEMU perspective, you would do memory sharing by waiting for a 
map REQ from the guest and then you would complete the request by doing 
an mmap(MAP_FIXED) with the appropriate parameters into phys_ram_base.

Notifications are a topic for discussion I think.  A CharDriverState 
could be used by I think it would be more interesting to do something 
like a fd passed by SCM_RIGHTS so that eventfd can be used.

To simplify things, I'd suggest starting out only supporting one memory 
region mapping.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity April 1, 2009, 6:07 p.m. UTC | #2
Anthony Liguori wrote:
> Hi Cam,
>
> Cam Macdonell wrote:
>> This patch supports sharing memory between VMs and between the 
>> host/VM.  It's a first cut and comments are encouraged.  The goal is 
>> to support simple Inter-VM communication
>> with zero-copy access to shared memory.
>>   
>
> Nice work!
>
> I would suggest two design changes to make here.  The first is that I 
> think you should use virtio.

I disagree with this.  While virtio is excellent at exporting guest 
memory, it isn't so good at importing another guest's memory.

>   The second is that I think instead of relying on mapping in device 
> memory to the guest, you should have the guest allocate it's own 
> memory to dedicate to sharing.

That's not what you describe below.  You're having the guest allocate 
parts of its address space that happen to be used by RAM, and overlaying 
those parts with the shared memory.

> Right now, you've got a bit of a hole in your implementation because 
> you only support files that are powers-of-two in size even though 
> that's not documented/enforced.  This is a limitation of PCI resource 
> regions.  

While the BAR needs to be a power of two, I don't think the RAM backing 
it needs to be.

> Also, the PCI memory hole is limited in size today which is going to 
> put an upper bound on the amount of memory you could ever map into a 
> guest.  

Today.  We could easily lift this restriction by supporting 64-bit 
BARs.  It would probably take only a few lines of code.

> Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
> too since qemu_ram_alloc() is a static allocation from a contiguous heap.

We need to fix this anyway, for memory hotplug.

>
> If you used virtio, what you could do is provide a ring queue that was 
> used to communicate a series of requests/response.  The exchange might 
> look like this:
>
> guest: REQ discover memory region
> host: RSP memory region id: 4 size: 8k
> guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
> (addr=944000,size=4k)}
> host: RSP mapped region id: 4
> guest: REQ notify region id: 4
> host: RSP notify region id: 4
> guest: REQ poll region id: 4
> host: RSP poll region id: 4

That looks significantly more complex.

>
> And the REQ/RSP order does not have to be in series like this.  In 
> general, you need one entry on the queue to poll for new memory 
> regions, one entry for each mapped region to poll for incoming 
> notification, and then the remaining entries can be used to send 
> short-lived requests/responses.
>
> It's important that the REQ map takes a scatter/gather list of 
> physical addresses because after running for a while, it's unlikely 
> that you'll be able to allocate any significant size of contiguous 
> memory.
>
> From a QEMU perspective, you would do memory sharing by waiting for a 
> map REQ from the guest and then you would complete the request by 
> doing an mmap(MAP_FIXED) with the appropriate parameters into 
> phys_ram_base.

That will fragment the vma list.  And what do you do when you unmap the 
region?

How does a 256M guest map 1G of shared memory?
Anthony Liguori April 1, 2009, 6:52 p.m. UTC | #3
Avi Kivity wrote:
> Anthony Liguori wrote:
>> Hi Cam,
>>
>>
>> I would suggest two design changes to make here.  The first is that I 
>> think you should use virtio.
>
> I disagree with this.  While virtio is excellent at exporting guest 
> memory, it isn't so good at importing another guest's memory.

First we need to separate static memory sharing and dynamic memory 
sharing.  Static memory sharing has to be configured on start up.  I 
think in practice, static memory sharing is not terribly interesting 
except for maybe embedded environments.

Dynamically memory sharing requires bidirectional communication in order 
to establish mappings and tear down mappings.  You'll eventually 
recreate virtio once you've implemented this communication mechanism.

>>   The second is that I think instead of relying on mapping in device 
>> memory to the guest, you should have the guest allocate it's own 
>> memory to dedicate to sharing.
>
> That's not what you describe below.  You're having the guest allocate 
> parts of its address space that happen to be used by RAM, and 
> overlaying those parts with the shared memory.

But from the guest's perspective, it's RAM is being used for memory sharing.

If you're clever, you could start a guest with -mem-path and then use 
this mechanism to map a portion of one guest's memory into another guest 
without either guest ever knowing who "owns" the memory and with exactly 
the same driver on both.

>> Right now, you've got a bit of a hole in your implementation because 
>> you only support files that are powers-of-two in size even though 
>> that's not documented/enforced.  This is a limitation of PCI resource 
>> regions.  
>
> While the BAR needs to be a power of two, I don't think the RAM 
> backing it needs to be.

Then you need a side channel to communicate the information to the guest.

>> Also, the PCI memory hole is limited in size today which is going to 
>> put an upper bound on the amount of memory you could ever map into a 
>> guest.  
>
> Today.  We could easily lift this restriction by supporting 64-bit 
> BARs.  It would probably take only a few lines of code.
>
>> Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
>> too since qemu_ram_alloc() is a static allocation from a contiguous 
>> heap.
>
> We need to fix this anyway, for memory hotplug.

It's going to be hard to "fix" with TCG.

>> If you used virtio, what you could do is provide a ring queue that 
>> was used to communicate a series of requests/response.  The exchange 
>> might look like this:
>>
>> guest: REQ discover memory region
>> host: RSP memory region id: 4 size: 8k
>> guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
>> (addr=944000,size=4k)}
>> host: RSP mapped region id: 4
>> guest: REQ notify region id: 4
>> host: RSP notify region id: 4
>> guest: REQ poll region id: 4
>> host: RSP poll region id: 4
>
> That looks significantly more complex.

It's also supporting dynamic shared memory.  If you do use BARs, then 
perhaps you'd just do PCI hotplug to make things dynamic.

>>
>> And the REQ/RSP order does not have to be in series like this.  In 
>> general, you need one entry on the queue to poll for new memory 
>> regions, one entry for each mapped region to poll for incoming 
>> notification, and then the remaining entries can be used to send 
>> short-lived requests/responses.
>>
>> It's important that the REQ map takes a scatter/gather list of 
>> physical addresses because after running for a while, it's unlikely 
>> that you'll be able to allocate any significant size of contiguous 
>> memory.
>>
>> From a QEMU perspective, you would do memory sharing by waiting for a 
>> map REQ from the guest and then you would complete the request by 
>> doing an mmap(MAP_FIXED) with the appropriate parameters into 
>> phys_ram_base.
>
> That will fragment the vma list.  And what do you do when you unmap 
> the region?
>
> How does a 256M guest map 1G of shared memory?

It doesn't but it couldn't today either b/c of the 32-bit BARs.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cam Macdonell April 1, 2009, 8:32 p.m. UTC | #4
Hi Anthony and Avi,

Anthony Liguori wrote:
> Avi Kivity wrote:
>> Anthony Liguori wrote:
>>> Hi Cam,
>>>
>>>
>>> I would suggest two design changes to make here.  The first is that I 
>>> think you should use virtio.
>>
>> I disagree with this.  While virtio is excellent at exporting guest 
>> memory, it isn't so good at importing another guest's memory.
> 
> First we need to separate static memory sharing and dynamic memory 
> sharing.  Static memory sharing has to be configured on start up.  I 
> think in practice, static memory sharing is not terribly interesting 
> except for maybe embedded environments.

I think there is value for static memory sharing.   It can be used for 
fast, simple synchronization and communication between guests (and the 
host) that use need to share data that needs to be updated frequently 
(such as a simple cache or notification system).  It may not be a common 
task, but I think static sharing has its place and that's what this 
device is for at this point.

> Dynamically memory sharing requires bidirectional communication in order 
> to establish mappings and tear down mappings.  You'll eventually 
> recreate virtio once you've implemented this communication mechanism.

>>>   The second is that I think instead of relying on mapping in device 
>>> memory to the guest, you should have the guest allocate it's own 
>>> memory to dedicate to sharing.
>>
>> That's not what you describe below.  You're having the guest allocate 
>> parts of its address space that happen to be used by RAM, and 
>> overlaying those parts with the shared memory.
> 
> But from the guest's perspective, it's RAM is being used for memory 
> sharing.
> 
> If you're clever, you could start a guest with -mem-path and then use 
> this mechanism to map a portion of one guest's memory into another guest 
> without either guest ever knowing who "owns" the memory and with exactly 
> the same driver on both.
> 
>>> Right now, you've got a bit of a hole in your implementation because 
>>> you only support files that are powers-of-two in size even though 
>>> that's not documented/enforced.  This is a limitation of PCI resource 
>>> regions.  
>>
>> While the BAR needs to be a power of two, I don't think the RAM 
>> backing it needs to be.
> 
> Then you need a side channel to communicate the information to the guest.

Couldn't one of the registers in BAR0 be used to store the actual 
(non-power-of-two) size?

>>> Also, the PCI memory hole is limited in size today which is going to 
>>> put an upper bound on the amount of memory you could ever map into a 
>>> guest.  
>>
>> Today.  We could easily lift this restriction by supporting 64-bit 
>> BARs.  It would probably take only a few lines of code.
>>
>>> Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
>>> too since qemu_ram_alloc() is a static allocation from a contiguous 
>>> heap.
>>
>> We need to fix this anyway, for memory hotplug.
> 
> It's going to be hard to "fix" with TCG.
> 
>>> If you used virtio, what you could do is provide a ring queue that 
>>> was used to communicate a series of requests/response.  The exchange 
>>> might look like this:
>>>
>>> guest: REQ discover memory region
>>> host: RSP memory region id: 4 size: 8k
>>> guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
>>> (addr=944000,size=4k)}
>>> host: RSP mapped region id: 4
>>> guest: REQ notify region id: 4
>>> host: RSP notify region id: 4
>>> guest: REQ poll region id: 4
>>> host: RSP poll region id: 4
>>
>> That looks significantly more complex.
> 
> It's also supporting dynamic shared memory.  If you do use BARs, then 
> perhaps you'd just do PCI hotplug to make things dynamic.
> 
>>>
>>> And the REQ/RSP order does not have to be in series like this.  In 
>>> general, you need one entry on the queue to poll for new memory 
>>> regions, one entry for each mapped region to poll for incoming 
>>> notification, and then the remaining entries can be used to send 
>>> short-lived requests/responses.
>>>
>>> It's important that the REQ map takes a scatter/gather list of 
>>> physical addresses because after running for a while, it's unlikely 
>>> that you'll be able to allocate any significant size of contiguous 
>>> memory.
>>>
>>> From a QEMU perspective, you would do memory sharing by waiting for a 
>>> map REQ from the guest and then you would complete the request by 
>>> doing an mmap(MAP_FIXED) with the appropriate parameters into 
>>> phys_ram_base.
>>
>> That will fragment the vma list.  And what do you do when you unmap 
>> the region?
>>
>> How does a 256M guest map 1G of shared memory?
> 
> It doesn't but it couldn't today either b/c of the 32-bit BARs.
> 

Cam
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity April 2, 2009, 7:05 a.m. UTC | #5
Anthony Liguori wrote:
>> I disagree with this.  While virtio is excellent at exporting guest 
>> memory, it isn't so good at importing another guest's memory.
>
> First we need to separate static memory sharing and dynamic memory 
> sharing.  Static memory sharing has to be configured on start up.  I 
> think in practice, static memory sharing is not terribly interesting 
> except for maybe embedded environments.
>
> Dynamically memory sharing requires bidirectional communication in 
> order to establish mappings and tear down mappings.  You'll eventually 
> recreate virtio once you've implemented this communication mechanism.
>

I guess that depends on what one uses share memory for.

Cam?

>>>   The second is that I think instead of relying on mapping in device 
>>> memory to the guest, you should have the guest allocate it's own 
>>> memory to dedicate to sharing.
>>
>> That's not what you describe below.  You're having the guest allocate 
>> parts of its address space that happen to be used by RAM, and 
>> overlaying those parts with the shared memory.
>
> But from the guest's perspective, it's RAM is being used for memory 
> sharing.
>
> If you're clever, you could start a guest with -mem-path and then use 
> this mechanism to map a portion of one guest's memory into another 
> guest without either guest ever knowing who "owns" the memory and with 
> exactly the same driver on both.
>

If it's part of the normal address space, it will just confuse the 
guest.  Consider for example a reboot.

Shared memory is not normal RAM!

>>> Right now, you've got a bit of a hole in your implementation because 
>>> you only support files that are powers-of-two in size even though 
>>> that's not documented/enforced.  This is a limitation of PCI 
>>> resource regions.  
>>
>> While the BAR needs to be a power of two, I don't think the RAM 
>> backing it needs to be.
>
> Then you need a side channel to communicate the information to the guest.

There is the PCI config space for that.

>>> Since you're using qemu_ram_alloc() also, it makes hotplug 
>>> unworkable too since qemu_ram_alloc() is a static allocation from a 
>>> contiguous heap.
>>
>> We need to fix this anyway, for memory hotplug.
>
> It's going to be hard to "fix" with TCG.
>

Why?  Instead of an offset against phys_ram_base you'd store an offset 
against (char *)0 in the tlb.  Where do you see an issue?

>>
>> That will fragment the vma list.  And what do you do when you unmap 
>> the region?

^^^

>>
>> How does a 256M guest map 1G of shared memory?
>
> It doesn't but it couldn't today either b/c of the 32-bit BARs.

Let's compare the two approaches, not how they fit or don't fit random 
qemu limitations which need lifting anyway.
Avi Kivity April 2, 2009, 7:07 a.m. UTC | #6
Cam Macdonell wrote:
> I think there is value for static memory sharing.   It can be used for 
> fast, simple synchronization and communication between guests (and the 
> host) that use need to share data that needs to be updated frequently 
> (such as a simple cache or notification system).  It may not be a 
> common task, but I think static sharing has its place and that's what 
> this device is for at this point.

It would be good to detail a use case for reference.


>> Then you need a side channel to communicate the information to the 
>> guest.
>
> Couldn't one of the registers in BAR0 be used to store the actual 
> (non-power-of-two) size?

The PCI config space (where the BARs reside) is a good place for it.  
Registers 0x40+ are device specific IIRC.
Cam Macdonell April 3, 2009, 4:54 p.m. UTC | #7
Avi Kivity wrote:
> Cam Macdonell wrote:
>> I think there is value for static memory sharing.   It can be used for 
>> fast, simple synchronization and communication between guests (and the 
>> host) that use need to share data that needs to be updated frequently 
>> (such as a simple cache or notification system).  It may not be a 
>> common task, but I think static sharing has its place and that's what 
>> this device is for at this point.
> 
> It would be good to detail a use case for reference.

I'll try my best...

We are using the (static) shared memory region for fast, interprocess 
communications (IPC).  Of course, shared-memory IPC is an old idea, and 
the IPC here is actually between VMs (i.e., ivshmem), not processes 
inside a single VM.  But, fast IPC is useful for shared caches, OS 
bypass (guest-to-guest, and host-to-guest), and low-latency IPC use-cases.

For example, one use of ivshmem is as as a file cache between VMs.  Note 
that, unlike stream-oriented IPC, this file cache can be shared between, 
say, four VMs simultaneously.  In using VMs as sandboxes for distributed 
computing (condor, cloud, etc.), if two (or more) VMs are co-located on 
the same server, they can effectively share a single, unified cache. 
Any VM can bring in the data, and other VMs can use it.  Otherwise, two 
VMs might transfer (over the WAN, in the worst case, as in a cloud) and 
buffer cache the same file in multiple VMs.  In some ways, the 
configuration would look like an in-memory cluster file system, but 
instead of shared disks, we have shared memory.

Alternative forms of file sharing between VMs (e.g., via SAMBA or NFS) 
are possible, but also results in multiple cached copies of the same 
file data on the same physical server.  Furthermore, ivshmem has the 
(usual, planned) latency (e.g., for file metadata stats) and bandwidth 
advantages between most forms of stream-oriented IPC for file sharing 
protocols.

Other (related) use cases include bulk-data movement between the host 
and guest VMs, due to the OS bypass properties of the  ivshmem.  Since 
static shared memory shares a file (or memory object) on the host, 
host-guest sharing is simpler than with dynamic shared memory.

We acknowledge that work has to be done with thread/process scheduling 
to truly gain low IPC latency; that is to come, possibly with
PCI interrupts.  And, as the VMware experience shows (see below), VM 
migration *is* affected by ivshmem, but we think a good (but 
non-trivial) attach-to-ivshmem and detach-from-ivshmem protocol (in the 
future) can mostly address that issue.

As an aside, VMware ditched shared memory as part of their VMCI 
interface.  We emailed with some of their people who suggested to use 
sockets since shared memory "de-virtualizes" the VM (i.e. it breaks 
migration).  But on their forums there were users that used shared 
memory for their work and were disappointed to see it go.  One person I 
emailed with used shared memory for simulations running across VMs. 
Using shared memory freed him from having to come up with a protocol to 
exchange updates and having a central VM responsible for receiving and 
broadcasting updates.  When he did try to use a socket-based approach, 
the performance dropped substantially due to the communication overhead.

>>> Then you need a side channel to communicate the information to the 
>>> guest.
>>
>> Couldn't one of the registers in BAR0 be used to store the actual 
>> (non-power-of-two) size?
> 
> The PCI config space (where the BARs reside) is a good place for it.  
> Registers 0x40+ are device specific IIRC.
> 

Ok.

Cam
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cam Macdonell April 19, 2009, 5:22 a.m. UTC | #8
Hi Avi and Anthony,

Sorry for the top-reply, but we haven't discussed this aspect here  
before.

I've been thinking about how to implement interrupts.  As far as I can  
tell, unix domain sockets in Qemu/KVM are used point-to-point with one  
VM being the server by specifying "server" along with the unix:  
option.  This works simply for two VMs, but I'm unsure how this can  
extend to multiple VMs.  How would a server VM know how many clients  
to wait for?  How can messages then be multicast or broadcast?  Is a  
separate "interrupt server" necessary?

Thanks,
Cam

On 1-Apr-09, at 12:52 PM, Anthony Liguori wrote:

> Avi Kivity wrote:
>> Anthony Liguori wrote:
>>> Hi Cam,
>>>
>>>
>>> I would suggest two design changes to make here.  The first is  
>>> that I think you should use virtio.
>>
>> I disagree with this.  While virtio is excellent at exporting guest  
>> memory, it isn't so good at importing another guest's memory.
>
> First we need to separate static memory sharing and dynamic memory  
> sharing.  Static memory sharing has to be configured on start up.  I  
> think in practice, static memory sharing is not terribly interesting  
> except for maybe embedded environments.
>
> Dynamically memory sharing requires bidirectional communication in  
> order to establish mappings and tear down mappings.  You'll  
> eventually recreate virtio once you've implemented this  
> communication mechanism.
>
>>>  The second is that I think instead of relying on mapping in  
>>> device memory to the guest, you should have the guest allocate  
>>> it's own memory to dedicate to sharing.
>>
>> That's not what you describe below.  You're having the guest  
>> allocate parts of its address space that happen to be used by RAM,  
>> and overlaying those parts with the shared memory.
>
> But from the guest's perspective, it's RAM is being used for memory  
> sharing.
>
> If you're clever, you could start a guest with -mem-path and then  
> use this mechanism to map a portion of one guest's memory into  
> another guest without either guest ever knowing who "owns" the  
> memory and with exactly the same driver on both.
>
>>> Right now, you've got a bit of a hole in your implementation  
>>> because you only support files that are powers-of-two in size even  
>>> though that's not documented/enforced.  This is a limitation of  
>>> PCI resource regions.
>>
>> While the BAR needs to be a power of two, I don't think the RAM  
>> backing it needs to be.
>
> Then you need a side channel to communicate the information to the  
> guest.
>
>>> Also, the PCI memory hole is limited in size today which is going  
>>> to put an upper bound on the amount of memory you could ever map  
>>> into a guest.
>>
>> Today.  We could easily lift this restriction by supporting 64-bit  
>> BARs.  It would probably take only a few lines of code.
>>
>>> Since you're using qemu_ram_alloc() also, it makes hotplug  
>>> unworkable too since qemu_ram_alloc() is a static allocation from  
>>> a contiguous heap.
>>
>> We need to fix this anyway, for memory hotplug.
>
> It's going to be hard to "fix" with TCG.
>
>>> If you used virtio, what you could do is provide a ring queue that  
>>> was used to communicate a series of requests/response.  The  
>>> exchange might look like this:
>>>
>>> guest: REQ discover memory region
>>> host: RSP memory region id: 4 size: 8k
>>> guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k),  
>>> (addr=944000,size=4k)}
>>> host: RSP mapped region id: 4
>>> guest: REQ notify region id: 4
>>> host: RSP notify region id: 4
>>> guest: REQ poll region id: 4
>>> host: RSP poll region id: 4
>>
>> That looks significantly more complex.
>
> It's also supporting dynamic shared memory.  If you do use BARs,  
> then perhaps you'd just do PCI hotplug to make things dynamic.
>
>>>
>>> And the REQ/RSP order does not have to be in series like this.  In  
>>> general, you need one entry on the queue to poll for new memory  
>>> regions, one entry for each mapped region to poll for incoming  
>>> notification, and then the remaining entries can be used to send  
>>> short-lived requests/responses.
>>>
>>> It's important that the REQ map takes a scatter/gather list of  
>>> physical addresses because after running for a while, it's  
>>> unlikely that you'll be able to allocate any significant size of  
>>> contiguous memory.
>>>
>>> From a QEMU perspective, you would do memory sharing by waiting  
>>> for a map REQ from the guest and then you would complete the  
>>> request by doing an mmap(MAP_FIXED) with the appropriate  
>>> parameters into phys_ram_base.
>>
>> That will fragment the vma list.  And what do you do when you unmap  
>> the region?
>>
>> How does a 256M guest map 1G of shared memory?
>
> It doesn't but it couldn't today either b/c of the 32-bit BARs.
>
> Regards,
>
> Anthony Liguori
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-----------------------------------------------
A. Cameron Macdonell
Ph.D. Student
Department of Computing Science
University of Alberta
cam@cs.ualberta.ca



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity April 19, 2009, 10:26 a.m. UTC | #9
Cameron Macdonell wrote:
>
> Hi Avi and Anthony,
>
> Sorry for the top-reply, but we haven't discussed this aspect here 
> before.
>
> I've been thinking about how to implement interrupts.  As far as I can 
> tell, unix domain sockets in Qemu/KVM are used point-to-point with one 
> VM being the server by specifying "server" along with the unix: 
> option.  This works simply for two VMs, but I'm unsure how this can 
> extend to multiple VMs.  How would a server VM know how many clients 
> to wait for?  How can messages then be multicast or broadcast?  Is a 
> separate "interrupt server" necessary?


I don't think unix provides a reliable multicast RPC.  So yes, an 
interrupt server seems necessary.

You could expand its role an make it a "shared memory PCI card server", 
and have it also be responsible for providing the backing file using an 
SCM_RIGHTS fd.  That would reduce setup headaches for users (setting up 
a file for which all VMs have permissions).
Cam Macdonell April 22, 2009, 10:41 p.m. UTC | #10
subbu kl wrote:
> correct me if wrong,
> can we do the sharing business by writing a non-transparent qemu PCI 
> device in host and guests can access each other's address space ?

Hi Subbu,

I'm a bit confused by your question.  Are you asking how this device 
works or suggesting an alternative approach?  I'm not sure what you mean 
by a non-transparent qemu device.

Cam

> 
> ~subbu
> 
> On Sun, Apr 19, 2009 at 3:56 PM, Avi Kivity <avi@redhat.com 
> <mailto:avi@redhat.com>> wrote:
> 
>     Cameron Macdonell wrote:
> 
> 
>         Hi Avi and Anthony,
> 
>         Sorry for the top-reply, but we haven't discussed this aspect
>         here before.
> 
>         I've been thinking about how to implement interrupts.  As far as
>         I can tell, unix domain sockets in Qemu/KVM are used
>         point-to-point with one VM being the server by specifying
>         "server" along with the unix: option.  This works simply for two
>         VMs, but I'm unsure how this can extend to multiple VMs.  How
>         would a server VM know how many clients to wait for?  How can
>         messages then be multicast or broadcast?  Is a separate
>         "interrupt server" necessary?
> 
> 
> 
>     I don't think unix provides a reliable multicast RPC.  So yes, an
>     interrupt server seems necessary.
> 
>     You could expand its role an make it a "shared memory PCI card
>     server", and have it also be responsible for providing the backing
>     file using an SCM_RIGHTS fd.  That would reduce setup headaches for
>     users (setting up a file for which all VMs have permissions).
> 
>     -- 
>     Do not meddle in the internals of kernels, for they are subtle and
>     quick to panic.
> 
> 
>     --
>     To unsubscribe from this list: send the line "unsubscribe kvm" in
>     the body of a message to majordomo@vger.kernel.org
>     <mailto:majordomo@vger.kernel.org>
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> -- 
> ~subbu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cam Macdonell April 23, 2009, 4:28 p.m. UTC | #11
subbu kl wrote:
> Cam,
> 
> just a wild though about alternative approach. 

Ideas are always good.

> Once specific set of 
> address range of one guest is visible to other guest its just a matter 
> of DMA/single memcpy will transfer the data across.

My idea is to eliminate unnecessary copying.  This introduces one.

> usually non-transparent PCIe bridges(NTB) will be used for inter 
> processor data communication. physical PCIe NTB between two processors 
> just sets up a PCIe data channel with some Address translation stuffs.
> 
> So i was just wondering if we can write this non transparent bridge 
> (qemu PCI device) with Addrdess translation capability then guests just 
> can start mmap and start accessing each others memory :)

I think your concept is similar to what Anthony suggested using virtio 
to export and import other VMs memory.  However, RAM and shared memory 
are not the same thing and having one guest access another's RAM could 
confuse the guest.  With the approach of mapping a BAR, the shared 
memory is separate from the guest RAM but it can be mapped by the guest 
processes.

Cam

> ~subbu
> 
> On Thu, Apr 23, 2009 at 4:11 AM, Cam Macdonell <cam@cs.ualberta.ca 
> <mailto:cam@cs.ualberta.ca>> wrote:
> 
>     subbu kl wrote:
> 
>         correct me if wrong,
>         can we do the sharing business by writing a non-transparent qemu
>         PCI device in host and guests can access each other's address
>         space ?
> 
> 
>     Hi Subbu,
> 
>     I'm a bit confused by your question.  Are you asking how this device
>     works or suggesting an alternative approach?  I'm not sure what you
>     mean by a non-transparent qemu device.
> 
>     Cam
> 
> 
>         ~subbu
> 
> 
>         On Sun, Apr 19, 2009 at 3:56 PM, Avi Kivity <avi@redhat.com
>         <mailto:avi@redhat.com> <mailto:avi@redhat.com
>         <mailto:avi@redhat.com>>> wrote:
> 
>            Cameron Macdonell wrote:
> 
> 
>                Hi Avi and Anthony,
> 
>                Sorry for the top-reply, but we haven't discussed this aspect
>                here before.
> 
>                I've been thinking about how to implement interrupts.  As
>         far as
>                I can tell, unix domain sockets in Qemu/KVM are used
>                point-to-point with one VM being the server by specifying
>                "server" along with the unix: option.  This works simply
>         for two
>                VMs, but I'm unsure how this can extend to multiple VMs.  How
>                would a server VM know how many clients to wait for?  How can
>                messages then be multicast or broadcast?  Is a separate
>                "interrupt server" necessary?
> 
> 
> 
>            I don't think unix provides a reliable multicast RPC.  So yes, an
>            interrupt server seems necessary.
> 
>            You could expand its role an make it a "shared memory PCI card
>            server", and have it also be responsible for providing the
>         backing
>            file using an SCM_RIGHTS fd.  That would reduce setup
>         headaches for
>            users (setting up a file for which all VMs have permissions).
> 
>            --    Do not meddle in the internals of kernels, for they are
>         subtle and
>            quick to panic.
> 
> 
>            --
>            To unsubscribe from this list: send the line "unsubscribe kvm" in
>            the body of a message to majordomo@vger.kernel.org
>         <mailto:majordomo@vger.kernel.org>
>            <mailto:majordomo@vger.kernel.org
>         <mailto:majordomo@vger.kernel.org>>
> 
>            More majordomo info at
>          http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
>         -- 
>         ~subbu
> 
> 
> 
> 
> -- 
> ~subbu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index 6eed853..167db55 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -640,6 +640,8 @@  OBJS += e1000.o
 
 # Serial mouse
 OBJS += msmouse.o
+# Inter-VM PCI shared memory
+OBJS += ivshmem.o
 
 ifeq ($(USE_KVM_DEVICE_ASSIGNMENT), 1)
 OBJS+= device-assignment.o
diff --git a/qemu/hw/ivshmem.c b/qemu/hw/ivshmem.c
new file mode 100644
index 0000000..27db95f
--- /dev/null
+++ b/qemu/hw/ivshmem.c
@@ -0,0 +1,363 @@ 
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "qemu-common.h"
+#include <sys/mman.h>
+
+#define PCI_COMMAND_IOACCESS                0x0001
+#define PCI_COMMAND_MEMACCESS               0x0002
+#define PCI_COMMAND_BUSMASTER               0x0004
+
+//#define DEBUG_IVSHMEM
+
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct IVShmemState {
+    uint16_t intrmask;
+    uint16_t intrstatus;
+    uint8_t *ivshmem_ptr;
+    unsigned long ivshmem_offset;
+    unsigned int ivshmem_size;
+    unsigned long bios_offset;
+    unsigned int bios_size;
+    target_phys_addr_t base_ctrl;
+    int it_shift;
+    PCIDevice *pci_dev;
+    unsigned long map_addr;
+    unsigned long map_end;
+    int ivshmem_mmio_io_addr;
+} IVShmemState;
+
+typedef struct PCI_IVShmemState {
+    PCIDevice dev;
+    IVShmemState ivshmem_state;
+} PCI_IVShmemState;
+
+typedef struct IVShmemDesc {
+    char name[1024];
+    int size;
+} IVShmemDesc;
+
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 16
+};
+
+static int num_ivshmem_devices = 0;
+static IVShmemDesc ivshmem_desc;
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    uint32_t addr, uint32_t size, int type)
+{
+    PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+    IVShmemState *s = &d->ivshmem_state;
+
+    IVSHMEM_DPRINTF("addr = %u size = %u\n", addr, size);
+    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+void ivshmem_init(const char * optarg) {
+
+    char * temp;
+    int size;
+
+    num_ivshmem_devices++;
+
+    /* currently we only support 1 device */
+    if (num_ivshmem_devices > MAX_IVSHMEM_DEVICES) {
+        return;
+    }
+
+    temp = strdup(optarg);
+    snprintf(ivshmem_desc.name, 1024, "/%s", strsep(&temp,","));
+    size = atol(temp);
+    if ( size == -1) {
+        ivshmem_desc.size = TARGET_PAGE_SIZE;
+    } else {
+        ivshmem_desc.size = size*1024*1024;
+    }
+    IVSHMEM_DPRINTF("optarg is %s, name is %s, size is %d\n", optarg,
+                                        ivshmem_desc.name,
+                                        ivshmem_desc.size);
+}
+
+int ivshmem_get_size(void) {
+    return ivshmem_desc.size;
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->pci_dev->irq[0], (isr != 0));
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       uint32_t addr, uint32_t size, int type)
+{
+    PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+    IVShmemState *s = &d->ivshmem_state;
+
+    cpu_register_physical_memory(addr + 0, 0x100, s->ivshmem_mmio_io_addr);
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    IVSHMEM_DPRINTF("writing 0x%x to 0x%lx\n", addr, (unsigned long) opaque);
+
+    addr &= 0xfe;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we writing 0x%x\n", addr);
+    }
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing longs\n");
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading longs\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+int pci_ivshmem_init(PCIBus *bus, uint8_t *phys_ram_base)
+{
+    PCI_IVShmemState *d;
+    IVShmemState *s;
+    uint8_t *pci_conf;
+    int ivshmem_fd;
+
+    IVSHMEM_DPRINTF("shared file is %s\n", ivshmem_desc.name);
+    d = (PCI_IVShmemState *)pci_register_device(bus, "kvm_ivshmem",
+                                           sizeof(PCI_IVShmemState),
+                                           -1, NULL, NULL);
+    if (!d) {
+        return -1;
+    }
+
+    s = &d->ivshmem_state;
+
+    /* allocate shared memory RAM */
+    s->ivshmem_offset = qemu_ram_alloc(ivshmem_desc.size);
+    IVSHMEM_DPRINTF("size is = %d\n", ivshmem_desc.size);
+    IVSHMEM_DPRINTF("ivshmem ram offset = %ld\n", s->ivshmem_offset);
+
+    s->ivshmem_ptr = phys_ram_base + s->ivshmem_offset;
+
+    s->pci_dev = &d->dev;
+    s->ivshmem_size = ivshmem_desc.size;
+
+    pci_conf = d->dev.config;
+    pci_conf[0x00] = 0xf4; // Qumranet vendor ID 0x5002
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IOACCESS | PCI_COMMAND_MEMACCESS;
+    pci_conf[0x0a] = 0x00; // RAM controller
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; // header_type
+
+    pci_conf[PCI_INTERRUPT_PIN] = 1; // we are going to support interrupts
+
+    /* XXX: ivshmem_desc.size must be a power of two */
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(0, ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+
+    /* region for registers*/
+    pci_register_io_region(&d->dev, 0, 0x100,
+                           PCI_ADDRESS_SPACE_MEM, ivshmem_mmio_map);
+
+    /* region for shared memory */
+    pci_register_io_region(&d->dev, 1, ivshmem_desc.size,
+                           PCI_ADDRESS_SPACE_MEM, ivshmem_map);
+
+    /* open shared memory file  */
+    if ((ivshmem_fd = shm_open(ivshmem_desc.name, O_CREAT|O_RDWR, S_IRWXU)) < 0)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+        exit(-1);
+    }
+
+    ftruncate(ivshmem_fd, ivshmem_desc.size);
+
+    /* mmap onto PCI device's memory */
+    if (mmap(s->ivshmem_ptr, ivshmem_desc.size, PROT_READ|PROT_WRITE,
+                        MAP_SHARED|MAP_FIXED, ivshmem_fd, 0) == MAP_FAILED)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not mmap shared file\n");
+        exit(-1);
+    }
+
+    IVSHMEM_DPRINTF("shared object mapped to 0x%p\n", s->ivshmem_ptr);
+
+    return 0;
+}
+
diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
index d4a4320..34cd1ba 100644
--- a/qemu/hw/pc.c
+++ b/qemu/hw/pc.c
@@ -64,6 +64,8 @@  static PITState *pit;
 static IOAPICState *ioapic;
 static PCIDevice *i440fx_state;
 
+extern int ivshmem_enabled;
+
 static void ioport80_write(void *opaque, uint32_t addr, uint32_t data)
 {
 }
@@ -1038,6 +1040,10 @@  vga_bios_error:
         }
     }
 
+    if (pci_enabled && ivshmem_enabled) {
+        pci_ivshmem_init(pci_bus, phys_ram_base);
+    }
+
     rtc_state = rtc_init(0x70, i8259[8], 2000);
 
     qemu_register_boot_set(pc_boot_set, rtc_state);
diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h
index 85319ea..0158ef3 100644
--- a/qemu/hw/pc.h
+++ b/qemu/hw/pc.h
@@ -190,4 +190,7 @@  void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd);
 
 void extboot_init(BlockDriverState *bs, int cmd);
 
+/* ivshmem.c */
+int pci_ivshmem_init(PCIBus *bus, uint8_t *phys_ram_base);
+
 #endif
diff --git a/qemu/qemu-options.hx b/qemu/qemu-options.hx
index bb4c8e6..84c7af2 100644
--- a/qemu/qemu-options.hx
+++ b/qemu/qemu-options.hx
@@ -1201,6 +1201,16 @@  The default device is @code{vc} in graphical mode and @code{stdio} in
 non graphical mode.
 ETEXI
 
+DEF("ivshmem", HAS_ARG, QEMU_OPTION_ivshmem, \
+    "-ivshmem name,size    creates or opens a shared file 'name' of size \
+    'size' (in MB) and exposes it as a PCI device in the guest\n")
+STEXI
+@item -ivshmem @var{file},@var{size}
+Creates a POSIX shared file named @var{file} of size @var{size} and creates a
+PCI device of the same size that maps the shared file into the device for guests
+to access.  The created file on the host is located in /dev/shm/
+ETEXI
+
 DEF("pidfile", HAS_ARG, QEMU_OPTION_pidfile, \
     "-pidfile file   write PID to 'file'\n")
 STEXI
diff --git a/qemu/sysemu.h b/qemu/sysemu.h
index d765465..ed34b5a 100644
--- a/qemu/sysemu.h
+++ b/qemu/sysemu.h
@@ -215,6 +215,13 @@  extern CharDriverState *parallel_hds[MAX_PARALLEL_PORTS];
 
 extern CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES];
 
+/* inter-VM shared memory devices */
+
+#define MAX_IVSHMEM_DEVICES 1
+
+void ivshmem_init(const char * optarg);
+int ivshmem_get_size(void);
+
 #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR)
 
 #ifdef NEED_CPU_H
diff --git a/qemu/vl.c b/qemu/vl.c
index b3da7ad..e0a08fb 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -219,6 +219,7 @@  static int rtc_date_offset = -1; /* -1 means no change */
 int cirrus_vga_enabled = 1;
 int std_vga_enabled = 0;
 int vmsvga_enabled = 0;
+int ivshmem_enabled = 0;
 #ifdef TARGET_SPARC
 int graphic_width = 1024;
 int graphic_height = 768;
@@ -236,6 +237,7 @@  int no_quit = 0;
 CharDriverState *serial_hds[MAX_SERIAL_PORTS];
 CharDriverState *parallel_hds[MAX_PARALLEL_PORTS];
 CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES];
+const char * ivshmem_device;
 #ifdef TARGET_I386
 int win2k_install_hack = 0;
 int rtc_td_hack = 0;
@@ -4522,6 +4524,7 @@  int main(int argc, char **argv, char **envp)
     cyls = heads = secs = 0;
     translation = BIOS_ATA_TRANSLATION_AUTO;
     monitor_device = "vc:80Cx24C";
+    ivshmem_device = NULL;
 
     serial_devices[0] = "vc:80Cx24C";
     for(i = 1; i < MAX_SERIAL_PORTS; i++)
@@ -4944,6 +4947,10 @@  int main(int argc, char **argv, char **envp)
                 parallel_devices[parallel_device_index] = optarg;
                 parallel_device_index++;
                 break;
+            case QEMU_OPTION_ivshmem:
+                ivshmem_device = optarg;
+                ivshmem_enabled = 1;
+                break;
 	    case QEMU_OPTION_loadvm:
 		loadvm = optarg;
 		break;
@@ -5416,6 +5423,11 @@  int main(int argc, char **argv, char **envp)
 	    }
     }
 
+    if (ivshmem_enabled) {
+        ivshmem_init(ivshmem_device);
+        phys_ram_size += ivshmem_get_size();
+    }
+
     phys_ram_base = qemu_alloc_physram(phys_ram_size);
     if (!phys_ram_base) {
         fprintf(stderr, "Could not allocate physical memory\n");