Message ID | 20240410100345.389462-1-eperezma@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Identify aliased maps in vdpa SVQ iova_tree | expand |
On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > The guest may have overlapped memory regions, where different GPA leads > to the same HVA. This causes a problem when overlapped regions > (different GPA but same translated HVA) exists in the tree, as looking > them by HVA will return them twice. I think I don't understand if there's any side effect for shadow virtqueue? Thanks > > To solve this, track GPA in the DMA entry that acs as unique identifiers > to the maps. When the map needs to be removed, iova tree is able to > find the right one. > > Users that does not go to this extra layer of indirection can use the > iova tree as usual, with id = 0. > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > time to reproduce the issue. This has been tested only without overlapping > maps. If it works with overlapping maps, it will be intergrated in the main > series. > > Comments are welcome. Thanks! > > Eugenio Pérez (2): > iova_tree: add an id member to DMAMap > vdpa: identify aliased maps in iova_tree > > hw/virtio/vhost-vdpa.c | 2 ++ > include/qemu/iova-tree.h | 5 +++-- > util/iova-tree.c | 3 ++- > 3 files changed, 7 insertions(+), 3 deletions(-) > > -- > 2.44.0 >
On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > The guest may have overlapped memory regions, where different GPA leads > > to the same HVA. This causes a problem when overlapped regions > > (different GPA but same translated HVA) exists in the tree, as looking > > them by HVA will return them twice. > > I think I don't understand if there's any side effect for shadow virtqueue? > My bad, I totally forgot to put a reference to where this comes from. Si-Wei found that during initialization this sequences of maps / unmaps happens [1]: HVA GPA IOVA ------------------------------------------------------------------------------------------------------------------------- Map [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) [0x80001000, 0x2000001000) [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x2000001000, 0x2000021000) Unmap [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, 0x20000) ??? The third HVA range is contained in the first one, but exposed under a different GVA (aliased). This is not "flattened" by QEMU, as GPA does not overlap, only HVA. At the third chunk unmap, the current algorithm finds the first chunk, not the second one. This series is the way to tell the difference at unmap time. [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html Thanks! > Thanks > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers > > to the maps. When the map needs to be removed, iova tree is able to > > find the right one. > > > > Users that does not go to this extra layer of indirection can use the > > iova tree as usual, with id = 0. > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > > time to reproduce the issue. This has been tested only without overlapping > > maps. If it works with overlapping maps, it will be intergrated in the main > > series. > > > > Comments are welcome. Thanks! > > > > Eugenio Pérez (2): > > iova_tree: add an id member to DMAMap > > vdpa: identify aliased maps in iova_tree > > > > hw/virtio/vhost-vdpa.c | 2 ++ > > include/qemu/iova-tree.h | 5 +++-- > > util/iova-tree.c | 3 ++- > > 3 files changed, 7 insertions(+), 3 deletions(-) > > > > -- > > 2.44.0 > > >
On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > to the same HVA. This causes a problem when overlapped regions > > > (different GPA but same translated HVA) exists in the tree, as looking > > > them by HVA will return them twice. > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > My bad, I totally forgot to put a reference to where this comes from. > > Si-Wei found that during initialization this sequences of maps / > unmaps happens [1]: > > HVA GPA IOVA > ------------------------------------------------------------------------------------------------------------------------- > Map > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > [0x80001000, 0x2000001000) > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > [0x2000001000, 0x2000021000) > > Unmap > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > 0x20000) ??? > > The third HVA range is contained in the first one, but exposed under a > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > not overlap, only HVA. > > At the third chunk unmap, the current algorithm finds the first chunk, > not the second one. This series is the way to tell the difference at > unmap time. > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > Thanks! Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in the iova tree to solve this issue completely. Then there won't be aliasing issues. Thanks > > > Thanks > > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers > > > to the maps. When the map needs to be removed, iova tree is able to > > > find the right one. > > > > > > Users that does not go to this extra layer of indirection can use the > > > iova tree as usual, with id = 0. > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > > > time to reproduce the issue. This has been tested only without overlapping > > > maps. If it works with overlapping maps, it will be intergrated in the main > > > series. > > > > > > Comments are welcome. Thanks! > > > > > > Eugenio Pérez (2): > > > iova_tree: add an id member to DMAMap > > > vdpa: identify aliased maps in iova_tree > > > > > > hw/virtio/vhost-vdpa.c | 2 ++ > > > include/qemu/iova-tree.h | 5 +++-- > > > util/iova-tree.c | 3 ++- > > > 3 files changed, 7 insertions(+), 3 deletions(-) > > > > > > -- > > > 2.44.0 > > > > > >
On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > to the same HVA. This causes a problem when overlapped regions > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > them by HVA will return them twice. > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > Si-Wei found that during initialization this sequences of maps / > > unmaps happens [1]: > > > > HVA GPA IOVA > > ------------------------------------------------------------------------------------------------------------------------- > > Map > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > [0x80001000, 0x2000001000) > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > [0x2000001000, 0x2000021000) > > > > Unmap > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > 0x20000) ??? > > > > The third HVA range is contained in the first one, but exposed under a > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > not overlap, only HVA. > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > not the second one. This series is the way to tell the difference at > > unmap time. > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > Thanks! > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > the iova tree to solve this issue completely. Then there won't be > aliasing issues. > I'm ok to explore that route but this has another problem. Both SVQ vrings and CVQ buffers also need to be addressable by VhostIOVATree, and they do not have GPA. At this moment vhost_svq_translate_addr is able to handle this transparently as we translate vaddr to SVQ IOVA. How can we store these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and then a list to go through other entries (SVQ vaddr and CVQ buffers). Thanks! > Thanks > > > > > > Thanks > > > > > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers > > > > to the maps. When the map needs to be removed, iova tree is able to > > > > find the right one. > > > > > > > > Users that does not go to this extra layer of indirection can use the > > > > iova tree as usual, with id = 0. > > > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > > > > time to reproduce the issue. This has been tested only without overlapping > > > > maps. If it works with overlapping maps, it will be intergrated in the main > > > > series. > > > > > > > > Comments are welcome. Thanks! > > > > > > > > Eugenio Pérez (2): > > > > iova_tree: add an id member to DMAMap > > > > vdpa: identify aliased maps in iova_tree > > > > > > > > hw/virtio/vhost-vdpa.c | 2 ++ > > > > include/qemu/iova-tree.h | 5 +++-- > > > > util/iova-tree.c | 3 ++- > > > > 3 files changed, 7 insertions(+), 3 deletions(-) > > > > > > > > -- > > > > 2.44.0 > > > > > > > > > >
On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > <eperezma@redhat.com> wrote: > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > them by HVA will return them twice. > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > Si-Wei found that during initialization this sequences of maps / > > > unmaps happens [1]: > > > > > > HVA GPA IOVA > > > ------------------------------------------------------------------------------------------------------------------------- > > > Map > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > [0x80001000, 0x2000001000) > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > [0x2000001000, 0x2000021000) > > > > > > Unmap > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > 0x20000) ??? > > > > > > The third HVA range is contained in the first one, but exposed under a > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > not overlap, only HVA. > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > not the second one. This series is the way to tell the difference at > > > unmap time. > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > Thanks! > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > the iova tree to solve this issue completely. Then there won't be > > aliasing issues. > > > > I'm ok to explore that route but this has another problem. Both SVQ > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > and they do not have GPA. > > At this moment vhost_svq_translate_addr is able to handle this > transparently as we translate vaddr to SVQ IOVA. How can we store > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > then a list to go through other entries (SVQ vaddr and CVQ buffers). This seems to be tricky. As discussed, it could be another iova tree. Thanks > > Thanks! > > > Thanks > > > > > > > > > Thanks > > > > > > > > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers > > > > > to the maps. When the map needs to be removed, iova tree is able to > > > > > find the right one. > > > > > > > > > > Users that does not go to this extra layer of indirection can use the > > > > > iova tree as usual, with id = 0. > > > > > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > > > > > time to reproduce the issue. This has been tested only without overlapping > > > > > maps. If it works with overlapping maps, it will be intergrated in the main > > > > > series. > > > > > > > > > > Comments are welcome. Thanks! > > > > > > > > > > Eugenio Pérez (2): > > > > > iova_tree: add an id member to DMAMap > > > > > vdpa: identify aliased maps in iova_tree > > > > > > > > > > hw/virtio/vhost-vdpa.c | 2 ++ > > > > > include/qemu/iova-tree.h | 5 +++-- > > > > > util/iova-tree.c | 3 ++- > > > > > 3 files changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > -- > > > > > 2.44.0 > > > > > > > > > > > > > > >
On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > <eperezma@redhat.com> wrote: > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > them by HVA will return them twice. > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > unmaps happens [1]: > > > > > > > > HVA GPA IOVA > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > Map > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > [0x80001000, 0x2000001000) > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > [0x2000001000, 0x2000021000) > > > > > > > > Unmap > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > 0x20000) ??? > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > not overlap, only HVA. > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > not the second one. This series is the way to tell the difference at > > > > unmap time. > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > Thanks! > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > the iova tree to solve this issue completely. Then there won't be > > > aliasing issues. > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > and they do not have GPA. > > > > At this moment vhost_svq_translate_addr is able to handle this > > transparently as we translate vaddr to SVQ IOVA. How can we store > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > This seems to be tricky. > > As discussed, it could be another iova tree. > Yes but there are many ways to add another IOVATree. Let me expand & recap. Option 1 is to simply add another iova tree to VhostShadowVirtqueue. Let's call it gpa_iova_tree, as opposed to the current iova_tree that translates from vaddr to SVQ IOVA. To know which one to use is easy at adding or removing, like in the memory listener, but how to know at vhost_svq_translate_addr? The easiest way for me is to rely on memory_region_from_host(). When vaddr is from the guest, it returns a valid MemoryRegion. When it is not, it returns NULL. I'm not sure if this is a valid use case, it just worked in my tests so far. Now we have the second problem: The GPA values of the regions of the two IOVA tree must be unique. We need to be able to find unallocated regions in SVQ IOVA. At this moment there is only one IOVATree, so this is done easily by vhost_iova_tree_map_alloc. But it is very complicated with two trees. Option 2a is to add another IOVATree in VhostIOVATree. I think the easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it iova_gpa_map, and the current vaddr -> SVQ IOVA tree in iova_taddr_map. This second tree should contain both vaddr memory that belongs to the guest and host-only vaddr like vrings and CVQ buffers. How to pass the GPA to VhostIOVATree API? To add it to DMAMap is complicated, as it is shared with intel_iommu. We can add new functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is enough with GPA only. It should be functions to add, remove, and allocate new entries. But vaddr ones must be kept, because buffers might be host-only. Then the caller can choose which version to call: for adding and removing guest memory from the memory listener, the GPA variant. Adding SVQ vrings and CVQ buffers should use the current vaddr versions. vhost_svq_translate_addr still needs to use memory_region_from_host() to know which one to call. Although I didn't like this approach because it complicates VhostIOVATree, I think it is the easier way except for option 4, I'll explain later. This has an extra advantage: currently, the lookup in vhost_svq_translate_addr is linear, O(1). This would allow us to use the tree properly. Option 2b could be to keep them totally separated. So current VhostIOVATree->iova_taddr_map only contains host-only entries, and the new iova_gpa_map containst the guest entries. I don't think this case adds anything except less memory usage, as the gpa map (which should be the fastest) will be the same size. Also, it makes it difficult to implement vhost_iova_tree_map_alloc. Option 3 is to not add new functions but extend the current ones. That would require special values of GPA values to indicate no GPA, like SVQ vrings. I think option 2a is better, but this may help to keep the interface simpler. Option 4 is what I'm proposing in this RFC. To store the GPA as map id so we can tell if the vaddr corresponds to one SVQ IOVA or another. Now I'm having trouble retrieving the memory section I see in the memory listener. It should not be so difficult but. The main advantage is not to duplicate data structs that are already in QEMU, but maybe it is not worth the effort. Going further with this option, we could add a flag to ignore the .id member added. But it adds more and more complexity to the API so I would prefer option 2a for this. > Thanks > > > > > Thanks! > > > > > Thanks > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers > > > > > > to the maps. When the map needs to be removed, iova tree is able to > > > > > > find the right one. > > > > > > > > > > > > Users that does not go to this extra layer of indirection can use the > > > > > > iova tree as usual, with id = 0. > > > > > > > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > > > > > > time to reproduce the issue. This has been tested only without overlapping > > > > > > maps. If it works with overlapping maps, it will be intergrated in the main > > > > > > series. > > > > > > > > > > > > Comments are welcome. Thanks! > > > > > > > > > > > > Eugenio Pérez (2): > > > > > > iova_tree: add an id member to DMAMap > > > > > > vdpa: identify aliased maps in iova_tree > > > > > > > > > > > > hw/virtio/vhost-vdpa.c | 2 ++ > > > > > > include/qemu/iova-tree.h | 5 +++-- > > > > > > util/iova-tree.c | 3 ++- > > > > > > 3 files changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > > > -- > > > > > > 2.44.0 > > > > > > > > > > > > > > > > > > > > >
On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > unmaps happens [1]: > > > > > > > > > > HVA GPA IOVA > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > Map > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > [0x80001000, 0x2000001000) > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > Unmap > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > 0x20000) ??? > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > not overlap, only HVA. > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > not the second one. This series is the way to tell the difference at > > > > > unmap time. > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > Thanks! > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > the iova tree to solve this issue completely. Then there won't be > > > > aliasing issues. > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > and they do not have GPA. > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > This seems to be tricky. > > > > As discussed, it could be another iova tree. > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > translates from vaddr to SVQ IOVA. To know which one to use is easy at > adding or removing, like in the memory listener, but how to know at > vhost_svq_translate_addr? Then we won't use virtqueue_pop() at all, we need a SVQ version of virtqueue_pop() to translate GPA to SVQ IOVA directly? > > The easiest way for me is to rely on memory_region_from_host(). When > vaddr is from the guest, it returns a valid MemoryRegion. When it is > not, it returns NULL. I'm not sure if this is a valid use case, it > just worked in my tests so far. > > Now we have the second problem: The GPA values of the regions of the > two IOVA tree must be unique. We need to be able to find unallocated > regions in SVQ IOVA. At this moment there is only one IOVATree, so > this is done easily by vhost_iova_tree_map_alloc. But it is very > complicated with two trees. Would it be simpler if we decouple the IOVA allocator? For example, we can have a dedicated gtree to track the allocated IOVA ranges. It is shared by both 1) Guest memory (GPA) 2) SVQ virtqueue and buffers And another gtree to track the GPA to IOVA. The SVQ code could use either 1) one linear mappings that contains both SVQ virtqueue and buffers or 2) dynamic IOVA allocation/deallocation helpers So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > Option 2a is to add another IOVATree in VhostIOVATree. I think the > easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it > iova_gpa_map, and the current vaddr -> SVQ IOVA tree in > iova_taddr_map. This second tree should contain both vaddr memory that > belongs to the guest and host-only vaddr like vrings and CVQ buffers. > > How to pass the GPA to VhostIOVATree API? To add it to DMAMap is > complicated, as it is shared with intel_iommu. We can add new > functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is > enough with GPA only. It should be functions to add, remove, and > allocate new entries. But vaddr ones must be kept, because buffers > might be host-only. > > Then the caller can choose which version to call: for adding and > removing guest memory from the memory listener, the GPA variant. > Adding SVQ vrings and CVQ buffers should use the current vaddr > versions. vhost_svq_translate_addr still needs to use > memory_region_from_host() to know which one to call. So the idea is, for region_del/region_add use iova_gpa_map? For the SVQ translation, use the iova_taddr_map? I suspect we still need to synchronize with those two trees so it might be still problematic as iova_taddr_map contains the overlapped regions. > > Although I didn't like this approach because it complicates > VhostIOVATree, I think it is the easier way except for option 4, I'll > explain later. > > This has an extra advantage: currently, the lookup in > vhost_svq_translate_addr is linear, O(1). This would allow us to use > the tree properly. It uses g_tree_foreach() which I guess is not O(1). > > Option 2b could be to keep them totally separated. So current > VhostIOVATree->iova_taddr_map only contains host-only entries, and the > new iova_gpa_map containst the guest entries. I don't think this case > adds anything except less memory usage, as the gpa map (which should > be the fastest) will be the same size. Also, it makes it difficult to > implement vhost_iova_tree_map_alloc. > > Option 3 is to not add new functions but extend the current ones. That > would require special values of GPA values to indicate no GPA, like > SVQ vrings. I think option 2a is better, but this may help to keep the > interface simpler. > > Option 4 is what I'm proposing in this RFC. To store the GPA as map id > so we can tell if the vaddr corresponds to one SVQ IOVA or another. > Now I'm having trouble retrieving the memory section I see in the > memory listener. It should not be so difficult but. The main advantage > is not to duplicate data structs that are already in QEMU, but maybe > it is not worth the effort. > > Going further with this option, we could add a flag to ignore the .id > member added. But it adds more and more complexity to the API so I > would prefer option 2a for this. Thanks > > > Thanks > > > > > > > > Thanks! > > > > > > > Thanks > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers > > > > > > > to the maps. When the map needs to be removed, iova tree is able to > > > > > > > find the right one. > > > > > > > > > > > > > > Users that does not go to this extra layer of indirection can use the > > > > > > > iova tree as usual, with id = 0. > > > > > > > > > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > > > > > > > time to reproduce the issue. This has been tested only without overlapping > > > > > > > maps. If it works with overlapping maps, it will be intergrated in the main > > > > > > > series. > > > > > > > > > > > > > > Comments are welcome. Thanks! > > > > > > > > > > > > > > Eugenio Pérez (2): > > > > > > > iova_tree: add an id member to DMAMap > > > > > > > vdpa: identify aliased maps in iova_tree > > > > > > > > > > > > > > hw/virtio/vhost-vdpa.c | 2 ++ > > > > > > > include/qemu/iova-tree.h | 5 +++-- > > > > > > > util/iova-tree.c | 3 ++- > > > > > > > 3 files changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > -- > > > > > > > 2.44.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > unmaps happens [1]: > > > > > > > > > > > > HVA GPA IOVA > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > Map > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > [0x80001000, 0x2000001000) > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > Unmap > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > 0x20000) ??? > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > not overlap, only HVA. > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > not the second one. This series is the way to tell the difference at > > > > > > unmap time. > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > Thanks! > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > aliasing issues. > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > and they do not have GPA. > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > This seems to be tricky. > > > > > > As discussed, it could be another iova tree. > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > adding or removing, like in the memory listener, but how to know at > > vhost_svq_translate_addr? > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > virtqueue_pop() to translate GPA to SVQ IOVA directly? > The problem is not virtqueue_pop, that's out of the vhost_svq_translate_addr. The problem is the need of adding conditionals / complexity in all the callers of > > > > The easiest way for me is to rely on memory_region_from_host(). When > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > not, it returns NULL. I'm not sure if this is a valid use case, it > > just worked in my tests so far. > > > > Now we have the second problem: The GPA values of the regions of the > > two IOVA tree must be unique. We need to be able to find unallocated > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > complicated with two trees. > > Would it be simpler if we decouple the IOVA allocator? For example, we > can have a dedicated gtree to track the allocated IOVA ranges. It is > shared by both > > 1) Guest memory (GPA) > 2) SVQ virtqueue and buffers > > And another gtree to track the GPA to IOVA. > > The SVQ code could use either > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > or > > 2) dynamic IOVA allocation/deallocation helpers > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > That's possible, but that scatters the IOVA handling code instead of keeping it self-contained in VhostIOVATree. > > > > Option 2a is to add another IOVATree in VhostIOVATree. I think the > > easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it > > iova_gpa_map, and the current vaddr -> SVQ IOVA tree in > > iova_taddr_map. This second tree should contain both vaddr memory that > > belongs to the guest and host-only vaddr like vrings and CVQ buffers. > > > > How to pass the GPA to VhostIOVATree API? To add it to DMAMap is > > complicated, as it is shared with intel_iommu. We can add new > > functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is > > enough with GPA only. It should be functions to add, remove, and > > allocate new entries. But vaddr ones must be kept, because buffers > > might be host-only. > > > > Then the caller can choose which version to call: for adding and > > removing guest memory from the memory listener, the GPA variant. > > Adding SVQ vrings and CVQ buffers should use the current vaddr > > versions. vhost_svq_translate_addr still needs to use > > memory_region_from_host() to know which one to call. > > So the idea is, for region_del/region_add use iova_gpa_map? For the > SVQ translation, use the iova_taddr_map? > > I suspect we still need to synchronize with those two trees so it > might be still problematic as iova_taddr_map contains the overlapped > regions. > Right. All adding / removing functions with GPA must also update the current iova_taddr_map too. > > > > Although I didn't like this approach because it complicates > > VhostIOVATree, I think it is the easier way except for option 4, I'll > > explain later. > > > > This has an extra advantage: currently, the lookup in > > vhost_svq_translate_addr is linear, O(1). This would allow us to use > > the tree properly. > > It uses g_tree_foreach() which I guess is not O(1). > I'm sorry I meant O(N). > > > > Option 2b could be to keep them totally separated. So current > > VhostIOVATree->iova_taddr_map only contains host-only entries, and the > > new iova_gpa_map containst the guest entries. I don't think this case > > adds anything except less memory usage, as the gpa map (which should > > be the fastest) will be the same size. Also, it makes it difficult to > > implement vhost_iova_tree_map_alloc. > > > > Option 3 is to not add new functions but extend the current ones. That > > would require special values of GPA values to indicate no GPA, like > > SVQ vrings. I think option 2a is better, but this may help to keep the > > interface simpler. > > > > Option 4 is what I'm proposing in this RFC. To store the GPA as map id > > so we can tell if the vaddr corresponds to one SVQ IOVA or another. > > Now I'm having trouble retrieving the memory section I see in the > > memory listener. It should not be so difficult but. The main advantage > > is not to duplicate data structs that are already in QEMU, but maybe > > it is not worth the effort. > > > > Going further with this option, we could add a flag to ignore the .id > > member added. But it adds more and more complexity to the API so I > > would prefer option 2a for this. > > Thanks > > > > > > Thanks > > > > > > > > > > > Thanks! > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers > > > > > > > > to the maps. When the map needs to be removed, iova tree is able to > > > > > > > > find the right one. > > > > > > > > > > > > > > > > Users that does not go to this extra layer of indirection can use the > > > > > > > > iova tree as usual, with id = 0. > > > > > > > > > > > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard > > > > > > > > time to reproduce the issue. This has been tested only without overlapping > > > > > > > > maps. If it works with overlapping maps, it will be intergrated in the main > > > > > > > > series. > > > > > > > > > > > > > > > > Comments are welcome. Thanks! > > > > > > > > > > > > > > > > Eugenio Pérez (2): > > > > > > > > iova_tree: add an id member to DMAMap > > > > > > > > vdpa: identify aliased maps in iova_tree > > > > > > > > > > > > > > > > hw/virtio/vhost-vdpa.c | 2 ++ > > > > > > > > include/qemu/iova-tree.h | 5 +++-- > > > > > > > > util/iova-tree.c | 3 ++- > > > > > > > > 3 files changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > > > -- > > > > > > > > 2.44.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > Map > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > > [0x80001000, 0x2000001000) > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > Unmap > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > > not the second one. This series is the way to tell the difference at > > > > > > > unmap time. > > > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > > aliasing issues. > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > > and they do not have GPA. > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > > > This seems to be tricky. > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > adding or removing, like in the memory listener, but how to know at > > > vhost_svq_translate_addr? > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > The problem is not virtqueue_pop, that's out of the > vhost_svq_translate_addr. The problem is the need of adding > conditionals / complexity in all the callers of > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > not, it returns NULL. I'm not sure if this is a valid use case, it > > > just worked in my tests so far. > > > > > > Now we have the second problem: The GPA values of the regions of the > > > two IOVA tree must be unique. We need to be able to find unallocated > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > > complicated with two trees. > > > > Would it be simpler if we decouple the IOVA allocator? For example, we > > can have a dedicated gtree to track the allocated IOVA ranges. It is > > shared by both > > > > 1) Guest memory (GPA) > > 2) SVQ virtqueue and buffers > > > > And another gtree to track the GPA to IOVA. > > > > The SVQ code could use either > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > > > or > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > > That's possible, but that scatters the IOVA handling code instead of > keeping it self-contained in VhostIOVATree. To me, the IOVA range/allocation is orthogonal to how IOVA is used. An example is the iova allocator in the kernel. Note that there's an even simpler IOVA "allocator" in NVME passthrough code, not sure it is useful here though (haven't had a deep look at that). Thanks
On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > > Map > > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > > > [0x80001000, 0x2000001000) > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > > > Unmap > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > > > not the second one. This series is the way to tell the difference at > > > > > > > > unmap time. > > > > > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > > > aliasing issues. > > > > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > > > and they do not have GPA. > > > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > > > > > This seems to be tricky. > > > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > > adding or removing, like in the memory listener, but how to know at > > > > vhost_svq_translate_addr? > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > > > > The problem is not virtqueue_pop, that's out of the > > vhost_svq_translate_addr. The problem is the need of adding > > conditionals / complexity in all the callers of > > > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > > not, it returns NULL. I'm not sure if this is a valid use case, it > > > > just worked in my tests so far. > > > > > > > > Now we have the second problem: The GPA values of the regions of the > > > > two IOVA tree must be unique. We need to be able to find unallocated > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > > > complicated with two trees. > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we > > > can have a dedicated gtree to track the allocated IOVA ranges. It is > > > shared by both > > > > > > 1) Guest memory (GPA) > > > 2) SVQ virtqueue and buffers > > > > > > And another gtree to track the GPA to IOVA. > > > > > > The SVQ code could use either > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > > > > > or > > > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > > > > > That's possible, but that scatters the IOVA handling code instead of > > keeping it self-contained in VhostIOVATree. > > To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > An example is the iova allocator in the kernel. > > Note that there's an even simpler IOVA "allocator" in NVME passthrough > code, not sure it is useful here though (haven't had a deep look at > that). > I don't know enough about them to have an opinion. I keep seeing the drawback of needing to synchronize both allocation & adding in all the places we want to modify the IOVATree. At this moment, these are the vhost-vdpa memory listener, the SVQ vring creation and removal, and net CVQ buffers. But it may be more in the future. What are the advantages of keeping these separated that justifies needing to synchronize in all these places, compared with keeping them synchronized in VhostIOVATree? Thanks!
On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > Map > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > > > > [0x80001000, 0x2000001000) > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > > > > > Unmap > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > > > > not the second one. This series is the way to tell the difference at > > > > > > > > > unmap time. > > > > > > > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > > > > aliasing issues. > > > > > > > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > > > > and they do not have GPA. > > > > > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > > > > > > > This seems to be tricky. > > > > > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > > > adding or removing, like in the memory listener, but how to know at > > > > > vhost_svq_translate_addr? > > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > > > > > > > The problem is not virtqueue_pop, that's out of the > > > vhost_svq_translate_addr. The problem is the need of adding > > > conditionals / complexity in all the callers of > > > > > > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it > > > > > just worked in my tests so far. > > > > > > > > > > Now we have the second problem: The GPA values of the regions of the > > > > > two IOVA tree must be unique. We need to be able to find unallocated > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > > > > complicated with two trees. > > > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is > > > > shared by both > > > > > > > > 1) Guest memory (GPA) > > > > 2) SVQ virtqueue and buffers > > > > > > > > And another gtree to track the GPA to IOVA. > > > > > > > > The SVQ code could use either > > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > > > > > > > or > > > > > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > > > > > > > > That's possible, but that scatters the IOVA handling code instead of > > > keeping it self-contained in VhostIOVATree. > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > > > An example is the iova allocator in the kernel. > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough > > code, not sure it is useful here though (haven't had a deep look at > > that). > > > > I don't know enough about them to have an opinion. I keep seeing the > drawback of needing to synchronize both allocation & adding in all the > places we want to modify the IOVATree. At this moment, these are the > vhost-vdpa memory listener, the SVQ vring creation and removal, and > net CVQ buffers. But it may be more in the future. > > What are the advantages of keeping these separated that justifies > needing to synchronize in all these places, compared with keeping them > synchronized in VhostIOVATree? It doesn't need to be synchronized. Assuming guest and SVQ shares IOVA range. IOVA only needs to track which part of the range has been used. This simplifies things, we can store GPA->IOVA mappings and SVQ -> IOVA mappings separately. Thanks > > Thanks! >
On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > > Map > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > > > > > [0x80001000, 0x2000001000) > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > > > > > > > Unmap > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > > > > > not the second one. This series is the way to tell the difference at > > > > > > > > > > unmap time. > > > > > > > > > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > > > > > aliasing issues. > > > > > > > > > > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > > > > > and they do not have GPA. > > > > > > > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > > > > > > > > > This seems to be tricky. > > > > > > > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > > > > adding or removing, like in the memory listener, but how to know at > > > > > > vhost_svq_translate_addr? > > > > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > > > > > > > > > > The problem is not virtqueue_pop, that's out of the > > > > vhost_svq_translate_addr. The problem is the need of adding > > > > conditionals / complexity in all the callers of > > > > > > > > > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it > > > > > > just worked in my tests so far. > > > > > > > > > > > > Now we have the second problem: The GPA values of the regions of the > > > > > > two IOVA tree must be unique. We need to be able to find unallocated > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > > > > > complicated with two trees. > > > > > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is > > > > > shared by both > > > > > > > > > > 1) Guest memory (GPA) > > > > > 2) SVQ virtqueue and buffers > > > > > > > > > > And another gtree to track the GPA to IOVA. > > > > > > > > > > The SVQ code could use either > > > > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > > > > > > > > > or > > > > > > > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > > > > > > > > > > > That's possible, but that scatters the IOVA handling code instead of > > > > keeping it self-contained in VhostIOVATree. > > > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > > > > > An example is the iova allocator in the kernel. > > > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough > > > code, not sure it is useful here though (haven't had a deep look at > > > that). > > > > > > > I don't know enough about them to have an opinion. I keep seeing the > > drawback of needing to synchronize both allocation & adding in all the > > places we want to modify the IOVATree. At this moment, these are the > > vhost-vdpa memory listener, the SVQ vring creation and removal, and > > net CVQ buffers. But it may be more in the future. > > > > What are the advantages of keeping these separated that justifies > > needing to synchronize in all these places, compared with keeping them > > synchronized in VhostIOVATree? > > It doesn't need to be synchronized. > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track > which part of the range has been used. > Not sure if I follow, that is what I mean with "synchronized". > This simplifies things, we can store GPA->IOVA mappings and SVQ -> > IOVA mappings separately. > Sorry, I still cannot see the whole picture :). Let's assume we have all the GPA mapped to specific IOVA regions, so we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ because of the migration. How can we know where we can place SVQ vrings without having them synchronized? At this moment we're using a tree. The tree nature of the current SVQ IOVA -> VA makes all nodes ordered so it is more or less easy to look for holes. Now your proposal uses the SVQ IOVA as tree values. Should we iterate over all of them, order them, of the two trees, and then look for holes there? > Thanks > > > > > Thanks! > > >
On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > > <eperezma@redhat.com> wrote: > > > > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > > > Map > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > > > > > > [0x80001000, 0x2000001000) > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > > > > > > > > > Unmap > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > > > > > > not the second one. This series is the way to tell the difference at > > > > > > > > > > > unmap time. > > > > > > > > > > > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > > > > > > aliasing issues. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > > > > > > and they do not have GPA. > > > > > > > > > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > > > > > > > > > > > This seems to be tricky. > > > > > > > > > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > > > > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > > > > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > > > > > adding or removing, like in the memory listener, but how to know at > > > > > > > vhost_svq_translate_addr? > > > > > > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > > > > > > > > > > > > > The problem is not virtqueue_pop, that's out of the > > > > > vhost_svq_translate_addr. The problem is the need of adding > > > > > conditionals / complexity in all the callers of > > > > > > > > > > > > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it > > > > > > > just worked in my tests so far. > > > > > > > > > > > > > > Now we have the second problem: The GPA values of the regions of the > > > > > > > two IOVA tree must be unique. We need to be able to find unallocated > > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > > > > > > complicated with two trees. > > > > > > > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is > > > > > > shared by both > > > > > > > > > > > > 1) Guest memory (GPA) > > > > > > 2) SVQ virtqueue and buffers > > > > > > > > > > > > And another gtree to track the GPA to IOVA. > > > > > > > > > > > > The SVQ code could use either > > > > > > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > > > > > > > > > > > or > > > > > > > > > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > > > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > > > > > > > > > > > > > > That's possible, but that scatters the IOVA handling code instead of > > > > > keeping it self-contained in VhostIOVATree. > > > > > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > > > > > > > An example is the iova allocator in the kernel. > > > > > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough > > > > code, not sure it is useful here though (haven't had a deep look at > > > > that). > > > > > > > > > > I don't know enough about them to have an opinion. I keep seeing the > > > drawback of needing to synchronize both allocation & adding in all the > > > places we want to modify the IOVATree. At this moment, these are the > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and > > > net CVQ buffers. But it may be more in the future. > > > > > > What are the advantages of keeping these separated that justifies > > > needing to synchronize in all these places, compared with keeping them > > > synchronized in VhostIOVATree? > > > > It doesn't need to be synchronized. > > > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track > > which part of the range has been used. > > > > Not sure if I follow, that is what I mean with "synchronized". Oh right. > > > This simplifies things, we can store GPA->IOVA mappings and SVQ -> > > IOVA mappings separately. > > > > Sorry, I still cannot see the whole picture :). > > Let's assume we have all the GPA mapped to specific IOVA regions, so > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > because of the migration. How can we know where we can place SVQ > vrings without having them synchronized? Just allocating a new IOVA range for SVQ? > > At this moment we're using a tree. The tree nature of the current SVQ > IOVA -> VA makes all nodes ordered so it is more or less easy to look > for holes. Yes, iova allocate could still be implemented via a tree. > > Now your proposal uses the SVQ IOVA as tree values. Should we iterate > over all of them, order them, of the two trees, and then look for > holes there? Let me clarify, correct me if I was wrong: 1) IOVA allocator is still implemented via a tree, we just don't need to store how the IOVA is used 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in the datapath SVQ translation 3) A linear mapping or another SVQ -> IOVA tree used for SVQ Thanks > > > Thanks > > > > > > > > Thanks! > > > > > >
On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: > > On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > > > <eperezma@redhat.com> wrote: > > > > > > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > > > > Map > > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > > > > > > > [0x80001000, 0x2000001000) > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > > > > > > > > > > > Unmap > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > > > > > > > not the second one. This series is the way to tell the difference at > > > > > > > > > > > > unmap time. > > > > > > > > > > > > > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > > > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > > > > > > > aliasing issues. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > > > > > > > and they do not have GPA. > > > > > > > > > > > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > > > > > > > > > > > > > This seems to be tricky. > > > > > > > > > > > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > > > > > > > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > > > > > > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > > > > > > adding or removing, like in the memory listener, but how to know at > > > > > > > > vhost_svq_translate_addr? > > > > > > > > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > > > > > > > > > > > > > > > > The problem is not virtqueue_pop, that's out of the > > > > > > vhost_svq_translate_addr. The problem is the need of adding > > > > > > conditionals / complexity in all the callers of > > > > > > > > > > > > > > > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When > > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it > > > > > > > > just worked in my tests so far. > > > > > > > > > > > > > > > > Now we have the second problem: The GPA values of the regions of the > > > > > > > > two IOVA tree must be unique. We need to be able to find unallocated > > > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > > > > > > > complicated with two trees. > > > > > > > > > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we > > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is > > > > > > > shared by both > > > > > > > > > > > > > > 1) Guest memory (GPA) > > > > > > > 2) SVQ virtqueue and buffers > > > > > > > > > > > > > > And another gtree to track the GPA to IOVA. > > > > > > > > > > > > > > The SVQ code could use either > > > > > > > > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > > > > > > > > > > > > > or > > > > > > > > > > > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > > > > > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > > > > > > > > > > > > > > > > > That's possible, but that scatters the IOVA handling code instead of > > > > > > keeping it self-contained in VhostIOVATree. > > > > > > > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > > > > > > > > > An example is the iova allocator in the kernel. > > > > > > > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough > > > > > code, not sure it is useful here though (haven't had a deep look at > > > > > that). > > > > > > > > > > > > > I don't know enough about them to have an opinion. I keep seeing the > > > > drawback of needing to synchronize both allocation & adding in all the > > > > places we want to modify the IOVATree. At this moment, these are the > > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and > > > > net CVQ buffers. But it may be more in the future. > > > > > > > > What are the advantages of keeping these separated that justifies > > > > needing to synchronize in all these places, compared with keeping them > > > > synchronized in VhostIOVATree? > > > > > > It doesn't need to be synchronized. > > > > > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track > > > which part of the range has been used. > > > > > > > Not sure if I follow, that is what I mean with "synchronized". > > Oh right. > > > > > > This simplifies things, we can store GPA->IOVA mappings and SVQ -> > > > IOVA mappings separately. > > > > > > > Sorry, I still cannot see the whole picture :). > > > > Let's assume we have all the GPA mapped to specific IOVA regions, so > > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > > because of the migration. How can we know where we can place SVQ > > vrings without having them synchronized? > > Just allocating a new IOVA range for SVQ? > > > > > At this moment we're using a tree. The tree nature of the current SVQ > > IOVA -> VA makes all nodes ordered so it is more or less easy to look > > for holes. > > Yes, iova allocate could still be implemented via a tree. > > > > > Now your proposal uses the SVQ IOVA as tree values. Should we iterate > > over all of them, order them, of the two trees, and then look for > > holes there? > > Let me clarify, correct me if I was wrong: > > 1) IOVA allocator is still implemented via a tree, we just don't need > to store how the IOVA is used > 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > the datapath SVQ translation > 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > Ok, so the part I was missing is that now we have 3 whole trees, with somehow redundant information :). In some sense this is simpler than trying to get all the information from only two trees. On the bad side, all SVQ calls that allocate some region need to remember to add to one of the two other threes. That is what I mean by synchronized. But sure, we can go that way. > Thanks > > > > > > Thanks > > > > > > > > > > > Thanks! > > > > > > > > > >
On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > > <eperezma@redhat.com> wrote: > > > > > > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > > > > > > > > > > <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads > > > > > > > > > > > > > > > to the same HVA. This causes a problem when overlapped regions > > > > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking > > > > > > > > > > > > > > > them by HVA will return them twice. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from. > > > > > > > > > > > > > > > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps / > > > > > > > > > > > > > unmaps happens [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > HVA GPA IOVA > > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > > > > > Map > > > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > > > > > > > > > > > [0x80001000, 0x2000001000) > > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > > > > > > > > > > > [0x2000001000, 0x2000021000) > > > > > > > > > > > > > > > > > > > > > > > > > > Unmap > > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > > > > > > > > > > > 0x20000) ??? > > > > > > > > > > > > > > > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a > > > > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > > > > > > > > > > > not overlap, only HVA. > > > > > > > > > > > > > > > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk, > > > > > > > > > > > > > not the second one. This series is the way to tell the difference at > > > > > > > > > > > > > unmap time. > > > > > > > > > > > > > > > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > > > > > > > > > > the iova tree to solve this issue completely. Then there won't be > > > > > > > > > > > > aliasing issues. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ > > > > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > > > > > > > > > and they do not have GPA. > > > > > > > > > > > > > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this > > > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store > > > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > > > > > > > > > > > > > > > > > > This seems to be tricky. > > > > > > > > > > > > > > > > > > > > As discussed, it could be another iova tree. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > > > > > > > > > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > > > > > > > adding or removing, like in the memory listener, but how to know at > > > > > > > > > vhost_svq_translate_addr? > > > > > > > > > > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of > > > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > > > > > > > > > > > > > > > > > > > > The problem is not virtqueue_pop, that's out of the > > > > > > > vhost_svq_translate_addr. The problem is the need of adding > > > > > > > conditionals / complexity in all the callers of > > > > > > > > > > > > > > > > > > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When > > > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it > > > > > > > > > just worked in my tests so far. > > > > > > > > > > > > > > > > > > Now we have the second problem: The GPA values of the regions of the > > > > > > > > > two IOVA tree must be unique. We need to be able to find unallocated > > > > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very > > > > > > > > > complicated with two trees. > > > > > > > > > > > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we > > > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is > > > > > > > > shared by both > > > > > > > > > > > > > > > > 1) Guest memory (GPA) > > > > > > > > 2) SVQ virtqueue and buffers > > > > > > > > > > > > > > > > And another gtree to track the GPA to IOVA. > > > > > > > > > > > > > > > > The SVQ code could use either > > > > > > > > > > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers > > > > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > 2) dynamic IOVA allocation/deallocation helpers > > > > > > > > > > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > > > > > > > > > > > > > > > > > > > > That's possible, but that scatters the IOVA handling code instead of > > > > > > > keeping it self-contained in VhostIOVATree. > > > > > > > > > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > > > > > > > > > > > An example is the iova allocator in the kernel. > > > > > > > > > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough > > > > > > code, not sure it is useful here though (haven't had a deep look at > > > > > > that). > > > > > > > > > > > > > > > > I don't know enough about them to have an opinion. I keep seeing the > > > > > drawback of needing to synchronize both allocation & adding in all the > > > > > places we want to modify the IOVATree. At this moment, these are the > > > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and > > > > > net CVQ buffers. But it may be more in the future. > > > > > > > > > > What are the advantages of keeping these separated that justifies > > > > > needing to synchronize in all these places, compared with keeping them > > > > > synchronized in VhostIOVATree? > > > > > > > > It doesn't need to be synchronized. > > > > > > > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track > > > > which part of the range has been used. > > > > > > > > > > Not sure if I follow, that is what I mean with "synchronized". > > > > Oh right. > > > > > > > > > This simplifies things, we can store GPA->IOVA mappings and SVQ -> > > > > IOVA mappings separately. > > > > > > > > > > Sorry, I still cannot see the whole picture :). > > > > > > Let's assume we have all the GPA mapped to specific IOVA regions, so > > > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > > > because of the migration. How can we know where we can place SVQ > > > vrings without having them synchronized? > > > > Just allocating a new IOVA range for SVQ? > > > > > > > > At this moment we're using a tree. The tree nature of the current SVQ > > > IOVA -> VA makes all nodes ordered so it is more or less easy to look > > > for holes. > > > > Yes, iova allocate could still be implemented via a tree. > > > > > > > > Now your proposal uses the SVQ IOVA as tree values. Should we iterate > > > over all of them, order them, of the two trees, and then look for > > > holes there? > > > > Let me clarify, correct me if I was wrong: > > > > 1) IOVA allocator is still implemented via a tree, we just don't need > > to store how the IOVA is used > > 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > > the datapath SVQ translation > > 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > > > > Ok, so the part I was missing is that now we have 3 whole trees, with > somehow redundant information :). Somehow, it decouples the IOVA usage out of the IOVA allocator. This might be simple as guests and SVQ may try to share a single IOVA address space. > > In some sense this is simpler than trying to get all the information > from only two trees. On the bad side, all SVQ calls that allocate some > region need to remember to add to one of the two other threes. That is > what I mean by synchronized. But sure, we can go that way. Just a suggestion, if it turns out to complicate the issue, I'm fine to go the other way. Thanks > > > Thanks > > > > > > > > > Thanks > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > >
On 5/13/24 11:56 PM, Jason Wang wrote: > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: >> >> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: >>> >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin >>> <eperezma@redhat.com> wrote: >>>> >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: >>>>> >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin >>>>> <eperezma@redhat.com> wrote: >>>>>> >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>> >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>> >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin >>>>>>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads >>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking >>>>>>>>>>>>>>>> them by HVA will return them twice. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / >>>>>>>>>>>>>> unmaps happens [1]: >>>>>>>>>>>>>> >>>>>>>>>>>>>> HVA GPA IOVA >>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>>> Map >>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) >>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) >>>>>>>>>>>>>> [0x80001000, 0x2000001000) >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) >>>>>>>>>>>>>> [0x2000001000, 0x2000021000) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Unmap >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, >>>>>>>>>>>>>> 0x20000) ??? >>>>>>>>>>>>>> >>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a >>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does >>>>>>>>>>>>>> not overlap, only HVA. >>>>>>>>>>>>>> >>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, >>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at >>>>>>>>>>>>>> unmap time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>> >>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in >>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be >>>>>>>>>>>>> aliasing issues. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ >>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, >>>>>>>>>>>> and they do not have GPA. >>>>>>>>>>>> >>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this >>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store >>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and >>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). >>>>>>>>>>> >>>>>>>>>>> This seems to be tricky. >>>>>>>>>>> >>>>>>>>>>> As discussed, it could be another iova tree. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. >>>>>>>>>> >>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. >>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that >>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at >>>>>>>>>> adding or removing, like in the memory listener, but how to know at >>>>>>>>>> vhost_svq_translate_addr? >>>>>>>>> >>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of >>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? >>>>>>>>> >>>>>>>> >>>>>>>> The problem is not virtqueue_pop, that's out of the >>>>>>>> vhost_svq_translate_addr. The problem is the need of adding >>>>>>>> conditionals / complexity in all the callers of >>>>>>>> >>>>>>>>>> >>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When >>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is >>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it >>>>>>>>>> just worked in my tests so far. >>>>>>>>>> >>>>>>>>>> Now we have the second problem: The GPA values of the regions of the >>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated >>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so >>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very >>>>>>>>>> complicated with two trees. >>>>>>>>> >>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we >>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is >>>>>>>>> shared by both >>>>>>>>> >>>>>>>>> 1) Guest memory (GPA) >>>>>>>>> 2) SVQ virtqueue and buffers >>>>>>>>> >>>>>>>>> And another gtree to track the GPA to IOVA. >>>>>>>>> >>>>>>>>> The SVQ code could use either >>>>>>>>> >>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers >>>>>>>>> >>>>>>>>> or >>>>>>>>> >>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers >>>>>>>>> >>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? >>>>>>>>> >>>>>>>> >>>>>>>> That's possible, but that scatters the IOVA handling code instead of >>>>>>>> keeping it self-contained in VhostIOVATree. >>>>>>> >>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. >>>>>>> >>>>>>> An example is the iova allocator in the kernel. >>>>>>> >>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough >>>>>>> code, not sure it is useful here though (haven't had a deep look at >>>>>>> that). >>>>>>> >>>>>> >>>>>> I don't know enough about them to have an opinion. I keep seeing the >>>>>> drawback of needing to synchronize both allocation & adding in all the >>>>>> places we want to modify the IOVATree. At this moment, these are the >>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and >>>>>> net CVQ buffers. But it may be more in the future. >>>>>> >>>>>> What are the advantages of keeping these separated that justifies >>>>>> needing to synchronize in all these places, compared with keeping them >>>>>> synchronized in VhostIOVATree? >>>>> >>>>> It doesn't need to be synchronized. >>>>> >>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track >>>>> which part of the range has been used. >>>>> >>>> >>>> Not sure if I follow, that is what I mean with "synchronized". >>> >>> Oh right. >>> >>>> >>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> >>>>> IOVA mappings separately. >>>>> >>>> >>>> Sorry, I still cannot see the whole picture :). >>>> >>>> Let's assume we have all the GPA mapped to specific IOVA regions, so >>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ >>>> because of the migration. How can we know where we can place SVQ >>>> vrings without having them synchronized? >>> >>> Just allocating a new IOVA range for SVQ? >>> >>>> >>>> At this moment we're using a tree. The tree nature of the current SVQ >>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look >>>> for holes. >>> >>> Yes, iova allocate could still be implemented via a tree. >>> >>>> >>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate >>>> over all of them, order them, of the two trees, and then look for >>>> holes there? >>> >>> Let me clarify, correct me if I was wrong: >>> >>> 1) IOVA allocator is still implemented via a tree, we just don't need >>> to store how the IOVA is used >>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in >>> the datapath SVQ translation >>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ >>> >> >> Ok, so the part I was missing is that now we have 3 whole trees, with >> somehow redundant information :). > > Somehow, it decouples the IOVA usage out of the IOVA allocator. This > might be simple as guests and SVQ may try to share a single IOVA > address space. > I'm working on implementing your three suggestions above but I'm a bit confused with some of the wording and I was hoping you could clarify some of it for me when you get the chance. --- For your first suggestion (1) you mention decoupling the IOVA allocator and "don't need to store how the IOVA is used." By this, do you mean to not save the IOVA->HVA mapping and instead only save the allocated IOVA ranges? In other words, are you suggesting to create a dedicated "IOVA->IOVA" tree like: struct VhostIOVATree { uint64_t iova_first; uint64_t iova_last; IOVATree *iova_map; }; Where the mapping might look something like (where translated_addr is given some kind of 0 value): iova_region = (DMAMap) { .iova = iova_first, .translated_addr = 0, .size = region_size - 1, .perm = IOMMU_ACCESS_FLAG(true, section->readonly), }; Also, if this is what you mean by decoupling the IOVA allocator, what happens to the IOVA->HVA tree? Are we no longer saving these mappings in a tree? --- In your second suggestion (2) with a dedicated GPA->IOVA tree, were you thinking something like this? Just adding on to VhostIOVATree here: struct VhostIOVATree { uint64_t iova_first; uint64_t iova_last; IOVATree *iova_map; IOVATree *gpa_map; }; Where the mapping might look something like: gpa_region = (DMAMap) { .iova = iova_first, .translated_addr = gpa_first, .size = region_size - 1, .perm = IOMMU_ACCESS_FLAG(true, section->readonly), }; Also, when you say "used in the datapath SVQ translation", we still need to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is called, right? --- Lastly, in your third suggestion (3) you mention implementing a SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. struct VhostIOVATree { uint64_t iova_first; uint64_t iova_last; IOVATree *iova_map; IOVATree *gpa_map; IOVATree *svq_map; }; --- Let me know if I'm understanding this correctly. If I am, this would require a pretty good amount of refactoring on the IOVA allocation, searching, destroying, etc. code to account for these new trees. Thanks Jason! >> >> In some sense this is simpler than trying to get all the information >> from only two trees. On the bad side, all SVQ calls that allocate some >> region need to remember to add to one of the two other threes. That is >> what I mean by synchronized. But sure, we can go that way. > > Just a suggestion, if it turns out to complicate the issue, I'm fine > to go the other way. > > Thanks > >> >>> Thanks >>> >>>> >>>>> Thanks >>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>> >>>> >>> >> >> >
On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: > > > > On 5/13/24 11:56 PM, Jason Wang wrote: > > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > > <eperezma@redhat.com> wrote: > >> > >> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: > >>> > >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > >>> <eperezma@redhat.com> wrote: > >>>> > >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>> > >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > >>>>> <eperezma@redhat.com> wrote: > >>>>>> > >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>> > >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>> > >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > >>>>>>>>>>>>> <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads > >>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions > >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking > >>>>>>>>>>>>>>>> them by HVA will return them twice. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / > >>>>>>>>>>>>>> unmaps happens [1]: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> HVA GPA IOVA > >>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>> Map > >>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > >>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > >>>>>>>>>>>>>> [0x80001000, 0x2000001000) > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > >>>>>>>>>>>>>> [0x2000001000, 0x2000021000) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Unmap > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > >>>>>>>>>>>>>> 0x20000) ??? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a > >>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does > >>>>>>>>>>>>>> not overlap, only HVA. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, > >>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at > >>>>>>>>>>>>>> unmap time. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>> > >>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > >>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be > >>>>>>>>>>>>> aliasing issues. > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ > >>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, > >>>>>>>>>>>> and they do not have GPA. > >>>>>>>>>>>> > >>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this > >>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store > >>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > >>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). > >>>>>>>>>>> > >>>>>>>>>>> This seems to be tricky. > >>>>>>>>>>> > >>>>>>>>>>> As discussed, it could be another iova tree. > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. > >>>>>>>>>> > >>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > >>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that > >>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at > >>>>>>>>>> adding or removing, like in the memory listener, but how to know at > >>>>>>>>>> vhost_svq_translate_addr? > >>>>>>>>> > >>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of > >>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? > >>>>>>>>> > >>>>>>>> > >>>>>>>> The problem is not virtqueue_pop, that's out of the > >>>>>>>> vhost_svq_translate_addr. The problem is the need of adding > >>>>>>>> conditionals / complexity in all the callers of > >>>>>>>> > >>>>>>>>>> > >>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When > >>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is > >>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it > >>>>>>>>>> just worked in my tests so far. > >>>>>>>>>> > >>>>>>>>>> Now we have the second problem: The GPA values of the regions of the > >>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated > >>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so > >>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very > >>>>>>>>>> complicated with two trees. > >>>>>>>>> > >>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we > >>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is > >>>>>>>>> shared by both > >>>>>>>>> > >>>>>>>>> 1) Guest memory (GPA) > >>>>>>>>> 2) SVQ virtqueue and buffers > >>>>>>>>> > >>>>>>>>> And another gtree to track the GPA to IOVA. > >>>>>>>>> > >>>>>>>>> The SVQ code could use either > >>>>>>>>> > >>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers > >>>>>>>>> > >>>>>>>>> or > >>>>>>>>> > >>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers > >>>>>>>>> > >>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > >>>>>>>>> > >>>>>>>> > >>>>>>>> That's possible, but that scatters the IOVA handling code instead of > >>>>>>>> keeping it self-contained in VhostIOVATree. > >>>>>>> > >>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. > >>>>>>> > >>>>>>> An example is the iova allocator in the kernel. > >>>>>>> > >>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough > >>>>>>> code, not sure it is useful here though (haven't had a deep look at > >>>>>>> that). > >>>>>>> > >>>>>> > >>>>>> I don't know enough about them to have an opinion. I keep seeing the > >>>>>> drawback of needing to synchronize both allocation & adding in all the > >>>>>> places we want to modify the IOVATree. At this moment, these are the > >>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and > >>>>>> net CVQ buffers. But it may be more in the future. > >>>>>> > >>>>>> What are the advantages of keeping these separated that justifies > >>>>>> needing to synchronize in all these places, compared with keeping them > >>>>>> synchronized in VhostIOVATree? > >>>>> > >>>>> It doesn't need to be synchronized. > >>>>> > >>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track > >>>>> which part of the range has been used. > >>>>> > >>>> > >>>> Not sure if I follow, that is what I mean with "synchronized". > >>> > >>> Oh right. > >>> > >>>> > >>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> > >>>>> IOVA mappings separately. > >>>>> > >>>> > >>>> Sorry, I still cannot see the whole picture :). > >>>> > >>>> Let's assume we have all the GPA mapped to specific IOVA regions, so > >>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > >>>> because of the migration. How can we know where we can place SVQ > >>>> vrings without having them synchronized? > >>> > >>> Just allocating a new IOVA range for SVQ? > >>> > >>>> > >>>> At this moment we're using a tree. The tree nature of the current SVQ > >>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look > >>>> for holes. > >>> > >>> Yes, iova allocate could still be implemented via a tree. > >>> > >>>> > >>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate > >>>> over all of them, order them, of the two trees, and then look for > >>>> holes there? > >>> > >>> Let me clarify, correct me if I was wrong: > >>> > >>> 1) IOVA allocator is still implemented via a tree, we just don't need > >>> to store how the IOVA is used > >>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > >>> the datapath SVQ translation > >>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > >>> > >> > >> Ok, so the part I was missing is that now we have 3 whole trees, with > >> somehow redundant information :). > > > > Somehow, it decouples the IOVA usage out of the IOVA allocator. This > > might be simple as guests and SVQ may try to share a single IOVA > > address space. > > > > I'm working on implementing your three suggestions above but I'm a bit > confused with some of the wording and I was hoping you could clarify > some of it for me when you get the chance. > > --- > For your first suggestion (1) you mention decoupling the IOVA allocator > and "don't need to store how the IOVA is used." > > By this, do you mean to not save the IOVA->HVA mapping and instead only > save the allocated IOVA ranges? In other words, are you suggesting to > create a dedicated "IOVA->IOVA" tree like: > > struct VhostIOVATree { > uint64_t iova_first; > uint64_t iova_last; > IOVATree *iova_map; > }; > > Where the mapping might look something like (where translated_addr is > given some kind of 0 value): > > iova_region = (DMAMap) { > .iova = iova_first, > .translated_addr = 0, > .size = region_size - 1, > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > }; > > Also, if this is what you mean by decoupling the IOVA allocator, what > happens to the IOVA->HVA tree? Are we no longer saving these mappings in > a tree? > > --- > In your second suggestion (2) with a dedicated GPA->IOVA tree, were you > thinking something like this? Just adding on to VhostIOVATree here: > > struct VhostIOVATree { > uint64_t iova_first; > uint64_t iova_last; > IOVATree *iova_map; > IOVATree *gpa_map; > }; > > Where the mapping might look something like: > > gpa_region = (DMAMap) { > .iova = iova_first, > .translated_addr = gpa_first, > .size = region_size - 1, > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > }; > > Also, when you say "used in the datapath SVQ translation", we still need > to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is > called, right? > > --- > Lastly, in your third suggestion (3) you mention implementing a > SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for > just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. > > struct VhostIOVATree { > uint64_t iova_first; > uint64_t iova_last; > IOVATree *iova_map; > IOVATree *gpa_map; > IOVATree *svq_map; > }; > > --- > > Let me know if I'm understanding this correctly. If I am, this would > require a pretty good amount of refactoring on the IOVA allocation, > searching, destroying, etc. code to account for these new trees. > Ok I think I understand your previous question better, sorry for the delay :). If I'm not wrong, Jason did not enumerate these as alternatives but as part of the same solution. Jason please correct me if I'm wrong here. His solution is composed of three trees: 1) One for the IOVA allocations, so we know where to allocate new ranges 2) One of the GPA -> SVQ IOVA translations. 3) Another one for SVQ vrings translations. In my opinion to use GPA this is problematic as we force all of the callers to know if the address we want to translate comes from the guest or not. Current code does now know it, as vhost_svq_translate_addr is called to translate both buffer dataplane addresses and control virtqueue buffers, which are also shadowed. To transmit that information to the caller code would require either duplicate functions, to add a boolean, etc, as it is in a different layer (net specific net/vhost-vdpa.c vs generic hw/virtio/vhost-shadow-virtqueue.c). In my opinion is easier to keep just two trees (or similar structs): 1) One for the IOVA allocations, so we know where to allocate new ranges. We don't actually need to store the translations here. 2) One for HVA -> SVQ IOVA translations. This way we can accommodate both SVQ vrings, CVQ buffers, and guest memory buffers, all on the second tree, and take care of the HVA duplications. Thanks! > Thanks Jason! > > >> > >> In some sense this is simpler than trying to get all the information > >> from only two trees. On the bad side, all SVQ calls that allocate some > >> region need to remember to add to one of the two other threes. That is > >> what I mean by synchronized. But sure, we can go that way. > > > > Just a suggestion, if it turns out to complicate the issue, I'm fine > > to go the other way. > > > > Thanks > > > >> > >>> Thanks > >>> > >>>> > >>>>> Thanks > >>>>> > >>>>>> > >>>>>> Thanks! > >>>>>> > >>>>> > >>>> > >>> > >> > >> > > >
On 7/29/24 6:04 AM, Eugenio Perez Martin wrote: > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: >> >> >> >> On 5/13/24 11:56 PM, Jason Wang wrote: >>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin >>> <eperezma@redhat.com> wrote: >>>> >>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: >>>>> >>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin >>>>> <eperezma@redhat.com> wrote: >>>>>> >>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>> >>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin >>>>>>> <eperezma@redhat.com> wrote: >>>>>>>> >>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin >>>>>>>>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads >>>>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions >>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking >>>>>>>>>>>>>>>>>> them by HVA will return them twice. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / >>>>>>>>>>>>>>>> unmaps happens [1]: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> HVA GPA IOVA >>>>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>>>>> Map >>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) >>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) >>>>>>>>>>>>>>>> [0x80001000, 0x2000001000) >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) >>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Unmap >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, >>>>>>>>>>>>>>>> 0x20000) ??? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a >>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does >>>>>>>>>>>>>>>> not overlap, only HVA. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, >>>>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at >>>>>>>>>>>>>>>> unmap time. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in >>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be >>>>>>>>>>>>>>> aliasing issues. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ >>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, >>>>>>>>>>>>>> and they do not have GPA. >>>>>>>>>>>>>> >>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this >>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store >>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and >>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). >>>>>>>>>>>>> >>>>>>>>>>>>> This seems to be tricky. >>>>>>>>>>>>> >>>>>>>>>>>>> As discussed, it could be another iova tree. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. >>>>>>>>>>>> >>>>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. >>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that >>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at >>>>>>>>>>>> adding or removing, like in the memory listener, but how to know at >>>>>>>>>>>> vhost_svq_translate_addr? >>>>>>>>>>> >>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of >>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The problem is not virtqueue_pop, that's out of the >>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding >>>>>>>>>> conditionals / complexity in all the callers of >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When >>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is >>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it >>>>>>>>>>>> just worked in my tests so far. >>>>>>>>>>>> >>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of the >>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated >>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so >>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very >>>>>>>>>>>> complicated with two trees. >>>>>>>>>>> >>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we >>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is >>>>>>>>>>> shared by both >>>>>>>>>>> >>>>>>>>>>> 1) Guest memory (GPA) >>>>>>>>>>> 2) SVQ virtqueue and buffers >>>>>>>>>>> >>>>>>>>>>> And another gtree to track the GPA to IOVA. >>>>>>>>>>> >>>>>>>>>>> The SVQ code could use either >>>>>>>>>>> >>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers >>>>>>>>>>> >>>>>>>>>>> or >>>>>>>>>>> >>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers >>>>>>>>>>> >>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> That's possible, but that scatters the IOVA handling code instead of >>>>>>>>>> keeping it self-contained in VhostIOVATree. >>>>>>>>> >>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. >>>>>>>>> >>>>>>>>> An example is the iova allocator in the kernel. >>>>>>>>> >>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough >>>>>>>>> code, not sure it is useful here though (haven't had a deep look at >>>>>>>>> that). >>>>>>>>> >>>>>>>> >>>>>>>> I don't know enough about them to have an opinion. I keep seeing the >>>>>>>> drawback of needing to synchronize both allocation & adding in all the >>>>>>>> places we want to modify the IOVATree. At this moment, these are the >>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and >>>>>>>> net CVQ buffers. But it may be more in the future. >>>>>>>> >>>>>>>> What are the advantages of keeping these separated that justifies >>>>>>>> needing to synchronize in all these places, compared with keeping them >>>>>>>> synchronized in VhostIOVATree? >>>>>>> >>>>>>> It doesn't need to be synchronized. >>>>>>> >>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track >>>>>>> which part of the range has been used. >>>>>>> >>>>>> >>>>>> Not sure if I follow, that is what I mean with "synchronized". >>>>> >>>>> Oh right. >>>>> >>>>>> >>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> >>>>>>> IOVA mappings separately. >>>>>>> >>>>>> >>>>>> Sorry, I still cannot see the whole picture :). >>>>>> >>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so >>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ >>>>>> because of the migration. How can we know where we can place SVQ >>>>>> vrings without having them synchronized? >>>>> >>>>> Just allocating a new IOVA range for SVQ? >>>>> >>>>>> >>>>>> At this moment we're using a tree. The tree nature of the current SVQ >>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look >>>>>> for holes. >>>>> >>>>> Yes, iova allocate could still be implemented via a tree. >>>>> >>>>>> >>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate >>>>>> over all of them, order them, of the two trees, and then look for >>>>>> holes there? >>>>> >>>>> Let me clarify, correct me if I was wrong: >>>>> >>>>> 1) IOVA allocator is still implemented via a tree, we just don't need >>>>> to store how the IOVA is used >>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in >>>>> the datapath SVQ translation >>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ >>>>> >>>> >>>> Ok, so the part I was missing is that now we have 3 whole trees, with >>>> somehow redundant information :). >>> >>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This >>> might be simple as guests and SVQ may try to share a single IOVA >>> address space. >>> >> >> I'm working on implementing your three suggestions above but I'm a bit >> confused with some of the wording and I was hoping you could clarify >> some of it for me when you get the chance. >> >> --- >> For your first suggestion (1) you mention decoupling the IOVA allocator >> and "don't need to store how the IOVA is used." >> >> By this, do you mean to not save the IOVA->HVA mapping and instead only >> save the allocated IOVA ranges? In other words, are you suggesting to >> create a dedicated "IOVA->IOVA" tree like: >> >> struct VhostIOVATree { >> uint64_t iova_first; >> uint64_t iova_last; >> IOVATree *iova_map; >> }; >> >> Where the mapping might look something like (where translated_addr is >> given some kind of 0 value): >> >> iova_region = (DMAMap) { >> .iova = iova_first, >> .translated_addr = 0, >> .size = region_size - 1, >> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >> }; >> >> Also, if this is what you mean by decoupling the IOVA allocator, what >> happens to the IOVA->HVA tree? Are we no longer saving these mappings in >> a tree? >> >> --- >> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you >> thinking something like this? Just adding on to VhostIOVATree here: >> >> struct VhostIOVATree { >> uint64_t iova_first; >> uint64_t iova_last; >> IOVATree *iova_map; >> IOVATree *gpa_map; >> }; >> >> Where the mapping might look something like: >> >> gpa_region = (DMAMap) { >> .iova = iova_first, >> .translated_addr = gpa_first, >> .size = region_size - 1, >> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >> }; >> >> Also, when you say "used in the datapath SVQ translation", we still need >> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is >> called, right? >> >> --- >> Lastly, in your third suggestion (3) you mention implementing a >> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for >> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. >> >> struct VhostIOVATree { >> uint64_t iova_first; >> uint64_t iova_last; >> IOVATree *iova_map; >> IOVATree *gpa_map; >> IOVATree *svq_map; >> }; >> >> --- >> >> Let me know if I'm understanding this correctly. If I am, this would >> require a pretty good amount of refactoring on the IOVA allocation, >> searching, destroying, etc. code to account for these new trees. >> > > Ok I think I understand your previous question better, sorry for the delay :). > > If I'm not wrong, Jason did not enumerate these as alternatives but as > part of the same solution. Jason please correct me if I'm wrong here. > > His solution is composed of three trees: > 1) One for the IOVA allocations, so we know where to allocate new ranges > 2) One of the GPA -> SVQ IOVA translations. > 3) Another one for SVQ vrings translations. > > In my opinion to use GPA this is problematic as we force all of the > callers to know if the address we want to translate comes from the > guest or not. Current code does now know it, as > vhost_svq_translate_addr is called to translate both buffer dataplane > addresses and control virtqueue buffers, which are also shadowed. To > transmit that information to the caller code would require either > duplicate functions, to add a boolean, etc, as it is in a different > layer (net specific net/vhost-vdpa.c vs generic > hw/virtio/vhost-shadow-virtqueue.c). > > In my opinion is easier to keep just two trees (or similar structs): > 1) One for the IOVA allocations, so we know where to allocate new > ranges. We don't actually need to store the translations here. > 2) One for HVA -> SVQ IOVA translations. > > This way we can accommodate both SVQ vrings, CVQ buffers, and guest > memory buffers, all on the second tree, and take care of the HVA > duplications. > > Thanks! I assume that this dedicated IOVA tree will be created and added to in vhost_iova_tree_map_alloc --> iova_tree_alloc_map function calls, but what about the IOVA->HVA tree that gets created during vhost_vdpa_listener_region_add? Will this tree just be replaced with the dedicated IOVA tree? Also, an HVA->SVQ IOVA tree means that the tree is balanced using the HVA value and not the IOVA value, right? In other words, a tree where it's more efficient to search using the HVA values vs. IOVA values? > >> Thanks Jason! >> >>>> >>>> In some sense this is simpler than trying to get all the information >>>> from only two trees. On the bad side, all SVQ calls that allocate some >>>> region need to remember to add to one of the two other threes. That is >>>> what I mean by synchronized. But sure, we can go that way. >>> >>> Just a suggestion, if it turns out to complicate the issue, I'm fine >>> to go the other way. >>> >>> Thanks >>> >>>> >>>>> Thanks >>>>> >>>>>> >>>>>>> Thanks >>>>>>> >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >
On Mon, Jul 29, 2024 at 7:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: > > > > On 7/29/24 6:04 AM, Eugenio Perez Martin wrote: > > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: > >> > >> > >> > >> On 5/13/24 11:56 PM, Jason Wang wrote: > >>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > >>> <eperezma@redhat.com> wrote: > >>>> > >>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>> > >>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > >>>>> <eperezma@redhat.com> wrote: > >>>>>> > >>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>> > >>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > >>>>>>> <eperezma@redhat.com> wrote: > >>>>>>>> > >>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > >>>>>>>>>>>>>>> <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads > >>>>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions > >>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking > >>>>>>>>>>>>>>>>>> them by HVA will return them twice. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / > >>>>>>>>>>>>>>>> unmaps happens [1]: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> HVA GPA IOVA > >>>>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>>>> Map > >>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > >>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > >>>>>>>>>>>>>>>> [0x80001000, 0x2000001000) > >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > >>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000) > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Unmap > >>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > >>>>>>>>>>>>>>>> 0x20000) ??? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a > >>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does > >>>>>>>>>>>>>>>> not overlap, only HVA. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, > >>>>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at > >>>>>>>>>>>>>>>> unmap time. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > >>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be > >>>>>>>>>>>>>>> aliasing issues. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ > >>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, > >>>>>>>>>>>>>> and they do not have GPA. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this > >>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store > >>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > >>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). > >>>>>>>>>>>>> > >>>>>>>>>>>>> This seems to be tricky. > >>>>>>>>>>>>> > >>>>>>>>>>>>> As discussed, it could be another iova tree. > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. > >>>>>>>>>>>> > >>>>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > >>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that > >>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at > >>>>>>>>>>>> adding or removing, like in the memory listener, but how to know at > >>>>>>>>>>>> vhost_svq_translate_addr? > >>>>>>>>>>> > >>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of > >>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> The problem is not virtqueue_pop, that's out of the > >>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding > >>>>>>>>>> conditionals / complexity in all the callers of > >>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When > >>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is > >>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it > >>>>>>>>>>>> just worked in my tests so far. > >>>>>>>>>>>> > >>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of the > >>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated > >>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so > >>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very > >>>>>>>>>>>> complicated with two trees. > >>>>>>>>>>> > >>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we > >>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is > >>>>>>>>>>> shared by both > >>>>>>>>>>> > >>>>>>>>>>> 1) Guest memory (GPA) > >>>>>>>>>>> 2) SVQ virtqueue and buffers > >>>>>>>>>>> > >>>>>>>>>>> And another gtree to track the GPA to IOVA. > >>>>>>>>>>> > >>>>>>>>>>> The SVQ code could use either > >>>>>>>>>>> > >>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers > >>>>>>>>>>> > >>>>>>>>>>> or > >>>>>>>>>>> > >>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers > >>>>>>>>>>> > >>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> That's possible, but that scatters the IOVA handling code instead of > >>>>>>>>>> keeping it self-contained in VhostIOVATree. > >>>>>>>>> > >>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. > >>>>>>>>> > >>>>>>>>> An example is the iova allocator in the kernel. > >>>>>>>>> > >>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough > >>>>>>>>> code, not sure it is useful here though (haven't had a deep look at > >>>>>>>>> that). > >>>>>>>>> > >>>>>>>> > >>>>>>>> I don't know enough about them to have an opinion. I keep seeing the > >>>>>>>> drawback of needing to synchronize both allocation & adding in all the > >>>>>>>> places we want to modify the IOVATree. At this moment, these are the > >>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and > >>>>>>>> net CVQ buffers. But it may be more in the future. > >>>>>>>> > >>>>>>>> What are the advantages of keeping these separated that justifies > >>>>>>>> needing to synchronize in all these places, compared with keeping them > >>>>>>>> synchronized in VhostIOVATree? > >>>>>>> > >>>>>>> It doesn't need to be synchronized. > >>>>>>> > >>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track > >>>>>>> which part of the range has been used. > >>>>>>> > >>>>>> > >>>>>> Not sure if I follow, that is what I mean with "synchronized". > >>>>> > >>>>> Oh right. > >>>>> > >>>>>> > >>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> > >>>>>>> IOVA mappings separately. > >>>>>>> > >>>>>> > >>>>>> Sorry, I still cannot see the whole picture :). > >>>>>> > >>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so > >>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > >>>>>> because of the migration. How can we know where we can place SVQ > >>>>>> vrings without having them synchronized? > >>>>> > >>>>> Just allocating a new IOVA range for SVQ? > >>>>> > >>>>>> > >>>>>> At this moment we're using a tree. The tree nature of the current SVQ > >>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look > >>>>>> for holes. > >>>>> > >>>>> Yes, iova allocate could still be implemented via a tree. > >>>>> > >>>>>> > >>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate > >>>>>> over all of them, order them, of the two trees, and then look for > >>>>>> holes there? > >>>>> > >>>>> Let me clarify, correct me if I was wrong: > >>>>> > >>>>> 1) IOVA allocator is still implemented via a tree, we just don't need > >>>>> to store how the IOVA is used > >>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > >>>>> the datapath SVQ translation > >>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > >>>>> > >>>> > >>>> Ok, so the part I was missing is that now we have 3 whole trees, with > >>>> somehow redundant information :). > >>> > >>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This > >>> might be simple as guests and SVQ may try to share a single IOVA > >>> address space. > >>> > >> > >> I'm working on implementing your three suggestions above but I'm a bit > >> confused with some of the wording and I was hoping you could clarify > >> some of it for me when you get the chance. > >> > >> --- > >> For your first suggestion (1) you mention decoupling the IOVA allocator > >> and "don't need to store how the IOVA is used." > >> > >> By this, do you mean to not save the IOVA->HVA mapping and instead only > >> save the allocated IOVA ranges? In other words, are you suggesting to > >> create a dedicated "IOVA->IOVA" tree like: > >> > >> struct VhostIOVATree { > >> uint64_t iova_first; > >> uint64_t iova_last; > >> IOVATree *iova_map; > >> }; > >> > >> Where the mapping might look something like (where translated_addr is > >> given some kind of 0 value): > >> > >> iova_region = (DMAMap) { > >> .iova = iova_first, > >> .translated_addr = 0, > >> .size = region_size - 1, > >> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > >> }; > >> > >> Also, if this is what you mean by decoupling the IOVA allocator, what > >> happens to the IOVA->HVA tree? Are we no longer saving these mappings in > >> a tree? > >> > >> --- > >> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you > >> thinking something like this? Just adding on to VhostIOVATree here: > >> > >> struct VhostIOVATree { > >> uint64_t iova_first; > >> uint64_t iova_last; > >> IOVATree *iova_map; > >> IOVATree *gpa_map; > >> }; > >> > >> Where the mapping might look something like: > >> > >> gpa_region = (DMAMap) { > >> .iova = iova_first, > >> .translated_addr = gpa_first, > >> .size = region_size - 1, > >> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > >> }; > >> > >> Also, when you say "used in the datapath SVQ translation", we still need > >> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is > >> called, right? > >> > >> --- > >> Lastly, in your third suggestion (3) you mention implementing a > >> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for > >> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. > >> > >> struct VhostIOVATree { > >> uint64_t iova_first; > >> uint64_t iova_last; > >> IOVATree *iova_map; > >> IOVATree *gpa_map; > >> IOVATree *svq_map; > >> }; > >> > >> --- > >> > >> Let me know if I'm understanding this correctly. If I am, this would > >> require a pretty good amount of refactoring on the IOVA allocation, > >> searching, destroying, etc. code to account for these new trees. > >> > > > > Ok I think I understand your previous question better, sorry for the delay :). > > > > If I'm not wrong, Jason did not enumerate these as alternatives but as > > part of the same solution. Jason please correct me if I'm wrong here. > > > > His solution is composed of three trees: > > 1) One for the IOVA allocations, so we know where to allocate new ranges > > 2) One of the GPA -> SVQ IOVA translations. > > 3) Another one for SVQ vrings translations. > > > > In my opinion to use GPA this is problematic as we force all of the > > callers to know if the address we want to translate comes from the > > guest or not. Current code does now know it, as > > vhost_svq_translate_addr is called to translate both buffer dataplane > > addresses and control virtqueue buffers, which are also shadowed. To > > transmit that information to the caller code would require either > > duplicate functions, to add a boolean, etc, as it is in a different > > layer (net specific net/vhost-vdpa.c vs generic > > hw/virtio/vhost-shadow-virtqueue.c). > > > > In my opinion is easier to keep just two trees (or similar structs): > > 1) One for the IOVA allocations, so we know where to allocate new > > ranges. We don't actually need to store the translations here. > > 2) One for HVA -> SVQ IOVA translations. > > > > This way we can accommodate both SVQ vrings, CVQ buffers, and guest > > memory buffers, all on the second tree, and take care of the HVA > > duplications. > > > > Thanks! > > I assume that this dedicated IOVA tree will be created and added to in > vhost_iova_tree_map_alloc --> iova_tree_alloc_map function calls, but > what about the IOVA->HVA tree that gets created during > vhost_vdpa_listener_region_add? Not sure if I get you. The only IOVA tree that vdpa is using is created at either net/vhost-vdpa.c:vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start. From that moment, other places like vhost_vdpa_listener_region_add just add entries to it. > Will this tree just be replaced with the > dedicated IOVA tree? > I'd say that for a first iteration it is ok to keep using VhostIOVATree->IOVATree, even if the values of the tree (HVA) are not used anymore. > Also, an HVA->SVQ IOVA tree means that the tree is balanced using the > HVA value and not the IOVA value, right? In other words, a tree where > it's more efficient to search using the HVA values vs. IOVA values? > Right, HVA is the key and SVQ IOVA is the value. That way we can detect duplicates and act in consequence. > > > >> Thanks Jason! > >> > >>>> > >>>> In some sense this is simpler than trying to get all the information > >>>> from only two trees. On the bad side, all SVQ calls that allocate some > >>>> region need to remember to add to one of the two other threes. That is > >>>> what I mean by synchronized. But sure, we can go that way. > >>> > >>> Just a suggestion, if it turns out to complicate the issue, I'm fine > >>> to go the other way. > >>> > >>> Thanks > >>> > >>>> > >>>>> Thanks > >>>>> > >>>>>> > >>>>>>> Thanks > >>>>>>> > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > > >
On 7/29/24 2:20 PM, Eugenio Perez Martin wrote: > On Mon, Jul 29, 2024 at 7:50 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: >> >> >> >> On 7/29/24 6:04 AM, Eugenio Perez Martin wrote: >>> On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: >>>> >>>> >>>> >>>> On 5/13/24 11:56 PM, Jason Wang wrote: >>>>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin >>>>> <eperezma@redhat.com> wrote: >>>>>> >>>>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>> >>>>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin >>>>>>> <eperezma@redhat.com> wrote: >>>>>>>> >>>>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin >>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin >>>>>>>>>>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads >>>>>>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions >>>>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking >>>>>>>>>>>>>>>>>>>> them by HVA will return them twice. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / >>>>>>>>>>>>>>>>>> unmaps happens [1]: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> HVA GPA IOVA >>>>>>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> Map >>>>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) >>>>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) >>>>>>>>>>>>>>>>>> [0x80001000, 0x2000001000) >>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) >>>>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Unmap >>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, >>>>>>>>>>>>>>>>>> 0x20000) ??? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a >>>>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does >>>>>>>>>>>>>>>>>> not overlap, only HVA. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, >>>>>>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at >>>>>>>>>>>>>>>>>> unmap time. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in >>>>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be >>>>>>>>>>>>>>>>> aliasing issues. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ >>>>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, >>>>>>>>>>>>>>>> and they do not have GPA. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this >>>>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store >>>>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and >>>>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This seems to be tricky. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As discussed, it could be another iova tree. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. >>>>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that >>>>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at >>>>>>>>>>>>>> adding or removing, like in the memory listener, but how to know at >>>>>>>>>>>>>> vhost_svq_translate_addr? >>>>>>>>>>>>> >>>>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of >>>>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The problem is not virtqueue_pop, that's out of the >>>>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding >>>>>>>>>>>> conditionals / complexity in all the callers of >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When >>>>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is >>>>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it >>>>>>>>>>>>>> just worked in my tests so far. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of the >>>>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated >>>>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so >>>>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very >>>>>>>>>>>>>> complicated with two trees. >>>>>>>>>>>>> >>>>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we >>>>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is >>>>>>>>>>>>> shared by both >>>>>>>>>>>>> >>>>>>>>>>>>> 1) Guest memory (GPA) >>>>>>>>>>>>> 2) SVQ virtqueue and buffers >>>>>>>>>>>>> >>>>>>>>>>>>> And another gtree to track the GPA to IOVA. >>>>>>>>>>>>> >>>>>>>>>>>>> The SVQ code could use either >>>>>>>>>>>>> >>>>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers >>>>>>>>>>>>> >>>>>>>>>>>>> or >>>>>>>>>>>>> >>>>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers >>>>>>>>>>>>> >>>>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> That's possible, but that scatters the IOVA handling code instead of >>>>>>>>>>>> keeping it self-contained in VhostIOVATree. >>>>>>>>>>> >>>>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. >>>>>>>>>>> >>>>>>>>>>> An example is the iova allocator in the kernel. >>>>>>>>>>> >>>>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough >>>>>>>>>>> code, not sure it is useful here though (haven't had a deep look at >>>>>>>>>>> that). >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I don't know enough about them to have an opinion. I keep seeing the >>>>>>>>>> drawback of needing to synchronize both allocation & adding in all the >>>>>>>>>> places we want to modify the IOVATree. At this moment, these are the >>>>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and >>>>>>>>>> net CVQ buffers. But it may be more in the future. >>>>>>>>>> >>>>>>>>>> What are the advantages of keeping these separated that justifies >>>>>>>>>> needing to synchronize in all these places, compared with keeping them >>>>>>>>>> synchronized in VhostIOVATree? >>>>>>>>> >>>>>>>>> It doesn't need to be synchronized. >>>>>>>>> >>>>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track >>>>>>>>> which part of the range has been used. >>>>>>>>> >>>>>>>> >>>>>>>> Not sure if I follow, that is what I mean with "synchronized". >>>>>>> >>>>>>> Oh right. >>>>>>> >>>>>>>> >>>>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> >>>>>>>>> IOVA mappings separately. >>>>>>>>> >>>>>>>> >>>>>>>> Sorry, I still cannot see the whole picture :). >>>>>>>> >>>>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so >>>>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ >>>>>>>> because of the migration. How can we know where we can place SVQ >>>>>>>> vrings without having them synchronized? >>>>>>> >>>>>>> Just allocating a new IOVA range for SVQ? >>>>>>> >>>>>>>> >>>>>>>> At this moment we're using a tree. The tree nature of the current SVQ >>>>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look >>>>>>>> for holes. >>>>>>> >>>>>>> Yes, iova allocate could still be implemented via a tree. >>>>>>> >>>>>>>> >>>>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate >>>>>>>> over all of them, order them, of the two trees, and then look for >>>>>>>> holes there? >>>>>>> >>>>>>> Let me clarify, correct me if I was wrong: >>>>>>> >>>>>>> 1) IOVA allocator is still implemented via a tree, we just don't need >>>>>>> to store how the IOVA is used >>>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in >>>>>>> the datapath SVQ translation >>>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ >>>>>>> >>>>>> >>>>>> Ok, so the part I was missing is that now we have 3 whole trees, with >>>>>> somehow redundant information :). >>>>> >>>>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This >>>>> might be simple as guests and SVQ may try to share a single IOVA >>>>> address space. >>>>> >>>> >>>> I'm working on implementing your three suggestions above but I'm a bit >>>> confused with some of the wording and I was hoping you could clarify >>>> some of it for me when you get the chance. >>>> >>>> --- >>>> For your first suggestion (1) you mention decoupling the IOVA allocator >>>> and "don't need to store how the IOVA is used." >>>> >>>> By this, do you mean to not save the IOVA->HVA mapping and instead only >>>> save the allocated IOVA ranges? In other words, are you suggesting to >>>> create a dedicated "IOVA->IOVA" tree like: >>>> >>>> struct VhostIOVATree { >>>> uint64_t iova_first; >>>> uint64_t iova_last; >>>> IOVATree *iova_map; >>>> }; >>>> >>>> Where the mapping might look something like (where translated_addr is >>>> given some kind of 0 value): >>>> >>>> iova_region = (DMAMap) { >>>> .iova = iova_first, >>>> .translated_addr = 0, >>>> .size = region_size - 1, >>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >>>> }; >>>> >>>> Also, if this is what you mean by decoupling the IOVA allocator, what >>>> happens to the IOVA->HVA tree? Are we no longer saving these mappings in >>>> a tree? >>>> >>>> --- >>>> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you >>>> thinking something like this? Just adding on to VhostIOVATree here: >>>> >>>> struct VhostIOVATree { >>>> uint64_t iova_first; >>>> uint64_t iova_last; >>>> IOVATree *iova_map; >>>> IOVATree *gpa_map; >>>> }; >>>> >>>> Where the mapping might look something like: >>>> >>>> gpa_region = (DMAMap) { >>>> .iova = iova_first, >>>> .translated_addr = gpa_first, >>>> .size = region_size - 1, >>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >>>> }; >>>> >>>> Also, when you say "used in the datapath SVQ translation", we still need >>>> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is >>>> called, right? >>>> >>>> --- >>>> Lastly, in your third suggestion (3) you mention implementing a >>>> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for >>>> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. >>>> >>>> struct VhostIOVATree { >>>> uint64_t iova_first; >>>> uint64_t iova_last; >>>> IOVATree *iova_map; >>>> IOVATree *gpa_map; >>>> IOVATree *svq_map; >>>> }; >>>> >>>> --- >>>> >>>> Let me know if I'm understanding this correctly. If I am, this would >>>> require a pretty good amount of refactoring on the IOVA allocation, >>>> searching, destroying, etc. code to account for these new trees. >>>> >>> >>> Ok I think I understand your previous question better, sorry for the delay :). >>> >>> If I'm not wrong, Jason did not enumerate these as alternatives but as >>> part of the same solution. Jason please correct me if I'm wrong here. >>> >>> His solution is composed of three trees: >>> 1) One for the IOVA allocations, so we know where to allocate new ranges >>> 2) One of the GPA -> SVQ IOVA translations. >>> 3) Another one for SVQ vrings translations. >>> >>> In my opinion to use GPA this is problematic as we force all of the >>> callers to know if the address we want to translate comes from the >>> guest or not. Current code does now know it, as >>> vhost_svq_translate_addr is called to translate both buffer dataplane >>> addresses and control virtqueue buffers, which are also shadowed. To >>> transmit that information to the caller code would require either >>> duplicate functions, to add a boolean, etc, as it is in a different >>> layer (net specific net/vhost-vdpa.c vs generic >>> hw/virtio/vhost-shadow-virtqueue.c). >>> >>> In my opinion is easier to keep just two trees (or similar structs): >>> 1) One for the IOVA allocations, so we know where to allocate new >>> ranges. We don't actually need to store the translations here. >>> 2) One for HVA -> SVQ IOVA translations. >>> >>> This way we can accommodate both SVQ vrings, CVQ buffers, and guest >>> memory buffers, all on the second tree, and take care of the HVA >>> duplications. >>> >>> Thanks! >> >> I assume that this dedicated IOVA tree will be created and added to in >> vhost_iova_tree_map_alloc --> iova_tree_alloc_map function calls, but >> what about the IOVA->HVA tree that gets created during >> vhost_vdpa_listener_region_add? > > Not sure if I get you. The only IOVA tree that vdpa is using is > created at either net/vhost-vdpa.c:vhost_vdpa_net_data_start_first or > vhost_vdpa_net_cvq_start. From that moment, other places like > vhost_vdpa_listener_region_add just add entries to it. > Ah, right, my apologies. "Created" was the wrong word to use here. I really meant built/added to here. >> Will this tree just be replaced with the >> dedicated IOVA tree? >> > > I'd say that for a first iteration it is ok to keep using > VhostIOVATree->IOVATree, even if the values of the tree (HVA) are not > used anymore. > Ack, thanks! >> Also, an HVA->SVQ IOVA tree means that the tree is balanced using the >> HVA value and not the IOVA value, right? In other words, a tree where >> it's more efficient to search using the HVA values vs. IOVA values? >> > > Right, HVA is the key and SVQ IOVA is the value. That way we can > detect duplicates and act in consequence. > Ack. Thanks for the help! I'll see what I can do and get a first iteration out as soon as I can. >>> >>>> Thanks Jason! >>>> >>>>>> >>>>>> In some sense this is simpler than trying to get all the information >>>>>> from only two trees. On the bad side, all SVQ calls that allocate some >>>>>> region need to remember to add to one of the two other threes. That is >>>>>> what I mean by synchronized. But sure, we can go that way. >>>>> >>>>> Just a suggestion, if it turns out to complicate the issue, I'm fine >>>>> to go the other way. >>>>> >>>>> Thanks >>>>> >>>>>> >>>>>>> Thanks >>>>>>> >>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: > > > > > > > > On 5/13/24 11:56 PM, Jason Wang wrote: > > > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > > > <eperezma@redhat.com> wrote: > > >> > > >> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: > > >>> > > >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > > >>> <eperezma@redhat.com> wrote: > > >>>> > > >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > > >>>>> > > >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > > >>>>> <eperezma@redhat.com> wrote: > > >>>>>> > > >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>> > > >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > >>>>>>>> > > >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>> > > >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > >>>>>>>>>> > > >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > >>>>>>>>>>>>> <eperezma@redhat.com> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads > > >>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions > > >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking > > >>>>>>>>>>>>>>>> them by HVA will return them twice. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / > > >>>>>>>>>>>>>> unmaps happens [1]: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> HVA GPA IOVA > > >>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- > > >>>>>>>>>>>>>> Map > > >>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > >>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > >>>>>>>>>>>>>> [0x80001000, 0x2000001000) > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > >>>>>>>>>>>>>> [0x2000001000, 0x2000021000) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Unmap > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > >>>>>>>>>>>>>> 0x20000) ??? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a > > >>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > >>>>>>>>>>>>>> not overlap, only HVA. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, > > >>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at > > >>>>>>>>>>>>>> unmap time. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > >>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be > > >>>>>>>>>>>>> aliasing issues. > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ > > >>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > >>>>>>>>>>>> and they do not have GPA. > > >>>>>>>>>>>> > > >>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this > > >>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store > > >>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > >>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). > > >>>>>>>>>>> > > >>>>>>>>>>> This seems to be tricky. > > >>>>>>>>>>> > > >>>>>>>>>>> As discussed, it could be another iova tree. > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. > > >>>>>>>>>> > > >>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > >>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > >>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at > > >>>>>>>>>> adding or removing, like in the memory listener, but how to know at > > >>>>>>>>>> vhost_svq_translate_addr? > > >>>>>>>>> > > >>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of > > >>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> The problem is not virtqueue_pop, that's out of the > > >>>>>>>> vhost_svq_translate_addr. The problem is the need of adding > > >>>>>>>> conditionals / complexity in all the callers of > > >>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When > > >>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is > > >>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it > > >>>>>>>>>> just worked in my tests so far. > > >>>>>>>>>> > > >>>>>>>>>> Now we have the second problem: The GPA values of the regions of the > > >>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated > > >>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so > > >>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very > > >>>>>>>>>> complicated with two trees. > > >>>>>>>>> > > >>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we > > >>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is > > >>>>>>>>> shared by both > > >>>>>>>>> > > >>>>>>>>> 1) Guest memory (GPA) > > >>>>>>>>> 2) SVQ virtqueue and buffers > > >>>>>>>>> > > >>>>>>>>> And another gtree to track the GPA to IOVA. > > >>>>>>>>> > > >>>>>>>>> The SVQ code could use either > > >>>>>>>>> > > >>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers > > >>>>>>>>> > > >>>>>>>>> or > > >>>>>>>>> > > >>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers > > >>>>>>>>> > > >>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> That's possible, but that scatters the IOVA handling code instead of > > >>>>>>>> keeping it self-contained in VhostIOVATree. > > >>>>>>> > > >>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > >>>>>>> > > >>>>>>> An example is the iova allocator in the kernel. > > >>>>>>> > > >>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough > > >>>>>>> code, not sure it is useful here though (haven't had a deep look at > > >>>>>>> that). > > >>>>>>> > > >>>>>> > > >>>>>> I don't know enough about them to have an opinion. I keep seeing the > > >>>>>> drawback of needing to synchronize both allocation & adding in all the > > >>>>>> places we want to modify the IOVATree. At this moment, these are the > > >>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and > > >>>>>> net CVQ buffers. But it may be more in the future. > > >>>>>> > > >>>>>> What are the advantages of keeping these separated that justifies > > >>>>>> needing to synchronize in all these places, compared with keeping them > > >>>>>> synchronized in VhostIOVATree? > > >>>>> > > >>>>> It doesn't need to be synchronized. > > >>>>> > > >>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track > > >>>>> which part of the range has been used. > > >>>>> > > >>>> > > >>>> Not sure if I follow, that is what I mean with "synchronized". > > >>> > > >>> Oh right. > > >>> > > >>>> > > >>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> > > >>>>> IOVA mappings separately. > > >>>>> > > >>>> > > >>>> Sorry, I still cannot see the whole picture :). > > >>>> > > >>>> Let's assume we have all the GPA mapped to specific IOVA regions, so > > >>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > > >>>> because of the migration. How can we know where we can place SVQ > > >>>> vrings without having them synchronized? > > >>> > > >>> Just allocating a new IOVA range for SVQ? > > >>> > > >>>> > > >>>> At this moment we're using a tree. The tree nature of the current SVQ > > >>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look > > >>>> for holes. > > >>> > > >>> Yes, iova allocate could still be implemented via a tree. > > >>> > > >>>> > > >>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate > > >>>> over all of them, order them, of the two trees, and then look for > > >>>> holes there? > > >>> > > >>> Let me clarify, correct me if I was wrong: > > >>> > > >>> 1) IOVA allocator is still implemented via a tree, we just don't need > > >>> to store how the IOVA is used > > >>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > > >>> the datapath SVQ translation > > >>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > > >>> > > >> > > >> Ok, so the part I was missing is that now we have 3 whole trees, with > > >> somehow redundant information :). > > > > > > Somehow, it decouples the IOVA usage out of the IOVA allocator. This > > > might be simple as guests and SVQ may try to share a single IOVA > > > address space. > > > > > > > I'm working on implementing your three suggestions above but I'm a bit > > confused with some of the wording and I was hoping you could clarify > > some of it for me when you get the chance. > > > > --- > > For your first suggestion (1) you mention decoupling the IOVA allocator > > and "don't need to store how the IOVA is used." > > > > By this, do you mean to not save the IOVA->HVA mapping and instead only > > save the allocated IOVA ranges? Yes. > > In other words, are you suggesting to > > create a dedicated "IOVA->IOVA" tree like: > > > > struct VhostIOVATree { > > uint64_t iova_first; > > uint64_t iova_last; > > IOVATree *iova_map; > > }; It could be this or other. I think the point is the allocator is only used for IOVA allocation but it doesn't have any information about the mapping. > > > > Where the mapping might look something like (where translated_addr is > > given some kind of 0 value): > > > > iova_region = (DMAMap) { > > .iova = iova_first, > > .translated_addr = 0, > > .size = region_size - 1, > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > > }; > > > > Also, if this is what you mean by decoupling the IOVA allocator, what > > happens to the IOVA->HVA tree? We will have two structures: 1) IOVA domain (in charge of IOVA range allocation/deallocation) 2) How the IOVA is used, e.g an IOVA->HVA tree > Are we no longer saving these mappings in > > a tree? > > > > --- > > In your second suggestion (2) with a dedicated GPA->IOVA tree, were you > > thinking something like this? Just adding on to VhostIOVATree here: > > > > struct VhostIOVATree { > > uint64_t iova_first; > > uint64_t iova_last; > > IOVATree *iova_map; > > IOVATree *gpa_map; > > }; I'd split the mappings if it's possible. > > > > Where the mapping might look something like: > > > > gpa_region = (DMAMap) { > > .iova = iova_first, > > .translated_addr = gpa_first, > > .size = region_size - 1, > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > > }; > > > > Also, when you say "used in the datapath SVQ translation", we still need > > to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is > > called, right? Right. > > > > --- > > Lastly, in your third suggestion (3) you mention implementing a > > SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for > > just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. > > > > struct VhostIOVATree { > > uint64_t iova_first; > > uint64_t iova_last; > > IOVATree *iova_map; > > IOVATree *gpa_map; > > IOVATree *svq_map; > > }; > > > > --- > > > > Let me know if I'm understanding this correctly. If I am, this would > > require a pretty good amount of refactoring on the IOVA allocation, > > searching, destroying, etc. code to account for these new trees. > > > > Ok I think I understand your previous question better, sorry for the delay :). > > If I'm not wrong, Jason did not enumerate these as alternatives but as > part of the same solution. Jason please correct me if I'm wrong here. > > His solution is composed of three trees: > 1) One for the IOVA allocations, so we know where to allocate new ranges > 2) One of the GPA -> SVQ IOVA translations. > 3) Another one for SVQ vrings translations. > Exactly. > In my opinion to use GPA this is problematic as we force all of the > callers to know if the address we want to translate comes from the > guest or not. Current code does now know it, as > vhost_svq_translate_addr is called to translate both buffer dataplane > addresses and control virtqueue buffers, which are also shadowed. To > transmit that information to the caller code would require either > duplicate functions, to add a boolean, etc, as it is in a different > layer (net specific net/vhost-vdpa.c vs generic > hw/virtio/vhost-shadow-virtqueue.c). > > In my opinion is easier to keep just two trees (or similar structs): > 1) One for the IOVA allocations, so we know where to allocate new > ranges. We don't actually need to store the translations here. > 2) One for HVA -> SVQ IOVA translations. > > This way we can accommodate both SVQ vrings, CVQ buffers, and guest > memory buffers, all on the second tree, and take care of the HVA > duplications. Probably, but how to take care of HVA duplications (I may miss the context, sorry)? Thanks > > Thanks! > > > Thanks Jason! > > > > >> > > >> In some sense this is simpler than trying to get all the information > > >> from only two trees. On the bad side, all SVQ calls that allocate some > > >> region need to remember to add to one of the two other threes. That is > > >> what I mean by synchronized. But sure, we can go that way. > > > > > > Just a suggestion, if it turns out to complicate the issue, I'm fine > > > to go the other way. > > > > > > Thanks > > > > > >> > > >>> Thanks > > >>> > > >>>> > > >>>>> Thanks > > >>>>> > > >>>>>> > > >>>>>> Thanks! > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > >> > > > > > >
On Tue, Jul 30, 2024 at 10:48 AM Jason Wang <jasowang@redhat.com> wrote: > > On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: > > > > > > > > > > > > On 5/13/24 11:56 PM, Jason Wang wrote: > > > > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > > > > <eperezma@redhat.com> wrote: > > > >> > > > >> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: > > > >>> > > > >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > > > >>> <eperezma@redhat.com> wrote: > > > >>>> > > > >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > > > >>>>> > > > >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > > > >>>>> <eperezma@redhat.com> wrote: > > > >>>>>> > > > >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > >>>>>>> > > > >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > >>>>>>>> > > > >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > > > >>>>>>>>> > > > >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > > > >>>>>>>>>>>>> <eperezma@redhat.com> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads > > > >>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions > > > >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking > > > >>>>>>>>>>>>>>>> them by HVA will return them twice. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / > > > >>>>>>>>>>>>>> unmaps happens [1]: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> HVA GPA IOVA > > > >>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- > > > >>>>>>>>>>>>>> Map > > > >>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > > > >>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > > >>>>>>>>>>>>>> [0x80001000, 0x2000001000) > > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > > >>>>>>>>>>>>>> [0x2000001000, 0x2000021000) > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Unmap > > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > > > >>>>>>>>>>>>>> 0x20000) ??? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a > > > >>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does > > > >>>>>>>>>>>>>> not overlap, only HVA. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, > > > >>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at > > > >>>>>>>>>>>>>> unmap time. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Thanks! > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > > > >>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be > > > >>>>>>>>>>>>> aliasing issues. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ > > > >>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, > > > >>>>>>>>>>>> and they do not have GPA. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this > > > >>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store > > > >>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > > > >>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). > > > >>>>>>>>>>> > > > >>>>>>>>>>> This seems to be tricky. > > > >>>>>>>>>>> > > > >>>>>>>>>>> As discussed, it could be another iova tree. > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. > > > >>>>>>>>>> > > > >>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > > > >>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that > > > >>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at > > > >>>>>>>>>> adding or removing, like in the memory listener, but how to know at > > > >>>>>>>>>> vhost_svq_translate_addr? > > > >>>>>>>>> > > > >>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of > > > >>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> The problem is not virtqueue_pop, that's out of the > > > >>>>>>>> vhost_svq_translate_addr. The problem is the need of adding > > > >>>>>>>> conditionals / complexity in all the callers of > > > >>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When > > > >>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is > > > >>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it > > > >>>>>>>>>> just worked in my tests so far. > > > >>>>>>>>>> > > > >>>>>>>>>> Now we have the second problem: The GPA values of the regions of the > > > >>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated > > > >>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so > > > >>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very > > > >>>>>>>>>> complicated with two trees. > > > >>>>>>>>> > > > >>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we > > > >>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is > > > >>>>>>>>> shared by both > > > >>>>>>>>> > > > >>>>>>>>> 1) Guest memory (GPA) > > > >>>>>>>>> 2) SVQ virtqueue and buffers > > > >>>>>>>>> > > > >>>>>>>>> And another gtree to track the GPA to IOVA. > > > >>>>>>>>> > > > >>>>>>>>> The SVQ code could use either > > > >>>>>>>>> > > > >>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers > > > >>>>>>>>> > > > >>>>>>>>> or > > > >>>>>>>>> > > > >>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers > > > >>>>>>>>> > > > >>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> That's possible, but that scatters the IOVA handling code instead of > > > >>>>>>>> keeping it self-contained in VhostIOVATree. > > > >>>>>>> > > > >>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. > > > >>>>>>> > > > >>>>>>> An example is the iova allocator in the kernel. > > > >>>>>>> > > > >>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough > > > >>>>>>> code, not sure it is useful here though (haven't had a deep look at > > > >>>>>>> that). > > > >>>>>>> > > > >>>>>> > > > >>>>>> I don't know enough about them to have an opinion. I keep seeing the > > > >>>>>> drawback of needing to synchronize both allocation & adding in all the > > > >>>>>> places we want to modify the IOVATree. At this moment, these are the > > > >>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and > > > >>>>>> net CVQ buffers. But it may be more in the future. > > > >>>>>> > > > >>>>>> What are the advantages of keeping these separated that justifies > > > >>>>>> needing to synchronize in all these places, compared with keeping them > > > >>>>>> synchronized in VhostIOVATree? > > > >>>>> > > > >>>>> It doesn't need to be synchronized. > > > >>>>> > > > >>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track > > > >>>>> which part of the range has been used. > > > >>>>> > > > >>>> > > > >>>> Not sure if I follow, that is what I mean with "synchronized". > > > >>> > > > >>> Oh right. > > > >>> > > > >>>> > > > >>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> > > > >>>>> IOVA mappings separately. > > > >>>>> > > > >>>> > > > >>>> Sorry, I still cannot see the whole picture :). > > > >>>> > > > >>>> Let's assume we have all the GPA mapped to specific IOVA regions, so > > > >>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > > > >>>> because of the migration. How can we know where we can place SVQ > > > >>>> vrings without having them synchronized? > > > >>> > > > >>> Just allocating a new IOVA range for SVQ? > > > >>> > > > >>>> > > > >>>> At this moment we're using a tree. The tree nature of the current SVQ > > > >>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look > > > >>>> for holes. > > > >>> > > > >>> Yes, iova allocate could still be implemented via a tree. > > > >>> > > > >>>> > > > >>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate > > > >>>> over all of them, order them, of the two trees, and then look for > > > >>>> holes there? > > > >>> > > > >>> Let me clarify, correct me if I was wrong: > > > >>> > > > >>> 1) IOVA allocator is still implemented via a tree, we just don't need > > > >>> to store how the IOVA is used > > > >>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > > > >>> the datapath SVQ translation > > > >>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > > > >>> > > > >> > > > >> Ok, so the part I was missing is that now we have 3 whole trees, with > > > >> somehow redundant information :). > > > > > > > > Somehow, it decouples the IOVA usage out of the IOVA allocator. This > > > > might be simple as guests and SVQ may try to share a single IOVA > > > > address space. > > > > > > > > > > I'm working on implementing your three suggestions above but I'm a bit > > > confused with some of the wording and I was hoping you could clarify > > > some of it for me when you get the chance. > > > > > > --- > > > For your first suggestion (1) you mention decoupling the IOVA allocator > > > and "don't need to store how the IOVA is used." > > > > > > By this, do you mean to not save the IOVA->HVA mapping and instead only > > > save the allocated IOVA ranges? > > Yes. > > > > In other words, are you suggesting to > > > create a dedicated "IOVA->IOVA" tree like: > > > > > > struct VhostIOVATree { > > > uint64_t iova_first; > > > uint64_t iova_last; > > > IOVATree *iova_map; > > > }; > > It could be this or other. I think the point is the allocator is only > used for IOVA allocation but it doesn't have any information about the > mapping. > > > > > > > Where the mapping might look something like (where translated_addr is > > > given some kind of 0 value): > > > > > > iova_region = (DMAMap) { > > > .iova = iova_first, > > > .translated_addr = 0, > > > .size = region_size - 1, > > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > > > }; > > > > > > Also, if this is what you mean by decoupling the IOVA allocator, what > > > happens to the IOVA->HVA tree? > > We will have two structures: > > 1) IOVA domain (in charge of IOVA range allocation/deallocation) > 2) How the IOVA is used, e.g an IOVA->HVA tree > > > Are we no longer saving these mappings in > > > a tree? > > > > > > --- > > > In your second suggestion (2) with a dedicated GPA->IOVA tree, were you > > > thinking something like this? Just adding on to VhostIOVATree here: > > > > > > struct VhostIOVATree { > > > uint64_t iova_first; > > > uint64_t iova_last; > > > IOVATree *iova_map; > > > IOVATree *gpa_map; > > > }; > > I'd split the mappings if it's possible. > > > > > > > Where the mapping might look something like: > > > > > > gpa_region = (DMAMap) { > > > .iova = iova_first, > > > .translated_addr = gpa_first, > > > .size = region_size - 1, > > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > > > }; > > > > > > Also, when you say "used in the datapath SVQ translation", we still need > > > to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is > > > called, right? > > Right. > > > > > > > --- > > > Lastly, in your third suggestion (3) you mention implementing a > > > SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for > > > just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. > > > > > > struct VhostIOVATree { > > > uint64_t iova_first; > > > uint64_t iova_last; > > > IOVATree *iova_map; > > > IOVATree *gpa_map; > > > IOVATree *svq_map; > > > }; > > > > > > --- > > > > > > Let me know if I'm understanding this correctly. If I am, this would > > > require a pretty good amount of refactoring on the IOVA allocation, > > > searching, destroying, etc. code to account for these new trees. > > > > > > > Ok I think I understand your previous question better, sorry for the delay :). > > > > If I'm not wrong, Jason did not enumerate these as alternatives but as > > part of the same solution. Jason please correct me if I'm wrong here. > > > > His solution is composed of three trees: > > 1) One for the IOVA allocations, so we know where to allocate new ranges > > 2) One of the GPA -> SVQ IOVA translations. > > 3) Another one for SVQ vrings translations. > > > > Exactly. > > > In my opinion to use GPA this is problematic as we force all of the > > callers to know if the address we want to translate comes from the > > guest or not. Current code does now know it, as > > vhost_svq_translate_addr is called to translate both buffer dataplane > > addresses and control virtqueue buffers, which are also shadowed. To > > transmit that information to the caller code would require either > > duplicate functions, to add a boolean, etc, as it is in a different > > layer (net specific net/vhost-vdpa.c vs generic > > hw/virtio/vhost-shadow-virtqueue.c). > > > > In my opinion is easier to keep just two trees (or similar structs): > > 1) One for the IOVA allocations, so we know where to allocate new > > ranges. We don't actually need to store the translations here. > > 2) One for HVA -> SVQ IOVA translations. > > > > This way we can accommodate both SVQ vrings, CVQ buffers, and guest > > memory buffers, all on the second tree, and take care of the HVA > > duplications. > > Probably, but how to take care of HVA duplications (I may miss the > context, sorry)? > What to do with the duplicated HVA is an added problem that deserves more discussion, right. If a duplicated entry gets deleted, the other one should take its place in both trees. Store them in an ordered list and look for replacements on tree deletion maybe? It would be great to locate a similar usage and see what is done there. > Thanks > > > > > Thanks! > > > > > Thanks Jason! > > > > > > >> > > > >> In some sense this is simpler than trying to get all the information > > > >> from only two trees. On the bad side, all SVQ calls that allocate some > > > >> region need to remember to add to one of the two other threes. That is > > > >> what I mean by synchronized. But sure, we can go that way. > > > > > > > > Just a suggestion, if it turns out to complicate the issue, I'm fine > > > > to go the other way. > > > > > > > > Thanks > > > > > > > >> > > > >>> Thanks > > > >>> > > > >>>> > > > >>>>> Thanks > > > >>>>> > > > >>>>>> > > > >>>>>> Thanks! > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > >> > > > > > > > > > >
On 7/30/24 7:00 AM, Eugenio Perez Martin wrote: > On Tue, Jul 30, 2024 at 10:48 AM Jason Wang <jasowang@redhat.com> wrote: >> >> On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin >> <eperezma@redhat.com> wrote: >>> >>> On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: >>>> >>>> >>>> >>>> On 5/13/24 11:56 PM, Jason Wang wrote: >>>>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin >>>>> <eperezma@redhat.com> wrote: >>>>>> >>>>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>> >>>>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin >>>>>>> <eperezma@redhat.com> wrote: >>>>>>>> >>>>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin >>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin >>>>>>>>>>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads >>>>>>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions >>>>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking >>>>>>>>>>>>>>>>>>>> them by HVA will return them twice. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / >>>>>>>>>>>>>>>>>> unmaps happens [1]: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> HVA GPA IOVA >>>>>>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> Map >>>>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) >>>>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) >>>>>>>>>>>>>>>>>> [0x80001000, 0x2000001000) >>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) >>>>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Unmap >>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, >>>>>>>>>>>>>>>>>> 0x20000) ??? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a >>>>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does >>>>>>>>>>>>>>>>>> not overlap, only HVA. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, >>>>>>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at >>>>>>>>>>>>>>>>>> unmap time. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in >>>>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be >>>>>>>>>>>>>>>>> aliasing issues. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ >>>>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, >>>>>>>>>>>>>>>> and they do not have GPA. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this >>>>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store >>>>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and >>>>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This seems to be tricky. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As discussed, it could be another iova tree. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. >>>>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that >>>>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at >>>>>>>>>>>>>> adding or removing, like in the memory listener, but how to know at >>>>>>>>>>>>>> vhost_svq_translate_addr? >>>>>>>>>>>>> >>>>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of >>>>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The problem is not virtqueue_pop, that's out of the >>>>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding >>>>>>>>>>>> conditionals / complexity in all the callers of >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When >>>>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is >>>>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it >>>>>>>>>>>>>> just worked in my tests so far. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of the >>>>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated >>>>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so >>>>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very >>>>>>>>>>>>>> complicated with two trees. >>>>>>>>>>>>> >>>>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we >>>>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is >>>>>>>>>>>>> shared by both >>>>>>>>>>>>> >>>>>>>>>>>>> 1) Guest memory (GPA) >>>>>>>>>>>>> 2) SVQ virtqueue and buffers >>>>>>>>>>>>> >>>>>>>>>>>>> And another gtree to track the GPA to IOVA. >>>>>>>>>>>>> >>>>>>>>>>>>> The SVQ code could use either >>>>>>>>>>>>> >>>>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers >>>>>>>>>>>>> >>>>>>>>>>>>> or >>>>>>>>>>>>> >>>>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers >>>>>>>>>>>>> >>>>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> That's possible, but that scatters the IOVA handling code instead of >>>>>>>>>>>> keeping it self-contained in VhostIOVATree. >>>>>>>>>>> >>>>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. >>>>>>>>>>> >>>>>>>>>>> An example is the iova allocator in the kernel. >>>>>>>>>>> >>>>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough >>>>>>>>>>> code, not sure it is useful here though (haven't had a deep look at >>>>>>>>>>> that). >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I don't know enough about them to have an opinion. I keep seeing the >>>>>>>>>> drawback of needing to synchronize both allocation & adding in all the >>>>>>>>>> places we want to modify the IOVATree. At this moment, these are the >>>>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and >>>>>>>>>> net CVQ buffers. But it may be more in the future. >>>>>>>>>> >>>>>>>>>> What are the advantages of keeping these separated that justifies >>>>>>>>>> needing to synchronize in all these places, compared with keeping them >>>>>>>>>> synchronized in VhostIOVATree? >>>>>>>>> >>>>>>>>> It doesn't need to be synchronized. >>>>>>>>> >>>>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track >>>>>>>>> which part of the range has been used. >>>>>>>>> >>>>>>>> >>>>>>>> Not sure if I follow, that is what I mean with "synchronized". >>>>>>> >>>>>>> Oh right. >>>>>>> >>>>>>>> >>>>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> >>>>>>>>> IOVA mappings separately. >>>>>>>>> >>>>>>>> >>>>>>>> Sorry, I still cannot see the whole picture :). >>>>>>>> >>>>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so >>>>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ >>>>>>>> because of the migration. How can we know where we can place SVQ >>>>>>>> vrings without having them synchronized? >>>>>>> >>>>>>> Just allocating a new IOVA range for SVQ? >>>>>>> >>>>>>>> >>>>>>>> At this moment we're using a tree. The tree nature of the current SVQ >>>>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look >>>>>>>> for holes. >>>>>>> >>>>>>> Yes, iova allocate could still be implemented via a tree. >>>>>>> >>>>>>>> >>>>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate >>>>>>>> over all of them, order them, of the two trees, and then look for >>>>>>>> holes there? >>>>>>> >>>>>>> Let me clarify, correct me if I was wrong: >>>>>>> >>>>>>> 1) IOVA allocator is still implemented via a tree, we just don't need >>>>>>> to store how the IOVA is used >>>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in >>>>>>> the datapath SVQ translation >>>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ >>>>>>> >>>>>> >>>>>> Ok, so the part I was missing is that now we have 3 whole trees, with >>>>>> somehow redundant information :). >>>>> >>>>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This >>>>> might be simple as guests and SVQ may try to share a single IOVA >>>>> address space. >>>>> >>>> >>>> I'm working on implementing your three suggestions above but I'm a bit >>>> confused with some of the wording and I was hoping you could clarify >>>> some of it for me when you get the chance. >>>> >>>> --- >>>> For your first suggestion (1) you mention decoupling the IOVA allocator >>>> and "don't need to store how the IOVA is used." >>>> >>>> By this, do you mean to not save the IOVA->HVA mapping and instead only >>>> save the allocated IOVA ranges? >> >> Yes. >> >>>> In other words, are you suggesting to >>>> create a dedicated "IOVA->IOVA" tree like: >>>> >>>> struct VhostIOVATree { >>>> uint64_t iova_first; >>>> uint64_t iova_last; >>>> IOVATree *iova_map; >>>> }; >> >> It could be this or other. I think the point is the allocator is only >> used for IOVA allocation but it doesn't have any information about the >> mapping. >> >>>> >>>> Where the mapping might look something like (where translated_addr is >>>> given some kind of 0 value): >>>> >>>> iova_region = (DMAMap) { >>>> .iova = iova_first, >>>> .translated_addr = 0, >>>> .size = region_size - 1, >>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >>>> }; >>>> >>>> Also, if this is what you mean by decoupling the IOVA allocator, what >>>> happens to the IOVA->HVA tree? >> >> We will have two structures: >> >> 1) IOVA domain (in charge of IOVA range allocation/deallocation) >> 2) How the IOVA is used, e.g an IOVA->HVA tree >> >>> Are we no longer saving these mappings in >>>> a tree? >>>> >>>> --- >>>> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you >>>> thinking something like this? Just adding on to VhostIOVATree here: >>>> >>>> struct VhostIOVATree { >>>> uint64_t iova_first; >>>> uint64_t iova_last; >>>> IOVATree *iova_map; >>>> IOVATree *gpa_map; >>>> }; >> >> I'd split the mappings if it's possible. >> >>>> >>>> Where the mapping might look something like: >>>> >>>> gpa_region = (DMAMap) { >>>> .iova = iova_first, >>>> .translated_addr = gpa_first, >>>> .size = region_size - 1, >>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >>>> }; >>>> >>>> Also, when you say "used in the datapath SVQ translation", we still need >>>> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is >>>> called, right? >> >> Right. >> >>>> >>>> --- >>>> Lastly, in your third suggestion (3) you mention implementing a >>>> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for >>>> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. >>>> >>>> struct VhostIOVATree { >>>> uint64_t iova_first; >>>> uint64_t iova_last; >>>> IOVATree *iova_map; >>>> IOVATree *gpa_map; >>>> IOVATree *svq_map; >>>> }; >>>> >>>> --- >>>> >>>> Let me know if I'm understanding this correctly. If I am, this would >>>> require a pretty good amount of refactoring on the IOVA allocation, >>>> searching, destroying, etc. code to account for these new trees. >>>> >>> >>> Ok I think I understand your previous question better, sorry for the delay :). >>> >>> If I'm not wrong, Jason did not enumerate these as alternatives but as >>> part of the same solution. Jason please correct me if I'm wrong here. >>> >>> His solution is composed of three trees: >>> 1) One for the IOVA allocations, so we know where to allocate new ranges >>> 2) One of the GPA -> SVQ IOVA translations. >>> 3) Another one for SVQ vrings translations. >>> >> >> Exactly. >> >>> In my opinion to use GPA this is problematic as we force all of the >>> callers to know if the address we want to translate comes from the >>> guest or not. Current code does now know it, as >>> vhost_svq_translate_addr is called to translate both buffer dataplane >>> addresses and control virtqueue buffers, which are also shadowed. To >>> transmit that information to the caller code would require either >>> duplicate functions, to add a boolean, etc, as it is in a different >>> layer (net specific net/vhost-vdpa.c vs generic >>> hw/virtio/vhost-shadow-virtqueue.c). >>> >>> In my opinion is easier to keep just two trees (or similar structs): >>> 1) One for the IOVA allocations, so we know where to allocate new >>> ranges. We don't actually need to store the translations here. >>> 2) One for HVA -> SVQ IOVA translations. >>> >>> This way we can accommodate both SVQ vrings, CVQ buffers, and guest >>> memory buffers, all on the second tree, and take care of the HVA >>> duplications. >> >> Probably, but how to take care of HVA duplications (I may miss the >> context, sorry)? >> > > What to do with the duplicated HVA is an added problem that deserves > more discussion, right. If a duplicated entry gets deleted, the other > one should take its place in both trees. > > Store them in an ordered list and look for replacements on tree > deletion maybe? It would be great to locate a similar usage and see > what is done there. > Isn't what we're doing here supposed to solve this issue of duplicated/overlapping HVA ranges? >> Thanks >> >>> >>> Thanks! >>> >>>> Thanks Jason! >>>> >>>>>> >>>>>> In some sense this is simpler than trying to get all the information >>>>>> from only two trees. On the bad side, all SVQ calls that allocate some >>>>>> region need to remember to add to one of the two other threes. That is >>>>>> what I mean by synchronized. But sure, we can go that way. >>>>> >>>>> Just a suggestion, if it turns out to complicate the issue, I'm fine >>>>> to go the other way. >>>>> >>>>> Thanks >>>>> >>>>>> >>>>>>> Thanks >>>>>>> >>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
On Tue, Jul 30, 2024 at 2:32 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: > > > > On 7/30/24 7:00 AM, Eugenio Perez Martin wrote: > > On Tue, Jul 30, 2024 at 10:48 AM Jason Wang <jasowang@redhat.com> wrote: > >> > >> On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin > >> <eperezma@redhat.com> wrote: > >>> > >>> On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: > >>>> > >>>> > >>>> > >>>> On 5/13/24 11:56 PM, Jason Wang wrote: > >>>>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin > >>>>> <eperezma@redhat.com> wrote: > >>>>>> > >>>>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>> > >>>>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin > >>>>>>> <eperezma@redhat.com> wrote: > >>>>>>>> > >>>>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin > >>>>>>>>> <eperezma@redhat.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin > >>>>>>>>>>>>>>>>> <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads > >>>>>>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions > >>>>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking > >>>>>>>>>>>>>>>>>>>> them by HVA will return them twice. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / > >>>>>>>>>>>>>>>>>> unmaps happens [1]: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> HVA GPA IOVA > >>>>>>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>>>>>> Map > >>>>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > >>>>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > >>>>>>>>>>>>>>>>>> [0x80001000, 0x2000001000) > >>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > >>>>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000) > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Unmap > >>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > >>>>>>>>>>>>>>>>>> 0x20000) ??? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a > >>>>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does > >>>>>>>>>>>>>>>>>> not overlap, only HVA. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, > >>>>>>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at > >>>>>>>>>>>>>>>>>> unmap time. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in > >>>>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be > >>>>>>>>>>>>>>>>> aliasing issues. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ > >>>>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, > >>>>>>>>>>>>>>>> and they do not have GPA. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this > >>>>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store > >>>>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and > >>>>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This seems to be tricky. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> As discussed, it could be another iova tree. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. > >>>>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that > >>>>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at > >>>>>>>>>>>>>> adding or removing, like in the memory listener, but how to know at > >>>>>>>>>>>>>> vhost_svq_translate_addr? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of > >>>>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> The problem is not virtqueue_pop, that's out of the > >>>>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding > >>>>>>>>>>>> conditionals / complexity in all the callers of > >>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When > >>>>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is > >>>>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it > >>>>>>>>>>>>>> just worked in my tests so far. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of the > >>>>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated > >>>>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so > >>>>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very > >>>>>>>>>>>>>> complicated with two trees. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we > >>>>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is > >>>>>>>>>>>>> shared by both > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1) Guest memory (GPA) > >>>>>>>>>>>>> 2) SVQ virtqueue and buffers > >>>>>>>>>>>>> > >>>>>>>>>>>>> And another gtree to track the GPA to IOVA. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The SVQ code could use either > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers > >>>>>>>>>>>>> > >>>>>>>>>>>>> or > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers > >>>>>>>>>>>>> > >>>>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> That's possible, but that scatters the IOVA handling code instead of > >>>>>>>>>>>> keeping it self-contained in VhostIOVATree. > >>>>>>>>>>> > >>>>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. > >>>>>>>>>>> > >>>>>>>>>>> An example is the iova allocator in the kernel. > >>>>>>>>>>> > >>>>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough > >>>>>>>>>>> code, not sure it is useful here though (haven't had a deep look at > >>>>>>>>>>> that). > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> I don't know enough about them to have an opinion. I keep seeing the > >>>>>>>>>> drawback of needing to synchronize both allocation & adding in all the > >>>>>>>>>> places we want to modify the IOVATree. At this moment, these are the > >>>>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and > >>>>>>>>>> net CVQ buffers. But it may be more in the future. > >>>>>>>>>> > >>>>>>>>>> What are the advantages of keeping these separated that justifies > >>>>>>>>>> needing to synchronize in all these places, compared with keeping them > >>>>>>>>>> synchronized in VhostIOVATree? > >>>>>>>>> > >>>>>>>>> It doesn't need to be synchronized. > >>>>>>>>> > >>>>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track > >>>>>>>>> which part of the range has been used. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Not sure if I follow, that is what I mean with "synchronized". > >>>>>>> > >>>>>>> Oh right. > >>>>>>> > >>>>>>>> > >>>>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> > >>>>>>>>> IOVA mappings separately. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Sorry, I still cannot see the whole picture :). > >>>>>>>> > >>>>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so > >>>>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ > >>>>>>>> because of the migration. How can we know where we can place SVQ > >>>>>>>> vrings without having them synchronized? > >>>>>>> > >>>>>>> Just allocating a new IOVA range for SVQ? > >>>>>>> > >>>>>>>> > >>>>>>>> At this moment we're using a tree. The tree nature of the current SVQ > >>>>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look > >>>>>>>> for holes. > >>>>>>> > >>>>>>> Yes, iova allocate could still be implemented via a tree. > >>>>>>> > >>>>>>>> > >>>>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate > >>>>>>>> over all of them, order them, of the two trees, and then look for > >>>>>>>> holes there? > >>>>>>> > >>>>>>> Let me clarify, correct me if I was wrong: > >>>>>>> > >>>>>>> 1) IOVA allocator is still implemented via a tree, we just don't need > >>>>>>> to store how the IOVA is used > >>>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in > >>>>>>> the datapath SVQ translation > >>>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > >>>>>>> > >>>>>> > >>>>>> Ok, so the part I was missing is that now we have 3 whole trees, with > >>>>>> somehow redundant information :). > >>>>> > >>>>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This > >>>>> might be simple as guests and SVQ may try to share a single IOVA > >>>>> address space. > >>>>> > >>>> > >>>> I'm working on implementing your three suggestions above but I'm a bit > >>>> confused with some of the wording and I was hoping you could clarify > >>>> some of it for me when you get the chance. > >>>> > >>>> --- > >>>> For your first suggestion (1) you mention decoupling the IOVA allocator > >>>> and "don't need to store how the IOVA is used." > >>>> > >>>> By this, do you mean to not save the IOVA->HVA mapping and instead only > >>>> save the allocated IOVA ranges? > >> > >> Yes. > >> > >>>> In other words, are you suggesting to > >>>> create a dedicated "IOVA->IOVA" tree like: > >>>> > >>>> struct VhostIOVATree { > >>>> uint64_t iova_first; > >>>> uint64_t iova_last; > >>>> IOVATree *iova_map; > >>>> }; > >> > >> It could be this or other. I think the point is the allocator is only > >> used for IOVA allocation but it doesn't have any information about the > >> mapping. > >> > >>>> > >>>> Where the mapping might look something like (where translated_addr is > >>>> given some kind of 0 value): > >>>> > >>>> iova_region = (DMAMap) { > >>>> .iova = iova_first, > >>>> .translated_addr = 0, > >>>> .size = region_size - 1, > >>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > >>>> }; > >>>> > >>>> Also, if this is what you mean by decoupling the IOVA allocator, what > >>>> happens to the IOVA->HVA tree? > >> > >> We will have two structures: > >> > >> 1) IOVA domain (in charge of IOVA range allocation/deallocation) > >> 2) How the IOVA is used, e.g an IOVA->HVA tree > >> > >>> Are we no longer saving these mappings in > >>>> a tree? > >>>> > >>>> --- > >>>> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you > >>>> thinking something like this? Just adding on to VhostIOVATree here: > >>>> > >>>> struct VhostIOVATree { > >>>> uint64_t iova_first; > >>>> uint64_t iova_last; > >>>> IOVATree *iova_map; > >>>> IOVATree *gpa_map; > >>>> }; > >> > >> I'd split the mappings if it's possible. > >> > >>>> > >>>> Where the mapping might look something like: > >>>> > >>>> gpa_region = (DMAMap) { > >>>> .iova = iova_first, > >>>> .translated_addr = gpa_first, > >>>> .size = region_size - 1, > >>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), > >>>> }; > >>>> > >>>> Also, when you say "used in the datapath SVQ translation", we still need > >>>> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is > >>>> called, right? > >> > >> Right. > >> > >>>> > >>>> --- > >>>> Lastly, in your third suggestion (3) you mention implementing a > >>>> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for > >>>> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. > >>>> > >>>> struct VhostIOVATree { > >>>> uint64_t iova_first; > >>>> uint64_t iova_last; > >>>> IOVATree *iova_map; > >>>> IOVATree *gpa_map; > >>>> IOVATree *svq_map; > >>>> }; > >>>> > >>>> --- > >>>> > >>>> Let me know if I'm understanding this correctly. If I am, this would > >>>> require a pretty good amount of refactoring on the IOVA allocation, > >>>> searching, destroying, etc. code to account for these new trees. > >>>> > >>> > >>> Ok I think I understand your previous question better, sorry for the delay :). > >>> > >>> If I'm not wrong, Jason did not enumerate these as alternatives but as > >>> part of the same solution. Jason please correct me if I'm wrong here. > >>> > >>> His solution is composed of three trees: > >>> 1) One for the IOVA allocations, so we know where to allocate new ranges > >>> 2) One of the GPA -> SVQ IOVA translations. > >>> 3) Another one for SVQ vrings translations. > >>> > >> > >> Exactly. > >> > >>> In my opinion to use GPA this is problematic as we force all of the > >>> callers to know if the address we want to translate comes from the > >>> guest or not. Current code does now know it, as > >>> vhost_svq_translate_addr is called to translate both buffer dataplane > >>> addresses and control virtqueue buffers, which are also shadowed. To > >>> transmit that information to the caller code would require either > >>> duplicate functions, to add a boolean, etc, as it is in a different > >>> layer (net specific net/vhost-vdpa.c vs generic > >>> hw/virtio/vhost-shadow-virtqueue.c). > >>> > >>> In my opinion is easier to keep just two trees (or similar structs): > >>> 1) One for the IOVA allocations, so we know where to allocate new > >>> ranges. We don't actually need to store the translations here. > >>> 2) One for HVA -> SVQ IOVA translations. > >>> > >>> This way we can accommodate both SVQ vrings, CVQ buffers, and guest > >>> memory buffers, all on the second tree, and take care of the HVA > >>> duplications. > >> > >> Probably, but how to take care of HVA duplications (I may miss the > >> context, sorry)? > >> > > > > What to do with the duplicated HVA is an added problem that deserves > > more discussion, right. If a duplicated entry gets deleted, the other > > one should take its place in both trees. > > > > Store them in an ordered list and look for replacements on tree > > deletion maybe? It would be great to locate a similar usage and see > > what is done there. > > > > Isn't what we're doing here supposed to solve this issue of > duplicated/overlapping HVA ranges? > Yes, let me explain again with the example of Si-Wei [1]: HVA GPA IOVA ------------------------------------------------------------------------------------------------------------------------- Map [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) [0x80001000, 0x2000001000) [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x2000001000, 0x2000021000) Unmap [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, 0x20000) ??? The third HVA range is contained in the first one, but exposed under a different GVA (aliased). This is not "flattened" by QEMU, as GPA does not overlap, only HVA. In this case we can ignore the third range, as we already have it in HVA. As it is unmapped first, it is ok to simply ignore it for all the QEMU run. Now, what action should we take if the first range, GPA [0x0, 0x80000000), is unmapped first? We should reintroduce the smaller chunk HVA [0x7f7903ea0000, 0x7f7903ec0000) -> SVQ IOVA [0x2000001000, 0x2000021000) entry in the tree. Maybe we can trust every case of this will be handled in a stack fashion and all of this is overthinking, but I'd like to know other's opinions / thoughts here. Thanks! [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html > >> Thanks > >> > >>> > >>> Thanks! > >>> > >>>> Thanks Jason! > >>>> > >>>>>> > >>>>>> In some sense this is simpler than trying to get all the information > >>>>>> from only two trees. On the bad side, all SVQ calls that allocate some > >>>>>> region need to remember to add to one of the two other threes. That is > >>>>>> what I mean by synchronized. But sure, we can go that way. > >>>>> > >>>>> Just a suggestion, if it turns out to complicate the issue, I'm fine > >>>>> to go the other way. > >>>>> > >>>>> Thanks > >>>>> > >>>>>> > >>>>>>> Thanks > >>>>>>> > >>>>>>>> > >>>>>>>>> Thanks > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Thanks! > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >
On 7/31/24 5:56 AM, Eugenio Perez Martin wrote: > On Tue, Jul 30, 2024 at 2:32 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: >> >> >> >> On 7/30/24 7:00 AM, Eugenio Perez Martin wrote: >>> On Tue, Jul 30, 2024 at 10:48 AM Jason Wang <jasowang@redhat.com> wrote: >>>> >>>> On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin >>>> <eperezma@redhat.com> wrote: >>>>> >>>>> On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 5/13/24 11:56 PM, Jason Wang wrote: >>>>>>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin >>>>>>> <eperezma@redhat.com> wrote: >>>>>>>> >>>>>>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin >>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin >>>>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin >>>>>>>>>>>>>>>>>>> <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where different GPA leads >>>>>>>>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped regions >>>>>>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, as looking >>>>>>>>>>>>>>>>>>>>>> them by HVA will return them twice. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for shadow virtqueue? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this comes from. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps / >>>>>>>>>>>>>>>>>>>> unmaps happens [1]: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> HVA GPA IOVA >>>>>>>>>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>>>> Map >>>>>>>>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) >>>>>>>>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) >>>>>>>>>>>>>>>>>>>> [0x80001000, 0x2000001000) >>>>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) >>>>>>>>>>>>>>>>>>>> [0x2000001000, 0x2000021000) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Unmap >>>>>>>>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, >>>>>>>>>>>>>>>>>>>> 0x20000) ??? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The third HVA range is contained in the first one, but exposed under a >>>>>>>>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU, as GPA does >>>>>>>>>>>>>>>>>>>> not overlap, only HVA. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the first chunk, >>>>>>>>>>>>>>>>>>>> not the second one. This series is the way to tell the difference at >>>>>>>>>>>>>>>>>>>> unmap time. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$ >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in >>>>>>>>>>>>>>>>>>> the iova tree to solve this issue completely. Then there won't be >>>>>>>>>>>>>>>>>>> aliasing issues. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm ok to explore that route but this has another problem. Both SVQ >>>>>>>>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by VhostIOVATree, >>>>>>>>>>>>>>>>>> and they do not have GPA. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle this >>>>>>>>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we store >>>>>>>>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and >>>>>>>>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ buffers). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This seems to be tricky. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> As discussed, it could be another iova tree. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me expand & recap. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Option 1 is to simply add another iova tree to VhostShadowVirtqueue. >>>>>>>>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current iova_tree that >>>>>>>>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is easy at >>>>>>>>>>>>>>>> adding or removing, like in the memory listener, but how to know at >>>>>>>>>>>>>>>> vhost_svq_translate_addr? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version of >>>>>>>>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The problem is not virtqueue_pop, that's out of the >>>>>>>>>>>>>> vhost_svq_translate_addr. The problem is the need of adding >>>>>>>>>>>>>> conditionals / complexity in all the callers of >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The easiest way for me is to rely on memory_region_from_host(). When >>>>>>>>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When it is >>>>>>>>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use case, it >>>>>>>>>>>>>>>> just worked in my tests so far. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Now we have the second problem: The GPA values of the regions of the >>>>>>>>>>>>>>>> two IOVA tree must be unique. We need to be able to find unallocated >>>>>>>>>>>>>>>> regions in SVQ IOVA. At this moment there is only one IOVATree, so >>>>>>>>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is very >>>>>>>>>>>>>>>> complicated with two trees. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For example, we >>>>>>>>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges. It is >>>>>>>>>>>>>>> shared by both >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) Guest memory (GPA) >>>>>>>>>>>>>>> 2) SVQ virtqueue and buffers >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> And another gtree to track the GPA to IOVA. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The SVQ code could use either >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and buffers >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> or >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> That's possible, but that scatters the IOVA handling code instead of >>>>>>>>>>>>>> keeping it self-contained in VhostIOVATree. >>>>>>>>>>>>> >>>>>>>>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is used. >>>>>>>>>>>>> >>>>>>>>>>>>> An example is the iova allocator in the kernel. >>>>>>>>>>>>> >>>>>>>>>>>>> Note that there's an even simpler IOVA "allocator" in NVME passthrough >>>>>>>>>>>>> code, not sure it is useful here though (haven't had a deep look at >>>>>>>>>>>>> that). >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I don't know enough about them to have an opinion. I keep seeing the >>>>>>>>>>>> drawback of needing to synchronize both allocation & adding in all the >>>>>>>>>>>> places we want to modify the IOVATree. At this moment, these are the >>>>>>>>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and >>>>>>>>>>>> net CVQ buffers. But it may be more in the future. >>>>>>>>>>>> >>>>>>>>>>>> What are the advantages of keeping these separated that justifies >>>>>>>>>>>> needing to synchronize in all these places, compared with keeping them >>>>>>>>>>>> synchronized in VhostIOVATree? >>>>>>>>>>> >>>>>>>>>>> It doesn't need to be synchronized. >>>>>>>>>>> >>>>>>>>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track >>>>>>>>>>> which part of the range has been used. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Not sure if I follow, that is what I mean with "synchronized". >>>>>>>>> >>>>>>>>> Oh right. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ -> >>>>>>>>>>> IOVA mappings separately. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sorry, I still cannot see the whole picture :). >>>>>>>>>> >>>>>>>>>> Let's assume we have all the GPA mapped to specific IOVA regions, so >>>>>>>>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ >>>>>>>>>> because of the migration. How can we know where we can place SVQ >>>>>>>>>> vrings without having them synchronized? >>>>>>>>> >>>>>>>>> Just allocating a new IOVA range for SVQ? >>>>>>>>> >>>>>>>>>> >>>>>>>>>> At this moment we're using a tree. The tree nature of the current SVQ >>>>>>>>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look >>>>>>>>>> for holes. >>>>>>>>> >>>>>>>>> Yes, iova allocate could still be implemented via a tree. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate >>>>>>>>>> over all of them, order them, of the two trees, and then look for >>>>>>>>>> holes there? >>>>>>>>> >>>>>>>>> Let me clarify, correct me if I was wrong: >>>>>>>>> >>>>>>>>> 1) IOVA allocator is still implemented via a tree, we just don't need >>>>>>>>> to store how the IOVA is used >>>>>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in >>>>>>>>> the datapath SVQ translation >>>>>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ >>>>>>>>> >>>>>>>> >>>>>>>> Ok, so the part I was missing is that now we have 3 whole trees, with >>>>>>>> somehow redundant information :). >>>>>>> >>>>>>> Somehow, it decouples the IOVA usage out of the IOVA allocator. This >>>>>>> might be simple as guests and SVQ may try to share a single IOVA >>>>>>> address space. >>>>>>> >>>>>> >>>>>> I'm working on implementing your three suggestions above but I'm a bit >>>>>> confused with some of the wording and I was hoping you could clarify >>>>>> some of it for me when you get the chance. >>>>>> >>>>>> --- >>>>>> For your first suggestion (1) you mention decoupling the IOVA allocator >>>>>> and "don't need to store how the IOVA is used." >>>>>> >>>>>> By this, do you mean to not save the IOVA->HVA mapping and instead only >>>>>> save the allocated IOVA ranges? >>>> >>>> Yes. >>>> >>>>>> In other words, are you suggesting to >>>>>> create a dedicated "IOVA->IOVA" tree like: >>>>>> >>>>>> struct VhostIOVATree { >>>>>> uint64_t iova_first; >>>>>> uint64_t iova_last; >>>>>> IOVATree *iova_map; >>>>>> }; >>>> >>>> It could be this or other. I think the point is the allocator is only >>>> used for IOVA allocation but it doesn't have any information about the >>>> mapping. >>>> >>>>>> >>>>>> Where the mapping might look something like (where translated_addr is >>>>>> given some kind of 0 value): >>>>>> >>>>>> iova_region = (DMAMap) { >>>>>> .iova = iova_first, >>>>>> .translated_addr = 0, >>>>>> .size = region_size - 1, >>>>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >>>>>> }; >>>>>> >>>>>> Also, if this is what you mean by decoupling the IOVA allocator, what >>>>>> happens to the IOVA->HVA tree? >>>> >>>> We will have two structures: >>>> >>>> 1) IOVA domain (in charge of IOVA range allocation/deallocation) >>>> 2) How the IOVA is used, e.g an IOVA->HVA tree >>>> >>>>> Are we no longer saving these mappings in >>>>>> a tree? >>>>>> >>>>>> --- >>>>>> In your second suggestion (2) with a dedicated GPA->IOVA tree, were you >>>>>> thinking something like this? Just adding on to VhostIOVATree here: >>>>>> >>>>>> struct VhostIOVATree { >>>>>> uint64_t iova_first; >>>>>> uint64_t iova_last; >>>>>> IOVATree *iova_map; >>>>>> IOVATree *gpa_map; >>>>>> }; >>>> >>>> I'd split the mappings if it's possible. >>>> >>>>>> >>>>>> Where the mapping might look something like: >>>>>> >>>>>> gpa_region = (DMAMap) { >>>>>> .iova = iova_first, >>>>>> .translated_addr = gpa_first, >>>>>> .size = region_size - 1, >>>>>> .perm = IOMMU_ACCESS_FLAG(true, section->readonly), >>>>>> }; >>>>>> >>>>>> Also, when you say "used in the datapath SVQ translation", we still need >>>>>> to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is >>>>>> called, right? >>>> >>>> Right. >>>> >>>>>> >>>>>> --- >>>>>> Lastly, in your third suggestion (3) you mention implementing a >>>>>> SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for >>>>>> just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g. >>>>>> >>>>>> struct VhostIOVATree { >>>>>> uint64_t iova_first; >>>>>> uint64_t iova_last; >>>>>> IOVATree *iova_map; >>>>>> IOVATree *gpa_map; >>>>>> IOVATree *svq_map; >>>>>> }; >>>>>> >>>>>> --- >>>>>> >>>>>> Let me know if I'm understanding this correctly. If I am, this would >>>>>> require a pretty good amount of refactoring on the IOVA allocation, >>>>>> searching, destroying, etc. code to account for these new trees. >>>>>> >>>>> >>>>> Ok I think I understand your previous question better, sorry for the delay :). >>>>> >>>>> If I'm not wrong, Jason did not enumerate these as alternatives but as >>>>> part of the same solution. Jason please correct me if I'm wrong here. >>>>> >>>>> His solution is composed of three trees: >>>>> 1) One for the IOVA allocations, so we know where to allocate new ranges >>>>> 2) One of the GPA -> SVQ IOVA translations. >>>>> 3) Another one for SVQ vrings translations. >>>>> >>>> >>>> Exactly. >>>> >>>>> In my opinion to use GPA this is problematic as we force all of the >>>>> callers to know if the address we want to translate comes from the >>>>> guest or not. Current code does now know it, as >>>>> vhost_svq_translate_addr is called to translate both buffer dataplane >>>>> addresses and control virtqueue buffers, which are also shadowed. To >>>>> transmit that information to the caller code would require either >>>>> duplicate functions, to add a boolean, etc, as it is in a different >>>>> layer (net specific net/vhost-vdpa.c vs generic >>>>> hw/virtio/vhost-shadow-virtqueue.c). >>>>> >>>>> In my opinion is easier to keep just two trees (or similar structs): >>>>> 1) One for the IOVA allocations, so we know where to allocate new >>>>> ranges. We don't actually need to store the translations here. >>>>> 2) One for HVA -> SVQ IOVA translations. >>>>> >>>>> This way we can accommodate both SVQ vrings, CVQ buffers, and guest >>>>> memory buffers, all on the second tree, and take care of the HVA >>>>> duplications. >>>> >>>> Probably, but how to take care of HVA duplications (I may miss the >>>> context, sorry)? >>>> >>> >>> What to do with the duplicated HVA is an added problem that deserves >>> more discussion, right. If a duplicated entry gets deleted, the other >>> one should take its place in both trees. >>> >>> Store them in an ordered list and look for replacements on tree >>> deletion maybe? It would be great to locate a similar usage and see >>> what is done there. >>> >> >> Isn't what we're doing here supposed to solve this issue of >> duplicated/overlapping HVA ranges? >> > > Yes, let me explain again with the example of Si-Wei [1]: > > > HVA GPA IOVA > ------------------------------------------------------------------------------------------------------------------------- > Map > [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) > [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > [0x80001000, 0x2000001000) > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > [0x2000001000, 0x2000021000) > > Unmap > [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x1000, > 0x20000) ??? > > The third HVA range is contained in the first one, but exposed under a > different GVA (aliased). This is not "flattened" by QEMU, as GPA does > not overlap, only HVA. > > In this case we can ignore the third range, as we already have it in > HVA. As it is unmapped first, it is ok to simply ignore it for all the > QEMU run. > > Now, what action should we take if the first range, GPA [0x0, > 0x80000000), is unmapped first? We should reintroduce the smaller > chunk HVA [0x7f7903ea0000, 0x7f7903ec0000) -> SVQ IOVA [0x2000001000, > 0x2000021000) entry in the tree. > > Maybe we can trust every case of this will be handled in a stack > fashion and all of this is overthinking, but I'd like to know other's > opinions / thoughts here. > > Thanks! > > [1] https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!NM0p-noaAc1mabiBCHIh8H7LfCwn40F4jJ-zHuHXu9XedG0hXP5gh660HAWVVTqqIzhvmFAcueJFibhYh4E$ > For my understanding, say we have those 3 memory mappings: HVA GPA IOVA --------------------------------------------------- Map (1) [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, 0x80000000) (2) [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) [0x80001000, 0x2000001000) (3) [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) [0x2000001000, 0x2000021000) And then say when we go to unmap (e.g. vhost_vdpa_svq_unmap_ring) we're given an HVA of 0x7f7903eb0000, which fits in both the first and third mappings. The correct one to remove here would be the third mapping, right? Not only because the HVA range of the third mapping has a more "specific" or "tighter" range fit given an HVA of 0x7f7903eb0000 (which, as I understand, may not always be the case in other scenarios), but mainly because the HVA->GPA translation would give GPA 0xfedb0000, which only fits in the third mapping's GPA range. Am I understanding this correctly? --- In the case where the first mapping here is removed (GPA [0x0, 0x80000000)), why do we use the word "reintroduce" here? As I understand it, when we remove a mapping, we're essentially invalidating the IOVA range associated with that mapping, right? In other words, the IOVA ranges here don't overlap, so removing a mapping where its HVA range overlaps another mapping's HVA range shouldn't affect the other mapping since they have unique IOVA ranges. Is my understanding correct here or am I probably missing something? >>>> Thanks >>>> >>>>> >>>>> Thanks! >>>>> >>>>>> Thanks Jason! >>>>>> >>>>>>>> >>>>>>>> In some sense this is simpler than trying to get all the information >>>>>>>> from only two trees. On the bad side, all SVQ calls that allocate some >>>>>>>> region need to remember to add to one of the two other threes. That is >>>>>>>> what I mean by synchronized. But sure, we can go that way. >>>>>>> >>>>>>> Just a suggestion, if it turns out to complicate the issue, I'm fine >>>>>>> to go the other way. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Hi Jonah, On 7/31/2024 7:09 AM, Jonah Palmer wrote: > >>>>>>>>>> Let me clarify, correct me if I was wrong: >>>>>>>>>> >>>>>>>>>> 1) IOVA allocator is still implemented via a tree, we just >>>>>>>>>> don't need >>>>>>>>>> to store how the IOVA is used >>>>>>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is >>>>>>>>>> used in >>>>>>>>>> the datapath SVQ translation >>>>>>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ >>>>>>>>>> >>>>>>>>> >>>>>>>>> His solution is composed of three trees: >>>>>>>>> 1) One for the IOVA allocations, so we know where to allocate >>>>>>>>> new ranges >>>>>>>>> 2) One of the GPA -> SVQ IOVA translations. >>>>>>>>> 3) Another one for SVQ vrings translations. >>>>>>>>> >>>>> >>> > > For my understanding, say we have those 3 memory mappings: > > HVA GPA IOVA > --------------------------------------------------- > Map > (1) [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, > 0x80000000) > (2) [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > [0x80001000, 0x2000001000) > (3) [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > [0x2000001000, 0x2000021000) > > And then say when we go to unmap (e.g. vhost_vdpa_svq_unmap_ring) > we're given an HVA of 0x7f7903eb0000, which fits in both the first and > third mappings. > > The correct one to remove here would be the third mapping, right? Not > only because the HVA range of the third mapping has a more "specific" > or "tighter" range fit given an HVA of 0x7f7903eb0000 (which, as I > understand, may not always be the case in other scenarios), but mainly > because the HVA->GPA translation would give GPA 0xfedb0000, which only > fits in the third mapping's GPA range. Am I understanding this correctly? You're correct, we would still need a GPA -> IOVA tree for mapping and unmapping on guest mem. I've talked to Eugenio this morning and I think he is now aligned. Granted, this GPA tree is partial in IOVA space that doesn't contain ranges from host-only memory (e.g. backed by SVQ descriptors or buffers), we could create an API variant to vhost_iova_tree_map_alloc() and vhost_iova_tree_map_remove(), which not just adds IOVA -> HVA range to the HVA tree, but also manipulates the GPA tree to maintain guest memory mappings, i.e. only invoked from the memory listener ops. Such that this new API is distinguishable from the one in the SVQ mapping and unmapping path that only manipulates the HVA tree. I think the only case that you may need to pay attention to in implementation is in the SVQ address translation path, where if you come to an HVA address for translation, you would need to tell apart which tree you'd have to look up - if this HVA is backed by guest mem you could use API qemu_ram_block_from_host() to infer the ram block then the GPA, so you end up doing a lookup on the GPA tree; or else the HVA may be from the SVQ mappings, where you'd have to search the HVA tree again to look for host-mem-only range before you can claim the HVA is a bogus/unmapped address... For now, this additional second lookup is sub-optimal but inadvitable, but I think both of us agreed that you could start to implement this version first, and look for future opportunity to optimize the lookup performance on top. > > --- > > In the case where the first mapping here is removed (GPA [0x0, > 0x80000000)), why do we use the word "reintroduce" here? As I > understand it, when we remove a mapping, we're essentially > invalidating the IOVA range associated with that mapping, right? In > other words, the IOVA ranges here don't overlap, so removing a mapping > where its HVA range overlaps another mapping's HVA range shouldn't > affect the other mapping since they have unique IOVA ranges. Is my > understanding correct here or am I probably missing something? With the GPA tree I think this case should work fine. I've double checked the implementation of vhost-vdpa iotlb, and doesn't see a red flag there. Thanks, -Siwei
On Thu, Aug 1, 2024 at 2:41 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: > > Hi Jonah, > > On 7/31/2024 7:09 AM, Jonah Palmer wrote: > > > >>>>>>>>>> Let me clarify, correct me if I was wrong: > >>>>>>>>>> > >>>>>>>>>> 1) IOVA allocator is still implemented via a tree, we just > >>>>>>>>>> don't need > >>>>>>>>>> to store how the IOVA is used > >>>>>>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is > >>>>>>>>>> used in > >>>>>>>>>> the datapath SVQ translation > >>>>>>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> His solution is composed of three trees: > >>>>>>>>> 1) One for the IOVA allocations, so we know where to allocate > >>>>>>>>> new ranges > >>>>>>>>> 2) One of the GPA -> SVQ IOVA translations. > >>>>>>>>> 3) Another one for SVQ vrings translations. > >>>>>>>>> > >>>>> > >>> > > > > For my understanding, say we have those 3 memory mappings: > > > > HVA GPA IOVA > > --------------------------------------------------- > > Map > > (1) [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, > > 0x80000000) > > (2) [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) > > [0x80001000, 0x2000001000) > > (3) [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) > > [0x2000001000, 0x2000021000) > > > > And then say when we go to unmap (e.g. vhost_vdpa_svq_unmap_ring) > > we're given an HVA of 0x7f7903eb0000, which fits in both the first and > > third mappings. > > > > The correct one to remove here would be the third mapping, right? Not > > only because the HVA range of the third mapping has a more "specific" > > or "tighter" range fit given an HVA of 0x7f7903eb0000 (which, as I > > understand, may not always be the case in other scenarios), but mainly > > because the HVA->GPA translation would give GPA 0xfedb0000, which only > > fits in the third mapping's GPA range. Am I understanding this correctly? > You're correct, we would still need a GPA -> IOVA tree for mapping and > unmapping on guest mem. I've talked to Eugenio this morning and I think > he is now aligned. Granted, this GPA tree is partial in IOVA space that > doesn't contain ranges from host-only memory (e.g. backed by SVQ > descriptors or buffers), we could create an API variant to > vhost_iova_tree_map_alloc() and vhost_iova_tree_map_remove(), which not > just adds IOVA -> HVA range to the HVA tree, but also manipulates the > GPA tree to maintain guest memory mappings, i.e. only invoked from the > memory listener ops. Such that this new API is distinguishable from the > one in the SVQ mapping and unmapping path that only manipulates the HVA > tree. > Right, I think I understand both Jason's and your approach better, and I think it is the best one. To modify the lookup API is hard, as the caller does not know if the HVA looked up is contained in the guest memory or not. To modify the add or remove regions is easier, as they know it. > I think the only case that you may need to pay attention to in > implementation is in the SVQ address translation path, where if you come > to an HVA address for translation, you would need to tell apart which > tree you'd have to look up - if this HVA is backed by guest mem you > could use API qemu_ram_block_from_host() to infer the ram block then the > GPA, so you end up doing a lookup on the GPA tree; or else the HVA may > be from the SVQ mappings, where you'd have to search the HVA tree again > to look for host-mem-only range before you can claim the HVA is a > bogus/unmapped address... I'd leave this HVA -> IOVA tree for future performance optimization on top, and focus on the aliased maps for a first series. However, calling qemu_ram_block_from_host is actually not needed if the HVA tree contains all the translations, both SVQ and guest buffers in memory. > For now, this additional second lookup is > sub-optimal but inadvitable, but I think both of us agreed that you > could start to implement this version first, and look for future > opportunity to optimize the lookup performance on top. > Right, thanks for explaining! > > > > --- > > > > In the case where the first mapping here is removed (GPA [0x0, > > 0x80000000)), why do we use the word "reintroduce" here? As I > > understand it, when we remove a mapping, we're essentially > > invalidating the IOVA range associated with that mapping, right? In > > other words, the IOVA ranges here don't overlap, so removing a mapping > > where its HVA range overlaps another mapping's HVA range shouldn't > > affect the other mapping since they have unique IOVA ranges. Is my > > understanding correct here or am I probably missing something? > With the GPA tree I think this case should work fine. I've double > checked the implementation of vhost-vdpa iotlb, and doesn't see a red > flag there. > > Thanks, > -Siwei >
On 8/1/2024 1:22 AM, Eugenio Perez Martin wrote: > On Thu, Aug 1, 2024 at 2:41 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: >> Hi Jonah, >> >> On 7/31/2024 7:09 AM, Jonah Palmer wrote: >>>>>>>>>>>> Let me clarify, correct me if I was wrong: >>>>>>>>>>>> >>>>>>>>>>>> 1) IOVA allocator is still implemented via a tree, we just >>>>>>>>>>>> don't need >>>>>>>>>>>> to store how the IOVA is used >>>>>>>>>>>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is >>>>>>>>>>>> used in >>>>>>>>>>>> the datapath SVQ translation >>>>>>>>>>>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ >>>>>>>>>>>> >>>>>>>>>>> His solution is composed of three trees: >>>>>>>>>>> 1) One for the IOVA allocations, so we know where to allocate >>>>>>>>>>> new ranges >>>>>>>>>>> 2) One of the GPA -> SVQ IOVA translations. >>>>>>>>>>> 3) Another one for SVQ vrings translations. >>>>>>>>>>> >>> For my understanding, say we have those 3 memory mappings: >>> >>> HVA GPA IOVA >>> --------------------------------------------------- >>> Map >>> (1) [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000) [0x1000, >>> 0x80000000) >>> (2) [0x7f7983e00000, 0x7f9903e00000) [0x100000000, 0x2080000000) >>> [0x80001000, 0x2000001000) >>> (3) [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000, 0xfedc0000) >>> [0x2000001000, 0x2000021000) >>> >>> And then say when we go to unmap (e.g. vhost_vdpa_svq_unmap_ring) >>> we're given an HVA of 0x7f7903eb0000, which fits in both the first and >>> third mappings. >>> >>> The correct one to remove here would be the third mapping, right? Not >>> only because the HVA range of the third mapping has a more "specific" >>> or "tighter" range fit given an HVA of 0x7f7903eb0000 (which, as I >>> understand, may not always be the case in other scenarios), but mainly >>> because the HVA->GPA translation would give GPA 0xfedb0000, which only >>> fits in the third mapping's GPA range. Am I understanding this correctly? >> You're correct, we would still need a GPA -> IOVA tree for mapping and >> unmapping on guest mem. I've talked to Eugenio this morning and I think >> he is now aligned. Granted, this GPA tree is partial in IOVA space that >> doesn't contain ranges from host-only memory (e.g. backed by SVQ >> descriptors or buffers), we could create an API variant to >> vhost_iova_tree_map_alloc() and vhost_iova_tree_map_remove(), which not >> just adds IOVA -> HVA range to the HVA tree, but also manipulates the >> GPA tree to maintain guest memory mappings, i.e. only invoked from the >> memory listener ops. Such that this new API is distinguishable from the >> one in the SVQ mapping and unmapping path that only manipulates the HVA >> tree. >> > Right, I think I understand both Jason's and your approach better, and > I think it is the best one. To modify the lookup API is hard, as the > caller does not know if the HVA looked up is contained in the guest > memory or not. To modify the add or remove regions is easier, as they > know it. Exactly. > >> I think the only case that you may need to pay attention to in >> implementation is in the SVQ address translation path, where if you come >> to an HVA address for translation, you would need to tell apart which >> tree you'd have to look up - if this HVA is backed by guest mem you >> could use API qemu_ram_block_from_host() to infer the ram block then the >> GPA, so you end up doing a lookup on the GPA tree; or else the HVA may >> be from the SVQ mappings, where you'd have to search the HVA tree again >> to look for host-mem-only range before you can claim the HVA is a >> bogus/unmapped address... > I'd leave this HVA -> IOVA tree for future performance optimization on > top, and focus on the aliased maps for a first series. > > However, calling qemu_ram_block_from_host is actually not needed if > the HVA tree contains all the translations, both SVQ and guest buffers > in memory. If we don't take account of any aliased map or overlapped HVAs, looking up through the HVA tree itself should work. I think calling qemu_ram_block_from_host() further assures that we always deal with the real ram block that backs up the guest memory, while it is hard to guarantee the same with IOVA -> HVA tree, in case that there exists overlapped HVA ranges. This is simple and reliable since we avoid building the HVA lookup tree around any existing assumption or API implications in the memory subsystem, I'd lean toward using existing memory system API to simplify the implementation of the IOVA -> HVA tree (especially the lookup routine). > >> For now, this additional second lookup is >> sub-optimal but inadvitable, but I think both of us agreed that you >> could start to implement this version first, and look for future >> opportunity to optimize the lookup performance on top. >> > Right, thanks for explaining! Thanks for the discussion! -Siwei > >>> --- >>> >>> In the case where the first mapping here is removed (GPA [0x0, >>> 0x80000000)), why do we use the word "reintroduce" here? As I >>> understand it, when we remove a mapping, we're essentially >>> invalidating the IOVA range associated with that mapping, right? In >>> other words, the IOVA ranges here don't overlap, so removing a mapping >>> where its HVA range overlaps another mapping's HVA range shouldn't >>> affect the other mapping since they have unique IOVA ranges. Is my >>> understanding correct here or am I probably missing something? >> With the GPA tree I think this case should work fine. I've double >> checked the implementation of vhost-vdpa iotlb, and doesn't see a red >> flag there. >> >> Thanks, >> -Siwei >> >