Message ID | 1714406135-451286-20-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Live update: cpr-exec | expand |
On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: > Preserve fields of RAMBlocks that allocate their host memory during CPR so > the RAM allocation can be recovered. This sentence itself did not explain much, IMHO. QEMU can share memory using fd based memory already of all kinds, as long as the memory backend is path-based it can be shared by sharing the same paths to dst. This reads very confusing as a generic concept. I mean, QEMU migration relies on so many things to work right. We mostly asks the users to "use exactly the same cmdline for src/dst QEMU unless you know what you're doing", otherwise many things can break. That should also include ramblock being matched between src/dst due to the same cmdlines provided on both sides. It'll be confusing to mention this when we thought the ramblocks also rely on that fact. So IIUC this sentence should be dropped in the real patch, and I'll try to guess the real reason with below.. > Mirror the mr->align field in the RAMBlock to simplify the vmstate. > Preserve the old host address, even though it is immediately discarded, > as it will be needed in the future for CPR with iommufd. Preserve > guest_memfd, even though CPR does not yet support it, to maintain vmstate > compatibility when it becomes supported. .. It could be about the vfio vaddr update feature that you mentioned and only for iommufd (as IIUC vfio still relies on iova ranges, then it won't help here)? If so, IMHO we should have this patch (or any variance form) to be there for your upcoming vfio support. Keeping this around like this will make the series harder to review. Or is it needed even before VFIO? Another thing to ask: does this idea also need to rely on some future iommufd kernel support? If there's anything that's not merged in current Linux upstream, this series needs to be marked as RFC, so it's not target for merging. This will also be true if this patch is "preparing" for that work. It means if this patch only services iommufd purpose, even if it doesn't require any kernel header to be referenced, we should only merge it together with the full iommufd support comes later (and that'll be after iommufd kernel supports land). Thanks,
On 5/28/2024 5:44 PM, Peter Xu wrote: > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: >> Preserve fields of RAMBlocks that allocate their host memory during CPR so >> the RAM allocation can be recovered. > > This sentence itself did not explain much, IMHO. QEMU can share memory > using fd based memory already of all kinds, as long as the memory backend > is path-based it can be shared by sharing the same paths to dst. > > This reads very confusing as a generic concept. I mean, QEMU migration > relies on so many things to work right. We mostly asks the users to "use > exactly the same cmdline for src/dst QEMU unless you know what you're > doing", otherwise many things can break. That should also include ramblock > being matched between src/dst due to the same cmdlines provided on both > sides. It'll be confusing to mention this when we thought the ramblocks > also rely on that fact. > > So IIUC this sentence should be dropped in the real patch, and I'll try to > guess the real reason with below.. The properties of the implicitly created ramblocks must be preserved. The defaults can and do change between qemu releases, even when the command-line parameters do not change for the explicit objects that cause these implicit ramblocks to be created. >> Mirror the mr->align field in the RAMBlock to simplify the vmstate. >> Preserve the old host address, even though it is immediately discarded, >> as it will be needed in the future for CPR with iommufd. Preserve >> guest_memfd, even though CPR does not yet support it, to maintain vmstate >> compatibility when it becomes supported. > > .. It could be about the vfio vaddr update feature that you mentioned and > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't > help here)? > > If so, IMHO we should have this patch (or any variance form) to be there > for your upcoming vfio support. Keeping this around like this will make > the series harder to review. Or is it needed even before VFIO? This patch is needed independently of vfio or iommufd. guest_memfd is independent of vfio or iommufd. It is a recent addition which I have not tried to support, but I added this placeholder field to it can be supported in the future without adding a new field later and maintaining backwards compatibility. > Another thing to ask: does this idea also need to rely on some future > iommufd kernel support? If there's anything that's not merged in current > Linux upstream, this series needs to be marked as RFC, so it's not target > for merging. This will also be true if this patch is "preparing" for that > work. It means if this patch only services iommufd purpose, even if it > doesn't require any kernel header to be referenced, we should only merge it > together with the full iommufd support comes later (and that'll be after > iommufd kernel supports land). It does not rely on future kernel support. - Steve
On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: > On 5/28/2024 5:44 PM, Peter Xu wrote: > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: > > > Preserve fields of RAMBlocks that allocate their host memory during CPR so > > > the RAM allocation can be recovered. > > > > This sentence itself did not explain much, IMHO. QEMU can share memory > > using fd based memory already of all kinds, as long as the memory backend > > is path-based it can be shared by sharing the same paths to dst. > > > > This reads very confusing as a generic concept. I mean, QEMU migration > > relies on so many things to work right. We mostly asks the users to "use > > exactly the same cmdline for src/dst QEMU unless you know what you're > > doing", otherwise many things can break. That should also include ramblock > > being matched between src/dst due to the same cmdlines provided on both > > sides. It'll be confusing to mention this when we thought the ramblocks > > also rely on that fact. > > > > So IIUC this sentence should be dropped in the real patch, and I'll try to > > guess the real reason with below.. > > The properties of the implicitly created ramblocks must be preserved. > The defaults can and do change between qemu releases, even when the command-line > parameters do not change for the explicit objects that cause these implicit > ramblocks to be created. AFAIU, QEMU relies on ramblocks to be the same before this series. Do you have an example? Would that already cause issue when migrate? > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate. > > > Preserve the old host address, even though it is immediately discarded, > > > as it will be needed in the future for CPR with iommufd. Preserve > > > guest_memfd, even though CPR does not yet support it, to maintain vmstate > > > compatibility when it becomes supported. > > > > .. It could be about the vfio vaddr update feature that you mentioned and > > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't > > help here)? > > > > If so, IMHO we should have this patch (or any variance form) to be there > > for your upcoming vfio support. Keeping this around like this will make > > the series harder to review. Or is it needed even before VFIO? > > This patch is needed independently of vfio or iommufd. > > guest_memfd is independent of vfio or iommufd. It is a recent addition > which I have not tried to support, but I added this placeholder field > to it can be supported in the future without adding a new field later > and maintaining backwards compatibility. Is guest_memfd the only user so far, then? If so, would it be possible we split it as a separate effort on top of the base cpr-exec support?
On 5/29/2024 3:25 PM, Peter Xu wrote: > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: >> On 5/28/2024 5:44 PM, Peter Xu wrote: >>> On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: >>>> Preserve fields of RAMBlocks that allocate their host memory during CPR so >>>> the RAM allocation can be recovered. >>> >>> This sentence itself did not explain much, IMHO. QEMU can share memory >>> using fd based memory already of all kinds, as long as the memory backend >>> is path-based it can be shared by sharing the same paths to dst. >>> >>> This reads very confusing as a generic concept. I mean, QEMU migration >>> relies on so many things to work right. We mostly asks the users to "use >>> exactly the same cmdline for src/dst QEMU unless you know what you're >>> doing", otherwise many things can break. That should also include ramblock >>> being matched between src/dst due to the same cmdlines provided on both >>> sides. It'll be confusing to mention this when we thought the ramblocks >>> also rely on that fact. >>> >>> So IIUC this sentence should be dropped in the real patch, and I'll try to >>> guess the real reason with below.. >> >> The properties of the implicitly created ramblocks must be preserved. >> The defaults can and do change between qemu releases, even when the command-line >> parameters do not change for the explicit objects that cause these implicit >> ramblocks to be created. > > AFAIU, QEMU relies on ramblocks to be the same before this series. Do you > have an example? Would that already cause issue when migrate? Alignment has changed, and used_length vs max_length changed when resizeable ramblocks were introduced. I have dealt with these issues while supporting cpr for our internal use, and the learned lesson is to explicitly communicate the creation-time parameters to new qemu. These are not an issue for migration because the ramblock is re-created and the data copied into the new memory. >>>> Mirror the mr->align field in the RAMBlock to simplify the vmstate. >>>> Preserve the old host address, even though it is immediately discarded, >>>> as it will be needed in the future for CPR with iommufd. Preserve >>>> guest_memfd, even though CPR does not yet support it, to maintain vmstate >>>> compatibility when it becomes supported. >>> >>> .. It could be about the vfio vaddr update feature that you mentioned and >>> only for iommufd (as IIUC vfio still relies on iova ranges, then it won't >>> help here)? >>> >>> If so, IMHO we should have this patch (or any variance form) to be there >>> for your upcoming vfio support. Keeping this around like this will make >>> the series harder to review. Or is it needed even before VFIO? >> >> This patch is needed independently of vfio or iommufd. >> >> guest_memfd is independent of vfio or iommufd. It is a recent addition >> which I have not tried to support, but I added this placeholder field >> to it can be supported in the future without adding a new field later >> and maintaining backwards compatibility. > > Is guest_memfd the only user so far, then? If so, would it be possible we > split it as a separate effort on top of the base cpr-exec support? I don't understand the question. I am indeed deferring support for guest_memfd to a future time. For now, I am adding a blocker, and reserving a field for it in the preserved ramblock attributes, to avoid adding a subsection later. - Steve
On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote: > On 5/29/2024 3:25 PM, Peter Xu wrote: > > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: > > > On 5/28/2024 5:44 PM, Peter Xu wrote: > > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: > > > > > Preserve fields of RAMBlocks that allocate their host memory during CPR so > > > > > the RAM allocation can be recovered. > > > > > > > > This sentence itself did not explain much, IMHO. QEMU can share memory > > > > using fd based memory already of all kinds, as long as the memory backend > > > > is path-based it can be shared by sharing the same paths to dst. > > > > > > > > This reads very confusing as a generic concept. I mean, QEMU migration > > > > relies on so many things to work right. We mostly asks the users to "use > > > > exactly the same cmdline for src/dst QEMU unless you know what you're > > > > doing", otherwise many things can break. That should also include ramblock > > > > being matched between src/dst due to the same cmdlines provided on both > > > > sides. It'll be confusing to mention this when we thought the ramblocks > > > > also rely on that fact. > > > > > > > > So IIUC this sentence should be dropped in the real patch, and I'll try to > > > > guess the real reason with below.. > > > > > > The properties of the implicitly created ramblocks must be preserved. > > > The defaults can and do change between qemu releases, even when the command-line > > > parameters do not change for the explicit objects that cause these implicit > > > ramblocks to be created. > > > > AFAIU, QEMU relies on ramblocks to be the same before this series. Do you > > have an example? Would that already cause issue when migrate? > > Alignment has changed, and used_length vs max_length changed when > resizeable ramblocks were introduced. I have dealt with these issues > while supporting cpr for our internal use, and the learned lesson is to > explicitly communicate the creation-time parameters to new qemu. Why used_length can change? I'm looking at ram_mig_ram_block_resized(): if (!migration_is_idle()) { /* * Precopy code on the source cannot deal with the size of RAM blocks * changing at random points in time - especially after sending the * RAM block sizes in the migration stream, they must no longer change. * Abort and indicate a proper reason. */ error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr); migration_cancel(err); error_free(err); } We sent used_length upfront of a migration during SETUP phase. Looks like what you're describing can be something different, though? Regarding to rb->align: isn't that mostly a constant, reflecting the MR's alignment? It's set when ramblock is created IIUC: rb->align = mr->align; When will the alignment change? > > These are not an issue for migration because the ramblock is re-created > and the data copied into the new memory. > > > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate. > > > > > Preserve the old host address, even though it is immediately discarded, > > > > > as it will be needed in the future for CPR with iommufd. Preserve > > > > > guest_memfd, even though CPR does not yet support it, to maintain vmstate > > > > > compatibility when it becomes supported. > > > > > > > > .. It could be about the vfio vaddr update feature that you mentioned and > > > > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't > > > > help here)? > > > > > > > > If so, IMHO we should have this patch (or any variance form) to be there > > > > for your upcoming vfio support. Keeping this around like this will make > > > > the series harder to review. Or is it needed even before VFIO? > > > > > > This patch is needed independently of vfio or iommufd. > > > > > > guest_memfd is independent of vfio or iommufd. It is a recent addition > > > which I have not tried to support, but I added this placeholder field > > > to it can be supported in the future without adding a new field later > > > and maintaining backwards compatibility. > > > > Is guest_memfd the only user so far, then? If so, would it be possible we > > split it as a separate effort on top of the base cpr-exec support? > > I don't understand the question. I am indeed deferring support for guest_memfd > to a future time. For now, I am adding a blocker, and reserving a field for > it in the preserved ramblock attributes, to avoid adding a subsection later. I meant I'm thinking whether the new ramblock vmsd may not be required for the initial implementation. E.g., IIUC vaddr is required by iommufd, and so far that's not part of the initial support. Then I think a major thing is about the fds to be managed that will need to be shared. If we put guest_memfd aside, it can be really, mostly, about VFIO fds. For that, I'm wondering whether you looked into something like this: commit da3e04b26fd8d15b344944504d5ffa9c5f20b54b Author: Zhenzhong Duan <zhenzhong.duan@intel.com> Date: Tue Nov 21 16:44:10 2023 +0800 vfio/pci: Make vfio cdev pre-openable by passing a file handle I just notice this when I was thinking of a way where it might be possible to avoid QEMU vfio-pci open the device at all, then I found we have something like that already.. Then if the mgmt wants, IIUC that fd can be passed down from Libvirt cleanly to dest qemu in a no-exec context. Would this work too, and cleaner / reusing existing infrastructures? I think it's nice to always have libvirt managing most, or possible, all fds that qemu uses, then we don't even need scm_rights. But I didn't look deeper into this, just a thought. When thinking about this, I also wonder how cpr-exec handles the limited environments like cgroups and especially seccomps. I'm not sure what's the status of that in most cloud environments, but I think exec() / fork() is definitely not always on the seccomp whitelist, and I think that's also another reason why we can think about avoid using them.
On 5/30/2024 2:39 PM, Peter Xu wrote: > On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote: >> On 5/29/2024 3:25 PM, Peter Xu wrote: >>> On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: >>>> On 5/28/2024 5:44 PM, Peter Xu wrote: >>>>> On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: >>>>>> Preserve fields of RAMBlocks that allocate their host memory during CPR so >>>>>> the RAM allocation can be recovered. >>>>> >>>>> This sentence itself did not explain much, IMHO. QEMU can share memory >>>>> using fd based memory already of all kinds, as long as the memory backend >>>>> is path-based it can be shared by sharing the same paths to dst. >>>>> >>>>> This reads very confusing as a generic concept. I mean, QEMU migration >>>>> relies on so many things to work right. We mostly asks the users to "use >>>>> exactly the same cmdline for src/dst QEMU unless you know what you're >>>>> doing", otherwise many things can break. That should also include ramblock >>>>> being matched between src/dst due to the same cmdlines provided on both >>>>> sides. It'll be confusing to mention this when we thought the ramblocks >>>>> also rely on that fact. >>>>> >>>>> So IIUC this sentence should be dropped in the real patch, and I'll try to >>>>> guess the real reason with below.. >>>> >>>> The properties of the implicitly created ramblocks must be preserved. >>>> The defaults can and do change between qemu releases, even when the command-line >>>> parameters do not change for the explicit objects that cause these implicit >>>> ramblocks to be created. >>> >>> AFAIU, QEMU relies on ramblocks to be the same before this series. Do you >>> have an example? Would that already cause issue when migrate? >> >> Alignment has changed, and used_length vs max_length changed when >> resizeable ramblocks were introduced. I have dealt with these issues >> while supporting cpr for our internal use, and the learned lesson is to >> explicitly communicate the creation-time parameters to new qemu. > > Why used_length can change? I'm looking at ram_mig_ram_block_resized(): > > if (!migration_is_idle()) { > /* > * Precopy code on the source cannot deal with the size of RAM blocks > * changing at random points in time - especially after sending the > * RAM block sizes in the migration stream, they must no longer change. > * Abort and indicate a proper reason. > */ > error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr); > migration_cancel(err); > error_free(err); > } > > We sent used_length upfront of a migration during SETUP phase. Looks like > what you're describing can be something different, though? I was imprecise. used_length did not change; it was introduced as being different than max_length when resizeable ramblocks were introduced. The max_length is not sent. It is an implicit property of the implementation, and can change. It is the size of the memfd mapping, so we need to know it and preserve it. used_length is indeed sent during SETUP. We could also send max_length at that time, and store both in the struct ramblock, and *maybe* that would be safe, but that is more fragile and less future proof than setting both properties to the correct value when the ramblock struct is created. And BTW, the ramblock properties are sent using ad-hoc code in setup. I send them using nice clean vmstate. > Regarding to rb->align: isn't that mostly a constant, reflecting the MR's > alignment? It's set when ramblock is created IIUC: > > rb->align = mr->align; > > When will the alignment change? The alignment specified by the mr to allocate a new block is an implicit property of the implementation, and has changed before, from one qemu release to another. Not often, but it did, and could again in the future. Communicating the alignment from old qemu to new qemu is future proof. >> These are not an issue for migration because the ramblock is re-created >> and the data copied into the new memory. >> >>>>>> Mirror the mr->align field in the RAMBlock to simplify the vmstate. >>>>>> Preserve the old host address, even though it is immediately discarded, >>>>>> as it will be needed in the future for CPR with iommufd. Preserve >>>>>> guest_memfd, even though CPR does not yet support it, to maintain vmstate >>>>>> compatibility when it becomes supported. >>>>> >>>>> .. It could be about the vfio vaddr update feature that you mentioned and >>>>> only for iommufd (as IIUC vfio still relies on iova ranges, then it won't >>>>> help here)? >>>>> >>>>> If so, IMHO we should have this patch (or any variance form) to be there >>>>> for your upcoming vfio support. Keeping this around like this will make >>>>> the series harder to review. Or is it needed even before VFIO? >>>> >>>> This patch is needed independently of vfio or iommufd. >>>> >>>> guest_memfd is independent of vfio or iommufd. It is a recent addition >>>> which I have not tried to support, but I added this placeholder field >>>> to it can be supported in the future without adding a new field later >>>> and maintaining backwards compatibility. >>> >>> Is guest_memfd the only user so far, then? If so, would it be possible we >>> split it as a separate effort on top of the base cpr-exec support? >> >> I don't understand the question. I am indeed deferring support for guest_memfd >> to a future time. For now, I am adding a blocker, and reserving a field for >> it in the preserved ramblock attributes, to avoid adding a subsection later. > > I meant I'm thinking whether the new ramblock vmsd may not be required for > the initial implementation. > > E.g., IIUC vaddr is required by iommufd, and so far that's not part of the > initial support. > > Then I think a major thing is about the fds to be managed that will need to > be shared. If we put guest_memfd aside, it can be really, mostly, about > VFIO fds. The block->fd must be preserved. That is the fd of the memfd_create used by cpr. > For that, I'm wondering whether you looked into something like > this: > > commit da3e04b26fd8d15b344944504d5ffa9c5f20b54b > Author: Zhenzhong Duan <zhenzhong.duan@intel.com> > Date: Tue Nov 21 16:44:10 2023 +0800 > > vfio/pci: Make vfio cdev pre-openable by passing a file handle > > I just notice this when I was thinking of a way where it might be possible > to avoid QEMU vfio-pci open the device at all, then I found we have > something like that already.. > > Then if the mgmt wants, IIUC that fd can be passed down from Libvirt > cleanly to dest qemu in a no-exec context. Would this work too, and > cleaner / reusing existing infrastructures? That capability as currently defined would not work for cpr. The fd is pre-created, but qemu still calls the kernel to configure it. cpr skips all kernel configuration calls. > I think it's nice to always have libvirt managing most, or possible, all > fds that qemu uses, then we don't even need scm_rights. But I didn't look > deeper into this, just a thought. One could imagine a solution where the manager extracts internal properties of vfio, ramblock, etc and passes them as creation time parameters on the new qemu command line. And, the manager pre-creates all fd's so they can be passed to old and new qemu. Lots of code required in qemu and in the manager, and all implicitly created objects would need to me made explicit. Yuck. The precreate vmstate approach is much simpler for all. > When thinking about this, I also wonder how cpr-exec handles the limited > environments like cgroups and especially seccomps. I'm not sure what's the > status of that in most cloud environments, but I think exec() / fork() is > definitely not always on the seccomp whitelist, and I think that's also > another reason why we can think about avoid using them. Exec must be allowed to use cpr-exec mode. Fork can remain blocked. Currently the qemu sandbox option can block 'spawn', which blocks both exec and fork. I have a patch in my next series that makes this more fine grained, so one or the other can be blocked. Those unwilling to allow exec can wait for cpr-scm mode :) - Steve
On Fri, May 31, 2024 at 03:32:11PM -0400, Steven Sistare wrote: > On 5/30/2024 2:39 PM, Peter Xu wrote: > > On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote: > > > On 5/29/2024 3:25 PM, Peter Xu wrote: > > > > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: > > > > > On 5/28/2024 5:44 PM, Peter Xu wrote: > > > > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: > > > > > > > Preserve fields of RAMBlocks that allocate their host memory during CPR so > > > > > > > the RAM allocation can be recovered. > > > > > > > > > > > > This sentence itself did not explain much, IMHO. QEMU can share memory > > > > > > using fd based memory already of all kinds, as long as the memory backend > > > > > > is path-based it can be shared by sharing the same paths to dst. > > > > > > > > > > > > This reads very confusing as a generic concept. I mean, QEMU migration > > > > > > relies on so many things to work right. We mostly asks the users to "use > > > > > > exactly the same cmdline for src/dst QEMU unless you know what you're > > > > > > doing", otherwise many things can break. That should also include ramblock > > > > > > being matched between src/dst due to the same cmdlines provided on both > > > > > > sides. It'll be confusing to mention this when we thought the ramblocks > > > > > > also rely on that fact. > > > > > > > > > > > > So IIUC this sentence should be dropped in the real patch, and I'll try to > > > > > > guess the real reason with below.. > > > > > > > > > > The properties of the implicitly created ramblocks must be preserved. > > > > > The defaults can and do change between qemu releases, even when the command-line > > > > > parameters do not change for the explicit objects that cause these implicit > > > > > ramblocks to be created. > > > > > > > > AFAIU, QEMU relies on ramblocks to be the same before this series. Do you > > > > have an example? Would that already cause issue when migrate? > > > > > > Alignment has changed, and used_length vs max_length changed when > > > resizeable ramblocks were introduced. I have dealt with these issues > > > while supporting cpr for our internal use, and the learned lesson is to > > > explicitly communicate the creation-time parameters to new qemu. > > > > Why used_length can change? I'm looking at ram_mig_ram_block_resized(): > > > > if (!migration_is_idle()) { > > /* > > * Precopy code on the source cannot deal with the size of RAM blocks > > * changing at random points in time - especially after sending the > > * RAM block sizes in the migration stream, they must no longer change. > > * Abort and indicate a proper reason. > > */ > > error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr); > > migration_cancel(err); > > error_free(err); > > } > > > > We sent used_length upfront of a migration during SETUP phase. Looks like > > what you're describing can be something different, though? > > I was imprecise. used_length did not change; it was introduced as being > different than max_length when resizeable ramblocks were introduced. > > The max_length is not sent. It is an implicit property of the implementation, > and can change. It is the size of the memfd mapping, so we need to know it > and preserve it. > > used_length is indeed sent during SETUP. We could also send max_length > at that time, and store both in the struct ramblock, and *maybe* that would > be safe, but that is more fragile and less future proof than setting both > properties to the correct value when the ramblock struct is created. > > And BTW, the ramblock properties are sent using ad-hoc code in setup. > I send them using nice clean vmstate. Right, I agree that's not pretty at all... I wished we have had something better, but that was just there for years. When you said max_length can change, could you give an example? I want to know whether it means we have bug already, and bug fixing can even be done before the rest. Thinking now, maybe max_length is indeed fine to be changed acorss migration? Consider the fact that only used_length is used in both src/dst for e.g. migration, dirty tracking, etc. purposes. Basically we assumed that's the "real size" of RAM irrelevant of "how large it used to be before migration", or "how large it can grow after migration completes", while max_length is "possible max value" here but isn't really important for migration. E.g., mem resize can allow a larger range after migration if the user specifies max_length on dest to be larger than src max_length somehow, and logically migration should still work indeed. I just don't know whether there'll be people using it like that. > > > Regarding to rb->align: isn't that mostly a constant, reflecting the MR's > > alignment? It's set when ramblock is created IIUC: > > > > rb->align = mr->align; > > > > When will the alignment change? > > The alignment specified by the mr to allocate a new block is an implicit property > of the implementation, and has changed before, from one qemu release to another. > Not often, but it did, and could again in the future. Communicating the alignment > from old qemu to new qemu is future proof. Same on this one; do you have examples around and share? I hope we don't introduce things without good reasons. If we're talking about "alignment can change", it'll be very helpful to know what we're fixing against (before CPR's need). > > > > These are not an issue for migration because the ramblock is re-created > > > and the data copied into the new memory. > > > > > > > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate. > > > > > > > Preserve the old host address, even though it is immediately discarded, > > > > > > > as it will be needed in the future for CPR with iommufd. Preserve > > > > > > > guest_memfd, even though CPR does not yet support it, to maintain vmstate > > > > > > > compatibility when it becomes supported. > > > > > > > > > > > > .. It could be about the vfio vaddr update feature that you mentioned and > > > > > > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't > > > > > > help here)? > > > > > > > > > > > > If so, IMHO we should have this patch (or any variance form) to be there > > > > > > for your upcoming vfio support. Keeping this around like this will make > > > > > > the series harder to review. Or is it needed even before VFIO? > > > > > > > > > > This patch is needed independently of vfio or iommufd. > > > > > > > > > > guest_memfd is independent of vfio or iommufd. It is a recent addition > > > > > which I have not tried to support, but I added this placeholder field > > > > > to it can be supported in the future without adding a new field later > > > > > and maintaining backwards compatibility. > > > > > > > > Is guest_memfd the only user so far, then? If so, would it be possible we > > > > split it as a separate effort on top of the base cpr-exec support? > > > > > > I don't understand the question. I am indeed deferring support for guest_memfd > > > to a future time. For now, I am adding a blocker, and reserving a field for > > > it in the preserved ramblock attributes, to avoid adding a subsection later. > > > > I meant I'm thinking whether the new ramblock vmsd may not be required for > > the initial implementation. > > > > E.g., IIUC vaddr is required by iommufd, and so far that's not part of the > > initial support. > > > > Then I think a major thing is about the fds to be managed that will need to > > be shared. If we put guest_memfd aside, it can be really, mostly, about > > VFIO fds. > > The block->fd must be preserved. That is the fd of the memfd_create used > by cpr. Right, cpr needs all fds be passed over and I think that's a great idea. It could be a matter of how do we mark those fds, how to pass them over, and whether do we need to manage them one by one, or in a batch. E.g., in my mind now I'm picturing something, I probably shared it bit by bit in my previous replies when trying to review your series, but in general, a cleaner approach may look like this: - QEMU provides a fd-manager, managing all relevant fds. It can be ramblock fds, vfio fds, vhost fds, or whatever fds. We "name" these fds in some way, so that we know how to recover on the other side. We don't differenciate them with different vmsds: no need to migrate a fd in ramblock vmsd, then a fd in vfio vmsd, then a fd in vhost fd. We migrate them all, then modules can try to fetch them on dest qemu, perhaps transparently (like qemu_open_internal() on /dev/fdsets), maybe not. I haven't thought about that details. - FDs need to be passed over _before_ VM starts. It might be easier to not attach that to a "pre" phase of "migration", but it might be doable in such way that: when cpr-xxx mode is supported, Libvirt can use a new QMP command to fetch all the FDs in one shot using scm rights (e.g., "fd-manager-fetch"), then apply those list of fds _before_ dest QEMU try to initialize using another QMP command (e.g., "fd-manager-apply"). QEMU src/dst don't talk at all on the FDs; they rely on Libvirt to set them up. This will greatly simplify migration code on fd passovers; either using execve() or scm rights. In this picture, neither execve() nor new migration protocol change needed. Migration stream keeps just like a normal migration stream. > > > For that, I'm wondering whether you looked into something like > > this: > > > > commit da3e04b26fd8d15b344944504d5ffa9c5f20b54b > > Author: Zhenzhong Duan <zhenzhong.duan@intel.com> > > Date: Tue Nov 21 16:44:10 2023 +0800 > > > > vfio/pci: Make vfio cdev pre-openable by passing a file handle > > > > I just notice this when I was thinking of a way where it might be possible > > to avoid QEMU vfio-pci open the device at all, then I found we have > > something like that already.. > > > > Then if the mgmt wants, IIUC that fd can be passed down from Libvirt > > cleanly to dest qemu in a no-exec context. Would this work too, and > > cleaner / reusing existing infrastructures? > > That capability as currently defined would not work for cpr. The fd is > pre-created, but qemu still calls the kernel to configure it. cpr skips > all kernel configuration calls. It's just an idea. I didn't look into the details of it, but I suppose from this part it might be similar to what cpr-exec would need when using a new fd-manager or similar approach. Basically we allow fds to be passed over too, not from original qemu using exec() but from libvirt. Would that work for us? > > > I think it's nice to always have libvirt managing most, or possible, all > > fds that qemu uses, then we don't even need scm_rights. But I didn't look > > deeper into this, just a thought. > > One could imagine a solution where the manager extracts internal properties > of vfio, ramblock, etc and passes them as creation time parameters on the > new qemu command line. And, the manager pre-creates all fd's so they > can be passed to old and new qemu. Lots of code required in qemu and in the > manager, and all implicitly created objects would need to me made explicit. > Yuck. The precreate vmstate approach is much simpler for all. So please correct me here if I misunderstood, but isn't this a shared problem with/without precreate vmsd? IIUC we always need a way to pass over the fds in this case, either by exec() or scm right or other approaches. It looks to me that here precreate is only the transport to deliver those fds, or am I wrong? > > > When thinking about this, I also wonder how cpr-exec handles the limited > > environments like cgroups and especially seccomps. I'm not sure what's the > > status of that in most cloud environments, but I think exec() / fork() is > > definitely not always on the seccomp whitelist, and I think that's also > > another reason why we can think about avoid using them. > > Exec must be allowed to use cpr-exec mode. Fork can remain blocked. Currently > the qemu sandbox option can block 'spawn', which blocks both exec and fork. I have > a patch in my next series that makes this more fine grained, so one or the other > can be blocked. Those unwilling to allow exec can wait for cpr-scm mode :) The question is what cpr-scm will be different from cpr-exec, and whether we'd like them both! As a maintainer, I definitely want to maintain as "less" as possible.. :-( If they play similar role, I suggest we stick with one for sure and discuss the design. If cpr-exec will be accepted, I hope it's because we decided to give up seccomp, rather than waiting for cpr-scm. :) Thanks,
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h index 61deefe..b492d89 100644 --- a/include/exec/ramblock.h +++ b/include/exec/ramblock.h @@ -44,6 +44,7 @@ struct RAMBlock { uint64_t fd_offset; int guest_memfd; size_t page_size; + uint64_t align; /* dirty bitmap used during migration */ unsigned long *bmap; @@ -91,5 +92,10 @@ struct RAMBlock { */ ram_addr_t postcopy_length; }; + +#define RAM_BLOCK "RAMBlock" + +extern const VMStateDescription vmstate_ram_block; + #endif #endif diff --git a/system/physmem.c b/system/physmem.c index 36d97ec..3019284 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -1398,6 +1398,7 @@ static void *file_ram_alloc(RAMBlock *block, block->mr->align = MAX(block->mr->align, QEMU_VMALLOC_ALIGN); } #endif + block->align = block->mr->align; if (memory < block->page_size) { error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to " @@ -1848,6 +1849,7 @@ static void *ram_block_alloc_host(RAMBlock *rb, Error **errp) rb->idstr); } } + rb->align = mr->align; if (host) { memory_try_enable_merging(host, rb->max_length); @@ -1934,6 +1936,7 @@ static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size, rb->flags = ram_flags; rb->page_size = qemu_real_host_page_size(); rb->mr = mr; + rb->align = mr->align; if (ram_flags & RAM_GUEST_MEMFD) { rb->guest_memfd = ram_block_create_guest_memfd(rb, errp); @@ -2060,6 +2063,26 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, } #endif +const VMStateDescription vmstate_ram_block = { + .name = RAM_BLOCK, + .version_id = 1, + .minimum_version_id = 1, + .precreate = true, + .factory = true, + .fields = (VMStateField[]) { + VMSTATE_UINT64(align, RAMBlock), + VMSTATE_VOID_PTR(host, RAMBlock), + VMSTATE_INT32(fd, RAMBlock), + VMSTATE_INT32(guest_memfd, RAMBlock), + VMSTATE_UINT32(flags, RAMBlock), + VMSTATE_UINT64(used_length, RAMBlock), + VMSTATE_UINT64(max_length, RAMBlock), + VMSTATE_END_OF_LIST() + } +}; + +vmstate_register_init_factory(vmstate_ram_block, RAMBlock); + static RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, void (*resized)(const char*, @@ -2070,6 +2093,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, { RAMBlock *new_block; int align; + g_autofree RAMBlock *preserved = NULL; assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC | RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0); @@ -2086,6 +2110,17 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, } new_block->resized = resized; + preserved = vmstate_claim_factory_object(RAM_BLOCK, new_block->idstr, 0); + if (preserved) { + assert(mr->align <= preserved->align); + mr->align = mr->align ?: preserved->align; + new_block->align = preserved->align; + new_block->fd = preserved->fd; + new_block->flags = preserved->flags; + new_block->used_length = preserved->used_length; + new_block->max_length = preserved->max_length; + } + if (!host) { host = ram_block_alloc_host(new_block, errp); if (!host) { @@ -2093,6 +2128,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, g_free(new_block); return NULL; } + if (!(ram_flags & RAM_GUEST_MEMFD)) { + vmstate_register_named(new_block->idstr, 0, &vmstate_ram_block, + new_block); + } } new_block->host = host; @@ -2157,6 +2196,7 @@ void qemu_ram_free(RAMBlock *block) } qemu_mutex_lock_ramlist(); + vmstate_unregister_named(RAM_BLOCK, block->idstr, 0); qemu_ram_unset_idstr(block); QLIST_REMOVE_RCU(block, next); ram_list.mru_block = NULL;
Preserve fields of RAMBlocks that allocate their host memory during CPR so the RAM allocation can be recovered. Mirror the mr->align field in the RAMBlock to simplify the vmstate. Preserve the old host address, even though it is immediately discarded, as it will be needed in the future for CPR with iommufd. Preserve guest_memfd, even though CPR does not yet support it, to maintain vmstate compatibility when it becomes supported. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> --- include/exec/ramblock.h | 6 ++++++ system/physmem.c | 40 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+)