Message ID | 20241027100751.219214-1-yishaih@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Enhances the vfio-virtio driver to support live migration | expand |
On Sun, 27 Oct 2024 12:07:44 +0200 Yishai Hadas <yishaih@nvidia.com> wrote: > > - According to the Virtio specification, a device has only two states: > RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g., > RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When > transitioning to RUNNING_P2P, the device state is set to STOP and > remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at > which point it resumes its RUNNING state. Does this assume the virtio device is not a DMA target for another device? If so, how can we make such an assumption? Otherwise, what happens on a DMA write to the stopped virtio device? > - Furthermore, the Virtio specification does not support reading partial > or incremental device contexts. This means that during the PRE_COPY > state, the vfio-virtio driver reads the full device state. This step is > beneficial because it allows the device to send some "initial data" > before moving to the STOP_COPY state, thus reducing downtime by > preparing early. To avoid an infinite number of device calls during > PRE_COPY, the vfio-virtio driver limits this flow to a maximum of 128 > calls. After reaching this limit, the driver will report zero bytes > remaining in PRE_COPY, signaling to QEMU to transition to STOP_COPY. If the virtio spec doesn't support partial contexts, what makes it beneficial here? Can you qualify to what extent this initial data improves the overall migration performance? If it is beneficial, why is it beneficial to send initial data more than once? In particular, what heuristic supports capping iterations at 128? The code also only indicates this is to prevent infinite iterations. Would it be better to rate-limit calls, by reporting no data available for some time interval after the previous call? Thanks, Alex
On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > On Sun, 27 Oct 2024 12:07:44 +0200 > Yishai Hadas <yishaih@nvidia.com> wrote: > > > > - According to the Virtio specification, a device has only two states: > > RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g., > > RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When > > transitioning to RUNNING_P2P, the device state is set to STOP and > > remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at > > which point it resumes its RUNNING state. > > Does this assume the virtio device is not a DMA target for another > device? If so, how can we make such an assumption? Otherwise, what > happens on a DMA write to the stopped virtio device? I was told the virtio spec says that during VFIO STOP it only stops doing outgoing DMA, it still must accept incoming operations. It was a point of debate if the additional step (stop everything vs stop outgoing only) was necessary and the virtio folks felt that stop outgoing was good enough. > If the virtio spec doesn't support partial contexts, what makes it > beneficial here? It stil lets the receiver 'warm up', like allocating memory and approximately sizing things. > If it is beneficial, why is it beneficial to send initial data more than > once? I guess because it is allowed to change and the benefit is highest when the pre copy data closely matches the final data.. Rate limiting does seem better to me Jason
On Mon, 28 Oct 2024 13:23:54 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > If the virtio spec doesn't support partial contexts, what makes it > > beneficial here? > > It stil lets the receiver 'warm up', like allocating memory and > approximately sizing things. > > > If it is beneficial, why is it beneficial to send initial data more than > > once? > > I guess because it is allowed to change and the benefit is highest > when the pre copy data closely matches the final data.. It would be useful to see actual data here. For instance, what is the latency advantage to allocating anything in the warm-up and what's the probability that allocation is simply refreshed versus starting over? Re-sending the initial data up to some arbitrary cap sounds more like we're making a policy decision in the driver to consume more migration bandwidth for some unknown latency trade-off at stop-copy. I wonder if that advantage disappears if the pre-copy data is at all stale relative to the current device state. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Monday, October 28, 2024 10:24 PM > > On Mon, 28 Oct 2024 13:23:54 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > > > If the virtio spec doesn't support partial contexts, what makes it > > > beneficial here? > > > > It stil lets the receiver 'warm up', like allocating memory and > > approximately sizing things. > > > > > If it is beneficial, why is it beneficial to send initial data more than > > > once? > > > > I guess because it is allowed to change and the benefit is highest > > when the pre copy data closely matches the final data.. > > It would be useful to see actual data here. For instance, what is the latency > advantage to allocating anything in the warm-up and what's the probability > that allocation is simply refreshed versus starting over? > Allocating everything during the warm-up phase, compared to no allocation, reduced the total VM downtime from 439 ms to 128 ms. This was tested using two PCI VF hardware devices per VM. The benefit comes from the device state staying mostly the same. We tested with different configurations from 1 to 4 devices per VM, varied with vcpus and memory. Also, more detailed test results are captured in Figure-2 on page 6 at [1]. The commit log for patch-7 should have captured the perf summary table for the value of the 7th patch. Yishai, If you are planning to send next revision, please add it. > Re-sending the initial data up to some arbitrary cap sounds more like we're > making a policy decision in the driver to consume more migration bandwidth > for some unknown latency trade-off at stop-copy. I wonder if that advantage > disappears if the pre-copy data is at all stale relative to the current device > state. Thanks, > You're right. If the pre-copy data differs significantly from the current device state, the benefits might be lost. However, this can also depend on the device's design. A more advanced device could apply a low-pass filter to avoid unnecessary refreshes. > Alex [1] https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf
On Sun, 27 Oct 2024 12:07:44 +0200 Yishai Hadas <yishaih@nvidia.com> wrote: > This series enhances the vfio-virtio driver to support live migration > for virtio-net Virtual Functions (VFs) that are migration-capable. What's the status of making virtio-net VFs in QEMU migration capable? There would be some obvious benefits for the vfio migration ecosystem if we could validate migration of a functional device (ie. not mtty) in an L2 guest with no physical hardware dependencies. Thanks, Alex
On 28/10/2024 20:17, Alex Williamson wrote: > On Sun, 27 Oct 2024 12:07:44 +0200 > Yishai Hadas <yishaih@nvidia.com> wrote: > >> This series enhances the vfio-virtio driver to support live migration >> for virtio-net Virtual Functions (VFs) that are migration-capable. > > What's the status of making virtio-net VFs in QEMU migration capable? Currently, we don’t have plans to make virtio-net VFs in QEMU migration-capable. > > There would be some obvious benefits for the vfio migration ecosystem > if we could validate migration of a functional device (ie. not mtty) in > an L2 guest with no physical hardware dependencies. Thanks, > > Alex > Right, software testing could be beneficial, however, we are not the ones who own or perform software simulation and its related tasks inside QEMU. So, this task could be considered in the future once its HOST side is actually accepted. Yishai
On 28/10/2024 18:23, Jason Gunthorpe wrote: > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: >> On Sun, 27 Oct 2024 12:07:44 +0200 >> Yishai Hadas <yishaih@nvidia.com> wrote: >> If the virtio spec doesn't support partial contexts, what makes it >> beneficial here? > > It stil lets the receiver 'warm up', like allocating memory and > approximately sizing things. > >> If it is beneficial, why is it beneficial to send initial data more than >> once? > > I guess because it is allowed to change and the benefit is highest > when the pre copy data closely matches the final data.. > > Rate limiting does seem better to me > > Jason Right, given that the device state is likely to remain mostly unchanged over a certain period, using a rate limiter could be a sensible approach. So, in V1, I plan to replace the hard-coded limit value of 128 with a rate limiter by reporting no data available for some time interval after the previous call. I would start with a one second interval, which seems to be reasonable for that kind of device. Yishai
On Mon, 28 Oct 2024 17:46:57 +0000 Parav Pandit <parav@nvidia.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Monday, October 28, 2024 10:24 PM > > > > On Mon, 28 Oct 2024 13:23:54 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > > > > > If the virtio spec doesn't support partial contexts, what makes it > > > > beneficial here? > > > > > > It stil lets the receiver 'warm up', like allocating memory and > > > approximately sizing things. > > > > > > > If it is beneficial, why is it beneficial to send initial data more than > > > > once? > > > > > > I guess because it is allowed to change and the benefit is highest > > > when the pre copy data closely matches the final data.. > > > > It would be useful to see actual data here. For instance, what is the latency > > advantage to allocating anything in the warm-up and what's the probability > > that allocation is simply refreshed versus starting over? > > > > Allocating everything during the warm-up phase, compared to no > allocation, reduced the total VM downtime from 439 ms to 128 ms. This > was tested using two PCI VF hardware devices per VM. > > The benefit comes from the device state staying mostly the same. > > We tested with different configurations from 1 to 4 devices per VM, > varied with vcpus and memory. Also, more detailed test results are > captured in Figure-2 on page 6 at [1]. Those numbers seems to correspond to column 1 of Figure 2 in the referenced document, but that's looking only at downtime. To me that chart seems to show a step function where there's ~400ms of downtime per device, which suggests we're serializing device resume in the stop-copy phase on the target without pre-copy. Figure 3 appears to look at total VM migration time, where pre-copy tends to show marginal improvements in smaller configurations, but up to 60% worse overall migration time as the vCPU, device, and VM memory size increase. The paper comes to the conclusion: It can be concluded that either of increasing the VM memory or device configuration has equal effect on the VM total migration time, but no effect on the VM downtime due to pre-copy enablement. Noting specifically "downtime" here ignores that the overall migration time actually got worse with pre-copy. Between columns 10 & 11 the device count is doubled. With pre-copy enabled, the migration time increases by 135% while with pre-copy disabled we only only see a 113% increase. Between columns 11 & 12 the VM memory is further doubled. This results in another 33% increase in migration time with pre-copy enabled and only a 3% increase with pre-copy disabled. For the most part this entire figure shows that overall migration time with pre-copy enabled is either on par with or worse than the same with pre-copy disabled. We then move on to Tables 1 & 2, which are again back to specifically showing timing of operations related to downtime rather than overall migration time. The notable thing here seems to be that we've amortized the 300ms per device load time across the pre-copy phase, leaving only 11ms per device contributing to downtime. However, the paper also goes into this tangent: Our observations indicate that enabling device-level pre-copy results in more pre-copy operations of the system RAM and device state. This leads to a 50% reduction in memory (RAM) copy time in the device pre-copy method in the micro-benchmark results, saving 100 milliseconds of downtime. I'd argue that this is an anti-feature. A less generous interpretation is that pre-copy extended the migration time, likely resulting in more RAM transfer during pre-copy, potentially to the point that the VM undershot its prescribed downtime. Further analysis should also look at the total data transferred for the migration and adherence to the configured VM downtime, rather than just the absolute downtime. At the end of the paper, I think we come to the same conclusion shown in Figure 1, where device load seems to be serialized and therefore significantly limits scalability. That could be parallelized, but even 300-400ms for loading all devices is still too much contribution to downtime. I'd therefore agree that pre-loading the device during pre-copy improves the scaling by an order of magnitude, but it doesn't solve the scaling problem. Also, it should not come with the cost of drawing out pre-copy and thus the overall migration time to this extent. The reduction in downtime related to RAM copy time should be evidence that the pre-copy behavior here has exceeded its scope and is interfering with the balance between pre- and post- copy elsewhere. Thanks, Alex > > [1] https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf >
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, October 30, 2024 1:58 AM > > On Mon, 28 Oct 2024 17:46:57 +0000 > Parav Pandit <parav@nvidia.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Monday, October 28, 2024 10:24 PM > > > > > > On Mon, 28 Oct 2024 13:23:54 -0300 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > > > > > > > If the virtio spec doesn't support partial contexts, what makes > > > > > it beneficial here? > > > > > > > > It stil lets the receiver 'warm up', like allocating memory and > > > > approximately sizing things. > > > > > > > > > If it is beneficial, why is it beneficial to send initial data > > > > > more than once? > > > > > > > > I guess because it is allowed to change and the benefit is highest > > > > when the pre copy data closely matches the final data.. > > > > > > It would be useful to see actual data here. For instance, what is > > > the latency advantage to allocating anything in the warm-up and > > > what's the probability that allocation is simply refreshed versus starting > over? > > > > > > > Allocating everything during the warm-up phase, compared to no > > allocation, reduced the total VM downtime from 439 ms to 128 ms. This > > was tested using two PCI VF hardware devices per VM. > > > > The benefit comes from the device state staying mostly the same. > > > > We tested with different configurations from 1 to 4 devices per VM, > > varied with vcpus and memory. Also, more detailed test results are > > captured in Figure-2 on page 6 at [1]. > > Those numbers seems to correspond to column 1 of Figure 2 in the > referenced document, but that's looking only at downtime. Yes. What do you mean by only looking at the downtime? The intention was to measure the downtime in various configurations. Do you mean, we should have looked at migration bandwidth, migration amount of data, migration time too? If so, yes, some of them were not considered as the focus was on two things: a. total VM downtime b. total migration time But with recent tests, we looked at more things. Explained more below. > To me that chart > seems to show a step function where there's ~400ms of downtime per > device, which suggests we're serializing device resume in the stop-copy > phase on the target without pre-copy. > Yes. even without serialization, when there is single device, same bottleneck can be observed. And your orthogonal suggestion of using parallelism is very useful. The paper captures this aspect in text on page 7 after the Table 2. > Figure 3 appears to look at total VM migration time, where pre-copy tends to > show marginal improvements in smaller configurations, but up to 60% worse > overall migration time as the vCPU, device, and VM memory size increase. > The paper comes to the conclusion: > > It can be concluded that either of increasing the VM memory or > device configuration has equal effect on the VM total migration > time, but no effect on the VM downtime due to pre-copy > enablement. > > Noting specifically "downtime" here ignores that the overall migration time > actually got worse with pre-copy. > > Between columns 10 & 11 the device count is doubled. With pre-copy > enabled, the migration time increases by 135% while with pre-copy disabled > we only only see a 113% increase. Between columns 11 & 12 the VM > memory is further doubled. This results in another 33% increase in > migration time with pre-copy enabled and only a 3% increase with pre-copy > disabled. For the most part this entire figure shows that overall migration > time with pre-copy enabled is either on par with or worse than the same > with pre-copy disabled. > I will answer this part in more detail towards the end of the email. > We then move on to Tables 1 & 2, which are again back to specifically > showing timing of operations related to downtime rather than overall > migration time. Yes, because the objective was to analyze the effects and improvements on downtime of various configurations of device, VM, pre-copy. > The notable thing here seems to be that we've amortized > the 300ms per device load time across the pre-copy phase, leaving only 11ms > per device contributing to downtime. > Correct. > However, the paper also goes into this tangent: > > Our observations indicate that enabling device-level pre-copy > results in more pre-copy operations of the system RAM and > device state. This leads to a 50% reduction in memory (RAM) > copy time in the device pre-copy method in the micro-benchmark > results, saving 100 milliseconds of downtime. > > I'd argue that this is an anti-feature. A less generous interpretation is that > pre-copy extended the migration time, likely resulting in more RAM transfer > during pre-copy, potentially to the point that the VM undershot its > prescribed downtime. VM downtime was close to the configured downtime, on slightly higher side. > Further analysis should also look at the total data > transferred for the migration and adherence to the configured VM > downtime, rather than just the absolute downtime. > We did look the device side total data transferred to see how many iterations of pre-copy done. > At the end of the paper, I think we come to the same conclusion shown in > Figure 1, where device load seems to be serialized and therefore significantly > limits scalability. That could be parallelized, but even 300-400ms for loading > all devices is still too much contribution to downtime. I'd therefore agree > that pre-loading the device during pre-copy improves the scaling by an order > of magnitude, Yep. > but it doesn't solve the scaling problem. Yes, your suggestion is very valid. Parallel operation from the qemu would make the downtime even smaller. The paper also highlighted this in page 7 after Table-2. > Also, it should not > come with the cost of drawing out pre-copy and thus the overall migration > time to this extent. Right. You pointed out rightly. So we did several more tests in last 2 days for insights you provided. And found an interesting outcome. In 30+ samples, we collected for each, (a) pre-copy enabled and (b) pre-copy disabled. This was done for column 10 and 11. The VM total migration time varied in range of 13 seconds to 60 seconds. Most noticeably with pre-copy off also it varied in such large range. In the paper it was pure co-incidence that every time pre-copy=on had higher migration time compared to pre-copy=on. This led us to misguide that pre-copy influenced the higher migration time. After some reason, we found the QEMU anomaly which was fixed/overcome by the knob " avail-switchover-bandwidth". Basically the bandwidth calculation was not accurate, due to which the migration time fluctuated a lot. This problem and solution are described in [2]. Following the solution_2, We ran exact same tests of column 10 and 11, with " avail-switchover-bandwidth" configured. With that for both the modes pre-copy=on and off the total migration time stayed constant to 14-15 seconds. And this conclusion aligns with your analysis that "pre-copy should not extent the migration time to this much". Great finding, proving that figure_3 was incomplete in the paper. > The reduction in downtime related to RAM copy time > should be evidence that the pre-copy behavior here has exceeded its scope > and is interfering with the balance between pre- and post- copy elsewhere. As I explained above, pre-copy did its job, it didn't interfere. It was just not enough and right samples to analyze back then. Now it is resolved. Thanks a lot for the direction. > Thanks, > > Alex > > > > > [1] > > https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf > > [2] https://lore.kernel.org/qemu-devel/20231010221922.40638-1-peterx@redhat.com/
On Thu, 31 Oct 2024 15:04:51 +0000 Parav Pandit <parav@nvidia.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, October 30, 2024 1:58 AM > > > > On Mon, 28 Oct 2024 17:46:57 +0000 > > Parav Pandit <parav@nvidia.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Monday, October 28, 2024 10:24 PM > > > > > > > > On Mon, 28 Oct 2024 13:23:54 -0300 > > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > > > > > > > > > If the virtio spec doesn't support partial contexts, what makes > > > > > > it beneficial here? > > > > > > > > > > It stil lets the receiver 'warm up', like allocating memory and > > > > > approximately sizing things. > > > > > > > > > > > If it is beneficial, why is it beneficial to send initial data > > > > > > more than once? > > > > > > > > > > I guess because it is allowed to change and the benefit is highest > > > > > when the pre copy data closely matches the final data.. > > > > > > > > It would be useful to see actual data here. For instance, what is > > > > the latency advantage to allocating anything in the warm-up and > > > > what's the probability that allocation is simply refreshed versus starting > > over? > > > > > > > > > > Allocating everything during the warm-up phase, compared to no > > > allocation, reduced the total VM downtime from 439 ms to 128 ms. This > > > was tested using two PCI VF hardware devices per VM. > > > > > > The benefit comes from the device state staying mostly the same. > > > > > > We tested with different configurations from 1 to 4 devices per VM, > > > varied with vcpus and memory. Also, more detailed test results are > > > captured in Figure-2 on page 6 at [1]. > > > > Those numbers seems to correspond to column 1 of Figure 2 in the > > referenced document, but that's looking only at downtime. > Yes. > What do you mean by only looking at the downtime? It's just a prelude to my interpretation that the paper is focusing mostly on the benefits to downtime and downplaying the apparent longer overall migration time while rationalizing the effect on RAM migration downtime. > The intention was to measure the downtime in various configurations. > Do you mean, we should have looked at migration bandwidth, migration amount of data, migration time too? > If so, yes, some of them were not considered as the focus was on two things: > a. total VM downtime > b. total migration time > > But with recent tests, we looked at more things. Explained more below. Good. Yes, there should be a more holistic approach, improving the thing we intend to improve without degrading other aspects. > > To me that chart > > seems to show a step function where there's ~400ms of downtime per > > device, which suggests we're serializing device resume in the stop-copy > > phase on the target without pre-copy. > > > Yes. even without serialization, when there is single device, same bottleneck can be observed. > And your orthogonal suggestion of using parallelism is very useful. > The paper captures this aspect in text on page 7 after the Table 2. > > > Figure 3 appears to look at total VM migration time, where pre-copy tends to > > show marginal improvements in smaller configurations, but up to 60% worse > > overall migration time as the vCPU, device, and VM memory size increase. > > The paper comes to the conclusion: > > > > It can be concluded that either of increasing the VM memory or > > device configuration has equal effect on the VM total migration > > time, but no effect on the VM downtime due to pre-copy > > enablement. > > > > Noting specifically "downtime" here ignores that the overall migration time > > actually got worse with pre-copy. > > > > Between columns 10 & 11 the device count is doubled. With pre-copy > > enabled, the migration time increases by 135% while with pre-copy disabled > > we only only see a 113% increase. Between columns 11 & 12 the VM > > memory is further doubled. This results in another 33% increase in > > migration time with pre-copy enabled and only a 3% increase with pre-copy > > disabled. For the most part this entire figure shows that overall migration > > time with pre-copy enabled is either on par with or worse than the same > > with pre-copy disabled. > > > I will answer this part in more detail towards the end of the email. > > > We then move on to Tables 1 & 2, which are again back to specifically > > showing timing of operations related to downtime rather than overall > > migration time. > Yes, because the objective was to analyze the effects and improvements on downtime of various configurations of device, VM, pre-copy. > > > The notable thing here seems to be that we've amortized > > the 300ms per device load time across the pre-copy phase, leaving only 11ms > > per device contributing to downtime. > > > Correct. > > > However, the paper also goes into this tangent: > > > > Our observations indicate that enabling device-level pre-copy > > results in more pre-copy operations of the system RAM and > > device state. This leads to a 50% reduction in memory (RAM) > > copy time in the device pre-copy method in the micro-benchmark > > results, saving 100 milliseconds of downtime. > > > > I'd argue that this is an anti-feature. A less generous interpretation is that > > pre-copy extended the migration time, likely resulting in more RAM transfer > > during pre-copy, potentially to the point that the VM undershot its > > prescribed downtime. > VM downtime was close to the configured downtime, on slightly higher side. > > > Further analysis should also look at the total data > > transferred for the migration and adherence to the configured VM > > downtime, rather than just the absolute downtime. > > > We did look the device side total data transferred to see how many iterations of pre-copy done. > > > At the end of the paper, I think we come to the same conclusion shown in > > Figure 1, where device load seems to be serialized and therefore significantly > > limits scalability. That could be parallelized, but even 300-400ms for loading > > all devices is still too much contribution to downtime. I'd therefore agree > > that pre-loading the device during pre-copy improves the scaling by an order > > of magnitude, > Yep. > > but it doesn't solve the scaling problem. > Yes, your suggestion is very valid. > Parallel operation from the qemu would make the downtime even smaller. > The paper also highlighted this in page 7 after Table-2. > > > Also, it should not > > come with the cost of drawing out pre-copy and thus the overall migration > > time to this extent. > Right. You pointed out rightly. > So we did several more tests in last 2 days for insights you provided. > And found an interesting outcome. > > In 30+ samples, we collected for each, > (a) pre-copy enabled and > (b) pre-copy disabled. > > This was done for column 10 and 11. > > The VM total migration time varied in range of 13 seconds to 60 seconds. > Most noticeably with pre-copy off also it varied in such large range. > > In the paper it was pure co-incidence that every time pre-copy=on had > higher migration time compared to pre-copy=on. This led us to Assuming typo here, =on vs =off. > misguide that pre-copy influenced the higher migration time. > > After some reason, we found the QEMU anomaly which was fixed/overcome > by the knob " avail-switchover-bandwidth". Basically the bandwidth > calculation was not accurate, due to which the migration time > fluctuated a lot. This problem and solution are described in [2]. > > Following the solution_2, > We ran exact same tests of column 10 and 11, with " > avail-switchover-bandwidth" configured. With that for both the modes > pre-copy=on and off the total migration time stayed constant to 14-15 > seconds. > > And this conclusion aligns with your analysis that "pre-copy should > not extent the migration time to this much". Great finding, proving > that figure_3 was incomplete in the paper. Great! So with this the difference in downtime related to RAM migration in the trailing tables of the paper becomes negligible? Is this using the originally proposed algorithm of migrating device data up to 128 consecutive times or is it using rate-limiting of device data in pre-copy? Any notable differences between those algorithms? > > The reduction in downtime related to RAM copy time > > should be evidence that the pre-copy behavior here has exceeded its > > scope and is interfering with the balance between pre- and post- > > copy elsewhere. > As I explained above, pre-copy did its job, it didn't interfere. It > was just not enough and right samples to analyze back then. Now it is > resolved. Thanks a lot for the direction. Glad we could arrive at a better understanding overall. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, November 1, 2024 9:55 PM > > On Thu, 31 Oct 2024 15:04:51 +0000 > Parav Pandit <parav@nvidia.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Wednesday, October 30, 2024 1:58 AM > > > > > > On Mon, 28 Oct 2024 17:46:57 +0000 > > > Parav Pandit <parav@nvidia.com> wrote: > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Monday, October 28, 2024 10:24 PM > > > > > > > > > > On Mon, 28 Oct 2024 13:23:54 -0300 Jason Gunthorpe > > > > > <jgg@nvidia.com> wrote: > > > > > > > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > > > > > > > > > > > If the virtio spec doesn't support partial contexts, what > > > > > > > makes it beneficial here? > > > > > > > > > > > > It stil lets the receiver 'warm up', like allocating memory > > > > > > and approximately sizing things. > > > > > > > > > > > > > If it is beneficial, why is it beneficial to send initial > > > > > > > data more than once? > > > > > > > > > > > > I guess because it is allowed to change and the benefit is > > > > > > highest when the pre copy data closely matches the final data.. > > > > > > > > > > It would be useful to see actual data here. For instance, what > > > > > is the latency advantage to allocating anything in the warm-up > > > > > and what's the probability that allocation is simply refreshed > > > > > versus starting > > > over? > > > > > > > > > > > > > Allocating everything during the warm-up phase, compared to no > > > > allocation, reduced the total VM downtime from 439 ms to 128 ms. > > > > This was tested using two PCI VF hardware devices per VM. > > > > > > > > The benefit comes from the device state staying mostly the same. > > > > > > > > We tested with different configurations from 1 to 4 devices per > > > > VM, varied with vcpus and memory. Also, more detailed test results > > > > are captured in Figure-2 on page 6 at [1]. > > > > > > Those numbers seems to correspond to column 1 of Figure 2 in the > > > referenced document, but that's looking only at downtime. > > Yes. > > What do you mean by only looking at the downtime? > > It's just a prelude to my interpretation that the paper is focusing mostly on the > benefits to downtime and downplaying the apparent longer overall migration > time while rationalizing the effect on RAM migration downtime. > Now after the new debug we shared, we know the other areas too. > > The intention was to measure the downtime in various configurations. > > Do you mean, we should have looked at migration bandwidth, migration > amount of data, migration time too? > > If so, yes, some of them were not considered as the focus was on two > things: > > a. total VM downtime > > b. total migration time > > > > But with recent tests, we looked at more things. Explained more below. > > Good. Yes, there should be a more holistic approach, improving the thing we > intend to improve without degrading other aspects. > Yes. > > > To me that chart > > > seems to show a step function where there's ~400ms of downtime per > > > device, which suggests we're serializing device resume in the > > > stop-copy phase on the target without pre-copy. > > > > > Yes. even without serialization, when there is single device, same bottleneck > can be observed. > > And your orthogonal suggestion of using parallelism is very useful. > > The paper captures this aspect in text on page 7 after the Table 2. > > > > > Figure 3 appears to look at total VM migration time, where pre-copy > > > tends to show marginal improvements in smaller configurations, but > > > up to 60% worse overall migration time as the vCPU, device, and VM > memory size increase. > > > The paper comes to the conclusion: > > > > > > It can be concluded that either of increasing the VM memory or > > > device configuration has equal effect on the VM total migration > > > time, but no effect on the VM downtime due to pre-copy > > > enablement. > > > > > > Noting specifically "downtime" here ignores that the overall > > > migration time actually got worse with pre-copy. > > > > > > Between columns 10 & 11 the device count is doubled. With pre-copy > > > enabled, the migration time increases by 135% while with pre-copy > > > disabled we only only see a 113% increase. Between columns 11 & 12 > > > the VM memory is further doubled. This results in another 33% > > > increase in migration time with pre-copy enabled and only a 3% > > > increase with pre-copy disabled. For the most part this entire > > > figure shows that overall migration time with pre-copy enabled is > > > either on par with or worse than the same with pre-copy disabled. > > > > > I will answer this part in more detail towards the end of the email. > > > > > We then move on to Tables 1 & 2, which are again back to > > > specifically showing timing of operations related to downtime rather than > overall > > > migration time. > > Yes, because the objective was to analyze the effects and improvements on > downtime of various configurations of device, VM, pre-copy. > > > > > The notable thing here seems to be that we've amortized the 300ms > > > per device load time across the pre-copy phase, leaving only 11ms > > > per device contributing to downtime. > > > > > Correct. > > > > > However, the paper also goes into this tangent: > > > > > > Our observations indicate that enabling device-level pre-copy > > > results in more pre-copy operations of the system RAM and > > > device state. This leads to a 50% reduction in memory (RAM) > > > copy time in the device pre-copy method in the micro-benchmark > > > results, saving 100 milliseconds of downtime. > > > > > > I'd argue that this is an anti-feature. A less generous > > > interpretation is that pre-copy extended the migration time, likely > > > resulting in more RAM transfer during pre-copy, potentially to the point > that the VM undershot its > > > prescribed downtime. > > VM downtime was close to the configured downtime, on slightly higher side. > > > > > Further analysis should also look at the total data transferred for > > > the migration and adherence to the configured VM downtime, rather > > > than just the absolute downtime. > > > > > We did look the device side total data transferred to see how many iterations > of pre-copy done. > > > > > At the end of the paper, I think we come to the same conclusion > > > shown in Figure 1, where device load seems to be serialized and > > > therefore significantly limits scalability. That could be > > > parallelized, but even 300-400ms for loading all devices is still > > > too much contribution to downtime. I'd therefore agree that pre-loading > the device during pre-copy improves the scaling by an order > > > of magnitude, > > Yep. > > > but it doesn't solve the scaling problem. > > Yes, your suggestion is very valid. > > Parallel operation from the qemu would make the downtime even smaller. > > The paper also highlighted this in page 7 after Table-2. > > > > > Also, it should not > > > come with the cost of drawing out pre-copy and thus the overall migration > > > time to this extent. > > Right. You pointed out rightly. > > So we did several more tests in last 2 days for insights you provided. > > And found an interesting outcome. > > > > In 30+ samples, we collected for each, > > (a) pre-copy enabled and > > (b) pre-copy disabled. > > > > This was done for column 10 and 11. > > > > The VM total migration time varied in range of 13 seconds to 60 seconds. > > Most noticeably with pre-copy off also it varied in such large range. > > > > In the paper it was pure co-incidence that every time pre-copy=on had > > higher migration time compared to pre-copy=on. This led us to > > Assuming typo here, =on vs =off. > Correct it is pre-copy=off. > > misguide that pre-copy influenced the higher migration time. > > > > After some reason, we found the QEMU anomaly which was fixed/overcome > > by the knob " avail-switchover-bandwidth". Basically the bandwidth > > calculation was not accurate, due to which the migration time > > fluctuated a lot. This problem and solution are described in [2]. > > > > Following the solution_2, > > We ran exact same tests of column 10 and 11, with " > > avail-switchover-bandwidth" configured. With that for both the modes > > pre-copy=on and off the total migration time stayed constant to 14-15 > > seconds. > > > > And this conclusion aligns with your analysis that "pre-copy should > > not extent the migration time to this much". Great finding, proving > > that figure_3 was incomplete in the paper. > > Great! So with this the difference in downtime related to RAM migration in the > trailing tables of the paper becomes negligible? Yes. > Is this using the originally > proposed algorithm of migrating device data up to 128 consecutive times or is > it using rate-limiting of device data in pre-copy? Both. Yishai has new rate limiting based algorithm which also has similar results. > Any notable differences > between those algorithms? > No significant differences. Vfio level data transfer size is less now, as the frequency is reduced with your suggested algorithm. > > > The reduction in downtime related to RAM copy time should be > > > evidence that the pre-copy behavior here has exceeded its scope and > > > is interfering with the balance between pre- and post- copy > > > elsewhere. > > As I explained above, pre-copy did its job, it didn't interfere. It > > was just not enough and right samples to analyze back then. Now it is > > resolved. Thanks a lot for the direction. > > Glad we could arrive at a better understanding overall. Thanks, > > Alex