diff mbox series

[v2,18/18] migration/ram: Add direct-io support to precopy file migration

Message ID 20240523190548.23977-19-farosas@suse.de (mailing list archive)
State New, archived
Headers show
Series migration/mapped-ram: Add direct-io support | expand

Commit Message

Fabiano Rosas May 23, 2024, 7:05 p.m. UTC
We've recently added support for direct-io with multifd, which brings
performance benefits, but creates a non-uniform user interface by
coupling direct-io with the multifd capability. This means that users
cannot keep the direct-io flag enabled while disabling multifd.

Libvirt in particular already has support for direct-io and parallel
migration separately from each other, so it would be a regression to
now require both options together. It's relatively simple for QEMU to
add support for direct-io migration without multifd, so let's do this
in order to keep both options decoupled.

We cannot simply enable the O_DIRECT flag, however, because not all IO
performed by the migration thread satisfies the alignment requirements
of O_DIRECT. There are many small read & writes that add headers and
synchronization flags to the stream, which at the moment are required
to always be present.

Fortunately, due to fixed-ram migration there is a discernible moment
where only RAM pages are written to the migration file. Enable
direct-io during that moment.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/ram.c              | 40 ++++++++++++++++++++++++++++--------
 tests/qtest/migration-test.c | 30 +++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 9 deletions(-)

Comments

Peter Xu June 4, 2024, 8:56 p.m. UTC | #1
On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
> We've recently added support for direct-io with multifd, which brings
> performance benefits, but creates a non-uniform user interface by
> coupling direct-io with the multifd capability. This means that users
> cannot keep the direct-io flag enabled while disabling multifd.
> 
> Libvirt in particular already has support for direct-io and parallel
> migration separately from each other, so it would be a regression to
> now require both options together. It's relatively simple for QEMU to
> add support for direct-io migration without multifd, so let's do this
> in order to keep both options decoupled.
> 
> We cannot simply enable the O_DIRECT flag, however, because not all IO
> performed by the migration thread satisfies the alignment requirements
> of O_DIRECT. There are many small read & writes that add headers and
> synchronization flags to the stream, which at the moment are required
> to always be present.
> 
> Fortunately, due to fixed-ram migration there is a discernible moment
> where only RAM pages are written to the migration file. Enable
> direct-io during that moment.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Is anyone going to consume this?  How's the performance?

It doesn't look super fast to me if we need to enable/disable dio in each
loop.. then it's a matter of whether we should bother, or would it be
easier that we simply require multifd when direct-io=on.

Thanks,
Fabiano Rosas June 7, 2024, 6:42 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
>> We've recently added support for direct-io with multifd, which brings
>> performance benefits, but creates a non-uniform user interface by
>> coupling direct-io with the multifd capability. This means that users
>> cannot keep the direct-io flag enabled while disabling multifd.
>> 
>> Libvirt in particular already has support for direct-io and parallel
>> migration separately from each other, so it would be a regression to
>> now require both options together. It's relatively simple for QEMU to
>> add support for direct-io migration without multifd, so let's do this
>> in order to keep both options decoupled.
>> 
>> We cannot simply enable the O_DIRECT flag, however, because not all IO
>> performed by the migration thread satisfies the alignment requirements
>> of O_DIRECT. There are many small read & writes that add headers and
>> synchronization flags to the stream, which at the moment are required
>> to always be present.
>> 
>> Fortunately, due to fixed-ram migration there is a discernible moment
>> where only RAM pages are written to the migration file. Enable
>> direct-io during that moment.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> Is anyone going to consume this?  How's the performance?

I don't think we have a pre-determined consumer for this. This came up
in an internal discussion about making the interface simpler for libvirt
and in a thread on the libvirt mailing list[1] about using O_DIRECT to
keep the snapshot data out of the caches to avoid impacting the rest of
the system. (I could have described this better in the commit message,
sorry).

Quoting Daniel:

  "Note the reason for using O_DIRECT is *not* to make saving / restoring
   the guest VM faster. Rather it is to ensure that saving/restoring a VM
   does not trash the host I/O / buffer cache, which will negatively impact
   performance of all the *other* concurrently running VMs."

1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de

About performance, a quick test on a stopped 30G guest, shows
mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
direct-io=off.

>
> It doesn't look super fast to me if we need to enable/disable dio in each
> loop.. then it's a matter of whether we should bother, or would it be
> easier that we simply require multifd when direct-io=on.

AIUI, the issue here that users are already allowed to specify in
libvirt the equivalent to direct-io and multifd independent of each
other (bypass-cache, parallel). To start requiring both together now in
some situations would be a regression. I confess I don't know libvirt
code to know whether this can be worked around somehow, but as I said,
it's a relatively simple change from the QEMU side.

Another option which would be for libvirt to keep using multifd, but
make it 1 channel only if --parallel is not specified. That might be
enough to solve the interface issues. Of course, it's a different code
altogether than the usual precopy code that gets executed when
multifd=off, I don't know whether that could be an issue somehow.
Jim Fehlig June 7, 2024, 8:39 p.m. UTC | #3
On 6/7/24 12:42 PM, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
>>> We've recently added support for direct-io with multifd, which brings
>>> performance benefits, but creates a non-uniform user interface by
>>> coupling direct-io with the multifd capability. This means that users
>>> cannot keep the direct-io flag enabled while disabling multifd.
>>>
>>> Libvirt in particular already has support for direct-io and parallel
>>> migration separately from each other, so it would be a regression to
>>> now require both options together. It's relatively simple for QEMU to
>>> add support for direct-io migration without multifd, so let's do this
>>> in order to keep both options decoupled.
>>>
>>> We cannot simply enable the O_DIRECT flag, however, because not all IO
>>> performed by the migration thread satisfies the alignment requirements
>>> of O_DIRECT. There are many small read & writes that add headers and
>>> synchronization flags to the stream, which at the moment are required
>>> to always be present.
>>>
>>> Fortunately, due to fixed-ram migration there is a discernible moment
>>> where only RAM pages are written to the migration file. Enable
>>> direct-io during that moment.
>>>
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>
>> Is anyone going to consume this?  How's the performance?
> 
> I don't think we have a pre-determined consumer for this. This came up
> in an internal discussion about making the interface simpler for libvirt
> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
> keep the snapshot data out of the caches to avoid impacting the rest of
> the system. (I could have described this better in the commit message,
> sorry).
> 
> Quoting Daniel:
> 
>    "Note the reason for using O_DIRECT is *not* to make saving / restoring
>     the guest VM faster. Rather it is to ensure that saving/restoring a VM
>     does not trash the host I/O / buffer cache, which will negatively impact
>     performance of all the *other* concurrently running VMs."
> 
> 1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de
> 
> About performance, a quick test on a stopped 30G guest, shows
> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
> direct-io=off.
> 
>>
>> It doesn't look super fast to me if we need to enable/disable dio in each
>> loop.. then it's a matter of whether we should bother, or would it be
>> easier that we simply require multifd when direct-io=on.
> 
> AIUI, the issue here that users are already allowed to specify in
> libvirt the equivalent to direct-io and multifd independent of each
> other (bypass-cache, parallel). To start requiring both together now in
> some situations would be a regression. I confess I don't know libvirt
> code to know whether this can be worked around somehow, but as I said,
> it's a relatively simple change from the QEMU side.

Currently, libvirt does not support --parallel with virDomainSave* and 
virDomainRestore* APIs. I'll work on that after getting support for mapped-ram 
merged. --parallel is supported in virDomainMigrate* APIs, but obviously those 
APIs don't accept --bypass-cache.

Regards,
Jim

> 
> Another option which would be for libvirt to keep using multifd, but
> make it 1 channel only if --parallel is not specified. That might be
> enough to solve the interface issues. Of course, it's a different code
> altogether than the usual precopy code that gets executed when
> multifd=off, I don't know whether that could be an issue somehow.
Peter Xu June 10, 2024, 4:09 p.m. UTC | #4
On Fri, Jun 07, 2024 at 03:42:35PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
> >> We've recently added support for direct-io with multifd, which brings
> >> performance benefits, but creates a non-uniform user interface by
> >> coupling direct-io with the multifd capability. This means that users
> >> cannot keep the direct-io flag enabled while disabling multifd.
> >> 
> >> Libvirt in particular already has support for direct-io and parallel
> >> migration separately from each other, so it would be a regression to
> >> now require both options together. It's relatively simple for QEMU to
> >> add support for direct-io migration without multifd, so let's do this
> >> in order to keep both options decoupled.
> >> 
> >> We cannot simply enable the O_DIRECT flag, however, because not all IO
> >> performed by the migration thread satisfies the alignment requirements
> >> of O_DIRECT. There are many small read & writes that add headers and
> >> synchronization flags to the stream, which at the moment are required
> >> to always be present.
> >> 
> >> Fortunately, due to fixed-ram migration there is a discernible moment
> >> where only RAM pages are written to the migration file. Enable
> >> direct-io during that moment.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > Is anyone going to consume this?  How's the performance?
> 
> I don't think we have a pre-determined consumer for this. This came up
> in an internal discussion about making the interface simpler for libvirt
> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
> keep the snapshot data out of the caches to avoid impacting the rest of
> the system. (I could have described this better in the commit message,
> sorry).
> 
> Quoting Daniel:
> 
>   "Note the reason for using O_DIRECT is *not* to make saving / restoring
>    the guest VM faster. Rather it is to ensure that saving/restoring a VM
>    does not trash the host I/O / buffer cache, which will negatively impact
>    performance of all the *other* concurrently running VMs."
> 
> 1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de
> 
> About performance, a quick test on a stopped 30G guest, shows
> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
> direct-io=off.

Yes, this makes sense.

> 
> >
> > It doesn't look super fast to me if we need to enable/disable dio in each
> > loop.. then it's a matter of whether we should bother, or would it be
> > easier that we simply require multifd when direct-io=on.
> 
> AIUI, the issue here that users are already allowed to specify in
> libvirt the equivalent to direct-io and multifd independent of each
> other (bypass-cache, parallel). To start requiring both together now in
> some situations would be a regression. I confess I don't know libvirt
> code to know whether this can be worked around somehow, but as I said,
> it's a relatively simple change from the QEMU side.

Firstly, I definitely want to already avoid all the calls to either
migration_direct_io_start() or *_finish(), now we already need to
explicitly call them in three paths, and that's not intuitive and less
readable, just like the hard coded rdma codes.

I also worry we may overlook the complexity here, and pinning buffers
definitely need more thoughts on its own.  It's easier to digest when using
multifd and when QEMU only pins guest pages just like tcp-zerocopy does,
which are naturally host page size aligned, and also guaranteed to not be
freed (while reused / modified is fine here, as dirty tracking guarantees a
new page will be migrated soon again).

IMHO here the "not be freed / modified" is even more important than
"alignment": the latter is about perf, the former is about correctness.
When we do directio on random buffers, AFAIU we don't want to have the
buffer modified before flushed to disk, and that's IMHO not easy to
guarantee.

E.g., I don't think this guarantees a flush on the buffer usages:

  migration_direct_io_start()
    /* flush any potentially unaligned IO before setting O_DIRECT */
    qemu_fflush(file);

qemu_fflush() internally does writev(), and that "flush" is about "flushing
qemufile iov[] to fd", not "flushing buffers to disk".  I think it means
if we do qemu_fflush() then we modify QEMUFile.buf[IO_BUF_SIZE] we're
doomed: we will never know whether dio has happened, and which version of
buffer will be sent; I don't think it's guaranteed it will always be the
old version of the buffer.

However the issue is, QEMUFile defines qemu_fflush() as: after call, the
buf[] can be reused!  It suggests breaking things I guess in dio context.

IIUC currently mapped-ram is ok because mapped-ram is just special that it
doesn't have page headers, so it doesn't use the buf[] during iterations;
while for zeropage it uses file_bmap bitmap and that's separate too and
does not generate any byte on the wire either.

xbzrle could use that buf[], but maybe mapped-ram doesn't work anyway with
xbzrle.

Everything is just very not obvious and tricky to me.  This still looks
pretty dangerous to me.  Would migration_direct_io_finish() guarantee
something like a fdatasync()?  If so it looks safer, but still within the
start() and finish() if someone calls qemu_fflush() and reuse the buffer we
can still get hard to debug issues (as the outcome would be that we saw
corrupted migration files).

> 
> Another option which would be for libvirt to keep using multifd, but
> make it 1 channel only if --parallel is not specified. That might be
> enough to solve the interface issues. Of course, it's a different code
> altogether than the usual precopy code that gets executed when
> multifd=off, I don't know whether that could be an issue somehow.

Would there be any comment from Libvirt side?  This sounds like a good
solution if my above concern is real; as long as we always stick dio with
guest pages we'll be all fine.

Thanks,
Fabiano Rosas June 10, 2024, 5:45 p.m. UTC | #5
Peter Xu <peterx@redhat.com> writes:

> On Fri, Jun 07, 2024 at 03:42:35PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
>> >> We've recently added support for direct-io with multifd, which brings
>> >> performance benefits, but creates a non-uniform user interface by
>> >> coupling direct-io with the multifd capability. This means that users
>> >> cannot keep the direct-io flag enabled while disabling multifd.
>> >> 
>> >> Libvirt in particular already has support for direct-io and parallel
>> >> migration separately from each other, so it would be a regression to
>> >> now require both options together. It's relatively simple for QEMU to
>> >> add support for direct-io migration without multifd, so let's do this
>> >> in order to keep both options decoupled.
>> >> 
>> >> We cannot simply enable the O_DIRECT flag, however, because not all IO
>> >> performed by the migration thread satisfies the alignment requirements
>> >> of O_DIRECT. There are many small read & writes that add headers and
>> >> synchronization flags to the stream, which at the moment are required
>> >> to always be present.
>> >> 
>> >> Fortunately, due to fixed-ram migration there is a discernible moment
>> >> where only RAM pages are written to the migration file. Enable
>> >> direct-io during that moment.
>> >> 
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >
>> > Is anyone going to consume this?  How's the performance?
>> 
>> I don't think we have a pre-determined consumer for this. This came up
>> in an internal discussion about making the interface simpler for libvirt
>> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
>> keep the snapshot data out of the caches to avoid impacting the rest of
>> the system. (I could have described this better in the commit message,
>> sorry).
>> 
>> Quoting Daniel:
>> 
>>   "Note the reason for using O_DIRECT is *not* to make saving / restoring
>>    the guest VM faster. Rather it is to ensure that saving/restoring a VM
>>    does not trash the host I/O / buffer cache, which will negatively impact
>>    performance of all the *other* concurrently running VMs."
>> 
>> 1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de
>> 
>> About performance, a quick test on a stopped 30G guest, shows
>> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
>> direct-io=off.
>
> Yes, this makes sense.
>
>> 
>> >
>> > It doesn't look super fast to me if we need to enable/disable dio in each
>> > loop.. then it's a matter of whether we should bother, or would it be
>> > easier that we simply require multifd when direct-io=on.
>> 
>> AIUI, the issue here that users are already allowed to specify in
>> libvirt the equivalent to direct-io and multifd independent of each
>> other (bypass-cache, parallel). To start requiring both together now in
>> some situations would be a regression. I confess I don't know libvirt
>> code to know whether this can be worked around somehow, but as I said,
>> it's a relatively simple change from the QEMU side.
>
> Firstly, I definitely want to already avoid all the calls to either
> migration_direct_io_start() or *_finish(), now we already need to
> explicitly call them in three paths, and that's not intuitive and less
> readable, just like the hard coded rdma codes.

Right, but that's just a side-effect of how the code is structured and
the fact that writes to the stream happen in small chunks. Setting
O_DIRECT needs to happen around aligned IO. We could move the calls
further down into qemu_put_buffer_at(), but that would be four fcntl()
calls for every page.

A tangent:
 one thing that occured to me now is that we may be able to restrict
 calls to qemu_fflush() to internal code like add_to_iovec() and maybe
 use that function to gather the correct amount of data before writing,
 making sure it disables O_DIRECT in case alignment is about to be
 broken?

>
> I also worry we may overlook the complexity here, and pinning buffers
> definitely need more thoughts on its own.  It's easier to digest when using
> multifd and when QEMU only pins guest pages just like tcp-zerocopy does,
> which are naturally host page size aligned, and also guaranteed to not be
> freed (while reused / modified is fine here, as dirty tracking guarantees a
> new page will be migrated soon again).

I don't get this at all, sorry. What is different from multifd here?
We're writing on the same HVA as the one that would be given to multifd
(if it were enabled) and dirty tracking is working the same.

> IMHO here the "not be freed / modified" is even more important than
> "alignment": the latter is about perf, the former is about correctness.
> When we do directio on random buffers, AFAIU we don't want to have the
> buffer modified before flushed to disk, and that's IMHO not easy to
> guarantee.
>
> E.g., I don't think this guarantees a flush on the buffer usages:
>
>   migration_direct_io_start()
>     /* flush any potentially unaligned IO before setting O_DIRECT */
>     qemu_fflush(file);
>
> qemu_fflush() internally does writev(), and that "flush" is about "flushing
> qemufile iov[] to fd", not "flushing buffers to disk".  I think it means
> if we do qemu_fflush() then we modify QEMUFile.buf[IO_BUF_SIZE] we're
> doomed: we will never know whether dio has happened, and which version of
> buffer will be sent; I don't think it's guaranteed it will always be the
> old version of the buffer.
>
> However the issue is, QEMUFile defines qemu_fflush() as: after call, the
> buf[] can be reused!  It suggests breaking things I guess in dio context.

I think you're mixing the usage of qemu_put_byte()/qemu_put_buffer()
with the usage of qemu_put_buffer_at(). The former two use the
QEMUFile.buf without O_DIRECT and the latter writes directly to the fd
at the page offset. So there's no issue in reusing buf before writes
have reached the disk. All writes going through buf are serialized and
all writes going through qio_channel_pwrite() go to a different offset.

I included all of these assert(!f->dio) to ensure that we don't use the
two APIs incorrectly. Mainly that we don't try to write to buf while
O_DIRECT is set.

>
> IIUC currently mapped-ram is ok because mapped-ram is just special that it
> doesn't have page headers, so it doesn't use the buf[] during iterations;
> while for zeropage it uses file_bmap bitmap and that's separate too and
> does not generate any byte on the wire either.

Right. This is all mapped-ram. I'm not proposing to enable O_DIRECT for
any migration.

>
> xbzrle could use that buf[], but maybe mapped-ram doesn't work anyway with
> xbzrle.
>
> Everything is just very not obvious and tricky to me.  This still looks
> pretty dangerous to me.  Would migration_direct_io_finish() guarantee
> something like a fdatasync()?  If so it looks safer, but still within the
> start() and finish() if someone calls qemu_fflush() and reuse the buffer we
> can still get hard to debug issues (as the outcome would be that we saw
> corrupted migration files).
>
>> 
>> Another option which would be for libvirt to keep using multifd, but
>> make it 1 channel only if --parallel is not specified. That might be
>> enough to solve the interface issues. Of course, it's a different code
>> altogether than the usual precopy code that gets executed when
>> multifd=off, I don't know whether that could be an issue somehow.
>
> Would there be any comment from Libvirt side?  This sounds like a good
> solution if my above concern is real; as long as we always stick dio with
> guest pages we'll be all fine.
>
> Thanks,
Peter Xu June 10, 2024, 7:02 p.m. UTC | #6
On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
> >> AIUI, the issue here that users are already allowed to specify in
> >> libvirt the equivalent to direct-io and multifd independent of each
> >> other (bypass-cache, parallel). To start requiring both together now in
> >> some situations would be a regression. I confess I don't know libvirt
> >> code to know whether this can be worked around somehow, but as I said,
> >> it's a relatively simple change from the QEMU side.
> >
> > Firstly, I definitely want to already avoid all the calls to either
> > migration_direct_io_start() or *_finish(), now we already need to
> > explicitly call them in three paths, and that's not intuitive and less
> > readable, just like the hard coded rdma codes.
> 
> Right, but that's just a side-effect of how the code is structured and
> the fact that writes to the stream happen in small chunks. Setting
> O_DIRECT needs to happen around aligned IO. We could move the calls
> further down into qemu_put_buffer_at(), but that would be four fcntl()
> calls for every page.

Hmm.. why we need four fcntl()s instead of two?

> 
> A tangent:
>  one thing that occured to me now is that we may be able to restrict
>  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
>  use that function to gather the correct amount of data before writing,
>  making sure it disables O_DIRECT in case alignment is about to be
>  broken?

IIUC dio doesn't require alignment if we don't care about perf?  I meant it
should be legal to write(fd, buffer, 5) even if O_DIRECT?

I just noticed the asserts you added in previous patch, I think that's
better indeed, but still I'm wondering whether we can avoid enabling it on
qemufile.

It makes me feel slightly nervous when introducing dio to QEMUFile rather
than iochannels - the API design of QEMUFile seems to easily encourage
breaking things in dio worlds with a default and static buffering. And if
we're going to blacklist most of the API anyway except the new one for
mapped-ram, I start to wonder too why bother on top of QEMUFile anyway.

IIRC you also mentioned in the previous doc patch so that libvirt should
always pass in two fds anyway to the fdset if dio is enabled.  I wonder
whether it's also true for multifd=off and directio=on, then would it be
possible to use the dio for guest pages with one fd, while keeping the
normal stream to use !dio with the other fd.  I'm not sure whether it's
easy to avoid qemufile in the dio fd, even if not looks like we may avoid
frequent fcntl()s?
Daniel P. Berrangé June 10, 2024, 7:07 p.m. UTC | #7
On Mon, Jun 10, 2024 at 03:02:10PM -0400, Peter Xu wrote:
> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
> > >> AIUI, the issue here that users are already allowed to specify in
> > >> libvirt the equivalent to direct-io and multifd independent of each
> > >> other (bypass-cache, parallel). To start requiring both together now in
> > >> some situations would be a regression. I confess I don't know libvirt
> > >> code to know whether this can be worked around somehow, but as I said,
> > >> it's a relatively simple change from the QEMU side.
> > >
> > > Firstly, I definitely want to already avoid all the calls to either
> > > migration_direct_io_start() or *_finish(), now we already need to
> > > explicitly call them in three paths, and that's not intuitive and less
> > > readable, just like the hard coded rdma codes.
> > 
> > Right, but that's just a side-effect of how the code is structured and
> > the fact that writes to the stream happen in small chunks. Setting
> > O_DIRECT needs to happen around aligned IO. We could move the calls
> > further down into qemu_put_buffer_at(), but that would be four fcntl()
> > calls for every page.
> 
> Hmm.. why we need four fcntl()s instead of two?
> 
> > 
> > A tangent:
> >  one thing that occured to me now is that we may be able to restrict
> >  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
> >  use that function to gather the correct amount of data before writing,
> >  making sure it disables O_DIRECT in case alignment is about to be
> >  broken?
> 
> IIUC dio doesn't require alignment if we don't care about perf?  I meant it
> should be legal to write(fd, buffer, 5) even if O_DIRECT?

No, we must assume  that O_DIRECT requires alignment both of the userspace
memory buffers, and the file offset on disk:

[quote man(open)]
  O_DIRECT
       The O_DIRECT flag may impose alignment restrictions  on  the  length
       and  address  of user-space buffers and the file offset of I/Os.  In
       Linux alignment restrictions vary by filesystem and  kernel  version
       and  might  be absent entirely.  The handling of misaligned O_DIRECT
       I/Os also varies; they can either fail with EINVAL or fall  back  to
       buffered I/O.
[/quote]

Given QEMU's code base, it is only safe for us to use O_DIRECT with RAM
blocks where we have predictable in-memory alignment, and have defined
a good on-disk offset alignment too.


With regards,
Daniel
Fabiano Rosas June 10, 2024, 8:12 p.m. UTC | #8
Peter Xu <peterx@redhat.com> writes:

> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
>> >> AIUI, the issue here that users are already allowed to specify in
>> >> libvirt the equivalent to direct-io and multifd independent of each
>> >> other (bypass-cache, parallel). To start requiring both together now in
>> >> some situations would be a regression. I confess I don't know libvirt
>> >> code to know whether this can be worked around somehow, but as I said,
>> >> it's a relatively simple change from the QEMU side.
>> >
>> > Firstly, I definitely want to already avoid all the calls to either
>> > migration_direct_io_start() or *_finish(), now we already need to
>> > explicitly call them in three paths, and that's not intuitive and less
>> > readable, just like the hard coded rdma codes.
>> 
>> Right, but that's just a side-effect of how the code is structured and
>> the fact that writes to the stream happen in small chunks. Setting
>> O_DIRECT needs to happen around aligned IO. We could move the calls
>> further down into qemu_put_buffer_at(), but that would be four fcntl()
>> calls for every page.
>
> Hmm.. why we need four fcntl()s instead of two?

Because we need to first get the flags before flipping the O_DIRECT
bit. And we do this once to enable and once to disable.

    int flags = fcntl(fioc->fd, F_GETFL);
    if (enabled) {
        flags |= O_DIRECT;
    } else {
        flags &= ~O_DIRECT;
    }
    fcntl(fioc->fd, F_SETFL, flags);

>> 
>> A tangent:
>>  one thing that occured to me now is that we may be able to restrict
>>  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
>>  use that function to gather the correct amount of data before writing,
>>  making sure it disables O_DIRECT in case alignment is about to be
>>  broken?
>
> IIUC dio doesn't require alignment if we don't care about perf?  I meant it
> should be legal to write(fd, buffer, 5) even if O_DIRECT?

No, we may get an -EINVAL. See Daniel's reply.

>
> I just noticed the asserts you added in previous patch, I think that's
> better indeed, but still I'm wondering whether we can avoid enabling it on
> qemufile.
>
> It makes me feel slightly nervous when introducing dio to QEMUFile rather
> than iochannels - the API design of QEMUFile seems to easily encourage
> breaking things in dio worlds with a default and static buffering. And if
> we're going to blacklist most of the API anyway except the new one for
> mapped-ram, I start to wonder too why bother on top of QEMUFile anyway.
>
> IIRC you also mentioned in the previous doc patch so that libvirt should
> always pass in two fds anyway to the fdset if dio is enabled.  I wonder
> whether it's also true for multifd=off and directio=on, then would it be
> possible to use the dio for guest pages with one fd, while keeping the
> normal stream to use !dio with the other fd.  I'm not sure whether it's
> easy to avoid qemufile in the dio fd, even if not looks like we may avoid
> frequent fcntl()s?

Hm, sounds like a good idea. We'd need a place to put that new ioc
though. Either QEMUFile.direct_ioc and then make use of it in
qemu_put_buffer_at() or a more transparent QIOChannelFile.direct_fd that
gets set somewhere during file_start_outgoing_migration(). Let me try to
come up with something.
Fabiano Rosas June 12, 2024, 6:08 p.m. UTC | #9
Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
>>> >> AIUI, the issue here that users are already allowed to specify in
>>> >> libvirt the equivalent to direct-io and multifd independent of each
>>> >> other (bypass-cache, parallel). To start requiring both together now in
>>> >> some situations would be a regression. I confess I don't know libvirt
>>> >> code to know whether this can be worked around somehow, but as I said,
>>> >> it's a relatively simple change from the QEMU side.
>>> >
>>> > Firstly, I definitely want to already avoid all the calls to either
>>> > migration_direct_io_start() or *_finish(), now we already need to
>>> > explicitly call them in three paths, and that's not intuitive and less
>>> > readable, just like the hard coded rdma codes.
>>> 
>>> Right, but that's just a side-effect of how the code is structured and
>>> the fact that writes to the stream happen in small chunks. Setting
>>> O_DIRECT needs to happen around aligned IO. We could move the calls
>>> further down into qemu_put_buffer_at(), but that would be four fcntl()
>>> calls for every page.
>>
>> Hmm.. why we need four fcntl()s instead of two?
>
> Because we need to first get the flags before flipping the O_DIRECT
> bit. And we do this once to enable and once to disable.
>
>     int flags = fcntl(fioc->fd, F_GETFL);
>     if (enabled) {
>         flags |= O_DIRECT;
>     } else {
>         flags &= ~O_DIRECT;
>     }
>     fcntl(fioc->fd, F_SETFL, flags);
>
>>> 
>>> A tangent:
>>>  one thing that occured to me now is that we may be able to restrict
>>>  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
>>>  use that function to gather the correct amount of data before writing,
>>>  making sure it disables O_DIRECT in case alignment is about to be
>>>  broken?
>>
>> IIUC dio doesn't require alignment if we don't care about perf?  I meant it
>> should be legal to write(fd, buffer, 5) even if O_DIRECT?
>
> No, we may get an -EINVAL. See Daniel's reply.
>
>>
>> I just noticed the asserts you added in previous patch, I think that's
>> better indeed, but still I'm wondering whether we can avoid enabling it on
>> qemufile.
>>
>> It makes me feel slightly nervous when introducing dio to QEMUFile rather
>> than iochannels - the API design of QEMUFile seems to easily encourage
>> breaking things in dio worlds with a default and static buffering. And if
>> we're going to blacklist most of the API anyway except the new one for
>> mapped-ram, I start to wonder too why bother on top of QEMUFile anyway.
>>
>> IIRC you also mentioned in the previous doc patch so that libvirt should
>> always pass in two fds anyway to the fdset if dio is enabled.  I wonder
>> whether it's also true for multifd=off and directio=on, then would it be
>> possible to use the dio for guest pages with one fd, while keeping the
>> normal stream to use !dio with the other fd.  I'm not sure whether it's
>> easy to avoid qemufile in the dio fd, even if not looks like we may avoid
>> frequent fcntl()s?
>
> Hm, sounds like a good idea. We'd need a place to put that new ioc
> though. Either QEMUFile.direct_ioc and then make use of it in
> qemu_put_buffer_at() or a more transparent QIOChannelFile.direct_fd that
> gets set somewhere during file_start_outgoing_migration(). Let me try to
> come up with something.

I looked into this and it's cumbersome:

- We'd need to check migrate_direct_io() several times, once to get the
  second fd and during every IO to know to use the fd.

- Even getting the second fd is not straight forward, we need to create
  a new ioc for it with qio_channel_new_path(). But QEMUFile is generic
  code, so we'd probably need to call this channel-file specific
  function from migration_channel_connect().

- With the new ioc, do we put it in QEMUFile, or do we take the fd only?
  Or maybe an ioc with two fds? Or a new QIOChannelDirect? All options
  look bad to me.

So I suggest we proceed proceed with the 1 multifd channel approach,
passing 2 fds into QEMU just like we do for the n channels. Is that ok
from libvirt's perspective? I assume libvirt users are mostly interested
in _enabling_ parallelism with --parallel, instead of _avoiding_ it with
the ommision of the option, so main thread + 1 channel should not be a
bad thing.

Choosing to use 1 multifd channel now is also a gentler introduction for
when we finally move all of the vmstate migration into multifd (I've
been looking into this, but don't hold your breaths).
Daniel P. Berrangé June 12, 2024, 6:15 p.m. UTC | #10
On Wed, Jun 12, 2024 at 03:08:02PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
> I looked into this and it's cumbersome:
> 
> - We'd need to check migrate_direct_io() several times, once to get the
>   second fd and during every IO to know to use the fd.
> 
> - Even getting the second fd is not straight forward, we need to create
>   a new ioc for it with qio_channel_new_path(). But QEMUFile is generic
>   code, so we'd probably need to call this channel-file specific
>   function from migration_channel_connect().
> 
> - With the new ioc, do we put it in QEMUFile, or do we take the fd only?
>   Or maybe an ioc with two fds? Or a new QIOChannelDirect? All options
>   look bad to me.
> 
> So I suggest we proceed proceed with the 1 multifd channel approach,
> passing 2 fds into QEMU just like we do for the n channels. Is that ok
> from libvirt's perspective? I assume libvirt users are mostly interested
> in _enabling_ parallelism with --parallel, instead of _avoiding_ it with
> the ommision of the option, so main thread + 1 channel should not be a
> bad thing.

IIUC, with the "fixed-ram" feature, the on-disk format of a saved VM
should end up the same whether we're using traditional migration, or
multifd migration. Use of multifd is simply an optimization that lets
us write RAM in parallel to the file, with direct-io further optimizing.

There's also a clear break with libvirt between the existing on-disk
format libvirt uses, and the new fixed-ram format. So we have no backwards
compatibilty concerns added from multifd, beyond what we already have to
figure out when deciding on use of 'fixed-ram'. 

Thus I believe there is no downside to always using multifd for save
images with fixed-ram, even if we only want nchannels=1.


> Choosing to use 1 multifd channel now is also a gentler introduction for
> when we finally move all of the vmstate migration into multifd (I've
> been looking into this, but don't hold your breaths).

Yes, future proofing is a good idea.

With regards,
Daniel
Peter Xu June 12, 2024, 6:27 p.m. UTC | #11
On Wed, Jun 12, 2024 at 07:15:19PM +0100, Daniel P. Berrangé wrote:
> IIUC, with the "fixed-ram" feature, the on-disk format of a saved VM
> should end up the same whether we're using traditional migration, or
> multifd migration. Use of multifd is simply an optimization that lets
> us write RAM in parallel to the file, with direct-io further optimizing.
> 
> There's also a clear break with libvirt between the existing on-disk
> format libvirt uses, and the new fixed-ram format. So we have no backwards
> compatibilty concerns added from multifd, beyond what we already have to
> figure out when deciding on use of 'fixed-ram'. 
> 
> Thus I believe there is no downside to always using multifd for save
> images with fixed-ram, even if we only want nchannels=1.

That sounds good.

Just to double check with all of us: so we allow mapped-ram to be used in
whatever case when !dio, however we restrict dio only when with multifd=on,
am I right?

I'd personally like that, and it also pretty much matches with what we have
with tcp zerocopy send too. After all they're really similar stuff to me on
pinning implications and locked_vm restrictions.  It's just that the target
of the data movement is different here, either to NIC, or to/from a file.

Thanks,
Fabiano Rosas June 12, 2024, 6:44 p.m. UTC | #12
Peter Xu <peterx@redhat.com> writes:

> On Wed, Jun 12, 2024 at 07:15:19PM +0100, Daniel P. Berrangé wrote:
>> IIUC, with the "fixed-ram" feature, the on-disk format of a saved VM
>> should end up the same whether we're using traditional migration, or
>> multifd migration. Use of multifd is simply an optimization that lets
>> us write RAM in parallel to the file, with direct-io further optimizing.
>> 
>> There's also a clear break with libvirt between the existing on-disk
>> format libvirt uses, and the new fixed-ram format. So we have no backwards
>> compatibilty concerns added from multifd, beyond what we already have to
>> figure out when deciding on use of 'fixed-ram'. 
>> 
>> Thus I believe there is no downside to always using multifd for save
>> images with fixed-ram, even if we only want nchannels=1.
>
> That sounds good.
>
> Just to double check with all of us: so we allow mapped-ram to be used in
> whatever case when !dio, however we restrict dio only when with multifd=on,
> am I right?

Yes. The restricting part is not yet in place. I'll add a multifd check
to migrate_direct_io():

bool migrate_direct_io(void)
{
    MigrationState *s = migrate_get_current();

    return s->parameters.direct_io &&
        s->capabilities[MIGRATION_CAPABILITY_MAPPED_RAM] &&
        s->capabilities[MIGRATION_CAPABILITY_MULTIFD];
}
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index ceea586b06..5183d1f97c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3131,6 +3131,7 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
     int i;
     int64_t t0;
     int done = 0;
+    Error **errp = NULL;
 
     /*
      * We'll take this lock a little bit long, but it's okay for two reasons.
@@ -3154,6 +3155,10 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
                 goto out;
             }
 
+            if (!migration_direct_io_start(f, errp)) {
+                return -errno;
+            }
+
             t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
             i = 0;
             while ((ret = migration_rate_exceeded(f)) == 0 ||
@@ -3194,6 +3199,9 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
                 }
                 i++;
             }
+            if (!migration_direct_io_finish(f, errp)) {
+                return -errno;
+            }
         }
     }
 
@@ -3242,7 +3250,8 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
 {
     RAMState **temp = opaque;
     RAMState *rs = *temp;
-    int ret = 0;
+    int ret = 0, pages;
+    Error **errp = NULL;
 
     rs->last_stage = !migration_in_colo_state();
 
@@ -3257,25 +3266,30 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
             return ret;
         }
 
+        if (!migration_direct_io_start(f, errp)) {
+            return -errno;
+        }
+
         /* try transferring iterative blocks of memory */
 
         /* flush all remaining blocks regardless of rate limiting */
         qemu_mutex_lock(&rs->bitmap_mutex);
         while (true) {
-            int pages;
-
             pages = ram_find_and_save_block(rs);
-            /* no more blocks to sent */
-            if (pages == 0) {
+            if (pages <= 0) {
                 break;
             }
-            if (pages < 0) {
-                qemu_mutex_unlock(&rs->bitmap_mutex);
-                return pages;
-            }
         }
         qemu_mutex_unlock(&rs->bitmap_mutex);
 
+        if (!migration_direct_io_finish(f, errp)) {
+            return -errno;
+        }
+
+        if (pages < 0) {
+            return pages;
+        }
+
         ret = rdma_registration_stop(f, RAM_CONTROL_FINISH);
         if (ret < 0) {
             qemu_file_set_error(f, ret);
@@ -3920,6 +3934,10 @@  static bool read_ramblock_mapped_ram(QEMUFile *f, RAMBlock *block,
     void *host;
     size_t read, unread, size;
 
+    if (!migration_direct_io_start(f, errp)) {
+        return false;
+    }
+
     for (set_bit_idx = find_first_bit(bitmap, num_pages);
          set_bit_idx < num_pages;
          set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
@@ -3955,6 +3973,10 @@  static bool read_ramblock_mapped_ram(QEMUFile *f, RAMBlock *block,
         }
     }
 
+    if (!migration_direct_io_finish(f, errp)) {
+        return false;
+    }
+
     return true;
 
 err:
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 5ced3b90c9..8c6a122c20 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2245,6 +2245,34 @@  static void test_multifd_file_mapped_ram_dio(void)
     test_file_common(&args, true);
 }
 
+static void *mapped_ram_dio_start(QTestState *from, QTestState *to)
+{
+    migrate_mapped_ram_start(from, to);
+
+    migrate_set_parameter_bool(from, "direct-io", true);
+    migrate_set_parameter_bool(to, "direct-io", true);
+
+    return NULL;
+}
+
+static void test_precopy_file_mapped_ram_dio(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = mapped_ram_dio_start,
+    };
+
+    if (!probe_o_direct_support(tmpfs)) {
+        g_test_skip("Filesystem does not support O_DIRECT");
+        return;
+    }
+
+    test_file_common(&args, true);
+}
+
 #ifndef _WIN32
 static void multifd_mapped_ram_fdset_end(QTestState *from, QTestState *to,
                                          void *opaque)
@@ -3735,6 +3763,8 @@  int main(int argc, char **argv)
 
     migration_test_add("/migration/multifd/file/mapped-ram/dio",
                        test_multifd_file_mapped_ram_dio);
+    migration_test_add("/migration/precopy/file/mapped-ram/dio",
+                       test_precopy_file_mapped_ram_dio);
 
 #ifndef _WIN32
     qtest_add_func("/migration/multifd/file/mapped-ram/fdset",