Message ID | 20221025044730.319941-5-leobras@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | MultiFD zero-copy improvements | expand |
On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote: > Zero-copy multifd migration sends both the header and the memory pages in a > single syscall. Since it's necessary to flush before reusing the header, a > header array was implemented, so each write call uses a different > array, and flushing only take place after all headers have been used, > meaning 1 flush for each N writes. > > This method has a bottleneck, though: After the last write, a flush will > have to wait for all writes to finish, which will be a lot, meaning the > recvmsg() syscall called in qio_channel_socket_flush() will be called a > lot. On top of that, it will create a time period when the I/O queue is > empty and nothing is getting send: between the flush and the next write. > > To avoid that, use qio_channel_flush()'s new max_pending parameter to wait > until at most half of the array is still in use. (i.e. the LRU half of the > array can be reused) > > Flushing for the LRU half of the array is much faster, since it does not > have to wait for the most recent writes to finish, making up for having > to flush twice per array. > > As a main benefit, this approach keeps the I/O queue from being empty while > there are still data to be sent, making it easier to keep the I/O maximum > throughput while consuming less cpu time. Doesn't this defeat the reason for adding the flush in the first place, which was to ensure that a migration iteration was fully sent before starting the next iteration over RAM ? If it is OK to only partially flush on each iteration, then why do we need to flush at all ? With regards, Daniel
Daniel P. Berrangé <berrange@redhat.com> wrote: > On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote: >> Zero-copy multifd migration sends both the header and the memory pages in a >> single syscall. Since it's necessary to flush before reusing the header, a >> header array was implemented, so each write call uses a different >> array, and flushing only take place after all headers have been used, >> meaning 1 flush for each N writes. >> >> This method has a bottleneck, though: After the last write, a flush will >> have to wait for all writes to finish, which will be a lot, meaning the >> recvmsg() syscall called in qio_channel_socket_flush() will be called a >> lot. On top of that, it will create a time period when the I/O queue is >> empty and nothing is getting send: between the flush and the next write. >> >> To avoid that, use qio_channel_flush()'s new max_pending parameter to wait >> until at most half of the array is still in use. (i.e. the LRU half of the >> array can be reused) >> >> Flushing for the LRU half of the array is much faster, since it does not >> have to wait for the most recent writes to finish, making up for having >> to flush twice per array. >> >> As a main benefit, this approach keeps the I/O queue from being empty while >> there are still data to be sent, making it easier to keep the I/O maximum >> throughput while consuming less cpu time. > > Doesn't this defeat the reason for adding the flush in the first > place, which was to ensure that a migration iteration was fully > sent before starting the next iteration over RAM ? If it is OK to > only partially flush on each iteration, then why do we need to > flush at all ? Now we need to do the flush in two places: - on sync_main (the old place) - on the migration thread, when we run out of array entries. This one has nothing to do with correctness, it is just that we have more space than needed. So, in this regard, I think that the patches are correct. But on the other hand, I am not sure that I like the size of the array. Leonardo is using 1024 entries for each migration channel. That means that it allows it to have 1024 entries * 512 KB each packet is 512MB of unflushed data in each channel. I think that this is still too much. I will need some data from testing, but my understanding on how Zero Copy work is that having around 10MB in each channel would be more than enough to saturate the link. And once that the data inflight is smaller, we can just flush it when we get out of channels. My idea here was to work the size the other way around, add a parameter to the user about how much memory is he available for mlocking, and just do a memory/channels array entries on each channel. That will: a - limit the amount of mlocked memory that we need 10MB/channel for 10 channels is 100MB of mlocked memory, for a guest that has lots of Gigabytes of RAM looks reasonable. b - We don't synchronize after each write, because I still claim than doing a non asynchronous write on the channel just syncs everything (otherwise I can't see how the hardware can even work). So I guess that the best thing to decide this is: - get a couple of nice 100Gigabit networking cards - Enable 16 channels or so, so we know that the CPU is not going to be the bottleneck - test with this patch - remove last two patches and test with several array sizes (10, 20, 30,..) and we will see that after some size, performance will not improve at all. - Got that value as default one. What do you think? Later, Juan. PD. Yes, I don't like to add another parameter, so you can recompile with different values, or we will not add the parameter once that we find a right value.
On Tue, 2022-10-25 at 12:07 +0200, Juan Quintela wrote: > Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote: > > > Zero-copy multifd migration sends both the header and the memory pages in a > > > single syscall. Since it's necessary to flush before reusing the header, a > > > header array was implemented, so each write call uses a different > > > array, and flushing only take place after all headers have been used, > > > meaning 1 flush for each N writes. > > > > > > This method has a bottleneck, though: After the last write, a flush will > > > have to wait for all writes to finish, which will be a lot, meaning the > > > recvmsg() syscall called in qio_channel_socket_flush() will be called a > > > lot. On top of that, it will create a time period when the I/O queue is > > > empty and nothing is getting send: between the flush and the next write. > > > > > > To avoid that, use qio_channel_flush()'s new max_pending parameter to wait > > > until at most half of the array is still in use. (i.e. the LRU half of the > > > array can be reused) > > > > > > Flushing for the LRU half of the array is much faster, since it does not > > > have to wait for the most recent writes to finish, making up for having > > > to flush twice per array. > > > > > > As a main benefit, this approach keeps the I/O queue from being empty while > > > there are still data to be sent, making it easier to keep the I/O maximum > > > throughput while consuming less cpu time. > > > > Doesn't this defeat the reason for adding the flush in the first > > place, which was to ensure that a migration iteration was fully > > sent before starting the next iteration over RAM ? If it is OK to > > only partially flush on each iteration, then why do we need to > > flush at all ? > > Now we need to do the flush in two places: > - on sync_main (the old place) > - on the migration thread, when we run out of array entries. > This one has nothing to do with correctness, it is just that we have > more space than needed. That's correct! In fact, sync_main (the old place) will call flush with max_remaining = 0, meaning it will only return when there is nothing else pending. > > So, in this regard, I think that the patches are correct. > > But on the other hand, I am not sure that I like the size of the array. > Leonardo is using 1024 entries for each migration channel. That means > that it allows it to have 1024 entries * 512 KB each packet is 512MB of > unflushed data in each channel. I think that this is still too much. The size is correct, meaning we need to allow 512MB of locked memory per multifd channel. > > I will need some data from testing, but my understanding on how Zero > Copy work is that having around 10MB in each channel would be more than > enough to saturate the link. And once that the data inflight is > smaller, we can just flush it when we get out of channels. > > My idea here was to work the size the other way around, add a parameter > to the user about how much memory is he available for mlocking, and > just do a memory/channels array entries on each channel. That will: > > a - limit the amount of mlocked memory that we need > 10MB/channel for 10 channels is 100MB of mlocked memory, for a guest > that has lots of Gigabytes of RAM looks reasonable. > > b - We don't synchronize after each write, because I still claim than > doing a non asynchronous write on the channel just syncs everything > (otherwise I can't see how the hardware can even work). So the user sets the locked memory available and we split it between the number of channels during migration. Seems a good strategy, since it will explore the hardware well regardless of the channel count. > > So I guess that the best thing to decide this is: > - get a couple of nice 100Gigabit networking cards > - Enable 16 channels or so, so we know that the CPU is not going to be > the bottleneck > - test with this patch > - remove last two patches and test with several array sizes > (10, 20, 30,..) and we will see that after some size, performance will > not improve at all. > - Got that value as default one. > > What do you think? Most of it is good. I only disagree on removing the two last patches. I did some tests with smaller array sizes, and without the last two patches it would totally destroy performance. I blame the fact that once each N writes it will have to wait the queue to be completely empty, and then add more data for sending. Also, waiting for the completion of a send that was just scheduled takes multiple recvmsg syscalls, raising cpu usage. Other than that, seems a good strategy. Thank you both for reviewing! Best regards, Leo > > Later, Juan. > > PD. Yes, I don't like to add another parameter, so you can recompile > with different values, or we will not add the parameter once that > we find a right value. >
diff --git a/migration/multifd.c b/migration/multifd.c index c5d1f911a4..fe9df460f6 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -569,12 +569,13 @@ void multifd_save_cleanup(void) multifd_send_state = NULL; } -static int multifd_zero_copy_flush(QIOChannel *c) +static int multifd_zero_copy_flush(QIOChannel *c, + int max_remaining) { int ret; Error *err = NULL; - ret = qio_channel_flush(c, 0, &err); + ret = qio_channel_flush(c, max_remaining, &err); if (ret < 0) { error_report_err(err); return -1; @@ -636,7 +637,7 @@ int multifd_send_sync_main(QEMUFile *f) qemu_mutex_unlock(&p->mutex); qemu_sem_post(&p->sem); - if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c) < 0)) { + if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c, 0) < 0)) { return -1; } } @@ -719,12 +720,17 @@ static void *multifd_send_thread(void *opaque) if (use_zero_copy_send) { p->packet_idx = (p->packet_idx + 1) % HEADER_ARR_SZ; - - if (!p->packet_idx && (multifd_zero_copy_flush(p->c) < 0)) { + /* + * When half the array have been used, flush to make sure the + * next half is available + */ + if (!(p->packet_idx % (HEADER_ARR_SZ / 2)) && + (multifd_zero_copy_flush(p->c, HEADER_ARR_SZ / 2) < 0)) { break; } header = (void *)p->packet + p->packet_idx * p->packet_len; } + qemu_mutex_lock(&p->mutex); p->pending_job--; qemu_mutex_unlock(&p->mutex);
Zero-copy multifd migration sends both the header and the memory pages in a single syscall. Since it's necessary to flush before reusing the header, a header array was implemented, so each write call uses a different array, and flushing only take place after all headers have been used, meaning 1 flush for each N writes. This method has a bottleneck, though: After the last write, a flush will have to wait for all writes to finish, which will be a lot, meaning the recvmsg() syscall called in qio_channel_socket_flush() will be called a lot. On top of that, it will create a time period when the I/O queue is empty and nothing is getting send: between the flush and the next write. To avoid that, use qio_channel_flush()'s new max_pending parameter to wait until at most half of the array is still in use. (i.e. the LRU half of the array can be reused) Flushing for the LRU half of the array is much faster, since it does not have to wait for the most recent writes to finish, making up for having to flush twice per array. As a main benefit, this approach keeps the I/O queue from being empty while there are still data to be sent, making it easier to keep the I/O maximum throughput while consuming less cpu time. Signed-off-by: Leonardo Bras <leobras@redhat.com> --- migration/multifd.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)