io: Set unix socket buffers on macOS

Message ID	20250418142436.6121-1-nirsof@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Nir Soffer <nirsof@gmail.com> To: qemu-devel@nongnu.org Cc: =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com>, "Richard Jones" <rjones@redhat.com>, "Eric Blake" <eblake@redhat.com>, Nir Soffer <nirsof@gmail.com> Subject: [PATCH] io: Set unix socket buffers on macOS Date: Fri, 18 Apr 2025 17:24:36 +0300 Message-Id: <20250418142436.6121-1-nirsof@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2a00:1450:4864:20::330; envelope-from=nirsof@gmail.com; helo=mail-wm1-x330.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	io: Set unix socket buffers on macOS \| expand io: Set unix socket buffers on macOS

Nir Soffer April 18, 2025, 2:24 p.m. UTC

Testing with qemu-nbd shows that computing a hash of an image via
qemu-nbd is 5-7 times faster with this change.

Tested with 2 qemu-nbd processes:

    $ ./qemu-nbd-after -r -t -e 0 -f raw -k /tmp/after.sock /var/tmp/bench/data-10g.img &
    $ ./qemu-nbd-before -r -t -e 0 -f raw -k /tmp/before.sock /var/tmp/bench/data-10g.img &

With nbdcopy, using 4 NBD connections:

    $ hyperfine -w 3 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:"
                     "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:"
    Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:
      Time (mean ± σ):      8.670 s ±  0.025 s    [User: 5.670 s, System: 7.113 s]
      Range (min … max):    8.620 s …  8.703 s    10 runs

    Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:
      Time (mean ± σ):      1.839 s ±  0.008 s    [User: 4.651 s, System: 1.882 s]
      Range (min … max):    1.830 s …  1.853 s    10 runs

    Summary
      ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null: ran
        4.72 ± 0.02 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:

With blksum, using one NBD connection:

    $ hyperfine -w 3 "blksum 'nbd+unix:///?socket=/tmp/before.sock'" \
                     "blksum 'nbd+unix:///?socket=/tmp/after.sock'"
    Benchmark 1: blksum 'nbd+unix:///?socket=/tmp/before.sock'
      Time (mean ± σ):     13.606 s ±  0.081 s    [User: 5.799 s, System: 6.231 s]
      Range (min … max):   13.516 s … 13.785 s    10 runs

    Benchmark 2: blksum 'nbd+unix:///?socket=/tmp/after.sock'
      Time (mean ± σ):      1.946 s ±  0.017 s    [User: 4.541 s, System: 1.481 s]
      Range (min … max):    1.912 s …  1.979 s    10 runs

    Summary
      blksum 'nbd+unix:///?socket=/tmp/after.sock' ran
        6.99 ± 0.07 times faster than blksum 'nbd+unix:///?socket=/tmp/before.sock'

This will improve other usage of unix domain sockets on macOS, I tested
only qemu-nbd.

Signed-off-by: Nir Soffer <nirsof@gmail.com>
---
 io/channel-socket.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Philippe Mathieu-Daudé April 18, 2025, 2:50 p.m. UTC | #1

Hi Nir,

On 18/4/25 16:24, Nir Soffer wrote:
> Testing with qemu-nbd shows that computing a hash of an image via
> qemu-nbd is 5-7 times faster with this change.
> 
> Tested with 2 qemu-nbd processes:
> 
>      $ ./qemu-nbd-after -r -t -e 0 -f raw -k /tmp/after.sock /var/tmp/bench/data-10g.img &
>      $ ./qemu-nbd-before -r -t -e 0 -f raw -k /tmp/before.sock /var/tmp/bench/data-10g.img &
> 
> With nbdcopy, using 4 NBD connections:
> 
>      $ hyperfine -w 3 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:"
>                       "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:"
>      Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:
>        Time (mean ± σ):      8.670 s ±  0.025 s    [User: 5.670 s, System: 7.113 s]
>        Range (min … max):    8.620 s …  8.703 s    10 runs
> 
>      Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:
>        Time (mean ± σ):      1.839 s ±  0.008 s    [User: 4.651 s, System: 1.882 s]
>        Range (min … max):    1.830 s …  1.853 s    10 runs
> 
>      Summary
>        ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null: ran
>          4.72 ± 0.02 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:
> 
> With blksum, using one NBD connection:
> 
>      $ hyperfine -w 3 "blksum 'nbd+unix:///?socket=/tmp/before.sock'" \
>                       "blksum 'nbd+unix:///?socket=/tmp/after.sock'"
>      Benchmark 1: blksum 'nbd+unix:///?socket=/tmp/before.sock'
>        Time (mean ± σ):     13.606 s ±  0.081 s    [User: 5.799 s, System: 6.231 s]
>        Range (min … max):   13.516 s … 13.785 s    10 runs
> 
>      Benchmark 2: blksum 'nbd+unix:///?socket=/tmp/after.sock'
>        Time (mean ± σ):      1.946 s ±  0.017 s    [User: 4.541 s, System: 1.481 s]
>        Range (min … max):    1.912 s …  1.979 s    10 runs
> 
>      Summary
>        blksum 'nbd+unix:///?socket=/tmp/after.sock' ran
>          6.99 ± 0.07 times faster than blksum 'nbd+unix:///?socket=/tmp/before.sock'
> 
> This will improve other usage of unix domain sockets on macOS, I tested
> only qemu-nbd.
> 
> Signed-off-by: Nir Soffer <nirsof@gmail.com>
> ---
>   io/channel-socket.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index 608bcf066e..b858659764 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -410,6 +410,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>       }
>   #endif /* WIN32 */
>   
> +#if __APPLE__
> +    /* On macOS we need to tune unix domain socket buffer for best performance.
> +     * Apple recommends sizing the receive buffer at 4 times the size of the
> +     * send buffer.
> +     */
> +    if (cioc->localAddr.ss_family == AF_UNIX) {
> +        const int sndbuf_size = 1024 * 1024;

Please add a definition instead of magic value, i.e.:

   #define SOCKET_SEND_BUFSIZE  (1 * MiB)

BTW in test_io_channel_set_socket_bufs() we use 64 KiB, why 1 MiB?

> +        const int rcvbuf_size = 4 * sndbuf_size;
> +        setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size, sizeof(sndbuf_size));
> +        setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size, sizeof(rcvbuf_size));
> +    }
> +#endif /* __APPLE__ */

Thanks,

Phil.

Eric Blake April 18, 2025, 6:55 p.m. UTC | #2

On Fri, Apr 18, 2025 at 05:24:36PM +0300, Nir Soffer wrote:
> Testing with qemu-nbd shows that computing a hash of an image via
> qemu-nbd is 5-7 times faster with this change.
> 

> +++ b/io/channel-socket.c
> @@ -410,6 +410,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>      }
>  #endif /* WIN32 */
>  
> +#if __APPLE__
> +    /* On macOS we need to tune unix domain socket buffer for best performance.
> +     * Apple recommends sizing the receive buffer at 4 times the size of the
> +     * send buffer.
> +     */
> +    if (cioc->localAddr.ss_family == AF_UNIX) {
> +        const int sndbuf_size = 1024 * 1024;
> +        const int rcvbuf_size = 4 * sndbuf_size;
> +        setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size, sizeof(sndbuf_size));
> +        setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size, sizeof(rcvbuf_size));
> +    }
> +#endif /* __APPLE__ */

Why does this have to be limited?  On linux, 'man 7 unix' documents
that SO_SNDBUF is honored (SO_RCVBUF is silently ignored but accepted
for compatibility).  On the other hand, 'man 7 socket' states that it
defaults to the value in /proc/sys/net/core/wmem_default (212992 on my
machine) and cannot exceed the value in /proc/sys/net/core/wmem_max
without CAP_NET_ADMIN privileges (also 212992 on my machine).

Of course, Linux and MacOS are different kernels, so your effort to
set it to 1M may actually be working on Apple rather than being
silently cut back to the enforced maximum.  And the fact that raising
it at all makes a difference merely says that unlike Linux (where the
default appears to already be as large as possible), Apple is set up
to default to a smaller buffer (more fragmentation requires more
time), and bumping to the larger value improves performance.  But can
you use getsockopt() prior to your setsockopt() to see what value
Apple was defaulting to, and then again afterwards to see whether it
actually got as large as you suggested?

Nir Soffer April 18, 2025, 7:55 p.m. UTC | #3

> On 18 Apr 2025, at 17:50, Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
> 
> Hi Nir,
> 
> On 18/4/25 16:24, Nir Soffer wrote:
>> Testing with qemu-nbd shows that computing a hash of an image via
>> qemu-nbd is 5-7 times faster with this change.
>> Tested with 2 qemu-nbd processes:
>>     $ ./qemu-nbd-after -r -t -e 0 -f raw -k /tmp/after.sock /var/tmp/bench/data-10g.img &
>>     $ ./qemu-nbd-before -r -t -e 0 -f raw -k /tmp/before.sock /var/tmp/bench/data-10g.img &
>> With nbdcopy, using 4 NBD connections:
>>     $ hyperfine -w 3 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:"
>>                      "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:"
>>     Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:
>>       Time (mean ± σ):      8.670 s ±  0.025 s    [User: 5.670 s, System: 7.113 s]
>>       Range (min … max):    8.620 s …  8.703 s    10 runs
>>     Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:
>>       Time (mean ± σ):      1.839 s ±  0.008 s    [User: 4.651 s, System: 1.882 s]
>>       Range (min … max):    1.830 s …  1.853 s    10 runs
>>     Summary
>>       ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null: ran
>>         4.72 ± 0.02 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:
>> With blksum, using one NBD connection:
>>     $ hyperfine -w 3 "blksum 'nbd+unix:///?socket=/tmp/before.sock'" \
>>                      "blksum 'nbd+unix:///?socket=/tmp/after.sock'"
>>     Benchmark 1: blksum 'nbd+unix:///?socket=/tmp/before.sock'
>>       Time (mean ± σ):     13.606 s ±  0.081 s    [User: 5.799 s, System: 6.231 s]
>>       Range (min … max):   13.516 s … 13.785 s    10 runs
>>     Benchmark 2: blksum 'nbd+unix:///?socket=/tmp/after.sock'
>>       Time (mean ± σ):      1.946 s ±  0.017 s    [User: 4.541 s, System: 1.481 s]
>>       Range (min … max):    1.912 s …  1.979 s    10 runs
>>     Summary
>>       blksum 'nbd+unix:///?socket=/tmp/after.sock' ran
>>         6.99 ± 0.07 times faster than blksum 'nbd+unix:///?socket=/tmp/before.sock'
>> This will improve other usage of unix domain sockets on macOS, I tested
>> only qemu-nbd.
>> Signed-off-by: Nir Soffer <nirsof@gmail.com>
>> ---
>>  io/channel-socket.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>> diff --git a/io/channel-socket.c b/io/channel-socket.c
>> index 608bcf066e..b858659764 100644
>> --- a/io/channel-socket.c
>> +++ b/io/channel-socket.c
>> @@ -410,6 +410,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>>      }
>>  #endif /* WIN32 */
>>  +#if __APPLE__
>> +    /* On macOS we need to tune unix domain socket buffer for best performance.
>> +     * Apple recommends sizing the receive buffer at 4 times the size of the
>> +     * send buffer.
>> +     */
>> +    if (cioc->localAddr.ss_family == AF_UNIX) {
>> +        const int sndbuf_size = 1024 * 1024;
> 
> Please add a definition instead of magic value, i.e.:
> 
>  #define SOCKET_SEND_BUFSIZE  (1 * MiB)

Using 1 * MiB is nicer.

Not sure about the “magic” value; Do you mean:

    #define SOCKET_SEND_BUFSIZE  (1 * MiB)

In the top of the file or near the definition?

    const int sndbuf_size = 1 * MiB;

If we want it at the top of the file the name may be confusing since this is used only for macOS and for unix socket.

We can have:

    #define MACOS_UNIX_SOCKET_SEND_BUFSIZE (1 * MiB)

Or maybe:

    #if __APPLE__
    #define UNIX_SOCKET_SEND_BUFSIZE (1 * MiB)
    #endif

But we use this in one function so I’m not sure it helps.

In vmnet-helper I’m using this in 2 places so it moved to config.h.
https://github.com/nirs/vmnet-helper/blob/main/config.h.in

> 
> BTW in test_io_channel_set_socket_bufs() we use 64 KiB, why 1 MiB?

This test use small buffer size so we can see the effect of partial reads/writes. I’m trying to improve throughput when reading image data with qemu-nbd. This will likely improve also qemu-storage-daemon and qemu builtin nbd server but I did not test them.

I did some benchmarks with send buffer size 64k - 2m, and it shows that 1m gives the best performance.

Running one qemu-nbd process with each configuration:

% ps
...
18850 ttys013    2:01.78 ./qemu-nbd-64k -r -t -e 0 -f raw -k /tmp/64k.sock /Users/nir/bench/data-10g.img
18871 ttys013    1:53.49 ./qemu-nbd-128k -r -t -e 0 -f raw -k /tmp/128k.sock /Users/nir/bench/data-10g.img
18877 ttys013    1:47.95 ./qemu-nbd-256k -r -t -e 0 -f raw -k /tmp/256k.sock /Users/nir/bench/data-10g.img
18885 ttys013    1:52.06 ./qemu-nbd-512k -r -t -e 0 -f raw -k /tmp/512k.sock /Users/nir/bench/data-10g.img
18894 ttys013    2:02.34 ./qemu-nbd-1m -r -t -e 0 -f raw -k /tmp/1m.sock /Users/nir/bench/data-10g.img
22918 ttys013    0:00.02 ./qemu-nbd-2m -r -t -e 0 -f raw -k /tmp/2m.sock /Users/nir/bench/data-10g.img

% hyperfine -w 3 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/64k.sock' null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/128k.sock' null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/256k.sock' null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/512k.sock' null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/2m.sock' null:"
Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/64k.sock' null:
  Time (mean ± σ):      2.760 s ±  0.014 s    [User: 4.871 s, System: 2.576 s]
  Range (min … max):    2.736 s …  2.788 s    10 runs
  Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/128k.sock' null:
  Time (mean ± σ):      2.284 s ±  0.006 s    [User: 4.774 s, System: 2.044 s]
  Range (min … max):    2.275 s …  2.294 s    10 runs
  Benchmark 3: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/256k.sock' null:
  Time (mean ± σ):      2.036 s ±  0.010 s    [User: 4.734 s, System: 1.822 s]
  Range (min … max):    2.021 s …  2.052 s    10 runs
  Benchmark 4: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/512k.sock' null:
  Time (mean ± σ):      1.763 s ±  0.005 s    [User: 4.637 s, System: 1.801 s]
  Range (min … max):    1.755 s …  1.771 s    10 runs
  Benchmark 5: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null:
  Time (mean ± σ):      1.653 s ±  0.012 s    [User: 4.568 s, System: 1.818 s]
  Range (min … max):    1.636 s …  1.683 s    10 runs
  Benchmark 6: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/2m.sock' null:
  Time (mean ± σ):      1.802 s ±  0.052 s    [User: 4.573 s, System: 1.918 s]
  Range (min … max):    1.736 s …  1.896 s    10 runs
  Summary
  ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null: ran
    1.07 ± 0.01 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/512k.sock' null:
    1.09 ± 0.03 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/2m.sock' null:
    1.23 ± 0.01 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/256k.sock' null:
    1.38 ± 0.01 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/128k.sock' null:
    1.67 ± 0.02 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/64k.sock' null:

I can add a combat table showing the results in a comment, or add the test output to the commit message for reference.

> 
>> +        const int rcvbuf_size = 4 * sndbuf_size;
>> +        setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size, sizeof(sndbuf_size));
>> +        setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size, sizeof(rcvbuf_size));
>> +    }
>> +#endif /* __APPLE__ */
> 
> Thanks,
> 
> Phil.

Nir Soffer April 18, 2025, 8:02 p.m. UTC | #4

This should be changed also on the client side.

The libnbd part is here:
https://gitlab.com/nbdkit/libnbd/-/merge_requests/21

We may want to change also the nbd client code used in qemu-img. I can look at this later.


> On 18 Apr 2025, at 17:24, Nir Soffer <nirsof@gmail.com> wrote:
> 
> Testing with qemu-nbd shows that computing a hash of an image via
> qemu-nbd is 5-7 times faster with this change.
> 
> Tested with 2 qemu-nbd processes:
> 
>    $ ./qemu-nbd-after -r -t -e 0 -f raw -k /tmp/after.sock /var/tmp/bench/data-10g.img &
>    $ ./qemu-nbd-before -r -t -e 0 -f raw -k /tmp/before.sock /var/tmp/bench/data-10g.img &
> 
> With nbdcopy, using 4 NBD connections:
> 
>    $ hyperfine -w 3 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:"
>                     "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:"
>    Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:
>      Time (mean ± σ):      8.670 s ±  0.025 s    [User: 5.670 s, System: 7.113 s]
>      Range (min … max):    8.620 s …  8.703 s    10 runs
> 
>    Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null:
>      Time (mean ± σ):      1.839 s ±  0.008 s    [User: 4.651 s, System: 1.882 s]
>      Range (min … max):    1.830 s …  1.853 s    10 runs
> 
>    Summary
>      ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null: ran
>        4.72 ± 0.02 times faster than ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' null:
> 
> With blksum, using one NBD connection:
> 
>    $ hyperfine -w 3 "blksum 'nbd+unix:///?socket=/tmp/before.sock'" \
>                     "blksum 'nbd+unix:///?socket=/tmp/after.sock'"
>    Benchmark 1: blksum 'nbd+unix:///?socket=/tmp/before.sock'
>      Time (mean ± σ):     13.606 s ±  0.081 s    [User: 5.799 s, System: 6.231 s]
>      Range (min … max):   13.516 s … 13.785 s    10 runs
> 
>    Benchmark 2: blksum 'nbd+unix:///?socket=/tmp/after.sock'
>      Time (mean ± σ):      1.946 s ±  0.017 s    [User: 4.541 s, System: 1.481 s]
>      Range (min … max):    1.912 s …  1.979 s    10 runs
> 
>    Summary
>      blksum 'nbd+unix:///?socket=/tmp/after.sock' ran
>        6.99 ± 0.07 times faster than blksum 'nbd+unix:///?socket=/tmp/before.sock'
> 
> This will improve other usage of unix domain sockets on macOS, I tested
> only qemu-nbd.
> 
> Signed-off-by: Nir Soffer <nirsof@gmail.com>
> ---
> io/channel-socket.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
> 
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index 608bcf066e..b858659764 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -410,6 +410,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>     }
> #endif /* WIN32 */
> 
> +#if __APPLE__
> +    /* On macOS we need to tune unix domain socket buffer for best performance.
> +     * Apple recommends sizing the receive buffer at 4 times the size of the
> +     * send buffer.
> +     */
> +    if (cioc->localAddr.ss_family == AF_UNIX) {
> +        const int sndbuf_size = 1024 * 1024;
> +        const int rcvbuf_size = 4 * sndbuf_size;
> +        setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size, sizeof(sndbuf_size));
> +        setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size, sizeof(rcvbuf_size));
> +    }
> +#endif /* __APPLE__ */
> +
>     qio_channel_set_feature(QIO_CHANNEL(cioc),
>                             QIO_CHANNEL_FEATURE_READ_MSG_PEEK);
> 
> -- 
> 2.39.5 (Apple Git-154)
>

Nir Soffer April 18, 2025, 8:29 p.m. UTC | #5

> On 18 Apr 2025, at 21:55, Eric Blake <eblake@redhat.com> wrote:
> 
> On Fri, Apr 18, 2025 at 05:24:36PM +0300, Nir Soffer wrote:
>> Testing with qemu-nbd shows that computing a hash of an image via
>> qemu-nbd is 5-7 times faster with this change.
>> 
> 
>> +++ b/io/channel-socket.c
>> @@ -410,6 +410,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>>     }
>> #endif /* WIN32 */
>> 
>> +#if __APPLE__
>> +    /* On macOS we need to tune unix domain socket buffer for best performance.
>> +     * Apple recommends sizing the receive buffer at 4 times the size of the
>> +     * send buffer.
>> +     */
>> +    if (cioc->localAddr.ss_family == AF_UNIX) {
>> +        const int sndbuf_size = 1024 * 1024;
>> +        const int rcvbuf_size = 4 * sndbuf_size;
>> +        setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size, sizeof(sndbuf_size));
>> +        setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size, sizeof(rcvbuf_size));
>> +    }
>> +#endif /* __APPLE__ */
> 
> Why does this have to be limited?  On linux, 'man 7 unix' documents
> that SO_SNDBUF is honored (SO_RCVBUF is silently ignored but accepted
> for compatibility).  On the other hand, 'man 7 socket' states that it
> defaults to the value in /proc/sys/net/core/wmem_default (212992 on my
> machine) and cannot exceed the value in /proc/sys/net/core/wmem_max
> without CAP_NET_ADMIN privileges (also 212992 on my machine).
> 
> Of course, Linux and MacOS are different kernels, so your effort to
> set it to 1M may actually be working on Apple rather than being
> silently cut back to the enforced maximum.

Testing shows values up to 2m send buffer, 8m receive buffer shows the values changes the performance, so they are not silently clipped.

> And the fact that raising
> it at all makes a difference merely says that unlike Linux (where the
> default appears to already be as large as possible), Apple is set up
> to default to a smaller buffer (more fragmentation requires more
> time), and bumping to the larger value improves performance.  But can
> you use getsockopt() prior to your setsockopt() to see what value
> Apple was defaulting to, and then again afterwards to see whether it
> actually got as large as you suggested?

Sure, tested with:

diff --git a/io/channel-socket.c b/io/channel-socket.c
index b858659764..9600a076be 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -418,8 +418,21 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
     if (cioc->localAddr.ss_family == AF_UNIX) {
         const int sndbuf_size = 1024 * 1024;
         const int rcvbuf_size = 4 * sndbuf_size;
+        int value;
+        socklen_t value_size = sizeof(value);
+
+        getsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &value, &value_size);
+        fprintf(stderr, "before: send buffer size: %d\n", value);
+        getsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &value, &value_size);
+        fprintf(stderr, "before: recv buffer size: %d\n", value);
+
         setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size, sizeof(sndbuf_size));
         setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size, sizeof(rcvbuf_size));
+
+        getsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &value, &value_size);
+        fprintf(stderr, "after: send buffer size: %d\n", value);
+        getsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &value, &value_size);
+        fprintf(stderr, "after: recv buffer size: %d\n", value);
     }
 #endif /* __APPLE__ */
 
With 1m send buffer:

% ./qemu-nbd -r -t -e 0 -f raw -k /tmp/nbd.sock ~/bench/data-10g.img
before: send buffer size: 8192
before: recv buffer size: 8192
after: send buffer size: 1048576
after: recv buffer size: 4194304

With 2m send buffer:

% ./qemu-nbd -r -t -e 0 -f raw -k /tmp/nbd.sock ~/bench/data-10g.img
before: send buffer size: 8192
before: recv buffer size: 8192
after: send buffer size: 2097152
after: recv buffer size: 8388608

io: Set unix socket buffers on macOS

Commit Message

Comments

Patch