diff mbox series

[2/2] block: file-posix: Replace posix_fallocate with fallocate

Message ID 20200831140127.657134-3-nsoffer@redhat.com (mailing list archive)
State New, archived
Headers show
Series Replace posix_fallocate() with falloate() | expand

Commit Message

Nir Soffer Aug. 31, 2020, 2:01 p.m. UTC
If fallocate() is not supported, posix_fallocate() falls back to
inefficient allocation, writing one byte for every 4k bytes[1]. This is
very slow compared with writing zeros. In oVirt we measured ~400%
improvement in allocation time when replacing posix_fallocate() with
manually writing zeroes[2].

We also know that posix_fallocated() does not work well when using OFD
locks[3]. We don't know the reason yet for this issue yet.

Change preallocate_falloc() to use fallocate() instead of
posix_falloate(), and fall back to full preallocation if not supported.

Here are quick test results with this change.

Before (qemu-img-5.1.0-2.fc32.x86_64):

$ time qemu-img create -f raw -o preallocation=falloc /tmp/nfs3/test.raw 6g
Formatting '/tmp/nfs3/test.raw', fmt=raw size=6442450944 preallocation=falloc

real 0m42.100s
user 0m0.602s
sys 0m4.137s

NFS stats:
calls      retrans    authrefrsh    write
1571583    0          1572205       1571321

After:

$ time ./qemu-img create -f raw -o preallocation=falloc /tmp/nfs3/test.raw 6g
Formatting '/tmp/nfs3/test.raw', fmt=raw size=6442450944 preallocation=falloc

real 0m15.551s
user 0m0.070s
sys 0m2.623s

NFS stats:
calls      retrans    authrefrsh    write
24620      0          24624         24567  

[1] https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
[2] https://bugzilla.redhat.com/1850267#c25
[3] https://bugzilla.redhat.com/1851097

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
---
 block/file-posix.c                     | 32 +++++++++-----------------
 docs/system/qemu-block-drivers.rst.inc | 11 +++++----
 docs/tools/qemu-img.rst                | 11 +++++----
 qapi/block-core.json                   |  4 ++--
 4 files changed, 25 insertions(+), 33 deletions(-)

Comments

Alberto Garcia Sept. 1, 2020, 3:51 p.m. UTC | #1
On Mon 31 Aug 2020 04:01:27 PM CEST, Nir Soffer wrote:
> If fallocate() is not supported, posix_fallocate() falls back to
> inefficient allocation, writing one byte for every 4k bytes[1]. This is
> very slow compared with writing zeros. In oVirt we measured ~400%
> improvement in allocation time when replacing posix_fallocate() with
> manually writing zeroes[2].
>
> We also know that posix_fallocated() does not work well when using OFD
> locks[3]. We don't know the reason yet for this issue yet.
>
> Change preallocate_falloc() to use fallocate() instead of
> posix_falloate(), and fall back to full preallocation if not
> supported.


>      case PREALLOC_MODE_FALLOC:
>          result = preallocate_falloc(fd, current_length, offset, errp);
> -        goto out;
> +        if (result != -ENOTSUP)
> +            goto out;
> +        /* If fallocate() is not supported, fallback to full preallocation. */

With the missing braces in this if statement,

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto
Daniel P. Berrangé Sept. 14, 2020, 5:32 p.m. UTC | #2
On Mon, Aug 31, 2020 at 05:01:27PM +0300, Nir Soffer wrote:
> If fallocate() is not supported, posix_fallocate() falls back to
> inefficient allocation, writing one byte for every 4k bytes[1]. This is
> very slow compared with writing zeros. In oVirt we measured ~400%
> improvement in allocation time when replacing posix_fallocate() with
> manually writing zeroes[2].
> 
> We also know that posix_fallocated() does not work well when using OFD
> locks[3]. We don't know the reason yet for this issue yet.
> 
> Change preallocate_falloc() to use fallocate() instead of
> posix_falloate(), and fall back to full preallocation if not supported.
> 
> Here are quick test results with this change.
> 
> Before (qemu-img-5.1.0-2.fc32.x86_64):
> 
> $ time qemu-img create -f raw -o preallocation=falloc /tmp/nfs3/test.raw 6g
> Formatting '/tmp/nfs3/test.raw', fmt=raw size=6442450944 preallocation=falloc
> 
> real 0m42.100s
> user 0m0.602s
> sys 0m4.137s
> 
> NFS stats:
> calls      retrans    authrefrsh    write
> 1571583    0          1572205       1571321
> 
> After:
> 
> $ time ./qemu-img create -f raw -o preallocation=falloc /tmp/nfs3/test.raw 6g
> Formatting '/tmp/nfs3/test.raw', fmt=raw size=6442450944 preallocation=falloc
> 
> real 0m15.551s
> user 0m0.070s
> sys 0m2.623s
> 
> NFS stats:
> calls      retrans    authrefrsh    write
> 24620      0          24624         24567  
> 
> [1] https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> [2] https://bugzilla.redhat.com/1850267#c25
> [3] https://bugzilla.redhat.com/1851097

This bug appears to be private to RH employees only, so rather than link
to it, please summarize any important facts in it for benefit of nonn-RH
QEMU contributors.

> 
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>  block/file-posix.c                     | 32 +++++++++-----------------
>  docs/system/qemu-block-drivers.rst.inc | 11 +++++----
>  docs/tools/qemu-img.rst                | 11 +++++----
>  qapi/block-core.json                   |  4 ++--
>  4 files changed, 25 insertions(+), 33 deletions(-)

Regards,
Daniel
Nir Soffer Sept. 15, 2020, 8:55 a.m. UTC | #3
On Mon, Sep 14, 2020 at 8:32 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Mon, Aug 31, 2020 at 05:01:27PM +0300, Nir Soffer wrote:
> > If fallocate() is not supported, posix_fallocate() falls back to
> > inefficient allocation, writing one byte for every 4k bytes[1]. This is
> > very slow compared with writing zeros. In oVirt we measured ~400%
> > improvement in allocation time when replacing posix_fallocate() with
> > manually writing zeroes[2].
> >
> > We also know that posix_fallocated() does not work well when using OFD
> > locks[3]. We don't know the reason yet for this issue yet.
> >
> > Change preallocate_falloc() to use fallocate() instead of
> > posix_falloate(), and fall back to full preallocation if not supported.
> >
> > Here are quick test results with this change.
> >
> > Before (qemu-img-5.1.0-2.fc32.x86_64):
> >
> > $ time qemu-img create -f raw -o preallocation=falloc /tmp/nfs3/test.raw 6g
> > Formatting '/tmp/nfs3/test.raw', fmt=raw size=6442450944 preallocation=falloc
> >
> > real 0m42.100s
> > user 0m0.602s
> > sys 0m4.137s
> >
> > NFS stats:
> > calls      retrans    authrefrsh    write
> > 1571583    0          1572205       1571321
> >
> > After:
> >
> > $ time ./qemu-img create -f raw -o preallocation=falloc /tmp/nfs3/test.raw 6g
> > Formatting '/tmp/nfs3/test.raw', fmt=raw size=6442450944 preallocation=falloc
> >
> > real 0m15.551s
> > user 0m0.070s
> > sys 0m2.623s
> >
> > NFS stats:
> > calls      retrans    authrefrsh    write
> > 24620      0          24624         24567
> >
> > [1] https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> > [2] https://bugzilla.redhat.com/1850267#c25
> > [3] https://bugzilla.redhat.com/1851097
>
> This bug appears to be private to RH employees only, so rather than link
> to it, please summarize any important facts in it for benefit of nonn-RH
> QEMU contributors.

Thanks, I missed that detail when linking to the bug. The bug is public now.

> > Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> > ---
> >  block/file-posix.c                     | 32 +++++++++-----------------
> >  docs/system/qemu-block-drivers.rst.inc | 11 +++++----
> >  docs/tools/qemu-img.rst                | 11 +++++----
> >  qapi/block-core.json                   |  4 ++--
> >  4 files changed, 25 insertions(+), 33 deletions(-)
>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c
index 341ffb1cb4..eac3c0b412 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1835,36 +1835,24 @@  static int allocate_first_block(int fd, size_t max_size)
 static int preallocate_falloc(int fd, int64_t current_length, int64_t offset,
                               Error **errp)
 {
-#ifdef CONFIG_POSIX_FALLOCATE
+#ifdef CONFIG_FALLOCATE
     int result;
 
     if (offset == current_length)
         return 0;
 
-    /*
-     * Truncating before posix_fallocate() makes it about twice slower on
-     * file systems that do not support fallocate(), trying to check if a
-     * block is allocated before allocating it, so don't do that here.
-     */
-
-    result = -posix_fallocate(fd, current_length,
-                              offset - current_length);
+    result = do_fallocate(fd, 0, current_length, offset - current_length);
     if (result != 0) {
-        /* posix_fallocate() doesn't set errno. */
-        error_setg_errno(errp, -result,
-                         "Could not preallocate new data");
+        error_setg_errno(errp, -result, "Could not preallocate new data");
         return result;
     }
 
     if (current_length == 0) {
         /*
-         * posix_fallocate() uses fallocate() if the filesystem supports
-         * it, or fallback to manually writing zeroes. If fallocate()
-         * was used, unaligned reads from the fallocated area in
-         * raw_probe_alignment() will succeed, hence we need to allocate
-         * the first block.
+         * Unaligned reads from the fallocated area in raw_probe_alignment()
+         * will succeed, hence we need to allocate the first block.
          *
-         * Optimize future alignment probing; ignore failures.
+         * Optimizes future alignment probing; ignore failures.
          */
         allocate_first_block(fd, offset);
     }
@@ -1973,10 +1961,12 @@  static int handle_aiocb_truncate(void *opaque)
     }
 
     switch (prealloc) {
-#ifdef CONFIG_POSIX_FALLOCATE
+#ifdef CONFIG_FALLOCATE
     case PREALLOC_MODE_FALLOC:
         result = preallocate_falloc(fd, current_length, offset, errp);
-        goto out;
+        if (result != -ENOTSUP)
+            goto out;
+        /* If fallocate() is not supported, fallback to full preallocation. */
 #endif
     case PREALLOC_MODE_FULL:
         result = preallocate_full(fd, current_length, offset, errp);
@@ -3080,7 +3070,7 @@  static QemuOptsList raw_create_opts = {
             .name = BLOCK_OPT_PREALLOC,
             .type = QEMU_OPT_STRING,
             .help = "Preallocation mode (allowed values: off"
-#ifdef CONFIG_POSIX_FALLOCATE
+#ifdef CONFIG_FALLOCATE
                     ", falloc"
 #endif
                     ", full)"
diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index b052a6d14e..8e4acf397e 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -25,11 +25,12 @@  This section describes each format and the options that are supported for it.
   .. program:: raw
   .. option:: preallocation
 
-    Preallocation mode (allowed values: ``off``, ``falloc``,
-    ``full``). ``falloc`` mode preallocates space for image by
-    calling ``posix_fallocate()``. ``full`` mode preallocates space
-    for image by writing data to underlying storage. This data may or
-    may not be zero, depending on the storage location.
+    Preallocation mode (allowed values: ``off``, ``falloc``, ``full``).
+    ``falloc`` mode preallocates space for image by calling
+    ``fallocate()``, and falling back to ``full` mode if not supported.
+    ``full`` mode preallocates space for image by writing data to
+    underlying storage. This data may or may not be zero, depending on
+    the storage location.
 
 .. program:: image-formats
 .. option:: qcow2
diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index c35bd64822..a2089bd1b7 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -750,11 +750,12 @@  Supported image file formats:
   Supported options:
 
   ``preallocation``
-    Preallocation mode (allowed values: ``off``, ``falloc``,
-    ``full``).  ``falloc`` mode preallocates space for image by
-    calling ``posix_fallocate()``.  ``full`` mode preallocates space
-    for image by writing data to underlying storage.  This data may or
-    may not be zero, depending on the storage location.
+    Preallocation mode (allowed values: ``off``, ``falloc``, ``full``).
+    ``falloc`` mode preallocates space for image by calling
+    ``fallocate()``, and falling back to ``full` mode if not supported.
+    ``full`` mode preallocates space for image by writing data to
+    underlying storage. This data may or may not be zero, depending on
+    the storage location.
 
 ``qcow2``
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index db08c58d78..681d79ec63 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -5021,8 +5021,8 @@ 
 #
 # @off: no preallocation
 # @metadata: preallocate only for metadata
-# @falloc: like @full preallocation but allocate disk space by
-#          posix_fallocate() rather than writing data.
+# @falloc: try to allocate disk space by fallocate(), and fallback to
+#          @full preallocation if not supported.
 # @full: preallocate all data by writing it to the device to ensure
 #        disk space is really available. This data may or may not be
 #        zero, depending on the image format and storage.