Message ID | 20190811205024.19482-1-nsoffer@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] block: posix: Handle undetectable alignment | expand |
Am 11.08.2019 um 22:50 hat Nir Soffer geschrieben: > In some cases buf_align or request_alignment cannot be detected: > > - With Gluster, buf_align cannot be detected since the actual I/O is > done on Gluster server, and qemu buffer alignment does not matter. If it doesn't matter, the best value would be buf_align = 1. > - With local XFS filesystem, buf_align cannot be detected if reading > from unallocated area. Here, we actually do need alignment, but it's unknown whether it would be 512 or 4096 or something entirely. Failing to align requests correctly results in I/O errors. > - With Gluster backed by XFS, request_alignment cannot be detected if > reading from unallocated area. This is like buf_align for XFS: We don't know the right value, but getting it wrong causes I/O errors. > - With NFS, the server does not use direct I/O, so both buf_align > cannot be detected. This suggests that byte-aligned requests are fine for NFS, i.e. buf_align = request_alignment = 1 would be optimal in this case. > These cases seems to work when storage sector size is 512 bytes, because > the current code starts checking align=512. If the check succeeds > because alignment cannot be detected we use 512. But this does not work > for storage with 4k sector size. > > Practically the alignment requirements are the same for buffer > alignment, buffer length, and offset in file. So in case we cannot > detect buf_align, we can use request alignment. If we cannot detect > request alignment, we can fallback to a safe value. This makes sense in general. What the commit message doesn't explain, but probably should do is how we determine whether we could successfully detect request alignment. The approach taken here is that a detected alignment of 1 is understood as failure to detect the real alignment. With cases 2 and 3 this gives the desird result; however for cases 1 and 4, an alignment of 1 would be the actual correct value, and the new probing algorithm results in a worse result. However, since the negative effect of the old algorithm in cases 2 and 3 is I/O errors whereas the effect of the new one in cases 1 and 4 is just degraded performance for I/O that isn't 4k aligned, the new approch is still preferable. I think we need to make this tradeoff clearer in the commit message and the comment in the code, but the approach is reasonable enough. > With this change: > > - Provisioning VM and copying disks on local XFS and Gluster with 4k > sector size works, resolving bugs [1],[2]. > > - With NFS we fallback to buf_align and request_alignment of 4096 > instead of 512. This may cause unneeded data copying, but so far I see > better performance with this change. > > [1] https://bugzilla.redhat.com/1737256 > [2] https://bugzilla.redhat.com/1738657 > > Signed-off-by: Nir Soffer <nsoffer@redhat.com> > --- > > v1 was a minimal hack; this version is a more generic fix that works for > any storage without requiring users to allocate the first block of an > image. Allocting the first block of an image is still a good idea since > it allows detecting the right alignment in some cases. > > v1 could also affect cases when we could detect buf_align to use > request_alignment instead; v2 will only affect cases when buf_align or > request_alignment cannot be detected. > > v1 was hare: > https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00133.html > > block/file-posix.c | 40 +++++++++++++++++++++++++++++----------- > 1 file changed, 29 insertions(+), 11 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index f33b542b33..511468f166 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -323,6 +323,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > BDRVRawState *s = bs->opaque; > char *buf; > size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); > + size_t alignments[] = {1, 512, 1024, 2048, 4096}; > > /* For SCSI generic devices the alignment is not really used. > With buffered I/O, we don't have any restrictions. */ > @@ -349,25 +350,42 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > } > #endif > > - /* If we could not get the sizes so far, we can only guess them */ > - if (!s->buf_align) { > + /* > + * If we could not get the sizes so far, we can only guess them. First try > + * to detect request alignment, since it is more likely to succeed. Then > + * try to detect buf_align, which cannot be detected in some cases (e.g. > + * Gluster). If buf_align cannot be detected, we fallback to the value of > + * request_alignment. > + */ > + > + if (!bs->bl.request_alignment) { > + int i; > size_t align; > - buf = qemu_memalign(max_align, 2 * max_align); > - for (align = 512; align <= max_align; align <<= 1) { > - if (raw_is_io_aligned(fd, buf + align, max_align)) { > - s->buf_align = align; > + buf = qemu_memalign(max_align, max_align); > + for (i = 0; i < ARRAY_SIZE(alignments); i++) { > + align = alignments[i]; > + if (raw_is_io_aligned(fd, buf, align)) { > + /* Fallback to safe value. */ > + bs->bl.request_alignment = (align != 1) ? align : max_align; > break; > } > } > qemu_vfree(buf); > } > > - if (!bs->bl.request_alignment) { > + if (!s->buf_align) { > + int i; > size_t align; > - buf = qemu_memalign(s->buf_align, max_align); > - for (align = 512; align <= max_align; align <<= 1) { > - if (raw_is_io_aligned(fd, buf, align)) { > - bs->bl.request_alignment = align; > + buf = qemu_memalign(max_align, 2 * max_align); > + for (i = 0; i < ARRAY_SIZE(alignments); i++) { > + align = alignments[i]; > + if (raw_is_io_aligned(fd, buf + align, max_align)) { > + /* Fallback to request_aligment or safe value. */ > + s->buf_align = (align != 1) > + ? align > + : (bs->bl.request_alignment != 0) > + ? bs->bl.request_alignment > + : max_align; Nested ternary operators over five lines aren't that readable any more. I'd suggest using at least one more if: if (align != 1) { s->buf_align = align; } else { s->buf_slign = bs->bl.request_alignment ?: max_align; } In fact, is checking bs->bl.request_alignment for 0 even worth it here? If it's 0, we failed to find any working configuration and will return an error anyway. Then it doesn't matter if s->buf_align becomes 0, too. Kevin
On Mon, Aug 12, 2019 at 5:23 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 11.08.2019 um 22:50 hat Nir Soffer geschrieben: > > In some cases buf_align or request_alignment cannot be detected: > > > > - With Gluster, buf_align cannot be detected since the actual I/O is > > done on Gluster server, and qemu buffer alignment does not matter. > > If it doesn't matter, the best value would be buf_align = 1. > Right, if we know that this is gluster. > - With local XFS filesystem, buf_align cannot be detected if reading > > from unallocated area. > > Here, we actually do need alignment, but it's unknown whether it would > be 512 or 4096 or something entirely. Failing to align requests > correctly results in I/O errors. > > > - With Gluster backed by XFS, request_alignment cannot be detected if > > reading from unallocated area. > > This is like buf_align for XFS: We don't know the right value, but > getting it wrong causes I/O errors. > > > - With NFS, the server does not use direct I/O, so both buf_align > > cannot be detected. > > This suggests that byte-aligned requests are fine for NFS, i.e. > buf_align = request_alignment = 1 would be optimal in this case. > Right, but again we don't know this is NFS. > These cases seems to work when storage sector size is 512 bytes, because > > the current code starts checking align=512. If the check succeeds > > because alignment cannot be detected we use 512. But this does not work > > for storage with 4k sector size. > > > > Practically the alignment requirements are the same for buffer > > alignment, buffer length, and offset in file. So in case we cannot > > detect buf_align, we can use request alignment. If we cannot detect > > request alignment, we can fallback to a safe value. > > This makes sense in general. > > What the commit message doesn't explain, but probably should do is how > we determine whether we could successfully detect request alignment. The > approach taken here is that a detected alignment of 1 is understood as > failure to detect the real alignment. > Failing with EINVAL when using 1, and succeeding with another value is considered a successful detection. We have 3 issues preventing detection: - filesystem not using direct I/O on the remote server (NFS, Gluster when network.remote-dio=on) - area probed is unallocated with XFS or Gluster backed by XFS - filesystem without buffer alignment requirement (e.g. Gluster) For handling unallocated areas, we can: - always allocate the first block when creating an image (qemu-img create/convert) - use write() instead of read(). In oVirt we went with the second option - when we initialize a file storage domain, we create a special file and do direct write to this file with 1, 512, and 4096 bytes length. If we detect 512 or 4096, we use this value for creating the domain (e.g. for sanlock). If we detect 1, we use the user provided value (default 512). You can see the code here: https://github.com/oVirt/vdsm/blob/4733018f9a719729242738b486906d3b9ed058cd/lib/vdsm/storage/fileSD.py#L838 One way we can use in qemu is to create a temporary file: /path/to/image.tmp9vo8US Delete the file, keeping the fd open, and detect the alignment on this file using write(). With this we fixed all the cases listed above, but creating new files requires write permission in the directory where the image is in, and will not work for some strange setups (.e.g bind-mount images). One issue with this is that there is no guarantee that the temporary file will be deleted so the user will have to deal with leftover files. With cases 2 and 3 this gives the desird result; however for cases 1 and > 4, an alignment of 1 would be the actual correct value, and the new > probing algorithm results in a worse result. > > However, since the negative effect of the old algorithm in cases 2 and 3 > is I/O errors whereas the effect of the new one in cases 1 and 4 is just > degraded performance for I/O that isn't 4k aligned, the new approch is > still preferable. > > I think we need to make this tradeoff clearer in the commit message and > the comment in the code, but the approach is reasonable enough. > I'll try to make this more clear in v3. > > > With this change: > > > > - Provisioning VM and copying disks on local XFS and Gluster with 4k > > sector size works, resolving bugs [1],[2]. > > > > - With NFS we fallback to buf_align and request_alignment of 4096 > > instead of 512. This may cause unneeded data copying, but so far I see > > better performance with this change. > > > > [1] https://bugzilla.redhat.com/1737256 > > [2] https://bugzilla.redhat.com/1738657 > > > > Signed-off-by: Nir Soffer <nsoffer@redhat.com> > > --- > > > > v1 was a minimal hack; this version is a more generic fix that works for > > any storage without requiring users to allocate the first block of an > > image. Allocting the first block of an image is still a good idea since > > it allows detecting the right alignment in some cases. > > > > v1 could also affect cases when we could detect buf_align to use > > request_alignment instead; v2 will only affect cases when buf_align or > > request_alignment cannot be detected. > > > > v1 was hare: > > https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00133.html > > > > block/file-posix.c | 40 +++++++++++++++++++++++++++++----------- > > 1 file changed, 29 insertions(+), 11 deletions(-) > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index f33b542b33..511468f166 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -323,6 +323,7 @@ static void raw_probe_alignment(BlockDriverState > *bs, int fd, Error **errp) > > BDRVRawState *s = bs->opaque; > > char *buf; > > size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); > > + size_t alignments[] = {1, 512, 1024, 2048, 4096}; > > > > /* For SCSI generic devices the alignment is not really used. > > With buffered I/O, we don't have any restrictions. */ > > @@ -349,25 +350,42 @@ static void raw_probe_alignment(BlockDriverState > *bs, int fd, Error **errp) > > } > > #endif > > > > - /* If we could not get the sizes so far, we can only guess them */ > > - if (!s->buf_align) { > > + /* > > + * If we could not get the sizes so far, we can only guess them. > First try > > + * to detect request alignment, since it is more likely to succeed. > Then > > + * try to detect buf_align, which cannot be detected in some cases > (e.g. > > + * Gluster). If buf_align cannot be detected, we fallback to the > value of > > + * request_alignment. > > + */ > > + > > + if (!bs->bl.request_alignment) { > > + int i; > > size_t align; > > - buf = qemu_memalign(max_align, 2 * max_align); > > - for (align = 512; align <= max_align; align <<= 1) { > > - if (raw_is_io_aligned(fd, buf + align, max_align)) { > > - s->buf_align = align; > > + buf = qemu_memalign(max_align, max_align); > > + for (i = 0; i < ARRAY_SIZE(alignments); i++) { > > + align = alignments[i]; > > + if (raw_is_io_aligned(fd, buf, align)) { > > + /* Fallback to safe value. */ > > + bs->bl.request_alignment = (align != 1) ? align : > max_align; > > break; > > } > > } > > qemu_vfree(buf); > > } > > > > - if (!bs->bl.request_alignment) { > > + if (!s->buf_align) { > > + int i; > > size_t align; > > - buf = qemu_memalign(s->buf_align, max_align); > > - for (align = 512; align <= max_align; align <<= 1) { > > - if (raw_is_io_aligned(fd, buf, align)) { > > - bs->bl.request_alignment = align; > > + buf = qemu_memalign(max_align, 2 * max_align); > > + for (i = 0; i < ARRAY_SIZE(alignments); i++) { > > + align = alignments[i]; > > + if (raw_is_io_aligned(fd, buf + align, max_align)) { > > + /* Fallback to request_aligment or safe value. */ > > + s->buf_align = (align != 1) > > + ? align > > + : (bs->bl.request_alignment != 0) > > + ? bs->bl.request_alignment > > + : max_align; > > Nested ternary operators over five lines aren't that readable any more. > I'd suggest using at least one more if: > > if (align != 1) { > s->buf_align = align; > } else { > s->buf_slign = bs->bl.request_alignment ?: max_align; > } > This is better. In fact, is checking bs->bl.request_alignment for 0 even worth it here? > If it's 0, we failed to find any working configuration and will return > an error anyway. Then it doesn't matter if s->buf_align becomes 0, too. > Right, so we can: s->buf_align = (align != 1) ? align : bs->bl.request_alignment; Thanks for reviewing. Nir
Am 13.08.2019 um 12:45 hat Nir Soffer geschrieben: > On Mon, Aug 12, 2019 at 5:23 PM Kevin Wolf <kwolf@redhat.com> wrote: > > > Am 11.08.2019 um 22:50 hat Nir Soffer geschrieben: > > > In some cases buf_align or request_alignment cannot be detected: > > > > > > - With Gluster, buf_align cannot be detected since the actual I/O is > > > done on Gluster server, and qemu buffer alignment does not matter. > > > > If it doesn't matter, the best value would be buf_align = 1. > > > > Right, if we know that this is gluster. > > > - With local XFS filesystem, buf_align cannot be detected if reading > > > from unallocated area. > > > > Here, we actually do need alignment, but it's unknown whether it would > > be 512 or 4096 or something entirely. Failing to align requests > > correctly results in I/O errors. > > > > > - With Gluster backed by XFS, request_alignment cannot be detected if > > > reading from unallocated area. > > > > This is like buf_align for XFS: We don't know the right value, but > > getting it wrong causes I/O errors. > > > > > - With NFS, the server does not use direct I/O, so both buf_align > > > cannot be detected. > > > > This suggests that byte-aligned requests are fine for NFS, i.e. > > buf_align = request_alignment = 1 would be optimal in this case. > > > > Right, but again we don't know this is NFS. Yes, I agree. I was just trying to list the optimal settings for each case so I could compare them against the actual results the path provides. I'm well aware that we don't know a way to get the optimal results for all four cases. > > These cases seems to work when storage sector size is 512 bytes, because > > > the current code starts checking align=512. If the check succeeds > > > because alignment cannot be detected we use 512. But this does not work > > > for storage with 4k sector size. > > > > > > Practically the alignment requirements are the same for buffer > > > alignment, buffer length, and offset in file. So in case we cannot > > > detect buf_align, we can use request alignment. If we cannot detect > > > request alignment, we can fallback to a safe value. > > > > This makes sense in general. > > > > What the commit message doesn't explain, but probably should do is how > > we determine whether we could successfully detect request alignment. The > > approach taken here is that a detected alignment of 1 is understood as > > failure to detect the real alignment. > > Failing with EINVAL when using 1, and succeeding with another value is > considered a successful detection. > > We have 3 issues preventing detection: > - filesystem not using direct I/O on the remote server (NFS, Gluster > when network.remote-dio=on) > - area probed is unallocated with XFS or Gluster backed by XFS > - filesystem without buffer alignment requirement (e.g. Gluster) I would say case 1 is effectively a subset of case 3 (i.e. it's just one specific reason why we don't have a buffer alignment requirement). > For handling unallocated areas, we can: > - always allocate the first block when creating an image (qemu-img > create/convert) > - use write() instead of read(). > > In oVirt we went with the second option - when we initialize a file > storage domain, we create a special file and do direct write to this > file with 1, 512, and 4096 bytes length. If we detect 512 or 4096, we > use this value for creating the domain (e.g. for sanlock). If we > detect 1, we use the user provided value (default 512). Yes, but there's the important difference that oVirt controls the image files, whereas QEMU doesn't. Even if qemu-img create made sure that we allocate the first block, the user could still pass us an image that was created using a different way. Using write() is actually an interesting thought. Obviously, we can't just overwrite the user image. But maybe what we could do is read the first block and then try to rewrite it with different alignments. However, this will break down with read-only images, so if we can't write, we'd still have to fall back to a safe default. Also, given the straces we saw, I'm afraid we might trigger gluster bugs where writes that failed with EINVAL mess up the internal state so that even later aligned requests would fail. > You can see the code here: > https://github.com/oVirt/vdsm/blob/4733018f9a719729242738b486906d3b9ed058cd/lib/vdsm/storage/fileSD.py#L838 > > One way we can use in qemu is to create a temporary file: > > /path/to/image.tmp9vo8US > > Delete the file, keeping the fd open, and detect the alignment on this > file using write(). This isn't going to fly. We might not have write permission to the directory even for read-write images. Worse, we might get passed only a file descriptor instead of a path. So whatever we do, we must do it with the image file descriptor. > With this we fixed all the cases listed above, but creating new files > requires write permission in the directory where the image is in, and > will not work for some strange setups (.e.g bind-mount images). > > One issue with this is that there is no guarantee that the temporary > file will be deleted so the user will have to deal with leftover > files. On Linux, we could use O_TMPFILE for this. However, as I mentioned above, we may not even know the directory. > With cases 2 and 3 this gives the desird result; however for cases 1 and > > 4, an alignment of 1 would be the actual correct value, and the new > > probing algorithm results in a worse result. > > > > However, since the negative effect of the old algorithm in cases 2 and 3 > > is I/O errors whereas the effect of the new one in cases 1 and 4 is just > > degraded performance for I/O that isn't 4k aligned, the new approch is > > still preferable. > > > > I think we need to make this tradeoff clearer in the commit message and > > the comment in the code, but the approach is reasonable enough. > > I'll try to make this more clear in v3. Thanks. > > > With this change: > > > > > > - Provisioning VM and copying disks on local XFS and Gluster with 4k > > > sector size works, resolving bugs [1],[2]. > > > > > > - With NFS we fallback to buf_align and request_alignment of 4096 > > > instead of 512. This may cause unneeded data copying, but so far I see > > > better performance with this change. > > > > > > [1] https://bugzilla.redhat.com/1737256 > > > [2] https://bugzilla.redhat.com/1738657 > > > > > > Signed-off-by: Nir Soffer <nsoffer@redhat.com> > > > --- > > > > > > v1 was a minimal hack; this version is a more generic fix that works for > > > any storage without requiring users to allocate the first block of an > > > image. Allocting the first block of an image is still a good idea since > > > it allows detecting the right alignment in some cases. > > > > > > v1 could also affect cases when we could detect buf_align to use > > > request_alignment instead; v2 will only affect cases when buf_align or > > > request_alignment cannot be detected. > > > > > > v1 was hare: > > > https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00133.html > > > > > > block/file-posix.c | 40 +++++++++++++++++++++++++++++----------- > > > 1 file changed, 29 insertions(+), 11 deletions(-) > > > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > > index f33b542b33..511468f166 100644 > > > --- a/block/file-posix.c > > > +++ b/block/file-posix.c > > > @@ -323,6 +323,7 @@ static void raw_probe_alignment(BlockDriverState > > *bs, int fd, Error **errp) > > > BDRVRawState *s = bs->opaque; > > > char *buf; > > > size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); > > > + size_t alignments[] = {1, 512, 1024, 2048, 4096}; > > > > > > /* For SCSI generic devices the alignment is not really used. > > > With buffered I/O, we don't have any restrictions. */ > > > @@ -349,25 +350,42 @@ static void raw_probe_alignment(BlockDriverState > > *bs, int fd, Error **errp) > > > } > > > #endif > > > > > > - /* If we could not get the sizes so far, we can only guess them */ > > > - if (!s->buf_align) { > > > + /* > > > + * If we could not get the sizes so far, we can only guess them. > > First try > > > + * to detect request alignment, since it is more likely to succeed. > > Then > > > + * try to detect buf_align, which cannot be detected in some cases > > (e.g. > > > + * Gluster). If buf_align cannot be detected, we fallback to the > > value of > > > + * request_alignment. > > > + */ > > > + > > > + if (!bs->bl.request_alignment) { > > > + int i; > > > size_t align; > > > - buf = qemu_memalign(max_align, 2 * max_align); > > > - for (align = 512; align <= max_align; align <<= 1) { > > > - if (raw_is_io_aligned(fd, buf + align, max_align)) { > > > - s->buf_align = align; > > > + buf = qemu_memalign(max_align, max_align); > > > + for (i = 0; i < ARRAY_SIZE(alignments); i++) { > > > + align = alignments[i]; > > > + if (raw_is_io_aligned(fd, buf, align)) { > > > + /* Fallback to safe value. */ > > > + bs->bl.request_alignment = (align != 1) ? align : > > max_align; > > > break; > > > } > > > } > > > qemu_vfree(buf); > > > } > > > > > > - if (!bs->bl.request_alignment) { > > > + if (!s->buf_align) { > > > + int i; > > > size_t align; > > > - buf = qemu_memalign(s->buf_align, max_align); > > > - for (align = 512; align <= max_align; align <<= 1) { > > > - if (raw_is_io_aligned(fd, buf, align)) { > > > - bs->bl.request_alignment = align; > > > + buf = qemu_memalign(max_align, 2 * max_align); > > > + for (i = 0; i < ARRAY_SIZE(alignments); i++) { > > > + align = alignments[i]; > > > + if (raw_is_io_aligned(fd, buf + align, max_align)) { > > > + /* Fallback to request_aligment or safe value. */ > > > + s->buf_align = (align != 1) > > > + ? align > > > + : (bs->bl.request_alignment != 0) > > > + ? bs->bl.request_alignment > > > + : max_align; > > > > Nested ternary operators over five lines aren't that readable any more. > > I'd suggest using at least one more if: > > > > if (align != 1) { > > s->buf_align = align; > > } else { > > s->buf_slign = bs->bl.request_alignment ?: max_align; > > } > > > > This is better. > > In fact, is checking bs->bl.request_alignment for 0 even worth it here? > > If it's 0, we failed to find any working configuration and will return > > an error anyway. Then it doesn't matter if s->buf_align becomes 0, too. > > > > Right, so we can: > > s->buf_align = (align != 1) ? align : bs->bl.request_alignment; > > Thanks for reviewing. Yes, that should work. Kevin
On Tue, Aug 13, 2019 at 2:21 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 13.08.2019 um 12:45 hat Nir Soffer geschrieben: > > On Mon, Aug 12, 2019 at 5:23 PM Kevin Wolf <kwolf@redhat.com> wrote: > > > > > Am 11.08.2019 um 22:50 hat Nir Soffer geschrieben: > > > > In some cases buf_align or request_alignment cannot be detected: > > > > > > > > - With Gluster, buf_align cannot be detected since the actual I/O is > > > > done on Gluster server, and qemu buffer alignment does not matter. > > > > > > If it doesn't matter, the best value would be buf_align = 1. > > > > > > > Right, if we know that this is gluster. > > > > > - With local XFS filesystem, buf_align cannot be detected if reading > > > > from unallocated area. > > > > > > Here, we actually do need alignment, but it's unknown whether it would > > > be 512 or 4096 or something entirely. Failing to align requests > > > correctly results in I/O errors. > > > > > > > - With Gluster backed by XFS, request_alignment cannot be detected if > > > > reading from unallocated area. > > > > > > This is like buf_align for XFS: We don't know the right value, but > > > getting it wrong causes I/O errors. > > > > > > > - With NFS, the server does not use direct I/O, so both buf_align > > > > cannot be detected. > > > > > > This suggests that byte-aligned requests are fine for NFS, i.e. > > > buf_align = request_alignment = 1 would be optimal in this case. > > > > > > > Right, but again we don't know this is NFS. > > Yes, I agree. I was just trying to list the optimal settings for each > case so I could compare them against the actual results the path > provides. I'm well aware that we don't know a way to get the optimal > results for all four cases. > > > > These cases seems to work when storage sector size is 512 bytes, > because > > > > the current code starts checking align=512. If the check succeeds > > > > because alignment cannot be detected we use 512. But this does not > work > > > > for storage with 4k sector size. > > > > > > > > Practically the alignment requirements are the same for buffer > > > > alignment, buffer length, and offset in file. So in case we cannot > > > > detect buf_align, we can use request alignment. If we cannot detect > > > > request alignment, we can fallback to a safe value. > > > > > > This makes sense in general. > > > > > > What the commit message doesn't explain, but probably should do is how > > > we determine whether we could successfully detect request alignment. > The > > > approach taken here is that a detected alignment of 1 is understood as > > > failure to detect the real alignment. > > > > Failing with EINVAL when using 1, and succeeding with another value is > > considered a successful detection. > > > > We have 3 issues preventing detection: > > - filesystem not using direct I/O on the remote server (NFS, Gluster > > when network.remote-dio=on) > > - area probed is unallocated with XFS or Gluster backed by XFS > > - filesystem without buffer alignment requirement (e.g. Gluster) > > I would say case 1 is effectively a subset of case 3 (i.e. it's just one > specific reason why we don't have a buffer alignment requirement). > > > For handling unallocated areas, we can: > > - always allocate the first block when creating an image (qemu-img > > create/convert) > > - use write() instead of read(). > > > > In oVirt we went with the second option - when we initialize a file > > storage domain, we create a special file and do direct write to this > > file with 1, 512, and 4096 bytes length. If we detect 512 or 4096, we > > use this value for creating the domain (e.g. for sanlock). If we > > detect 1, we use the user provided value (default 512). > > Yes, but there's the important difference that oVirt controls the image > files, whereas QEMU doesn't. Even if qemu-img create made sure that we > allocate the first block, the user could still pass us an image that > was created using a different way. > > Using write() is actually an interesting thought. Obviously, we can't > just overwrite the user image. But maybe what we could do is read the > first block and then try to rewrite it with different alignments. > Yes, this is what we do in ovirt-imageio for file based storage: https://github.com/oVirt/ovirt-imageio/blob/ca70170886b0c1fbeca8640b12bcf54f01a3fea0/common/ovirt_imageio_common/backends/file.py#L311 But we have lot of assumptions that may not work for qemu: - we don't support read only images - we assume nobody else is writing to the image imageio uses (enforced by oVirt) So this will not work for qemu-img read-only operations. However, this will break down with read-only images, so if we can't > write, we'd still have to fall back to a safe default. > > Also, given the straces we saw, I'm afraid we might trigger gluster bugs > where writes that failed with EINVAL mess up the internal state so that > even later aligned requests would fail. > If this happen only when using Gluster performance.strict-o-direct = off, we will enforce performance.strict-o-direct = in oVirt. Otherwise this is a Gluster bug and it should be fixed in Gluster. > You can see the code here: > > > https://github.com/oVirt/vdsm/blob/4733018f9a719729242738b486906d3b9ed058cd/lib/vdsm/storage/fileSD.py#L838 > > > > One way we can use in qemu is to create a temporary file: > > > > /path/to/image.tmp9vo8US > > > > Delete the file, keeping the fd open, and detect the alignment on this > > file using write(). > > This isn't going to fly. We might not have write permission to the > directory even for read-write images. Worse, we might get passed only a > file descriptor instead of a path. So whatever we do, we must do it with > the image file descriptor. > > > With this we fixed all the cases listed above, but creating new files > > requires write permission in the directory where the image is in, and > > will not work for some strange setups (.e.g bind-mount images). > > > > One issue with this is that there is no guarantee that the temporary > > file will be deleted so the user will have to deal with leftover > > files. > > On Linux, we could use O_TMPFILE for this. However, as I mentioned > above, we may not even know the directory. > ... Nir
diff --git a/block/file-posix.c b/block/file-posix.c index f33b542b33..511468f166 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -323,6 +323,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) BDRVRawState *s = bs->opaque; char *buf; size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); + size_t alignments[] = {1, 512, 1024, 2048, 4096}; /* For SCSI generic devices the alignment is not really used. With buffered I/O, we don't have any restrictions. */ @@ -349,25 +350,42 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } #endif - /* If we could not get the sizes so far, we can only guess them */ - if (!s->buf_align) { + /* + * If we could not get the sizes so far, we can only guess them. First try + * to detect request alignment, since it is more likely to succeed. Then + * try to detect buf_align, which cannot be detected in some cases (e.g. + * Gluster). If buf_align cannot be detected, we fallback to the value of + * request_alignment. + */ + + if (!bs->bl.request_alignment) { + int i; size_t align; - buf = qemu_memalign(max_align, 2 * max_align); - for (align = 512; align <= max_align; align <<= 1) { - if (raw_is_io_aligned(fd, buf + align, max_align)) { - s->buf_align = align; + buf = qemu_memalign(max_align, max_align); + for (i = 0; i < ARRAY_SIZE(alignments); i++) { + align = alignments[i]; + if (raw_is_io_aligned(fd, buf, align)) { + /* Fallback to safe value. */ + bs->bl.request_alignment = (align != 1) ? align : max_align; break; } } qemu_vfree(buf); } - if (!bs->bl.request_alignment) { + if (!s->buf_align) { + int i; size_t align; - buf = qemu_memalign(s->buf_align, max_align); - for (align = 512; align <= max_align; align <<= 1) { - if (raw_is_io_aligned(fd, buf, align)) { - bs->bl.request_alignment = align; + buf = qemu_memalign(max_align, 2 * max_align); + for (i = 0; i < ARRAY_SIZE(alignments); i++) { + align = alignments[i]; + if (raw_is_io_aligned(fd, buf + align, max_align)) { + /* Fallback to request_aligment or safe value. */ + s->buf_align = (align != 1) + ? align + : (bs->bl.request_alignment != 0) + ? bs->bl.request_alignment + : max_align; break; } }
In some cases buf_align or request_alignment cannot be detected: - With Gluster, buf_align cannot be detected since the actual I/O is done on Gluster server, and qemu buffer alignment does not matter. - With local XFS filesystem, buf_align cannot be detected if reading from unallocated area. - With Gluster backed by XFS, request_alignment cannot be detected if reading from unallocated area. - With NFS, the server does not use direct I/O, so both buf_align cannot be detected. These cases seems to work when storage sector size is 512 bytes, because the current code starts checking align=512. If the check succeeds because alignment cannot be detected we use 512. But this does not work for storage with 4k sector size. Practically the alignment requirements are the same for buffer alignment, buffer length, and offset in file. So in case we cannot detect buf_align, we can use request alignment. If we cannot detect request alignment, we can fallback to a safe value. With this change: - Provisioning VM and copying disks on local XFS and Gluster with 4k sector size works, resolving bugs [1],[2]. - With NFS we fallback to buf_align and request_alignment of 4096 instead of 512. This may cause unneeded data copying, but so far I see better performance with this change. [1] https://bugzilla.redhat.com/1737256 [2] https://bugzilla.redhat.com/1738657 Signed-off-by: Nir Soffer <nsoffer@redhat.com> --- v1 was a minimal hack; this version is a more generic fix that works for any storage without requiring users to allocate the first block of an image. Allocting the first block of an image is still a good idea since it allows detecting the right alignment in some cases. v1 could also affect cases when we could detect buf_align to use request_alignment instead; v2 will only affect cases when buf_align or request_alignment cannot be detected. v1 was hare: https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00133.html block/file-posix.c | 40 +++++++++++++++++++++++++++++----------- 1 file changed, 29 insertions(+), 11 deletions(-)