mbox series

[0/5] btrfs-progs: use direct-IO for zoned device

Message ID 20210927041554.325884-1-naohiro.aota@wdc.com (mailing list archive)
Headers show
Series btrfs-progs: use direct-IO for zoned device | expand

Message

Naohiro Aota Sept. 27, 2021, 4:15 a.m. UTC
As discussed in the Zoned Storage page [1],  the kernel page cache does not
guarantee that cached dirty pages will be flushed to a block device in
sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
device to ensure the write ordering.

[1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions

As a writng buffer is embedded in some other struct (e.g., "char data[]" in
struct extent_buffer), it is difficult to allocate the struct so that the
writng buffer is aligned.

This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
which allocates an aligned bounce buffer, copy the buffer contents, and
proceeds the IO. And, it now opens a zoned device with O_DIRECT.

Since the allocation and copying are costly, it is better to do them only
when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
the file is opened with O_DIRECT or not every time doing an IO.

As zoned device forces to use zoned btrfs, I decided to use the zoned flag
to determine if it is direct-IO or not. This can cause a false-positive (to
use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
emulated zoned mode on a non-zoned device or a regular file. Considering
the emulated zoned mode is mostly for debugging or testing, I believe this
is acceptable.

Patch 1 is a preparation not to set an emulated zone_size value when not
needed.

Patches 2 and 3 wraps pread/pwrite with newly introduced function
btrfs_pread/btrfs_pwrite.

Patches 4 deals with the zoned flag while reading the initial trees.

Patch 5 finally opens a zoned device with O_DIRECT.

Naohiro Aota (5):
  btrfs-progs: mkfs: do not set zone size on non-zoned mode
  btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
  btrfs-progs: introduce btrfs_pread wrapper for pread
  btrfs-progs: temporally set zoned flag for initial tree reading
  btrfs-progs: use direct-IO for zoned device

 common/device-utils.c     | 76 ++++++++++++++++++++++++++++++++++++---
 common/device-utils.h     | 29 ++++++++++++++-
 kernel-shared/disk-io.c   | 19 +++++++++-
 kernel-shared/extent_io.c | 14 +++++---
 kernel-shared/volumes.c   |  4 +++
 kernel-shared/zoned.c     |  6 ++--
 mkfs/common.c             | 14 +++++---
 mkfs/main.c               | 12 +++++--
 8 files changed, 153 insertions(+), 21 deletions(-)

Comments

David Sterba Sept. 27, 2021, 7:26 p.m. UTC | #1
On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> As discussed in the Zoned Storage page [1],  the kernel page cache does not
> guarantee that cached dirty pages will be flushed to a block device in
> sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> device to ensure the write ordering.
> 
> [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> 
> As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> struct extent_buffer), it is difficult to allocate the struct so that the
> writng buffer is aligned.
> 
> This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> which allocates an aligned bounce buffer, copy the buffer contents, and
> proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> 
> Since the allocation and copying are costly, it is better to do them only
> when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> the file is opened with O_DIRECT or not every time doing an IO.

This should be in the changelog somewhere too, the last patch looks like
a good place so I'll copy it there.

> As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> to determine if it is direct-IO or not. This can cause a false-positive (to
> use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> emulated zoned mode on a non-zoned device or a regular file. Considering
> the emulated zoned mode is mostly for debugging or testing, I believe this
> is acceptable.

Agreed.

All patches added to devel. Would be good to add some tests for the
emulated mode, ie. that we can test at least something regularly without
special devices.
David Sterba Sept. 27, 2021, 9:51 p.m. UTC | #2
On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> As discussed in the Zoned Storage page [1],  the kernel page cache does not
> guarantee that cached dirty pages will be flushed to a block device in
> sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> device to ensure the write ordering.
> 
> [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> 
> As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> struct extent_buffer), it is difficult to allocate the struct so that the
> writng buffer is aligned.
> 
> This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> which allocates an aligned bounce buffer, copy the buffer contents, and
> proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> 
> Since the allocation and copying are costly, it is better to do them only
> when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> the file is opened with O_DIRECT or not every time doing an IO.
> 
> As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> to determine if it is direct-IO or not. This can cause a false-positive (to
> use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> emulated zoned mode on a non-zoned device or a regular file. Considering
> the emulated zoned mode is mostly for debugging or testing, I believe this
> is acceptable.
> 
> Patch 1 is a preparation not to set an emulated zone_size value when not
> needed.
> 
> Patches 2 and 3 wraps pread/pwrite with newly introduced function
> btrfs_pread/btrfs_pwrite.
> 
> Patches 4 deals with the zoned flag while reading the initial trees.
> 
> Patch 5 finally opens a zoned device with O_DIRECT.
> 
> Naohiro Aota (5):
>   btrfs-progs: mkfs: do not set zone size on non-zoned mode
>   btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
>   btrfs-progs: introduce btrfs_pread wrapper for pread
>   btrfs-progs: temporally set zoned flag for initial tree reading
>   btrfs-progs: use direct-IO for zoned device

I was doing some btrfs-convert changes and found that it crashed, rough
bisection points to this series. With the last patch applied, convert
fails with the following ASAN error:

...
Create initial btrfs filesystem
Create ext2 image file
AddressSanitizer:DEADLYSIGNAL
=================================================================
==18432==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000000496627 bp 0x7ffe5299e4d0 sp 0x7ffe5299e4b0 T0)
==18432==The signal is caused by a READ memory access.
==18432==Hint: this fault was caused by a dereference of a high value address (see register values below).  Dissassemble the provided pc to learn which register was used.
    #0 0x496627 in write_extent_to_disk kernel-shared/extent_io.c:815
    #1 0x470080 in write_and_map_eb kernel-shared/disk-io.c:525
    #2 0x411af9 in migrate_one_reserved_range convert/main.c:402
    #3 0x411fa5 in migrate_reserved_ranges convert/main.c:459
    #4 0x414088 in create_image convert/main.c:878
    #5 0x416d70 in do_convert convert/main.c:1269
    #6 0x41a294 in main convert/main.c:1993
    #7 0x7f7ef7c2753f in __libc_start_call_main (/lib64/libc.so.6+0x2d53f)
    #8 0x7f7ef7c275eb in __libc_start_main_alias_1 (/lib64/libc.so.6+0x2d5eb)
    #9 0x40ed04 in _start (.../btrfs-progs/btrfs-convert+0x40ed04)

kernel-shared/extent_io.c:815:

 811 int write_extent_to_disk(struct extent_buffer *eb)
 812 {
 813         int ret;
 814         ret = btrfs_pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr,
 815                            eb->fs_info->zoned);
 816         if (ret < 0)
 817                 goto out;
 818         if (ret != eb->len) {
 819                 ret = -EIO;
 820                 goto out;
 821         }
 822         ret = 0;
 823 out:
 824         return ret;
 825 }
Naohiro Aota Sept. 29, 2021, 2:21 a.m. UTC | #3
On Mon, Sep 27, 2021 at 09:26:18PM +0200, David Sterba wrote:
> On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > As discussed in the Zoned Storage page [1],  the kernel page cache does not
> > guarantee that cached dirty pages will be flushed to a block device in
> > sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> > device to ensure the write ordering.
> > 
> > [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> > 
> > As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> > struct extent_buffer), it is difficult to allocate the struct so that the
> > writng buffer is aligned.
> > 
> > This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> > which allocates an aligned bounce buffer, copy the buffer contents, and
> > proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> > 
> > Since the allocation and copying are costly, it is better to do them only
> > when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> > the file is opened with O_DIRECT or not every time doing an IO.
> 
> This should be in the changelog somewhere too, the last patch looks like
> a good place so I'll copy it there.
> 
> > As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> > to determine if it is direct-IO or not. This can cause a false-positive (to
> > use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> > emulated zoned mode on a non-zoned device or a regular file. Considering
> > the emulated zoned mode is mostly for debugging or testing, I believe this
> > is acceptable.
> 
> Agreed.
> 
> All patches added to devel. Would be good to add some tests for the
> emulated mode, ie. that we can test at least something regularly without
> special devices.

Will do. We may also add some tests for zoned device by setting up
null_blk (provided the machine has enough memory).
Naohiro Aota Sept. 29, 2021, 2:24 a.m. UTC | #4
On Mon, Sep 27, 2021 at 11:51:39PM +0200, David Sterba wrote:
> On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > As discussed in the Zoned Storage page [1],  the kernel page cache does not
> > guarantee that cached dirty pages will be flushed to a block device in
> > sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> > device to ensure the write ordering.
> > 
> > [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> > 
> > As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> > struct extent_buffer), it is difficult to allocate the struct so that the
> > writng buffer is aligned.
> > 
> > This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> > which allocates an aligned bounce buffer, copy the buffer contents, and
> > proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> > 
> > Since the allocation and copying are costly, it is better to do them only
> > when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> > the file is opened with O_DIRECT or not every time doing an IO.
> > 
> > As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> > to determine if it is direct-IO or not. This can cause a false-positive (to
> > use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> > emulated zoned mode on a non-zoned device or a regular file. Considering
> > the emulated zoned mode is mostly for debugging or testing, I believe this
> > is acceptable.
> > 
> > Patch 1 is a preparation not to set an emulated zone_size value when not
> > needed.
> > 
> > Patches 2 and 3 wraps pread/pwrite with newly introduced function
> > btrfs_pread/btrfs_pwrite.
> > 
> > Patches 4 deals with the zoned flag while reading the initial trees.
> > 
> > Patch 5 finally opens a zoned device with O_DIRECT.
> > 
> > Naohiro Aota (5):
> >   btrfs-progs: mkfs: do not set zone size on non-zoned mode
> >   btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
> >   btrfs-progs: introduce btrfs_pread wrapper for pread
> >   btrfs-progs: temporally set zoned flag for initial tree reading
> >   btrfs-progs: use direct-IO for zoned device
> 
> I was doing some btrfs-convert changes and found that it crashed, rough
> bisection points to this series. With the last patch applied, convert
> fails with the following ASAN error:

It looks like eb->fs_info == NULL at this point. In case of
btrfs-convert, we can assume it is non-zoned because we do not support
the converting on a zoned device (we can't create ext*, reiserfs on a
zoned device anyway).

But, I also found a similar issue occurs with "mkfs.btrfs -f -d raid5
-m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3" in
read_extent_from_disk(). Let me check which is better to ensure the
fs_info is set or to check if it's NULL.

> ...
> Create initial btrfs filesystem
> Create ext2 image file
> AddressSanitizer:DEADLYSIGNAL
> =================================================================
> ==18432==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000000496627 bp 0x7ffe5299e4d0 sp 0x7ffe5299e4b0 T0)
> ==18432==The signal is caused by a READ memory access.
> ==18432==Hint: this fault was caused by a dereference of a high value address (see register values below).  Dissassemble the provided pc to learn which register was used.
>     #0 0x496627 in write_extent_to_disk kernel-shared/extent_io.c:815
>     #1 0x470080 in write_and_map_eb kernel-shared/disk-io.c:525
>     #2 0x411af9 in migrate_one_reserved_range convert/main.c:402
>     #3 0x411fa5 in migrate_reserved_ranges convert/main.c:459
>     #4 0x414088 in create_image convert/main.c:878
>     #5 0x416d70 in do_convert convert/main.c:1269
>     #6 0x41a294 in main convert/main.c:1993
>     #7 0x7f7ef7c2753f in __libc_start_call_main (/lib64/libc.so.6+0x2d53f)
>     #8 0x7f7ef7c275eb in __libc_start_main_alias_1 (/lib64/libc.so.6+0x2d5eb)
>     #9 0x40ed04 in _start (.../btrfs-progs/btrfs-convert+0x40ed04)
> 
> kernel-shared/extent_io.c:815:
> 
>  811 int write_extent_to_disk(struct extent_buffer *eb)
>  812 {
>  813         int ret;
>  814         ret = btrfs_pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr,
>  815                            eb->fs_info->zoned);
>  816         if (ret < 0)
>  817                 goto out;
>  818         if (ret != eb->len) {
>  819                 ret = -EIO;
>  820                 goto out;
>  821         }
>  822         ret = 0;
>  823 out:
>  824         return ret;
>  825 }
David Sterba Sept. 29, 2021, 10:16 a.m. UTC | #5
On Wed, Sep 29, 2021 at 02:21:02AM +0000, Naohiro Aota wrote:
> On Mon, Sep 27, 2021 at 09:26:18PM +0200, David Sterba wrote:
> > On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > > As discussed in the Zoned Storage page [1],  the kernel page cache does not
> > > guarantee that cached dirty pages will be flushed to a block device in
> > > sequential sector order. Thus, we must use O_DIRECT for writing to a zoned
> > > device to ensure the write ordering.
> > > 
> > > [1] https://zonedstorage.io/linux/overview/#zbd-support-restrictions
> > > 
> > > As a writng buffer is embedded in some other struct (e.g., "char data[]" in
> > > struct extent_buffer), it is difficult to allocate the struct so that the
> > > writng buffer is aligned.
> > > 
> > > This series introduces btrfs_{pread,pwrite} to wrap around pread/pwrite,
> > > which allocates an aligned bounce buffer, copy the buffer contents, and
> > > proceeds the IO. And, it now opens a zoned device with O_DIRECT.
> > > 
> > > Since the allocation and copying are costly, it is better to do them only
> > > when necessary. But, it is cumbersome to call fcntl(F_GETFL) to determine
> > > the file is opened with O_DIRECT or not every time doing an IO.
> > 
> > This should be in the changelog somewhere too, the last patch looks like
> > a good place so I'll copy it there.
> > 
> > > As zoned device forces to use zoned btrfs, I decided to use the zoned flag
> > > to determine if it is direct-IO or not. This can cause a false-positive (to
> > > use the bounce buffer when a file is *not* opened with O_DIRECT) in case of
> > > emulated zoned mode on a non-zoned device or a regular file. Considering
> > > the emulated zoned mode is mostly for debugging or testing, I believe this
> > > is acceptable.
> > 
> > Agreed.
> > 
> > All patches added to devel. Would be good to add some tests for the
> > emulated mode, ie. that we can test at least something regularly without
> > special devices.
> 
> Will do. We may also add some tests for zoned device by setting up
> null_blk (provided the machine has enough memory).

As setting up the elated zoned devices requires some resources or
non-trivial setup we can add a separate class of tests. As there are
still limitations to what zoned mode supports, running all current tests
won't work anyway without tons of workarounds or quirks to existing
tests.

Adding a separate class would probably duplicate some of the tests but
that's IMHO less error prone way than changing the other tests. If the
zoned-tests are not run by default it should be safe for regular
testing.

Eventually, to avoid code duplication, we cand do some sort of test
links. The zoned-test will set up the environment and then run the
existing test from other directory.
David Sterba Sept. 29, 2021, 10:22 a.m. UTC | #6
On Wed, Sep 29, 2021 at 02:24:22AM +0000, Naohiro Aota wrote:
> On Mon, Sep 27, 2021 at 11:51:39PM +0200, David Sterba wrote:
> > On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > I was doing some btrfs-convert changes and found that it crashed, rough
> > bisection points to this series. With the last patch applied, convert
> > fails with the following ASAN error:
> 
> It looks like eb->fs_info == NULL at this point. In case of
> btrfs-convert, we can assume it is non-zoned because we do not support
> the converting on a zoned device (we can't create ext*, reiserfs on a
> zoned device anyway).

That would mean that extN/reiserfs was created on a zoned device. One
can still do a image copy to a zoned device and then convert. Even if
this is possible in theory I'd rather not allow that right now because
there are probably more changes required to do full support.

I've just noticed that ZONED bit is mistakenly among the feature flag
bits allowed in convert. Added in 242c8328bcd55175 "btrfs-progs: zoned:
add new ZONED feature flag":

BTRFS_CONVERT_ALLOWED_FEATURES must not contain
BTRFS_FEATURE_INCOMPAT_ZONED.
Naohiro Aota Oct. 5, 2021, 6:11 a.m. UTC | #7
On Wed, Sep 29, 2021 at 12:22:23PM +0200, David Sterba wrote:
> On Wed, Sep 29, 2021 at 02:24:22AM +0000, Naohiro Aota wrote:
> > On Mon, Sep 27, 2021 at 11:51:39PM +0200, David Sterba wrote:
> > > On Mon, Sep 27, 2021 at 01:15:49PM +0900, Naohiro Aota wrote:
> > > I was doing some btrfs-convert changes and found that it crashed, rough
> > > bisection points to this series. With the last patch applied, convert
> > > fails with the following ASAN error:
> > 
> > It looks like eb->fs_info == NULL at this point. In case of
> > btrfs-convert, we can assume it is non-zoned because we do not support
> > the converting on a zoned device (we can't create ext*, reiserfs on a
> > zoned device anyway).
> 
> That would mean that extN/reiserfs was created on a zoned device. One
> can still do a image copy to a zoned device and then convert. Even if
> this is possible in theory I'd rather not allow that right now because
> there are probably more changes required to do full support.
> 
> I've just noticed that ZONED bit is mistakenly among the feature flag
> bits allowed in convert. Added in 242c8328bcd55175 "btrfs-progs: zoned:
> add new ZONED feature flag":
> 
> BTRFS_CONVERT_ALLOWED_FEATURES must not contain
> BTRFS_FEATURE_INCOMPAT_ZONED.

Oops, I thought I did not list BTRFS_FEATURE_INCOMPAT_ZONED in the
ALLOWED_FEATURES list. I'll fix it in the new series.